Documentsummarizationbasedondatareconstruction资源-CSDN文库

20 浏览量 2021-02-09 04:54:17 上传评论收藏 592KB PDF 举报

资源推荐

资源详情

资源评论

Document Summarization Based on Data Reconstruction

Zhanying He and Chun Chen and Jiajun Bu and Can Wang and Lijun Zhang

Zhejiang Provincial Key Laboratory of Service Robot, College of Computer Science,

Zhejiang University, Hangzhou 310027, China.

{hezhanying, chenc, bjj, wcan, zljzju}@zju.edu.cn

Deng Cai and Xiaofei He

State Key Lab of CAD&CG, College of Computer Science,

Zhejiang University, Hangzhou 310058, China.

{dengcai, xiaofeihe}@cad.zju.edu.cn

Abstract

Document summarization is of great value to many

real world applications, such as snippets generation for

search results and news headlines generation. Tradition-

ally, document summarization is implemented by ex-

tracting sentences that cover the main topics of a doc-

ument with a minimum redundancy. In this paper, we

take a different perspective from data reconstruction and

propose a novel framework named Document Summa-

rization based on Data Reconstruction (DSDR). Specif-

ically, our approach generates a summary which consist

of those sentences that can best reconstruct the original

document. To model the relationship among sentences,

we introduce two objective functions: (1) linear recon-

struction, which approximates the document by linear

combinations of the selected sentences; (2) nonnega-

tive linear reconstruction, which allows only additive,

not subtractive, linear combinations. In this framework,

the reconstruction error becomes a natural criterion for

measuring the quality of the summary. For each objec-

tive function, we develop an efﬁcient algorithm to solve

the corresponding optimization problem. Extensive ex-

periments on summarization benchmark data sets DUC

2006 and DUC 2007 demonstrate the effectiveness of

our proposed approach.

Introduction

With the explosive growth of the Internet, people are over-

whelmed by a large number of accessible documents. Sum-

marization can represent the document with a short piece

of text covering the main topics, and help users sift through

the Internet, catch the most relevant document, and ﬁlter out

redundant information. So document summarization has be-

come one of the most important research topics in the natural

language processing and information retrieval communities.

In recent years, automatic summarization has been ap-

plied broadly in varied domains. For example, search en-

gines can provide users with snippets as the previews of

the document contents (Turpin et al. 2007; Huang, Liu, and

Chen 2008; Cai et al. 2004; He et al. 2007). News sites usu-

ally describe hot news topics in concise headlines to facili-

tate browsing. Both the snippets and headlines are speciﬁc

forms of document summary in practical applications.

 2012, Association for the Advancement of Artiﬁcial

Most of the existing generic summarization approaches

use a ranking model to select sentences from a candidate set

(Brin and Page 1998; Kleinberg 1999; Wan and Yang 2007).

These methods suffer from a severe problem that top ranked

sentences usually share much redundant information. Al-

though there are some methods (Conroy and O’leary 2001;

Park et al. 2007; Shen et al. 2007) trying to reduce the redun-

dancy, selecting sentences which have both good coverage

and minimum redundancy is a non-trivial task.

In this paper, we propose a novel summarization method

from the perspective of data reconstruction. As far as we

know, our approach is the ﬁrst to treat the document sum-

marization as a data reconstruction problem. We argue that

a good summary should consist of those sentences that can

best reconstruct the original document. Therefore, the re-

construction error becomes a natural criterion for measur-

ing the quality of summary. We propose a novel framework

called Document Summarization based on Data Reconstruc-

tion (DSDR) which ﬁnds the summary sentences by mini-

mizing the reconstruction error. DSDR ﬁrstly learns a recon-

struction function for each candidate sentence of an input

document and then obtains the error formula by that func-

tion. Finally it obtains an optimal summary by minimizing

the reconstruction error. From the geometric interpretation,

DSDR tends to select sentences that span the intrinsic sub-

space of candidate sentence space so that it is able to cover

the core information of the document.

To model the relationship among sentences, we discuss

two kinds of reconstruction. The ﬁrst one is linear recon-

struction, which approximates the document by linear com-

binations of the selected sentences. Optimizing the corre-

sponding objective function is achieved through a greedy

method which extracts sentences sequentially. The second

one is non-negative linear reconstruction, which allows only

additive, not subtractive, combinations among the selected

sentences. Previous studies have shown that there is psycho-

logical and physiological evidence for parts-based represen-

tation in the human brain (Palmer 1977; Wachsmuth, Oram,

and Perrett 1994; Cai et al. 2011). Naturally, a document

summary should consist of the parts of sentences. With the

nonnegative constraints, our method leads to parts-based re-

construction so that no redundant information needs to be

subtracted from the combination. We formulate the nonneg-

ative linear reconstruction as a convex optimization problem

Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence

620

and design a multiplicative updating algorithm which guar-

antees converging monotonically to a global minima.

Extensive experiments on summarization benchmark data

sets DUC 2006 and DUC 2007 demonstrate the effective-

ness of our proposed approach.

Related work

Recently, lots of extractive document summarization meth-

ods have been proposed. Most of them involve assigning

salient scores to sentences of the original document and

composing the result summary of the top sentences with the

highest scores. The computation rules of salient scores can

be categorized into three groups (Hu, Sun, and Lim 2008):

feature based measurements, lexical chain based measure-

ments and graph based measurements. In (Wang et al. 2008),

the semantic relations of terms in the same semantic role

are discovered by using the WordNet (Miller 1995). A tree

pattern expression for extracting information from syntac-

tically parsed text is proposed in (Choi 2011). Algorithms

like PageRank (Brin and Page 1998) and HITS (Kleinberg

1999) are used in the sentence score propagation based on

the graph constructed based on the similarity between sen-

tences. Wan and Yang (2007) show that graph based mea-

surements can also improve the single-document summa-

rization by integrating multiple documents of the same topic.

Most of these scoring-based methods have to incorporate

with the adjustment of word weights which is one of the

most important factors that inﬂuence the summarization per-

formance (Nenkova, Vanderwende, and McKeown 2006).

So much work has been studied on how to extract sentences

without saliency scores. Inspired by the latent semantic in-

dexing (LSA), the singular value decomposition (SVD) is

used to select highly ranked sentences for generic document

summarization (Gong and Liu 2001). Harabagiu and Laca-

tusu (2005) analyze ﬁve different topic representations and

propose a novel topic representation based on topic themes.

Wang et al. (2008) use the symmetric non-negative matrix

factorization (SNMF) to cluster sentences into groups and

select sentences from each group for summarization.

The Proposed Framework

Most of the existing summarization methods aim to obtain

the summary which covers the core information of the docu-

ment. In this paper, we study the summarization from a data

reconstruction perspective. We believe that a good summary

should contain those sentences that can be used to recon-

struct the document as well as possible, namely, minimizing

the reconstruction error.

In this section, we describe the details of our proposed

framework Document Summarization based on Data Recon-

struction (DSDR) which minimizes the reconstruction error

for summarization. The algorithm procedure of DSDR is as

follows:

• After stemming and stop-word elimination, we decom-

pose the document into individual sentences and create

a weighted term-frequency vector for every sentence. All

the sentences form the candidate set.

• For any sentence in the document, DSDR selects the re-

lated sentences from the candidate set to reconstruct the

given sentence by learning a reconstruction function for

the sentence.

• For the entire document (or, a set of documents), DSDR

aims to ﬁnd an optimal set of representative sentences

to approximate the entire document (or, the set of doc-

uments), by minimizing the reconstruction error.

We denote the candidate sentence set as V =

, v

, . . . , v

]

where v

∈ R

is a weighted term-

frequency vector for sentence i. Here notice that, we use V

to represent both the matrix and the candidate set {v

}. Sup-

pose there are totally d terms and n sentences in the docu-

ment, we will have a matrix V in the size of n × d. We

denote the summary sentence set as X = [x

, x

, . . . , x

]

with m < n and X ⊂ V .

Given a sentence v

∈ V , DSDR attempts to represent it

with a reconstruction function f

(X) given the selected sen-

tence set X. Denoting the parameters of f

as a

, we obtain

the reconstruction error of v

as:

L(v

, f

(X; a

)) = kv

− f

(X; a

, (1)

where k · k is the L

-norm.

By minimizing the sum of reconstruction errors over all

the sentences in the document, DSDR picks the optimal set

of representative sentences. The objective function of DSDR

can be formally deﬁned as:

min

X,a

i=1

− f

(X; a

. (2)

In the following, we will discuss two types of the recon-

struction function f

(X; a

), namely, linear reconstruction

and nonnegative linear reconstruction.

Linear Reconstruction

First we deﬁne the reconstruction functions f

(X) as a linear

function:

(X; a

) =

j=1

, X = [x

, x

, . . . , x

]

. (3)

Then a candidate sentence v

can be approximately repre-

sented as:

≈

j=1

, 1 ≤ i ≤ n.

Now, the reconstruction error of the document can be ob-

tained as:

i=1

− X

The solution from minimizing the above equation often ex-

hibits high variance and results in high generalization error

especially when the dimension of sentence vectors is smaller

than the number of sentences. The variance can be reduced

by shrinking the coefﬁcients a

, if we impose a penalty on its

size. Inspired by ridge regression (Hoerl and Kennard 1970),

621

剩余6页未读，继续阅读

评论收藏

内容反馈

weixin_38688745

粉丝: 4
资源: 908

Document summarization based on data reconstruction

最新资源

Document summarization based on data reconstruction

summarization

Efficient Video Summarization Based on Motion SIFT-Distribution Histogram

论文研究-Ontology Lossless Summarization based on Equivalence.pdf

Multi-Document Summarization Based on the Term-Sentence-Document Tri-layer Graph Model

从数据重构的角度进行无监督文档摘要

Multi-Document Summarization using Distributed Bag-of-Words Model（中文翻译）

论文研究-Automatic Video Summarization Based Chinese Caption.pdf

Document Summarization-开源

A Novel Relational Learning - to - Rank Approach for Topic - Focused Multi - Document Summarization

An Extraction-Abstraction Hybrid Approach for Long Document Summarization

Exploiting Timelines to Enhance Multi-document Summarization

优质课件 北京大学研究生课程文本挖掘 文本数据挖掘全套PPT教程（共58页）TextMining09-文档自动摘要技术.pptx

Ontology-based automatic summarization of web document

video summarization

aspect-based-summarization:EMNLP 2020-总结任何方面的文字

AIGC-based Discussion Minutes and Summarization.zip

PySpark SQL Recipes: With HiveQL, Dataframe and Graphframes

Attention_based_Summarization_Approach_of_Clinical_Notes

Superframe Segmentation based on Content-Motion Correspondence for Social Video Summarization

A Neural Attention Model for Abstractive Sentence Summarization

数据挖掘概念与技术ppt --林晓峰

matrix methods in data mining and pattern recognition

data-driven-summarization:数据驱动的科学论文摘要

《Fine-tune BERT for Extractive Summarization》论文分享.pptx

最新资源

优质课件北京大学研究生课程文本挖掘文本数据挖掘全套PPT教程（共58页）TextMining09-文档自动摘要技术.pptx