没有合适的资源?快使用搜索试试~ 我知道了~
Document summarization based on data reconstruction
0 下载量 20 浏览量
2021-02-09
04:54:17
上传
评论
收藏 592KB PDF 举报
温馨提示
Document summarization is of great value to many real world applications, such as snippets generation for search results and news headlines generation. Traditionally, document summarization is implemented by extracting sentences that cover the main topics of a document with a minimum redundancy. In this paper, we take a different perspective from data reconstruction and propose a novel framework named Document Summarization based on Data Reconstruction (DSDR). Specifically, our approach generate
资源推荐
资源详情
资源评论
Document Summarization Based on Data Reconstruction
Zhanying He and Chun Chen and Jiajun Bu and Can Wang and Lijun Zhang
Zhejiang Provincial Key Laboratory of Service Robot, College of Computer Science,
Zhejiang University, Hangzhou 310027, China.
{hezhanying, chenc, bjj, wcan, zljzju}@zju.edu.cn
Deng Cai and Xiaofei He
State Key Lab of CAD&CG, College of Computer Science,
Zhejiang University, Hangzhou 310058, China.
{dengcai, xiaofeihe}@cad.zju.edu.cn
Abstract
Document summarization is of great value to many
real world applications, such as snippets generation for
search results and news headlines generation. Tradition-
ally, document summarization is implemented by ex-
tracting sentences that cover the main topics of a doc-
ument with a minimum redundancy. In this paper, we
take a different perspective from data reconstruction and
propose a novel framework named Document Summa-
rization based on Data Reconstruction (DSDR). Specif-
ically, our approach generates a summary which consist
of those sentences that can best reconstruct the original
document. To model the relationship among sentences,
we introduce two objective functions: (1) linear recon-
struction, which approximates the document by linear
combinations of the selected sentences; (2) nonnega-
tive linear reconstruction, which allows only additive,
not subtractive, linear combinations. In this framework,
the reconstruction error becomes a natural criterion for
measuring the quality of the summary. For each objec-
tive function, we develop an efficient algorithm to solve
the corresponding optimization problem. Extensive ex-
periments on summarization benchmark data sets DUC
2006 and DUC 2007 demonstrate the effectiveness of
our proposed approach.
Introduction
With the explosive growth of the Internet, people are over-
whelmed by a large number of accessible documents. Sum-
marization can represent the document with a short piece
of text covering the main topics, and help users sift through
the Internet, catch the most relevant document, and filter out
redundant information. So document summarization has be-
come one of the most important research topics in the natural
language processing and information retrieval communities.
In recent years, automatic summarization has been ap-
plied broadly in varied domains. For example, search en-
gines can provide users with snippets as the previews of
the document contents (Turpin et al. 2007; Huang, Liu, and
Chen 2008; Cai et al. 2004; He et al. 2007). News sites usu-
ally describe hot news topics in concise headlines to facili-
tate browsing. Both the snippets and headlines are specific
forms of document summary in practical applications.
Copyright
c
2012, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Most of the existing generic summarization approaches
use a ranking model to select sentences from a candidate set
(Brin and Page 1998; Kleinberg 1999; Wan and Yang 2007).
These methods suffer from a severe problem that top ranked
sentences usually share much redundant information. Al-
though there are some methods (Conroy and O’leary 2001;
Park et al. 2007; Shen et al. 2007) trying to reduce the redun-
dancy, selecting sentences which have both good coverage
and minimum redundancy is a non-trivial task.
In this paper, we propose a novel summarization method
from the perspective of data reconstruction. As far as we
know, our approach is the first to treat the document sum-
marization as a data reconstruction problem. We argue that
a good summary should consist of those sentences that can
best reconstruct the original document. Therefore, the re-
construction error becomes a natural criterion for measur-
ing the quality of summary. We propose a novel framework
called Document Summarization based on Data Reconstruc-
tion (DSDR) which finds the summary sentences by mini-
mizing the reconstruction error. DSDR firstly learns a recon-
struction function for each candidate sentence of an input
document and then obtains the error formula by that func-
tion. Finally it obtains an optimal summary by minimizing
the reconstruction error. From the geometric interpretation,
DSDR tends to select sentences that span the intrinsic sub-
space of candidate sentence space so that it is able to cover
the core information of the document.
To model the relationship among sentences, we discuss
two kinds of reconstruction. The first one is linear recon-
struction, which approximates the document by linear com-
binations of the selected sentences. Optimizing the corre-
sponding objective function is achieved through a greedy
method which extracts sentences sequentially. The second
one is non-negative linear reconstruction, which allows only
additive, not subtractive, combinations among the selected
sentences. Previous studies have shown that there is psycho-
logical and physiological evidence for parts-based represen-
tation in the human brain (Palmer 1977; Wachsmuth, Oram,
and Perrett 1994; Cai et al. 2011). Naturally, a document
summary should consist of the parts of sentences. With the
nonnegative constraints, our method leads to parts-based re-
construction so that no redundant information needs to be
subtracted from the combination. We formulate the nonneg-
ative linear reconstruction as a convex optimization problem
Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence
620
and design a multiplicative updating algorithm which guar-
antees converging monotonically to a global minima.
Extensive experiments on summarization benchmark data
sets DUC 2006 and DUC 2007 demonstrate the effective-
ness of our proposed approach.
Related work
Recently, lots of extractive document summarization meth-
ods have been proposed. Most of them involve assigning
salient scores to sentences of the original document and
composing the result summary of the top sentences with the
highest scores. The computation rules of salient scores can
be categorized into three groups (Hu, Sun, and Lim 2008):
feature based measurements, lexical chain based measure-
ments and graph based measurements. In (Wang et al. 2008),
the semantic relations of terms in the same semantic role
are discovered by using the WordNet (Miller 1995). A tree
pattern expression for extracting information from syntac-
tically parsed text is proposed in (Choi 2011). Algorithms
like PageRank (Brin and Page 1998) and HITS (Kleinberg
1999) are used in the sentence score propagation based on
the graph constructed based on the similarity between sen-
tences. Wan and Yang (2007) show that graph based mea-
surements can also improve the single-document summa-
rization by integrating multiple documents of the same topic.
Most of these scoring-based methods have to incorporate
with the adjustment of word weights which is one of the
most important factors that influence the summarization per-
formance (Nenkova, Vanderwende, and McKeown 2006).
So much work has been studied on how to extract sentences
without saliency scores. Inspired by the latent semantic in-
dexing (LSA), the singular value decomposition (SVD) is
used to select highly ranked sentences for generic document
summarization (Gong and Liu 2001). Harabagiu and Laca-
tusu (2005) analyze five different topic representations and
propose a novel topic representation based on topic themes.
Wang et al. (2008) use the symmetric non-negative matrix
factorization (SNMF) to cluster sentences into groups and
select sentences from each group for summarization.
The Proposed Framework
Most of the existing summarization methods aim to obtain
the summary which covers the core information of the docu-
ment. In this paper, we study the summarization from a data
reconstruction perspective. We believe that a good summary
should contain those sentences that can be used to recon-
struct the document as well as possible, namely, minimizing
the reconstruction error.
In this section, we describe the details of our proposed
framework Document Summarization based on Data Recon-
struction (DSDR) which minimizes the reconstruction error
for summarization. The algorithm procedure of DSDR is as
follows:
• After stemming and stop-word elimination, we decom-
pose the document into individual sentences and create
a weighted term-frequency vector for every sentence. All
the sentences form the candidate set.
• For any sentence in the document, DSDR selects the re-
lated sentences from the candidate set to reconstruct the
given sentence by learning a reconstruction function for
the sentence.
• For the entire document (or, a set of documents), DSDR
aims to find an optimal set of representative sentences
to approximate the entire document (or, the set of doc-
uments), by minimizing the reconstruction error.
We denote the candidate sentence set as V =
[v
1
, v
2
, . . . , v
n
]
T
where v
i
∈ R
d
is a weighted term-
frequency vector for sentence i. Here notice that, we use V
to represent both the matrix and the candidate set {v
i
}. Sup-
pose there are totally d terms and n sentences in the docu-
ment, we will have a matrix V in the size of n × d. We
denote the summary sentence set as X = [x
1
, x
2
, . . . , x
m
]
T
with m < n and X ⊂ V .
Given a sentence v
i
∈ V , DSDR attempts to represent it
with a reconstruction function f
i
(X) given the selected sen-
tence set X. Denoting the parameters of f
i
as a
i
, we obtain
the reconstruction error of v
i
as:
L(v
i
, f
i
(X; a
i
)) = kv
i
− f
i
(X; a
i
)k
2
, (1)
where k · k is the L
2
-norm.
By minimizing the sum of reconstruction errors over all
the sentences in the document, DSDR picks the optimal set
of representative sentences. The objective function of DSDR
can be formally defined as:
min
X,a
i
n
X
i=1
kv
i
− f
i
(X; a
i
)k
2
. (2)
In the following, we will discuss two types of the recon-
struction function f
i
(X; a
i
), namely, linear reconstruction
and nonnegative linear reconstruction.
Linear Reconstruction
First we define the reconstruction functions f
i
(X) as a linear
function:
f
i
(X; a
i
) =
m
X
j=1
x
j
a
ij
, X = [x
1
, x
2
, . . . , x
m
]
T
. (3)
Then a candidate sentence v
i
can be approximately repre-
sented as:
v
i
≈
m
X
j=1
x
j
a
ij
, 1 ≤ i ≤ n.
Now, the reconstruction error of the document can be ob-
tained as:
n
X
i=1
kv
i
− X
T
a
i
k
2
The solution from minimizing the above equation often ex-
hibits high variance and results in high generalization error
especially when the dimension of sentence vectors is smaller
than the number of sentences. The variance can be reduced
by shrinking the coefficients a
i
, if we impose a penalty on its
size. Inspired by ridge regression (Hoerl and Kennard 1970),
621
剩余6页未读,继续阅读
资源评论
weixin_38688745
- 粉丝: 4
- 资源: 908
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功