Surfer100: Generating Surveys From Web Resources on Wikipedia-style
Irene Li, Alexander Fabbri, Rina Kawamura, Yixin Liu, Xiangru Tang,
Jaesung Tae, Chang Shen, Sally Ma, Tomoe Mizutani, Dragomir Radev
Department of Computer Science
Yale University
Abstract
Fast-developing fields such as Artificial Intelli-
gence (AI) often outpace the efforts of encyclo-
pedic sources such as Wikipedia, which either
do not completely cover recently-introduced
topics or lack such content entirely. As a re-
sult, methods for automatically producing con-
tent are valuable tools to address this informa-
tion overload. We show that recent advances
in pretrained language modeling can be com-
bined for a two-stage extractive and abstrac-
tive approach for Wikipedia lead paragraph
generation. We extend this approach to gen-
erate longer Wikipedia-style summaries with
sections and examine how such methods strug-
gle in this application through detailed stud-
ies with 100 reference human-collected sur-
veys. This is the first study on utilizing web
resources for long Wikipedia-style summaries
to the best of our knowledge.
1 Introduction
Novel concepts are being introduced and evolv-
ing at a rate that makes creating high-quality, up-
to-date Wikipedia pages for such topics challeng-
ing. A pipeline for automatically creating such
Wikipedia pages is thus desirable. While there
has been some work on generating full Wikipedia
pages, these efforts are either domain-specific
(Sauper and Barzilay, 2009), making strong as-
sumptions about the topics being summarized
(Banerjee and Mitra, 2016), or are purely extractive
(Jha et al., 2015). In a related line of work, query-
based summarization has been applied to specific
sections of Wikipedia pages Deutsch and Roth
(2019); Zhu et al. (2019), which can be viewed as a
more self-contained version of Wikipedia page gen-
eration. Recent Wikipedia page generation work
has focused on generating the initial leading para-
graph of a Wikipedia page (Liu et al., 2018; Liu
and Lapata, 2019; Perez-Beltrachini et al., 2019).
These papers consist of a two-step framework by
which an extractive method selects relevant con-
tent for a specific topic, and an abstractive method
generates the final summary of the topic.
In this paper, we first examine how recently-
introduced pretrained language models (Devlin
et al., 2019; Liu et al., 2019; Lewis et al., 2019)
improve upon both the extractive and abstractive
steps of previous models for the task of lead para-
graph generation. We further focus on analyzing
the extension of such methods to full Wikipedia
page generation on scientific topics related to AI
and Natural Language Processing (NLP). We man-
ually create summaries of 100 AI and NLP topics
divided along sections, as on Wikipedia pages. We
perform ablation studies on content selection and
generation methods over selected topics, finding
that current content selection methods are not pre-
cise and fail to differentiate content well among
queries for subtopics of the main topic.
Our contributions are: 1) We demonstrate how
recent advances in pretrained language models im-
prove upon Wikipedia lead paragraph generation.
2) We then extend such a method to generate full
Wikipedia-style pages of scientific topics; 3) For a
testing purpose, we manually collected Surfer100,
100 SURveys From wEb Resources on scientific
topics, filling the gap on human-written surveys us-
ing web resources in scientific topics. We provide
a better understanding of current methods and their
faults on a real-world application.
2 Wikipedia Lead Paragraph Generation
In this section, we show how combining recent
methods for a two-staged approach of content se-
lection and generation give improved results on
the WikiSum dataset (Liu et al., 2018) as well as a
newly curated set of Wikipedia articles.
2.1 Data
We make use of the
WikiSum
dataset (Liu et al.,
2018), a collection of over 1.5 million Wikipedia
arXiv:2112.06377v1 [cs.CL] 13 Dec 2021
评论0
最新资源