A Hierarchical Neural Autoencoder for Paragraphs and Documents
Jiwei Li, Minh-Thang Luong and Dan Jurafsky
Computer Science Department, Stanford University, Stanford, CA 94305, USA
jiweil, lmthang, jurafsky@stanford.edu
Abstract
Natural language generation of coherent
long texts like paragraphs or longer doc-
uments is a challenging problem for re-
current networks models. In this paper,
we explore an important step toward this
generation task: training an LSTM (Long-
short term memory) auto-encoder to pre-
serve and reconstruct multi-sentence para-
graphs. We introduce an LSTM model that
hierarchically builds an embedding for a
paragraph from embeddings for sentences
and words, then decodes this embedding
to reconstruct the original paragraph. We
evaluate the reconstructed paragraph us-
ing standard metrics like ROUGE and En-
tity Grid, showing that neural models are
able to encode texts in a way that preserve
syntactic, semantic, and discourse coher-
ence. While only a first step toward gener-
ating coherent text units from neural mod-
els, our work has the potential to signifi-
cantly impact natural language generation
and summarization
1
.
1 Introduction
Generating coherent text is a central task in natural
language processing. A wide variety of theories
exist for representing relationships between text
units, such as Rhetorical Structure Theory (Mann
and Thompson, 1988) or Discourse Representa-
tion Theory (Lascarides and Asher, 1991), for ex-
tracting these relations from text units (Marcu,
2000; LeThanh et al., 2004; Hernault et al., 2010;
Feng and Hirst, 2012, inter alia), and for extract-
ing other coherence properties characterizing the
role each text unit plays with others in a discourse
(Barzilay and Lapata, 2008; Barzilay and Lee,
1
Code for models described in this paper are available at
www.stanford.edu/
˜
jiweil/.
2004; Elsner and Charniak, 2008; Li and Hovy,
2014, inter alia). However, applying these to text
generation remains difficult. To understand how
discourse units are connected, one has to under-
stand the communicative function of each unit,
and the role it plays within the context that en-
capsulates it, recursively all the way up for the
entire text. Identifying increasingly sophisticated
human-developed features may be insufficient for
capturing these patterns. But developing neural-
based alternatives has also been difficult. Al-
though neural representations for sentences can
capture aspects of coherent sentence structure (Ji
and Eisenstein, 2014; Li et al., 2014; Li and Hovy,
2014), it’s not clear how they could help in gener-
ating more broadly coherent text.
Recent LSTM models (Hochreiter and Schmid-
huber, 1997) have shown powerful results on gen-
erating meaningful and grammatical sentences in
sequence generation tasks like machine translation
(Sutskever et al., 2014; Bahdanau et al., 2014; Lu-
ong et al., 2015) or parsing (Vinyals et al., 2014).
This performance is at least partially attributable
to the ability of these systems to capture local
compositionally: the way neighboring words are
combined semantically and syntactically to form
meanings that they wish to express.
Could these models be extended to deal with
generation of larger structures like paragraphs or
even entire documents? In standard sequence-
to-sequence generation tasks, an input sequence
is mapped to a vector embedding that represents
the sequence, and then to an output string of
words. Multi-text generation tasks like summa-
rization could work in a similar way: the sys-
tem reads a collection of input sentences, and
is then asked to generate meaningful texts with
certain properties (such as—for summarization—
being succinct and conclusive). Just as the local
semantic and syntactic compositionally of words
can be captured by LSTM models, can the com-