A Neural Conversational Model
provements on the English-French and English-German
translation tasks from the WMT’14 dataset (
Luong et al.,
2014; Jean et al., 2014). It has also been used for
other tasks such as parsing (Vinyals et al., 2014a) and
image captioning (
Vinyals et al., 2014b). Since it is
well known that vanilla RNNs suffer from vanish-
ing gradients, most researchers use variants of the
Long Short Term Memory (LSTM) recurrent neural net-
work (
Hochreiter & Schmidhuber, 1997).
Our work is also inspired by the recent success of neu-
ral language modeling (
Bengio et al., 2003; Mikolov et al.,
2010; Mikolov, 2012), which shows that recurrent neural
networks are rather effective models for natural language.
More recently, work by Sordoni et al. (
Sordoni et al., 2015)
and Shang et al. (
Shang et al., 2015), used recurrent neural
networks to model dialogue in short conversations (trained
on Twitter-style chats).
Building bots and conversational agents has been pur-
sued by many researchers over the last decades, and it
is out of the scope of this paper to provide an exhaus-
tive list of references. However, most of these systems
require a rather complicated processing pipeline of many
stages (
Lester et al., 2004; Will, 2007; Jurafsky & Martin,
2009). Our work differs from conventional systems by
proposing an end-to-end approach to the problem which
lacks domain knowledge. It could, in principle, be com-
bined with other systems to re-score a short-list of can-
didate responses, but our work is based on producing an-
swers given by a probabilistic model trained to maximize
the probability of the answer given some context.
3. Model
Our approach makes use of the sequence to sequence
(seq2seq) model described in (
Sutskever et al., 2014). The
model is based on a recurrent neural network which reads
the input sequence one token at a time, and predicts the
output sequence, also one token at a time. During training,
the true output sequence is given to the model, so learning
can be done by backpropagation. The model is trained to
maximize the cross entropy of the correct sequence given
its context. During inference, given that the true output se-
quence is notobserved, we simply feed the predicted output
token as input to predict the next output. This is a “greedy”
inference approach. A less greedy approach would be to
use beam search, and feed several candidates at the previ-
ous step to the next step. The predicted sequence can be
selected based on the probability of the sequence.
Concretely, suppose that we observe a conversation with
two turns: the first person utters “ABC”, and another replies
“WXYZ”. We can use a recurrent neural network, and train
to map “ABC” to “WXYZ” as shown in Figure
1 above.
Figure 1.
Using the seq2seq framework for modeling conversa-
tions.
The hidden state of the model when it receives the end of
sequence symbol “<eos>” can be viewed as the thought
vector because it stores the information of the sentence, or
thought, “ABC”.
The strength of this model lies in its simplicity and gener-
ality. We can use this model for machine translation, ques-
tion/answering, and conversations without major changes
in the architecture.
Unlike easier tasks like translation, however, a model like
sequence to sequence will not be able to successfully
“solve” the problem of modeling dialogue due to sev-
eral obvious simplifications: the objective function being
optimized does not capture the actual objective achieved
through human communication, which is typically longer
term and based on exchangeof information rather than next
step prediction. The lack of a model to ensure consistency
and general world knowledge is another obvious limitation
of a purely unsupervised model.
4. Datasets
In our experiments we used two datasets: a closed-domain
IT helpdesk troubleshooting dataset and an open-domain
movie transcript dataset. The details of the two datasets are
as follows.
4.1. IT Helpdesk Troubleshooting dataset
Our first set of experiments used a dataset which is ex-
tracted from a IT helpdesk troubleshooting chat service. In
this service, costumers are dealing with computer related
issues, and a specialist is helping them walking through a
solution. Typical interactions (or threads) are 400 words
long, and turn taking is clearly signaled. Our training set
contains 30M tokens, and 3M tokens were used as valida-
tion. Some amount of clean up was performed, such as
removing common names, numbers, and full URLs.
4.2. OpenSubtitles dataset
We also experimented our model with the OpenSubtitles
dataset (
Tiedemann, 2009). This dataset consists of movie
conversations in XML format. It contains sentences uttered
by characters in movies. We applied a simple processing
step removing XML tags and obvious non-conversational