when modelling the underlying distribution of
human conversations.
Generative Adversarial Nets (GANs) (Good-
fellow et al., 2014; Chen et al., 2016) offers
an effective architecture of jointly training a
generative model and a discriminative classifier
to generate sharp and realistic images. This
architecture could also potentially be applied to
conversational response generation to relieve the
safe response problem, where the generative part
can be an Seq2Seq-based model that generates
response utterances for given queries, and the
discriminative part can evaluate the quality of
the generated utterances from diverse dimen-
sions according to human-produced responses.
However, unlike the image generation problems,
training such a GAN for text generation here is
not straightforward. The decoding phase of the
Seq2Seq model usually involves sampling discrete
words from the predicted distributions, which will
be fed into the training of the discriminator. The
sampling procedure is non-differentiable, and will
therefore break the back-propagation.
To the best of our knowledge, Reinforcement
Learning (RL) is first introduced to address the
above problem (Li et al., 2017; Yu et al., 2017),
where the score predicted by a discriminator was
used as the reinforcement to train the generator,
yielding a hybrid model of GAN and RL. But to
train the RL phrase, Li et al. (2017) introduced
two approximations for reward computing at each
action (word) selection step, including a Markov
Chain Monte Carlo (MCMC) sampling method
and a partial utterance scoring approach. It has
been stated in their work that the former approach
is time-consuming and the latter one will result in
lower performance due to the overfitting problem
caused by adding a large amount of partial utter-
ances into the training set. Nevertheless, we also
want to argue that, besides the time complexity
issue of MCMC, RL itself is not an optimal choice
either. As shown in our experimental results in
Section 5.1, a more elegant design of an end-to-
end differentiable GAN can significantly increase
the model’s performance in this text generation
task.
In this paper, we propose a novel variant
of GAN for conversational response generation,
which introduces an approximate embedding layer
to replace the sampling-based decoding phase,
such that the entire model is continuous and dif-
ferentiable. Empirical experiments are conducted
based on two datasets, of which the results show
that the proposed method significantly outper-
forms three representative existing approaches in
both relevance and diversity oriented automatic
metrics. In addition, human evaluations are
carried out as well, demonstrating the potential of
the proposed model.
2 Related Work
Inspired by recent advances in Neural Machine
Translation (NMT), Ritter et al. (2011) and
Vinyals and Le (2015) have shown that single-
turn short-text conversations can be modelled as
a generative process trained using query-response
pairs accumulated on social networks. Earlier
works focused on paired word sequences only,
while Zhou et al. (2016) and Iulian et al. (2017)
have demonstrated that the comprehensibility of
the generated responses can benefit from multi-
view training with respect to words, coarse tokens
and utterances. Moreover, Sordoni et al. (2015)
proposed a context-aware response generation
model that goes beyond single-turn conversations.
In addition, attention mechanisms were intro-
duced to Seq2Seq-based models to capture topic
and dialog focus information by Shang et al.
(2015) and Chen et al. (2017), which had been
proven to be helpful for improving query-response
relevance (Wu et al., 2016). Additional features
such as persona information (Li et al., 2016b) and
latent semantics (Zhou et al., 2017; Serban et al.,
2017) have also been proven beneficial within this
context.
When compared to previous work, this paper is
focused on single-turn conversation modeling, and
employs a GAN to yield informative responses.
3 Building a Conversational Response
Generator via GAN
3.1 Notations
Let D = {(q
i
, r
i
)}
N
i=1
be a set of N single-
turn human-human conversations, where q
i
=
(w
q
i
,1
, . . . , w
q
i
,t
, . . . , w
q
i
,m
) is a query, r
i
=
(w
r
i
,1
, . . . , w
r
i
,t
, . . . , w
r
i
,n
) stands for the re-
sponse to q
i
, and w
q
i
,t
and w
r
i
,t
denote the t-
th words in q
i
and r
i
, respectively. This paper
aims to learn a generative model G(r|q) based
on a discriminator D that can predict informative
responses with good diversity for arbitrary input
queries.