Mixture Models for Diverse Machine Translation: Tricks of the Trade
regularization noise can hamper that. Second, hard mixtures
yield more diverse generations than soft mixtures, similar to
how K-Means tends to find centroids that are farther apart
from each other compared to the means found by a mix-
ture of Gaussians (Kearns et al., 1998). Third, employing
a uniform prior encourages all mixture components to pro-
duce good translations for any input source sentence, which
is highly desirable. Finally, using independently parame-
terized mixture components provides greater diversifying
capacity than shared parameters; but if responsibilities are
refreshed online, independent parameterization is prone to
a degeneracy where only a single component is trained
because of the “rich gets richer” effect. Conversely, the
combination of shared parameters and offline responsibility
assignment may lead to another degeneracy, in which the
mixture components fail to specialize and behave the same.
We extend our evaluation to three WMT benchmark datasets
for which test sets with multiple human references are
available. We demonstrate that mixture models, when suc-
cessfully trained, consistently outperform variational NMT
(Zhang et al., 2016) and diverse decoding algorithms such
as diverse beam search (Li et al., 2017; Vijayakumar et al.,
2018) and biased sampling (Graves, 2013; Fan et al., 2018).
Our qualitative analysis shows that different mixture compo-
nents can capture consistent translation styles across exam-
ples, enabling users to control generations in an interpretable
and semantically meaningful way.
2. Related Work
Prior studies have investigated the prediction uncertainty in
machine translation. Dreyer & Marcu (2012) and Galley
et al. (2015) introduced new metrics to address uncertainty
at evaluation time. Ott et al. (2018a) inspected the sources
of uncertainty and proposed tools to check fitting between
the model and the data distributions. They also observed
that modern conditional auto-regressive NMT models can
only capture uncertainty to a limited extent, and they tend
to oversmooth probability mass over the hypothesis space.
Recent work has explored latent variable modeling for ma-
chine translation. Zhang et al. (2016) leveraged variational
inference (Kingma & Welling, 2014; Bowman et al., 2016)
to augment an NMT system with a single Gaussian latent
variable. This work was extended by Schulz et al. (2018),
who considered a sequence of latent Gaussian variables to
represent each target word. Kaiser et al. (2018) proposed
a similar model, but with groups of discrete multinomial
latent variables. In their qualitative analysis, Kaiser et al.
(2018) showed that the latent codes do affect the output pre-
dictions in interesting ways, but their focus was on speeding
up regular decoding rather than producing a diverse set of
hypotheses. None of these works analyzed and quantified
diversity introduced by such latent variables.
The most relevant work is by He et al. (2018), who propose
to use a soft mixture model with uniform prior for diverse
machine translation. However, they did not evaluate on
datasets with multiple references, nor did they analyze the
full spectrum of design choices for building mixture mod-
els. Moreover, they used weaker base models and did not
compare to variational NMT or diverse decoding baselines,
which makes their empirical analysis less conclusive. We
provide a comprehensive study and shed light on the differ-
ent behaviors of mixture models in a variety of settings.
Besides machine translation, there is work on latent vari-
ables for dialogue generation (Serban et al., 2017; Cao &
Clark, 2017; Wen et al., 2017) and image captioning (Wang
et al., 2017; Dai et al., 2017). The proposed mixture model
departs from these VAE or GAN-based approaches and im-
portantly, is much simpler. It could also be applied to other
text generation tasks as well.
3. Mixture Models for Diverse MT
A standard neural machine translation (NMT) model has
an encoder-decoder structure. The encoder maps a source
sentence
x
to a sequence of hidden states, which are
then fed to the decoder to generate an output sentence
one word at a time. At each time step, the decoder ad-
ditionally conditions its output on the previous outputs,
resulting in an auto-regressive factorization
p(y|x; θ) =
Q
T
t=1
p(y
t
|y
1:t−1
, x; θ)
, where
(y
1
, · · · , y
T
)
are the words
that compose a target sentence y.
However, the machine translation task has inherent uncer-
tainty, due to the existence of multiple valid translations
y
for a given source sentence
x
. With the auto-regressive
factorization all uncertainty is represented in the decoder
output distribution, making it difficult to search for multiple
modes of
p(y|x; θ)
. Indeed, widely used decoding algo-
rithms such as beam search typically produce hypotheses of
low diversity with only minor differences in the suffix (Ott
et al., 2018a).
Mixture models provide an alternative approach to model-
ing uncertainty and generating diverse translations. While
these models have primarily been explored as a means of in-
creasing model capacity (Jacobs et al., 1991; Shazeer et al.,
2017), they are also a natural way of modeling different
translation styles (He et al., 2018).
Formally, given a source sentence
x
and reference trans-
lation
y
, a mixture model introduces a multinomial latent
variable
z ∈ {1, · · · , K}
, and decomposes the marginal
likelihood as:
p(y|x; θ) =
K
X
z=1
p(y, z|x; θ) =
K
X
z=1
p(z|x; θ)p(y|z, x; θ)
(1)
评论0