Bridging the Gap between Training and Inference
for Neural Machine Translation
Wen Zhang
1,2
Yang Feng
1,2∗
Fandong Meng
3
Di You
4
Qun Liu
5
1
Key Laboratory of Intelligent Information Processing
Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)
2
University of Chinese Academy of Sciences, Beijing, China
{zhangwen,fengyang}@ict.ac.cn
3
Pattern Recognition Center, WeChat AI, Tencent Inc, China
fandongmeng@tencent.com
4
Worcester Polytechnic Institute, Worcester, MA, USA
dyou@wpi.edu
5
Huawei Noah’s Ark Lab, Hong Kong, China
qun.liu@huawei.com
Abstract
Neural Machine Translation (NMT) generates
target words sequentially in the way of pre-
dicting the next word conditioned on the con-
text words. At training time, it predicts with
the ground truth words as context while at in-
ference it has to generate the entire sequence
from scratch. This discrepancy of the fed con-
text leads to error accumulation among the
way. Furthermore, word-level training re-
quires strict matching between the generated
sequence and the ground truth sequence which
leads to overcorrection over different but rea-
sonable translations. In this paper, we ad-
dress these issues by sampling context words
not only from the ground truth sequence but
also from the predicted sequence by the model
during training, where the predicted sequence
is selected with a sentence-level optimum.
Experiment results on Chinese→English and
WMT’14 English→German translation tasks
demonstrate that our approach can achieve sig-
nificant improvements on multiple datasets.
1 Introduction
Neural Machine Translation has shown promising
results and drawn more attention recently. Most
NMT models fit in the encoder-decoder frame-
work, including the RNN-based (Sutskever et al.,
2014; Bahdanau et al., 2015; Meng and Zhang,
2019), the CNN-based (Gehring et al., 2017) and
the attention-based (Vaswani et al., 2017) mod-
els, which predict the next word conditioned on
the previous context words, deriving a language
model over target words. The scenario is at train-
ing time the ground truth words are used as context
∗
Corresponding author.
while at inference the entire sequence is generated
by the resulting model on its own and hence the
previous words generated by the model are fed as
context. As a result, the predicted words at train-
ing and inference are drawn from different dis-
tributions, namely, from the data distribution as
opposed to the model distribution. This discrep-
ancy, called exposure bias (Ranzato et al., 2015),
leads to a gap between training and inference. As
the target sequence grows, the errors accumulate
among the sequence and the model has to predict
under the condition it has never met at training
time.
Intuitively, to address this problem, the model
should be trained to predict under the same con-
dition it will face at inference. Inspired by DATA
AS DEMONSTRATOR (DAD) (Venkatraman et al.,
2015), feeding as context both ground truth words
and the predicted words during training can be
a solution. NMT models usually optimize the
cross-entropy loss which requires a strict pairwise
matching at the word level between the predicted
sequence and the ground truth sequence. Once
the model generates a word deviating from the
ground truth sequence, the cross-entropy loss will
correct the error immediately and draw the re-
maining generation back to the ground truth se-
quence. However, this causes a new problem. A
sentence usually has multiple reasonable transla-
tions and it cannot be said that the model makes a
mistake even if it generates a word different from
the ground truth word. For example,
reference: We should comply with the rule.
cand1: We should abide with the rule.
cand2: We should abide by the law.
cand3: We should abide by the rule.
arXiv:1906.02448v2 [cs.CL] 17 Jun 2019
评论0