【免费】MixtureModelsforDiverseMachineTranslationTricksoftheTra资源-CSDN文库

需积分: 0 167 浏览量 2022-08-03 13:07:25 上传评论收藏 1.2MB PDF 举报

《混合模型在多样的机器翻译中的应用：技巧与策略》机器学习文献中，通过EM（期望最大化）训练的混合模型是最简单、最常用且理解透彻的潜在变量模型之一。然而，在诸如机器翻译这样的文本生成应用中，这些模型却鲜有探索。理论上，混合模型提供了一个控制生成过程的潜在变量，可以产生多样化的假设。但在实践中，混合模型往往容易出现退化问题——可能只有一个组件被训练，或者潜在变量被忽视。我们发现，关闭责任计算中的辍学噪声对于成功的训练至关重要。此外，参数化的设计选择、先验分布、硬EM与软EM、在线分配与离线分配等都会显著影响模型性能。我们开发了一种评估协议，用于评估生成结果的质量和多样性，同时针对多个参考标准进行对比，并对几种混合模型变体进行了广泛的实证研究。分析表明，某些类型的混合模型更为稳健，能在翻译质量和多样性之间提供最佳的权衡，相比于变分模型和多样化解码方法表现更优。特别是，当从没有语法性别的语言翻译到具有语法性别的语言时，由于不同语言间的信息不对称，往往会有多个合理且语义等价的翻译选项。 1. 引言机器翻译不仅因为其庞大且结构化的输出空间而具有挑战性，还因为它本质上是一对多的映射。由于不同语言之间的信息不对称，常常存在多种合理的、语义等价的翻译。例如，从没有性别语法的语言翻译到有性别语法的语言时，就会出现两种有效的翻译选项。此外，不同的翻译策略，如词序、短语选择和句法结构，也会影响多样性。本文深入探讨了混合模型在解决机器翻译多样性问题中的潜力，提出了一系列关键的训练技巧和模型设计选择，为机器翻译的多样性和质量提供了新的视角。通过实证研究，我们强调了这些策略如何影响最终的翻译效果，以及如何在保持翻译准确性的前提下增加多样性。这些发现对于改进现有的机器翻译系统，尤其是在处理多义词和上下文依赖的翻译任务中，具有重要的指导意义。这篇论文揭示了混合模型在机器翻译中的潜在价值，提出了克服训练难点的方法，并提供了评估和优化模型的新工具。未来的研究可能会进一步探索这些模型在其他文本生成任务中的应用，以及如何结合深度学习和其他先进技术来提升混合模型的表现。

资源详情

资源评论

资源推荐

Mixture Models for Diverse Machine Translation: Tricks of the Trade

Tianxiao Shen

* 1

Myle Ott

* 2

Michael Auli

Marc’Aurelio Ranzato

Abstract

Mixture models trained via EM are among the

simplest, most widely used and well understood

latent variable models in the machine learning

literature. Surprisingly, these models have been

hardly explored in text generation applications

such as machine translation. In principle, they pro-

vide a latent variable to control generation and pro-

duce a diverse set of hypotheses. In practice, how-

ever, mixture models are prone to degeneracies—

often only one component gets trained or the la-

tent variable is simply ignored. We ﬁnd that dis-

abling dropout noise in responsibility computa-

tion is critical to successful training. In addition,

the design choices of parameterization, prior dis-

tribution, hard versus soft EM and online versus

ofﬂine assignment can dramatically affect model

performance. We develop an evaluation protocol

to assess both quality and diversity of generations

against multiple references, and provide an ex-

tensive empirical study of several mixture model

variants. Our analysis shows that certain types of

mixture models are more robust and offer the best

trade-off between translation quality and diver-

sity compared to variational models and diverse

decoding approaches.

1. Introduction

Machine translation (MT) is a challenging task not only

because of the large and structured output space, but also be-

cause it is inherently a one-to-many mapping. There are of-

ten many plausible and semantically equivalent translations

due to information asymmetry between different languages,

e.g., translating from a language without grammatical gen-

der to a language that has grammatical gender leads to two

valid translation options, as well as different translation

Equal contribution

MIT CSAIL

Facebook AI Research. Cor-

respondence to: Tianxiao Shen <tianxiao@csail.mit.edu>.

Proceedings of the

International Conference on Machine

Learning, Long Beach, California, PMLR 97, 2019. Copyright

2019 by the author(s).

Code to reproduce the results in this paper is available at

https://github.com/pytorch/fairseq

styles such as formal/informal, literal/not literal, etc. This

raises the question of how to model such multi-modal output

distributions and how to evaluate these models.

Our ﬁrst contribution is a better evaluation protocol that uses

multiple references during evaluation to measure both the

quality of translation and diversity of a generated hypothesis

set. The second contribution of this paper is an in-depth em-

pirical analysis of mixture models for machine translation,

although we conjecture that the ﬁndings are general and

might apply to other text generation tasks, such as dialogue,

summarization, image captioning, etc.

Conditional mixture models, also known as mixture of ex-

perts (MoE) (Jacobs et al., 1991), are in principle well suited

to generating diverse hypotheses which can be achieved

through different mixture components. However, they have

been largely overlooked in favor of models with richer latent

structure (Zhang et al., 2016; Kaiser et al., 2018). There has

been some previous work on mixture models for sequence to

sequence learning (Shazeer et al., 2017; He et al., 2018), but

these did not evaluate generations in terms of both quality

and diversity, or they focus on a particular model variant.

There is a lack of consensus whether mixture models are

competitive with more complex models that rely on approx-

imate Bayesian inference, whether they are plagued by the

same “posterior collapse” degeneracy as variational mod-

els (Bowman et al., 2016), how model conﬁgurations affect

performance and which one works best in practice.

This work considers all the major design choices involved

in the construction of mixture models, including hard ver-

sus soft EM training, different parameterizations of mixture

components, the choice of conditional prior, update fre-

quency of responsibilities (also called membership weights),

and how regularization noise is injected. We experiment on

the large scale WMT English to German benchmark with

a state-of-the-art model architecture and the results demon-

strate intricate dependencies between these design choices.

They also reveal that some ingredients are key to successful

training of mixture models.

First, we show that mixture models are prone to degenera-

cies when trained with dropout noise, but that this can be

mitigated by turning off dropout in the computation of re-

sponsibilities. The key to the specialization of experts is to

make consistent use of them, and even a small amount of

Mixture Models for Diverse Machine Translation: Tricks of the Trade

regularization noise can hamper that. Second, hard mixtures

yield more diverse generations than soft mixtures, similar to

how K-Means tends to ﬁnd centroids that are farther apart

from each other compared to the means found by a mix-

ture of Gaussians (Kearns et al., 1998). Third, employing

a uniform prior encourages all mixture components to pro-

duce good translations for any input source sentence, which

is highly desirable. Finally, using independently parame-

terized mixture components provides greater diversifying

capacity than shared parameters; but if responsibilities are

refreshed online, independent parameterization is prone to

a degeneracy where only a single component is trained

because of the “rich gets richer” effect. Conversely, the

combination of shared parameters and ofﬂine responsibility

assignment may lead to another degeneracy, in which the

mixture components fail to specialize and behave the same.

We extend our evaluation to three WMT benchmark datasets

for which test sets with multiple human references are

available. We demonstrate that mixture models, when suc-

cessfully trained, consistently outperform variational NMT

(Zhang et al., 2016) and diverse decoding algorithms such

as diverse beam search (Li et al., 2017; Vijayakumar et al.,

2018) and biased sampling (Graves, 2013; Fan et al., 2018).

Our qualitative analysis shows that different mixture compo-

nents can capture consistent translation styles across exam-

ples, enabling users to control generations in an interpretable

and semantically meaningful way.

2. Related Work

Prior studies have investigated the prediction uncertainty in

machine translation. Dreyer & Marcu (2012) and Galley

et al. (2015) introduced new metrics to address uncertainty

at evaluation time. Ott et al. (2018a) inspected the sources

of uncertainty and proposed tools to check ﬁtting between

the model and the data distributions. They also observed

that modern conditional auto-regressive NMT models can

only capture uncertainty to a limited extent, and they tend

to oversmooth probability mass over the hypothesis space.

Recent work has explored latent variable modeling for ma-

chine translation. Zhang et al. (2016) leveraged variational

inference (Kingma & Welling, 2014; Bowman et al., 2016)

to augment an NMT system with a single Gaussian latent

variable. This work was extended by Schulz et al. (2018),

who considered a sequence of latent Gaussian variables to

represent each target word. Kaiser et al. (2018) proposed

a similar model, but with groups of discrete multinomial

latent variables. In their qualitative analysis, Kaiser et al.

(2018) showed that the latent codes do affect the output pre-

dictions in interesting ways, but their focus was on speeding

up regular decoding rather than producing a diverse set of

hypotheses. None of these works analyzed and quantiﬁed

diversity introduced by such latent variables.

The most relevant work is by He et al. (2018), who propose

to use a soft mixture model with uniform prior for diverse

machine translation. However, they did not evaluate on

datasets with multiple references, nor did they analyze the

full spectrum of design choices for building mixture mod-

els. Moreover, they used weaker base models and did not

compare to variational NMT or diverse decoding baselines,

which makes their empirical analysis less conclusive. We

provide a comprehensive study and shed light on the differ-

ent behaviors of mixture models in a variety of settings.

Besides machine translation, there is work on latent vari-

ables for dialogue generation (Serban et al., 2017; Cao &

Clark, 2017; Wen et al., 2017) and image captioning (Wang

et al., 2017; Dai et al., 2017). The proposed mixture model

departs from these VAE or GAN-based approaches and im-

portantly, is much simpler. It could also be applied to other

text generation tasks as well.

3. Mixture Models for Diverse MT

A standard neural machine translation (NMT) model has

an encoder-decoder structure. The encoder maps a source

sentence

to a sequence of hidden states, which are

then fed to the decoder to generate an output sentence

one word at a time. At each time step, the decoder ad-

ditionally conditions its output on the previous outputs,

resulting in an auto-regressive factorization

p(y|x; θ) =

t=1

p(y

1:t−1

, x; θ)

, where

, · · · , y

)

are the words

that compose a target sentence y.

However, the machine translation task has inherent uncer-

tainty, due to the existence of multiple valid translations

for a given source sentence

. With the auto-regressive

factorization all uncertainty is represented in the decoder

output distribution, making it difﬁcult to search for multiple

modes of

p(y|x; θ)

. Indeed, widely used decoding algo-

rithms such as beam search typically produce hypotheses of

low diversity with only minor differences in the sufﬁx (Ott

et al., 2018a).

Mixture models provide an alternative approach to model-

ing uncertainty and generating diverse translations. While

these models have primarily been explored as a means of in-

creasing model capacity (Jacobs et al., 1991; Shazeer et al.,

2017), they are also a natural way of modeling different

translation styles (He et al., 2018).

Formally, given a source sentence

and reference trans-

lation

, a mixture model introduces a multinomial latent

variable

z ∈ {1, · · · , K}

, and decomposes the marginal

likelihood as:

p(y|x; θ) =

z=1

p(y, z|x; θ) =

z=1

p(z|x; θ)p(y|z, x; θ)

(1)

剩余9页未读，继续阅读

评论收藏

内容反馈

老许的花开

粉丝: 33
资源: 328

Mixture Models for Diverse Machine Translation Tricks of the Tra

评论0

最新资源

Mixture Models for Diverse Machine Translation Tricks of the Tra

评论0

Machine Translation

机器翻译

机器翻译机制

Fast Estimation of Gaussian Mixture Models for Image Segmentation

Adaptive background mixture models for real-time tracking(stauffer_cvpr98_track)代码

adaptive background mixture models for realtime tracking

Stauffer1999A_Adaptive background mixture models for real-time tracking.pdf

adaptive background mixture models for real-time tracking - explation

Finite Mixture Models

Dirichlet Process Mixture Models(DPMM)

源码：Robust ellipse detection with Gaussian mixture models

Adaptive background mixture models for real-time tracking 混合高斯模型

Asymmetrical Gauss Mixture Models for Point Sets Matching

Unsupervised Learning Of Finite Mixture Models

On the mixture model for multiphase flow

Mixture Models: Theory, Geometry and Applications(图模型（Graph Model）的重要著作)

EM algorithm for Gaussian Mixture Model

Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process

Gaussian Mixture Models交通车辆跟踪效果好付代码论文

A General Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models.pdf

Multivariate Compressive Sensing for Image Reconstruction in the Wavelet Domain: Using Scale Mixture Models

说话人识别数据集--Spoken Speaker Identification based on Gaussian Mixture Models-1

说话人识别数据集--Spoken Speaker Identification based on Gaussian Mixture Models-2

Image Segmentation Using Hidden Markov Gauss Mixture Models

Pattern Recognition and Machine Learning (Bishop)

最新资源