没有合适的资源?快使用搜索试试~ 我知道了~
资源详情
资源评论
资源推荐
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 151–164
Florence, Italy, July 28 - August 2, 2019.
c
2019 Association for Computational Linguistics
151
Massively Multilingual Transfer for NER
Afshin Rahimi
∗
Yuan Li
∗
Trevor Cohn
School of Computing and Information Systems
The University of Melbourne
yuanl4@student.unimelb.edu.au
{rahimia,t.cohn}@unimelb.edu.au
Abstract
In cross-lingual transfer, NLP models over one
or more source languages are applied to a low-
resource target language. While most prior
work has used a single source model or a
few carefully selected models, here we con-
sider a “massive” setting with many such mod-
els. This setting raises the problem of poor
transfer, particularly from distant languages.
We propose two techniques for modulating
the transfer, suitable for zero-shot or few-shot
learning, respectively. Evaluating on named
entity recognition, we show that our tech-
niques are much more effective than strong
baselines, including standard ensembling, and
our unsupervised method rivals oracle selec-
tion of the single best individual model.
1
1 Introduction
Supervised learning remains king in natural lan-
guage processing, with most tasks requiring large
quantities of annotated corpora. The majority of
the world’s 6,000+ languages however have lim-
ited or no annotated text, and therefore much
of the progress in NLP has yet to be realised
widely. Cross-lingual transfer learning is a tech-
nique which can compensate for the dearth of
data, by transferring knowledge from high- to low-
resource languages, which has typically taken the
form of annotation projection over parallel corpora
or other multilingual resources (Yarowsky et al.,
2001; Hwa et al., 2005), or making use of trans-
ferable representations, such as phonetic transcrip-
tions (Bharadwaj et al., 2016), closely related lan-
guages (Cotterell and Duh, 2017) or bilingual dic-
tionaries (Mayhew et al., 2017; Xie et al., 2018).
Most methods proposed for cross-lingual trans-
fer rely on a single source language, which lim-
its the transferable knowledge to only one source.
∗
Both authors contributed equally to this work.
1
The code and the datasets will be made available at
https://github.com/afshinrahimi/mmner.
The target language might be similar to many
source languages, on the grounds of the script,
word order, loan words etc, and transfer would
benefit from these diverse sources of information.
There are a few exceptions, which use transfer
from several languages, ranging from multitask
learning (Duong et al., 2015; Ammar et al., 2016;
Fang and Cohn, 2017), and annotation projection
from several languages (T
¨
ackstr
¨
om, 2012; Fang
and Cohn, 2016; Plank and Agi
´
c, 2018). How-
ever, to the best of our knowledge, none of these
approaches adequately account for the quality of
transfer, but rather “weight” the contribution of
each language uniformly.
In this paper, we propose a novel method for
zero-shot multilingual transfer, inspired by re-
search in truth inference in crowd-sourcing, a re-
lated problem, in which the ‘ground truth’ must be
inferred from the outputs of several unreliable an-
notators (Dawid and Skene, 1979). In this prob-
lem, the best approaches estimate each model’s
reliability, and their patterns of mistakes (Kim
and Ghahramani, 2012). Our proposed model
adapts these ideas to a multilingual transfer set-
ting, whereby we learn the quality of transfer, and
language-specific transfer errors, in order to infer
the best labelling in the target language, as part of
a Bayesian graphical model. The key insight is
that while the majority of poor models make lots
of mistakes, these mistakes are diverse, while the
few good models consistently provide reliable in-
put. This allows the model to infer which are the
reliable models in an unsupervised manner, i.e.,
without explicit supervision in the target language,
and thereby make accurate inferences despite the
substantial noise.
In the paper, we also consider a supervised set-
ting, where a tiny annotated corpus is available in
the target language. We present two methods to
use this data: 1) estimate reliability parameters of
152
the Bayesian model, and 2) explicit model selec-
tion and fine-tuning of a low-resource supervised
model, thus allowing for more accurate modelling
of language specific parameters, such as charac-
ter embeddings, shown to be important in previous
work (Xie et al., 2018).
Experimenting on two NER corpora, one with
as many as 41 languages, we show that single
model transfer has highly variable performance,
and uniform ensembling often substantially under-
performs the single best model. In contrast, our
zero-shot approach does much better, exceeding
the performance of the single best model, and our
few-shot supervised models result in further gains.
2 Approach
We frame the problem of multilingual transfer
as follows. We assume a collection of H mod-
els, all trained in a high resource setting, denoted
M
h
= {M
h
i
, i ∈ (1, H)}. Each of these mod-
els are not well matched to our target data setting,
for instance these may be trained on data from dif-
ferent domains, or on different languages, as we
evaluate in our experiments, where we use cross-
lingual embeddings for model transfer. This is a
problem of transfer learning, namely, how best we
can use the H models for best results in the target
language.
2
Simple approaches in this setting include a)
choosing a single model M ∈ M
h
, on the
grounds of practicality, or the similarity between
the model’s native data condition and the target,
and this model is used to label the target data; or
b) allowing all models to ‘vote’ in an classifier en-
semble, such that the most frequent outcome is
selected as the ensemble output. Unfortunately
neither of these approaches are very accurate in
a cross-lingual transfer setting, as we show in §4,
where we show a fixed source language model
(en) dramatically underperforms compared to or-
acle selection of source language, and the same is
true for uniform voting.
Motivated by these findings, we propose novel
methods for learning. For the “zero-shot” setting
where no labelled data is available in the target,
we propose the BEA
uns
method inspired by work
2
We limit our attention to transfer in a ‘black-box’ setting,
that is, given predictive models, but not assuming access to
their data, nor their implementation. This is the most flexible
scenario, as it allows for application to settings with closed
APIs, and private datasets. It does, however, preclude multi-
task learning, as the source models are assumed to be static.
V
(j)
π
z
i
y
ij
β
α
i = 1 . . . N
j = 1 . . . H
Figure 1: Plate diagram for the BEA model.
in truth inference from crowd-sourced datasets or
diverse classifiers (§2.1). To handle the “few-shot”
case §2.2 presents a rival supervised technique,
RaRe, based on using very limited annotations in
the target language for model selection and classi-
fier fine-tuning.
2.1 Zero-Shot Transfer
One way to improve the performance of the en-
semble system is to select a subset of compo-
nent models carefully, or more generally, learn a
non-uniform weighting function. Some models do
much better than others, on their own, so it stands
to reason that identifying these handful of mod-
els will give rise to better ensemble performance.
How might we proceed to learn the relative qual-
ity of models in the setting where no annotations
are available in the target language? This is a clas-
sic unsupervised inference problem, for which we
propose a probabilistic graphical model, inspired
by Kim and Ghahramani (2012).
We develop a generative model, illustrated in
Figure 1, of the transfer models’ predictions, y
ij
,
where i ∈ [1, N ] is an instance (a token or an
entity span), and j ∈ [1, H] indexes a trans-
fer model. The generative process assumes a
‘true’ label, z
i
∈ [1, K], which is corrupted
by each transfer model, in producing the predic-
tion, y
ij
. The corruption process is described
by P (y
ij
= l|z
i
= k, V
(j)
) = V
(j)
kl
, where V
(j)
∈
R
K×K
is the confusion matrix specific to a trans-
fer model.
To complete the story, the confusion matri-
ces are drawn from vague row-wise independent
Dirichlet priors, with a parameter α = 1, and the
true labels are governed by a Dirichlet prior, π,
which is drawn from an uninformative Dirichlet
distribution with a parameter β = 1. This genera-
tive model is referred to as BEA.
Inference under the BEA model involves ex-
153
plaining the observed predictions Y in the most
efficient way. Where several transfer models have
identical predictions, k, on an instance, this can be
explained by letting z
i
= k,
3
and the confusion
matrices of those transfer models assigning high
probability to V
(j)
kk
. Other, less reliable, transfer
models will have divergent predictions, which are
less likely to be in agreement, or else are heav-
ily biased towards a particular class. Accordingly,
the BEA model can better explain these predictions
through label confusion, using the off-diagonal el-
ements of the confusion matrix. Aggregated over
a corpus of instances, the BEA model can learn to
differentiate between those reliable transfer mod-
els, with high V
(j)
kk
and those less reliable ones,
with high V
(j)
kl
, l 6= k. This procedure applies per-
label, and thus the ‘reliability’ of a transfer model
is with respect to a specific label, and may differ
between classes. This helps in the NER setting
where many poor transfer models have excellent
accuracy for the outside label, but considerably
worse performance for entity labels.
For inference, we use mean-field variational
Bayes (Jordan, 1998), which learns a variational
distribution, q(Z, V, π) to optimise the evidence
lower bound (ELBO),
log P (Y |α, β) ≥ E
q(Z,V,π)
log
P (Y, Z, V, π|α, β)
q(Z, V, π)
assuming a fully factorised variational distribu-
tion, q(Z, V, π) = q(Z)q(V )q(π). This gives
rise to an iterative learning algorithm with update
rules:
E
q
log π
k
(1a)
=ψ
β +
X
i
q(z
i
= k)
!
− ψ (Kβ + N )
E
q
log V
(j)
kl
(1b)
=ψ
α +
X
i
q(z
i
= k)1[y
ij
= l]
!
− ψ
Kα +
X
i
q(z
i
= k)
!
q(z
i
= k) ∝ exp
E
q
log π
k
+
X
j
E
q
log V
(j)
ky
ij
(2)
3
Although there is no explicit breaking of the symmetry
of the model, we initialise inference using the majority vote,
which results in a bias towards this solution.
w
1
w
2
w
3
w
4
[1, 4] [2, 4] [3, 4]
M
h
1
B-ORG I-ORG I-ORG I-ORG ORG O O
M
h
2
O B-ORG I-ORG I-ORG O ORG O
M
h
3
O O B-ORG I-ORG O O ORG
M
h
4
O B-PER I-PER I-PER O PER O
M
h
5
O B-PER I-PER I-PER O PER O
Agg. O B-PER I-ORG I-ORG O PER O
Table 1: An example sentence with its aggregated la-
bels in both token view and entity view. Aggregation
in token view may generate results inconsistent with
the BIO scheme.
where ψ is the digamma function, defined as the
logarithmic derivative of the gamma function. The
sets of rules (1) and (2) are applied alternately, to
update the values of E
q
log π
k
, E
q
log V
(j)
kl
, and
q(z
ij
= k) respectively. This repeats until conver-
gence, when the difference in the ELBO between
two iterations is smaller than a threshold.
The final prediction of the model is based
on q(Z), using the maximum a posteriori label
ˆz
i
= arg max
z
q(z
i
= z). This method is referred
to as BEA
uns
.
In our NER transfer task, classifiers are diverse
in their F1 scores ranging from almost 0 to around
80, motivating spammer removal (Raykar and Yu,
2012) to filter out the worst of the transfer models.
We adopt a simple strategy that first estimates the
confusion matrices for all transfer models on all
labels, then ranks them based on their mean recall
on different entity categories (elements on the di-
agonals of their confusion matrices), and then runs
the BEA model again using only labels from the
top k transfer models only. We call this method
BEA
uns×2
and its results are reported in §4.
2.1.1 Token versus Entity Granularity
Our proposed aggregation method in §2.1 is based
on an assumption that the true annotations are in-
dependent from each other, which simplifies the
model but may generate undesired results. That
is, entities predicted by different transfer models
could be mixed, resulting in labels inconsistent
with the BIO scheme. Table 1 shows an exam-
ple, where a sentence with 4 words is annotated
by 5 transfer models with 4 different predictions,
among which at most one is correct as they over-
lap. However, the aggregated result in the token
view is a mixture of two predictions, which is sup-
ported by no transfer models.
To deal with this problem, we consider aggre-
剩余13页未读,继续阅读
今年也要加油呀
- 粉丝: 17
- 资源: 312
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0