【免费】B海量多语言迁移学习1资源-CSDN文库

自然语言处理

迁移学习

需积分: 0 12 浏览量 2022-08-03 13:49:34 上传评论收藏 583KB PDF 举报

资源详情

资源评论

资源推荐

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 151–164

Florence, Italy, July 28 - August 2, 2019.

2019 Association for Computational Linguistics

151

Massively Multilingual Transfer for NER

Afshin Rahimi

∗

Yuan Li

∗

Trevor Cohn

School of Computing and Information Systems

The University of Melbourne

yuanl4@student.unimelb.edu.au

{rahimia,t.cohn}@unimelb.edu.au

Abstract

In cross-lingual transfer, NLP models over one

or more source languages are applied to a low-

resource target language. While most prior

work has used a single source model or a

few carefully selected models, here we con-

sider a “massive” setting with many such mod-

els. This setting raises the problem of poor

transfer, particularly from distant languages.

We propose two techniques for modulating

the transfer, suitable for zero-shot or few-shot

learning, respectively. Evaluating on named

entity recognition, we show that our tech-

niques are much more effective than strong

baselines, including standard ensembling, and

our unsupervised method rivals oracle selec-

tion of the single best individual model.

1 Introduction

Supervised learning remains king in natural lan-

guage processing, with most tasks requiring large

quantities of annotated corpora. The majority of

the world’s 6,000+ languages however have lim-

ited or no annotated text, and therefore much

of the progress in NLP has yet to be realised

widely. Cross-lingual transfer learning is a tech-

nique which can compensate for the dearth of

data, by transferring knowledge from high- to low-

resource languages, which has typically taken the

form of annotation projection over parallel corpora

or other multilingual resources (Yarowsky et al.,

2001; Hwa et al., 2005), or making use of trans-

ferable representations, such as phonetic transcrip-

tions (Bharadwaj et al., 2016), closely related lan-

guages (Cotterell and Duh, 2017) or bilingual dic-

tionaries (Mayhew et al., 2017; Xie et al., 2018).

Most methods proposed for cross-lingual trans-

fer rely on a single source language, which lim-

its the transferable knowledge to only one source.

∗

Both authors contributed equally to this work.

The code and the datasets will be made available at

https://github.com/afshinrahimi/mmner.

The target language might be similar to many

source languages, on the grounds of the script,

word order, loan words etc, and transfer would

beneﬁt from these diverse sources of information.

There are a few exceptions, which use transfer

from several languages, ranging from multitask

learning (Duong et al., 2015; Ammar et al., 2016;

Fang and Cohn, 2017), and annotation projection

from several languages (T

ackstr

om, 2012; Fang

and Cohn, 2016; Plank and Agi

c, 2018). How-

ever, to the best of our knowledge, none of these

approaches adequately account for the quality of

transfer, but rather “weight” the contribution of

each language uniformly.

In this paper, we propose a novel method for

zero-shot multilingual transfer, inspired by re-

search in truth inference in crowd-sourcing, a re-

lated problem, in which the ‘ground truth’ must be

inferred from the outputs of several unreliable an-

notators (Dawid and Skene, 1979). In this prob-

lem, the best approaches estimate each model’s

reliability, and their patterns of mistakes (Kim

and Ghahramani, 2012). Our proposed model

adapts these ideas to a multilingual transfer set-

ting, whereby we learn the quality of transfer, and

language-speciﬁc transfer errors, in order to infer

the best labelling in the target language, as part of

a Bayesian graphical model. The key insight is

that while the majority of poor models make lots

of mistakes, these mistakes are diverse, while the

few good models consistently provide reliable in-

put. This allows the model to infer which are the

reliable models in an unsupervised manner, i.e.,

without explicit supervision in the target language,

and thereby make accurate inferences despite the

substantial noise.

In the paper, we also consider a supervised set-

ting, where a tiny annotated corpus is available in

the target language. We present two methods to

use this data: 1) estimate reliability parameters of

152

the Bayesian model, and 2) explicit model selec-

tion and ﬁne-tuning of a low-resource supervised

model, thus allowing for more accurate modelling

of language speciﬁc parameters, such as charac-

ter embeddings, shown to be important in previous

work (Xie et al., 2018).

Experimenting on two NER corpora, one with

as many as 41 languages, we show that single

model transfer has highly variable performance,

and uniform ensembling often substantially under-

performs the single best model. In contrast, our

zero-shot approach does much better, exceeding

the performance of the single best model, and our

few-shot supervised models result in further gains.

2 Approach

We frame the problem of multilingual transfer

as follows. We assume a collection of H mod-

els, all trained in a high resource setting, denoted

= {M

, i ∈ (1, H)}. Each of these mod-

els are not well matched to our target data setting,

for instance these may be trained on data from dif-

ferent domains, or on different languages, as we

evaluate in our experiments, where we use cross-

lingual embeddings for model transfer. This is a

problem of transfer learning, namely, how best we

can use the H models for best results in the target

language.

Simple approaches in this setting include a)

choosing a single model M ∈ M

, on the

grounds of practicality, or the similarity between

the model’s native data condition and the target,

and this model is used to label the target data; or

b) allowing all models to ‘vote’ in an classiﬁer en-

semble, such that the most frequent outcome is

selected as the ensemble output. Unfortunately

neither of these approaches are very accurate in

a cross-lingual transfer setting, as we show in §4,

where we show a ﬁxed source language model

(en) dramatically underperforms compared to or-

acle selection of source language, and the same is

true for uniform voting.

Motivated by these ﬁndings, we propose novel

methods for learning. For the “zero-shot” setting

where no labelled data is available in the target,

we propose the BEA

uns

method inspired by work

We limit our attention to transfer in a ‘black-box’ setting,

that is, given predictive models, but not assuming access to

their data, nor their implementation. This is the most ﬂexible

scenario, as it allows for application to settings with closed

APIs, and private datasets. It does, however, preclude multi-

task learning, as the source models are assumed to be static.

(j)

i = 1 . . . N

j = 1 . . . H

Figure 1: Plate diagram for the BEA model.

in truth inference from crowd-sourced datasets or

diverse classiﬁers (§2.1). To handle the “few-shot”

case §2.2 presents a rival supervised technique,

RaRe, based on using very limited annotations in

the target language for model selection and classi-

ﬁer ﬁne-tuning.

2.1 Zero-Shot Transfer

One way to improve the performance of the en-

semble system is to select a subset of compo-

nent models carefully, or more generally, learn a

non-uniform weighting function. Some models do

much better than others, on their own, so it stands

to reason that identifying these handful of mod-

els will give rise to better ensemble performance.

How might we proceed to learn the relative qual-

ity of models in the setting where no annotations

are available in the target language? This is a clas-

sic unsupervised inference problem, for which we

propose a probabilistic graphical model, inspired

by Kim and Ghahramani (2012).

We develop a generative model, illustrated in

Figure 1, of the transfer models’ predictions, y

where i ∈ [1, N ] is an instance (a token or an

entity span), and j ∈ [1, H] indexes a trans-

fer model. The generative process assumes a

‘true’ label, z

∈ [1, K], which is corrupted

by each transfer model, in producing the predic-

tion, y

. The corruption process is described

by P (y

= l|z

= k, V

(j)

) = V

(j)

, where V

(j)

∈

K×K

is the confusion matrix speciﬁc to a trans-

fer model.

To complete the story, the confusion matri-

ces are drawn from vague row-wise independent

Dirichlet priors, with a parameter α = 1, and the

true labels are governed by a Dirichlet prior, π,

which is drawn from an uninformative Dirichlet

distribution with a parameter β = 1. This genera-

tive model is referred to as BEA.

Inference under the BEA model involves ex-

153

plaining the observed predictions Y in the most

efﬁcient way. Where several transfer models have

identical predictions, k, on an instance, this can be

explained by letting z

= k,

and the confusion

matrices of those transfer models assigning high

probability to V

(j)

. Other, less reliable, transfer

models will have divergent predictions, which are

less likely to be in agreement, or else are heav-

ily biased towards a particular class. Accordingly,

the BEA model can better explain these predictions

through label confusion, using the off-diagonal el-

ements of the confusion matrix. Aggregated over

a corpus of instances, the BEA model can learn to

differentiate between those reliable transfer mod-

els, with high V

(j)

and those less reliable ones,

with high V

(j)

, l 6= k. This procedure applies per-

label, and thus the ‘reliability’ of a transfer model

is with respect to a speciﬁc label, and may differ

between classes. This helps in the NER setting

where many poor transfer models have excellent

accuracy for the outside label, but considerably

worse performance for entity labels.

For inference, we use mean-ﬁeld variational

Bayes (Jordan, 1998), which learns a variational

distribution, q(Z, V, π) to optimise the evidence

lower bound (ELBO),

log P (Y |α, β) ≥ E

q(Z,V,π)

log

P (Y, Z, V, π|α, β)

q(Z, V, π)

assuming a fully factorised variational distribu-

tion, q(Z, V, π) = q(Z)q(V )q(π). This gives

rise to an iterative learning algorithm with update

rules:

log π

(1a)

=ψ

β +

q(z

= k)

− ψ (Kβ + N )

log V

(j)

(1b)

=ψ

α +

q(z

= k)1[y

= l]

− ψ

Kα +

q(z

= k)

q(z

= k) ∝ exp







log π

log V

(j)







(2)

Although there is no explicit breaking of the symmetry

of the model, we initialise inference using the majority vote,

which results in a bias towards this solution.

[1, 4] [2, 4] [3, 4]

B-ORG I-ORG I-ORG I-ORG ORG O O

O B-ORG I-ORG I-ORG O ORG O

O O B-ORG I-ORG O O ORG

O B-PER I-PER I-PER O PER O

Agg. O B-PER I-ORG I-ORG O PER O

Table 1: An example sentence with its aggregated la-

bels in both token view and entity view. Aggregation

in token view may generate results inconsistent with

the BIO scheme.

where ψ is the digamma function, deﬁned as the

logarithmic derivative of the gamma function. The

sets of rules (1) and (2) are applied alternately, to

update the values of E

log π

, E

log V

(j)

, and

q(z

= k) respectively. This repeats until conver-

gence, when the difference in the ELBO between

two iterations is smaller than a threshold.

The ﬁnal prediction of the model is based

on q(Z), using the maximum a posteriori label

ˆz

= arg max

q(z

= z). This method is referred

to as BEA

uns

In our NER transfer task, classiﬁers are diverse

in their F1 scores ranging from almost 0 to around

80, motivating spammer removal (Raykar and Yu,

2012) to ﬁlter out the worst of the transfer models.

We adopt a simple strategy that ﬁrst estimates the

confusion matrices for all transfer models on all

labels, then ranks them based on their mean recall

on different entity categories (elements on the di-

agonals of their confusion matrices), and then runs

the BEA model again using only labels from the

top k transfer models only. We call this method

BEA

uns×2

and its results are reported in §4.

2.1.1 Token versus Entity Granularity

Our proposed aggregation method in §2.1 is based

on an assumption that the true annotations are in-

dependent from each other, which simpliﬁes the

model but may generate undesired results. That

is, entities predicted by different transfer models

could be mixed, resulting in labels inconsistent

with the BIO scheme. Table 1 shows an exam-

ple, where a sentence with 4 words is annotated

by 5 transfer models with 4 different predictions,

among which at most one is correct as they over-

lap. However, the aggregated result in the token

view is a mixture of two predictions, which is sup-

ported by no transfer models.

To deal with this problem, we consider aggre-

剩余13页未读，继续阅读

评论收藏

内容反馈

今年也要加油呀

粉丝: 17
资源: 312

B 海量多语言迁移学习1

评论0

最新资源

B 海量多语言迁移学习1

评论0

通过负迁移检测改善跨语言观点分析中的迁移学习

albert_zh：用于自我监督学习语言表示的精简BERT，海量中文预训练ALBERT模型

siatl:NAACL 2019论文的PyTorch源代码“从预训练的语言模型进行迁移学习的令人尴尬的简单方法”-Source code learning

ChatGPT技术对话生成中的语言风格与迁移学习研究.docx

基于标签迁移和深度学习的跨语言实体抽取研究.pdf

机器学习迁移学习简明手册

20180411_迁移学习1

（迁移成分分析TCA）迁移学习算法程序实现_TCA迁移学习_TCA_迁移学习_迁移成分分析算法代码_

迁移学习简明手册

meta_cross_nlu_qa:用于在NLU和QA中为跨语言迁移学习复制元学习的代码

ChatGPT技术的迁移学习与跨语言对话生成实践.docx

ChatGPT技术的迁移学习与多语言对话生成实践案例及指南.docx

迁移学习手册.pdf

迁移学习研究进展

迁移学习进展研究

海量数据迁移方案

迁移学习-杨强-2015_Transitive_Transfer_Learning1

迁移学习介绍.pptx

基于多模态多标记迁移学习的早期阿尔茨海默病诊断

迁移学习研究介绍

迁移学习解析，教你快速上手迁移学习

机器学习和迁移学习

迁移学习 目标检测

迁移学习_transferlearning_迁移学习_迁移学习ppt_

迁移学习手册V1.1版本

《自然语言处理迁移学习》综述论文

BurpLoaderKeygen.jar.zip

最新版ISO/IEC 27001:2022、ISO 27002:2022中英文合集

Goby红队版-win-x64-2.4.7版本

最新资源

迁移学习目标检测