没有合适的资源?快使用搜索试试~ 我知道了~
BERT Pre-training of Deep Bidirectional Transformers for Languag...
需积分: 19 49 下载量 88 浏览量
2018-10-17
17:24:28
上传
评论 1
收藏 717KB PDF 举报
温馨提示
NLP领域取得最重大突破!谷歌AI团队新发布的BERT模型,在机器阅读理解顶级水平测试SQuAD1.1中表现出惊人的成绩:全部两个衡量指标上全面超越人类,并且还在11种不同NLP测试中创出最佳成绩。毋庸置疑,BERT模型开启了NLP的新时代!
资源推荐
资源详情
资源评论
BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding
Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova
Google AI Language
{jacobdevlin,mingweichang,kentonl,kristout}@google.com
Abstract
We introduce a new language representa-
tion model called BERT, which stands for
Bidirectional Encoder Representations from
Transformers. Unlike recent language repre-
sentation models (Peters et al., 2018; Radford
et al., 2018), BERT is designed to pre-train
deep bidirectional representations by jointly
conditioning on both left and right context in
all layers. As a result, the pre-trained BERT
representations can be fine-tuned with just one
additional output layer to create state-of-the-
art models for a wide range of tasks, such
as question answering and language inference,
without substantial task-specific architecture
modifications.
BERT is conceptually simple and empirically
powerful. It obtains new state-of-the-art re-
sults on eleven natural language processing
tasks, including pushing the GLUE bench-
mark to 80.4% (7.6% absolute improvement),
MultiNLI accuracy to 86.7% (5.6% abso-
lute improvement) and the SQuAD v1.1 ques-
tion answering Test F1 to 93.2 (1.5 absolute
improvement), outperforming human perfor-
mance by 2.0.
1 Introduction
Language model pre-training has shown to be ef-
fective for improving many natural language pro-
cessing tasks (Dai and Le, 2015; Peters et al.,
2017, 2018; Radford et al., 2018; Howard and
Ruder, 2018). These tasks include sentence-level
tasks such as natural language inference (Bow-
man et al., 2015; Williams et al., 2018) and para-
phrasing (Dolan and Brockett, 2005), which aim
to predict the relationships between sentences by
analyzing them holistically, as well as token-level
tasks such as named entity recognition (Tjong
Kim Sang and De Meulder, 2003) and SQuAD
question answering (Rajpurkar et al., 2016), where
models are required to produce fine-grained output
at the token-level.
There are two existing strategies for apply-
ing pre-trained language representations to down-
stream tasks: feature-based and fine-tuning. The
feature-based approach, such as ELMo (Peters
et al., 2018), uses tasks-specific architectures that
include the pre-trained representations as addi-
tional features. The fine-tuning approach, such as
the Generative Pre-trained Transformer (OpenAI
GPT) (Radford et al., 2018), introduces minimal
task-specific parameters, and is trained on the
downstream tasks by simply fine-tuning the pre-
trained parameters. In previous work, both ap-
proaches share the same objective function dur-
ing pre-training, where they use unidirectional lan-
guage models to learn general language represen-
tations.
We argue that current techniques severely re-
strict the power of the pre-trained representations,
especially for the fine-tuning approaches. The ma-
jor limitation is that standard language models are
unidirectional, and this limits the choice of archi-
tectures that can be used during pre-training. For
example, in OpenAI GPT, the authors use a left-
to-right architecture, where every token can only
attended to previous tokens in the self-attention
layers of the Transformer (Vaswani et al., 2017).
Such restrictions are sub-optimal for sentence-
level tasks, and could be devastating when ap-
plying fine-tuning based approaches to token-level
tasks such as SQuAD question answering (Ra-
jpurkar et al., 2016), where it is crucial to incor-
porate context from both directions.
In this paper, we improve the fine-tuning based
approaches by proposing BERT: Bidirectional
Encoder Representations from Transformers.
BERT addresses the previously mentioned uni-
directional constraints by proposing a new
pre-training objective: the “masked language
arXiv:1810.04805v1 [cs.CL] 11 Oct 2018
model” (MLM), inspired by the Cloze task (Tay-
lor, 1953). The masked language model randomly
masks some of the tokens from the input, and the
objective is to predict the original vocabulary id of
the masked word based only on its context. Un-
like left-to-right language model pre-training, the
MLM objective allows the representation to fuse
the left and the right context, which allows us
to pre-train a deep bidirectional Transformer. In
addition to the masked language model, we also
introduce a “next sentence prediction” task that
jointly pre-trains text-pair representations.
The contributions of our paper are as follows:
• We demonstrate the importance of bidirec-
tional pre-training for language representa-
tions. Unlike Radford et al. (2018), which
uses unidirectional language models for pre-
training, BERT uses masked language mod-
els to enable pre-trained deep bidirectional
representations. This is also in contrast to
Peters et al. (2018), which uses a shallow
concatenation of independently trained left-
to-right and right-to-left LMs.
• We show that pre-trained representations
eliminate the needs of many heavily-
engineered task-specific architectures. BERT
is the first fine-tuning based representation
model that achieves state-of-the-art perfor-
mance on a large suite of sentence-level and
token-level tasks, outperforming many sys-
tems with task-specific architectures.
• BERT advances the state-of-the-art for eleven
NLP tasks. We also report extensive abla-
tions of BERT, demonstrating that the bidi-
rectional nature of our model is the single
most important new contribution. The code
and pre-trained model will be available at
goo.gl/language/bert.
1
2 Related Work
There is a long history of pre-training general lan-
guage representations, and we briefly review the
most popular approaches in this section.
2.1 Feature-based Approaches
Learning widely applicable representations of
words has been an active area of research for
decades, including non-neural (Brown et al., 1992;
1
Will be released before the end of October 2018.
Ando and Zhang, 2005; Blitzer et al., 2006) and
neural (Collobert and Weston, 2008; Mikolov
et al., 2013; Pennington et al., 2014) methods. Pre-
trained word embeddings are considered to be an
integral part of modern NLP systems, offering sig-
nificant improvements over embeddings learned
from scratch (Turian et al., 2010).
These approaches have been generalized to
coarser granularities, such as sentence embed-
dings (Kiros et al., 2015; Logeswaran and Lee,
2018) or paragraph embeddings (Le and Mikolov,
2014). As with traditional word embeddings,
these learned representations are also typically
used as features in a downstream model.
ELMo (Peters et al., 2017) generalizes tradi-
tional word embedding research along a differ-
ent dimension. They propose to extract context-
sensitive features from a language model. When
integrating contextual word embeddings with ex-
isting task-specific architectures, ELMo advances
the state-of-the-art for several major NLP bench-
marks (Peters et al., 2018) including question an-
swering (Rajpurkar et al., 2016) on SQuAD, sen-
timent analysis (Socher et al., 2013), and named
entity recognition (Tjong Kim Sang and De Meul-
der, 2003).
2.2 Fine-tuning Approaches
A recent trend in transfer learning from language
models (LMs) is to pre-train some model ar-
chitecture on a LM objective before fine-tuning
that same model for a supervised downstream
task (Dai and Le, 2015; Howard and Ruder, 2018;
Radford et al., 2018). The advantage of these ap-
proaches is that few parameters need to be learned
from scratch. At least partly due this advantage,
OpenAI GPT (Radford et al., 2018) achieved pre-
viously state-of-the-art results on many sentence-
level tasks from the GLUE benchmark (Wang
et al., 2018).
2.3 Transfer Learning from Supervised Data
While the advantage of unsupervised pre-training
is that there is a nearly unlimited amount of data
available, there has also been work showing ef-
fective transfer from supervised tasks with large
datasets, such as natural language inference (Con-
neau et al., 2017) and machine translation (Mc-
Cann et al., 2017). Outside of NLP, computer
vision research has also demonstrated the impor-
tance of transfer learning from large pre-trained
models, where an effective recipe is to fine-tune
BERT (Ours)
Trm
Trm Trm
Trm Trm Trm
...
...
Trm Trm Trm
Trm Trm Trm
...
...
OpenAI GPT
Lstm
ELMo
Lstm Lstm
Lstm Lstm Lstm
Lstm Lstm Lstm
Lstm Lstm Lstm
T
1
T
2
T
N
...
...
...
...
...
E
1
E
2
E
N
...
T
1
T
2
T
N
...
E
1
E
2
E
N
...
T
1
T
2
T
N
...
E
1
E
2
E
N
...
Figure 1: Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT
uses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-
to-left LSTM to generate features for downstream tasks. Among three, only BERT representations are jointly
conditioned on both left and right context in all layers.
models pre-trained on ImageNet (Deng et al.,
2009; Yosinski et al., 2014).
3 BERT
We introduce BERT and its detailed implementa-
tion in this section. We first cover the model ar-
chitecture and the input representation for BERT.
We then introduce the pre-training tasks, the core
innovation in this paper, in Section 3.3. The
pre-training procedures, and fine-tuning proce-
dures are detailed in Section 3.4 and 3.5, respec-
tively. Finally, the differences between BERT and
OpenAI GPT are discussed in Section 3.6.
3.1 Model Architecture
BERT’s model architecture is a multi-layer bidi-
rectional Transformer encoder based on the orig-
inal implementation described in Vaswani et al.
(2017) and released in the tensor2tensor li-
brary.
2
Because the use of Transformers has be-
come ubiquitous recently and our implementation
is effectively identical to the original, we will
omit an exhaustive background description of the
model architecture and refer readers to Vaswani
et al. (2017) as well as excellent guides such as
“The Annotated Transformer.”
3
In this work, we denote the number of layers
(i.e., Transformer blocks) as L, the hidden size as
H, and the number of self-attention heads as A.
In all cases we set the feed-forward/filter size to
be 4H, i.e., 3072 for the H = 768 and 4096 for
the H = 1024. We primarily report results on two
model sizes:
• BERT
BASE
: L=12, H=768, A=12, Total Pa-
rameters=110M
2
https://github.com/tensorflow/tensor2tensor
3
http://nlp.seas.harvard.edu/2018/04/03/attention.html
• BERT
LARGE
: L=24, H=1024, A=16, Total
Parameters=340M
BERT
BASE
was chosen to have an identical
model size as OpenAI GPT for comparison pur-
poses. Critically, however, the BERT Transformer
uses bidirectional self-attention, while the GPT
Transformer uses constrained self-attention where
every token can only attend to context to its left.
We note that in the literature the bidirectional
Transformer is often referred to as a “Transformer
encoder” while the left-context-only version is re-
ferred to as a “Transformer decoder” since it can
be used for text generation. The comparisons be-
tween BERT, OpenAI GPT and ELMo are shown
visually in Figure 1.
3.2 Input Representation
Our input representation is able to unambiguously
represent both a single text sentence or a pair of
text sentences (e.g., [Question, Answer]) in one
token sequence.
4
For a given token, its input rep-
resentation is constructed by summing the cor-
responding token, segment and position embed-
dings. A visual representation of our input rep-
resentation is given in Figure 2.
The specifics are:
• We use WordPiece embeddings (Wu et al.,
2016) with a 30,000 token vocabulary. We
denote split word pieces with ##.
• We use learned positional embeddings with
supported sequence lengths up to 512 tokens.
4
Throughout this work, a “sentence” can be an arbitrary
span of contiguous text, rather than an actual linguistic sen-
tence. A “sequence” refers to the input token sequence to
BERT, which may be a single sentence or two sentences
packed together.
实现
部分
架构
创新
程序
微调
bidirectional
双向的
普遍存在的
有效的
相同的
忽略
详尽的
架构
表示
同一的
比较
批判
bi双向的
约束
注意
著作
明白的
对
constructed
创建
段
位置
given
特点
工件
vocabulary
位置
剩余13页未读,继续阅读
资源评论
平原2018
- 粉丝: 978
- 资源: 9
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功