BERTPre-trainingofDeepBidirectionalTransformersforLanguageUnderstanding资源-CSDN文库

BERT

需积分: 19 52 浏览量 2018-10-17 17:24:28 上传评论 1 收藏 717KB PDF 举报

资源推荐

资源详情

资源评论

BERT: Pre-training of Deep Bidirectional Transformers for

Language Understanding

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova

Google AI Language

{jacobdevlin,mingweichang,kentonl,kristout}@google.com

Abstract

We introduce a new language representa-

tion model called BERT, which stands for

Bidirectional Encoder Representations from

Transformers. Unlike recent language repre-

sentation models (Peters et al., 2018; Radford

et al., 2018), BERT is designed to pre-train

deep bidirectional representations by jointly

conditioning on both left and right context in

all layers. As a result, the pre-trained BERT

representations can be ﬁne-tuned with just one

additional output layer to create state-of-the-

art models for a wide range of tasks, such

as question answering and language inference,

without substantial task-speciﬁc architecture

modiﬁcations.

BERT is conceptually simple and empirically

powerful. It obtains new state-of-the-art re-

sults on eleven natural language processing

tasks, including pushing the GLUE bench-

mark to 80.4% (7.6% absolute improvement),

MultiNLI accuracy to 86.7% (5.6% abso-

lute improvement) and the SQuAD v1.1 ques-

tion answering Test F1 to 93.2 (1.5 absolute

improvement), outperforming human perfor-

mance by 2.0.

1 Introduction

Language model pre-training has shown to be ef-

fective for improving many natural language pro-

cessing tasks (Dai and Le, 2015; Peters et al.,

2017, 2018; Radford et al., 2018; Howard and

Ruder, 2018). These tasks include sentence-level

tasks such as natural language inference (Bow-

man et al., 2015; Williams et al., 2018) and para-

phrasing (Dolan and Brockett, 2005), which aim

to predict the relationships between sentences by

analyzing them holistically, as well as token-level

tasks such as named entity recognition (Tjong

Kim Sang and De Meulder, 2003) and SQuAD

question answering (Rajpurkar et al., 2016), where

models are required to produce ﬁne-grained output

at the token-level.

There are two existing strategies for apply-

ing pre-trained language representations to down-

stream tasks: feature-based and ﬁne-tuning. The

feature-based approach, such as ELMo (Peters

et al., 2018), uses tasks-speciﬁc architectures that

include the pre-trained representations as addi-

tional features. The ﬁne-tuning approach, such as

the Generative Pre-trained Transformer (OpenAI

GPT) (Radford et al., 2018), introduces minimal

task-speciﬁc parameters, and is trained on the

downstream tasks by simply ﬁne-tuning the pre-

trained parameters. In previous work, both ap-

proaches share the same objective function dur-

ing pre-training, where they use unidirectional lan-

guage models to learn general language represen-

tations.

We argue that current techniques severely re-

strict the power of the pre-trained representations,

especially for the ﬁne-tuning approaches. The ma-

jor limitation is that standard language models are

unidirectional, and this limits the choice of archi-

tectures that can be used during pre-training. For

example, in OpenAI GPT, the authors use a left-

to-right architecture, where every token can only

attended to previous tokens in the self-attention

layers of the Transformer (Vaswani et al., 2017).

Such restrictions are sub-optimal for sentence-

level tasks, and could be devastating when ap-

plying ﬁne-tuning based approaches to token-level

tasks such as SQuAD question answering (Ra-

jpurkar et al., 2016), where it is crucial to incor-

porate context from both directions.

In this paper, we improve the ﬁne-tuning based

approaches by proposing BERT: Bidirectional

Encoder Representations from Transformers.

BERT addresses the previously mentioned uni-

directional constraints by proposing a new

pre-training objective: the “masked language

arXiv:1810.04805v1 [cs.CL] 11 Oct 2018

model” (MLM), inspired by the Cloze task (Tay-

lor, 1953). The masked language model randomly

masks some of the tokens from the input, and the

objective is to predict the original vocabulary id of

the masked word based only on its context. Un-

like left-to-right language model pre-training, the

MLM objective allows the representation to fuse

the left and the right context, which allows us

to pre-train a deep bidirectional Transformer. In

addition to the masked language model, we also

introduce a “next sentence prediction” task that

jointly pre-trains text-pair representations.

The contributions of our paper are as follows:

• We demonstrate the importance of bidirec-

tional pre-training for language representa-

tions. Unlike Radford et al. (2018), which

uses unidirectional language models for pre-

training, BERT uses masked language mod-

els to enable pre-trained deep bidirectional

representations. This is also in contrast to

Peters et al. (2018), which uses a shallow

concatenation of independently trained left-

to-right and right-to-left LMs.

• We show that pre-trained representations

eliminate the needs of many heavily-

engineered task-speciﬁc architectures. BERT

is the ﬁrst ﬁne-tuning based representation

model that achieves state-of-the-art perfor-

mance on a large suite of sentence-level and

token-level tasks, outperforming many sys-

tems with task-speciﬁc architectures.

• BERT advances the state-of-the-art for eleven

NLP tasks. We also report extensive abla-

tions of BERT, demonstrating that the bidi-

rectional nature of our model is the single

most important new contribution. The code

and pre-trained model will be available at

goo.gl/language/bert.

2 Related Work

There is a long history of pre-training general lan-

guage representations, and we brieﬂy review the

most popular approaches in this section.

2.1 Feature-based Approaches

Learning widely applicable representations of

words has been an active area of research for

decades, including non-neural (Brown et al., 1992;

Will be released before the end of October 2018.

Ando and Zhang, 2005; Blitzer et al., 2006) and

neural (Collobert and Weston, 2008; Mikolov

et al., 2013; Pennington et al., 2014) methods. Pre-

trained word embeddings are considered to be an

integral part of modern NLP systems, offering sig-

niﬁcant improvements over embeddings learned

from scratch (Turian et al., 2010).

These approaches have been generalized to

coarser granularities, such as sentence embed-

dings (Kiros et al., 2015; Logeswaran and Lee,

2018) or paragraph embeddings (Le and Mikolov,

2014). As with traditional word embeddings,

these learned representations are also typically

used as features in a downstream model.

ELMo (Peters et al., 2017) generalizes tradi-

tional word embedding research along a differ-

ent dimension. They propose to extract context-

sensitive features from a language model. When

integrating contextual word embeddings with ex-

isting task-speciﬁc architectures, ELMo advances

the state-of-the-art for several major NLP bench-

marks (Peters et al., 2018) including question an-

swering (Rajpurkar et al., 2016) on SQuAD, sen-

timent analysis (Socher et al., 2013), and named

entity recognition (Tjong Kim Sang and De Meul-

der, 2003).

2.2 Fine-tuning Approaches

A recent trend in transfer learning from language

models (LMs) is to pre-train some model ar-

chitecture on a LM objective before ﬁne-tuning

that same model for a supervised downstream

task (Dai and Le, 2015; Howard and Ruder, 2018;

Radford et al., 2018). The advantage of these ap-

proaches is that few parameters need to be learned

from scratch. At least partly due this advantage,

OpenAI GPT (Radford et al., 2018) achieved pre-

viously state-of-the-art results on many sentence-

level tasks from the GLUE benchmark (Wang

et al., 2018).

2.3 Transfer Learning from Supervised Data

While the advantage of unsupervised pre-training

is that there is a nearly unlimited amount of data

available, there has also been work showing ef-

fective transfer from supervised tasks with large

datasets, such as natural language inference (Con-

neau et al., 2017) and machine translation (Mc-

Cann et al., 2017). Outside of NLP, computer

vision research has also demonstrated the impor-

tance of transfer learning from large pre-trained

models, where an effective recipe is to ﬁne-tune

BERT (Ours)

Trm

Trm Trm

Trm Trm Trm

...

Trm Trm Trm

...

OpenAI GPT

Lstm

ELMo

Lstm Lstm

Lstm Lstm Lstm

...

Figure 1: Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT

uses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-

to-left LSTM to generate features for downstream tasks. Among three, only BERT representations are jointly

conditioned on both left and right context in all layers.

models pre-trained on ImageNet (Deng et al.,

2009; Yosinski et al., 2014).

3 BERT

We introduce BERT and its detailed implementa-

tion in this section. We ﬁrst cover the model ar-

chitecture and the input representation for BERT.

We then introduce the pre-training tasks, the core

innovation in this paper, in Section 3.3. The

pre-training procedures, and ﬁne-tuning proce-

dures are detailed in Section 3.4 and 3.5, respec-

tively. Finally, the differences between BERT and

OpenAI GPT are discussed in Section 3.6.

3.1 Model Architecture

BERT’s model architecture is a multi-layer bidi-

rectional Transformer encoder based on the orig-

inal implementation described in Vaswani et al.

(2017) and released in the tensor2tensor li-

brary.

Because the use of Transformers has be-

come ubiquitous recently and our implementation

is effectively identical to the original, we will

omit an exhaustive background description of the

model architecture and refer readers to Vaswani

et al. (2017) as well as excellent guides such as

“The Annotated Transformer.”

In this work, we denote the number of layers

(i.e., Transformer blocks) as L, the hidden size as

H, and the number of self-attention heads as A.

In all cases we set the feed-forward/ﬁlter size to

be 4H, i.e., 3072 for the H = 768 and 4096 for

the H = 1024. We primarily report results on two

model sizes:

• BERT

BASE

: L=12, H=768, A=12, Total Pa-

rameters=110M

https://github.com/tensorﬂow/tensor2tensor

http://nlp.seas.harvard.edu/2018/04/03/attention.html

• BERT

LARGE

: L=24, H=1024, A=16, Total

Parameters=340M

BERT

BASE

was chosen to have an identical

model size as OpenAI GPT for comparison pur-

poses. Critically, however, the BERT Transformer

uses bidirectional self-attention, while the GPT

Transformer uses constrained self-attention where

every token can only attend to context to its left.

We note that in the literature the bidirectional

Transformer is often referred to as a “Transformer

encoder” while the left-context-only version is re-

ferred to as a “Transformer decoder” since it can

be used for text generation. The comparisons be-

tween BERT, OpenAI GPT and ELMo are shown

visually in Figure 1.

3.2 Input Representation

Our input representation is able to unambiguously

represent both a single text sentence or a pair of

text sentences (e.g., [Question, Answer]) in one

token sequence.

For a given token, its input rep-

resentation is constructed by summing the cor-

responding token, segment and position embed-

dings. A visual representation of our input rep-

resentation is given in Figure 2.

The speciﬁcs are:

• We use WordPiece embeddings (Wu et al.,

2016) with a 30,000 token vocabulary. We

denote split word pieces with ##.

• We use learned positional embeddings with

supported sequence lengths up to 512 tokens.

Throughout this work, a “sentence” can be an arbitrary

span of contiguous text, rather than an actual linguistic sen-

tence. A “sequence” refers to the input token sequence to

BERT, which may be a single sentence or two sentences

packed together.

实现

部分

架构

创新

程序

微调

bidirectional

双向的

普遍存在的

有效的

相同的

忽略

详尽的

架构

表示

同一的

比较

批判

bi双向的

约束

注意

著作

明白的

对

constructed

创建

段

位置

given

特点

工件

vocabulary

位置

剩余13页未读，继续阅读

评论收藏

内容反馈

平原2018

粉丝: 984
资源: 9

BERT Pre-training of Deep Bidirectional Transformers for Languag...

最新资源

BERT Pre-training of Deep Bidirectional Transformers for Languag...

BERT Pre-training of Deep Bidirectional Transformers for Language Understanding.

BERT： Pre-training of Deep Bidirectional Transformers翻译

Transformers

NLP：语言表示模型BERT

BERT: Pre-training of Deep Bidirectional Transformers forLanguag

英文BERT论文预训练数据part2

十强选手方案_“达观杯”文本智能信息抽取挑战赛.pdf

BERT研究论文

生产中的bert：在生产环境中使用BERT（https：arxiv.orgabs1810.04805）和相关语言模型的资源集合

Bert详解.pptx

BERT——2018NLP最强论文

人工智能完整的学习资源列表，包括教程、论文复现和 demo项目源代码 .pdf

ToD-BERT:ToD-BERT的预训练模型

NLP网课资源，最全最新，前沿方向，研究热点、论文及代码

BERT介绍

awesome-nlp：精选的自然语言处理（NLP）资源列表

NLP自然语言处理10篇论文.zip

AI人工智能培训资料（培训PPT+示例代码）.zip

bert相关paper

bert原文.zip

BERT&RoBERTa预训练代码，tensorflow和torch两种版本实现.zip

大模型相关教程.docx

bert-arch-1layer.pdf

Scikit-Learn风格的NLP微调(Fine Tuning)模块-python

Transformer & Bert.zip

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

最新资源