ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, Haifeng Wang
Baidu Inc., Beijing, China
{sunyu02, wangshuohuan, tianhao, wu hua,wanghaifeng}@baidu.com
Abstract
Recently pre-trained models have achieved state-of-the-art
results in various language understanding tasks. Current pre-
training procedures usually focus on training the model with
several simple tasks to grasp the co-occurrence of words or
sentences. However, besides co-occurring information, there
exists other valuable lexical, syntactic and semantic infor-
mation in training corpora, such as named entities, semantic
closeness and discourse relations. In order to extract the lexi-
cal, syntactic and semantic information from training corpora,
we propose a continual pre-training framework named ERNIE
2.0 which incrementally builds pre-training tasks and then
learn pre-trained models on these constructed tasks via contin-
ual multi-task learning. Based on this framework, we construct
several tasks and train the ERNIE 2.0 model to capture lexical,
syntactic and semantic aspects of information in the train-
ing data. Experimental results demonstrate that ERNIE 2.0
model outperforms BERT and XLNet on 16 tasks including
English tasks on GLUE benchmarks and several similar tasks
in Chinese. The source codes and pre-trained models have
been released at https://github.com/PaddlePaddle/ERNIE.
Introduction
Pre-trained language representations such as ELMo(Peters et
al
.
2018), OpenAI GPT(Radford et al
.
2018), BERT (Devlin
et al
.
2018), ERNIE 1.0 (Sun et al
.
2019)
1
and XLNet(Yang
et al
.
2019) have been proven to be effective for improving
the performances of various natural language understanding
tasks including sentiment classification (Socher et al
.
2013),
natural language inference (Bowman et al
.
2015), named
entity recognition (Sang and De Meulder 2003) and so on.
Generally the pre-training of models often train the model
based on the co-occurrence of words and sentences. While
in fact, there are other lexical, syntactic and semantic infor-
mation worth examining in training corpora other than co-
occurrence. For example, named entities like person names,
location names, and organization names, may contain con-
ceptual information. Information like sentence order and
Copyright
c
2020, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
1
In order to distinguish ERNIE 2.0 framework and the ERNIE
model, the latter is referred to as ERNIE 1.0.(Sun et al. 2019)
sentence proximity enables the models to learn structure-
aware representations. And semantic similarity at the doc-
ument level or discourse relations among sentences allow
the models to learn semantic-aware representations. In or-
der to discover all valuable information in training corpora,
be it lexical, syntactic or semantic representations, we pro-
pose a continual pre-training framework named ERNIE 2.0
which could incrementally build and train a large variety of
pre-training tasks through continual multi-task learning.
Our ERNIE framework supports the introduction of vari-
ous customized tasks continually, which is realized through
continual multi-task learning. When given one or more new
tasks, the continual multi-task learning method simultane-
ously trains the newly-introduced tasks together with the
original tasks in an efficient way, without forgetting previ-
ously learned knowledge. In this way, our framework can
incrementally train the distributed representations based on
the previously trained parameters that it grasped. Moreover,
in this framework, all the tasks share the same encoding net-
works, thus making the encoding of lexical, syntactic and
semantic information across different tasks possible.
In summary, our contributions are as follows:
•
We propose a continual pre-training framework ERNIE
2.0, which efficiently supports customized training tasks
and continual multi-task learning in an incremental way.
•
We construct three kinds of unsupervised language pro-
cessing tasks to verify the effectiveness of the proposed
framework. Experimental results demonstrate that ERNIE
2.0 achieves significant improvements over BERT and XL-
Net on 16 tasks including English GLUE benchmarks and
several Chinese tasks.
•
Our fine-tuning code of ERNIE 2.0 and models pre-trained
on English corpora are available at https://github.com/
PaddlePaddle/ERNIE.
Related Work
Unsupervised Learning for Language
Representation
It is effective to learn general language representation by
pre-training a language model with a large amount of unan-
arXiv:1907.12412v2 [cs.CL] 21 Nov 2019