没有合适的资源?快使用搜索试试~ 我知道了~
tinybert bert与蒸馏学习1
需积分: 0 0 下载量 114 浏览量
2022-08-03
13:37:42
上传
评论
收藏 1.17MB PDF 举报
温馨提示
试读
13页
tinybert bert与蒸馏学习1
资源详情
资源评论
资源推荐
Under review
TINYBERT: DISTILLING BERT FOR NATURAL LAN-
GUAGE UNDERSTANDING
Xiaoqi Jiao
1∗ †
, Yichun Yin
2∗
, Lifeng Shang
2
, Xin Jiang
2
Xiao Chen
2
, Linlin Li
3
, Fang Wang
1
and Qun Liu
2
1
Huazhong University of Science and Technology
2
Huawei Noah’s Ark Lab
3
Huawei Technologies Co., Ltd.
ABSTRACT
Language model pre-training, such as BERT, has significantly improved the per-
formances of many natural language processing tasks. However, the pre-trained
language models are usually computationally expensive and memory intensive,
so it is difficult to effectively execute them on resource-restricted devices. To ac-
celerate inference and reduce model size while maintaining accuracy, we firstly
propose a novel Transformer distillation method that is specially designed for
knowledge distillation (KD) of the Transformer-based models. By leveraging this
new KD method, the plenty of knowledge encoded in a large “teacher” BERT can
be well transferred to a small “student” TinyBERT. Moreover, we introduce a new
two-stage learning framework for TinyBERT, which performs Transformer distil-
lation at both the pre-training and task-specific learning stages. This framework
ensures that TinyBERT can capture the general-domain as well as the task-specific
knowledge in BERT.
TinyBERT
1
is empirically effective and achieves comparable results with BERT
on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference.
TinyBERT is also significantly better than state-of-the-art baselines on BERT dis-
tillation, with only ∼28% parameters and ∼31% inference time of them.
1 INTRODUCTION
Pre-training language models then fine-tuning on downstream tasks has become a new paradigm for
natural language processing (NLP). Pre-trained language models (PLMs), such as BERT (Devlin
et al., 2018), XLNet (Yang et al., 2019), RoBERTa (Liu et al., 2019) and SpanBERT (Joshi et al.,
2019), have achieved great success in many NLP tasks (e.g., the GLUE benchmark (Wang et al.,
2018) and the challenging multi-hop reasoning task (Ding et al., 2019)). However, PLMs usually
have an extremely large number of parameters and need long inference time, which are difficult to
be deployed on edge devices such as mobile phones. Moreover, recent studies (Wu et al., 2019a)
also demonstrate that there is redundancy in PLMs. Therefore, it is crucial and possible to reduce
the computational overhead and model storage of PLMs while keeping their performances.
There has been many model compression techniques (Han et al., 2015a) proposed to accelerate
deep model inference and reduce model size while maintaining accuracy. The most commonly
used techniques include quantization (Gong et al., 2014), weights pruning (Han et al., 2015b), and
knowledge distillation (KD) (Romero et al., 2014). In this paper we focus on knowledge distillation,
an idea proposed by Hinton et al. (2015) in a teacher-student framework. KD aims to transfer the
knowledge embedded in a large teacher network to a small student network. The student network is
trained to reproduce the behaviors of the teacher network. Based on the framework, we propose a
novel distillation method specifically for Transformer-based models (Vaswani et al., 2017), and use
BERT as an example to investigate the KD methods for large scale PLMs.
∗
Authors contribute equally.
†
This work is done when Xiaoqi Jiao is an intern at Huawei Noah’s Ark Lab.
1
Our code and models will be made publicly available.
1
arXiv:1909.10351v2 [cs.CL] 24 Sep 2019
Under review
KD has been extensively studied in NLP (Kim & Rush, 2016; Hu et al., 2018), while designing
KD methods for BERT has been less explored. The pre-training-then-fine-tuning paradigm firstly
pre-trains BERT on a large scale unsupervised text corpus, then fine-tunes it on task-specific dataset,
which greatly increases the difficulty of BERT distillation. Thus we are required to design an ef-
fective KD strategy for both stages. To build a competitive TinyBERT, we firstly propose a new
Transformer distillation method to distill the knowledge embedded in teacher BERT. Specifically,
we design several loss functions to fit different representations from BERT layers: 1) the output
of the embedding layer; 2) the hidden states and attention matrices derived from the Transformer
layer; 3) the logits output by the prediction layer. The attention based fitting is inspired by the re-
cent findings (Clark et al., 2019) that the attention weights learned by BERT can capture substantial
linguistic knowledge, which encourages that the linguistic knowledge can be well transferred from
teacher BERT to student TinyBERT. However, it is ignored in existing KD methods of BERT, such
as Distilled BiLSTM
SOFT
(Tang et al., 2019), BERT-PKD (Sun et al., 2019) and DistilBERT
2
. Then,
we propose a novel two-stage learning framework including the general distillation and the task-
specific distillation. At the general distillation stage, the original BERT without fine-tuning acts
as the teacher model. The student TinyBERT learns to mimic the teacher’s behavior by executing
the proposed Transformer distillation on the large scale corpus from general domain. We obtain a
general TinyBERT that can be fine-tuned for various downstream tasks. At the task-specific distil-
lation stage, we perform the data augmentation to provide more task-related materials for teacher-
student learning, and then re-execute the Transformer distillation on the augmented data. Both the
two stages are essential to improve the performance and generalization capability of TinyBERT. A
detailed comparison between the proposed method and other existing methods is summarized in
Table 1. The Transformer distillation and two-stage learning framework are two key ideas of the
proposed method.
Table 1: A summary of KD methods for BERT. Abbreviations: INIT(initializing student BERT with
some layers of pre-trained teacher BERT), DA(conducting data augmentation for task-specific train-
ing data). Embd, Attn, Hidn, and Pred represent the knowledge from embedding layers, attention
matrices, hidden states, and final prediction layers, respectively.
KD Methods
KD at Pre-training Stage KD at Fine-tuning Stage
INIT Embd Attn Hidn Pred Embd Attn Hidn Pred DA
Distilled BiLSTM
SOFT
X X
BERT-PKD X X
3
X
DistilBERT X X
4
X
TinyBERT (our method) X X X X X X X X
The main contributions of this work are as follows: 1) We propose a new Transformer distillation
method to encourage that the linguistic knowledge encoded in teacher BERT can be well transferred
to TinyBERT. 2) We propose a novel two-stage learning framework with performing the proposed
Transformer distillation at both the pre-training and fine-tuning stages, which ensures that Tiny-
BERT can capture both the general-domain and task-specific knowledge of the teacher BERT. 3)
We show experimentally that our TinyBERT can achieve comparable results with teacher BERT on
GLUE tasks, while having much fewer parameters (∼13.3%) and less inference time (∼10.6%), and
significantly outperforms other state-of-the-art baselines on BERT distillation.
2 PRELIMINARIES
We firstly describe the formulation of Transformer (Vaswani et al., 2017) and Knowledge Distilla-
tion (Hinton et al., 2015). Our proposed Transformer distillation is a specially designed KD method
for Transformer-based models.
2.1 TRANSFORMER LAYER
Most of recent pre-trained language models (e.g., BERT, XLNet and RoBERTa) are built with Trans-
former layers, which can capture long-term dependencies between input tokens by self-attention
2
https://medium.com/huggingface/distilbert-8cf3380435b5
3
The student learns from the [CLS] (a special classification token of BERT) hidden states of the teacher.
4
The output of pre-training tasks (such as dynamic masking) is used as the supervision signal.
2
Under review
mechanism. Specifically, a standard Transformer layer includes two main sub-layers: multi-head
attention (MHA) and fully connected feed-forward network (FFN).
Multi-Head Attention (MHA). The calculation of attention function depends on the three com-
ponents of queries, keys and values, which are denoted as matrices Q, K and V respectively.
Followingly, the attention function can be formulated as follows:
A =
QK
T
√
d
k
, (1)
Attention(Q, K, V ) = softmax(A)V , (2)
where d
k
is the dimension of keys and acts as a scaling factor, A is the attention matrix calculated
from the compatibility of Q and K by dot-product operation. The final function output is calculated
as a weighted sum of values V , and the weight is computed by applying softmax() operation on
the matrix A. The attention matrices A of BERT can capture substantial linguistic knowledge and
plays an essential role in our proposed distillation method.
Multi-head attention is defined by concatenating the attention heads from different representation
subspaces as follows:
MultiHead(Q, K, V ) = Concat(head
1
, . . . , head
h
)W , (3)
where h is the number of attention heads, and head
i
denotes the i-th attention head, which is calcu-
lated by the Attention() function with inputs from different representation subspaces, the matrix
W acts as a linear transformation.
Position-wise Feed-Forward Network (FFN). Transformer layer also contains a fully connected
feed-forward network, which is formulated as follows:
FNN(x) = max(0, xW
1
+ b
1
)W
2
+ b
2
. (4)
We can see that the FFN contains two linear transformations and one ReLU activation.
2.2 KNOWLEDGE DISTILLATION
KD aims to transfer the knowledge of a large teacher network T to a small student network S. The
student network is trained to mimic the behaviors of teacher networks. Let f
T
and f
S
represent the
behavior functions of teacher and student networks, respectively. The behavior function targets at
transforming network inputs to some informative representations, and it can be defined as the output
of any layer of networks. In the context of Transformer distillation, the output of MHA layer or FFN
layer, or some intermediate representations (e.g. the attention matrix A) can be used as behavior
function. Formally, KD can be modeled as minimizing the following objective function:
L
KD
=
X
x∈X
L
f
S
(x), f
T
(x)
, (5)
where L(·) is a loss function that evaluates the difference between teacher and student networks,
x is the text input and X denotes the training dataset. Thus the key research problem is how to
define effective behaviors and loss functions. Different from previous KD methods, we also need to
consider how to perform KD at the pre-training stage of BERT, which further increases the difficulty
of KD for BERT.
3 METHOD
In this section, we propose a novel distillation method for Transformer-based models, then present
a two-stage learning framework of TinyBERT.
3.1 TRANSFORMER DISTILLATION
The proposed Transformer distillation is a specially designed KD method for Transformer networks.
Figure 1 displays an overview of the proposed KD method. In this work, both the student and teacher
3
剩余12页未读,继续阅读
萌新小白爱学习
- 粉丝: 16
- 资源: 311
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0