【免费】tinybertbert与蒸馏学习1_tinybert论文资源-CSDN文库

需积分: 0 114 浏览量 2022-08-03 13:37:42 上传评论收藏 1.17MB PDF 举报

资源详情

资源评论

资源推荐

Under review

TINYBERT: DISTILLING BERT FOR NATURAL LAN-

GUAGE UNDERSTANDING

Xiaoqi Jiao

1∗ †

, Yichun Yin

2∗

, Lifeng Shang

, Xin Jiang

Xiao Chen

, Linlin Li

, Fang Wang

and Qun Liu

Huazhong University of Science and Technology

Huawei Noah’s Ark Lab

Huawei Technologies Co., Ltd.

ABSTRACT

Language model pre-training, such as BERT, has signiﬁcantly improved the per-

formances of many natural language processing tasks. However, the pre-trained

language models are usually computationally expensive and memory intensive,

so it is difﬁcult to effectively execute them on resource-restricted devices. To ac-

celerate inference and reduce model size while maintaining accuracy, we ﬁrstly

propose a novel Transformer distillation method that is specially designed for

knowledge distillation (KD) of the Transformer-based models. By leveraging this

new KD method, the plenty of knowledge encoded in a large “teacher” BERT can

be well transferred to a small “student” TinyBERT. Moreover, we introduce a new

two-stage learning framework for TinyBERT, which performs Transformer distil-

lation at both the pre-training and task-speciﬁc learning stages. This framework

ensures that TinyBERT can capture the general-domain as well as the task-speciﬁc

knowledge in BERT.

TinyBERT

is empirically effective and achieves comparable results with BERT

on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference.

TinyBERT is also signiﬁcantly better than state-of-the-art baselines on BERT dis-

tillation, with only ∼28% parameters and ∼31% inference time of them.

1 INTRODUCTION

Pre-training language models then ﬁne-tuning on downstream tasks has become a new paradigm for

natural language processing (NLP). Pre-trained language models (PLMs), such as BERT (Devlin

et al., 2018), XLNet (Yang et al., 2019), RoBERTa (Liu et al., 2019) and SpanBERT (Joshi et al.,

2019), have achieved great success in many NLP tasks (e.g., the GLUE benchmark (Wang et al.,

2018) and the challenging multi-hop reasoning task (Ding et al., 2019)). However, PLMs usually

have an extremely large number of parameters and need long inference time, which are difﬁcult to

be deployed on edge devices such as mobile phones. Moreover, recent studies (Wu et al., 2019a)

also demonstrate that there is redundancy in PLMs. Therefore, it is crucial and possible to reduce

the computational overhead and model storage of PLMs while keeping their performances.

There has been many model compression techniques (Han et al., 2015a) proposed to accelerate

deep model inference and reduce model size while maintaining accuracy. The most commonly

used techniques include quantization (Gong et al., 2014), weights pruning (Han et al., 2015b), and

knowledge distillation (KD) (Romero et al., 2014). In this paper we focus on knowledge distillation,

an idea proposed by Hinton et al. (2015) in a teacher-student framework. KD aims to transfer the

knowledge embedded in a large teacher network to a small student network. The student network is

trained to reproduce the behaviors of the teacher network. Based on the framework, we propose a

novel distillation method speciﬁcally for Transformer-based models (Vaswani et al., 2017), and use

BERT as an example to investigate the KD methods for large scale PLMs.

∗

Authors contribute equally.

†

This work is done when Xiaoqi Jiao is an intern at Huawei Noah’s Ark Lab.

Our code and models will be made publicly available.

arXiv:1909.10351v2 [cs.CL] 24 Sep 2019

Under review

KD has been extensively studied in NLP (Kim & Rush, 2016; Hu et al., 2018), while designing

KD methods for BERT has been less explored. The pre-training-then-ﬁne-tuning paradigm ﬁrstly

pre-trains BERT on a large scale unsupervised text corpus, then ﬁne-tunes it on task-speciﬁc dataset,

which greatly increases the difﬁculty of BERT distillation. Thus we are required to design an ef-

fective KD strategy for both stages. To build a competitive TinyBERT, we ﬁrstly propose a new

Transformer distillation method to distill the knowledge embedded in teacher BERT. Speciﬁcally,

we design several loss functions to ﬁt different representations from BERT layers: 1) the output

of the embedding layer; 2) the hidden states and attention matrices derived from the Transformer

layer; 3) the logits output by the prediction layer. The attention based ﬁtting is inspired by the re-

cent ﬁndings (Clark et al., 2019) that the attention weights learned by BERT can capture substantial

linguistic knowledge, which encourages that the linguistic knowledge can be well transferred from

teacher BERT to student TinyBERT. However, it is ignored in existing KD methods of BERT, such

as Distilled BiLSTM

SOFT

(Tang et al., 2019), BERT-PKD (Sun et al., 2019) and DistilBERT

. Then,

we propose a novel two-stage learning framework including the general distillation and the task-

speciﬁc distillation. At the general distillation stage, the original BERT without ﬁne-tuning acts

as the teacher model. The student TinyBERT learns to mimic the teacher’s behavior by executing

the proposed Transformer distillation on the large scale corpus from general domain. We obtain a

general TinyBERT that can be ﬁne-tuned for various downstream tasks. At the task-speciﬁc distil-

lation stage, we perform the data augmentation to provide more task-related materials for teacher-

student learning, and then re-execute the Transformer distillation on the augmented data. Both the

two stages are essential to improve the performance and generalization capability of TinyBERT. A

detailed comparison between the proposed method and other existing methods is summarized in

Table 1. The Transformer distillation and two-stage learning framework are two key ideas of the

proposed method.

Table 1: A summary of KD methods for BERT. Abbreviations: INIT(initializing student BERT with

some layers of pre-trained teacher BERT), DA(conducting data augmentation for task-speciﬁc train-

ing data). Embd, Attn, Hidn, and Pred represent the knowledge from embedding layers, attention

matrices, hidden states, and ﬁnal prediction layers, respectively.

KD Methods

KD at Pre-training Stage KD at Fine-tuning Stage

INIT Embd Attn Hidn Pred Embd Attn Hidn Pred DA

Distilled BiLSTM

SOFT

X X

BERT-PKD X X

DistilBERT X X

TinyBERT (our method) X X X X X X X X

The main contributions of this work are as follows: 1) We propose a new Transformer distillation

method to encourage that the linguistic knowledge encoded in teacher BERT can be well transferred

to TinyBERT. 2) We propose a novel two-stage learning framework with performing the proposed

Transformer distillation at both the pre-training and ﬁne-tuning stages, which ensures that Tiny-

BERT can capture both the general-domain and task-speciﬁc knowledge of the teacher BERT. 3)

We show experimentally that our TinyBERT can achieve comparable results with teacher BERT on

GLUE tasks, while having much fewer parameters (∼13.3%) and less inference time (∼10.6%), and

signiﬁcantly outperforms other state-of-the-art baselines on BERT distillation.

2 PRELIMINARIES

We ﬁrstly describe the formulation of Transformer (Vaswani et al., 2017) and Knowledge Distilla-

tion (Hinton et al., 2015). Our proposed Transformer distillation is a specially designed KD method

for Transformer-based models.

2.1 TRANSFORMER LAYER

Most of recent pre-trained language models (e.g., BERT, XLNet and RoBERTa) are built with Trans-

former layers, which can capture long-term dependencies between input tokens by self-attention

https://medium.com/huggingface/distilbert-8cf3380435b5

The student learns from the [CLS] (a special classiﬁcation token of BERT) hidden states of the teacher.

The output of pre-training tasks (such as dynamic masking) is used as the supervision signal.

Under review

mechanism. Speciﬁcally, a standard Transformer layer includes two main sub-layers: multi-head

attention (MHA) and fully connected feed-forward network (FFN).

Multi-Head Attention (MHA). The calculation of attention function depends on the three com-

ponents of queries, keys and values, which are denoted as matrices Q, K and V respectively.

Followingly, the attention function can be formulated as follows:

A =

√

, (1)

Attention(Q, K, V ) = softmax(A)V , (2)

where d

is the dimension of keys and acts as a scaling factor, A is the attention matrix calculated

from the compatibility of Q and K by dot-product operation. The ﬁnal function output is calculated

as a weighted sum of values V , and the weight is computed by applying softmax() operation on

the matrix A. The attention matrices A of BERT can capture substantial linguistic knowledge and

plays an essential role in our proposed distillation method.

Multi-head attention is deﬁned by concatenating the attention heads from different representation

subspaces as follows:

MultiHead(Q, K, V ) = Concat(head

, . . . , head

)W , (3)

where h is the number of attention heads, and head

denotes the i-th attention head, which is calcu-

lated by the Attention() function with inputs from different representation subspaces, the matrix

W acts as a linear transformation.

Position-wise Feed-Forward Network (FFN). Transformer layer also contains a fully connected

feed-forward network, which is formulated as follows:

FNN(x) = max(0, xW

+ b

. (4)

We can see that the FFN contains two linear transformations and one ReLU activation.

2.2 KNOWLEDGE DISTILLATION

KD aims to transfer the knowledge of a large teacher network T to a small student network S. The

student network is trained to mimic the behaviors of teacher networks. Let f

and f

represent the

behavior functions of teacher and student networks, respectively. The behavior function targets at

transforming network inputs to some informative representations, and it can be deﬁned as the output

of any layer of networks. In the context of Transformer distillation, the output of MHA layer or FFN

layer, or some intermediate representations (e.g. the attention matrix A) can be used as behavior

function. Formally, KD can be modeled as minimizing the following objective function:

x∈X



(x), f

(x)



, (5)

where L(·) is a loss function that evaluates the difference between teacher and student networks,

x is the text input and X denotes the training dataset. Thus the key research problem is how to

deﬁne effective behaviors and loss functions. Different from previous KD methods, we also need to

consider how to perform KD at the pre-training stage of BERT, which further increases the difﬁculty

of KD for BERT.

3 METHOD

In this section, we propose a novel distillation method for Transformer-based models, then present

a two-stage learning framework of TinyBERT.

3.1 TRANSFORMER DISTILLATION

The proposed Transformer distillation is a specially designed KD method for Transformer networks.

Figure 1 displays an overview of the proposed KD method. In this work, both the student and teacher

剩余12页未读，继续阅读

评论收藏

内容反馈

萌新小白爱学习

粉丝: 16
资源: 311

tinybert bert与蒸馏学习1

评论0

最新资源

tinybert bert与蒸馏学习1

评论0

人工智能-项目实践-知识蒸馏-简洁易用版TinyBert：基于Bert进行知识蒸馏的预训练语言模型.zip

简洁易用版TinyBert：基于Bert进行知识蒸馏的预训练语言模型

bert_distill：BERT蒸馏（基于BERT的蒸馏实验）

论文复现基于BERT的蒸馏实验Python源码+运行说明+数据.zip

基于Bert进行知识蒸馏的预训练语言模型-demo

TinyBERT比BERT-base小7.5倍，推理速度快9.4倍，在自然语言理解任务中取得了有竞争力的性能

BERT 蒸馏在垃圾舆情识别中的探索.pdf

基于python+django的(bert)深度学习文本相似度检测系统设计的实现.zip

BERT模型实战1

基于Bert进行知识蒸馏的预训练语言模型python源码+数据+文档说明

TensorFlow实战BERT（附赠：TensorFlow1.X升级TensorFlow2.X）

Python自然语言处理-BERT实战

bert-multitask-learning：用于多任务学习的BERT

项目实战-Bert文本分类（keras-bert实现）源代码及数据集.zip

基于深度学习的文本分类系统（完整代码+数据）bert+rnn textcnn fastcnn bert.rar

基于bert的文本情感分析

人工智能-项目实践-知识蒸馏-knowledge distillation 采用知识蒸馏，训练bert后指导textcnn

模型压缩方法与bert压缩的论文.zip

Bert详解.pptx

BurpLoaderKeygen.jar.zip

最新版ISO/IEC 27001:2022、ISO 27002:2022中英文合集

Goby红队版-win-x64-2.4.7版本

Chrome Header Editor 插件

ISO SAE 21434-2021 中文版.pdf

国赛ciscn2024-WP-re2-androidso-re(unidbg模拟执行Native层方法)

国赛ciscn2024-WP-re6-gdb-debug(伪随机数保护)

OpenVAS GVM 中文翻译补丁

安全认证cisp教材全套

最新资源