从三大顶会论文看百变Self-Attention-self-attention的相关思想以及最新的研究进展.zip

共6个文件

pdf：6个

self_attention

需积分: 35 175 浏览量 2019-11-11 14:31:05 上传评论 1 收藏 4.99MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

从三大顶会论文看百变Self-Attention.zip （6个子文件）

1909.02222.pdf.pdf 595KB

1905.09418.pdf.pdf 3.13MB

1909.06639.pdf.pdf 1.46MB

1909.01562.pdf.pdf 307KB

1909.00383.pdf.pdf 304KB

1905.10650.pdf.pdf 563KB

Analyzing Multi-Head Self-Attention:

Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Elena Voita

1,2

David Talbot

Fedor Moiseev

1,5

Rico Sennrich

3,4

Ivan Titov

3,2

Yandex, Russia

University of Amsterdam, Netherlands

University of Edinburgh, Scotland

University of Zurich, Switzerland

Moscow Institute of Physics and Technology, Russia

{lena-voita, talbot, femoiseev}@yandex-team.ru

rico.sennrich@ed.ac.uk ititov@inf.ed.ac.uk

Abstract

Multi-head self-attention is a key component

of the Transformer, a state-of-the-art architec-

ture for neural machine translation. In this

work we evaluate the contribution made by in-

dividual attention heads in the encoder to the

overall performance of the model and analyze

the roles played by them. We ﬁnd that the

most important and conﬁdent heads play con-

sistent and often linguistically-interpretable

roles. When pruning heads using a method

based on stochastic gates and a differentiable

relaxation of the L

penalty, we observe that

specialized heads are last to be pruned. Our

novel pruning method removes the vast major-

ity of heads without seriously affecting perfor-

mance. For example, on the English-Russian

WMT dataset, pruning 38 out of 48 encoder

heads results in a drop of only 0.15 BLEU.

1 Introduction

The Transformer (Vaswani et al., 2017) has be-

come the dominant modeling paradigm in neu-

ral machine translation. It follows the encoder-

decoder framework using stacked multi-head self-

attention and fully connected layers. Multi-head

attention was shown to make more efﬁcient use of

the model’s capacity: performance of the model

with 8 heads is almost 1 BLEU point higher than

that of a model of the same size with single-head

attention (Vaswani et al., 2017). The Transformer

achieved state-of-the-art results in recent shared

translation tasks (Bojar et al., 2018; Niehues

et al., 2018). Despite the model’s widespread

adoption and recent attempts to investigate the

kinds of information learned by the model’s en-

coder (Raganato and Tiedemann, 2018), the anal-

ysis of multi-head attention and its importance

We release code at https://github.com/

lena-voita/the-story-of-heads.

for translation is challenging. Previous analysis

of multi-head attention considered the average of

attention weights over all heads at a given posi-

tion or focused only on the maximum attention

weights (Voita et al., 2018; Tang et al., 2018),

but neither method explicitly takes into account

the varying importance of different heads. Also,

this obscures the roles played by individual heads

which, as we show, inﬂuence the generated trans-

lations to differing extents. We attempt to answer

the following questions:

• To what extent does translation quality de-

pend on individual encoder heads?

• Do individual encoder heads play consistent

and interpretable roles? If so, which are the

most important ones for translation quality?

• Which types of model attention (encoder

self-attention, decoder self-attention or

decoder-encoder attention) are most sensitive

to the number of attention heads and on

which layers?

• Can we signiﬁcantly reduce the number of

attention heads while preserving translation

quality?

We start by identifying the most important

heads in each encoder layer using layer-wise rele-

vance propagation (Ding et al., 2017). For heads

judged to be important, we then attempt to charac-

terize the roles they perform. We observe the fol-

lowing types of role: positional (heads attending

to an adjacent token), syntactic (heads attending

to tokens in a speciﬁc syntactic dependency rela-

tion) and attention to rare words (heads pointing to

the least frequent tokens in the sentence).

To understand whether the remaining heads per-

form vital but less easily deﬁned roles, or are sim-

ply redundant to the performance of the model as

arXiv:1905.09418v2 [cs.CL] 7 Jun 2019

measured by translation quality, we introduce a

method for pruning heads based on Louizos et al.

(2018). While we cannot easily incorporate the

number of active heads as a penalty term in our

learning objective (i.e. the L

regularizer), we can

use a differentiable relaxation. We prune atten-

tion heads in a continuous learning scenario start-

ing from the converged full model and identify the

roles of those which remain in the model. These

experiments corroborate the ﬁndings of layer-wise

relevance propagation; in particular, heads with

clearly identiﬁable positional and syntactic func-

tions are pruned last and hence shown to be most

important for the translation task.

Our key ﬁndings are as follows:

• Only a small subset of heads are important

for translation;

• Important heads have one or more specialized

and interpretable functions in the model;

• The functions correspond to attention to

neighbouring words and to tokens in speciﬁc

syntactic dependency relations.

2 Transformer Architecture

In this section, we brieﬂy describe the Transformer

architecture (Vaswani et al., 2017) introducing the

terminology used in the rest of the paper.

The Transformer is an encoder-decoder model

that uses stacked self-attention and fully con-

nected layers for both the encoder and decoder.

The encoder consists of N layers, each contain-

ing two sub-layers: (a) a multi-head self-attention

mechanism, and (b) a feed-forward network. The

multi-head attention mechanism relies on scaled

dot-product attention, which operates on a query

Q, a key K and a value V :

Attention(Q, K, V ) = softmax



√



V (1)

where d

is the key dimensionality. In self-

attention, queries, keys and values come from the

output of the previous layer.

The multi-head attention mechanism obtains h

(i.e. one per head) different representations of (Q,

K, V ), computes scaled dot-product attention for

each representation, concatenates the results, and

projects the concatenation through a feed-forward

layer. This can be expressed in the same notation

as Equation (1):

head

= Attention(QW

, KW

, V W

) (2)

MultiHead(Q, K, V ) = Concat

(head

(3)

where the W

and W

are parameter matrices.

The second component of each layer of the

Transformer network is a feed-forward network.

The authors propose using a two-layer network

with a ReLU activation.

Analogously, each layer of the decoder contains

the two sub-layers mentioned above as well as an

additional multi-head attention sub-layer. This ad-

ditional sub-layer receives the output of the en-

coder as its keys and values.

The Transformer uses multi-head attention in

three different ways: encoder self-attention, de-

coder self-attention and decoder-encoder atten-

tion. In this work, we concentrate primarily on

encoder self-attention.

3 Data and setting

We focus on English as a source language and con-

sider three target languages: Russian, German and

French. For each language pair, we use the same

number of sentence pairs from WMT data to con-

trol for the amount of training data and train Trans-

former models with the same numbers of param-

eters. We use 2.5m sentence pairs, corresponding

to the amount of English–Russian parallel train-

ing data (excluding UN and Paracrawl). In Sec-

tion 5.2 we use the same held-out data for all lan-

guage pairs; these are 50k English sentences taken

from the WMT EN-FR data not used in training.

For English-Russian, we perform additional ex-

periments using the publicly available OpenSubti-

tles2018 corpus (Lison et al., 2018) to evaluate the

impact of domains on our results.

In Section 6 we concentrate on English-Russian

and two domains: WMT and OpenSubtitles.

Model hyperparameters, preprocessing and

training details are provided in appendix B.

4 Identifying Important Heads

Previous work analyzing how representations are

formed by the Transformer’s multi-head attention

mechanism focused on either the average or the

maximum attention weights over all heads (Voita

et al., 2018; Tang et al., 2018), but neither method

explicitly takes into account the varying impor-

tance of different heads. Also, this obscures the

roles played by individual heads which, as we will

show, inﬂuence the generated translations to dif-

fering extents.

(a) LRP (b) conﬁdence (c) head functions

Figure 1: Importance (according to LRP), conﬁdence, and function of self-attention heads. In each layer, heads

are sorted by their relevance according to LRP. Model trained on 6m OpenSubtitles EN-RU data.

(a) LRP (EN-DE) (b) head functions

Figure 2: Importance (according to LRP) and function

of self-attention heads. In each layer, heads are sorted

by their relevance according to LRP. Models trained on

2.5m WMT EN-DE (a, b) and EN-FR (c, d).

We deﬁne the “conﬁdence” of a head as the

average of its maximum attention weight exclud-

ing the end of sentence symbol,

where average

is taken over tokens in a set of sentences used for

evaluation (development set). A conﬁdent head is

one that usually assigns a high proportion of its at-

tention to a single token. Intuitively, we might ex-

pect conﬁdent heads to be important to the trans-

lation task.

Layer-wise relevance propagation (LRP) (Ding

et al., 2017) is a method for computing the rela-

tive contribution of neurons at one point in a net-

work to neurons at another.

Here we propose to

use LRP to evaluate the degree to which different

heads at each layer contribute to the top-1 logit

predicted by the model. Heads whose outputs have

a higher relevance value may be judged to be more

important to the model’s predictions.

We exclude EOS on the grounds that it is not a real token.

A detailed description of LRP is provided in appendix A.

The results of LRP are shown in Figures 1a, 2a,

2c. In each layer, LRP ranks a small number of

heads as much more important than all others.

The conﬁdence for each head is shown in Fig-

ure 1b. We can observe that the relevance of a

head as computed by LRP agrees to a reasonable

extent with its conﬁdence. The only clear excep-

tion to this pattern is the head judged by LRP to

be the most important in the ﬁrst layer. It is the

most relevant head in the ﬁrst layer but its average

maximum attention weight is low. We will discuss

this head further in Section 5.3.

5 Characterizing heads

We now turn to investigating whether heads

play consistent and interpretable roles within the

model.

We examined some attention matrices paying

particular attention to heads ranked highly by LRP

and identiﬁed three functions which heads might

be playing:

1. positional: the head points to an adjacent to-

ken,

2. syntactic: the head points to tokens in a spe-

ciﬁc syntactic relation,

3. rare words: the head points to the least fre-

quent tokens in a sentence.

Now we discuss the criteria used to determine

if a head is performing one of these functions and

examine properties of the corresponding heads.

5.1 Positional heads

We refer to a head as “positional” if at least 90%

of the time its maximum attention weight is as-

signed to a speciﬁc relative position (in practice

either -1 or +1, i.e. attention to adjacent tokens).

Such heads are shown in purple in Figures 1c for

English-Russian, 2b for English-German, 2d for

English-French and marked with the relative posi-

tion.

As can be seen, the positional heads correspond

to a large extent to the most conﬁdent heads and

the most important heads as ranked by LRP. In

fact, the average maximum attention weight ex-

ceeds 0.8 for every positional head for all language

pairs considered here.

5.2 Syntactic heads

We hypothesize that, when used to perform trans-

lation, the Transformer’s encoder may be respon-

sible for disambiguating the syntactic structure

of the source sentence. We therefore wish to

know whether a head attends to tokens corre-

sponding to any of the major syntactic relations

in a sentence. In our analysis, we looked at the

following dependency relations: nominal subject

(nsubj), direct object (dobj), adjectival modiﬁer

(amod) and adverbial modiﬁer (advmod). These

include the main verbal arguments of a sentence

and some other common relations. They also in-

clude those relations which might inform morpho-

logical agreement or government in one or more

of the target languages considered here.

5.2.1 Methodology

We evaluate to what extent each head in the Trans-

former’s encoder accounts for a speciﬁc depen-

dency relation by comparing its attention weights

to a predicted dependency structure generated us-

ing CoreNLP (Manning et al., 2014) on a large

number of held-out sentences. We calculate for

each head how often it assigns its maximum atten-

tion weight (excluding EOS) to a token with which

it is in one of the aforementioned dependency rela-

tions. We count each relation separately and allow

the relation to hold in either direction between the

two tokens.

We refer to this relative frequency as the “ac-

curacy” of head on a speciﬁc dependency relation

in a speciﬁc direction. Note that under this deﬁni-

tion, we may evaluate the accuracy of a head for

multiple dependency relations.

Many dependency relations are frequently ob-

served in speciﬁc relative positions (for example,

often they hold between adjacent tokens, see Fig-

ure 3). We say that a head is “syntactic” if its ac-

curacy is at least 10% higher than the baseline that

looks at the most frequent relative position for this

dependency relation.

Figure 3: Distribution of the relative position of depen-

dent for different dependency relations (WMT).

dep. direction best head / baseline

accuracy

WMT OpenSubtitles

nsubj

v → s 45 / 35 77 / 45

s → v 52 / 35 70 / 45

dobj

v → o 78 / 41 61 / 46

o → v 73 / 41 84 / 46

amod

noun → adj.m. 74 / 72 81 / 80

adj.m. → noun 82 / 72 81 / 80

advmod

v → adv.m. 48 / 46 38 / 33

adv.m. → v 52 / 46 42 / 33

Table 1: Dependency scores for EN-RU, comparing the

best self-attention head to a positional baseline. Models

trained on 2.5m WMT data and 6m OpenSubtitles data.

Figure 4: Dependency scores for EN-RU, EN-DE, EN-

FR each trained on 2.5m WMT data.

5.2.2 Results

Table 1 shows the accuracy of the most accurate

head for each of the considered dependency re-

lations on the two domains for English-Russian.

Figure 4 compares the scores of the models trained

on WMT with different target languages.

Clearly certain heads learn to detect syntactic

relations with accuracies signiﬁcantly higher than

the positional baseline. This supports the hypoth-

评论收藏

内容反馈

syp_net

粉丝: 158
资源: 1196

从三大顶会论文看百变Self-Attention - self-attention的相关思想以及最新的研究进展.zip

最新资源

从三大顶会论文看百变Self-Attention - self-attention的相关思想以及最新的研究进展.zip

最新「注意力机制Attention」大综述论文

清华&南开最新「视觉注意力机制Attention」综述论文

百变遥控.zip

综述：计算机视觉中的注意力机制

70--[百变超人].zip源码scratch2.0 3.0编程项目源文件源码案例素材源代码

75--[百变无敌猫].zip源码scratch2.0 3.0编程项目源文件源码案例素材源代码

百变空间PSD源码第1-10套.rar

百变超人-少儿编程scratch项目源代码文件案例素材.zip

行业文档-设计装置-百变多媒体台灯.zip

百变F3--F3系列卫星接收机使用说明.pdf

百变金刚--JSP网络编程实例

内容付费崛起，优质内容为王-互联网内容产业报告-恒大研究院-20180706-55页.pdf.zip.zip

百变F3--F3系列卫星接收机使用说明整理.pdf

百变F3--F3系列卫星接收机使用说明汇编.pdf

2013炫彩百变皮肤.zip

百变无敌猫-少儿编程scratch项目源代码文件案例素材.zip

ASP实例开发源码-VOD2008百变VIP v8.34.zip

YT9216B_ui1-9.1-HP1024x600-ota-v6.5.zip

ASP实例开发源码-随机秀QQ空间百变制作系统源码 asp版.zip

三年级第一讲-百变小精灵.doc

百变遥控-手机电脑控制

ASP源码—VOD2022百变VIP v8.34.zip

ChatGPT教程（终极版）最全整理

博客中Kmeans以及FCM算法数据（免积分）

神经网络回归预测--气温数据集

hugging face的models-openai-clip-vit-large-patch14文件夹

XGBoost+LightGBM+LSTM-光伏发电量预测

Mathwork+Matlab+编程手册

Stable-Diffusion WEBUI 简体中文语言包（2023.05.30更新）

最新资源