AttentionisnotExplanation.pdf资源-CSDN文库

需积分: 50 7 浏览量 2020-11-21 20:35:05 上传评论收藏 1.2MB PDF 举报

在自然语言处理（NLP）领域，注意力机制（Attention Mechanisms）已经成为神经网络模型的常见组件，尤其是在ACL（Association for Computational Linguistics）等顶级会议上。论文“Attention is not Explanation”探讨了一个关键问题：注意力权重是否能作为模型预测的有意义解释。尽管注意力机制在提升预测性能上表现出色，但其是否真的提供了模型内部工作原理的透明度，仍是一个值得深思的问题。注意力机制最初由Bahdanau等人在2014年提出，它允许模型根据输入单元的条件分布生成加权上下文向量，为下游模块提供信息。这种机制在现代神经NLP架构中几乎无处不在。然而，论文指出，当前的理解中，注意力权重与模型输出之间的关系并不明确。作者通过大量实验来评估注意力权重在多大程度上能够为预测提供有意义的“解释”。实验结果表明，学习到的注意力权重往往与基于梯度的重要特征度量不相关，也就是说，即使存在非常不同的注意力分布，也能得到等价的预测结果。这意味着标准的注意力模块并未提供真正意义上的解释，不应被视为解释模型决策的依据。论文中提到，Li等人在2016年的观点代表了NLP领域的一个普遍看法，即注意力机制为理解神经模型的工作原理提供了重要途径。然而，这篇ACL 2019的论文挑战了这一观点，它揭示了仅依赖注意力权重来解释模型预测的局限性。作者还公开了所有实验的复现代码，以便其他人验证和扩展他们的研究。此外，注意力机制的解释能力不足可能导致误导，因为用户可能会误以为高注意力权重的输入在预测中起到了决定性作用。这可能导致错误的解释，特别是在需要对模型决策进行理解和解释的应用中，如机器翻译、情感分析或文本推理等任务。因此，研究人员和实践者需要谨慎对待模型的解释性，并寻找更可靠的方法来揭示模型的内部运作机制。为了提高模型的可解释性，可能需要探索其他方法，例如局部可解释性模型（LIME）、SHAP（SHapley Additive exPlanations）或其他特征重要性度量。同时，开发新的注意力机制设计，使其不仅提升性能，还能提供更清晰、更准确的解释，将是未来NLP研究的重要方向。这篇论文提醒我们，尽管注意力机制是强大的工具，但我们不能仅凭其权重分配就认为掌握了模型的解释。

资源推荐

资源详情

资源评论

Attention is not Explanation

Sarthak Jain

Northeastern University

jain.sar@husky.neu.edu

Byron C. Wallace

Northeastern University

b.wallace@northeastern.edu

Abstract

Attention mechanisms have seen wide adop-

tion in neural NLP models. In addition to

improving predictive performance, these are

often touted as affording transparency: mod-

els equipped with attention provide a distribu-

tion over attended-to input units, and this is

often presented (at least implicitly) as com-

municating the relative importance of inputs.

However, it is unclear what relationship ex-

ists between attention weights and model out-

puts. In this work we perform extensive exper-

iments across a variety of NLP tasks that aim

to assess the degree to which attention weights

provide meaningful “explanations" for predic-

tions. We ﬁnd that they largely do not. For

example, learned attention weights are fre-

quently uncorrelated with gradient-based mea-

sures of feature importance, and one can iden-

tify very different attention distributions that

nonetheless yield equivalent predictions. Our

ﬁndings show that standard attention mod-

ules do not provide meaningful explanations

and should not be treated as though they do.

Code to reproduce all experiments is avail-

able at https://github.com/successar/

AttentionExplanation.

1 Introduction and Motivation

Attention mechanisms (Bahdanau et al., 2014) in-

duce conditional distributions over input units to

compose a weighted context vector for down-

stream modules. These are now a near-ubiquitous

component of neural NLP architectures. Attention

weights are often claimed (implicitly or explic-

itly) to afford insights into the “inner-workings”

of models: for a given output one can inspect the

inputs to which the model assigned large attention

weights. Li et al. (2016) summarized this com-

monly held view in NLP: “Attention provides an

important way to explain the workings of neural

models". Indeed, claims that attention provides

after 15 minutes watching the

movie i was asking myself what to

do leave the theater sleep or try

to keep watching the movie to

see if there was anything worth i

ﬁnally watched the movie what a

waste of time maybe i am not a 5

years old kid anymore

original

adversarial

after 15 minutes watching the

movie i was asking myself what to

do leave the theater sleep or try

to keep watching the movie to

see if there was anything worth i

ﬁnally watched the movie what a

waste of time maybe i am not a 5

years old kid anymore

f(x|↵, ✓)=0.01

f(x|˜↵, ✓)=0.01

↵

˜↵

Figure 1: Heatmap of attention weights induced over

a negative movie review. We show observed model at-

tention (left) and an adversarially constructed set of at-

tention weights (right). Despite being quite dissimilar,

these both yield effectively the same prediction (0.01).

interpretability are common in the literature, e.g.,

(Xu et al., 2015; Choi et al., 2016; Lei et al., 2017;

Martins and Astudillo, 2016; Xie et al., 2017; Mul-

lenbach et al., 2018).

Implicit in this is the assumption that the inputs

(e.g., words) accorded high attention weights are

responsible for model outputs. But as far as we

are aware, this assumption has not been formally

evaluated. Here we empirically investigate the re-

lationship between attention weights, inputs, and

outputs.

Assuming attention provides a faithful expla-

nation for model predictions, we might expect

the following properties to hold. (i) Attention

weights should correlate with feature importance

measures (e.g., gradient-based measures); (ii) Al-

ternative (or counterfactual) attention weight con-

ﬁgurations ought to yield corresponding changes

in prediction (and if they do not then are equally

plausible as explanations). We report that neither

property is consistently observed by a BiLSTM

with a standard attention mechanism in the context

of text classiﬁcation, question answering (QA),

and Natural Language Inference (NLI) tasks.

We do not intend to single out any particular work; in-

deed one of the authors has himself presented (supervised)

attention as providing interpretability (Zhang et al., 2016).

arXiv:1902.10186v3 [cs.CL] 8 May 2019

Consider Figure 1. The left panel shows the

original attention distribution α over the words of

a particular movie review using a standard atten-

tive BiLSTM architecture for sentiment analysis.

It is tempting to conclude from this that the token

waste is largely responsible for the model coming

to its disposition of ‘negative’ (ˆy = 0.01). But one

can construct an alternative attention distribution

˜α (right panel) that attends to entirely different to-

kens yet yields an essentially identical prediction

(holding all other parameters of f, θ, constant).

Such counterfactual distributions imply that ex-

plaining the original prediction by highlighting

attended-to tokens is misleading. One may, e.g.,

now conclude from the right panel that model out-

put was due primarily to was; but both waste and

was cannot simultaneously be responsible. Fur-

ther, the attention weights in this case correlate

only weakly with gradient-based measures of fea-

ture importance (τ

= 0.29). And arbitrarily per-

muting the entries in α yields a median output dif-

ference of 0.006 with the original prediction.

These and similar ﬁndings call into question

the view that attention provides meaningful insight

into model predictions, at least for RNN-based

models. We thus caution against using attention

weights to highlight input tokens “responsible for”

model outputs and constructing just-so stories on

this basis.

Research questions and contributions. We ex-

amine the extent to which the (often implicit) nar-

rative that attention provides model transparency

(Lipton, 2016). We are speciﬁcally interested in

whether attention weights indicate why a model

made the prediction that it did. This is sometimes

called faithful explanation (Ross et al., 2017). We

investigate whether this holds across tasks by ex-

ploring the following empirical questions.

1. To what extent do induced attention weights

correlate with measures of feature impor-

tance – speciﬁcally, those resulting from gra-

dients and leave-one-out (LOO) methods?

2. Would alternative attention weights (and

hence distinct heatmaps/“explanations”) nec-

essarily yield different predictions?

Our ﬁndings for attention weights in recurrent

(BiLSTM) encoders with respect to these ques-

tions are summarized as follows: (1) Only weakly

and inconsistently, and, (2) No; it is very often

possible to construct adversarial attention distri-

butions that yield effectively equivalent predic-

tions as when using the originally induced atten-

tion weights, despite attending to entirely differ-

ent input features. Even more strikingly, ran-

domly permuting attention weights often induces

only minimal changes in output. By contrast, at-

tention weights in simple, feedforward (weighted

average) encoders enjoy better behaviors with re-

spect to these criteria.

2 Preliminaries and Assumptions

We consider exemplar NLP tasks for which atten-

tion mechanisms are commonly used: classiﬁca-

tion, natural language inference (NLI), and ques-

tion answering.

We adopt the following general

modeling assumptions and notation.

We assume model inputs x ∈ R

T ×|V |

, com-

posed of one-hot encoded words at each position.

These are passed through an embedding matrix E

which provides dense (d dimensional) token repre-

sentations x

∈ R

T ×d

. Next, an encoder Enc con-

sumes the embedded tokens in order, producing

T m-dimensional hidden states: h = Enc(x

) ∈

T ×m

. We predominantly consider a Bi-RNN as

the encoder module, and for contrast we analyze

unordered ‘average’ embedding variants in which

is the embedding of token t after being passed

through a linear projection layer and ReLU activa-

tion. For completeness we also considered Con-

vNets, which are somewhere between these two

models; we report results for these in the supple-

mental materials.

A similarity function φ maps h and a query

Q ∈ R

(e.g., hidden representation of a question

in QA, or the hypothesis in NLI) to scalar scores,

and attention is then induced over these:

α =

softmax(φ(h, Q)) ∈ R

. In this work we con-

sider two common similarity functions: Additive

φ(h, Q) = v

tanh(W

h + W

Q) (Bahdanau

et al., 2014) and Scaled Dot-Product φ(h, Q) =

√

(Vaswani et al., 2017), where v, W

, W

are

model parameters.

Finally, a dense layer Dec with parameters θ

consumes a weighted instance representation and

yields a prediction ˆy = σ(θ · h

) ∈ R

|Y|

, where

t=1

ˆα

·h

; σ is an output activation func-

tion; and |Y| denotes the label set size.

While attention is perhaps most common in seq2seq

tasks like translation, our impression is that interpretability

is not typically emphasized for such tasks, in general.

Dataset |V | Avg. length Train size Test size Test performance (LSTM)

SST 16175 19 3034 / 3321 863 / 862 0.81

IMDB 13916 179 12500 / 12500 2184 / 2172 0.88

ADR Tweets 8686 20 14446 / 1939 3636 / 487 0.61

20 Newsgroups 8853 115 716 / 710 151 / 183 0.94

AG News 14752 36 30000 / 30000 1900 / 1900 0.96

Diabetes (MIMIC) 22316 1858 6381 / 1353 1295 / 319 0.79

Anemia (MIMIC) 19743 2188 1847 / 3251 460 / 802 0.92

CNN 74790 761 380298 3198 0.64

bAbI (Task 1 / 2 / 3) 40 8 / 67 / 421 10000 1000 1.0 / 0.48 / 0.62

SNLI 20982 14 182764 / 183187 / 183416 3219 / 3237 / 3368 0.78

Table 1: Dataset characteristics. For train and test size, we list the cardinality for each class, where applicable:

0/1 for binary classiﬁcation (top), and 0 / 1 / 2 for NLI (bottom). Average length is in tokens. Test metrics are

F1 score, accuracy, and micro-F1 for classiﬁcation, QA, and NLI, respectively; all correspond to performance

using a BiLSTM encoder. We note that results using convolutional and average (i.e., non-recurrent) encoders are

comparable for classiﬁcation though markedly worse for QA tasks.

3 Datasets and Tasks

For binary text classiﬁcation, we use:

Stanford Sentiment Treebank (SST) (Socher

et al., 2013). 10,662 sentences tagged with senti-

ment on a scale from 1 (most negative) to 5 (most

positive). We ﬁlter out neutral instances and di-

chotomize the remaining sentences into positive

(4, 5) and negative (1, 2).

IMDB Large Movie Reviews Corpus (Maas

et al., 2011). Binary sentiment classiﬁcation

dataset containing 50,000 polarized (positive or

negative) movie reviews, split into half for train-

ing and testing.

Twitter Adverse Drug Reaction dataset (Nikfar-

jam et al., 2015). A corpus of ∼8000 tweets re-

trieved from Twitter, annotated by domain experts

as mentioning adverse drug reactions. We use a

superset of this dataset.

20 Newsgroups (Hockey vs Baseball). Col-

lection of ∼20,000 newsgroup correspondences,

partitioned (nearly) evenly across 20 categories.

We extract instances belonging to baseball and

hockey, which we designate as 0 and 1, respec-

tively, to derive a binary classiﬁcation task.

AG News Corpus (Business vs World).

496,835

news articles from 2000+ sources. We follow

(Zhang et al., 2015) in ﬁltering out all but the top

4 categories. We consider the binary classiﬁcation

task of discriminating between world (0) and busi-

ness (1) articles.

MIMIC ICD9 (Diabetes) (Johnson et al., 2016).

A subset of discharge summaries from the MIMIC

III dataset of electronic health records. The task is

to recognize if a given summary has been labeled

with the ICD9 code for diabetes (or not).

http://www.di.unipi.it/~gulli/AG_

corpus_of_news_articles.html

MIMIC ICD9 (Chronic vs Acute Anemia) (John-

son et al., 2016). A subset of discharge sum-

maries from MIMIC III dataset (Johnson et al.,

2016) known to correspond to patients with ane-

mia. Here the task to distinguish the type of ane-

mia for each report – acute (0) or chronic (1).

For Question Answering (QA):

CNN News Articles (Hermann et al., 2015). A

corpus of cloze-style questions created via auto-

matic parsing of news articles from CNN. Each

instance comprises a paragraph-question-answer

triplet, where the answer is one of the anonymized

entities in the paragraph.

bAbI (Weston et al., 2015). We consider the

three tasks presented in the original bAbI dataset

paper, training separate models for each. These

entail ﬁnding (i) a single supporting fact for a

question and (ii) two or (iii) three supporting state-

ments, chained together to compose a coherent

line of reasoning.

Finally, for Natural Language Inference (NLI):

The SNLI dataset (Bowman et al., 2015). 570k

human-written English sentence pairs manually

labeled for balanced classiﬁcation with the labels

neutral, contradiction, and entailment, supporting

the task of natural language inference (NLI). In

this work, we generate an attention distribution

over premise words conditioned on the hidden rep-

resentation induced for the hypothesis.

We restrict ourselves to comparatively sim-

ple instantiations of attention mechanisms, as de-

scribed in the preceding section. This means we

do not consider recently proposed ‘BiAttentive’

architectures that attend to tokens in the respec-

tive inputs, conditioned on the other inputs (Parikh

et al., 2016; Seo et al., 2016; Xiong et al., 2016).

剩余15页未读，继续阅读

评论收藏

内容反馈

MonkeyDogFox

粉丝: 6
资源: 3

Attention is not Explanation.pdf

最新资源

Attention is not Explanation.pdf

Attention is not not Explanation.pdf

Is Attention Interpretable.pdf

浅谈Attention机制的理解.pdf

20200417_ResNeSt Split-Attention Networks.pdf

KGAT：Knowledge Graph Attention Network for Recommendation.pdf

《Explanation of DS Parameters》.pdf

SOA Curve Explanation.pdf

Application Note - OptiMOS Power MOSFET Datasheet Explanation.pdf

Latest GPS Protocol Explanation.rar

Code explanation.rar

Robe Goldberg Explanation1.pdf

Yahoo：CTV Ad Attention & Receptiveness.pdf

Dual Attention Network.pdf

Attention Is All You Need.pdf

Attention-Based Pedestrian Attribute Analysis.pdf

金融股票深度学习论文整理

游戏和图形学的 3D 数学入门教程：3D Math Primer for graphics and game development.pdf

Physics from Symmetry 2015.pdf

Video-Explanation.rar_GMM KF_anf

A General Survey on Attention Mechanisms in Deep Learning.pdf

NLP：Attention Is All You Need.pdf

Attention-JAVA核心面试知识整理.pdf

Attention+is+All+You+Need.pdf

perlgolf_history_070109.pdf

stbc.rar_The Signal_stbc_ui_start

Lesson29DNATheStoryofYou.pdf

最新资源