Consider Figure 1. The left panel shows the
original attention distribution α over the words of
a particular movie review using a standard atten-
tive BiLSTM architecture for sentiment analysis.
It is tempting to conclude from this that the token
waste is largely responsible for the model coming
to its disposition of ‘negative’ (ˆy = 0.01). But one
can construct an alternative attention distribution
˜α (right panel) that attends to entirely different to-
kens yet yields an essentially identical prediction
(holding all other parameters of f, θ, constant).
Such counterfactual distributions imply that ex-
plaining the original prediction by highlighting
attended-to tokens is misleading. One may, e.g.,
now conclude from the right panel that model out-
put was due primarily to was; but both waste and
was cannot simultaneously be responsible. Fur-
ther, the attention weights in this case correlate
only weakly with gradient-based measures of fea-
ture importance (τ
g
= 0.29). And arbitrarily per-
muting the entries in α yields a median output dif-
ference of 0.006 with the original prediction.
These and similar findings call into question
the view that attention provides meaningful insight
into model predictions, at least for RNN-based
models. We thus caution against using attention
weights to highlight input tokens “responsible for”
model outputs and constructing just-so stories on
this basis.
Research questions and contributions. We ex-
amine the extent to which the (often implicit) nar-
rative that attention provides model transparency
(Lipton, 2016). We are specifically interested in
whether attention weights indicate why a model
made the prediction that it did. This is sometimes
called faithful explanation (Ross et al., 2017). We
investigate whether this holds across tasks by ex-
ploring the following empirical questions.
1. To what extent do induced attention weights
correlate with measures of feature impor-
tance – specifically, those resulting from gra-
dients and leave-one-out (LOO) methods?
2. Would alternative attention weights (and
hence distinct heatmaps/“explanations”) nec-
essarily yield different predictions?
Our findings for attention weights in recurrent
(BiLSTM) encoders with respect to these ques-
tions are summarized as follows: (1) Only weakly
and inconsistently, and, (2) No; it is very often
possible to construct adversarial attention distri-
butions that yield effectively equivalent predic-
tions as when using the originally induced atten-
tion weights, despite attending to entirely differ-
ent input features. Even more strikingly, ran-
domly permuting attention weights often induces
only minimal changes in output. By contrast, at-
tention weights in simple, feedforward (weighted
average) encoders enjoy better behaviors with re-
spect to these criteria.
2 Preliminaries and Assumptions
We consider exemplar NLP tasks for which atten-
tion mechanisms are commonly used: classifica-
tion, natural language inference (NLI), and ques-
tion answering.
2
We adopt the following general
modeling assumptions and notation.
We assume model inputs x ∈ R
T ×|V |
, com-
posed of one-hot encoded words at each position.
These are passed through an embedding matrix E
which provides dense (d dimensional) token repre-
sentations x
e
∈ R
T ×d
. Next, an encoder Enc con-
sumes the embedded tokens in order, producing
T m-dimensional hidden states: h = Enc(x
e
) ∈
R
T ×m
. We predominantly consider a Bi-RNN as
the encoder module, and for contrast we analyze
unordered ‘average’ embedding variants in which
h
t
is the embedding of token t after being passed
through a linear projection layer and ReLU activa-
tion. For completeness we also considered Con-
vNets, which are somewhere between these two
models; we report results for these in the supple-
mental materials.
A similarity function φ maps h and a query
Q ∈ R
m
(e.g., hidden representation of a question
in QA, or the hypothesis in NLI) to scalar scores,
and attention is then induced over these:
ˆ
α =
softmax(φ(h, Q)) ∈ R
T
. In this work we con-
sider two common similarity functions: Additive
φ(h, Q) = v
T
tanh(W
1
h + W
2
Q) (Bahdanau
et al., 2014) and Scaled Dot-Product φ(h, Q) =
hQ
√
m
(Vaswani et al., 2017), where v, W
1
, W
2
are
model parameters.
Finally, a dense layer Dec with parameters θ
consumes a weighted instance representation and
yields a prediction ˆy = σ(θ · h
α
) ∈ R
|Y|
, where
h
α
=
P
T
t=1
ˆα
t
·h
t
; σ is an output activation func-
tion; and |Y| denotes the label set size.
2
While attention is perhaps most common in seq2seq
tasks like translation, our impression is that interpretability
is not typically emphasized for such tasks, in general.