Published as a conference paper at ICLR 2020
We apply Transformer-XH to two multi-evidence reasoning tasks: Hotpot QA, the multi-hop ques-
tion answering task (Yang et al., 2018), and FEVER, the fact verification benchmark whose claims
often require multiple pieces of evidence to support (Thorne et al., 2018). Rather than decomposing
the task into a series of sub-tasks to fit the constraints of pre-trained Transformers, Transformer-
XH is a solution that fits the problem as it naturally occurs. It is a single model that represents
and combines evidence from multiple documents to conduct the reasoning process. On HotpotQA’s
FullWiki setting, which requires strong multi-hop reasoning capability (Min et al., 2019b; Jiang &
Bansal, 2019), Transformer-XH outperforms CogQA (Ding et al., 2019), the previous start-of-the-
art, by 12 points on answer F1. On FEVER 1.0 shared task, Transformer-XH outperforms GEAR,
the Graph Neural Network based approach significantly. On both applications, Transformer-XH
beats the contemporary BERT based pipeline SR-MRS (Nie et al., 2019), by 2-3 points.
The results follow from our simple yet effective design, with one unified model operating over the
inherent structure of the task, rather than melding the outputs from disparate sub-tasks adapted to the
sequential constraints of pre-trained Transformers. Our ablation studies demonstrate Transformer-
XH’s efficacy on questions that are known to require multi-hop reasoning (Min et al., 2019b) and
on verifying multi-evidence claims (Liu et al., 2019b). Our analyses confirm that the source of
Transformer-XH’s effectiveness success is due to the eXtra Hop attention’s ability to fuse and prop-
agate information across multiple documents.
1
2 MODEL
This section first discusses preliminaries on sequential Transformers, then we show how we incor-
porate eXtra hop attention to create Transformer-XH.
2.1 PRELIMINARIES
Transformers represent a sequence of input text tokens X = {x
1
, ..., x
i
, ..., x
n
} as contextualized
distributed representations H = {h
1
, ..., h
i
, ..., h
n
} (Vaswani et al., 2017). This process involves
multiple stacked self-attention layers that converts the input X into {H
0
, H
1
, ..., H
l
, ...H
L
}, start-
ing from H
0
, the embeddings, to the final layer of depth L.
The key idea of Transformer is its attention mechanism, which calculates the l-th layer output H
l
using the input H
l−1
from the previous layer:
H
l
= softmax(
Q · K
T
√
d
k
) · V
T
, (1)
Q
T
; K
T
; V
T
= W
q
· H
l−1
; W
k
· H
l−1
; W
v
· H
l−1
. (2)
It includes three projections on the input H
l−1
: Query (Q), Key (K), and Value (V).
Specifically, the slices of token h
l
i
in Eqn.(2) is:
h
l
i
=
X
j
softmax
j
(
q
T
i
· k
j
√
d
k
) · v
j
, (3)
which first calculates its attention to all other tokens j in the sequence and then combines the token
values v
j
into a new representation h
l
i
, using the normalized attention weights. Multiple attentions
can be used in one Transformer layer and concatenated as multi-head attention (Vaswani et al.,
2017). The architecture is stacked to form rather deep networks, which leads to significant success
of large pre-trained Transformer models (Devlin et al., 2019; Liu et al., 2019a).
A challenge of Transformer is that its attention is calculated over all token pairs (Eqn. 3), which
is hard to scale to long text sequences. Transformer-XL (eXtra Long) addresses this challenge by
breaking down longer texts, e.g., a multi-paragraph document, into a sequence of text segments:
{X
1
, ..., X
τ
, ..., X
ζ
}, and propagates the information between adjacent text segments using the fol-
lowing attention:
˜
H
l−1
τ
= [Freeze(H
l−1
τ −1
) ◦ H
l−1
τ
]. (4)
1
Code available at https://aka.ms/transformer-xh.
2
评论0