没有合适的资源?快使用搜索试试~ 我知道了~
具有稀疏计算代价的组合器全注意变换器_Combiner Full Attention Transformer with Spar
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 173 浏览量
2022-01-22
16:23:25
上传
评论
收藏 477KB PDF 举报
温馨提示
试读
16页
具有稀疏计算代价的组合器全注意变换器_Combiner Full Attention Transformer with Sparse Computation Cost.pdf
资源推荐
资源详情
资源评论
Combiner: Full Attention Transformer
with Sparse Computation Cost
∗
Hongyu Ren
†
,
∗
Hanjun Dai
,
∗
Zihang Dai
Mengjiao Yang
, Jure Leskovec
†
, Dale Schuurmans
,‡
, Bo Dai
†
Stanford University, {hyren,jure}@cs.stanford.edu
Google Research, Brain Team, {hadai,zihangd,sherryy,schuurmans,bodai}@google.com
‡
University of Alberta
Abstract
Transformers provide a class of expressive architectures that are extremely effec-
tive for sequence modeling. However, the key limitation of transformers is their
quadratic memory and time complexity
O(L
2
)
with respect to the sequence length
in attention layers, which restricts application in extremely long sequences. Most
existing approaches leverage sparsity or low-rank assumptions in the attention
matrix to reduce cost, but sacrifice expressiveness. Instead, we propose Combiner,
which provides full attention capability in each attention head while maintaining
low computation and memory complexity. The key idea is to treat the self-attention
mechanism as a conditional expectation over embeddings at each location, and
approximate the conditional distribution with a structured factorization. Each loca-
tion can attend to all other locations, either via direct attention, or through indirect
attention to abstractions, which are again conditional expectations of embeddings
from corresponding local regions. We show that most sparse attention patterns
used in existing sparse transformers are able to inspire the design of such factor-
ization for full attention, resulting in the same sub-quadratic cost (
O(L log(L))
or
O(L
√
L)
). Combiner is a drop-in replacement for attention layers in existing trans-
formers and can be easily implemented in common frameworks. An experimental
evaluation on both autoregressive and bidirectional sequence tasks demonstrates
the effectiveness of this approach, yielding state-of-the-art results on several image
and text modeling tasks.
1 Introduction
The Transformer [
1
] is a powerful neural network architecture that has demonstrated state-of-the-art
performance in machine translation [
2
] and many other natural language processing (NLP) tasks
via pretraining, using either unidirectional language modeling [
3
] or bidirectional language model-
ing [
4
–
8
]. It has also achieved excellent results in other domains like image recognition [
9
], code
understanding [
10
], speech recognition [
11
], protein [
12
], music [
13
] and image [
14
] generative mod-
eling. The core component of Transformer is the attention mechanism, which computes dependencies
between all pairs of positions in a sequence. However, for a sequence of length
L
, the expressiveness
of pairwise attention comes at a quadratic cost
O(L
2
)
in both time and memory consumption. This
makes the vanilla Transformer [
1
] prohibitive for applications that involve long sequences, including
high-resolution images, protein sequences, or raw speech signals [
15
], where the sequence length
L
is often larger than 10, 000 [14].
Recently, there have been several attempts to scale up attention to long sequences. A popular
class of methods sparsifies the attention matrix with different sparsity patterns, including local
∗
indicates equal contribution. The work was completed during HR’s internship at Google Brain.
35th Conference on Neural Information Processing Systems (NeurIPS 2021).
arXiv:2107.05768v2 [cs.LG] 28 Oct 2021
window [
16
,
17
], local+stride [
14
], log-sparse [
18
], axial [
19
,
20
], or learnable patterns through
hashing [
21
] or clustering [
22
]. Sparse attention enjoys sub-quadratic cost, but is lossy in capturing
all-pair relationships. Generally, sparse attention requires more layers [
14
,
20
,
23
] to achieve full
autoregressive or bidirectional dependencies (or receptive fields [
20
]) for each location in a long
sequence.
Alternatively, another line of research has tried to achieve scalability with an explicit low-rank
assumption [
24
,
25
] on the attention matrix or by using explicit feature maps of some kernels [
26
].
However these explicit low dimensional approximations might be too restricted for the potentially full
rank attention matrix, which uses exponential kernels that are effectively infinite dimensional [
27
].
The Performer [
28
] is among the first works that attempts to approximate regular full-rank attention
with the random feature trick [
29
]. However such random-feature based approaches [
30
] require
many more bases to better approximate the exponential kernel [
27
], and empirically we found it
produces inferior results in some sequence modeling tasks, such as density estimation.
In this paper we propose Combiner, a drop-in replacement for the vanilla quadratic attention mech-
anism with sub-quadratic computation and memory cost. Combiner still achieves full attention
capability within each head of Multi-Head Attention, unlike approaches that adopt sparse or low-rank
approximations. As we will discuss, the standard attention computed at each location can be seen as
the conditional expectation of the value embeddings at all feasible locations given the current location.
Based on such an understanding, Combiner explicitly approximates the conditional distribution
in through a structured factorization of the probability space. Specifically, given a location
x
, the
probability of attending to location
y
can be either directly calculated via the query vector of
x
and
key vector of
y
, or indirectly through a local abstraction where
x
first attends to the key vector that
represents a group of locations containing
y
, and multiplying the probability of choosing
y
within that
group. We refer to this model as Combiner since the conditional distributions in attention become a
combination between several local attentions and direct attentions. This structured decomposition
enables Combiner to take existing sparse attention patterns and convert them into corresponding
design choices for probability factorizations that achieve full attention. As shown in Figure 1, Com-
biner achieves full attention with the same asymptotic complexity as sparse variants. Combiner can
be easily implemented in most existing deep learning frameworks without the need for specialized
hardware implementation, and is GPU/TPU friendly. In fact, both the fixed and learnable sparse
attention patterns from many existing Transformer variants [
14
,
18
,
20
,
22
] can be enhanced with
such structured factorizations, with the same order of time or memory cost.
We validate Combiner on both autoregressive and bidirectional sequence modeling tasks over a
variety of domains including text and images. We show that Combiner can achieve better perplexity
and accuracy when using the same transformer architectures while being much faster in terms
of runtime, and achieves state of the art performance on density estimation on standard datasets
CIFAR-10 (2.77 bits/dim) and ImageNet-64 (3.42 bits/dim), as well as the Long-Range Arena [
31
].
The implementation of Combiner can be found at https://github.com/google-research/google-
research/tree/master/combiner.
2 Attention as Conditional Expectation
In this section, we revisit the formulation of the standard Transformer [
1
] from the perspective of
conditional expectation, which inspires the derivation of Combiner.
Without loss of generality, we use a single sequence in the self-attention scenario. Given a sequence
of
L
embeddings
X = [x
1
, x
2
, . . . , x
L
]
, where
X ∈ R
L×d
and each embedding
x
i
∈ R
d
is a
d
-dimensional vector, the core component of Transformer is the multi-head attention, where each
head h is a scaled dot-product attention:
A
h
(X) = softmax
Q
h
√
d
K
>
h
V
h
,
n
Q
h
= XW
Q
h
, K
h
= XW
K
h
, V
h
= XW
V
h
o
∈ R
L×d
, (1)
and the attention vector from each head A
h
(X) is concatenated and projected:
MultiHeadAttn(X) = [A
1
(X), A
2
(X), . . . , A
H
(X)] W
o
, W
o
∈ R
Hd×d
. (2)
Here
H
is the total number of heads per Transformer layer. In this paper, we focus on how to
approximate full attention within each head of multi-head attention. For ease of notation, we drop
the head index
h
whenever possible, and use lower-case letters
x
i
, q
i
, k
i
, v
i
∈ R
d
to denote rows in
2
(D) Combiner-Fixed
(A) Fixed
(B) Logsparse
(E) Combiner-Logsparse
Direct Expectation
Local Expectation
(F) Combiner-Axial
(C) Axial
Figure 1: Attention matrices of several instantiations of Combiner in the autoregressive setting. We
transform several sparse attention patterns: Fixed (A) [
14
], Logsparse (B) [
18
] and Axial (C) [
20
]
to Combiner-Fixed (D), Combiner-Logsparse (E) and Combiner-Axial (F). Combiner approximates
the conditional expectation
(3)
with a combination of direct expectation (blue) and local expectation
(yellow). Our instantiations (D)(E)(F) achieves full attention with the same sub-quadratic complexity.
X, Q, K, V
respectively, which corresponds to a location
i
in the original sequence of length
L
. We
use [n] to denote the set of positive integers {1, 2, . . . , n}.
For a position
i ∈ [L]
, the attention formulation
(1)
can be viewed as conditional expectation of rows
in V . Specifically, since softmax outputs a probability distribution, we can rewrite (1) as
A(x
i
) = E
p(j|i)
[v
j
] , p(j|i) =
1
Z (x
i
)
exp
q
i
√
d
k
>
j
, (3)
where
p(j|i)
denotes the conditional probability at position
j
given the token at position
i
and
the partition function
Z (x
i
) =
P
j∈Ω
i
exp
q
i
√
d
k
>
j
over support
Ω
i
. The support
Ω
i
of
p (j|i)
defines the set of valid locations that the
i
-th token can attend to. For instance, the support set in
autoregressive language modeling (LM) consists of all previous tokens, i.e.,
Ω
LM
i
= [i]
2
; in masked
language modeling (MLM) the support consists of all tokens in the sequence, i.e.,
Ω
MLM
i
= [L]
. That
is, Ω
LM
i
and Ω
MLM
i
represent the full attention capability respectively in the LM and MLM setting.
3 Combiner: Full Attention via Structured Conditional Expectation
The complexity of
p (j|i)
is the bottleneck of the computation for
A (x
i
)
. Generally, in existing
sparse transformers, the support of
p (j|i)
is sparsified to reduce the computation and memory
complexity, e.g.,
Ω
Sparse
i
( Ω
LM
i
for LM and
Ω
Sparse
i
( Ω
MLM
i
for MLM, but this can lead to either
reduced capacity or limited applicability. We defer detailed discussion of the full capacity of the
model to Appendix A. In this section we introduce the Combiner, which achieves
Ω
Combiner
i
= Ω
LM
i
for LM and
Ω
Combiner
i
= Ω
MLM
i
for MLM, while still maintaining sub-quadratic computation and
memory cost. Below we denote
Ω
i
as the support for full attention if there is no ambiguity or need
to distinguish between LM or MLM. We introduce the main design framework in Section 3.1 and
possible parameterizations in Section 3.2. Then in Section 3.3 we analyze the trade-off of Combiner.
3.1 Local Factorization for Conditional Expectation
The main idea of Combiner is to exploit a hierarchical structure for conditional probability modeling
in
(3)
, which provides the opportunity for reducing computation complexity while maintaining the
2
Following the conventional implementation, the input sequence will be “
right-shifted
” so that the
position i can attent to itself in LM setting.
3
same support. Specifically, we introduce support variables
Ω
r
i
, for
r = 0, . . . , n
i
and
i ∈ [L]
. The
support variables are disjoint, i.e.,
Ω
r
i
∩ Ω
s
i
= ∅, ∀r 6= s
, and
∪
n
i
r=0
Ω
r
i
= Ω
i
. Then we can factorize
p(j|i) as
p(j|i) =
n
i
X
r=0
p(j, Ω
r
i
|i) =
n
i
X
r=0
p(j|Ω
r
i
, i)p(Ω
r
i
|i) = p(j|Ω
r
j
i
, i)p(Ω
r
j
i
|i), (4)
where r
j
denotes the index of the support to which j belongs. The last equation arises from the fact
that the
Ω
r
i
are disjoint from each other (
Ω
r
i
∩ Ω
s
i
= ∅, ∀r 6= s
). Therefore, there is only one support,
Ω
r
j
i
, containing
j
. The remaining terms, where
j 6∈ Ω
r
i
for
r 6= r
j
, are all zero since
p (j|Ω
r
i
, i) = 0
.
Furthermore, assume Ω
r
j
i
is a sufficient statistic, i.e., j and i are independent given Ω
r
j
i
, we obtain
p(j|i) = p(j|Ω
r
j
i
)p(Ω
r
j
i
|i). (5)
Given the partition {Ω
r
i
}
n
i
r=0
, the attention form in (3) can be rewritten as
A (x
i
) = E
p(j|i)
[v
j
] =
n
i
X
r=0
X
j∈Ω
r
i
p (j, Ω
r
i
|i) v
j
(6)
=
X
j∈Ω
0
i
˜p(j|i)v
j
| {z }
direct expectation
+
P
n
i
r=1
p(Ω
r
i
|i)
X
j∈Ω
r
i
p(j|Ω
r
i
)v
j
| {z }
local expectation
, (7)
where we consider direct attention in partition
Ω
0
i
and apply the local factorization
(5)
to the partition
r = 1, . . . , n
i
. Here
˜p(j|i) ∝ p(j|i)
but with different normalization constants, which will be
explained below. We refer to this model as Combiner since the structured attention
(7)
combines the
direct expectation of
Ω
0
i
and multiple local expectations via
p(j|Ω
r
i
)
and
p(Ω
r
i
|i)
to form the final
conditional expectation.
Equivalently, we can also rewrite the structured attention (7) as
A(x
i
) =
P
j∈Ω
i
"
I(j ∈ Ω
0
i
)˜p(j|i) +
n
i
X
r=1
I(j ∈ Ω
r
i
)p(j|Ω
r
i
)p(Ω
r
i
|i)
#
| {z }
the new effective conditional probability q(j|i)
v
j
, (8)
where
I(·)
is a binary indicator function. After reordering, one can see from
(8)
that we obtain the
effective conditional probability
q(j|i)
that tries to approximate the original
p(j|i)
. Each probability
term depends on both current location
i
and other location
j
, and the expectation is still obtained with
respect to a valid conditional probability (non-negative and sums up to 1 over Ω
i
).
Requirement for Sub-quadratic Cost.
We can immediately see the benefit of this formulation from
the fact that the local expectation in
(7)
is independent of the position
i
. The full dependence is
achieved via the multiplier
p(Ω
r
i
|i)
where
j ∈ Ω
r
i
. If we can design the local factorization such that:
1. the order of number of terms in (7) for p(·|i), ∀i ∈ [L]:
P
L
i=1
(n
i
+ |Ω
0
i
|) is sub-quadratic; and
2.
let
U = {Ω
r
i
}
i∈[L],r∈[1,n
i
]
be the unique set of partitions used for local expectation calculation,
then the order of |U| (i.e., the number of unique partitions in U) is sub-quadratic;
3.
the order of total number of unique calculations of local expectation across all locations in
(7)
,
P
Ω∈U
|Ω| is sub-quadratic;
then one can see that the overall computation and memory cost will be sub-quadratic with full
attention support
Ω
Combiner
i
= Ω
i
, ∀i ∈ [L]
. We will discuss in detail in Section 4 how to instantiate
such a principle by drawing inspiration from existing sparse transformers, and how to convert them
into a full attention model almost for free with identical asymptotic complexity.
Remark (Further Hierarchical Decomposition):
We introduce the local decomposition with a one
layer partition of support of
p(·|i)
for simplicity. In fact, such local decompositions can be stacked
further, which introduces a partition tree. Specifically, we can further partition
Ω
r
i
with disjoint
subsets
Ω
rk
i
n
r
k=1
, and consider local decomposition
p(j, Ω
r
i
|i) = p(j|Ω
rk
j
i
, i)p(Ω
rk
j
i
|Ω
r
i
, i)p(Ω
r
i
|i)
,
where
k
j
is the index of sub-region which
j
belongs to. Thus, we obtain a hierarchical decomposition
of p(j|i), which can also be plugged to (6) and yield a new full attention formulation.
4
剩余15页未读,继续阅读
资源评论
易小侠
- 粉丝: 6476
- 资源: 9万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功