具有稀疏计算代价的组合器全注意变换器_CombinerFullAttentionTransformerwithSpar资源-CSDN文库

版权申诉

173 浏览量 2022-01-22 16:23:25 上传评论收藏 477KB PDF 举报

资源推荐

资源详情

资源评论

Combiner: Full Attention Transformer

with Sparse Computation Cost

∗

Hongyu Ren

†

∗

Hanjun Dai



∗

Zihang Dai



Mengjiao Yang



, Jure Leskovec

†

, Dale Schuurmans

,‡

, Bo Dai



†

Stanford University, {hyren,jure}@cs.stanford.edu



Google Research, Brain Team, {hadai,zihangd,sherryy,schuurmans,bodai}@google.com

‡

University of Alberta

Abstract

Transformers provide a class of expressive architectures that are extremely effec-

tive for sequence modeling. However, the key limitation of transformers is their

quadratic memory and time complexity

O(L

)

with respect to the sequence length

in attention layers, which restricts application in extremely long sequences. Most

existing approaches leverage sparsity or low-rank assumptions in the attention

matrix to reduce cost, but sacriﬁce expressiveness. Instead, we propose Combiner,

which provides full attention capability in each attention head while maintaining

low computation and memory complexity. The key idea is to treat the self-attention

mechanism as a conditional expectation over embeddings at each location, and

approximate the conditional distribution with a structured factorization. Each loca-

tion can attend to all other locations, either via direct attention, or through indirect

attention to abstractions, which are again conditional expectations of embeddings

from corresponding local regions. We show that most sparse attention patterns

used in existing sparse transformers are able to inspire the design of such factor-

ization for full attention, resulting in the same sub-quadratic cost (

O(L log(L))

O(L

√

). Combiner is a drop-in replacement for attention layers in existing trans-

formers and can be easily implemented in common frameworks. An experimental

evaluation on both autoregressive and bidirectional sequence tasks demonstrates

the effectiveness of this approach, yielding state-of-the-art results on several image

and text modeling tasks.

1 Introduction

The Transformer [

] is a powerful neural network architecture that has demonstrated state-of-the-art

performance in machine translation [

] and many other natural language processing (NLP) tasks

via pretraining, using either unidirectional language modeling [

] or bidirectional language model-

ing [

–

]. It has also achieved excellent results in other domains like image recognition [

], code

understanding [

], speech recognition [

], protein [

], music [

] and image [

] generative mod-

eling. The core component of Transformer is the attention mechanism, which computes dependencies

between all pairs of positions in a sequence. However, for a sequence of length

, the expressiveness

of pairwise attention comes at a quadratic cost

O(L

)

in both time and memory consumption. This

makes the vanilla Transformer [

] prohibitive for applications that involve long sequences, including

high-resolution images, protein sequences, or raw speech signals [

], where the sequence length

is often larger than 10, 000 [14].

Recently, there have been several attempts to scale up attention to long sequences. A popular

class of methods sparsiﬁes the attention matrix with different sparsity patterns, including local

∗

indicates equal contribution. The work was completed during HR’s internship at Google Brain.

35th Conference on Neural Information Processing Systems (NeurIPS 2021).

arXiv:2107.05768v2 [cs.LG] 28 Oct 2021

window [

], local+stride [

], log-sparse [

], axial [

], or learnable patterns through

hashing [

] or clustering [

]. Sparse attention enjoys sub-quadratic cost, but is lossy in capturing

all-pair relationships. Generally, sparse attention requires more layers [

] to achieve full

autoregressive or bidirectional dependencies (or receptive ﬁelds [

]) for each location in a long

sequence.

Alternatively, another line of research has tried to achieve scalability with an explicit low-rank

assumption [

] on the attention matrix or by using explicit feature maps of some kernels [

However these explicit low dimensional approximations might be too restricted for the potentially full

rank attention matrix, which uses exponential kernels that are effectively inﬁnite dimensional [

The Performer [

] is among the ﬁrst works that attempts to approximate regular full-rank attention

with the random feature trick [

]. However such random-feature based approaches [

] require

many more bases to better approximate the exponential kernel [

], and empirically we found it

produces inferior results in some sequence modeling tasks, such as density estimation.

In this paper we propose Combiner, a drop-in replacement for the vanilla quadratic attention mech-

anism with sub-quadratic computation and memory cost. Combiner still achieves full attention

capability within each head of Multi-Head Attention, unlike approaches that adopt sparse or low-rank

approximations. As we will discuss, the standard attention computed at each location can be seen as

the conditional expectation of the value embeddings at all feasible locations given the current location.

Based on such an understanding, Combiner explicitly approximates the conditional distribution

in through a structured factorization of the probability space. Speciﬁcally, given a location

, the

probability of attending to location

can be either directly calculated via the query vector of

and

key vector of

, or indirectly through a local abstraction where

ﬁrst attends to the key vector that

represents a group of locations containing

, and multiplying the probability of choosing

within that

group. We refer to this model as Combiner since the conditional distributions in attention become a

combination between several local attentions and direct attentions. This structured decomposition

enables Combiner to take existing sparse attention patterns and convert them into corresponding

design choices for probability factorizations that achieve full attention. As shown in Figure 1, Com-

biner achieves full attention with the same asymptotic complexity as sparse variants. Combiner can

be easily implemented in most existing deep learning frameworks without the need for specialized

hardware implementation, and is GPU/TPU friendly. In fact, both the ﬁxed and learnable sparse

attention patterns from many existing Transformer variants [

] can be enhanced with

such structured factorizations, with the same order of time or memory cost.

We validate Combiner on both autoregressive and bidirectional sequence modeling tasks over a

variety of domains including text and images. We show that Combiner can achieve better perplexity

and accuracy when using the same transformer architectures while being much faster in terms

of runtime, and achieves state of the art performance on density estimation on standard datasets

CIFAR-10 (2.77 bits/dim) and ImageNet-64 (3.42 bits/dim), as well as the Long-Range Arena [

The implementation of Combiner can be found at https://github.com/google-research/google-

research/tree/master/combiner.

2 Attention as Conditional Expectation

In this section, we revisit the formulation of the standard Transformer [

] from the perspective of

conditional expectation, which inspires the derivation of Combiner.

Without loss of generality, we use a single sequence in the self-attention scenario. Given a sequence

embeddings

X = [x

, x

, . . . , x

]

, where

X ∈ R

L×d

and each embedding

∈ R

is a

-dimensional vector, the core component of Transformer is the multi-head attention, where each

head h is a scaled dot-product attention:

(X) = softmax



√



= XW

, K

= XW

, V

= XW

∈ R

L×d

, (1)

and the attention vector from each head A

(X) is concatenated and projected:

MultiHeadAttn(X) = [A

(X), A

(X), . . . , A

(X)] W

, W

∈ R

Hd×d

. (2)

Here

is the total number of heads per Transformer layer. In this paper, we focus on how to

approximate full attention within each head of multi-head attention. For ease of notation, we drop

the head index

whenever possible, and use lower-case letters

, q

, k

, v

∈ R

to denote rows in

(D) Combiner-Fixed

(A) Fixed

(B) Logsparse

(E) Combiner-Logsparse

Direct Expectation

Local Expectation

(F) Combiner-Axial

Figure 1: Attention matrices of several instantiations of Combiner in the autoregressive setting. We

transform several sparse attention patterns: Fixed (A) [

], Logsparse (B) [

] and Axial (C) [

]

to Combiner-Fixed (D), Combiner-Logsparse (E) and Combiner-Axial (F). Combiner approximates

the conditional expectation

(3)

with a combination of direct expectation (blue) and local expectation

(yellow). Our instantiations (D)(E)(F) achieves full attention with the same sub-quadratic complexity.

X, Q, K, V

respectively, which corresponds to a location

in the original sequence of length

. We

use [n] to denote the set of positive integers {1, 2, . . . , n}.

For a position

i ∈ [L]

, the attention formulation

(1)

can be viewed as conditional expectation of rows

in V . Speciﬁcally, since softmax outputs a probability distribution, we can rewrite (1) as

A(x

) = E

p(j|i)

] , p(j|i) =

Z (x

)

exp



√



, (3)

where

p(j|i)

denotes the conditional probability at position

given the token at position

and

the partition function

Z (x

) =

j∈Ω

exp



√



over support

Ω

. The support

Ω

p (j|i)

deﬁnes the set of valid locations that the

-th token can attend to. For instance, the support set in

autoregressive language modeling (LM) consists of all previous tokens, i.e.,

Ω

= [i]

; in masked

language modeling (MLM) the support consists of all tokens in the sequence, i.e.,

Ω

MLM

= [L]

. That

is, Ω

and Ω

MLM

represent the full attention capability respectively in the LM and MLM setting.

3 Combiner: Full Attention via Structured Conditional Expectation

The complexity of

p (j|i)

is the bottleneck of the computation for

A (x

)

. Generally, in existing

sparse transformers, the support of

p (j|i)

is sparsiﬁed to reduce the computation and memory

complexity, e.g.,

Ω

Sparse

( Ω

for LM and

Ω

Sparse

( Ω

MLM

for MLM, but this can lead to either

reduced capacity or limited applicability. We defer detailed discussion of the full capacity of the

model to Appendix A. In this section we introduce the Combiner, which achieves

Ω

Combiner

= Ω

for LM and

Ω

Combiner

= Ω

MLM

for MLM, while still maintaining sub-quadratic computation and

memory cost. Below we denote

Ω

as the support for full attention if there is no ambiguity or need

to distinguish between LM or MLM. We introduce the main design framework in Section 3.1 and

possible parameterizations in Section 3.2. Then in Section 3.3 we analyze the trade-off of Combiner.

3.1 Local Factorization for Conditional Expectation

The main idea of Combiner is to exploit a hierarchical structure for conditional probability modeling

(3)

, which provides the opportunity for reducing computation complexity while maintaining the

Following the conventional implementation, the input sequence will be “

right-shifted

” so that the

position i can attent to itself in LM setting.

same support. Speciﬁcally, we introduce support variables

Ω

, for

r = 0, . . . , n

and

i ∈ [L]

. The

support variables are disjoint, i.e.,

Ω

∩ Ω

= ∅, ∀r 6= s

, and

∪

r=0

Ω

= Ω

. Then we can factorize

p(j|i) as

p(j|i) =

r=0

p(j, Ω

|i) =

r=0

p(j|Ω

, i)p(Ω

|i) = p(j|Ω

, i)p(Ω

|i), (4)

where r

denotes the index of the support to which j belongs. The last equation arises from the fact

that the

Ω

are disjoint from each other (

Ω

∩ Ω

= ∅, ∀r 6= s

). Therefore, there is only one support,

Ω

, containing

. The remaining terms, where

j 6∈ Ω

for

r 6= r

, are all zero since

p (j|Ω

, i) = 0

Furthermore, assume Ω

is a sufﬁcient statistic, i.e., j and i are independent given Ω

, we obtain

p(j|i) = p(j|Ω

)p(Ω

|i). (5)

Given the partition {Ω

}

r=0

, the attention form in (3) can be rewritten as

A (x

) = E

p(j|i)

] =

r=0

j∈Ω

p (j, Ω

|i) v

(6)

j∈Ω

˜p(j|i)v

| {z }

direct expectation

r=1

p(Ω

|i)



j∈Ω

p(j|Ω



| {z }

local expectation

, (7)

where we consider direct attention in partition

Ω

and apply the local factorization

(5)

to the partition

r = 1, . . . , n

. Here

˜p(j|i) ∝ p(j|i)

but with different normalization constants, which will be

explained below. We refer to this model as Combiner since the structured attention

(7)

combines the

direct expectation of

Ω

and multiple local expectations via

p(j|Ω

)

and

p(Ω

|i)

to form the ﬁnal

conditional expectation.

Equivalently, we can also rewrite the structured attention (7) as

A(x

) =

j∈Ω

I(j ∈ Ω

)˜p(j|i) +

r=1

I(j ∈ Ω

)p(j|Ω

)p(Ω

|i)

| {z }

the new effective conditional probability q(j|i)

, (8)

where

I(·)

is a binary indicator function. After reordering, one can see from

(8)

that we obtain the

effective conditional probability

q(j|i)

that tries to approximate the original

p(j|i)

. Each probability

term depends on both current location

and other location

, and the expectation is still obtained with

respect to a valid conditional probability (non-negative and sums up to 1 over Ω

Requirement for Sub-quadratic Cost.

We can immediately see the beneﬁt of this formulation from

the fact that the local expectation in

(7)

is independent of the position

. The full dependence is

achieved via the multiplier

p(Ω

|i)

where

j ∈ Ω

. If we can design the local factorization such that:

1. the order of number of terms in (7) for p(·|i), ∀i ∈ [L]:

i=1

+ |Ω

|) is sub-quadratic; and

let

U = {Ω

}

i∈[L],r∈[1,n

]

be the unique set of partitions used for local expectation calculation,

then the order of |U| (i.e., the number of unique partitions in U) is sub-quadratic;

the order of total number of unique calculations of local expectation across all locations in

(7)

Ω∈U

|Ω| is sub-quadratic;

then one can see that the overall computation and memory cost will be sub-quadratic with full

attention support

Ω

Combiner

= Ω

, ∀i ∈ [L]

. We will discuss in detail in Section 4 how to instantiate

such a principle by drawing inspiration from existing sparse transformers, and how to convert them

into a full attention model almost for free with identical asymptotic complexity.

Remark (Further Hierarchical Decomposition):

We introduce the local decomposition with a one

layer partition of support of

p(·|i)

for simplicity. In fact, such local decompositions can be stacked

further, which introduces a partition tree. Speciﬁcally, we can further partition

Ω

with disjoint

subsets



Ω



k=1

, and consider local decomposition

p(j, Ω

|i) = p(j|Ω

, i)p(Ω

|Ω

, i)p(Ω

|i)

where

is the index of sub-region which

belongs to. Thus, we obtain a hierarchical decomposition

of p(j|i), which can also be plugged to (6) and yield a new full attention formulation.

剩余15页未读，继续阅读

评论收藏

内容反馈

版权申诉

易小侠

粉丝: 6476
资源: 9万+

具有稀疏计算代价的组合器全注意变换器_Combiner Full Attention Transformer with Spar

最新资源

具有稀疏计算代价的组合器全注意变换器_Combiner Full Attention Transformer with Spar

ca码生成代码matlab-HFR_Combiner_TirLig:HFR_Combiner_TirLig的GitHub存储库

gleam_combiner:实验解析器组合器

png24_combiner:合并透明PNG的Python脚本

Alternative_combiner:C ++代码将两个文本文件组合成一个文件，方法是从第一个文件中取出第一行，然后从第二个文件中取出第一行，然后从第一个文件中取出第二行，一直到最后

pl_log_combiner：perl脚本，用于将各种日志（iis，checkmarx）组合到一个大的csv文件中，以便更轻松地关联事件（没有kibana或类似文件时）

sass-media_query_combiner

超级组合器1.6.1(Super Combiner 1..zip

Absorbance-File-Combiner-96Well

image-combiner-1.2.0.jar

pdf combiner

All-fiber 7 \tiems 1 signal combiner with high beam quality for high-power fiber lasers

combiner-maven-plugin

Super Combiner

exynos-combiner.rar_SOC

part-combiner:我曾经为 lilypond 写的一个部分组合器

combiner.zip

Hadoop Combiner使用方法详解

keyboa：Keyboa是一个旨在简化电报键盘创建的项目

Cobalt Strike下载

北京邮电大学计算机考研复试笔试资料

计算机系统-笔记-HUN2021级

cs1.6老版本供下载

合成孔径雷达的经典成像算法cs(matlab)仿真代码（吐血整理，内容全，注释全）

港大CS（MSC）面试整理

合成孔径雷达RD CS OmegaK算法点目标仿真.rar

计算机科学导论原书第二版答案.zip

Cobalt-Strike-4.5

cobaltstrike4.3.zip

最新资源