清华&南开最新「视觉注意力机制Attention」综述论文

需积分: 39 38 浏览量 2021-11-22 06:14:36 上传评论 4 收藏 5.26MB PDF 举报

资源详情

资源评论

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Attention Mechanisms in Computer Vision:

A Survey

Meng-Hao Guo, Tian-Xing Xu, Jiang-Jiang Liu, Zheng-Ning Liu, Peng-Tao Jiang, Tai-Jiang Mu, Song-Hai

Zhang, Ralph R. Martin, Ming-Ming Cheng, Senior Member, IEEE, Shi-Min Hu, Senior Member, IEEE,

Abstract—Humans can naturally and effectively ﬁnd salient regions in complex scenes. Motivated by this observation, attention

mechanisms were introduced into computer vision with the aim of imitating this aspect of the human visual system. Such an attention

mechanism can be regarded as a dynamic weight adjustment process based on features of the input image. Attention mechanisms have

achieved great success in many visual tasks, including image classiﬁcation, object detection, semantic segmentation, video

understanding, image generation, 3D vision, multi-modal tasks and self-supervised learning. In this survey, we provide a comprehensive

review of various attention mechanisms in computer vision and categorize them according to approach, such as channel attention, spatial

attention, temporal attention and branch attention; a related repository https://github.com/MenghaoGuo/Awesome-Vision-Attentions is

dedicated to collecting related work. We also suggest future directions for attention mechanism research.

Index Terms—Attention, Transformer, Survey, Computer Vision, Deep Learning, Salience.

1 INTRODUCTION

ETHODS for diverting attention to the most impor-

tant regions of an image and disregarding irrelevant

parts are called attention mechanisms; the human visual

system uses one [1], [2], [3], [4] to assist in analyzing and

understanding complex scenes efﬁciently and effectively.

This in turn has inspired researchers to introduce attention

mechanisms into computer vision systems to improve their

performance. In a vision system, an attention mechanism can

be treated as a dynamic selection process that is realized by

adaptively weighting features according to the importance

of the input. Attention mechanisms have provided beneﬁts

in very many visual tasks, e.g. image classiﬁcation [5], [6],

object detection [7], [8], semantic segmentation [9], [10], face

recognition [11], [12], person re-identiﬁcation [13], [14], action

recognition [15], [16], few-show learning [17], [18], medical

image processing [19], [20], image generation [21], [22], pose

estimation [23], super resolution [24], [25], 3D vision [26],

[27], and multi-modal task [28], [29].

In the past decade, the attention mechanism has played

an increasingly important role in computer vision; Fig. 3,

brieﬂy summarizes the history of attention-based models

in computer vision in the deep learning era. Progress can

be coarsely divided into four phases. The ﬁrst phase begins

from RAM [31], pioneering work that combined deep neural

networks with attention mechanisms. It recurrently predicts

the important region and updates the whole network in an

end-to-end manner through a policy gradient. Later, various

works [21], [35] adopted a similar strategy for attention in

•

M.H.Guo, T.X.Xu, Z.N.Liu, T.J.Mu, S.H.Zhang and S.M.Hu are with

the BNRist, Department of Computer Science and Technology, Tsinghua

University, Beijing 100084, China.

•

J.J.Liu, P.T.Jiang and M.M.Cheng are with TKLNDST, College of Computer

Science, Nankai University, Tianjin 300350, China.

•

R.R.Martin was with the School of Computer Science and Informatics,

Cardiff University, UK.

• S.M.Hu is the corresponding author.

E-mail: shimin@tsinghua.edu.cn.

Channel Attention

e.g.,SENet

What to attend

Spatial Attention

e.g.,Self-attention

Where to attend

Channel & Spatial

Attention

e.g.,CBAM

Temporal Attention

e.g., GLTR

When to attend

∅

Spatial&Temporal

Attention

e.g., STA-LSTM

∅

Branch Attention

e.g., SKNet

Which to attend

Fig. 1. Attention mechanisms can be categorised according to data

domain. These include four fundamental categories of channel attention,

spatial attention, temporal attention and branch attention, and two hybrid

categories, combining channel & spatial attention and spatial & temporal

attention. ∅ means such combinations do not (yet) exist.

vision. In this phase, recurrent neural networks(RNNs) were

necessary tools for an attention mechanism. At the start of

the second phase, Jaderberg et al. [32] proposed the STN

which introduces a sub-network to predict an afﬁne trans-

formation used to to select important regions in the input.

Explicitly predicting discriminatory input features is the

major characteristic of the second phase; DCNs [7], [36] are

representative works. The third phase began with SENet [5]

arXiv:2111.07624v1 [cs.CV] 15 Nov 2021

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

H,W

Channel Attention

H,W

Tem p o ral Attention

H,W

Spatial Attention

H,W

Channel&Spatial Attention

H,W

Spatial&Temporal Attention

Branch Attention

Branch n

…

Branch 2Branch 1

: Wher e attention is used.

Fig. 2. Channel, spatial and temporal attention can be regarded as

operating on different domains.

represents the channel domain,

and

represent spatial domains, and

means the temporal domain.

Branch attention is complementary to these. Figure following [30].

that presented a novel channel-attention network which

implicitly and adaptively predicts the potential key features.

CBAM [6] and ECANet [37] are representative works of this

phase. The last phase is the self-attention era. Self-attention

was ﬁrstly proposed in [33] and rapidly provided great

advances in the ﬁeld of natural language processing [33],

[38], [39]. Wang et al. [15] took the lead in introducing self-

attention to computer vision and presented a novel non-local

network with great success in video understanding and

object detection. It was followed by a series of works such as

EMANet [40], CCNet [41], HamNet [42] and the Stand-Alone

Network [43], which improved speed, quality of results,

and generalization capability. Recently, various pure deep

self-attention networks (visual transformers) [27], [34], [44],

[45], [46], [47], [48], [49] have appeared, showing the huge

potential of attention-based models. It is clear that attention-

based models have the potential to replace convolutional

neural networks and become a more powerful and general

architecture in computer vision.

The goal of this paper is to summarize and classify

current attention methods in computer vision. Our approach

TABLE 1

Key notation in this paper. Other minor notation is explained where used.

Symbol Description

X input feature map, X ∈ R

C×H×W

Y output feature map

W learnable kernel weight

FC fully-connected layer

Conv convolution

GAP global average pooling

GMP global max pooling

[ ] concatenation

δ ReLU activation [51]

σ sigmoid activation

tanh tanh activation

Softmax softmax activation

BN batch normalization [52]

Expand expan input by repetition

is shown in Fig. 1 and further explained in Fig. 2: it is

based around data domain. Some methods consider the

question of when the important data occurs, or others where it

occurs, etc., and accordingly try to ﬁnd key times or locations

in the data. We divide existing attention methods into

six categories which include four basic categories: channel

attention (what to pay attention to [50]), spatial attention (where

to pay attention), temporal attention (when to pay attention)

and branch channel (which to pay attention to), along with two

hybrid combined categories: channel & spatial attention and

spatial & temporal attention. These ideas are further brieﬂy

summarized together with related works in Tab. 2.

The main contributions of this paper are:

•

a systematic review of visual attention methods, cov-

ering the uniﬁed description of attention mechanisms,

the development of visual attention mechanisms as

well as current research,

•

a categorisation grouping attention methods accord-

ing to their data domain, allowing us to link visual

attention methods independently of their particular

application, and

• suggestions for future research in visual attention.

Sec. 2 considers related surveys, then Sec. 3 is the main

body of our survey. Suggestions for future research are given

in Sec. 4 and ﬁnally, we give conclusions in Sec. 5.

2 OTHER SURVEYS

In this section, we brieﬂy compare this paper to various

existing surveys which have reviewed attention methods

and visual transformers. Chaudhari et al. [140] provide

a survey of attention models in deep neural networks

which concentrates on their application to natural language

processing, while our work focuses on computer vision.

Three more speciﬁc surveys [141], [142], [143] summarize the

development of visual transformers while our paper reviews

attention mechanisms in vision more generally, not just self-

attention mechanisms. Wang et al. [144] present a survey of

attention models in computer vision, but it only considers

RNN-based attention models, which form just a part of our

survey. In addition, unlike previous surveys, we provide

a classiﬁcation which groups various attention methods

according to their data domain, rather than according to

their ﬁeld of application. Doing so allows us to concentrate

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

SENet

2017.09.05

(It proposes to

adaptively recalibrate

channel by using

attention weight)

Non-Local Network

2017.11.21

(It first successfully

uses self-attention to

model non-local

relationship in

computer vision)

CBAM

2018.07.17

(It proposes tp

combine channel

attention with spatial

attention)

Highway Network

2015.07.22

(It proposes to

combine different

branches by using

attention method)

RAM

2014.06.24

(It adopts RNN and

reinforcement learning

to achieve spatial

attention)

ViT

2020.10.22

(The first pure

transformer structure

achieves great results

in computer vision)

STN

2015.06.05

(It proposes to select

important region by

learning affine

transformation)

2018-2020

(Some popular

attention related

network such as

SAGAN, OCNet,

DANet, EMANet,

OCRNet and HamNet)

2020-Now

(Various transformer

variants such as PCT,

Deformable DETR DeiT,

T2T-ViT, IPT,PVT and

Swin-Transformer)

Fig. 3. Brief summary of key developments in attention in computer vision, which have loosely occurred in four phases. Phase 1 adopted RNNs to

construct attention, a representative method being RAM [31]. Phase 2 explicitly predicted important regions, a representative method being STN [32].

Phase 3 implicitly completed the attention process, a representative method being SENet [5]. Phase 4 used self-attention methods [15], [33], [34].

TABLE 2

Brief summary of attention categories and key related works.

Attention category Description Related work

Channel attention

Generate attention mask across the channel domain and

use it to select important channels.

[5], [37], [53], [54], [55], [56], [57], [58] [25], [59],

[60]

Spatial attention

Generate attention mask across spatial domains and use

it to select important spatial regions (e.g. [15], [61]) or

predict the most relevant spatial position directly (e.g. [7],

[31]).

[8], [9], [15], [21], [31], [32], [34], [35] , [22], [26],

[62], [63], [64], [65], [66], [67] , [41], [68], [69], [70],

[71], [72], [73], [74] , [8], [34], [42], [43], [75], [76],

[77], [78] , [27], [44], [45], [46], [79], [80], [81], [82]

, [61], [83], [84], [85], [86], [87], [88], [89] , [47],

[90], [91], [92], [93], [94], [95], [96] , [97], [98], [99],

[100], [101], [102], [103], [104] , [20], [105], [106],

[107], [108], [109]

Temporal attention

Generate attention mask in time and use it to select key

frames.

[110], [111], [112]

Branch attention

Generate attention mask across the different branches

and use it to select important branches.

[113], [114], [115], [116]

Channel & spatial atten-

tion

Predict channel and spatial attention masks separately

(e.g. [6], [117]) or generate a joint 3-D channel, height,

width attention mask directly (e.g. [118], [119]) and use

it to select important features.

[6], [50], [117], [119], [120], [121], [122], [10],

[101], [118], [123], [124], [125], [126] , [13], [14],

[127], [128], [129]

Spatial & temporal at-

tention

Compute temporal and spatial attention masks sepa-

rately (e.g. [16], [130]), or produce a joint spatiotemporal

attention mask (e.g. [131]), to focus on informative

regions.

[130], [132], [133], [134], [135], [136], [137], [138],

[139]

on the attention methods in their own right, rather than

treating them as supplementary to other tasks.

3 ATTENTION METHODS IN COMPUTER VISION

In this section, we ﬁrst sum up a general form for the

attention mechanism based on the recognition process of

human visual system in Sec. 3.1. Then we review various

categories of attention models given in Fig. 1, with a

subsection dedicated to each category. In each, we tabularize

representative works for that category. We also introduce

that category of attention strategy more deeply, considering

its development in terms of motivation, formulation and

function.

3.1 General form

When seeing a scene in our daily life, we will focus on the

discriminative regions, and process these regions quickly.

The above process can be formulated as:

Attention = f(g(x), x) (1)

Here

g(x)

can represent to generate attention which

corresponds to the process of attending to the discriminative

regions.

f(g(x), x)

means processing input

based on the

attention

g(x)

which is consistent with processing critical

regions and getting information.

With the above deﬁnition, we ﬁnd that almost all existing

attention mechanisms can be written into the above formu-

lation. Here we take self-attention [15] and squeeze-and-

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

Channel

attention

SENet

Improve squeeze

module

Improve excitation

module

Improve both squeeze

and excitation module

GSoP-Net, FcaNet, etc.

ECANet

SRM, GCT, etc.

DRAW, Glimpse Net, etc.

DCN, DCN V2, etc.

Spatial attention

RNN-based attention

Predict the relevant

region explicitly

Self-attention based

Predict a soft mask

implicitly

RAM

Non-local

GENet

STN

Visual Transformers

Local self-attention

Efficient self-attention

DETR, ViT, etc.

SASA, SAN, etc.

CCNet, EMANet, etc.

Branch attention

SKNet, ReNeSt, etc.

Dynamic Conv.

Combining features of

different branches

Combine different

convolution kernels

Highway

Network

CondConv

Channel &

Spatial attention

CBAM, BAM, scSE et al.

Split channel attention

and spatial attention

SimAM, Strip Pooling,

SCNet

Directly Estimate 3D

attention map

Residual

attention

Triplet Attention

Cross-dimension

interaction

Coordinate Attention,

DANet

Long-range

dependencies

RGA

Relation-aware

attention

Attention

Mechanisms

Temporal

attention

Combining local

attention and global

attention

Self-attention based

TAM

GLTR

Temporal &

Spatial attention

Split temporal

attention and spatial

attention

Jointly produce spatial

and temporal attention

RSTAN

STA

Pairwise relation-

based

STGCN

Fig. 4. Developmental context of visual attention.

excitation(SE) attention [5] as examples. For self-attention,

g(x) and f(g(x), x) can be written as

Q, K, V = Linear(x) (2)

g(x) = Softmax(QK) (3)

f(g(x), x) = g(x)V (4)

(5)

For SE, g(x) and f(g(x), x) can be written as

g(x) = Sigmoid(MLP(GAP(x))) (6)

f(g(x), x) = g(x)x (7)

In the following, we will introduce various attention

mechanisms and specify them to the above formulation.

3.2 Channel Attention

In deep neural networks, different channels in different

feature maps usually represent different objects [50]. Channel

attention adaptively recalibrates the weight of each channel,

and can be viewed as an object selection process, thus

determining what to pay attention to. Hu et al. [5] ﬁrst

proposed the concept of channel attention and presented

SENet for this purpose. As Fig. 4 shows, and we discuss

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

shortly, three streams of work continue to improve channel

attention in different ways.

In this section, we ﬁrst summarize the representative

channel attention works and specify process

g(x)

and

f(g(x), x)

described as Eq. 1 in Tab. 3 and Fig. 5. Then

we discuss various channel attention methods along with

their development process respectively.

3.2.1 SENet

SENet [5] pioneered channel attention. The core of SENet

is a squeeze-and-excitation (SE) block which is used to collect

global information, capture channel-wise relationships and

improve representation ability.

SE blocks are divided into two parts, a squeeze module

and an excitation module. Global spatial information is

collected in the squeeze module by global average pooling.

The excitation module captures channel-wise relationships

and outputs an attention vector by using fully-connected

layers and non-linear layers (ReLU and sigmoid). Then, each

channel of the input feature is scaled by multiplying the

corresponding element in the attention vector. Overall, a

squeeze-and-excitation block

(with parameter

) which

takes X as input and outputs Y can be formulated as:

s = F

(X, θ) = σ(W

δ(W

GAP(X))) (8)

Y = sX (9)

SE blocks play the role of emphasizing important channels

while suppressing noise. An SE block can be added after each

residual unit [145] due to their low computational resource

requirements. However, SE blocks have shortcomings. In

the squeeze module, global average pooling is too simple

to capture complex global information. In the excitation

module, fully-connected layers increase the complexity of the

model. As Fig. 4 indicates, later works attempt to improve the

outputs of the squeeze module (e.g. GSoP-Net [54]), reduce

the complexity of the model by improving the excitation

module (e.g. ECANet [37]), or improve both the squeeze

module and the excitation module (e.g. SRM [55]).

3.2.2 GSoP-Net

An SE block captures global information by only using global

average pooling (i.e. ﬁrst-order statistics), which limits its

modeling capability, in particular the ability to capture high-

order statistics.

To address this issue, Gao et al. [54] proposed to improve

the squeeze module by using a global second-order pooling

(GSoP) block to model high-order statistics while gathering

global information.

Like an SE block, a GSoP block also has a squeeze

module and an excitation module. In the squeeze module,

a GSoP block ﬁrstly reduces the number of channels from

(

< c

) using a

1x1

convolution, then computes a

× c

covariance matrix for the different channels to obtain

their correlation. Next, row-wise normalization is performed

on the covariance matrix. Each

(i, j)

in the normalized

covariance matrix explicitly relates channel i to channel j.

In the excitation module, a GSoP block performs row-wise

convolution to maintain structural information and output a

vector. Then a fully-connected layer and a sigmoid function

are applied to get a

-dimensional attention vector. Finally, it

multiplies the input features by the attention vector, as in an

SE block. A GSoP block can be formulated as:

s = F

gsop

(X, θ) = σ(W RC(Cov(Conv(X)))) (10)

Y = sX (11)

Here,

Conv(·)

reduces the number of channels,

Cov(·)

computes the covariance matrix and

RC(·)

means row-wise

convolution.

By using second-order pooling, GSoP blocks have im-

proved the ability to collect global information over the

SE block. However, this comes at the cost of additional

computation. Thus, a single GSoP block is typically added

after several residual blocks.

3.2.3 SRM

Motivated by successes in style transfer, Lee et al. [55] pro-

posed the lightweight style-based recalibration module (SRM).

SRM combines style transfer with an attention mechanism. Its

main contribution is style pooling which utilizes both mean

and standard deviation of the input features to improve

its capability to capture global information. It also adopts

a lightweight channel-wise fully-connected (CFC) layer, in

place of the original fully-connected layer, to reduce the

computational requirements.

Given an input feature map

X ∈ R

C×H×W

, SRM ﬁrst

collects global information by using style pooling (

SP(·)

)

which combines global average pooling and global standard

deviation pooling. Then a channel-wise fully connected

(

CFC(·)

) layer (i.e. fully connected per channel), batch nor-

malization

and sigmoid function

are used to provide

the attention vector. Finally, as in an SE block, the input

features are multiplied by the attention vector. Overall, an

SRM can be written as:

s = F

srm

(X, θ) = σ(BN(CFC(SP(X)))) (12)

Y = sX (13)

The SRM block improves both squeeze and excitation mod-

ules, yet can be added after each residual unit like an SE

block.

3.2.4 GCT

Due to the computational demand and number of parameters

of the fully connected layer in the excitation module, it

is impractical to use an SE block after each convolution

layer. Furthermore, using fully connected layers to model

channel relationships is an implicit procedure. To overcome

the above problems, Yang et al. [56] propose the gated channel

transformation (GCT) to efﬁciently collect information while

explicitly modeling channel-wise relationships.

Unlike previous methods, GCT ﬁrst collects global in-

formation by computing the

-norm of each channel. Next,

a learnable vector

is applied to scale the feature. Then a

competition mechanism is adopted by channel normalization

to interact between channels. Like other common normal-

ization methods, a learnable scale parameter

and bias

are applied to rescale the normalization. However, unlike

previous methods, GCT adopts tanh activation to control

the attention vector. Finally, it not only multiplies the input

剩余26页未读，继续阅读

评论收藏

内容反馈

清华&南开最新「视觉注意力机制Attention」综述论文

评论0

最新资源

清华&南开最新「视觉注意力机制Attention」综述论文

评论0

最新资源

相关推荐

计算机视觉Attention注意力机制综述！清华、南开出品！185篇参考文献！.pdf

综述：计算机视觉中的注意力机制

最新「注意力机制Attention」大综述论文

NLP-注意力机制论文（2018-2019）

Attention(注意力机制代码)

从三大顶会论文看百变Self-Attention - self-attention的相关思想以及最新的研究进展.zip

深入理解图注意力机制（Graph Attention Network）

深度学习Atention注意力机制ppt 报告专用

An Empirical Study of Spatial Attention Mechanisms in Deep Networks

清华&今日头条-短视频与知识传播研究报告-2019.1-41页.pdf

清华&今日头条-短视频与知识传播研究报告-2019.1.pdf

清华大学《计算机视觉》课程教案课件PPT-Ch1概述.pdf

清华大学孙茂松课题组，《图神经网络: 方法与应用》综述论文 - Graph+Neural+Networks.pdf.zip

清华&央财&商务部-电子商务发展指数报告（）-5-40页.pdf

清华&今日头条-短视频与知识传播研究报告.pdf

清华大学崔鹏等最新「分布外泛化(Out-Of-Distribution Generalization)」 综述论文

清华&中国工程院-2019中国人工智能发展报告(4).rar

清华最新《图神经网络推荐系统》综述论文

清华大学综合论文训练论文模板

清华艾海舟计算机视觉论文

清华&中国工程院-2019中国人工智能发展报告-2019.12-392页.rar

2022关于清华 SSMB-EUV 光源原理的综述论文-唐传祥 2023-10-1 9151 6.pdf

清华大学论文模板

清华大学崔鹏等最新「分布外泛化(Out-Of-Distribution Generalization)」综述论文