【免费】PADRe:AUnifyingPolynomialAttentionDrop-inReplacement

需积分: 0 178 浏览量更新于2024-07-18 收藏 1.1MB PDF 举报

PADRe: A Unifying Polynomial Attention Drop-in Replacement for Efﬁcient Vision Transformer。我们提出了多项式注意置换（PADRe），这是一个新颖而统一的框架，旨在取代变压器模型中的传统自我注意机制。值得注意的是，最近的几种替代注意力机制，包括 Hyena、Mamba、SimA、Conv2Former 和 Castling-ViT 等，都可以看作是我们的 PADRe 框架的具体实例。PADRe 利用多项式函数并借鉴近似理论的成熟结果，在不影响精度的前提下提高了计算效率。PADRe 的关键要素包括乘法非线性，我们使用直观、硬件友好的操作（如 Hadamard 乘积）来实现这些非线性，只产生线性计算和内存成本。PADRe 还避免了使用 Softmax 等复杂函数的需要，但与传统的自注意相比，它仍能保持相当或更高的精确度。我们评估了 PADRe 在各种计算机视觉任务中替代自我注意的有效性。这些任务包括图像分类、基于图像的二维物体检测和三维点云物体检测。 ### PADRe：多项式注意力置换概述 #### 一、引言与背景随着深度学习技术的发展，Transformer模型在自然语言处理、计算机视觉以及语音识别等多个领域取得了显著成果。特别是在计算机视觉领域，Vision Transformer (ViT) 成为了重要的研究焦点，推动了开放世界识别等领域的进步。然而，传统的自注意力机制在计算效率上存在一定的局限性，限制了其在实际应用中的扩展能力。 #### 二、PADRe：多项式注意力置换介绍 ##### 2.1 架构理念 **多项式注意力置换（PADRe）** 是一项创新性的框架设计，旨在提高Transformer模型的计算效率，同时确保准确性的不降低。PADRe通过多项式函数的应用及近似理论的支持，实现了对传统自注意力机制的有效替换。该方法的关键在于利用了多项式函数的优势，并结合硬件友好型操作，比如Hadamard乘积，实现了高效的非线性变换过程。 ##### 2.2 PADRe的关键要素 - **乘法非线性**：通过简单的硬件友好型操作实现非线性转换，例如Hadamard乘积。这种操作仅产生线性的计算和内存成本。 - **避免复杂函数**：PADRe避免了使用Softmax等复杂的数学函数，减少了计算负担的同时依然能够维持与传统自注意力相当甚至更高的准确性。 - **通用性**：近期提出的多种替代注意力机制（如Hyena、Mamba、SimA、Conv2Former和Castling-ViT等）均可被视为PADRe框架下的具体实例。 ##### 2.3 实验验证为了证明PADRe的有效性和实用性，研究人员在多个计算机视觉任务中进行了实验评估。这些任务包括但不限于： - **图像分类**：PADRe在图像分类任务中的表现与传统自注意力机制相当，甚至更好。 - **图像基2D物体检测**：通过对图像进行分析，PADRe能够有效检测出图像中的物体，且速度更快。 - **3D点云物体检测**：在处理更为复杂的3D点云数据时，PADRe同样表现出色。 #### 三、实验结果与性能比较 PADRe相比于传统自注意力机制，在计算效率上有了显著提升。具体来说： - **服务器GPU上的性能提升**：PADRe的运行速度比传统自注意力快11倍至43倍之间。 - **移动NPU上的性能提升**：PADRe在移动端的表现同样优秀，进一步证明了其广泛的应用潜力。 #### 四、PADRe的优点 - **计算效率**：PADRe通过多项式函数和硬件友好型操作显著提高了计算效率。 - **准确性**：尽管简化了计算过程，PADRe仍然能够保持甚至超过传统自注意力机制的准确性。 - **灵活性**：PADRe能够适应多种不同的计算机视觉任务需求，展现出了高度的灵活性和通用性。 - **易于集成**：作为“即插即用”的解决方案，PADRe可以轻松地替换现有的自注意力机制，降低了集成难度。 #### 五、总结 PADRe提供了一种全新的视角来解决传统自注意力机制中存在的计算效率问题。通过多项式函数的应用和对硬件友好型操作的选择，PADRe不仅提升了计算效率，而且保持了与传统自注意力相当甚至更优的准确性。这一成果对于加速Transformer模型在实际场景中的应用具有重要意义。未来的研究方向可能集中在如何进一步优化PADRe框架，以适应更多样化的应用场景和更广泛的模型结构。

PADRe: A Unifying Polynomial Attention Drop-in

Replacement for Efﬁcient Vision Transformer

Pierre-David Letourneau

∗

Manish Kumar Singh

∗

Hsin-Pai Cheng Shizhong Han

Yunxiao Shi Dalton Jones Matthew Harper Langston Hong Cai

Fatih Porikli

Qualcomm AI Research

†

{ pletourn, masi, hsinpaic, shizhan, yunxshi,

daltjone, hlangsto, hongcai, fporikli } @ qti.qualcomm.com

Abstract

We present Polynomial Attention Drop-in Replacement (PADRe), a novel and

unifying framework designed to replace the conventional self-attention mechanism

in transformer models. Notably, several recent alternative attention mechanisms,

including Hyena, Mamba, SimA, Conv2Former, and Castling-ViT, can be viewed

as speciﬁc instances of our PADRe framework. PADRe leverages polynomial func-

tions and draws upon established results from approximation theory, enhancing

computational efﬁciency without compromising accuracy. PADRe’s key compo-

nents include multiplicative nonlinearities, which we implement using straightfor-

ward, hardware-friendly operations such as Hadamard products, incurring only

linear computational and memory costs. PADRe further avoids the need for using

complex functions such as Softmax, yet it maintains comparable or superior accu-

racy compared to traditional self-attention. We assess the effectiveness of PADRe

as a drop-in replacement for self-attention across diverse computer vision tasks.

These tasks include image classiﬁcation, image-based 2D object detection, and

3D point cloud object detection. Empirical results demonstrate that PADRe runs

signiﬁcantly faster than the conventional self-attention (11

×∼

faster on server

GPU and mobile NPU) while maintaining similar accuracy when substituting

self-attention in the transformer models.

1 Introduction

Transformers have been pivotal in recent advancements across natural language processing, computer

vision, and speech processing. In computer vision, Vision Transformers (ViTs) [

] are particularly

signiﬁcant, driving innovations in areas such as open-world recognition [

], image and video

generation [21, 2], and large multi-modal models [16, 34].

However, the computational and memory costs associated with self-attention operations in transform-

ers can increase quadratically relative to input size. This computationally-demanding challenge is

particularly pronounced in real-world computer vision applications, which often involve inherently

large input sizes, such as high-resolution images and videos in virtual reality and camera applications,

or large 3D point clouds in autonomous driving. The need to run transformers on hardware platforms

with limited computational and memory resources further intensiﬁes these challenges.

To address these computational hurdles, we propose PADRe, a unifying Polynomial Attention

Drop-in Replacement scheme. PADRe is a novel approach that can effectively substitute standard

Equal contribution

†

Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.

Preprint. Under review.

arXiv:2407.11306v1 [cs.CV] 16 Jul 2024

self-attention at low (linear computational and memory) costs without sacriﬁcing accuracy. PADRe’s

unifying framework provides a guide for designing alternative transformer architectures. In particular,

we observe and demonstrate that many recently-proposed attention replacement mechanisms (e.g.,

Hyena [

], Mamba [

], Conv2Former [

], SimA [

], Castling-ViT [

]), as well as the standard

attention itself [

], may in fact be interpreted as speciﬁc instances of the PADRe framework (See

Table 1).

Our framework relies on polynomial approximants in the input to replace the standard self-attention

mechanism. Polynomials are well-studied objects known to be powerful and efﬁcient multi-

dimensional function approximants over compact sets [

]. We use these existing results, together

with the aforementioned observations, as guiding principles. We also leverage simple, mobile-friendly

operations, such as Hadamard products, to induce nonlinearities, providing the necessary modeling

capacity of the polynomial functions. Moreover, our design allows for easily and efﬁciently scaling up

the modeling capacity of existing network architectures by increasing the degree, while maintaining

linear computational complexity and memory cost.

In addition to discussing the theoretical underpinnings and unifying properties of PADRe, we

further experimentally demonstrate the signiﬁcant advantages of PADRe over the standard self-

attention through a variety of experiments involving computer vision tasks in practical contexts.

More speciﬁcally, by replacing the self-attention operation with our design, we maintain the model

performance while only incurring linear computational and memory costs. Further, by proﬁling on

actual hardware platforms, we demonstrate the signiﬁcantly better on-device efﬁciency of PADRe as

compared to the standard self-attention operation.

We summarize the main contributions of the paper as follows:

•

We propose PADRe, a unifying framework for replacing the attention mechanism based on

polynomial approximants, which provides an efﬁcient, linear-complexity alternative to the

standard self-attention without sacriﬁcing accuracy.

•

We show that PADRe provides a unifying formulation for many of the recently-proposed

attention replacement schemes, including Hyena [

], Mamba [

], Conv2Former [

SimA [13], and Castling-ViT [36].

•

We implement a speciﬁc instance of PADRe following the proposed design principles. We

demonstrate the performance and advantage of PADRe through multiple experiments related

to diverse, practical computer vision applications. Our PADRe-based models achieve similar

or better accuracy than the original ones, while incurring signiﬁcantly lower costs. Our

on-device proﬁling further validates the computational efﬁciency of PADRe.

2 Related Work

Efﬁcient Vision Transformer Backbones: A majority of existing research proposes holistically

designed vision transformer backbones. These backbones are capable of extracting features for

various vision tasks, such as image classiﬁcation, object detection, and segmentation. Most of these

methods employ a hierarchical architecture, where the spatial dimension of the image feature map is

progressively downsampled [

]. This strategy allows the deeper transformer layers to operate on

smaller inputs, thereby reducing computational and memory costs. Speciﬁcally, many recent designs

adopt convolution layers in the early stages of the network and apply transformers only to signiﬁcantly

downsampled feature maps later in the network [

]. While this approach considerably

improves efﬁciency, it still suffers from the inherent quadratic complexity when input size becomes

large. Building upon hierarchical and/or convolution-transformer hybrid architectures, some studies

propose more efﬁcient, alternative attention schemes, e.g., ReLU-based attention [

], transposed

attention [

], convolutional modulation [

], additive attention [

], shift-add attention [

], linear-

angular attention [

]. However, all these models are designed to process a single 2D image and

cannot be directly applied to more complex visual inputs, such as 3D point clouds. Many of them

also rely on jointly designing the entire backbone to achieve good performance on 2D tasks and do

not work well as drop-in replacements. In fact, using some of these efﬁcient attention methods as

drop-ins can degrade the model’s performance, as reported by [9].

Efﬁcient Attention Drop-in Replacements: To enhance the computational efﬁciency of vision

transformers, various alternative attention mechanisms have been introduced. For instance, Swin [

]

proposes windowed attention, where self-attention computation is limited to local neighborhoods

of an image. Although this theoretically reduces computation, the intricate reshaping and indexing

operations pose practical challenges on resource-constrained hardware platforms. Several works

propose linear-cost attentions [

]. Some of them (e.g., [

]) need to compute nor-

malization factors at inference, which requires summing over all the tokens. This is inefﬁcient

on memory-constrained mobile platforms, especially when the number of tokens is large. Others

(e.g., [

]) are special cases of our proposed PADRe framework (see Section 3.5). Furthermore,

it is worth noting that many of these alternative attentions still require multiple heads to maintain

accuracy, a requirement that PADRe eliminates.

3 PADRe Framework Approach

We describe our framework’s technical details; at a high level, PADRe may be described as follows:

Each element of the output tensor is a polynomial function of the elements of the input tensor, the

coefﬁcients of which are polynomial functions of the parameters (weights). Further, this dependency

is such that the coefﬁcients and the output itself may be computed efﬁciently (e.g., in linear time).

For

N ∈ N

tokens and embedding / feature channel dimension

D ∈ N

, consider an input

X ∈

N×D

. We denote

as the degree of our framework; this parameter determines the strengths of the

nonlinearities as discussed below. PADRe operations are performed on each batch element identically.

We omit batch dimension for brevity and consider a single head; if more than one head is used, each

is treated independently using copies of the framework. PADRe consists of three main components:

(1) Linear transformations; (2) Nonlinearities; (3) Optional operations (e.g., output resizing,

normalization). Further details regarding each step are discussed below in the following sections.

3.1 Linear Transformations

Linear transformations consist of the following operations: for i = 1, ..., d, compute

= A

X B

, (1)

where

and

are

N ×N

and

D×D

weight matrices, and

∈ R

N×D

. We impose structure on

the matrices

and

for efﬁcient multiplication; common structures include convolutions, sparse

patterns, low-rank and hierarchical matrices, which can be optimized and parallelized on existing

platforms. Note that left and right multiplications create a form of mixing among tokens (in the

spatial dimension in the case of visual input) and in the embedding/channel dimension, respectively.

Coupled with nonlinearities (Section 3.2) this leads to the presence of cross-terms (i.e., multivariate

monomials) in the expansion that are key to the expressivity of the polynomial functions.

3.2 Nonlinearities

We construct nonlinear, polynomial functions of the input by ﬁrst deﬁning the matrices:

= Y

, (2)

i+1

= (C

) ⊙ Y

i+1

, for i ∈ {1, ..., d − 1}, (3)

where

⊙

indicates the Hadamard (element-wise) product, which is an efﬁcient, parallelizable, and

hardware-friendly operation.

and

are

N ×N

and

D×D

weight matrices that can perform

additional token and channel mixing on the

tensor as needed, and

∈ R

N×D

. Similarly, we

require

and

to have structures that allow for efﬁcient multiplications. In particular, we have the

following lemma which is important for the analysis of the scheme,

Lemma 1. The elements of Z

are homogeneous polynomials of degree i in the input X.

Proof. We proceed by induction. In the case i = 1,

= A

X B

, (4)

which shows that each element of

is a linear function of the input. Now assume the elements of

are homogeneous polynomials of degree i in the input and consider

i+1

= (C

) ⊙ (A

i+1

X B

i+1

). (5)

Observe that the elements of

i+1

X B

i+1

are homogeneous polynomials of degree

by construc-

tion and the elements of

are homogeneous polynomials of degree

by assumption and

construction (linear transformations do not affect this conclusion). The expression is therefore a

homogeneous polynomial of degree i + 1 as claimed.

Next, we compute dense (rather than homogeneous) polynomials of the input to increase diversity,

expressivity, and number of degrees of freedom, through the computation of the output tensor P ,

[P ]

m,n

i=1

[W ]

m,n,i

]

m,n

+ [L]

m,n

, (6)

where

W ∈ R

N×D×d

is a weight tensor,

1 ≤ m ≤ N

and

1 ≤ n ≤ D

are token and embedding

dimensions respectively, and L ∈ R

N×D

is the bias (i.e., 0

-order term).

3.3 Optional Operations

When an output of size different from to the input is required, we perform a computation of the form,

O = U P V, (7)

where

U ∈ R

F ×N

V ∈ R

D×G

, and

O ∈ R

F ×G

(required output shape). As discussed, it is

necessary that U and V possess structures (e.g., low-rank) to allow efﬁcient application to P.

In addition, the

’s tensors can optionally be normalized. While this is not necessary for the PADRe

instances in our experiments, it can be a potentially useful regularization technique.

3.4 Overall Framework

Combining all operations together and leveraging Lemma 1, the output takes the following form:

†

[P ]

m,n

i=1

[W ]

m,n,i

]

m,n

+ [L]

m,n

∼

k∈N

N×D

:|k|≤d

, (8)

where

is a multi-index,

|k| =

m=1

n=1

m,n

is the absolute degree,

{π

}

are coefﬁ-

cients (which have a complex polynomial relationship with the parameters/weights), and

m=1

n=1

m,n

are monomials. Eq.

(8)

is a degree-

multi-dimensional polynomial in the entries

of the input

; that is, PADRe uses hardware-friendly and efﬁcient operations to produce an output

tensor whose entries are (potentially normalized) polynomials of the entries of the input tensor. In

particular, any proposed scheme that can be written in this form falls within the PADRe framework.

3.5 Unifying Framework

Eq.

(8)

provides a general, quantitative representation of PADRe’s output. This formula in fact

encompasses many recently-proposed schemes. Such observation may not be obvious at ﬁrst sight

but becomes clear following basic algebraic manipulations. For several recently proposed methods,

we summarize their PADRe equivalence in Table 1 and provide detailed proofs to show that they

are within our proposed PADRe framework in Appendix B. Further, it can be easily veriﬁed that the

“attention" part of existing models based on the MetaFormer architecture [

] and linear token mixing

(e.g., convolution [18], MLP [26, 25]) are degree-1 PADRe equivalences.

While we focus on leveraging polynomial functions, PADRe can be generalized to rational functions

(i.e., ratios of polynomials) as discussed in Appendix A.2. Although the standard self-attention cannot

be represented by polynomials, it can be approximated by high-degree rational functions, as we show

in Appendix B.6.

3.6 Computational Characteristics

We brieﬂy analyze the number of parameters and computational cost of constructing an output

P ∈ R

N×D

given an input

X ∈ R

N×D

using PADRe. First, let us consider the number of

†

We ignore the optional resizing here, which is just a linear transformation.

剩余18页未读，继续阅读

资源推荐

资源评论

布尔大学士

粉丝: 9w+
资源: 14

PADRe: A Unifying Polynomial Attention Drop-in Replacement

PyPI 官网下载 | pypads_padre-0.2.7-py3-none-any.whl

PyPI 官网下载 | pypads_padre-0.2.5-py3-none-any.whl

padre-brew:用于打包 Padre 并从源代码创建沙箱 Padre 的工具。 注意 - OSX 支持包括 LION。 适用于 Linux，但您需要手动安装 wxwidgets

Python库 | pypads_padre-0.2.5-py3-none-any.whl

Python库 | pypads_padre-0.2.7-py3-none-any.whl

Padre:Padre核心代码的官方存储库

vim-padre:VIM 调试器插件

Proj_Padre_Pio:Padre Pio 社区网站-开源

并行可扩展科学计算工具箱（PETSc）简介

PyPI 官网下载 | pypads_padre-0.2.0.tar.gz

2024年东南亚18650锂电池市场深度研究及预测报告.pdf

Perl 5.14.2+Padre(Perl IDE)的Perl语言快捷开发环境

git-test-padre

dwimperl-5.14.2.1

神奇的perl-最佳Perl入门

pdf editor

PERL入门教程

Perl_学习手札.pdf

devbrary-js-test-library:促进JavaScript使用的各种功能

山东省济南外国语学校三箭分校2015_2016学年高一西班牙语上学期期中试题无答案

Perl 语言教程，很好用的文本处理语言

ActivePerl-5.16.3.1603-MSWin32-x86-296746.rar

ActivePerl-5.24.1.2402-MSWin32-x64-401627

Perl 安装版

CSS:Primeros pasos por el CSS

最新资源

padre-brew:用于打包 Padre 并从源代码创建沙箱 Padre 的工具。注意 - OSX 支持包括 LION。适用于 Linux，但您需要手动安装 wxwidgets