self-attention at low (linear computational and memory) costs without sacrificing accuracy. PADRe’s
unifying framework provides a guide for designing alternative transformer architectures. In particular,
we observe and demonstrate that many recently-proposed attention replacement mechanisms (e.g.,
Hyena [
20
], Mamba [
8
], Conv2Former [
11
], SimA [
13
], Castling-ViT [
36
]), as well as the standard
attention itself [
30
], may in fact be interpreted as specific instances of the PADRe framework (See
Table 1).
Our framework relies on polynomial approximants in the input to replace the standard self-attention
mechanism. Polynomials are well-studied objects known to be powerful and efficient multi-
dimensional function approximants over compact sets [
28
]. We use these existing results, together
with the aforementioned observations, as guiding principles. We also leverage simple, mobile-friendly
operations, such as Hadamard products, to induce nonlinearities, providing the necessary modeling
capacity of the polynomial functions. Moreover, our design allows for easily and efficiently scaling up
the modeling capacity of existing network architectures by increasing the degree, while maintaining
linear computational complexity and memory cost.
In addition to discussing the theoretical underpinnings and unifying properties of PADRe, we
further experimentally demonstrate the significant advantages of PADRe over the standard self-
attention through a variety of experiments involving computer vision tasks in practical contexts.
More specifically, by replacing the self-attention operation with our design, we maintain the model
performance while only incurring linear computational and memory costs. Further, by profiling on
actual hardware platforms, we demonstrate the significantly better on-device efficiency of PADRe as
compared to the standard self-attention operation.
We summarize the main contributions of the paper as follows:
•
We propose PADRe, a unifying framework for replacing the attention mechanism based on
polynomial approximants, which provides an efficient, linear-complexity alternative to the
standard self-attention without sacrificing accuracy.
•
We show that PADRe provides a unifying formulation for many of the recently-proposed
attention replacement schemes, including Hyena [
20
], Mamba [
8
], Conv2Former [
11
],
SimA [13], and Castling-ViT [36].
•
We implement a specific instance of PADRe following the proposed design principles. We
demonstrate the performance and advantage of PADRe through multiple experiments related
to diverse, practical computer vision applications. Our PADRe-based models achieve similar
or better accuracy than the original ones, while incurring significantly lower costs. Our
on-device profiling further validates the computational efficiency of PADRe.
2 Related Work
Efficient Vision Transformer Backbones: A majority of existing research proposes holistically
designed vision transformer backbones. These backbones are capable of extracting features for
various vision tasks, such as image classification, object detection, and segmentation. Most of these
methods employ a hierarchical architecture, where the spatial dimension of the image feature map is
progressively downsampled [
17
]. This strategy allows the deeper transformer layers to operate on
smaller inputs, thereby reducing computational and memory costs. Specifically, many recent designs
adopt convolution layers in the early stages of the network and apply transformers only to significantly
downsampled feature maps later in the network [
14
,
10
,
29
,
33
]. While this approach considerably
improves efficiency, it still suffers from the inherent quadratic complexity when input size becomes
large. Building upon hierarchical and/or convolution-transformer hybrid architectures, some studies
propose more efficient, alternative attention schemes, e.g., ReLU-based attention [
4
], transposed
attention [
19
], convolutional modulation [
11
], additive attention [
23
], shift-add attention [
35
], linear-
angular attention [
36
]. However, all these models are designed to process a single 2D image and
cannot be directly applied to more complex visual inputs, such as 3D point clouds. Many of them
also rely on jointly designing the entire backbone to achieve good performance on 2D tasks and do
not work well as drop-in replacements. In fact, using some of these efficient attention methods as
drop-ins can degrade the model’s performance, as reported by [9].
Efficient Attention Drop-in Replacements: To enhance the computational efficiency of vision
transformers, various alternative attention mechanisms have been introduced. For instance, Swin [
17
]
proposes windowed attention, where self-attention computation is limited to local neighborhoods
2