稀疏注意力：时间序列预测的局部性和Transformer的存储瓶颈

transformer

75 浏览量 2024-08-14 16:58:05 上传评论收藏 949KB PDF 举报

资源推荐

资源详情

资源评论

Enhancing the Locality and Breaking the Memory

Bottleneck of Transformer on Time Series Forecasting

Shiyang Li

shiyangli@ucsb.edu

Xiaoyong Jin

x_jin@ucsb.edu

Yao Xuan

yxuan@ucsb.edu

Xiyou Zhou

xiyou@ucsb.edu

Wenhu Chen

wenhuchen@ucsb.edu

Yu-Xiang Wang

yuxiangw@cs.ucsb.edu

Xifeng Yan

xyan@cs.ucsb.edu

University of California, Santa Barbara

Abstract

Time series forecasting is an important problem across many domains, including

predictions of solar plant energy output, electricity consumption, and trafﬁc jam

situation. In this paper, we propose to tackle such forecasting problem with

Transformer [

]. Although impressed by its performance in our preliminary study,

we found its two major weaknesses: (1) locality-agnostics: the point-wise dot-

product self-attention in canonical Transformer architecture is insensitive to local

context, which can make the model prone to anomalies in time series; (2) memory

bottleneck: space complexity of canonical Transformer grows quadratically with

sequence length

, making directly modeling long time series infeasible. In

order to solve these two issues, we ﬁrst propose convolutional self-attention by

producing queries and keys with causal convolution so that local context can

be better incorporated into attention mechanism. Then, we propose LogSparse

Transformer with only

O(L(log L)

)

memory cost, improving forecasting accuracy

for time series with ﬁne granularity and strong long-term dependencies under

constrained memory budget. Our experiments on both synthetic data and real-

world datasets show that it compares favorably to the state-of-the-art.

1 Introduction

Time series forecasting plays an important role in daily life to help people manage resources and make

decisions. For example, in retail industry, probabilistic forecasting of product demand and supply

based on historical data can help people do inventory planning to maximize the proﬁt. Although

still widely used, traditional time series forecasting models, such as State Space Models (SSMs) [

]

and Autoregressive (AR) models, are designed to ﬁt each time series independently. Besides, they

also require practitioners’ expertise in manually selecting trend, seasonality and other components.

To sum up, these two major weaknesses have greatly hindered their applications in the modern

large-scale time series forecasting tasks.

To tackle the aforementioned challenges, deep neural networks [

] have been proposed as

an alternative solution, where Recurrent Neural Network (RNN) [

] has been employed to

model time series in an autoregressive fashion. However, RNNs are notoriously difﬁcult to train [

]

because of gradient vanishing and exploding problem. Despite the emergence of various variants,

including LSTM [

] and GRU [

], the issues still remain unresolved. As an example, [

] shows

that language models using LSTM have an effective context size of about 200 tokens on average

but are only able to sharply distinguish 50 tokens nearby, indicating that even LSTM struggles to

33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.

capture long-term dependencies. On the other hand, real-world forecasting applications often have

both long- and short-term repeating patterns [

]. For example, the hourly occupancy rate of a freeway

in trafﬁc data has both daily and hourly patterns. In such cases, how to model long-term dependencies

becomes the critical step in achieving promising performances.

Recently, Transformer [

] has been proposed as a brand new architecture which leverages attention

mechanism to process a sequence of data. Unlike the RNN-based methods, Transformer allows the

model to access any part of the history regardless of distance, making it potentially more suitable

for grasping the recurring patterns with long-term dependencies. However, canonical dot-product

self-attention matches queries against keys insensitive to local context, which may make the model

prone to anomalies and bring underlying optimization issues. More importantly, space complexity of

canonical Transformer grows quadratically with the input length

, which causes memory bottleneck

on directly modeling long time series with ﬁne granularity. We speciﬁcally delve into these two

issues and investigate the applications of Transformer to time series forecasting. Our contributions

are three fold:

•

We successfully apply Transformer architecture to time series forecasting and perform extensive

experiments on both synthetic and real datasets to validate Transformer’s potential value in better

handling long-term dependencies than RNN-based models.

•

We propose convolutional self-attention by employing causal convolutions to produce queries and

keys in the self-attention layer. Query-key matching aware of local context, e.g. shapes, can help

the model achieve lower training loss and further improve its forecasting accuracy.

•

We propose LogSparse Transformer, with only

O(L(log L)

)

space complexity to break the

memory bottleneck, not only making ﬁne-grained long time series modeling feasible but also

producing comparable or even better results with much less memory usage, compared to canonical

Transformer.

2 Related Work

Due to the wide applications of forecasting, various methods have been proposed to solve the problem.

One of the most prominent models is

ARIMA

[

]. Its statistical properties as well as the well-

known Box-Jenkins methodology [

] in the model selection procedure make it the ﬁrst attempt for

practitioners. However, its linear assumption and limited scalability make it unsuitable for large-scale

forecasting tasks. Further, information across similar time series cannot be shared since each time

series is ﬁtted individually. In contrast, [

] models related time series data as a matrix and deal with

forecasting as a matrix factorization problem. [

] proposes hierarchical Bayesian methods to learn

across multiple related count time series from the perspective of graph model.

Deep neural networks have been proposed to capture shared information across related time series

for accurate forecasting. [

] fuses traditional AR models with RNNs by modeling a probabilistic

distribution in an encoder-decoder fashion. Instead, [

] uses an RNN as an encoder and Multi-layer

Perceptrons (MLPs) as a decoder to solve the so-called error accumulation issue and conduct multi-

ahead forecasting in parallel. [

] uses a global RNN to directly output the parameters of a linear

SSM at each step for each time series, aiming to approximate nonlinear dynamics with locally linear

segments. In contrast, [

] deals with noise using a local Gaussian process for each time series while

using a global RNN to model the shared patterns. [

] tries to combine the advantages of AR models

and SSMs, and maintain a complex latent process to conduct multi-step forecasting in parallel.

The well-known self-attention based Transformer [

] has recently been proposed for sequence

modeling and has achieved great success. Several recent works apply it to translation, speech, music

and image generation [

]. However, scaling attention to extremely long sequences is

computationally prohibitive since the space complexity of self-attention grows quadratically with

sequence length [

]. This becomes a serious issue in forecasting time series with ﬁne granularity

and strong long-term dependencies.

3 Background

Problem deﬁnition

Suppose we have a collection of

related univariate time series

i,1:t

}

i=1

where

i,1:t

, [z

i,1

, z

i,2

, ··· , z

i,t

]

and

i,t

2 R

denotes the value of time series

at time

.We

are going to predict the next

⌧

time steps for all time series, i.e.

i,t

+1:t

+⌧

}

i=1

. Besides, let

i,1:t

+⌧

}

i=1

be a set of associated time-based covariate vectors with dimension

that are assumed

to be known over the entire time period, e.g. day-of-the-week and hour-of-the-day. We aim to model

the following conditional distribution

p(z

i,t

+1:t

+⌧

i,1:t

, x

i,1:t

+⌧

; )=

+⌧

t=t

p(z

i,t

i,1:t1

, x

i,1:t

; ).

We reduce the problem to learning a one-step-ahead prediction model

p(z

1:t1

, x

1:t

; )

, where



denotes the learnable parameters shared by all time series in the collection. To fully utilize both

the observations and covariates, we concatenate them to obtain an augmented matrix as follows:

, [z

t1

 x

] 2 R

d+1

, Y

=[y

, ··· , y

]

2 R

t⇥(d+1)

where

[··]

represents concatenation. An appropriate model

⇠ f(Y

)

is then explored to predict

the distribution of z

given Y

Transformer

We instantiate

with Transformer

by taking advantage of the multi-head self-

attention mechanism, since self-attention enables Transformer to capture both long- and short-term

dependencies, and different attention heads learn to focus on different aspects of temporal patterns.

These advantages make Transformer a good candidate for time series forecasting. We brieﬂy introduce

its architecture here and refer readers to [1] for more details.

In the self-attention layer, a multi-head self-attention sublayer simultaneously transforms

into

distinct query matrices

= YW

, key matrices

= YW

, and value matrices

= YW

respectively, with

h =1, ··· ,H

. Here

, W

2 R

(d+1)⇥d

and

2 R

(d+1)⇥d

are learnable

parameters. After these linear projections, the scaled dot-product attention computes a sequence of

vector outputs:

= Attention(Q

, K

, V

)=softmax

✓

· M

◆

Note that a mask matrix

is applied to ﬁlter out rightward attention by setting all upper triangular

elements to

1

, in order to avoid future information leakage. Afterwards,

, O

, ··· , O

are

concatenated and linearly projected again. Upon the attention output, a position-wise feedforward

sublayer with two layers of fully-connected network and a ReLU activation in the middle is stacked.

4 Methodology

4.1 Enhancing the locality of Transformer

Patterns in time series may evolve with time signiﬁcantly due to various events, e.g. holidays and

extreme weather, so whether an observed point is an anomaly, change point or part of the patterns

is highly dependent on its surrounding context. However, in the self-attention layers of canonical

Transformer, the similarities between queries and keys are computed based on their point-wise values

without fully leveraging local context like shape, as shown in Figure 1(a) and (b). Query-key matching

agnostic of local context may confuse the self-attention module in terms of whether the observed

value is an anomaly, change point or part of patterns, and bring underlying optimization issues.

We propose convolutional self-attention to ease the issue. The architectural view of proposed

convolutional self-attention is illustrated in Figure 1(c) and (d). Rather than using convolution of

Here time index

is relative, i.e. the same

in different time series may represent different actual time point.

Since the model is applicable to all time series, we omit the subscript i for simplicity and clarity.

By referring to Transformer, we only consider the autoregressive Transformer-decoder in the following.

At each time step the same model is applied, so we simplify the formulation with some abuse of notation.

剩余10页未读，继续阅读

评论收藏

内容反馈

托比-马奎尔

粉丝: 1760
资源: 16

稀疏注意力：时间序列预测的局部性和Transformer的存储瓶颈

Transformer在时间序列预测中的应用

时间序列Transformer for TimeSeries时序预测算法详解.docx

时间序列预测-Transformer,Informer,Autoformer,FEDformer复现结果

Transformer-BiLSTM多特征输入时间序列预测（Pytorch完整源码和数据）

基于Transformer模型的时间序列预测python源码（高分项目）.zip

Pytorch实现TCN-Transformer的时间序列预测（完整源码和数据)

时间序列预测数据时间序列预测数据

time-series-prediction:天池时间序列预测比赛的回购-源码_transformer股票预测,时间序列预测比赛

Transformer-BiGRU多特征输入时间序列预测（Pytorch完整源码和数据）

Transformer时间序列预测（单步、多步实验）（Pytorch完整源码和数据)

Transformer时间序列预测（Pytorch完整源码和数据）

基于TCN-Transformer模型的时间序列预测（Python完整源码）

时间序列预测与深度学习：文献综述与应用实例.pdf

时间序列预测数据集时间序列预测数据集

informer时间序列预测

使用多头注意力机制实现数字预测

时间序列预测没有任何问题-完整的训练测试输出

多头注意力：Transformer的多面洞察力

LSTM+Transformer时间序列预测（Pytorch完整源码和数据）

基于WTC+transformer时间序列组合预测模型（Python完整源码和数据）

人工智能和机器学习之关联规则学习算法：图注意力机制与图Transformer.pdf

人工智能和机器学习之关联规则学习算法：图注意力机制与图Transformer.docx

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

亲测Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

pytorch lstm 时间序列 多时间步预测

基于Transformer的长时间序列代码汇总（Autoformer,PEDformer,Informer...等15个算法代码

transformer:应用于时间序列的 Transformer 模型（最初来自 Attention is All You Need）的实现

最新资源

pytorch lstm 时间序列多时间步预测