没有合适的资源?快使用搜索试试~ 我知道了~
稀疏注意力:时间序列预测的局部性和Transformer的存储瓶颈
0 下载量 75 浏览量
2024-08-14
16:58:05
上传
评论
收藏 949KB PDF 举报
温馨提示
稀疏注意力:时间序列预测的局部性和Transformer的存储瓶颈
资源推荐
资源详情
资源评论
Enhancing the Locality and Breaking the Memory
Bottleneck of Transformer on Time Series Forecasting
Shiyang Li
shiyangli@ucsb.edu
Xiaoyong Jin
x_jin@ucsb.edu
Yao Xuan
yxuan@ucsb.edu
Xiyou Zhou
xiyou@ucsb.edu
Wenhu Chen
wenhuchen@ucsb.edu
Yu-Xiang Wang
yuxiangw@cs.ucsb.edu
Xifeng Yan
xyan@cs.ucsb.edu
University of California, Santa Barbara
Abstract
Time series forecasting is an important problem across many domains, including
predictions of solar plant energy output, electricity consumption, and traffic jam
situation. In this paper, we propose to tackle such forecasting problem with
Transformer [
1
]. Although impressed by its performance in our preliminary study,
we found its two major weaknesses: (1) locality-agnostics: the point-wise dot-
product self-attention in canonical Transformer architecture is insensitive to local
context, which can make the model prone to anomalies in time series; (2) memory
bottleneck: space complexity of canonical Transformer grows quadratically with
sequence length
L
, making directly modeling long time series infeasible. In
order to solve these two issues, we first propose convolutional self-attention by
producing queries and keys with causal convolution so that local context can
be better incorporated into attention mechanism. Then, we propose LogSparse
Transformer with only
O(L(log L)
2
)
memory cost, improving forecasting accuracy
for time series with fine granularity and strong long-term dependencies under
constrained memory budget. Our experiments on both synthetic data and real-
world datasets show that it compares favorably to the state-of-the-art.
1 Introduction
Time series forecasting plays an important role in daily life to help people manage resources and make
decisions. For example, in retail industry, probabilistic forecasting of product demand and supply
based on historical data can help people do inventory planning to maximize the profit. Although
still widely used, traditional time series forecasting models, such as State Space Models (SSMs) [
2
]
and Autoregressive (AR) models, are designed to fit each time series independently. Besides, they
also require practitioners’ expertise in manually selecting trend, seasonality and other components.
To sum up, these two major weaknesses have greatly hindered their applications in the modern
large-scale time series forecasting tasks.
To tackle the aforementioned challenges, deep neural networks [
3
,
4
,
5
,
6
] have been proposed as
an alternative solution, where Recurrent Neural Network (RNN) [
7
,
8
,
9
] has been employed to
model time series in an autoregressive fashion. However, RNNs are notoriously difficult to train [
10
]
because of gradient vanishing and exploding problem. Despite the emergence of various variants,
including LSTM [
11
] and GRU [
12
], the issues still remain unresolved. As an example, [
13
] shows
that language models using LSTM have an effective context size of about 200 tokens on average
but are only able to sharply distinguish 50 tokens nearby, indicating that even LSTM struggles to
33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.
capture long-term dependencies. On the other hand, real-world forecasting applications often have
both long- and short-term repeating patterns [
7
]. For example, the hourly occupancy rate of a freeway
in traffic data has both daily and hourly patterns. In such cases, how to model long-term dependencies
becomes the critical step in achieving promising performances.
Recently, Transformer [
1
,
14
] has been proposed as a brand new architecture which leverages attention
mechanism to process a sequence of data. Unlike the RNN-based methods, Transformer allows the
model to access any part of the history regardless of distance, making it potentially more suitable
for grasping the recurring patterns with long-term dependencies. However, canonical dot-product
self-attention matches queries against keys insensitive to local context, which may make the model
prone to anomalies and bring underlying optimization issues. More importantly, space complexity of
canonical Transformer grows quadratically with the input length
L
, which causes memory bottleneck
on directly modeling long time series with fine granularity. We specifically delve into these two
issues and investigate the applications of Transformer to time series forecasting. Our contributions
are three fold:
•
We successfully apply Transformer architecture to time series forecasting and perform extensive
experiments on both synthetic and real datasets to validate Transformer’s potential value in better
handling long-term dependencies than RNN-based models.
•
We propose convolutional self-attention by employing causal convolutions to produce queries and
keys in the self-attention layer. Query-key matching aware of local context, e.g. shapes, can help
the model achieve lower training loss and further improve its forecasting accuracy.
•
We propose LogSparse Transformer, with only
O(L(log L)
2
)
space complexity to break the
memory bottleneck, not only making fine-grained long time series modeling feasible but also
producing comparable or even better results with much less memory usage, compared to canonical
Transformer.
2 Related Work
Due to the wide applications of forecasting, various methods have been proposed to solve the problem.
One of the most prominent models is
ARIMA
[
15
]. Its statistical properties as well as the well-
known Box-Jenkins methodology [
16
] in the model selection procedure make it the first attempt for
practitioners. However, its linear assumption and limited scalability make it unsuitable for large-scale
forecasting tasks. Further, information across similar time series cannot be shared since each time
series is fitted individually. In contrast, [
17
] models related time series data as a matrix and deal with
forecasting as a matrix factorization problem. [
18
] proposes hierarchical Bayesian methods to learn
across multiple related count time series from the perspective of graph model.
Deep neural networks have been proposed to capture shared information across related time series
for accurate forecasting. [
3
] fuses traditional AR models with RNNs by modeling a probabilistic
distribution in an encoder-decoder fashion. Instead, [
19
] uses an RNN as an encoder and Multi-layer
Perceptrons (MLPs) as a decoder to solve the so-called error accumulation issue and conduct multi-
ahead forecasting in parallel. [
6
] uses a global RNN to directly output the parameters of a linear
SSM at each step for each time series, aiming to approximate nonlinear dynamics with locally linear
segments. In contrast, [
9
] deals with noise using a local Gaussian process for each time series while
using a global RNN to model the shared patterns. [
20
] tries to combine the advantages of AR models
and SSMs, and maintain a complex latent process to conduct multi-step forecasting in parallel.
The well-known self-attention based Transformer [
1
] has recently been proposed for sequence
modeling and has achieved great success. Several recent works apply it to translation, speech, music
and image generation [
1
,
21
,
22
,
23
]. However, scaling attention to extremely long sequences is
computationally prohibitive since the space complexity of self-attention grows quadratically with
sequence length [
21
]. This becomes a serious issue in forecasting time series with fine granularity
and strong long-term dependencies.
2
3 Background
Problem definition
Suppose we have a collection of
N
related univariate time series
{z
i,1:t
0
}
N
i=1
,
where
z
i,1:t
0
, [z
i,1
, z
i,2
, ··· , z
i,t
0
]
and
z
i,t
2 R
denotes the value of time series
i
at time
t
1
.We
are going to predict the next
⌧
time steps for all time series, i.e.
{z
i,t
0
+1:t
0
+⌧
}
N
i=1
. Besides, let
{x
i,1:t
0
+⌧
}
N
i=1
be a set of associated time-based covariate vectors with dimension
d
that are assumed
to be known over the entire time period, e.g. day-of-the-week and hour-of-the-day. We aim to model
the following conditional distribution
p(z
i,t
0
+1:t
0
+⌧
|z
i,1:t
0
, x
i,1:t
0
+⌧
; )=
t
0
+⌧
Y
t=t
0
+1
p(z
i,t
|z
i,1:t1
, x
i,1:t
; ).
We reduce the problem to learning a one-step-ahead prediction model
p(z
t
|z
1:t1
, x
1:t
; )
2
, where
denotes the learnable parameters shared by all time series in the collection. To fully utilize both
the observations and covariates, we concatenate them to obtain an augmented matrix as follows:
y
t
, [z
t1
x
t
] 2 R
d+1
, Y
t
=[y
1
, ··· , y
t
]
T
2 R
t⇥(d+1)
,
where
[··]
represents concatenation. An appropriate model
z
t
⇠ f(Y
t
)
is then explored to predict
the distribution of z
t
given Y
t
.
Transformer
We instantiate
f
with Transformer
3
by taking advantage of the multi-head self-
attention mechanism, since self-attention enables Transformer to capture both long- and short-term
dependencies, and different attention heads learn to focus on different aspects of temporal patterns.
These advantages make Transformer a good candidate for time series forecasting. We briefly introduce
its architecture here and refer readers to [1] for more details.
In the self-attention layer, a multi-head self-attention sublayer simultaneously transforms
Y
4
into
H
distinct query matrices
Q
h
= YW
Q
h
, key matrices
K
h
= YW
K
h
, and value matrices
V
h
= YW
V
h
respectively, with
h =1, ··· ,H
. Here
W
Q
h
, W
K
h
2 R
(d+1)⇥d
k
and
W
V
h
2 R
(d+1)⇥d
v
are learnable
parameters. After these linear projections, the scaled dot-product attention computes a sequence of
vector outputs:
O
h
= Attention(Q
h
, K
h
, V
h
)=softmax
✓
Q
h
K
T
h
p
d
k
· M
◆
V
h
.
Note that a mask matrix
M
is applied to filter out rightward attention by setting all upper triangular
elements to
1
, in order to avoid future information leakage. Afterwards,
O
1
, O
2
, ··· , O
H
are
concatenated and linearly projected again. Upon the attention output, a position-wise feedforward
sublayer with two layers of fully-connected network and a ReLU activation in the middle is stacked.
4 Methodology
4.1 Enhancing the locality of Transformer
Patterns in time series may evolve with time significantly due to various events, e.g. holidays and
extreme weather, so whether an observed point is an anomaly, change point or part of the patterns
is highly dependent on its surrounding context. However, in the self-attention layers of canonical
Transformer, the similarities between queries and keys are computed based on their point-wise values
without fully leveraging local context like shape, as shown in Figure 1(a) and (b). Query-key matching
agnostic of local context may confuse the self-attention module in terms of whether the observed
value is an anomaly, change point or part of patterns, and bring underlying optimization issues.
We propose convolutional self-attention to ease the issue. The architectural view of proposed
convolutional self-attention is illustrated in Figure 1(c) and (d). Rather than using convolution of
1
Here time index
t
is relative, i.e. the same
t
in different time series may represent different actual time point.
2
Since the model is applicable to all time series, we omit the subscript i for simplicity and clarity.
3
By referring to Transformer, we only consider the autoregressive Transformer-decoder in the following.
4
At each time step the same model is applied, so we simplify the formulation with some abuse of notation.
3
剩余10页未读,继续阅读
资源评论
托比-马奎尔
- 粉丝: 1760
- 资源: 16
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 全自动烤箱设备工程图机械结构设计图纸和其它技术资料和技术方案非常好100%好用.zip
- 热熔胶涂布机工程图机械结构设计图纸和其它技术资料和技术方案非常好100%好用.zip
- 熔喷布驻极流水线工程图机械结构设计图纸和其它技术资料和技术方案非常好100%好用.zip
- 基于ruoyi-vue 3.8.8的BaiZe-ui设计源码,融合官方插件与文档便利店
- 基于C++与跨语言集成的AC学习笔记源码设计
- 基于Java和Vue的启航电商ERP系统2.0版设计源码
- 新年主题的概要介绍与分析
- python的概要介绍与分析
- 基于微信小程序的TT水果商城JavaScript开发设计源码
- 基于Java与多种前端技术的尚上优选社区团购微服务毕设项目设计源码
- 基于PHP开发的API访问控制与数据分析管理系统设计源码
- 基于RabbitMQ的分布式消息分发应用框架设计源码
- c语言的概要介绍与分析
- 快速排序的概要介绍与分析
- 基于Flutter的支付宝支付SDK插件Tobias设计源码
- 基于微信小程序的景区小程序设计源码
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功