没有合适的资源？快使用搜索试试~ 我知道了~
FEDformer.pdf
需积分: 0 0 下载量 184 浏览量
20240418
19:46:24
上传
评论
收藏 540KB PDF 举报
温馨提示
试读
19页
FEDformer.pdf
资源推荐
资源详情
资源评论
arXiv:2201.12740v3 [cs.LG] 16 Jun 2022
FEDformer: Frequency Enhanced Decomposed Transformer for Longterm
Series Forecasting
Tian Zhou
* 1
Ziqing Ma
* 1
Qingsong Wen
1
Xue Wang
1
Liang Sun
1
Rong Jin
1
Abstract
Although Transformerbased methods have sig
niﬁcantly improved stateoftheart results for
longter m series forecasting, they are not only
computationally expensive but more importa ntly,
are una ble to capture the global view of time
series (e.g. overall trend). To address these
problems, we propose to combine Transfor mer
with the seasonaltrend decomposition method,
in which the decomposition method captu res the
global proﬁle of time series while Transformers
capture more detailed structures. To further en
hance the perform ance of Transformer for long
term prediction, we exploit the fact that most
time series tend to have a sparse representation
in wellknown basis such as Fourier tra nsform,
and develop a frequency enhance d Transformer.
Besides being more effective, the pro posed
method, termed as Frequency Enhanced Decom
posed Transformer (FEDformer), is more efﬁ
cient than standard Transformer with a line ar
complexity to the sequenc e length. Our empiri
cal studies with six benchmark d a ta sets show that
compare d with stateoftheart methods, FED
former can reduce prediction error by 14.8% and
22.6% f or multivariate an d univariate time se
ries, respectively. Code is publicly available at
https://github.com/MAZiqing /FED former.
1. Introduction
Long term time series forecasting is a long standing chal
lenge in various applications (e.g., energy, weather, traf
ﬁc, economics). Despite the impressive results achieved
by RNNtyp e methods (Rangapuram et al., 2018; Flunkert
et al.
, 2017), they often suffer from the problem of gradi
*
Equal contribution
1
Machine Intelligence Technology, Al
ibaba Group.. Correspondence to: Tian Zhou <tian.zt@alibaba
inc.com>, Rong Jin <jinrong.jr@alibabainc.com>.
Proceedings of the 39
th
International Conference on Machine
Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copy
right 2022 by the author(s).
ent vanishing or explod ing (
Pascanu et al., 2013), signiﬁ
cantly limiting their performance. Following the recent suc
cess in NLP and CV community (
Vaswani et al., 2017; De
vlin et al., 2 019; Dosovitskiy et al., 2021; Rao et al., 2021),
Transformer (Vaswani et al., 201 7) has been introduced to
capture longterm dependencies in time series for ecasting
and shows p romising results (
Zhou et al., 2021; Wu et al.,
2021). Sin ce high computational complexity and memory
requirement make it difﬁcult for Transformer to be applied
to long sequence modeling, numero us studies are devoted
to reduce the computational cost of Transformer (
Li et al.,
2019; Kitaev et a l., 2020; Zh ou et al., 2021; Wang et al.,
2020; Xiong et al., 2021; Ma et al., 2021). A through
overview of this line of works can be found in Appendix
A.
Despite the progr e ss mad e by Transformerbased meth
ods for time series forecasting, they tend to fail in cap
turing the overall characteristics/distribution of time se
ries in some cases. In Figure
1, we compare the time
series of ground truth with that predicted by the vanilla
Transformer me thod (
Vaswani et al., 2017) in a re alworld
ETTm1 dataset (Zhou et al., 2021). It is clear that the pre
dicted time series shared a different distribution from that
of ground truth. The discrepancy between ground truth
and p rediction could be explained by the pointwise atten
tion an d prediction in Transformer. Since pre diction for
each timestep is made individually and independently, it
is likely that the model fails to maintain the global prop
erty and statistics of time series as a whole. To address
this problem, we exploit two ideas in this work. The ﬁrst
idea is to incorporate a seasonal trend decomposition ap
proach (Cleveland et al., 1990; Wen et al., 2019), which is
widely used in time series analysis, into the Transformer
based method. Although this idea has been exploited be
fore (
Oreshkin et al., 2019; Wu et al., 20 21), we present a
special design of network that is effective in bringing the
distribution of prediction close to that of ground truth, ac
cording to KologrovSmirnov distribution test. Our second
idea is to combine Fourier analysis with the Transfo rmer
based method. Instead o f applying Tr a nsformer to the time
domain, we ap ply it to the frequency domain which helps
Transformer b e tter capture global p roperties of time series.
Combining both ideas, we propose a Frequency Enhanced
Submission and Formatting Instructions for ICML 2022
Decomposition Transformer, or, FEDformer for short, for
longter m time ser ie s forecasting.
One critical que stion with FEDformer is which su bset of
frequency components should be used by Fourier analysis
to represent time series. A common wisdom is to keep low
frequency components and throw away the highfrequency
ones. This may not be appropriate for time series forecast
ing as some of tre nd changes in time ser ies ar e related to im
portant events, and this piece of information could be lost
if we simp ly remove all highfrequency components. We
address this problem by effectively exploiting the fact that
time ser ie s tend to have (unknown) sparse represen tations
on a basis like Fourier basis. According to our theoretical
analysis, a randomly selected subset of fre quency compo
nents, including both low and high ones, will give a better
representation for time series, which is further veriﬁed by
extensive empirical studies. Besides b eing more effective
for long term forecasting, combining Transformer with fre
quency analysis allows us to reduce the computational cost
of Transformer fr om quadratic to linear complexity. We
note that this is different from previous efforts on speeding
up Transformer, which often leads to a performanc e dr op.
In short, we summarize the key contributions of this work
as follows:
1. We propose a frequency enhanced decompo sed Trans
former architecture with mixture of experts for
seasonaltrend decomposition in order to better cap
ture global properties of time series.
2. We propo se Fourier enhanced blocks and Wavelet
enhan c ed blocks in the Transformer structure that
allows us to capture important stru ctures in time
series through frequency do main mapping. They
serve as substitutions for both selfattention and cross
attention blocks.
3. By randomly selecting a ﬁxed number of Fourier com
ponen ts, the proposed model achieves linear computa
tional complexity and memory cost. The effectiveness
of this selection method is veriﬁed both theoretically
and empirically.
4. We conduct extensive expe riments over 6 benchmark
datasets across multiple domains ( energy, trafﬁc, ec o
nomics, weather and disease). Our empirical stud
ies show tha t the proposed mode l improves the per
formance of stateoftheart meth ods by 14.8% and
22.6% for multivariate and un ivariate forecasting, re
spectively.
Figure 1.
Different distribution between ground truth and fore
casting output from vanilla Transformer in a realworld ETTm1
dataset. L eft: frequency mode and trend shift. Right: tr end shift.
2. Compact Representation of Time Series in
Frequency Domain
It is wellknown th at time series data can be modeled from
the time domain and frequency domain. One key contribu
tion o f our work which separates from other longterm fore
casting algorithms is the frequencydomain operation with
a neural network. As Fourier analysis is a c ommon tool
to dive into the fr e quency domain, while how to appropri
ately r e present the information in time series using Fourier
analysis is critical. Simply keeping all the frequency com
ponen ts may re sult in inferior representations since many
highf requency changes in time series are due to noisy in
puts. On the other hand, only keeping the lowfrequency
components may also be inappr opriate for series forecast
ing as some trend changes in time series re present impor
tant events. Instead, keeping a compact representation of
time ser ie s using a small number of selected Fourier com
ponen ts will lead to efﬁcient compu tation of transformer,
which is crucial for modelling long sequences. We pro
pose to represen t time series by randomly selecting a con
stant number o f Fourier components, including both high
frequency and lowfrequency. Below, an analysis that justi
ﬁes the random selection is presented theoretically. Empir
ical veriﬁcation can be found in the experimental session.
Consider we have m time series, denoted as
X
1
(t), . . . , X
m
(t). By applying Fourier transform
to each time ser ies, we turn each X
i
(t) into a vec
tor a
i
= (a
i,1
, . . . , a
i,d
)
⊤
∈ R
d
. By putting all
the Fourier tra nsform vectors into a matrix, we have
A = (a
1
, a
2
, . . . , a
m
)
⊤
∈ R
m×d
, with each row cor
respond ing to a different time series and each column
correspo nding to a d ifferent Fourier component. Although
using all the Fourier components allows us to be st preserve
the history information in the time series, it may potentially
lead to overﬁtting of the history da ta an d con sequentially
a poor prediction of future signals. Hence, we need to
select a subset of Fourier comp onents, that on the one hand
should be small enough to avoid the overﬁtting problem
and on the oth e r hand, should be able to preserve m ost
of the histor y information. Here, we pr opose to select
s comp onents from the d Fourier components (s < d)
Submission and Formatting Instructions for ICML 2022
Feed
Forward
Frequency
Enhanced
Block
MOE
Decomp
MOE
Decomp
FEDformer Encoder
Frequency
Enhanced
Attention
Frequency
Enhanced
Block
MOE
Decomp
MOE
Decomp
FEDformer Decoder
+ +
N
+ +
+
Feed
Forward
MOE
Decomp
+
Q
+
+ +
Output
M
K
V
Encoder
Input
Seasoal
Init
Trend Init
T
l,2
de
T
l,1
de
T
l,3
de
S
l,2
de
S
l,1
de
S
l,1
en
S
l,2
en
(or X
l
en
)
X
0
en
∈ R
I×D
X
0
de
∈ R
(I/2+O)×D
T
0
de
∈ R
(I/2+O)×D
X
l−1
en
T
l−1
de
X
l−1
de
T
l
de
S
l,3
de
(or X
l
de
)
Figure 2.
FEDformer Structure. The FEDformer consists of N encoders and M decoders. The Frequency Enhanced Block (FEB, green
blocks) and Frequency Enhanced Attention (FEA, red blocks) are used to perform representation learning in frequency domain. E ither
FEB or FEA has two subversions (FEBf & FEBw or FEAf & FEAw), where ‘f’ means using Fourier basis and ‘w’ means using
Wavelet basis. The Mixture Of Expert Decomposition Blocks (MOEDecomp,
yellow blocks) are used to extract seasonaltrend patterns
from the input data.
unifor mly at random. More speciﬁcally, we denote by
i
1
< i
2
< . . . < i
s
the randomly selected components.
We construct matrix S ∈ {0, 1}
s×d
, with S
i,k
= 1 if
i = i
k
and S
i,k
= 0 otherwise. Then, our representation
of multivaria te time series becomes A
′
= AS
⊤
∈ R
m×s
.
Below, we will show that, although the Fourier basis are
randomly selected, under a mild condition, A
′
is able to
preserve most of the information from A.
In orde r to measure how well A
′
is able to preserve informa
tion from A, we project each column vector of A into the
subspace spanned by the column vectors in A
′
. We denote
by P
A
′
(A) the resulting matrix after the projection, where
P
A
′
(·) rep resents the projection operator. If A
′
preserves
a large portion of information from A, we would expect a
small error between A and P
A
′
(A), i.e. A −P
A
′
(A). Let
A
k
represent the approximatio n of A by its ﬁrst k largest
single value de composition. The the orem below shows that
A−P
A
′
(A)is clo se to A−A
k
if the number of randomly
sampled Fourier components s is on the order of k
2
.
Theorem 1. Assume that µ(A), the coherence measure of
matrix A, is Ω(k/n). Then, with a high probability, we
have
A − P
A
′
(A) ≤ (1 + ǫ)A − A
k

if s = O(k
2
/ǫ
2
).
The detailed analysis can be foun d in Appendix
C.
For realworld multivariate times series, the c orresponding
matrix A from Fourier transform often exhibit low rank
property, since those univaraite variables in multivariate
times series depend no t only on its past values but a lso
has dependency on each other, as well as share similar fre
quency components. Therefore, as indic ated by the The
orem
1, ran domly selecting a subset of Fourie r compo
nents allows us to appropriate ly represent th e information
in Fourier matrix A.
Similarly, wavelet orthogonal polynomials, suc h as Legen
dre Polynomials, obey restricted iso metry property (RIP)
and can be used for c a pture infor mation in time series as
well. Compared to Fourier b asis, wavelet based re presenta
tion is more effective in captur ing local structu res in time
series and thus can be more effective for some forecasting
tasks. We defer the discussion of wavelet based representa
tion in Appendix
B. In the next section, we will present the
design of frequency enhanced d e composed Transformer ar
chitecture that incorporate the Fourier transform into trans
former.
3. Model Structure
In this section, we will introdu ce (1) the overall structure
of FEDformer, as shown in Fig ure
2, (2) two subversion
structures for signal process: one uses Fourier basis and
the other u ses Wavelet basis, (3) the mixture of experts
mechanism for seasonaltrend decomposition, and (4) the
complexity analysis of the proposed model.
3.1. FEDformer Framework
Preliminary Longterm time series forecasting is a se
quence to sequence problem. We denote the input length
as I and output length as O. We denote D as the hidden
states of the series. The input of th e encoder is a I × D
matrix and the decoder has (I/2 + O) × D in put.
FEDformer Structure Inspired by the seasonaltr end de
composition and distribution ana lysis as discussed in Sec
tion
1, we renovate Transformer as a deep decomposition
architecture as shown in Figure 2, including Frequency
Enhance d Block (FEB), Frequency Enhanced Attention
Submission and Formatting Instructions for ICML 2022
(FEA) conn ecting encoder and decoder, and the Mixture
Of Experts Decomposition block (MOEDe c omp). The de
tailed description of FEB, FEA, a nd MOEDecomp blocks
will be given in the following Section
3.2, 3.3, and 3.4 re
spectively.
The encoder adopts a multilayer structure as: X
l
en
=
Encoder(X
l−1
en
), where l ∈ {1, ··· , N} denotes the out
put of lth encoder layer and X
0
en
∈ R
I×D
is the embed ded
historical series. The Enc oder(·) is formalized as
S
l,1
en
,
−
= MOEDecomp(FEB
X
l−1
en
+ X
l−1
en
),
S
l,2
en
,
−
= MOEDecomp(FeedForward
S
l,1
en
+ S
l,1
en
),
X
l
en
= S
l,2
en
,
(1)
where S
l,i
en
, i ∈ {1, 2} represents the seasonal compo
nent a fter the ith decomposition block in th e lth layer r e
spectively. For FEB module, it has two d ifferent versions
(FEBf & FEBw) which are implemented through Discrete
Fourier transform (DFT) and Discrete Wavelet transform
(DWT) mechanism respec tively and can seamlessly re place
the self attention bloc k.
The decoder also a dopts a multilayer structure as:
X
l
de
, T
l
de
= D e coder(X
l−1
de
, T
l−1
de
), where l ∈ {1, ··· , M}
denotes the output of lth decoder lay er. The Decoder(·) is
formalized as
S
l,1
de
, T
l,1
de
= MOEDecomp
FEB
X
l−1
de
+ X
l−1
de
,
S
l,2
de
, T
l,2
de
= MOEDecomp
FEA
S
l,1
de
, X
N
en
+ S
l,1
de
,
S
l,3
de
, T
l,3
de
= MOEDecomp
FeedForward
S
l,2
de
+ S
l,2
de
,
X
l
de
= S
l,3
de
,
T
l
de
= T
l−1
de
+ W
l,1
·T
l,1
de
+ W
l,2
·T
l,2
de
+ W
l,3
·T
l,3
de
,
(2)
where S
l,i
de
, T
l,i
de
, i ∈ {1, 2, 3 } represent the seaso nal and
trend component after the ith decomposition block in the l
th layer respectively. W
l,i
, i ∈ {1, 2, 3} represents the pro
jector for the ith extracted trend T
l,i
de
. Similar to FEB, FEA
has two different versions (FEAf & FEAw) which are im
plemented through DFT and DWT projectio n respectively
with attention design, and can replace the crossattention
block. The detailed description of FEA(·) will be given in
the following Section
3.3.
The ﬁnal prediction is the sum of the two r e ﬁned decom
posed components as W
S
· X
M
de
+ T
M
de
, where W
S
is to
project the d e ep transformed seasonal component X
M
de
to
the target dimensio n.
3.2. Fourier Enhanced Structure
Discrete Fourier Transform (DFT) The proposed
Fourier Enhanced Structures u se discrete Fourier transfor m
(DFT). Le t F denotes the Fourier transform and F
−1
de
F
F
−1
q ∈ R
L×D
Sampling
×
Q ∈ C
N ×D
C
M ×D
˜
Q ∈
R ∈
˜
Y ∈
C
M ×D
padding
Y ∈ C
N ×D
y ∈ R
L
de
×D
C
M ×D×D
X
l−1
en/de
MLP
Figure 3.
Frequency Enhanced Block with Fourier transform
(FEBf) structure.
×
MLP
v
F + Sampling
MLP
MLP
˜
Q
˜
K
˜
V
F + Sampling
F + Sampling
σ(·)
×
Padding + F
−1
y ∈ R
L
de
×D
q ∈ R
L
de
×D
k
˜
Q
˜
K
⊤
X
l
en
S
l,1
de
Figure 4.
Frequency Enhanced Attention with Fourier transform
(FEAf) structure, σ(·) is the activation function.
notes the inverse Fourier transform. Given a sequence of
real numbers x
n
in time domain, where n = 1, 2...N. DFT
is deﬁned as X
l
=
P
N−1
n=0
x
n
e
−iωln
, where i is the im a g
inary unit and X
l
, l = 1, 2...L is a sequence o f complex
numbers in the frequency domain. Similarly, the inverse
DFT is deﬁned as x
n
=
P
L−1
l=0
X
l
e
iωln
. The complex
ity of DFT is O(N
2
). With fast Fourier transform (FFT),
the computation complexity can be reduced to O(N log N).
Here a random subset of the Fourier basis is used and
the scale of the subset is b ounded by a scalar. When we
choose the mode index b efore DFT and reverse DFT oper
ations, the computation complexity can be further red uced
to O(N).
Frequency Enhanced Block with Fourier Transform
(FEBf) The FE Bf is used in both encoder and decoder
as shown in Figure
2. The input (x ∈ R
N×D
) of the
FEBf block is ﬁrst linearly projected with w ∈ R
D×D
, so
q = x·w. Then q is converted f rom the time dom ain to the
frequency domain. The Fourier transform of q is denoted
as Q ∈ C
N×D
. In frequency domain , only the randomly
selected M modes are kept so we use a select operator as
˜
Q = Select(Q) = Select(F(q)), (3)
where
˜
Q ∈ C
M×D
and M << N. Then, the FEBf is
deﬁned as
FEBf(q) = F
−1
(Padding(
˜
Q ⊙ R)), (4)
where R ∈ C
D×D×M
is a parameterized kernel in itialized
randomly. Let Y = Q ⊙ C, with Y ∈ C
M×D
. The pro
duction operator ⊙ is deﬁned as: Y
m,d
o
=
P
D
d
i
=0
Q
m,d
i
·
R
d
i
,d
o
,m
, where d
i
= 1, 2...D is the input channel and
d
o
= 1, 2...D is the output chan nel. The result of Q ⊙ R
is then z e ropadded to C
N×D
before performing inverse
Fourier transform bac k to the time domain. T he structure
is shown in Figure
3.
剩余18页未读，继续阅读
资源评论
Chase～711
 粉丝: 0
 资源: 1
上传资源 快速赚钱
 我的内容管理 展开
 我的资源 快来上传第一个资源
 我的收益 登录查看自己的收益
 我的积分 登录查看自己的积分
 我的C币 登录后查看C币余额
 我的收藏
 我的下载
 下载帮助
安全验证
文档复制为VIP权益，开通VIP直接复制
信息提交成功