【免费】FEDformer.pdf资源-CSDN文库

需积分: 0 19 浏览量 2024-04-18 19:46:24 上传评论收藏 540KB PDF 举报

资源推荐

资源详情

资源评论

arXiv:2201.12740v3 [cs.LG] 16 Jun 2022

FEDformer: Frequency Enhanced Decomposed Transformer for Long-term

Series Forecasting

Tian Zhou

* 1

Ziqing Ma

* 1

Qingsong Wen

Xue Wang

Liang Sun

Rong Jin

Abstract

Although Transformer-based methods have sig-

niﬁcantly improved state-of-the-art results for

long-ter m series forecasting, they are not only

computationally expensive but more importa ntly,

are una ble to capture the global view of time

series (e.g. overall trend). To address these

problems, we propose to combine Transfor mer

with the seasonal-trend decomposition method,

in which the decomposition method captu res the

global proﬁle of time series while Transformers

capture more detailed structures. To further en-

hance the perform ance of Transformer for long-

term prediction, we exploit the fact that most

time series tend to have a sparse representation

in well-known basis such as Fourier tra nsform,

and develop a frequency enhance d Transformer.

Besides being more effective, the pro posed

method, termed as Frequency Enhanced Decom-

posed Transformer (FEDformer), is more efﬁ-

cient than standard Transformer with a line ar

complexity to the sequenc e length. Our empiri-

cal studies with six benchmark d a ta sets show that

compare d with state-of-the-art methods, FED-

former can reduce prediction error by 14.8% and

22.6% f or multivariate an d univariate time se-

ries, respectively. Code is publicly available at

https://github.com/MAZiqing /FED former.

1. Introduction

Long- term time series forecasting is a long- standing chal-

lenge in various applications (e.g., energy, weather, traf-

ﬁc, economics). Despite the impressive results achieved

by RNN-typ e methods (Rangapuram et al., 2018; Flunkert

et al.

, 2017), they often suffer from the problem of gradi-

Equal contribution

Machine Intelligence Technology, Al-

ibaba Group.. Correspondence to: Tian Zhou <tian.zt@alibaba-

inc.com>, Rong Jin <jinrong.jr@alibaba-inc.com>.

Proceedings of the 39

International Conference on Machine

Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copy-

right 2022 by the author(s).

ent vanishing or explod ing (

Pascanu et al., 2013), signiﬁ-

cantly limiting their performance. Following the recent suc-

cess in NLP and CV community (

Vaswani et al., 2017; De-

vlin et al., 2 019; Dosovitskiy et al., 2021; Rao et al., 2021),

Transformer (Vaswani et al., 201 7) has been introduced to

capture long-term dependencies in time series for ecasting

and shows p romising results (

Zhou et al., 2021; Wu et al.,

2021). Sin ce high computational complexity and memory

requirement make it difﬁcult for Transformer to be applied

to long sequence modeling, numero us studies are devoted

to reduce the computational cost of Transformer (

Li et al.,

2019; Kitaev et a l., 2020; Zh ou et al., 2021; Wang et al.,

2020; Xiong et al., 2021; Ma et al., 2021). A through

overview of this line of works can be found in Appendix

Despite the progr e ss mad e by Transformer-based meth-

ods for time series forecasting, they tend to fail in cap-

turing the overall characteristics/distribution of time se-

ries in some cases. In Figure

1, we compare the time

series of ground truth with that predicted by the vanilla

Transformer me thod (

Vaswani et al., 2017) in a re al-world

ETTm1 dataset (Zhou et al., 2021). It is clear that the pre-

dicted time series shared a different distribution from that

of ground truth. The discrepancy between ground truth

and p rediction could be explained by the point-wise atten-

tion an d prediction in Transformer. Since pre diction for

each timestep is made individually and independently, it

is likely that the model fails to maintain the global prop-

erty and statistics of time series as a whole. To address

this problem, we exploit two ideas in this work. The ﬁrst

idea is to incorporate a seasonal- trend decomposition ap-

proach (Cleveland et al., 1990; Wen et al., 2019), which is

widely used in time series analysis, into the Transformer-

based method. Although this idea has been exploited be-

fore (

Oreshkin et al., 2019; Wu et al., 20 21), we present a

special design of network that is effective in bringing the

distribution of prediction close to that of ground truth, ac-

cording to Kologrov-Smirnov distribution test. Our second

idea is to combine Fourier analysis with the Transfo rmer-

based method. Instead o f applying Tr a nsformer to the time

domain, we ap ply it to the frequency domain which helps

Transformer b e tter capture global p roperties of time series.

Combining both ideas, we propose a Frequency Enhanced

Submission and Formatting Instructions for ICML 2022

Decomposition Transformer, or, FEDformer for short, for

long-ter m time ser ie s forecasting.

One critical que stion with FEDformer is which su bset of

frequency components should be used by Fourier analysis

to represent time series. A common wisdom is to keep low-

frequency components and throw away the high-frequency

ones. This may not be appropriate for time series forecast-

ing as some of tre nd changes in time ser ies ar e related to im-

portant events, and this piece of information could be lost

if we simp ly remove all high-frequency components. We

address this problem by effectively exploiting the fact that

time ser ie s tend to have (unknown) sparse represen tations

on a basis like Fourier basis. According to our theoretical

analysis, a randomly selected subset of fre quency compo-

nents, including both low and high ones, will give a better

representation for time series, which is further veriﬁed by

extensive empirical studies. Besides b eing more effective

for long term forecasting, combining Transformer with fre-

quency analysis allows us to reduce the computational cost

of Transformer fr om quadratic to linear complexity. We

note that this is different from previous efforts on speeding

up Transformer, which often leads to a performanc e dr op.

In short, we summarize the key contributions of this work

as follows:

1. We propose a frequency enhanced decompo sed Trans-

former architecture with mixture of experts for

seasonal-trend decomposition in order to better cap-

ture global properties of time series.

2. We propo se Fourier enhanced blocks and Wavelet

enhan c ed blocks in the Transformer structure that

allows us to capture important stru ctures in time

series through frequency do main mapping. They

serve as substitutions for both self-attention and cross-

attention blocks.

3. By randomly selecting a ﬁxed number of Fourier com-

ponen ts, the proposed model achieves linear computa-

tional complexity and memory cost. The effectiveness

of this selection method is veriﬁed both theoretically

and empirically.

4. We conduct extensive expe riments over 6 benchmark

datasets across multiple domains ( energy, trafﬁc, ec o-

nomics, weather and disease). Our empirical stud-

ies show tha t the proposed mode l improves the per-

formance of state-of-the-art meth ods by 14.8% and

22.6% for multivariate and un ivariate forecasting, re-

spectively.

Figure 1.

Different distribution between ground truth and fore-

casting output from vanilla Transformer in a real-world ETTm1

dataset. L eft: frequency mode and trend shift. Right: tr end shift.

2. Compact Representation of Time Series in

Frequency Domain

It is well-known th at time series data can be modeled from

the time domain and frequency domain. One key contribu-

tion o f our work which separates from other long-term fore-

casting algorithms is the frequency-domain operation with

a neural network. As Fourier analysis is a c ommon tool

to dive into the fr e quency domain, while how to appropri-

ately r e present the information in time series using Fourier

analysis is critical. Simply keeping all the frequency com-

ponen ts may re sult in inferior representations since many

high-f requency changes in time series are due to noisy in-

puts. On the other hand, only keeping the low-frequency

components may also be inappr opriate for series forecast-

ing as some trend changes in time series re present impor-

tant events. Instead, keeping a compact representation of

time ser ie s using a small number of selected Fourier com-

ponen ts will lead to efﬁcient compu tation of transformer,

which is crucial for modelling long sequences. We pro-

pose to represen t time series by randomly selecting a con-

stant number o f Fourier components, including both high-

frequency and low-frequency. Below, an analysis that justi-

ﬁes the random selection is presented theoretically. Empir-

ical veriﬁcation can be found in the experimental session.

Consider we have m time series, denoted as

(t), . . . , X

(t). By applying Fourier transform

to each time ser ies, we turn each X

(t) into a vec-

tor a

= (a

i,1

, . . . , a

i,d

)

⊤

∈ R

. By putting all

the Fourier tra nsform vectors into a matrix, we have

A = (a

, a

, . . . , a

)

⊤

∈ R

m×d

, with each row cor-

respond ing to a different time series and each column

correspo nding to a d ifferent Fourier component. Although

using all the Fourier components allows us to be st preserve

the history information in the time series, it may potentially

lead to overﬁtting of the history da ta an d con sequentially

a poor prediction of future signals. Hence, we need to

select a subset of Fourier comp onents, that on the one hand

should be small enough to avoid the overﬁtting problem

and on the oth e r hand, should be able to preserve m ost

of the histor y information. Here, we pr opose to select

s comp onents from the d Fourier components (s < d)

Submission and Formatting Instructions for ICML 2022

Feed

Forward

Frequency

Enhanced

Block

MOE

Decomp

MOE

Decomp

FEDformer Encoder

Frequency

Enhanced

Attention

Frequency

Enhanced

Block

MOE

Decomp

MOE

Decomp

FEDformer Decoder

+ +

Feed

Forward

MOE

Decomp

+ +

Output

Encoder

Input

Seasoal

Init

Trend Init

l,2

l,1

l,3

l,2

l,1

l,2

(or X

)

∈ R

I×D

∈ R

(I/2+O)×D

∈ R

(I/2+O)×D

l−1

l,3

(or X

)

Figure 2.

FEDformer Structure. The FEDformer consists of N encoders and M decoders. The Frequency Enhanced Block (FEB, green

blocks) and Frequency Enhanced Attention (FEA, red blocks) are used to perform representation learning in frequency domain. E ither

FEB or FEA has two subversions (FEB-f & FEB-w or FEA-f & FEA-w), where ‘-f’ means using Fourier basis and ‘-w’ means using

Wavelet basis. The Mixture Of Expert Decomposition Blocks (MOEDecomp,

yellow blocks) are used to extract seasonal-trend patterns

from the input data.

unifor mly at random. More speciﬁcally, we denote by

< i

< . . . < i

the randomly selected components.

We construct matrix S ∈ {0, 1}

s×d

, with S

i,k

= 1 if

i = i

and S

i,k

= 0 otherwise. Then, our representation

of multivaria te time series becomes A

′

= AS

⊤

∈ R

m×s

Below, we will show that, although the Fourier basis are

randomly selected, under a mild condition, A

′

is able to

preserve most of the information from A.

In orde r to measure how well A

′

is able to preserve informa-

tion from A, we project each column vector of A into the

subspace spanned by the column vectors in A

′

. We denote

by P

′

(A) the resulting matrix after the projection, where

′

(·) rep resents the projection operator. If A

′

preserves

a large portion of information from A, we would expect a

small error between A and P

′

(A), i.e. |A −P

′

(A)|. Let

represent the approximatio n of A by its ﬁrst k largest

single value de composition. The the orem below shows that

|A−P

′

(A)|is clo se to |A−A

|if the number of randomly

sampled Fourier components s is on the order of k

Theorem 1. Assume that µ(A), the coherence measure of

matrix A, is Ω(k/n). Then, with a high probability, we

have

|A − P

′

(A)| ≤ (1 + ǫ)|A − A

if s = O(k

/ǫ

The detailed analysis can be foun d in Appendix

For real-world multivariate times series, the c orresponding

matrix A from Fourier transform often exhibit low rank

property, since those univaraite variables in multivariate

times series depend no t only on its past values but a lso

has dependency on each other, as well as share similar fre-

quency components. Therefore, as indic ated by the The-

orem

1, ran domly selecting a subset of Fourie r compo-

nents allows us to appropriate ly represent th e information

in Fourier matrix A.

Similarly, wavelet orthogonal polynomials, suc h as Legen-

dre Polynomials, obey restricted iso metry property (RIP)

and can be used for c a pture infor mation in time series as

well. Compared to Fourier b asis, wavelet based re presenta-

tion is more effective in captur ing local structu res in time

series and thus can be more effective for some forecasting

tasks. We defer the discussion of wavelet based representa-

tion in Appendix

B. In the next section, we will present the

design of frequency enhanced d e composed Transformer ar-

chitecture that incorporate the Fourier transform into trans-

former.

3. Model Structure

In this section, we will introdu ce (1) the overall structure

of FEDformer, as shown in Fig ure

2, (2) two subversion

structures for signal process: one uses Fourier basis and

the other u ses Wavelet basis, (3) the mixture of experts

mechanism for seasonal-trend decomposition, and (4) the

complexity analysis of the proposed model.

3.1. FEDformer Framework

Preliminary Long-term time series forecasting is a se-

quence to sequence problem. We denote the input length

as I and output length as O. We denote D as the hidden

states of the series. The input of th e encoder is a I × D

matrix and the decoder has (I/2 + O) × D in put.

FEDformer Structure Inspired by the seasonal-tr end de-

composition and distribution ana lysis as discussed in Sec-

tion

1, we renovate Transformer as a deep decomposition

architecture as shown in Figure 2, including Frequency

Enhance d Block (FEB), Frequency Enhanced Attention

Submission and Formatting Instructions for ICML 2022

(FEA) conn ecting encoder and decoder, and the Mixture

Of Experts Decomposition block (MOEDe c omp). The de-

tailed description of FEB, FEA, a nd MOEDecomp blocks

will be given in the following Section

3.2, 3.3, and 3.4 re-

spectively.

The encoder adopts a multilayer structure as: X

Encoder(X

l−1

), where l ∈ {1, ··· , N} denotes the out-

put of l-th encoder layer and X

∈ R

I×D

is the embed ded

historical series. The Enc oder(·) is formalized as

l,1

−

= MOEDecomp(FEB



l−1



+ X

l−1

l,2

−

= MOEDecomp(FeedForward



l,1



+ S

l,1

= S

l,2

(1)

where S

l,i

, i ∈ {1, 2} represents the seasonal compo-

nent a fter the i-th decomposition block in th e l-th layer r e-

spectively. For FEB module, it has two d ifferent versions

(FEB-f & FEB-w) which are implemented through Discrete

Fourier transform (DFT) and Discrete Wavelet transform

(DWT) mechanism respec tively and can seamlessly re place

the self- attention bloc k.

The decoder also a dopts a multilayer structure as:

, T

= D e coder(X

l−1

, T

l−1

), where l ∈ {1, ··· , M}

denotes the output of l-th decoder lay er. The Decoder(·) is

formalized as

l,1

, T

l,1

= MOEDecomp



FEB



l−1



+ X

l−1



l,2

, T

l,2

= MOEDecomp



FEA



l,1

, X



+ S

l,1



l,3

, T

l,3

= MOEDecomp



FeedForward



l,2



+ S

l,2



= S

l,3

= T

l−1

+ W

l,1

·T

l,1

+ W

l,2

·T

l,2

+ W

l,3

·T

l,3

(2)

where S

l,i

, T

l,i

, i ∈ {1, 2, 3 } represent the seaso nal and

trend component after the i-th decomposition block in the l-

th layer respectively. W

l,i

, i ∈ {1, 2, 3} represents the pro-

jector for the i-th extracted trend T

l,i

. Similar to FEB, FEA

has two different versions (FEA-f & FEA-w) which are im-

plemented through DFT and DWT projectio n respectively

with attention design, and can replace the cross-attention

block. The detailed description of FEA(·) will be given in

the following Section

3.3.

The ﬁnal prediction is the sum of the two r e ﬁned decom-

posed components as W

· X

+ T

, where W

is to

project the d e ep transformed seasonal component X

the target dimensio n.

3.2. Fourier Enhanced Structure

Discrete Fourier Transform (DFT) The proposed

Fourier Enhanced Structures u se discrete Fourier transfor m

(DFT). Le t F denotes the Fourier transform and F

−1

de-

−1

q ∈ R

L×D

Sampling

Q ∈ C

N ×D

M ×D

Q ∈

R ∈

Y ∈

M ×D

padding

Y ∈ C

N ×D

y ∈ R

×D

M ×D×D

l−1

en/de

MLP

Figure 3.

Frequency Enhanced Block with Fourier transform

(FEB-f) structure.

MLP

F + Sampling

MLP

F + Sampling

σ(·)

Padding + F

−1

y ∈ R

×D

q ∈ R

×D

⊤

l,1

Figure 4.

Frequency Enhanced Attention with Fourier transform

(FEA-f) structure, σ(·) is the activation function.

notes the inverse Fourier transform. Given a sequence of

real numbers x

in time domain, where n = 1, 2...N. DFT

is deﬁned as X

N−1

n=0

−iωln

, where i is the im a g-

inary unit and X

, l = 1, 2...L is a sequence o f complex

numbers in the frequency domain. Similarly, the inverse

DFT is deﬁned as x

L−1

l=0

iωln

. The complex-

ity of DFT is O(N

). With fast Fourier transform (FFT),

the computation complexity can be reduced to O(N log N).

Here a random subset of the Fourier basis is used and

the scale of the subset is b ounded by a scalar. When we

choose the mode index b efore DFT and reverse DFT oper-

ations, the computation complexity can be further red uced

to O(N).

Frequency Enhanced Block with Fourier Transform

(FEB-f) The FE B-f is used in both encoder and decoder

as shown in Figure

2. The input (x ∈ R

N×D

) of the

FEB-f block is ﬁrst linearly projected with w ∈ R

D×D

, so

q = x·w. Then q is converted f rom the time dom ain to the

frequency domain. The Fourier transform of q is denoted

as Q ∈ C

N×D

. In frequency domain , only the randomly

selected M modes are kept so we use a select operator as

Q = Select(Q) = Select(F(q)), (3)

where

Q ∈ C

M×D

and M << N. Then, the FEB-f is

deﬁned as

FEB-f(q) = F

−1

(Padding(

Q ⊙ R)), (4)

where R ∈ C

D×D×M

is a parameterized kernel in itialized

randomly. Let Y = Q ⊙ C, with Y ∈ C

M×D

. The pro-

duction operator ⊙ is deﬁned as: Y

m,d

, where d

= 1, 2...D is the input channel and

= 1, 2...D is the output chan nel. The result of Q ⊙ R

is then z e ro-padded to C

N×D

before performing inverse

Fourier transform bac k to the time domain. T he structure

is shown in Figure

剩余18页未读，继续阅读

评论收藏

内容反馈

Chase～711

粉丝: 0
资源: 1

FEDformer.pdf

时间序列预测-Transformer,Informer,Autoformer,FEDformer复现结果

电力负荷预测分析（代码+报告+中间数据）

1_sixyin-music-source-v1.0.7.js

misaka-v3.3.8.zip

大麦抢票_BP全自动抢购教程+注意事项.rar

TiggerRamDiskV4.2Beta1-Win.zip

C语言程序设计第四版何钦铭课后习题及答案.pdf

Flyme10图标包_1.0.0_1.apk

自动抢福袋.apk

00孙亮v2白体版本.zip

EhViewer-1.9.5.0.apk

大麦抢票_7.6最新详细教程（IOS+安卓）.rar

小财神计算器(1).exe

ExuiKrnln.dll

Tailscale_v1.18.0.apk

826103 计算机网络-自顶向下方法第七版.pdf

大麦内部版抢购脚本8.5.0.docx

红果脚本.apk

管理运筹学配套软件3-5版.zip

第十五届蓝桥杯EDA赛模拟试题一（嘉立创EDA提供）(1).zip

高级调停者性格档案.ZIP

1_大麦内部版抢购脚本15.6.0.docx

罗技GHUB 主播定制版全套数据III.lua

小黑课堂计算机二级WPSOffice题库安装包1.5(1).exe

00.代码随想录-最强八股文-第3版-无密版本.pdf

2023电赛使用例程.zip

程序员的自我修养—链接、装载与库.pdf

现代C++语言核心特性解析.pdf

2024MathorCupC题 第二问代码.ipynb

最新资源

2024MathorCupC题第二问代码.ipynb