0% found this document useful (0 votes)
33 views33 pages

A Joint Time-Frequency Domain Transformer For Multivariate Time Series Forecasting

Uploaded by

Mukenze junior
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views33 pages

A Joint Time-Frequency Domain Transformer For Multivariate Time Series Forecasting

Uploaded by

Mukenze junior
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

A Joint Time-frequency Domain Transformer for

Multivariate Time Series Forecasting


Yushu Chena , Shengzhuo Liub , Jinzhe Yangc , Hao Jingd , Wenlai Zhaoa ,
Guangwen Yanga
arXiv:2305.14649v2 [cs.LG] 28 Oct 2023

a
Department of Computer Science and Technology, Tsinghua University, RM.3-126, FIT
Building, Haidian District, Beijing, 100084, China
b
College of Computer Science and Mathematics, Fujian University of
Technology, RM.213, Building C4, Fuzhou, Fujian, 350118, China
c
Techorigin, No.581, Jianzhu West Road, Binhu District, Wuxi, Jiangsu, 214000, China
d
Earth System Modeling and Prediction Center, No.46, Zhongguancun South Street,
Haidian District, Beijing, 100081, China

Abstract
In order to enhance the performance of Transformer models for long-term
multivariate forecasting while minimizing computational demands, this paper
introduces the Joint Time-Frequency Domain Transformer (JTFT). JTFT
combines time and frequency domain representations to make predictions.
The frequency domain representation efficiently extracts multi-scale depen-
dencies while maintaining sparsity by utilizing a small number of learnable
frequencies. Simultaneously, the time domain (TD) representation is de-
rived from a fixed number of the most recent data points, strengthening the
modeling of local relationships and mitigating the effects of non-stationarity.
Importantly, the length of the representation remains independent of the
input sequence length, enabling JTFT to achieve linear computational com-
plexity. Furthermore, a low-rank attention layer is proposed to efficiently
capture cross-dimensional dependencies, thus preventing performance degra-
dation resulting from the entanglement of temporal and channel-wise model-
ing. Experimental results on six real-world datasets demonstrate that JTFT
outperforms state-of-the-art baselines in predictive performance.

Email address: [email protected] (Guangwen Yang)


1
Yushu Chen, Wenlai Zhao, and Guangwen Yang also work at National Supercomputing
Center in Wuxi, China. Guangwen Yang is the corresponding author.

Preprint submitted to arXiv October 31, 2023


Keywords: time series forecasting, multivariate, frequency domain,
Transformer

1. Introduction
Time series forecasting predicts the future based on historical data. It has
broad applications including but not limited to climatology, energy, finance,
trading, and logistics (Petropoulos et al., 2022). Following the great success
of Transformers (Vaswani et al., 2017) in NLP (Kalyan et al., 2021), CV
(Khan et al., 2021), and speech (Karita et al., 2019), Transformers have
been introduced in time series forecasting and achieves promising results
(Wen et al., 2022).
One of the primary drawbacks of Transformers is their quadratic complex-
ity in both computation and memory, making them less suitable for long-term
forecasting. To address this limitation, a plethora of Transformer-based mod-
els, e.g., LogTrans, Informer, AutoFormer, Performer, and PyraFormer (Li
et al., 2019; Zhou et al., 2021; Wu et al., 2021; Choromanski et al., 2021; Liu
et al., 2022a), have been proposed to enhance predictive performance while
maintaining low complexity. Notably, Zhou et al. (2022b) observed that
most time series which are dense in the time domain (TD) tend to have a
sparse representation in the frequency domain (FD). In response, they intro-
duced FEDFormer, which leverages a randomly selected subset of frequency
components to exploit FD sparsity. This approach has linear complexity
and achieved state-of-the-art (SOTA) results in early 2022. However, the
sparse representation using random frequencies alone may not fully capture
the multi-scale characteristics of time series. As a result, seasonal-trend de-
composition remains necessary in FEDFormer, despite its susceptibility to
hyperparameters. Additionally, due to the periodic nature of the basis func-
tions, FD representation is vulnerable to non-stationarity. Enhancing the
sparse FD representation to capture multi-scale features has become crucial,
especially after the emergence of a simple linear model known as DLinear,
which outperformed existing Transformers in time-series forecasting (Zeng
et al., 2022). PatchTST (Nie et al., 2023), an improved Transformer with
a patching approach, has recently outperformed DLinear in performance.
While PatchTST’s concept is valuable for improving FD Transformers, it
still exhibits quadratic complexity and cannot utilize cross-channel correla-
tions.

2
In this paper, we present a joint time-frequency domain Transformer
(JTFT). It exploits the FD sparsity in time series using a small number
of learnable frequencies and enhances the learning of local relations by in-
corporating a fixed number of the latest data points. A low-rank attention
layer is also used to effectively extract cross-channel dependencies. The main
contributions lie in four folds:

• A customize discrete cosine transform (CDCT) is presented to compute


customized FD components and enable learning of frequencies. Based
on CDCT, a FD representation with a few learnable frequencies is
developed, which is effective to capture multi-scale structures of time
series.

• Time series are encoded based on both the sparse FD and latest TD
representation, that enables the Transformer to extract both long-term
and local dependencies effectively with linear complexity.

• A low-rank attention layer is introduced to learn cross-channel inter-


action, that further improves the predictive performance by mitigating
the entanglement and redundancy in the capture of temporal and chan-
nel dependencies.

• Extensive experimental results on 6 real-world benchmark datasets cov-


ering multiple fields (medical care, energy, trade, transportation, and
weather) show that our model improves the predictive performance of
SOTA methods. Specifically, JTFT ranks as the top-performing model
in 54 out of 60 settings with varying prediction lengths and metrics,
and ranks the second in the remaining ones in the experiments.

The source code of JTFT is available at https://github.com/rationalspark/


JTFT.git.

2. Related work
Well-known traditional methods for time series forecasting include AR,
ARIMA, VAR , GARCH(Box et al., 2015), kernel methods(Chen et al., 2008),
ensembles(Bouchachia & Bouchachia, 2008), Gaussian processes(Frigola &
Rasmussen, 2014), regime switching(Tong & Lim, 2009), and so on. Benefit-
ing from the improvement in computing power and neural network structure,

3
many deep learning models are proposed and often empirically surpassed tra-
ditional methods in predictive performance.
Various deep neural network architectures, such as recurrent neural net-
works (RNN), convolutional neural networks (CNN), graph neural networks
(GNN), multi-layer perceptrons (MLP), and Transformers, have been widely
applied in time series forecasting. RNN-type models (Hochreiter & Schmid-
huber, 1997; Qin et al., 2017; Rangapuram et al., 2018; Salinas et al., 2020)
were once the most popular choice, but they often encountered issues re-
lated to gradient vanishing or exploding (Pascanu et al., 2012), limiting their
performance. CNN structures, on the other hand, excel at extracting local
features from time series (Bai et al., 2018; Borovykh et al., 2017; Sen et al.,
2019; Wang et al., 2023a; Wu et al., 2022). However, they usually require
many layers to effectively represent global relationships, mainly due to their
limited receptive field size. GNNs are increasingly applied to enhance both
temporal and dimensional pattern recognition in time series data (Wu et al.,
2020; Cao et al., 2020). Recent advancements (Challu et al., 2023; Li et al.,
2023b; Das et al., 2023; Ekambaram et al., 2023) suggest that MLP-type
structures remain competitive in forecasting.
Transformers are appealing in time series forecasting since their attention
mechanism is effective to capture long-term temporal dependency. How-
ever, the complexity and memory requirement are quadratic in the sequence
length, which hinders application in long-term modeling. Numerous meth-
ods are proposed to both improve the predictive performance and reduce
the costs of Transformers. Typically, these methods have to exploit some
form of sparsity. LogTrans (Li et al., 2019) introduces attention layers with
LogSparse design and achieves O(L log2 (L)) complexity, where L is the input
length. Informer (Zhou et al., 2021) selects the top-k in attention matrix with
a KL-divergence based method, and has O(L log L) complexity. Autoformer
(Wu et al., 2021) obtains the same complexity using an Auto-Correlation
mechanism. Several improved Transformers have achieved linear complex-
ity. Among these methods, Linformer (Wang et al., 2020) compresses the se-
quence by learnable linear projection; Luna (Ma et al., 2021) adopts a nested
linear structure; Nyströformer (Xiong et al., 2021) applies Nyström approx-
imation in the attention mechanism; Performer (Choromanski et al., 2021)
leverages a positive orthogonal random features approach; Pyraformer (Liu
et al., 2022a) introduces a pyramidal attention module to extract dependency
at different resolutions; FEDformer (Zhou et al., 2022b) combines seasonal-
trend decomposition and frequency enhanced structures to capture both the

4
global and local dependencies. Numerous effective approaches have been
proposed besides the efforts to reduce complexity. Non-stationary Trans-
formers (Liu et al., 2022b) and Scaleformer (Shabani et al., 2023) provide
generic frameworks to improve the accuracy of existing methods. Cross-
former (Zhang & Yan, 2023) develops two-stage attention layers to utilize
cross-channel dependency. Zhou et al. (2022c) show that a linear head may
achieve better performance than the Transformer decoder.
However, recent work (Zeng et al., 2022) shows a simple linear model
(DLinear) outperforms the existing SOTA Transformer based methods, and
often by a large margin. Additionally, Li et al. (2023b) argue that attention
is not necessary for capturing temporal dependencies. Indeed, PatchTST
(Nie et al., 2023), which uses continuous segments of series as the input of
a vanilla Transformer and a channel-independence setting, beats DLinear on
the standard multivariate forecasting benchmarks recently. The method still
needs improvement in terms of computational efficiency and the extraction
of cross-channel dependency.
Frequency domain (FD) methods are also widely explored in time series
modeling. Many commonly used FD features are obtained through the Dis-
crete Fourier Transform (DFT). For instance, Wu et al. (2021) efficiently com-
pute the auto-correlation function using the Fast Fourier Transform (FFT).
Lee-Thorp et al. (2022) significantly accelerate training with minor accuracy
loss by replacing self-attention in a Transformer encoder with the Fourier
Transform. Sun & Boning (2022) present a competitive FD neural network
for univariate time series forecasting. Rather than using all FD components,
exploiting FD sparsity is crucial for computational efficiency and noise sup-
pression. FEDformer, for example, randomly selects a subset of FD compo-
nents. Similarly, Wang et al. (2023b) develop a compressed sensing technique
using random FD components. Zhou et al. (2022a) improves the Legendre
memory model using low-frequency Fourier components and a low-rank ap-
proximation. Woo et al. (2023) develops a concatenated Fourier features
module for efficiently learning high-frequency patterns. Additionally, Li et al.
(2023a) combines temporal and frequency streams for time series classifi-
cation and regression, with FD components selected by a band-pass filter.
In contrast to these approaches, JTFT introduces a custom discrete cosine
transform (CDCT) to extract FD features. The Discrete Cosine Transform
(DCT) is chosen for its superior energy compaction characteristics compared
to DFT, allowing it to obtain a sparse FD representation of the input effec-
tively. CDCT further extends DCT by enabling the learning of frequencies,

5
improving the extraction of periodic dependencies that may not align with
the uniform frequency grids of DCT bases.
Other methods, such as Short Time Fourier Transform and Discrete
Wavelet Transform, are used to capture time-frequency features (e.g., Chao-
valit et al., 2011; Yao et al., 2019; Singh & Tiwari, 2006; Wen et al., 2021;
Ding et al., 2022). These methods excel in extracting information from non-
stationary data due to their ability to adapt FD representations over time.
However, time-frequency features include FD components for discrete time
points, resulting in considerably larger sizes compared to FFT and DCT.
Consequently, these methods often require more effective feature selection or
compression techniques to counteract overfitting. In contrast, JTFT effec-
tively addresses the challenges posed by non-stationarity through recent TD
representations.

3. Methods
In this section, we will introduce (1) the overall structure of JTFT, (2)
the sparse FD representation with learnable frequencies, (3) the low-rank at-
tention layers to extract cross-channel dependencies, and (4) the complexity
analysis of the proposed method.

3.1. JTFT framework


Preliminary: Multivariate time series forecasting predicts the future
value of time series based on historical data. We denote the input series
by x1:L = {x1 , · · · , xL }, where L represents the look-back window (input
length). This series has D channels, and the i-th channel is denoted as
xi = {xi1 , · · · , xiL }. The future values to be forecasted are represented as
xL+1:L+T = {xL+1 , · · · , xL+T }, where T is the target-window (prediction
length). The model’s task is to map x1:L to y ∈ RD×T , which is an ap-
proximation of xL+1:L+T .
Overall structure: The overall structure of JTFT is shown in Figure
1. The model firstly transforms the input series to a joint time-frequency
domain embedding, then maps it with a Transformer Encoder and a low-rank
cross-channel attention mechanism to the latent representation, and finally
generates the output using a prediction head followed by denormalization.
In the preprocessing stage, each channel of the input undergoes mean
subtraction and standard deviation scaling. Afterward, a patching technique
(Dosovitskiy et al., 2021; Bao et al., 2022; He et al., 2021; Nie et al., 2023)

6
Figure 1: JTFT framework. The input time series undergo preprocessing to obtain a
joint time-frequency embedding, which is then fed to the encoders for extracting time-
frequency and cross-channel dependencies, respectively. The forecasting is generated using
a prediction head.

is applied, dividing the time series into either overlapped or non-overlapped


segments (for detailed information, please refer to Appendix A). These con-
tinuous segments serve as the fundamental input units for the subsequent
modeling stage.
The preprocessed data then passes through a Custom Discrete Cosine
Transform (CDCT) module to extract the frequency-domain (FD) compo-
nents. These FD components are integrated with the most recent time-
domain (TD) patches, creating a joint time-frequency domain representation
(JTFR). Each channel’s JTFR is subsequently projected into the latent space
of the Transformer Encoder, forming an embedding of the input series, which
is added by a learnable position embedding to account for the sequence’s or-
der. The progress is further illustrated in Figure 2.
This fusion of TD and FD data plays a crucial role in mitigating the
adverse effects of non-stationarity in time series data. Non-stationary data
exhibits changes in statistical properties and joint distributions over time,
making the time series less predictable (Liu et al., 2022b). Additionally,
the cyclical nature of the CDCT basis may not accurately represent these
temporal changes. Therefore, the incorporation of the most recent TD data
is essential for capturing up-to-date local relationships.

7
Figure 2: Time-Frequency Domain Data Embedding. Patched data is transformed into FD
representation (FDR) by multiplying it with the CDCT bases. FDR is then concatenated
with the TD representation (TDR) to form the JTFR. This JTFR is mapped to the model
dimension using the projection matrix P and added with the position embedding to create
the inputs for the encoders.

The Transformer encoder and low-rank attention layers in JTFT extract


the time-frequency and cross-channel dependencies respectively. It is intu-
itively believed that cross-channel relationships are useful to improve predic-
tion performance. However, Nie et al. (2023) show that PatchTST with a
channel-independent (CI) setting outperforms channel-mixing models, as it
is less susceptible to overfitting and more robust to noise. Similarly, Cross-
former (Zhang & Yan, 2023) employs two-stage attention layers to learn
cross-channel dependency, but its performance is inferior to PatchTST with
CI in most of the experiments. Furthermore, Li et al. (2023b) illustrate
that entanglement and redundancy in capturing temporal and channel inter-
action affect forecasting performance. Motivated by these findings, JTFT
leverages a two-stage approach to separately capture time-frequency and
channel dependencies. The Transformer encoder in JTFT extracts time-
frequency dependency using a CI setting. While cross-channel interaction
is not modeled at this stage, the encoder is shared among all channels and
trained with all available data, which helps to mitigate overfitting. The low-
rank attention layers that extract cross-dimensional dependencies adopt a
time-frequency-independent (TFI) setting, shared across all time-frequency

8
points and trained with all available data, which also benefits generalization
performance.
The prediction head consists of a GELU activation function (Hendrycks &
Gimpel, 2016), dropout layer (Srivastava et al., 2014), and linear projection
layer. It maps the latent representation to output, which is then denormalized
using the statistics saved in preprocessing. The model is trained using the
Huber loss function, which offers greater resilience to outliers in the data
compared to the Mean Squared Error (MSE) loss.

3.2. The sparse FD representation with learnable frequencies


Keeping all the frequency components may result in overfitting, because
many high-frequency changes are caused by noises. It is also crucial to exploit
FD sparsity in order to reduce the computation and memory complexity.
Consequently, a critical problem for FD Transformers is how to select a subset
of frequency components to represent the time series. Instead of keeping only
the low frequency components, Zhou et al. (2022b) show that the random
selection used in FEDformer gives a better representation. They further show
that the representing ability is close to the approximation by first s largest
single value decomposition, when the number of components is on the order
of s2 . However, when a more precise frequency domain representation is
expected, the O(s2 ) number of random components will be large. In order to
reprensent the time series more precisely by less components, we introduce
the FD representation with learnable frequecies.
A customize discrete cosine transform (CDCT) is developed to compute
customized FD components, which enables learning of frequencies. The
CDCT is a gerneralization of the discrete cosine transform (DCT), which
expresses a sequence in terms of the sum of real-valued cosine functions os-
cillating at different frequencies. Compared with discrete Fourier transform
(DFT) which uses complex exponential functions, DCT roughly halves the
length to represent a real sequence, and has a strong energy compaction prop-
erty. Consequently, it is more preferred in compression and thus suitable to
represent series with a small number of FD components. A commonly used

9
form of DCT is

z̃ = DCT(z) = T̃z
 √
1/ N , k = 0, n ∈ {0, · · · , N − 1}

T̃k,n =
r    
2 π 1

 cos n+ k , k ∈ {1, · · · , N − 1}, n ∈ {0, · · · , N − 1},
N N 2
(1)
where z is a sequence with length N . It may differ from the input time
sequence due to certain preprocessing steps.
The matrix T̃ is orthogonal, so the inverse transform is

z = IDCT(z̃) = T̃−1 z̃ = T̃T z̃. (2)

Some insignificant FD components in DCT can be ignored to exploit sparsity.


In definition (1), the TD series z is transformed to FD by multiplying a
orthogonal basis of cosine functions, the frequencies of which lie on uniform
grid points indexed by k. However, these specified frequencies may not be
adequate to express some real-world phenomenon. For example, when the
sampling rate is 1, multiple FD components are required to express a simple
function cos(1.1πt/N ), but there is only one FD component in fact.
In order to further improve the representing ability with a small fixed
number of FD components, we propose the CDCT by generalizing the DCT
with customized frequencies as

ẑ = CDCT(z) = T̂z
 √
1/ N , k = 0, n ∈ {0, · · · , N − 1}

(3)
T̂k,n =
r   
2 1

 cos n+ πψk , ψk ∈ (0, 1),
N 2

where Ψ = {0, ψ1 , · · · , ψnf −1 } is a group of customized frequency coefficients,


k ∈ {1, · · · , nf − 1}, and n ∈ {0, · · · , N − 1}. We set ψ0 = 0 to retain the
mean. The CDCT maps a time series of length N to nf FD components,
that is applicatable to obtain a compact data representation in time series
forecasting. If nf ≪ N is treated as a constant, the complexity of the CDCT
is O(N ). However, unlike the DCT, the CDCT does not have an intuitive
inverse transform due to the fact that its basis is not orthogonal in general.
Benefitting from modern deep learning frameworks, it is convenient to learn a

10
projection matrix to recover the TD series from the sparse FD representation
obtained by CDCT.
The CDCT allows the learning of frequencies by setting ψ1:nf −1 as learn-
able parameters. This flexibility enables the adjustment of frequencies within
CDCT to better approximate the most significant frequencies in the data,
which may not align with the uniform grid points of the DCT. In the imple-
mentation, ψ1:nf −1 are initialized based on the top frequencies obtained by
applying DCT to a randomly sampled subset of the dataset. Although ini-
tializing CDCT with DCT frequencies has a time complexity of O(L log L),
this initialization stage is relatively quick, particularly when compared to
the training times. Since time series datasets are typically not as large as
those in CV and NLP, even performing DCT on the whole datesets without
sampling is acceptable in many real-world scenarios. Therefore, to be less
rigorous, this short initialization stage is omitted when analyzing the overall
complexity.
We compare the representation ability of learnable (LRN), low (LOW),
and random (RND) frequencies by examining the errors in reconstructing
the input TD series. For random and low frequencies, the TD series is recon-
structed using IDCT, while a learned linear projection is used for learnable
frequencies. From the results shown in Figure 3, the learnable frequencies
representation retains significantly more information from the input com-
pared to random and low frequencies when the number of components is the
same. It’s worth noting that lossless data representations of the inputs are
unnecessary because insignificant frequency components often have higher
noise levels and may be detrimental for prediction. Although the representa-
tional ability of low frequencies may be close to that of learnable frequencies
when the number of components is large, the latter are more effective at ob-
taining a sparse FD representation with fewer components that retains the
important information.

3.3. The low-rank attention layers to extract cross-channel dependencies


There are two intuitive ways to capture cross-channel dependencies. The
first approach involves embedding data points from all channels at the same
time step into a unified feature vector, while the second approach employs
a Transformer along the channel dimension. Zhang & Yan (2023) reduces
the computational and memory costs in the channel-wise Transformer by
replacing the self-attention with two smaller-scaled attentions. However,
empirical results (Nie et al., 2023; Zhou et al., 2022c) indicate that these

11
100 10 0 1
0.8
0.6
Normalized MSE

Normalized MSE

Normalized MSE
0.4

LRN 10-1 LRN LRN


0.2
-1 LOW LOW LOW
10
RND RND RND
2 4 8 16 32 64 2 4 8 16 32 64 2 4 8 16 32 64
Number of FD components Number of FD components Number of FD components
(a)Illness (b) Electricity (c)Traffic

Figure 3: Comparison of the errors in reconstructing the input TD series from the FD
representation of learnable (LRN), low (LOW), and random (RND) frequencies. The
datasets are divided into continuous multivariate subsequences, each with a length of 512,
and the MSE is normalized by the sum-of-squares of the data. The RND frequencies are
executed 5 times, and the error bar represents the standard deviation.

approaches generate larger errors when compared to channel-independence


(CI) PatchTST.
To address the need for capturing cross-channel dependencies with less
redundancy, we introduce low-rank attention (LRA) layers. LRA is a com-
putationally efficient approach to integrate cross-channel information into CI
modeling. It conducts a lightweight attention across the channel dimension,
generating low-rank corrections to the outputs of CI Transformer encoders.
The goal of LRA is to enhance accuracy beyond the already strong base-
line provided by the CI model, which captures high-accuracy temporal de-
pendencies. The process is prone to overfitting. LRA mitigates overfitting
by confining the updates to a low-rank space through two approaches. First,
it maps cross-channel sequences to low-dimensional representations, followed
by a linear projection back to the original space to generate updates. Sec-
ond, it employs a time-frequency independence (TFI) setting that shares
corrections across the time-frequency dimension. Additionally, it reduces the
number of parameters by simplifying the attention mechanism and moving
the positional embedding from the input space to the low-dimensional space
of representations.
Figure 4 illustrates the structure of LRA. It replaces the resource-intensive
multi-head self-attention (MSA) in a channel-wise Transformer encoder with
a lightweight multi-head self-attention (LMSA). LMSA utilizes short learn-

12
Figure 4: LRA Layers. It conducts attention across the channel dimension. Inputs and
a learnable query are fed into LMSA, with the outputs being added with a compact
learnable position embedding. The outcomes are projected across the channel dimension
and replicated along the time-frequency dimension to match the input shape. The shortcut
connection, LayerNorm, and MLP are applied similarly to the Transformer. M is the
number of layers.

able queries to aggregate messages from all channels into a low-rank space.
Additionally, a compact learnable position embedding is added to the LMSA
outputs to account for position within the low-rank space. The resulting
outputs are mapped to match the number of channels through linear projec-
tion, distributing the aggregated messages among the channels. The results
are duplicated in the time-frequency dimension, following the TFI setting.
Subsequently, the shortcut connection, layer normalization (LayerNorm) (Ba
et al., 2016), and multi-layer feedforward network (denoted by MLP) are ap-
plied in the same manner as in the Transformer encoder.
The LMSA is a simplified version of the MSA with reduced computational
requirements and parameters. In the LMSA, both the keys and values share
the same linear projection. Because it is used alongside learnable queries,
the linear projection for queries is also omitted. Denote the standard MSA
by
QKT
 
Attn(Q, K, V) = Softmax √ V, (4)
dk

13
where Q ∈ Rlq ×dk , K, V ∈ Rlkv ×dk . The LMSA is defined as

LMSA(Q̂, K̂, V̂) = Concat(head1 , · · · , headh )Wo


  (5)
headi = Attn Q̂(i−1)dk +1:idk , K̂Wkv,i , V̂Wkv,i ,

where the inputs Q̂ ∈ Rlq ×hdk , K̂, V̂ ∈ Rlkv ×dm , the learnable matrices Wo ∈
Rhdk ×dm , Wkv,i ∈ Rdm ×dk . h is the number of heads, and dm is the model
width. The batch-size dimension is ignored in the discussion for convenience.
The input of LRA is the CI representation obtained by the Transformer
encoder. It is denoted as Zin ∈ RD×(nt +nf )×dm , where nt and nf are the
length of TD and FD representations. According to the TFI setting, Zin is
reshaped to Ẑ ∈ RD(nt +nf )×dm . Then the LRA can be discribed as
 
B = LMSA R, Ẑ, Ẑ
B̂ = B + Epos
Z̄ = We B̂ (6)

Z̃ = LayerNorm Zin ⊕ Z̄
  
Zout = LayerNorm Z̃ + MLP Z̃ .

In equation (6), R ∈ Rdr ×hdk is the learnable query, whose length dr ≪ D;


Epos ∈ Rdr ×dm is the compact learnable position embedding; We ∈ RD×dr is
the learnable projection matrix; ⊕ maps Z̄ ∈ RD×dm into RD×(nt +nf )×dm by
repeating along the time-frequency dimension, and then adds it to Zin .
LRA is computationally efficient, benefiting from its low-rank nature.
While the channel-wise Transformer has a complexity of O(D2 ) with respect
to D, which can lead to substantial computational costs, LRA reduces the
complexity to O(Ddr ). This complexity can be approximated as O(D) when
dr ≪ D is viewed as a constant.

3.4. Complexity analysis


Instead of using the input series directly, most of the operations in JTFT
are applied to the joint time-frequency domain representation. Its sequence
length L̂, which is the sum of the TD and FD representations length nt
and nf , remains constant and irrelevant to the input series length L. For
example, the maximum L in the experiments is 512. However, the maximum

14
Table 1: Complexity analysis of different forecasting models

Methods Time Space Methods Time Space


2
Autoformer O(L log(L)) O(L log(L)) Crossformer O(L ) O(L2 )
FEDformer O(L) O(L) Informer O(L log(L)) O(L log(L))
JTFT O(L) O(L) LogTrans O(L log2 (L)) O(L log2 (L))
LSTM O(L) O(L) MICN O(L) O(L)
PatchTST O(L2 ) O(L2 ) Pyraformer O(L) O(L)
Reformer O(L log(L)) O(L log(L)) Transformer O(L2 ) O(L2 )

L̂ is only 48, since we use at most 16 frequency components and 32 time-


domain patches. The following analysis shows that the reduction of sequence
length allows JTFT to have low complexity. The time and space complexity
are not specified in the analysis since they are the same for all the main
modules in JTFT. We also ignore the channel number D because all the
complexity analyzed is proportional to it.
In JTFT, there are three main functional modules: preprocessing, en-
coder, and prediction head. The preprocessing, which comprises normaliza-
tion, patching, CDCT, and embedding, has a complexity of O(L̂L) = O(L).
The Transformer encoder takes most of the computation and memory of
JTFT in practice. However, its complexity is only O(L̂2 ) = O(1), which can
be a bit misleading since the coefficient is large. In contrast, the low-rank
attention layers have O(DL̂) = O(1) complexity and are much cheaper than
the Transformer encoder in practical applications. The prediction head has
a complexity of O(L̂T ) = O(T ), which can be approximated as O(L) under
normal conditions where the target window T and look-back window L have
the same order of magnitude. Consequently, the overall complexity of JTFT
is O(L).
Table 1 compares the time and space complexity of various time series
forecasting models. JTFT is one of the prediction models with the lowest
complexity (O(L)). Notably, Crossformer and PatchTST have high theoret-
ical complexities (O(L2 )), but they effectively reduce actual computational
costs by using segment projections as input to Transformer encoders instead
of individual time points.

15
4. Experiments
Datasets: We assess the performance of JTFT on six real-world datasets,
including Exchange, Weather, Traffic, Electricity (Electricity Consumption
Load), ILI (Influenza-Like Illness), and ETTm2 (Electricity Transformer
Temperature-minutely). The datasets are divided into training, validation,
and test sets following Nie et al. (2023), with split ratios of 0.6:0.2:0.2 for
ETTm2 and 0.7:0.1:0.2 for other datasets.
Baselines: We use several SOTA models for time series forecasting as
baselines, including PatchTST (Nie et al., 2023), Crossformer (Zhang & Yan,
2023), FEDformer (Zhou et al., 2022b), FiLM (Zhou et al., 2022a), DLinear
(Zeng et al., 2022), DeepTime (Woo et al., 2023), TSMixer (Chen et al.,
2023). Classic models such as ARIMA, basic RNN/LSTM/CNN models,
and some popular Transformer-based models, including LogTrans (Li et al.,
2019), Reformer (Kitaev et al., 2020), Pyraformer (Liu et al., 2022a), Auto-
former (Wu et al., 2021), and Informer (Zhou et al., 2021) are not included
in the main results because they exhibit relatively inferior performance, as
shown in (Zhou et al., 2022b; Zeng et al., 2022; Nie et al., 2023).
Experimental Settings: The forecasting length for ILI is T ∈ {24, 36, 48, 60},
while for the other datasets, it is T ∈ {96, 192, 336, 720}. Baseline results
for PatchTST, DLinear, FiLM, DeepTime, and TSMixer are obtained from
their original papers, with the look-back window L either searched for or
set to a suggested value. Specific to PatchTST, we present the results of
PatchTST/64, which generally performs better overall than PatchTST/42.
For other methods or settings not available in the literature, L is deter-
mined through grid search to establish strong baselines. The search range
is {24, 48, 84, 96, 128} for the two smaller datasets (ILI and Exchange) and
{48, 96, 192, 336, 512, 720} for the other datasets. In contrast, the JTFT con-
figuration aligns with PatchTST/64, where L is set at 128 for the smaller
datasets and 512 for the larger ones. The evaluation metrics reported include
Mean Squared Error (MSE) and Mean Absolute Error (MAE) for multivari-
ate time series forecasting.

4.1. Main results


Table 2 presents the multivariate forecasting results, where JTFT ex-
cels over all baseline methods. Across 60 different settings involving various
datasets, evaluation metrics, and diverse prediction lengths, it ranks top-1

16
Table 2: Multivariate long-term series forecasting results on 6 datasets. The best results
are highlighted in bold, and the second best results are underlined.

Models JTFT PatchTST Crossformer FEDformer FiLM DLinear DeepTime TSMixer


Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.080 0.199 0.085 0.202 0.285 0.410 0.106 0.245 0.086 0.204 0.081 0.203 0.081 0.205 0.081 0.198
Exchange

192 0.148 0.279 0.178 0.299 0.536 0.544 0.214 0.357 0.188 0.292 0.157 0.293 0.151 0.284 0.176 0.297
336 0.260 0.381 0.329 0.415 0.804 0.731 0.413 0.493 0.356 0.433 0.305 0.414 0.314 0.412 0.334 0.416
720 0.667 0.618 0.901 0.715 1.266 0.926 1.038 0.796 0.727 0.669 0.643 0.601 0.856 0.663 0.867 0.702
avg 0.289 0.369 0.373 0.408 0.723 0.653 0.443 0.473 0.339 0.400 0.296 0.378 0.350 0.391 0.364 0.403
96 0.144 0.186 0.149 0.198 0.148 0.212 0.238 0.314 0.199 0.262 0.176 0.237 0.166 0.221 0.145 0.198
Weather

192 0.187 0.228 0.194 0.241 0.191 0.258 0.275 0.329 0.228 0.288 0.22 0.282 0.207 0.261 0.191 0.242
336 0.237 0.270 0.245 0.282 0.244 0.308 0.339 0.377 0.267 0.323 0.265 0.319 0.251 0.298 0.242 0.280
720 0.308 0.321 0.314 0.334 0.311 0.355 0.389 0.409 0.319 0.361 0.323 0.362 0.301 0.338 0.320 0.336
avg 0.219 0.251 0.226 0.264 0.224 0.283 0.310 0.357 0.253 0.308 0.246 0.300 0.231 0.280 0.225 0.264
96 0.351 0.232 0.360 0.249 0.482 0.268 0.576 0.359 0.416 0.294 0.410 0.282 0.390 0.275 0.376 0.264
192 0.374 0.243 0.379 0.256 0.495 0.271 0.610 0.380 0.408 0.288 0.423 0.287 0.402 0.278 0.397 0.277
Traffic

336 0.385 0.249 0.392 0.264 0.512 0.280 0.608 0.375 0.425 0.298 0.436 0.296 0.415 0.288 0.413 0.290
720 0.429 0.275 0.432 0.286 0.561 0.313 0.621 0.375 0.52 0.353 0.466 0.315 0.449 0.307 0.444 0.306
avg 0.385 0.250 0.391 0.264 0.512 0.283 0.604 0.372 0.442 0.308 0.434 0.295 0.414 0.287 0.407 0.284
96 0.131 0.224 0.129 0.222 0.217 0.311 0.186 0.302 0.154 0.267 0.140 0.237 0.137 0.238 0.131 0.229
Electricity

192 0.144 0.237 0.147 0.240 0.263 0.337 0.197 0.311 0.164 0.258 0.153 0.249 0.152 0.252 0.151 0.246
336 0.157 0.252 0.163 0.259 0.319 0.370 0.213 0.328 0.188 0.283 0.169 0.267 0.166 0.268 0.161 0.261
720 0.182 0.275 0.197 0.290 0.388 0.412 0.233 0.344 0.236 0.332 0.203 0.301 0.201 0.302 0.197 0.293
avg 0.154 0.247 0.159 0.253 0.297 0.357 0.207 0.321 0.185 0.285 0.166 0.264 0.164 0.265 0.160 0.257
24 1.027 0.604 1.319 0.754 2.918 1.139 2.624 1.095 1.970 0.875 2.215 1.081 2.425 1.086 2.415 1.058
36 0.995 0.621 1.579 0.870 3.020 1.123 2.516 1.021 1.982 0.859 1.963 0.963 2.231 1.008 2.280 1.027
ILI

48 0.980 0.637 1.553 0.815 3.241 1.192 2.505 1.041 1.868 0.896 2.130 1.024 2.230 1.016 2.379 1.056
60 1.386 0.760 1.470 0.788 3.324 1.188 2.742 1.122 2.057 0.929 2.368 1.096 2.143 0.985 2.370 1.047
avg 1.097 0.655 1.480 0.807 3.126 1.161 2.597 1.070 1.969 0.890 2.169 1.041 2.257 1.024 2.361 1.047
96 0.160 0.247 0.166 0.256 0.280 0.371 0.180 0.271 0.165 0.256 0.167 0.260 0.166 0.257 0.163 0.252
ETTm2

192 0.213 0.284 0.223 0.296 0.364 0.446 0.252 0.318 0.222 0.296 0.224 0.303 0.225 0.302 0.216 0.290
336 0.265 0.319 0.274 0.329 0.990 0.734 0.324 0.364 0.277 0.333 0.281 0.342 0.277 0.336 0.268 0.324
720 0.348 0.373 0.362 0.385 1.892 1.026 0.410 0.420 0.371 0.389 0.397 0.421 0.383 0.409 0.420 0.422
avg 0.247 0.306 0.256 0.317 0.881 0.644 0.291 0.343 0.259 0.319 0.267 0.332 0.263 0.326 0.267 0.322

in 54 settings and top-2 in the remaining 6 settings. This performance sur-


passes previous FD methods like FEDformer and FiLM, as well as the SOTA
Transformer-based model, PatchTST. Worth noting is that FEDformer sig-
nificantly outperforms earlier well-known Transformer models such as Auto-
former, Informer, LogTrans, and Performer (Zhou et al., 2022b), and FiLM
is an improved iteration of FEDformer proposed by the same team. The
channel-independent PatchTST outperforms Crossformer, designed for ex-
tracting cross-channel dependencies. The results highlight the challenge of
improving the performance of SOTA channel-independent methods by in-
corporating cross-channel information in multivariate time series forecasting,
primarily due to the relatively smaller sizes of available datasets compared
to applications like computer vision (CV) and natural language processing
(NLP), which carry a higher risk of overfitting. Alongside Transformer-based

17
methods, DLinear, FiLM, DeepTime, and TSMixer are shown to be compet-
itive.

4.2. Ablation study


In this subsection, we investigate the impact of the Joint Time-Frequency
Domain Representation (JTFR) and Low-Rank Attention (LRA) layers. We
refer to JTFT with only a TD representation and no LRA as TDR, JTFT
with only an FD representation and no LRA as FDR, and the one with both
TD and FD representations but no LRA as JTFR. The performance of TDR,
FDR, and JTFR is assessed on three datasets: ILI, Weather, and Electricity,
representing small, medium, and large datasets, respectively. The results are
compared with PatchTST in Table 3. FEDformer and FiLM are also included
in the baseline as previous SOTA FD methods, which utilize random and low
frequencies, respectively.
Table 3 demonstrates that JTFR enhances performance in comparison to
both TDR and FDR across most settings. The inclusion of both TD and FD
representations provides additional information to correct errors in predic-
tions made using only one of them. Specifically, JTFR achieves performance
comparable to, or in some cases, slightly better than PatchTST while reduc-
ing the length of the representation L̂. Here, L̂ corresponds to the patch
number in PatchTST and is the sum of TD and FD patch numbers (nt and
nf ) in JTFT. In the experiment, L̂ is reduced from 64 to 32 in Weather and
48 in ILI and Electricity, reducing the computation and memory costs since
the complexity of Transformers is proportional to O(L̂2 ). Additionally, FDR
significantly outperforms FEDformer and even surpasses FiLM in most of
the settings, highlighting the effectiveness of learnable frequencies. JTFT
outperforms JTFR, showcasing that LRA inclusion enhances prediction per-
formance by leveraging cross-channel correlations.

4.3. Actual time efficiency and memory usage


In some cases, methods with low theoretical complexity may still incur
significant costs in practice due to large coefficients in the expressions. In
this subsection, we compare the actual time and memory costs of JTFT
with other Transformer-based methods (e.g., PatchTST, FEDformer, Cross-
former, Autoformer, Informer) and DLinear.
The experiments were conducted on the Weather dataset using an NVIDIA
RTX 3090 GPU. All previous methods used the settings detailed in their
original papers. To ensure a fair comparison of computational efficiency and

18
Table 3: Ablation study of the Joint Time-Frequency Domain Representation (JTFR)
and Low-Rank Attention Layer (LRA) in JTFT. TDR denotes JTFT with only a TD
representation and no LRA, FDR represents JTFT with only an FD representation and
no LRA, and JTFR includes both TD and FD representations but no LRA. The best
results are highlighted in bold, and the second-best results are underlined.

Models JTFT TDR FDR JTFR FEDformer FiLM PatchTST


Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
24 1.027 0.604 1.509 0.718 1.626 0.835 1.124 0.668 2.624 1.095 1.970 0.875 1.319 0.754
36 0.995 0.621 1.449 0.715 1.739 0.887 1.055 0.653 1.153 0.679 1.982 0.859 1.579 0.870
ILI

48 0.980 0.637 1.490 0.738 1.864 0.957 1.107 0.690 2.505 1.041 1.868 0.896 1.553 0.815
60 1.386 0.760 1.818 0.846 1.946 0.993 1.454 0.794 2.742 1.122 2.057 0.929 1.470 0.788
avg 1.097 0.656 1.567 0.754 1.794 0.918 1.185 0.701 2.256 0.984 1.969 0.890 1.480 0.807
96 0.144 0.186 0.158 0.201 0.147 0.193 0.147 0.190 0.238 0.314 0.199 0.262 0.149 0.198
Weather

192 0.187 0.228 0.200 0.240 0.191 0.237 0.192 0.234 0.275 0.329 0.228 0.288 0.194 0.241
336 0.237 0.270 0.251 0.280 0.245 0.280 0.245 0.276 0.339 0.377 0.267 0.323 0.245 0.282
720 0.308 0.321 0.322 0.332 0.319 0.334 0.316 0.328 0.389 0.409 0.319 0.361 0.314 0.334
avg 0.219 0.251 0.233 0.263 0.226 0.261 0.225 0.257 0.310 0.357 0.253 0.309 0.226 0.262
96 0.131 0.224 0.132 0.221 0.146 0.245 0.131 0.222 0.186 0.302 0.154 0.267 0.129 0.222
Electricity

192 0.144 0.237 0.148 0.237 0.161 0.259 0.147 0.237 0.197 0.311 0.164 0.258 0.147 0.240
336 0.157 0.252 0.164 0.255 0.176 0.272 0.163 0.253 0.213 0.328 0.188 0.283 0.163 0.259
720 0.182 0.275 0.205 0.288 0.211 0.300 0.200 0.286 0.233 0.344 0.236 0.332 0.197 0.290
avg 0.154 0.247 0.162 0.250 0.174 0.269 0.160 0.250 0.207 0.321 0.186 0.285 0.159 0.253

memory usage, the batch size was uniformly set to 128. However, for Auto-
former and Informer, a batch size of 64 was used for look-back windows of 96,
192, and 336, respectively, due to GPU memory limitations. Tests were not
performed on the two methods with larger look-back windows, as doing so
would have necessitated a further reduction in batch sizes, potentially result-
ing in an unfair comparison of speed. Additionally, the compilation function
in PyTorch 2 was disabled.
The results are displayed in Figure 5. Among the Transformer-based
methods, JTFT is the fastest and requires the least memory. While DLinear
is faster than JTFT, its accuracy is inferior. These results highlight that
JTFT is an efficient approach, especially when considering both speed and
accuracy. Further details and explanations are provided below.
Firstly, the computation and memory costs for different look-back win-
dows L are compared in Figures 5 (a, b, c), where the target window size
consistently set at 720. According to the complexity analysis, although the
time and space complexity of JTFT is O(L), the real time and memory cost
are mainly decided by the input length of the Transformer encoder, denoted
by L̂ = nt + nf . In the expression, nt and nf are the length of TD and FD

19
PatchTST
10 1
FEDformer

Memory
Crossformer
Time

10 0
Time

Autoformer
10 0
Informer
DLinear
JTFT_C
JTFT_V
10 0
96 192 336 512 96 192 336 512 96 192 336 512
Look-back window Look-back window Look-back window
(a) Training time (ms/sample) (b) Inference time (ms/sample) (c) Maximum GPU memory (GB)

10 2

10 1 PatchTST
FEDformer
Memory

Crossformer
Time

0
Time

10
Autoformer
Informer
DLinear
JTFT
10 0
10 0
96 192 336 720 96 192 336 720 96 192 336 720
Target window Target window Target window
(d) Training time (ms/sample) (e) Inference time (ms/sample) (f) Maximum GPU memory (GB)

Figure 5: Comparison of actual runtime efficiency and memory usage. Figures (a, b, c)
depict the results for various look-back windows while the target window is 720. Figures (d,
e, f) explore the impact of target window variations when the look-back window is 512. In
JTFT_C, the TD and FD representation lengths remain constant, while in JTFT_V, the
representation lengths increase roughly linearly with the look-back window. In Figures (d,
e, f), JTFT uses the largest representation length of JTFT_V. In general, JTFT is more
efficient in both computation and memory than the other Transformer-based methods in
the comparison.

20
representation, which are assigned by users and decoupled from L. Typically,
L̂ is less than the number of patches (refer to Appendix A), since the FD
and TD representation are linearly dependent otherwise.
Two versions of JTFT with different strategies to set nt and nf are consid-
ered. In JTFT_C uses constant (nt , nf ) = (8, 8). It is not applied for L = 96
due to the number of patches being less than L̂. In JTFT_V, (nt , nf ) in-
creases along with L. It is set as (4, 4), (8, 8), (12, 12), and (16, 16) for L of
96, 192, 336, and 512.
Figures 5 (a, b, c) demonstrate that both JTFT_C and JTFT_V out-
perform the other Transformer-based methods in most settings. The compu-
tational time and GPU memory usage of JTFT_C remain almost constant
with respect to L because the costs of encoders and heads, which account for
most of the expenses in JTFT, do not increase with L when L̂ is kept con-
stant. In contrast, for JTFT_V, the computation and memory costs increase
as L̂ grows with L. JTFT_C and JTFT_V require slightly more memory
compared to PatchTST when L = 192 because they include additional LRA
layers.
Next, we compare the computational and memory costs for various target
windows (T ) in Figure 5 (d, e, f). The look-back window size is consistently
set to 512 in these figures. Similar to the previous results, JTFT remains
faster and requires less memory compared to the other Transformer-based
methods. The costs increase at a small and roughly constant rate with respect
to T . This behavior is because, when both L and L̂ are fixed, only the cost
of the prediction head, which is not resource-intensive, increases along with
T.

5. Conclusions and future works


This paper introduces JTFT, a joint time-frequency domain Transformer
for multivariate time series forecasting. JTFT effectively captures multi-
scale structures using a small number of learnable frequencies, while also
leveraging the latest time-domain data to enhance local relation learning and
mitigate the adverse effects of non-stationarity. Additionally, it utilizes a low-
rank attention layer to extract cross-channel correlations while alleviating the
entanglement with the modeling of temporal dependencies. JTFT has linear
complexity in both time and space. Extensive experiments on 6 real-world
datasets demonstrate that our method achieves state-of-the-art performance
in long-term forecasting.

21
In addition to its application in the Transformer architecture, the joint
time-frequency domain representation could be utilized in other types of
neural network models, including those based on CNNs and MLPs. Moreover,
this representation holds the potential for adoption in various domains, such
as NLP, where input sequences tend to be lengthy.

6. Acknowledgments
This work was supported in part by the National Natural Science Foun-
dation of China (Grant No. T2125006, U2242210, 42174057, 61972231,
62102114, 62202119), the Jiangsu Innovation Capacity Building Program
(Grant No. BM2022028), the Science and Technology Project of Qinghai
Province (Grant No. 2023-QY-208), and the Key Research Project of Zhe-
jiang Lab (Grant No. 2021PB0AC01).

Appendix A. A brief introduction about patching in time series


forecasting
As analyzed in Section 3, the patching technique and channel-independent
setting introduced in PatchTST have been found to be effective in extracting
temporal dependencies, and are thus applied in JTFT. We further propose
the time-frequency-independence setting to exploit the cross-channel depen-
dencies with low redundancy after the temporal modeling. Although these
approaches are not inherently complex, they may not be intuitive for readers
who are unfamiliar with PatchTST. Therefore, some brief explanations are
presented below.
Patching involves dividing the time series into either overlapped or non-
overlapped continuous segments, which serve as the fundamental input units
for subsequent modeling steps. This technique retains local semantic informa-
tion within the embedding, thereby enhancing the capture of comprehensive
semantic information that may not be available at the point-level. More-
over, it reduces the length of inputs for Transformer encoders, resulting in
significant computational and memory savings.
During the patching process, each channel of the input multivariate time
series is treated as a separate univariate time series. The input univariate
time series are then divided into patches, with parameters defined as follows:
the patch length is denoted as lp , and the stride as ls . It is assumed that the
input length L and patch length lp are divisible by ls , and the time series

22
is padded by repeating the last element ls times. In this setup, a total of
L̄ = (L − lp )/ls + 2 patches are generated by extracting continuous segments
within the input sequence. The input is rearranged as
Patching (x1:L ) = Zpch (A.1)
where the input series x1:L ∈ RD×L , patched series Zpch ∈ RD×L̄×lp , and
(Zpch )ij = xi(j−1)ls +1:(j−1)ls +lp (A.2)

for channel index i ∈ {1, · · · , D} and patch index j ∈ {1, · · · , L̂}.


Patching reduces the input sequence length of the Transformer encoder
from L to approximately L/ls , that results in significant saving in computa-
tion and memory. This reduction also allows for more efficient processing of
long sequences.
The channel-independent (CI) setting treats each channel of a multivari-
ate series as an independent univariate series. The network structures under
the CI setting are shared for all channels, so that much more data samples are
available in training. Consequently, CI networks are more stable to noise and
less prone to overfitting. In implementation, the patched data is permuted
and the channel dimension is merged into the batch-size dimension, that
enables the data to be processed by a vanilla Transformer encoder directly.
The overall structure of PatchTST is under the CI setting, which brings an
enormous advantage compared with the channel-mixed models and achieves
SOTA performance. However, the trade-off of the CI setting is that it is
unable to capture cross-channel dependencies.
In order to utilize the cross-channel relations, we apply a time-frequency-
independence (TFI) setting in our low-rank attention (LRA) layers. It is
proposed based on the observation that incorporating CI and non-CI net-
work structures typically leads to degeneration of performance compared
with the strong baseline of PatchTST. TFI shares the network structures
along the time-frequency dimension instead of the channel dimension in CI.
Furthermore, being more compact, it also shares the increments resulting
from cross-channel interactions across the time-frequency dimension. This
approach significantly reduces the updating space in LRA, thereby mitigat-
ing the entanglement and redundancy associated with modeling temporal
and channel-wise relations simultaneously.
JTFT successfully combines the CI and CFI settings. The Transformer
encoder and its embedding are channel-independent, resulting in a CI repre-
sentation of the input series. The representation is fed to the TFI LRA layers

23
Table B.4: Statistics of the datasets for benchmark

Dataset Ettm2 Electricity Exchange ILI Traffic Weather


Length 69980 26304 7588 966 17544 52696
Channels 7 321 8 7 862 21
Frequency 15 min 1 hour 1 day 7 day 1 hour 10 min

to incorporate cross-channel information. Finally, the prediction head maps


the representation to prediction. The head is also shared along the channel
dimension, but the channel-wise information has been integrated into its in-
puts. Empirical results show that the mixed CI_TFI design is effective in
most datasets.

Appendix B. Datasets details


The details of the datasets are introduced as follows: (1) ETTm2 (Zhou
et al., 2021) contains the electricity transformer temperature and power loads
collected from 2 counties in China. It is a 15-minute-level dataset spanning
2 years. (2) Electricity 2 comprises the hourly electricity consumption of 321
customers, also spanning 2 years. (3) Exchange (Lai et al., 2018) logs the
daily exchange rates of eight different countries from 1990 to 2016. (4) ILI 3
includes the weekly data of recorded influenza-like illness (ILI) patients from
the Centers for Disease Control and Prevention of the United States, covering
the period from 2002 to 2021. This dataset provides information on the ratio
of patients seen with ILI to the total number of patients. (5) Traffic 4 consists
of hourly data obtained from the California Department of Transportation,
providing information on road occupancy rates measured by various sensors
installed on freeways in the San Francisco Bay area. (6) Weather 5 is captured
at 10-minute intervals throughout the year 2020, featuring 21 meteorological
indicators, such as air temperature, humidity, and wind speed. The statistics
of the datasets are summarized in Table B.4. All the datasets can be accessed
from Autoformer.

2
https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014
3
https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html
4
http://pems.dot.ca.gov/
5
https://www.bgc-jena.mpg.de/wetter/

24
Appendix C. Experimental details
By default, JTFT consists of 3 Transformer encoder layers and 1 low-
rank attention layer (LRA). However, the number of LRA is set to 0 for the
Traffic dataset as it negatively impacts the performance. The dimension of
the latent space, denoted as dm , in the Transformer encoder increases with
the dataset size. Specifically, it is set to 8 for ILI and Exchange, 16 for
ETTm2, 128 for Weather, Electricity and Traffic. The dm of the LRA is half
of those in the Transformer encoder. For ILI and Exchange, the patch length
and stride are set to (4, 2), while for the other datasets, they are (16, 8). The
length of TD and FD representations (nt , nf ) are searched in (16, 16) and
(16, 32).
Our method is implemented in PyTorch and trained on a workstation
equipped with 4 NVIDIA RTX 3090 GPUs, each with 24GB memory. All 4
GPUs are utilized for training on the Electricity and Traffic datasets, while
only 1 GPU is used for the remaining datasets.

Appendix D. Multiple random runs


The results of JTFT reported in the main text are obtained using a fixed
random seed of 1. To assess the robustness of the method, we conduct
5 random runs on the ILI dataset, and 3 runs on the Weather and Traffic
datasets. These datasets represent small, medium, and large datasets, respec-
tively. PatchTST, which performed the best among the previous experiments
except for JTFT, is included as a baseline, along with Dlinear.
The mean and standard deviation of the MSE and MAE are reported
in Table D.5. The results demonstrate that the variances of JTFT are low,
particularly for the Weather and Traffic datasets. Slightly higher variances
are observed for ILI, which can be attributed to the smaller size of the dataset.
The comparison also reveals that JTFT significantly outperforms PatchTST
and Dlinear in the majority of experimental settings, as the improvements
in the mean metrics are much larger than the standard deviations. While
DLinear exhibits inferior performance compared to JTFT and PatchTST,
it displays remarkably low variance on ILI and Traffic. This characteristic
may stem from the model’s lower number of parameters compared to other
methods.

25
Table D.5: Multivariate long-term series forecasting results on 3 datasets showing both
Mean and STD. The best results are highlighted in bold, and the second best results are
underlined.

Dataset JTFT PatchTST DLinear


Metric MSE MAE MSE MAE
24 0.9940 ± 0.0461 0.6053 ± 0.0064 1.3989 ± 0.0844 0.7670 ± 0.0233 1.9569 ± 0.0232 0.9788 ± 0.0070
36 1.0380 ± 0.0744 0.6414 ± 0.0191 1.2534 ± 0.0778 0.7381 ± 0.0240 2.0823 ± 0.0067 0.9980 ± 0.0024
ILI

48 1.0639 ± 0.0597 0.6639 ± 0.0174 1.6462 ± 0.1520 0.8318 ± 0.0513 2.1333 ± 0.0063 1.0251 ± 0.0022
60 1.3866 ± 0.0568 0.7593 ± 0.0152 1.4527 ± 0.0547 0.8008 ± 0.0105 2.3175 ± 0.0178 1.0854 ± 0.0090
96 0.1440 ± 0.0003 0.1875 ± 0.0016 0.1483 ± 0.0004 0.1978 ± 0.0002 0.1690 ± 0.0004 0.2300 ± 0.0015
Weather

192 0.1876 ± 0.0008 0.2300 ± 0.0016 0.1943 ± 0.0013 0.2419 ± 0.0013 0.2133 ± 0.0012 0.2724 ± 0.0033
336 0.2385 ± 0.0013 0.2710 ± 0.0013 0.2461 ± 0.0010 0.2828 ± 0.0007 0.2570 ± 0.0016 0.3078 ± 0.0030
720 0.3087 ± 0.0008 0.3233 ± 0.0019 0.3126 ± 0.0011 0.3329 ± 0.0011 0.3161 ± 0.0016 0.3559 ± 0.0026
96 0.3525 ± 0.0021 0.2339 ± 0.0016 0.3602 ± 0.0006 0.2487 ± 0.0003 0.4101 ± 0.0001 0.2819 ± 0.0002
Traffic

192 0.3732 ± 0.0005 0.2425 ± 0.0002 0.3788 ± 0.0004 0.2560 ± 0.0002 0.4227 ± 0.0003 0.2873 ± 0.0002
336 0.3854 ± 0.0013 0.2500 ± 0.0008 0.3917 ± 0.0011 0.2639 ± 0.0007 0.4357 ± 0.0003 0.2956 ± 0.0004
720 0.4292 ± 0.0006 0.2753 ± 0.0005 0.4322 ± 0.0009 0.2863 ± 0.0003 0.4658 ± 0.0001 0.3148 ± 0.0001

References
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization.
arXiv:1607.06450.

Bai, S., Kolter, J. Z., & Koltun, V. (2018). An empirical evaluation of


generic convolutional and recurrent networks for sequence modeling. arXiv
preprint arXiv:1803.01271 , .

Bao, H., Dong, L., Piao, S., & Wei, F. (2022). BEiT: BERT pre-training of
image transformers. In International Conference on Learning Representa-
tions. URL: https://openreview.net/forum?id=p-BhZSz59o4.

Borovykh, A., Bohté, S. M., & Oosterlee, C. W. (2017). Conditional time


series forecasting with convolutional neural networks. arXiv: Machine
Learning, .

Bouchachia, A., & Bouchachia, S. (2008). Ensemble learning for time series
prediction. Proceedings of the 1st International Workshop on Nonlinear
Dynamics and Synchronization, .

Box, G. E., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). Time
series analysis: forecasting and control . John Wiley & Sons.

26
Cao, D., Wang, Y., Duan, J., Zhang, C., Zhu, X., Huang, C., Tong, Y., Xu,
B., Bai, J., Tong, J., & Zhang, Q. (2020). Spectral temporal graph neural
network for multivariate time-series forecasting. ArXiv , abs/2103.07719 .
Challu, C., Olivares, K. G., Oreshkin, B. N., Ramirez, F. G., Canseco, M. M.,
& Dubrawski, A. (2023). Nhits: Neural hierarchical interpolation for time
series forecasting. In Proceedings of the AAAI Conference on Artificial
Intelligence (pp. 6989–6997). volume 37.
Chaovalit, P., Gangopadhyay, A., Karabatis, G., & Chen, Z. (2011). Discrete
wavelet transform-based time series analysis and mining. ACM Computing
Surveys (CSUR), 43 , 1–37.
Chen, S., Wang, X. X., & Harris, C. J. (2008). Narxbased nonlinear system
identification using orthogonal least squares basis hunting. IEEE Trans-
actions on Control Systems, (pp. 78–84).
Chen, S.-A., Li, C.-L., Arik, S. O., Yoder, N. C., & Pfister, T. (2023).
TSMixer: An all-MLP architecture for time series forecast-ing. Transac-
tions on Machine Learning Research, . URL: https://openreview.net/
forum?id=wbpxTuXgm0.
Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A.,
Sarlós, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger,
D. B., Colwell, L. J., & Weller, A. (2021). Rethinking attention with
performers. In 9th International Conference on Learning Representations
(ICLR), Virtual Event, Austria, May 3-7, 2021 .
Das, A., Kong, W., Leach, A., Sen, R., & Yu, R. (2023). Long-
term forecasting with tide: Time-series dense encoder. arXiv preprint
arXiv:2304.08424 , .
Ding, Y., Jia, M., Miao, Q., & Cao, Y. (2022). A novel time–frequency
transformer based on self–attention mechanism and its application in fault
diagnosis of rolling bearings. Mechanical Systems and Signal Processing,
168 , 108616.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Un-
terthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszko-
reit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Trans-
formers for image recognition at scale. In International Conference on

27
Learning Representations. URL: https://openreview.net/forum?id=
YicbFdNTTy.

Ekambaram, V., Jati, A., Nguyen, N., Sinthong, P., & Kalagnanam, J.
(2023). Tsmixer: Lightweight mlp-mixer model for multivariate time series
forecasting. arXiv preprint arXiv:2306.09364 , .

Frigola, R., & Rasmussen, C. E. (2014). Integrated pre-processing for


Bayesian nonlinear system identification with Gaussian processes. IEEE
Conference on Decision and Control , (pp. 552––560).

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. B. (2021). Masked
autoencoders are scalable vision learners. CoRR, abs/2111.06377 . URL:
https://arxiv.org/abs/2111.06377. arXiv:2111.06377.

Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (gelus).
arXiv preprint arXiv:1606.08415 , .

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural


Computation, 9 , 1735–1780.

Kalyan, K. S., Rajasekharan, A., & Sangeetha, S. (2021). Ammus: A survey


of transformer-based pretrained models in natural language processing.
arXiv preprint arXiv:2108.05542 , .

Karita, S., Chen, N., Hayashi, T., Hori, T., Inaguma, H., Jiang, Z., Someki,
M., Soplin, N. E. Y., Yamamoto, R., Wang, X. et al. (2019). A comparative
study on transformer vs rnn in speech applications. In IEEE Automatic
Speech Recognition and Understanding Workshop (ASRU) (pp. 449–456).
IEEE.

Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., & Shah,
M. (2021). Transformers in vision: A survey. ACM Computing Surveys
(CSUR), .

Kitaev, N., Kaiser, L., & Levskaya, A. (2020). Reformer: The efficient
transformer. In 8th International Conference on Learning Representations,
ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 .

28
Lai, G., Chang, W.-C., Yang, Y., & Liu, H. (2018). Modeling long-and short-
term temporal patterns with deep neural networks. In The 41st interna-
tional ACM SIGIR conference on research & development in information
retrieval (pp. 95–104).

Lee-Thorp, J., Ainslie, J., Eckstein, I., & Ontanon, S. (2022). Fnet: Mixing
tokens with fourier transforms. arXiv:2105.03824.

Li, B., Cui, W., Zhang, L., Zhu, C., Wang, W., Tsang, I., & Zhou, J. T.
(2023a). Difformer: Multi-resolutional differencing transformer with dy-
namic ranging for time series analysis. IEEE Transactions on Pattern
Analysis and Machine Intelligence, .

Li, S., Jin, X., Xuan, Y., Zhou, X., Chen, W., Wang, Y.-X., & Yan, X. (2019).
Enhancing the locality and breaking the memory bottleneck of transformer
on time series forecasting. In Advances in Neural Information Processing
Systems. volume 32. URL: https://proceedings.neurips.cc/paper/
2019/file/6775a0635c302542da2c32aa19d86be0-Paper.pdf.

Li, Z., Rao, Z., Pan, L., & Xu, Z. (2023b). Mts-mixers: Multivariate time se-
ries forecasting via factorized temporal and channel mixing. arXiv preprint
arXiv:2302.04501 , .

Liu, S., Yu, H., Liao, C., Li, J., Lin, W., Liu, A. X., & Dustdar, S. (2022a).
Pyraformer: Low-complexity pyramidal attention for long-range time se-
ries modeling and forecasting. In International Conference on Learning
Representations.

Liu, Y., Wu, H., Wang, J., & Long, M. (2022b). Non-stationary
transformers: Exploring the stationarity in time series forecast-
ing. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho,
& A. Oh (Eds.), Advances in Neural Information Processing Sys-
tems (pp. 9881–9893). Curran Associates, Inc. volume 35. URL:
https://proceedings.neurips.cc/paper_files/paper/2022/file/
4054556fcaa934b0bf76da52cf4f92cb-Paper-Conference.pdf.

Ma, X., Kong, X., Wang, S., Zhou, C., May, J., Ma, H., & Zettlemoyer, L.
(2021). Luna: Linear unified nested attention. CoRR, abs/2106.01540 .
arXiv:2106.01540.

29
Nie, Y., Nguyen, N. H., Sinthong, P., & Kalagnanam, J. (2023). A time
series is worth 64 words: Long-term forecasting with transformers. In The
Eleventh International Conference on Learning Representations. URL:
https://openreview.net/forum?id=Jbdc0vTOcol.

Pascanu, R., Mikolov, T., & Bengio, Y. (2012). On the difficulty of train-
ing recurrent neural networks. In International Conference on Machine
Learning.

Petropoulos, F., Apiletti, D., & Vassilios Assimakopoulos, e. a.


(2022). Forecasting: theory and practice. International Jour-
nal of Forecasting, 38 , 705–871. URL: https://www.sciencedirect.
com/science/article/pii/S0169207021001758. doi:https://doi.org/
10.1016/j.ijforecast.2021.11.001.

Qin, Y., Song, D., Cheng, H., Cheng, W., Jiang, G., & Cottrell, G. W.
(2017). A dual-stage attention-based recurrent neural network for time se-
ries prediction. In International Joint Conference on Artificial Intelligence
(pp. 2627–2633).

Rangapuram, S. S., Seeger, M. W., Gasthaus, J., Stella, L., Wang, B., &
Januschowski, T. (2018). Deep state space models for time series forecast-
ing. In Neural Information Processing Systems.

Salinas, D., Flunkert, V., Gasthaus, J., & Januschowski, T. (2020). DeepAR:
Probabilistic forecasting with autoregressive recurrent networks. Interna-
tional Journal of Forecasting, 36 , 1181–1191.

Sen, R., Yu, H.-F., & Dhillon, I. S. (2019). Think globally, act locally: A
deep neural network approach to high-dimensional time series forecasting.
In Neural Information Processing Systems.

Shabani, M. A., Abdi, A. H., Meng, L., & Sylvain, T. (2023). Scaleformer: It-
erative multi-scale refining transformers for time series forecasting. In The
Eleventh International Conference on Learning Representations. URL:
https://openreview.net/forum?id=sCrnllCtjoE.

Singh, B. N., & Tiwari, A. K. (2006). Optimal selection of wavelet basis


function applied to ecg signal denoising. Digital signal processing, 16 ,
275–287.

30
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R.
(2014). Dropout: a simple way to prevent neural networks from overfitting.
The journal of machine learning research, 15 , 1929–1958.

Sun, F.-K., & Boning, D. S. (2022). Fredo: Frequency domain-based long-


term time series forecasting. arXiv:2205.12301.

Tong, H., & Lim, K. S. (2009). Threshold autoregression, limit cycles and
cyclical data. In Exploration Of A Nonlinear World: An Appreciation of
Howell Tong’s Contributions to Statistics (pp. 9–56). World Scientific.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you
need. In Advances in Neural Information Processing Systems. vol-
ume 30. URL: https://proceedings.neurips.cc/paper/2017/file/
3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

Wang, H., Peng, J., Huang, F., Wang, J., Chen, J., & Xiao, Y. (2023a).
MICN: Multi-scale local and global context modeling for long-term series
forecasting. In The Eleventh International Conference on Learning Repre-
sentations. URL: https://openreview.net/forum?id=zt53IDUR1U.

Wang, S., Li, B. Z., Khabsa, M., Fang, H., & Ma, H. (2020). Lin-
former: Self-attention with linear complexity. CoRR, abs/2006.04768 .
arXiv:2006.04768.

Wang, Z., Zhao, H., Zheng, M., Niu, S., Gao, X., & Li, L. (2023b). A novel
time series prediction method based on pooling compressed sensing echo
state network and its application in stock market. Neural Networks, 164 ,
216–227.

Wen, Q., He, K., Sun, L., Zhang, Y., Ke, M., & Xu, H. (2021). RobustPe-
riod: Time-frequency mining for robust multiple periodicity detection. In
Proceedings of the 2021 International Conference on Management of Data
(SIGMOD ’21) (pp. 205–215).

Wen, Q., Zhou, T., Zhang, C., Chen, W., Ma, Z., Yan, J., & Sun, L. (2022).
Transformers in time series: A survey. arXiv preprint arXiv:2202.07125 , .

Woo, G., Liu, C., Sahoo, D., Kumar, A., & Hoi, S. (2023). Learning deep
time-index models for time series forecasting, .

31
Wu, H., Hu, T., Liu, Y., Zhou, H., Wang, J., & Long, M. (2022). Timesnet:
Temporal 2d-variation modeling for general time series analysis. In The
Eleventh International Conference on Learning Representations.

Wu, H., Xu, J., Wang, J., & Long, M. (2021). Autoformer: Decomposition
transformers with Auto-Correlation for long-term series forecasting. In
Advances in Neural Information Processing Systems.

Wu, Z., Pan, S., Long, G., Jiang, J., Chang, X., & Zhang, C. (2020). Con-
necting the dots: Multivariate time series forecasting with graph neural
networks. Proceedings of the 26th ACM SIGKDD International Confer-
ence on Knowledge Discovery & Data Mining, .

Xiong, Y., Zeng, Z., Chakraborty, R., Tan, M., Fung, G., Li, Y., & Singh,
V. (2021). Nyströmformer: A nyström-based algorithm for approximating
self-attention. In Thirty-Fifth AAAI Conference on Artificial Intelligence
(pp. 14138–14148).

Yao, S., Piao, A., Jiang, W., Zhao, Y., Shao, H., Liu, S., Liu, D., Li, J.,
Wang, T., Hu, S. et al. (2019). stfnets: Learning sensing signals from the
time-frequency perspective with short-time fourier neural networks. In The
World Wide Web Conference (pp. 2192–2202).

Zeng, A., Chen, M., Zhang, L., & Xu, Q. (2022). Are transformers effective
for time series forecasting? arXiv preprint arXiv:2205.13504 , .

Zhang, Y., & Yan, J. (2023). Crossformer: Transformer utilizing cross-


dimension dependency for multivariate time series forecasting. In The
Eleventh International Conference on Learning Representations. URL:
https://openreview.net/forum?id=vSVLM2j9eie.

Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., & Zhang, W.
(2021). Informer: Beyond efficient transformer for long sequence time-
series forecasting. In The Thirty-Fifth AAAI Conference on Artificial In-
telligence (pp. 11106–11115). volume 35.

Zhou, T., Ma, Z., Wen, Q., Sun, L., Yao, T., Yin, W., Jin, R. et al. (2022a).
Film: Frequency improved legendre memory model for long-term time
series forecasting. Advances in Neural Information Processing Systems,
35 , 12677–12690.

32
Zhou, T., Ma, Z., Wen, Q., Wang, X., Sun, L., & Jin, R. (2022b). FED-
former: Frequency enhanced decomposed transformer for long-term series
forecasting. In Proc. 39th International Conference on Machine Learning.

Zhou, Z., Zhong, R., Yang, C., Wang, Y., Yang, X., & Shen, W. (2022c). A
k-variate time series is worth k words: Evolution of the vanilla trans-
former architecture for long-term multivariate time series forecasting.
arXiv:2212.02789.

33

You might also like