A Joint Time-Frequency Domain Transformer For Multivariate Time Series Forecasting
A Joint Time-Frequency Domain Transformer For Multivariate Time Series Forecasting
a
Department of Computer Science and Technology, Tsinghua University, RM.3-126, FIT
Building, Haidian District, Beijing, 100084, China
b
College of Computer Science and Mathematics, Fujian University of
Technology, RM.213, Building C4, Fuzhou, Fujian, 350118, China
c
Techorigin, No.581, Jianzhu West Road, Binhu District, Wuxi, Jiangsu, 214000, China
d
Earth System Modeling and Prediction Center, No.46, Zhongguancun South Street,
Haidian District, Beijing, 100081, China
Abstract
In order to enhance the performance of Transformer models for long-term
multivariate forecasting while minimizing computational demands, this paper
introduces the Joint Time-Frequency Domain Transformer (JTFT). JTFT
combines time and frequency domain representations to make predictions.
The frequency domain representation efficiently extracts multi-scale depen-
dencies while maintaining sparsity by utilizing a small number of learnable
frequencies. Simultaneously, the time domain (TD) representation is de-
rived from a fixed number of the most recent data points, strengthening the
modeling of local relationships and mitigating the effects of non-stationarity.
Importantly, the length of the representation remains independent of the
input sequence length, enabling JTFT to achieve linear computational com-
plexity. Furthermore, a low-rank attention layer is proposed to efficiently
capture cross-dimensional dependencies, thus preventing performance degra-
dation resulting from the entanglement of temporal and channel-wise model-
ing. Experimental results on six real-world datasets demonstrate that JTFT
outperforms state-of-the-art baselines in predictive performance.
1. Introduction
Time series forecasting predicts the future based on historical data. It has
broad applications including but not limited to climatology, energy, finance,
trading, and logistics (Petropoulos et al., 2022). Following the great success
of Transformers (Vaswani et al., 2017) in NLP (Kalyan et al., 2021), CV
(Khan et al., 2021), and speech (Karita et al., 2019), Transformers have
been introduced in time series forecasting and achieves promising results
(Wen et al., 2022).
One of the primary drawbacks of Transformers is their quadratic complex-
ity in both computation and memory, making them less suitable for long-term
forecasting. To address this limitation, a plethora of Transformer-based mod-
els, e.g., LogTrans, Informer, AutoFormer, Performer, and PyraFormer (Li
et al., 2019; Zhou et al., 2021; Wu et al., 2021; Choromanski et al., 2021; Liu
et al., 2022a), have been proposed to enhance predictive performance while
maintaining low complexity. Notably, Zhou et al. (2022b) observed that
most time series which are dense in the time domain (TD) tend to have a
sparse representation in the frequency domain (FD). In response, they intro-
duced FEDFormer, which leverages a randomly selected subset of frequency
components to exploit FD sparsity. This approach has linear complexity
and achieved state-of-the-art (SOTA) results in early 2022. However, the
sparse representation using random frequencies alone may not fully capture
the multi-scale characteristics of time series. As a result, seasonal-trend de-
composition remains necessary in FEDFormer, despite its susceptibility to
hyperparameters. Additionally, due to the periodic nature of the basis func-
tions, FD representation is vulnerable to non-stationarity. Enhancing the
sparse FD representation to capture multi-scale features has become crucial,
especially after the emergence of a simple linear model known as DLinear,
which outperformed existing Transformers in time-series forecasting (Zeng
et al., 2022). PatchTST (Nie et al., 2023), an improved Transformer with
a patching approach, has recently outperformed DLinear in performance.
While PatchTST’s concept is valuable for improving FD Transformers, it
still exhibits quadratic complexity and cannot utilize cross-channel correla-
tions.
2
In this paper, we present a joint time-frequency domain Transformer
(JTFT). It exploits the FD sparsity in time series using a small number
of learnable frequencies and enhances the learning of local relations by in-
corporating a fixed number of the latest data points. A low-rank attention
layer is also used to effectively extract cross-channel dependencies. The main
contributions lie in four folds:
• Time series are encoded based on both the sparse FD and latest TD
representation, that enables the Transformer to extract both long-term
and local dependencies effectively with linear complexity.
2. Related work
Well-known traditional methods for time series forecasting include AR,
ARIMA, VAR , GARCH(Box et al., 2015), kernel methods(Chen et al., 2008),
ensembles(Bouchachia & Bouchachia, 2008), Gaussian processes(Frigola &
Rasmussen, 2014), regime switching(Tong & Lim, 2009), and so on. Benefit-
ing from the improvement in computing power and neural network structure,
3
many deep learning models are proposed and often empirically surpassed tra-
ditional methods in predictive performance.
Various deep neural network architectures, such as recurrent neural net-
works (RNN), convolutional neural networks (CNN), graph neural networks
(GNN), multi-layer perceptrons (MLP), and Transformers, have been widely
applied in time series forecasting. RNN-type models (Hochreiter & Schmid-
huber, 1997; Qin et al., 2017; Rangapuram et al., 2018; Salinas et al., 2020)
were once the most popular choice, but they often encountered issues re-
lated to gradient vanishing or exploding (Pascanu et al., 2012), limiting their
performance. CNN structures, on the other hand, excel at extracting local
features from time series (Bai et al., 2018; Borovykh et al., 2017; Sen et al.,
2019; Wang et al., 2023a; Wu et al., 2022). However, they usually require
many layers to effectively represent global relationships, mainly due to their
limited receptive field size. GNNs are increasingly applied to enhance both
temporal and dimensional pattern recognition in time series data (Wu et al.,
2020; Cao et al., 2020). Recent advancements (Challu et al., 2023; Li et al.,
2023b; Das et al., 2023; Ekambaram et al., 2023) suggest that MLP-type
structures remain competitive in forecasting.
Transformers are appealing in time series forecasting since their attention
mechanism is effective to capture long-term temporal dependency. How-
ever, the complexity and memory requirement are quadratic in the sequence
length, which hinders application in long-term modeling. Numerous meth-
ods are proposed to both improve the predictive performance and reduce
the costs of Transformers. Typically, these methods have to exploit some
form of sparsity. LogTrans (Li et al., 2019) introduces attention layers with
LogSparse design and achieves O(L log2 (L)) complexity, where L is the input
length. Informer (Zhou et al., 2021) selects the top-k in attention matrix with
a KL-divergence based method, and has O(L log L) complexity. Autoformer
(Wu et al., 2021) obtains the same complexity using an Auto-Correlation
mechanism. Several improved Transformers have achieved linear complex-
ity. Among these methods, Linformer (Wang et al., 2020) compresses the se-
quence by learnable linear projection; Luna (Ma et al., 2021) adopts a nested
linear structure; Nyströformer (Xiong et al., 2021) applies Nyström approx-
imation in the attention mechanism; Performer (Choromanski et al., 2021)
leverages a positive orthogonal random features approach; Pyraformer (Liu
et al., 2022a) introduces a pyramidal attention module to extract dependency
at different resolutions; FEDformer (Zhou et al., 2022b) combines seasonal-
trend decomposition and frequency enhanced structures to capture both the
4
global and local dependencies. Numerous effective approaches have been
proposed besides the efforts to reduce complexity. Non-stationary Trans-
formers (Liu et al., 2022b) and Scaleformer (Shabani et al., 2023) provide
generic frameworks to improve the accuracy of existing methods. Cross-
former (Zhang & Yan, 2023) develops two-stage attention layers to utilize
cross-channel dependency. Zhou et al. (2022c) show that a linear head may
achieve better performance than the Transformer decoder.
However, recent work (Zeng et al., 2022) shows a simple linear model
(DLinear) outperforms the existing SOTA Transformer based methods, and
often by a large margin. Additionally, Li et al. (2023b) argue that attention
is not necessary for capturing temporal dependencies. Indeed, PatchTST
(Nie et al., 2023), which uses continuous segments of series as the input of
a vanilla Transformer and a channel-independence setting, beats DLinear on
the standard multivariate forecasting benchmarks recently. The method still
needs improvement in terms of computational efficiency and the extraction
of cross-channel dependency.
Frequency domain (FD) methods are also widely explored in time series
modeling. Many commonly used FD features are obtained through the Dis-
crete Fourier Transform (DFT). For instance, Wu et al. (2021) efficiently com-
pute the auto-correlation function using the Fast Fourier Transform (FFT).
Lee-Thorp et al. (2022) significantly accelerate training with minor accuracy
loss by replacing self-attention in a Transformer encoder with the Fourier
Transform. Sun & Boning (2022) present a competitive FD neural network
for univariate time series forecasting. Rather than using all FD components,
exploiting FD sparsity is crucial for computational efficiency and noise sup-
pression. FEDformer, for example, randomly selects a subset of FD compo-
nents. Similarly, Wang et al. (2023b) develop a compressed sensing technique
using random FD components. Zhou et al. (2022a) improves the Legendre
memory model using low-frequency Fourier components and a low-rank ap-
proximation. Woo et al. (2023) develops a concatenated Fourier features
module for efficiently learning high-frequency patterns. Additionally, Li et al.
(2023a) combines temporal and frequency streams for time series classifi-
cation and regression, with FD components selected by a band-pass filter.
In contrast to these approaches, JTFT introduces a custom discrete cosine
transform (CDCT) to extract FD features. The Discrete Cosine Transform
(DCT) is chosen for its superior energy compaction characteristics compared
to DFT, allowing it to obtain a sparse FD representation of the input effec-
tively. CDCT further extends DCT by enabling the learning of frequencies,
5
improving the extraction of periodic dependencies that may not align with
the uniform frequency grids of DCT bases.
Other methods, such as Short Time Fourier Transform and Discrete
Wavelet Transform, are used to capture time-frequency features (e.g., Chao-
valit et al., 2011; Yao et al., 2019; Singh & Tiwari, 2006; Wen et al., 2021;
Ding et al., 2022). These methods excel in extracting information from non-
stationary data due to their ability to adapt FD representations over time.
However, time-frequency features include FD components for discrete time
points, resulting in considerably larger sizes compared to FFT and DCT.
Consequently, these methods often require more effective feature selection or
compression techniques to counteract overfitting. In contrast, JTFT effec-
tively addresses the challenges posed by non-stationarity through recent TD
representations.
3. Methods
In this section, we will introduce (1) the overall structure of JTFT, (2)
the sparse FD representation with learnable frequencies, (3) the low-rank at-
tention layers to extract cross-channel dependencies, and (4) the complexity
analysis of the proposed method.
6
Figure 1: JTFT framework. The input time series undergo preprocessing to obtain a
joint time-frequency embedding, which is then fed to the encoders for extracting time-
frequency and cross-channel dependencies, respectively. The forecasting is generated using
a prediction head.
7
Figure 2: Time-Frequency Domain Data Embedding. Patched data is transformed into FD
representation (FDR) by multiplying it with the CDCT bases. FDR is then concatenated
with the TD representation (TDR) to form the JTFR. This JTFR is mapped to the model
dimension using the projection matrix P and added with the position embedding to create
the inputs for the encoders.
8
points and trained with all available data, which also benefits generalization
performance.
The prediction head consists of a GELU activation function (Hendrycks &
Gimpel, 2016), dropout layer (Srivastava et al., 2014), and linear projection
layer. It maps the latent representation to output, which is then denormalized
using the statistics saved in preprocessing. The model is trained using the
Huber loss function, which offers greater resilience to outliers in the data
compared to the Mean Squared Error (MSE) loss.
9
form of DCT is
z̃ = DCT(z) = T̃z
√
1/ N , k = 0, n ∈ {0, · · · , N − 1}
T̃k,n =
r
2 π 1
cos n+ k , k ∈ {1, · · · , N − 1}, n ∈ {0, · · · , N − 1},
N N 2
(1)
where z is a sequence with length N . It may differ from the input time
sequence due to certain preprocessing steps.
The matrix T̃ is orthogonal, so the inverse transform is
ẑ = CDCT(z) = T̂z
√
1/ N , k = 0, n ∈ {0, · · · , N − 1}
(3)
T̂k,n =
r
2 1
cos n+ πψk , ψk ∈ (0, 1),
N 2
10
projection matrix to recover the TD series from the sparse FD representation
obtained by CDCT.
The CDCT allows the learning of frequencies by setting ψ1:nf −1 as learn-
able parameters. This flexibility enables the adjustment of frequencies within
CDCT to better approximate the most significant frequencies in the data,
which may not align with the uniform grid points of the DCT. In the imple-
mentation, ψ1:nf −1 are initialized based on the top frequencies obtained by
applying DCT to a randomly sampled subset of the dataset. Although ini-
tializing CDCT with DCT frequencies has a time complexity of O(L log L),
this initialization stage is relatively quick, particularly when compared to
the training times. Since time series datasets are typically not as large as
those in CV and NLP, even performing DCT on the whole datesets without
sampling is acceptable in many real-world scenarios. Therefore, to be less
rigorous, this short initialization stage is omitted when analyzing the overall
complexity.
We compare the representation ability of learnable (LRN), low (LOW),
and random (RND) frequencies by examining the errors in reconstructing
the input TD series. For random and low frequencies, the TD series is recon-
structed using IDCT, while a learned linear projection is used for learnable
frequencies. From the results shown in Figure 3, the learnable frequencies
representation retains significantly more information from the input com-
pared to random and low frequencies when the number of components is the
same. It’s worth noting that lossless data representations of the inputs are
unnecessary because insignificant frequency components often have higher
noise levels and may be detrimental for prediction. Although the representa-
tional ability of low frequencies may be close to that of learnable frequencies
when the number of components is large, the latter are more effective at ob-
taining a sparse FD representation with fewer components that retains the
important information.
11
100 10 0 1
0.8
0.6
Normalized MSE
Normalized MSE
Normalized MSE
0.4
Figure 3: Comparison of the errors in reconstructing the input TD series from the FD
representation of learnable (LRN), low (LOW), and random (RND) frequencies. The
datasets are divided into continuous multivariate subsequences, each with a length of 512,
and the MSE is normalized by the sum-of-squares of the data. The RND frequencies are
executed 5 times, and the error bar represents the standard deviation.
12
Figure 4: LRA Layers. It conducts attention across the channel dimension. Inputs and
a learnable query are fed into LMSA, with the outputs being added with a compact
learnable position embedding. The outcomes are projected across the channel dimension
and replicated along the time-frequency dimension to match the input shape. The shortcut
connection, LayerNorm, and MLP are applied similarly to the Transformer. M is the
number of layers.
able queries to aggregate messages from all channels into a low-rank space.
Additionally, a compact learnable position embedding is added to the LMSA
outputs to account for position within the low-rank space. The resulting
outputs are mapped to match the number of channels through linear projec-
tion, distributing the aggregated messages among the channels. The results
are duplicated in the time-frequency dimension, following the TFI setting.
Subsequently, the shortcut connection, layer normalization (LayerNorm) (Ba
et al., 2016), and multi-layer feedforward network (denoted by MLP) are ap-
plied in the same manner as in the Transformer encoder.
The LMSA is a simplified version of the MSA with reduced computational
requirements and parameters. In the LMSA, both the keys and values share
the same linear projection. Because it is used alongside learnable queries,
the linear projection for queries is also omitted. Denote the standard MSA
by
QKT
Attn(Q, K, V) = Softmax √ V, (4)
dk
13
where Q ∈ Rlq ×dk , K, V ∈ Rlkv ×dk . The LMSA is defined as
where the inputs Q̂ ∈ Rlq ×hdk , K̂, V̂ ∈ Rlkv ×dm , the learnable matrices Wo ∈
Rhdk ×dm , Wkv,i ∈ Rdm ×dk . h is the number of heads, and dm is the model
width. The batch-size dimension is ignored in the discussion for convenience.
The input of LRA is the CI representation obtained by the Transformer
encoder. It is denoted as Zin ∈ RD×(nt +nf )×dm , where nt and nf are the
length of TD and FD representations. According to the TFI setting, Zin is
reshaped to Ẑ ∈ RD(nt +nf )×dm . Then the LRA can be discribed as
B = LMSA R, Ẑ, Ẑ
B̂ = B + Epos
Z̄ = We B̂ (6)
Z̃ = LayerNorm Zin ⊕ Z̄
Zout = LayerNorm Z̃ + MLP Z̃ .
14
Table 1: Complexity analysis of different forecasting models
15
4. Experiments
Datasets: We assess the performance of JTFT on six real-world datasets,
including Exchange, Weather, Traffic, Electricity (Electricity Consumption
Load), ILI (Influenza-Like Illness), and ETTm2 (Electricity Transformer
Temperature-minutely). The datasets are divided into training, validation,
and test sets following Nie et al. (2023), with split ratios of 0.6:0.2:0.2 for
ETTm2 and 0.7:0.1:0.2 for other datasets.
Baselines: We use several SOTA models for time series forecasting as
baselines, including PatchTST (Nie et al., 2023), Crossformer (Zhang & Yan,
2023), FEDformer (Zhou et al., 2022b), FiLM (Zhou et al., 2022a), DLinear
(Zeng et al., 2022), DeepTime (Woo et al., 2023), TSMixer (Chen et al.,
2023). Classic models such as ARIMA, basic RNN/LSTM/CNN models,
and some popular Transformer-based models, including LogTrans (Li et al.,
2019), Reformer (Kitaev et al., 2020), Pyraformer (Liu et al., 2022a), Auto-
former (Wu et al., 2021), and Informer (Zhou et al., 2021) are not included
in the main results because they exhibit relatively inferior performance, as
shown in (Zhou et al., 2022b; Zeng et al., 2022; Nie et al., 2023).
Experimental Settings: The forecasting length for ILI is T ∈ {24, 36, 48, 60},
while for the other datasets, it is T ∈ {96, 192, 336, 720}. Baseline results
for PatchTST, DLinear, FiLM, DeepTime, and TSMixer are obtained from
their original papers, with the look-back window L either searched for or
set to a suggested value. Specific to PatchTST, we present the results of
PatchTST/64, which generally performs better overall than PatchTST/42.
For other methods or settings not available in the literature, L is deter-
mined through grid search to establish strong baselines. The search range
is {24, 48, 84, 96, 128} for the two smaller datasets (ILI and Exchange) and
{48, 96, 192, 336, 512, 720} for the other datasets. In contrast, the JTFT con-
figuration aligns with PatchTST/64, where L is set at 128 for the smaller
datasets and 512 for the larger ones. The evaluation metrics reported include
Mean Squared Error (MSE) and Mean Absolute Error (MAE) for multivari-
ate time series forecasting.
16
Table 2: Multivariate long-term series forecasting results on 6 datasets. The best results
are highlighted in bold, and the second best results are underlined.
192 0.148 0.279 0.178 0.299 0.536 0.544 0.214 0.357 0.188 0.292 0.157 0.293 0.151 0.284 0.176 0.297
336 0.260 0.381 0.329 0.415 0.804 0.731 0.413 0.493 0.356 0.433 0.305 0.414 0.314 0.412 0.334 0.416
720 0.667 0.618 0.901 0.715 1.266 0.926 1.038 0.796 0.727 0.669 0.643 0.601 0.856 0.663 0.867 0.702
avg 0.289 0.369 0.373 0.408 0.723 0.653 0.443 0.473 0.339 0.400 0.296 0.378 0.350 0.391 0.364 0.403
96 0.144 0.186 0.149 0.198 0.148 0.212 0.238 0.314 0.199 0.262 0.176 0.237 0.166 0.221 0.145 0.198
Weather
192 0.187 0.228 0.194 0.241 0.191 0.258 0.275 0.329 0.228 0.288 0.22 0.282 0.207 0.261 0.191 0.242
336 0.237 0.270 0.245 0.282 0.244 0.308 0.339 0.377 0.267 0.323 0.265 0.319 0.251 0.298 0.242 0.280
720 0.308 0.321 0.314 0.334 0.311 0.355 0.389 0.409 0.319 0.361 0.323 0.362 0.301 0.338 0.320 0.336
avg 0.219 0.251 0.226 0.264 0.224 0.283 0.310 0.357 0.253 0.308 0.246 0.300 0.231 0.280 0.225 0.264
96 0.351 0.232 0.360 0.249 0.482 0.268 0.576 0.359 0.416 0.294 0.410 0.282 0.390 0.275 0.376 0.264
192 0.374 0.243 0.379 0.256 0.495 0.271 0.610 0.380 0.408 0.288 0.423 0.287 0.402 0.278 0.397 0.277
Traffic
336 0.385 0.249 0.392 0.264 0.512 0.280 0.608 0.375 0.425 0.298 0.436 0.296 0.415 0.288 0.413 0.290
720 0.429 0.275 0.432 0.286 0.561 0.313 0.621 0.375 0.52 0.353 0.466 0.315 0.449 0.307 0.444 0.306
avg 0.385 0.250 0.391 0.264 0.512 0.283 0.604 0.372 0.442 0.308 0.434 0.295 0.414 0.287 0.407 0.284
96 0.131 0.224 0.129 0.222 0.217 0.311 0.186 0.302 0.154 0.267 0.140 0.237 0.137 0.238 0.131 0.229
Electricity
192 0.144 0.237 0.147 0.240 0.263 0.337 0.197 0.311 0.164 0.258 0.153 0.249 0.152 0.252 0.151 0.246
336 0.157 0.252 0.163 0.259 0.319 0.370 0.213 0.328 0.188 0.283 0.169 0.267 0.166 0.268 0.161 0.261
720 0.182 0.275 0.197 0.290 0.388 0.412 0.233 0.344 0.236 0.332 0.203 0.301 0.201 0.302 0.197 0.293
avg 0.154 0.247 0.159 0.253 0.297 0.357 0.207 0.321 0.185 0.285 0.166 0.264 0.164 0.265 0.160 0.257
24 1.027 0.604 1.319 0.754 2.918 1.139 2.624 1.095 1.970 0.875 2.215 1.081 2.425 1.086 2.415 1.058
36 0.995 0.621 1.579 0.870 3.020 1.123 2.516 1.021 1.982 0.859 1.963 0.963 2.231 1.008 2.280 1.027
ILI
48 0.980 0.637 1.553 0.815 3.241 1.192 2.505 1.041 1.868 0.896 2.130 1.024 2.230 1.016 2.379 1.056
60 1.386 0.760 1.470 0.788 3.324 1.188 2.742 1.122 2.057 0.929 2.368 1.096 2.143 0.985 2.370 1.047
avg 1.097 0.655 1.480 0.807 3.126 1.161 2.597 1.070 1.969 0.890 2.169 1.041 2.257 1.024 2.361 1.047
96 0.160 0.247 0.166 0.256 0.280 0.371 0.180 0.271 0.165 0.256 0.167 0.260 0.166 0.257 0.163 0.252
ETTm2
192 0.213 0.284 0.223 0.296 0.364 0.446 0.252 0.318 0.222 0.296 0.224 0.303 0.225 0.302 0.216 0.290
336 0.265 0.319 0.274 0.329 0.990 0.734 0.324 0.364 0.277 0.333 0.281 0.342 0.277 0.336 0.268 0.324
720 0.348 0.373 0.362 0.385 1.892 1.026 0.410 0.420 0.371 0.389 0.397 0.421 0.383 0.409 0.420 0.422
avg 0.247 0.306 0.256 0.317 0.881 0.644 0.291 0.343 0.259 0.319 0.267 0.332 0.263 0.326 0.267 0.322
17
methods, DLinear, FiLM, DeepTime, and TSMixer are shown to be compet-
itive.
18
Table 3: Ablation study of the Joint Time-Frequency Domain Representation (JTFR)
and Low-Rank Attention Layer (LRA) in JTFT. TDR denotes JTFT with only a TD
representation and no LRA, FDR represents JTFT with only an FD representation and
no LRA, and JTFR includes both TD and FD representations but no LRA. The best
results are highlighted in bold, and the second-best results are underlined.
48 0.980 0.637 1.490 0.738 1.864 0.957 1.107 0.690 2.505 1.041 1.868 0.896 1.553 0.815
60 1.386 0.760 1.818 0.846 1.946 0.993 1.454 0.794 2.742 1.122 2.057 0.929 1.470 0.788
avg 1.097 0.656 1.567 0.754 1.794 0.918 1.185 0.701 2.256 0.984 1.969 0.890 1.480 0.807
96 0.144 0.186 0.158 0.201 0.147 0.193 0.147 0.190 0.238 0.314 0.199 0.262 0.149 0.198
Weather
192 0.187 0.228 0.200 0.240 0.191 0.237 0.192 0.234 0.275 0.329 0.228 0.288 0.194 0.241
336 0.237 0.270 0.251 0.280 0.245 0.280 0.245 0.276 0.339 0.377 0.267 0.323 0.245 0.282
720 0.308 0.321 0.322 0.332 0.319 0.334 0.316 0.328 0.389 0.409 0.319 0.361 0.314 0.334
avg 0.219 0.251 0.233 0.263 0.226 0.261 0.225 0.257 0.310 0.357 0.253 0.309 0.226 0.262
96 0.131 0.224 0.132 0.221 0.146 0.245 0.131 0.222 0.186 0.302 0.154 0.267 0.129 0.222
Electricity
192 0.144 0.237 0.148 0.237 0.161 0.259 0.147 0.237 0.197 0.311 0.164 0.258 0.147 0.240
336 0.157 0.252 0.164 0.255 0.176 0.272 0.163 0.253 0.213 0.328 0.188 0.283 0.163 0.259
720 0.182 0.275 0.205 0.288 0.211 0.300 0.200 0.286 0.233 0.344 0.236 0.332 0.197 0.290
avg 0.154 0.247 0.162 0.250 0.174 0.269 0.160 0.250 0.207 0.321 0.186 0.285 0.159 0.253
memory usage, the batch size was uniformly set to 128. However, for Auto-
former and Informer, a batch size of 64 was used for look-back windows of 96,
192, and 336, respectively, due to GPU memory limitations. Tests were not
performed on the two methods with larger look-back windows, as doing so
would have necessitated a further reduction in batch sizes, potentially result-
ing in an unfair comparison of speed. Additionally, the compilation function
in PyTorch 2 was disabled.
The results are displayed in Figure 5. Among the Transformer-based
methods, JTFT is the fastest and requires the least memory. While DLinear
is faster than JTFT, its accuracy is inferior. These results highlight that
JTFT is an efficient approach, especially when considering both speed and
accuracy. Further details and explanations are provided below.
Firstly, the computation and memory costs for different look-back win-
dows L are compared in Figures 5 (a, b, c), where the target window size
consistently set at 720. According to the complexity analysis, although the
time and space complexity of JTFT is O(L), the real time and memory cost
are mainly decided by the input length of the Transformer encoder, denoted
by L̂ = nt + nf . In the expression, nt and nf are the length of TD and FD
19
PatchTST
10 1
FEDformer
Memory
Crossformer
Time
10 0
Time
Autoformer
10 0
Informer
DLinear
JTFT_C
JTFT_V
10 0
96 192 336 512 96 192 336 512 96 192 336 512
Look-back window Look-back window Look-back window
(a) Training time (ms/sample) (b) Inference time (ms/sample) (c) Maximum GPU memory (GB)
10 2
10 1 PatchTST
FEDformer
Memory
Crossformer
Time
0
Time
10
Autoformer
Informer
DLinear
JTFT
10 0
10 0
96 192 336 720 96 192 336 720 96 192 336 720
Target window Target window Target window
(d) Training time (ms/sample) (e) Inference time (ms/sample) (f) Maximum GPU memory (GB)
Figure 5: Comparison of actual runtime efficiency and memory usage. Figures (a, b, c)
depict the results for various look-back windows while the target window is 720. Figures (d,
e, f) explore the impact of target window variations when the look-back window is 512. In
JTFT_C, the TD and FD representation lengths remain constant, while in JTFT_V, the
representation lengths increase roughly linearly with the look-back window. In Figures (d,
e, f), JTFT uses the largest representation length of JTFT_V. In general, JTFT is more
efficient in both computation and memory than the other Transformer-based methods in
the comparison.
20
representation, which are assigned by users and decoupled from L. Typically,
L̂ is less than the number of patches (refer to Appendix A), since the FD
and TD representation are linearly dependent otherwise.
Two versions of JTFT with different strategies to set nt and nf are consid-
ered. In JTFT_C uses constant (nt , nf ) = (8, 8). It is not applied for L = 96
due to the number of patches being less than L̂. In JTFT_V, (nt , nf ) in-
creases along with L. It is set as (4, 4), (8, 8), (12, 12), and (16, 16) for L of
96, 192, 336, and 512.
Figures 5 (a, b, c) demonstrate that both JTFT_C and JTFT_V out-
perform the other Transformer-based methods in most settings. The compu-
tational time and GPU memory usage of JTFT_C remain almost constant
with respect to L because the costs of encoders and heads, which account for
most of the expenses in JTFT, do not increase with L when L̂ is kept con-
stant. In contrast, for JTFT_V, the computation and memory costs increase
as L̂ grows with L. JTFT_C and JTFT_V require slightly more memory
compared to PatchTST when L = 192 because they include additional LRA
layers.
Next, we compare the computational and memory costs for various target
windows (T ) in Figure 5 (d, e, f). The look-back window size is consistently
set to 512 in these figures. Similar to the previous results, JTFT remains
faster and requires less memory compared to the other Transformer-based
methods. The costs increase at a small and roughly constant rate with respect
to T . This behavior is because, when both L and L̂ are fixed, only the cost
of the prediction head, which is not resource-intensive, increases along with
T.
21
In addition to its application in the Transformer architecture, the joint
time-frequency domain representation could be utilized in other types of
neural network models, including those based on CNNs and MLPs. Moreover,
this representation holds the potential for adoption in various domains, such
as NLP, where input sequences tend to be lengthy.
6. Acknowledgments
This work was supported in part by the National Natural Science Foun-
dation of China (Grant No. T2125006, U2242210, 42174057, 61972231,
62102114, 62202119), the Jiangsu Innovation Capacity Building Program
(Grant No. BM2022028), the Science and Technology Project of Qinghai
Province (Grant No. 2023-QY-208), and the Key Research Project of Zhe-
jiang Lab (Grant No. 2021PB0AC01).
22
is padded by repeating the last element ls times. In this setup, a total of
L̄ = (L − lp )/ls + 2 patches are generated by extracting continuous segments
within the input sequence. The input is rearranged as
Patching (x1:L ) = Zpch (A.1)
where the input series x1:L ∈ RD×L , patched series Zpch ∈ RD×L̄×lp , and
(Zpch )ij = xi(j−1)ls +1:(j−1)ls +lp (A.2)
23
Table B.4: Statistics of the datasets for benchmark
2
https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014
3
https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html
4
http://pems.dot.ca.gov/
5
https://www.bgc-jena.mpg.de/wetter/
24
Appendix C. Experimental details
By default, JTFT consists of 3 Transformer encoder layers and 1 low-
rank attention layer (LRA). However, the number of LRA is set to 0 for the
Traffic dataset as it negatively impacts the performance. The dimension of
the latent space, denoted as dm , in the Transformer encoder increases with
the dataset size. Specifically, it is set to 8 for ILI and Exchange, 16 for
ETTm2, 128 for Weather, Electricity and Traffic. The dm of the LRA is half
of those in the Transformer encoder. For ILI and Exchange, the patch length
and stride are set to (4, 2), while for the other datasets, they are (16, 8). The
length of TD and FD representations (nt , nf ) are searched in (16, 16) and
(16, 32).
Our method is implemented in PyTorch and trained on a workstation
equipped with 4 NVIDIA RTX 3090 GPUs, each with 24GB memory. All 4
GPUs are utilized for training on the Electricity and Traffic datasets, while
only 1 GPU is used for the remaining datasets.
25
Table D.5: Multivariate long-term series forecasting results on 3 datasets showing both
Mean and STD. The best results are highlighted in bold, and the second best results are
underlined.
48 1.0639 ± 0.0597 0.6639 ± 0.0174 1.6462 ± 0.1520 0.8318 ± 0.0513 2.1333 ± 0.0063 1.0251 ± 0.0022
60 1.3866 ± 0.0568 0.7593 ± 0.0152 1.4527 ± 0.0547 0.8008 ± 0.0105 2.3175 ± 0.0178 1.0854 ± 0.0090
96 0.1440 ± 0.0003 0.1875 ± 0.0016 0.1483 ± 0.0004 0.1978 ± 0.0002 0.1690 ± 0.0004 0.2300 ± 0.0015
Weather
192 0.1876 ± 0.0008 0.2300 ± 0.0016 0.1943 ± 0.0013 0.2419 ± 0.0013 0.2133 ± 0.0012 0.2724 ± 0.0033
336 0.2385 ± 0.0013 0.2710 ± 0.0013 0.2461 ± 0.0010 0.2828 ± 0.0007 0.2570 ± 0.0016 0.3078 ± 0.0030
720 0.3087 ± 0.0008 0.3233 ± 0.0019 0.3126 ± 0.0011 0.3329 ± 0.0011 0.3161 ± 0.0016 0.3559 ± 0.0026
96 0.3525 ± 0.0021 0.2339 ± 0.0016 0.3602 ± 0.0006 0.2487 ± 0.0003 0.4101 ± 0.0001 0.2819 ± 0.0002
Traffic
192 0.3732 ± 0.0005 0.2425 ± 0.0002 0.3788 ± 0.0004 0.2560 ± 0.0002 0.4227 ± 0.0003 0.2873 ± 0.0002
336 0.3854 ± 0.0013 0.2500 ± 0.0008 0.3917 ± 0.0011 0.2639 ± 0.0007 0.4357 ± 0.0003 0.2956 ± 0.0004
720 0.4292 ± 0.0006 0.2753 ± 0.0005 0.4322 ± 0.0009 0.2863 ± 0.0003 0.4658 ± 0.0001 0.3148 ± 0.0001
References
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization.
arXiv:1607.06450.
Bao, H., Dong, L., Piao, S., & Wei, F. (2022). BEiT: BERT pre-training of
image transformers. In International Conference on Learning Representa-
tions. URL: https://openreview.net/forum?id=p-BhZSz59o4.
Bouchachia, A., & Bouchachia, S. (2008). Ensemble learning for time series
prediction. Proceedings of the 1st International Workshop on Nonlinear
Dynamics and Synchronization, .
Box, G. E., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). Time
series analysis: forecasting and control . John Wiley & Sons.
26
Cao, D., Wang, Y., Duan, J., Zhang, C., Zhu, X., Huang, C., Tong, Y., Xu,
B., Bai, J., Tong, J., & Zhang, Q. (2020). Spectral temporal graph neural
network for multivariate time-series forecasting. ArXiv , abs/2103.07719 .
Challu, C., Olivares, K. G., Oreshkin, B. N., Ramirez, F. G., Canseco, M. M.,
& Dubrawski, A. (2023). Nhits: Neural hierarchical interpolation for time
series forecasting. In Proceedings of the AAAI Conference on Artificial
Intelligence (pp. 6989–6997). volume 37.
Chaovalit, P., Gangopadhyay, A., Karabatis, G., & Chen, Z. (2011). Discrete
wavelet transform-based time series analysis and mining. ACM Computing
Surveys (CSUR), 43 , 1–37.
Chen, S., Wang, X. X., & Harris, C. J. (2008). Narxbased nonlinear system
identification using orthogonal least squares basis hunting. IEEE Trans-
actions on Control Systems, (pp. 78–84).
Chen, S.-A., Li, C.-L., Arik, S. O., Yoder, N. C., & Pfister, T. (2023).
TSMixer: An all-MLP architecture for time series forecast-ing. Transac-
tions on Machine Learning Research, . URL: https://openreview.net/
forum?id=wbpxTuXgm0.
Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A.,
Sarlós, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger,
D. B., Colwell, L. J., & Weller, A. (2021). Rethinking attention with
performers. In 9th International Conference on Learning Representations
(ICLR), Virtual Event, Austria, May 3-7, 2021 .
Das, A., Kong, W., Leach, A., Sen, R., & Yu, R. (2023). Long-
term forecasting with tide: Time-series dense encoder. arXiv preprint
arXiv:2304.08424 , .
Ding, Y., Jia, M., Miao, Q., & Cao, Y. (2022). A novel time–frequency
transformer based on self–attention mechanism and its application in fault
diagnosis of rolling bearings. Mechanical Systems and Signal Processing,
168 , 108616.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Un-
terthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszko-
reit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Trans-
formers for image recognition at scale. In International Conference on
27
Learning Representations. URL: https://openreview.net/forum?id=
YicbFdNTTy.
Ekambaram, V., Jati, A., Nguyen, N., Sinthong, P., & Kalagnanam, J.
(2023). Tsmixer: Lightweight mlp-mixer model for multivariate time series
forecasting. arXiv preprint arXiv:2306.09364 , .
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. B. (2021). Masked
autoencoders are scalable vision learners. CoRR, abs/2111.06377 . URL:
https://arxiv.org/abs/2111.06377. arXiv:2111.06377.
Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (gelus).
arXiv preprint arXiv:1606.08415 , .
Karita, S., Chen, N., Hayashi, T., Hori, T., Inaguma, H., Jiang, Z., Someki,
M., Soplin, N. E. Y., Yamamoto, R., Wang, X. et al. (2019). A comparative
study on transformer vs rnn in speech applications. In IEEE Automatic
Speech Recognition and Understanding Workshop (ASRU) (pp. 449–456).
IEEE.
Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., & Shah,
M. (2021). Transformers in vision: A survey. ACM Computing Surveys
(CSUR), .
Kitaev, N., Kaiser, L., & Levskaya, A. (2020). Reformer: The efficient
transformer. In 8th International Conference on Learning Representations,
ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 .
28
Lai, G., Chang, W.-C., Yang, Y., & Liu, H. (2018). Modeling long-and short-
term temporal patterns with deep neural networks. In The 41st interna-
tional ACM SIGIR conference on research & development in information
retrieval (pp. 95–104).
Lee-Thorp, J., Ainslie, J., Eckstein, I., & Ontanon, S. (2022). Fnet: Mixing
tokens with fourier transforms. arXiv:2105.03824.
Li, B., Cui, W., Zhang, L., Zhu, C., Wang, W., Tsang, I., & Zhou, J. T.
(2023a). Difformer: Multi-resolutional differencing transformer with dy-
namic ranging for time series analysis. IEEE Transactions on Pattern
Analysis and Machine Intelligence, .
Li, S., Jin, X., Xuan, Y., Zhou, X., Chen, W., Wang, Y.-X., & Yan, X. (2019).
Enhancing the locality and breaking the memory bottleneck of transformer
on time series forecasting. In Advances in Neural Information Processing
Systems. volume 32. URL: https://proceedings.neurips.cc/paper/
2019/file/6775a0635c302542da2c32aa19d86be0-Paper.pdf.
Li, Z., Rao, Z., Pan, L., & Xu, Z. (2023b). Mts-mixers: Multivariate time se-
ries forecasting via factorized temporal and channel mixing. arXiv preprint
arXiv:2302.04501 , .
Liu, S., Yu, H., Liao, C., Li, J., Lin, W., Liu, A. X., & Dustdar, S. (2022a).
Pyraformer: Low-complexity pyramidal attention for long-range time se-
ries modeling and forecasting. In International Conference on Learning
Representations.
Liu, Y., Wu, H., Wang, J., & Long, M. (2022b). Non-stationary
transformers: Exploring the stationarity in time series forecast-
ing. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho,
& A. Oh (Eds.), Advances in Neural Information Processing Sys-
tems (pp. 9881–9893). Curran Associates, Inc. volume 35. URL:
https://proceedings.neurips.cc/paper_files/paper/2022/file/
4054556fcaa934b0bf76da52cf4f92cb-Paper-Conference.pdf.
Ma, X., Kong, X., Wang, S., Zhou, C., May, J., Ma, H., & Zettlemoyer, L.
(2021). Luna: Linear unified nested attention. CoRR, abs/2106.01540 .
arXiv:2106.01540.
29
Nie, Y., Nguyen, N. H., Sinthong, P., & Kalagnanam, J. (2023). A time
series is worth 64 words: Long-term forecasting with transformers. In The
Eleventh International Conference on Learning Representations. URL:
https://openreview.net/forum?id=Jbdc0vTOcol.
Pascanu, R., Mikolov, T., & Bengio, Y. (2012). On the difficulty of train-
ing recurrent neural networks. In International Conference on Machine
Learning.
Qin, Y., Song, D., Cheng, H., Cheng, W., Jiang, G., & Cottrell, G. W.
(2017). A dual-stage attention-based recurrent neural network for time se-
ries prediction. In International Joint Conference on Artificial Intelligence
(pp. 2627–2633).
Rangapuram, S. S., Seeger, M. W., Gasthaus, J., Stella, L., Wang, B., &
Januschowski, T. (2018). Deep state space models for time series forecast-
ing. In Neural Information Processing Systems.
Salinas, D., Flunkert, V., Gasthaus, J., & Januschowski, T. (2020). DeepAR:
Probabilistic forecasting with autoregressive recurrent networks. Interna-
tional Journal of Forecasting, 36 , 1181–1191.
Sen, R., Yu, H.-F., & Dhillon, I. S. (2019). Think globally, act locally: A
deep neural network approach to high-dimensional time series forecasting.
In Neural Information Processing Systems.
Shabani, M. A., Abdi, A. H., Meng, L., & Sylvain, T. (2023). Scaleformer: It-
erative multi-scale refining transformers for time series forecasting. In The
Eleventh International Conference on Learning Representations. URL:
https://openreview.net/forum?id=sCrnllCtjoE.
30
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R.
(2014). Dropout: a simple way to prevent neural networks from overfitting.
The journal of machine learning research, 15 , 1929–1958.
Tong, H., & Lim, K. S. (2009). Threshold autoregression, limit cycles and
cyclical data. In Exploration Of A Nonlinear World: An Appreciation of
Howell Tong’s Contributions to Statistics (pp. 9–56). World Scientific.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you
need. In Advances in Neural Information Processing Systems. vol-
ume 30. URL: https://proceedings.neurips.cc/paper/2017/file/
3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
Wang, H., Peng, J., Huang, F., Wang, J., Chen, J., & Xiao, Y. (2023a).
MICN: Multi-scale local and global context modeling for long-term series
forecasting. In The Eleventh International Conference on Learning Repre-
sentations. URL: https://openreview.net/forum?id=zt53IDUR1U.
Wang, S., Li, B. Z., Khabsa, M., Fang, H., & Ma, H. (2020). Lin-
former: Self-attention with linear complexity. CoRR, abs/2006.04768 .
arXiv:2006.04768.
Wang, Z., Zhao, H., Zheng, M., Niu, S., Gao, X., & Li, L. (2023b). A novel
time series prediction method based on pooling compressed sensing echo
state network and its application in stock market. Neural Networks, 164 ,
216–227.
Wen, Q., He, K., Sun, L., Zhang, Y., Ke, M., & Xu, H. (2021). RobustPe-
riod: Time-frequency mining for robust multiple periodicity detection. In
Proceedings of the 2021 International Conference on Management of Data
(SIGMOD ’21) (pp. 205–215).
Wen, Q., Zhou, T., Zhang, C., Chen, W., Ma, Z., Yan, J., & Sun, L. (2022).
Transformers in time series: A survey. arXiv preprint arXiv:2202.07125 , .
Woo, G., Liu, C., Sahoo, D., Kumar, A., & Hoi, S. (2023). Learning deep
time-index models for time series forecasting, .
31
Wu, H., Hu, T., Liu, Y., Zhou, H., Wang, J., & Long, M. (2022). Timesnet:
Temporal 2d-variation modeling for general time series analysis. In The
Eleventh International Conference on Learning Representations.
Wu, H., Xu, J., Wang, J., & Long, M. (2021). Autoformer: Decomposition
transformers with Auto-Correlation for long-term series forecasting. In
Advances in Neural Information Processing Systems.
Wu, Z., Pan, S., Long, G., Jiang, J., Chang, X., & Zhang, C. (2020). Con-
necting the dots: Multivariate time series forecasting with graph neural
networks. Proceedings of the 26th ACM SIGKDD International Confer-
ence on Knowledge Discovery & Data Mining, .
Xiong, Y., Zeng, Z., Chakraborty, R., Tan, M., Fung, G., Li, Y., & Singh,
V. (2021). Nyströmformer: A nyström-based algorithm for approximating
self-attention. In Thirty-Fifth AAAI Conference on Artificial Intelligence
(pp. 14138–14148).
Yao, S., Piao, A., Jiang, W., Zhao, Y., Shao, H., Liu, S., Liu, D., Li, J.,
Wang, T., Hu, S. et al. (2019). stfnets: Learning sensing signals from the
time-frequency perspective with short-time fourier neural networks. In The
World Wide Web Conference (pp. 2192–2202).
Zeng, A., Chen, M., Zhang, L., & Xu, Q. (2022). Are transformers effective
for time series forecasting? arXiv preprint arXiv:2205.13504 , .
Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., & Zhang, W.
(2021). Informer: Beyond efficient transformer for long sequence time-
series forecasting. In The Thirty-Fifth AAAI Conference on Artificial In-
telligence (pp. 11106–11115). volume 35.
Zhou, T., Ma, Z., Wen, Q., Sun, L., Yao, T., Yin, W., Jin, R. et al. (2022a).
Film: Frequency improved legendre memory model for long-term time
series forecasting. Advances in Neural Information Processing Systems,
35 , 12677–12690.
32
Zhou, T., Ma, Z., Wen, Q., Wang, X., Sun, L., & Jin, R. (2022b). FED-
former: Frequency enhanced decomposed transformer for long-term series
forecasting. In Proc. 39th International Conference on Machine Learning.
Zhou, Z., Zhong, R., Yang, C., Wang, Y., Yang, X., & Shen, W. (2022c). A
k-variate time series is worth k words: Evolution of the vanilla trans-
former architecture for long-term multivariate time series forecasting.
arXiv:2212.02789.
33