SOFTS Multivariate TimeSeries
SOFTS Multivariate TimeSeries
Abstract
Multivariate time series forecasting plays a crucial role in various fields such as fi-
nance, traffic management, energy, and healthcare. Recent studies have highlighted
the advantages of channel independence to resist distribution drift but neglect
channel correlations, limiting further enhancements. Several methods utilize mech-
anisms like attention or mixer to address this by capturing channel correlations, but
they either introduce excessive complexity or rely too heavily on the correlation to
achieve satisfactory results under distribution drifts, particularly with a large num-
ber of channels. Addressing this gap, this paper presents an efficient MLP-based
model, the Series-cOre Fused Time Series forecaster (SOFTS), which incorpo-
rates a novel STar Aggregate-Redistribute (STAR) module. Unlike traditional
approaches that manage channel interactions through distributed structures, e.g.,
attention, STAR employs a centralized strategy to improve efficiency and reduce
reliance on the quality of each channel. It aggregates all series to form a global
core representation, which is then dispatched and fused with individual series rep-
resentations to facilitate channel interactions effectively. SOFTS achieves superior
performance over existing state-of-the-art methods with only linear complexity. The
broad applicability of the STAR module across different forecasting models is also
demonstrated empirically. For further research and development, we have made
our code publicly available at https://github.com/Secilia-Cxy/SOFTS.
1 Introduction
Time series forecasting plays a critical role in numerous applications across various fields, including
environment [9], traffic management [15], energy [16], and healthcare [27]. The ability to accurately
predict future values based on previously observed data is fundamental for decision-making, policy
development, and strategic planning in these areas. Historically, models such as ARIMA and
Exponential Smoothing were standard in forecasting, noted for their simplicity and effectiveness in
certain contexts [2]. However, the emergence of deep learning models, particularly those exploiting
structures like Recurrent Neural Networks (RNNs) [14, 3, 29] and Convolutional Neural Networks
(CNN) [1, 8], has shifted the paradigm towards more complex models capable of understanding
intricate patterns in time series data. To overcome the inability to capture long-term dependencies,
Transformer-based models have been a popular direction and achieved remarkable performance,
especially on long-term multivariate time series forecasting [46, 28, 26].
Earlier on, Transformer-based methods perform embedding techniques like linear or convolution
layers to aggregate information from different channels, then extract information along the temporal
dimension via attention mechanisms [46, 35, 47]. However, such channel mixing structures were
found vulnerable to the distribution drift, to the extent that they were often less effective than simpler
∗
Equal contribution
1. We present Series-cOre Fused Time Series (SOFTS) forecaster, a simple MLP-based model that
demonstrates state-of-the-art performance with lower complexity.
2. We introduce the STar Aggregate-Redistribute (STAR) module, which serves as the foundation of
SOFTS. STAR is designed as a centralized structure that uses a core to aggregate and exchange
information from the channels. Compared to distributed structures like attention, the STAR not
only reduces the complexity but also improves robustness against anomalies in channels.
3. Lastly, through extensive experiments, the effectiveness and scalability of SOFTS are validated.
The universality of STAR is also validated on various attention-based time series forecasters.
2 Related Work
Time series forecasting. Time series forecasting is a critical area of research that finds applications
in both industry and academia. With the powerful representation capability of neural networks, deep
forecasting models have undergone a rapid development [22, 38, 37, 4, 5]. Two widely used methods
for time series forecasting are recurrent neural networks (RNNs) and convolutional neural networks
(CNNs). RNNs model successive time points based on the Markov assumption [14, 3, 29], while
CNNs extract variation information along the temporal dimension using techniques such as temporal
convolutional networks (TCNs) [1, 8]. However, due to the Markov assumption in RNN and the local
reception property in TCN, both of the two models are unable to capture the long-term dependencies
in sequential data. Recently, the potential of Transformer models for long-term time series forecasting
tasks has garnered attention due to their ability to extract long-term dependencies via the attention
mechanism [46, 35, 47].
2
Star Aggregate-Redistribute Module × 𝑁𝑁 Core
Pool
MLP Linear
Residual Time
Time
Figure 1: Overview of our SOFTS method. The multivariate time series is first embedded along the
temporal dimension to get the series representation for each channel. Then the channel correlation is
captured by multiple layers of STAR modules. The STAR module utilizes a centralized structure that
first aggregates the series representation to obtain a global core representation, and then dispatches
and fuses the core with each series, which encodes the local information.
frequency domain using selected components to enhance performance [47]. Despite these innovations,
models mixing channels in multivariate series often exhibit reduced robustness to adapt to distribution
drifts and achieve subpar performance [43, 11]. Consequently, some researchers have adopted a
channel-independent approach, simplifying the model architecture and delivering robust results as
well [28, 23]. However, ignoring the interactions among variates can limit further advancements.
Recent trends have therefore shifted towards leveraging attention mechanisms to capture channel
correlations [45, 33, 26]. Even though the performance is promising, their scalability is limited on
large datasets. Another stream of research focuses on modeling time and channel dependencies
through simpler structures like MLP [44, 7, 40]. Yet, they usually achieve sub-optimal performance
compared to SOTA transformer-based methods, especially when the number of channels is large.
In this paper, we propose a new MLP-based method that breaks the dilemma of performance and
efficiency, achieving state-of-the-art performance with merely linear complexity to both the number
of channels and the length of the lookback window.
3 SOFTS
Multivariate time series forecasting (MTSF) deals with time series data that contain multiple variables,
or channels, at each time step. Given historical values X ∈ RC×L where L represents the length of
the lookback window, and C is the number of channels. The goal of MTSF is to predict the future
values Y ∈ RC×H , where H > 0 is the forecast horizon.
3.1 Overview
Our Series-cOre Fused Time Series forecaster (SOFTS) comprises the following components and its
structure is illustrated in Figure 1.
Reversible instance normalization. Normalization is a common technique to calibrate the distribu-
tion of input data. In time series forecasting, the local statistics of the history are usually removed to
stabilize the prediction of the base forecaster and restore these statistics to the model prediction [17].
Following the common practice in many state-of-the-art models [28, 26], we apply reversible instance
normalization which centers the series to zero means, scales them to unit variance, and reverses the
normalization on the forecasted series. For PEMS dataset, we follow Liu et al. [26] to selectively
perform normalization according to the performance.
Series embedding. Series embedding is an extreme case of the prevailing patch embedding in time
series [28], which is equivalent to setting the patch length to the length of the whole series [26]. Unlike
patch embedding, series embedding does not produce extra dimension and is thus less complex than
patch embedding. Therefore, in this work, we perform series embedding on the lookback window.
Concretely, we use a linear projection to embed the series of each channel to S0 = RC×d , where d is
3
Series
Core
Interaction
Figure 2: The comparison of the STAR module and several common modules, like attention, GNN
and mixer. These modules employ a distributed structure to perform the interaction, which relies on
the quality of each channel. On the contrary, our STAR module utilizes a centralized structure that
first aggregates the information from all the series to obtain a comprehensive core representation.
Then the core information is dispatched to each channel. This kind of interaction pattern reduces not
only the complexity of interaction but also the reliance on the channel quality.
Channel interaction. The series embedding is refined by multiple layers of STAR modules:
Si = STAR(Si−1 ), i = 1, 2, . . . , N. (2)
The STAR module utilizes a star-shaped structure that exchanges information between different
channels, as will be fully described in the next section.
Linear predictor. After N layers of STAR, we use a linear predictor (Rd 7→ RH ) to produce the
forecasting results. Assume the output series representation of layer N to be SN , the prediction
Ŷ ∈ RC×H is computed as:
Ŷ = Linear(SN ).
Our main contribution is a simple but efficient STar Aggregate-Redistribute (STAR) module to
capture the dependencies between channels. Existing methods employ modules like attention to
extract such interaction. Although these modules directly compare the characteristics of each pair,
they are faced with the quadratic complexity related to the number of channels. Besides, such a
distributed structure may lack robustness when there are abnormal channels for the reason that they
rely on the extract correlation between channels. Existing research on channel independence has
already proved the untrustworthy correlations on non-stationary time series [43, 11]. To this end,
we propose the STAR module to solve the inefficiency of the distributed interaction modules. This
module is inspired by the star-shaped centralized system in software engineering, where instead of
letting the clients communicate with each other, there is a server center to aggregate and exchange
the information [30, 10], whose advantage is efficient and reliable. Following this idea, the STAR
replaces the mutual series interaction with the indirect interaction through a core, which represents
the global representation across all the channels. Compared to the distributed structure, STAR takes
advantage of the robustness brought by aggregation of channel statistics [11], and thus achieves even
better performance. Figure 2 illustrates the main idea of STAR and its difference between existing
models like attention [32], GNN [19] and Mixer [31].
Given the series representation of each channel as input, STAR first gets the core representation of the
multivariate series, at the heart of our SOFTS method. We define the core representation as follows:
Definition 3.1 (Core Representation). Given a multivariate series with C channels {s1 , s2 , . . . , sC },
the core representation o is a vector generated by an arbitrary function f with the following form:
o = f (s1 , s2 , . . . , sC )
4
The core representation encodes the global information across all the channels. To obtain such repre-
sentation, we employ the following form, which is inspired by the Kolmogorov-Arnold representation
theorem [20] and DeepSets [41]:
oi = Stoch_Pool(MLP1 (Si−1 )) (3)
′
where MLP1 : Rd 7→ Rd is a projection that projects the series representation from the series hidden
dimension d to the core dimension d′ , composing two layers with hidden dimension d and GELU [13]
′
activation. Stoch_Pool is the stochastic pooling [42] that get the core representation o ∈ Rd by
aggregating representations of C series. Stochastic pooling combines the advantages of mean and
max pooling. The details of computing the core representation can be found in Appendix B.2. Next,
we fuse the representations of the core and all the series:
Fi = Repeat_Concat(Si−1 , oi ) (4)
Si = MLP2 (Fi ) + Si−1 (5)
The Repeat_Concat operation concatenate the core representation o to each series representation,
′ ′
and we get the Fi ∈ RC×(d+d ) . Then another MLP (MLP2 : Rd+d 7→ Rd ) is used to fuse the
concatenated presentation and project it back to the hidden dimension d, i.e., Si ∈ RC×d . Like many
deep learning modules, we also add a residual connection from the input to the output [12].
We analyze the complexity of each component of SOFTS step by step concerning window length
L, number of channels C, model dimension d, and forecasting horizon H. The complexity of the
reversible instance normalization and series embedding is O(CL) and O(CLd) respectively. In
STAR, assuming d′ = d, the MLP1 is a Rd 7→ Rd mapping with complexity O(Cd2 ). Stoch_Pool
computes the softmax along the channel dimension, with complexity O(Cd). The MLP2 on the
concatenated embedding has the complexity O(Cd2 ). The complexity of the predictor is O(CdH).
In all, the complexity of the encoding part is O(CLd + Cd2 + CdH), which is linear to C, L, and H.
Ignoring the model dimension d, which is a constant in the algorithm and irrelevant to the problem,
we compare the complexity of several popular forecasters in Table 1.
Table 1: Complexity comparison between popular time series forecasters concerning window length
L, number of channels C and forecasting horizon H. Our method achieves only linear complexity.
SOFTS (ours) iTransformer PatchTST Transformer
2 2
Complexity O(CL + CH) O(C + CL + CH) O(CL + CH) O(CL + L2 + HL + CH)
4 Experiments
Datasets. To thoroughly evaluate the performance of our proposed SOFTS, we conduct extensive
experiments on 6 widely used, real-world datasets including ETT (4 subsets), Traffic, Electricity,
Weather [46, 35], Solar-Energy [21] and PEMS (4 subsets) [24]. Detailed descriptions of the datasets
can be found in Appendix A.
Forecasting benchmarks. The long-term forecasting benchmarks follow the setting in In-
former [46] and SCINet [24]. The lookback window length (L) is set to 96 for all datasets. We set the
prediction horizon (H) to {12, 24, 48, 96} for PEMS and {96, 192, 336, 720} for others. Performance
comparison among different methods is conducted based on two primary evaluation metrics: Mean
Squared Error (MSE) and Mean Absolute Error (MAE). The results of PatchTST and TSMixer are
reproduced for the ablation study and other results are taken from iTransformer [26].
5
Implementation details. We use the ADAM optimizer [18] with an initial learning rate of 3 × 10−4 .
This rate is modulated by a cosine learning rate scheduler. The mean squared error (MSE) loss
function is utilized for model optimization. We explore the number of STAR blocks N within the
set {1, 2, 3, 4}, and the dimension of series d within {128, 256, 512}. Additionally, the dimension
of the core representation d′ varies among {64, 128, 256, 512}. Other detailed implementations are
described in Appendix B.3.
Main results. As shown in Table 2, SOFTS has provided the best or second predictive outcomes in
all 6 datasets on average. Moreover, when compared to previous state-of-the-art methods, SOFTS
has demonstrated significant advancements. For instance, on the Traffic dataset, SOFTS improved
the average MSE error from 0.428 to 0.409, representing a notable reduction of about 4.4%. On the
PEMS07 dataset, SOFTS achieves a substantial relative decrease of 13.9% in average MSE error,
from 0.101 to 0.087. These significant improvements indicate that the SOFTS model possesses robust
performance and broad applicability in multivariate time series forecasting tasks, especially in tasks
with a large number of channels, such as the Traffic dataset, which includes 862 channels, and the
PEMS dataset, with a varying range from 170 to 883 channels.
Table 2: Multivariate forecasting results with horizon H ∈ {12, 24, 48, 96} for PEMS and H ∈
{96, 192, 336, 720} for others and fixed lookback window length L = 96. Results are averaged from
all prediction horizons. Full results are listed in Table 6.
Models SOFTS (ours) iTransformer PatchTST TSMixer Crossformer TiDE TimesNet DLinear SCINet FEDformer Stationary
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
ECL 0.174 0.264 0.178 0.270 0.189 0.276 0.186 0.287 0.244 0.334 0.251 0.344 0.192 0.295 0.212 0.300 0.268 0.365 0.214 0.327 0.193 0.296
Traffic 0.409 0.267 0.428 0.282 0.454 0.286 0.522 0.357 0.550 0.304 0.760 0.473 0.620 0.336 0.625 0.383 0.804 0.509 0.610 0.376 0.624 0.340
Weather 0.255 0.278 0.258 0.278 0.256 0.279 0.256 0.279 0.259 0.315 0.271 0.320 0.259 0.287 0.265 0.317 0.292 0.363 0.309 0.360 0.288 0.314
Solar-Energy 0.229 0.256 0.233 0.262 0.236 0.266 0.260 0.297 0.641 0.639 0.347 0.417 0.301 0.319 0.330 0.401 0.282 0.375 0.291 0.381 0.261 0.381
ETTm1 0.393 0.403 0.407 0.410 0.396 0.406 0.398 0.407 0.513 0.496 0.419 0.419 0.400 0.406 0.403 0.407 0.485 0.481 0.448 0.452 0.481 0.456
ETTm2 0.287 0.330 0.288 0.332 0.287 0.330 0.289 0.333 0.757 0.610 0.358 0.404 0.291 0.333 0.350 0.401 0.571 0.537 0.305 0.349 0.306 0.347
ETTh1 0.449 0.442 0.454 0.447 0.453 0.446 0.463 0.452 0.529 0.522 0.541 0.507 0.458 0.450 0.456 0.452 0.747 0.647 0.440 0.460 0.570 0.537
ETTh2 0.373 0.400 0.383 0.407 0.385 0.410 0.401 0.417 0.942 0.684 0.611 0.550 0.414 0.427 0.559 0.515 0.954 0.723 0.437 0.449 0.526 0.516
PEMS03 0.104 0.210 0.113 0.221 0.137 0.240 0.119 0.233 0.169 0.281 0.326 0.419 0.147 0.248 0.278 0.375 0.114 0.224 0.213 0.327 0.147 0.249
PEMS04 0.102 0.208 0.111 0.221 0.145 0.249 0.103 0.215 0.209 0.314 0.353 0.437 0.129 0.241 0.295 0.388 0.092 0.202 0.231 0.337 0.127 0.240
PEMS07 0.087 0.184 0.101 0.204 0.144 0.233 0.112 0.217 0.235 0.315 0.380 0.440 0.124 0.225 0.329 0.395 0.119 0.234 0.165 0.283 0.127 0.230
PEMS08 0.138 0.219 0.150 0.226 0.200 0.275 0.165 0.261 0.268 0.307 0.441 0.464 0.193 0.271 0.379 0.416 0.158 0.244 0.286 0.358 0.201 0.276
Model efficiency. Our SOFTS model demonstrates efficient performance with minimal memory
and time consumption. Figure 3b illustrates the memory and time usage across different models on
the Traffic dataset, with lookback window L = 96, horizon H = 720, and batch size 4. Despite
their low resource usage, Linear-based or MLP-based models such as DLinear and TSMixer perform
poorly with a large number of channels. Figure 3a explores the memory requirements of the three
best-performing models from Figure 3b. This figure reveals that the memory usage of both PatchTST
and iTransformer escalates significantly with an increase in channels. In contrast, our SOFTS model
maintains efficient operation, with its complexity scaling linearly with the number of channels,
effectively handling large channel counts.
In this section, the prediction horizon (H) is set to {12, 24, 48, 96} for PEMS and {96, 192, 336, 720}
for others. All the results are averaged on four horizons. If not especially concerned, the lookback
window length (L) is set to 96 as default.
Comparison of different pooling methods. The comparison of different pooling methods in STAR
is shown in Table 3. The term "w/o STAR" refers to a scenario where an MLP is utilized with the
Channel Independent (CI) strategy, without the use of STAR. Mean pooling computes the average
value of all the series representations. Max pooling selects the maximum value of each hidden feature
among all the channels. Weighted average learns the weight for each channel. Stochastic pooling
applies random selection during training and weighted average during testing according to the feature
value. The result reveals that incorporating STAR into the model leads to a consistent enhancement
in performance across all pooling methods. Additionally, stochastic pooling deserves attention as it
outperforms the other methods across nearly all the datasets.
6
0.68
24000
PatchTST TimesNet
iTransformer 0.64
20000 SOFTS (ours)
Stationary
0.60
FEDformer
DLinear
16000
Memory (MB)
0.56
MSE
12000
Memory (MB)
0.52 TSMixer 26000
21000
16000
Crossformer
8000 0.48 11000
6000
PatchTST
0.44
4000 1000
iTransformer
0.40
SOFTS (ours)
0
1000 2000 3000 4000 5000 0 50 100 150 200 250 300 350
# of Channels Inference Time (ms/iter)
(a) (b)
Figure 3: Memory and time consumption of different models. In Figure 3a, we set the lookback
window L = 96, horizon H = 720, and batch size to 16 in a synthetic dataset we conduct.
In Figure 3b, we set the lookback window L = 96, horizon H = 720, and batch size to 4 in
Traffic dataset. Figure 3a reveals that SOFTS model scales to large number of channels more
effectively than Transformer-based models. Figure 3b shows that previous Linear-based or MLP-
based models such as DLinear and TSMixer perform poorly with a large number of channels. While
SOFTS model demonstrates efficient performance with minimal memory and time consumption.
Table 3: Comparison of the effect of different pooling methods. The term "w/o STAR" refers to a
scenario where an MLP is utilized with the Channel Independent (CI) strategy, without the use of
STAR. The result reveals that incorporating STAR into the model leads to a consistent enhancement
in performance across all pooling methods. Apart from that, stochastic pooling performs better than
mean and max pooling. Full results can be found in Table 7.
ECL Traffic Weather Solar ETTh2 PEMS04
Pooling Method
MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
w/o STAR 0.187 0.273 0.442 0.281 0.261 0.281 0.247 0.272 0.381 0.406 0.143 0.245
Mean 0.174 0.266 0.420 0.277 0.261 0.281 0.234 0.262 0.379 0.404 0.106 0.212
Max 0.180 0.270 0.406 0.271 0.259 0.280 0.246 0.269 0.379 0.401 0.116 0.223
Weighted 0.184 0.275 0.440 0.292 0.263 0.284 0.264 0.280 0.379 0.403 0.109 0.218
Stochastic 0.174 0.264 0.409 0.267 0.255 0.278 0.229 0.256 0.373 0.400 0.102 0.208
Influence of lookback window length. Common sense suggests that a longer lookback window
should improve forecast accuracy. However, incorporating too many features can lead to a curse
of dimensionality, potentially compromising the model’s forecasting effectiveness. We explore
how varying the lengths of these lookback windows impacts the forecasting performance for time
horizons from 48 to 336 in all datasets. As shown in Figure 4, SOFTS could consistently improve
its performance by effectively utilizing the enhanced data available from an extended lookback
7
Table 4: The performance of STAR in different models. The attention replaced by STAR here are
the time attention in PatchTST, the channel attention in iTransformer, and both the time attention
and channel attention in modified Crossformer. The results demonstrate that replacing attention with
STAR, which requires less computational resources, could maintain and even improve the models’
performance in several datasets. † : The Crossformer used here is a modified version that replaces the
decoder with a flattened head like what PatchTST does. Full results can be found in Table 8.
ECL Traffic Weather PEMS03 PEMS04 PEMS07
Model Component
MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
Attention 0.189 0.276 0.454 0.286 0.256 0.279 0.137 0.240 0.145 0.249 0.144 0.233
PatchTST
STAR 0.185 0.272 0.448 0.279 0.252 0.277 0.134 0.233 0.136 0.238 0.137 0.225
Attention 0.202 0.301 0.546 0.297 0.254 0.310 0.100 0.208 0.090 0.198 0.084 0.181
Crossformer†
STAR 0.198 0.292 0.549 0.292 0.252 0.305 0.100 0.204 0.087 0.194 0.080 0.175
Attention 0.178 0.270 0.428 0.282 0.258 0.278 0.113 0.221 0.111 0.221 0.101 0.204
iTransformer
STAR 0.174 0.264 0.409 0.267 0.255 0.278 0.104 0.210 0.102 0.208 0.087 0.184
window. Also, SOFTS performs consistently better than other models under different lookback
window lengths, especially in shorter cases.
MAE
MAE
0.36 0.36
0.30
0.34
0.32
0.34 0.28
0.30
0.28
0.26
0.26
0.32
48 96 192 336 48 96 192 336 48 96 192 336
Lookback Window Length Lookback Window Length Lookback Window Length
Figure 4: Influence of lookback window length L. SOFTS performs consistently better than other
models under different lookback window lengths, especially in shorter cases.
Series embedding adaptation of STAR. The STAR module adapts the series embeddings by
extracting the interaction between channels. To give an intuition of the functionality of STAR, we
visualize the series embeddings before and after being adjusted by STAR. The multivariate series is
selected from the test set of Traffic with look back window 96 and number of channels 862. Figure 6
shows the series embeddings visualized by T-SNE before and after the first STAR module. Among the
862 channels, there are 2 channels embedded far away from the other channels. These two channels
can be seen as anomalies, marked as (⋆) in the figure. Without STAR, i.e., using only the channel
independent strategy, the prediction on the series can only achieve 0.414 MSE. After being adjusted
by STAR, the abnormal channels can be clustered towards normal channels by exchanging channel
information. An example of the normal channels is marked as (△). Predictions on the adapted series
embeddings can improve the performance to 0.376, a 9% improvement.
Impact of channel noise. As previously mentioned, SOFTS can cluster abnormal channels towards
normal channels by exchanging channel information. To test the impact of an abnormal channel on
8
ECL Traffic Weather Solar ECL Traffic Weather Solar ECL Traffic Weather Solar
0.48 0.44
0.42
0.46 0.42
0.44 0.40
0.40
0.42 0.38
0.38
0.40 0.36
0.38 0.36
0.34 0.34
0.36
0.32 0.32
0.34
MSE
MSE
MSE
0.32 0.30 0.30
0.30 0.28 0.28
0.28 0.26 0.26
0.26 0.24 0.24
0.24 0.22 0.22
0.22
0.20 0.20
0.20
0.18 0.18 0.18
0.16 0.16 0.16
32 64 128 256 512 1024 64 128 256 512 1024 1 2 3 4
Dimension of Series Dimension of Core # of Encoder Layers
Figure 5: Impact of several key hyperparameters: the hidden dimension of the model, denoted as
d, the hidden dimension of the core, represented by d′ , and the number of encoder layers, N . Full
results can be seen in Appendix E.
MSE
3
0
0 1 2 3 4 5 6 7 8 9 10
Series Embedding Series Embedding Strength of Noise
(a) (b) (c)
Figure 6: Figure 6a 6b: T-SNE of the series embeddings on the Traffic dataset. 6a: the series
embeddings before STAR. Two abnormal channels (⋆) are located far from the other channels.
Forecasting on the embeddings achieves 0.414 MSE. 6b: series embeddings after being adjusted
by STAR. The two channels are clustered towards normal channels (△) by exchanging channel
information. Adapted series embeddings improve forecasting performance to 0.376. Figure 6c:
Impact of noise on one channel. Our method is more robust against channel noise than other methods.
the performance of three models—SOFTS, PatchTST, and iTransformer—we select one channel from
the PEMS03 dataset and add Gaussian noise with a mean of 0 and a standard deviation representing
the strength of the noise. The lookback window and horizon are set to 96 for this experiment. In
Figure 6c, we observe that the MSE of PatchTST increases sharply as the strength of the noise grows.
In contrast, SOFTS and iTransformer can better handle the noise. This indicates that suitable channel
interaction can improve the robustness against noise in one channel using information from the
normal channels. Moreover, SOFTS demonstrates superior noise handling compared to iTransformer.
This suggests that while the abnormal channel can affect the model’s judgment of normal channels,
our STAR module can mitigate the negative impact more effectively by utilizing core representation
instead of building relationships between every pair of channels.
5 Conclusion
Although channel independence has been found an effective strategy to improve robustness for
multivariate time series forecasting, channel correlation is important information to be utilized
for further improvement. The previous methods faced a dilemma between model complexity and
performance in extracting the correlation. In this paper, we solve the dilemma by introducing the
Series-cOre Fused Time Series forecaster (SOFTS) which achieves state-of-the-art performance with
low complexity, along with a novel STar Aggregate-Redistribute (STAR) module to efficiently capture
the channel correlation.
9
Our paper explores the way of building a scalable multivariate time series forecaster while maintaining
equal or even better performance than the state-of-the-art methods, which we think may pave the way
to forecasting on datasets of more larger scale under resource constraints.
References
[1] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolu-
tional and recurrent networks for sequence modeling. CoRR, abs/1803.01271, 2018.
[2] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time series
analysis: forecasting and control. John Wiley & Sons, 2015.
[3] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the
properties of neural machine translation: Encoder-decoder approaches. In SSST@EMNLP,
pages 103–111. Association for Computational Linguistics, 2014.
[4] Razvan-Gabriel Cirstea, Darius-Valer Micu, Gabriel-Marcel Muresan, Chenjuan Guo, and Bin
Yang. Correlated time series forecasting using multi-task deep neural networks. In ICKM, pages
1527–1530, 2018.
[5] Yue Cui, Kai Zheng, Dingshan Cui, Jiandong Xie, Liwei Deng, Feiteng Huang, and Xiaofang
Zhou. METRO: A generic graph neural network framework for multivariate time series
forecasting. Proc. VLDB Endow., 15(2):224–236, 2021.
[6] Abhimanyu Das, Weihao Kong, Andrew Leach, Shaan Mathur, Rajat Sen, and Rose Yu. Long-
term forecasting with tide: Time-series dense encoder. CoRR, abs/2304.08424, 2023.
[7] Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam.
Tsmixer: Lightweight mlp-mixer model for multivariate time series forecasting. In KDD, pages
459–469, 2023.
[8] Jean-Yves Franceschi, Aymeric Dieuleveut, and Martin Jaggi. Unsupervised scalable represen-
tation learning for multivariate time series. In NeurIPS, pages 4652–4663, 2019.
[9] Aleksandra Gruca, Federico Serva, Llorenç Lliso, Pilar Rípodas, Xavier Calbet, Pedro Herruzo,
Jiřı Pihrt, Rudolf Raevskyi, Petr Šimánek, Matej Choma, et al. Weather4cast at neurips
2022: Super-resolution rain movie prediction under spatio-temporal shifts. In NeurIPS 2022
Competition Track, pages 292–313. PMLR, 2022.
[10] Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. Star-
transformer. In NAACL-HLT, pages 1315–1325. Association for Computational Linguistics,
2019.
[11] Lu Han, Han-Jia Ye, and De-Chuan Zhan. The capacity and robustness trade-off: Revisiting the
channel independent strategy for multivariate time series forecasting. CoRR, abs/2304.05206,
2023.
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In CVPR, pages 770–778. IEEE Computer Society, 2016.
[13] Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with
gaussian error linear units. CoRR, abs/1606.08415, 2016.
[14] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):
1735–1780, 1997.
[15] Akhil Kadiyala and Ashok Kumar. Multivariate time series models for prediction of air
quality inside a public transportation bus using available software. Environmental Progress &
Sustainable Energy, 33(2):337–341, 2014.
[16] Evaggelos G Kardakos, Minas C Alexiadis, Stylianos I Vagropoulos, Christos K Simoglou,
Pandelis N Biskas, and Anastasios G Bakirtzis. Application of time series and artificial neural
network models in short-term forecasting of pv power generation. In 2013 48th International
Universities’ Power Engineering Conference (UPEC), pages 1–6. IEEE, 2013.
10
[17] Taesung Kim, Jinhee Kim, Yunwon Tae, Cheonbok Park, Jang-Ho Choi, and Jaegul Choo.
Reversible instance normalization for accurate time-series forecasting against distribution shift.
In ICLR, 2021.
[18] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization.
arXiv:1412.6980 [cs], January 2017. arXiv: 1412.6980.
[19] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional
networks. In ICLR. OpenReview.net, 2017.
[21] Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Modeling long-and short-term
temporal patterns with deep neural networks. In SIGIR, pages 95–104, 2018.
[22] Bryan Lim and Stefan Zohren. Time series forecasting with deep learning: A survey. CoRR,
abs/2004.13408, 2020.
[23] Shengsheng Lin, Weiwei Lin, Wentai Wu, Feiyu Zhao, Ruichao Mo, and Haotong Zhang.
Segrnn: Segment recurrent neural network for long-term time series forecasting. CoRR,
abs/2308.11200, 2023.
[24] Minhao Liu, Ailing Zeng, Muxi Chen, Zhijian Xu, Qiuxia Lai, Lingna Ma, and Qiang Xu.
Scinet: Time series modeling and forecasting with sample convolution and interaction. In
NeurIPS, 2022.
[25] Yong Liu, Haixu Wu, Jianmin Wang, and Mingsheng Long. Non-stationary transformers:
Exploring the stationarity in time series forecasting. In NeurIPS, 2022.
[26] Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng
Long. itransformer: Inverted transformers are effective for time series forecasting. CoRR,
abs/2310.06625, 2023.
[27] Mohammad Amin Morid, Olivia R Liu Sheng, and Joseph Dunbar. Time series prediction using
deep learning methods in healthcare. ACM Transactions on Management Information Systems,
14(1):1–29, 2023.
[28] Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is
worth 64 words: Long-term forecasting with transformers. In ICLR, 2023.
[29] Syama Sundar Rangapuram, Matthias W. Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang,
and Tim Januschowski. Deep state space models for time series forecasting. In NeurIPS, pages
7796–7805, 2018.
[30] Lawrence G Roberts and Barry D Wessler. Computer network development to achieve resource
sharing. In Proceedings of the May 5-7, 1970, spring joint computer conference, pages 543–549,
1970.
[31] Ilya O. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas
Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic,
and Alexey Dosovitskiy. Mlp-mixer: An all-mlp architecture for vision. In NeurIPS, pages
24261–24272, 2021.
[32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, pages 5998–6008,
2017.
[33] Xue Wang, Tian Zhou, Qingsong Wen, Jinyang Gao, Bolin Ding, and Rong Jin. Make trans-
former great again for time series forecasting: Channel aligned robust dual transformer. CoRR,
abs/2305.12095, 2023.
11
[34] Zepu Wang, Yuqi Nie, Peng Sun, Nam H. Nguyen, John M. Mulvey, and H. Vincent Poor.
ST-MLP: A cascaded spatio-temporal linear framework with channel-independence strategy for
traffic forecasting. CoRR, abs/2308.07496, 2023.
[35] Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition
transformers with auto-correlation for long-term series forecasting. In NeurIPS, pages 101–112,
2021.
[36] Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Timesnet:
Temporal 2d-variation modeling for general time series analysis. In ICLR. OpenReview.net,
2023.
[37] Xinle Wu, Dalin Zhang, Chenjuan Guo, Chaoyang He, Bin Yang, and Christian S. Jensen.
Autocts: Automated correlated time series forecasting. Proc. VLDB Endow., 15(4):971–983,
2021.
[38] Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, Xiaojun Chang, and Chengqi Zhang.
Connecting the dots: Multivariate time series forecasting with graph neural networks. In
SIGKDD, pages 753–763, 2020.
[39] Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha. Few-shot learning via embedding
adaptation with set-to-set functions. In CVPR, pages 8805–8814. Computer Vision Foundation
/ IEEE, 2020.
[40] Kun Yi, Qi Zhang, Wei Fan, Shoujin Wang, Pengyang Wang, Hui He, Ning An, Defu Lian,
Longbing Cao, and Zhendong Niu. Frequency-domain mlps are more effective learners in time
series forecasting. In NeurIPS, 2023.
[41] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabás Póczos, Ruslan Salakhutdinov,
and Alexander J. Smola. Deep sets. In NIPS, pages 3391–3401, 2017.
[42] Matthew D. Zeiler and Rob Fergus. Stochastic pooling for regularization of deep convolutional
neural networks. In ICLR, 2013.
[43] Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series
forecasting? In AAAI, pages 11121–11128, 2023.
[44] Tianping Zhang, Yizhuo Zhang, Wei Cao, Jiang Bian, Xiaohan Yi, Shun Zheng, and Jian
Li. Less is more: Fast multivariate time series forecasting with light sampling-oriented MLP
structures. CoRR, abs/2207.01186, 2022.
[45] Yunhao Zhang and Junchi Yan. Crossformer: Transformer utilizing cross-dimension dependency
for multivariate time series forecasting. In ICLR. OpenReview.net, 2023.
[46] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai
Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In
AAAI, pages 11106–11115, 2021.
[47] Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. Fedformer:
Frequency enhanced decomposed transformer for long-term series forecasting. In ICML,
volume 162, pages 27268–27286, 2022.
12
A Datasets Description
Table 5: Detailed dataset descriptions. Channels denotes the number of channels in each dataset.
Dataset Split denotes the total number of time points in (Train, Validation, Test) split respectively.
Prediction Length denotes the future time points to be predicted and four prediction settings are
included in each dataset. Granularity denotes the sampling interval of time points.
Dataset Channels Prediction Length Dataset Split Granularity Domain
ETTh1, ETTh2 7 {96, 192, 336, 720} (8545, 2881, 2881) Hourly Electricity
ETTm1, ETTm2 7 {96, 192, 336, 720} (34465, 11521, 11521) 15min Electricity
Weather 21 {96, 192, 336, 720} (36792, 5271, 10540) 10min Weather
ECL 321 {96, 192, 336, 720} (18317, 2633, 5261) Hourly Electricity
Traffic 862 {96, 192, 336, 720} (12185, 1757, 3509) Hourly Transportation
Solar-Energy 137 {96, 192, 336, 720} (36601, 5161, 10417) 10min Energy
PEMS03 358 {12, 24, 48, 96} (15617,5135,5135) 5min Transportation
PEMS04 307 {12, 24, 48, 96} (10172,3375,281) 5min Transportation
PEMS07 883 {12, 24, 48, 96} (16911,5622,468) 5min Transportation
PEMS08 170 {12, 24, 48, 96} (10690,3548,265) 5min Transportation
B Implement Details
The overall architecture of SOFTS is detailed in Algorithm 1. Initially, a linear layer is employed to
obtain the embedding for each series (Lines 1-2). Subsequently, several encoder layers are applied.
Within each encoder layer, the core representation is first derived by applying an MLP to the series
embeddings and pooling them (Line 4). This core representation is then concatenated with each series
(Line 5), and another MLP is used to fuse them (Line 6). After passing through multiple encoder
layers, a final linear layer projects the series embeddings to the predicted series (Line 8).
2
https://github.com/zhouhaoyi/ETDataset
3
http://pems.dot.ca.gov
4
https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014
5
https://pems.dot.ca.gov/
13
Algorithm 1 Series-cOre Fused Time Series forecaster (SOFTS) applied to multivariate time series.
Require: Look back window X ∈ RL×C ;
1: X = X. transpose ▷ X ∈ RC×L
2: S0 = Linear(X) ▷ Get series embedding, S0 ∈ RC×d
3: for l = 1 . . . L do ′
4: oi = Stoch_Pool(MLP(Si−1 )) ▷ Get core representation, oi ∈ Rd
′
5: Fi = Repeat_Concat(Si−1 , oi ) ▷ Fi ∈ RC×(d+d )
C×d
6: Si = MLP(Fi ) + Si−1 ▷ Fuse series and core, Si ∈ R
7: end for
8: Ŷ = Linear(SL ) ▷ Project series embedding to predicted series, Ŷ ∈ RC×H
9: Ŷ = Ŷ . transpose ▷ Ŷ ∈ RH×C
10: return Ŷ
Core representation. Recall that the core representation for the multivariate time series is defined
in Definition 3.1 with the following form:
o = f (s1 , s2 , . . . , sC )
To obtain the representation, we draw inspiration from the two theorems:
Theorem B.1 (Kolmogorov-Arnold representation [20]). Let f : [0, 1]M → R be an arbitrary
multivariate continuous function iff it has the representation
M
!
X
f (x1 , . . . , xM ) = ρ λm ϕ (xm )
m=1
2M +1
with continuous outer and inner functions ρ : R → R and ϕ : R → R2M +1 . The inner function
ϕ is independent of the function f .
Theorem B.2 (DeepSets [41]). Assume the elements are from a compact set in Rd , i.e. possibly
uncountable, and the set size is fixed to M . Then any continuous function operating on a set X, i.e.
f : Rd×M → R which is permutation invariant to the elements in X can be approximated arbitrarily
close in the form of !
X
ρ ϕ(x) ,
x∈X
for suitable transformations ϕ and ρ.
The two formulations are very similar, except for the dependence of inner transformation on the coor-
dinate through λm . The existence of λ determines whether the formulation is permutation invariant
or not. In this paper, we find in Table 4 that the permutation invariant expression (Theorem B.2)
performs much better than the permutation variant one (Theorem B.1). This may be attributed to
the characteristics of channel series being enough to induce the index of each channel (coordinate).
Introducing extra parameters specific to each channel may enhance the dependency channel coordi-
nate and reduce the dependence on the history, therefore causing low robustness when encountering
unknown series. Consequently, we utilize DeepSets form to express the core representation:
!
X
o=ρ ϕ(s) .
s∈S
We propose two modifications to the expression:
1. We generalize the mean pooling over the inner transformation by arbitrary pooling methods.
2. We remove the outer transformation ρ because it is redundant with the MLP during the
fusion process.
For 1., we tested several common pooling methods and found that the mean pooling and max pooling
outperform each other in different cases. Stochastic pooling (described in the following paragraph)
can achieve the best results in averaged cases (Table 3). So, the core is computed as Equation (3).
14
Stochastic pooling. Stochastic pooling is a pooling method that combines the characteristics of
max pooling and mean pooling [42]. In stochastic pooling, the pooled map response is selected
by sampling from a multinomial distribution formed from the activations of each pooling region.
Specifically, we first calculate the probabilities p for each dimension j by normalizing the softmax
activations within the dimension:
eAij
pij = PC (6)
Akj
k=1 e
During training, we sample from the multinomial distribution based on p to pick a channel c within
the dimension j. The pooled result is then simply Acj :
oj = Acj where c ∼ P (p1j , p2j , ..., pCj ) (7)
At test time, we use a probabilistic form of averaging:
C
X
oj = pij Aij (8)
i=1
This approach allows for a more robust and statistically grounded pooling mechanism, which can
enhance the generalization capabilities of the model across different data scenarios.
All the experiments are conducted on a single NVIDIA GeForce RTX 3090 with 24G VRAM. The
mean squared error (MSE) loss function is utilized for model optimization. Performance comparison
among different methods is conducted based on two primary evaluation metrics: Mean Squared Error
(MSE) and Mean Absolute Error (MAE). We use the ADAM optimizer [18] with an initial learning
rate of 3 × 10−4 . This rate is modulated by a cosine learning rate scheduler. We explore the number
of STAR blocks N within the set {1, 2, 3, 4}, and the dimension of series d within {128, 256, 512}.
Additionally, the dimension of the core representation d′ is searched in {64, 128, 256, 512}, with the
constraint that d′ does not exceed d.
Mean Squared Error (MSE):
H
1 X
MSE = (Yi − Ŷi )2 (9)
H i=1
Mean Absolute Error (MAE):
H
1 X
MAE = |Yi − Ŷi | (10)
H i=1
where Y, Ŷ ∈ RH×C are the ground truth and prediction results of the future with H time points
and C channels. Yi denotes the i-th future time point.
C Full Results
C.1 Full Results of Multivariate Forecasting Benchmark
The complete results of our forecasting benchmarks are presented in Table 6. We conducted experi-
ments using six widely utilized real-world datasets and compared our method against ten previous
state-of-the-art models. Our approach, SOFTS, demonstrates strong performance across these tests.
The complete results of our pooling method ablation are presented in Table 7. The term "w/o STAR"
refers to a scenario where an MLP is utilized with the Channel Independent (CI) strategy, without
the use of STAR. Mean pooling computes the average value of all the series representations. Max
pooling selects the maximum value of each hidden feature among all the channels. Weighted average
learns the weight for each channel. Stochastic pooling applies random selection during training and
weighted average during testing according to the feature value. The result reveals that incorporating
STAR into the model leads to a consistent enhancement in performance across all pooling methods.
15
Table 6: Multivariate forecasting results with prediction lengths H ∈ {12, 24, 48, 96} for PEMS
and H ∈ {96, 192, 336, 720} for others and fixed lookback window length L = 96. The results
of PatchTST and TSMixer are reproduced for the ablation study and other results are taken from
iTransformer [26].
Models SOFTS (ours) iTransformer PatchTST TSMixer Crossformer TiDE TimesNet DLinear SCINet FEDformer Stationary
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.325 0.361 0.334 0.368 0.329 0.365 0.323 0.363 0.404 0.426 0.364 0.387 0.338 0.375 0.345 0.372 0.418 0.438 0.379 0.419 0.386 0.398
ETTm1
192 0.375 0.389 0.377 0.391 0.380 0.394 0.376 0.392 0.450 0.451 0.398 0.404 0.374 0.387 0.380 0.389 0.439 0.450 0.426 0.441 0.459 0.444
336 0.405 0.412 0.426 0.420 0.400 0.410 0.407 0.413 0.532 0.515 0.428 0.425 0.410 0.411 0.413 0.413 0.490 0.485 0.445 0.459 0.495 0.464
720 0.466 0.447 0.491 0.459 0.475 0.453 0.485 0.459 0.666 0.589 0.487 0.461 0.478 0.450 0.474 0.453 0.595 0.550 0.543 0.490 0.585 0.516
Avg 0.393 0.403 0.407 0.410 0.396 0.406 0.398 0.407 0.513 0.496 0.419 0.419 0.400 0.406 0.403 0.407 0.485 0.481 0.448 0.452 0.481 0.456
96 0.180 0.261 0.180 0.264 0.184 0.264 0.182 0.266 0.287 0.366 0.207 0.305 0.187 0.267 0.193 0.292 0.286 0.377 0.203 0.287 0.192 0.274
ETTm2
192 0.246 0.306 0.250 0.309 0.246 0.306 0.249 0.309 0.414 0.492 0.290 0.364 0.249 0.309 0.284 0.362 0.399 0.445 0.269 0.328 0.280 0.339
336 0.319 0.352 0.311 0.348 0.308 0.346 0.309 0.347 0.597 0.542 0.377 0.422 0.321 0.351 0.369 0.427 0.637 0.591 0.325 0.366 0.334 0.361
720 0.405 0.401 0.412 0.407 0.409 0.402 0.416 0.408 1.730 1.042 0.558 0.524 0.408 0.403 0.554 0.522 0.960 0.735 0.421 0.415 0.417 0.413
Avg 0.287 0.330 0.288 0.332 0.287 0.330 0.289 0.333 0.757 0.610 0.358 0.404 0.291 0.333 0.350 0.401 0.571 0.537 0.305 0.349 0.306 0.347
96 0.381 0.399 0.386 0.405 0.394 0.406 0.401 0.412 0.423 0.448 0.479 0.464 0.384 0.402 0.386 0.400 0.654 0.599 0.376 0.419 0.513 0.491
ETTh1
192 0.435 0.431 0.441 0.436 0.440 0.435 0.452 0.442 0.471 0.474 0.525 0.492 0.436 0.429 0.437 0.432 0.719 0.631 0.420 0.448 0.534 0.504
336 0.480 0.452 0.487 0.458 0.491 0.462 0.492 0.463 0.570 0.546 0.565 0.515 0.491 0.469 0.481 0.459 0.778 0.659 0.459 0.465 0.588 0.535
720 0.499 0.488 0.503 0.491 0.487 0.479 0.507 0.490 0.653 0.621 0.594 0.558 0.521 0.500 0.519 0.516 0.836 0.699 0.506 0.507 0.643 0.616
Avg 0.449 0.442 0.454 0.447 0.453 0.446 0.463 0.452 0.529 0.522 0.541 0.507 0.458 0.450 0.456 0.452 0.747 0.647 0.440 0.460 0.570 0.537
96 0.297 0.347 0.297 0.349 0.288 0.340 0.319 0.361 0.745 0.584 0.400 0.440 0.340 0.374 0.333 0.387 0.707 0.621 0.358 0.397 0.476 0.458
ETTh2
192 0.373 0.394 0.380 0.400 0.376 0.395 0.402 0.410 0.877 0.656 0.528 0.509 0.402 0.414 0.477 0.476 0.860 0.689 0.429 0.439 0.512 0.493
336 0.410 0.426 0.428 0.432 0.440 0.451 0.444 0.446 1.043 0.731 0.643 0.571 0.452 0.452 0.594 0.541 1.000 0.744 0.496 0.487 0.552 0.551
720 0.411 0.433 0.427 0.445 0.436 0.453 0.441 0.450 1.104 0.763 0.874 0.679 0.462 0.468 0.831 0.657 1.249 0.838 0.463 0.474 0.562 0.560
Avg 0.373 0.400 0.383 0.407 0.385 0.410 0.401 0.417 0.942 0.684 0.611 0.550 0.414 0.427 0.559 0.515 0.954 0.723 0.437 0.449 0.526 0.516
96 0.143 0.233 0.148 0.240 0.164 0.251 0.157 0.260 0.219 0.314 0.237 0.329 0.168 0.272 0.197 0.282 0.247 0.345 0.193 0.308 0.169 0.273
192 0.158 0.248 0.162 0.253 0.173 0.262 0.173 0.274 0.231 0.322 0.236 0.330 0.184 0.289 0.196 0.285 0.257 0.355 0.201 0.315 0.182 0.286
ECL
336 0.178 0.269 0.178 0.269 0.190 0.279 0.192 0.295 0.246 0.337 0.249 0.344 0.198 0.300 0.209 0.301 0.269 0.369 0.214 0.329 0.200 0.304
720 0.218 0.305 0.225 0.317 0.230 0.313 0.223 0.318 0.280 0.363 0.284 0.373 0.220 0.320 0.245 0.333 0.299 0.390 0.246 0.355 0.222 0.321
Avg 0.174 0.264 0.178 0.270 0.189 0.276 0.186 0.287 0.244 0.334 0.251 0.344 0.192 0.295 0.212 0.300 0.268 0.365 0.214 0.327 0.193 0.296
96 0.376 0.251 0.395 0.268 0.427 0.272 0.493 0.336 0.522 0.290 0.805 0.493 0.593 0.321 0.650 0.396 0.788 0.499 0.587 0.366 0.612 0.338
192 0.398 0.261 0.417 0.276 0.454 0.289 0.497 0.351 0.530 0.293 0.756 0.474 0.617 0.336 0.598 0.370 0.789 0.505 0.604 0.373 0.613 0.340
Traffic
336 0.415 0.269 0.433 0.283 0.450 0.282 0.528 0.361 0.558 0.305 0.762 0.477 0.629 0.336 0.605 0.373 0.797 0.508 0.621 0.383 0.618 0.328
720 0.447 0.287 0.467 0.302 0.484 0.301 0.569 0.380 0.589 0.328 0.719 0.449 0.640 0.350 0.645 0.394 0.841 0.523 0.626 0.382 0.653 0.355
Avg 0.409 0.267 0.428 0.282 0.454 0.286 0.522 0.357 0.550 0.304 0.760 0.473 0.620 0.336 0.625 0.383 0.804 0.509 0.610 0.376 0.624 0.340
96 0.166 0.208 0.174 0.214 0.176 0.217 0.166 0.210 0.158 0.230 0.202 0.261 0.172 0.220 0.196 0.255 0.221 0.306 0.217 0.296 0.173 0.223
Weather
192 0.217 0.253 0.221 0.254 0.221 0.256 0.215 0.256 0.206 0.277 0.242 0.298 0.219 0.261 0.237 0.296 0.261 0.340 0.276 0.336 0.245 0.285
336 0.282 0.300 0.278 0.296 0.275 0.296 0.287 0.300 0.272 0.335 0.287 0.335 0.280 0.306 0.283 0.335 0.309 0.378 0.339 0.380 0.321 0.338
720 0.356 0.351 0.358 0.347 0.352 0.346 0.355 0.348 0.398 0.418 0.351 0.386 0.365 0.359 0.345 0.381 0.377 0.427 0.403 0.428 0.414 0.410
Avg 0.255 0.278 0.258 0.278 0.256 0.279 0.256 0.279 0.259 0.315 0.271 0.320 0.259 0.287 0.265 0.317 0.292 0.363 0.309 0.360 0.288 0.314
Solar-Energy
96 0.200 0.230 0.203 0.237 0.205 0.246 0.221 0.275 0.310 0.331 0.312 0.399 0.250 0.292 0.290 0.378 0.237 0.344 0.242 0.342 0.215 0.249
192 0.229 0.253 0.233 0.261 0.237 0.267 0.268 0.306 0.734 0.725 0.339 0.416 0.296 0.318 0.320 0.398 0.280 0.380 0.285 0.380 0.254 0.272
336 0.243 0.269 0.248 0.273 0.250 0.276 0.272 0.294 0.750 0.735 0.368 0.430 0.319 0.330 0.353 0.415 0.304 0.389 0.282 0.376 0.290 0.296
720 0.245 0.272 0.249 0.275 0.252 0.275 0.281 0.313 0.769 0.765 0.370 0.425 0.338 0.337 0.356 0.413 0.308 0.388 0.357 0.427 0.285 0.200
Avg 0.229 0.256 0.233 0.262 0.236 0.266 0.260 0.297 0.641 0.639 0.347 0.417 0.301 0.319 0.330 0.401 0.282 0.375 0.291 0.381 0.261 0.381
12 0.064 0.165 0.071 0.174 0.073 0.178 0.075 0.186 0.090 0.203 0.178 0.305 0.085 0.192 0.122 0.243 0.066 0.172 0.126 0.251 0.081 0.188
PEMS03
24 0.083 0.188 0.093 0.201 0.105 0.212 0.095 0.210 0.121 0.240 0.257 0.371 0.118 0.223 0.201 0.317 0.085 0.198 0.149 0.275 0.105 0.214
48 0.114 0.223 0.125 0.236 0.159 0.264 0.121 0.240 0.202 0.317 0.379 0.463 0.155 0.260 0.333 0.425 0.127 0.238 0.227 0.348 0.154 0.257
96 0.156 0.264 0.164 0.275 0.210 0.305 0.184 0.295 0.262 0.367 0.490 0.539 0.228 0.317 0.457 0.515 0.178 0.287 0.348 0.434 0.247 0.336
Avg 0.104 0.210 0.113 0.221 0.137 0.240 0.119 0.233 0.169 0.281 0.326 0.419 0.147 0.248 0.278 0.375 0.114 0.224 0.213 0.327 0.147 0.249
12 0.074 0.176 0.078 0.183 0.085 0.189 0.079 0.188 0.098 0.218 0.219 0.340 0.087 0.195 0.148 0.272 0.073 0.177 0.138 0.262 0.088 0.196
PEMS04
24 0.088 0.194 0.095 0.205 0.115 0.222 0.089 0.201 0.131 0.256 0.292 0.398 0.103 0.215 0.224 0.340 0.084 0.193 0.177 0.293 0.104 0.216
48 0.110 0.219 0.120 0.233 0.167 0.273 0.111 0.222 0.205 0.326 0.409 0.478 0.136 0.250 0.355 0.437 0.099 0.211 0.270 0.368 0.137 0.251
96 0.135 0.244 0.150 0.262 0.211 0.310 0.133 0.247 0.402 0.457 0.492 0.532 0.190 0.303 0.452 0.504 0.114 0.227 0.341 0.427 0.186 0.297
Avg 0.102 0.208 0.111 0.221 0.145 0.249 0.103 0.215 0.209 0.314 0.353 0.437 0.129 0.241 0.295 0.388 0.092 0.202 0.231 0.337 0.127 0.240
12 0.057 0.152 0.067 0.165 0.068 0.163 0.073 0.181 0.094 0.200 0.173 0.304 0.082 0.181 0.115 0.242 0.068 0.171 0.109 0.225 0.083 0.185
PEMS07
24 0.073 0.173 0.088 0.190 0.102 0.201 0.090 0.199 0.139 0.247 0.271 0.383 0.101 0.204 0.210 0.329 0.119 0.225 0.125 0.244 0.102 0.207
48 0.096 0.195 0.110 0.215 0.170 0.261 0.124 0.231 0.311 0.369 0.446 0.495 0.134 0.238 0.398 0.458 0.149 0.237 0.165 0.288 0.136 0.240
96 0.120 0.218 0.139 0.245 0.236 0.308 0.163 0.255 0.396 0.442 0.628 0.577 0.181 0.279 0.594 0.553 0.141 0.234 0.262 0.376 0.187 0.287
Avg 0.087 0.184 0.101 0.204 0.144 0.233 0.112 0.217 0.235 0.315 0.380 0.440 0.124 0.225 0.329 0.395 0.119 0.234 0.165 0.283 0.127 0.230
12 0.074 0.171 0.079 0.182 0.098 0.205 0.083 0.189 0.165 0.214 0.227 0.343 0.112 0.212 0.154 0.276 0.087 0.184 0.173 0.273 0.109 0.207
PEMS08
24 0.104 0.201 0.115 0.219 0.162 0.266 0.117 0.226 0.215 0.260 0.318 0.409 0.141 0.238 0.248 0.353 0.122 0.221 0.210 0.301 0.140 0.236
48 0.164 0.253 0.186 0.235 0.238 0.311 0.196 0.299 0.315 0.355 0.497 0.510 0.198 0.283 0.440 0.470 0.189 0.270 0.320 0.394 0.211 0.294
96 0.211 0.253 0.221 0.267 0.303 0.318 0.266 0.331 0.377 0.397 0.721 0.592 0.320 0.351 0.674 0.565 0.236 0.300 0.442 0.465 0.345 0.367
Avg 0.138 0.219 0.150 0.226 0.200 0.275 0.165 0.261 0.268 0.307 0.441 0.464 0.193 0.271 0.379 0.416 0.158 0.244 0.286 0.358 0.201 0.276
1st Count 40 47 2 4 6 8 1 0 3 0 0 0 1 2 1 0 5 4 4 0 0 0
16
Table 7: Comparison of the effect of different pooling methods. The term "w/o STAR" refers to a
scenario where an MLP is utilized with the Channel Independent (CI) strategy, without the use of
STAR. The result reveals that incorporating STAR into the model leads to a consistent enhancement
in performance across all pooling methods. Apart from that, stochastic pooling performs better than
mean and max pooling.
Pooling Method w/o STAR Mean Max Weighted Stochastic
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.161 0.248 0.146 0.239 0.150 0.241 0.156 0.247 0.143 0.233
192 0.171 0.259 0.166 0.258 0.165 0.256 0.173 0.264 0.158 0.248
ECL 336 0.188 0.276 0.175 0.269 0.188 0.280 0.190 0.284 0.178 0.269
720 0.228 0.311 0.211 0.300 0.216 0.304 0.217 0.305 0.218 0.305
Avg 0.187 0.273 0.174 0.266 0.180 0.270 0.184 0.275 0.174 0.264
96 0.414 0.266 0.380 0.255 0.386 0.261 0.410 0.275 0.376 0.251
192 0.428 0.272 0.406 0.268 0.397 0.267 0.434 0.288 0.398 0.261
Traffic 336 0.446 0.284 0.442 0.293 0.406 0.273 0.447 0.295 0.415 0.269
720 0.480 0.303 0.453 0.293 0.433 0.284 0.470 0.308 0.447 0.287
Avg 0.442 0.281 0.420 0.277 0.406 0.271 0.440 0.292 0.409 0.267
96 0.179 0.217 0.174 0.213 0.172 0.211 0.180 0.222 0.166 0.208
192 0.227 0.259 0.227 0.260 0.226 0.260 0.226 0.261 0.217 0.253
Weather 336 0.281 0.299 0.281 0.299 0.280 0.298 0.284 0.302 0.282 0.300
720 0.357 0.348 0.361 0.352 0.360 0.350 0.360 0.351 0.356 0.351
Avg 0.261 0.281 0.261 0.281 0.259 0.280 0.263 0.284 0.255 0.278
96 0.215 0.250 0.202 0.238 0.206 0.243 0.219 0.260 0.200 0.230
192 0.246 0.271 0.238 0.260 0.245 0.266 0.255 0.272 0.229 0.253
Solar 336 0.263 0.282 0.248 0.277 0.267 0.284 0.292 0.294 0.243 0.269
720 0.263 0.283 0.247 0.271 0.265 0.284 0.290 0.293 0.245 0.272
Avg 0.247 0.272 0.234 0.262 0.246 0.269 0.264 0.280 0.229 0.256
96 0.298 0.349 0.298 0.348 0.296 0.347 0.292 0.344 0.297 0.347
192 0.375 0.398 0.376 0.396 0.378 0.396 0.387 0.401 0.373 0.394
ETTh2 336 0.420 0.431 0.417 0.430 0.423 0.428 0.428 0.435 0.410 0.426
720 0.433 0.448 0.423 0.442 0.421 0.435 0.409 0.433 0.411 0.433
Avg 0.381 0.406 0.379 0.404 0.379 0.401 0.379 0.403 0.373 0.400
12 0.084 0.189 0.075 0.177 0.078 0.182 0.077 0.180 0.074 0.176
24 0.113 0.220 0.090 0.196 0.095 0.204 0.094 0.203 0.088 0.194
PEMS04 48 0.164 0.266 0.117 0.225 0.126 0.236 0.120 0.231 0.110 0.219
96 0.209 0.304 0.142 0.250 0.164 0.269 0.147 0.258 0.135 0.244
Avg 0.143 0.245 0.106 0.212 0.116 0.223 0.109 0.218 0.102 0.208
17
C.3 Full Results of STAR Ablation
The complete results of our ablation on universality of STAR are presented in Table 8. The STar
Aggregate-Redistribute (STAR) module is a set-to-set function [39] that is replaceable to arbitrary
transformer-based methods that use the attention mechanism. In this paragraph, we test the ef-
fectiveness of STAR on different existing transformer-based forecasters, such as PatchTST [28]
and Crossformer [45]. Note that our method can be regarded as replacing the channel attention
in iTransformer [26]. Here we involve substituting the time attention in PatchTST with STAR
and incrementally replacing both the time and channel attention in Crossformer with STAR. The
results, as presented in Table 8, demonstrate that replacing attention with STAR, which deserves
less computational resources, could maintain and even improve the models’ performance in several
datasets.
D Error Bar
In this section, we test the robustness of SOFTS. We conducted 5 experiments using different random
seeds, and the averaged results are presented in Table 9. It can be seen that SOFTS have robust
performance over different datasets and different horizons.
ETTm1 ETTm2 ETTh1 ETTh2 ECL Traffic Weather Solar PEMS03 PEMS04 PEMS07 PEMS08
0.48 0.48 0.20
0.46
0.46 0.44 0.18
0.44 0.42
0.40
0.42 0.38 0.16
0.40 0.36
0.34 0.14
MSE
MSE
MSE
0.38 0.32
0.36 0.30
0.12
0.28
0.34 0.26
0.24 0.10
0.32 0.22
0.30 0.20 0.08
0.18
0.28 0.16
0.06
32 64 128 256 512 1024 32 64 128 256 512 1024 32 64 128 256 512 1024
Dimension of Series Dimension of Series Dimension of Series
Figure 7: Influence of the hidden dimension of series d. Traffic datasets (such as Traffic and PEMS)
require larger hidden dimensions to handle their intricacies effectively.
ETTm1 ETTm2 ETTh1 ETTh2 ECL Traffic Weather Solar PEMS03 PEMS04 PEMS07 PEMS08
0.48 0.18
0.42
0.46 0.40
0.44 0.38 0.16
0.36
0.42
0.34 0.14
0.40 0.32
MSE
MSE
MSE
0.38 0.30
0.12
0.28
0.36
0.26
0.34 0.24 0.10
0.32 0.22
0.20 0.08
0.30
0.18
0.28 0.16
0.06
64 128 256 512 1024 64 128 256 512 1024 64 128 256 512 1024
Dimension of Core Dimension of Core Dimension of Core
Figure 8: Influence of the hidden dimension of the core d′ . Variations in d′ have a minimal influence
on the model’s overall performance
18
Table 8: The performance of STAR in different models. The attention replaced by STAR here are
the time attention in PatchTST, the channel attention in iTransformer, and both the time attention
and channel attention in modified Crossformer. The results demonstrate that replacing attention with
STAR, which requires less computational resources, could maintain and even improve the models’
performance in several datasets. † : The Crossformer used here is a modified version that replaces the
decoder with a flattened head like what PatchTST does.
Model iTransformer PatchTST Crossformer
Component Attention STAR Attention STAR Attention STAR
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.148 0.240 0.143 0.233 0.164 0.251 0.160 0.248 0.156 0.259 0.166 0.263
192 0.162 0.253 0.158 0.248 0.173 0.262 0.169 0.257 0.182 0.284 0.182 0.277
Electricity 336 0.178 0.269 0.178 0.269 0.190 0.279 0.187 0.275 0.203 0.305 0.200 0.296
720 0.225 0.317 0.218 0.305 0.230 0.313 0.225 0.308 0.267 0.358 0.243 0.334
Avg 0.178 0.270 0.174 0.264 0.189 0.276 0.185 0.272 0.202 0.301 0.198 0.292
96 0.395 0.268 0.376 0.251 0.427 0.272 0.423 0.265 0.508 0.275 0.520 0.277
192 0.417 0.276 0.398 0.261 0.454 0.289 0.434 0.271 0.519 0.281 0.535 0.285
Traffic 336 0.433 0.283 0.415 0.269 0.450 0.282 0.447 0.278 0.556 0.304 0.551 0.292
720 0.467 0.302 0.447 0.287 0.484 0.301 0.489 0.301 0.600 0.329 0.591 0.315
Avg 0.428 0.282 0.409 0.267 0.454 0.286 0.448 0.279 0.546 0.297 0.549 0.292
96 0.174 0.214 0.166 0.208 0.176 0.217 0.170 0.214 0.174 0.245 0.174 0.239
192 0.221 0.254 0.217 0.253 0.221 0.256 0.215 0.251 0.219 0.283 0.220 0.282
Weather 336 0.278 0.296 0.282 0.300 0.275 0.296 0.273 0.296 0.271 0.327 0.272 0.324
720 0.358 0.347 0.356 0.351 0.352 0.346 0.349 0.346 0.351 0.383 0.343 0.376
Avg 0.258 0.278 0.255 0.278 0.256 0.279 0.252 0.277 0.254 0.310 0.252 0.305
12 0.071 0.174 0.064 0.165 0.073 0.178 0.071 0.173 0.067 0.170 0.065 0.165
24 0.093 0.201 0.083 0.188 0.105 0.212 0.101 0.206 0.081 0.187 0.081 0.184
PEMS03 48 0.125 0.236 0.114 0.223 0.159 0.264 0.157 0.256 0.109 0.220 0.109 0.216
96 0.164 0.275 0.156 0.264 0.210 0.305 0.205 0.296 0.142 0.255 0.147 0.250
Avg 0.113 0.222 0.104 0.210 0.137 0.240 0.134 0.233 0.100 0.208 0.100 0.204
12 0.078 0.183 0.074 0.176 0.085 0.189 0.082 0.184 0.069 0.171 0.071 0.174
24 0.095 0.205 0.088 0.194 0.115 0.222 0.108 0.214 0.082 0.190 0.079 0.185
PEMS04 48 0.120 0.233 0.110 0.219 0.167 0.273 0.155 0.258 0.097 0.207 0.091 0.200
96 0.150 0.262 0.135 0.244 0.211 0.310 0.198 0.297 0.111 0.222 0.106 0.218
Avg 0.111 0.221 0.102 0.208 0.145 0.249 0.136 0.238 0.090 0.198 0.087 0.194
12 0.067 0.165 0.057 0.152 0.068 0.163 0.065 0.160 0.056 0.151 0.055 0.150
24 0.088 0.190 0.073 0.173 0.102 0.201 0.098 0.195 0.070 0.166 0.067 0.165
PEMS07 48 0.110 0.215 0.096 0.195 0.170 0.261 0.162 0.250 0.090 0.192 0.088 0.183
96 0.139 0.245 0.120 0.218 0.236 0.308 0.222 0.294 0.120 0.215 0.110 0.203
Avg 0.101 0.204 0.087 0.184 0.144 0.233 0.137 0.225 0.084 0.181 0.080 0.175
19
Table 9: Robustness of SOFTS. Results are averaged over 5 experiments with different random seeds.
Dataset ETTm1 Weather Traffic
Horizon MSE MAE MSE MAE MSE MAE
96 0.325 ± 0.002 0.361 ± 0.002 0.166 ± 0.002 0.208 ± 0.002 0.376 ± 0.002 0.251 ± 0.001
192 0.375 ± 0.002 0.389 ± 0.003 0.217 ± 0.003 0.253 ± 0.002 0.398 ± 0.002 0.261 ± 0.002
336 0.405 ± 0.004 0.412 ± 0.003 0.282 ± 0.001 0.300 ± 0.001 0.415 ± 0.002 0.269 ± 0.002
720 0.466 ± 0.004 0.447 ± 0.002 0.356 ± 0.002 0.351 ± 0.002 0.447 ± 0.002 0.287 ± 0.001
Dataset PEMS03 PEMS04 PEMS07
Horizon MSE MAE MSE MAE MSE MAE
12 0.064 ± 0.002 0.165 ± 0.002 0.074 ± 0.000 0.176 ± 0.000 0.057 ± 0.000 0.152 ± 0.000
24 0.083 ± 0.002 0.188 ± 0.002 0.088 ± 0.000 0.194 ± 0.000 0.073 ± 0.003 0.173 ± 0.004
48 0.114 ± 0.004 0.223 ± 0.003 0.110 ± 0.001 0.219 ± 0.002 0.096 ± 0.002 0.195 ± 0.002
96 0.156 ± 0.001 0.264 ± 0.001 0.135 ± 0.003 0.244 ± 0.003 0.120 ± 0.003 0.218 ± 0.003
ETTm1 ETTm2 ETTh1 ETTh2 ECL Traffic Weather Solar PEMS03 PEMS04 PEMS07 PEMS08
0.44 0.22
0.48
0.42
0.46 0.40 0.20
0.44 0.38
0.36 0.18
0.42
0.34
0.40 0.16
0.32
MSE
MSE
MSE
0.38 0.30 0.14
0.36 0.28
0.26 0.12
0.34
0.24
0.32 0.22 0.10
0.30 0.20
0.18 0.08
0.28
0.16
0.26 0.06
1 2 3 4 1 2 3 4 1 2 3 4
# of Encoder Layers # of Encoder Layers # of Encoder Layers
Figure 9: Influence of the number of the encoder layers N . Traffic datasets (such as Traffic and
PEMS) require more encoding layers to handle their intricacies effectively.
F Showcase
F.1 Visualization of the Core
In this section, we present a visualization of the core. The visualization is generated by employing a
frozen state of our trained model to capture the series embeddings from the final encoder layer. These
embeddings are then utilized as inputs to a two-layer MLP autoencoder. The primary function of
this autoencoder is to map these high-dimensional embeddings back to the original input series. The
visualization is shown in Figure 10. Highlighted by the red line, this core captures the global trend of
all cross all the channels in
Figure 10: Visualization of the core, represented by the red line, alongside the original input channels.
We freeze our model and extract the series embeddings from the last encoder layer to train a two-layer
MLP autoencoder. This autoencoder maps the embeddings back to the original series, allowing us to
visualize the core effectively.
20
F.2 Visualization of Predictions
Figure 11: Visualization of Prediction on ECL dataset with lookback window 96, horizon 96.
Figure 12: Visualization of Prediction on ETTh2 dataset with lookback window 96, horizon 96.
Figure 13: Visualization of Prediction on Traffic dataset with lookback window 96, horizon 96.
As stated in Section 4.2, after being adjusted by STAR, the abnormal channels can be clustered
towards normal channels by exchanging channel information. In this section, we choose two abnormal
channels in the ECL and PEMS03 datasets to demonstrate our SOFTS model’s advantage in handling
noise from abnormal channels. As shown in Figure 15, the value of channel 160 in PEMS03
experiences a sharp decrease followed by a smooth period. In this case, SOFTS is able to capture
the slowly increasing trend effectively. Similarly, in Figure 16, the signal of channel 298 in ECL
resembles the sum of an impulse function and a step function, which lacks a continuous trend. Here,
our SOFTS model provides a more stable prediction compared to the other two models.
21
(a) SOFTS (ours) (b) iTransformer (c) PatchTST
Figure 14: Visualization of Prediction on PEMS03 dataset with lookback window 96, horizon 96.
Figure 15: Visualization of Prediction on abnormal channel in PEMS03 dataset with lookback
window 96, horizon 96.
Dependence on core representation quality. The effectiveness of the STAR module heavily
depends on the quality of the global core representation. If this core representation does not accurately
capture the essential features of the individual series, the model’s performance might degrade.
Ensuring the robustness and accuracy of this core representation across diverse datasets remains a
challenge that warrants further research.
H Societal Impacts
The development of the Series-cOre Fused Time Series (SOFTS) forecaster has the potential to
significantly benefit various fields such as finance, traffic management, energy, and healthcare by
improving the accuracy and efficiency of time series forecasting, thereby enhancing decision-making
processes and optimizing operations. However, there are potential negative societal impacts to
consider. Privacy concerns may arise from the use of personal data, especially in healthcare and
finance, leading to possible violations if data is not securely handled. Additionally, biases in the data
could result in unfair outcomes, perpetuating or exacerbating existing disparities. Over-reliance on
automated forecasting models might lead to neglect of important contextual or qualitative factors,
causing adverse outcomes when predictions are incorrect. To mitigate these risks, robust data
22
(a) SOFTS (ours) (b) iTransformer (c) PatchTST
Figure 16: Visualization of Prediction on abnormal channel in ECL dataset with lookback window
96, horizon 96.
protection protocols should be implemented, and continuous monitoring for bias is necessary to
ensure fairness. Developing ethical use policies and maintaining human oversight in decision-making
can further ensure that the deployment of SOFTS maximizes its positive societal impact while
minimizing potential negative consequences.
23