0% found this document useful (0 votes)
28 views

MixMamba Time Series Modeling With Adaptive Expertise

Uploaded by

pt4452mxd5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

MixMamba Time Series Modeling With Adaptive Expertise

Uploaded by

pt4452mxd5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Information Fusion 112 (2024) 102589

Contents lists available at ScienceDirect

Information Fusion
journal homepage: www.elsevier.com/locate/inffus

Full length article

MixMamba: Time series modeling with adaptive expertise


Khaled Alkilane, Yihang He, Der-Horng Lee ∗
Zhejiang University–University of Illinois Urbana-Champaign Institute, Haining 314400, China

ARTICLE INFO ABSTRACT

Keywords: From finance and healthcare to transportation and beyond, effective time series modeling underpins a wide
Time series modeling range of applications. While transformers have achieved success, their reliance on global context limits
Mixture-of-experts scalability for lengthy sequences due to the quadratic increase in computational cost with sequence length.
Multivariate time series forecasting
Recent research suggests linear models can achieve comparable performance with lower complexity. However,
the heterogeneity and non-stationary characteristics of time series data continue to challenge single models’
ability to capture complex temporal dynamics, especially in long-term forecasting. This paper proposes
MixMamba, a novel framework for time series modeling applicable across diverse domains. The framework
leverages the content-based reasoning strengths of the Mamba model by integrating it as an expert within a
mixture-of-experts (MoE) framework. This framework decomposes modeling into a pool of specialized experts,
enabling the model to learn robust representations and capture the full spectrum of patterns present in
time series data. Furthermore, a dynamic gating network is introduced within the framework. This network
adaptively allocates each data segment to the most suitable expert based on its characteristics. This is crucial in
non-stationary time series, as it allows the model to adjust dynamically to temporal changes in the underlying
data distribution. To prevent bias towards a limited subset of experts, a load balancing loss function is
incorporated. Extensive experiments on benchmark datasets demonstrate the effectiveness and robustness of
our proposed method in various time series modeling tasks, including long-term and short-term forecasting,
as well as classification.

1. Introduction the pursuit of ever-increasing accuracy, particularly in critical applica-


tions like healthcare and autonomous vehicles, necessitates continued
Time series data, a collection of observations measured at con- development in this field.
secutive time intervals, forms the foundation for a wide range of The Transformer architecture has emerged as a powerful tool in
real-world applications [1]. Time series modeling has attracted consid- various sequence modeling tasks, achieving remarkable results in NLP
erable research interest due to its various tasks, including long-term and machine translation. This success has motivated researchers to
forecasting [2–5], short-term prediction [6–8], and classification [6, investigate its applicability to time series modeling. Recent studies
9,10]. Therefore, the development of robust models for time series employing Transformer-based methods [2,14,15], have shown promise
modeling remains a critical area of research. due to their ability to capture complex temporal dependencies. It is
Time series analysis leverages models to extract valuable insights noteworthy that simpler architectures, like linear models [8,16], can
from the temporal dynamics of variables. By identifying and capturing also capture temporal dynamics in time series data to a certain degree,
these underlying temporal patterns, improved data representations can offering a good level of performance. State space models (SSMs) have
be achieved. The development of such models has been character- a long-standing reputation for effectively modeling the dynamics of
ized by a shift from traditional statistical methods to advanced deep sequential data [17]. Recent advancements like Mamba [18] capital-
learning techniques. Initially, statistical methods [11,12], effectively ize on this strength, achieving linear computational complexity for
identified trends but were limited by assumptions of linearity in the lengthy sequences. Mamba utilizes a selective attention mechanism to
data. The introduction of machine learning brought greater flexibil- focus on crucial data points, while a causal convolutional layer facil-
ity [13], yet struggled with capturing long-term dependencies within itates efficient learning from extended sequences [18]. These models
the time series. Deep learning marked a significant breakthrough by are emerging as potential challengers to Transformer-based models in
enabling the modeling of complex underlying patterns [1]. However, sequential tasks.

∗ Corresponding author.
E-mail addresses: [email protected] (K. Alkilane), [email protected] (Y. He), [email protected] (D.-H. Lee).

https://doi.org/10.1016/j.inffus.2024.102589
Received 8 April 2024; Received in revised form 18 June 2024; Accepted 15 July 2024
Available online 18 July 2024
1566-2535/© 2024 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
K. Alkilane et al. Information Fusion 112 (2024) 102589

Time series data pose significant challenges for modeling due to architectures have been specifically designed to tackle a multitude of
their non-stationarity, inherent heterogeneity, and long-range depen- time series tasks, including short- and long-term forecasting [3,9,14,15,
dencies with complex temporal patterns. Existing methods have several 19], classification [6,10]. A thorough review of recent advancements
limitation. Transformer-based models suffer from a key bottleneck: in time series representation learning can be found in [1]. This section
their inherent quadratic time complexity with respect to sequence presents a systematic review of the existing literature, categorized into
length. This arises from the self-attention mechanism, where each ele- four scheme.
ment in the sequence attends to all others. This complexity might limit Traditional methods. This category incorporates statistical [11,12]
the model’s ability to learn effectively from distant past observations. and machine learning techniques [13,20,21] designed to capture the
Linear-based models, which rely on stacked linear transformations, temporal dependencies within time series data. While these methods
have a shallow architecture. This impedes their ability to capture com- demonstrate computational efficiency, making them suitable for large
plex, long-range dependencies in time series data. Another limitation datasets, their reliance on pre-defined statistical assumptions can hin-
of existing studies is the reliance on a single model to learn all the der their ability to effectively model complex non-linear relationships
complex patterns and dynamic dependencies. Heterogeneous data can and highly dynamic data.
hinder a single model, potentially leading to overfitting or underfitting Transformer-based methods. Motivated by the remarkable success
on different segments, leading to suboptimal capture of the underlying of the Transformer architecture [22] in sequence-to-sequence tasks,
dynamics. researchers have increasingly focused on adapting it to the domain of
To address the limitations of existing time series modeling meth- time series modeling [4,14,23]. Several methods, such as [2,3], have
ods, this paper proposes MixMamba, a new approach for effective emerged as attractive options due to their inherent advantages in han-
time series modeling. It incorporates the SSM model, Mamba, as a dling sequential data. Unlike traditional methods, Transformers excel
specialist within the well-established mixture-of-experts (MoE) frame- at capturing long-range dependencies in sequential data and utilizing
work. MixMamba decomposes the modeling task into a collection of these dependencies within sequences. This is particularly significant
specialized Mamba experts, each skilled in capturing a specific aspect for time series forecasting, where accurate predictions often hinge
of the underlying data dynamics. This division of labor empowers on capturing subtle interactions that unfold over extended periods.
the model to achieve superior representation learning compared to a However, these advantages come at a cost. A notable limitation lies
single model, effectively capturing a wide range of complexities within in their high computational complexity, which exhibits a quadratic
the data. Furthermore, a gating network is designed to function as scaling in relation to the input sequence length [18]. This translates to
a dynamic router. This network dynamically routes data segments to increased memory consumption and prolonged training times, particu-
domain-specific experts. This adaptability is crucial for non-stationary larly when dealing with extensive datasets. Furthermore, these models
time series, as it empowers the model to adjust flexibly to temporal may require larger amounts of data to achieve optimal performance,
changes in the underlying data distribution. For instance, one expert as they typically have a higher number of trainable parameters com-
might specialize in capturing long-term trends, while another focuses pared to other architectures. Their complexity may also increase their
on short-term fluctuations. By enabling parallel training of simpler ex- propensity to overfitting, especially in scenarios with limited available
perts, the MixMamba framework facilitates scalability for large datasets data [1]. Additionally, recent work has explored the application of
compared to the computational challenges of training a single, complex Graph Convolutional Networks (GCNs) in this domain. Xu et al. pro-
model. Additionally, it embodies the principles of ensemble learning, pose GDGCN [24], which utilizes parameter sharing, temporal graph
where multiple models collectively achieve superior performance. The convolution, and dynamic graph construction for improved traffic flow
main contributions of this work are three folds: forecasting. Zhan et al. [25] propose a hybrid framework that combines
Fuzzy C-means (FCM) clustering and feature selection to overcome limi-
• This paper proposes MixMamba, a novel approach for time se-
tations of Extreme Learning Machines (ELMs) in MTS prediction. Their
ries modeling that leverages collaborative learning and dynamic
approach leverages information fusion and a multi-metric strategy to
expert allocation. This combination enables efficient capture of
optimize FCM and feature selection, ultimately achieving improved
complex temporal dependencies and diverse patterns within the
prediction accuracy. Additionally, Zhu et al. [26] recognize limitations
data, resulting in improved performance on time series tasks.
in information granulation for long-term MTS forecasting and propose
• By exploiting Mamba’s linear-time efficiency, our approach over-
an LSTM-based approach that incorporates multilinear trend fuzzy
comes the trade-off between efficient sequence modeling and
information granules within a periodic framework.
capturing long-term trends. This allows for scalable learning of
Linear-based methods. Despite advancements in Transformers,
long-term dependencies without compromising computational re-
several studies [7,8,16] highlight the continued effectiveness of linear
sources.
models in time series modeling. These methods, known as univariate
• A dynamic gating network, acting as an intelligent router, allo-
linear models, treat multivariate data as collections of univariate se-
cates each data segment to the most suitable expert module based
quences and have demonstrated promising results [8]. This approach
on its specific characteristics. This ensures flexible handling of
is further exemplified by the work of [3], where a univariate patch
heterogeneity and temporal changes in the data.
Transformer is proposed. The effectiveness of these methods often relies
To comprehensively evaluate MixMamba’s effectiveness, we con- on the assumption of periodicity or smoothness in the time series
ducted extensive experiments on widely-used benchmark datasets. The data [19]. This assumption has been further generalized to the notion
results demonstrate that MixMamba outperforms previous state-of-the- that time series can be decomposed into a periodic component and
art methods in terms of prediction accuracy for both long-term and a component with a smooth trend. Several studies [8,16,19] further
short-term forecasting tasks. Additionally, MixMamba achieves im- emphasize a key advantage of linear models lies in their static mapping
provements in classification accuracy when compared to existing ap- weights for each data point within the sequence. This contrasts with
proaches. These empirical findings strongly reinforce the effectiveness recurrent and attention-based architectures where mapping weights
of MixMamba’s proposed innovations in handling complex time series are outputs of gates and attention layers that are data dependent.
data. However, the inherent simplicity of linear models may not be sufficient
to accurately represent complex patterns and dependencies within the
2. Related works non-stationary data. Furthermore, their reliance on a key assumption –
that time series data exhibits periodic patterns or smooth trends – may
Recent years have witnessed the rise of deep learning as the domi- not be universally applicable as time series data can display irregular
nant approach for time series modeling. A wide array of neural network fluctuations.

2
K. Alkilane et al. Information Fusion 112 (2024) 102589

Deep State space models. Deep SSMs have emerged as a promising The resulting discretized model takes the form:
alternative to traditional sequence models like RNNs, CNNs, and Trans-
ℎ𝑡 = ℎ(𝑡 − 1) + 𝑢(𝑡), 𝑦(𝑡) = ℎ(𝑡) (3)
formers [27]. They demonstrate remarkable computational efficiency
and robust modeling capabilities, particularly for long sequences [17]. This discretized form enables direct application to MTS data. The key
Traditional SSMs offer a principled framework for modeling and learn- advantage lies in the recursive relationship with respect to the hidden
ing time series patterns, including trend and seasonality, leading to state, ℎ(𝑡). This allows the model to capture temporal dynamics more
interpretability and data-efficient learning [28,29]. However, they lack effectively in various contexts. The models finalize output through
the ability to infer shared patterns across datasets of similar time series global convolution, defined as:
because each series is fitted individually. In contrast, deep neural net- 𝐿−1
works (DNNs) offer significant advantages in identifying complex pat-  = (, , … ,  ), 𝑦 = 𝑢(𝑡) (4)
terns within and across time series, but their interpretability is limited Here, 𝐿 is the input sequence length and  ∈ R𝐿
represents a structured
and enforcing assumptions within the model can be challenging. To ad- convolutional kernel.
dress these limitations, researchers have explored combining SSMs with However, a key limitation of traditional SSMs lies in their inherent
DNNs for time series modeling [29,30]. For instance, the work in [17] Linear Time-Invariant (LTI) nature. Fixed parameters like , , , 𝑎𝑛𝑑𝛥
proposes a method where RNN parameters are learned simultaneously restrict their ability to capture nuances in diverse sequences. The
from raw time series data and associated features. This empowers the selective SSM model, Mamba [18], addresses this by making these
model to extract relevant characteristics and learn temporal depen- parameters input-dependent. This transition to a time-variant model
dencies from the data. Furthermore, a rich body of well-established enhances adaptability and leads to a more accurate representation of
structured SSMs has been developed in [31–33]. HiPPO [32] designed the input sequence.
for analyzing sequential data at various timescales. It incorporates
memory dynamics into RNNs, enabling it to capture complex temporal 4. Method
dependencies. Mamba [18], a recently introduced selective SSM, en-
hances SSMs with content-based reasoning using discrete modalities. This study introduces MixMamba that combines MoE and Mamba
This allows the model to selectively retain or discard information for effective time series dependency modeling. The schematic architec-
throughout the sequence based on the current element. Additionally, ture of the model is depicted in Fig. 1. The model preprocesses data
Mamba leverages a hardware-aware parallel algorithm for recurrent through normalization, segments sequences into manageable parts,
operations, facilitating faster inference and linear scaling with sequence and embeds them for efficient processing. It then employs positional
length, demonstrating improved performance. encoding to retain temporal order. The core MoM block dynamically
selects the most suitable Mamba modules to extract key features from
3. Preliminaries each segment. A gating network assigns weights to these modules,
determining their influence on the final prediction. Finally, the model
Multivariate time series (MTS) data, characterized by multiple denormalizes the output to recover the original data scale.
interdependent variables observed over time, presents a unique chal-
lenge in forecasting due to the inherent complexity arising from both 4.1. Normalization
temporal dynamics and interrelationships between variables. In MTS
forecasting, we are given historical observations, typically represented Due to the significant challenge of distribution shift, where the sta-
as a matrix 𝑋 = {𝑋1 , … , 𝑋𝐿 } ∈ R𝐿×𝑁 , where 𝐿 denotes the num- tistical properties of time series data (e.g., mean and variance) change
ber of time steps, and 𝑁 represents the number of variables. The over time, accurate forecasting becomes difficult due to discrepancies
objective is to forecast a subsequent sequence of future values, 𝑋̂ = between training and test data distributions. To tackle this issue, we are
{𝑋𝐿+1 , … , 𝑋𝐿+𝑇 } ∈ R𝑇 ×𝑁 , for 𝑇 future time steps. Each time step 𝑡 is using Reversible Instance Normalization (RevIN) [34]. RevIN operates
associated with a multidimensional vector 𝑋𝑡 , reflecting the inherent in two stages: (1) normalization, where it removes non-stationary
complexity of the data. information from input sequences to reduce distribution discrepancies,
and (2) denormalization, where it injects the removed information back
The key challenges in MTS forecasting include: Capturing intricate
into the output sequences to preserve the original data distribution.
temporal dynamics. Modeling complex interdependencies between dif-
The mean 𝜇 and standard deviation 𝜎 are computed for every instance
ferent variables. Handling high dimensionality that can arise with a
𝑋 (𝑖) ∈ R𝐿 of the input data as:
large number of variables.
𝐿 ( )2
1 ∑ (𝑖) 1 ∑
State Space Models (SSMs) are a powerful class of mathematical 𝐿

models employed to represent systems characterized by hidden internal 𝜇𝑡(𝑖) = 𝑋 𝜎𝑡(𝑖) = 𝑋𝑗(𝑖) − 𝜇𝑡(𝑖) (5)
𝐿 𝑗=1 𝑗 𝐿 𝑗=1
states. These models map one-dimensional input sequences, 𝑢(𝑡) ∈ R, to
output sequences, 𝑦(𝑡) ∈ R, utilizing a hidden state variable, ℎ(𝑡) ∈ R𝑉 . Following this, the normalized input is computed using learnable affine
They are defined by a system of linear ordinary differential equations: parameter vectors 𝛾, 𝛽 ∈ R𝑁 as follows.

̇ ⎛ (𝑖) (𝑖) ⎞
ℎ(𝑡) = ℎ(𝑡) + 𝑢(𝑡), 𝑦(𝑡) = ℎ(𝑡) + 𝑢(𝑡) (1) ⎜ 𝑋𝑡 − 𝜇𝑡 ⎟
̇ (𝑖)
𝑋𝑡 = 𝛾𝑁 ⎜ √ ⎟ + 𝛽𝑁 (6)
̇
where ℎ(𝑡) denotes the time derivative of the state vector ℎ(𝑡).  ∈ ⎜ 𝜎 (𝑖) + 𝜖 ⎟
R𝑉 ×𝑉 represents the state evolution matrix.  ∈ R𝑉 ×1 ,  ∈ R1×𝑉 are ⎝ 𝑡 ⎠
the input and output projection matrices, respectively.  is used for ̂ is obtained by
Finally, the denormalized forecast value, denoted by 𝑋,
skip connection. applying the following equation to the model output, denoted by 𝑋.̃
While well-suited for continuous data, traditional SSMs are not √ ( )
𝑋̃ 𝑡(𝑖) − 𝛽𝑁
directly applicable to discrete data like MTS data. This necessitates the 𝑋̂ 𝑡(𝑖) = 𝜎𝑡(𝑖) + 𝜖 ⋅ + 𝜇𝑡(𝑖) . (7)
adaptation of SSMs to the discrete domain. To address this, a timescale, 𝛾𝑁
𝛥, is introduced to transform the continuous parameters,  and , into
their discrete counterparts,  and . Methods like the zero-order hold 4.2. Patch segmentation
(ZOH) can be employed for this transformation, defined as:
Drawing inspiration from the success of the Transformer archi-
 = exp(𝛥),  = (𝛥)−1 (exp(𝛥) − 𝐼) ⋅ 𝛥 (2) tecture in NLP, Vision Transformer (ViT) [35] pioneered the use of

3
K. Alkilane et al. Information Fusion 112 (2024) 102589

Fig. 1. Schematic architecture of MixMamba. The process begins with pre-processing the raw time series data through normalization and segmentation (left). These patches are
then embedded and augmented with positional information before being input into the mixture of Mamba (MoM) block (center). This block consists of multiple Mamba experts
coordinated via a gating network (right). Each Mamba module includes a series of projections, convolution, selective SSM, and a skip connection to learn temporal dependencies.
Finally, a linear prediction head is employed to generate final outputs.

patching in image processing tasks. The process involves segmenting from its original dimension to 𝐷. The resulting vectors are referred to as
an image into a sequence of smaller patches, which are then processed patch embeddings. To preserve the temporal order within the sequence
by the Transformer. Recently, PatshTST [3] successfully applied the of patches, we incorporate position embeddings. These embeddings are
concept of patch segmentation to time series data. In this work, we learnable one-dimensional vectors 𝑒𝑝𝑜𝑠 ∈ R𝑛×𝐷 , and are implemented
build upon this successful approach by segmenting time series data into as described in [3]. Each position within the sequence is encoded
patches. These patches serve as input tokens that are fed into the MoM using a set of trigonometric functions across 𝐷. By combining the
block, a core component of our model. This method effectively reduces patch embeddings and the position embeddings, we obtain a sequence
the dimensionality of the time series data, making it computationally of embedding vectors 𝑍𝑝 . This sequence serves as the input to the
more manageable for the model to process. Consequently, the model subsequent MoM block in our model.
gains the ability to capture important local features within each patch. [ ]
𝑍𝑝(𝑖) = 𝑋𝑝(1) 𝑒𝑝 ; … ; 𝑋𝑝(𝑛) 𝑒𝑝 + 𝑒𝑝𝑜𝑠 , 𝑍𝑝(𝑖) ∈ R𝑛×𝐷 (8)
Patch-based processing offers a significant advantage over considering
the entire time series at once, as it helps to preserve local dependencies
4.4. Mixture-of-Mamba (MoM)
and structures within individual segments that might otherwise be
obscured. Additionally, training on patches potentially encourages the
MixMamba presents a promising approach to tackle the challenges
model’s ability to learn generalizable features that are invariant across
of time series modeling. Composed of several expert models and a gat-
different segments of the time series. This, in turn, could enhance
ing network, MixMamba allows individual experts to focus on learning
the model’s capacity to generalize from training data to unseen data,
specific patterns and relationships within the complex time series data.
leading to improved forecasting performance.
This specialization is particularly effective for time series data that
The initial stage of our approach involves transforming the input
exhibits a wide range of patterns and trends across various segments,
univariate time series into a sequence of patches. These patches have a
thereby enabling MixMamba models to capture both short-term trends
predefined size of 𝑃 and can be either overlapping or non-overlapping.
and long-term seasonality. The gating network, typically a specialized
The non-overlapping strategy relies on a concept called stride, denoted
neural network, dynamically determines the weighting or contribution
by 𝑆, which determines the distance between the beginnings of two of each expert’s output towards the final output. As a result, MixMamba
consecutive patches. The resulting output is a sequence of patches with has the potential to achieve higher accuracy in comparison to single-
each patch being a vector of dimension 𝑋𝑝(𝑖) ∈ R𝑛×𝑃 . Here, 𝑛 represents model approaches, leveraging the collective expertise of each expert
the total number of patches extracted from the original series. The focused on specific aspects of the data.
calculation of 𝑛 leverages the floor function
⌊ to
⌋ account for incomplete We propose the MoM block, specifically designed to capture the
final patches, and is formulated as: 𝑛 = 𝐿−𝑃 𝑆
+ 2. complex and evolving temporal dependencies within time series data.
This block consists of a set of 𝜂 experts (1 , … , 𝜂 ), each being
4.3. Patch embedding and positional encoding a Mamba module with its own set of trainable parameters, and a
gating network () generating a sparse 𝜂-dimensional vector. Each
The Mamba module maintains a consistent latent vector size across element in this vector corresponds to the weighting or contribution of
all its layers. To ensure compatibility with this dimension, we employ the respective expert in the final prediction. Fig. 1 provides a visual
a trainable linear projection 𝑒𝑝 ∈ R𝑃 ×𝐷 in Eq. (8) to map each patch representation of the MoM block.

4
K. Alkilane et al. Information Fusion 112 (2024) 102589

Algorithm 1 Expert Algorithm 2 Gating Network



Require: patch sequence 𝑍𝑝𝑑 ∶ (𝐵 , 𝐶, 𝐷) Require: patch sequence 𝑍𝑝 ∶ (𝐵 ′ , 𝑛, 𝐷)
Ensure: patch sequence 𝑍𝑚 ∶ (𝐵 ′ , 𝐶, 𝐷) Ensure: patch sequence 𝑊 , 𝑍𝑚𝑎𝑠𝑘 ∶ (𝐵 ′ , 𝑛, 𝜂, 𝐶)
1: for 𝑖 ← 0 to L do ⊳ Run Through Mamba Module 1: 𝑍𝑒 ∶ (𝐵 ′ , 𝑛, 𝜂) ← Linear(𝑍𝑝 )

2: 𝑍𝑝𝑑 ∶ (𝐵 ′ , 𝐶, 𝐷) ← RMSNorm(𝑍𝑝𝑑 ) 2: 𝑍𝑒 ∶ (𝐵 ′ , 𝑛, 𝜂) ← 𝐻(𝑍𝑒 ) ⊳ Noise

3: 𝑍𝑔 ∶ (𝐵 ′ , 𝐶, 𝐸) ← SiLU(Linear(𝑍𝑝𝑑 )) 3: 𝑊 , 𝑍𝑚𝑎𝑠𝑘 ∶ (𝐵 ′ , 𝑛, 𝜂, 𝐶) ← 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(TopKGating(𝑍𝑒 , 𝑘))
′ ′
4: 𝑍𝑐 ∶ (𝐵 , 𝐶, 𝐸) ← Linear(𝑍𝑝𝑑 ) 4: 𝑎𝑢𝑥 ← Z (𝑍𝑝 ) + B (𝑍𝑝 ) ⊳ Auxiliary Loss
5: 𝑍𝑐 ∶ (𝐵 ′ , 𝐶, 𝐸) ← SiLU(Conv1d(𝑍𝑐 )) 5: return 𝑊 , 𝑍𝑚𝑎𝑠𝑘 , 𝑎𝑢𝑥 ⊳ Weights; Dispatch Mask; Loss
6:  ∶ (𝐸, 𝑉 ) ← Parameter ⊳ Represents 𝑉 × 𝑉 Matrix
7: ,  ∶ (𝐵 ′ , 𝐶, 𝑉 ) ← Linear(𝑍𝑐 )
8: 𝛥 ∶ (𝐵 ′ , 𝐶, 𝑉 ) ← 𝜏𝛥 (Parameter + Linear(𝑍𝑐 )) Algorithm 3 MixMamba
9: ,  ∶ (𝐵 ′ , 𝐶, 𝐸, 𝑉 ) ← Discretize(𝛥, , )
Require: patch sequence 𝑋 ∶ (𝐵, 𝐿, 𝑁)
10: 𝑍𝑠 ∶ (𝐵 ′ , 𝐶, 𝐸) ← SSM(, , )(𝑍𝑐 )
Ensure: patch sequence 𝑋̂ ∶ (𝐵, 𝑇 , 𝑁)
11: 𝑍𝑜 ∶ (𝐵 ′ , 𝐶, 𝐸) ← 𝑍𝑠 ⊙ 𝑍𝑔
1: 𝑋 ′ ∶ (𝐵, 𝐿, 𝑁) ← Norm(𝑋)
12: 𝑍𝑜 ∶ (𝐵 ′ , 𝐶, 𝐷) ← Linear(𝑍𝑜 )
2: 𝑋 ′ ∶ (𝐵, 𝑁, 𝐿) ← Permute(𝑋 ′ )
13: 𝑍𝑝𝑑 ∶ (𝐵 ′ , 𝐶, 𝐷) ← 𝑍𝑜 + 𝑍𝑝𝑑 ⊳ Skip Connection
3: 𝑋𝑝 ∶ (𝐵, 𝑁, 𝑛, 𝑃 ) ← Patchify(𝑋 ′ ) ⊳ Patch Segmentation
14: 𝑍𝑚 ← 𝑍𝑝𝑑
4: 𝑍𝑝 ∶ (𝐵, 𝑁, 𝑛, 𝐷) ← Linear(𝑋𝑝 )
15: end for
5: 𝑍𝑝 ∶ (𝐵 ′ , 𝑛, 𝐷) ← Reshape(𝑍𝑝 ) ⊳ Channel Independence
16: return 𝑍𝑚
6: ⊳ 𝐵′ = 𝐵 × 𝑁
7: 𝑊 , 𝑍𝑚𝑎𝑠𝑘 , 𝑎𝑢𝑥 ∶ (𝐵 ′ , 𝑛, 𝜂, 𝐶) ← (𝑍𝑝 ) ⊳ Run Algorithm 2
8: 𝑍𝑝𝑑 ∶ (𝐵 ′ , 𝜂, 𝐶, 𝐷) ← einsum(𝑍𝑝 , 𝑍𝑚𝑎𝑠𝑘 ) ⊳ Masked Input
9: for 𝑖 ← 0 to 𝜂 do
Given a sequence of embedded patches 𝑍𝑝 . The final output of MoM
10: 𝑍𝑚(𝑖) ∶ (𝐵 ′ , 𝐶, 𝐷) ← (𝑖) (𝑍𝑝𝑑 ) ⊳ Run Algorithm 1
block 𝑍𝜃 is calculated by summing the element-wise product of the 11: end for
gating network’s output (𝑍𝑝 ) and the individual expert outputs (𝑍𝑝 ), 12: out ∶ (𝐵 ′ , 𝜂, 𝐶, 𝐷) ← Stack(𝑍𝑚(𝑖) )
as represented by the following: 13: 𝑍𝜃 ∶ (𝐵 ′ , 𝑛, 𝐷) ← einsum(𝑊 , 𝑜𝑢𝑡) ⊳ Weighted Sum

𝜂 14: 𝑍𝜃 ∶ (𝐵, 𝑁, 𝑛, 𝐷) ← Reshape(𝑍𝜃 )
𝑍𝜃 = (𝑍𝑝 )𝑖 ⋅ 𝑖 (𝑍𝑝 ) (9) 15: 𝑍𝜃 ∶ (𝐵, 𝑁, 𝑛 ∗ 𝐷) ← Reshape(𝑍𝜃 )
𝑖=1 16: 𝑍ℎ ∶ (𝐵, 𝑁, 𝑇 ) ← Linear(𝑍𝜃 ) ⊳ Prediction Head
17: 𝑍ℎ ∶ (𝐵, 𝑇 , 𝑁) ← Permute(𝑍ℎ )
4.4.1. Expert model 18: 𝑋̂ ∶ (𝐵, 𝑇 , 𝑁) ← De-Norm(𝑍ℎ )
Building upon the efficiency of structured state space models (SSMs) 19:  ← MAE (𝑋, ̂ 𝑌 ) + 𝑎𝑢𝑥 ⊳ Overall Loss; Y: ground-truth
20: return 𝑋, ̂ 
[31–33] for sequence modeling, we utilize the recently introduced
Mamba architecture to address the challenge of handling discrete and
information-dense data in the context of time series. Mamba’s key
strengths lie in its selection mechanism and its linear scaling with
gradient flow through the network and improve training performance.
sequence length. The selection mechanism enables the model to priori-
The output of Mamba module  is computed in Eq. (12).
tize crucial information within the input sequence, effectively filtering ( )
out irrelevant details. This is particularly advantageous in time series, 𝑍𝑔 = 𝑆𝑖𝐿𝑈 𝑔 ⋅ 𝑁𝑜𝑟𝑚(𝑍𝑝 ) (10)
where identifying relevant features from multiple variables is crucial
( ( ( )))
for accurate forecasting. Additionally, Mamba’s linear scaling with 𝑍𝑠 = 𝑆𝑆𝑀 𝑆𝑖𝐿𝑈 𝐶𝑜𝑛𝑣 𝑐 ⋅ 𝑁𝑜𝑟𝑚(𝑍𝑝 ) (11)
sequence length makes it well-suited for handling long sequences of
time series data. ( )
(𝑍𝑝 ) = 𝑚 ⋅ 𝑍𝑔 ⊙ 𝑍𝑐 + 𝑍𝑝 (12)
We leverage the Mamba module as an expert model for modeling
long-range dependencies within time series tasks. Fig. 1 provides a where 𝑔 , 𝑐 , and 𝑚 are trainable weight matrices.
visual representation of the Mamba’s structure, while Algorithm 1
outlines the sequence of its operations. Specifically, the input sequence 4.4.2. Gating network
𝑍𝑃 is initially normalized using a Root Mean Square Normalization
layer (RMSNorm) [36]. The normalized sequence is then duplicated Our proposed MixMamba architecture incorporates a refined gating
and directed towards two processing branches: the gate branch and network design inspired by research in [37]. This design, illustrated
the main branch. The gate branch processes its copy of the sequence in Fig. 1 (upper right panel), introduces sparsity and enhances com-
putational efficiency (as detailed in Algorithm 2). Here are the key
through a projection layer that maps it to a higher dimensional space
functionalities:
𝐸. This projection aims to extract informative features from the se-
∙ Tunable Sparsity: Prior to applying the Softmax function, the
quence. Subsequently, a Swish activation function (SiLU) is applied for
network output is augmented with tunable Gaussian noise, controlled
non-linearity. The output of gate branch is computed in Eq. (10).
by weight 𝑛𝑜𝑖𝑠𝑒 to introduce a degree of randomness and improve load
The main branch also processes its copy of the sequence through
balancing. To achieve sparsity and enhance computational efficiency,
a projection layer followed by a convolution layer. This convolution
only the top 𝑘 values from the network output are retained, while
helps capture local dependencies within the time series sequence. A
the remaining values are suppressed to −∞, effectively setting their
SiLU activation function is then applied to introduce non-linearity. Sub-
corresponding gate values to zero.
sequently, a Selective SSM (S6) [18] is employed to generate sequence ∙ Balanced Gating: This selective approach offers a promising
representations 𝑍𝑠 in Eq. (11), capturing long-range dependencies balance between the aforementioned objectives, potentially leading to
within the data. For a deeper understanding of the S6’s inner workings, improved performance in capturing complex temporal dependencies
please refer to Section 3. Following individual branch processing, the within multivariate time series data.
outputs from both branches are multiplied element-wise. This step
enables the gate branch to control the information flow from the main (𝑍𝑝 ) = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(TopKGating(𝐻(𝑍𝑝 ), 𝑘)) (13)
branch. The resulting product is then projected back to the original
input dimension, 𝐷. Finally, a skip connection is added to facilitate 𝐻(𝑍𝑝 )𝑖 = (𝑍𝑝 𝑔𝑎𝑡𝑒 )𝑖 + 𝑆𝑜𝑓 𝑡𝑝𝑙𝑢𝑠((𝑍𝑝 𝑛𝑜𝑖𝑠𝑒 )𝑖 ) (14)

5
K. Alkilane et al. Information Fusion 112 (2024) 102589

Table 1
{ Statistics of long-term forecast datasets.
𝑣𝑖 if 𝑣𝑖 in top 𝑘 elements of 𝑣
TopKGating(𝑣, 𝑘)𝑖 = (15) Dataset Variables Size Frequency Domain
−∞ otherwise.
ETTh1, ETTh2 7 (8545, 2881, 2881) Hourly Electricity
where 𝑔𝑎𝑡𝑒 and 𝑛𝑜𝑖𝑠𝑒 are trainable weight matrices. ETTm1, ETTm2 7 (34 465, 11 521, 11 521) 15 min Electricity
Exchange 8 (5120, 665, 1422) Daily Economy
As noted by [38,39], the gating network often converges to a state
Weather 21 (36 792, 5271, 10 540) 10 min Weather
where it consistently assigns large weights to a select few experts. To ECL 321 (18 317, 2633, 5261) Hourly Electricity
mitigate this, we adopt the approach proposed by [38] by defining the PEMS03 358 (15 617, 5135, 5135) 5 min Transportation
significance of an expert in relation to a batch of training samples as PEMS08 170 (10 690, 3548, 265) 5 min Transportation
the sum of the gate values for that specific expert across the batch. ILI 7 (617, 74, 170) Weekly Illness

This balance loss B , defined in Eq. (16), is equivalent to the square


of the coefficient of variation CV of the load vector, multiplied by a
scaling factor B , that is adjusted manually. Minimizing this loss term MICN [5], and FiLM [43]. To ensure a consistent evaluation framework,
encourages a more uniform distribution and ensures balanced exposure all models were subjected to the same experimental setup. Baseline
to samples across all experts. results for long-term forecasting (𝐿 = 96) were obtained from [2]. The
( )2 evaluation metrics employed in this study were Mean Squared Error
B (𝑍𝑝 ) = B ⋅ 𝐶𝑉 𝐿𝑜𝑎𝑑(𝑍𝑝 ) (16)
(MSE) and Mean Absolute Error (MAE).
where 𝐿𝑜𝑎𝑑(.) denotes a smooth estimator that quantifies the expert- Model Parameters. The default configuration of MixMamba em-
wise sample allocation [38]. ploys sixteen experts (𝜂 = 16), each containing a single Mamba module.
∙ Regularization: To further regularize the gating network, we in- The model operates in a D-dimensional space (𝐷 ∈ [32, 512]). Each
corporate an additional loss term Z based on [37]. This loss penalizes Mamba module comprises three linear layers: two for projecting the
large logits within the network, and its effectiveness before the gating data to a higher-dimensional space (𝐸 = 2 × 𝐷) and one for projecting
layer has been demonstrated in [37]. it back to the original dimensionality. The SiLU activation function is
( )2 used within the Mamba module, while the GELU activation function
1∑ ∑
𝑛 𝜂
(𝑍 ) is employed elsewhere in the model. The gating network utilizes a
Z (𝑍𝑝 ) = log 𝑒 𝑝𝑗 (17)
𝑛 𝑖=1 𝑗=1 single linear layer to project the D-dimensional features to the number
where 𝑛 represents the number of patches and 𝜂 denotes the number of of experts. The gating network’s parameter 𝑘, controlling the number
experts. of active experts per data point, is set to 2. Additionally, the buffer
The total auxiliary loss 𝑎𝑢𝑥 is subsequently calculated as a lin- capacity (𝐶) for each expert within each batch is set to 8 by de-
ear combination of the aforementioned losses weighted by specific fault. Similar to other baseline models, the look-back window (𝐿) and
factors, and then incorporated into the overall model loss function. prediction window (𝑇 ) are dataset-specific. Patch-based processing is
Algorithm 3 presents a detailed outline of the MixMamba’s operation. implemented following PatchTST [3] with a patch size of 16 and a
This algorithm comprehensively describes the step-by-step execution stride of 8. The MAE serves as the primary loss function, along with
of our proposed approach, facilitating a clear understanding of its an auxiliary loss (𝑎𝑢𝑥 ) derived from the gating network. The AdamW
functionality. optimizer is employed for training with a learning rate of 1e−3. The
model is implemented using PyTorch on a Linux workstation equipped
5. Experiments with an Nvidia GeForce RTX 4080 GPU. To ensure reliability, all exper-
iments are replicated ten times with varying random seeds. The results
Datasets. We compiled 21 benchmark datasets representing various presented throughout this work represent the average performance
time series data types, obtained from several sources widely used in across these ten replications.
the field of time series modeling.1 These datasets encompass diverse
domains: mechanical systems (ETT), electricity (ECL), traffic (PEMS), 5.1. Long-term forecasting results
weather, economics (Exchange), and disease (ILI). Table 1 presents rel-
evant data statistics. M4 [40] dataset is used for short-term forecasting. To evaluate the efficacy of our proposed model, MixMamba, in
This dataset is derived from the Makridakis competitions and offers a multivariate time series long-term forecasting, we perform a series of
rich collection of 100,000 time series across six subsets, encompassing rigorous experiments utilizing 10 benchmark datasets, namely ETT (4
a wide spectrum of frequencies, including high-frequency data (hourly, subsets), ECL, ILI, PEMS03, PEMS08, Weather, and Exchange. The com-
daily, and weekly) and low-frequency data (monthly, quarterly, and prehensive results are presented in Table 2. We evaluate the model’s
yearly). For the classification task, we leverage seven diverse datasets performance across various prediction window lengths, ranging from
from the UEA archive [41]. These datasets encompass various data 96 to 720 time steps. To assess the accuracy of the predictions, we
modalities, including image, audio, and gesture, and represent a range employ MSE and MAE as metrics, with lower values signifying higher
of classification tasks, such as handwriting recognition, heartbeat clas- prediction accuracy.
sification, and spoken digit recognition. To ensure the integrity of The findings presented in Table 2 provide compelling evidence of
our experiments, we adhered to the standard protocol [14] for data the superiority of MixMamba in comparison to existing methods. Mix-
processing and splitting the data in chronological order. Mamba achieves state-of-the-art performance, consistently exhibiting
Baselines. We leverage a range of recent baseline models cat- the lowest MSE/MAE across most datasets and prediction horizons.
egorized by their underlying architecture. Transformer-based meth- Our model achieves substantial improvements over SOTA methods such
ods include iTransformer [2], PatchTST [3], Crossformer [15], FED- as Fedformer and Autoformer. Furthermore, MixMamba demonstrates
former [4], Stationary [9], Informer [23], and Autoformer [14]. Linear- superior performance when compared to recently introduced models,
based methods include DLinear [8], RLinear [16], and TiDE [19]. Other including PatchTST, iTransformer, DLinear, and RLinear. This success
deep learning methods include TimesNet [6], SCINet [42], LightTS [7], can be attributed to the following key features. Firstly, MixMamba
utilizes a gating network that strategically routes data to specialized
expert models. This approach allows MixMamba to effectively model
1
The datasets and baselines’ codes used in this work are publicly available the diverse characteristics in time series data, leading to the generation
and can be downloaded from the following repository: https://github.com/ of more informative and accurate representations. Secondly, the incor-
thuml/Time-Series-Library. poration of Mamba architecture within MixMamba, with its efficient

6
K. Alkilane et al.
Table 2
Long-term forecasting performance on various datasets. The look-back window (𝐿) is set to 36 for ILI and 96 for all other datasets. The prediction window (𝑇 ) varies: PEMS (𝑇 ∈ {12, 24, 48, 96}), ILI (𝑇 ∈ {24, 36, 48, 60}), and all others
(𝑇 ∈ {96, 192, 336, 720}). Avg represents the average results across these four prediction windows. Bold red values indicate the best performance, while underlined blue values represent the second-best.
Models MixMamba iTransformer [2] RLinear [16] PatchTST [3] Crossformer [15] TiDE [19] TimesNet [6] DLinear [8] SCINet [42] FEDformer [4] Stationary [9] Autoformer [14]
Metric (Ours) (2023) (2023) (2023) (2023) (2023) (2023) (2023) (2022) (2022) (2022) (2021)
MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.318 0.350 0.334 0.368 0.355 0.376 0.329 0.367 0.404 0.426 0.364 0.387 0.338 0.375 0.345 0.372 0.418 0.438 0.379 0.419 0.386 0.398 0.505 0.475
192 0.363 0.372 0.377 0.391 0.391 0.392 0.367 0.385 0.450 0.451 0.398 0.404 0.374 0.387 0.380 0.389 0.439 0.450 0.426 0.441 0.459 0.444 0.553 0.496
ETTm1

336 0.391 0.393 0.426 0.420 0.424 0.415 0.399 0.410 0.532 0.515 0.428 0.425 0.410 0.411 0.413 0.413 0.490 0.485 0.445 0.459 0.495 0.464 0.621 0.537
720 0.450 0.427 0.491 0.459 0.487 0.450 0.454 0.439 0.666 0.589 0.487 0.461 0.478 0.450 0.474 0.453 0.595 0.550 0.543 0.490 0.585 0.516 0.671 0.561
Avg 0.381 0.386 0.407 0.410 0.414 0.407 0.387 0.400 0.513 0.496 0.419 0.419 0.400 0.406 0.403 0.407 0.485 0.481 0.448 0.452 0.481 0.456 0.588 0.517
96 0.176 0.254 0.180 0.264 0.182 0.265 0.175 0.259 0.287 0.366 0.207 0.305 0.187 0.267 0.193 0.292 0.286 0.377 0.203 0.287 0.192 0.274 0.255 0.339
192 0.241 0.297 0.250 0.309 0.246 0.304 0.241 0.302 0.414 0.492 0.290 0.364 0.249 0.309 0.284 0.362 0.399 0.445 0.269 0.328 0.280 0.339 0.281 0.340
ETTm2

336 0.301 0.337 0.311 0.348 0.307 0.342 0.305 0.343 0.597 0.542 0.377 0.422 0.321 0.351 0.369 0.427 0.637 0.591 0.325 0.366 0.334 0.361 0.339 0.372
720 0.400 0.394 0.412 0.407 0.407 0.398 0.402 0.400 1.730 1.042 0.558 0.524 0.408 0.403 0.554 0.522 0.960 0.735 0.421 0.415 0.417 0.413 0.433 0.432
Avg 0.280 0.321 0.288 0.332 0.286 0.327 0.281 0.326 0.757 0.610 0.358 0.404 0.291 0.333 0.350 0.401 0.571 0.537 0.305 0.349 0.306 0.347 0.327 0.371
96 0.374 0.389 0.386 0.405 0.386 0.395 0.414 0.419 0.423 0.448 0.479 0.464 0.384 0.402 0.386 0.400 0.654 0.599 0.376 0.419 0.513 0.491 0.449 0.459
192 0.420 0.417 0.441 0.436 0.437 0.424 0.460 0.445 0.471 0.474 0.525 0.492 0.436 0.429 0.437 0.432 0.719 0.631 0.420 0.448 0.534 0.504 0.500 0.482
ETTh1

336 0.463 0.439 0.487 0.458 0.479 0.446 0.501 0.466 0.570 0.546 0.565 0.515 0.491 0.469 0.481 0.459 0.778 0.659 0.459 0.465 0.588 0.535 0.521 0.496
720 0.455 0.454 0.503 0.491 0.481 0.470 0.500 0.488 0.653 0.621 0.594 0.558 0.521 0.500 0.519 0.516 0.836 0.699 0.506 0.507 0.643 0.616 0.514 0.512
Avg 0.429 0.428 0.454 0.447 0.446 0.434 0.469 0.454 0.529 0.522 0.541 0.507 0.458 0.450 0.456 0.452 0.747 0.647 0.440 0.460 0.570 0.537 0.496 0.487
96 0.283 0.329 0.297 0.349 0.288 0.338 0.302 0.348 0.745 0.584 0.400 0.440 0.340 0.374 0.333 0.387 0.707 0.621 0.358 0.397 0.476 0.458 0.346 0.388
192 0.363 0.380 0.380 0.400 0.374 0.390 0.388 0.400 0.877 0.656 0.528 0.509 0.402 0.414 0.477 0.476 0.860 0.689 0.429 0.439 0.512 0.493 0.456 0.452
ETTh2

336 0.406 0.416 0.428 0.432 0.415 0.426 0.426 0.433 1.043 0.731 0.643 0.571 0.452 0.452 0.594 0.541 1.000 0.744 0.496 0.487 0.552 0.551 0.482 0.486
720 0.415 0.433 0.427 0.445 0.420 0.440 0.431 0.446 1.104 0.763 0.874 0.679 0.462 0.468 0.831 0.657 1.249 0.838 0.463 0.474 0.562 0.560 0.515 0.511
Avg 0.367 0.390 0.383 0.407 0.374 0.398 0.387 0.407 0.942 0.684 0.611 0.550 0.414 0.427 0.559 0.515 0.954 0.723 0.437 0.449 0.526 0.516 0.450 0.459
12 0.072 0.175 0.071 0.174 0.126 0.236 0.099 0.216 0.090 0.203 0.178 0.305 0.085 0.192 0.122 0.243 0.066 0.172 0.126 0.251 0.081 0.188 0.272 0.385
PEMS03

24 0.091 0.199 0.093 0.201 0.246 0.334 0.142 0.259 0.121 0.240 0.257 0.371 0.118 0.223 0.201 0.317 0.085 0.198 0.149 0.275 0.105 0.214 0.334 0.440
48 0.121 0.233 0.125 0.236 0.551 0.529 0.211 0.319 0.202 0.317 0.379 0.463 0.155 0.260 0.333 0.425 0.127 0.238 0.227 0.348 0.154 0.257 1.032 0.782
7

96 0.162 0.270 0.164 0.275 1.057 0.787 0.269 0.370 0.262 0.367 0.490 0.539 0.228 0.317 0.457 0.515 0.178 0.287 0.348 0.434 0.247 0.336 1.031 0.796
Avg 0.112 0.222 0.113 0.221 0.495 0.472 0.180 0.291 0.169 0.281 0.326 0.419 0.147 0.248 0.278 0.375 0.114 0.224 0.213 0.327 0.147 0.249 0.667 0.601
12 0.090 0.186 0.079 0.182 0.133 0.247 0.168 0.232 0.165 0.214 0.227 0.343 0.112 0.212 0.154 0.276 0.087 0.184 0.173 0.273 0.109 0.207 0.436 0.485
PEMS08

24 0.125 0.226 0.115 0.219 0.249 0.343 0.224 0.281 0.215 0.260 0.318 0.409 0.141 0.238 0.248 0.353 0.122 0.221 0.210 0.301 0.140 0.236 0.467 0.502
48 0.184 0.237 0.186 0.235 0.569 0.544 0.321 0.354 0.315 0.355 0.497 0.510 0.198 0.283 0.440 0.470 0.189 0.270 0.320 0.394 0.211 0.294 0.966 0.733
96 0.216 0.262 0.221 0.267 1.166 0.814 0.408 0.417 0.377 0.397 0.721 0.592 0.320 0.351 0.674 0.565 0.236 0.300 0.442 0.465 0.345 0.367 1.385 0.915
Avg 0.154 0.228 0.150 0.226 0.529 0.487 0.280 0.321 0.268 0.307 0.441 0.464 0.193 0.271 0.379 0.416 0.158 0.244 0.286 0.358 0.201 0.276 0.814 0.659
96 0.084 0.200 0.086 0.206 0.093 0.217 0.088 0.205 0.256 0.367 0.094 0.218 0.107 0.234 0.088 0.218 0.267 0.396 0.148 0.278 0.111 0.237 0.197 0.323
Exchange

192 0.174 0.295 0.177 0.299 0.184 0.307 0.176 0.299 0.470 0.509 0.184 0.307 0.226 0.344 0.176 0.315 0.351 0.459 0.271 0.315 0.219 0.335 0.300 0.369
336 0.333 0.415 0.331 0.417 0.351 0.432 0.301 0.397 1.268 0.883 0.349 0.431 0.367 0.448 0.313 0.427 1.324 0.853 0.460 0.427 0.421 0.476 0.509 0.524
720 0.826 0.682 0.847 0.691 0.886 0.714 0.901 0.714 1.767 1.068 0.852 0.698 0.964 0.746 0.839 0.695 1.058 0.797 1.195 0.695 1.092 0.769 1.447 0.941
Avg 0.360 0.401 0.360 0.403 0.378 0.417 0.367 0.404 0.940 0.707 0.370 0.413 0.416 0.443 0.354 0.414 0.750 0.626 0.519 0.429 0.461 0.454 0.613 0.539
96 0.179 0.214 0.174 0.214 0.192 0.232 0.177 0.218 0.158 0.230 0.202 0.261 0.172 0.220 0.196 0.255 0.221 0.306 0.217 0.296 0.173 0.223 0.266 0.336
Weather

192 0.226 0.254 0.221 0.254 0.240 0.271 0.225 0.259 0.206 0.277 0.242 0.298 0.219 0.261 0.237 0.296 0.261 0.340 0.276 0.336 0.245 0.285 0.307 0.367
336 0.281 0.293 0.278 0.296 0.292 0.307 0.278 0.297 0.272 0.335 0.287 0.335 0.280 0.306 0.283 0.335 0.309 0.378 0.339 0.380 0.321 0.338 0.359 0.395
720 0.355 0.342 0.358 0.347 0.364 0.353 0.354 0.348 0.398 0.418 0.351 0.386 0.365 0.359 0.345 0.381 0.377 0.427 0.403 0.428 0.414 0.410 0.419 0.428
Avg 0.261 0.277 0.258 0.278 0.272 0.291 0.259 0.281 0.259 0.315 0.271 0.320 0.259 0.287 0.265 0.317 0.292 0.363 0.309 0.360 0.288 0.314 0.338 0.382
24 1.971 0.838 2.472 0.994 5.742 1.772 2.290 0.920 3.906 1.332 5.452 1.732 2.317 0.934 2.398 1.040 3.687 1.420 3.228 1.260 2.294 0.945 3.483 1.287
36 1.875 0.816 2.288 0.964 5.343 1.672 2.345 0.928 3.880 1.278 4.960 1.621 1.972 0.920 2.646 1.088 3.941 1.582 2.679 1.080 1.825 0.848 3.103 1.148
ILI

48 1.898 0.829 2.227 0.951 4.722 1.563 2.213 0.916 3.896 1.273 4.561 1.533 2.238 1.982 2.614 1.086 3.193 1.202 2.622 1.078 2.010 0.900 2.669 1.085
60 1.893 0.849 2.267 0.966 4.526 1.529 2.143 0.904 4.190 1.331 4.632 1.556 2.027 0.928 2.804 1.146 3.187 1.198 2.857 1.157 2.178 0.963 2.770 1.125

Information Fusion 112 (2024) 102589


Avg 1.909 0.858 2.302 0.968 5.083 1.621 2.247 0.917 3.968 1.303 4.894 1.610 2.139 0.931 2.616 1.090 3.502 1.350 2.847 1.144 2.077 0.914 3.006 1.161
96 0.152 0.248 0.148 0.240 0.201 0.281 0.195 0.285 0.219 0.314 0.237 0.329 0.168 0.272 0.197 0.282 0.247 0.345 0.193 0.308 0.169 0.273 0.201 0.317
192 0.189 0.276 0.162 0.253 0.201 0.283 0.199 0.289 0.231 0.322 0.236 0.330 0.184 0.289 0.196 0.285 0.257 0.355 0.201 0.315 0.182 0.286 0.222 0.334
ECL

336 0.204 0.301 0.178 0.269 0.215 0.298 0.215 0.305 0.246 0.337 0.249 0.344 0.198 0.300 0.209 0.301 0.269 0.369 0.214 0.329 0.200 0.304 0.231 0.338
720 0.226 0.322 0.225 0.317 0.257 0.331 0.256 0.337 0.280 0.363 0.284 0.373 0.220 0.320 0.245 0.333 0.299 0.390 0.246 0.355 0.222 0.321 0.254 0.361
Avg 0.193 0.287 0.178 0.270 0.219 0.298 0.216 0.304 0.244 0.334 0.251 0.344 0.192 0.295 0.212 0.300 0.268 0.365 0.214 0.327 0.193 0.296 0.227 0.338
1st count 29 38 8 10 0 0 2 1 3 0 0 0 1 0 2 0 2 2 2 0 1 0 0 0
K. Alkilane et al. Information Fusion 112 (2024) 102589

Table 3
Short-term forecasting performance on the M4 dataset. Values in red font denote the best performing model(s) for each metric, while underlined blue values represent the second-best
performing model(s).
Models MixMamba iTransformer [2] TimesNet [6] PatchTST [3] DLinear [8] FEDformer [4] LightTS [7] MICN [5] FiLM [43] Informer [23] Stationary [9] Autoformer [14]
SMAPE 13.363 14.262 13.62 13.562 14.319 13.93 13.569 14.53 14.377 16.127 13.717 13.974
Year MASE 2.993 3.257 3.112 3.029 3.078 3.141 3.066 3.38 3.031 3.544 3.078 3.134
OWA 0.785 0.846 0.808 0.796 0.825 0.821 0.801 0.87 0.821 0.939 0.807 0.822
SMAPE 10.187 13.64 10.244 10.884 10.514 10.836 10.431 11.45 10.788 13.703 10.958 11.338
Quarter MASE 1.181 1.719 1.202 1.308 1.241 1.285 1.212 1.391 1.308 1.742 1.325 1.365
OWA 0.893 1.246 0.903 0.971 0.93 0.961 0.915 1.027 0.967 1.257 0.981 1.012
SMAPE 12.89 16.8 12.88 13.146 13.464 14.091 13.043 13.858 13.384 15.867 13.917 13.958
Month MASE 0.941 1.399 0.953 1.011 1.028 1.076 0.999 1.085 1.027 1.263 1.097 1.103
OWA 0.897 1.24 0.894 0.931 0.95 0.994 0.922 0.99 0.947 1.144 0.998 1.002
SMAPE 4.758 5.823 5.009 4.963 5.114 5.025 5.791 6.207 5.82 7.213 6.302 5.485
Others MASE 3.189 4.044 3.346 3.334 3.641 3.279 3.802 4.311 3.881 5.198 4.064 3.865
OWA 1.004 1.25 1.055 1.048 1.112 1.046 1.209 1.333 1.224 1.579 1.304 1.187
SMAPE 12.04 14.909 12.024 12.29 12.535 12.82 12.175 13.052 12.611 14.975 12.780 12.909
Avg MASE 1.584 2.035 1.629 1.663 1.681 1.711 1.666 1.848 1.698 2.099 1.756 1.771
OWA 0.858 1.082 0.869 0.888 0.902 0.92 0.884 0.964 0.909 1.101 0.930 0.939

Note: The official code and settings from TimesNet [6] were used; however, the reproduced results exhibit variations from those reported in the paper.

Fig. 2. Visualization of long-term forecasting performance with 𝐿 = 96 and 𝑇 = 192 on ETTh1 dataset.

Fig. 3. Visualization of long-term forecasting performance with 𝐿 = 96 and 𝑇 = 192 on weather dataset.

selection process and capability to scale linearly with sequence length, 5.2. Short-term forecasting results
facilitates the model’s effective capture of broader context and complex
temporal dependencies present within time series data. This ensures To further validate the effectiveness of MixMamba for short-term
that MixMamba maintains its efficiency and performance even when forecasting, we compare its performance to SOTA baselines on the M4
dealing with increasingly long sequences. dataset. Unlike long-term forecasting datasets, which encompass high-
To further substantiate the superior performance of MixMamba, frequency data like weekly, daily, hourly, and even sub-hourly intervals
we present visualizations showcasing its long-term forecasting capa- (e.g., 15, 10, and 5 min as shown in Table 1), the M4 dataset has
bilities on the ETTh1 and Weather datasets (Figs. 2 and 3, respec- lower frequencies (yearly, quarterly, and monthly data). This allows
tively). We compare MixMamba’s performance against established for comprehensive evaluation of MixMamba’s performance across a
models: iTransformer [2], PatchTST [3], TiDE [19], DLinear [8], and variety of forecasting horizons with distinct frequency characteristics.
Autoformer [14]. The results demonstrably indicate that MixMamba Consistent with the forecasting horizons employed in TimesNet [6], we
achieves a higher degree of accuracy in its predictions compared to the adopted the following time steps for different M4 data granularities:
aforementioned models. yearly (6), quarterly (8), monthly (18), weekly (13), daily (14), and

8
K. Alkilane et al. Information Fusion 112 (2024) 102589

Fig. 4. Visualization of short-term forecasting performance on the M4 dataset (Yearly).

Table 4
Classification accuracy on diverse datasets. The asterisk (*) refers to Transformer family (e.g., Re* is Reformer).
Datasets/Models DTW XGBoost Rocket LSTM LSTNet LSSL TCN Trans* Re* In* Pyra* Auto* Station* FED* ETS* Flow* DLinear LightTS TimesNet MixMamba
FaceDetection 52.9 63.3 64.7 57.7 65.7 66.7 52.8 67.3 68.6 67.0 65.7 68.4 68.0 66.0 66.3 67.6 68.0 67.5 68.6 69.0
Heartbeat 71.7 73.2 75.6 72.2 77.1 72.7 75.6 76.1 77.1 80.5 75.6 74.6 73.7 73.7 71.2 77.6 75.1 75.1 78.0 77.0
JapaneseVowels 94.9 86.5 96.2 79.7 98.1 98.4 98.9 98.7 97.8 98.9 98.4 96.2 99.2 98.4 95.9 98.9 96.2 96.2 98.4 98.1
SelfRegulationSCP1 77.7 84.6 90.8 68.9 84.0 90.8 84.6 92.2 90.4 90.1 88.1 84.0 89.4 88.7 89.6 92.5 87.3 89.8 91.8 91.8
SelfRegulationSCP2 53.9 48.9 53.3 46.6 52.8 52.2 55.6 53.9 56.7 53.3 53.3 50.6 57.2 54.4 55.0 56.1 50.5 51.1 57.2 58.3
SpokenArabicDigits 96.3 69.6 71.2 31.9 100.0 100.0 95.6 98.4 97.0 100.0 99.6 100.0 100.0 100.0 100.0 98.8 81.4 100.0 99.0 98.5
UWaveGestureLibrary 90.3 75.9 94.4 41.2 87.8 85.9 88.4 85.6 85.6 85.6 83.4 85.9 87.5 85.3 85.0 86.6 82.1 80.3 85.3 85.9
Average accuracy 76.81 71.71 78.03 56.89 80.79 80.96 78.79 81.74 81.89 82.20 80.59 79.96 82.14 80.93 80.43 82.59 77.23 80.00 82.61 82.66

hourly (48). The detailed results pertaining to this experiment are MixMamba’s classification foundation relies on entropy values, em-
presented in Table 3. ploying cross-entropy loss during training. A key feature is its dynamic
Despite the inherent complexity of the M4 dataset, which encom- gating mechanism within the MoE component. This mechanism selects
passes diverse temporal variations from various sources, MixMamba experts based on a weighted combination, including past performance
outperforms all comparative benchmark models across a broad spec- on similar data and current data point’s characteristics.
trum of frequencies. Notably, at low frequencies, MixMamba achieves Consequently, each expert specializes in handling specific temporal
significant improvements on all three evaluation metrics: Symmetric dynamics within the time series data. This diversity allows MixMamba
Mean Absolute Percentage Error (SMAPE), Mean Absolute Scaled Error to tackle a wide range of input data characteristics, ultimately improv-
(MASE), and Overall Weighted Average (OWA). This improvement is ing the overall robustness and accuracy of classification. To illustrate
particularly pronounced when compared to PatchTST, the second-best this process, consider a simplified 3-class problem with TopKGating set
performing model. This superiority can be attributed to MixMamba’s to 3. Here, the gating network selects the three most suitable experts.
ability to effectively capture the distinct patterns inherent in the low- Each expert outputs a probability distribution over all classes. The
frequency. MixMamba’s effectiveness extends beyond low-frequency gating network then assigns weights to each expert based on their
data and generalizes well to high-frequency data (others) as well. The past performance and the current data point. Finally, these weighted
model’s design, with its emphasis on handling a wide range of temporal expert predictions are combined to produce the final classification.
dynamics, proves advantageous in this domain. The specialized experts MixMamba employs the cross-entropy loss function to measure the
within the MoE framework are adept at acquiring knowledge from the difference between the predicted distribution and the true class labels.
The results, reported in Table 4, demonstrate our model’s supe-
rapid variations present in high-frequency data. Compared to PatchTST
rior classification accuracy, achieving a highest average accuracy of
and Fedformer, MixMamba achieves substantial improvements on all
82.66%. The unique architecture of our model facilitates its profi-
metrics, showcasing its ability to handle rapid fluctuations within the
ciency in learning high-level representations and capturing the rich
data. Furthermore, MixMamba surpasses recently introduced models,
information within the data. This, in turn, enhances its classifica-
including iTransformer and DLinear, demonstrating its competitive
tion performance relative to the baseline models. While Transformer-
edge in this evolving field. Most importantly, MixMamba surpasses
based models demonstrate competitive performance in forecasting task,
TimesNet, the current SOTA model specifically designed for short-term
their ability to capture comprehensive representations for classification
forecasting, in terms of accuracy across most frequencies.
tasks seems limited. Conversely, TimesNet, while exhibiting promise in
To further demonstrate MixMamba’s superiority in short-term fore- classification tasks, struggles with long-term forecasting.
casting, we present a visualization for the M4 dataset in Fig. 4. Here, we Table 5 presents additional classification results obtained using
compare MixMamba’s performance against iTransformer [2], PatchTST two new datasets from the UEA archive: PEMS-SF and EigenWorms.
[3], TimesNet [6], DLinear [8], and Autoformer [14]. This visualization F1 Score and AUC-ROC serve as the evaluation metrics to assess
illuminates MixMamba’s ability to capture the nuances of short-term the performance of various classification models. Notably, MixMamba
patterns within the M4 data. demonstrates superior performance compared to Transformer-based
and Linear-based models, achieving the highest F1 Score and AUC-ROC
5.3. Classification results on both datasets. These findings underscore the robustness and superior
classification capabilities of MixMamba.
To assess the effectiveness of MixMamba in different time series
tasks, we conduct experiments on seven established time series clas- 5.4. Impact of look-back window
sification datasets. We compare our model against a variety of base-
line models, including Transformer-based architectures, linear-based In order to evaluate the performance of MixMamba under more
models, and TimesNet, the current SOTA in classification. demanding conditions, we conduct an experiment utilizing extended

9
K. Alkilane et al. Information Fusion 112 (2024) 102589

Table 5
F1 Score and AUC-ROC classification results on PEMS-SF and EigenWorms datasets. Bold indicates best performing model.
Datasets Metrics CrossFormer DLinear PatchTST TimesNet MixMamba
F1 0.609 0.681 0.713 0.769 0.792
PEMS-SF
AUC-ROC 0.825 0.842 0.857 0.863 0.885
F1 0.533 0.584 0.637 0.682 0.713
EigenWorms
AUC-ROC 0.768 0.772 0.791 0.814 0.833

Fig. 6. Comparison of learned representations on ETTm1 dataset with 𝐿 = 96, 𝑇 = 720.

Fig. 5. Performance under varied look-back window length 𝐿 ∈ {96, 192, 336, 720} on The second analysis employs UMAP [45] to visualize the outputs
PEMS03 datasets (𝑇 = 720). of eight experts within the MoM block in Fig. 6. UMAP operates
on the principle of manifold learning, aiming to represent the high-
dimensional data structure in a lower-dimensional space while preserv-
look-back window lengths on the PEMS03 dataset. We fix the prediction ing both local and global relationships. The results in Fig. 6 provide
window length (𝑇 ) at 720 timesteps, while the look-back sequence tangible evidence of the capability of MixMamba to learn diverse
length (𝐿) is varied across four values: 96, 192, 336, and 720. This patterns within the time series data. This figure shows distinct clusters
experiment aims to assess the model’s capacity to capture complex for each expert, indicating that different experts specialize in capturing
dependencies within longer sequences. We compare the performance different patterns within the data. The separation between clusters
of MixMamba against three established baselines. The results are vi- suggests that the features or patterns each expert responds to are
sualized in Fig. 5. Notably, MixMamba exhibits a consistent reduction distinct, which is indicative of a well-functioning MixMamba model.
in MSE with increasing look-back window length. This superior perfor-
mance can be primarily attributed to two key aspects of MixMamba’s 5.6. Computational efficiency
architecture. Firstly, the utilization of specialized experts allows the
model to effectively capture a wider range of patterns within the data. In the realm of time-series applications, the importance of com-
Secondly, the strength of the Mamba module enables the model to learn putational and memory efficiency cannot be overstated [1]. As the
informative representations from longer sequences. It is noteworthy forecast horizon extends, the associated computational and memory
that, while the MSE of other models increased for the longest look- costs escalate. Fig. 7 depicts the computational time and memory
back length (𝐿 = 720), MixMamba maintains its capability to capture usage for various forecast horizons in a single batch for six models
dependencies and reduce prediction error. compared to our proposed model. DLinear exhibits the lowest time
and memory consumption due to its limited number of parameters
5.5. Analysis of representation and simple architecture. Conversely, TimesNet exhibits the highest
computational cost and memory as it extracts features from multiple
This study examines the representational quality learned by our time periods and iterates computations for each period. Transformer-
proposed model, with a particular focus on the core MoM block. Two based models possess quadratic complexity, explaining the observed
analyses are conducted on separate datasets to evaluate the effective- raised in computation and memory. Our model emerges as the second
ness of MixMamba in capturing the underlying structure of time series most efficient in terms of both computation and memory usage, sur-
data. passing all Transformer-based models. Notably, the computational cost
The first analysis employs CKA [44] as a robust metric to assess the and memory requirements of our method do not exhibit a significant
similarity between the input and output representations generated by increase with extended forecast lengths. Consequently, our approach
the MoM block. We use CKA nonlinear kernels to capture more complex achieves superior performance compared to DLinear, while maintain-
relationships compared to the linear kernel. The results, presented ing lower computational and memory demands compared to other
in Fig. 6, demonstrate that the MixMamba achieves the highest CKA Transformer-based models.
similarity with the lowest MSE compared to other models. A high CKA
score for our model indicates several key advantages: (1) MixMamba 5.7. Ablation study
effectively preserves the essential structure of the input data throughout
the transformation process, ensuring that the model maintains the key To investigate the influence of each component within the Mix-
characteristics of the original information, crucial for understanding Mamba architecture, we construct five distinct variants, each targeting
and leveraging the underlying patterns in the data. (2) The high CKA a specific element. w/o MoE: This variant investigates the impact of the
similarity highlights the proficiency of MixMamba in capturing the MoE paradigm by entirely removing it from the architecture. Instead,
relational structure inherent in the time series data, such as long-term it employs a single Mamba module to directly learn both data repre-
trends and seasonal patterns. sentations and model dependencies. w/o Mamba: This variant replaces

10
K. Alkilane et al. Information Fusion 112 (2024) 102589

Fig. 9. MixMamba hyperparameters analysis on exchange and ILI datasets (𝐿 = 96,


𝑇 = 720).

In contrast, the full-attention mechanism inherent to Transformers can


limit their ability to learn efficiently from distant past observations in
Fig. 7. Comparison of memory usage (Up) and computation time (Down) on ETTm2 long sequences. Furthermore, substituting the Mamba module with a
dataset (Batch size is set to 32). simple MLP module leads to an even greater increase in MSE error
(38.05% and 22.39% on ECL and Weather, respectively). This is likely
due to the limited capacity of the MLP module’s shallow architecture to
capture the complex temporal dependencies present in long sequences.
(3) Omitting the patching step also influences model performance,
with MSE increasing by 4.87% and 5.15%. The benefit of patching
lies in its computational efficiency and improved memory usage, as
it reduces the sequence length to approximately 𝐿∕𝑃 (where 𝐿 is
the original sequence length and 𝑃 is the patch size). (4) Eliminating
the auxiliary loss term associated with the gating network leads to a
decrease in performance, with MSE increasing by 6.37% and 3.66%.
This auxiliary loss term plays a crucial role in load balancing within
the MoE framework. By removing it, the gating network may assign
excessively large weights to a select few experts, hindering the model’s
ability to leverage the full capacity of the expert pool.

5.8. Hyperparameters study

Fig. 8. Ablation analysis of MixMamba and its variants on ECL and weather datasets
We investigate the impact of two key parameters on the perfor-
(𝐿 = 96, 𝑇 = 720).
mance of MixMamba: the number of experts (𝜂) in Eq. (9) and the
parameter (𝑘) in TopKGating in Eq. (13). The results are presented
in Fig. 9. First, we vary the number of experts 𝜂 ∈ {4, 8, 16, 24}.
Mamba module with a simple MLP module, acting as an expert within As illustrated in Fig. 9, an architecture with 16 experts achieved the
the framework. w/o Patching : This variant bypasses the patching step,
optimal configuration on both datasets. This configuration allows the
feeding the complete sequence directly into the MoM block without
MixMamba to effectively learn representations and capture the tem-
any prior segmentation. w/o 𝑎𝑢𝑥 : This variant explores the significance
poral dependencies within the time series data. However, using fewer
of the auxiliary loss associated with the gating network. It eliminates
or more experts reduces performance. With fewer experts, the model’s
this term from the overall loss function employed during training. The
capacity might be insufficient to capture the diversity of patterns and
final variant, MixTrans, investigates the role of the Mamba module by
relationships in the data. Conversely, an increased number of experts
replacing it with a standard Transformer block. The performance of
can lead to overfitting the training data. We then examine the influence
each variant is evaluated on two benchmark datasets: ECL and Weather.
of parameter 𝑘 ∈ {2, 3, 4, 5}. The results in Fig. 9 indicate that the
The results of this ablation study are presented in Fig. 8.
model achieves the best performance with 𝑘 = 2. This value provides
The ablation study depicted in Fig. 8 offers valuable insights into
a good balance between specialization and capacity. Performance de-
the contributions of each component within MixMamba framework. We
clines with higher values of 𝑘 where employing a larger number of
observe the following key points: (1) Removing the MoE component
leads to the most significant performance decline, with MSE increasing active experts per data point increases the model’s complexity, making
by 85.49% and 26.87% on ECL and Weather datasets, respectively. it challenging for the gating network to learn which experts to select
This finding underscores the critical role of MoE in enabling the model and how to effectively balance their contributions.
to capture diverse characteristics within the time series data through
specialized experts. This facilitates superior representation learning, 6. Conclusion
ultimately leading to enhanced performance. Notably, the impact of
removing MoE is more pronounced in the ECL dataset, likely due to its In essence, this work addresses limitations in time series model-
larger number of features (variables) compared to the Weather dataset. ing by introducing the MixMamba framework. MixMamba effectively
(2) Replacing the Mamba module with alternative expert models also captures complex long-term temporal dynamics within non-stationary
results in a performance decline. When a Transformer block is em- time series data. This is achieved by leveraging the strengths of the
ployed as a substitute model on the ECL and Weather datasets, MSE mixture-of-experts paradigm and the Mamba model, enabling superior
error increases by 18.14% and 8.17%, respectively. This highlights the representation learning compared to single models. Furthermore, a dy-
advantages of the Mamba module in capturing intricate relationships namic gating network enhances MixMamba’s adaptability by efficiently
within lengthy sequences. The Mamba module achieves this through routing data segments to the most appropriate experts based on their
its selective attention mechanism and causal convolution capabilities. specific characteristics. This approach yields several key advantages.

11
K. Alkilane et al. Information Fusion 112 (2024) 102589

First, MixMamba effectively models long-term dependencies by capi- [10] W. Xi, A. Jain, L. Zhang, J. Lin, Lb-simtsc: An efficient similarity-aware graph
talizing on the linear-time efficiency of the Mamba model, facilitating neural network for semi-supervised time series classification, 2023, arXiv:2301.
04838.
scalable learning for long sequences. Second, the dynamic gating net-
[11] G.E. Box, G.M. Jenkins, G.C. Reinsel, G.M. Ljung, Time Series Analysis:
work empowers MixMamba to flexibly adapt to heterogeneous data, Forecasting and Control, John Wiley & Sons, 2015.
efficiently handling diverse patterns and temporal changes. Finally, [12] E.S. Gardner Jr., Exponential smoothing: The state of the art, J. Forecast. 4
extensive evaluations demonstrate that MixMamba outperforms exist- (1985) 1–28.
ing state-of-the-art methods across various time series modeling tasks, [13] V. Cerqueira, L. Torgo, I. Mozetič, Evaluating time series forecasting models: An
empirical study on performance estimation methods, Mach. Learn. 109 (2020)
including long-term and short-term forecasting, classification, and im-
1997–2028.
putation. Looking towards the future, MixMamba presents exciting [14] H. Wu, J. Xu, J. Wang, M. Long, Autoformer: Decomposition transformers with
avenues for further exploration. One promising direction involves in- auto-correlation for long-term series forecasting, Adv. Neural Inf. Process. Syst.
corporating domain knowledge to potentially enhance performance 34 (2021) 22419–22430.
for specific applications. This could involve pre-training experts on [15] Y. Zhang, J. Yan, Crossformer: Transformer utilizing cross-dimension dependency
for multivariate time series forecasting, in: The Eleventh International Conference
domain-specific datasets or designing expert architectures tailored to
on Learning Representations, 2023.
capture known domain characteristics relevant to the domain. Ad- [16] Z. Li, S. Qi, Y. Li, Z. Xu, Revisiting long-term time series forecasting: An
ditionally, integrating MixMamba with probabilistic forecasting tech- investigation on linear mapping, 2023, arXiv:2305.10721.
niques holds promise for generating prediction intervals. This capability [17] S.S. Rangapuram, M.W. Seeger, J. Gasthaus, L. Stella, Y. Wang, T. Januschowski,
would provide valuable insights into the uncertainty associated with Deep state space models for time series forecasting, Adv. Neural Inf. Process. Syst.
31 (2018).
forecasts, critical for decision-making systems in various domains.
[18] A. Gu, T. Dao, Mamba: Linear-time sequence modeling with selective state
spaces, 2023, arXiv:2312.00752.
CRediT authorship contribution statement [19] A. Das, W. Kong, A. Leach, S.K. Mathur, R. Sen, R. Yu, Long-term forecasting
with tide: Time-series dense encoder, Trans. Mach. Learn. Res. (2023) URL
https://openreview.net/forum?id=pCbC3aQB5W.
Khaled Alkilane: Writing – review & editing, Writing – original
[20] J.M. Valente, S. Maldonado, Svr-ffs: A novel forward feature selection approach
draft, Visualization, Validation, Methodology, Investigation, Data cu- for high-frequency time series forecasting using support vector regression, Expert
ration, Conceptualization. Yihang He: Visualization, Validation, Soft- Syst. Appl. 160 (2020) 113729.
ware, Data curation. Der-Horng Lee: Writing – review & editing, [21] E. de Bézenac, S.S. Rangapuram, K. Benidis, M. Bohlke-Schneider, R. Kurle,
Supervision, Methodology, Funding acquisition, Conceptualization. L. Stella, H. Hasson, P. Gallinari, T. Januschowski, Normalizing kalman filters
for multivariate time series analysis, Adv. Neural Inf. Process. Syst. 33 (2020)
2995–3007.
Declaration of competing interest [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser,
I. Polosukhin, Attention is all you need, 2017, URL https://arxiv.org/pdf/1706.
The authors declare that they have no known competing finan- 03762.pdf.
cial interests or personal relationships that could have appeared to [23] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, W. Zhang, Informer: Beyond
efficient transformer for long sequence time-series forecasting, in: Proceedings of
influence the work reported in this paper. the AAAI Conference on Artificial Intelligence, 2021, pp. 11106–11115.
[24] Y. Xu, L. Han, T. Zhu, L. Sun, B. Du, W. Lv, Generic dynamic graph convolutional
Data availability network for traffic flow forecasting, Inf. Fusion 100 (2023) 101946, http://dx.
doi.org/10.1016/j.inffus.2023.101946.
[25] J. Zhan, X. Huang, Y. Xian, W. Ding, A fuzzy C-means clustering-based hybrid
Data will be made available on request.
multivariate time series prediction framework with feature selection, IEEE Trans.
Fuzzy Syst. (2024) 1–15.
Acknowledgment [26] C. Zhu, X. Ma, W. Ding, J. Zhan, Long-term time series forecasting with multi-
linear trend fuzzy information granules for LSTM in a periodic framework, IEEE
Trans. Fuzzy Syst. 32 (2024) 322–336.
This research is financially supported by the Smart Urban Future
[27] J.M. Sanchez-Bornot, R.C. Sotero, Machine learning for time series forecasting
(SURF) Laboratory, Zhejiang Province. using state space models, in: International Conference on Intelligent Data
Engineering and Automated Learning, Springer, 2023, pp. 470–482.
References [28] L. Zhou, M. Poli, W. Xu, S. Massaroli, S. Ermon, Deep latent state space models
for time-series generation, in: International Conference on Machine Learning,
[1] M. Jin, H.Y. Koh, Q. Wen, D. Zambon, C. Alippi, G.I. Webb, I. King, S. Pan, PMLR, 2023, pp. 42625–42643.
A survey on graph neural networks for time series: Forecasting, classification, [29] Y. Lin, I. Koprinska, M. Rana, Ssdnet: State space decomposition neural network
imputation, and anomaly detection, 2023, arXiv preprint arXiv:2307.03759. for time series forecasting, in: 2021 IEEE International Conference on Data
[2] Y. Liu, T. Hu, H. Zhang, H. Wu, S. Wang, L. Ma, M. Long, Itransformer: Inverted Mining, ICDM, IEEE, 2021, pp. 370–378.
transformers are effective for time series forecasting, 2024, arXiv:2310.06625. [30] A.F. Ansari, A. Heng, A. Lim, H. Soh, Neural continuous-discrete state space
[3] Y. Nie, N.H. Nguyen, P. Sinthong, J. Kalagnanam, A time series is worth 64 models for irregularly-sampled time series, in: International Conference on
words: Long-term forecasting with transformers, 2022, arXiv preprint arXiv: Machine Learning, PMLR, 2023, pp. 926–951.
2211.14730. [31] A. Gu, K. Goel, C. Ré, Efficiently modeling long sequences with structured state
[4] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, R. Jin, Fedformer: Frequency en- spaces, 2022, arXiv:2111.00396.
hanced decomposed transformer for long-term series forecasting, in: International [32] A. Gu, T. Dao, S. Ermon, A. Rudra, C. Re, Hippo: Recurrent memory with optimal
Conference on Machine Learning, PMLR, 2022b, pp. 27268–27286. polynomial projections, 2020, arXiv:2008.07669.
[5] H. Wang, J. Peng, F. Huang, J. Wang, J. Chen, Y. Xiao, MICN: Multi-scale [33] A. Gupta, A. Gu, J. Berant, Diagonal state spaces are as effective as structured
local and global context modeling for long-term series forecasting, in: The state spaces, 2022, arXiv:2203.14343.
Eleventh International Conference on Learning Representations, 2023, URL https: [34] T. Kim, J. Kim, Y. Tae, C. Park, J.H. Choi, J. Choo, Reversible instance
//openreview.net/forum?id=zt53IDUR1U. normalization for accurate time-series forecasting against distribution shift,
[6] H. Wu, T. Hu, Y. Liu, H. Zhou, J. Wang, M. Long, Timesnet: Temporal in: International Conference on Learning Representations, 2022, URL https:
2d-variation modeling for general time series analysis, 2023, arXiv:2210.02186. //openreview.net/forum?id=cGDAkQo1C0p.
[7] T. Zhang, Y. Zhang, W. Cao, J. Bian, X. Yi, S. Zheng, J. Li, Less is more: [35] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner,
Fast multivariate time series forecasting with light sampling-oriented MLP M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An
structures, 2022, CoRR abs/2207.01186. URL http://dx.doi.org/10.48550/arXiv. image is worth 16x16 words: Transformers for image recognition at scale, 2021,
2207.01186 arXiv:2207.01186. arXiv:2010.11929.
[8] A. Zeng, M. Chen, L. Zhang, Q. Xu, Are transformers effective for time series [36] B. Zhang, R. Sennrich, Root mean square layer normalization, 2019, arXiv:
forecasting? in: Proceedings of the AAAI Conference on Artificial Intelligence, 1910.07467.
2023, pp. 11121–11128. [37] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, Z.
[9] Y. Liu, H. Wu, J. Wang, M. Long, Non-stationary transformers: Exploring the Chen, Gshard: Scaling giant models with conditional computation and automatic
stationarity in time series forecasting, in: NeurIPS, 2022. sharding, 2020, arXiv:2006.16668.

12
K. Alkilane et al. Information Fusion 112 (2024) 102589

[38] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, J. Dean, [42] L.I.U. M, A. Zeng, M. Chen, Z. Xu, L.A.I. Q, L. Ma, Q. Xu, Scinet: Time series
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, modeling and forecasting with sample convolution and interaction, in: Advances
2017, URL https://openreview.net/pdf?id=B1ckMDqlg. in Neural Information Processing Systems, Curran Associates, Inc., 2022, pp.
[39] B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean, N. Shazeer, W. Fedus, 5816–5828.
St-moe: Designing stable and transferable sparse expert models, 2022, arXiv: [43] T. Zhou, M.A. Z, wang. x, Q. Wen, L. Sun, T. Yao, W. Yin, R. Jin, Film: Frequency
2202.08906. improved legendre memory model for long-term time series forecasting, in:
[40] S. Makridakis, M4 dataset, 2018, URL https://github.com/M4Competition/M4- Advances in Neural Information Processing Systems, Curran Associates, Inc.,
methods/tree/master/Dataset. 2022a, pp. 12677–12690.
[41] A. Bagnall, H.A. Dau, J. Lines, M. Flynn, J. Large, A. Bostrom, P. Southam, [44] S. Kornblith, M. Norouzi, H. Lee, G. Hinton, Similarity of neural network
E. Keogh, The uea multivariate time series classification archive, 2018, 2018, representations revisited, 2019, arXiv:1905.00414.
arXiv:1811.00075. [45] L. McInnes, J. Healy, J. Melville, Umap: Uniform manifold approximation and
projection for dimension reduction, 2020, arXiv:1802.03426.

13

You might also like