MixMamba Time Series Modeling With Adaptive Expertise
MixMamba Time Series Modeling With Adaptive Expertise
Information Fusion
journal homepage: www.elsevier.com/locate/inffus
Keywords: From finance and healthcare to transportation and beyond, effective time series modeling underpins a wide
Time series modeling range of applications. While transformers have achieved success, their reliance on global context limits
Mixture-of-experts scalability for lengthy sequences due to the quadratic increase in computational cost with sequence length.
Multivariate time series forecasting
Recent research suggests linear models can achieve comparable performance with lower complexity. However,
the heterogeneity and non-stationary characteristics of time series data continue to challenge single models’
ability to capture complex temporal dynamics, especially in long-term forecasting. This paper proposes
MixMamba, a novel framework for time series modeling applicable across diverse domains. The framework
leverages the content-based reasoning strengths of the Mamba model by integrating it as an expert within a
mixture-of-experts (MoE) framework. This framework decomposes modeling into a pool of specialized experts,
enabling the model to learn robust representations and capture the full spectrum of patterns present in
time series data. Furthermore, a dynamic gating network is introduced within the framework. This network
adaptively allocates each data segment to the most suitable expert based on its characteristics. This is crucial in
non-stationary time series, as it allows the model to adjust dynamically to temporal changes in the underlying
data distribution. To prevent bias towards a limited subset of experts, a load balancing loss function is
incorporated. Extensive experiments on benchmark datasets demonstrate the effectiveness and robustness of
our proposed method in various time series modeling tasks, including long-term and short-term forecasting,
as well as classification.
∗ Corresponding author.
E-mail addresses: [email protected] (K. Alkilane), [email protected] (Y. He), [email protected] (D.-H. Lee).
https://doi.org/10.1016/j.inffus.2024.102589
Received 8 April 2024; Received in revised form 18 June 2024; Accepted 15 July 2024
Available online 18 July 2024
1566-2535/© 2024 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
K. Alkilane et al. Information Fusion 112 (2024) 102589
Time series data pose significant challenges for modeling due to architectures have been specifically designed to tackle a multitude of
their non-stationarity, inherent heterogeneity, and long-range depen- time series tasks, including short- and long-term forecasting [3,9,14,15,
dencies with complex temporal patterns. Existing methods have several 19], classification [6,10]. A thorough review of recent advancements
limitation. Transformer-based models suffer from a key bottleneck: in time series representation learning can be found in [1]. This section
their inherent quadratic time complexity with respect to sequence presents a systematic review of the existing literature, categorized into
length. This arises from the self-attention mechanism, where each ele- four scheme.
ment in the sequence attends to all others. This complexity might limit Traditional methods. This category incorporates statistical [11,12]
the model’s ability to learn effectively from distant past observations. and machine learning techniques [13,20,21] designed to capture the
Linear-based models, which rely on stacked linear transformations, temporal dependencies within time series data. While these methods
have a shallow architecture. This impedes their ability to capture com- demonstrate computational efficiency, making them suitable for large
plex, long-range dependencies in time series data. Another limitation datasets, their reliance on pre-defined statistical assumptions can hin-
of existing studies is the reliance on a single model to learn all the der their ability to effectively model complex non-linear relationships
complex patterns and dynamic dependencies. Heterogeneous data can and highly dynamic data.
hinder a single model, potentially leading to overfitting or underfitting Transformer-based methods. Motivated by the remarkable success
on different segments, leading to suboptimal capture of the underlying of the Transformer architecture [22] in sequence-to-sequence tasks,
dynamics. researchers have increasingly focused on adapting it to the domain of
To address the limitations of existing time series modeling meth- time series modeling [4,14,23]. Several methods, such as [2,3], have
ods, this paper proposes MixMamba, a new approach for effective emerged as attractive options due to their inherent advantages in han-
time series modeling. It incorporates the SSM model, Mamba, as a dling sequential data. Unlike traditional methods, Transformers excel
specialist within the well-established mixture-of-experts (MoE) frame- at capturing long-range dependencies in sequential data and utilizing
work. MixMamba decomposes the modeling task into a collection of these dependencies within sequences. This is particularly significant
specialized Mamba experts, each skilled in capturing a specific aspect for time series forecasting, where accurate predictions often hinge
of the underlying data dynamics. This division of labor empowers on capturing subtle interactions that unfold over extended periods.
the model to achieve superior representation learning compared to a However, these advantages come at a cost. A notable limitation lies
single model, effectively capturing a wide range of complexities within in their high computational complexity, which exhibits a quadratic
the data. Furthermore, a gating network is designed to function as scaling in relation to the input sequence length [18]. This translates to
a dynamic router. This network dynamically routes data segments to increased memory consumption and prolonged training times, particu-
domain-specific experts. This adaptability is crucial for non-stationary larly when dealing with extensive datasets. Furthermore, these models
time series, as it empowers the model to adjust flexibly to temporal may require larger amounts of data to achieve optimal performance,
changes in the underlying data distribution. For instance, one expert as they typically have a higher number of trainable parameters com-
might specialize in capturing long-term trends, while another focuses pared to other architectures. Their complexity may also increase their
on short-term fluctuations. By enabling parallel training of simpler ex- propensity to overfitting, especially in scenarios with limited available
perts, the MixMamba framework facilitates scalability for large datasets data [1]. Additionally, recent work has explored the application of
compared to the computational challenges of training a single, complex Graph Convolutional Networks (GCNs) in this domain. Xu et al. pro-
model. Additionally, it embodies the principles of ensemble learning, pose GDGCN [24], which utilizes parameter sharing, temporal graph
where multiple models collectively achieve superior performance. The convolution, and dynamic graph construction for improved traffic flow
main contributions of this work are three folds: forecasting. Zhan et al. [25] propose a hybrid framework that combines
Fuzzy C-means (FCM) clustering and feature selection to overcome limi-
• This paper proposes MixMamba, a novel approach for time se-
tations of Extreme Learning Machines (ELMs) in MTS prediction. Their
ries modeling that leverages collaborative learning and dynamic
approach leverages information fusion and a multi-metric strategy to
expert allocation. This combination enables efficient capture of
optimize FCM and feature selection, ultimately achieving improved
complex temporal dependencies and diverse patterns within the
prediction accuracy. Additionally, Zhu et al. [26] recognize limitations
data, resulting in improved performance on time series tasks.
in information granulation for long-term MTS forecasting and propose
• By exploiting Mamba’s linear-time efficiency, our approach over-
an LSTM-based approach that incorporates multilinear trend fuzzy
comes the trade-off between efficient sequence modeling and
information granules within a periodic framework.
capturing long-term trends. This allows for scalable learning of
Linear-based methods. Despite advancements in Transformers,
long-term dependencies without compromising computational re-
several studies [7,8,16] highlight the continued effectiveness of linear
sources.
models in time series modeling. These methods, known as univariate
• A dynamic gating network, acting as an intelligent router, allo-
linear models, treat multivariate data as collections of univariate se-
cates each data segment to the most suitable expert module based
quences and have demonstrated promising results [8]. This approach
on its specific characteristics. This ensures flexible handling of
is further exemplified by the work of [3], where a univariate patch
heterogeneity and temporal changes in the data.
Transformer is proposed. The effectiveness of these methods often relies
To comprehensively evaluate MixMamba’s effectiveness, we con- on the assumption of periodicity or smoothness in the time series
ducted extensive experiments on widely-used benchmark datasets. The data [19]. This assumption has been further generalized to the notion
results demonstrate that MixMamba outperforms previous state-of-the- that time series can be decomposed into a periodic component and
art methods in terms of prediction accuracy for both long-term and a component with a smooth trend. Several studies [8,16,19] further
short-term forecasting tasks. Additionally, MixMamba achieves im- emphasize a key advantage of linear models lies in their static mapping
provements in classification accuracy when compared to existing ap- weights for each data point within the sequence. This contrasts with
proaches. These empirical findings strongly reinforce the effectiveness recurrent and attention-based architectures where mapping weights
of MixMamba’s proposed innovations in handling complex time series are outputs of gates and attention layers that are data dependent.
data. However, the inherent simplicity of linear models may not be sufficient
to accurately represent complex patterns and dependencies within the
2. Related works non-stationary data. Furthermore, their reliance on a key assumption –
that time series data exhibits periodic patterns or smooth trends – may
Recent years have witnessed the rise of deep learning as the domi- not be universally applicable as time series data can display irregular
nant approach for time series modeling. A wide array of neural network fluctuations.
2
K. Alkilane et al. Information Fusion 112 (2024) 102589
Deep State space models. Deep SSMs have emerged as a promising The resulting discretized model takes the form:
alternative to traditional sequence models like RNNs, CNNs, and Trans-
ℎ𝑡 = ℎ(𝑡 − 1) + 𝑢(𝑡), 𝑦(𝑡) = ℎ(𝑡) (3)
formers [27]. They demonstrate remarkable computational efficiency
and robust modeling capabilities, particularly for long sequences [17]. This discretized form enables direct application to MTS data. The key
Traditional SSMs offer a principled framework for modeling and learn- advantage lies in the recursive relationship with respect to the hidden
ing time series patterns, including trend and seasonality, leading to state, ℎ(𝑡). This allows the model to capture temporal dynamics more
interpretability and data-efficient learning [28,29]. However, they lack effectively in various contexts. The models finalize output through
the ability to infer shared patterns across datasets of similar time series global convolution, defined as:
because each series is fitted individually. In contrast, deep neural net- 𝐿−1
works (DNNs) offer significant advantages in identifying complex pat- = (, , … , ), 𝑦 = 𝑢(𝑡) (4)
terns within and across time series, but their interpretability is limited Here, 𝐿 is the input sequence length and ∈ R𝐿
represents a structured
and enforcing assumptions within the model can be challenging. To ad- convolutional kernel.
dress these limitations, researchers have explored combining SSMs with However, a key limitation of traditional SSMs lies in their inherent
DNNs for time series modeling [29,30]. For instance, the work in [17] Linear Time-Invariant (LTI) nature. Fixed parameters like , , , 𝑎𝑛𝑑𝛥
proposes a method where RNN parameters are learned simultaneously restrict their ability to capture nuances in diverse sequences. The
from raw time series data and associated features. This empowers the selective SSM model, Mamba [18], addresses this by making these
model to extract relevant characteristics and learn temporal depen- parameters input-dependent. This transition to a time-variant model
dencies from the data. Furthermore, a rich body of well-established enhances adaptability and leads to a more accurate representation of
structured SSMs has been developed in [31–33]. HiPPO [32] designed the input sequence.
for analyzing sequential data at various timescales. It incorporates
memory dynamics into RNNs, enabling it to capture complex temporal 4. Method
dependencies. Mamba [18], a recently introduced selective SSM, en-
hances SSMs with content-based reasoning using discrete modalities. This study introduces MixMamba that combines MoE and Mamba
This allows the model to selectively retain or discard information for effective time series dependency modeling. The schematic architec-
throughout the sequence based on the current element. Additionally, ture of the model is depicted in Fig. 1. The model preprocesses data
Mamba leverages a hardware-aware parallel algorithm for recurrent through normalization, segments sequences into manageable parts,
operations, facilitating faster inference and linear scaling with sequence and embeds them for efficient processing. It then employs positional
length, demonstrating improved performance. encoding to retain temporal order. The core MoM block dynamically
selects the most suitable Mamba modules to extract key features from
3. Preliminaries each segment. A gating network assigns weights to these modules,
determining their influence on the final prediction. Finally, the model
Multivariate time series (MTS) data, characterized by multiple denormalizes the output to recover the original data scale.
interdependent variables observed over time, presents a unique chal-
lenge in forecasting due to the inherent complexity arising from both 4.1. Normalization
temporal dynamics and interrelationships between variables. In MTS
forecasting, we are given historical observations, typically represented Due to the significant challenge of distribution shift, where the sta-
as a matrix 𝑋 = {𝑋1 , … , 𝑋𝐿 } ∈ R𝐿×𝑁 , where 𝐿 denotes the num- tistical properties of time series data (e.g., mean and variance) change
ber of time steps, and 𝑁 represents the number of variables. The over time, accurate forecasting becomes difficult due to discrepancies
objective is to forecast a subsequent sequence of future values, 𝑋̂ = between training and test data distributions. To tackle this issue, we are
{𝑋𝐿+1 , … , 𝑋𝐿+𝑇 } ∈ R𝑇 ×𝑁 , for 𝑇 future time steps. Each time step 𝑡 is using Reversible Instance Normalization (RevIN) [34]. RevIN operates
associated with a multidimensional vector 𝑋𝑡 , reflecting the inherent in two stages: (1) normalization, where it removes non-stationary
complexity of the data. information from input sequences to reduce distribution discrepancies,
and (2) denormalization, where it injects the removed information back
The key challenges in MTS forecasting include: Capturing intricate
into the output sequences to preserve the original data distribution.
temporal dynamics. Modeling complex interdependencies between dif-
The mean 𝜇 and standard deviation 𝜎 are computed for every instance
ferent variables. Handling high dimensionality that can arise with a
𝑋 (𝑖) ∈ R𝐿 of the input data as:
large number of variables.
𝐿 ( )2
1 ∑ (𝑖) 1 ∑
State Space Models (SSMs) are a powerful class of mathematical 𝐿
models employed to represent systems characterized by hidden internal 𝜇𝑡(𝑖) = 𝑋 𝜎𝑡(𝑖) = 𝑋𝑗(𝑖) − 𝜇𝑡(𝑖) (5)
𝐿 𝑗=1 𝑗 𝐿 𝑗=1
states. These models map one-dimensional input sequences, 𝑢(𝑡) ∈ R, to
output sequences, 𝑦(𝑡) ∈ R, utilizing a hidden state variable, ℎ(𝑡) ∈ R𝑉 . Following this, the normalized input is computed using learnable affine
They are defined by a system of linear ordinary differential equations: parameter vectors 𝛾, 𝛽 ∈ R𝑁 as follows.
̇ ⎛ (𝑖) (𝑖) ⎞
ℎ(𝑡) = ℎ(𝑡) + 𝑢(𝑡), 𝑦(𝑡) = ℎ(𝑡) + 𝑢(𝑡) (1) ⎜ 𝑋𝑡 − 𝜇𝑡 ⎟
̇ (𝑖)
𝑋𝑡 = 𝛾𝑁 ⎜ √ ⎟ + 𝛽𝑁 (6)
̇
where ℎ(𝑡) denotes the time derivative of the state vector ℎ(𝑡). ∈ ⎜ 𝜎 (𝑖) + 𝜖 ⎟
R𝑉 ×𝑉 represents the state evolution matrix. ∈ R𝑉 ×1 , ∈ R1×𝑉 are ⎝ 𝑡 ⎠
the input and output projection matrices, respectively. is used for ̂ is obtained by
Finally, the denormalized forecast value, denoted by 𝑋,
skip connection. applying the following equation to the model output, denoted by 𝑋.̃
While well-suited for continuous data, traditional SSMs are not √ ( )
𝑋̃ 𝑡(𝑖) − 𝛽𝑁
directly applicable to discrete data like MTS data. This necessitates the 𝑋̂ 𝑡(𝑖) = 𝜎𝑡(𝑖) + 𝜖 ⋅ + 𝜇𝑡(𝑖) . (7)
adaptation of SSMs to the discrete domain. To address this, a timescale, 𝛾𝑁
𝛥, is introduced to transform the continuous parameters, and , into
their discrete counterparts, and . Methods like the zero-order hold 4.2. Patch segmentation
(ZOH) can be employed for this transformation, defined as:
Drawing inspiration from the success of the Transformer archi-
= exp(𝛥), = (𝛥)−1 (exp(𝛥) − 𝐼) ⋅ 𝛥 (2) tecture in NLP, Vision Transformer (ViT) [35] pioneered the use of
3
K. Alkilane et al. Information Fusion 112 (2024) 102589
Fig. 1. Schematic architecture of MixMamba. The process begins with pre-processing the raw time series data through normalization and segmentation (left). These patches are
then embedded and augmented with positional information before being input into the mixture of Mamba (MoM) block (center). This block consists of multiple Mamba experts
coordinated via a gating network (right). Each Mamba module includes a series of projections, convolution, selective SSM, and a skip connection to learn temporal dependencies.
Finally, a linear prediction head is employed to generate final outputs.
patching in image processing tasks. The process involves segmenting from its original dimension to 𝐷. The resulting vectors are referred to as
an image into a sequence of smaller patches, which are then processed patch embeddings. To preserve the temporal order within the sequence
by the Transformer. Recently, PatshTST [3] successfully applied the of patches, we incorporate position embeddings. These embeddings are
concept of patch segmentation to time series data. In this work, we learnable one-dimensional vectors 𝑒𝑝𝑜𝑠 ∈ R𝑛×𝐷 , and are implemented
build upon this successful approach by segmenting time series data into as described in [3]. Each position within the sequence is encoded
patches. These patches serve as input tokens that are fed into the MoM using a set of trigonometric functions across 𝐷. By combining the
block, a core component of our model. This method effectively reduces patch embeddings and the position embeddings, we obtain a sequence
the dimensionality of the time series data, making it computationally of embedding vectors 𝑍𝑝 . This sequence serves as the input to the
more manageable for the model to process. Consequently, the model subsequent MoM block in our model.
gains the ability to capture important local features within each patch. [ ]
𝑍𝑝(𝑖) = 𝑋𝑝(1) 𝑒𝑝 ; … ; 𝑋𝑝(𝑛) 𝑒𝑝 + 𝑒𝑝𝑜𝑠 , 𝑍𝑝(𝑖) ∈ R𝑛×𝐷 (8)
Patch-based processing offers a significant advantage over considering
the entire time series at once, as it helps to preserve local dependencies
4.4. Mixture-of-Mamba (MoM)
and structures within individual segments that might otherwise be
obscured. Additionally, training on patches potentially encourages the
MixMamba presents a promising approach to tackle the challenges
model’s ability to learn generalizable features that are invariant across
of time series modeling. Composed of several expert models and a gat-
different segments of the time series. This, in turn, could enhance
ing network, MixMamba allows individual experts to focus on learning
the model’s capacity to generalize from training data to unseen data,
specific patterns and relationships within the complex time series data.
leading to improved forecasting performance.
This specialization is particularly effective for time series data that
The initial stage of our approach involves transforming the input
exhibits a wide range of patterns and trends across various segments,
univariate time series into a sequence of patches. These patches have a
thereby enabling MixMamba models to capture both short-term trends
predefined size of 𝑃 and can be either overlapping or non-overlapping.
and long-term seasonality. The gating network, typically a specialized
The non-overlapping strategy relies on a concept called stride, denoted
neural network, dynamically determines the weighting or contribution
by 𝑆, which determines the distance between the beginnings of two of each expert’s output towards the final output. As a result, MixMamba
consecutive patches. The resulting output is a sequence of patches with has the potential to achieve higher accuracy in comparison to single-
each patch being a vector of dimension 𝑋𝑝(𝑖) ∈ R𝑛×𝑃 . Here, 𝑛 represents model approaches, leveraging the collective expertise of each expert
the total number of patches extracted from the original series. The focused on specific aspects of the data.
calculation of 𝑛 leverages the floor function
⌊ to
⌋ account for incomplete We propose the MoM block, specifically designed to capture the
final patches, and is formulated as: 𝑛 = 𝐿−𝑃 𝑆
+ 2. complex and evolving temporal dependencies within time series data.
This block consists of a set of 𝜂 experts (1 , … , 𝜂 ), each being
4.3. Patch embedding and positional encoding a Mamba module with its own set of trainable parameters, and a
gating network () generating a sparse 𝜂-dimensional vector. Each
The Mamba module maintains a consistent latent vector size across element in this vector corresponds to the weighting or contribution of
all its layers. To ensure compatibility with this dimension, we employ the respective expert in the final prediction. Fig. 1 provides a visual
a trainable linear projection 𝑒𝑝 ∈ R𝑃 ×𝐷 in Eq. (8) to map each patch representation of the MoM block.
4
K. Alkilane et al. Information Fusion 112 (2024) 102589
5
K. Alkilane et al. Information Fusion 112 (2024) 102589
Table 1
{ Statistics of long-term forecast datasets.
𝑣𝑖 if 𝑣𝑖 in top 𝑘 elements of 𝑣
TopKGating(𝑣, 𝑘)𝑖 = (15) Dataset Variables Size Frequency Domain
−∞ otherwise.
ETTh1, ETTh2 7 (8545, 2881, 2881) Hourly Electricity
where 𝑔𝑎𝑡𝑒 and 𝑛𝑜𝑖𝑠𝑒 are trainable weight matrices. ETTm1, ETTm2 7 (34 465, 11 521, 11 521) 15 min Electricity
Exchange 8 (5120, 665, 1422) Daily Economy
As noted by [38,39], the gating network often converges to a state
Weather 21 (36 792, 5271, 10 540) 10 min Weather
where it consistently assigns large weights to a select few experts. To ECL 321 (18 317, 2633, 5261) Hourly Electricity
mitigate this, we adopt the approach proposed by [38] by defining the PEMS03 358 (15 617, 5135, 5135) 5 min Transportation
significance of an expert in relation to a batch of training samples as PEMS08 170 (10 690, 3548, 265) 5 min Transportation
the sum of the gate values for that specific expert across the batch. ILI 7 (617, 74, 170) Weekly Illness
6
K. Alkilane et al.
Table 2
Long-term forecasting performance on various datasets. The look-back window (𝐿) is set to 36 for ILI and 96 for all other datasets. The prediction window (𝑇 ) varies: PEMS (𝑇 ∈ {12, 24, 48, 96}), ILI (𝑇 ∈ {24, 36, 48, 60}), and all others
(𝑇 ∈ {96, 192, 336, 720}). Avg represents the average results across these four prediction windows. Bold red values indicate the best performance, while underlined blue values represent the second-best.
Models MixMamba iTransformer [2] RLinear [16] PatchTST [3] Crossformer [15] TiDE [19] TimesNet [6] DLinear [8] SCINet [42] FEDformer [4] Stationary [9] Autoformer [14]
Metric (Ours) (2023) (2023) (2023) (2023) (2023) (2023) (2023) (2022) (2022) (2022) (2021)
MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.318 0.350 0.334 0.368 0.355 0.376 0.329 0.367 0.404 0.426 0.364 0.387 0.338 0.375 0.345 0.372 0.418 0.438 0.379 0.419 0.386 0.398 0.505 0.475
192 0.363 0.372 0.377 0.391 0.391 0.392 0.367 0.385 0.450 0.451 0.398 0.404 0.374 0.387 0.380 0.389 0.439 0.450 0.426 0.441 0.459 0.444 0.553 0.496
ETTm1
336 0.391 0.393 0.426 0.420 0.424 0.415 0.399 0.410 0.532 0.515 0.428 0.425 0.410 0.411 0.413 0.413 0.490 0.485 0.445 0.459 0.495 0.464 0.621 0.537
720 0.450 0.427 0.491 0.459 0.487 0.450 0.454 0.439 0.666 0.589 0.487 0.461 0.478 0.450 0.474 0.453 0.595 0.550 0.543 0.490 0.585 0.516 0.671 0.561
Avg 0.381 0.386 0.407 0.410 0.414 0.407 0.387 0.400 0.513 0.496 0.419 0.419 0.400 0.406 0.403 0.407 0.485 0.481 0.448 0.452 0.481 0.456 0.588 0.517
96 0.176 0.254 0.180 0.264 0.182 0.265 0.175 0.259 0.287 0.366 0.207 0.305 0.187 0.267 0.193 0.292 0.286 0.377 0.203 0.287 0.192 0.274 0.255 0.339
192 0.241 0.297 0.250 0.309 0.246 0.304 0.241 0.302 0.414 0.492 0.290 0.364 0.249 0.309 0.284 0.362 0.399 0.445 0.269 0.328 0.280 0.339 0.281 0.340
ETTm2
336 0.301 0.337 0.311 0.348 0.307 0.342 0.305 0.343 0.597 0.542 0.377 0.422 0.321 0.351 0.369 0.427 0.637 0.591 0.325 0.366 0.334 0.361 0.339 0.372
720 0.400 0.394 0.412 0.407 0.407 0.398 0.402 0.400 1.730 1.042 0.558 0.524 0.408 0.403 0.554 0.522 0.960 0.735 0.421 0.415 0.417 0.413 0.433 0.432
Avg 0.280 0.321 0.288 0.332 0.286 0.327 0.281 0.326 0.757 0.610 0.358 0.404 0.291 0.333 0.350 0.401 0.571 0.537 0.305 0.349 0.306 0.347 0.327 0.371
96 0.374 0.389 0.386 0.405 0.386 0.395 0.414 0.419 0.423 0.448 0.479 0.464 0.384 0.402 0.386 0.400 0.654 0.599 0.376 0.419 0.513 0.491 0.449 0.459
192 0.420 0.417 0.441 0.436 0.437 0.424 0.460 0.445 0.471 0.474 0.525 0.492 0.436 0.429 0.437 0.432 0.719 0.631 0.420 0.448 0.534 0.504 0.500 0.482
ETTh1
336 0.463 0.439 0.487 0.458 0.479 0.446 0.501 0.466 0.570 0.546 0.565 0.515 0.491 0.469 0.481 0.459 0.778 0.659 0.459 0.465 0.588 0.535 0.521 0.496
720 0.455 0.454 0.503 0.491 0.481 0.470 0.500 0.488 0.653 0.621 0.594 0.558 0.521 0.500 0.519 0.516 0.836 0.699 0.506 0.507 0.643 0.616 0.514 0.512
Avg 0.429 0.428 0.454 0.447 0.446 0.434 0.469 0.454 0.529 0.522 0.541 0.507 0.458 0.450 0.456 0.452 0.747 0.647 0.440 0.460 0.570 0.537 0.496 0.487
96 0.283 0.329 0.297 0.349 0.288 0.338 0.302 0.348 0.745 0.584 0.400 0.440 0.340 0.374 0.333 0.387 0.707 0.621 0.358 0.397 0.476 0.458 0.346 0.388
192 0.363 0.380 0.380 0.400 0.374 0.390 0.388 0.400 0.877 0.656 0.528 0.509 0.402 0.414 0.477 0.476 0.860 0.689 0.429 0.439 0.512 0.493 0.456 0.452
ETTh2
336 0.406 0.416 0.428 0.432 0.415 0.426 0.426 0.433 1.043 0.731 0.643 0.571 0.452 0.452 0.594 0.541 1.000 0.744 0.496 0.487 0.552 0.551 0.482 0.486
720 0.415 0.433 0.427 0.445 0.420 0.440 0.431 0.446 1.104 0.763 0.874 0.679 0.462 0.468 0.831 0.657 1.249 0.838 0.463 0.474 0.562 0.560 0.515 0.511
Avg 0.367 0.390 0.383 0.407 0.374 0.398 0.387 0.407 0.942 0.684 0.611 0.550 0.414 0.427 0.559 0.515 0.954 0.723 0.437 0.449 0.526 0.516 0.450 0.459
12 0.072 0.175 0.071 0.174 0.126 0.236 0.099 0.216 0.090 0.203 0.178 0.305 0.085 0.192 0.122 0.243 0.066 0.172 0.126 0.251 0.081 0.188 0.272 0.385
PEMS03
24 0.091 0.199 0.093 0.201 0.246 0.334 0.142 0.259 0.121 0.240 0.257 0.371 0.118 0.223 0.201 0.317 0.085 0.198 0.149 0.275 0.105 0.214 0.334 0.440
48 0.121 0.233 0.125 0.236 0.551 0.529 0.211 0.319 0.202 0.317 0.379 0.463 0.155 0.260 0.333 0.425 0.127 0.238 0.227 0.348 0.154 0.257 1.032 0.782
7
96 0.162 0.270 0.164 0.275 1.057 0.787 0.269 0.370 0.262 0.367 0.490 0.539 0.228 0.317 0.457 0.515 0.178 0.287 0.348 0.434 0.247 0.336 1.031 0.796
Avg 0.112 0.222 0.113 0.221 0.495 0.472 0.180 0.291 0.169 0.281 0.326 0.419 0.147 0.248 0.278 0.375 0.114 0.224 0.213 0.327 0.147 0.249 0.667 0.601
12 0.090 0.186 0.079 0.182 0.133 0.247 0.168 0.232 0.165 0.214 0.227 0.343 0.112 0.212 0.154 0.276 0.087 0.184 0.173 0.273 0.109 0.207 0.436 0.485
PEMS08
24 0.125 0.226 0.115 0.219 0.249 0.343 0.224 0.281 0.215 0.260 0.318 0.409 0.141 0.238 0.248 0.353 0.122 0.221 0.210 0.301 0.140 0.236 0.467 0.502
48 0.184 0.237 0.186 0.235 0.569 0.544 0.321 0.354 0.315 0.355 0.497 0.510 0.198 0.283 0.440 0.470 0.189 0.270 0.320 0.394 0.211 0.294 0.966 0.733
96 0.216 0.262 0.221 0.267 1.166 0.814 0.408 0.417 0.377 0.397 0.721 0.592 0.320 0.351 0.674 0.565 0.236 0.300 0.442 0.465 0.345 0.367 1.385 0.915
Avg 0.154 0.228 0.150 0.226 0.529 0.487 0.280 0.321 0.268 0.307 0.441 0.464 0.193 0.271 0.379 0.416 0.158 0.244 0.286 0.358 0.201 0.276 0.814 0.659
96 0.084 0.200 0.086 0.206 0.093 0.217 0.088 0.205 0.256 0.367 0.094 0.218 0.107 0.234 0.088 0.218 0.267 0.396 0.148 0.278 0.111 0.237 0.197 0.323
Exchange
192 0.174 0.295 0.177 0.299 0.184 0.307 0.176 0.299 0.470 0.509 0.184 0.307 0.226 0.344 0.176 0.315 0.351 0.459 0.271 0.315 0.219 0.335 0.300 0.369
336 0.333 0.415 0.331 0.417 0.351 0.432 0.301 0.397 1.268 0.883 0.349 0.431 0.367 0.448 0.313 0.427 1.324 0.853 0.460 0.427 0.421 0.476 0.509 0.524
720 0.826 0.682 0.847 0.691 0.886 0.714 0.901 0.714 1.767 1.068 0.852 0.698 0.964 0.746 0.839 0.695 1.058 0.797 1.195 0.695 1.092 0.769 1.447 0.941
Avg 0.360 0.401 0.360 0.403 0.378 0.417 0.367 0.404 0.940 0.707 0.370 0.413 0.416 0.443 0.354 0.414 0.750 0.626 0.519 0.429 0.461 0.454 0.613 0.539
96 0.179 0.214 0.174 0.214 0.192 0.232 0.177 0.218 0.158 0.230 0.202 0.261 0.172 0.220 0.196 0.255 0.221 0.306 0.217 0.296 0.173 0.223 0.266 0.336
Weather
192 0.226 0.254 0.221 0.254 0.240 0.271 0.225 0.259 0.206 0.277 0.242 0.298 0.219 0.261 0.237 0.296 0.261 0.340 0.276 0.336 0.245 0.285 0.307 0.367
336 0.281 0.293 0.278 0.296 0.292 0.307 0.278 0.297 0.272 0.335 0.287 0.335 0.280 0.306 0.283 0.335 0.309 0.378 0.339 0.380 0.321 0.338 0.359 0.395
720 0.355 0.342 0.358 0.347 0.364 0.353 0.354 0.348 0.398 0.418 0.351 0.386 0.365 0.359 0.345 0.381 0.377 0.427 0.403 0.428 0.414 0.410 0.419 0.428
Avg 0.261 0.277 0.258 0.278 0.272 0.291 0.259 0.281 0.259 0.315 0.271 0.320 0.259 0.287 0.265 0.317 0.292 0.363 0.309 0.360 0.288 0.314 0.338 0.382
24 1.971 0.838 2.472 0.994 5.742 1.772 2.290 0.920 3.906 1.332 5.452 1.732 2.317 0.934 2.398 1.040 3.687 1.420 3.228 1.260 2.294 0.945 3.483 1.287
36 1.875 0.816 2.288 0.964 5.343 1.672 2.345 0.928 3.880 1.278 4.960 1.621 1.972 0.920 2.646 1.088 3.941 1.582 2.679 1.080 1.825 0.848 3.103 1.148
ILI
48 1.898 0.829 2.227 0.951 4.722 1.563 2.213 0.916 3.896 1.273 4.561 1.533 2.238 1.982 2.614 1.086 3.193 1.202 2.622 1.078 2.010 0.900 2.669 1.085
60 1.893 0.849 2.267 0.966 4.526 1.529 2.143 0.904 4.190 1.331 4.632 1.556 2.027 0.928 2.804 1.146 3.187 1.198 2.857 1.157 2.178 0.963 2.770 1.125
336 0.204 0.301 0.178 0.269 0.215 0.298 0.215 0.305 0.246 0.337 0.249 0.344 0.198 0.300 0.209 0.301 0.269 0.369 0.214 0.329 0.200 0.304 0.231 0.338
720 0.226 0.322 0.225 0.317 0.257 0.331 0.256 0.337 0.280 0.363 0.284 0.373 0.220 0.320 0.245 0.333 0.299 0.390 0.246 0.355 0.222 0.321 0.254 0.361
Avg 0.193 0.287 0.178 0.270 0.219 0.298 0.216 0.304 0.244 0.334 0.251 0.344 0.192 0.295 0.212 0.300 0.268 0.365 0.214 0.327 0.193 0.296 0.227 0.338
1st count 29 38 8 10 0 0 2 1 3 0 0 0 1 0 2 0 2 2 2 0 1 0 0 0
K. Alkilane et al. Information Fusion 112 (2024) 102589
Table 3
Short-term forecasting performance on the M4 dataset. Values in red font denote the best performing model(s) for each metric, while underlined blue values represent the second-best
performing model(s).
Models MixMamba iTransformer [2] TimesNet [6] PatchTST [3] DLinear [8] FEDformer [4] LightTS [7] MICN [5] FiLM [43] Informer [23] Stationary [9] Autoformer [14]
SMAPE 13.363 14.262 13.62 13.562 14.319 13.93 13.569 14.53 14.377 16.127 13.717 13.974
Year MASE 2.993 3.257 3.112 3.029 3.078 3.141 3.066 3.38 3.031 3.544 3.078 3.134
OWA 0.785 0.846 0.808 0.796 0.825 0.821 0.801 0.87 0.821 0.939 0.807 0.822
SMAPE 10.187 13.64 10.244 10.884 10.514 10.836 10.431 11.45 10.788 13.703 10.958 11.338
Quarter MASE 1.181 1.719 1.202 1.308 1.241 1.285 1.212 1.391 1.308 1.742 1.325 1.365
OWA 0.893 1.246 0.903 0.971 0.93 0.961 0.915 1.027 0.967 1.257 0.981 1.012
SMAPE 12.89 16.8 12.88 13.146 13.464 14.091 13.043 13.858 13.384 15.867 13.917 13.958
Month MASE 0.941 1.399 0.953 1.011 1.028 1.076 0.999 1.085 1.027 1.263 1.097 1.103
OWA 0.897 1.24 0.894 0.931 0.95 0.994 0.922 0.99 0.947 1.144 0.998 1.002
SMAPE 4.758 5.823 5.009 4.963 5.114 5.025 5.791 6.207 5.82 7.213 6.302 5.485
Others MASE 3.189 4.044 3.346 3.334 3.641 3.279 3.802 4.311 3.881 5.198 4.064 3.865
OWA 1.004 1.25 1.055 1.048 1.112 1.046 1.209 1.333 1.224 1.579 1.304 1.187
SMAPE 12.04 14.909 12.024 12.29 12.535 12.82 12.175 13.052 12.611 14.975 12.780 12.909
Avg MASE 1.584 2.035 1.629 1.663 1.681 1.711 1.666 1.848 1.698 2.099 1.756 1.771
OWA 0.858 1.082 0.869 0.888 0.902 0.92 0.884 0.964 0.909 1.101 0.930 0.939
Note: The official code and settings from TimesNet [6] were used; however, the reproduced results exhibit variations from those reported in the paper.
Fig. 2. Visualization of long-term forecasting performance with 𝐿 = 96 and 𝑇 = 192 on ETTh1 dataset.
Fig. 3. Visualization of long-term forecasting performance with 𝐿 = 96 and 𝑇 = 192 on weather dataset.
selection process and capability to scale linearly with sequence length, 5.2. Short-term forecasting results
facilitates the model’s effective capture of broader context and complex
temporal dependencies present within time series data. This ensures To further validate the effectiveness of MixMamba for short-term
that MixMamba maintains its efficiency and performance even when forecasting, we compare its performance to SOTA baselines on the M4
dealing with increasingly long sequences. dataset. Unlike long-term forecasting datasets, which encompass high-
To further substantiate the superior performance of MixMamba, frequency data like weekly, daily, hourly, and even sub-hourly intervals
we present visualizations showcasing its long-term forecasting capa- (e.g., 15, 10, and 5 min as shown in Table 1), the M4 dataset has
bilities on the ETTh1 and Weather datasets (Figs. 2 and 3, respec- lower frequencies (yearly, quarterly, and monthly data). This allows
tively). We compare MixMamba’s performance against established for comprehensive evaluation of MixMamba’s performance across a
models: iTransformer [2], PatchTST [3], TiDE [19], DLinear [8], and variety of forecasting horizons with distinct frequency characteristics.
Autoformer [14]. The results demonstrably indicate that MixMamba Consistent with the forecasting horizons employed in TimesNet [6], we
achieves a higher degree of accuracy in its predictions compared to the adopted the following time steps for different M4 data granularities:
aforementioned models. yearly (6), quarterly (8), monthly (18), weekly (13), daily (14), and
8
K. Alkilane et al. Information Fusion 112 (2024) 102589
Table 4
Classification accuracy on diverse datasets. The asterisk (*) refers to Transformer family (e.g., Re* is Reformer).
Datasets/Models DTW XGBoost Rocket LSTM LSTNet LSSL TCN Trans* Re* In* Pyra* Auto* Station* FED* ETS* Flow* DLinear LightTS TimesNet MixMamba
FaceDetection 52.9 63.3 64.7 57.7 65.7 66.7 52.8 67.3 68.6 67.0 65.7 68.4 68.0 66.0 66.3 67.6 68.0 67.5 68.6 69.0
Heartbeat 71.7 73.2 75.6 72.2 77.1 72.7 75.6 76.1 77.1 80.5 75.6 74.6 73.7 73.7 71.2 77.6 75.1 75.1 78.0 77.0
JapaneseVowels 94.9 86.5 96.2 79.7 98.1 98.4 98.9 98.7 97.8 98.9 98.4 96.2 99.2 98.4 95.9 98.9 96.2 96.2 98.4 98.1
SelfRegulationSCP1 77.7 84.6 90.8 68.9 84.0 90.8 84.6 92.2 90.4 90.1 88.1 84.0 89.4 88.7 89.6 92.5 87.3 89.8 91.8 91.8
SelfRegulationSCP2 53.9 48.9 53.3 46.6 52.8 52.2 55.6 53.9 56.7 53.3 53.3 50.6 57.2 54.4 55.0 56.1 50.5 51.1 57.2 58.3
SpokenArabicDigits 96.3 69.6 71.2 31.9 100.0 100.0 95.6 98.4 97.0 100.0 99.6 100.0 100.0 100.0 100.0 98.8 81.4 100.0 99.0 98.5
UWaveGestureLibrary 90.3 75.9 94.4 41.2 87.8 85.9 88.4 85.6 85.6 85.6 83.4 85.9 87.5 85.3 85.0 86.6 82.1 80.3 85.3 85.9
Average accuracy 76.81 71.71 78.03 56.89 80.79 80.96 78.79 81.74 81.89 82.20 80.59 79.96 82.14 80.93 80.43 82.59 77.23 80.00 82.61 82.66
hourly (48). The detailed results pertaining to this experiment are MixMamba’s classification foundation relies on entropy values, em-
presented in Table 3. ploying cross-entropy loss during training. A key feature is its dynamic
Despite the inherent complexity of the M4 dataset, which encom- gating mechanism within the MoE component. This mechanism selects
passes diverse temporal variations from various sources, MixMamba experts based on a weighted combination, including past performance
outperforms all comparative benchmark models across a broad spec- on similar data and current data point’s characteristics.
trum of frequencies. Notably, at low frequencies, MixMamba achieves Consequently, each expert specializes in handling specific temporal
significant improvements on all three evaluation metrics: Symmetric dynamics within the time series data. This diversity allows MixMamba
Mean Absolute Percentage Error (SMAPE), Mean Absolute Scaled Error to tackle a wide range of input data characteristics, ultimately improv-
(MASE), and Overall Weighted Average (OWA). This improvement is ing the overall robustness and accuracy of classification. To illustrate
particularly pronounced when compared to PatchTST, the second-best this process, consider a simplified 3-class problem with TopKGating set
performing model. This superiority can be attributed to MixMamba’s to 3. Here, the gating network selects the three most suitable experts.
ability to effectively capture the distinct patterns inherent in the low- Each expert outputs a probability distribution over all classes. The
frequency. MixMamba’s effectiveness extends beyond low-frequency gating network then assigns weights to each expert based on their
data and generalizes well to high-frequency data (others) as well. The past performance and the current data point. Finally, these weighted
model’s design, with its emphasis on handling a wide range of temporal expert predictions are combined to produce the final classification.
dynamics, proves advantageous in this domain. The specialized experts MixMamba employs the cross-entropy loss function to measure the
within the MoE framework are adept at acquiring knowledge from the difference between the predicted distribution and the true class labels.
The results, reported in Table 4, demonstrate our model’s supe-
rapid variations present in high-frequency data. Compared to PatchTST
rior classification accuracy, achieving a highest average accuracy of
and Fedformer, MixMamba achieves substantial improvements on all
82.66%. The unique architecture of our model facilitates its profi-
metrics, showcasing its ability to handle rapid fluctuations within the
ciency in learning high-level representations and capturing the rich
data. Furthermore, MixMamba surpasses recently introduced models,
information within the data. This, in turn, enhances its classifica-
including iTransformer and DLinear, demonstrating its competitive
tion performance relative to the baseline models. While Transformer-
edge in this evolving field. Most importantly, MixMamba surpasses
based models demonstrate competitive performance in forecasting task,
TimesNet, the current SOTA model specifically designed for short-term
their ability to capture comprehensive representations for classification
forecasting, in terms of accuracy across most frequencies.
tasks seems limited. Conversely, TimesNet, while exhibiting promise in
To further demonstrate MixMamba’s superiority in short-term fore- classification tasks, struggles with long-term forecasting.
casting, we present a visualization for the M4 dataset in Fig. 4. Here, we Table 5 presents additional classification results obtained using
compare MixMamba’s performance against iTransformer [2], PatchTST two new datasets from the UEA archive: PEMS-SF and EigenWorms.
[3], TimesNet [6], DLinear [8], and Autoformer [14]. This visualization F1 Score and AUC-ROC serve as the evaluation metrics to assess
illuminates MixMamba’s ability to capture the nuances of short-term the performance of various classification models. Notably, MixMamba
patterns within the M4 data. demonstrates superior performance compared to Transformer-based
and Linear-based models, achieving the highest F1 Score and AUC-ROC
5.3. Classification results on both datasets. These findings underscore the robustness and superior
classification capabilities of MixMamba.
To assess the effectiveness of MixMamba in different time series
tasks, we conduct experiments on seven established time series clas- 5.4. Impact of look-back window
sification datasets. We compare our model against a variety of base-
line models, including Transformer-based architectures, linear-based In order to evaluate the performance of MixMamba under more
models, and TimesNet, the current SOTA in classification. demanding conditions, we conduct an experiment utilizing extended
9
K. Alkilane et al. Information Fusion 112 (2024) 102589
Table 5
F1 Score and AUC-ROC classification results on PEMS-SF and EigenWorms datasets. Bold indicates best performing model.
Datasets Metrics CrossFormer DLinear PatchTST TimesNet MixMamba
F1 0.609 0.681 0.713 0.769 0.792
PEMS-SF
AUC-ROC 0.825 0.842 0.857 0.863 0.885
F1 0.533 0.584 0.637 0.682 0.713
EigenWorms
AUC-ROC 0.768 0.772 0.791 0.814 0.833
Fig. 5. Performance under varied look-back window length 𝐿 ∈ {96, 192, 336, 720} on The second analysis employs UMAP [45] to visualize the outputs
PEMS03 datasets (𝑇 = 720). of eight experts within the MoM block in Fig. 6. UMAP operates
on the principle of manifold learning, aiming to represent the high-
dimensional data structure in a lower-dimensional space while preserv-
look-back window lengths on the PEMS03 dataset. We fix the prediction ing both local and global relationships. The results in Fig. 6 provide
window length (𝑇 ) at 720 timesteps, while the look-back sequence tangible evidence of the capability of MixMamba to learn diverse
length (𝐿) is varied across four values: 96, 192, 336, and 720. This patterns within the time series data. This figure shows distinct clusters
experiment aims to assess the model’s capacity to capture complex for each expert, indicating that different experts specialize in capturing
dependencies within longer sequences. We compare the performance different patterns within the data. The separation between clusters
of MixMamba against three established baselines. The results are vi- suggests that the features or patterns each expert responds to are
sualized in Fig. 5. Notably, MixMamba exhibits a consistent reduction distinct, which is indicative of a well-functioning MixMamba model.
in MSE with increasing look-back window length. This superior perfor-
mance can be primarily attributed to two key aspects of MixMamba’s 5.6. Computational efficiency
architecture. Firstly, the utilization of specialized experts allows the
model to effectively capture a wider range of patterns within the data. In the realm of time-series applications, the importance of com-
Secondly, the strength of the Mamba module enables the model to learn putational and memory efficiency cannot be overstated [1]. As the
informative representations from longer sequences. It is noteworthy forecast horizon extends, the associated computational and memory
that, while the MSE of other models increased for the longest look- costs escalate. Fig. 7 depicts the computational time and memory
back length (𝐿 = 720), MixMamba maintains its capability to capture usage for various forecast horizons in a single batch for six models
dependencies and reduce prediction error. compared to our proposed model. DLinear exhibits the lowest time
and memory consumption due to its limited number of parameters
5.5. Analysis of representation and simple architecture. Conversely, TimesNet exhibits the highest
computational cost and memory as it extracts features from multiple
This study examines the representational quality learned by our time periods and iterates computations for each period. Transformer-
proposed model, with a particular focus on the core MoM block. Two based models possess quadratic complexity, explaining the observed
analyses are conducted on separate datasets to evaluate the effective- raised in computation and memory. Our model emerges as the second
ness of MixMamba in capturing the underlying structure of time series most efficient in terms of both computation and memory usage, sur-
data. passing all Transformer-based models. Notably, the computational cost
The first analysis employs CKA [44] as a robust metric to assess the and memory requirements of our method do not exhibit a significant
similarity between the input and output representations generated by increase with extended forecast lengths. Consequently, our approach
the MoM block. We use CKA nonlinear kernels to capture more complex achieves superior performance compared to DLinear, while maintain-
relationships compared to the linear kernel. The results, presented ing lower computational and memory demands compared to other
in Fig. 6, demonstrate that the MixMamba achieves the highest CKA Transformer-based models.
similarity with the lowest MSE compared to other models. A high CKA
score for our model indicates several key advantages: (1) MixMamba 5.7. Ablation study
effectively preserves the essential structure of the input data throughout
the transformation process, ensuring that the model maintains the key To investigate the influence of each component within the Mix-
characteristics of the original information, crucial for understanding Mamba architecture, we construct five distinct variants, each targeting
and leveraging the underlying patterns in the data. (2) The high CKA a specific element. w/o MoE: This variant investigates the impact of the
similarity highlights the proficiency of MixMamba in capturing the MoE paradigm by entirely removing it from the architecture. Instead,
relational structure inherent in the time series data, such as long-term it employs a single Mamba module to directly learn both data repre-
trends and seasonal patterns. sentations and model dependencies. w/o Mamba: This variant replaces
10
K. Alkilane et al. Information Fusion 112 (2024) 102589
Fig. 8. Ablation analysis of MixMamba and its variants on ECL and weather datasets
We investigate the impact of two key parameters on the perfor-
(𝐿 = 96, 𝑇 = 720).
mance of MixMamba: the number of experts (𝜂) in Eq. (9) and the
parameter (𝑘) in TopKGating in Eq. (13). The results are presented
in Fig. 9. First, we vary the number of experts 𝜂 ∈ {4, 8, 16, 24}.
Mamba module with a simple MLP module, acting as an expert within As illustrated in Fig. 9, an architecture with 16 experts achieved the
the framework. w/o Patching : This variant bypasses the patching step,
optimal configuration on both datasets. This configuration allows the
feeding the complete sequence directly into the MoM block without
MixMamba to effectively learn representations and capture the tem-
any prior segmentation. w/o 𝑎𝑢𝑥 : This variant explores the significance
poral dependencies within the time series data. However, using fewer
of the auxiliary loss associated with the gating network. It eliminates
or more experts reduces performance. With fewer experts, the model’s
this term from the overall loss function employed during training. The
capacity might be insufficient to capture the diversity of patterns and
final variant, MixTrans, investigates the role of the Mamba module by
relationships in the data. Conversely, an increased number of experts
replacing it with a standard Transformer block. The performance of
can lead to overfitting the training data. We then examine the influence
each variant is evaluated on two benchmark datasets: ECL and Weather.
of parameter 𝑘 ∈ {2, 3, 4, 5}. The results in Fig. 9 indicate that the
The results of this ablation study are presented in Fig. 8.
model achieves the best performance with 𝑘 = 2. This value provides
The ablation study depicted in Fig. 8 offers valuable insights into
a good balance between specialization and capacity. Performance de-
the contributions of each component within MixMamba framework. We
clines with higher values of 𝑘 where employing a larger number of
observe the following key points: (1) Removing the MoE component
leads to the most significant performance decline, with MSE increasing active experts per data point increases the model’s complexity, making
by 85.49% and 26.87% on ECL and Weather datasets, respectively. it challenging for the gating network to learn which experts to select
This finding underscores the critical role of MoE in enabling the model and how to effectively balance their contributions.
to capture diverse characteristics within the time series data through
specialized experts. This facilitates superior representation learning, 6. Conclusion
ultimately leading to enhanced performance. Notably, the impact of
removing MoE is more pronounced in the ECL dataset, likely due to its In essence, this work addresses limitations in time series model-
larger number of features (variables) compared to the Weather dataset. ing by introducing the MixMamba framework. MixMamba effectively
(2) Replacing the Mamba module with alternative expert models also captures complex long-term temporal dynamics within non-stationary
results in a performance decline. When a Transformer block is em- time series data. This is achieved by leveraging the strengths of the
ployed as a substitute model on the ECL and Weather datasets, MSE mixture-of-experts paradigm and the Mamba model, enabling superior
error increases by 18.14% and 8.17%, respectively. This highlights the representation learning compared to single models. Furthermore, a dy-
advantages of the Mamba module in capturing intricate relationships namic gating network enhances MixMamba’s adaptability by efficiently
within lengthy sequences. The Mamba module achieves this through routing data segments to the most appropriate experts based on their
its selective attention mechanism and causal convolution capabilities. specific characteristics. This approach yields several key advantages.
11
K. Alkilane et al. Information Fusion 112 (2024) 102589
First, MixMamba effectively models long-term dependencies by capi- [10] W. Xi, A. Jain, L. Zhang, J. Lin, Lb-simtsc: An efficient similarity-aware graph
talizing on the linear-time efficiency of the Mamba model, facilitating neural network for semi-supervised time series classification, 2023, arXiv:2301.
04838.
scalable learning for long sequences. Second, the dynamic gating net-
[11] G.E. Box, G.M. Jenkins, G.C. Reinsel, G.M. Ljung, Time Series Analysis:
work empowers MixMamba to flexibly adapt to heterogeneous data, Forecasting and Control, John Wiley & Sons, 2015.
efficiently handling diverse patterns and temporal changes. Finally, [12] E.S. Gardner Jr., Exponential smoothing: The state of the art, J. Forecast. 4
extensive evaluations demonstrate that MixMamba outperforms exist- (1985) 1–28.
ing state-of-the-art methods across various time series modeling tasks, [13] V. Cerqueira, L. Torgo, I. Mozetič, Evaluating time series forecasting models: An
empirical study on performance estimation methods, Mach. Learn. 109 (2020)
including long-term and short-term forecasting, classification, and im-
1997–2028.
putation. Looking towards the future, MixMamba presents exciting [14] H. Wu, J. Xu, J. Wang, M. Long, Autoformer: Decomposition transformers with
avenues for further exploration. One promising direction involves in- auto-correlation for long-term series forecasting, Adv. Neural Inf. Process. Syst.
corporating domain knowledge to potentially enhance performance 34 (2021) 22419–22430.
for specific applications. This could involve pre-training experts on [15] Y. Zhang, J. Yan, Crossformer: Transformer utilizing cross-dimension dependency
for multivariate time series forecasting, in: The Eleventh International Conference
domain-specific datasets or designing expert architectures tailored to
on Learning Representations, 2023.
capture known domain characteristics relevant to the domain. Ad- [16] Z. Li, S. Qi, Y. Li, Z. Xu, Revisiting long-term time series forecasting: An
ditionally, integrating MixMamba with probabilistic forecasting tech- investigation on linear mapping, 2023, arXiv:2305.10721.
niques holds promise for generating prediction intervals. This capability [17] S.S. Rangapuram, M.W. Seeger, J. Gasthaus, L. Stella, Y. Wang, T. Januschowski,
would provide valuable insights into the uncertainty associated with Deep state space models for time series forecasting, Adv. Neural Inf. Process. Syst.
31 (2018).
forecasts, critical for decision-making systems in various domains.
[18] A. Gu, T. Dao, Mamba: Linear-time sequence modeling with selective state
spaces, 2023, arXiv:2312.00752.
CRediT authorship contribution statement [19] A. Das, W. Kong, A. Leach, S.K. Mathur, R. Sen, R. Yu, Long-term forecasting
with tide: Time-series dense encoder, Trans. Mach. Learn. Res. (2023) URL
https://openreview.net/forum?id=pCbC3aQB5W.
Khaled Alkilane: Writing – review & editing, Writing – original
[20] J.M. Valente, S. Maldonado, Svr-ffs: A novel forward feature selection approach
draft, Visualization, Validation, Methodology, Investigation, Data cu- for high-frequency time series forecasting using support vector regression, Expert
ration, Conceptualization. Yihang He: Visualization, Validation, Soft- Syst. Appl. 160 (2020) 113729.
ware, Data curation. Der-Horng Lee: Writing – review & editing, [21] E. de Bézenac, S.S. Rangapuram, K. Benidis, M. Bohlke-Schneider, R. Kurle,
Supervision, Methodology, Funding acquisition, Conceptualization. L. Stella, H. Hasson, P. Gallinari, T. Januschowski, Normalizing kalman filters
for multivariate time series analysis, Adv. Neural Inf. Process. Syst. 33 (2020)
2995–3007.
Declaration of competing interest [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser,
I. Polosukhin, Attention is all you need, 2017, URL https://arxiv.org/pdf/1706.
The authors declare that they have no known competing finan- 03762.pdf.
cial interests or personal relationships that could have appeared to [23] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, W. Zhang, Informer: Beyond
efficient transformer for long sequence time-series forecasting, in: Proceedings of
influence the work reported in this paper. the AAAI Conference on Artificial Intelligence, 2021, pp. 11106–11115.
[24] Y. Xu, L. Han, T. Zhu, L. Sun, B. Du, W. Lv, Generic dynamic graph convolutional
Data availability network for traffic flow forecasting, Inf. Fusion 100 (2023) 101946, http://dx.
doi.org/10.1016/j.inffus.2023.101946.
[25] J. Zhan, X. Huang, Y. Xian, W. Ding, A fuzzy C-means clustering-based hybrid
Data will be made available on request.
multivariate time series prediction framework with feature selection, IEEE Trans.
Fuzzy Syst. (2024) 1–15.
Acknowledgment [26] C. Zhu, X. Ma, W. Ding, J. Zhan, Long-term time series forecasting with multi-
linear trend fuzzy information granules for LSTM in a periodic framework, IEEE
Trans. Fuzzy Syst. 32 (2024) 322–336.
This research is financially supported by the Smart Urban Future
[27] J.M. Sanchez-Bornot, R.C. Sotero, Machine learning for time series forecasting
(SURF) Laboratory, Zhejiang Province. using state space models, in: International Conference on Intelligent Data
Engineering and Automated Learning, Springer, 2023, pp. 470–482.
References [28] L. Zhou, M. Poli, W. Xu, S. Massaroli, S. Ermon, Deep latent state space models
for time-series generation, in: International Conference on Machine Learning,
[1] M. Jin, H.Y. Koh, Q. Wen, D. Zambon, C. Alippi, G.I. Webb, I. King, S. Pan, PMLR, 2023, pp. 42625–42643.
A survey on graph neural networks for time series: Forecasting, classification, [29] Y. Lin, I. Koprinska, M. Rana, Ssdnet: State space decomposition neural network
imputation, and anomaly detection, 2023, arXiv preprint arXiv:2307.03759. for time series forecasting, in: 2021 IEEE International Conference on Data
[2] Y. Liu, T. Hu, H. Zhang, H. Wu, S. Wang, L. Ma, M. Long, Itransformer: Inverted Mining, ICDM, IEEE, 2021, pp. 370–378.
transformers are effective for time series forecasting, 2024, arXiv:2310.06625. [30] A.F. Ansari, A. Heng, A. Lim, H. Soh, Neural continuous-discrete state space
[3] Y. Nie, N.H. Nguyen, P. Sinthong, J. Kalagnanam, A time series is worth 64 models for irregularly-sampled time series, in: International Conference on
words: Long-term forecasting with transformers, 2022, arXiv preprint arXiv: Machine Learning, PMLR, 2023, pp. 926–951.
2211.14730. [31] A. Gu, K. Goel, C. Ré, Efficiently modeling long sequences with structured state
[4] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, R. Jin, Fedformer: Frequency en- spaces, 2022, arXiv:2111.00396.
hanced decomposed transformer for long-term series forecasting, in: International [32] A. Gu, T. Dao, S. Ermon, A. Rudra, C. Re, Hippo: Recurrent memory with optimal
Conference on Machine Learning, PMLR, 2022b, pp. 27268–27286. polynomial projections, 2020, arXiv:2008.07669.
[5] H. Wang, J. Peng, F. Huang, J. Wang, J. Chen, Y. Xiao, MICN: Multi-scale [33] A. Gupta, A. Gu, J. Berant, Diagonal state spaces are as effective as structured
local and global context modeling for long-term series forecasting, in: The state spaces, 2022, arXiv:2203.14343.
Eleventh International Conference on Learning Representations, 2023, URL https: [34] T. Kim, J. Kim, Y. Tae, C. Park, J.H. Choi, J. Choo, Reversible instance
//openreview.net/forum?id=zt53IDUR1U. normalization for accurate time-series forecasting against distribution shift,
[6] H. Wu, T. Hu, Y. Liu, H. Zhou, J. Wang, M. Long, Timesnet: Temporal in: International Conference on Learning Representations, 2022, URL https:
2d-variation modeling for general time series analysis, 2023, arXiv:2210.02186. //openreview.net/forum?id=cGDAkQo1C0p.
[7] T. Zhang, Y. Zhang, W. Cao, J. Bian, X. Yi, S. Zheng, J. Li, Less is more: [35] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner,
Fast multivariate time series forecasting with light sampling-oriented MLP M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An
structures, 2022, CoRR abs/2207.01186. URL http://dx.doi.org/10.48550/arXiv. image is worth 16x16 words: Transformers for image recognition at scale, 2021,
2207.01186 arXiv:2207.01186. arXiv:2010.11929.
[8] A. Zeng, M. Chen, L. Zhang, Q. Xu, Are transformers effective for time series [36] B. Zhang, R. Sennrich, Root mean square layer normalization, 2019, arXiv:
forecasting? in: Proceedings of the AAAI Conference on Artificial Intelligence, 1910.07467.
2023, pp. 11121–11128. [37] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, Z.
[9] Y. Liu, H. Wu, J. Wang, M. Long, Non-stationary transformers: Exploring the Chen, Gshard: Scaling giant models with conditional computation and automatic
stationarity in time series forecasting, in: NeurIPS, 2022. sharding, 2020, arXiv:2006.16668.
12
K. Alkilane et al. Information Fusion 112 (2024) 102589
[38] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, J. Dean, [42] L.I.U. M, A. Zeng, M. Chen, Z. Xu, L.A.I. Q, L. Ma, Q. Xu, Scinet: Time series
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, modeling and forecasting with sample convolution and interaction, in: Advances
2017, URL https://openreview.net/pdf?id=B1ckMDqlg. in Neural Information Processing Systems, Curran Associates, Inc., 2022, pp.
[39] B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean, N. Shazeer, W. Fedus, 5816–5828.
St-moe: Designing stable and transferable sparse expert models, 2022, arXiv: [43] T. Zhou, M.A. Z, wang. x, Q. Wen, L. Sun, T. Yao, W. Yin, R. Jin, Film: Frequency
2202.08906. improved legendre memory model for long-term time series forecasting, in:
[40] S. Makridakis, M4 dataset, 2018, URL https://github.com/M4Competition/M4- Advances in Neural Information Processing Systems, Curran Associates, Inc.,
methods/tree/master/Dataset. 2022a, pp. 12677–12690.
[41] A. Bagnall, H.A. Dau, J. Lines, M. Flynn, J. Large, A. Bostrom, P. Southam, [44] S. Kornblith, M. Norouzi, H. Lee, G. Hinton, Similarity of neural network
E. Keogh, The uea multivariate time series classification archive, 2018, 2018, representations revisited, 2019, arXiv:1905.00414.
arXiv:1811.00075. [45] L. McInnes, J. Healy, J. Melville, Umap: Uniform manifold approximation and
projection for dimension reduction, 2020, arXiv:1802.03426.
13