0% found this document useful (0 votes)
13 views14 pages

KAN4TSF: Are KAN and KAN-based Models Effective For Time Series Forecasting?

The document discusses the effectiveness of the Kolmogorov-Arnold Network (KAN) and its variant, the Reversible Mixture of KAN Experts (RMoK), for time series forecasting. It highlights the challenges of existing deep learning methods in terms of interpretability and performance, and presents experimental results showing that RMoK outperforms several baselines on real-world datasets. The study concludes that KAN-based models are effective for time series forecasting, providing better mathematical properties and interpretability compared to traditional methods.

Uploaded by

yobr39015
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views14 pages

KAN4TSF: Are KAN and KAN-based Models Effective For Time Series Forecasting?

The document discusses the effectiveness of the Kolmogorov-Arnold Network (KAN) and its variant, the Reversible Mixture of KAN Experts (RMoK), for time series forecasting. It highlights the challenges of existing deep learning methods in terms of interpretability and performance, and presents experimental results showing that RMoK outperforms several baselines on real-world datasets. The study concludes that KAN-based models are effective for time series forecasting, providing better mathematical properties and interpretability compared to traditional methods.

Uploaded by

yobr39015
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

KAN4TSF: Are KAN and KAN-based models

Effective for Time Series Forecasting?

Xiao Han1,2 , Xinfeng Zhang1B , Yiling Wu2B , Zhenduo Zhang1,2 , Zhe Wu2
1
School of Computer Science and Technology,
University of Chinese Academy of Sciences, Beijing, China
arXiv:2408.11306v1 [cs.LG] 21 Aug 2024

2
Peng Cheng Laboratory, Shenzhen, China
{hanxiao22,zhangzhenduo21}@mails.ucas.ac.cn
[email protected], {wuyl02,wuzh02}@pcl.ac.cn

Abstract

Time series forecasting is a crucial task that predicts the future values of variables
based on historical data. Time series forecasting techniques have been developing
in parallel with the machine learning community, from early statistical learning
methods to current deep learning methods. Although existing methods have made
significant progress, they still suffer from two challenges. The mathematical
theory of mainstream deep learning-based methods does not establish a clear
relation between network sizes and fitting capabilities, and these methods often lack
interpretability. To this end, we introduce the Kolmogorov-Arnold Network (KAN)
into time series forecasting research, which has better mathematical properties
and interpretability. First, we propose the Reversible Mixture of KAN experts
(RMoK) model, which is a KAN-based model for time series forecasting. RMoK
uses a mixture-of-experts structure to assign variables to KAN experts. Then, we
compare performance, integration, and speed between RMoK and various baselines
on real-world datasets, and the experimental results show that RMoK achieves the
best performance in most cases. And we find the relationship between temporal
feature weights and data periodicity through visualization, which roughly explains
RMoK’s mechanism. Thus, we conclude that KAN and KAN-based models
(RMoK) are effective in time series forecasting. Code is available at KAN4TSF:
https://github.com/2448845600/KAN4TSF.

1 Introduction
Time series forecasting (TSF) is the task of using historical data to predict future states of variables.
This research area includes a broad scope of applications, such as financial investment, weather
forecasting, traffic estimation, and health management Bi et al. [2023], Gao et al. [2023], Savcisens
et al. [2024], Han et al. [2024a]. The machine learning community’s progress has long inspired time
series forecasting technology: the popularity of early statistical learning methods gave rise to SVR
and ARMIA, while the development of deep learning introduced MLP and Transformer into time
series forecasting. At present, various time series forecasting methods cover almost all deep learning
network architectures, such as RNN, CNN, Transformer, and MLP Nie et al. [2023], Wu et al. [2023],
Han et al. [2024b]. The forecasting models derived from different network architectures have their
own advantages in forecasting performance, running speed, and resource usage.
Although deep learning-based models have made notable progress in time series forecasting, there
are still several challenges. The universal approximation theorem (UAT), which is the mathematical
foundation of most mainstream forecasting models, cannot provide a guarantee on the necessary
network sizes (depths and widths) to approximate a predetermined continuous function with specific

Preprint. Under review.


accuracy. And this theory can only achieve an approximation rather than a representation. The
limitations of UAT have become the sword of Damocles hanging over time series forecasting.
Furthermore, the prediction mechanism of existing models is black-box, resulting in a lack of
interpretability. These nontransparent methods are suspected of being suitable for tasks that require a
low tolerance for errors, such as medicine, law and finance.
Kolmogorov-Arnold Network (KAN) Liu et al. [2024a], which is based on the Kolmogorov-Arnold
representation theorem (KART), has become a novel approach to solving the above challenges. On the
one hand, KART proves that a multivariate continuous function can be represented as a combination
of finite univariate continuous functions. This theorem establishes the relationship between network
size and input shape under the premise of representation. On the other hand, KAN offers a pruning
strategy that simplifies the trained KAN into a set of symbolic functions, enabling the analysis
of specific modules’ mechanisms, thereby significantly enhancing the network’s interpretability.
In addition, KAN’s function fitting idea is consistent with the properties of time series, such as
periodicity and trend, which is conducive to embedding prior knowledge into the network structure
and improving the performance of the network.
Despite being a relatively recent proposal, KAN, which employs a trainable 1D B-spline functions to
convert incoming signals, has already sparked numerous efforts to improve or broaden its capabilities.
Some studies propose KAN’s variants which replace the B-splines with Chebyshev polynomials SS
[2024], wavelet functions Bozorgasl and Chen [2024], Jacobi polynomials Aghaei [2024], ReLU
functions Qiu et al. [2024], etc., to accelerate training speed and improve network performance.
Other studies introduce KAN with existing popular network structures for various applications. For
example, ConvKAN Bodner et al. [2024] and GraphKAN Zhang and Zhang [2024], Xu et al. [2024]
are proposed for image processing and graph processing. In summary, KANs have been extensively
empirically studied in vision and language Azam and Akhtar [2024], Yu et al. [2024]. However,
existing studies lack a KAN-based model that considers time series domain knowledge, making it
impossible to verify whether KAN is effective in time series forecasting.
To this end, we aim to propose a KAN-based model for the time series forecasting task and evaluate its
effectiveness from four perspectives: performance, integration, speed, and interpretability. First, we
propose the Reversible Mixture of KAN Experts model (RMoK), a KAN-based time series forecasting
model that uses multiple KAN variants as experts and a gating network to adaptively assign variables
to specific experts for prediction. RMoK is implemented as a single-layer network because we
hope that it can have similar performance and better interpretability than existing methods. Then,
we use a unified training and evaluation setting to compare the performance of RMoK and current
popular baselines on seven real-world datasets. The experimental results show that RMoK achieves
state-of-the-art (SOTA) performance in most cases. Subsequently, we conduct a comprehensive
empirical study on KAN-based models, including the comparison between KAN and Linear, the
effect of integrating KANs with the Transformer, and the speed of the KAN-based models. Finally,
we discuss the interpretability of RMoK using the example of weather prediction. We visualize the
weights of temporal features at different time steps in KAN and find the correlation between the
weight distribution and the periodicity of the data.
To sum up, the contributions of this work include:

• To the best of our knowledge, this is the first work that comprehensively discusses the
effectiveness of the booming KANs for time series forecasting.
• To validate our claims, we propose the Reversible Mixture of KAN Experts model, which
uses a single layer of the mixture of KAN experts to keep a balance between performance
and interpretability.
• We fairly compare the performance between RMoK and baselines on seven real-world
datasets, and the experimental results show that RMoK achieves the best performance in
most cases. And we also conduct a comprehensive empirical study on KAN-based models
about integration and speed.
• We mine the relationship between time feature weights and data periodicity through visual-
ization, which roughly explains the mechanism of RMoK.

In summary, compared with the baselines in terms of performance, integration, speed, and inter-
pretability, we conclude that KAN is effective in time series forecasting.

2
Linear KAN

Figure 1: The computational process of Linear and KAN layers under a certain input and output
dimension.

2 Problem Defintion

In multivariate time series forecasting, given historical data X = [X1 , · · · , XT ] ∈ RT ×C , where T


is the time steps of historical data and C is the number of variates. The time series forecasting task is
to predict Y = [XT +1 , · · · , XT +P ] ∈ RP ×C during future P time steps.

3 Related Work

3.1 Time-Series Forecasting Models

Although Transformer-based methods have almost become the standard in CV and NLP, various
network architectures (such as Transformer, CNN, and MLP) are competing in time series forecasting
recently.
Transformer-based time series forecasting models have strong performance but high time and memory
complexity. Informer Zhou et al. [2021] proposed ProbSparse self-attention to reduce the complexity
from O(T 2 ) to O(T log T ). Pyraformer Liu et al. [2022a] utilizes the pyramid attention mechanism
to capture hierarchical multi-scale time dependencies with a time and memory complexity of O(T ).
PatchTST Nie et al. [2023] and Crossformer Zhang and Yan [2023] use the patch operation to reduce
the number of input tokens, thereby reducing time complexity. The performance of early Multi-
layer perception-based (MLP-based) models is generally weaker than transformer-based methods.
However, NLinear Zeng et al. [2023] and RLinear Li et al. [2023] combine different normalization
methods with a single-layer MLP, and achieve performance that exceeds Transformer-based models
on some datasets with extremely low computational cost.
Recurrent neural networks (RNNs) are suitable for handling sequence data, making them a favorable
choice for time series analysis. The size of the hidden state in RNN is independent of the input time
series length, so recent RNN-based time series prediction models, such as SegRNN Lin et al. [2023]
and WITRAN Jia et al. [2024], apply the RNN structure to time series forecasting with longer input.
Convolutional neural networks (CNNs) are frequently used in time series forecasting models in the
form of 1D convolution, such as ModernTCN donghao and wang xue [2024] and SCINet Liu et al.
[2022b], but TimesNet Wu et al. [2023] takes a different approach by converting the 1D time series
into a 2D matrix through the Fourier transform and then using 2D convolution for prediction. In
addition to the above four types of network structures, there is also time series forecasting work based
on new network architectures such as Mamba Dao and Gu [2024], Wang et al. [2024] and RWKV
Peng et al. [2023], Hou and Yu [2024].

3.2 Kolmogorov-Arnold Network

The Kolmogorov-Arnold representation theorem (KART) is the mathematical foundation of the


Kolmogorov-Arnold Network (KAN) Liu et al. [2024a], which makes KAN more fitting and inter-
pretable than Multi-Layer Perceptrons (MLP) based on the universal approximation theorem. We
show the different between KAN and MLP in Figure 1. Given an input tensor x ∈ Rn0 , the structure
of L layers KAN network can be represented as:
KAN(x) = (ΦL ◦ ΦL−1 ◦ · · · ◦ Φ2 ◦ Φ1 ) x, (1)

3
RevIN+
Gating Network

MoK Layer ......


KAN activate
KAN
KAN Layer deactivate
RevIN-
+

Figure 2: The structure of RMoK and MoK layer.

where Φl , l ∈ [1, 2, · · · , L] is a KAN layer, and the output dimension of each KAN layer can be
expressed as: [n1 , n2 , · · · , nL ]. Therefore, the transform process of j-th feature in l-th layer can be
formed as:
nl−1
X
xl,j = ϕl−1,j,i (xl−1,i ) , j = 1, · · · , nl , (2)
i=1

where ϕ consists two parts: spline function and the residual activation function with learnable
parameters wb , ws :
ϕ(x) = wb SiLU(x) + ws Spline(x), (3)
P
where Spline(·) is a linear combination of B-spline functions Spline(x) = i ci Bi (x).
Recent studies have proposed a large number of KAN variants to expand its application scope and
improve its performance. Some work uses functions such as wavelet functions Bozorgasl and Chen
[2024], Taylor polynomials SS [2024] and Jacobi polynomials Aghaei [2024] instead of spline
functions to improve the performance of KAN in different fields. Some work extends KAN to the
fields of vision and graphics and proposes ConvKAN and GraphKAN Zhang and Zhang [2024],
Bodner et al. [2024]. Few works try to apply KAN to time series, but there is a lack of sufficient
theoretical and experimental analysis Vaca-Rubio et al. [2024]. This paper aims to analyze the
effectiveness of KAN and its variants in time series forecasting, we use XKAN to describe these
methods uniformly (for example, TaylorKAN is KAN with Taylor polynomials).

4 RMoK
4.1 Mixture of KAN Experts Layer

Time series in real-world scenarios frequently exhibit non-stationarity, with their statistical properties
(such as mean and variance) varying over time. Moreover, there are significant distribution discrepan-
cies between variables in multivariate time series. This poses a significant challenge to time series
forecasting techniques, inevitably impacting the KAN-based methods as well. Fortunately, when
it comes to dealing with distribution shift, KAN has a unique characteristic compared to existing
Linear and Transformer-based networks: KAN has many variants using different spline functions.
Considering that the special spline function may be suitable for modeling certain data distributions,
we try to combine several KANs into a single layer and adaptively schedule them according to the
input data.
Following this idea, we propose the mixture of KAN experts (MoK) layer, which is a combination
of KAN and mixture of experts (MoE). MoK layer uses a gating network to assign KAN layers
to variables according to temporal features, where each expert is responsible for a specific part of
the data. KAN and its variants only differ in the spline function in Equation 3, so we use K(·) to
represent these methods uniformly in this paper. Our proposed MoK layer with N experts can be
simple formed as:
XN
xl+1 = G(xl )i Ki (xl ), (4)
i=1

4
Table 1: The statistics of the seven TSF datasets.
Dataset Variates Timesteps Granularity
ETTh1,h2 7 17,420 1 hour
ETTm1,m2 7 69,680 15 min
ECL 321 26,304 1 hour
Traffic 862 17,544 1 hour
Weather 21 52,696 10 min

where G(·) is a gating network. This mixture of experts structure can adapt to the diversity of time
series, with each expert learning different parts of temporal features, thus improving performance on
time series forecasting task.
The gating network is the key module of the MoK layer which is responsible for learning the weight
of each expert from the input data. The softmax gating network Gsoftmax which uses a softmax
function and a learnable weight matrix wg to schedule input data as:
Gsoftmax (x) = Softmax(xwg ). (5)
This gating network is popular due to its simple structure. However, it activates all experts once,
resulting in low efficiency when there are a large number of experts. Therefore, we adopt the sparse
gating network Shazeer et al. [2017] which only activates the best matching top-k experts. It adds
Gaussian noise to input time series by wnoise , and uses KeepTopK operation to retains experts with
the highest k values. The sparse gating network can be described as:
Gsparse (x) = Softmax(KeepTopK(H(x), k)), (6)
H(x) = xwg + Norm(Softplus(xwnoise )), (7)
where Norm(·) is standard normalization.

4.2 Reversible Mixture of KAN Experts Model

We can get a sophisticated KAN-based model by stacking multiple KANs or replacing the linear
layers of the existing models with KANs. However, we try to design a simple KAN-based model that
is easy to analyze while achieving comparable performance to the most state-of-the-art time series
forecasting methods.
Inspired by several successful single-layer methods Li et al. [2023], Zeng et al. [2023], we propose a
simple, effective and interpretable KAN-based model, Reversible Mixture of KAN Experts Network
(RMoK), which uses RevIN Kim et al. [2022] and single MoK layer, as shown in Figure 2. First,
RevIN+ (the normalization operation of RevIN) uses a learnable affine transformation to normalize
the input time series of each variable. Then, the MoK layer obtains the prediction results based on
the normalized time series features. Finally, the prediction results are denormalized to the original
distribution space using the same affine transformation parameters in the first step by RevIN− (the
denormalization operation of RevIN).
During the training stage, the gating network has a tendency to reach a winner-take-all state where
it always gives large weights for the same few experts. Following the previous work Shazeer et al.
[2017], we add an load balancing loss function to encourage experts have equal importance. First, we
count the weight of experts as loads, and calcuate the square of the coefficient of variation of the load
values as additional loss:
Lload−balancing = CV(loads)2 . (8)
And the total loss function is the sum of the prediction loss and load balancing loss with weight wl :
L = MSE(Y, Ŷ) + wl · Lload−balancing . (9)

5 Experiments
In this section, we experimentally validate the effectiveness of our proposed KAN-based models,
RMoK, on various time series forecasting benchmarks. Specifically, we conduct extensive experi-
ments, including performance comparison, running speed comparison, the impact of integrating the
KAN and MoK layers into other network structures (Transformer), and the interpretability of RMoK.

5
Table 2: Overall performance. Both the best results of KAN-based models and baselines are
highlighted in bold, and the best results of all methods are marked in red.
KAN-based Transformer-based CNN-based Linear-based
Dataset P RMoK-S RMoK-B PatchTST FEDFormer TimesNet SCINet RLinear DLinear
MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.326 0.360 0.320 0.358 0.329 0.367 0.379 0.419 0.338 0.375 0.418 0.438 0.355 0.376 0.345 0.372
192 0.367 0.382 0.364 0.383 0.367 0.385 0.426 0.441 0.374 0.387 0.439 0.450 0.391 0.392 0.380 0.389
ETTm1 336 0.400 0.404 0.395 0.405 0.399 0.410 0.426 0.441 0.410 0.411 0.490 0.485 0.424 0.415 0.413 0.413
720 0.462 0.439 0.457 0.440 0.454 0.439 0.543 0.490 0.478 0.450 0.595 0.550 0.487 0.450 0.474 0.453
avg 0.389 0.396 0.384 0.397 0.387 0.400 0.448 0.452 0.400 0.406 0.485 0.481 0.414 0.407 0.403 0.407
96 0.176 0.261 0.176 0.261 0.175 0.259 0.203 0.287 0.187 0.267 0.286 0.377 0.182 0.265 0.193 0.292
192 0.244 0.305 0.240 0.302 0.241 0.302 0.269 0.328 0.249 0.309 0.399 0.445 0.246 0.304 0.284 0.362
ETTm2 336 0.306 0.346 0.299 0.342 0.305 0.343 0.325 0.366 0.321 0.351 0.637 0.591 0.307 0.342 0.369 0.427
720 0.405 0.404 0.397 0.401 0.402 0.400 0.421 0.415 0.408 0.403 0.960 0.735 0.407 0.398 0.554 0.522
avg 0.283 0.329 0.278 0.326 0.281 0.326 0.305 0.349 0.291 0.333 0.571 0.537 0.286 0.327 0.350 0.401
96 0.382 0.396 0.374 0.397 0.414 0.419 0.376 0.419 0.384 0.402 0.654 0.599 0.386 0.395 0.386 0.400
192 0.430 0.426 0.419 0.429 0.460 0.445 0.420 0.448 0.436 0.429 0.719 0.631 0.437 0.424 0.437 0.432
ETTh1 336 0.468 0.443 0.461 0.450 0.501 0.466 0.459 0.465 0.491 0.469 0.778 0.659 0.479 0.446 0.481 0.459
720 0.450 0.458 0.474 0.467 0.500 0.488 0.506 0.507 0.521 0.500 0.836 0.699 0.481 0.470 0.519 0.516
avg 0.433 0.431 0.432 0.436 0.469 0.454 0.440 0.460 0.458 0.450 0.747 0.647 0.446 0.434 0.456 0.452
96 0.313 0.357 0.301 0.353 0.302 0.348 0.358 0.397 0.340 0.374 0.707 0.621 0.288 0.338 0.333 0.387
192 0.397 0.409 0.379 0.405 0.388 0.400 0.429 0.439 0.402 0.414 0.860 0.689 0.374 0.390 0.477 0.476
ETTh2 336 0.441 0.446 0.432 0.446 0.426 0.433 0.496 0.487 0.452 0.452 1.000 0.744 0.415 0.426 0.594 0.541
720 0.453 0.461 0.446 0.463 0.431 0.446 0.463 0.474 0.462 0.468 1.249 0.838 0.420 0.440 0.831 0.657
avg 0.401 0.418 0.390 0.417 0.387 0.407 0.437 0.449 0.414 0.427 0.954 0.723 0.374 0.398 0.559 0.515
96 0.187 0.273 0.178 0.267 0.181 0.270 0.193 0.308 0.168 0.272 0.247 0.345 0.201 0.281 0.197 0.282
192 0.193 0.280 0.187 0.274 0.188 0.274 0.201 0.315 0.184 0.289 0.257 0.355 0.201 0.283 0.196 0.285
ECL 336 0.210 0.295 0.204 0.290 0.204 0.293 0.214 0.329 0.198 0.300 0.269 0.369 0.215 0.298 0.209 0.301
720 0.256 0.330 0.247 0.323 0.246 0.324 0.246 0.355 0.220 0.320 0.299 0.390 0.257 0.331 0.245 0.333
avg 0.211 0.294 0.204 0.288 0.205 0.290 0.214 0.327 0.192 0.295 0.268 0.365 0.219 0.298 0.212 0.300
96 0.601 0.367 0.541 0.340 0.462 0.295 0.587 0.366 0.593 0.321 0.788 0.499 0.649 0.389 0.650 0.396
192 0.576 0.354 0.529 0.330 0.466 0.296 0.604 0.373 0.617 0.336 0.789 0.505 0.601 0.366 0.598 0.370
Traffic 336 0.588 0.359 0.545 0.334 0.482 0.304 0.621 0.383 0.629 0.336 0.797 0.508 0.609 0.369 0.605 0.373
720 0.625 0.376 0.580 0.351 0.514 0.322 0.626 0.382 0.640 0.350 0.841 0.523 0.647 0.387 0.645 0.394
avg 0.597 0.364 0.549 0.339 0.481 0.304 0.610 0.376 0.620 0.336 0.804 0.509 0.626 0.378 0.625 0.383
96 0.175 0.225 0.171 0.221 0.177 0.218 0.217 0.296 0.172 0.220 0.221 0.306 0.192 0.232 0.196 0.255
192 0.224 0.267 0.220 0.263 0.225 0.259 0.276 0.336 0.219 0.261 0.261 0.340 0.240 0.271 0.237 0.296
Weather 336 0.284 0.308 0.277 0.302 0.278 0.297 0.339 0.380 0.280 0.306 0.309 0.378 0.292 0.307 0.283 0.335
720 0.366 0.358 0.360 0.354 0.354 0.348 0.403 0.428 0.365 0.359 0.377 0.427 0.364 0.353 0.345 0.381
avg 0.262 0.290 0.257 0.285 0.259 0.281 0.309 0.360 0.259 0.287 0.292 0.363 0.272 0.291 0.265 0.317
Top-1 Count 31 23 7 12

Table 3: The performance of KAN-based models vs. Linear-based models. Both the best results of
models with or w/o MoE are highlighted in bold, and the best models of all methods are marked in
red.
with MoE w/o MoE
Dataset P RMoK-S RMoL-S RWavKAN RTaylorKAN RLinear
MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.326 0.360 0.335 0.367 0.341 0.370 0.340 0.368 0.355 0.376
192 0.367 0.382 0.374 0.385 0.374 0.384 0.380 0.386 0.391 0.392
ETTm1 336 0.400 0.404 0.407 0.406 0.406 0.405 0.412 0.406 0.424 0.415
720 0.462 0.439 0.470 0.440 0.465 0.438 0.474 0.439 0.487 0.450
avg 0.389 0.396 0.397 0.400 0.397 0.399 0.402 0.400 0.414 0.407
96 0.382 0.396 0.382 0.392 0.396 0.402 0.388 0.398 0.386 0.395
192 0.430 0.426 0.436 0.423 0.439 0.430 0.438 0.424 0.437 0.424
ETTh1 336 0.468 0.443 0.487 0.449 0.479 0.449 0.477 0.442 0.479 0.446
720 0.450 0.458 0.486 0.472 0.464 0.461 0.477 0.461 0.481 0.470
avg 0.433 0.431 0.448 0.434 0.444 0.435 0.445 0.431 0.446 0.434
96 0.175 0.225 0.174 0.223 0.180 0.228 0.182 0.232 0.192 0.232
192 0.224 0.267 0.226 0.267 0.227 0.268 0.235 0.274 0.240 0.271
Weather 336 0.284 0.308 0.286 0.308 0.284 0.308 0.291 0.312 0.292 0.307
720 0.366 0.358 0.367 0.358 0.368 0.360 0.370 0.360 0.364 0.353
avg 0.262 0.290 0.263 0.289 0.264 0.291 0.269 0.294 0.272 0.291

6
Table 4: The performance of KAN within iTransformer. The best results of all methods are highlighted
in bold. KAN, JacobKAN, and TaylorKAN are KANs with B-splines functions, Jacobi polynomials,
and Taylor polynomials.
iTransformer +KAN +JacobKAN +TaylorKAN +MoK (Ours)
Dataset P
MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.341 0.373 0.689 0.547 0.340 0.373 0.331 0.365 0.329 0.364
192 0.388 0.397 0.707 0.556 0.382 0.395 0.379 0.392 0.376 0.390
ETTm1 336 0.426 0.420 0.720 0.565 0.418 0.419 0.417 0.416 0.412 0.415
720 0.498 0.458 0.746 0.582 0.487 0.456 0.484 0.450 0.478 0.453
avg 0.413 0.412 0.716 0.562 0.407 0.411 0.403 0.406 0.399 0.406
96 0.390 0.404 0.696 0.557 0.401 0.414 0.397 0.409 0.393 0.409
192 0.445 0.436 0.716 0.572 0.454 0.445 0.443 0.436 0.442 0.441
ETTh1 336 0.489 0.457 0.721 0.583 0.499 0.469 0.491 0.461 0.476 0.458
720 0.505 0.487 0.731 0.607 0.521 0.499 0.510 0.488 0.494 0.485
avg 0.457 0.446 0.716 0.580 0.469 0.457 0.460 0.448 0.451 0.448

Table 5: Comparison of model parameters and running speed of models. All results are tested with
L=96 and P =720 on ETTh1. Training batch size is 64 and infer batch size is 1.
Model Param Train (it/s) Infer (it/s)
DLinear 139 K 102.51 360.93
RLinear 69.9 K 92.11 345.04
iTransformer 3.6 M 51.03 188.79
PatchTST 7.6 M 43.92 193.58
RKAN 1.2 M 5.94 193.25
RWKAN 277 K 70.39 308.69
RTKAN 208 K 77.87 316.94
RJKAN (d=4) 345 K 76.77 280.06
RJKAN (d=6) 483 K 58.96 254.65
RMoK-S 972 K 34.86 190.57

5.1 Experimental Settings

Dataset. We conduct extensive experiments on seven widely-used real-world datasets, including


ETT(h1, h2, m1, m2) Zhou et al. [2021], ECL, Traffic and Weather Lai et al. [2018], whose statistical
information is shown in Table 1. We follow the same data processing operations used in TimesNet Wu
et al. [2023], where the training, validation, and testing sets are divided according to chronological
order.

Evaluation Metrics Following previous works Zhou et al. [2021], Wu et al. [2023], we use Mean
Squared Error (MSE) and Mean Absolute Error (MAE) to evaluate the performance of time series
forecasting.

Baselines We select 6 well-acknowledged forecasting models as our baselines, including (1)


Transformer-based methods: PatchTST Nie et al. [2023] and FEDformer Zhou et al. [2022]; (2)
CNN-based methods: TimesNet Wu et al. [2023] and SCINet Liu et al. [2022b]; (3) Linear-based
methods: RLinear Li et al. [2023] and DLinear Zeng et al. [2023].

5.2 Can KAN-based Models get SOTA Performance?

In this section, we conduct extensive experiments to compare the forecasting performance of KAN-
based models with advanced baselines.
We use two versions of our proposed RMoK: RMoK-S (the small version, which has four experts),
and RMoK-B (the base version, which has eight experts). We compare RMoKs with six popular
Transformer-based, CNN-based and Linear-based baselines on seven benchmarks. The experimental

7
Figure 3: Heatmap of the expert loads in the RMoK model on the Weather dataset. The value at
location xy equal to 1 means that the gating network assigns the y-th variable to the x-th expert in all
test samples.

results are shown in Table 2, where both the best results of KAN-based models and baselines are
highlighted in bold and the best results of all models are marked in red. The results of baselines
come from published papers Wu et al. [2023], and the results of RMoKs are average of four times
experiments with fixed seed in [0, 1, 2, 3]. Surprisingly, RMoK achieves the best results in most
cases. Considering that RMoK is a simple single-layer method that does not model correlations
among variates, this empirical finding adequately demonstrates that KAN-based models are effective
in the TSF task.
Specifically, we can divide these seven datasets into two groups based on the number of variables.
On the four ETT subsets with fewer variables, RMoK outperforms the baselines most, and RMoK-S
and RMoK-B exhibit their own strengths under different forecasting length P . On the Weather, ECL,
and Traffic datasets with more variables, RMoK-B is significantly better than RMoK-S, indicating
that the mixture of experts approach is suitable for dealing with a large number of variables. In
addition, due to the intricate spatiotemporal correlation among various variables in Traffic dataset, the
Transformer-based method PatchTST achieves the best results, whereas our proposed simple RMoK
achieves suboptimal results that far exceeded other baselines.

5.3 Does KAN Outperform Linear?

In this section, we conduct ablation experiments to compare KAN-based models with Linear-based
models in three time series forecasting datasets.
For fair comparison, we replace the four KAN experts in RMoK-S with four Linear experts to obtain
a Linear-based baseline with the mixture of expert structure, naming RMoL-S. And we replace the
entire MoK layer with a single KAN or Linear to obtain RWavKAN, RTaylorKAN and RLinear to
analyze the performance of KAN variants and Linear on the TSF task. The experimental results
are reported in Table 3, where all the results are average of 4 times experiments and the best results
are highlighted in bold. We can conclude three useful empirical experiences. First, the KAN-based
model outperforms the linear-based model in most cases. We speculate this phenomenon is due to
KAN’s function representation idea, which is more efficient to capture the periodicity and trend in
time series. Second, the mixture of expert structures is applicable to both KAN and Linear, which
should be attributed to the fact that the gating network assigns variables to specific experts. Third, the
performance of KAN-based models is affected by the specific function, which may be related to the
intrinsic distribution of the time series.

5.4 Can KAN be Integrated into Other Methods?

In this section, we verify whether KAN can be integrated into the existing time series forecasting
models as a plug-in to improve performance. We select iTransformer Liu et al. [2024b] as baseline
and replace the linear projections of all the attention modules with different KAN variants and MoK
layer. The experiments are conducted on the ETT datasets. For fair comparison, we set the same
model hyperparameters in all experiments where hidden dim is 512 and layer number is 2, and search
the best learning rate from 1e-2 to 1e-5 through a grid searching strategy, and repeat four times to
report the average results in Table 4. While iTransformer with various KANs performs well only
in ETTm1, iTransformer with MoK achieves the best performance in most cases on both ETTh1

8
and ETTm1. These experimental findings demonstrate that MoK is a successful form for integrating
KAN into Transformer-based methods.

(a) Data visualization (b) The weight of each input feature.


Figure 4: Visualization of KAN learning the periodicity in the data.

5.5 Are KAN-based Models Efficiency?

We report the size of model parameters, the training and infering speed of KAN-based methods
and baselines on ETTh1 dataset with input length 96 and prediction length 720. We implement all
methods in a unify code library with PyTorch Paszke et al. [2019], and the testing platform is a GPU
server with NVIDIA A100 80GB GPUs. We set training batch size to 64, infer batch size to 1. The
KAN, WKAN, TKAN, JAKN represent KANs with B-splines functions, Wavelet function (transform
is mexican hat), Taylor polynomials (order=3), and Jacob polynomials (d is degree which sets to 4 or
6). As shown in Table 5, the running speed of KANs are affected by the specific implementation.
KAN variant with Taylor polynomials can achieve close running efficiency to the Linear. With future
hardware optimization, the efficiency of KANs will still be improved.

5.6 Are KAN-based Models Interpretable?

In this section, we aim to analyze the interpretability of RMoK for the time series forecasting task.
First, we generate a heatmap to visualize the outputs of the gating network, which decide variable-
expert assignment. This visualization demonstrates how the RMoK simplifies the multivariate
time series forecasting task into multiple univariate forecasting subtasks. Then, we analyze what
knowledge RMoK learns from real-world time-varying systems in each subtask.
Our proposed RMoK model uses a single MoK layer, which consists of gating network and KAN
experts. The gating network outputs the matching score between variables and experts based on
the input feature, and the top-k matching experts are selected for each variable to predict the future
state. We train an RMoK-S model with 4 experts on Weather dataset with 21 variables, then count
the top-1 scores of all samples in the test set and generate a heatmap in Figure 3, where (x, y) = 1.0
means the y-th variable matches the x-th expert in all test samples. From this figure, although the
matching score is related to the time-varying input data, the variable is still closely related to the
specific expert. Therefore, we roughly approximate that the trained RMoK is a linear combination of
several KANs, and RMoK treats the multivariate forecasting task as multiple univariate forecasting
tasks. Although the above process is crude, this simplification helps us analyze the explainability of
RMoK in complex real-world scenarios.
After simplification, we try to analyze why KAN is effective in time series forecasting from the
perspective of univariate and single-expert subtask. We use the temperature variable from Weather
dataset which collects data every 10 minutes to train the KAN with B-splines function as forecasting
model. We decompose and visualize data according to the trend, season, and resid terms through
statsmodels 1 library. As shown in Figure 4, the temperature series has obvious daily periodicity,
which is consistent with life experience. Then, we input the complete daily period (the past 144
1
https://www.statsmodels.org/stable/index.html

9
time steps) data to predict the state of the next time step. We visualize the weight of each feature
dimension of the trained RMoK. As shown in the Figure 4b, there are three peaks in the feature
weight, which are near 0, 72 and 144 time steps. 0 represents the temperature at the same time step of
the previous day’s period, 144 represents the temperature at the adjacent moment, and 70 represents
half a period. These three weight peaks correspond to the three zero points of the cosine function
in one period. Finally, we conclude that RMoK can learn the periodicity of time series, which can
preliminary explain its effectiveness in time series prediction.

6 Conclusion
This work discusses the effectiveness of KANs in time series forecasting. Due to various variants with
different spline functions of KAN, we propose a single-layer mixture of KAN experts model (RMoK)
to alleviate the distribution variation in time series. We experimentally compared KAN with existing
baseline methods of various network architectures in terms of performance, integration, efficiency,
and interpretability. Experimental results on seven real-world datasets show that RMoK performs
in most metrics, it is sufficient to conclude that KAN and KAN-based models are effective in the
time series forecasting task, and KAN can gain a place in the increasingly fierce model structure
competition. The single-layer KAN-based model proposed in this work is only a preliminary attempt
to introduce KAN into time series forecasting. We hope that our work can help future studies improve
the performance and interpretability of KAN-based models.

References
Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Accurate medium-
range global weather forecasting with 3d neural networks. Nature, 619(7970):533–538, 2023.
Siyu Gao, Yunbo Wang, and Xiaokang Yang. Stockformer: learning hybrid trading machines with
predictive coding. In Proceedings of the Thirty-Second International Joint Conference on Artificial
Intelligence, IJCAI ’23, 2023. ISBN 978-1-956792-03-4. doi: 10.24963/ijcai.2023/530. URL
https://doi.org/10.24963/ijcai.2023/530.
Germans Savcisens, Tina Eliassi-Rad, Lars Kai Hansen, Laust Hvas Mortensen, Lau Lilleholt, Anna
Rogers, Ingo Zettler, and Sune Lehmann. Using sequences of life-events to predict human lives.
Nature Computational Science, 4(1):43–56, 2024.
Xiao Han, Xinfeng Zhang, Yiling Wu, Zhenduo Zhang, Tianyu Zhang, and Yaowei Wang. Knowledge-
based multiple relations modeling for traffic forecasting. IEEE Transactions on Intelligent Trans-
portation Systems, pages 1–14, 2024a. doi: 10.1109/TITS.2024.3373123.
Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth
64 words: Long-term forecasting with transformers. In International Conference on Learning
Representations, 2023.
Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Timesnet:
Temporal 2d-variation modeling for general time series analysis. International Conference on
Learning Representations, 2023.
Xiao Han, Zhenduo zhang, Yiling Wu, Xinfeng Zhang, and Zhe Wu. Event traffic forecasting with
sparse multimodal data. In ACM Multimedia 2024, 2024b. URL https://openreview.net/
forum?id=7XyXOOGYfV.
Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljačić,
Thomas Y Hou, and Max Tegmark. Kan: Kolmogorov-arnold networks. arXiv preprint
arXiv:2404.19756, 2024a.
Sidharth SS. Chebyshev polynomial-based kolmogorov-arnold networks: An efficient architecture
for nonlinear function approximation. arXiv preprint arXiv:2405.07200, 2024.
Zavareh Bozorgasl and Hao Chen. Wav-kan: Wavelet kolmogorov-arnold networks. arXiv preprint
arXiv:2405.12832, 2024.

10
Alireza Afzal Aghaei. fkan: Fractional kolmogorov-arnold networks with trainable jacobi basis
functions. arXiv preprint arXiv:2406.07456, 2024.
Qi Qiu, Tao Zhu, Helin Gong, Liming Luke Chen, and Huansheng Ning. Relu-kan: New kolmogorov-
arnold networks that only need matrix addition, dot multiplication, and relu. ArXiv, abs/2406.02075,
2024. URL https://api.semanticscholar.org/CorpusID:270226410.
Alexander Dylan Bodner, Antonio Santiago Tepsich, Jack Natan Spolski, and Santiago Pourteau.
Convolutional kolmogorov-arnold networks. arXiv preprint arXiv:2406.13155, 2024.
Fan Zhang and Xin Zhang. Graphkan: Enhancing feature extraction with graph kolmogorov arnold
networks. arXiv preprint arXiv:2406.13597, 2024.
Jinfeng Xu, Zheyu Chen, Jinze Li, Shuo Yang, Wei Wang, Xiping Hu, and Edith C-H Ngai. Fourierkan-
gcf: Fourier kolmogorov-arnold network–an effective and efficient feature transformation for graph
collaborative filtering. arXiv preprint arXiv:2406.01034, 2024.
Basim Azam and Naveed Akhtar. Suitability of kans for computer vision: A preliminary investigation.
arXiv preprint arXiv:2406.09087, 2024.
Runpeng Yu, Weihao Yu, and Xinchao Wang. Kan or mlp: A fairer comparison. arXiv preprint
arXiv:2407.16674, 2024.
Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings
of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115, 2021.
Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, Alex X. Liu, and Schahram Dust-
dar. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling
and forecasting. In International Conference on Learning Representations, 2022a. URL
https://api.semanticscholar.org/CorpusID:251649164.
Yunhao Zhang and Junchi Yan. Crossformer: Transformer utilizing cross-dimension dependency
for multivariate time series forecasting. In The Eleventh International Conference on Learning
Representations, 2023. URL https://openreview.net/forum?id=vSVLM2j9eie.
Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series
forecasting? In Brian Williams, Yiling Chen, and Jennifer Neville, editors, Thirty-Seventh
AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative
Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances
in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 11121–
11128. AAAI Press, 2023. doi: 10.1609/AAAI.V37I9.26317. URL https://doi.org/10.
1609/aaai.v37i9.26317.
Zhe Li, Shiyi Qi, Yiduo Li, and Zenglin Xu. Revisiting long-term time series forecasting: An investi-
gation on linear mapping. ArXiv, abs/2305.10721, 2023. URL https://api.semanticscholar.
org/CorpusID:258762346.
Shengsheng Lin, Weiwei Lin, Wentai Wu, Feiyu Zhao, Ruichao Mo, and Haotong Zhang. Seg-
rnn: Segment recurrent neural network for long-term time series forecasting. arXiv preprint
arXiv:2308.11200, 2023.
Yuxin Jia, Youfang Lin, Xinyan Hao, Yan Lin, Shengnan Guo, and Huaiyu Wan. Witran: Water-wave
information transmission and recurrent acceleration network for long-range time series forecasting.
Advances in Neural Information Processing Systems, 36, 2024.
Luo donghao and wang xue. ModernTCN: A modern pure convolution structure for general time
series analysis. In The Twelfth International Conference on Learning Representations, 2024. URL
https://openreview.net/forum?id=vpJMJerXHU.
Minhao Liu, Ailing Zeng, Muxi Chen, Zhijian Xu, Qiuxia Lai, Lingna Ma, and Qiang Xu. Scinet:
time series modeling and forecasting with sample convolution and interaction. In Proceedings
of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red
Hook, NY, USA, 2022b. Curran Associates Inc. ISBN 9781713871088.

11
Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through
structured state space duality. arXiv preprint arXiv:2405.21060, 2024.
Zihan Wang, Fanheng Kong, Shi Feng, Ming Wang, Han Zhao, Daling Wang, and Yifei Zhang. Is
mamba effective for time series forecasting? arXiv preprint arXiv:2403.11144, 2024.
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman,
Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi
Gv, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartłomiej
Koptyra, Hayden Lau, Jiaju Lin, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu
Song, Xiangru Tang, Johan Wind, Stanisław Woźniak, Zhenyuan Zhang, Qinghua Zhou, Jian
Zhu, and Rui-Jie Zhu. RWKV: Reinventing RNNs for the transformer era. In Houda Bouamor,
Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics:
EMNLP 2023, pages 14048–14077, Singapore, December 2023. Association for Computational
Linguistics. doi: 10.18653/v1/2023.findings-emnlp.936. URL https://aclanthology.org/
2023.findings-emnlp.936.
Haowen Hou and F. Richard Yu. RWKV-TS: beyond traditional recurrent neural network for
time series tasks. CoRR, abs/2401.09093, 2024. doi: 10.48550/ARXIV.2401.09093. URL
https://doi.org/10.48550/arXiv.2401.09093.
Cristian J Vaca-Rubio, Luis Blanco, Roberto Pereira, and Màrius Caus. Kolmogorov-arnold networks
(kans) for time series analysis. arXiv preprint arXiv:2405.08790, 2024.
Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton,
and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.
In International Conference on Learning Representations, 2017. URL https://openreview.
net/forum?id=B1ckMDqlg.
Taesung Kim, Jinhee Kim, Yunwon Tae, Cheonbok Park, Jang-Ho Choi, and Jaegul Choo. Re-
versible instance normalization for accurate time-series forecasting against distribution shift. In
International Conference on Learning Representations, 2022. URL https://openreview.net/
forum?id=cGDAkQo1C0p.
Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Modeling long- and short-term
temporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference
on Research & Development in Information Retrieval, SIGIR ’18, page 95–104, New York, NY,
USA, 2018. Association for Computing Machinery. ISBN 9781450356572. doi: 10.1145/3209978.
3210006. URL https://doi.org/10.1145/3209978.3210006.
Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. Fedformer: Frequency
enhanced decomposed transformer for long-term series forecasting. In International conference on
machine learning, pages 27268–27286. PMLR, 2022.
Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long.
itransformer: Inverted transformers are effective for time series forecasting. In The Twelfth
International Conference on Learning Representations, 2024b. URL https://openreview.
net/forum?id=JePfAI8fah.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style,
high-performance deep learning library. Advances in neural information processing systems, 32,
2019.

12
A Implementation Details
A.1 Datasets

We conduct extensive experiments on seven real-world datasets, including ETT(h1, h2, m1, m2)
Zhou et al. [2021], ECL, Traffic and Weather Lai et al. [2018], whose detail information is shown in
Table 6. We follow the same data processing operations used in TimesNet Wu et al. [2023], where
the training, validation, and testing sets are divided according to chronological order. The datasets we
used cover multiple domains (electricity, transportation, and weather), which are sufficient to verify
the generalizability of our method.

Table 6: The information of the seven time series forecasting datasets.


Dataset Variates Timesteps Granularity Domain
ETTh1,h2 7 17,420 1 hour Electricity
ETTm1,m2 7 69,680 15 min Electricity
ECL 321 26,304 1 hour Electricity
Traffic 862 17,544 1 hour Transportation
Weather 21 52,696 10 min Weather

A.2 Baselines

To comprehensively compare KAN with other baselines under different network structures, we
select six well-known forecasting models as our baselines, including (1) Transformer-based methods:
PatchTST and FEDformer; (2) CNN-based methods: TimesNet and SCINet; (3) Linear-based
methods: RLinear and DLinear. We report the details of these baselines in Table 7.

Table 7: The information of baselines.


Name Type Cite
PatchTST Transformer Nie et al. [2023]
FEDformer Transformer Zhou et al. [2022]
TimesNet CNN Wu et al. [2023]
SCINet CNN Liu et al. [2022b]
RLinear Linear Li et al. [2023]
DLinear Linear Zeng et al. [2023]

A.3 Experimental Settings

We implement our method and baselines in a unify code library with PyTorch 2 and Pytorch-lightning
3
(codes will be released after acceptance), and we conduct all experiments on a GPU server with
NVIDIA A100 80GB GPUs. To prevent the sample drop in the test phase where the samples of the
last batch are discarded, we set the test batch size to 1. Following previous works Zhou et al. [2021],
Wu et al. [2023], we use Mean Squared Error (MSE) and Mean Absolute Error (MAE) to evaluate
the performance of time series forecasting.

N
1 X
MAE(x, x̂) = |xi − x̂i | ,
N i=0
(10)
N
1 X 2
MSE(x, x̂) = (xi − x̂i ) ,
N i=0
where xi denotes the i-th ground truth, x̂i represents the i-th predicted values, and N represents the
number of testing samples.
2
https://github.com/pytorch/pytorch
3
https://github.com/Lightning-AI/pytorch-lightning

13
B Experimental Analysis
We report hyperparameter sensitivity and performance robustness experimental results to further
evaluate the effectiveness of KAN-based modes in time series forecasting.

B.1 Hyperparameter Sensitivity

We conduct hyperparameter experiments to compare the performance between MoK and KAN layers.
Results are shown in Table 8, where WavKAN, TaylorKAN and JacobKAN are KANs with wavelet
functions, Taylor polynomials, and Jacobi polynomials. The type of spline function affects KAN’s
ability to model time series, but there is currently a lack of theoretical guidance on how to choose the
best suitable spline function, which can only be determined through experimental methods. Thus,
MoK becomes a best practice for KAN, which does not require a lot of experiments to select the
optimal spline function. This cocktail-like solution achieves better performance than a single KAN
layer in most cases and achieves overall performance improvements on datasets with a large number
of variables (such as Weather, Traffic, ECL).

Table 8: The performance of KAN-based models. Both the best results of models with or w/o MoE
are highlighted in bold, and the best models of all methods are marked in red.
with MoE w/o MoE
Setting
MoK-S MoK-B RWavKAN RTaylorKAN RJacobKAN
MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.326 0.360 0.320 0.358 0.341 0.370 0.340 0.368 0.353 0.376
ETTm1

192 0.367 0.382 0.364 0.383 0.374 0.384 0.380 0.386 0.386 0.391
336 0.400 0.404 0.395 0.405 0.406 0.405 0.412 0.406 0.413 0.410
720 0.462 0.439 0.457 0.440 0.465 0.438 0.474 0.439 0.465 0.442
avg 0.389 0.396 0.384 0.397 0.397 0.399 0.402 0.400 0.404 0.405
96 0.176 0.261 0.176 0.261 0.179 0.265 0.181 0.264 0.177 0.263
ETTm2

192 0.244 0.305 0.240 0.302 0.247 0.309 0.247 0.306 0.246 0.308
336 0.306 0.346 0.299 0.342 0.311 0.349 0.308 0.345 0.309 0.348
720 0.405 0.404 0.397 0.401 0.412 0.408 0.405 0.401 0.406 0.403
avg 0.283 0.329 0.278 0.326 0.287 0.333 0.285 0.329 0.284 0.331
96 0.382 0.396 0.374 0.397 0.396 0.402 0.388 0.398 0.389 0.411
ETTh1

192 0.430 0.426 0.419 0.429 0.439 0.430 0.438 0.424 0.434 0.437
336 0.468 0.443 0.461 0.450 0.479 0.449 0.477 0.442 0.464 0.451
720 0.450 0.458 0.474 0.467 0.464 0.461 0.477 0.461 0.452 0.460
avg 0.433 0.431 0.432 0.436 0.444 0.435 0.445 0.431 0.435 0.440
96 0.313 0.357 0.301 0.353 0.321 0.358 0.297 0.344 0.310 0.352
ETTh2

192 0.397 0.409 0.379 0.405 0.407 0.413 0.387 0.401 0.396 0.405
336 0.441 0.446 0.432 0.446 0.452 0.450 0.434 0.441 0.437 0.441
720 0.453 0.461 0.446 0.463 0.487 0.477 0.448 0.458 0.445 0.456
avg 0.401 0.418 0.390 0.417 0.417 0.424 0.392 0.411 0.397 0.414
96 0.187 0.273 0.178 0.267 0.202 0.284 0.203 0.282 0.201 0.284
192 0.193 0.280 0.187 0.274 0.203 0.285 0.205 0.285 0.200 0.285
ECL

336 0.210 0.295 0.204 0.290 0.218 0.299 0.221 0.300 0.216 0.301
720 0.256 0.330 0.247 0.323 0.262 0.332 0.265 0.333 0.260 0.333
avg 0.211 0.294 0.204 0.288 0.221 0.300 0.224 0.300 0.220 0.301
96 0.175 0.225 0.171 0.221 0.180 0.228 0.182 0.232 0.178 0.230
Weather

192 0.224 0.267 0.220 0.263 0.227 0.268 0.235 0.274 0.229 0.271
336 0.284 0.308 0.277 0.302 0.284 0.308 0.291 0.312 0.288 0.310
720 0.366 0.358 0.360 0.354 0.368 0.360 0.370 0.360 0.368 0.359
avg 0.262 0.290 0.257 0.285 0.264 0.291 0.269 0.294 0.266 0.292

14

You might also like