Motion Code Arxiv
Motion Code Arxiv
Abstract
Despite extensive research, time series classification and forecasting on noisy data
remain highly challenging. The main difficulties lie in finding suitable mathemat-
ical concepts to describe time series and effectively separate noise from the true
signals. Unlike traditional methods treating time series as static vectors or fixed
sequences, we propose a novel framework that views each time series, regardless
of length, as a realization of a continuous-time stochastic process. This mathe-
matical approach captures dependencies across timestamps and detects hidden,
time-varying signals within the noise. However, real-world data often involves
multiple distinct dynamics, making it insufficient to model the entire process with
a single stochastic model. To address this, we assign each dynamic a unique
signature vector and introduce the concept of "most informative timestamps" to
infer a sparse approximation of the individual dynamics from these vectors. The
resulting model, called Motion Code, includes parameters that fully capture diverse
underlying dynamics in an integrated manner, enabling simultaneous classification
and forecasting of time series. Extensive experiments on noisy datasets, includ-
ing real-world Parkinson’s disease sensor tracking, demonstrate Motion Code’s
strong performance against established benchmarks for time series classification
and forecasting.
1 INTRODUCTION
Noisy time series analysis is a challenging problem due to the difficulty in finding appropriate
mathematical models to represent and study such data, unlike images or text. For example, consider
two groups of time series, each representing the audio data of a word pronounced by speakers with
different accents, and with varying lengths between 80 and 95 data points (see Figure 1). Common
methods, like distance-based [Jeong et al., 2011] or shapelet-based [Bostrom and Bagnall, 2017]
approaches, treat time series as ordered vectors, but fail when those vectors are highly mixed, with
many red series resembling blue ones (see Figure 1a). Deep learning methods, such as recurrent
neural networks (e.g., LSTM-FCN [Karim et al., 2019]) or convolutional networks (e.g., ROCKET
[Dempster et al., 2020]), struggle to capture higher-order correlations in datasets with many data
points but a limited number of individual series, as in this example. Techniques that rely on empirical
statistics, such as dictionary-based [Schäfer, 2015] or interval-based methods [Deng et al., 2013], are
also unreliable because noise can distort the collected statistics, making it difficult to separate noise
from true signals. These challenges motivate our approach, which models time series collections
using stochastic processes to recover underlying dynamics from noisy data.
We propose modeling each time series as an instance of a common stochastic process governing
the group’s dynamics. While individual series may be noisy, our method recovers the underlying
stochastic process, capturing key patterns such as increases, decreases, or stable phases, as shown by
∗
Equal contribution
(a) 2 Time Series Collections (b) Absorptivity (c) Anything
Figure 1: (a): Two Collections Of Time Series Representing Pronunciation Audio Data For The Words
Absorptivity And Anything. (b) And (c): Most Informative Timestamps For The Pronunciation Of
Absorptivity (Red) And Anything (Blue).
the skeleton approximations in Figure 1b and Figure 1c. These approximations reveal core signals
and the statistical relationships between data points.
Modeling time series with multiple dynamics, however, requires more than a single stochastic process.
Unlike previous methods [Qi et al., 2010, Durbin, 2012, Cao et al., 2013, Moss et al., 2023] that focus
on a single series, our framework introduces the most informative timestamps (see Definition 2) to
identify key features across multiple series. In addition, each dynamic model is assigned a signature
vector, or motion code, which is jointly optimized using sparse learning techniques to prevent
overfitting. This enables us to accurately distinguish between different dynamics. For instance, in
Figure 1, scatter points reveal two distinct patterns that would otherwise be hidden in the noisy, mixed
time series collections.
Combining these innovations, our proposed model, Motion Code, effectively learns from multiple
underlying processes in a robust and comprehensive way. Our key contributions include:
1. Motion Code: A model that jointly learns from noisy time series collections, explicitly
modeling the underlying stochastic process to separate noise from core signals.
2. Irregular Data Handling: Motion Code handles out-of-sync, varying-length, and missing
data directly without interpolation, preserving temporal structure and avoiding alignment
distortions.
Time Series Classification: Common techniques include distance-based methods [Jeong et al.,
2011], interval-based [Deng et al., 2013], dictionary-based [Schäfer, 2015], shapelet-based [Bostrom
and Bagnall, 2017], feature-based [Lubba et al., 2019], and ensemble models [Lines et al., 2016,
Middlehurst et al., 2021]. In deep learning, popular approaches involve convolutional neural networks
[Karim et al., 2019, Dempster et al., 2020], residual networks [Ma et al., 2016], autoencoders [Hu
et al., 2016].
Time Series Forecasting: Methods for forecasting include exponential smoothing [Holt, 2004],
TBATS [De Livera et al., 2011], ARIMA [Malki et al., 2021], probabilistic state-space models
[Durbin, 2012], and deep learning frameworks [Lim and Zohren, 2021, Liu et al., 2021].
Stochastic Modeling: Gaussian processes [Rasmussen and Williams, 2006] are widely used for
continuous-time series. To reduce computational cost, sparse Gaussian processes [Titsias, 2009] have
been developed. Building on these approaches, advanced generative models with either approximate
or exact inference [Qi et al., 2010, Cao et al., 2013, Moss et al., 2023] have been introduced. However,
these methods are typically limited to individual time series, whereas our approach extends to
multi-time series collections, enabling joint modeling across different series.
2
The rest of the paper is organized as follows: Section 2 details the mathematical and algorithmic
framework for Motion Code. Section 3 presents experiments and benchmarking for classification and
forecasting tasks. Section 4 discusses the benefits of our framework.
1. Classification: Given a new time series y = (y(t))t∈T with timestamps T , classify it into
one of the L possible groups.
2. Forecasting: For a time series y generated by Gk (k ∈ 1, L), predict future values at new
timestamps T , i.e., predict {yt }t∈T .
Stochastic Process And Notations: Recall that a stochastic process G is defined as {g(t)}t≥0 , where
g is a random function, and g(t) is the random data point at time t. The random function g is referred
to as the underlying signal of the process G. For any timestamps set T , let gT denote the signal
vector (g(t))t∈T ∈ R|T | .
Data Assumptions: We assume that the observed time series data points y are normally distributed
around the underlying signal of their respective stochastic processes. Specifically, let gk represent
the underlying signal of the stochastic process Gk , i.e., Gk = {gk (t)}t≥0 . Then, for k ∈ 1, L,
the data (y i,k )t∈Ti,k assumes a Gaussian distribution with mean (gk )Ti,k = (gk (t))t∈Ti,k ∈ R|Ti,k |
and covariance matrix σI|Ti,k | , where In is the n × n identity matrix. The constant σ ∈ R+ is the
unknown noise variance in the sample data from the underlying signals.
In this sub-section, we develop the core mathematical concept behind Motion Code called the most
informative timestamps. The most informative timestamps of a time series collection generalize the
concept of inducing points of a single time series introduced in [Titsias, 2009]. They are a small
subset of timestamps that minimizes the mismatch between the original data and the information
reconstructed using only this subset. The visualization of the most informative timestamps is provided
in Figure 2, Figure 3, Figure 4, and is further discussed in Section 4.
To concretely define the most informative timestamps, we first introduce generalized evidence lower
bound function (GELB) in Definition 1. We then define the most informative timestamps as the
maximizers of this GELB function in Definition 2.
Definition 1. Suppose we are given a stochastic process G = {g(t)}t≥0 and a collection of time
B
series C = y i i=1 consisting of B independent time series y i sampled from G. Each series
y i = (yti )t∈Ti consists of Ni = |Ti | data points and is called a realization of G. Let m be a fixed
positive integer. We define the generalized evidence lower bound function L = L(C, G, S m , ϕ)
as a function of the data collection C, the stochastic process G, the m-elements timestamps set
3
(a) Weekend (b) Weekday
Also define the function Lmax such that Lmax (C, G, S m ) := maxϕ L(C, G, S m , ϕ). Hence, (S m )∗
can be found by maximizing Lmax over all possible S m .
To compute the training loss function for Motion Code, we need to computationally approximate the
function Lmax , which defines the most informative timestamps (see Algorithm 1). Specifically, for
B
a given set of m timestamps S m , a stochastic process G, and a collection C of B time series y i i=1
sampled from G, our goal is to approximate Lmax (C, G, S m ). This is achieved by approximating G
with a kernelized Gaussian process (see Definition 3), denoted as H, with a kernel function K.
Definition 3. A kernelized Gaussian process [Rasmussen and Williams, 2006] H := {h(t)}t≥0
with underlying signal h is a stochastic process defined by the mean function µ : R → R and the
positive-definite kernel function K : R × R → R. For the timestamps T , the joint distribution of the
signal vector hT = (h(t))t∈T is Gaussian and characterized by:
p(hT ) = p((h(t))t∈T ) = N (µT , KT T ), (3)
4
Here µT is the mean vector (µ(t))t∈T , and KT T is the positive-definite n × n kernel matrix
(K(t, s))t,s∈T . N (µ, Σ) denote a Gaussian distribution with mean µ and covariance matrix Σ.
With the kernel K of the approximate process H, for each i ∈ 1, B, define the kernel matrices
KTi Ti , KS m Ti , and KTi S m as follows: KTi Ti = (K(t, s))t∈Ti ,s∈Ti , KS m Ti = (K(t, s))t∈S m ,s∈Ti ,
and KTi S m = (K(t, s))t∈Ti ,s∈S m . From these, define the |Ti |-by-|Ti | matrix QTi Ti :=
KTi S m (KS m S m )−1 KS m Ti for i ∈ 1, B. Lastly, define the vector Y and the joint matrix QC,G
as follows: 1
y QT1 T1 0 0
Y = ... , QC,G = 0 .. (4)
. 0
yB 0 0 QTB TB
where pN (X|µ, Σ) denotes the density function of a Gaussian random variable X with mean µ and
covariance matrix Σ. The detailed proof for this approximation is given in the Appendix.
With the core concept of the most informative timestamps outlined in Section 2.2 and approximation
formula for Lmax in Section 2.3, we can now describe Motion Code learning framework in details:
Model And Parameters: We approximate each stochastic process Gk using a kernelized Gaussian
process with a kernel function K ηk , parameterized by ηk , for each k ∈ 1, L. All timestamps are
normalized to the interval [0, 1], and we select m ∈ N as the number of the most informative
timestamps, as well as a fixed latent dimension d ∈ N.
We jointly model the most informative timestamps S m,k for each stochastic process Gk (with
corresponding data collection Ck ) through a common mapping G : Rd → Rm . Specifically, we define
L distinct d-dimensional vectors z1 , . . . , zL ∈ Rd , referred to as motion codes, and use them to
model S m,k as:
[
S m,k := sigmoid(G(z )) ∈ Rm (6)
k
Training Loss Function: The goal is to have S [ m,k closely approximate the true S m,k , which
max
maximizes L . To achieve this, we maximize Lmax (Ck , Gk , S
[ m,k ) for all k, leading to the
following loss function:
L
X L
X
max
U(η, z, Θ) = − L [
(Ck , Gk , S m,k ) + λ ∥zk ∥22 (7)
k=1 k=1
The first term is computed using the approximation formula for Lmax in Equation (5). The second
term is a regularization term for the motion codes zk , controlled by the hyperparameter λ. The full
training procedure is detailed in Algorithm 1.
5
Algorithm 1 Motion Code training algorithm
Bk
Input: L collections of time series data Ck = y i,k i=1 , where the series y i,k has timestamps Ti,k ,
for k ∈ 1, L. Additional hyperparameters include number of the most informative timestamps m,
motion codes dimension d, regularization parameter λ, max iteration M , and stopping threshold ϵ.
Output: Parameters η, z, Θ that optimize loss function U(η, z, Θ) (see Section 2.4).
1: Initialize η and z to be constant vectors 1, and Θ to be the constant matrix, where each column is
the arithmetic sequence between 0.1 and 0.9.
2: repeat
3: Use the current parameter η, z to calculate the predicted most informative timestamps for the
k th stochastic process: S
[ m,k = sigmoid(Θz ).
k
4: Calculate KS m,k S m,k , KS m,k Ti,k , KTηi,k
ηk ηk k
S m,k
for k ∈ 1, L, i ∈ 1, Bk . Then calculate corre-
sponding matrix Q’s, and QC,G defined in Section 2.3.
5: Use above calculations to compute Lmax (Ck , Gk , S [ m,k ) approximated by Equation (5) via an
We use the trained parameters η, z, Θ from Algorithm 1 to perform both time series forecasting and
classification. The first step is to compute preliminary predictions that yield the predicted mean
signal pk = E[(gk )T ] ∈ R|T | , which forms the basis for these tasks.
Preliminary Predictions: For a given k ∈ 1, L, the predicted distribution of the signal vector (gk )T
is obtained by marginalizing over the signal (gk )S m,k at the most informative timestamps S m,k for
process Gk :
Z
p((gk )T ) = p((gk )T |(gk )S m,k )ϕ∗ ((gk )S m,k )d(gk )S m,k (8)
where the optimal variational distribution ϕ∗ is defined in Equation (2). A detailed calculation of
the distribution p((gk )T ) and its mean pk = E[(gk )T ], referred to as the predicted mean signal, is
provided in the Appendix.
Forecasting: For the stochastic process Gk , the predicted mean signal pk = E[(gk )T ] serves as the
forecast for the process.
Classification: To classify a series y with timestamps T , we compute the predicted mean signal
pk ∈ R|T | for each k ∈ 1, L. Motion Code outputs the predicted label based on the closest pk , using
the Euclidean distance ∥.∥2,R|T | :
Time Complexity: Matrix multiplication between an m-by-m matrix and an m-by-|Ti,k | or |Ti,k |-
by-m matrix is the most computationally expensive operation in Algorithm 1. As a result, the
PL PBk 2 2
time complexity of Algorithm 1 is O k=1 i=1 m |Ti,k | × M = O(m N M ), where N =
PL PBk
k=1 i=1 |Ti,k | represents the total number of data points, M is the maximum number of iterations,
and m is the number of most informative timestamps. For time series tasks, by the same argument, the
cost of predicting a single mean vector pk is O(m2 |T |). Thus, the cost for forecasting at timestamps
T is also O(m2 |T |). For classification tasks across L groups of time series, classifying a time series
with timestamps T has a complexity of O(m2 |T ||L|). Since m is typically chosen to be small, these
complexities are approximately linear in terms of the number of data points in the time series input.
6
3 EXPERIMENTS
3.1 Datasets
Table 1: Classification Accuracy (Percentage) For 7 Time Series Algorithms On Noisy Basic Sensor
Datasets. The Highest Accuracy Is Highlighted In Red, The Second Highest In Blue.
Table 2: Classification Accuracy (Percentage) For 7 Time Series Algorithms On Noisy Basic Sensor
Datasets. “Error" Indicates Failure To Run.
Parkinson’s Disease Sensor Data: The Parkinson data are derived from the Clinician Input Study
(CIS-PD) [Elm et al., 2019, Raykov et al., 2019], a 6-month project using Apple Watch devices
to monitor patients during clinic visits and at home. For two days before each clinic visit, pa-
tients reported symptoms every 30 minutes, focusing on medication state and tremor severity. The
accelerometer data was segmented into 20-minute intervals (10 minutes before and after each symp-
tom report). These Parkinson data were obtained from the Biomarker & Endpoint Assessment
to Track Parkinson’s disease DREAM Challenge. For up-to-date information on the study, visit
https://www.synapse.org/Synapse:syn20825169/wiki/600898.
7
We used two experimental settings for Parkinson’s monitoring. The first tracks patients fully on
medication state, distinguishing between no tremor and mild tremor to assess whether the patient
has fully recovered or is still symptomatic. The second setting adds a third category for moderate to
severe tremor, independent of medication state, aiming to capture broader tremor patterns, including
cases where symptoms persist despite medication. This offers a more comprehensive assessment of
tremor severity beyond recovery stages.
Motion Code was applied to three datasets: 12 basic datasets with added noise, pronunciation audio
data, and Parkinson’s sensor data, focusing on classification tasks. Forecasting was performed on the
basic datasets and the audio dataset. All experiments were run on an Nvidia A100 GPU.
Data Preprocessing: For the Parkinson’s dataset, we downsampled each segment by averaging per
second, calculated the absolute differences between consecutive points, and applied an exponential
moving average filter. We interpolated the data to 1,600 points for benchmark algorithms that require
same-length time series, though Motion Code can handle misaligned data directly without the need
for interpolation. More details are given in the Appendix.
PJ
Kernel Choice: We used a spectral kernel defined as K η (t, s) := j=1 αj exp(-0.5βj |t − s|2 ) with
parameters η = (α1 , · · · , αJ , β1 , · · · , βJ ).
Hyperparameters: For experiments, we set d = 2, λ = 1, ϵ = 10−5 , and M = 10. For basic and
pronunciation audio datasets, we selected 10 most informative timestamps (m = 10), and 1 kernel
components (J = 1). For Parkinson’s disease (PD), we used m = 6, J = 2 for the first setting, and
m = 12, J = 2 for the second setting.
We compared Motion Code’s performance on time series classification against 12 algorithms: DTW
[Jeong et al., 2011], TSF [Deng et al., 2013], RISE [Lines et al., 2016], BOSS [Schäfer, 2015], BOSS-
E [Schäfer, 2015], catch22 [Lubba et al., 2019], Shapelet [Bostrom and Bagnall, 2017], Teaser
[Schäfer and Leser, 2020], SVC [Löning et al., 2019], LSTM-FCN [Karim et al., 2019], Rocket
[Dempster et al., 2020], and Hive-Cote 2 [Middlehurst et al., 2021]. We evaluated performance based
on classification accuracy (measured in percentage).
As shown in Table 1 and Table 2, Motion Code outperforms other algorithms on more than half of the
noisy basic datasets and consistently ranks in the top 2, only behind the ensemble model Hive-Cote 2.
This demonstrates the robustness of our method in handling collections of noisy time series.
Table 3: Classification Accuracy For 7 Time Series Algorithms On Pronunciation Audio And
Parkinson Data.
Shape- LSTM- Hive- Motion
Data sets Teaser SVC Rocket
let FCN Cote 2 Code
Pronunciation Audio 68.75 Error 62.5 56.25 75 75 87.5
Parkinson setting 1 52.80 59.94 63.96 43.48 61.49 59.63 70.81
Parkinson setting 2 44.99 37.53 48.02 24.01 51.52 50.82 54.31
Table 4: Classification Accuracy For 7 Time Series Algorithms On Pronunciation Audio And
Parkinson Data.
BOSS- Motion
Data sets DTW TSF RISE BOSS catch22
E Code
Pronunciation Audio 50 87.5 62.5 68.75 62.5 50 87.5
Parkinson setting 1 63.35 63.98 70.81 61.80 65.53 68.94 70.81
Parkinson setting 2 43.12 51.98 53.61 45.92 36.83 51.52 54.31
For real-world datasets, Table 3 and Table 4 show competitive performance from Motion Code
compared to 12 other algorithms, highlighting its effectiveness in handling noise inherent in real-
world data.
8
3.4 Evaluation on Time Series Forecasting
For forecasting, each dataset was split into two parts: 80% of the data points were used for training,
while the remaining 20% of future data points were reserved for testing. For Motion Code, we
generated a single prediction for all series within the same collection. We selected 5 algorithms as
baselines for comparison: Exponential Smoothing [Holt, 2004], ARIMA [Malki et al., 2021], State
Space Model [Durbin, 2012], TBATS [De Livera et al., 2011], and Last Seen, a basic method that
uses previous values to predict the next time steps.
Table 5: Average Root Mean-Square Error (RMSE) For 6 Time Series Forecasting Algorithms.
Exp. Motion
ID ARIMA State space Last seen TBATS
Smoothing Code
1 Error 1079 775.96 723.1 633.04 518.49
2 0.34 0.43 1.58 0.19 0.17 0.27
3 0.88 0.58 0.93 0.57 0.56 0.74
4 60.38 128.44 59.83 41.51 20.94 417.94
5 1117 3386 730.51 497.3 560.88 648.27
6 0.043 0.095 0.25 0.019 0.02 0.048
7 Error 2.02 2.37 1.24 0.96 0.67
8 1.7 2.85 1.7 1.35 1.35 1.08
9 1.11 1.52 1.09 1.01 0.88 0.82
10 3.38 4.85 4.41 1.77 1.72 1.15
11 2.79 2.01 3.21 1.39 1.52 2.26
12 4.37 5 4.45 0.98 1.44 0.98
Audio 0.087 0.27 0.086 0.1 0.059 0.085
We ran the 5 baseline algorithms with individual predictions and compared the results, as shown in
Table 5. Despite not making individual predictions for each series, Motion Code outperformed other
methods in the majority of datasets.
Code: The implementation is available at https://github.com/mpnguyen2/motion_code.
Despite having several noisy time series that deviate from the common mean, the points at most
informative timestamps S m,k form a skeleton approximation of the underlying stochastic process.
All the important twists and turns are constantly observed by the corresponding points at important
timestamps (see Figure 2). Those points create a feature that helps visualize the underlying dynamics
with explicit global behaviors such as increasing, decreasing, staying still, unlike the original complex
time series collections with no visible common patterns among series.
Pronunciation Audio: For pronunciation audio data, where speakers from different nationalities
pronounce complex words, Motion Code highlights key linguistic features. For “absorptivity"
9
(ab-sorp-ti-vi-ty), the most informative timestamps align with significant phonetic components,
identifying emphasis on “ab" and “sorp" followed by a notable silent pause before proceeding to “ti"
and then “vi-ty" (see Figure 3a). Similarly, for “anything", it captures a strong vocal raise on “a-ny"
and emphasis on “thing", preserving core pronunciation patterns across accents despite variations
(see Figure 3b). This ability to focus on key moments reveals common speech dynamics shared
across accents.
Parkinson’s Disease Data: When tracking normal movement, the series appears random with no
clear pattern. This reflects the unpredictable nature of normal motion, where no consistent behavior
or tremor can be observed (see Figure 4a). In contrast, for patients with light tremor, the extracted
timestamps reveal a more consistent, repetitive pattern, characterized by slight oscillations that
correspond to minor, controlled hand swings. These small fluctuations, captured at key timestamps,
represent typical behavior in light tremor (see Figure 4b). For more severe tremors, the timestamps
highlight a progression from smaller, repeated movements to larger, more exaggerated swings.
Initially, the differences between consecutive data points are minimal, but as the tremor worsens,
the fluctuations become more pronounced, with larger variations visible at critical timestamps (see
Figure 4c). This interpretable feature allows us to track the severity and progression of tremors over
time, offering valuable insights into patient’s conditions.
Motion Code processes each data point and its timestamp independently, allowing the algorithm
to handle time series with different timestamps. Despite the time series having uneven lengths and
out-of-sync timestamps, Motion Code maintains accurate skeleton approximations (see Figure 1),
demonstrating its effectiveness with incomplete and varying-length data.
For Parkinson’s dataset, Motion Code efficiently handles out-of-sync timestamps and missing values.
The time series from wearable sensors vary in length from 200 to 1660 points, with intermediate
lengths such as 500 and 1000 points. Traditional methods often require interpolation to standardize
these lengths, which can distort the data, especially when dealing with large disparities. Motion Code
bypasses this need, processing time series of different lengths directly and learning across classes
without interpolation, preserving the original data’s integrity.
This capability is particularly useful for monitoring Parkinson’s disease, where tremors and bradyki-
nesia fluctuate, and sensor readings are irregular due to patient activities. Unlike other methods
that struggle with asynchronous data, Motion Code treats each reading as part of an underlying
stochastic process, enabling it to handle noisy, incomplete, and unsynchronized data efficiently. This
eliminates the need for strict time-aligned monitoring, allowing patients to maintain natural schedules
while ensuring accurate symptom tracking. Clinicians also benefit from clear, actionable insights,
improving their ability to monitor disease progression and make timely interventions.
5 CONCLUSION
In this work, we developed an integrated framework called Motion Code, utilizing variational
inference and sparse stochastic process modeling. Unlike most existing methods focusing on either
classification or forecasting, Motion Code performs both tasks simultaneously across diverse time
10
series collections. Our model demonstrates robustness to noise and consistently achieves competitive
performance against other leading time series algorithms. As discussed in Section 4, Motion Code
offers interpretable features that capture the core dynamics of the underlying stochastic process.
This is especially useful in domains like Parkinson’s disease monitoring, where understanding key
patterns offers actionable insights for clinicians. Additionally, it handles varying-length time series
and missing data, challenges that many other methods struggle with. In future work, we aim to extend
Motion Code by incorporating non-Gaussian approximation to adapt to time series from different
application domains.
References
Anthony Bagnall, Jason Lines, Aaron Bostrom, James Large, and Eamonn Keogh. The great time
series classification bake off: a review and experimental evaluation of recent algorithmic advances.
Data Min. Knowl. Discov., 31(3):606–660, 2017.
Aaron Bostrom and Anthony Bagnall. Binary shapelet transform for multiclass time series classifica-
tion. In Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXII, pages 24–46.
Springer Berlin Heidelberg, Berlin, Heidelberg, 2017.
Yanshuai Cao, Marcus A Brubaker, David J Fleet, and Aaron Hertzmann. Efficient optimization
for sparse gaussian process regression. In Advances in Neural Information Processing Systems,
volume 26. Curran Associates, Inc., 2013.
Alysha M De Livera, Rob J Hyndman, and Ralph D Snyder. Forecasting time series with complex
seasonal patterns using exponential smoothing. J. Am. Stat. Assoc., 106(496):1513–1527, 2011.
Angus Dempster, François Petitjean, and Geoffrey I Webb. ROCKET: exceptionally fast and accurate
time series classification using random convolutional kernels. Data Min. Knowl. Discov., 34(5):
1454–1495, 2020.
Houtao Deng, George Runger, Eugene Tuv, and Martyanov Vladimir. A time series forest for
classification and feature extraction. Inf. Sci. (Ny), 239:142–153, 2013.
James Durbin. Time Series Analysis by State Space Methods: Second Edition. Oxford University
Press, 2012.
Jordan J Elm, Margaret Daeschler, Lauren Bataille, Ruth Schneider, Amy Amara, Alberto J Espay,
Michal Afek, Chen Admati, Abeba Teklehaimanot, and Tanya Simuni. Feasibility and utility of a
clinician dashboard from wearable and mobile application parkinson’s disease data. NPJ Digit.
Med., 2(1):95, 2019.
Charles C Holt. Forecasting seasonals and trends by exponentially weighted moving averages. Int. J.
Forecast., 20(1):5–10, 2004.
Qinghua Hu, Rujia Zhang, and Yucan Zhou. Transfer learning for short-term wind speed prediction
with deep neural networks. Renew. Energy, 85:83–95, 2016.
Young-Seon Jeong, Myong K Jeong, and Olufemi A Omitaomu. Weighted dynamic time warping for
time series classification. Pattern Recognit., 44(9):2231–2240, 2011.
11
Fazle Karim, Somshubra Majumdar, Houshang Darabi, and Samuel Harford. Multivariate LSTM-
FCNs for time series classification. Neural Netw., 116:237–245, 2019.
Bryan Lim and Stefan Zohren. Time-series forecasting with deep learning: a survey. Philos. Trans. A
Math. Phys. Eng. Sci., 379(2194):20200209, 2021.
Jason Lines, Sarah Taylor, and Anthony Bagnall. HIVE-COTE: The hierarchical vote collective of
transformation-based ensembles for time series classification. In 2016 IEEE 16th International
Conference on Data Mining (ICDM). IEEE, 2016.
Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization.
Math. Program., 45(1-3):503–528, 1989.
Zhenyu Liu, Zhengtong Zhu, Jing Gao, and Cheng Xu. Forecast methods for time series data: A
survey. IEEE Access, 9:91896–91912, 2021.
Markus Löning, Anthony Bagnall, Sajaysurya Ganesh, Viktor Kazakov, Jason Lines, and Franz J.
Király. sktime: A Unified Interface for Machine Learning with Time Series. arXiv e-prints
arXiv:1909.07872, 2019.
Carl H Lubba, Sarab S Sethi, Philip Knaute, Simon R Schultz, Ben D Fulcher, and Nick S Jones.
catch22: CAnonical time-series CHaracteristics: Selected through highly comparative time-series
analysis. Data Min. Knowl. Discov., 33(6):1821–1852, 2019.
Qianli Ma, Lifeng Shen, Weibiao Chen, Jiabin Wang, Jia Wei, and Zhiwen Yu. Functional echo state
network for time series classification. Inf. Sci. (Ny), 373:1–20, 2016.
Zohair Malki, El-Sayed Atlam, Ashraf Ewis, Guesh Dagnew, Ahmad Reda Alzighaibi, Ghada
ELmarhomy, Mostafa A Elhosseini, Aboul Ella Hassanien, and Ibrahim Gad. ARIMA models for
predicting the end of COVID-19 pandemic and the risk of second rebound. Neural Comput. Appl.,
33(7):2929–2948, 2021.
Forvo Media. Forvo: The pronunciation guide. http://www.forvo.com/, 2022.
Matthew Middlehurst, James Large, Michael Flynn, Jason Lines, Aaron Bostrom, and Anthony
Bagnall. HIVE-COTE 2.0: a new meta ensemble for time series classification. Mach. Learn., 110
(11-12):3211–3243, 2021.
Henry B. Moss, Sebastian W. Ober, and Victor Picheny. Inducing point allocation for sparse gaussian
processes in high-throughput bayesian optimisation. In Proceedings of The 26th International
Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine
Learning Research, pages 5213–5230, 2023.
Yuan Qi, Ahmed H Abdel-Gawad, and Thomas P Minka. Sparse-posterior gaussian processes for
general likelihoods. In Proceedings of the 26th conference on uncertainty in artificial intelligence,
pages 450–457. Citeseer, 2010.
Carl Edward Rasmussen and Christopher K Williams. Gaussian processes for machine learning.
MIT Press, 2006.
Yordan P. Raykov, Luc J. W. Evers, Reham Badawy, Marjan J. Faber, Bastiaan R. Bloem, Kasper
Claes, and Max A. Little. Probabilistic modelling of gait for remote passive monitoring applications.
arXiv preprint arXiv:1812.02585, 2019.
Patrick Schäfer. The BOSS is concerned with time series classification in the presence of noise. Data
Min. Knowl. Discov., 29(6):1505–1530, 2015.
Patrick Schäfer and Ulf Leser. TEASER: early and accurate time series classification. Data Min.
Knowl. Discov., 34(5):1336–1362, 2020.
Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In Proceed-
ings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of
Proceedings of Machine Learning Research, pages 567–574, Florida, USA, 2009.
12
A MATHEMATICAL THEORY AND PROOF
In this section, we provide the details deferred from the main paper, including the approximation of
the Lmax function (Section 2.3) and the derivation of the signal distribution p((gk )T ) (Section 2.5).
B
1 X
Lmax (C, G, S m ) ≈ log pN (Y |0, Bσ 2 I + QC,G ) − T r(KTi Ti − QTi Ti ) (10)
2σ 2 B i=1
y1
QT1 T1 0 0
.. C,G ..
Y = . , Q = 0 (11)
. 0
yB 0 0 QTB TB
B
1 X
Lmax (C, G, S m ) ≈ log pN (Y |0, Bσ 2 I + QC,G ) − T r(KTi Ti − QTi Ti ) (12)
2σ 2 B i=1
where pN (X|µ, Σ) denotes the density function of a Gaussian random variable X with mean µ and
covariance matrix Σ.
Furthermore, the optimal variational distribution ϕ∗ = arg maxϕ L(C, G, S m , ϕ) (see Section 2.2)
is a Gaussian distribution of the form:
B
! !
∗ −2 1 X
ϕ (gS m ) = N σ KS m S m Σ KS m Ti y i , KS m S m ΣKS m S m (13)
B i=1
σ −2 PB
where Σ = Λ−1 with Λ := KS m S m + i=1 KS Ti KTi S .
m m
B
Proof. Define the conditional mean signal vector αi = E[gTi |gS m ]. From the rescaled mean signals
approximation, the conditional distribution of gT given gS is Gaussian with the following mean and
variance:
−1 −1
p(gT |gS , T, S) = N (KT S KSS gS , KT T − KT S KSS KST ) (14)
13
As a result, αi = KTi S m (KS m S m )−1 gS m . Then, following the derivation from [Titsias, 2009],
individual terms in Definition 1 can be approximated as follows:
p(y i |gTi )p(gS m )
Z
p(gTi |gS m )ϕ(gS m ) log dgTi dgS m
ϕ(gS m )
Z Z
i p(gS m )
= ϕ(gS m ) p(gTi |gS m ) log p(y |gTi )dgTi + log dgS m
ϕ(gS m )
Z
i 1 p(gS m )
≈ ϕ(gS m ) log pN (y |αi , σI|Ti | ) − 2 T r(KTi Ti − QTi Ti ) + log dgS m
2σ ϕ(gS m )
pN (y|αi , σI|Ti | )p(gS m )
Z
1
= ϕ(gS m ) log dgS m − 2 T r(KTi Ti − QTi Ti )
ϕ(gS m ) 2σ
α1
..
Let A be the combined mean signal vector A := . . Using the above approximation for individual
αB
terms, we upper-bound the function L(S, G, T m , ϕ) (see Definition 1 in Section 2.2) as follows:
L(S, G, T m , ϕ)
B B
pN (y i |gTi , σ 2 I)p(gS m )
Z
X 1 1 X
≈ ϕ(gS m ) log dgS m − 2 T r(KTi Ti − QTi Ti )
i=1
B ϕ(gS m ) 2σ B i=1
Z !1/B ! B
Y p(g S m) 1 X
= ϕ(gS m ) log pN (y i |αi , σ 2 I) dgS m − 2 T r(KTi Ti − QTi Ti )
i
ϕ(gS m ) 2σ B i=1
Z B
!1/B B
Y
i 2 1 X
≤ log pN (y |αi , σ I) p(gS m )dgS m − 2 T r(KTi Ti − QTi Ti )
i=1
2σ B i=1
Z B
2 1 X
= log pN (Y |A, Bσ )p(gS )dgS − 2
m m T r(KTi Ti − QTi Ti )
2σ B i=1
B
1 X
= log pN (Y |0, Bσ 2 I + QC,G ) − T r(KTi Ti − QTi Ti )
2σ 2 B i=1
The only inequality for this bound is due to Jensen inequality. This upper-bound no longer depends
on the variational distribution ϕ and only depends on the timestamps in S m . As a result, by definition
of Lmax , we obtain the Equation (12). Moreover, for this bound, the equality holds when:
YB
ϕ∗ (gS m ) ∝ pN (y i |αi , σ 2 I)1/B p(gS m )
i=1
B
!
σ −2 X −1 1
∝ exp (y ) KTi S m (KS m S m ) gS m − (gS m )T ×
i T
B i=1 2
−2 XB !
σ
(KS m S m )−1 KS m Ti KTi S m (KS m S m )−1 + (KS m S m )−1 × gS m
B i=1
Hence, ϕ∗ is (approximately) a Gaussian distribution with the following mean and variance:
B
! !
∗ −2 1 X i
ϕ (gS m ) = N σ KS m S m Λ KS m Ti y , KS m S m ΛKS m S m (15)
B i=1
Recall that S m,k represents the most informative timestamps for the underlying stochastic process Gk ,
associated with the time series data collection Ck (see Section 2.1). Furthermore, Gk is approximated
by a Gaussian process with a parameterized kernel function K ηk (see section 2.4).
14
We now provide a detailed calculation/approximation of the distribution of the underlying signal
p((gk )T ) and its mean pk = E[(gk )T ], referred to as the predicted mean signal (see Section 2.5).
For k ∈ 1, L, the predicted distribution of the signal vector (gk )T is obtained by marginalizing over
the signal (gk )S m,k at the most informative timestamps S m,k for process Gk :
Z
p((gk )T ) = p((gk )T |(gk )S m,k )ϕ∗ ((gk )S m,k )d(gk )S m,k (16)
Here the optimal variational distribution ϕ∗ has the approximate form defined in Equation (13).
Specifically, ϕ∗ can be approximated by a Gaussian distribution with the following mean µk and
covariance matrix Ak , based on Lemma 1:
Bk
!
−2 ηk 1 X ηk i
µk = σ KS m,k S m,k Σ K m,k y (17)
Bk i=1 S Ti,k
Ak = KSηkm,k S m,k ΣKSηkm,k S m,k (18)
σ −2 PBk ηk
where Σ = Λ−1 with Λ := KSηkm,k S m,k + ηk
i=1 KS m,k Ti,k KTi,k S m,k .
Bk
With this approximation for ϕ∗ , Equation (16) simplifies to an integral of two Gaussian distributions.
An explicit calculation shows that the distribution of (gk )T is approximated by a Gaussian distribution
with the following mean and variance:
pk = E[(gk )T ] = KTηkS m,k (KSηkm,k S m,k )−1 µk (19)
V ar[(gk )T ] = KTηkT − KTηkS m,k (KSηkm,k S m,k )−1 KSηkm,k T
+ KTηkS m,k (KSηkm,k S m,k )−1 Ak (KSηkm,k S m,k )−1 KSηkm,k T (20)
B DATASETS
B.1 Basic Datasets
Twelve publicly available time-series datasets were sourced from the UCR archive [Bagnall et al.,
2017]. These datasets are the following, with their respective IDs from 1 to 12: Chinatown, ECG-
FiveDays, FreezerSmallTrain, GunPointOldVersusYoung, HouseTwenty, InsectEPGRegularTrain,
ItalyPowerDemand, Lightning7, MoteStrain, PowerCons, SonyAIBORobotSurface2, and UWaveG-
estureLibraryAll.
For the Pronunciation Audio dataset, the audio samples were obtained from publicly available
pronunciation recordings [Media, 2022]. The original pronunciation audio files are included in the
folder data/audio, and all processing steps related to these files can be found in the provided Python
file data_processing.py. This file contains the full preprocessing pipeline for converting raw audio
into time-series data, which was used for benchmarking and experimentation.
During data processing, all personal identifying information (PII) has been thoroughly removed
from the dataset to ensure privacy and data security. The sensor data has been aggregated to a
per-second level, meaning the original, unaggregated data cannot be recovered, thereby minimizing
any risk of data exposure. The processing steps for this dataset are available in the Python file
parkinson_data_processing.py. This code generates the second-level processed data, which serves
as input for all benchmarking algorithms, including Motion Code.
To access the full original data and labeled datasets, researchers must apply for a separate
license. To apply for access to the original datasets, follow the instructions provided at:
https://www.synapse.org/Synapse:syn20825169/wiki/600903.
In addition, we provide general information for both the Pronunciation Audio data and the Parkinson’s
sensor data as processed by us in Table 6 below:
15
Table 6: Descriptions of 3 Datasets Processed by Authors.
C ADDITIONAL FIGURES
C.1 Interpretable Features
16
(a) Class 1 (b) Class 2 (c) Class 3
Figure 7: Interpretable Features for InsectEPGRegularTrain.
17