0% found this document useful (0 votes)
56 views20 pages

Self-Supervised Learning For Time Series Analysis Taxonomy Progress and Prospects

Uploaded by

marcos.hollmann
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views20 pages

Self-Supervised Learning For Time Series Analysis Taxonomy Progress and Prospects

Uploaded by

marcos.hollmann
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3387317

Self-Supervised Learning for Time Series


Analysis: Taxonomy, Progress, and Prospects
Kexin Zhang, Qingsong Wen† , Chaoli Zhang, Rongyao Cai, Ming Jin, Yong Liu, James Y. Zhang,
Yuxuan Liang, Guansong Pang, Dongjin Song, and Shirui Pan

Abstract—Self-supervised learning (SSL) has recently achieved impressive performance on various time series tasks. The most
prominent advantage of SSL is that it reduces the dependence on labeled data. Based on the pre-training and fine-tuning strategy,
even a small amount of labeled data can achieve high performance. Compared with many published self-supervised surveys on
computer vision and natural language processing, a comprehensive survey for time series SSL is still missing. To fill this gap, we review
current state-of-the-art SSL methods for time series data in this article. To this end, we first comprehensively review existing surveys
related to SSL and time series, and then provide a new taxonomy of existing time series SSL methods by summarizing them from three
perspectives: generative-based, contrastive-based, and adversarial-based. These methods are further divided into ten subcategories
with detailed reviews and discussions about their key intuitions, main frameworks, advantages and disadvantages. To facilitate the
experiments and validation of time series SSL methods, we also summarize datasets commonly used in time series forecasting,
classification, anomaly detection, and clustering tasks. Finally, we present the future directions of SSL for time series analysis.

Index Terms—Time series analysis, self-supervised learning, representation learning, deep learning

1 I NTRODUCTION series modeling methods have been following this learning


paradigm.
T IME series data abound in many real-world scenar-
ios [1], [2], including human activity recognition [3],
industrial fault diagnosis [4], smart building management
SSL is a subset of unsupervised learning that utilizes
pretext tasks to derive supervision signals from unlabeled
[5], and healthcare [6]. The key to most tasks based on time data. These pretext tasks are self-generated challenges that
series analysis is to extract useful and informative features. the model solves to learn from the data, thereby creating
In recent years, Deep Learning (DL) has shown impressive valuable representations for downstream tasks. SSL does
performance in extracting hidden patterns and features of not require additional manually labeled data because the
the data. Generally, the availability of sufficiently large supervisory signal is derived from the data itself. With
labeled data is one of the critical factors for a reliable DL- the help of well-designed pretext tasks, SSL has recently
based feature extraction model, usually referred to as super- achieved great success in the domains of Computer Vision
vised learning. Unfortunately, this requirement is difficult to (CV) [7]–[10] and Natural Language Processing (NLP) [11],
meet in some practical scenarios, particularly for time series [12].
data, where obtaining labeled data is a time-consuming With the great success of SSL in CV and NLP, it is appeal-
process. As an alternative, Self-Supervised Learning (SSL) ing to extend SSL to time series data. However, transferring
has garnered increasing attention for its label-efficiency and the pretext tasks designed for CV/NLP directly to time
generalization ability, and consequently, many latest time series data is non-trivial, and often fails to work in many sce-
narios. Here we highlight some typical challenges that arise
Kexin Zhang, Rongyao Cai, and Yong Liu are with the Institute of Cyber- when applying SSL to time series data. First, time series data
Systems and Control, Zhejiang University. (e-mail: [email protected], exhibit unique properties such as seasonality, trend, and
[email protected], and [email protected]) frequency domain information [13]–[15]. Since most pretext
Qingsong Wen is with DAMO Academy, Alibaba Group. (e-mail: qing-
[email protected]) tasks designed for image or language data do not consider
Chaoli Zhang is with the School of Computer Science and Technology, these semantics related to time series data, they cannot be
Zhejiang Normal University. (e-mail: [email protected]) directly adopted. Second, some techniques commonly used
Ming Jin is with the Faculty of Information Technology, Monash University.
(e-mail: [email protected])
in SSL, such as data augmentation, need to be specially
James Zhang is with Ant Group. (e-mail: [email protected]) designed for time series data. For example, rotation and crop
Yuxuan Liang is with INTR & DSA Thrust, Hong Kong University of Science are the commonly used augmentation techniques for image
and Technology (Guangzhou). (e-mail: [email protected]) data [16]. However, these two techniques may break the
Guansong Pang is with the School of Computing and Information Systems,
Singapore Management University. (e-mail: [email protected]) temporal dependency of the series data. Third, most time
Dongjin Song is with the School of Computing, University of Connecticut. series data contain multiple dimensions, i.e., multivariate
(e-mail: [email protected]) time series. However, useful information usually only exists
Shirui Pan is with the School of Information and Communication Technology,
Griffith University. (e-mail: [email protected])
in a few dimensions, making it difficult to extract useful
† Corresponding author: Qingsong Wen. (e-mail: [email protected]) information in time series using SSL methods from other
GitHub Page: https://github.com/qingsongedu/Awesome-SSL4TS data types.

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on August 12,2024 at 19:21:50 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3387317

Time Series SSL


Generative-based Methods Contrastive-based Methods Adversarial-based Methods

Autoregressive-based forecasting Sampling contrast Generation and imputation


THOC[52], STraTS[54], GDN[55], SSTSC[58] USRL[105], TNC[106], NCL[108] C-RNN-GAN[154], TimeGAN[155], TTS-GAN[156],
E2GAN[157], COSCI-GAN[158], PSA-GAN[159], GT-
Prediction contrast
Autoencoder-based reconstruction GAN[160], GRUI[161], SSGAN[162]
CPC[109], LNT[110], TRL-CPC[111], TS-CP2[112],
Skip-CPC[113], CMLF[115], TSTCC[116],
Autowarp[60],TimeNet[61], PT-LSTM-SAE[62],
CA-TCC[117]
RANSynCoders[63], DCTR[65], USAD[66], Auxiliary representation enhancement
FuSAGNet[67], DAETI[69], DTCRAE[71], STEP[74], Augmentation contrast UASD[66], AnomalyTrans[163], DUBCNs[164],
MVTSTrans[75], VSF[76], TARNet[77], TS2Vec[118], CoST[123], BTSF[124], TF-C[125], CRLI[165], AST[166], ACT[167], BeatGAN[168],
InterFusion[80], OmniAnomaly[81], GRELEN[82], TimeCLR[127], CLOCS[128], CLUDA[129], Activity2vec[169]
VGCRN[83], mTANs[84], P-VAE[85], HetVAE[86], MTFCC[130], MRLF[131], CMLF[115], SSLAPP[132],
Dcdetector[126], TSTCC[116], CA-TCC[117]
LaST[87]

Diffusion-based generation Prototype contrast


ShapeNet[136], TapNet[137], DVSL[138], MHCCL[139]
CSDI[98], TimeGrad[99], D3VAE[100], SSSD[102],
ImDiffusion[101], DiffLoad[103], DiffSTG[104] Expert knowledge contrast
SSP-TSC[142], ExpCLR[143], SleepPriorCL[144]

Fig. 1: The proposed taxonomy of SSL for time series data.

To the best of our knowledge, there has yet to be a of time series SSL. We divide existing methods into ten
comprehensive and systematic review of SSL for time series categories, and for each category, we describe the basic
data, in contrast to the extensive literature on SSL for CV frameworks, mathematical expression, fine-grained classi-
or NLP [17], [18]. The surveys proposed by Eldele et al. fication, detailed comparison, advantages and disadvan-
[19] and Deldari et al. [20] are partly similar to our work. tages. To the best of our knowledge, this is the first work
However, these two reviews only discuss a small part of to comprehensively and systematically review the existing
self-supervised contrastive learning (SSCL), which requires studies of SSL for time series data.
a more comprehensive literature review. Furthermore, a • Collection of applications and datasets. We collect re-
summary of benchmark time series datasets needs to be sources on time series SSL, including applications and
included, and the potential research directions for time datasets, and investigate related data sources, character-
series SSL are also scarce. istics, and corresponding works.
This article provides a review of current state-of-the-art • Abundant future directions. We point out key problems
SSL methods for time series data. We begin by summarizing in this field from both applicative and methodology per-
recent reviews on SSL and time series data and then propose spectives, analyze their causes and possible solutions, and
a new taxonomy from three perspectives: generative-based, discuss future research directions for time series SSL. We
contrastive-based, and adversarial-based. The taxonomy is strongly believe that our efforts will ignite further research
similar to the one proposed by Liu et al. [21] but specifi- interests in time series SSL.
cally concentrated on time series data. For generative-based The rest of the article is organized as follows. Section
methods, we describe three frameworks: autoregressive- 2 provides some review literature on SSL and time series
based forecasting, auto-encoder-based reconstruction, and data. Section 3 to Section 5 describe the generation-based,
diffusion-based generation. For contrastive-based methods, contrastive-based, and adversarial-based methods, respec-
we divide the existing work into five categories based on tively. Section 6 lists some commonly used time series data
how positive and negative samples are generated, including sets from the application perspective. The quantitative per-
sampling contrast, prediction contrast, augmentation con- formance comparisons and discussions are also provided.
trast, prototype contrast, and expert knowledge contrast. Section 7 discusses promising directions of time series SSL,
Then we sort out and summarize the adversarial-based and Section 8 concludes the article.
methods based on two target tasks: time series genera-
tion/imputation and auxiliary representation enhancement.
2 R ELATED S URVEYS
The proposed taxonomy is shown in Fig. 1. We conclude
this work by discussing possible future directions for time In this section, the definition of time series data is first
series SSL, including selection and combination of data introduced, and then several recent reviews on SSL and time
augmentation, selection of positive and negative samples series analysis are scrutinized.
in SSCL, the inductive bias for time series SSL, theoretical
analysis of SSCL, adversarial attacks and robust analysis on 2.1 Definition of time series data
time series, time series domain adaption, pretraining and 2.1.1 Univariate time series
large models for time series, time series SSL in collaborative A univariate time series refers to an ordered sequence of
systems, and benchmark evaluation for time series SSL. observations or measurements of the same variable indexed
Our main contributions are summarized as follows. by time. It can be defined as X = (x0 , x1 , x2 , . . . , xt ), where
• New taxonomy and comprehensive review. We provide xi is the point at timestamp i. Most often, the measurements
a new taxonomy and a detailed and up-to-date review are made at regular time intervals.

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on August 12,2024 at 19:21:50 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3387317

2.1.2 Multivariate time series TABLE 1: An overview of recent SSL surveys on different
modalities. The yellow part indicates the proportion of this
A multivariate time series consists of two or more interre- modality in all six modalities. It can be seen that this article
lated variables (or dimensions) that depend on time. It is is mainly concerned with SSL on time series data.
a combination of multiple univariate time series and can be
defined as X = [X0 , X1 , X2 , . . . , Xp ], where p is the number
Modality
of variables. Paper
Image Video Audio Graph Text Time Series

2.1.3 Multiple multivariate time series


[21]
Considering the scenario where distinct sets of multivari-
ate time series are concurrently examined. Analyzing such [22]
datasets involves studying each set independently and ex-
ploring the relationships between different sets. For in- [28]
stance, if we study meteorological data from different cities,
each city’s data forms a multivariate time series, collectively [29]
resulting in multiple multivariate time series. This can be
articulated as X = {X0 , X1 , X2 , . . . , Xn }, where n is the [17]
number of multivariate time series.
[20]

2.2 Surveys on SSL [18]

The surveys on SSL can be categorized by different criteria.


[30]
In this paper, we outline three widely used criteria: learning
paradigms, pretext tasks and components/modules. [31]

2.2.1 Learning paradigms [23]


This category focuses on model architectures and train-
ing objectives. The SSL methods can be roughly divided [24]
into the following categories: generative-based, contrastive-
[25]
based, and adversarial-based methods. The characteristics
and descriptions of the above methods can be found in
[26]
Appendix A. Using the learning paradigm as a taxonomy is
arguably the most popular among the existing SSL surveys,
[32]
including [20], [22]–[27]. However, not all surveys cover the
above three categories. The readers are referred to these [27]
surveys for more details. In Table 1, we also provide the data
modalities involved in each survey, which can help readers
Ours
quickly find the research work closely related to them.

2.2.2 Pretext tasks the core of the pretext tasks is how to construct pseudo-
The pretext task serves as a means to learn informative supervision signals. Generally speaking, ignoring the dif-
representations for downstream tasks. Unlike the learning- ferences in data modalities, existing pretext tasks can be
paradigm-based criterion, the pretext-task-based criterion is roughly summarized into three categories: context predic-
also related to data modality. For example, Ericsson et al. tion, instance discrimination, and instance generation. The
[28] provides a very comprehensive review of pretext tasks main differences and examples are summarized in Table 3.
for multiple modalities, including image, video, text, audio, It should be noted that here we only list some commonly
time series, and graph. The various self-supervised pretexts used pretexts tasks, and some special pretext tasks are
are divided into five broad families: transformation predic- not the focus of this article. The details can be found in
tion, masked prediction, instance discrimination, clustering, Appendix B.2.
and contrastive instance discrimination. Jing and Tian [18]
summarize the self-supervised feature learning methods on 2.2.3 Components and modules
image and video data, and four categories are discussed:
generation-based, context-based, free semantic label-based, The literature categorizing SSCL methods according to their
and cross modal-based, where cross-modal-based methods modules and components throughout the pipeline is also an
construct learning task using RGB frame sequence an optical important direction. Jaiswal et al. [17], Le-Khac et al. [29]
flow sequence, which are unique features in the video. Gui and Liu et al. [33] sort out the modules and components
et al. [30] explore four kinds of pretext tasks in computer required in SSL from different perspectives. Specifically,
vision and natural language processing, including context- Liu et al. [33] summarizes the research progress of self-
based methods, contrastive learning methods, generative supervised contrastive learning on medical time series data.
algorithms, and contrastive generative methods. Essentially, In summary, the pipeline can be divided into four compo-

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on August 12,2024 at 19:21:50 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3387317

nents: positive and negative samples, pretext task, model et al. [33] also focuses on time series data. They emphasize
architecture, and training loss. discussion of medical time series data, while we focus more
The basic intuition behind SSCL is to pull positive sam- on general time series SSL. More importantly, in addition
ples closer and push negative samples away. Therefore, to contrastive-based approaches, we also thoroughly re-
the first component is to construct positive and negative view a large set of literature for the generative-based and
samples. According to the suggestions of Le-Khac et al. adversarial-based approaches.
[29], the main methods can be divided into the following
categories: multisensory signals, data augmentation, local-
global consistency, and temporal consistency. Additional 3 G ENERATIVE - BASED M ETHODS
descriptions regarding the characteristics of these categories In this category, the pretext task is to generate the expected
can be found in Appendix B.1. data based on a given view of the data. In the context
The second component is pretext tasks, which is a self- of time series modeling, the commonly used pretext tasks
supervised task that acts as an important strategy to learn include using the past series to forecast the future windows
data representations using pseudo-labels [17]. Pretext tasks or specific time stamps, using the encoder and decoder to
have been summarized and categorized in the previous reconstruct the input, and forecasting the unseen part of
subsection, so repeated content will not be introduced again. the masked time series. This section sorts out the existing
The details can be found in Section 2.2.2 and Appendix B.2. self-supervised representation learning methods in time
The third component is model architecture, which deter- series modeling from the perspectives of autoregressive-
mines how positive and negative samples are encoded dur- based forecasting, autoencoder-based reconstruction, and
ing training. The major categories include end-to-end [16], diffusion-based generation. It should be noted that the
memory bank [34], momentum encoder [35], and clustering autoencoder-based reconstruction task is also viewed as an
[36]. More details of these four architectures are summarized unsupervised framework. In the context of SSL, we mainly
in Appendix B.3. use the reconstruction task as a pretext task, and the final
The fourth component is training loss. As summarized goal is to obtain the representations through autoencoder
in [29], commonly used contrastive loss functions generally models. The illustration of the generative-based SSL for
include scoring functions (cosine similarity), energy-based time series is shown in Fig. 2. In Appendix C.1 - C.3, the
margin functions (pair loss and triplet loss), probabilistic main advantages and disadvantages of three generative-
NCE-based functions, and mutual information based func- based submethods are summarized. Furthermore, the direct
tions. More details of these loss functions are summarized comparison of the three methods is shown in Appendix C.4.
in Appendix B.4.
3.1 Autoregressive-based forecasting
2.3 Surveys on time series data Given the current time step t, the goal of an autoregressive-
The surveys on time series data can be roughly divided based forecasting (ARF) task is to forecast K future horizons
into two categories. The first category focuses on different based on t historical time steps, which can be expressed as:
tasks, such as classification [37], [38], forecasting [39]–[42],
x̂[t+1:t+K] = f (x[1:t] ), (1)
and anomaly detection [43], [44]. These surveys compre-
hensively sort out the existing methods for each task. The where x̂[t+1:t+K] represents the target window, and K rep-
second category focuses on the key components of time resents the length of the target window. When K = 1, (1)
series modeling based on deep neural networks, such as is a single-step forecasting model, and it is a multi-step
data augmentation [33], [45]–[47], model structure [33], [48], forecasting model when K > 1. x[1:t] represents the input
[49]. [45] proposed a new taxonomy that divides the exist- series before time t (including t), which is usually used
ing data augmentation techniques into basic and advanced as the input of the model. f (·) represents the forecasting
approaches. [46] also provides a taxonomy and outlines model. The learning objective is to minimize the distance
four families: transformation-based methods, pattern mix- between the predicted target window and the ground truth,
ing, generative models, and decomposition methods. More- thus the loss function can be defined as:
over, both [45] and [46] empirically compare different data
augmentation methods for time series classification tasks. L = D(x̂[t+1:t+K] , x[t+1:t+K] ), (2)
[48] systematically reviews transformer schemes for time where D(·) represents the distance between the predicted
series modeling from two perspectives: network structure future window x̂[t+1:t+K] and the ground-truth future win-
and applications. Liu et al. [33] provide a comprehensive dow x[t+1:t+K] , usually measured by the mean square error
summary of the various augmentations applied to medical (MSE), i.e.,
time series data, the architectures of pre-training encoders,
K
the types of fine-tuning classifiers and clusters, and the pop- 1 X
ular contrastive loss functions. The taxonomies proposed L= (x̂[t+k] − x[t+k] )2 . (3)
K k=1
by Eldele et al. [19], Deldari et al. [20] and Liu et al. [33]
are somewhat similar to our proposed taxonomy, i.e., three In the time series modeling with autoregressive-based
taxonomies involve time series self-supervised contrastive forecasting task as a pretext task, Recurrent neural networks
learning methods. However, our taxonomy provides more (RNNs) are widely used thanks to their strong capability
detailed categories and more literature in the contrastive- in spatiotemporal dynamic behavior modeling or sequence
based approach. Although the taxonomy proposed by Liu prediction [41], [42], [50], [51]. Therefore, it is also naturally

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on August 12,2024 at 19:21:50 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3387317
Forecasted
Future window
5
Histroy window
Forecasting
Forecasted
Input time series Encoder
decoder
Future window
encoder maps the input x to the representation z , and then
Histroy window
Forecasted the decoder re-maps the representation z back to the input.
Future window
Representations
Input time series Encoder
History window
Forecasting The output of the decoder is defined as the reconstructed
decoder
Input time series Forecasting input x̃. The process can be expressed as:
Encoder
Representations decoder
z = E(x), x̃ = D(z), (4)
Representations
Input time series Reconstructed time-series
(a) Autoregressive-based forecasting task where E(·) and D(·) represent the encoder and decoder,
Encoder
respectively. The difference between the original input x
Input time series
Representations
Reconstructed time-series and the reconstructed input x̃ is called the reconstruction
Encoder
Input time seriesInput transform Reconstructed time-series
Reconstruction error, and the goal of the self-supervised pretext task using
(Masking, Noising) decoder
Encoder autoencoder structure is to minimize the error between x
Representations
and x̃, i.e.,
Input transform Reconstruction
(Masking, Noising)Representations decoder L = ∥x − x̃∥2 . (5)
Input transform Reconstruction
(b) Autoencoder-based
(Masking, Noising) reconstruction
decoder task The model structure of (4) is defined as the basic au-
toencoder (BAE). Most BAE-based methods jointly train
Forward process Reverse process
the encoder E(·) and the decoder D(·). Then removing
the decoder D(·) and leaving only the encoder E(·) that
Forward process Reverse process is used as a feature extractor, and the representation z is
Time series Random noise
Forward process Reverse process
used for downstream tasks [60]–[63]. For example, TimeNet
(c) Diffusion-based generation task
[61], PT-LSTM-SAE [62], and Autowarp [60] all use RNN
Time series
Fig. 2: Three categories of generative-based SSL for time
Random noise to build a sequence autoencoder model including encoder
seriesTime
data.
series
and decoder, which tries to reconstruct the input series.
Random noise
Once the model is learned, the encoder is used as a feature
applied in the pretext task based on autoregressive fore- extractor to obtain an embedded representation of time
casting. THOC [52] constructs a self-supervised pretext task series samples, which can help downstream tasks, such as
for multi-resolution single-step forecasting called Temporal classification and forecasting, achieve better performance.
Self-Supervision (TSS). TSS takes the L-layer dilated RNN Zhang et al. [64] build a CNN-based autoencoder model
with skip-connection structure as the model. By setting and keep the encoder as a feature extractor after minimizing
skip length, it can ensure that the forecasting tasks can be (5). The experimental results show that using the encoded
performed with different resolutions at the same time. In representation is better than directly using the original time
addition to RNNs, the forecasting models based on Convo- series data in industrial fault detection tasks.
lutional neural networks (CNNs) also have been developed However, the representations obtained by (5) are some-
[53]. Moreover, STraTS [54] first encodes the time series times task-agnostic. Therefore, it is feasible to introduce ad-
data into triple representations to avoid the limitations ditional training constraints based on (5). Abdulaal et al. [63]
of using basic RNN and CNN in modeling irregular and focus on the complex asynchronous multivariate time series
sparse time series data and then builds the transformer- data and introduce the spectral analysis in the autoencoder
based forecasting model for modeling multivariate medical model. The synchronous representation of the time series
clinical time series. Graph-based time series forecasting is extracted by learning the phase information in the data,
methods can also be used as a self-supervised pretext task. which is eventually used for the anomaly detection task.
Compared with RNNs and CNNs, Graph Neural Networks DTCR [65] is a temporal clustering-friendly representation
(GNNs) can better capture the correlation among variables learning model. It introduces K-means constraints in the
and constituent in multivariate time series data, such as reconstruction task, making the learned representation more
GDN [55] and GTS [56]. Graph-augmented normalizing friendly to clustering tasks. USAD [66] uses an encoder and
flow (GANF) is another graph-based approach that can two decoders to build an autoencoder model and introduces
model the conditional dependencies among constituent time adversarial training based on (5) to enhance the representa-
series [57]. In order to choose a more appropriate model tion ability of the model. FuSAGNet [67] introduces graph
in building time series SSL task, we further give the ad- learning on the sparse autoencoder to model relationships
vantages and disadvantages of these three commonly used in multivariate time series explicitly.
models. The details can be found in Appendix D. Unlike Denoising autoencoder (DAE) is another widely used
the above methods, SSTSC [58] proposes a temporal relation approach, which is based on the addition of noise to the
learning prediction task based on the “Past-Anchor-Future” input series to corrupt the data, and then followed by the
strategy as a self-supervised pretext task. Instead of directly reconstruction task [68]. DAE can be formulated as:
forecasting the values of the future time windows, SSTSC xn = T (x), Z = E(xn ), x̃ = D(z), (6)
predicts the relationships of the time windows, which can
fully mine the temporal relationship in the data. where T indicates the operation that adds noise. The learn-
ing objective of a DAE is the same as that of a BAE, which is
to minimize the difference between x and x̃. In time series
3.2 Autoencoder-based reconstruction modeling, more than one method can add noise to the input,
The autoencoder is an unsupervised artificial neural net- such as adding Gaussian noise [69], [70] and randomly
work composed of an encoder and a decoder [59]. The setting some time steps to zero [71], [72].

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on August 12,2024 at 19:21:50 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3387317

Mask autoencoder (MAE) is a structure widely used in where S(·) represents the sampling operation. Unlike (5),
language models and vision models in recent years [11], the loss function of a VAE includes two terms: the recon-
[73]. The core idea behind MAE is that in the pre-training struction item and the regularization item, i.e.,
phase, the model first masks part of the input and then
tries to predict the masked part through the unmasked part. L = ∥x − x̃∥2 + KL(N (µ, δ), N (0, I)), (10)
Unlike BAE and DAE, the loss of MAE is only computed on
the masked part. MAE can be formulated as: where KL(·) represents the Kullback-Leibler divergence.
The role of the regularization term is to ensure that the
xm = M(x), z = E(xm ), x̃ = D(z), (7) learned distribution P (z|x) is close to the standard normal
distribution, thereby regulating the representation of the
L = M(∥x − x̃∥2 ), (8) latent space. The representation learning method based on
VAE can model the distribution of each time step to bet-
where M(·) represents the mask operation, and Xm repre-
ter capture the complex spatiotemporal dependencies and
sents the masked input. In language models, since the input
provide better interpretability in time series modeling tasks.
is usually a sentence, the mask operation masks some words
For example, InterFusion [80] is a hierarchical VAE that
in a sentence or replaces them with other words. In vision
models inter-variable and temporal dependencies in time
models, the mask operation will mask the pixels or patches
series data. OmniAnomaly [81] combines VAE and Planar
in an image. For time series data, a feasible operation is to
Normalizing Flow to propose an interpretable time series
mask part of the time steps and then use the unmasked
anomaly detection algorithm. In order to better capture the
part to predict the masked time steps. Existing masking
dependencies between different variables in multivariate
methods for time series data can be divided into three
time series, GRELEN [82] and VGCRN [83] introduce the
categories: time-step-wise masking, segment-wise masking,
graph structure and in VAE. In addition to modeling on
and variable-wise masking.
regular time series, the methods based on VAE have made
The time-step-wise masking randomly selects a certain progress in sparse and irregular time series data representa-
proportion of time-steps in the series to mask, so the fine- tion learning, such as mTANs [84], P-VAE [85] and HetVAE
grained information is easier to capture, but it is difficult [86]. The latest work attempts to extract seasonal and trend
to learn contextual semantic information in time series. representations in time series data based on VAE. LaST
The segment-wise masking randomly selects segments to [87] is a disentangled variational inference framework with
mask, which allows the model to pay more attention to mutual information constraints. It separates seasonal and
slow features in the time series, such as trends or high- trend representations in the latent space to achieve accurate
level semantic information. STEP [74] divided the series time series forecasting.
into multiple non-overlapping segments of equal length
and then randomly selected a certain proportion of the
segments for masking. Moreover, STEP pointed out two 3.3 Diffusion-based generation
advantages of using segment-wise masking: the ability to
capture semantic information and reduce the input length As a new kind of deep generative model, diffusion models
to the encoder. Different from STEP, Zerveas et al. [75] have achieved great success recently in many fields, includ-
performed a more complex masking operation on the time ing image synthesis, video generation, speech generation,
series, i.e., the multivariate time series was randomly di- bioinformatics, and natural language processing due to their
vided into multiple non-overlapping segments of unequal powerful generating ability [88]–[92]. The key design of the
length on each variable. Variable-wise masking was intro- diffusion model contains two inverse processes: the forward
duced by Chauhan et al. [76], who defined a new time series process of injecting random noise to destruct data and the
forecasting task called variable subset forecast (VSF). In VSF, reverse process of sample generation from noise distribution
the time series samples used for training and inference have (usually normal distribution). The intuition is that if the
different dimensions or variables, which may be caused by forward process is done step-by-step with a transition kernel
the absence of some sensor data. This new forecasting task between any two adjacent states, then the reverse process
brings the feasibility of self-supervised learning based on can follow a reverse state transition operation to generate
variable-wise masking. Unlike random masking, TARNet samples from noise (the final state of the forward process).
[77] considers the pre-trained model based on the masking However, it is usually not easy to formulate the reverse
strategy irrelevant to the downstream task, which leads transition kernel, and thus diffusion models learn to ap-
to sub-optimal representations. TARNet uses self-attention proximate the kernel by deep neural networks. Nowadays,
score distribution from downstream task training to deter- there are mainly three basic formulations of diffusion mod-
mine the time steps that require masking. els: denoising diffusion probabilistic models (DDPMs) [88],
Variational autoencoder (VAE) is a model based on [93], score matching diffusion models [94], [95], and score
variational inference [78], [79]. The encoder encodes the SDEs [96], [97].
input x to the probability distribution P (z|x) instead of For DDPMs, the forward and reverse processes are two
the explicit representation z . When the decoder is used to Markov chains: a forward chain that adds random noise to
reconstruct the input, a vector generated by sampling from data and a reverse chain that transforms noise back into
the distribution P (z|x) will be used as input to the decoder. data. Formally, denote the data distribution as x0 ∼ q(x0 ),
The process can be expressed as: the forward Markov process gradually adds Gaussian noise
to the data according to transition kernel q(xt |xt−1 ). It
P (z|x) = E(x), z = S(P (z|x)), x̃ = D(z), (9) generates a sequence of random variables x1 , x2 , . . . , xT .

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on August 12,2024 at 19:21:50 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3387317

Thus the joint distribution of x1 , x2 , . . . , xT conditioned on where β( Tt ) = T βt when T goes to infinity; for SGMs, the
x0 is SDE is s
T
Y d[δ(t)2 ]
q(x1 , x2 , . . . , xT |x0 ) = q(xt |xt−1 ). (11) dx = dw, (18)
t=1
dt
For simplicity of calculation, the transition kernel is usually where δ( Tt ) = δt as T goes to infinity. With any diffusion
set as process in the form of (16), the reverse process can be gained
p by solving the following SDE:
q(xt |xt−1 ) = N (xt ; 1 − βt xt−1 , βt I), (12)
where β1 , β2 , . . . , βT is a variance schedule of the forward
dx = [f (x, t) − g(t)2 ∇x log qt (x)]dt + g(t)dw, (19)
process (usually chosen βt ∈ (0, 1) ahead of model training) where w is a standard Wiener process when time flows
and p(xT ) = N (xT ; 0, I). Similarly, the joint distribution of reversely and dt is an infinitesimal time step. Besides that,
the reverse process is the existence of an ordinary differential equation, which is
T
Y also called the probability flow ODE, is defined as follows.
pθ (x0 , x1 , . . . , xT ) = p(xT ) pθ (xt−1 |xt ), (13) 1 2
t=1 dx = [f (x, t) − g(t) ∇x log qt (x)]dt. (20)
2
where θ is the model parameters and pθ (xt−1 |xt ) =
P The trajectories of the probability flow ODE have the same
N (xt−1 ; µθ (xt , t), θ (xt , t)). The key to achieving the suc-
marginals as the reverse-time SDE. Once the score function
cess of sample generating is training the parameters θ to
at each time step is known, the reverse SDE can be solved
match the actual reverse process, that is, minimizing the
with various numerical techniques. Similar objective is de-
Kullback-Leibler divergence between the two joint distri-
signed with SGMs.
butions. Thus, according to Jensen’s inequality, the training
Diffusion models have also been applied in time series
loss is
analysis recently. We briefly summarize them based on
KL(q(x1 , x2 , . . . , xT )||pθ (x0 , x1 , . . . , xT )) the designed architectures and the main diffusion tech-
(14)
≥ E[− log pθ (x0 )] + const. niques used. Conditional score-based diffusion models for
imputation (CSDI) [98] were proposed for time series im-
For score-based diffusion models, the key idea is to per-
putation task. CSDI utilizes score-based diffusion models
turb data with a sequence of Gaussian noise and then jointly
conditioned on observed data. In time series forecasting
estimate the score functions for all noisy data distributions
tasks, TimeGrad [99] takes an RNN conditioned diffu-
by training a deep neural network conditioned on noise
sion probabilistic model at some time step to depict the
levels. The motivation of the idea is that, in many situations,
fixed forward process and the learned reverse process.
it is easier to model and estimate the score function than the
D3 VAE [100] is a bidirectional variational auto-encoder
original probability density function. Langevin dynamics is
(BVAE) equipped with diffusion, denoise, and disentangle-
one of the proper techniques. With a step size α > 0, the
ment. In D3 VAE, the coupled diffusion process augments
number of iterations T , and an initial sample x0 , Langevin
the input time series and output time series simultaneously.
dynamics iteratively does the following estimation to gain a
ImDiffusion [101] combines imputation and diffusion mod-
close approximation of p(x)
√ els for time series anomaly detection. SSSD [102] combines
xt ← xt−1 + α∇x log p(xt−1 ) + 2αz t , 1 ≤ t ≤ T, (15) diffusion models and structured state space models for
time series imputation and forecasting tasks. DiffLoad [103]
where z t ∼ N (0, I). However, the score function is inac- proposes a diffusion-based structure for electrical load
curate without the training data, and Langevin dynamics probabilistic forecasting by considering both epistemic and
may not converge correctly. Thus, the key approach (NCSN, aleatoric uncertainties. DiffSTG [104] presents the first shot
a noise-conditional score network), perturbing data with a to predict the evolution of spatio-temporal graphs using
noise sequence and jointly estimating the score function for DDPMs.
all the noisy data with a deep neural network conditioned
on noise levels, is proposed [94]. Training and sampling are
decoupled in score-based generative models, which inspires 4 C ONTRASTIVE - BASED M ETHODS
different choices in such two processes [95]. Contrastive learning is a widely used self-supervised learn-
For score SDEs, the diffusion operation is processed ing strategy, showing a strong learning ability in computer
according to the stochastic differential equation (SDE) [97]: vision and natural language processing. Unlike discrimina-
dx = f (x, t)dt + g(t)dw, (16) tive models that learn a mapping rule to true labels and
generative models that try to reconstruct inputs, contrastive-
where f (x, t) and g(t) are diffusion function and drift based methods aim to learn data representations by con-
function of the SDE, respectively, and w is a standard trasting between positive and negative samples. Specifically,
Wiener process. Different from DDPMs and SGMs, Score positive samples should have similar representations, while
SDEs generalize the diffusion process to the case of infinite negative samples have different representations. Therefore,
time steps. Fortunately, DDPMs and SGMs also can be the selection of positive samples and negative samples is
formulated with corresponding SDEs. For DDPMs, the SDE very important to contrastive-based methods. This section
is sorts out and summarizes the existing contrastive-based
1 q
dx = − β(t)xdt + β(t)dw, (17) methods in time series modeling according to the selection
2

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on August 12,2024 at 19:21:50 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3387317

Series
History
History
History
History
History series
series
series
series
series
11 1 1 1
Series
Series
Series
Series

Future
Future
Future
Future
Future
Series
Series 1Series
Series1 1 11
Series Series
Series 1 Series
1 1 11
Series
Series Series Series
Series1 1 11
1Series
Series
Series
Series 1 1 11
1Series
Series
Series
series
seriesseries
series
series Series2 2 22 Series
2 2 22 Series2 2 22
Series
Series 2Series
Series Series
Series 2 Series
Series Series
Series Series
2Series

Series
Series22 2 2 2
Series
Series
Series Series
Series 2 2 22
2Series
Series
Series

Augmentation
Augmentation
Augmentation
Augmentation
Augmentation
Sampling
Sampling
Sampling
Sampling
Sampling Prediction
Prediction
Prediction
Prediction
Prediction Augmentation
Augmentation
Augmentation
Augmentation
Augmentation Augmentation
Augmentation
Augmentation
Augmentation
Augmentation
Leverage
Leverage Prior Select
PriorPrior
Prior
Prior Select
True true
True
True
true
knowledge
knowledge
knowledge
knowledge negatives
knowledge negatives ?
negatives
negatives ??
Negative
Negative
Negative
Negative pairs
pairs
Negative pairs
pairs
pairs Predicted
Predicted
Negative
Negative pairs
pairs
Negative
Negative
Negative pairs pairs
pairs Negative
Negative
Negative
Negative pairs
pairs
Negative
pairs pairs
pairs S1-a S1-a
S1-a S1-a S1-b
S1-a S1-b
S1-b S1-b
S1-bS2-a S2-aS2-a
S2-a S2-a
S2-b S2-b
S2-b S2-b
S2-b
Predicted
Predicted
Predicted
series
series
series series
series JitterJitter
Jitter Perm
Perm
Perm
Jitter
Jitter Perm
Perm Jitter Jitter
Jitter Perm
Jitter
Jitter Perm
Perm Perm
Perm Jitter Jitter
Jitter Perm
Jitter
Jitter Perm
Perm Perm
PermJitter Jitter
Jitter Jitter Perm
Jitter Perm
Perm Perm
Perm
S1-a
S1-a S1-a S1-a S1-b
S1-a S1-b
S1-b S1-b
S1-b S2-a
S2-a S2-a S2-b
S2-b
S2-a
S2-b
S2-a S2-b Representations
S2-b S1-a S1-aS1-a
S1-a S1-a S1-b S1-bS1-b
S1-b S1-b
S2-aS2-a S2-bS2-b
S2-b
S2-b
S2-aS2-a
S2-a S2-b
Representations
Representations Representations
Representations
Representations Representations
Representations
Representations Representations

Push
Push
Push awayaway
away
Push
Push away
away
Positive
Positive
Positive pair
pair
pair
Positive
Positive pair
pair Positive
Positive
Positive pair
pair
pair
Positive
Positive pair
pair Positive
Positive
Positive pair
pair
pair
Positive
Positive Positive
Positive
Positive
pair
pair pair
pair
pair
Positive
Positive Prototype
pair
pair 11 1 1 1
Prototype
Prototype
Prototype
Prototype Prototype
Prototype
Prototype 2 2 2 2Positive
Prototype
Prototype Positive
2Positive pair
pair
pair
Positive
Positive pair
pair Positive
Positive
Positive pair
pair
pair
Positive
Positive pair
pair
(a) Sampling contrast (b) Prediction contrast (c) Augmentation contrast (d) Prototype contrast (e) Expert knowledge con-
trast

Fig. 3: Five categories of contrastive-based SSL for time series data.

of positive and negative samples. The illustration of the where w is the probability of sampling false negative
contrastive-based SSL for time series is shown in Fig. 3. In samples, N denotes the neighboring area, and Ñ denotes
Appendix E.1 - E.5, the main advantages and disadvantages the non-neighboring area. Supervised contrastive learning
of five contrastive-based submethods are summarized. (SCL) [107] effectively addresses sampling bias, so introduc-
ing the supervised signal to identify positive and negative
4.1 Sampling contrast samples is a feasible solution. Neighborhood contrastive
Sampling contrast follows a widely used assumption in time learning (NCL) is a recent time series modeling method
series analysis that two neighboring time windows or time that combines context sampling and the supervised signal to
stamps have a high degree of similarity, so positive and generate positive and negative samples [108]. NCL assumes
negative samples are directly sampled from the raw time that if two samples share some predefined attributes, then
series, as shown in Fig. 3(a). Specifically, given a time win- they are considered to share the same neighboring area.
dow (or a timestamp) as an anchor, its nearby window (or
the time stamp) is more likely to be similar (small distance), 4.2 Prediction contrast
and the distant window (or the time stamp) should be less
similar (large distance). The term “similar” indicates that In this category, prediction tasks that use the context
two windows (or two-time stamps) have more common (present) to predict the target (future information) are con-
patterns, such as the same amplitude, the same periodicity, sidered self-supervised pretext tasks, and the goal is to
and the same trend. maximally preserve the mutual information of the context
As mentioned in [105], suppose one anchor xref , and the target. Contrastive predictive coding (CPC) pro-
one positive sample xpos , and K negative samples posed by [109] provides a contrastive learning framework
xneg ref to perform the prediction task using InfoNCE loss. As
k , k∈1,2,··· ,K are chosen, we expect to assimilate x and
pos ref neg shown in Fig. 3(b), the context ct and the sample from
x and to distinguish between x and xk , i.e.,
p(xt+k |ct ) constructs positive pairs, and the samples from
K the ‘proposal’ distribution p(xt+k ) are negative samples.
log(−S(xref , xneg
X
L = − log(S(xref , xpos )) − k )), (21) The learning objective is as follows:
k=1 " #
where S(·) denotes the similarity of the two representations. fk (xt+k , ct )
L = − E log P , (23)
However, due to the non-stationary characteristics of most X xj ∈X fk (xj , ct )
time series data, it is still a challenge to choose the correct
positive and negative samples based on contextual infor- where fk (·) is the density ratio that preserves the mutual
mation in time series data. Temporal neighborhood coding information of ct and xt+k [109], and it can be estimated by
(TNC) was recently proposed to deal with this problem a simple log-bilinear model:
[106]. TNC uses augmented Dickey-Fuller (ADF) statisti- T
fk (xt+k , ct ) = exp(zt+k Wk ct ). (24)
cal test to determine the stationary region and introduces
positive-unlabeled (PU) learning to handle the problem of It can be seen that CPC does not directly predict future
sampling bias by treating negative samples as unknown observations xt+k . Instead, it tries to preserve the mutual
samples and then assigning weights to these samples. The information of ct and xt+k . This allows the model to capture
learning objective is extended to the “slow features” that span multiple time steps. Following
the architecture of CPC, LNT [110], TRL-CPC [111], TS-CP2
L = − Expos [∈N log S(xref , xpos )]
[112], and Skip-Step CPC [113] were proposed. LNT and
− Exneg ∈Ñ [(1 − w) × log −S(xref , xneg ) (22) TRL-CPC use the same structure as the original CPC [109]
+ w × log S(x ref
,x neg
)], to build a representation learning model, and the purpose

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on August 12,2024 at 19:21:50 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3387317

is to capture the local semantics across the time to detect series samples to generate a newly augmented view, while
the anomaly points. TS-CP2 and Skip-Step CPC replace the the pretext task is to correctly predict the proportion of two
autoregressive model in the original CPC structure with original time series samples in augmented view.
TCN [114], which improves feature learning ability and Data augmentation in the frequency domain is also
computational efficiency. Moreover, Skip-Step CPC points feasible for time series data. CoST [123] is a disentan-
out that adjusting the distance between context representa- gled seasonal-trend representation learning method, which
tion ct and xt+k can construct different positive pairs, which uses fast Fourier transform to convert different augmented
leads to different results in time series anomaly detection. views into amplitude and phase representations, and then
In addition to the basic contextual prediction tasks men- uses (25) to train the model. BTSF [124] is a contrastive-
tioned before, some more complex prediction tasks were based method based on a time-frequency fusion strategy,
constructed and proved useful. CMLF [115] transforms time which first generates an augmented view in the time do-
series into coarse-grained and fine-grained representations main through the dropout operation and then generates
and proposes a multi-granularity prediction task. This al- another augmented view in the frequency domain through
lows the model to represent the time series at different Fourier transform. Finally, the bilinear temporal-spectral
scales. TS-TCC [116] and its extended version CA-TCC [117] fusion mechanism is used to achieve the fusion of time-
designed a cross prediction task, which uses the context of frequency information. However, CoST and BTSF do not
xT 1 to predict the target in xT 2 , and vice versa uses the modify the frequency representation, while TF-C [125] di-
context of xT 2 to predict the target in xT 1 . rectly augments the time series data through frequency
perturbations, which has achieved better performance than
TS2Vec [118] and TS-TCC [116]. Specifically, TF-C imple-
4.3 Augmentation contrast
ments three augmentation strategies: low- vs. high-band
Augmentation contrast is one of the most widely used perturbations, single- vs. multi-component perturbations,
contrastive frameworks, as shown in Fig. 3(c). Most methods and random vs. distributional perturbations.
utilize data augmentation techniques to generate different In addition to the above methods, many view generation
views of an input sample and then learn representations methods are closely related to downstream tasks. Recently,
by maximizing the similarity of the views that come from DCdetector [126] proposes a dual attention contrastive rep-
the same sample and minimizing the similarity of the views resentation framework for time series anomaly detection.
that come from the different samples. SimCLR [16] is a very The in-patch and patch-wise representations are designed
typical multi-view invariance-based representation learning to gain two views of the input samples, as normal samples
framework, which has been used in many subsequent meth- behave differently from abnormal ones in such two views.
ods. The objective function based on this framework is: TimeCLR [127] proposed DTW augmentation, which can
exp (sim (z 1 , z 2 ) /τ ) not only simulate phase shifts and amplitude changes but
L = − log P2N , (25) also retain the structure and characteristics of the time series.
k=1 1[k̸=1] exp (sim (z 1 , z k ) /τ ) CLOCS [128] is a self-supervised pre-training method for
where τ is temperature parameter, sim(·) represents the medical and physiological signals, which uses multi-view
similarity between two representation vectors, and zk rep- invariance contrast in the three perspectives of time, space,
resents the training samples in a batch. It can be considered and patient to promote higher similarity of representations
that in the feature learning framework based on multi-view from the same source. CLUDA [129] introduces multi-view
invariance, the core is to obtain different views of the input invariance contrast in the time series domain adaptation
samples. When handling images in computer vision, com- problem, which captures the contextual representation of
monly used data augmentation methods include cropping, time series data through intra-domain and inter-domain
scaling, adding noise, rotation, and resizing [16]. However, contrast. MTFCC [130] is another view generation method
compared with augmentation methods for images, the aug- based on multi-scale characteristics, which samples time
mentation methods for time series needs to consider both series samples at multiple scales and considers that the
temporal and variable dependencies. views from the same sample have similar representations,
Since time series data can be converted to frequency even if their scales are different. Methods for constructing
domain representations through Fourier transform, the aug- multiple contrastive views based on multi-granularity or
mentation method can be developed from the time and multi-scale augmentations also include MRLF [131], CMLF
frequency domains. In the time domain, TS-TCC [116] and [115], and SSLAPP [132].
its extended version CA-TCC [117] designed two time series
data augmentation techniques, one is strong augmentation 4.4 Prototype contrast
(permutation-and-jitter), and the other is weak augmenta- The contrastive learning framework based on (23) and (25)
tion (jitter-and-scale). TS2Vec [118] generates different views is essentially an instance discrimination task, which encour-
through masking operations that randomly mask out some ages samples to form a uniform distribution in the feature
time steps. Generally speaking, there is no one-size-fits- space [133]. However, the real data distribution should sat-
all answer to the choice of data augmentation methods. isfy that the samples of the same class are more concentrated
Therefore, some works comprehensively compare and study in a cluster, while the distance between different clusters
the augmentation methods and further evaluate the per- should be farther. SCL [107] is an ideal solution when real
formance on different tasks [45], [119]–[121]. All the above labels are available, but this is difficult to implement in
methods only need a single time series sample in the aug- practice, especially for time series data. Therefore, introduc-
mentation operation, while Mixing-up [122] fuses two time ing clustering constraints into existing contrastive learning

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on August 12,2024 at 19:21:50 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3387317

10

frameworks is an alternative, such as CC [134], PCL [135], representations fi and fj , ExpCLR defines the normalized
and SwAV [36]. PCL and SwAV contrast the samples with distance between two samples:
the constructed prototypes, i.e., the cluster centers, which
reduces the computation and encourages the samples to ∥fi − fj ∥2
sij = 1 − , (27)
present a cluster-friendly distribution in the feature space. max ∥fk − fl ∥ 2
An illustration of prototype contrast is shown in Fig. 3(d).
where fk and fl are the two representation vectors with
In time series modeling based on prototypes contrast, the largest distance among all samples. Compared with the
ShapeNet [136] takes shapelets as input and constructs a original pair-loss, the distance between samples xi and xj
cluster-level triplet loss, which considers the distance be- is changed from a discrete value (0 and 1) to a continuous
tween the anchor and multiple positive (negative) samples value sij , which enables the model to learn more accurately
as well as the distance between positive (negative) samples. about the relationship between samples, thus thereby en-
ShapeNet is an implicit prototype contrast because it does hancing the representation ability of the model. In addition
not introduce explicit prototypes (cluster centers) during the to the above two works, SleepPriorCL [144] was proposed
training phase. TapNet [137] and DVSL [138] are explicit to alleviate the sampling bias problem faced by (25). Like
prototypes contrast because explicit prototypes are intro- ExpCLR, SleepPriorCL also introduces prior features to
duced. TapNet introduces a learnable prototype for each ensure the model can identify correct positive and negative
predefined class and classifies the input time series sample samples.
according to the distance between the sample and each class
Actually, introducing more prior knowledge in
prototype. DVSL defines virtual sequences, which have the
contrastive-based SSL can help the model extract better
same function as prototypes, i.e., minimize the distance
representations. The trend of this family of methods can
between samples and virtual sequences, but maximize the
be summarized from two perspectives: (i) Addressing
distance between virtual sequences. MHCCL [139] proposes
sampling bias. Sampling bias is caused by inappropriate
a hierarchical clustering based on the upward masking strat-
selection of positive and negative samples, so introducing
egy and a contrastive pairs selection strategy based on the
prior knowledge useful for selecting positive and negative
downward masking strategy. In the upward mask strategy,
samples can deal with this problem, such as a clustering-
MHCCL believes that outliers greatly impact prototypes,
based negative sample detection algorithm [145] and
so these outliers should be removed when updating pro-
sample identification strategy based on real labels [107],
totypes. The downward masking strategy, in turn, uses the
[108]. (ii) Addressing representation bias. Representation
clustering results to select positive and negative samples,
bias means that the extracted representations cannot be
i.e., samples belonging to the same prototype are regarded
guaranteed to be strongly related to the downstream task.
as true positive samples, and samples belonging to different
The essential reason is that there may be a big difference
prototypes are regarded as true negative samples.
between the goals of the pretext task and the downstream
task. An interesting trend is to fuse semi-supervised
learning and contrastive-based SSL to guide the training of
4.5 Expert knowledge contrast
the encoder through a small amount of labeled data [146],
Expert knowledge contrast is a relatively new representa- [147].
tion learning framework. Generally speaking, this modeling
framework incorporates expert prior knowledge or infor-
mation into deep neural networks to guide model training 5 A DVERSARIAL - BASED M ETHODS
[140], [141]. In the contrastive learning framework, prior Adversarial-based self-supervised representation learning
knowledge can help the model choose the correct positive methods utilize generative adversarial networks (GANs)
and negative samples during training. An example of expert to construct pretext tasks. GAN contains a generator G
knowledge contrast is shown in Fig. 3(e). and a discriminator D. The generator G is responsible for
Here we sort out three typical works of expert knowl- generating synthetic data similar to real data, while the
edge contrast for time series data. Shi et al. [142] used the discriminator D is responsible for determining whether the
DTW distance of time series samples as prior information generated data is real data or synthetic data. Therefore, the
and believed that two samples with a small DTW distance goal of the generator is to maximize the decision failure rate
have a higher similarity. Specifically, given the anchor xref of the discriminator, and the goal of the discriminator is to
and the other two samples xi and xj , the DTW distance minimize its failure rate [49], [148]. The generator G and
between xref and the other two samples is calculated first, the discriminator D are a mutual game relationship, so the
then the sample with a small distance from xref is consid- learning objective is:
ered as the positive sample of xref . This selection process is
defined as L = Ex∼Pdata (x) [log D(x)] + Ez∼Pz (z) [log(1 − D(G(z)))].
(28)
1, DTW xref , xi ≥ DTW xref , xj
  
According to the final task, the existing adversarial-
label = . (26)
0, otherwise based representation learning methods can be divided into
time series generation and imputation, and auxiliary repre-
Based on pair-loss, ExpCLR [143] introduces expert features sentation enhancement. The illustration of the adversarial-
of time series data to obtain more informative representa- based SSL for time series is shown in Fig. 4. In Ap-
tions. Given two input samples xi and xj and corresponding pendix F.1 - F.2, the main advantages and disadvantages

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on August 12,2024 at 19:21:50 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3387317

11
Real time series Gradients
Real time series Gradients
Real time series
Discriminator Real/Fake
Gradients dynamic characteristics of the series. TimeGAN also em-
Discriminator Real/Fake
Discriminator Real/Fake phasizes that static features and temporal characteristics are
crucial to the generation task.
Random
Random Generator
Generate
Generate Some recently proposed methods consider more com-
noise
Random Generator Generate
noise
noise
Generator Gradients plex time series generative tasks [156], [158]–[160]. For ex-
Gradients
Gradients ample, COSCI-GAN [158] is a time series generation frame-
(a) Time series generation work that considers the correlation between each dimension
of the multivariate time series. It includes Channel GANs
Non-complete time series Gradients
Non-complete time series
Non-complete time series
Gradients and Central Discriminator. Channel GANs are responsi-
Gradients
Discriminator Real/Fake
Discriminator Real/Fake ble for generating data in each dimension independently,
Discriminator Real/Fake
while Central Discriminator is responsible for determining
Random Imputation whether the correlation between different dimensions of the
Random Generator Imputation
noise
Random
noise
Generator Imputation generated series is the same as the raw series. PSA-GAN
Generator Gradients
noise Gradients
Gradients
[159] is a framework for long-time series generation and
(b) Time series imputation introduces a self-attention mechanism. It further presents
Context-FID, a new metric for evaluating the quality of
Real time series
Real time series
Gradients
Gradients generated series. Li et al. [156] explored the generation of
Real time series Gradients
Encoder
Encoder Base loss
Base loss
time series data with irregular spatiotemporal relationships
Encoder Representations
Representations
Base loss and proposed TTS-GAN, which uses a Transformer instead
Real time series Gradients
Real time series Representations Gradients
of an RNN to build the discriminator and the generator and
Real time series Encoder Base loss
Gradients
Encoder
Encoder
Base loss treats the time series data as image data of height one.
Enhanced Base loss
Enhanced
Representations
Representations
Enhanced
Add
Add Different from generating a new time series, the task of
Auxiliary adversarial
Auxiliary
Representations
adversarial training
training Add time series imputation refers to that given a non-complete
Adv loss
Adv loss
strategy (Generation/Imputation)
Auxiliary
strategy adversarial training
(Generation/Imputation)
Adv loss
time series sample (for example, the data of some time steps
strategy (Generation/Imputation)
(c) Auxiliary representation enhancement is missing), and the missing values need to be filled based on
the contextual information. Luo et al. [161] treat the problem
Fig. 4: Three categories of adversarial-based SSL for time of missing value imputation as a data generation task and
series data. then use GAN to learn the distribution of the training data
set. In order to better capture the dynamic characteristics
of two adversarial-based submethods are summarized. Fur- of the series, the GRUI module was proposed. The GRUI
thermore, the main differences in characteristics and lim- uses the time-lag matrix to record the time-lag information
itations between the adversarial-based methods and the between effective values of incomplete time series data,
previous two methods (generative-based and contrastive- which follow the unknown non-uniform distribution and
based) are shown in Appendix G. are very helpful for analyzing the dynamic characteristics
of the series. The GRUI module was also further used in
5.1 Time series generation and imputation E2 GAN [157]. SSGAN [162] is a semi-supervised framework
The generator in the GAN can generate synthetic data close for time series data imputation, which includes a generative
to the real data, so adversarial representation learning has a network, a discriminative network, and a classification net-
wide range of applications in the data generation field [149], work. Unlike previous frameworks, SSGAN’s classification
especially in image generation [150]–[153]. In recent years, network makes full use of label information, which helps
many scholars have also explored the potential of genera- the model achieve more accurate imputations.
tive representation learning in time series generation and
imputation, such as C-RNN-GAN [154], TimeGAN [155], 5.2 Auxiliary representation enhancement
TTS-GAN [156], and E2 GAN [157]. It should be emphasized
In addition to generation and interpolation tasks, an
that although Brophy et al. [49] have reviewed the GAN-
adversarial-based representation learning strategy can be
based time series generation methods in the latest survey,
added to existing learning frameworks as additional aux-
it differs from the proposed taxonomy. We sort out the
iliary learning modules, which we call adversarial-based
two aspects of complete time series generation and missing
auxiliary representation enhancement. The auxiliary repre-
value imputation, while Brophy et al. sorted out from the
sentation enhancement aims to promote the model to learn
perspective of discrete and continuous time series modeling.
more informative representations for downstream tasks by
Complete time series generation refers to generating a
adding adversarial-based learning strategies. It can be de-
new time series that does not exist in the existing data set.
fined as:
The new sample can be a univariate or multivariate time
L = Lbase + Ladv , (29)
series. C-RNN-GAN [154] is an early method of generat-
ing time series samples using GAN. The generator is an where Lbase is the basic learning objective and Ladv is the
RNN, and the discriminator is a bidirectional RNN. RNN- additional adversarial-based learning objective. It should be
based structures can capture the dynamic dependencies in noted that when Ladv is not available, the model can still
multiple time steps but ignore the static features of the extract representations from the data, so Ladv is regarded as
data. TimeGAN [155] is an improved time series gener- an auxiliary learning objective.
ation framework that combines the basic GAN with the USAD [66] is a time series anomaly detection framework
autoregressive model, allowing the preservation of temporal that includes two BAE models, and two BAE are defined as

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on August 12,2024 at 19:21:50 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3387317

12

AE1 and AE2 , respectively. The core idea behind USAD is TABLE 2: Summary of time series applications and widely
to amplify the reconstruction error by adversarial training used datasets. The UCR and UEA datasets have multiple
between two BAEs. In USAD, AE1 is regarded as the sub-datasets respectively, so their size and dimension are
generator, and AE2 is regarded as the discriminator. The not fixed values. The size and dimension of each sub-dataset
auxiliary goal is to use AE2 to distinguish real data from are represented by M and D, respectively. AnRa represents
reconstructed data from AE1 , and train AE1 to deceive Anomaly Ratio. SaIn represents Sampling Interval. Datasets
AE2 , the whole process can be expressed as: for classification tasks and clustering tasks are listed to-
gether because the goals of these two tasks are similar. AD
Ladv = min max ∥W − AE2 (AE1 (W ))∥2 , (30) represents anomaly detection. F represents forecasting. C&C
AE1 AE2
represents classification and clustering.
where W is the real input series. Similar to USAD, Anoma-
lyTrans [163] also uses an adversarial strategy to amplify the App Dataset Size Dim Comment
anomaly score of anomalies. But unlike (30), which uses re-
PSM∗ [63] 132,481 / 87,841 26 AnRa: 27.80%
construction error, AnomalyTrans defines prior-association SMD∗ [81] 708,405 / 708,405 38 AnRa: 4.16%
and series-association and then uses the Kulback-Leibler MSL∗ [170] 58,317 / 73,729 55 AnRa: 10.72%
AD
divergence to measure the error of the two associations. SMAP∗ [170] 135,183 / 427,617 25 AnRa: 13.13%
SWaT∗ [171] 475,200 / 449,919 51 AnRa: 12.98%
DUBCNs [164] and CRLI [165] are used for series re- WADI∗ [172] 1,048,571 / 172,801 103 AnRa: 5.99%
trieval and clustering tasks, respectively. Both methods
ETTh∗ [173] 17,420 7 SaIn: 1h
adopt RNN-based BAE as the model, and the clustering- ETTm∗ [173] 69,680 7 SaIn: 15min
based loss and adversarial-based loss are added to the basic Wind∗ [174] 10,957 28 SaIn: 1day
reconstruction loss, i.e., Electricity∗ [175] 26,304 321 SaIn: 1hour
F ILI∗ [176] 966 7 SaIn: 1weak
L = Lmse + λ1 Lcluster + λ2 Ladv . (31) Weather∗ [177] 52,696 21 SaIn: 10min
Traffic∗ [178] 17,544 862 SaIn: 1hour
Exchange∗ [179] 7,588 8 SaIn: 1day
where λ1 and λ2 are the weight coefficients of the auxiliary
Solar∗ [180] 52,560 137 SaIn: 10min
objective.
HAR∗ [181] 17,3056 / 173,056 9 Classes: 6
The adversarial-based strategy is also effective in other
C&C UCR 128∗ [182] 128 * M 1 N/A
time series modeling tasks. For example, introducing ad- UEA 30∗ [183] 30 * M D N/A
versarial training in time series forecasting can improve the
accuracy and capture long-term repeated patterns, such as
• Related methods. Most time series anomaly detection
AST [166] and ACT [167]. BeatGAN [168] introduces adver-
methods are constructed under an unsupervised learn-
sarial representation learning in the abnormal beat detection
ing framework because obtaining labels for anomalous
task of ECG data and provides an interpretable detection
data is challenging. Autoregressive-based forecasting and
framework. In modeling behavior data, Activity2vec [169]
autoencoder-based reconstruction are the most commonly
uses adversarial-based training to model target invariance
used modeling strategies. To be concrete, THOC [52] and
and enhance the representation ability of the model in
GDN [55] employ autoregressive-based forecasting SSL
different behavior stages.
framework, which assumes that anomalous sequences
or time points are not predictable. RANSynCoders [63],
6 A PPLICATIONS AND DATASETS USAD [66], AnomalyTrans [163], and DAEMON [184] em-
ploy autoencoder-based reconstruction SSL framework.
SSL has many applications across different time series tasks. Furthermore, VGCRN [83] and FuSAGNet [67] combine
This section summarizes the most widely used datasets and two frameworks to achieve more robust and accurate
representative references according to the application area, results. It is beneficial to introduce an adversarial-based
including anomaly detection, forecasting, classification, and SSL, which can further amplify the difference between
clustering. As shown in Table 2, we provide useful informa- normal and anomalous data, such as USAD [66] and
tion, including dataset name, dimension, size, source, and DAEMON [184].
useful comments. For each task, we summarize from the
following aspects: task description, related methods, evalu-
6.2 Forecasting
ation metrics, examples, and task flow. Due to space limita-
tions, relevant descriptions of evaluation metrics, examples, • Task description. Time series forecasting is the process of
and task flow can be found in Appendix H. In addition, analyzing time series data using statistics and modeling
we provide performance comparison results of different to make predictions of future windows or time points.
methods on the same dataset and further summarize the • Related methods. The pretext task based on
correlation between methods and tasks, the details can also autoregressive-based forecasting is essentially a time
be found in Appendix I. series forecasting task. Therefore, various models based
on forecasting tasks are proposed, such as Pyraformer
[185], FilM [15], Quatformer [186], Informer [173],
6.1 Anomaly detection Triformer [187], Scaleformer [188], Crossformer [189], and
• Task description. The anomaly detection problem for time Timesnet [190]. Moreover, we found that decomposing
series is usually formulated as identifying outlier time the series (seasonality and trend) and then learning
points or unexpected time sequences relative to some and forecasting on the decomposed components will
norm or usual signal. help improve the final forecasting accuracy, such as

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on August 12,2024 at 19:21:50 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3387317

13

MICN [191] and CoST [123]. Besides, introducing an networks-based modeling tasks [140], [196], [197]. On the
adversarial SSL is viable when missing values are in the one hand, although a purely data-driven model can be easily
series. For example, LGnet [192] introduces adversarial extended to various tasks, it requires much data to train
training to enhance the modeling of global temporal it. On the other hand, time series data usually has some
distribution, which mitigates the impact of missing available characteristics, such as seasonal, periodic, trend,
values on forecasting accuracy. and frequency domain biases [198]–[200]. Thus one future
direction is to consider more effective ways to induce induc-
6.3 Classification and clustering tive biases into time series SSL based on the understanding
• Task description. The goal of classification and clustering of time series data and characteristics of specific tasks.
tasks is similar, i.e., to identify the real category to which
a certain time series sample belongs. 7.3 SSL for irregular and sparse time series
• Related methods. Contrastive-based SSL methods are the
Irregular and sparse time series also widely exist in various
most suitable choice for these two tasks since the core of
scenarios. This data is measured at irregular time intervals,
contrastive learning is identifying positive and negative
and not all the variables are available for each sample [201].
samples. Specifically, TS-TCC [116] introduces temporal
The straightforward approach to deal with irregular and
contrast and contextual contrast in order to obtain more
sparse time series data is to use interpolation algorithms
robust representations. TS2Vec [118] and MHCCL [139]
to estimate missing values [161], [162], [202]. However,
perform a hierarchical contrastive learning strategy over
interpolation-based models add undesirable noise and extra
augmented views, which enables robust representations.
overhead to the model which usually worsens as the time
Similar to anomaly detection and prediction tasks, an
series become increasingly sparse [54]. Moreover, irregular
adversarial-based SSL strategy can also be introduced into
and sparse time series data is often expensive to obtain suf-
classification and clustering tasks. DTCR [65] propose a
ficient labeled data, which motivates us to build time series
fake-sample generation strategy to assist the encoder in
analysis models based on SSL in various tasks. Therefore,
obtaining more expressive representations.
building SSL models directly on irregular and sparse time
series data without interpolation is a valuable direction.
7 D ISCUSSION AND F UTURE D IRECTIONS
In this section, we point out some critical problems in cur- 7.4 Pretraining and large models for time series
rent studies and outline several research directions worthy
of further investigation. Nowadays, many large language models have shown their
powerful perception and learning capability for many dif-
ferent tasks. A similar phenomenon also appears in compu-
7.1 Selection and combination of data augmentation tational vision [203]. It is naturally an interesting question,
Data augmentation is one of the effective methods to gener- how about the time series analysis field? As far as we know,
ate augmented views in SSCL [47], [193]. The widely used there is limited work on pretraining models in large-scale
methods for time series data include jitter, scaling, rotation, time series. Exploring the potentiality of pretraining and
permutation, and warping [45], [119]–[121], [194]. In Sim- large time series models is a promising direction.
CLR [16], nine different augmentation methods for image
data were discussed. The experiments show that “no single 7.5 Adversarial attacks and robust analysis on time
transformation suffices to learn good representations” and series
“the composition of random cropping and random color
distortion is the most effective augmentation method”. This With the widespread use of deep neural networks in time
naturally raises the question of which one or composition series forecasting, classification, and anomaly detection, the
of data augmentation methods is optimal for time series. vulnerability and robustness of deep models under adver-
Recently, Um et al. [195] show that the combination of sarial have become a significant concern [204]–[207]. In the
three basic augmentation methods (permutation, rotation, field of time series forecasting, Liu et al. [205] study the
and time warping) is better than that of a single method indirect and sparse adversarial attacks on multivariate prob-
and achieves the best performance in time series classifi- abilistic forecasting models for time series forecasting and
cation task. Iwana et al. [121] evaluate twelve time series propose two defense mechanisms: randomized smoothing
data augmentation methods on 128 time series classification and mini-max defense. Wu et al. [206] propose an attack
datasets with six different types of neural networks. Differ- strategy for generating an adversarial time series by adding
ent evaluation frameworks give different recommendations malicious perturbations to the original time series to dete-
and results. Therefore, an interesting direction is to construct riorate the performance of time series prediction models.
a reasonable evaluation framework for time series data aug- Zhuo et al. [208] summarize and compare various recent
mentation methods, then further select the optimal method and typical adversarial attack and defense methods for fault
or combination strategy. classifiers in data-driven fault detection and classification
systems, including white-box attack (FGSM [209], IGSM
[210], C&W attack [211], DeepFool [212]) and gray-box and
7.2 Inductive bias for time series SSL black-box attack (UAP [213], SPSA [214], Random noise).
Existing SSL methods often pursue an entirely data-driven The research on adversarial attacks and defenses against
modeling approach. However, introducing reasonable in- time series data is a worthwhile direction, but there is much
ductive bias or prior is helpful for many deep neural less literature on this topic. Existing studies mainly involve

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on August 12,2024 at 19:21:50 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3387317

14

forecasting and classification tasks. However, the impact [4] K. Zhang, Y. Liu, Y. Gu, X. Ruan, and J. Wang, “Multiple-
of adversarial examples on time series self-supervised pre- timescale feature learning strategy for valve stiction detection
based on convolutional neural network,” IEEE/ASME Transac-
training tasks is still unknown. tions on Mechatronics, vol. 27, no. 3, pp. 1478–1488, 2022.
[5] S. Li, D. Hong, and H. Wang, “Relation inference among sensor
time series in smart buildings with metric learning,” Proceedings
7.6 Benchmark evaluation for time series SSL of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp.
SSL has many applications in time series classification, fore- 4683–4690, Apr. 2020.
casting, clustering, and anomaly detection. However, most [6] Y. Xu, S. Biswal, S. R. Deshpande, K. O. Maher, and J. Sun,
“RAIM: Recurrent attentive and intensive model of multimodal
current research seeks to achieve the best performance on patient monitoring data,” in Proceedings of the 24th ACM SIGKDD
specific tasks and needs more discussion and evaluation International Conference on Knowledge Discovery & Data Mining, ser.
of the self-supervised technique. One interesting direction KDD ’18, 2018, p. 2565–2573.
is to pay more attention to SSL, analyze its properties in [7] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised repre-
sentation learning by predicting image rotations,” in International
time series modeling tasks, and give reliable benchmark Conference on Learning Representations, 2018.
evaluation. [8] R. Zhang, P. Isola, and A. A. Efros, “Split-brain autoencoders: Un-
supervised learning by cross-channel prediction,” in 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR),
7.7 Time series SSL in collaborative systems 2017, pp. 645–654.
Distributed systems have been widely deployed in many [9] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox,
“Discriminative unsupervised feature learning with convolu-
scenarios, including intelligent control systems, wireless tional neural networks,” in Advances in Neural Information Pro-
sensor networks, network file systems, etc. On the one hand, cessing Systems, vol. 27, 2014.
an appropriate collaborative learning strategy is fundamen- [10] Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview cod-
tal in these systems, as users can train their own local ing,” in Computer Vision – ECCV 2020, Cham, 2020, pp. 776–794.
[11] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
models without sharing their private local data and cir- training of deep bidirectional transformers for language un-
cumventing the relevant privacy policy [215]. On the other derstanding,” in Proceedings of the 2019 Conference of the North
hand, time series data is also widely distributed in various American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers),
places in the system, and obtaining sufficient labeled data
Minneapolis, Minnesota, Jun. 2019, pp. 4171–4186.
is also difficult, so time series SSL has great deployment [12] T. Gao, X. Yao, and D. Chen, “SimCSE: Simple contrastive learn-
potential. In recent years, federated learning has been the ing of sentence embeddings,” in Empirical Methods in Natural
most popular collaborative learning framework and has Language Processing (EMNLP), 2021.
been used successfully in various applications. Combining [13] R. B. Cleveland, W. S. Cleveland, J. E. McRae, and I. Terpenning,
“STL: A seasonal-trend decomposition,” J. Off. Stat, vol. 6, no. 1,
time series self-supervised learning and federated learning pp. 3–73, 1990.
is a valuable research direction that can provide additional [14] Q. Wen, Z. Zhang, Y. Li, and L. Sun, “Fast RobustSTL: Efficient
modeling tools for modern distributed systems. and robust seasonal-trend decomposition for time series with
complex patterns,” in Proceedings of the 26th ACM SIGKDD Inter-
national Conference on Knowledge Discovery & Data Mining, 2020,
8 C ONCLUSION pp. 2203–2213.
[15] T. Zhou, Z. Ma, Q. Wen, L. Sun, T. Yao, W. Yin, R. Jin et al.,
This article concentrates on time series SSL methods and “Film: Frequency improved legendre memory model for long-
provides a new taxonomy. We categorize the existing meth- term time series forecasting,” Advances in Neural Information
Processing Systems, vol. 35, pp. 12 677–12 690, 2022.
ods into three broad categories according to their learn-
[16] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple
ing paradigms: generative-based, contrastive-based, and framework for contrastive learning of visual representations,” in
adversarial-based. Moreover, we sort out all methods into Proceedings of the 37th International Conference on Machine Learning,
ten detailed subcategories: autoregressive-based forecasting, ser. ICML’20, 2020.
autoencoder-based reconstruction, diffusion-based genera- [17] A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon,
“A survey on contrastive self-supervised learning,” Technologies,
tion, sampling contrast, prediction contrast, augmentation vol. 9, no. 1, 2021.
contrast, prototypes contrast, expert knowledge contrast, [18] L. Jing and Y. Tian, “Self-supervised visual feature learning with
generation and imputation, and auxiliary representation deep neural networks: A survey,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 43, no. 11, pp. 4037–4058,
enhancement. We also provide useful information about
2021.
applications and widely used time series datasets. Finally, [19] E. Eldele, M. Ragab, Z. Chen, M. Wu, C.-K. Kwoh, and X. Li,
multiple future directions are summarized. We believe this “Label-efficient time series representation learning: A review,”
review fills the gap in time series SSL and ignites further 2023.
research interests in SSL for time series data. [20] S. Deldari, H. Xue, A. Saeed, J. He, D. V. Smith, and F. D.
Salim, “Beyond just vision: A review on self-supervised repre-
sentation learning on multimodal and temporal data,” CoRR, vol.
abs/2206.02353, 2022.
R EFERENCES [21] X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and
[1] Q. Wen, L. Yang, T. Zhou, and L. Sun, “Robust time series analysis J. Tang, “Self-supervised learning: Generative or contrastive,”
and applications: An industrial perspective,” in Proceedings of the IEEE Transactions on Knowledge and Data Engineering, pp. 1–1,
28th ACM SIGKDD Conference on Knowledge Discovery and Data 2021.
Mining, 2022, pp. 4836–4837. [22] Y. Bengio, A. Courville, and P. Vincent, “Representation learning:
[2] P. Esling and C. Agon, “Time-series data mining,” ACM Comput- A review and new perspectives,” IEEE Transactions on Pattern
ing Surveys (CSUR), vol. 45, no. 1, pp. 1–34, 2012. Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828,
[3] J. B. Yang, M. N. Nguyen, P. P. San, X. L. Li, and S. Krishnaswamy, 2013.
“Deep convolutional neural networks on multichannel time se- [23] S. Liu, A. Mallol-Ragolta, E. Parada-Cabaleiro, K. Qian, X. Jing,
ries for human activity recognition,” in Proceedings of the 24th A. Kathan, B. Hu, and B. W. Schuller, “Audio self-supervised
International Conference on Artificial Intelligence, 2015, p. 3995–4001. learning: A survey,” Patterns, vol. 3, no. 12, p. 100616, 2022.

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on August 12,2024 at 19:21:50 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3387317

15

[24] A. Mohamed, H.-y. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, Proceedings of the Thirtieth International Joint Conference on Artificial
K. Kirchhoff, S.-W. Li, K. Livescu, L. Maaløe, T. N. Sainath, and Intelligence (IJCAI), 8 2021, pp. 4653–4660.
S. Watanabe, “Self-supervised speech representation learning: A [46] B. K. Iwana and S. Uchida, “An empirical survey of data aug-
review,” IEEE Journal of Selected Topics in Signal Processing, pp. mentation for time series classification with neural networks,”
1–34, 2022. CoRR, vol. abs/2007.15951, 2020.
[25] Y. Liu, M. Jin, S. Pan, C. Zhou, Y. Zheng, F. Xia, and P. Yu, [47] G. Iglesias, E. Talavera, Á. González-Prieto, A. Mozo, and
“Graph self-supervised learning: A survey,” IEEE Transactions on S. Gómez-Canaval, “Data augmentation techniques in time series
Knowledge and Data Engineering, pp. 1–1, 2022. domain: a survey and taxonomy,” Neural Computing and Applica-
[26] Y. Xie, Z. Xu, J. Zhang, Z. Wang, and S. Ji, “Self-supervised tions, vol. 35, no. 14, pp. 10 123–10 145, 2023.
learning of graph neural networks: A unified review,” IEEE [48] Q. Wen, T. Zhou, C. Zhang, W. Chen, Z. Ma, J. Yan, and L. Sun,
Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, “Transformers in time series: A survey,” in International Joint
2022. Conference on Artificial Intelligence (IJCAI), 2023.
[27] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, “Pre-trained [49] E. Brophy, Z. Wang, Q. She, and T. Ward, “Generative adversarial
models for natural language processing: A survey,” CoRR, vol. networks in time series: A systematic literature review,” ACM
abs/2003.08271, 2020. Comput. Surv., vol. 55, no. 10, feb 2023.
[28] L. Ericsson, H. Gouk, C. C. Loy, and T. M. Hospedales, “Self- [50] M. Schirmer, M. Eltayeb, S. Lessmann, and M. Rudolph, “Mod-
supervised representation learning: Introduction, advances, and eling irregular time series with continuous recurrent units,” in
challenges,” IEEE Signal Processing Magazine, vol. 39, no. 3, pp. Proceedings of the 39th International Conference on Machine Learning,
42–62, 2022. ser. Proceedings of Machine Learning Research, vol. 162, 17–23
[29] P. H. Le-Khac, G. Healy, and A. F. Smeaton, “Contrastive repre- Jul 2022, pp. 19 388–19 405.
sentation learning: A framework and review,” IEEE Access, vol. 8, [51] Q. Tan, M. Ye, B. Yang, S. Liu, A. J. Ma, T. C.-F. Yip, G. L.-H. Wong,
pp. 193 907–193 934, 2020. and P. Yuen, “DATA-GRU: Dual-attention time-aware gated re-
[30] J. Gui, T. Chen, J. Zhang, Q. Cao, Z. Sun, H. Luo, and D. Tao, current unit for irregular multivariate time series,” Proceedings of
“A survey on self-supervised learning: Algorithms, applications, the AAAI Conference on Artificial Intelligence, vol. 34, no. 01, pp.
and future trends,” 2023. 930–937, Apr. 2020.
[31] S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Qadir, and B. W. Schuller, [52] L. Shen, Z. Li, and J. Kwok, “Timeseries anomaly detection using
“Deep representation learning in speech processing: Challenges, temporal hierarchical one-class network,” in Advances in Neural
recent advances, and future trends,” CoRR, vol. abs/2001.00378, Information Processing Systems, vol. 33, 2020, pp. 13 016–13 026.
2020. [53] S. Jawed, J. Grabocka, and L. Schmidt-Thieme, “Self-supervised
[32] L. Wu, H. Lin, C. Tan, Z. Gao, and S. Z. Li, “Self-supervised learning for semi-supervised time series classification,” in Ad-
learning on graphs: Contrastive, generative,or predictive,” IEEE vances in Knowledge Discovery and Data Mining, Cham, 2020, pp.
Transactions on Knowledge and Data Engineering, pp. 1–1, 2021. 499–511.
[33] Z. Liu, A. Alavi, M. Li, and X. Zhang, “Self-supervised con- [54] S. Tipirneni and C. K. Reddy, “Self-supervised transformer for
trastive learning for medical time series: A systematic review,” sparse and irregularly sampled multivariate clinical time-series,”
Sensors, vol. 23, no. 9, 2023. ACM Trans. Knowl. Discov. Data, vol. 16, no. 6, jul 2022.
[34] I. Misra and L. van der Maaten, “Self-supervised learning of [55] A. Deng and B. Hooi, “Graph neural network-based anomaly
pretext-invariant representations,” CoRR, vol. abs/1912.01991, detection in multivariate time series,” Proceedings of the AAAI
2019. Conference on Artificial Intelligence, vol. 35, no. 5, pp. 4027–4035,
[35] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum May 2021.
contrast for unsupervised visual representation learning,” in 2020 [56] C. Shang, J. Chen, and J. Bi, “Discrete graph structure learning for
IEEE/CVF Conference on Computer Vision and Pattern Recognition forecasting multiple time series,” arXiv preprint arXiv:2101.06861,
(CVPR), 2020, pp. 9726–9735. 2021.
[36] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and [57] E. Dai and J. Chen, “Graph-augmented normalizing flows for
A. Joulin, “Unsupervised learning of visual features by contrast- anomaly detection of multiple time series,” in International Con-
ing cluster assignments,” in Proceedings of the 34th International ference on Learning Representations, 2022.
Conference on Neural Information Processing Systems, ser. NIPS’20, [58] L. Xi, Z. Yun, H. Liu, R. Wang, X. Huang, and H. Fan, “Semi-
Red Hook, NY, USA, 2020. supervised time series classification model with self-supervised
[37] A. Abanda, U. Mori, and J. A. Lozaono, “A review on distance learning,” Engineering Applications of Artificial Intelligence, vol.
based time series classification,” Data Mining and Knowledge Dis- 116, p. 105331, 2022.
covery, vol. 33, pp. 378–412, 2019. [59] P. Baldi, “Autoencoders, unsupervised learning, and deep archi-
[38] H. I. Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P.-A. tectures,” in Proceedings of ICML Workshop on Unsupervised and
Muller, “Deep learning for time series classification: a review,” Transfer Learning, ser. Proceedings of Machine Learning Research,
Data Mining and Knowledge Discovery, vol. 33, pp. 917–963, 2019. vol. 27, Bellevue, Washington, USA, 02 Jul 2012, pp. 37–49.
[39] B. Lim and S. Zohren, “Time-series forecasting with deep learn- [60] A. Abid and J. Zou, “Autowarp: Learning a warping distance
ing: a survey,” Philosophical Transactions of the Royal Society A: from unlabeled time series using sequence autoencoders,” in Pro-
Mathematical, Physical and Engineering Sciences, vol. 379, no. 2194, ceedings of the 32nd International Conference on Neural Information
p. 20200209, feb 2021. Processing Systems, ser. NIPS’18, Red Hook, NY, USA, 2018, p.
[40] O. B. Sezer, M. U. Gudelek, and A. M. Ozbayoglu, “Financial time 10568–10578.
series forecasting with deep learning : A systematic literature [61] P. Malhotra, V. TV, L. Vig, P. Agarwal, and G. Shroff, “TimeNet:
review: 2005–2019,” Applied Soft Computing, vol. 90, p. 106181, Pre-trained deep recurrent neural network for time series classi-
2020. fication,” CoRR, vol. abs/1706.08838, 2017.
[41] Z. Liu, Z. Zhu, J. Gao, and C. Xu, “Forecast methods for time [62] A. Sagheer and M. Kotb, “Unsupervised pre-training of a deep
series data: A survey,” IEEE Access, vol. 9, pp. 91 896–91 912, 2021. LSTM-based stacked autoencoder for multivariate time series
[42] K. Benidis, S. S. Rangapuram, V. Flunkert, Y. Wang, D. Mad- forecasting problems,” Scientific Reports, vol. 9, p. 19038, 2019.
dix, C. Turkmen, J. Gasthaus, M. Bohlke-Schneider, D. Salinas, [63] A. Abdulaal, Z. Liu, and T. Lancewicki, “Practical approach to
L. Stella, F.-X. Aubet, L. Callot, and T. Januschowski, “Deep learn- asynchronous multivariate time series anomaly detection and
ing for time series forecasting: Tutorial and literature survey,” localization,” in Proceedings of the 27th ACM SIGKDD Conference
ACM Comput. Surv., vol. 55, no. 6, dec 2022. on Knowledge Discovery & Data Mining, ser. KDD ’21, New York,
[43] A. Blázquez-Garcı́a, A. Conde, U. Mori, and J. A. Lozano, “A NY, USA, 2021, p. 2485–2494.
review on outlier/anomaly detection in time series data,” ACM [64] K. Zhang and Y. Liu, “Unsupervised feature learning with data
Comput. Surv., vol. 54, no. 3, apr 2021. augmentation for control valve stiction detection,” in 2021 IEEE
[44] A. A. Cook, G. Mısırlı, and Z. Fan, “Anomaly detection for iot 10th Data Driven Control and Learning Systems Conference (DDCLS),
time-series data: A survey,” IEEE Internet of Things Journal, vol. 7, 2021, pp. 1385–1390.
no. 7, pp. 6481–6494, 2020. [65] Q. Ma, J. Zheng, S. Li, and G. W. Cottrell, “Learning representa-
[45] Q. Wen, L. Sun, F. Yang, X. Song, J. Gao, X. Wang, and H. Xu, tions for time series clustering,” in Advances in Neural Information
“Time series data augmentation for deep learning: A survey,” in Processing Systems, vol. 32, 2019.

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on August 12,2024 at 19:21:50 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3387317

16

[66] J. Audibert, P. Michiardi, F. Guyard, S. Marti, and M. A. Zuluaga, 39th International Conference on Machine Learning, ser. Proceedings
“USAD: Unsupervised anomaly detection on multivariate time of Machine Learning Research, vol. 162, 17–23 Jul 2022, pp. 3621–
series,” in Proceedings of the 26th ACM SIGKDD International 3633.
Conference on Knowledge Discovery & Data Mining, ser. KDD ’20, [84] S. N. Shukla and B. M. Marlin, “Multi-time attention networks
2020, p. 3395–3404. for irregularly sampled time series,” in Proceedings of the ICLR,
[67] S. Han and S. S. Woo, “Learning sparse latent graph represen- 2021.
tations for anomaly detection in multivariate time series,” in [85] S. C.-X. Li and B. Marlin, “Learning from irregularly-sampled
Proceedings of the 28th ACM SIGKDD Conference on Knowledge time series: A missing data perspective,” in Proceedings of the 37th
Discovery and Data Mining, ser. KDD ’22, New York, NY, USA, International Conference on Machine Learning, ser. Proceedings of
2022, p. 2977–2986. Machine Learning Research, vol. 119, 13–18 Jul 2020, pp. 5937–
[68] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Ex- 5946.
tracting and composing robust features with denoising autoen- [86] S. N. Shukla and B. M. Marlin, “Heteroscedastic temporal vari-
coders,” in Proceedings of the 25th International Conference on ational autoencoder for irregularly sampled time series,” CoRR,
Machine Learning, ser. ICML ’08, New York, NY, USA, 2008, p. vol. abs/2107.11350, 2021.
1096–1103. [87] Z. Wang, X. Xu, G. Trajcevski, W. Zhang, T. Zhong, and F. Zhou,
[69] G. Jiang, P. Xie, H. He, and J. Yan, “Wind turbine fault detec- “Learning latent seasonal-trend representations for time series
tion using a denoising autoencoder with temporal information,” forecasting,” in Advances in Neural Information Processing Systems,
IEEE/ASME Transactions on Mechatronics, vol. 23, no. 1, pp. 89– 2022.
100, 2018. [88] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilis-
[70] J. Zhang and P. Yin, “Multivariate time series missing data tic models,” Advances in Neural Information Processing Systems,
imputation using recurrent denoising autoencoder,” in 2019 IEEE vol. 33, pp. 6840–6851, 2020.
International Conference on Bioinformatics and Biomedicine (BIBM), [89] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image
2019, pp. 760–764. synthesis,” Advances in Neural Information Processing Systems,
[71] Z. Zheng, Z. Zhang, L. Wang, and X. Luo, “Denoising temporal vol. 34, pp. 8780–8794, 2021.
convolutional recurrent autoencoders for time series classifica- [90] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, Y. Shao,
tion,” Information Sciences, vol. 588, pp. 159–173, 2022. W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models: A com-
[72] J. Li, Z. Struzik, L. Zhang, and A. Cichocki, “Feature learning prehensive survey of methods and applications,” arXiv preprint
from incomplete eeg with denoising autoencoder,” Neurocomput- arXiv:2209.00796, 2022.
ing, vol. 165, pp. 23–31, 2015. [91] H. Cao, C. Tan, Z. Gao, G. Chen, P.-A. Heng, and S. Z.
[73] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked Li, “A survey on generative diffusion model,” arXiv preprint
autoencoders are scalable vision learners,” in 2022 IEEE/CVF arXiv:2209.02646, 2022.
Conference on Computer Vision and Pattern Recognition (CVPR), [92] C. Luo, “Understanding diffusion models: A unified perspec-
2022, pp. 15 979–15 988. tive,” arXiv preprint arXiv:2208.11970, 2022.
[74] Z. Shao, Z. Zhang, F. Wang, and Y. Xu, “Pre-training enhanced [93] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli,
spatial-temporal graph neural network for multivariate time “Deep unsupervised learning using nonequilibrium thermody-
series forecasting,” in Proceedings of the 28th ACM SIGKDD Con- namics,” in International Conference on Machine Learning. PMLR,
ference on Knowledge Discovery and Data Mining, ser. KDD ’22, New 2015, pp. 2256–2265.
York, NY, USA, 2022, p. 1567–1577. [94] Y. Song and S. Ermon, “Generative modeling by estimating
[75] G. Zerveas, S. Jayaraman, D. Patel, A. Bhamidipaty, and C. Eick- gradients of the data distribution,” Advances in neural information
hoff, “A transformer-based framework for multivariate time processing systems, vol. 32, 2019.
series representation learning,” in Proceedings of the 27th ACM [95] Y. Song and S. Ermon, “Improved techniques for training score-
SIGKDD Conference on Knowledge Discovery & Data Mining, ser. based generative models,” Advances in neural information process-
KDD ’21, 2021, p. 2114–2124. ing systems, vol. 33, pp. 12 438–12 448, 2020.
[76] J. Chauhan, A. Raghuveer, R. Saket, J. Nandy, and B. Ravindran, [96] Y. Song, C. Durkan, I. Murray, and S. Ermon, “Maximum like-
“Multi-variate time series forecasting on variable subsets,” in lihood training of score-based diffusion models,” Advances in
Proceedings of the 28th ACM SIGKDD Conference on Knowledge Neural Information Processing Systems, vol. 34, pp. 1415–1428, 2021.
Discovery and Data Mining, ser. KDD ’22, New York, NY, USA, [97] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and
2022, p. 76–86. B. Poole, “Score-based generative modeling through stochastic
[77] R. R. Chowdhury, X. Zhang, J. Shang, R. K. Gupta, and differential equations,” arXiv preprint arXiv:2011.13456, 2020.
D. Hong, “TARNet: Task-aware reconstruction for time-series [98] Y. Tashiro, J. Song, Y. Song, and S. Ermon, “CSDI: Conditional
transformer,” in Proceedings of the 28th ACM SIGKDD Conference score-based diffusion models for probabilistic time series impu-
on Knowledge Discovery and Data Mining, ser. KDD ’22, New York, tation,” Advances in Neural Information Processing Systems, vol. 34,
NY, USA, 2022, p. 212–220. pp. 24 804–24 816, 2021.
[78] D. P. Kingma and M. Welling, “Auto-Encoding Variational [99] K. Rasul, C. Seward, I. Schuster, and R. Vollgraf, “Autoregressive
Bayes,” in 2nd International Conference on Learning Representations, denoising diffusion models for multivariate probabilistic time
ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track series forecasting,” in International Conference on Machine Learning.
Proceedings, 2014. PMLR, 2021, pp. 8857–8868.
[79] Diederik P. Kingma and Max Welling, “An introduction to varia- [100] Y. Li, X. Lu, Y. Wang, and D. Dou, “Generative time series fore-
tional autoencoders,” CoRR, vol. abs/1906.02691, 2019. casting with diffusion, denoise, and disentanglement,” NeurIPS,
[80] Z. Li, Y. Zhao, J. Han, Y. Su, R. Jiao, X. Wen, and D. Pei, 2022.
“Multivariate time series anomaly detection and interpretation [101] Y. Chen, C. Zhang, M. Ma, Y. Liu, R. Ding, B. Li, S. He, S. Raj-
using hierarchical inter-metric and temporal embedding,” in mohan, Q. Lin, and D. Zhang, “ImDiffusion: Imputed diffusion
Proceedings of the 27th ACM SIGKDD Conference on Knowledge models for multivariate time series anomaly detection,” arXiv
Discovery & Data Mining, ser. KDD ’21, 2021, p. 3220–3230. preprint arXiv:2307.00754, 2023.
[81] Y. Su, Y. Zhao, C. Niu, R. Liu, W. Sun, and D. Pei, “Robust [102] J. M. L. Alcaraz and N. Strodthoff, “Diffusion-based time series
anomaly detection for multivariate time series through stochas- imputation and forecasting with structured state space models,”
tic recurrent neural network,” in Proceedings of the 25th ACM arXiv preprint arXiv:2208.09399, 2022.
SIGKDD International Conference on Knowledge Discovery & Data [103] Z. Wang, Q. Wen, C. Zhang, L. Sun, and Y. Wang, “Diffload:
Mining, ser. KDD ’19, 2019, p. 2828–2837. Uncertainty quantification in load forecasting with diffusion
[82] W. Zhang, C. Zhang, and F. Tsung, “GRELEN: Multivariate model,” arXiv preprint arXiv:2306.01001, 2023.
time series anomaly detection from the perspective of graph [104] H. Wen, Y. Lin, Y. Xia, H. Wan, Q. Wen, R. Zimmermann,
relational learning,” in Proceedings of the Thirty-First International and Y. Liang, “DiffSTG: Probabilistic spatio-temporal graph
Joint Conference on Artificial Intelligence, IJCAI-22, 7 2022, pp. 2390– forecasting with denoising diffusion models,” arXiv preprint
2397, main Track. arXiv:2301.13629, 2023.
[83] W. Chen, L. Tian, B. Chen, L. Dai, Z. Duan, and M. Zhou, [105] J.-Y. Franceschi, A. Dieuleveut, and M. Jaggi, “Unsupervised
“Deep variational graph convolutional recurrent network for scalable representation learning for multivariate time series,” in
multivariate time series anomaly detection,” in Proceedings of the Advances in Neural Information Processing Systems, vol. 32, 2019.

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on August 12,2024 at 19:21:50 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3387317

17

[106] S. Tonekaboni, D. Eytan, and A. Goldenberg, “Unsupervised rep- for time series via time-frequency consistency,” in Proceedings of
resentation learning for time series with temporal neighborhood Neural Information Processing Systems, NeurIPS, 2022.
coding,” CoRR, vol. abs/2106.00750, 2021. [126] Y. Yang, C. Zhang, T. Zhou, Q. Wen, and L. Sun, “Dcdetector:
[107] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, Dual attention contrastive representation learning for time series
A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive anomaly detection,” in Proc. 29th ACM SIGKDD International
learning,” in Advances in Neural Information Processing Systems, Conference on Knowledge Discovery & Data Mining (KDD 2023),
vol. 33, 2020, pp. 18 661–18 673. Long Beach, CA, Aug., 2023.
[108] H. Yèche, G. Dresdner, F. Locatello, M. Hüser, and G. Rätsch, [127] X. Yang, Z. Zhang, and R. Cui, “TimeCLR: A self-supervised
“Neighborhood contrastive learning applied to online patient contrastive learning framework for univariate time series rep-
monitoring,” in Proceedings of the 38th International Conference on resentation,” Knowledge-Based Systems, vol. 245, p. 108606, 2022.
Machine Learning, ser. Proceedings of Machine Learning Research, [128] D. Kiyasseh, T. Zhu, and D. A. Clifton, “CLOCS: Contrastive
vol. 139, 18–24 Jul 2021, pp. 11 964–11 974. learning of cardiac signals across space, time, and patients,” in
[109] A. van den Oord, Y. Li, and O. Vinyals, “Representation learning Proceedings of the 38th International Conference on Machine Learning,
with contrastive predictive coding,” CoRR, vol. abs/1807.03748, ser. Proceedings of Machine Learning Research, vol. 139, 18–24
2018. Jul 2021, pp. 5606–5615.
[110] T. Schneider, C. Qiu, M. Kloft, D. Aspandi-Latif, S. Staab, [129] Y. Ozyurt, S. Feuerriegel, and C. Zhang, “Contrastive learning for
S. Mandt, and M. Rudolph, “Detecting anomalies within unsupervised domain adaptation of time series,” in The Eleventh
time series using local neural transformations,” CoRR, vol. International Conference on Learning Representations, 2023.
abs/2202.03944, 2022. [130] K. Zhang, Y. Liu, Y. Gu, J. Wang, and X. Ruan, “Valve stiction
[111] T. Pranavan, T. Sim, A. Ambikapathi, and S. Ramasamy, “Con- detection using multitimescale feature consistent constraint for
trastive predictive coding for anomaly detection in multi-variate time-series data,” IEEE/ASME Transactions on Mechatronics, pp.
time series data,” CoRR, vol. abs/2202.03639, 2022. 1–12, 2022.
[112] S. Deldari, D. V. Smith, H. Xue, and F. D. Salim, “Time series [131] M. Hou, C. Xu, Z. Li, Y. Liu, W. Liu, E. Chen, and J. Bian, “Multi-
change point detection with self-supervised contrastive predic- granularity residual learning with confidence estimation for time
tive coding,” in Proceedings of the Web Conference 2021, ser. WWW series prediction,” in Proceedings of the ACM Web Conference 2022,
’21, New York, NY, USA, 2021, p. 3124–3135. ser. WWW ’22, 2022, p. 112–121.
[113] K. Zhang, Q. Wen, C. Zhang, L. Sun, and Y. Liu, “Time series [132] H. Lee, E. Seong, and D.-K. Chae, “Self-supervised learning with
anomaly detection using skip-step contrastive predictive cod- attention-based latent signal augmentation for sleep staging with
ing,” in NeurIPS 2022 Workshop: Self-Supervised Learning - Theory limited labeled data,” in Proceedings of the Thirty-First International
and Practice, 2022. Joint Conference on Artificial Intelligence, IJCAI-22, 7 2022, pp. 3868–
[114] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation 3876, main Track.
of generic convolutional and recurrent networks for sequence [133] T. Wang and P. Isola, “Understanding contrastive representation
modeling,” CoRR, vol. abs/1803.01271, 2018. learning through alignment and uniformity on the hypersphere,”
[115] M. Hou, C. Xu, Y. Liu, W. Liu, J. Bian, L. Wu, Z. Li, E. Chen, in Proceedings of the 37th International Conference on Machine Learn-
and T.-Y. Liu, “Stock trend prediction with multi-granularity ing, ser. ICML’20, 2020.
data: A contrastive learning approach with adaptive fusion,” [134] Y. Li, P. Hu, Z. Liu, D. Peng, J. T. Zhou, and X. Peng, “Contrastive
Proceedings of the 30th ACM International Conference on Information clustering,” Proceedings of the AAAI Conference on Artificial Intelli-
& Knowledge Management, p. 700–709, 2021. gence, vol. 35, no. 10, pp. 8547–8555, May 2021.
[116] E. Eldele, M. Ragab, Z. Chen, M. Wu, C. K. Kwoh, X. Li, and
[135] J. Li, P. Zhou, C. Xiong, and S. Hoi, “Prototypical contrastive
C. Guan, “Time-series representation learning via temporal and
learning of unsupervised representations,” in International Con-
contextual contrasting,” in Proceedings of the Thirtieth International
ference on Learning Representations, 2021.
Joint Conference on Artificial Intelligence, IJCAI-21, 8 2021, pp. 2352–
2359. [136] G. Li, B. Choi, J. Xu, S. S Bhowmick, K.-P. Chun, and G. L.-
H. Wong, “Shapenet: A shapelet-neural network approach for
[117] E. Eldele, M. Ragab, Z. Chen, M. Wu, C.-K. Kwoh, X. Li, and
multivariate time series classification,” Proceedings of the AAAI
C. Guan, “Self-supervised contrastive representation learning
Conference on Artificial Intelligence, vol. 35, no. 9, pp. 8375–8383,
for semi-supervised time-series classification,” IEEE Transactions
May 2021.
on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp.
15 604–15 618, 2023. [137] X. Zhang, Y. Gao, J. Lin, and C.-T. Lu, “Tapnet: Multivariate
[118] Z. Yue, Y. Wang, J. Duan, T. Yang, C. Huang, Y. Tong, and time series classification with attentional prototypical network,”
B. Xu, “TS2Vec: Towards universal representation of time series,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34,
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 04, pp. 6845–6852, Apr. 2020.
no. 8, pp. 8980–8987, Jun. 2022. [138] A. Dorle, F. Li, W. Song, and S. Li, “Learning discriminative
[119] J. Pöppelbaum, G. S. Chadha, and A. Schwung, “Contrastive virtual sequences for time series classification,” in Proceedings
learning based self-supervised time-series analysis,” Applied Soft of the 29th ACM International Conference on Information & Knowl-
Computing, vol. 117, p. 108397, 2022. edge Management, ser. CIKM ’20, New York, NY, USA, 2020, p.
[120] T. Peng, C. Shen, S. Sun, and D. Wang, “Fault feature extractor 2001–2004.
based on bootstrap your own latent and data augmentation [139] Q. Meng, H. Qian, Y. Liu, L. Cui, Y. Xu, and Z. Shen, “MHCCL:
algorithm for unlabeled vibration signals,” IEEE Transactions on masked hierarchical cluster-wise contrastive learning for multi-
Industrial Electronics, vol. 69, no. 9, pp. 9547–9555, 2022. variate time series,” in AAAI, 2023, pp. 9153–9161.
[121] B. K. Iwana and S. Uchida, “An empirical survey of data aug- [140] X. Wu, L. Xiao, Y. Sun, J. Zhang, T. Ma, and L. He, “A survey
mentation for time series classification with neural networks,” of human-in-the-loop for machine learning,” Future Generation
PLOS ONE, vol. 16, no. 7, p. e0254841, jul 2021. Computer Systems, vol. 135, pp. 364–381, 2022.
[122] K. Wickstrøm, M. Kampffmeyer, K. Øyvind Mikalsen, and [141] Y. Chen and D. Zhang, “Integration of knowledge and data in
R. Jenssen, “Mixing up contrastive learning: Self-supervised rep- machine learning,” arXiv preprint arXiv:2202.10337v2, 2022.
resentation learning for time series,” Pattern Recognition Letters, [142] P. Shi, W. Ye, and Z. Qin, “Self-supervised pre-training for time
vol. 155, pp. 54–61, 2022. series classification,” in 2021 International Joint Conference on Neu-
[123] G. Woo, C. Liu, D. Sahoo, A. Kumar, and S. Hoi, “CoST: Con- ral Networks (IJCNN), 2021, pp. 1–8.
trastive learning of disentangled seasonal-trend representations [143] M. T. Nonnenmacher, L. Oldenburg, I. Steinwart, and D. Reeb,
for time series forecasting,” in International Conference on Learning “Utilizing expert features for contrastive learning of time-series
Representations, 2022. representations,” in Proceedings of the 39th International Conference
[124] L. Yang and S. Hong, “Unsupervised time-series representation on Machine Learning, ser. Proceedings of Machine Learning Re-
learning with iterative bilinear temporal-spectral fusion,” in Pro- search, vol. 162, 17–23 Jul 2022, pp. 16 969–16 989.
ceedings of the 39th International Conference on Machine Learning, [144] H. Zhang, J. Wang, Q. Xiao, J. Deng, and Y. Lin, “SleepPriorCL:
ser. Proceedings of Machine Learning Research, vol. 162, 17–23 Contrastive representation learning with prior knowledge-based
Jul 2022, pp. 25 038–25 054. positive mining and adaptive temperature for sleep staging,”
[125] Zhang, Xiang and Zhao, Ziyuan and Tsiligkaridis, Theodoros 2021.
and Zitnik, Marinka, “Self-supervised contrastive pre-training [145] T.-S. Chen, W.-C. Hung, H.-Y. Tseng, S.-Y. Chien, and M.-H. Yang,

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on August 12,2024 at 19:21:50 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3387317

18

“Incremental false negative detection for contrastive learning,” in of the 34th International Conference on Neural Information Processing
International Conference on Learning Representations, 2022. Systems, ser. NIPS’20, Red Hook, NY, USA, 2020.
[146] K. Zhang, R. Cai, and Y. Liu, “Industrial fault detection using [167] Y. Li, H. Wang, J. Li, C. Liu, and J. Tan, “Act: Adversarial
contrastive representation learning on time-series data,” IFAC- convolutional transformer for time series forecasting,” in 2022
PapersOnLine, vol. 56, no. 2, pp. 3197–3202, 2023, 22nd IFAC International Joint Conference on Neural Networks (IJCNN), 2022,
World Congress. pp. 1–8.
[147] Y. Zhang, X. Zhang, J. Li, R. C. Qiu, H. Xu, and Q. Tian, “Semi- [168] B. Zhou, S. Liu, B. Hooi, X. Cheng, and J. Ye, “Beatgan: Anoma-
supervised contrastive learning with similarity co-calibration,” lous rhythm detection using adversarially generated time series,”
IEEE Transactions on Multimedia, vol. 25, pp. 1749–1759, 2023. in Proceedings of the Twenty-Eighth International Joint Conference on
[148] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- Artificial Intelligence, IJCAI-19, 7 2019, pp. 4433–4439.
Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver- [169] K. Aggarwal, S. Joty, L. Fernandez-Luque, and J. Srivastava, “Ad-
sarial nets,” in Advances in Neural Information Processing Systems, versarial unsupervised representation learning for activity time-
vol. 27, 2014. series,” Proceedings of the AAAI Conference on Artificial Intelligence,
[149] Z. Wang, Q. She, and T. E. Ward, “Generative adversarial net- vol. 33, no. 01, pp. 834–841, Jul. 2019.
works in computer vision: A survey and taxonomy,” ACM Com- [170] K. Hundman, V. Constantinou, C. Laporte, I. Colwell, and
put. Surv., vol. 54, no. 2, feb 2021. T. Soderstrom, “Detecting spacecraft anomalies using lstms and
[150] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image nonparametric dynamic thresholding,” in Proceedings of the 24th
translation with conditional adversarial networks,” in 2017 IEEE ACM SIGKDD International Conference on Knowledge Discovery &
Conference on Computer Vision and Pattern Recognition (CVPR), Data Mining, ser. KDD ’18, New York, NY, USA, 2018, p. 387–395.
2017, pp. 5967–5976. [171] J. Goh, S. Adepu, K. N. Junejo, and A. Mathur, “A dataset to sup-
[151] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to- port research in the design of secure water treatment systems,”
image translation using cycle-consistent adversarial networks,” in Critical Information Infrastructures Security: 11th International
in 2017 IEEE International Conference on Computer Vision (ICCV), Conference, CRITIS 2016, Paris, France, October 10–12, 2016, Revised
2017, pp. 2242–2251. Selected Papers 11. Springer, 2017, pp. 88–99.
[152] A. Brock, J. Donahue, and K. Simonyan, “Large scale GAN [172] C. M. Ahmed, V. R. Palleti, and A. P. Mathur, “Wadi: A water
training for high fidelity natural image synthesis,” in International distribution testbed for research in the design of secure cyber
Conference on Learning Representations, 2019. physical systems,” in Proceedings of the 3rd International Workshop
[153] T. Karras, S. Laine, and T. Aila, “A style-based generator archi- on Cyber-Physical Systems for Smart Water Networks, ser. CySWA-
tecture for generative adversarial networks,” in 2019 IEEE/CVF TER ’17, New York, NY, USA, 2017, p. 25–28.
Conference on Computer Vision and Pattern Recognition (CVPR), [173] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and
2019, pp. 4396–4405. W. Zhang, “Informer: Beyond efficient transformer for long se-
[154] O. Mogren, “C-RNN-GAN: continuous recurrent neural net- quence time-series forecasting,” Proceedings of the AAAI Confer-
works with adversarial training,” CoRR, vol. abs/1611.09904, ence on Artificial Intelligence, vol. 35, no. 12, pp. 11 106–11 115, May
2016. 2021.
[155] J. Yoon, D. Jarrett, and M. van der Schaar, Time-Series Generative [174] European Commission’s STETIS program, “30 years of european
Adversarial Networks, Red Hook, NY, USA, 2019. wind generation,” https://www.kaggle.com/datasets/sohier/
[156] X. Li, V. Metsis, H. Wang, and A. H. H. Ngu, “TTS-GAN: A 30-years-of-european-wind-generation.
transformer-based time-series generative adversarial network,” [175] UCI Machine Learning Repository, “Electricityloaddia-
in Artificial Intelligence in Medicine: 20th International Conference on grams20112014 data set,” https://archive.ics.uci.edu/ml/
Artificial Intelligence in Medicine, AIME 2022, Halifax, NS, Canada, datasets/ElectricityLoadDiagrams20112014, 2011.
June 14–17, 2022, Proceedings, Berlin, Heidelberg, 2022, p. 133–143. [176] Centers for Disease Control and Prevention, “National, regional,
[157] Y. Luo, Y. Zhang, X. Cai, and X. Yuan, “E²gan: End-to-end genera- and state level outpatient illness and viral surveillance,” https:
tive adversarial network for multivariate time series imputation,” //gis.cdc.gov/grasp/fluview/fluportaldashboard.html.
in Proceedings of the Twenty-Eighth International Joint Conference on [177] Max-Planck-Institut fur Biogeochemie, Jena, “Weather data set,”
Artificial Intelligence, IJCAI-19, 7 2019, pp. 3094–3100. https://www.bgc-jena.mpg.de/wetter/.
[158] A. Seyfi, J.-F. Rajotte, and R. T. Ng, “Generating multivariate time [178] California Department of Transportation, “Traffic data set,” http:
series with COmmon source coordinated GAN (COSCI-GAN),” //pems.dot.ca.gov/.
in Advances in Neural Information Processing Systems, 2022. [179] G. Lai, W.-C. Chang, Y. Yang, and H. Liu, “Modeling long-
[159] P. Jeha, M. Bohlke-Schneider, P. Mercado, S. Kapoor, R. S. Nir- and short-term temporal patterns with deep neural networks,”
wan, V. Flunkert, J. Gasthaus, and T. Januschowski, “PSA-GAN: in The 41st International ACM SIGIR Conference on Research &
Progressive self attention GANs for synthetic time series,” in Development in Information Retrieval, ser. SIGIR ’18, New York,
International Conference on Learning Representations, 2022. NY, USA, 2018, p. 95–104.
[160] J. Jeon, J. KIM, H. Song, S. Cho, and N. Park, “GT-GAN: Gen- [180] National Renewable Energy Laboratory, “Solar power
eral purpose time series synthesis with generative adversarial data for integration studies,” https://www.nrel.gov/grid/
networks,” in Advances in Neural Information Processing Systems, solar-power-data.html, 2006.
2022. [181] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz,
[161] Y. Luo, X. Cai, Y. ZHANG, J. Xu, and Y. xiaojie, “Multivariate “A public domain dataset for human activity recognition using
time series imputation with generative adversarial networks,” in smartphones,” in The European Symposium on Artificial Neural
Advances in Neural Information Processing Systems, vol. 31, 2018. Networks, 2013.
[162] X. Miao, Y. Wu, J. Wang, Y. Gao, X. Mao, and J. Yin, “Generative [182] H. A. Dau, A. Bagnall, K. Kamgar, C.-C. M. Yeh, Y. Zhu,
semi-supervised learning for multivariate time series imputa- S. Gharghabi, C. A. Ratanamahatana, and E. Keogh, “The ucr
tion,” Proceedings of the AAAI Conference on Artificial Intelligence, time series archive,” IEEE/CAA Journal of Automatica Sinica, vol. 6,
vol. 35, no. 10, pp. 8983–8991, May 2021. no. 6, pp. 1293–1305, 2019.
[163] J. Xu, H. Wu, J. Wang, and M. Long, “Anomaly transformer: [183] A. Bagnall, H. A. Dau, J. Lines, M. Flynn, J. Large, A. Bostrom,
Time series anomaly detection with association discrepancy,” in P. Southam, and E. Keogh, “The uea multivariate time series
International Conference on Learning Representations, 2022. classification archive, 2018,” arXiv preprint arXiv:1811.00075, 2018.
[164] D. Zhu, D. Song, Y. Chen, C. Lumezanu, W. Cheng, B. Zong, [184] X. Chen, L. Deng, Y. Zhao, and K. Zheng, “Adversarial au-
J. Ni, T. Mizoguchi, T. Yang, and H. Chen, “Deep unsupervised toencoder for unsupervised time series anomaly detection and
binary coding networks for multivariate time series retrieval,” interpretation,” in Proceedings of the Sixteenth ACM International
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, Conference on Web Search and Data Mining, ser. WSDM ’23, New
no. 02, pp. 1403–1411, Apr. 2020. York, NY, USA, 2023, p. 267–275.
[165] Q. Ma, C. Chen, S. Li, and G. W. Cottrell, “Learning represen- [185] S. Liu, H. Yu, C. Liao, J. Li, W. Lin, A. X. Liu, and S. Dust-
tations for incomplete time series clustering,” Proceedings of the dar, “Pyraformer: Low-complexity pyramidal attention for long-
AAAI Conference on Artificial Intelligence, vol. 35, no. 10, pp. 8837– range time series modeling and forecasting,” in International
8846, May 2021. Conference on Learning Representations, 2022.
[166] S. Wu, X. Xiao, Q. Ding, P. Zhao, Y. Wei, and J. Huang, “Adversar- [186] W. Chen, W. Wang, B. Peng, Q. Wen, T. Zhou, and L. Sun, “Learn-
ial sparse transformer for time series forecasting,” in Proceedings ing to rotate: Quaternion transformer for complicated periodical

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on August 12,2024 at 19:21:50 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3387317

19

time series forecasting,” in Proceedings of the 28th ACM SIGKDD [206] T. Wu, X. Wang, S. Qiao, X. Xian, Y. Liu, and L. Zhang, “Small
Conference on Knowledge Discovery and Data Mining, ser. KDD ’22, perturbations are enough: Adversarial attacks on time series
New York, NY, USA, 2022, p. 146–156. prediction,” Information Sciences, vol. 587, pp. 794–812, 2022.
[187] R.-G. Cirstea, C. Guo, B. Yang, T. Kieu, X. Dong, and S. Pan, “Tri- [207] F. Karim, S. Majumdar, and H. Darabi, “Adversarial attacks on
former: Triangular, variable-specific attentions for long sequence time series,” IEEE Transactions on Pattern Analysis and Machine
multivariate time series forecasting,” in Proceedings of the Thirty- Intelligence, vol. 43, no. 10, pp. 3309–3320, 2021.
First International Joint Conference on Artificial Intelligence, IJCAI-22, [208] Y. Zhuo, Z. Yin, and Z. Ge, “Attack and defense: Adversarial se-
7 2022, pp. 1994–2001, main Track. curity of data-driven fdc systems,” IEEE Transactions on Industrial
[188] M. A. Shabani, A. H. Abdi, L. Meng, and T. Sylvain, “Scale- Informatics, vol. 19, no. 1, pp. 5–19, 2023.
former: Iterative multi-scale refining transformers for time series [209] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and
forecasting,” in The Eleventh International Conference on Learning harnessing adversarial examples,” 2015.
Representations, 2023. [210] K. Lee, J. Kim, S. Chong, and J. Shin, “Making stochastic neural
[189] Y. Zhang and J. Yan, “Crossformer: Transformer utilizing cross- networks from deterministic ones,” 2017.
dimension dependency for multivariate time series forecasting,” [211] N. Carlini and D. Wagner, “Towards evaluating the robustness of
in The Eleventh International Conference on Learning Representations, neural networks,” in 2017 IEEE Symposium on Security and Privacy
2023. (SP), 2017, pp. 39–57.
[190] H. Wu, T. Hu, Y. Liu, H. Zhou, J. Wang, and M. Long, “Timesnet: [212] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: A
Temporal 2d-variation modeling for general time series analysis,” simple and accurate method to fool deep neural networks,” in
in International Conference on Learning Representations, 2023. 2016 IEEE Conference on Computer Vision and Pattern Recognition
[191] H. Wang, J. Peng, F. Huang, J. Wang, J. Chen, and Y. Xiao, “MICN: (CVPR), 2016, pp. 2574–2582.
Multi-scale local and global context modeling for long-term series [213] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard,
forecasting,” in The Eleventh International Conference on Learning “Universal adversarial perturbations,” in 2017 IEEE Conference on
Representations, 2023. Computer Vision and Pattern Recognition (CVPR), 2017, pp. 86–94.
[192] X. Tang, H. Yao, Y. Sun, C. Aggarwal, P. Mitra, and S. Wang, “Joint [214] J. Uesato, B. O’Donoghue, P. Kohli, and A. van den Oord, “Adver-
modeling of local and global temporal dynamics for multivariate sarial risk and the dangers of evaluating against weak attacks,” in
time series forecasting with missing values,” Proceedings of the Proceedings of the 35th International Conference on Machine Learning,
AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 5956– ser. Proceedings of Machine Learning Research, vol. 80, 10–15 Jul
5963, Apr. 2020. 2018, pp. 5025–5034.
[193] X. Wang, K. Wang, and S. Lian, “A survey on face data augmen- [215] J. Li, L. Lyu, D. Iso, C. Chakrabarti, and M. Spranger, “MocoSFL:
tation for the training of deep neural networks,” Neural Comput. enabling cross-client collaborative self-supervised learning,” in
Appl., vol. 32, no. 19, p. 15503–15531, oct 2020. The Eleventh International Conference on Learning Representations,
[194] J. Gao, X. Song, Q. Wen, P. Wang, L. Sun, and H. Xu, “RobustTAD: 2023.
Robust time series anomaly detection via decomposition and [216] Z. Zhang, Z. Zhao, and Z. Lin, “Unsupervised representation
convolutional neural networks,” KDD Workshop on Mining and learning from pre-trained diffusion probabilistic models,” in
Learning from Time Series (KDD-MileTS’20), 2020. Advances in Neural Information Processing Systems, 2022.
[195] T. T. Um, F. M. J. Pfister, D. Pichler, S. Endo, M. Lang, S. Hirche, [217] S. Sørbø and M. Ruocco, “Navigating the metric maze: A taxon-
U. Fietzek, and D. Kulić, “Data augmentation of wearable sensor omy of evaluation metrics for anomaly detection in time series,”
data for parkinson’s disease monitoring using convolutional 2023.
neural networks,” in Proceedings of the 19th ACM International
Conference on Multimodal Interaction, ser. ICMI 2017, New York,
NY, USA, 2017, pp. 216–220.
[196] C. Deng, X. Ji, C. Rainey, J. Zhang, and W. Lu, “Integrating
machine learning with human knowledge,” iScience, vol. 23,
no. 11, p. 101656, 2020.
[197] Y. Chen and D. Zhang, “Integration of knowledge and data in Kexin Zhang received the B.S. and the M.S.
machine learning,” 2022. degrees in engineering from China University of
[198] Q. Wen, J. Gao, X. Song, L. Sun, H. Xu, and S. Zhu, “RobustSTL: Geosciences, Wuhan, China, in 2016 and 2019,
A robust seasonal-trend decomposition algorithm for long time respectively, and the Ph.D. degree in control en-
series,” in AAAI Conference on Artificial Intelligence (AAAI), 2019, gineering and science from Zhejiang University,
pp. 5409–5416. Hangzhou, China in 2023. His major research
[199] Q. Wen, K. He, L. Sun, Y. Zhang, M. Ke, and H. Xu, “Robust- interests include intelligent time series analysis,
Period: Robust time-frequency mining for multiple periodicity deep learning, data-driven industrial fault diag-
detection,” in Proceedings of the 2021 International Conference on nosis, and artificial intelligence security.
Management of Data (SIGMOD), 2021, pp. 2328–2337.
[200] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin, “FEDformer:
Frequency enhanced decomposed transformer for long-term se-
ries forecasting,” in Proc. 39th International Conference on Machine
Learning (ICML 2022), 2022.
[201] S. N. Shukla and B. M. Marlin, “A survey on principles, models
and methods for learning from irregularly sampled time series: Qingsong Wen (SM’23) is the Head of AI Re-
From discretization to attention and invariance,” CoRR, vol. search & Chief Scientist at Squirrel Ai Learning.
abs/2012.00168, 2020. Previously, he worked at Alibaba, Qualcomm,
[202] Y. Wu, J. Ni, W. Cheng, B. Zong, D. Song, Z. Chen, Y. Liu, and Marvell. He received his M.S. and Ph.D.
X. Zhang, H. Chen, and S. B. Davidson, “Dynamic gaussian degrees in Electrical and Computer Engineering
mixture based deep generative model for robust forecasting on from Georgia Institute of Technology, Atlanta,
sparse multivariate time series,” Proceedings of the AAAI Confer- USA. He has published over 80 top-ranked con-
ence on Artificial Intelligence, vol. 35, no. 1, pp. 651–659, May 2021. ference and journal papers, received AAAI/IAAI
[203] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, 2023 Innovative Application Award, and won the
T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and First Place in 2022 ICASSP Grand Challenge
R. Girshick, “Segment anything,” arXiv:2304.02643, 2023. Competition. He co-organized the Workshop on
[204] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Good- AI for Time Series (at KDD 2023, IJCAI 2023, ICDM 2023, SDM 2024,
fellow, and R. Fergus, “Intriguing properties of neural networks,” AAAI 2024). He is an Associate Editor for Neurocomputing, Guest Editor
arXiv preprint arXiv:1312.6199v4, 2013. for IEEE Internet of Things Journal, Guest Editor for Applied Energy,
[205] L. Liu, Y. Park, T. N. Hoang, H. Hasson, and L. Huan, “Ro- and regularly served as an Area Chair/SPC/PC member of the major
bust multivariate time-series forecasting: Adversarial attacks and AI conferences including AAAI, IJCAI, KDD, ICDM, ICASSP, etc. His
defense mechanisms,” in The Eleventh International Conference on research interests include AI for time series, AI for education, and
Learning Representations, 2023. general machine learning.

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on August 12,2024 at 19:21:50 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3387317

20

Chaoli Zhang received the B.S. degree in infor- Yuxuan Liang is currently an Assistant Profes-
mation security from Nankai University, China, sor at Intelligent Transportation Thrust, also af-
in 2015, and the Ph.D. degree in computer sci- filiated with Data Science and Analytics Thrust,
ence and engineering from Shanghai Jiao Tong Hong Kong University of Science and Tech-
University, China in 2020. She is currently a lec- nology (Guangzhou). He is working on the re-
turer with the School of Computer Science and search, development, and innovation of spatio-
Technology, Zhejiang Normal University. She en- temporal data mining and AI, with a broad range
gaged in research at Alibaba DAMO Academy of applications in smart cities. Before that, he ob-
for nearly three years. Her research interests tained his PhD degree at National University of
include time series analysis, algorithmic game Singapore. He published over 50 peer-reviewed
theory and mechanism design, networking. She papers in refereed journals and conferences,
won the gold prize of ICASSP-SPGC root cause analysis for wireless such as TPAMI, TKDE, AI Journal, TMC, KDD, WWW, NeurIPS, and
network fault localization 2022. She was the recipient of Google Anita ICLR. Three of them were selected as the most influential IJCAI/KDD
Borg Scholarship 2014 and AAAI/IAAI innovative deployed application papers. He received The 23rd China Patent Excellence Award in 2022.
award 2023.

Rongyao Cai received the B.S. degress in


chemical engineering from Zhejiang University Guansong Pang is a tenure-track Assistant Pro-
of Technology, Hangzhou, China, in 2022. He is fessor of Computer Science in the School of
studying for the M.S. degree in control engineer- Computing and Information Systems at Singa-
ing with College of Control Science and Engi- pore Management University (SMU), Singapore.
neering, Zhejiang University, Hangzhou, China. Before joining SMU, he was a Research Fellow
His major research interests include data mining, with the Australian Institute for Machine Learning
machine learning on industrial time-series data. (AIML). He received a Ph.D. degree from Uni-
versity of Technology Sydney, Australia in 2019.
His research interests lie in machine learning
techniques and their applications, with a focus
on handling abnormal and unknown data.

Ming Jin received the B.Eng. degree from the


Hebei University of Technology, Tianjin, China, in
2017, and M.Inf.Tech. degree from the University
of Melbourne, Melbourne, Australia, in 2019. He
is currently pursuing his Ph.D. degree in Com- Dongjin Song received a Ph.D. degree from
puter Science at Monash University, Melbourne, University of California San Diego (UCSD) in
Australia. His research focuses on graph neu- 2016. Currently, he is an assistant professor
ral networks (GNNs), time series analysis, data in the School of Computing at the University
mining, and machine learning. of Connecticut (UConn). His research interests
include machine learning, deep learning, data
mining, and related applications for time series
data and graph representation learning. Papers
describing his research have been published at
top-tier data science and artificial intelligence
Yong Liu (M’11) received the B.S. degree in conferences, such as NeurIPS, ICML, ICLR,
computer science and engineering and the Ph.D. KDD, ICDM, SDM, AAAI, IJCAI, CVPR, ICCV, etc. He has co-organized
degree in computer science from Zhejiang Uni- AI for Time Series (AI4TS) workshop at IJCAI 2022, 2023 and the
versity, Hangzhou, China, in 2001 and 2007, Mining and Learning from Time Series workshop at KDD 2022, 2023.
respectively. He is currently a Professor with the He has also served as senior PC for AAAI, IJCAI, and CIKM. He won
Institute of Cyber Systems and Control, Depart- the UConn Research Excellence Research (REP) Award in 2021. His
ment of Control Science and Engineering, Zhe- research has been funded by NSF, USDA, Morgan Stanley, NEC Labs
jiang University. He has authored or co-authored America, Travelers, etc.
more than 30 research papers in machine learn-
ing, computer vision, information fusion, and
robotics. His current research interests include
machine learning, robotics vision, information processing, and granular
computing.
Shirui Pan received a Ph.D. in computer science
from the University of Technology Sydney (UTS),
Ultimo, NSW, Australia. He is a Professor with
James Y. Zhang is the managing director of AI the School of Information and Communication
Forecast and Strategy Platform of Ant Group. Technology, Griffith University, Australia. Prior to
Prior to his employment with Ant Group, he this, he was a Senior Lecturer with the Faculty of
worked on finance-related AI at Bloomberg, IT at Monash University. His research interests
spearheaded the creation of Bloomberg’s GPU include data mining and machine learning. To
computation farm and participated in the estab- date, Dr Pan has published over 100 research
lishment of the AI branch of Bloomberg Labs. papers in top-tier journals and conferences, in-
He obtained his Ph.D. degree from the Univ. cluding TPAMI, TKDE, TNNLS, ICML, NeurIPS,
of Ottawa, Canada, in Electrical Engineering in and KDD. His research has attracted over 20,000 citations. His research
2006, and both his Master’s and Bachelor’s de- received the 2024 CIS IEEE TNNLS Oustanding Paper Award and the
grees from Zhejiang Univ., China, in 2000 and 2020 IEEE ICDM Best Student Paper Award. He is recognized as one
1997, respectively. His industrial experience of startups and larger cor- of the AI 2000 AAAI/IJCAI Most Influential Scholars in Australia (2021).
porations spans various disciplines, such as image processing, natural He is an ARC Future Fellow and a Fellow of Queensland Academy of
language processing, time series analysis, high-speed hardware devel- Arts and Sciences (FQA).
opment, optical networks, operations research, biometrics, and financial
systems, with extensive patents and publications.

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on August 12,2024 at 19:21:50 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

You might also like