DL For Time Series Anomaly Detection
DL For Time Series Anomaly Detection
Time series anomaly detection is important for a wide range of research fields and applications, including financial markets, economics,
earth sciences, manufacturing, and healthcare. The presence of anomalies can indicate novel or unexpected events, such as production
faults, system defects, and heart palpitations, and is therefore of particular interest. The large size and complexity of patterns in time
series data have led researchers to develop specialised deep learning models for detecting anomalous patterns. This survey provides a
structured and comprehensive overview of state-of-the-art deep learning for time series anomaly detection. It provides a taxonomy
based on anomaly detection strategies and deep learning models. Aside from describing the basic anomaly detection techniques in each
category, their advantages and limitations are also discussed. Furthermore, this study includes examples of deep anomaly detection in
time series across various application domains in recent years. Finally, it summarises open issues in research and challenges faced
while adopting deep anomaly detection models to time series data.
CCS Concepts: • Computing methodologies → Anomaly detection; • General and reference → Surveys and overviews.
Additional Key Words and Phrases: Anomaly detection, Outlier detection, Time series, Deep learning, Multivariate time series,
Univariate time series
1 INTRODUCTION
The detection of anomalies, also known as outlier or novelty detection, has been an active research field in numerous
application domains since the 1960s [72]. As computational processes evolve, the collection of big data and its use in
artificial intelligence (AI) is better enabled, contributing to time series analysis including the detection of anomalies. With
greater data availability and increasing algorithmic efficiency/computational power, time series analysis is increasingly
used to address business applications through forecasting, classification, and anomaly detection [57], [23]. Time series
anomaly detection (TSAD) has received increasing attention in recent years, because of increasing applicability in a
wide variety of domains, including urban management, intrusion detection, medical risk, and natural disasters.
Authors’ addresses: Zahra Zamanzadeh Darban, [email protected], Monash University, Melbourne, Victoria, Australia; Geoffrey I. Webb,
[email protected], Monash University, Melbourne, Victoria, Australia; Shirui Pan, [email protected], Griffith University, Gold Coast,
Queensland, Australia; Charu C. Aggarwal, [email protected], IBM T. J. Watson Research Center, Yorktown Heights, NY, USA; Mahsa Salehi,
[email protected], Monash University, Melbourne, Victoria, Australia.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
© 2023 Association for Computing Machinery.
Manuscript submitted to ACM
Deep learning has become increasingly capable over the past few years of learning expressive representations of
complex time series, like multidimensional data with both spatial (intermetric) and temporal characteristics. In deep
anomaly detection, neural networks are used to learn feature representations or anomaly scores in order to detect
anomalies. Many deep anomaly detection models have been developed, providing significantly higher performance
than traditional time series anomaly detection tasks in different real-world applications.
Although the field of anomaly detection has been explored in several literature surveys [26], [140], [24], [17], [20]
and some evaluation review papers exist [153], [101], there is only one survey on deep anomaly detection methods for
time series data [37]. However, the mentioned survey [37] has not covered the vast range of TSAD methods that have
emerged in recent years, such as DAEMON [33], TranAD [171], DCT-GAN [114], and Interfusion [117]. Additionally,
the representation learning methods within the taxonomy of TSAD methodologies has not been addressed in this
survey. As a result, there is a need for a survey that enables researchers to identify important future directions of
research in TSAD and the methods that are suitable to various application settings. Specifically, this article makes the
following contributions:
• Taxonomy: We present a novel taxonomy of deep anomaly detection models for time series data. These
models are broadly classified into four categories: forecasting-based, reconstruction-based, representation-based
and hybrid methods. Each category is further divided into subcategories based on the deep neural network
architectures used. This taxonomy helps to characterise the models by their unique structural features and their
contribution to anomaly detection capabilities.
• Comprehensive Review: Our study provides a thorough review of the current state-of-the-art in time series
anomaly detection up to 2024. This review offers a clear picture of the prevailing directions and emerging trends
in the field, making it easier for readers to understand the landscape and advancements.
• Benchmarks and Datasets: We compile and describe the primary benchmarks and datasets used in this field.
Additionally, we categorise the datasets into a set of domains and provide hyperlinks to these datasets, facilitating
easy access for researchers and practitioners.
• Guidelines for Practitioners: Our survey includes practical guidelines for readers on selecting appropriate deep
learning architectures, datasets, and models. These guidelines are designed to assist researchers and practitioners
in making informed choices based on their specific needs and the context of their work.
• Fundamental Principles: We discuss the fundamental principles underlying the occurrence of different types
of anomalies in time series data. This discussion aids in understanding the nature of anomalies and how they can
be effectively detected.
• Evaluation Metrics and Interpretability: We provide an extensive discussion on evaluation metrics together
with guidelines for metric selection. Additionally, we include a detailed discussion on model interpretability to
help practitioners understand and explain the behaviour and decisions of TSAD models.
This article is organised as follows. In Section 2, we start by introducing preliminary definitions, which is followed
by a taxonomy of anomalies in time series. Section 3 discusses the application of deep anomaly detection models to
time series. Different deep models and their capabilities are then presented based on the main approaches (forecasting-
based, reconstruction-based, representation-based, and hybrid) and architectures of deep neural networks. Additionally,
Section D explores the applications of time series deep anomaly detection models in different domains. Finally, Section
5 provides several challenges in this field that can serve as future opportunities. An overview of publicly available and
commonly used datasets for the considered anomaly detection models can be found in Section 4.
Manuscript submitted to ACM
Deep Learning for Time Series Anomaly Detection: A Survey 3
2 BACKGROUND
A time series is a series of data points indexed sequentially over time. The most common form of time series is a
sequence of observations recorded over time [75]. Time series are often divided into univariate (one-dimensional) and
multivariate (multi-dimensional). These two types are defined in the following subsections. Thereafter, decomposable
components of the time series are outlined. Following that, we provide a taxonomy of anomaly types based on time
series’ components and characteristics.
𝑋 = (𝑥 1, 𝑥 2, . . . , 𝑥𝑡 ) (1)
where 𝑋𝑖 = (𝑥𝑖1, 𝑥𝑖2, . . . , 𝑥𝑖𝑑 ) represents a data vector at time 𝑖, with each 𝑥𝑖 indicating the observation at time 𝑖 for the
𝑗
• Secular trend: This is the long-term trend in the series, such as increasing, decreasing or stable. The secular
trend represents the general pattern of the data over time and does not have to be linear. The change in population
in a particular region over several years is an example of nonlinear growth or decay depending on various
dynamic factors.
• Seasonal variations: Depending on the month, weekday, or duration, a time series may exhibit a seasonal
pattern. Seasonality always occurs at a fixed frequency. For instance, a study of gas/electricity consumption
shows that the consumption curve does not follow a similar pattern throughout the year. Depending on the
season and the locality, the pattern is different.
Manuscript submitted to ACM
4 Darban et al.
3 Metric 1
Temporal-Intermetric
2 Intermetric
Contextual 1
2
0
0 1
Trend
Shapelet
2
1
Global Seasonal 0
4
1
6 2
Metric 2
0 50 100 150 200 250 300 350
Time 0 50 100 150 200 250 300 350
(a) (b)
Fig. 1. (a) An overview of different temporal anomalies plotted from the NeurIPS-TS dataset [107]. Global and contextual anomalies
occur in a point (coloured in blue). Seasonal, trend and shapelet can occur in a subsequence (coloured in red). (b) Intermetric and
temporal-intermetric anomalies in MTS. In this figure, metric 1 is power consumption, and metric 2 is CPU usage.
• Cyclical fluctuations: A cycle is defined as an extended deviation from the underlying series defined by
the secular trend and seasonal variations. Unlike seasonal effects, cyclical effects vary in onset and duration.
Examples include economic cycles such as booms and recessions.
• Irregular variations: This refers to random, irregular events. It is the residual after all the other components
are removed. A disaster such as an earthquake or flood can lead to irregular variations.
A time series can be mathematically described by estimating its four components separately, and each of them may
deviate from the normal behaviour.
2.4.1 Types of Anomalies. Anomalies in UTS and MTS can be classified as temporal, intermetric, or temporal-intermetric
anomalies [117]. In a time series, temporal anomalies can be compared with either their neighbours (local) or the whole
time series (global), and they present different forms depending on their behaviour [107]. There are several types of
temporal anomalies that commonly occur in UTS, all of which are shown in Fig. 1a. Temporal anomalies can also occur
in the MTS and affect multiple dimensions or all dimensions. A subsequent anomaly may appear when an unusual
pattern of behaviour emerges over time; however, each observation may not be considered an outlier by itself. As a
result of a point anomaly, an unexpected event occurs at one point in time, and it is assumed to be a short sequence.
Different types of temporal anomalies are as follows:
• Global: They are spikes in the series, which are point(s) with extreme values compared to the rest of the series.
A global anomaly, for instance, is an unusually large payment by a customer on a typical day. Considering a
threshold, it can be described as Eq. (3).
where 𝑥ˆ is the output of the model. If the difference between the output and actual point value is greater than a
threshold, then it has been recognised as an anomaly. An example of a global anomaly is shown on the left side
of Fig. 1a where −6 has a large deviation from the time series.
• Contextual: A deviation from a given context is defined as a deviation from a neighbouring time point, defined
here as one that lies within a certain range of proximity. These types of anomalies are small glitches in sequential
data, which are deviated values from their neighbours. It is possible for a point to be normal in one context while
an anomaly in another. For example, large interactions, such as those on boxing day, are considered normal, but
not so on other days. The formula is the same as that of a global anomaly, but the threshold for finding anomalies
differs. The threshold is determined by taking into account the context of neighbours:
where 𝑋𝑡 −𝑤:𝑡 refers to the context of the data point 𝑥𝑡 with a window size 𝑤, var is the variance of the context
of data point and 𝜆 controlling coefficient for the threshold. The second blue highlight in Fig. 1a is a contextual
anomaly that occurs locally in a specific context.
• Seasonal: In spite of normal shapes and trends of the time series, their seasonality is unusual compared to the
overall seasonality. An example is the number of customers in a restaurant during a week. Such a series has a
clear weekly seasonality, so it makes sense to look for deviations in this seasonality and process the anomalous
periods individually.
ˆ > 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
𝑑𝑖𝑠𝑠𝑠 (𝑆, 𝑆) (5)
where 𝑑𝑖𝑠𝑠𝑠 is a function measuring the dissimilarity between two subsequences and 𝑆ˆ denotes the seasonality of
the expected subsequences. As demonstrated in the first red highlight of Fig. 1a, the seasonal anomaly changes
the frequency of a rise and drop of data in the particular segment.
• Trend: The event that causes a permanent shift in the data to its mean and produces a transition in the trend of
the time series. While this anomaly preserves its cycle and seasonality of normality, it drastically alters its slope.
Trends can occasionally change direction, meaning they may go from increasing to decreasing and vice versa. As
an example, when a new song comes out, it becomes popular for a while, then it disappears from the charts like
the segment in Fig. 1a where the trend is changed and is assumed as a trend anomaly. It is likely that the trend
will restart in the future.
𝑑𝑖𝑠𝑠𝑡 (𝑇 , 𝑇ˆ ) > 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 (6)
where 𝑇ˆ is the normal trend.
• Shapelet: Shapelet means a distinctive, time series subsequence pattern. There is a subsequence whose time
series pattern or cycle differs from the usual pattern found in the rest of the sequence. Variations in economic
conditions, like the total demand for and supply of goods and services, are often the cause of these fluctuations.
In the short-run, these changes lead to periods of expansion and recession.
ˆ > 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
𝑑𝑖𝑠𝑠𝑐 (𝐶, 𝐶) (7)
where 𝐶ˆ specifies the cycle or shape of expected subsequences. For example, the last highlight in Fig. 1a where
the shape of the segment changed due to some fluctuations.
Having discussed various types of anomalies, we understand that these can often be characterised by the distance
between the actual subsequence observed and the expected subsequence. In this context, dynamic time warping (DTW)
Manuscript submitted to ACM
6 Darban et al.
[134], which optimally aligns two time series, is a valuable method for measuring this dissimilarity. Consequently,
DTW’s ability to accurately calculate temporal alignments makes it a suitable tool for anomaly detection applications,
as evidenced in several studies [15], [161]. Moreover, MTS is composed of multiple dimensions (a.k.a, metrics [117, 163])
that each describe a different aspect of a complex entity. Spatial dependencies (correlations) among dimensions within
an entity, also known as intermetric dependencies, can be linear or nonlinear. MTS would exhibit a wide range of
anomalous behaviour if these correlations were broken. An example is shown in the left part of Fig. 1b. The correlation
between power consumption in the first dimension (metric 1) and CPU usage in the second dimension (metric 2) usage
is positive, but it breaks about 100th of a second after it begins. Such an anomaly is named the intermetric anomaly in
this study.
𝑗
max disscorr (Corr(𝑋 𝑗 , 𝑋 𝑘 ), Corr(𝑋𝑡 +𝛿𝑡 :𝑡 +𝑤+𝛿𝑡 , 𝑋𝑡𝑘+𝛿𝑡 :𝑡 +𝑤+𝛿𝑡 )) > threshold (8)
∀ 𝑗,𝑘 ∈𝐷,𝑗≠𝑘 𝑗 𝑗 𝑘 𝑘
where 𝑋 𝑗 and 𝑋 𝑘 are different dimensions of the MTS, Corr denotes the correlation function that measures the
relationship between two dimensions, 𝛿𝑡 𝑗 and 𝛿𝑡𝑘 are time shifts that adjust the comparison windows for dimensions 𝑗
and 𝑘, accommodating asynchronous events or delays between observations, 𝑡 is the starting point of the time window,
𝑤 is the width of the time window, indicating the duration over which correlations are assessed, disscorr is a function
that quantifies the divergence in correlation between the standard, long-term measurement and the dynamic, short-term
measurement within the specified window, threshold is a predefined limit that determines when the divergence in
correlations signifies an anomaly, and 𝐷 is the set of all dimensions within the MTS, with the comparison conducted
between every unique pair ( 𝑗, 𝑘) where 𝑗 ≠ 𝑘.
Dimensionality reduction techniques, such as selecting a subset of critical dimensions based on domain knowledge
or preliminary analysis, help manage the computational complexity that increases with the number of dimensions.
Where 𝑋 𝑗 and 𝑋 𝑘 are two different dimensions of MTS that are correlated, and 𝑐𝑜𝑟𝑟 measures the correlations
between two dimensions. When this correlation deteriorates in the window 𝑡 : 𝑡 + 𝑤, it means that the coefficient
deviates more than 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 from the normal coefficient.
Intermetric-temporal anomalies introduce added complexity and challenges in anomaly detection; however, they
occasionally facilitate easier detection across temporal or various dimensional perspectives due to their simultaneous
violation of intermetric and temporal dependencies, as illustrated on the right side of Fig. 1b.
TCN ResNet LSTM Bi-LSTM GRU SAE DAE CAE VAE GCN GAT
End-to-End
Learning /
Preprocessing Scoring
Inference
3.1.1 Temporal/Spatial. With a UTS as input, a model can capture temporal information (i.e., pattern), while with a
MTS as input, it can learn normality through both temporal and spatial dependencies. Moreover, if the model input is
an MTS in which spatial dependencies are captured, the model can also detect intermetric anomalies (shown in Fig. 1b).
3.1.2 Learning Schemes. In practice, training data tends to have a very small number of anomalies that are labelled. As
a consequence, most of the models attempt to learn the representation or features of normal data. Based on anomaly
definitions, anomalies are then detected by finding deviations from normal data. There are four learning schemes in the
recent deep models for anomaly detection: unsupervised, supervised, semi-supervised, and self-supervised. These are
based on the availability (or lack) of labelled data points. Supervised method employs a distinct method of learning the
boundaries between anomalous and normal data that is based on all the labels in the training set. It can determine an
appropriate threshold value that will be used for classifying all timestamps as anomalous if the anomaly score (Section
3.1) assigned to those timestamps exceeds the threshold. The problem with this method is that it is not applicable to
many real-world applications because anomalies are often unknown or improperly labelled. In contrast, Unsupervised
approach uses no labels and makes no distinction between training and testing datasets. These techniques are the most
flexible since they rely exclusively on intrinsic features of the data. They are useful in streaming applications because
they do not require labels for training and testing. Despite these advantages, researchers may encounter difficulties
evaluating anomaly detection models using unsupervised methods. The anomaly detection problem is typically treated
Manuscript submitted to ACM
8 Darban et al.
as an unsupervised learning problem due to the inherently unlabelled nature of historical data and the unpredictable
nature of anomalies. Semi-supervised anomaly detection in time series data may be utilised in cases where the dataset
only consists of labelled normal data, unlike supervised methods that require a fully labelled dataset of both normal and
anomalous points. Unlike unsupervised methods, which detect anomalies without any labelled data, semi-supervised
TSAD relies on labelled normal data to define normal patterns and detect deviations as anomalies. This approach is
distinct from self-supervised learning, where the model generates its own supervisory signal from the input data without
needing explicit labels.
3.1.3 Input. A model may take an individual point (i.e., a time step) or a window (i.e., a sequence of time steps
containing historical information) as an input. Windows can be used in order, also called sliding windows, or shuffled
without regard to the order, depending on the application. To overcome the challenges of comparing subsequences
rather than points, many models use representations of subsequences (windows) instead of raw data and employ
sliding windows that contain the history of previous time steps that rely on the order of subsequences within the time
series data. A sliding window extraction is performed in the preprocessing phase after other operations have been
implemented, such as imputing missing values, downsampling or upsampling of the data, and data normalisation.
3.1.4 Interpretability. In interpretation, the cause of an anomalous observation is given. Interpretability is essential
when anomaly detection is used as a diagnostic tool since it facilitates troubleshooting and analysing anomalies. MTS
are challenging to interpret, and stochastic deep learning complicates the process even further. A typical procedure to
troubleshoot entity anomalies involves searching for the top dimension that differs most from previously observed
behaviour. In light of that, it is, therefore, possible to interpret a detected entity anomaly by analysing several dimensions
with the highest anomaly scores.
3.1.5 Point/Subsequence anomaly. The model can detect either point anomalies or subsequence anomalies. A point
anomaly is a point that is unusual when compared with the rest of the dataset. Subsequence anomalies occur when
consecutive observations have unusual cooperative behaviour, although each observation is not necessarily an outlier
on its own. Different types of anomalies are described in Section 2.4 and illustrated in Fig. 1a and Fig. 1b
3.1.6 Stochasticity. As shown in Tables 1 and 2, we investigate the stochasticity of anomaly detection models as well.
Deterministic models can accurately predict future events without relying on randomness. Predicting something that is
deterministic is easy because you have all the necessary data at hand. The models will produce the same exact results
for a given set of inputs in this circumstance. Stochastic models can handle uncertainties in the inputs. Through the use
of a random component as an input, you can account for certain levels of unpredictability or randomness.
3.1.7 Incremental. This is a machine learning paradigm in which the model’s knowledge extends whenever one or more
new observations appear. It specifies a dynamic learning strategy that can be used if training data becomes available
gradually. The goal of incremental learning is to adapt a model to new data while preserving its past knowledge.
Moreover, the deep model processes the input in a step-by-step or end-to-end fashion (see Fig. 3). In the first category
(step-by-step), there is a learning module followed by an anomaly scoring module. It is possible to combine the two
modules in the second category to learn anomaly scores using neural networks as an end-to-end process. An output
of these models may be anomaly scores or binary labels for inputs. Contrary to algorithms whose objective is to
improve representations, DevNet [141], for example, introduces deviation networks to detect anomalies by leveraging a
few labelled anomalies to achieve end-to-end learning for optimizing anomaly scores. End-to-end models in anomaly
Manuscript submitted to ACM
Deep Learning for Time Series Anomaly Detection: A Survey 9
Forecasting
RNN (3.2.1) LSTM RNN [19] 2016 Semi P Subseq
LSTM-based [56] 2019 Un W -
TCQSA [118] 2020 Su P -
detection are designed to directly output the final classification of data points or subsequences as normal or anomalous,
which includes the explicit labelling of these points. In contrast, step-by-step models typically generate intermediate
outputs at each stage of the analysis, such as anomaly scores for each subsequence or point. These scores then
require additional post-processing, such as thresholding, to determine if an input is anomalous. Common methods for
establishing these thresholds include Nonparametric Dynamic Thresholding (NDT) [92] and Peaks-Over-Threshold
(POT) [158], which help convert scores into final labels.
An anomaly score is mostly defined based on a loss function. In most of the reconstruction-based approaches,
reconstruction probability is used, and in forecasting-based approaches, the prediction error is used to define an
anomaly score. An anomaly score indicates the degree of an anomaly in each data point. Anomaly detection can be
accomplished by ranking data points according to anomaly scores (𝐴𝑆 ) and a decision score based on a 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 value:
Evaluation metrics that are used in these papers are introduced in Appendix A
3.2.1 Recurrent Neural Networks (RNN). RNNs have internal memory, allowing them to process variable-length input
sequences and retain temporal dynamics [2, 167]. An example of a simple RNN architecture is shown in Fig 4a. Recurrent
units take the points of the input window 𝑋𝑡 −𝑤:𝑡 −1 and forecast the next timestamp 𝑥𝑡′ . The input sequence is processed
Manuscript submitted to ACM
10 Darban et al.
A1 MA2 Model Year T/S3 Su/Un4 Input5 Int6 P/S7 Stc8 Inc9 US10
LSTM-PRED [66] 2017 T Un W ✓ -
LSTM-NDT [92] 2018 T Un W ✓ Subseq
RNN (3.2.1) LGMAD [49] 2019 T Semi P Point ✓
THOC [156] 2020 T Self W Subseq ✓
AD-LTI [183] 2020 T Un P Point (frame)
Forecasting
x't + + x + x't
ct-w ct-1
tanh
1-
x
tanh x x x
σ σ tanh σ σ σ tanh
x't
w w w
Fig. 4. An Overview of (a) Recurrent neural network (RNN), (b) Long short-term memory unit (LSTM), and (c) Gated recurrent unit
(GRU). These models can predict 𝑥𝑡′ by capturing the temporal information of a window of 𝑤 samples prior to 𝑥𝑡 in the time series.
Using the error |𝑥𝑡 − 𝑥𝑡′ |, an anomaly score can be computed.
iteratively, timestamp by timestamp. Given input 𝑥𝑡 −1 to the recurrent unit 𝑜𝑡 −2 and an activation function like tanh,
the output 𝑥𝑡′ is calculated as follows:
to detect anomalies in UTS without using labelled data for training. This stacking helps learn higher-order temporal
patterns without needing prior knowledge of their duration. The network predicts several future time steps to capture
the sequence’s temporal structure, resulting in multiple error values for each point in the sequence. These prediction
errors are modelled as a multivariate Gaussian distribution to assess the likelihood of anomalies. LSTM-AD’s results
suggest that LSTM-based models are more effective than RNN-based models, especially when it’s unclear whether
normal behaviour involves long-term dependencies.
As opposed to the stacked LSTM used in LSTM-AD, Bontemps et al. [19] propose a simpler LSTM RNN model for
collective anomaly detection based on its predictive abilities for UTS. First, an LSTM RNN is trained with normal time
series data to make predictions, considering both current states and historical data. By introducing a circular array, the
model detects collective anomalies by identifying prediction errors that exceed a certain threshold within a sequence.
Motivated by promising results in LSTM models for UTS anomaly detection, a number of methods attempt to detect
anomalies in MTS based on LSTM architectures. In DeepLSTM [28], stacked LSTM recurrent networks are trained on
normal time series data. The prediction errors are then fitted to a multivariate Gaussian using maximum likelihood
estimation. This model predicts both normal and anomalous data, recording the Probability Density Function (PDF)
values of the errors. This approach has the advantage of not requiring preprocessing, and it works directly on raw
time series. LSTM-PRED [66] utilises three LSTM stacks with 100 hidden units each, processing data sequences of 100
seconds to learn temporal dependencies. Instead of setting thresholds for each sensor, it uses the Cumulative Sum
(CUSUM) method to detect anomalies. CUSUM calculates the cumulative sum of the sequence predictions to identify
small deviations, reducing false positives. It computes the positive and negative differences between predicted and
actual values, setting Upper Control Limits (UCL) and Lower Control Limits (LCL) from the validation data to determine
anomalies. Moreover, this model can pinpoint the specific sensor showing abnormal behaviour.
In all three above-mentioned models, LSTMs are stacked to improve prediction accuracy by analysing historical
data from MTS; however, LSTM-NDT [92] combines various techniques. LSTM-NDT model introduces a technique
that automatically adjusts thresholds for data changes, addressing issues like diversity and instability in evolving data.
Another model, called LGMAD [49], enhances LSTM’s structure for better anomaly detection in time series. Additionally,
a method combines LSTM with a Gaussian Mixture Model (GMM) for detecting anomalies in both simple and complex
systems, with a focus on assessing the system’s health status through a health factor. This model can only be applied in
low-dimensional applications. For high-dimensional data, it’s suggested to use dimension reduction methods like PCA
for effective anomaly detection [88].
Ergen and Kozat [56] present LSTM-based anomaly detection algorithms in an unsupervised framework, as well
as semi-supervised and fully supervised frameworks. To detect anomalies, it uses scoring functions implemented by
One Class-SVM (OC-SVM) and Support Vector Data Description (SVDD) algorithms. In this framework, LSTM and
OC-SVM (or SVDD) architecture parameters are jointly trained with well-defined objective functions, utilising two joint
optimisation approaches. The gradient-based joint optimisation method uses revised OC-SVM and SVDD formulations,
illustrating their convergence to the original formulations. As a result of the LSTM-based structure, methods are able
to process data sequences of variable length. Aside from that, the model is effective at detecting anomalies in time
series data without preprocessing. Moreover, since the approach is generic, the LSTM architecture in this model can be
replaced by a GRU (gated recurrent neural networks) architecture [38].
GRU was proposed by Cho et al. [36] in 2014, similar to LSTM but incorporating a more straightforward structure
that leads to less computing time (see Fig. 4c). Both LSTM and GRU use gated architectures to control information
flow. However, GRU has gating units that inflate the information flow inside the unit without having any separate
Manuscript submitted to ACM
Deep Learning for Time Series Anomaly Detection: A Survey 13
Convolution
Input Pooling Output
Fig. 5. Structure of a Convolutional Neural Network (CNN) predicting the next values of an input time series based on a previous
data window. Time series dependency dictates that predictions rely solely on previously observed inputs.
memory unit, unlike LSTM [47]. There is no output gate but an update gate and a reset gate. Fig. 4c shows the GRU cell
that integrates the new input with the previous memory using its reset gate. The update gate defines how much of
the last memory to keep [73]. The issue is that LSTMs and GRUs are limited in learning complex seasonal patterns
in multi-seasonal time series. As more hidden layers are stacked and the backpropagation distance (through time) is
increased, accuracy can be improved. However, training may be costly.
In this regard, the AD-LTI model is a forecasting tool that combines a GRU network with a method called Prophet to
learn seasonal time series data without needing labelled data. It starts by breaking down the time series to highlight
seasonal trends, which are then specifically fed into the GRU network for more effective learning. When making
predictions, the model considers both the overall trends and specific seasonal patterns like weekly and daily changes.
However, since it uses past data that might include anomalies, the projections might not always be reliable. To address
this, it introduces a new measure called Local Trend Inconsistency (LTI), which assesses the likelihood of anomalies by
comparing recent predictions against the probability of them being normal, overcoming the issue that there might be
anomalous frames in history.
Traditional one-class classifiers are developed for fixed-dimension data and struggle with capturing temporal
dependencies in time series data [149]. A recent model, called THOC [156], addresses this by using a complex network
that includes a multilayer dilated RNN [27] and hierarchical SVDD [165]. This setup allows it to capture detailed
temporal features at multiple scales (resolution) and efficiently recognise complex patterns in time series data. It
improves upon older models by using information from various layers, not just the simplest features, and it detects
anomalies by comparing current data against its normal pattern representation. In spite of the accomplishments of
RNNs, they still face challenges in processing very long sequences due to their fixed window size.
3.2.2 Convolutional Neural Networks (CNN). Convolutional Neural Networks (CNNs) are adaptations of multilayer
perceptrons designed to identify hierarchical patterns in data. These networks employ convolutional, pooling, and fully
connected layers, as depicted in Fig. 5. Convolutional layers utilise a set of learnable filters that are applied across the
entire input to produce 2D activation maps through dot products. Pooling layers summarise these outputs statistically.
The CNN-based DeepAnt model [135] efficiently detects small deviations in time series patterns with minimal
training data and can handle data contamination under 5% in an unsupervised setup. DeepAnt is applicable to both
UTS and MTS and detects various anomaly types, including point, contextual anomalies, and discords.
Despite their effectiveness, traditional CNNs struggle with sequential data due to their inherent design. This limitation
has been addressed by the development of Temporal Convolutional Networks (TCN) [11], which use dilated convolutions
to accommodate time series data. TCNs ensure that outputs are the same length as inputs without future data leakage.
This is achieved using a 1D fully convolutional network and dilated convolutions, ensuring all computations for a
Manuscript submitted to ACM
14 Darban et al.
GNN Layer
Skip
Conncetion
GNN Layer
GNN Layer
GNN Layer
...
Convolution
Sampling Pooling
Recurrent
Input
Prediction Layer
?
Output
Fig. 6. The basic structure of Graph Neural Network (GNN) for MTS anomaly detection that can learn the relationships (correlations)
between metrics and predict the expected behaviour of time series.
timestamp 𝑡 use only historical data. The dilated convolution operation is defined as:
−1
𝑘∑︁
𝑥 ′ (𝑡) = (𝑥 ∗𝑙 𝑓 )(𝑡) = 𝑓 (𝑖) · 𝑥𝑡 −𝑙 ·𝑖 (16)
𝑖=0
where 𝑓 is a filter of size 𝑘, ∗𝑙 denotes convolution with dilation factor 𝑙, and 𝑥𝑡 −𝑙 ·𝑖 represents past data points.
He and Zhao [78] use different methods to predict and detect anomalies in data over time. They use a TCN trained on
normal data to forecast trends and calculate anomaly scores using multivariate Gaussian distribution fitted to prediction
errors. It includes a skipping connection to blend multi-scale features, accommodating different pattern sizes. Ren et al.
[147] combines a Spectral Residual model, originally for visual saliency detection [83], with a CNN to enhance accuracy.
This method, used by over 200 Microsoft teams, can rapidly detect anomalies in millions of time series per minute. The
TCN Autoencoder (TCN-AE), developed by Thill et al. [169] (2020), modifies the standard AE by using CNNs instead of
dense layers, making it more effective and adaptable. It uses two TCNs for encoding and decoding, with layers that
respectively downsample and upsample data.
Many real-world scenarios produce quasi-periodic time series (QTS), like the patterns seen in ECGs (electrocardio-
grams). A new automated system for spotting anomalies in these QTS called AQADF [118], uses a two-part method.
First, it segments the QTS into consistent periods using an algorithm (TCQSA) that uses a hierarchical clustering
technique and groups similar data points without needing manual help, even filtering out errors to make it more reliable.
Second, it analyses these segments with an attention-based hybrid LSTM-CNN model (HALCM), which looks at both
broad trends and detailed features in the data. Furthermore, HALCM is further enhanced by three attention mechanisms,
allowing it to capture more precise details of the fluctuation patterns in QTS. Specifically, TAGs are embedded in LSTMs
in order to fine-tune variations extracted from different parts of QTS. A feature attention mechanism and a location
attention mechanism are embedded into a CNN in order to enhance the effects of key features extracted from QTSs.
TimesNet [181] is a versatile deep learning model designed for comprehensive time series analysis. It transforms 1D
time series data into 2D tensors to effectively capture complex temporal patterns. By using a modular structure called
TimesBlock, which incorporates a parameter-efficient inception block, TimesNet excels in a variety of tasks, including
forecasting, classification, and anomaly detection. This innovative approach allows it to handle intricate variations in
time series data, making it suitable for applications across different domains.
π(xt)
xt Encoder Spacial Pooler Sequence Memory errt
Anomaly Lt
Prediction Error
Likelihood
HTM a(xt)
OR
OR
Feedback
Context Feedforward
Fig. 7. (a) Components of an HTM-based (Hierarchical Temporal Memory) anomaly detection system calculating prediction error
and anomaly likelihood. (b) An HTM cell internal structure. Dendrites act as detectors with synapses. Context dendrites receive
lateral input from other neurons. Sufficient lateral activity puts the cell in a predicted state.
3.2.3 Graph Neural Networks (GNN). In recent years, researchers have proposed extracting spatial information from
MTS to form a graph structure, converting TSAD into a problem of detecting anomalies based on these graphs using
GNNs. As shown in Fig. 6, GNNs use pairwise message passing, where graph nodes iteratively update their representa-
tions by exchanging information. In MTS anomaly detection, each dimension is a node in the graph, represented as
𝑉 = {1, . . . , 𝑑 }. Edges 𝐸 indicate correlations learned from MTS. For node 𝑢 ∈ 𝑉 , the message passing layer outputs for
iteration 𝑘 + 1:
where ℎ𝑢𝑘 is the embedding for each node and 𝑁 (𝑢) is the neighbourhood of node 𝑢. GNNs enhance MTS modelling
by learning spatial structures [151]. Various GNN architectures exist, such as Graph Convolution Networks (GCN)
[103], which aggregate one-step neighbours, and Graph Attention Networks (GAT) [173], which use attention functions
to compute different weights for each neighbour.
Incorporating relationships between features is beneficial. Deng and Hooi [45] introduced GDN, a GNN attention-
based model that captures sensor characteristics as nodes and their correlations as edges, predicting behaviour based
on adjacent sensors. Anomaly detection framework GANF (Graph-Augmented Normalizing Flow) [40] augments
normalizing flow with graph structure learning, detecting anomalies by identifying low-density instances. GANF
represents time series as a Bayesian network, learning conditional densities with a graph-based dependency encoder
and using graph adjacency matrix optimisation [189].
In conclusion, extracting graph structures from time series and modelling them with GNNs enables the detection of
spatial changes over time, representing a promising research direction.
3.2.4 Hierarchical Temporal Memory (HTM). Hierarchical Temporal Memory (HTM) mimics the hierarchical processing
of the neocortex for anomaly detection [65]. Fig. 7a shows the typical components of the HTM. The input 𝑥𝑡 is encoded
and then processed through sparse spatial pooling [39], resulting in 𝑎(𝑥𝑡 ), a sparse binary vector. Sequence memory
models temporal patterns in 𝑎(𝑥𝑡 ) and returns a sparse vector prediction 𝜋 (𝑥𝑡 ). The prediction error is defined as:
Manuscript submitted to ACM
16 Darban et al.
Positional
Encoding
Encoder
Add Add
Multi-head Positionwise
Embedding + Attention
&
FFN
&
Norm Norm
N×
N×
Masked Add Add Add
Multi-head Positionwise
Embedding + Multi-head &
Attention
&
FFN
&
Attention Norm Norm Norm
Positional
Decoder
Encoding
Fig. 8. Transformer network structure for anomaly detection. The Transformer uses an encoder-decoder structure with multiple
identical blocks. Each encoder block includes a multi-head self-attention module and a feedforward network. During decoding,
cross-attention is added between the self-attention module and the feedforward network.
𝜋 (𝑥𝑡 −1 ) · 𝑎(𝑥𝑡 )
𝑒𝑟𝑟𝑡 = 1 − (18)
|𝑎(𝑥𝑡 )|
where |𝑎(𝑥𝑡 )| is the number of 1s in 𝑎(𝑥𝑡 ). Anomaly likelihood, based on the model’s prediction history and error
distribution, indicates whether the current state is anomalous.
HTM neurons are organised in columns within a layer (Fig. 7b). Multiple regions exist within each hierarchical
level, with fewer regions at higher levels combining patterns from lower levels to recognise more complex patterns.
Sensory data enters lower-level regions during learning and generates patterns for higher levels. HTM is robust to
noise, has high capacity, and can learn multiple patterns simultaneously. It recognises and memorises frequent spatial
input patterns and identifies sequences likely to occur in succession.
Numenta HTM [5] detects temporal anomalies of UTS in predictable and noisy environments. It effectively handles
extremely noisy data, adapts continuously to changes, and can identify small anomalies without false alarms. Multi-
HTM [182] learns context over time, making it noise-tolerant and capable of real-time predictions for various anomaly
detection challenges, so it can be used as an adaptive model. In particular, it is used for univariate problems and applied
efficiently to MTS. RADM [48] proposes a real-time, unsupervised framework for detecting anomalies in MTS by
combining HTM with a naive Bayesian network. Initially, HTM efficiently identifies anomalies in UTS with excellent
results in terms of detection and response times. Then, it pairs with a Bayesian network to improve MTS anomaly
detection without needing to reduce data dimensions, catching anomalies missed in UTS analyses. Bayesian networks
help refine observations due to their adaptability and ease in calculating probabilities.
3.2.5 Transformers. Transformers [172] are deep learning models that weigh input data differently depending on the
significance of different parts. In contrast to RNNs, transformers process the entire data simultaneously. Due to its
architecture based solely on attention mechanisms, illustrated in Fig. 8, it can capture long-term dependencies while
being computationally efficient. Recent studies utilise them to detect time series anomalies as they process sequential
data for translation in text data.
The original transformer architecture is encoder-decoder-based. An essential part of the transformer’s functionality
is its multi-head self-attention mechanism, stated in the following equation:
𝑄𝐾 T
𝑄, 𝐾, 𝑉 = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 ( √︁ )𝑉 (19)
𝑑𝑘
where 𝑄, 𝐾 and 𝑉 are defined as the matrices and 𝑑𝑘 is for normalisation of attention map.
Manuscript submitted to ACM
Deep Learning for Time Series Anomaly Detection: A Survey 17
Fig. 9. A time series may be unknown at any given moment or may change rapidly like (b), which illustrates sensor readings for
manual control [125]. Such a time series cannot be predicted in advance, making prediction-based anomaly detection ineffective.
A semantic correlation is identified in a long sequence, filtering out unimportant elements. Since transformers lack
recurrence or convolution, they need positional encoding for token positions (i.e. relative or absolute positions). GTA [34]
uses transformers for sequence modelling and a bidirectional graph to learn relationships among multiple IoT sensors.
It introduces an Influence Propagation (IP) graph convolution for semi-supervised learning of sensor dependencies.
To boost efficiency, each node’s neighbourhood is constrained, and then graph convolution layers model information
flow. As a next step, a multiscale dilated convolution and graph convolution are fused for hierarchical temporal context
encoding. They use transformers for parallelism and contextual understanding and propose multi-branch attention to
reduce attention complexity. In another recent work, SAnD [160] uses a transformer with stacked encoder-decoder
structures, relying solely on attention mechanisms to model clinical time series. The architecture utilises self-attention
to capture dependencies with multiple heads, positional encoding, and dense interpolation embedding for temporal
order. It was also extended for multitask diagnoses.
3.3.1 Autoencoder (AE). Autoencoders (AEs), also known as auto-associative neural networks [105], are widely used in
MTS anomaly detection for their nonlinear dimensionality reduction capabilities [150, 203]. Recent advancements in
deep learning have focused on learning low-dimensional representations (encoding) using AEs [16, 81].
AEs consist of an encoder and a decoder (see Fig. 10a). The encoder converts input into a low-dimensional represen-
tation, and the decoder reconstructs the input from this representation. The goal is to achieve accurate reconstruction
and minimise reconstruction error. This process is summarised as follows:
𝑍𝑡 −𝑤:𝑡 = 𝐸𝑛𝑐 (𝑋𝑡 −𝑤:𝑡 , 𝜙), 𝑋ˆ𝑡 −𝑤:𝑡 = 𝐷𝑒𝑐 (𝑍𝑡 −𝑤:𝑡 , 𝜃 ) (20)
where 𝑋𝑡 −𝑤:𝑡 is a sliding window of input data, 𝑥𝑡 ∈ R𝑑 , 𝐸𝑛𝑐 is the encoder with parameters 𝜙, and 𝐷𝑒𝑐 is the
decoder with parameters 𝜃 . 𝑍 represents the latent space (encoded representation). The encoder and decoder parameters
are optimised during training to minimise reconstruction error:
(𝜙 ∗, 𝜃 ∗ ) = arg min Err(𝑋𝑡 −𝑤:𝑡 , 𝐷𝑒𝑐 (𝐸𝑛𝑐 (𝑋𝑡 −𝑤:𝑡 , 𝜙), 𝜃 )) (21)
𝜙,𝜃
To improve representation, techniques such as Sparse Autoencoder (SAE) [137], Denoising Autoencoder (DAE) [174],
and Convolutional Autoencoder (CAE) [139] are used. The anomaly score of a window in an AE-based model is defined
based on the reconstruction error:
In MSCRED [192], attention-based ConvLSTM networks capture temporal trends, and a convolutional autoencoder
(CAE) reconstructs a signature matrix, representing inter-sensor correlations instead of relying on the time series
explicitly. The matrix length is 16, with a step interval of 5. An anomaly score is derived from the reconstruction error,
aiding in anomaly detection, root cause identification, and anomaly duration interpretation. In CAE-Ensemble [22], a
convolutional sequence-to-sequence autoencoder captures temporal dependencies with high parallelism. Gated Linear
Units (GLU) with convolution layers and attention capture local patterns, recognising recurring subsequences like
periodicity. The ensemble combines outputs from diverse models based on CAEs and uses a parameter-transfer training
strategy, which enhances accuracy and reduces training time and error. In order to ensure diversity, the objective
function also considers the differences between basic models rather than simply assessing their accuracy.
RANSysCoders [1] outlines a real-time anomaly detection system used by eBay. The authors propose an architecture
with multiple encoders and decoders, using random feature selection and majority voting to infer and localise anomalies.
The decoders set reconstruction bounds, functioning as bootstrapped AE for feature-bounds construction. The authors
also recommend using spectral analysis of the latent space representation to extract priors for MTS synchronisation.
Improved accuracy comes from feature synchronisation, bootstrapping, quantile loss, and majority voting. This method
addresses issues with previous approaches, such as threshold identification, time window selection, downsampling, and
inconsistent performance for large feature dimensions.
A novel Adaptive Memory Network with Self-supervised Learning (AMSL) [198] is designed to increase the generali-
sation of unsupervised anomaly detection. AMSL uses an AE framework with convolutions for end-to-end training. It
combines self-supervised learning and memory networks to handle limited normal data. The encoder maps the raw
time series and its six transformations into a feature space. A multi-class classifier is then used to classify these features
and improve generalisation. The features are also processed through global and local memory networks, which learn
common and specific features. Finally, an adaptive fusion module merges these features into a new reconstruction
representation. Recently, ContextDA [106] utilises deep reinforcement learning to optimise domain adaptation for
TSAD. It frames context sampling as a Markov decision process, focusing on aligning windows from the source and
target domains. The model uses a discriminator to align these domains without leveraging label information in the
source domain, which may lead to ineffective alignment when anomaly classes differ. ContextDA addresses this by
leveraging source labels, enhancing the alignment of normal samples and improving detection accuracy.
3.3.2 Variational Autoencoder (VAE). Fig. 10b shows a typical configuration of the variational autoencoder (VAE),
a directional probabilistic graph model which combines neural network autoencoders with mean-field variational
Bayes [102]. The VAE works similarly to AE, but instead of encoding inputs as single points, it encodes them as a
distribution using inference network 𝑞𝜙 (𝑍𝑡 −𝑤+1:𝑡 |𝑋𝑡 −𝑤+1:𝑡 ) where 𝜙 is its parameters. It represents a 𝑑 dimensional
input 𝑋𝑡 −𝑤+1:𝑡 to a latent representation 𝑍𝑡 −𝑤+1:𝑡 with a lower dimension 𝑘 < 𝑑. A sampling layer takes a sample from
a latent distribution and feeds it to the generative network 𝑝𝜃 (𝑋𝑡 −𝑤+1:𝑡 |𝑍𝑡 −𝑤+1:𝑡 ) with parameters 𝜃 , and its output
is 𝑔(𝑍𝑡 −𝑤+1:𝑡 ), reconstruction of the input. There are two components of the loss function, as stated in Equation (23)
that are minimised in a VAE: a reconstruction error that aims to improve the process of encoding and decoding and a
regularisation factor, which aims to regularise the latent space by making the encoder’s distribution as close to the
preferred distribution as possible.
𝑙𝑜𝑠𝑠 = ||𝑋𝑡 −𝑤+1:𝑡 − 𝑔(𝑍𝑡 −𝑤+1:𝑡 )|| 2 + 𝐾𝐿(𝑁 (𝜇𝑥 , 𝜎𝑥 ), 𝑁 (0, 1)) (23)
xt
xt-1
Decoder
Encoder
X h X̂
Encoded
xt-w+1
(a) Auto-Encoder
xt
ε
xt-1
Probabilistic
Probabilistic
Decoder
Encoder
X μ z X̂
σ
xt-w+1 Encoded
Fig. 10. Structure of (a) Auto-Encoder that compresses an input window into a lower-dimensional representation (ℎ) and then
reconstructs the output 𝑋ˆ from this representation, and (b) Variational Auto-Encoder that its encoder compresses an input window of
size 𝑤 into a latent distribution. The decoder uses sampled data from this distribution to produce 𝑋ˆ , closely matching 𝑋 .
where 𝐾𝐿 is the Kullback–Leibler divergence. By using regularised training, it avoids overfitting and ensures that the
latent space is appropriate for a generative process.
LSTM-VAE [143] represents a variation of the VAE that uses LSTM instead of a feed-forward network. This model is
trained with a denoising autoencoding method for better representation. It detects anomalies when the log-likelihood of
a data point is below a dynamic, state-based threshold to reduce false alarms. Xu et al. [184] found that training on both
normal and abnormal data is crucial for VAE anomaly detection. Their model, Donut, uses a VAE trained on shuffled
data for unsupervised anomaly detection. Donut’s Modified ELBO, Missing Data Injection, and MCMC Imputation
make it excellent at detecting anomalies in the seasonal KPI dataset. However, due to VAE’s nonsequential nature
and sliding window format, Donut struggles with temporal anomalies. Later on, Bagel [115] is introduced to handle
temporal anomalies robustly and unsupervised. Instead of using VAE in Donut, Bagel employs conditional variational
autoencoder (CVAE) [109] and considers temporal information. VAE models the relationship between two random
variables, 𝑥 and 𝑧. CVAE models the relationship between 𝑥 and 𝑧, conditioned on 𝑦, i.e., it models 𝑝 (𝑥, 𝑧|𝑦).
STORNs [159], or stochastic recurrent networks, use variational inference to model high-dimensional time series data.
The algorithm is flexible and generic and doesn’t need domain knowledge for structured time series. OmniAnomaly
[163] uses a VAE with stochastic RNNs for robust representations of multivariate data and planar normalizing flow for
non-Gaussian latent space distributions. It detects anomalies based on reconstruction probability and uses POT for
thresholding. InterFusion [117] uses a hierarchical Variational Autoencoder (HVAE) with two stochastic latent variables
for intermetric and temporal representations, along with a two-view embedding. To prevent overfitting anomalies in
training data, InterFusion employs prefiltering temporal anomalies. The paper also introduces MCMC imputation, MTS
for anomaly interpretation, and IPS for assessing results.
There are a few studies on anomaly detection in noisy time series data. Buzz [32] uses an adversarial training
method to capture patterns in univariate KPI with non-Gaussian noises and complex data distributions. This model
links Bayesian networks with optimal transport theory using Wasserstein distance. SISVAE (smoothness-inducing
sequential VAE) [112] detects point-level anomalies by smoothing before training a deep generative model using a
Bayesian method. As a result, it benefits from the efficiency of classical optimisation models as well as the ability to
model uncertainty with deep generative models. This model adjusts thresholds dynamically based on noise estimates,
Manuscript submitted to ACM
Deep Learning for Time Series Anomaly Detection: A Survey 21
crucial for changing time series. Other studies have used VAE for anomaly detection, assuming an unimodal Gaussian
distribution as a prior. Existing studies have struggled to learn the complex distribution of time series due to its inherent
multimodality. The GRU-based Gaussian Mixture VAE [74] addresses this challenge of learning complex distributions
by using GRU cells to discover time sequence correlations and represent multimodal data with a Gaussian Mixture.
In [191], a VAE with two extra modules is introduced: a Re-Encoder and a Latent Constraint network (VELC). The
Re-Encoder generates new latent vectors, and this complex setup maximises the anomaly score (reconstruction error) in
both the original and latent spaces to accurately model normal samples. The VELC network prevents the reconstruction
of untrained anomalies, leading to latent variables similar to the training data, which helps distinguish normal from
anomalous data. The VAE and LSTM are integrated as a single component in PAD [30] to support unsupervised anomaly
detection and robust prediction. The VAE minimises noise impact on predictions, while LSTMs help VAE capture
long-term sequences. Spectral residuals (SR) [83] are also used to improve performance by assigning weights to each
subsequence, indicating their normality.
TopoMAD (topology-aware multivariate time series anomaly detector) [79] is an anomaly detector in cloud systems
that uses GNN, LSTM, and VAE for spatiotemporal learning. It’s a stochastic seq2seq model that leverages topological
information to identify anomalies using graph-based representations. The model replaces standard LSTM cells with
graph neural networks (GCN and GAT) to capture spatial dependencies. To improve anomaly detection, models like
VAE-GAN [138] use partially labelled data. This semi-supervised model integrates LSTMs into a VAE, training an
encoder, generator, and discriminator simultaneously. The model distinguishes anomalies using both VAE reconstruction
differences and discriminator results.
The recently developed Robust Deep State Space Model (RDSSM) [113] is an unsupervised density reconstruction-
based model for detecting anomalies in MTS. Unlike many current methods, RDSSM uses raw data that might contain
anomalies during training. It incorporates two transition modules to handle temporal dependency and uncertainty.
The emission model includes a heavy-tail distribution error buffer, allowing it to handle contaminated and unlabelled
training data robustly. Using this generative model, they created a detection method that manages fluctuating noise over
time. This model provides adaptive anomaly scores for probabilistic detection, outperforming many existing methods.
In [177], a variational transformer is introduced for unsupervised anomaly detection in MTS. Instead of using a
feature relationship graph, the model captures correlations through self-attention. The model’s performance improves
due to reduced dimensionality and sparse correlations. The transformer’s positional encoding, or global temporal
encoding, helps capture long-term dependencies. Multi-scale feature fusion allows the model to capture robust features
from different time scales. The residual VAE module encodes hidden space using local features, and its residual structure
improves the KL divergence and enhances model generation.
3.3.3 Generative Adversarial Networks (GAN). A generative adversarial network (GAN) is an artificial intelligence
algorithm designed for generative modelling based on game theory [69], [69]. In generative models, training examples
are explored, and the probability distribution that generated them is learned. In this way, GAN can generate more
examples based on the estimated distribution, as illustrated in Fig. 11. Assume that we named the generator 𝐺 and the
discriminator 𝐷. The generator and discriminator are trained using following minimax model:
𝑚𝑖𝑛 𝑚𝑎𝑥 𝑉 (𝐷, 𝐺) = E𝑥∼𝑝 (𝑋 ) [𝑙𝑜𝑔 𝐷 (𝑋𝑡 −𝑤+1:𝑡 )] + E𝑧∼𝑝 (𝑍 ) [𝑙𝑜𝑔(1 − 𝐷 (𝑍𝑡 −𝑤+1:𝑡 ))] (24)
𝐺 𝐷
Discriminator
Generated Input
Generator Loss
Generator
Random Input
Fig. 11. Overview of a Generative Adversarial Network (GAN) with two main components: generator and discriminator. The generator
creates fake time series windows for the discriminator, which learns to distinguish between real and fake data. A combined anomaly
score is calculated using both the trained discriminator and generator.
where 𝑝 (𝑥) is the probability distribution of input data and 𝑋𝑡 −𝑤+1:𝑡 is a sliding window from the training set, called
real input in Fig.11. Also, 𝑝 (𝑧) is the prior probability distribution of the generated variable and 𝑍𝑡 −𝑤+1:𝑡 is a generated
input window taken from a random space with the same window size.
In spite of the fact that GANs have been applied to a wide variety of purposes (mainly in research), they continue
to involve unique challenges and research openings because they rely on game theory, which is distinct from most
approaches to generative modelling. Generally, GAN-based models take into account the fact that adversarial learning
makes the discriminator more sensitive to data outside the current dataset, making reconstructions of such data more
challenging. BeatGAN [200] is able to regularise its reconstruction robustly because it utilises a combination of AEs and
GANs [69] in cases where labels are not available. Moreover, using the time series warping method improves detection
accuracy by speed augmentation in training datasets and robust BeatGAN against variability involving time warping in
time series data. Research shows that BeatGAN can detect anomalies accurately in both ECG and sensor data.
However, training the GAN is usually difficult and requires a careful balance between the discriminator and generator
[104]. A system based on adversarial training is not suitable for online use due to its instability and difficulty in
convergence. With Adversarial Autoencoder Anomaly Detection Interpretation (DAEMON), anomalies are detected
using adversarially generated time series. DAEMON’s training involves three steps. First, a one-dimensional CNN
encodes MTS. Then, instead of decoding the hidden variable directly, a prior distribution is applied to the latent vector,
and an adversarial strategy aligns the posterior distribution with the prior. This avoids inaccurate reconstructions
of unseen patterns. Finally, a decoder reconstructs the time series, and another adversarial training step minimises
differences between the original and reconstructed values.
MAD-GAN (Multivariate Anomaly Detection with GAN) [111] is a GAN-based model that uses LSTM-RNN as
both the generator and discriminator to capture temporal relationships in time series. It detects anomalies using
reconstruction error and discrimination loss. Furthermore, FGANomaly (Filter GAN) [54] tackles overfitting in AE-
based and GAN-based anomaly detection models by filtering out potential abnormal samples before training using
pseudo-labels. The generator uses Adaptive Weight Loss, assigning weights based on reconstruction errors during
training, allowing the model to focus on normal data and reduce overfitting.
3.3.4 Transformers. Anomaly Transformer [185] uses an attention mechanism to spot unusual patterns by simultane-
ously modelling prior and series associations for each timestamp. This makes rare anomalies more distinguishable.
Anomalies are harder to connect with the entire series, while normal patterns connect more easily with nearby
timestamps. Prior associations estimate a focus on nearby points using a Gaussian kernel, while series associations
use self-attention weights from raw data. Along with reconstruction loss, a MINIMAX approach is used to enhance
Manuscript submitted to ACM
Deep Learning for Time Series Anomaly Detection: A Survey 23
the difference between normal and abnormal association discrepancies. TranAD [171] is another transformer-based
model that has self-conditioning and adversarial training. As a result of its architecture, it is efficient for training
and testing while preserving stability when dealing with huge input. When anomalies are subtle, transformer-based
encoder-decoder networks may fail to detect them. However, TranAD’s adversarial training amplifies reconstruction
errors to fix this. Self-conditioning ensures robust feature retrieval, improving stability and generalisation.
Li et al. [114] present an unsupervised method called DCT-GAN, which uses a transformer to handle time series
data, a GAN to reconstruct samples and spot anomalies, and dilated CNNs to capture temporal info from latent spaces.
The model blends multiple transformer generators at different scales to enhance its generalisation and uses a weight-
based mechanism to integrate generators, making it suitable for various anomalies. Additionally, MT-RVAE [177]
significantly benefits from the transformer’s sequence modelling and VAE capabilities that are categorised in both of
these architectures.
The Dual-TF [136] is a framework for detecting anomalies in time series data by utilising both time and frequency
information. It employs two parallel transformers to analyze data in these domains separately, then combines their
losses to improve the detection of complex anomalies. This dual-domain approach helps accurately pinpoint both
point-wise and subsequence-wise anomalies by overcoming the granularity discrepancies between time and frequency.
3.4.1 Transformers. TS2Vec [190] utilises a hierarchical transformer architecture to capture contextual information
at multiple scales, providing a universal representation learning approach using self-supervised contrastive learning
that defines anomaly detection problem as a downstream task across various time series datasets. In TS2Vec, positive
pairs are representations at the same timestamp in two augmented contexts created by timestamp masking and random
cropping, while negative samples are representations at different timestamps from the same series or from other series
at the same timestamp within the batch.
3.4.2 Convolutional Neural Networks (CNN). TF-C (Time-Frequency Consistency) model [196] is a self-supervised
contrastive pre-training framework designed for time series data. By leveraging both time-based and frequency-based
representations, the model ensures that these embeddings are consistent within a shared latent space through a novel
consistency loss. Using 3-layer 1-D ResNets as the backbone for its time and frequency encoders, the model captures
the temporal and spectral characteristics of time series. This architecture allows the TF-C model to learn generalisable
representations that can be used for time series anomaly detection in downstream tasks. In TF-C, a positive pair consists
slightly perturbed version of an original sample, while a negative pair includes different original samples or their
perturbed versions.
Manuscript submitted to ACM
24 Darban et al.
DCdetector [187] employs a deep CNN with a dual attention mechanism. This structure focuses on both spatial and
temporal dimensions, using contrastive learning to enhance the separability of normal and anomalous patterns, making
it adept at identifying subtle anomalies. In this model, a positive pair consists of representations from different views of
the same time series, while it does not use negative samples and relies on the dual attention structure to distinguish
anomalies by maximizing the representation discrepancy between normal and abnormal samples.
In contrast, CARLA [42] introduces a self-supervised contrastive representation learning approach using a two-phase
framework. The first phase, called pretext, differentiates between anomaly-injected samples and original samples. In
the second phase, self-supervised classification leverages information about the representations’ neighbours to enhance
anomaly detection by learning both normal behaviors and deviations indicating anomalies. In CARLA, positive pairs
are selected from neighbours, while negative pairs are anomaly-injected samples. In the recent work, DACAD [43]
combines a TCN with unsupervised domain adaptation techniques in its contrastive learning framework. It introduces
synthetic anomalies to improve learning and generalisation across different domains, using a structure that effectively
identifies anomalies through enhanced feature extraction and domain-invariant learning. DACAD selects positive pairs
and negative pairs similar to CARLA.
These models exemplify the advancement in using deep learning for TSAD, highlighting the shift towards models
that not only detect but also understand the intricate patterns in time series data, which makes this area of research
promising. Finally, while all the models in this category are based on self-supervised contrastive learning approaches,
there is no work on self-prediction-based self-supervised approaches in the TSAD literature and this research direction
is unexplored.
3.5.1 Autoencoder (AE). By capturing spatiotemporal correlation in multisensor time series, the CAE-M (Deep Convolu-
tional Autoencoding Memory network) [197] can model generalised patterns based on normalised data by undertaking
reconstruction and prediction simultaneously. It uses a deep convolutional AE with a Maximum Mean Discrepancy
(MMD) penalty to match a target distribution in low dimensions, which helps prevent overfitting due to noise or
anomalies. To better capture temporal dependencies, it employs nonlinear bidirectional LSTMs with attention and linear
autoregressive models. Neural System Identification and Bayesian Filtering (NSIBF) [60] is a new density-based TSAD
approach for Cyber-Physical Security (CPS). It uses a neural network with a state-space model to track hidden state
uncertainty over time, capturing CPS dynamics. In the detection phase, Bayesian filtering is applied to the state-space
model to estimate the likelihood of observed values. This combination of neural networks and Bayesian filters allows
NSIBF to accurately detect anomalies in noisy CPS sensor data.
3.5.2 Recurrent Neural Networks (RNN). With TAnoGan [13], they have developed a method that can detect anomalies
in time series if a limited number of examples are provided. TAnoGan has been evaluated using 46 NAB time series
datasets covering a range of topics. Experiments have shown that LSTM-based GANs can outperform LSTM-based
GANs when challenged with time series data through adversarial training.
Manuscript submitted to ACM
Deep Learning for Time Series Anomaly Detection: A Survey 25
3.5.3 Graph Neural Networks (GNN). In [199], two parallel graph attention (GAT) layers are introduced for self-
supervised multivariate TSAD. These layers identify connections between different time series and learn relationships
between timestamps. The model combines forecasting and reconstruction approaches: the forecasting model predicts
one point, while the reconstruction model learns a latent representation of the entire time series. The model can
diagnose anomalous time series (interpretability). FuSAGNet [76] fused SAE reconstruction and GNN forecasting to
find complex anomalies in multivariate data. It incorporates GDN [45] but embeds sensors in each process, followed by
recurrent units to capture temporal patterns. By learning recurrent sensor embeddings and sparse latent representations,
the GNN predicts expected behaviours during the testing phase.
• Multidimensional Data with Complex Dependencies: GNNs are suitable for capturing both temporal and
spatial dependencies in multivariate time series. They are particularly effective in scenarios such as IoT sensor
networks and industrial systems where intricate interdependencies among dimensions exist. GNN architectures
such as GCNs and GATs are suggested to be used in such settings.
• Sequential Data with Long-Term Temporal Dependencies: LSTM and GRU are effective for applications
requiring the modelling of long-term temporal dependencies. LSTM is commonly used in financial time series
analysis, predictive maintenance, and healthcare monitoring. GRU, with its simpler structure, offers faster training
times and is suitable for efficient temporal dependency modelling.
• Large Datasets Requiring Scalability and Efficiency: Transformers utilise self-attention mechanisms to
efficiently model long-range dependencies, making them suitable for handling large-scale datasets [97], such
as network traffic analysis. They are designed for robust anomaly detection by capturing complex temporal
patterns, with models like the Anomaly Transformer [185] and TranAD [171] being notable examples.
• Handling Noise in Anomaly Detection: AEs and VAEs architectures are particularly adept at handling noise in
the data, making them suitable for applications like network traffic, multivariate sensor data, and cyber-physical
systems.
• High-Frequency Data and Detailed Temporal Patterns: CNNs are useful for capturing local temporal
patterns in high-frequency data. They are particularly effective in detecting small deviations and subtle anomalies
in data such as web traffic and real-time monitoring systems. TCNs extend CNNs by using dilated convolutions
to capture long-term dependencies. As a result, they are suitable for applications where there exist long-range
dependencies as well as local patterns [11].
• Data with Evolving Patterns and Multimodal Distributions: Combining the strengths of various archi-
tectures, hybrid models are designed to handle complex, high-dimensional time series data with evolving
patterns like smart grid monitoring, industrial automation, and climate monitoring. These models, such as those
integrating GNNs, VAEs, and LSTMs, are suitable for the mentioned applications.
• Capturing Hierarchical and Multi-Scale Contexts: HTM models are designed to capture hierarchical and
multi-scale contexts in time series data. They are robust to noise and can learn multiple patterns simultaneously,
making them suitable for applications involving complex temporal patterns and noisy data.
Manuscript submitted to ACM
26 Darban et al.
Table 3. Public dataset and benchmarks used mostly for anomaly detection in time series. There are direct hyperlinks to their names
in the first column.
• Generalisation Across Diverse Datasets: Contrastive learning excels in scenarios requiring generalisation
across diverse datasets by learning robust representations through positive and negative pairs. It effectively
distinguishes normal from anomalous patterns in time series data, making it ideal for applications with varying
conditions, such as industrial monitoring, network security, and healthcare diagnostics.
4 DATASETS
This section summarises datasets and benchmarks for TSAD, which provides a rich resource for researchers in TSAD.
Some of these datasets are single-purpose datasets for anomaly detection, and some are general-purpose time series
Manuscript submitted to ACM
Deep Learning for Time Series Anomaly Detection: A Survey 27
datasets that we can use in anomaly detection model evaluation with some assumptions or customisation. We can
characterise each dataset or benchmark based on multiple aspects and their natural features. Here, we collect 48
well-known and/or highly-cited datasets examined by classic and state-of-the-art (SOTA) deep models for anomaly
detection in time series. These datasets are characterised based on the below attributes:
• Nature of the data generation which can be real, synthetic or combined.
• Number of entities, which means the number of independent time series inside each dataset.
• Type of variety for each dataset or benchmark, which can be multivariate, univariate or a combination of both.
• Number of dimensions, which is the number of features of an entity inside the dataset.
• Total number of samples of all entities in the dataset.
• The application domain of the dataset.
Note some datasets have been updated by their authors and contributors occasionally or regularly over time. We
considered and reported the latest update of the datasets and their attributes. Table 3 shows all 48 datasets with all
mentioned attributes for each of them. It also includes hyperlinks to the primary source to download the latest version
of the datasets.
Based on our exploration, the commonly used MTS datasets in SOTA TSAD models are MSL [92], SMAP [92], SMD
[115], SWaT [129], PSM [1], and WADI [7]. For UTS, the commonly used datasets are Yahoo [93], KPI [25], NAB [5], and
UCR [44]. These datasets are frequently used to benchmark and compare the performance of different TSAD models.
More detailed information about these datasets can be found on this Github repository: https://github.com/zamanzadeh/ts-
anomaly-benchmark.
• The use of anomaly detection for diagnostic purposes requires interpretability. Even so, anomaly detection
research focuses primarily on detection precision, failing to address the issue of interpretability.
• In addition to being rarely addressed in the literature, anomalies that occur on a periodic basis make detection
more challenging. A periodic subsequence anomaly is a subsequence that repeats over time [146]. The periodic
subsequence anomaly detection technique, in contrast to point anomaly detection, can be adapted in areas like
fraud detection to identify periodic anomalous transactions over time.
The main objective of this study was to explore and identify state-of-the-art deep learning models for TSAD, industrial
applications, and datasets. In this regard, a variety of perspectives have been explored regarding the characteristics
of time series, types of anomalies in time series, and the structure of deep learning models for TSAD. On the basis of
these perspectives, 64 recent deep models were comprehensively discussed and categorised. Moreover, time series deep
anomaly detection applications across multiple domains were discussed along with datasets commonly used in this
area of research. In the future, active research efforts on time series deep anomaly detection are necessary to overcome
the challenges we discussed in this survey.
REFERENCES
[1] Ahmed Abdulaal, Zhuanghua Liu, and Tomer Lancewicki. 2021. Practical approach to asynchronous multivariate time series anomaly detection
and localization. In KDD. 2485–2494.
[2] Oludare Isaac Abiodun, Aman Jantan, Abiodun Esther Omolara, Kemi Victoria Dada, Nachaat AbdElatif Mohamed, and Humaira Arshad. 2018.
State-of-the-art in artificial neural network applications: A survey. Heliyon 4, 11 (2018), e00938.
[3] Charu C Aggarwal. 2007. Data streams: models and algorithms. Vol. 31. Springer.
[4] Charu C Aggarwal. 2017. An introduction to outlier analysis. In Outlier analysis. Springer, 1–34.
[5] Subutai Ahmad, Alexander Lavin, Scott Purdy, and Zuha Agha. 2017. Unsupervised real-time anomaly detection for streaming data. Neurocomputing
262 (2017), 134–147.
[6] Azza H Ahmed, Michael A Riegler, Steven A Hicks, and Ahmed Elmokashfi. 2022. RCAD: Real-time Collaborative Anomaly Detection System for
Mobile Broadband Networks. In KDD. 2682–2691.
[7] Chuadhry Mujeeb Ahmed, Venkata Reddy Palleti, and Aditya P Mathur. 2017. WADI: a water distribution testbed for research in the design of
secure cyber physical systems. In Proceedings of the 3rd international workshop on cyber-physical systems for smart water networks. 25–28.
[8] Khaled Alrawashdeh and Carla Purdy. 2016. Toward an online anomaly intrusion detection system based on deep learning. In ICMLA. IEEE,
195–200.
[9] Rafal Angryk, Petrus Martens, Berkay Aydin, Dustin Kempton, Sushant Mahajan, Sunitha Basodi, Azim Ahmadzadeh, Xumin Cai, Soukaina
Filali Boubrahimi, Shah Muhammad Hamdi, Micheal Schuh, and Manolis Georgoulis. 2020. SWAN-SF. https://doi.org/10.7910/DVN/EBCFKM
[10] Julien Audibert, Pietro Michiardi, Frédéric Guyard, Sébastien Marti, and Maria A Zuluaga. 2020. Usad: Unsupervised anomaly detection on
multivariate time series. In KDD. 3395–3404.
[11] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence
modeling. arXiv preprint arXiv:1803.01271 (2018).
[12] Guillermo Barrenetxea. 2019. Sensorscope Data. https://doi.org/10.5281/zenodo.2654726
[13] Md Abul Bashar and Richi Nayak. 2020. TAnoGAN: Time series anomaly detection with generative adversarial networks. In 2020 IEEE Symposium
Series on Computational Intelligence (SSCI). IEEE, 1778–1785.
[14] Sagnik Basumallik, Rui Ma, and Sara Eftekharnejad. 2019. Packet-data anomaly detection in PMU-based state estimator using convolutional neural
network. International Journal of Electrical Power & Energy Systems 107 (2019), 690–702.
[15] Seif-Eddine Benkabou, Khalid Benabdeslem, and Bruno Canitia. 2018. Unsupervised outlier detection for time series by entropy and dynamic time
warping. Knowledge and Information Systems 54, 2 (2018), 463–486.
[16] Siddharth Bhatia, Arjit Jain, Pan Li, Ritesh Kumar, and Bryan Hooi. 2021. MSTREAM: Fast anomaly detection in multi-aspect streams. In WWW.
3371–3382.
[17] Ane Blázquez-García, Angel Conde, Usue Mori, and Jose A Lozano. 2021. A review on outlier/anomaly detection in time series data. CSUR 54, 3
(2021), 1–33.
[18] Paul Boniol, Michele Linardi, Federico Roncallo, Themis Palpanas, Mohammed Meftah, and Emmanuel Remy. 2021. Unsupervised and scalable
subsequence anomaly detection in large data series. The VLDB Journal 30, 6 (2021), 909–931.
[19] Loïc Bontemps, Van Loi Cao, James McDermott, and Nhien-An Le-Khac. 2016. Collective anomaly detection based on long short-term memory
recurrent neural networks. In FDSE. Springer, 141–152.
Manuscript submitted to ACM
Deep Learning for Time Series Anomaly Detection: A Survey 29
[20] Mohammad Braei and Sebastian Wagner. 2020. Anomaly detection in univariate time-series: A survey on the state-of-the-art. arXiv preprint
arXiv:2004.00433 (2020).
[21] Yin Cai, Mei-Ling Shyu, Yue-Xuan Tu, Yun-Tian Teng, and Xing-Xing Hu. 2019. Anomaly detection of earthquake precursor data using long
short-term memory networks. Applied Geophysics 16, 3 (2019), 257–266.
[22] David Campos, Tung Kieu, Chenjuan Guo, Feiteng Huang, Kai Zheng, Bin Yang, and Christian S Jensen. 2021. Unsupervised time series outlier
detection with diversity-driven convolutional ensembles. VLDB 15, 3 (2021), 611–623.
[23] Ander Carreño, Iñaki Inza, and Jose A Lozano. 2020. Analyzing rare event, anomaly, novelty and outlier detection terms under the supervised
classification framework. Artificial Intelligence Review 53, 5 (2020), 3575–3594.
[24] Raghavendra Chalapathy and Sanjay Chawla. 2019. Deep learning for anomaly detection: A survey. arXiv preprint arXiv:1901.03407 (2019).
[25] International AIOPS Challenges. 2018. KPI Anomaly Detection. https://competition.aiops-challenge.com/home/competition/1484452272200032281
[26] Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. CSUR 41, 3 (2009), 1–58.
[27] Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael Witbrock, Mark A Hasegawa-Johnson, and Thomas S
Huang. 2017. Dilated recurrent neural networks. NeurIPS 30 (2017).
[28] Sucheta Chauhan and Lovekesh Vig. 2015. Anomaly detection in ECG time signals via deep long short-term memory networks. In 2015 IEEE
International Conference on Data Science and Advanced Analytics (DSAA). IEEE, 1–7.
[29] Qing Chen, Anguo Zhang, Tingwen Huang, Qianping He, and Yongduan Song. 2020. Imbalanced dataset-based echo state networks for anomaly
detection. Neural Computing and Applications 32, 8 (2020), 3685–3694.
[30] Run-Qing Chen, Guang-Hui Shi, Wan-Lei Zhao, and Chang-Hui Liang. 2021. A joint model for IT operation series prediction and anomaly detection.
Neurocomputing 448 (2021), 130–139.
[31] Tingting Chen, Xueping Liu, Bizhong Xia, Wei Wang, and Yongzhi Lai. 2020. Unsupervised anomaly detection of industrial robots using
sliding-window convolutional variational autoencoder. IEEE Access 8 (2020), 47072–47081.
[32] Wenxiao Chen, Haowen Xu, Zeyan Li, Dan Pei, Jie Chen, Honglin Qiao, Yang Feng, and Zhaogang Wang. 2019. Unsupervised anomaly detection
for intricate kpis via adversarial training of vae. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications. IEEE, 1891–1899.
[33] Xuanhao Chen, Liwei Deng, Feiteng Huang, Chengwei Zhang, Zongquan Zhang, Yan Zhao, and Kai Zheng. 2021. Daemon: Unsupervised anomaly
detection and interpretation for multivariate time series. In ICDE. IEEE, 2225–2230.
[34] Zekai Chen, Dingshuo Chen, Xiao Zhang, Zixuan Yuan, and Xiuzhen Cheng. 2021. Learning graph structures with transformer for multivariate
time series anomaly detection in iot. IEEE Internet of Things Journal (2021).
[35] Yongliang Cheng, Yan Xu, Hong Zhong, and Yi Liu. 2019. HS-TCN: A semi-supervised hierarchical stacking temporal convolutional network for
anomaly detection in IoT. In IPCCC. IEEE, 1–7.
[36] Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the Properties of Neural Machine Translation:
Encoder–Decoder Approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. 103–111.
[37] Kukjin Choi, Jihun Yi, Changhwa Park, and Sungroh Yoon. 2021. Deep learning for anomaly detection in time-series data: review, analysis, and
guidelines. IEEE Access (2021).
[38] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on
sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
[39] Yuwei Cui, Subutai Ahmad, and Jeff Hawkins. 2017. The HTM spatial pooler—a neocortical algorithm for online sparse distributed coding. Frontiers
in computational neuroscience (2017), 111.
[40] Enyan Dai and Jie Chen. 2022. Graph-Augmented Normalizing Flows for Anomaly Detection of Multiple Time Series. In ICLR.
[41] Andrea Dal Pozzolo, Olivier Caelen, Reid A Johnson, and Gianluca Bontempi. 2015. Calibrating probability with undersampling for unbalanced
classification. In 2015 IEEE symposium series on computational intelligence. IEEE, 159–166.
[42] Zahra Zamanzadeh Darban, Geoffrey I Webb, Shirui Pan, and Mahsa Salehi. 2023. CARLA: A Self-supervised Contrastive Representation Learning
Approach for Time Series Anomaly Detection. arXiv preprint arXiv:2308.09296 (2023).
[43] Zahra Zamanzadeh Darban, Geoffrey I Webb, and Mahsa Salehi. 2024. DACAD: Domain Adaptation Contrastive Learning for Anomaly Detection
in Multivariate Time Series. arXiv preprint arXiv:2404.11269 (2024).
[44] Hoang Anh Dau, Eamonn Keogh, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, Yanping,
Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, Gustavo Batista, and Hexagon-ML. 2018. The UCR Time Series Classification
Archive. https://www.cs.ucr.edu/~eamonn/time_series_data_2018/.
[45] Ailin Deng and Bryan Hooi. 2021. Graph neural network-based anomaly detection in multivariate time series. In AAAI, Vol. 35. 4027–4035.
[46] Leyan Deng, Defu Lian, Zhenya Huang, and Enhong Chen. 2022. Graph convolutional adversarial networks for spatiotemporal anomaly detection.
TNNLS 33, 6 (2022), 2416–2428.
[47] Rahul Dey and Fathi M Salem. 2017. Gate-variants of gated recurrent unit (GRU) neural networks. In IEEE 60th international midwest symposium on
circuits and systems. IEEE, 1597–1600.
[48] Nan Ding, Huanbo Gao, Hongyu Bu, Haoxuan Ma, and Huaiwei Si. 2018. Multivariate-time-series-driven real-time anomaly detection based on
bayesian network. Sensors 18, 10 (2018), 3367.
[49] Nan Ding, HaoXuan Ma, Huanbo Gao, YanHua Ma, and GuoZhen Tan. 2019. Real-time anomaly detection based on long short-Term memory and
Gaussian Mixture Model. Computers & Electrical Engineering 79 (2019), 106458.
Manuscript submitted to ACM
30 Darban et al.
[50] Zhiguo Ding and Minrui Fei. 2013. An anomaly detection approach based on isolation forest algorithm for streaming data using sliding window.
IFAC Proceedings Volumes 46, 20 (2013), 12–17.
[51] Third International Knowledge Discovery and Data Mining Tools Competition. 1999. KDD Cup 1999 Data. https://kdd.ics.uci.edu/databases/
kddcup99/kddcup99.html
[52] Yadolah Dodge. 2008. Time Series. Springer New York, New York, NY, 536–539. https://doi.org/10.1007/978-0-387-32833-1_401
[53] Nicola Dragoni, Saverio Giallorenzo, Alberto Lluch Lafuente, Manuel Mazzara, Fabrizio Montesi, Ruslan Mustafin, and Larisa Safina. 2017.
Microservices: yesterday, today, and tomorrow. Present and ulterior software engineering (2017), 195–216.
[54] Bowen Du, Xuanxuan Sun, Junchen Ye, Ke Cheng, Jingyuan Wang, and Leilei Sun. 2021. GAN-Based Anomaly Detection for Multivariate Time
Series Using Polluted Training Set. TKDE (2021).
[55] Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository.
[56] Tolga Ergen and Suleyman Serdar Kozat. 2019. Unsupervised anomaly detection with LSTM neural networks. TNNLS 31, 8 (2019), 3127–3141.
[57] Philippe Esling and Carlos Agon. 2012. Time-series data mining. CSUR 45, 1 (2012), 1–34.
[58] Okwudili M Ezeme, Qusay Mahmoud, and Akramul Azim. 2020. A framework for anomaly detection in time-driven and event-driven processes
using kernel traces. TKDE (2020).
[59] Cheng Fan, Fu Xiao, Yang Zhao, and Jiayuan Wang. 2018. Analytical investigation of autoencoder-based methods for unsupervised anomaly
detection in building energy data. Applied energy 211 (2018), 1123–1135.
[60] Cheng Feng and Pengwei Tian. 2021. Time series anomaly detection for cyber-physical systems via neural system identification and bayesian
filtering. In KDD. 2858–2867.
[61] Yong Feng, Zijun Liu, Jinglong Chen, Haixin Lv, Jun Wang, and Xinwei Zhang. 2022. Unsupervised Multimodal Anomaly Detection With Missing
Sources for Liquid Rocket Engine. TNNLS (2022).
[62] Bob Ferrell and Steven Santuro. 2005. NASA Shuttle Valve Data. http://www.cs.fit.edu/~pkc/nasa/data/
[63] Pavel Filonov, Andrey Lavrentyev, and Artem Vorontsov. 2016. Multivariate industrial time series with cyber-attack simulation: Fault detection
using an lstm-based predictive data model. arXiv preprint arXiv:1612.06676 (2016).
[64] A Garg, W Zhang, J Samaran, R Savitha, and CS Foo. 2022. An Evaluation of Anomaly Detection and Diagnosis in Multivariate Time Series. TNNLS
33, 6 (2022), 2508–2517.
[65] Dileep George. 2008. How the brain might work: A hierarchical and temporal model for learning and recognition. Stanford University.
[66] Jonathan Goh, Sridhar Adepu, Marcus Tan, and Zi Shan Lee. 2017. Anomaly detection in cyber physical systems using recurrent neural networks.
In 2017 IEEE 18th International Symposium on High Assurance Systems Engineering (HASE). IEEE, 140–145.
[67] A L Goldberger, L A Amaral, L Glass, J M Hausdorff, P C Ivanov, R G Mark, J E Mietus, G B Moody, C K Peng, and H E Stanley. 2000. PhysioBank,
PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. , E215–20 pages.
[68] Abbas Golestani and Robin Gras. 2014. Can we predict the unpredictable? Scientific reports 4, 1 (2014), 1–6.
[69] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative
adversarial nets. NeurIPS 27 (2014).
[70] Adam Goodge, Bryan Hooi, See-Kiong Ng, and Wee Siong Ng. 2020. Robustness of Autoencoders for Anomaly Detection Under Adversarial
Impact.. In IJCAI. 1244–1250.
[71] Scott David Greenwald, Ramesh S Patil, and Roger G Mark. 1990. Improved detection and classification of arrhythmias in noise-corrupted electrocar-
diograms using contextual information. IEEE.
[72] Frank E Grubbs. 1969. Procedures for detecting outlying observations in samples. Technometrics 11, 1 (1969), 1–21.
[73] Antonio Gulli and Sujit Pal. 2017. Deep learning with Keras. Packt Publishing Ltd.
[74] Yifan Guo, Weixian Liao, Qianlong Wang, Lixing Yu, Tianxi Ji, and Pan Li. 2018. Multidimensional time series anomaly detection: A gru-based
gaussian mixture variational autoencoder approach. In Asian Conference on Machine Learning. PMLR, 97–112.
[75] James Douglas Hamilton. 2020. Time series analysis. Princeton university press.
[76] Siho Han and Simon S Woo. 2022. Learning Sparse Latent Graph Representations for Anomaly Detection in Multivariate Time Series. In KDD.
2977–2986.
[77] Douglas M Hawkins. 1980. Identification of outliers. Vol. 11. Springer.
[78] Yangdong He and Jiabao Zhao. 2019. Temporal convolutional networks for anomaly detection in time series. In Journal of Physics: Conference
Series, Vol. 1213. IOP Publishing, 042050.
[79] Zilong He, Pengfei Chen, Xiaoyun Li, Yongfeng Wang, Guangba Yu, Cailin Chen, Xinrui Li, and Zibin Zheng. 2020. A spatiotemporal deep learning
approach for unsupervised anomaly detection in cloud systems. TNNLS (2020).
[80] Michiel Hermans and Benjamin Schrauwen. 2013. Training and analysing deep recurrent neural networks. NeurIPS 26 (2013).
[81] Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. science 313, 5786 (2006), 504–507.
[82] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
[83] Xiaodi Hou and Liqing Zhang. 2007. Saliency detection: A spectral residual approach. In IEEE Conference on computer vision and pattern recognition.
Ieee, 1–8.
[84] Ruei-Jie Hsieh, Jerry Chou, and Chih-Hsiang Ho. 2019. Unsupervised online anomaly detection on multivariate sensing time series data for smart
manufacturing. In IEEE 12th Conference on Service-Oriented Computing and Applications (SOCA). IEEE, 90–97.
Manuscript submitted to ACM
Deep Learning for Time Series Anomaly Detection: A Survey 31
[85] Chia-Yu Hsu and Wei-Chen Liu. 2021. Multiple time-series convolutional neural network for fault detection and diagnosis and empirical study in
semiconductor manufacturing. Journal of Intelligent Manufacturing 32, 3 (2021), 823–836.
[86] Chao Huang, Chuxu Zhang, Peng Dai, and Liefeng Bo. 2021. Cross-interaction hierarchical attention networks for urban anomaly prediction. In
IJCAI. 4359–4365.
[87] Ling Huang, Xing-Xing Liu, Shu-Qiang Huang, Chang-Dong Wang, Wei Tu, Jia-Meng Xie, Shuai Tang, and Wendi Xie. 2021. Temporal Hierarchical
Graph Attention Network for Traffic Prediction. ACM Transactions on Intelligent Systems and Technology (TIST) 12, 6 (2021), 1–21.
[88] Ling Huang, XuanLong Nguyen, Minos Garofalakis, Michael Jordan, Anthony Joseph, and Nina Taft. 2006. In-network PCA and anomaly detection.
NeurIPS 19 (2006).
[89] Tao Huang, Pengfei Chen, and Ruipeng Li. 2022. A Semi-Supervised VAE Based Active Anomaly Detection Framework in Multivariate Time Series
for Online Systems. In WWW. 1797–1806.
[90] Xin Huang, Jangsoo Lee, Young-Woo Kwon, and Chul-Ho Lee. 2020. CrowdQuake: A networked system of low-cost sensors for earthquake
detection via deep learning. In KDD. 3261–3271.
[91] Alexis Huet, Jose Manuel Navarro, and Dario Rossi. 2022. Local evaluation of time series anomaly detection algorithms. In KDD. 635–645.
[92] Kyle Hundman, Valentino Constantinou, Christopher Laporte, Ian Colwell, and Tom Soderstrom. 2018. Detecting spacecraft anomalies using lstms
and nonparametric dynamic thresholding. In KDD. 387–395.
[93] Yahoo Inc. 2021. S5-A Labeled Anomaly Detection Dataset, Version 1.0. https://webscope.sandbox.yahoo.com/catalog.php?datatype=s&did=70
[94] Vincent Jacob, Fei Song, Arnaud Stiegler, Bijan Rad, Yanlei Diao, and Nesime Tatbul. 2021. Exathlon: A Benchmark for Explainable Anomaly
Detection over Time Series. VLDB (2021).
[95] Herbert Jaeger. 2007. Echo state network. scholarpedia 2, 9 (2007), 2330.
[96] Ahmad Javaid, Quamar Niyaz, Weiqing Sun, and Mansoor Alam. 2016. A deep learning approach for network intrusion detection system. In
Proceedings of the 9th EAI International Conference on Bio-inspired Information and Communications Technologies (formerly BIONETICS). 21–26.
[97] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2022. Transformers in vision: A
survey. CSUR 54, 10s (2022), 1–41.
[98] Tung Kieu, Bin Yang, Chenjuan Guo, and Christian S Jensen. 2019. Outlier Detection for Time Series with Recurrent Autoencoder Ensembles.. In
IJCAI. 2725–2732.
[99] Dohyung Kim, Hyochang Yang, Minki Chung, Sungzoon Cho, Huijung Kim, Minhee Kim, Kyungwon Kim, and Eunseok Kim. 2018. Squeezed
convolutional variational autoencoder for unsupervised anomaly detection in edge device industrial internet of things. In 2018 international
conference on information and computer technologies (icict). IEEE, 67–71.
[100] Eunji Kim, Sungzoon Cho, Byeongeon Lee, and Myoungsu Cho. 2019. Fault detection and diagnosis using self-attentive convolutional neural
networks for variable-length sensor data in semiconductor manufacturing. IEEE Transactions on Semiconductor Manufacturing 32, 3 (2019), 302–309.
[101] Siwon Kim, Kukjin Choi, Hyun-Soo Choi, Byunghan Lee, and Sungroh Yoon. 2022. Towards a rigorous evaluation of time-series anomaly detection.
In AAAI, Vol. 36. 7194–7201.
[102] Diederik P Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. stat 1050 (2014), 1.
[103] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR.
[104] Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. 2017. On convergence and stability of gans. arXiv preprint arXiv:1705.07215 (2017).
[105] Mark A Kramer. 1991. Nonlinear principal component analysis using autoassociative neural networks. AIChE journal 37, 2 (1991), 233–243.
[106] Kwei-Herng Lai, Lan Wang, Huiyuan Chen, Kaixiong Zhou, Fei Wang, Hao Yang, and Xia Hu. 2023. Context-aware domain adaptation for time
series anomaly detection. In SDM. SIAM, 676–684.
[107] Kwei-Herng Lai, Daochen Zha, Junjie Xu, Yue Zhao, Guanchu Wang, and Xia Hu. 2021. Revisiting time series outlier detection: Definitions and
benchmarks. In NeurIPS.
[108] Siddique Latif, Muhammad Usman, Rajib Rana, and Junaid Qadir. 2018. Phonocardiographic sensing using deep learning for abnormal heartbeat
detection. IEEE Sensors Journal 18, 22 (2018), 9393–9400.
[109] Alexander Lavin and Subutai Ahmad. 2015. Evaluating real-time anomaly detection algorithms–the Numenta anomaly benchmark. In ICMLA.
IEEE, 38–44.
[110] Tae Jun Lee, Justin Gottschlich, Nesime Tatbul, Eric Metcalf, and Stan Zdonik. 2018. Greenhouse: A zero-positive machine learning system for
time-series anomaly detection. arXiv preprint arXiv:1801.03168 (2018).
[111] Dan Li, Dacheng Chen, Baihong Jin, Lei Shi, Jonathan Goh, and See-Kiong Ng. 2019. MAD-GAN: Multivariate anomaly detection for time series
data with generative adversarial networks. In ICANN. Springer, 703–716.
[112] Longyuan Li, Junchi Yan, Haiyang Wang, and Yaohui Jin. 2020. Anomaly detection of time series with smoothness-inducing sequential variational
auto-encoder. TNNLS 32, 3 (2020), 1177–1191.
[113] Longyuan Li, Junchi Yan, Qingsong Wen, Yaohui Jin, and Xiaokang Yang. 2022. Learning Robust Deep State Space for Unsupervised Anomaly
Detection in Contaminated Time-Series. TKDE (2022).
[114] Yifan Li, Xiaoyan Peng, Jia Zhang, Zhiyong Li, and Ming Wen. 2021. DCT-GAN: Dilated Convolutional Transformer-based GAN for Time Series
Anomaly Detection. TKDE (2021).
[115] Zeyan Li, Wenxiao Chen, and Dan Pei. 2018. Robust and unsupervised kpi anomaly detection based on conditional variational autoencoder. In
IPCCC. IEEE, 1–9.
Manuscript submitted to ACM
32 Darban et al.
[116] Zhang Li, Bian Xia, and Mei Dong-Cheng. 2001. Gamma-ray light curve and phase-resolved spectra from Geminga pulsar. Chinese Physics 10, 7
(2001), 662.
[117] Zhihan Li, Youjian Zhao, Jiaqi Han, Ya Su, Rui Jiao, Xidao Wen, and Dan Pei. 2021. Multivariate time series anomaly detection and interpretation
using hierarchical inter-metric and temporal embedding. In KDD. 3220–3230.
[118] Fan Liu, Xingshe Zhou, Jinli Cao, Zhu Wang, Tianben Wang, Hua Wang, and Yanchun Zhang. 2020. Anomaly detection in quasi-periodic time
series based on automatic data segmentation and attentional LSTM-CNN. TKDE (2020).
[119] Jianwei Liu, Hongwei Zhu, Yongxia Liu, Haobo Wu, Yunsheng Lan, and Xinyu Zhang. 2019. Anomaly detection for time series using temporal
convolutional networks and Gaussian mixture model. In Journal of Physics: Conference Series, Vol. 1187. IOP Publishing, 042111.
[120] Manuel Lopez-Martin, Angel Nevado, and Belen Carro. 2020. Detection of early stages of Alzheimer’s disease based on MEG activity with a
randomized convolutional neural network. Artificial Intelligence in Medicine 107 (2020), 101924.
[121] Zhilong Lu, Weifeng Lv, Zhipu Xie, Bowen Du, Guixi Xiong, Leilei Sun, and Haiquan Wang. 2022. Graph Sequence Neural Network with an
Attention Mechanism for Traffic Speed Prediction. ACM Transactions on Intelligent Systems and Technology (TIST) 13, 2 (2022), 1–24.
[122] Tie Luo and Sai G Nagarajan. [n.d.]. Distributed anomaly detection using autoencoder neural networks in WSN for IoT. In ICC, pages=1–6,
year=2018, organization=IEEE.
[123] Lyft. 2022. Citi Bike Trip Histories. https://ride.citibikenyc.com/system-data
[124] Junshui Ma and Simon Perkins. 2003. Online novelty detection on temporal sequences. In KDD. 613–618.
[125] Pankaj Malhotra, Anusha Ramakrishnan, Gaurangi Anand, Lovekesh Vig, Puneet Agarwal, and Gautam Shroff. 2016. LSTM-based encoder-decoder
for multi-sensor anomaly detection. arXiv preprint arXiv:1607.00148 (2016).
[126] Pankaj Malhotra, Lovekesh Vig, Gautam Shroff, Puneet Agarwal, et al. 2015. Long short term memory networks for anomaly detection in time
series. In ESANN, Vol. 89. 89–94.
[127] Behrooz Mamandipoor, Mahshid Majd, Seyedmostafa Sheikhalishahi, Claudio Modena, and Venet Osmani. 2020. Monitoring and detecting faults in
wastewater treatment plants using deep learning. Environmental monitoring and assessment 192, 2 (2020), 1–12.
[128] Mohammad M Masud, Qing Chen, Latifur Khan, Charu Aggarwal, Jing Gao, Jiawei Han, and Bhavani Thuraisingham. 2010. Addressing concept-
evolution in concept-drifting data streams. In ICDM. IEEE, 929–934.
[129] Aditya P Mathur and Nils Ole Tippenhauer. 2016. SWaT: A water treatment testbed for research and training on ICS security. In 2016 international
workshop on cyber-physical systems for smart water networks (CySWater). IEEE, 31–36.
[130] Hengyu Meng, Yuxuan Zhang, Yuanxiang Li, and Honghua Zhao. 2019. Spacecraft anomaly detection via transformer reconstruction error. In
International Conference on Aerospace System Science and Engineering. Springer, 351–362.
[131] George B Moody and Roger G Mark. 2001. The impact of the MIT-BIH arrhythmia database. IEEE Engineering in Medicine and Biology Magazine 20,
3 (2001), 45–50.
[132] Steffen Moritz, Frederik Rehbach, Sowmya Chandrasekaran, Margarita Rebolledo, and Thomas Bartz-Beielstein. 2018. GECCO Industrial Challenge
2018 Dataset: A water quality dataset for the ’Internet of Things: Online Anomaly Detection for Drinking Water Quality’ competition at the Genetic and
Evolutionary Computation Conference 2018, Kyoto, Japan. https://doi.org/10.5281/zenodo.3884398
[133] Masud Moshtaghi, James C Bezdek, Christopher Leckie, Shanika Karunasekera, and Marimuthu Palaniswami. 2014. Evolving fuzzy rules for
anomaly detection in data streams. IEEE Transactions on Fuzzy Systems 23, 3 (2014), 688–700.
[134] Meinard Müller. 2007. Dynamic time warping. Information retrieval for music and motion (2007), 69–84.
[135] Mohsin Munir, Shoaib Ahmed Siddiqui, Andreas Dengel, and Sheraz Ahmed. 2018. DeepAnT: A deep learning approach for unsupervised anomaly
detection in time series. Ieee Access 7 (2018), 1991–2005.
[136] Youngeun Nam, Susik Yoon, Yooju Shin, Minyoung Bae, Hwanjun Song, Jae-Gil Lee, and Byung Suk Lee. 2024. Breaking the Time-Frequency
Granularity Discrepancy in Time-Series Anomaly Detection. (2024).
[137] Andrew Ng et al. 2011. Sparse autoencoder. CS294A Lecture notes 72, 2011 (2011), 1–19.
[138] Zijian Niu, Ke Yu, and Xiaofei Wu. 2020. LSTM-based VAE-GAN for time-series anomaly detection. Sensors 20, 13 (2020), 3738.
[139] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. 2015. Learning deconvolution network for semantic segmentation. In ICCV. 1520–1528.
[140] Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. 2021. Deep learning for anomaly detection: A review. CSUR 54, 2
(2021), 1–38.
[141] Guansong Pang, Chunhua Shen, and Anton van den Hengel. 2019. Deep anomaly detection with deviation networks. In KDD. 353–362.
[142] John Paparrizos, Paul Boniol, Themis Palpanas, Ruey S Tsay, Aaron Elmore, and Michael J Franklin. 2022. Volume under the surface: a new accuracy
evaluation measure for time-series anomaly detection. VLDB 15, 11 (2022), 2774–2787.
[143] Daehyung Park, Yuuna Hoshi, and Charles C Kemp. 2018. A multimodal anomaly detector for robot-assisted feeding using an lstm-based variational
autoencoder. IEEE Robotics and Automation Letters 3, 3 (2018), 1544–1551.
[144] Thibaut Perol, Michaël Gharbi, and Marine Denolle. 2018. Convolutional neural network for earthquake detection and location. Science Advances 4,
2 (2018), e1700578.
[145] Tie Qiu, Ruixuan Qiao, and Dapeng Oliver Wu. 2017. EABS: An event-aware backpressure scheduling scheme for emergency Internet of Things.
IEEE Transactions on Mobile Computing 17, 1 (2017), 72–84.
[146] Faraz Rasheed and Reda Alhajj. 2013. A framework for periodic outlier pattern detection in time-series sequences. IEEE transactions on cybernetics
44, 5 (2013), 569–582.
Manuscript submitted to ACM
Deep Learning for Time Series Anomaly Detection: A Survey 33
[147] Hansheng Ren, Bixiong Xu, Yujing Wang, Chao Yi, Congrui Huang, Xiaoyu Kou, Tony Xing, Mao Yang, Jie Tong, and Qi Zhang. 2019. Time-series
anomaly detection service at microsoft. In KDD. 3009–3017.
[148] Jonathan Rubin, Rui Abreu, Anurag Ganguli, Saigopal Nelaturi, Ion Matei, and Kumar Sricharan. 2017. Recognizing Abnormal Heart Sounds Using
Deep Learning. In IJCAI.
[149] Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft.
2018. Deep one-class classification. In ICML. PMLR, 4393–4402.
[150] Mayu Sakurada and Takehisa Yairi. 2014. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Workshop on Machine
Learning for Sensory Data Analysis. 4–11.
[151] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph neural network model. IEEE
transactions on neural networks 20, 1 (2008), 61–80.
[152] Udo Schlegel, Hiba Arnout, Mennatallah El-Assady, Daniela Oelke, and Daniel A Keim. 2019. Towards a rigorous evaluation of xai methods on
time series. In ICCVW. IEEE, 4197–4201.
[153] Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. 2022. Anomaly detection in time series: a comprehensive evaluation. VLDB 15, 9
(2022), 1779–1797.
[154] Pump sensor data. 2018. Pump sensor data for predictive maintenance. https://www.kaggle.com/datasets/nphantawee/pump-sensor-data
[155] Iman Sharafaldin, Arash Habibi Lashkari, and Ali A Ghorbani. 2018. Toward generating a new intrusion detection dataset and intrusion traffic
characterization. ICISSp 1 (2018), 108–116.
[156] Lifeng Shen, Zhuocong Li, and James Kwok. 2020. Timeseries anomaly detection using temporal hierarchical one-class network. NeurIPS 33 (2020),
13016–13026.
[157] Nathan Shone, Tran Nguyen Ngoc, Vu Dinh Phai, and Qi Shi. 2018. A deep learning approach to network intrusion detection. IEEE transactions on
emerging topics in computational intelligence 2, 1 (2018), 41–50.
[158] Alban Siffer, Pierre-Alain Fouque, Alexandre Termier, and Christine Largouet. 2017. Anomaly detection in streams with extreme value theory. In
KDD. 1067–1075.
[159] Maximilian Sölch, Justin Bayer, Marvin Ludersdorfer, and Patrick van der Smagt. 2016. Variational inference for on-line anomaly detection in
high-dimensional time series. arXiv preprint arXiv:1602.07109 (2016).
[160] Huan Song, Deepta Rajan, Jayaraman Thiagarajan, and Andreas Spanias. 2018. Attend and diagnose: Clinical time series analysis using attention
models. In AAAI, Vol. 32.
[161] Xiaomin Song, Qingsong Wen, Yan Li, and Liang Sun. 2022. Robust Time Series Dissimilarity Measure for Outlier Detection and Periodicity
Detection. In CIKM. 4510–4514.
[162] Yanjue Song and Suzhen Li. 2021. Gas leak detection in galvanised steel pipe with internal flow noise using convolutional neural network. Process
Safety and Environmental Protection 146 (2021), 736–744.
[163] Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei. 2019. Robust anomaly detection for multivariate time series through stochastic
recurrent neural network. In KDD. 2828–2837.
[164] Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, and Ali A Ghorbani. 2009. A detailed analysis of the KDD CUP 99 data set. In 2009 IEEE symposium
on computational intelligence for security and defense applications. Ieee, 1–6.
[165] David MJ Tax and Robert PW Duin. 2004. Support vector data description. Machine learning 54 (2004), 45–66.
[166] NYC Taxi and Limousine Commission. 2022. TLC Trip Record Data. https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
[167] Ahmed Tealab. 2018. Time series forecasting using artificial neural networks methodologies: A systematic review. Future Computing and Informatics
Journal 3, 2 (2018), 334–340.
[168] M G Terzano, L Parrino, A Sherieri, R Chervin, S Chokroverty, C Guilleminault, M Hirshkowitz, M Mahowald, H Moldofsky, A Rosa, R Thomas,
and A Walters. 2001. Atlas, rules, and recording techniques for the scoring of cyclic alternating pattern (CAP) in human sleep. Sleep Med. 2, 6 (Nov.
2001), 537–553.
[169] Markus Thill, Wolfgang Konen, and Thomas Bäck. 2020. Time series encodings with temporal convolutional networks. In International Conference
on Bioinspired Methods and Their Applications. Springer, 161–173.
[170] Markus Thill, Wolfgang Konen, and Thomas Bäck. 2020. MarkusThill/MGAB: The Mackey-Glass Anomaly Benchmark. https://doi.org/10.5281/
zenodo.3760086
[171] Shreshth Tuli, Giuliano Casale, and Nicholas R. Jennings. 2022. TranAD: Deep Transformer Networks for Anomaly Detection in Multivariate Time
Series Data. VLDB 15 (2022), 1201–1214.
[172] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is
all you need. NeurIPS 30 (2017).
[173] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. GRAPH ATTENTION NETWORKS.
stat 1050 (2018), 4.
[174] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising
autoencoders. In ICML. 1096–1103.
[175] Alexander von Birgelen and Oliver Niggemann. 2018. Anomaly detection and localization for cyber-physical production systems with self-organizing
maps. In Improve-innovative modelling approaches for production systems to raise validatable efficiency. Springer Vieweg, Berlin, Heidelberg, 55–71.
Manuscript submitted to ACM
34 Darban et al.
[176] Kai Wang, Youjin Zhao, Qingyu Xiong, Min Fan, Guotan Sun, Longkun Ma, and Tong Liu. 2016. Research on healthy anomaly detection model
based on deep learning from multiple time-series physiological signals. Scientific Programming 2016 (2016).
[177] Xixuan Wang, Dechang Pi, Xiangyan Zhang, Hao Liu, and Chang Guo. 2022. Variational transformer-based anomaly detection approach for
multivariate time series. Measurement 191 (2022), 110791.
[178] Yi Wang, Linsheng Han, Wei Liu, Shujia Yang, and Yanbo Gao. 2019. Study on wavelet neural network based anomaly detection in ocean observing
data series. Ocean Engineering 186 (2019), 106129.
[179] Politechnika Warszawska. 2020. Damadics Benchmark Website. https://iair.mchtr.pw.edu.pl/Damadics
[180] Tailai Wen and Roy Keyes. 2019. Time series anomaly detection using convolutional neural networks and transfer learning. arXiv preprint
arXiv:1905.13628 (2019).
[181] Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. 2023. TimesNet: Temporal 2D-Variation Modeling for General
Time Series Analysis. In ICLR.
[182] Jia Wu, Weiru Zeng, and Fei Yan. 2018. Hierarchical temporal memory method for time-series-based anomaly detection. Neurocomputing 273
(2018), 535–546.
[183] Wentai Wu, Ligang He, Weiwei Lin, Yi Su, Yuhua Cui, Carsten Maple, and Stephen A Jarvis. 2020. Developing an unsupervised real-time anomaly
detection scheme for time series with multi-seasonality. TKDE (2020).
[184] Haowen Xu, Wenxiao Chen, Nengwen Zhao, Zeyan Li, Jiahao Bu, Zhihan Li, Ying Liu, Youjian Zhao, Dan Pei, Yang Feng, et al. 2018. Unsupervised
anomaly detection via variational auto-encoder for seasonal kpis in web applications. In WWW. 187–196.
[185] Jiehui Xu, Haixu Wu, Jianmin Wang, and Mingsheng Long. 2021. Anomaly Transformer: Time Series Anomaly Detection with Association
Discrepancy. In ICLR.
[186] Kenji Yamanishi and Jun-ichi Takeuchi. 2002. A unifying framework for detecting outliers and change points from non-stationary time series data.
In KDD. 676–681.
[187] Yiyuan Yang, Chaoli Zhang, Tian Zhou, Qingsong Wen, and Liang Sun. 2023. DCdetector: Dual Attention Contrastive Representation Learning for
Time Series Anomaly Detection. In KDD (Long Beach, CA).
[188] Chin-Chia Michael Yeh, Yan Zhu, Liudmila Ulanova, Nurjahan Begum, Yifei Ding, Hoang Anh Dau, Diego Furtado Silva, Abdullah Mueen, and
Eamonn Keogh. 2016. Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In
ICDM. 1317–1322.
[189] Yue Yu, Jie Chen, Tian Gao, and Mo Yu. 2019. DAG-GNN: DAG structure learning with graph neural networks. In ICML. PMLR, 7154–7163.
[190] Zhihan Yue, Yujing Wang, Juanyong Duan, Tianmeng Yang, Congrui Huang, Yunhai Tong, and Bixiong Xu. 2022. Ts2vec: Towards universal
representation of time series. In AAAI, Vol. 36. 8980–8987.
[191] Chunkai Zhang, Shaocong Li, Hongye Zhang, and Yingyang Chen. 2019. VELC: A new variational autoencoder based model for time series
anomaly detection. arXiv preprint arXiv:1907.01702 (2019).
[192] Chuxu Zhang, Dongjin Song, Yuncong Chen, Xinyang Feng, Cristian Lumezanu, Wei Cheng, Jingchao Ni, Bo Zong, Haifeng Chen, and Nitesh V
Chawla. 2019. A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data. In AAAI, Vol. 33.
1409–1416.
[193] Mingyang Zhang, Tong Li, Hongzhi Shi, Yong Li, Pan Hui, et al. 2019. A decomposition approach for urban anomaly detection across spatiotemporal
data. In IJCAI. International Joint Conferences on Artificial Intelligence.
[194] Runtian Zhang and Qian Zou. 2018. Time series prediction and anomaly detection of light curve using lstm neural network. In Journal of Physics:
Conference Series, Vol. 1061. IOP Publishing, 012012.
[195] Weishan Zhang, Wuwu Guo, Xin Liu, Yan Liu, Jiehan Zhou, Bo Li, Qinghua Lu, and Su Yang. 2018. LSTM-based analysis of industrial IoT equipment.
IEEE Access 6 (2018), 23551–23560.
[196] Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. 2022. Self-supervised contrastive pre-training for time series via
time-frequency consistency. NeurIPS 35 (2022), 3988–4003.
[197] Yuxin Zhang, Yiqiang Chen, Jindong Wang, and Zhiwen Pan. 2021. Unsupervised deep anomaly detection for multi-sensor time-series signals.
TKDE (2021).
[198] Yuxin Zhang, Jindong Wang, Yiqiang Chen, Han Yu, and Tao Qin. 2022. Adaptive memory networks with self-supervised learning for unsupervised
anomaly detection. TKDE (2022).
[199] Hang Zhao, Yujing Wang, Juanyong Duan, Congrui Huang, Defu Cao, Yunhai Tong, Bixiong Xu, Jing Bai, Jie Tong, and Qi Zhang. 2020. Multivariate
time-series anomaly detection via graph attention network. In ICDM. IEEE, 841–850.
[200] Bin Zhou, Shenghua Liu, Bryan Hooi, Xueqi Cheng, and Jing Ye. 2019. BeatGAN: Anomalous Rhythm Detection using Adversarially Generated
Time Series.. In IJCAI. 4433–4439.
[201] Lingxue Zhu and Nikolay Laptev. 2017. Deep and confident prediction for time series at uber. In ICDMW. IEEE, 103–110.
[202] Weiqiang Zhu and Gregory C Beroza. 2019. PhaseNet: a deep-neural-network-based seismic arrival-time picking method. Geophysical Journal
International 216, 1 (2019), 261–273.
[203] Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho, and Haifeng Chen. 2018. Deep autoencoding gaussian
mixture model for unsupervised anomaly detection. In ICLR.
B INTERPRETABILITY METRICS
These metrics collectively offer a way to assess the interpretability of anomaly detection systems, specifically their
ability to identify and prioritise the most relevant factors or dimensions contributing to each detected anomaly.
HitRate@P% is defined in [163] from HitRate@K used in recommender systems, modified to evaluate the accuracy
of interpreting anomalies at the segment level. HitRate@P% assesses whether the true causes (relevant dimensions) of
an anomaly are included within the top P% of the identified causes by the algorithm.
anomaly. It is typically defined in a manner that reflects the proportion of correctly identified causes within the top-k
ranked items or factors, adjusted for their ranking order:
𝑁
1 ∑︁ Number of true causes in top k for segment 𝑖
IPS = (26)
𝑁 𝑖=1 Total number of true causes in segment 𝑖
Where 𝑁 is the number of segments analyzed, and the counts are taken from the top k causes identified by the model
for each segment.
RC-top-k (Relevant Causes top-k) metric [64] measures the fraction of events for which at least one of the
identified causes is among the top k causes identified by the model. This metric focuses on the model’s ability to capture
at least one relevant cause out of the potentially several contributing factors.
𝑃
2𝑟𝑖 − 1 ∑︁
RDCG@P% = (28)
𝑖=1
log2 (𝑖 + 1)
Where 𝑟𝑖 is the relevance score of the dimension at position 𝑖 in the ranking, up to the top P% of ranked dimensions.
C EXPERIMENTAL RESULTS
The plots in Fig. 12 provide a comparison of various TSAD models across the four different MTS datasets: MSL, SMAP,
SMD, and SWaT datasets. Each model’s performance is evaluated using two metrics: 𝐹 1 score and 𝐹 1𝑃𝐴 score. Fig. 12
illustrates that DACAD (2024) generally outperforms other models, especially on the MSL, SMAP, and SMD datasets,
although it lacks results for the SWaT dataset since it cannot have results on it. CARLA (2023) and TimesNet (2023)
also show strong performance across these datasets. In contrast, older models like DAGMM (2018), LSTM-VAE (2018),
and OmniAnomaly (2018) generally exhibit lower scores compared to the more recent models. The performance
improvement trend is evident with the newer models, which tend to achieve higher 𝐹 1 and 𝐹 1𝑃𝐴 scores, indicating
advancements in anomaly detection techniques over time.
F1 and F1PA for MSL F1 and F1PA for SMAP F1 and F1PA for SMD F1 and F1PA for SWaT
DAGMM (2018)
LSTM-VAE (2018)
OmniAnomaly (2018)
MSCRED (2019)
THOC (2020)
USAD (2020)
MTAD-GAT (2020)
GDN (2021)
AnomalyTransformer (2021)
TranAD (2022)
TS2Vec (2022)
DCdetector (2023)
TimesNet (2023)
CARLA (2023)
DACAD (2024) F1
F1PA
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Fig. 12. 𝐹 1 and 𝐹 1𝑃𝐴 results for 15 state-of-the-art TSAD models on four most commonly used MTS datasets.
One of the most challenging tasks in transportation is forecasting the speed of traffic. The use of a traffic prediction
system prior to travel in urban areas can help drivers avoid potential congestion and reduce travel time. The aim of
GTransformer [121] is to study how GNNs can be combined with attention mechanisms to improve traffic prediction
accuracy. Also, TH-GAT [87] is a temporal hierarchical graph attention network designed specifically for this purpose.
D.7 Aerospace
Due to the complexity and cost of spacecraft, failure to detect hazards during flight could lead to serious or even
catastrophic destruction. In [130], a transformer-based model with two novel components is presented, namely, an
attention mechanism that updates timestamps concurrently and a masking strategy that detects anomalies in advance.
Testing was conducted on NASA telemetry datasets.
Monitoring and diagnosing the health of liquid rocket engines (LREs) is the most significant concern for spacecraft
and vehicle safety, particularly for human launch. Failure of the engine will result directly in the failure of the space
launch, resulting in irreparable losses. To achieve reliable and automatic anomaly detection for large equipment such as
LREs and multisource data, Feng et al. [61] suggest using a multimodal unsupervised method with missing sources.
D.9 Energy
It is inevitable that purification and refinement will affect various petroleum products. Regarding this, [63] as an
LSTM-based approach is employed to monitor and detect faults in a multivariate industrial time series that includes
signals from sensors and control systems of gasoil plant heating loop (GHL). Likewise, according to Wen and Keyes
Manuscript submitted to ACM
Deep Learning for Time Series Anomaly Detection: A Survey 41
[180], a CNN is used to detect time series anomalies using a transfer learning framework to solve data sparsity problems.
The results were demonstrated on the GHL dataset [63], which contains data on cyber-attacks against utility systems.
The use of phasor measurement units (PMU) by utilities for power system monitoring increases the potential for
cyberattacks. In [14], anomalies are detected in MTS data generated by PMU data packets corresponding to different
events, such as line faults, trips, generation and load before each state estimation cycle. Consequently, it can help
operators identify targeted cyber-attacks and make better decisions to ensure grid reliability.
The management of energy in buildings can improve energy efficiency, increase equipment life, and reduce energy
consumption and operational costs. Fan et al. [59] propose an autoencoder-based ensemble method for the analysis of
energy time series in buildings and the detection of unexpected consumption patterns and excessive waste.
D.11 Robotics
In the modern manufacturing industry, as production lines become increasingly dependent on robots, failures of any
robot can cause a plunge into a disastrous situation, while some faults are difficult to identify. In order to detect incipient
failures in robots before they stop working completely, a real-time method is required to continuously track robots by
collecting time series from robots. A sliding-window convolutional variational autoencoder (SWCVAE) is proposed in
[31] to detect anomalies in MTS both spatially and temporally in an unsupervised manner.
Also, many people with disabilities require physical assistance from caregivers, although robots can substitute some
human caregiving. Robots can help with daily living activities, such as feeding and shaving. By detecting and stopping
abnormal task execution in assistance, potential hazards can be prevented or reduced [143].
detect collective faults in WWTPS, superior to earlier methods. Moreover, energy management systems must manage
gas storage and transportation continuously in order to reduce expenses and safeguard the environment. In [162], an
end-to-end CNN-based model is used to implement an internal-flow-noise leak detector in pipes.