DCdetector Dual Attention Contrastive Representation Learning For Time Series Anomaly Detection
DCdetector Dual Attention Contrastive Representation Learning For Time Series Anomaly Detection
classified as statistical, classic machine learning, and deep learning- between normal points and anomalies. Also, channel inde-
based methods [6, 56]. Machine learning methods, especially deep pendence patching is proposed to enhance local semantic
learning-based methods, have succeeded greatly due to their pow- information in time series. Multi-scale is proposed in the at-
erful representation advantages. Most of the supervised and semi- tention module to reduce information loss during patching.
supervised methods [14, 18, 48, 50, 83, 86] can not handle the • Optimization: An effective and robust loss function is de-
challenge of limited labeled data, especially the anomalies are dy- signed based on the similarity of two branches. Note that the
namic and new anomalies never observed before may occur. Un- model is trained purely contrastively without reconstruction
supervised methods are popular without strict requirements on loss, which reduces distractions from anomalies.
labeled data, including one class classification-based, probabilistic- • Performance & Justification: DCdetector achieves perfor-
based, distance-based, forecasting-based, reconstruction-based ap- mance comparable or superior to state-of-the-art methods
proaches [11, 24, 42, 56, 63, 81, 88]. on six multivariate and one univariate time series anomaly
Reconstruction-based methods learn a model to reconstruct nor- detection benchmark datasets. We also provide justification
mal samples, and thereby the instances failing be reconstructed by discussion to explain how our model avoids collapse without
the learned model are anomalies. Such an approach is developing negative samples.
rapidly due to its power in handling complex data by combining it
with different machine learning models and its interpretability that 2 RELATED WORK
the instances behave unusually abnormally. However, it is usually In this section, we show the related literature for this work. The
challenging to learn a well-reconstructed model for normal data relevant works include anomaly detection and contrastive repre-
without being obstructed by anomalies. The situation is even worse sentation learning.
in time series anomaly detection as the number of anomalies is
unknown, and normal and abnormal points may appear in one in- Time Series Anomaly Detection. There are various approaches to
stance, making it harder to learn a clean, well-reconstructed model detect anomalies in time series, including statistical methods, clas-
for normal points. sical machine learning methods, and deep learning methods [60].
Recently, contrastive representative learning has attracted at- Statistical methods include using moving averages, exponential
tention due to its diverse design and outstanding performance in smoothing [53], and the autoregressive integrated moving average
downstream tasks in the computer vision field [13, 15, 29, 78]. How- (ARIMA) model [9]. Machine learning methods include clustering
ever, the effectiveness of contrastive representative learning still algorithms such as k-means [34] and density-based methods, as
needs to be explored in the time-series anomaly detection area. In well as classification algorithms such as decision trees [35, 46] and
this paper, we propose a Dual attention Contrastive representa- support vector machines (SVMs). Deep learning methods include
tion learning anomaly detector called DCdetector to handle the using autoencoders, variational autoencoders (VAEs) [50, 58], and
challenges in time series anomaly detection. The key idea of our recurrent neural networks (RNNs) [12, 63] such as long short-term
DCdetector is that normal time series points share the latent pattern, memory (LSTM) networks [81]. Recent works in time series anom-
which means normal points have strong correlations with other aly detection also include generative adversarial networks (GANs)
points. In contrast, the anomalies do not (i.e., weak correlations based methods [16, 16, 41, 87] and deep reinforcement learning
with others). Learning consistent representations for anomalies (DRL) based methods [30, 80]. In general, deep learning methods
from different views will be hard but easy for normal points. The are more effective in identifying anomalies in time series data, es-
primary motivation is that if normal and abnormal points’ repre- pecially when the data is high-dimensional or non-linear.
sentations are distinguishable, we can detect anomalies without a In another view, time series anomaly detection models can be
highly qualified reconstruction model. roughly divided into two categories: supervised and unsupervised
Specifically, we propose a contrastive structure with two branches anomaly detection algorithms. Supervised methods can perform bet-
and a dual attention module, and two branches share network ter when the anomaly label is available or affordable. Such methods
weights. This model is trained based on the similarity of two branches, can be dated back to AutoEncoder [58], LSTM-VAE [50], Spectral
as normal points are the majority. The representation inconsistency Residual (SR) [54], RobustTAD [26] and so on. On the other hand, an
of anomaly will be conspicuous. Thus, the representation difference unsupervised anomaly detection algorithm can be applied in cases
between normal and abnormal data is enlarged without a highly where the anomaly labels are difficult to obtain. Such versatility
qualified reconstruction model. To capture the temporal depen- results in the community’s long-lasting interest in developing new
dency in time series, DCdetector utilizes patching-based attention unsupervised time-series anomaly detection methods, including
networks as the basic module. A multi-scale design is proposed to DAGMM [88], OmniAnomaly [63], GDN [24], RDSSM [42] and so
reduce information loss during patching. DCdetector takes all chan- on. Unsupervised deep learning methods have been widely studied
nels into representation efficiently with a channel independence in time series anomaly detection. The main reasons are as follows.
design for multivariate time series. In particular, DCdetector does First, it is usually hard or unaffordable to get labels for all time
not require prior knowledge about anomalies and thus can handle series sequences in real-world applications. Second, deep models
new outliers never observed before. The main contributions of our are powerful in representation learning and have the potential to
DCdetector are summarized as follows: get a decent detection accuracy under the unsupervised setting.
• Architecture: A contrastive learning-based dual-branch at- Most of them are based on a reconstruction approach where a
tention structure is designed to learn a permutation invariant well-reconstructed model is learned for normal points; Then, the
representation that enlarges the representation differences instances failing to be reconstructed are anomalies. Recently, some
DCdetector: Dual Attention Contrastive Representation Learning
for Time Series Anomaly Detection KDD ’23, August 6–10, 2023, Long Beach, CA, USA
self-supervised learning-based methods have been proposed to In some way, the underlining inductive bias we used here is simi-
enhance the generalization ability in unsupervised anomaly detec- lar to what Anomaly Transformer explored [71]. That is, anomalies
tion [33, 84, 86]. have less connection or interaction with the whole series than their
adjacent points. The Anomaly Transformer detects anomalies by
Contrastive Representation Learning. The goal of contrastive rep-
association discrepancy between a learned Gaussian kernel and
resentation learning is to learn an embedding space in which simi-
attention weight distribution. In contrast, we proposed DCdetec-
lar data samples stay close to each other while dissimilar ones are
tor, which achieves a similar goal in a much more general and
far apart. The idea of contrastive learning can trace back to Inst-
concise way with a dual-attention self-supervised contrastive-type
Dic [69]. Classical contrastive models create <positive, negative>
structure.
sample pairs to learn a representation where positive samples are
To better position our work in the landscape of time series anom-
near each other (pulled together) and far from negative samples
aly detection, we give a brief comparison of three approaches. To
(pushed apart) [13, 15, 29, 78]. Their key designs are about how to
be noticed, Anomaly Transformer is a representation of a series of
define negative samples and deal with the high computation pow-
explicit association modeling works [7, 19, 23, 85], not implying it
er/large batches requirements [37]. On the other hand, BYOL [28]
is the only one. We merely want to make a more direct comparison
and SimSiam [17] get rid of negative samples involved, and such a
with the closest work here. Figure 1 shows the architecture com-
simple siamese model (SimSiam) achieves comparable performance
parison of three approaches. The reconstruction-based approach
with other state-of-the-art complex architecture.
(Figure 1(a)) uses a representation neural network to learn the pat-
It is illuminating to make the distance of two-type samples larger
tern of normal points and do reconstruction. Anomaly Transformer
using contrastive design. We try to distinguish time series anomalies
(Figure 1(b)) takes advantage of the observation that it is difficult
and normal points with a well-designed multi-scale patching-based
to build nontrivial associations from abnormal points to the whole
attention module. Moreover, our DCdetector is also free from nega-
series. Thereby, the prior discrepancy is learned with Gaussian Ker-
tive samples and does not fall into a trivial solution even without
nel and the association discrepancy is learned with a transformer
the "stop gradient".
module. MinMax association learning is also critical for Anomaly
Transformer and reconstruction loss is contained. In contrast, the
3 METHODOLOGY
proposed DCdetector (Figure 1(c)) is concise, in the sense that it
Consider a multivariate time-series sequence of length T: does not need a specially designed Gaussian Kernel, a MinMax
X = (𝑥 1, 𝑥 2, . . . , 𝑥𝑇 ), learning strategy, or a reconstruction loss. The DCdetector mainly
where each data point 𝑥𝑡 ∈ IR𝑑 is acquired at a certain timestamp 𝑡 leverages the designed contrastive learning-based dual-branch at-
from industrial sensors or machines, and 𝑑 is the data dimensional- tention for discrepancy learning of anomalies in different views to
ity, e.g., the number of sensors or machines. Our problem can be enlarge the differences between anomalies and normal points. The
regarded as given input time-series sequence X, for another un- simplicity and effectiveness contribute to DCdetector’s versatility.
known test sequence X𝑡𝑒𝑠𝑡 of length 𝑇 ′ with the same modality as
the training sequence, we want to predict Y𝑡𝑒𝑠𝑡 = (𝑦1, 𝑦2, . . . , 𝑦𝑇 ′ ). 3.1 Overall Architecture
Here 𝑦𝑡 ∈ {0, 1} where 1 denotes an anomalous data point and 0 Figure 2 shows the overall architecture of the DCdetector, which
denotes a normal data point. consists of four main components, Forward Process module, Dual
As mentioned previously, representation learning is a powerful Attention Contrastive Structure module, Representation Discrep-
tool to handle the complex pattern of time series. Due to the high ancy module, and Anomaly Criterion module.
cost of gaining labels in practice, unsupervised and self-supervised The input multivariate time series in the Forward Process mod-
methods are more popular. The critical issue in time series anomaly ule is normalized by an instance normalization [38, 66] module.
detection is to distinguish anomalies from normal points. Learning The inputs to the instance normalization all come from the inde-
representations that demonstrate wide disparities without anom- pendent channels themselves. It can be seen as a consolidation and
alies is promising. We amplify the advantages of contrastive repre- adjustment of global information, and a more stable approach to
sentation learning with a dual attention structure. training processing. Channel independence assumption has been
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Yiyuan Yang, Chaoli Zhang, Tian Zhou, Qingsong Wen, & Liang Sun
(a) Backbone (b) Dual Attention Contrastive Structure (c) Dual Attention
𝓨 ∈ ℝ!×' Output Anomaly Score and Detection Grad Similarity Grad Patch-wise Attention In-patch Attention
…… … …… …
Avg Avg
Representation Discrepancy Patch-wise Upsample
In-patch Upsample
𝝌 ∈ ℝ&×!×! … … … …
𝐿×
Patch-wise Point-wise Patch-wise
Patch-wise Point-wise
Point-wise
Patch-wise Point-wise Concat Concat
Upsample Upsample Upsample
Upsample Upsample
Upsample
Upsample Upsample
𝐻× 𝐻×
ScaleDot-product
Scale
Scale Dot-ProductAttention
Dot-Product Attention
Attention ScaleDot-product
Scale
Scale Dot-ProductAttention
Dot-Product Attention
Attention
Patch-wise In-patch Point-wise
Point-wise
Patch-wise In-point
In-point
In-patch
Multi-head
Multi-head Multi-head
Multi-head
Representation Representation Multi-head Multi-head
Linear
Linear Linear
Linear Linear
Linear Linear
Linear
Attention
Attention
Attention Attention
Attention
Attention Linear Linear Linear Linear
𝝌 ∈ ℝ$×%×# 𝐾 𝑄 𝐾 𝑄
Share Weight
𝐾 𝑄 𝑄 𝐾
Point-wise
Point-wise
Patch-wise In-point
In-point
In-patch
Channel Independence + Patching Embedding Embedding …
Embedding
Embedding Embedding
Embedding …
…
… …
…
…
…… … …… …
Instance Normalization
𝝌 ∈ ℝ$×%×# 𝑷𝒂𝒕𝒄𝒉𝟏 𝑷𝒂𝒕𝒄𝒉𝒊 𝑷𝒂𝒕𝒄𝒉𝟏 𝑷𝒂𝒕𝒄𝒉𝒊
𝝌 ∈ ℝ!×# Input Multivariate Time-series Forward Process Dual Attention Contrastive Structure Representation Discrepancy Anomaly Criterion
Figure 2: The workflow of the DCdetector framework. DCdetector consists of four main components: Forward Process module,
Dual Attention Contrastive Structure module, Representation Discrepancy module, and Anomaly Criterion module.
Patching
Self-Attention
Channel-
independence Concatenate
Input 𝝌 ∈ R!×# 𝝌$ ∈ R!×% , 𝑖 = 1,2, … , 𝑑 𝝌′$ ∈ R&×% , 𝑖 = 1,2, … , 𝑑 Output 𝝌' ∈ R&×#
Figure 3: Basic patching attention with channel independence. Each channel in the multivariate time series input is considered
as a single time series and divided into patches. Each channel shares the same self-attention network, and the representation
results are concatenated as the final output.
proven helpful in multivariate time series forecasting tasks [43, 59] 3.2 Dual Attention Contrastive Structure
to reduce parameter numbers and overfitting issues. Our DCde- In DCdetector, we propose a contrastive representation learning
tector follows such channel independence setting to simplify the structure with dual attention to get the representations of input
attention network with patching. time series from different views. Concretely, with patching opera-
More specifically, the basic patching attention with channel in- tion, DCdetector takes patch-wise and in-patch representations as
dependence is shown in Figure 3. Each channel in the multivariate two views. Note that it differs from traditional contrastive learning,
time series input (X ∈ IR𝑇 ×𝑑 ) is considered as a single time series where original and augmented data are considered as two views of
(X𝑖 ∈ IR𝑇 ×1, 𝑖 = 1, 2, . . . , 𝑑) and divided into patches. Each chan- the original data. Moreover, DCdetector does not construct <posi-
nel shares the same self-attention network, and the representation tive, negative> pairs like the typical contrastive methods [29, 69].
results (X ′𝑖 ∈ IR𝑁 ×1, 𝑖 = 1, 2, . . . , 𝑑) is concatenated as the final out- Instead, its basic setting is similar to the contrastive methods only
put (X ′ ∈ IR𝑁 ×𝑑 ). In the implementation phase, running a sliding using positive samples [17, 28].
window in time series data is widely used in time series anomaly
detection tasks [61, 71] and has little influence on the main design. 3.2.1 Dual Attention. As shown in Figure 2, input time series
More implementation details are left in the experiment section.
X ∈ IR𝑇 ×𝑑 are patched as X ∈ IR𝑃 ×𝑁 ×𝑑 where 𝑃 is the size of
The Dual Attention Contrastive Structure module is critical in
patches and 𝑁 is the number of patches. Then, we fuse the channel
our design. It learns the representation of inputs in different views.
information with the batch dimension and the input size becomes
The insight is that, for normal points, most of them will share the
X ∈ IR𝑃 ×𝑁 . With such patched time series, DCdetector learns rep-
same latent pattern even in different views (a strong correlation is
resentation in patch-wise and in-patch views with self-attention
not easy to be destroyed). However, as anomalies are rare and do
networks. Our dual attention can be encoded in 𝐿 layers, so for
not have explicit patterns, it is hard for them to share latent modes
simplicity, we only use one of these layers as an example.
with normal points or among themselves (i.e., anomalies have a
For the patch-wise representation, a single patch is considered
weak correlation with other points). Thus, the difference will be
as a unit, and the dependencies among patches are modeled by a
slight for normal points representations in different views and large
multi-head self-attention network (named patch-wise attention). In
for anomalies. We can distinguish anomalies from normal points
detail, an embedded operation will be applied in the patch_size (𝑃)
with a well-designed Representation Discrepancy criterion. The
dimension, and the shape of embedding is XN ∈ IR𝑁 ×𝑑𝑚𝑜𝑑𝑒𝑙 . Then,
details of Dual Attention Contrastive Structure and Representation
we adopt multi-head attention weights to calculate the patch-wise
Discrepancy are left in the following Section 3.2 and Section 3.3.
representation. Firstly, initialize the query and key:
As for the Anomaly Criterion, we calculate anomaly scores based
on the discrepancy between the two representations and use a prior
threshold for anomaly detection. The details are left in Section 3.4. QN 𝑖 , KN 𝑖 = WQ 𝑖 XN 𝑖 , WK 𝑖 XN 𝑖 1 ≤ 𝑖 ≤ 𝐻, (1)
DCdetector: Dual Attention Contrastive Representation Learning
for Time Series Anomaly Detection KDD ’23, August 6–10, 2023, Long Beach, CA, USA
𝑃! 𝑃! 𝑃" 𝑝" 𝑝# 𝑝$ 𝑝! 𝑝% from patch to points) for up-sampling, and we will get the final
patch-wise representation N . For the in-patch branch, as only de-
Patch-wise In-patch pendencies among patch points are gained, repeating is done from
Upsampling Upsampling "one" patch to a full number of patches, and we will get the final
in-patch representation P.
A simple example is shown in Figure 4 where a patch is noted
as 𝑃 𝑖 , and a point is noted as 𝑝𝑖 . Such patching and repeating up-
Figure 4: A simple example of how up-sampling is done. For sampling operations inevitably lead to information loss. To keep the
patch-wise branch, repeating is done in patches (from patch information from the original data better, DCdetector introduces
to points). For in-patch branch, repeating is done from "one" a multi-scale design for patching representation and up-sampling.
patch to a full number of patches (from points to patches). The final representation concatenates results in different scales (i.e.,
where Q N 𝑖 , K N 𝑖 ∈ IR𝑁 ×
𝑑𝑚𝑜𝑑𝑒𝑙
𝐻 denote the query and key, respec- patch sizes). Specifically, we can preset a list of various patches to
𝑑𝑚𝑜𝑑𝑒𝑙 𝑑𝑚𝑜𝑑𝑒𝑙 perform parallel patching and the computation of dual attention
tively, WQ 𝑖 , WK 𝑖 ∈ IR × represent learnable parame-
𝐻 𝐻
representations, simultaneously. After upsampling each patch part,
ter matrices of Q N 𝑖 , K N 𝑖 , and 𝐻 is the head number. Then, compute they are summed to obtain the final patch-wise representation N
the attention weights: and in-patch representation P.
QN 𝑖 KN 𝑇𝑖 ∑︁
𝐴𝑡𝑡𝑛 N𝑖 = 𝑆𝑜 𝑓 𝑡𝑚𝑎𝑥 ( √︁ ), (2) N= Upsampling (𝐴𝑡𝑡𝑛 N ); (7)
𝑑𝑚𝑜𝑑𝑒𝑙 𝑃𝑎𝑡𝑐ℎ 𝑙𝑖𝑠𝑡
where 𝑆𝑜 𝑓 𝑡𝑚𝑎𝑥 (·) function normalizes the attention weight. Fi-
∑︁
P= Upsampling(𝐴𝑡𝑡𝑛 P ). (8)
nally, contact the multi-head and get the final patch-wise represen-
𝑃𝑎𝑡𝑐ℎ 𝑙𝑖𝑠𝑡
tation 𝐴𝑡𝑡𝑛 N , which is:
3.2.3 Contrastive Structure. Patch-wise and in-patch branches out-
𝐴𝑡𝑡𝑛 N = Concat(𝐴𝑡𝑡𝑛 N1 , · · · , 𝐴𝑡𝑡𝑛 N𝐻 )𝑊N𝑂 , (3)
put representations of the same input time series in two different
where 𝑊 N 𝑂 ∈ IR𝑑𝑚𝑜𝑑𝑒𝑙 ×𝑑𝑚𝑜𝑑𝑒𝑙 is a learnable parameter matrix. views. As shown in Figure 2 (c), patch-wise sample representa-
Similarly, for the in-patch representation, the dependencies of tion learns a weighted combination between sample points in the
points in the same patch are gained by a multi-head self-attention same position from each patch. In-patch sample representation,
network (called in-patch attention). Note that the patch-wise at- on the other hand, learns a weighted combination between points
tention network shares weights with the in-patch attention net- within the same patch. We can treat these two representations as
work. Specifically, another embedded operation will be applied in permutated multi-view representations. The key inductive bias we
the patch_number (𝑁 ) dimension, and the shape of embedding is exploit here is that normal points can maintain their representation
XP ∈ IR𝑃 ×𝑑𝑚𝑜𝑑𝑒𝑙 . Then, we adopt multi-head attention weights to under permutations while the anomalies can not. From such dual
calculate the in-patch representation. First, initialize the query and attention non-negative contrastive learning, we want to learn a
key: permutation invariant representation. Learning details are left
Q P 𝑖 , KP 𝑖 = WQ 𝑖 XP 𝑖 , WK 𝑖 XP 𝑖 1 ≤ 𝑖 ≤ 𝐻, (4) in Section 3.3.
𝑑𝑚𝑜𝑑𝑒𝑙
where Q P 𝑖 , K P 𝑖 ∈ IR𝑃 × 𝐻 denote the query and key, respec-
𝑑𝑚𝑜𝑑𝑒𝑙 𝑑𝑚𝑜𝑑𝑒𝑙 3.3 Representation Discrepancy
tively, and WQ 𝑖 , WK 𝑖 ∈ IR × 𝐻
𝐻 represent learnable pa- With dual attention contrastive structure, representations from
rameter matrices of Q P 𝑖 , K N 𝑖 . Then, compute the attention weights: two views (patch-wise branch and in-patch branch) are gained. We
Q P 𝑖 KP 𝑇𝑖 formalize a loss function based on Kullback–Leibler divergence (KL
𝐴𝑡𝑡𝑛 P𝑖 = 𝑆𝑜 𝑓 𝑡𝑚𝑎𝑥 ( √︁ ), (5)
𝑑𝑚𝑜𝑑𝑒𝑙 divergence) to measure the similarity of such two representations.
where 𝑆𝑜 𝑓 𝑡𝑚𝑎𝑥 (·) function normalizes the attention weight. Fi- The intuition is that, as anomalies are rare and normal points share
nally, contact the multi-head and get the final in-patch representa- latent patterns, the same inputs’ representations should be similar.
tion 𝐴𝑡𝑡𝑛 P , which is: 3.3.1 Loss function definition. Define similarity metric D of two
𝐴𝑡𝑡𝑛 P = Concat(𝐴𝑡𝑡𝑛 P1 , · · · , 𝐴𝑡𝑡𝑛 P𝐻 )𝑊P𝑂 , (6) output representation vectors P, N as D (P, N ) = 𝐾𝐿(P ||N )
where 𝐾𝐿(·||·) is the KL divergence distance. The loss function
where 𝑊 P𝑂 ∈ IR𝑑𝑚𝑜𝑑𝑒𝑙 ×𝑑𝑚𝑜𝑑𝑒𝑙 is a learnable parameter matrix.
in DCdetector is then defined as
Note that the WQ 𝑖 , WK 𝑖 are the shared weights within the in-
1 1
patch attention representation network and patch-wise attention L { P, N; X } = D ( P, Stopgrad( N ) ) + D ( N, Stopgrad( P ) ) (9)
2 2
representation network.
where X is the input time series, P and N are the representation
3.2.2 Up-sampling and Multi-scale Design. Although the patching result matrices of the in-patch branch and the patch-wise branch,
design benefits from gaining local semantic information, patch- respectively. Stop-gradient (labeled as ’stopgrad’) operation is also
wise attention ignores the relevance among points in a patch, and used in our loss function to train two branches asynchronously.
in-patch attention ignores the relevance among patches. To com- Unlike most anomaly detection works based on reconstruction
pare the results of two representation networks, we need to do framework [56], DCdetector is a self-supervised framework based
up-sampling first. For the patch-wise branch, as we only have the on representation learning, and no reconstruction part is utilized
dependencies among patches, repeating is done inside patches (i.e., in our model. There is no doubt that reconstruction helps to detect
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Yiyuan Yang, Chaoli Zhang, Tian Zhou, Qingsong Wen, & Liang Sun
the anomalies which behave not as expected. However, it is not from eBay Server Machines with 25 dimensions [1]. (4) SMD (Server
easy to build a suitable encoder and decoder to ’reconstruct’ the Machine Dataset) is a five-week-long dataset collected from an in-
time series as they are expected to be with anomalies’ interference. ternet company compute cluster, which stacks accessed traces of
Moreover, the ability of representation is restricted as the latent resource utilization of 28 machines [63]. (5) NIPS-TS-SWAN is an
pattern information is not fully considered. openly accessible comprehensive, multivariate time series bench-
mark extracted from solar photospheric vector magnetograms in
3.3.2 Discussion about Model Collapse. Interestingly, with only
Spaceweather HMI Active Region Patch series [5, 40]. (6) NIPS-
single-type inputs (or saying, no negative samples included), our
TS-GECCO is a drinking water quality dataset for the ‘internet of
DCdetector model does not fall into a trivial solution (model col-
things’, which is published in the 2018 genetic and evolutionary
lapse). SimSiam [17] gives the main credit for avoiding model col-
computation conference [40, 47]. Besides the above multivariate
lapse to stop gradient operation in their setting. However, we find
time series datasets, we also test univariate time series datasets.
that DCdetector still works without stop gradient operation, al-
(7) UCR is provided by the Multi-dataset Time Series Anomaly
though with the same parameters, the no stop gradient version
Detection Competition of KDD2021, and contains 250 sub-datasets
does not gain the best performance. Details are shown in the abla-
from various natural sources [21, 36]. It is a univariate time series of
tion study (Section 4.5.1).
dataset subsequence anomalies. In each time series, there is one and
A possible explanation is that our two branches are totally asym-
only one anomaly. More details of the seven benchmark datasets
metric. Following the unified perspective proposed in [82], consider
are summarized in Table 8 in Appendix B.
the output vector of a branch as 𝑍 and 𝑍 can be decomposed into
two parts 𝑜 and 𝑟 as 𝑍 = 𝑜 + 𝑟 , where 𝑜 = IE[𝑍 ] is the center vector
defined as an average of 𝑍 in the whole representation space, and 𝑟 4.2 Baselines and Evaluation Criteria
is the residual vector. When the collapse happens, all vectors 𝑍 fall We compare our model with the 26 baselines for comprehensive
into the center vector 𝑜 and 𝑜 dominates over 𝑟 . With two branches evaluations, including the reconstruction-based model: AutoEn-
noted as 𝑍𝑝 = 𝑜 𝑝 + 𝑟 𝑝 , 𝑍𝑛 = 𝑜𝑛 + 𝑟𝑛 , if the branches are symmetric, coder [58], LSTM-VAE [50], OmniAnomaly [63], BeatGAN [87],
i.e., 𝑜 𝑝 = 𝑜𝑛 , then the distance between them is 𝑍𝑝 −𝑍𝑛 = 𝑟 𝑝 −𝑟𝑛 . As InterFusion [45], Anomaly Transformer [71]; the autoregression-
𝑟 𝑝 and 𝑟𝑛 come from the same input example, it will lead to collapse. based models: VAR [4], Autoregression [55], LSTM-RNN [8], LSTM
Fortunately, the two branches in DCdetector are asymmetric, so [32], CL-MPPCA [64]; the density-estimation models: LOF [10], MP-
it is not easy for 𝑜 𝑝 to be the same as 𝑜𝑛 even when 𝑟 𝑝 and 𝑟𝑛 are PCACD [72], DAGMM [88]; the clustering-based methods: Deep-
similar. Thus, due to our asymmetric design, DCdetector is hard to SVDD [57], THOC [61], ITAD [62]; the classic methods: OCSVM
fall into a trivial solution. [65], OCSVM-based subsequence clustering (OCSVM*), IForest [46],
IForest-based subsequence clustering (IForest*), Gradient boosting
3.4 Anomaly Criterion regression (GBRT) [25]; the change point detection and time series
With the insight that normal points usually share latent patterns segmentation methods: BOCPD [2], U-Time [52], TS-CP2 [22]. We
(with strong correlation among them), thus the distances of rep- also compare our model with a time-series subsequence anomaly
resentation results from different views for normal points are less detection algorithm Matrix Profile [79].
than that for anomalies. The final anomaly score of X ∈ IR𝑇 ×𝑑 is Besides, we adopt various evaluation criteria for comprehensive
defined as comparison, including the commonly-used evaluation measures: ac-
1 1 curacy, precision, recall, F1-score; the recently proposed evaluation
AnomalyScore( X) = D ( P, N ) + D ( N, P ). (10) measures: affiliation precision/recall pair [31] and Volume under
2 2
It is a point-wise anomaly score, and anomalies result in higher the surface (VUS) [49]. F1-score is the most widely used metric but
scores than normal points. does not consider anomaly events. Affiliation precision and recall
Based on the point-wise anomaly score, a hyperparameter thresh- are calculated based on the distance between ground truth and
old 𝛿 is used to decide if a point is an anomaly (1) or not (0). If the prediction events. VUS metric takes anomaly events into consid-
score exceeds the threshold, the output Y is an anomaly. That is eration based on the receiver operator characteristic (ROC) curve.
( Different metrics provide different evaluation views. We employ
1: anomaly AnomalyScore( X𝑖 ) ≥ 𝛿 the commonly-used adjustment technique for a fair comparison
Y𝑖 = (11)
0: normal AnomalyScore( X𝑖 ) < 𝛿 [61, 63, 70, 71], according to which all abnormalities in an abnormal
segment are considered to have been detected if a single time point
4 EXPERIMENTS in an abnormal segment is identified.
4.1 Benchmark Datasets
We adopt seven representative benchmarks from five real-world 4.3 Implementation Details
applications to evaluate DCdetector: (1) MSL (Mars Science Labo- We summarize all the default hyper-parameters as follows in our
ratory dataset) is collected by NASA and shows the condition of implementation. Our DCdetector model contains three encoder
the sensors and actuator data from the Mars rover [32]. (2) SMAP layers (𝐿 = 3). The dimension of the hidden state 𝑑𝑚𝑜𝑑𝑒𝑙 is 256, and
(Soil Moisture Active Passive dataset) is also collected by NASA and the number of attention head 𝐻 is 1 for simplicity. We select various
presents the soil samples and telemetry information used by the patch size and window size options for different datasets, as shown
Mars rover [32]. Compared with MSL, SMAP has more point anom- in Table 8 in Appendix B. Our model defines an anomaly as a time
alies. (3) PSM (Pooled Server Metrics dataset) is a public dataset point whose anomaly score exceeds a hyperparameter threshold
DCdetector: Dual Attention Contrastive Representation Learning
for Time Series Anomaly Detection KDD ’23, August 6–10, 2023, Long Beach, CA, USA
Table 1: Overall results on real-world multivariate datasets. Performance ranked from lowest to highest. The P, R and F1 are
the precision, recall and F1-score. All results are in %, the best ones are in Bold, and the second ones are underlined.
Dataset MSL SMAP PSM SMD
Metric P R F1 P R F1 P R F1 P R F1
LOF 47.72 85.25 61.18 58.93 56.33 57.60 57.89 90.49 70.61 56.34 39.86 46.68
OCSVM 59.78 86.87 70.82 53.85 59.07 56.34 62.75 80.89 70.67 44.34 76.72 56.19
U-Time 57.20 71.66 63.62 49.71 56.18 52.75 82.85 79.34 81.06 65.95 74.75 70.07
IForest 53.94 86.54 66.45 52.39 59.07 55.53 76.09 92.45 83.48 42.31 73.29 53.64
DAGMM 89.60 63.93 74.62 86.45 56.73 68.51 93.49 70.03 80.08 67.30 49.89 57.30
ITAD 69.44 84.09 76.07 82.42 66.89 73.85 72.80 64.02 68.13 86.22 73.71 79.48
VAR 74.68 81.42 77.90 81.38 53.88 64.83 90.71 83.82 87.13 78.35 70.26 74.08
MMPCACD 81.42 61.31 69.95 88.61 75.84 81.73 76.26 78.35 77.29 71.20 79.28 75.02
CL-MPPCA 73.71 88.54 80.44 86.13 63.16 72.88 56.02 99.93 71.80 82.36 76.07 79.09
TS-CP2 86.45 68.48 76.42 87.65 83.18 85.36 82.67 78.16 80.35 87.42 66.25 75.38
Deep-SVDD 91.92 76.63 83.58 89.93 56.02 69.04 95.41 86.49 90.73 78.54 79.67 79.10
BOCPD 80.32 87.20 83.62 84.65 85.85 85.24 80.22 75.33 77.70 70.9 82.04 76.07
LSTM-VAE 85.49 79.94 82.62 92.20 67.75 78.10 73.62 89.92 80.96 75.76 90.08 82.30
BeatGAN 89.75 85.42 87.53 92.38 55.85 69.61 90.30 93.84 92.04 72.90 84.09 78.10
LSTM 85.45 82.50 83.95 89.41 78.13 83.39 76.93 89.64 82.80 78.55 85.28 81.78
OmniAnomaly 89.02 86.37 87.67 92.49 81.99 86.92 88.39 74.46 80.83 83.68 86.82 85.22
InterFusion 81.28 92.70 86.62 89.77 88.52 89.14 83.61 83.45 83.52 87.02 85.43 86.22
THOC 88.45 90.97 89.69 92.06 89.34 90.68 88.14 90.99 89.54 79.76 90.95 84.99
AnomalyTrans 91.92 96.03 93.93 93.59 99.41 96.41 96.94 97.81 97.37 88.47 92.28 90.33
DCdetector 93.69 99.69 96.60 95.63 98.92 97.02 97.14 98.74 97.94 83.59 91.10 87.18
Table 2: Multi-metrics results on real-world multivariate datasets. Aff-P and Aff-R are the precision and recall of affiliation metric [31],
respectively. R_A_R and R_A_P are Range-AUC-ROC and Range-AUC-PR [49], which denote two scores based on label transformation under
ROC curve and PR curve, respectively. V_ROC and V_RR are volumes under the surfaces created based on ROC curve and PR curve [49],
respectively. All results are in %, and the best ones are in Bold.
Dataset Method Acc F1 Aff-P [31] Aff-R [31] R_A_R [49] R_A_P [49] V_ROC [49] V_PR [49]
AnomalyTrans 98.69 93.93 51.76 95.98 90.04 87.87 88.20 86.26
MSL
DCdetector 99.06 96.60 51.84 97.39 93.17 91.64 93.15 91.66
AnomalyTrans 99.05 96.41 51.39 98.68 96.32 94.07 95.52 93.37
SMAP
DCdetector 99.21 97.02 51.46 98.64 96.03 94.18 95.19 93.46
AnomalyTrans 98.68 97.37 55.35 80.28 91.83 93.03 88.71 90.71
PSM
DCdetector 98.95 97.94 54.71 82.93 91.55 92.93 88.41 90.58
𝛿, and its default value to 1. For all experiments on the above hy- Table 3: Overall results on NIPS-TS datasets. Performance
perparameter selection and trade-off, please refer to Appendix C. ranked from lowest to highest. All results are in %, the best
Besides, all the experiments are implemented in PyTorch [51] with ones are in bold, and the second ones are underlined.
one NVIDIA Tesla-V100 32GB GPU. Adam [39] with default param- Dataset NIPS-TS-GECCO NIPS-TS-SWAN
eter is applied for optimization. We set the initial learning rate to Metric P R F1 P R F1
10 −4 and the batch size to 128 with 3 epochs for all datasets. OCSVM* 2.1 34.1 4.0 19.3 0.1 0.1
MatrixProfile 4.6 18.5 7.4 16.7 17.5 17.1
4.4 Main Results GBRT 17.5 14.0 15.6 44.7 37.5 40.8
4.4.1 Multivariate Anomaly Detection. We first evaluate our DCde- LSTM-RNN 34.3 27.5 30.5 52.7 22.1 31.2
tector with nineteen competitive baselines on four real-world multi- Autoregression 39.2 31.4 34.9 42.1 35.4 38.5
variate datasets as shown in Table 1. It can be seen that our proposed OCSVM 18.5 74.3 29.6 47.4 49.8 48.5
DCdetector achieves SOTA results under the widely used F1 met- IForest* 39.2 31.5 39.0 40.6 42.5 41.6
AutoEncoder 42.4 34.0 37.7 49.7 52.2 50.9
ric [54, 71] in most benchmark datasets. It is worth mentioning that
AnomalyTrans 25.7 28.5 27.0 90.7 47.4 62.3
it has been an intense discussion among recent studies about how
IForest 43.9 35.3 39.1 56.9 59.8 58.3
to evaluate the performance of anomaly detection algorithms fairly.
DCdetector 38.3 59.7 46.6 95.5 59.6 73.4
Precision, Recall, and F1 score are still the most widely used metrics
for comparison. Some additional metrics (affiliation precision/re-
call pair, VUS, etc.) are proposed to complement their deficiencies Anomaly Transformer achieves better results than other baseline
[31, 49, 61, 63, 70, 71]. To judge which metric is the best beyond the models, we mainly evaluate DCdetector with the Anomaly Trans-
scope of our work, so we include all the metrics here. As the recent former in this multi-metrics comparison as shown in Table 2. It
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Yiyuan Yang, Chaoli Zhang, Tian Zhou, Qingsong Wen, & Liang Sun
Table 4: Multi-metrics results on NIPS-TS datasets. All results are in %, and the best ones are in Bold.
Dataset Method Acc P R F1 Aff-P Aff-R R_A_R R_A_P V_ROC V_PR
AnomalyTrans 84.57 90.71 47.43 62.29 58.45 9.49 86.42 93.26 84.81 92.00
NIPS-TS-SWAN
DCdetector 85.94 95.48 59.55 73.35 50.48 5.63 88.06 94.71 86.25 93.50
AnomalyTrans 98.03 25.65 28.48 26.99 49.23 81.20 56.35 22.53 55.45 21.71
NIPS-TS-GECCO
DCdetector 98.56 38.25 59.73 46.63 50.05 88.55 62.95 34.17 62.41 33.67
Table 5: Ablation studies on Stop Gradient in DCdetector. All results are in %, and the best ones are in Bold.
Stop Gradient MSL SMAP PSM
Patch-wise Branch In-patch Branch P R F1 P R F1 P R F1
✘ ✘ 91.99 89.98 90.97 94.49 96.56 95.51 96.86 97.51 97.18
✔ ✘ 91.27 72.61 80.88 94.46 93.17 93.81 97.15 98.51 97.83
✘ ✔ 92.18 96.27 94.18 94.37 98.19 96.24 96.98 98.04 97.51
✔ ✔ 93.69 99.69 96.60 95.63 98.92 97.02 97.14 98.74 97.94
Table 6: Ablation studies on Forward Process module in DCdetector. All results are in %, and the best ones are in Bold.
Forward Process MSL SMAP PSM
Bilateral Filter Instance Norm P R F1 P R F1 P R F1
✘ ✘ 92.58 96.68 94.59 94.65 97.38 96.00 97.01 97.79 97.40
✔ ✘ 92.64 98.74 95.59 94.48 98.48 96.44 97.11 98.44 97.77
✘ ✔ 93.69 99.69 96.60 95.63 98.92 97.02 97.14 98.74 97.94
✔ ✔ 92.28 98.82 95.44 95.11 97.06 96.08 96.88 97.82 97.35
Table 7: Overall results on univariate dataset. Results are in the best performances. If no stop gradient is contained, DCdetec-
%, and the best ones are in Bold. tor still works and does not fall into a trivial solution. Moreover,
Dataset UCR in such a setting, it outperforms all the baselines except Anomaly
Metric Acc P R F1 Count Transformer. Besides, we also conduct an ablation study on how the
AnomalyTrans 99.49 60.41 100 73.08 42 two main preprocessing methods (bilateral filter for denoising and
DCdetector 99.51 61.62 100 74.05 46
instance normalization for normalization) affect the performance
of our method in Table 6. It can be seen that either of them slightly
can be seen that DCdetector performs better or at least comparable
improves the performance of our model when used individually.
with the Anomaly Transformer in most metrics.
However, if they are utilized simultaneously, the performance de-
We also evaluate the performance on another two datasets NIPS-
grades. Therefore, our final DCdetector only contains the instance
TS-SWAN and NIPS-TS-GECCO in Table 3, which are more challeng-
normalization module for preprocessing. More ablation studies on
ing with more types of anomalies than the above four datasets. Al-
multi-scale patching, window size, attention head, embedding di-
though the two datasets have the highest (32.6% in NIPS-TS-SWAN)
mension, encoder layer, anomaly threshold, and metrics in loss
and lowest (1.1% in NIPS-TS-GECCO) anomaly ratio, DCdetector is
function are left in Appendix C.
still able to achieve SOTA results and completely outperform other
methods. Similarly, multi-metrics comparisons between DCdetec-
4.5.2 Visual Analysis. We show how DCdetector works by visu-
tor and Anomaly Transformer are conducted and summarized in
alizing different anomalies in Figure 5. We use the synthetic data
Table 4, and DCdetector still achieves better performance in most
generation methods reported in [40] to generate univariate time
metrics.
series with different types of anomalies, including the point-wise
4.4.2 Univariate Anomaly Detection. In this part, we compare the anomaly (global point and contextual point anomalies) and pattern-
performance of the DCdetector and Anomaly Transformer in uni- wise anomalies (seasonal, group, and trend anomalies) [40]. It can
variate time series anomaly detection. We trained and tested sepa- be seen that DCdetector can robustly detect various anomalies
rately for each of the sub-datasets in UCR datasets, and the average better from normal points with relatively higher anomaly scores.
results are shown in Table 7. The count indicates how many sub-
datasets have reached SOTA. The sub-datasets of the UCR all have 4.5.3 Parameter Sensitivity. We also study the parameter sensitiv-
only one segment of subsequence anomalies, and DCdetecter can ity of the DCdetector. Figure 6(a) shows the performance under
identify and locate them correctly and achieve optimal results. different window sizes. As discussed, a single point can not be taken
as an instance in time series. Window segmentation is widely used
4.5 Model Analysis in the analysis, and window size is a significant parameter. For our
4.5.1 Ablation Studies. Table 5 shows the ablation study of stop primary evaluation, the window size is usually set as 60 or 100.
gradient. According to the loss function definition in Section 3.3, Nevertheless, results in Figure 6(a) demonstrate that DCdetector is
we use two stop gradient modules in L{P, N ; X}, noted as stop robust with a wide range of window sizes (from 30 to 210). Actu-
gradient in patch-wise branch and in-patch branch, respectively. ally, in the window size range [45, 195], the performances fluctuate
With two-stop gradient modules, we can see that DCdetector gains less than 2.3%. Figure 6(b) shows the performance under different
DCdetector: Dual Attention Contrastive Representation Learning
for Time Series Anomaly Detection KDD ’23, August 6–10, 2023, Long Beach, CA, USA
Input
Time
Series
Anomaly Score
of
AnomalyTrans
Anomaly Score
of DCdetector
F1 Score %
F1 Score %
96
F1 Score %
92 95 MSL
95
90 94 95 SMAP
PSM
94 95 94
88 93
92 93
86 93 94
MSL
SMAP
91
MSL
92 MSL
84 PSM
SMAP
PSM 92 93
SMAP
PSM
25 50 75 100 125 150 175 200 [1] [3] [5] [1,3] [1,5] [3,5][1,3,5] 1 2 3 4 5 1 2 4 8 128 256 512 1024
Window Size Multi-scale Patch-sizes Encoder_layer Number Attention Head Number d_model size of attention
(a) Window size (b) Multi-scale size (c) Encoder layer number (d) Attention head number (e) 𝑑𝑚𝑜𝑑𝑒𝑙 of attention
Figure 6: Parameter sensitivity studies of main hyper-parameters in DCdetector.
17.5 0.8
size. The memory and time usages with different 𝑑𝑚𝑜𝑑𝑒𝑙 sizes are
15.0
Memory size (GB)
10.0 dimension of the hidden state 𝑑𝑚𝑜𝑑𝑒𝑙 = 256 for the performance-
7.5 0.4
5.0
complexity trade-off, and it can be seen that DCdetector can work
0.2
2.5 quite well under 𝑑𝑚𝑜𝑑𝑒𝑙 = 256 with efficient running time and small
0.0 128 256 512 1024 0.0 128 256 512 1024
d_model size of attention d_model size of attention memory consumption.
(a) Memory used (b) Time cost
5 CONCLUSION
Figure 7: The averaged GPU memory cost and the averaged
This paper proposes a novel algorithm named DCdetector for time-
running time of 100 iterations during training with different
series anomaly detection. We design a contrastive learning-based
𝑑𝑚𝑜𝑑𝑒𝑙 sizes.
dual-branch attention structure in DCdetector to learn a permu-
multi-scale sizes. Horizontal coordinate is the patch-size combina- tation invariant representation. Such representation enlarges the
tion used in multi-scale attention which means we combine several differences between normal points and anomalies, improving de-
dual-attention modules with a given patch-size combination. Un- tection accuracy. Besides, two additional designs: multiscale and
like window size, the multi-scale design contributes to the final channel independence patching, are implemented to enhance the
performance of the DCdetector, and different patch-size combina- performance. Moreover, we propose a pure contrastive loss function
tions lead to different performances. Note that when studying the without reconstruction error, which empirically proves the effec-
parameter sensitivity of window size, the scale size is fixed as [3,5]. tiveness of contrastive representation compared to the widely used
When studying the parameter sensitivity of scale size, the window reconstructive one. Lastly, extensive experiments show that DCde-
size is fixed at 60. Figure 6(c) shows the performance under differ- tector achieves the best or comparable performance on seven bench-
ent numbers of encoder layers, since many deep neural networks’ mark datasets compared to various state-of-the-art algorithms.
performances are affected by the layer number. Figure 6(d) and Fig-
ure 6(e) show model performances with different head numbers or 6 ACKNOWLEDGEMENTS
𝑑𝑚𝑜𝑑𝑒𝑙 sizes in attention. It can be seen that DCdetector achieves the This work was supported by Alibaba Group through Alibaba Re-
best performance with a small attention head number and 𝑑𝑚𝑜𝑑𝑒𝑙 search Intern Program.
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Yiyuan Yang, Chaoli Zhang, Tian Zhou, Qingsong Wen, & Liang Sun
REFERENCES [26] Jingkun Gao, Xiaomin Song, Qingsong Wen, Pichao Wang, Liang Sun, and Huan
[1] Ahmed Abdulaal, Zhuanghua Liu, and Tomer Lancewicki. 2021. Practical ap- Xu. 2020. RobustTAD: Robust time series anomaly detection via decomposition
proach to asynchronous multivariate time series anomaly detection and localiza- and convolutional neural networks. KDD Workshop MileTS (2020).
tion. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery [27] Koosha Golmohammadi and Osmar R Zaiane. 2015. Time series contextual
& Data Mining. 2485–2494. anomaly detection for detecting market manipulation in stock market. In 2015
[2] Ryan Prescott Adams and David JC MacKay. 2007. Bayesian online changepoint IEEE international conference on data science and advanced analytics (DSAA). IEEE,
detection. arXiv preprint arXiv:0710.3742 (2007). 1–10.
[3] Archana Anandakrishnan, Senthil Kumar, Alexander Statnikov, Tanveer Faruquie, [28] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre
and Di Xu. 2018. Anomaly detection in finance: editors’ introduction. In KDD Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan
2017 Workshop on Anomaly Detection in Finance. PMLR, 1–7. Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent-a new
[4] OD Anderson. 1976. Time-Series. 2nd edn. approach to self-supervised learning. Advances in neural information processing
[5] Rafal Angryk, Petrus Martens, Berkay Aydin, Dustin Kempton, Sushant Mahajan, systems 33 (2020), 21271–21284.
Sunitha Basodi, Azim Ahmadzadeh, Xumin Cai, Soukaina Filali Boubrahimi, [29] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Mo-
Shah Muhammad Hamdi, Micheal Schuh, and Manolis Georgoulis. 2020. SWAN- mentum contrast for unsupervised visual representation learning. In Proceedings
SF. https://doi.org/10.7910/DVN/EBCFKM of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738.
[6] Ane Blázquez-García, Angel Conde, Usue Mori, and Jose A Lozano. 2021. A [30] Chengqiang Huang, Yulei Wu, Yuan Zuo, Ke Pei, and Geyong Min. 2018. Towards
review on outlier/anomaly detection in time series data. ACM Computing Surveys experienced anomaly detector through reinforcement learning. In Proceedings of
(CSUR) 54, 3 (2021), 1–33. the AAAI Conference on Artificial Intelligence, Vol. 32.
[7] Paul Boniol and Themis Palpanas. 2020. Series2Graph: Graph-based Subsequence [31] Alexis Huet, Jose Manuel Navarro, and Dario Rossi. 2022. Local Evaluation of
Anomaly Detection for Time Series. ArXiv abs/2207.12208 (2020). Time Series Anomaly Detection Algorithms. In Proceedings of the 28th ACM
[8] Loïc Bontemps, Van Loi Cao, James McDermott, and Nhien-An Le-Khac. 2016. SIGKDD Conference on Knowledge Discovery and Data Mining. 635–645.
Collective anomaly detection based on long short-term memory recurrent neural [32] Kyle Hundman, Valentino Constantinou, Christopher Laporte, Ian Colwell, and
networks. In International conference on future data and security engineering. Tom Soderstrom. 2018. Detecting spacecraft anomalies using lstms and nonpara-
Springer, 141–152. metric dynamic thresholding. In Proceedings of the 24th ACM SIGKDD interna-
[9] George EP Box and David A Pierce. 1970. Distribution of residual autocorrelations tional conference on knowledge discovery & data mining. 387–395.
in autoregressive-integrated moving average time series models. Journal of the [33] Yang Jiao, Kai Yang, Dongjing Song, and Dacheng Tao. 2022. Timeautoad: Au-
American statistical Association 65, 332 (1970), 1509–1526. tonomous anomaly detection with self-supervised contrastive loss for multivari-
[10] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. 2000. ate time series. IEEE Transactions on Network Science and Engineering 9, 3 (2022),
LOF: identifying density-based local outliers. In Proceedings of the 2000 ACM 1604–1619.
SIGMOD international conference on Management of data. 93–104. [34] Neha Kant and Manish Mahajan. 2019. Time-series outlier detection using
[11] David Campos, Tung Kieu, Chenjuan Guo, Feiteng Huang, Kai Zheng, Bin Yang, enhanced k-means in combination with pso algorithm. In Engineering Vibration,
and Christian S Jensen. 2021. Unsupervised Time Series Outlier Detection with Communication and Information Processing: ICoEVCI 2018, India. Springer, 363–
Diversity-Driven Convolutional Ensembles–Extended Version. arXiv preprint 373.
arXiv:2111.11108 (2021). [35] Paweł Karczmarek, Adam Kiersztyn, Witold Pedrycz, and Ebru Al. 2020. K-Means-
[12] Mikel Canizo, Isaac Triguero, Angel Conde, and Enrique Onieva. 2019. Multi-head based isolation forest. Knowledge-based systems 195 (2020), 105659.
CNN–RNN for multi-time series anomaly detection: An industrial case study. [36] Eamonn Keogh, Dutta Roy Taposh, U Naik, and A Agrawal. 2021. Multi-dataset
Neurocomputing 363 (2019), 246–260. Time-Series Anomaly Detection Competition. In ACM SIGKDD International
[13] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Conference on Knowledge Discovery and Data Mining. https://compete. hexagonml.
Armand Joulin. 2020. Unsupervised learning of visual features by contrasting com/practice/competition/39.
cluster assignments. Advances in neural information processing systems 33 (2020), [37] Adnan Khan, Sarah AlBarri, and Muhammad Arslan Manzoor. 2022. Contrastive
9912–9924. self-supervised learning: a survey on different architectures. In 2022 2nd Interna-
[14] Sucheta Chauhan and Lovekesh Vig. 2015. Anomaly detection in ECG time tional Conference on Artificial Intelligence (ICAI). IEEE, 1–6.
signals via deep long short-term memory networks. In 2015 IEEE international [38] Taesung Kim, Jinhee Kim, Yunwon Tae, Cheonbok Park, Jang-Ho Choi, and
conference on data science and advanced analytics (DSAA). IEEE, 1–7. Jaegul Choo. 2021. Reversible instance normalization for accurate time-series
[15] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A forecasting against distribution shift. In International Conference on Learning
simple framework for contrastive learning of visual representations. In Interna- Representations.
tional conference on machine learning. PMLR, 1597–1607. [39] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-
[16] Xuanhao Chen, Liwei Deng, Feiteng Huang, Chengwei Zhang, Zongquan Zhang, mization. arXiv preprint arXiv:1412.6980 (2014).
Yan Zhao, and Kai Zheng. 2021. Daemon: Unsupervised anomaly detection [40] Kwei-Herng Lai, Daochen Zha, Junjie Xu, Yue Zhao, Guanchu Wang, and Xia
and interpretation for multivariate time series. In 2021 IEEE 37th International Hu. 2021. Revisiting time series outlier detection: Definitions and benchmarks.
Conference on Data Engineering (ICDE). IEEE, 2225–2230. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and
[17] Xinlei Chen and Kaiming He. 2021. Exploring simple siamese representation Benchmarks Track (Round 1).
learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and [41] Dan Li, Dacheng Chen, Baihong Jin, Lei Shi, Jonathan Goh, and See-Kiong Ng.
Pattern Recognition. 15750–15758. 2019. MAD-GAN: Multivariate anomaly detection for time series data with gen-
[18] Zekai Chen, Dingshuo Chen, Xiao Zhang, Zixuan Yuan, and Xiuzhen Cheng. erative adversarial networks. In Artificial Neural Networks and Machine Learning–
2021. Learning graph structures with transformer for multivariate time-series ICANN 2019: Text and Time Series: 28th International Conference on Artificial
anomaly detection in IoT. IEEE Internet of Things Journal 9, 12 (2021), 9179–9189. Neural Networks, Munich, Germany, September 17–19, 2019, Proceedings, Part IV.
[19] Haibin Cheng, Pang-Ning Tan, Christopher Potter, and Steven Klooster. 2009. Springer, 703–716.
Detection and characterization of anomalies in multivariate time series. In Pro- [42] Longyuan Li, Junchi Yan, Qingsong Wen, Yaohui Jin, and Xiaokang Yang. 2022.
ceedings of the 2009 SIAM international conference on data mining. SIAM, 413–424. Learning robust deep state space for unsupervised anomaly detection in con-
[20] Andrew A Cook, Göksel Mısırlı, and Zhong Fan. 2019. Anomaly detection for IoT taminated time-series. IEEE Transactions on Knowledge and Data Engineering
time-series data: A survey. IEEE Internet of Things Journal 7, 7 (2019), 6481–6494. (2022).
[21] Hoang Anh Dau, Anthony Bagnall, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan [43] Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang,
Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, and Eamonn Keogh. and Xifeng Yan. 2019. Enhancing the locality and breaking the memory bottle-
2019. The UCR time series archive. IEEE/CAA Journal of Automatica Sinica 6, 6 neck of transformer on time series forecasting. Advances in neural information
(2019), 1293–1305. processing systems 32 (2019).
[22] Shohreh Deldari, Daniel V Smith, Hao Xue, and Flora D Salim. 2021. Time series [44] Xing Li, Qiquan Shi, Gang Hu, Lei Chen, Hui Mao, Yiyuan Yang, Mingxuan Yuan,
change point detection with self-supervised contrastive predictive coding. In Jia Zeng, and Zhuo Cheng. 2021. Block Access Pattern Discovery via Compressed
Proceedings of the Web Conference 2021. 3124–3135. Full Tensor Transformer. In Proceedings of the 30th ACM International Conference
[23] Ailin Deng and Bryan Hooi. 2021. Graph Neural Network-Based Anomaly De- on Information & Knowledge Management. 957–966.
tection in Multivariate Time Series. In AAAI Conference on Artificial Intelligence. [45] Zhihan Li, Youjian Zhao, Jiaqi Han, Ya Su, Rui Jiao, Xidao Wen, and Dan Pei. 2021.
[24] Ailin Deng and Bryan Hooi. 2021. Graph neural network-based anomaly detection Multivariate time series anomaly detection and interpretation using hierarchical
in multivariate time series. In Proceedings of the AAAI conference on artificial inter-metric and temporal embedding. In Proceedings of the 27th ACM SIGKDD
intelligence, Vol. 35. 4027–4035. Conference on Knowledge Discovery & Data Mining. 3220–3230.
[25] Shereen Elsayed, Daniela Thyssens, Ahmed Rashed, Hadi Samer Jomaa, and Lars [46] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In 2008
Schmidt-Thieme. 2021. Do we really need deep learning models for time series eighth ieee international conference on data mining. IEEE, 413–422.
forecasting? arXiv preprint arXiv:2101.02118 (2021). [47] Steffen Moritz, Frederik Rehbach, Sowmya Chandrasekaran, Margarita Rebolledo,
and Thomas Bartz-Beielstein. 2018. GECCO Industrial Challenge 2018 Dataset: a
DCdetector: Dual Attention Contrastive Representation Learning
for Time Series Anomaly Detection KDD ’23, August 6–10, 2023, Long Beach, CA, USA
water quality dataset for the “Internet of Things: Online Anomaly Detection for [70] Haowen Xu, Wenxiao Chen, Nengwen Zhao, Zeyan Li, Jiahao Bu, Zhihan Li,
Drinking Water Quality” competition at the Genetic and Evolutionary Computa- Ying Liu, Youjian Zhao, Dan Pei, Yang Feng, et al. 2018. Unsupervised anomaly
tion Conference 2018, Kyoto, Japan (2018). detection via variational auto-encoder for seasonal kpis in web applications. In
[48] Zijian Niu, Ke Yu, and Xiaofei Wu. 2020. LSTM-based VAE-GAN for time-series Proceedings of the 2018 world wide web conference. 187–196.
anomaly detection. Sensors 20, 13 (2020), 3738. [71] Jiehui Xu, Haixu Wu, Jianmin Wang, and Mingsheng Long. 2021. Anomaly
[49] John Paparrizos, Paul Boniol, Themis Palpanas, Ruey S Tsay, Aaron Elmore, and transformer: Time series anomaly detection with association discrepancy. arXiv
Michael J Franklin. 2022. Volume under the surface: a new accuracy evaluation preprint arXiv:2110.02642 (2021).
measure for time-series anomaly detection. Proceedings of the VLDB Endowment [72] Takehisa Yairi, Naoya Takeishi, Tetsuo Oda, Yuta Nakajima, Naoki Nishimura,
15, 11 (2022), 2774–2787. and Noboru Takata. 2017. A data-driven health monitoring method for satellite
[50] Daehyung Park, Yuuna Hoshi, and Charles C Kemp. 2018. A multimodal anomaly housekeeping data based on probabilistic clustering and dimensionality reduction.
detector for robot-assisted feeding using an lstm-based variational autoencoder. IEEE Trans. Aerospace Electron. Systems 53, 3 (2017), 1384–1401.
IEEE Robotics and Automation Letters 3, 3 (2018), 1544–1551. [73] Yiyuan Yang, Rongshang Li, Qiquan Shi, Xijun Li, Gang Hu, Xing Li, and Mingx-
[51] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory uan Yuan. 2023. SGDP: A Stream-Graph Neural Network Based Data Prefetcher.
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. arXiv preprint arXiv:2304.03864 (2023).
Pytorch: An imperative style, high-performance deep learning library. Advances [74] Yiyuan Yang, Yi Li, and Haifeng Zhang. 2021. Pipeline safety early warning
in neural information processing systems 32 (2019). method for distributed signal using bilinear CNN and LightGBM. In ICASSP
[52] Mathias Perslev, Michael Jensen, Sune Darkner, Poul Jørgen Jennum, and Chris- 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing
tian Igel. 2019. U-time: A fully convolutional network for time series segmentation (ICASSP). IEEE, 4110–4114.
applied to sleep staging. Advances in Neural Information Processing Systems 32 [75] Yiyuan Yang, Yi Li, Taojia Zhang, Yan Zhou, and Haifeng Zhang. 2021. Early
(2019). safety warnings for long-distance pipelines: A distributed optical fiber sensor
[53] Peter CB Phillips and Sainan Jin. 2015. Business cycles, trend elimination, and machine learning approach. In Proceedings of the AAAI Conference on Artificial
the HP filter. (2015). Intelligence, Vol. 35. 14991–14999.
[54] Hansheng Ren, Bixiong Xu, Yujing Wang, Chao Yi, Congrui Huang, Xiaoyu Kou, [76] Yiyuan Yang, Haifeng Zhang, and Yi Li. 2021. Long-distance pipeline safety early
Tony Xing, Mao Yang, Jie Tong, and Qi Zhang. 2019. Time-series anomaly detec- warning: a distributed optical fiber sensing semi-supervised learning method.
tion service at microsoft. In Proceedings of the 25th ACM SIGKDD international IEEE Sensors Journal 21, 17 (2021), 19453–19461.
conference on knowledge discovery & data mining. 3009–3017. [77] Yiyuan Yang, Haifeng Zhang, and Yi Li. 2021. Pipeline safety early warning
[55] Peter J Rousseeuw and Annick M Leroy. 2005. Robust regression and outlier by multifeature-fusion CNN and LightGBM analysis of signals from distributed
detection. John wiley & sons. optical fiber sensors. IEEE Transactions on Instrumentation and Measurement 70
[56] Lukas Ruff, Jacob R Kauffmann, Robert A Vandermeulen, Grégoire Montavon, (2021), 1–13.
Wojciech Samek, Marius Kloft, Thomas G Dietterich, and Klaus-Robert Müller. [78] Mang Ye, Xu Zhang, Pong C Yuen, and Shih-Fu Chang. 2019. Unsupervised
2021. A unifying review of deep and shallow anomaly detection. Proc. IEEE 109, embedding learning via invariant and spreading instance feature. In Proceedings
5 (2021), 756–795. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6210–
[57] Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed 6219.
Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. 2018. Deep [79] Chin-Chia Michael Yeh, Yan Zhu, Liudmila Ulanova, Nurjahan Begum, Yifei Ding,
one-class classification. In International conference on machine learning. PMLR, Hoang Anh Dau, Diego Furtado Silva, Abdullah Mueen, and Eamonn Keogh. 2016.
4393–4402. Matrix profile I: all pairs similarity joins for time series: a unifying view that
[58] Mayu Sakurada and Takehisa Yairi. 2014. Anomaly detection using autoencoders includes motifs, discords and shapelets. In 2016 IEEE 16th international conference
with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2nd on data mining (ICDM). Ieee, 1317–1322.
workshop on machine learning for sensory data analysis. 4–11. [80] Mengran Yu and Shiliang Sun. 2020. Policy-based reinforcement learning for
[59] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. 2020. time series anomaly detection. Engineering Applications of Artificial Intelligence
DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Inter- 95 (2020), 103919.
national Journal of Forecasting 36, 3 (2020), 1181–1191. [81] Zahra Zamanzadeh Darban, Geoffrey I Webb, Shirui Pan, Charu C Aggarwal,
[60] Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. 2022. Anomaly and Mahsa Salehi. 2022. Deep Learning for Time Series Anomaly Detection: A
detection in time series: a comprehensive evaluation. Proceedings of the VLDB Survey. arXiv e-prints (2022), arXiv–2211.
Endowment 15, 9 (2022), 1779–1797. [82] Chaoning Zhang, Kang Zhang, Chenshuang Zhang, Trung X Pham, Chang D
[61] Lifeng Shen, Zhuocong Li, and James Kwok. 2020. Timeseries anomaly detection Yoo, and In So Kweon. 2022. How does SimSiam avoid collapse without nega-
using temporal hierarchical one-class network. Advances in Neural Information tive samples? a unified understanding with self-supervised contrastive learning.
Processing Systems 33 (2020), 13016–13026. Proceedings of International Conference on Learning Representations (ICLR) (2022).
[62] Youjin Shin, Sangyup Lee, Shahroz Tariq, Myeong Shin Lee, Okchul Jung, Daewon [83] Chaoli Zhang, Tian Zhou, Qingsong Wen, and Liang Sun. 2022. TFAD: A De-
Chung, and Simon S Woo. 2020. Itad: integrative tensor-based anomaly detection composition Time Series Anomaly Detection Architecture with Time-Frequency
system for reducing false positives of satellite systems. In Proceedings of the Analysis. In Proceedings of the 31st ACM International Conference on Information
29th ACM international conference on information & knowledge management. & Knowledge Management. 2497–2507.
2733–2740. [84] Yuxin Zhang, Jindong Wang, Yiqiang Chen, Han Yu, and Tao Qin. 2022. Adap-
[63] Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei. 2019. Robust tive memory networks with self-supervised learning for unsupervised anomaly
anomaly detection for multivariate time series through stochastic recurrent detection. IEEE Transactions on Knowledge and Data Engineering (2022).
neural network. In Proceedings of the 25th ACM SIGKDD international conference [85] Hang Zhao, Yujing Wang, Juanyong Duan, Congrui Huang, Defu Cao, Yunhai
on knowledge discovery & data mining. 2828–2837. Tong, Bixiong Xu, Jing Bai, Jie Tong, and Qi Zhang. 2020. Multivariate Time-
[64] Shahroz Tariq, Sangyup Lee, Youjin Shin, Myeong Shin Lee, Okchul Jung, Daewon series Anomaly Detection via Graph Attention Network. 2020 IEEE International
Chung, and Simon S Woo. 2019. Detecting anomalies in space using multivariate Conference on Data Mining (ICDM) (2020), 841–850.
convolutional LSTM with mixtures of probabilistic PCA. In Proceedings of the [86] Hang Zhao, Yujing Wang, Juanyong Duan, Congrui Huang, Defu Cao, Yunhai
25th ACM SIGKDD international conference on knowledge discovery & data mining. Tong, Bixiong Xu, Jing Bai, Jie Tong, and Qi Zhang. 2020. Multivariate time-
2123–2133. series anomaly detection via graph attention network. In 2020 IEEE International
[65] David MJ Tax and Robert PW Duin. 2004. Support vector data description. Conference on Data Mining (ICDM). IEEE, 841–850.
Machine learning 54, 1 (2004), 45–66. [87] Bin Zhou, Shenghua Liu, Bryan Hooi, Xueqi Cheng, and Jing Ye. 2019. BeatGAN:
[66] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. 2017. Improved texture Anomalous Rhythm Detection using Adversarially Generated Time Series.. In
networks: Maximizing quality and diversity in feed-forward stylization and IJCAI, Vol. 2019. 4433–4439.
texture synthesis. In Proceedings of the IEEE conference on computer vision and [88] Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki
pattern recognition. 6924–6932. Cho, and Haifeng Chen. 2018. Deep autoencoding gaussian mixture model for
[67] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, unsupervised anomaly detection. In International conference on learning represen-
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all tations.
you need. Advances in neural information processing systems 30 (2017).
[68] Qingsong Wen, Linxiao Yang, Tian Zhou, and Liang Sun. 2022. Robust time series
analysis and applications: An industrial perspective. In Proceedings of the 28th
ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4836–4837.
[69] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised
feature learning via non-parametric instance discrimination. In Proceedings of
the IEEE conference on computer vision and pattern recognition. 3733–3742.
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Yiyuan Yang, Chaoli Zhang, Tian Zhou, Qingsong Wen, & Liang Sun
Table 8: Details of benchmark datasets. AR (anomaly ratio) represents the abnormal proportion of the whole dataset.
Benchmark Source Dimension Window Patch Size #Training #Test (Labeled) AR (%)
MSL NASA Space Sensors 55 90 [3,5] 58,317 73,729 10.5
SMAP NASA Space Sensors 25 105 [3,5,7] 135,183 427,617 12.8
PSM eBay Server Machine 25 60 [1,3,5] 132,481 87,841 27.8
SMD Internet Server Machine 38 105 [5,7] 708,405 708,420 4.2
NIPS-TS-SWAN Space (Solar) Weather 38 36 [1,3] 60,000 60,000 32.6
NIPS-TS-GECCO Water Quality for IoT 9 90 [1,3,5] 69,260 69,261 1.1
UCR Various Natural Sources 1 105 [3,5,7] 2,238,349 6,143,541 0.6
Table 9: Ablation studies on metrics in the loss function. All the best ones for all benchmarks. Besides, we also test the impact of
results are in %. The best ones are in Bold. window size on memory cost and running time. The window size
will affect the memory cost in a quadratic computational complex-
Dataset MSL SMAP PSM
Metric P R F1 P R F1 P R F1
ity way. So, the trade-off between slide window size and memory
JS 89.23 70.42 78.72 93.23 93.62 93.42 97.18 92.29 94.67 cost/running time is pretty important, especially for real-life sce-
Simple KL 92.44 98.82 95.52 92.20 93.44 92.82 97.80 96.71 97.25 narios. Fortunately, DCdetector can work optimally with a window
DCdetector 93.69 99.69 96.60 95.63 98.92 97.02 97.14 98.74 97.94
size of less than 105 in all benchmarks, which greatly decreases the
B DATASET DESCRIPTION complexity of the model and its memory cost.
We summarize the seven adopted benchmark datasets for evaluation C.4 Study on Attention Head
in Table 8. These datasets include both univariate and multivari-
ate time series scenarios with different types and anomaly ratios. Generally, multi-head attention is widely used in attention net-
MSL, SMAP, PSM, SMD, NIPS-TS-SWAN, and NIPS-TS-GECCO are works. We study the influence of attention head number 𝐻 in DCde-
multivariate time series datasets. UCR is a univariate time series tector. In general, the number of attention heads is even, so we set
dataset. 𝐻 ∈ {1, 2, 4, 8}. Fortunately, with a small attention head number, as
shown in Table 12, our model still achieves good performances (the
C EXTRA STUDIES best one or slightly lower than the best). Thus, DCdetector does
not need large memory when running.
To verify the sensitivity of the parameters in the proposed DCde-
tector, more ablation experiments are conducted in this part. We C.5 Study on Embedding Dimension
provide more detailed results here than those in Section 4.5.1. We
also show the memory used as well as the iteration time spent The embedding dimension 𝑑𝑚𝑜𝑑𝑒𝑙 is another important parameter
during the training process. in the attention network. As a hyperparameter of the hidden chan-
nels, it may have impacts on model performance, memory cost, and
C.1 Study on Metrics in Loss Function running efficiency. We set 𝑑𝑚𝑜𝑑𝑒𝑙 ∈ {128, 256, 512, 1024} as sug-
gested hyperparameters by Transformer [67]. For SMAP and PSM,
We use different statistical distances to calculate the discrepancy
it has little effect on the final results. As for MSL, it achieves the best
between patch-wise representation and in-patch representation,
performance with a small 𝑑𝑚𝑜𝑑𝑒𝑙 size and small memory. Overall,
and the results are shown in Table 9. The loss function proposed in
the proposed DCdetector can achieve quite good performance even
Section 3.3 can get the SOTA performance in all benchmarks. Note
with a small memory cost and good real-time performance. Details
that only using simple KL divergence, which is an asymmetrical loss
are in Table 13.
function, we can still get a comparable result. However, for Jensen-
Shannon (JS) divergence, there is visible performance degradation, C.6 Study on Encoder Layer
especially for the MSL benchmark.
Many deep models’ performances are dependent on the number
C.2 Study on Multi-scale Patching of network layers 𝐿. We also show the influence of the number of
encoder layers in Table 14. We set 𝐿 ∈ {1, 2, 3, 4, 5} as suggested
The multi-patching scale ∈ {[1], [3], [5], [1, 3], [1, 5], [3, 5], [1, 3, 5]}
hyperparameters by Transformer [67]. Different benchmarks have
are tested. Patch size preference is for odd numbers to prevent in-
different optimal parameters. Luckily, our model can gain the best
formation loss during upsampling. Generally, multi-scale design
performance in no more than 3 layers, and will not fail with too
results in larger memory and different datasets have different best
few encoder layers or over-fit with too many encoder layers.
multi-patching scales. This is perhaps due to different information
densities and anomaly types in different situations. The details of C.7 Study on Anomaly Threshold
evaluation results are shown in Table 10.
Anomaly threshold 𝛿 is a hyperparameter, which may affect the
C.3 Study on Window Size determination of anomaly or not, based on Eq. 11. We have a default
value of 1 for all benchmarks. As shown in Table 15, when it is in the
Window size is a significant hyper-parameter in time series analysis.
range of 0.5 to 1, it has little effect on the final model performance.
It is used to split time series into instances, as usually, a single point
PSM and SMAP are also more robust to anomaly threshold than
can not be considered as a sample. The results in Table 11 show
MSL. For the three benchmarks, its best results appear when 𝛿
that DCdetector is rather robust in different window sizes. Actually,
equals 0.7 or 0.8.
in a large range [45, 195], the performances are slightly lower than
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Yiyuan Yang, Chaoli Zhang, Tian Zhou, Qingsong Wen, & Liang Sun
Table 10: Ablation studies on multi-scale patching results (window size=60). All results are in %. The best ones are in Bold.
Dataset MSL SMAP PSM Mem Time
Metric Acc P R F1 Acc P R F1 Acc P R F1 (GB) (s)
Patch Size = [1] 97.98 92.77 88.29 90.48 98.96 93.91 98.31 96.06 98.82 97.27 98.07 97.66 16.9 0.42
Patch Size = [3] 98.64 92.39 95.34 93.84 98.59 94.65 94.43 94.54 97.22 96.95 91.84 94.33 6.0 0.24
Patch Size = [5] 98.91 92.55 97.87 95.14 98.92 94.40 97.44 95.90 98.75 97.22 97.84 97.53 3.2 0.17
Patch Size = [1,3] 98.30 93.19 90.77 91.96 98.98 94.42 97.87 96.11 98.83 96.96 98.42 97.68 16.9 0.59
Patch Size = [1,5] 98.52 92.88 93.60 93.24 98.89 94.15 97.49 95.79 98.76 97.03 98.07 97.55 16.9 0.46
Patch Size = [3,5] 98.93 93.88 97.72 95.76 98.89 94.61 96.95 95.77 98.38 97.00 96.54 96.77 6.0 0.27
Patch Size = [1,3,5] 98.44 91.52 93.78 92.64 99.03 93.72 99.10 96.34 98.95 97.14 98.74 97.94 16.9 0.71
Table 11: Ablation studies on window size results (patch size=[3,5]). All results are in %. The best ones are in Bold.
Dataset MSL SMAP PSM Mem Time
Metric Acc P R F1 Acc P R F1 Acc P R F1 (GB) (s)
Window size = 30 96.87 92.39 76.47 83.68 98.62 93.93 95.46 94.69 98.42 97.42 96.27 96.84 2.9 0.17
Window size = 45 98.77 92.80 96.13 94.44 98.94 94.24 97.75 95.96 98.79 97.01 98.09 97.55 6.0 0.22
Window size = 60 98.63 91.82 96.01 93.87 98.87 94.87 96.44 95.65 98.91 97.04 98.67 97.85 6.1 0.28
Window size = 75 98.79 91.71 97.93 94.72 98.97 94.62 97.60 96.09 98.79 97.26 98.20 97.73 7.6 0.36
Window size = 90 98.94 92.04 98.82 95.31 98.99 94.61 97.69 96.13 98.74 96.87 98.05 97.46 7.6 0.40
Window size = 105 99.06 93.69 99.69 96.60 99.16 94.69 98.87 96.74 98.57 96.84 97.37 97.10 18.5 0.46
Window size = 120 98.95 92.64 98.74 95.59 99.08 94.48 98.48 96.44 98.80 97.00 98.24 97.61 18.5 0.53
Window size = 135 98.44 91.52 94.45 92.96 99.09 94.26 98.91 96.53 98.70 96.97 98.17 97.57 24.4 0.60
Window size = 150 98.49 91.70 95.02 93.34 98.93 94.40 97.48 95.92 98.55 97.02 97.18 97.10 24.4 0.67
Window size = 165 98.64 92.61 95.68 94.12 99.01 94.50 98.07 96.25 98.77 97.11 98.05 97.58 24.4 0.74
Window size = 180 98.68 92.13 96.13 94.09 98.99 94.53 97.73 96.10 98.67 97.31 97.87 97.59 24.4 0.81
Window size = 195 98.50 92.68 94.18 93.43 98.95 94.44 97.60 96.00 98.66 97.24 97.88 97.55 24.5 0.89
Window size = 210 98.03 91.31 90.88 91.09 98.58 92.79 96.29 94.51 98.39 96.86 96.59 96.72 24.5 0.97
Table 12: Ablation studies on attention head 𝐻 results (patch size=[3,5], window size=60). All results are in %. The best ones are
in Bold.
Dataset MSL SMAP PSM Mem Time
Metric Acc P R F1 Acc P R F1 Acc P R F1 (GB) (s)
𝐻 =1 98.63 91.82 96.01 93.87 98.87 94.87 96.44 95.65 98.91 97.04 98.67 97.85 6.1 0.05
𝐻 =2 98.50 92.13 94.29 93.20 98.98 93.99 98.44 96.16 98.67 97.16 97.55 97.36 6.1 0.17
𝐻 =4 98.67 91.93 95.71 93.78 98.97 93.89 98.46 96.12 98.69 97.02 97.79 97.41 9.8 0.19
𝐻 =8 98.55 91.30 95.21 93.21 99.11 94.87 98.47 96.63 98.63 96.84 97.73 97.29 9.8 0.40
Table 13: Ablation studies on embedding 𝑑𝑚𝑜𝑑𝑒𝑙 results (patch size=[3,5], window size=60). All results are in %. The best ones are
in Bold.
Dataset MSL SMAP PSM Mem Time
Metric Acc P R F1 Acc P R F1 Acc P R F1 (GB) (s)
𝑑𝑚𝑜𝑑𝑒𝑙 = 128 98.47 91.62 94.59 93.08 99.10 94.85 98.34 96.56 98.86 97.13 98.38 97.75 3.9 0.05
𝑑𝑚𝑜𝑑𝑒𝑙 = 256 98.79 91.47 98.02 94.63 99.02 94.25 98.40 96.28 98.85 96.98 98.51 97.74 6.1 0.10
𝑑𝑚𝑜𝑑𝑒𝑙 = 512 98.63 91.82 96.01 93.87 98.87 94.87 96.44 95.65 98.91 97.04 98.67 97.85 10.3 0.28
𝑑𝑚𝑜𝑑𝑒𝑙 = 1024 98.13 91.92 90.81 91.36 98.97 94.87 97.27 96.06 98.97 97.11 98.83 97.96 18.4 0.92
Table 14: Ablation studies on encoder layers 𝐿 results (patch size=[3,5], window size=60). All results are in %. The best ones are
in Bold.
Dataset MSL SMAP PSM Mem Time
Metric Acc P R F1 Acc P R F1 Acc P R F1 (GB) (s)
𝐿=1 98.73 91.76 96.52 94.08 98.94 94.34 97.69 95.98 96.92 97.26 90.33 93.66 6.0 0.02
𝐿=2 98.67 98.67 94.56 93.72 98.75 93.93 96.53 95.22 98.88 97.24 98.35 97.79 6.0 0.04
𝐿=3 98.63 91.82 96.01 93.87 98.87 94.87 96.44 95.65 98.91 97.04 98.67 97.85 6.1 0.10
𝐿=4 98.33 91.52 92.72 92.11 99.01 95.03 97.42 96.21 98.88 97.00 98.53 97.76 6.1 0.19
𝐿=5 98.34 91.38 92.91 92.14 99.04 94.23 98.63 96.38 98.83 96.97 98.41 97.69 6.1 0.24
Table 15: Ablation studies on anomaly threshold 𝛿 results (patch size=[3,5], window size=60). All results are in %. The best ones
are in Bold.
Dataset MSL SMAP PSM
Metric Acc P R F1 Acc P R F1 Acc P R F1
𝛿=0.5 96.62 95.69 72.23 82.32 98.85 96.60 94.40 95.49 98.45 98.76 95.03 96.86
𝛿=0.6 98.08 94.75 87.20 90.80 99.12 96.66 96.53 96.59 98.71 98.27 96.56 97.41
𝛿=0.7 98.80 93.65 95.47 94.55 99.29 95.73 98.88 97.28 98.98 98.14 97.81 97.97
𝛿=0.8 98.83 93.12 96.40 94.73 99.06 94.29 98.72 96.45 98.88 97.71 97.85 97.78
𝛿=0.9 98.42 91.90 93.75 92.82 98.82 93.81 97.32 95.54 98.92 97.42 98.31 97.86
𝛿=1.0 98.33 92.58 92.04 92.31 98.90 93.30 98.56 95.86 98.91 97.04 98.67 97.85