Time Series Anomaly Detection With Multiresolution Ensemble Decoding
Time Series Anomaly Detection With Multiresolution Ensemble Decoding
9567
length, the decoder has to focus on macro temporal char- Specifically, f (E) uses not only the immediate previous state
acteristics such as trend patterns and seasonality; whereas (E) (E)
ht−1 , but also ht−s for some skip length s > 1:
a decoder with long decoding length can capture more de- " #!
tailed local temporal patterns. Furthermore, instead of sim- (E) (E)
(E) (E) w1 ht−1 + w2 ht−s
ply averaging the decoder outputs as in (Kieu et al. 2019), ht = f xt ; , (2)
|w1 | + |w2 |
the lower-resolution temporal information is used to guide
decoding at a higher resolution. Specifically, we introduce where coefficients w1 , w2 are randomly sampled from
a multiresolution shape-forcing loss to encourage the de- {(1, 0), (0, 1), (1, 1)} at each time step. In (Kieu et al. 2019),
coders to match the input’s global temporal shape at mul- the skip length s is randomly sampled from [1, 10] and fixed
tiple resolutions. This avoids overfitting the nonlinear local before training.
patterns at a higher resolution, and alleviates error accumu-
lation during decoding. Finally, the output from the highest Decoding for Anomaly Detection The encoder’s com-
(E)
resolution (whose decoding length equals the length of the pressed representation hT can be decoded by using a
whole time series) is used as the ensemble output. LSTM. In time series anomaly detection, decoding is usu-
Our main contributions can be summarized as follows: ally easier when performed in time-reverse order (Kieu
• We present the novel recurrent autoencoder RAMED, et al. 2019; Yoo, Kim, and Kim 2019), i.e., the target
with multiple decoders of different decoding lengths. By reconstructed output for input X = [x1 , x2 , . . . , xT ] is
introducing a shape-forcing reconstruction loss, decoders [yT , yT −1 , . . . , y1 ], where yt is the LSTM’s output at time
(E)
can capture temporal characteristics of the time series at t (for t = T, T − 1, . . . , 1). After initializing hT by hT ,
multiple resolutions. {hT −1 , . . . , h1 } are obtained as:
• We introduce a fusion mechanism to integrate multireso- yt = Wht + b,
lution temporal information from multiple decoders. (3)
ht−1 = LSTM([yt ; ht ]),
• We conduct extensive empirical studies on time series
anomaly detection. Results demonstrate that the proposed where W, b are learnable parameters. An anomaly score is
model outperforms recent strong baselines. computed for each xt based on the error e(t) = yt − xt . As
can be seen from (3), this error can accumulate during the
sequential decoding process.
Related Work
Autoencoders for Time Series Anomaly Detection Encoder-Decoder Ensemble In the recurrent autoencoder
ensemble (RAE-ensemble) (Kieu et al. 2019), multiple re-
The sequence-to-sequence model (Sutskever, Vinyals, and
current encoders and decoders are used. Let L(E) be the
Le 2014) is a popular auto-encoding approach for sequential (E )
number of encoders, and hT i be the representation from
data. There are two steps in its learning procedure: (i) en-
the ith encoder. The integrated compressed representation
coding, which compresses the sequential data into a fixed-
length representation; and (ii) decoding, which reconstructs h(E) is obtained by
the original input from the learned compressed representa-
(E ) (E ) (E (E) )
tion. The sequence-to-sequence model has been widely used h(E) = FMLP concat[hT 1 ;. . .; hT i ;. . .; hT L ] , (4)
in natural language processing (Bahdanau, Cho, and Ben-
gio 2015). Recently, it is also used in time series applica- where FMLP is a fully-connected layer, and h(E) shares
(E )
tions such as prediction (Le Guen and Thome 2019), clus- the same dimension as each hT i . During decoding, L(D)
tering (Ma et al. 2019) and anomaly detection (Malhotra decoders are used, with each of which following the
et al. 2016; Yoo, Kim, and Kim 2019; Kieu et al. 2019). (k)
same recurrent decoding process. After initializing hT to
Two recent representative sequence-to-sequence models for (k) (k)
h(E) , the kth decoder D(k) outputs {yT , . . . , y1 } and
time series anomaly detection are the recurrent auto-encoder
{hT −1 , . . . , h1 } as:
(RAE) (Malhotra et al. 2016) and recurrent reconstructive
network (RRN) (Yoo, Kim, and Kim 2019). (k) (k)
yt = W(k) ht + b(k) ,
(k) (k) (k) (5)
Time Series Encoding The recurrent neural network ht−1 = LSTM(k) ([yt ; ht ]),
(RNN) is often used to encode time series data. Let X =
[x1 , x2 , . . . , xT ], where xt ∈ Rd , be a time series of length where W(k) , and b(k) are learnable parameters. Note that
(E)
T . At time t, the encoder’s hidden state ht is updated as: this also suffers from error accumulation as in (3). During
inference, outputs from all the decoders are pooled together.
(E) (E)
ht = f (E) ([xt ; ht−1 ]), (1)
Multiresolution Temporal Modeling
(E)
where f (E) is a nonlinear function. The hT at the last time To capture multiresolution temporal information, Hihi and
step T is then used as X’s compressed representation. Bengio (1996) developed a hierarchical RNN that inte-
A popular choice for f (E) is the long-short term mem- grates multiple delays and time scales in different recurrent
ory (LSTM) (Hochreiter and Schmidhuber 1997). Kieu et al. neurons. Similarly, to model multiscale structures in text.
(2019) added sparse skip connections to the RNN cells so Hermans and Schrauwen (2013); Chung, Ahn, and Ben-
that additional hidden states in the past can be considered. gio (2017) introduced the hierarchical multiscale RNN. This
9568
multiresolution multiresolution outputs
shape-forcing loss from top layers
coarser
finer
Figure 1: The proposed Recurrent Autoencoder with Multiresolution Ensemble Decoding (RAMED).
stacks multiple recurrent layers, with each layer receiving forcing loss based on dynamic time warping (DTW) (Sakoe
hidden states from the previous layer as input. Instead of and Chiba 1978).
simply stacking multiple recurrent layers, Liu et al. (2019) The following sections describe these various components
proposed a coarse-to-fine procedure for time series imputa- in more detail.
tion. Very recently, the pyramid RNN (Ma et al. 2020) ag-
gregates the multiresolution information from each recurrent Decoder Lengths The kth decoder D(k) reconstructs a
layer in a bottom-up manner. time series of length T (k) , where T (k) = αk T and
9569
(k)
state ht+1 with the corresponding slightly-coarser informa- Algorithm 1 Recurrent Autoencoder with Multiresolution
(k+1)
tion hdt/τ e from the sibling decoder D(k+1) as: Ensemble Decoding (RAMED).
Input: a set of time series {Xb }; batch size B; number of
(k) (k) 0 (k) (k+1)
ĥt = βht+1 +(1−β)FMLP concat[ht+1 ; hdt/τ e ] , (7) encoders L(E) ; number of decoders L(D) ; τ .
1: for each decoder, its decoding length is T (k) = αk T ;
0
where FMLP is a two-layer fully-connected network with the 2: repeat
PReLU (Parametric Rectified Linear Unit) (He et al. 2015) 3: sample a batch of B time series;
(k)
activation, and βht+1 (with β > 0) plays a similar role as 4: for b = 1, . . . , B do
the residual connection (He et al. 2016). Analogous to (5), 5: feed time series Xb to encoders and obtain the
(k) (E)
this ĥt is then fed into the LSTM cell to generate last hidden states {hb,T };
(k) (k) (k) 6: obtain joint representations {h(E) } via (4);
ht = LSTM(k) ([yt+1 + δ; ĥt ]),
7: for k = L(D) , L(D) −1 . . . , 1 do
for t = T (k) − 1, . . . , 1. 8: run the decoder D(k) ;
Finally, the ensemble’s reconstructed output can be 9: if (k 6= L(D) ) then
obtained from the bottom-most decoder as Yrecon = 10: perform coarse-to-fine fusion;
(1) (1) (1)
[y1 , y2 , . . . , yT ] (after reversing to the original time 11: end if
(k)
order). To encourage Yrecon to be close to input X = 12: obtain updated hidden states {hb,t } and out-
(k)
[x1 , . . . , xT ], we use the square loss as the reconstruction puts {yb,t };
error: 13: end for
T
X (1) 14: minimize (12) by SGD or its variants;
LMSE (X) = kyt − xt k2 . (8)
15: end for
t=1
16: until convergence.
Multiresolution Shape-Forcing Loss To further encour-
age decoders to learn consistent temporal patterns at differ-
ent resolutions, we force the decoders’ outputs to have sim-
With the sDTW distance, we encourage decoders at dif-
ilar shapes as the original input by introducing a loss based
ferent resolutions to output time series with similar tempo-
on dynamic time warping (DTW) (Sakoe and Chiba 1978).
ral characteristics as the input. Here, decoders whose decod-
Let the output from decoder D(k) be Y(k) = ing length is less than the length of the whole time series
(k) (k) (k)
[y1 , y2 , . . . , yT (k) ]. Since T (k) 6= T for k = are considered. This leads to the following multiresolution
2, . . . , L(D) , we define a similarity between time series X shape-forcing loss:
and each Y(k) by DTW. The DTW similarity is based on (D)
L
distances along the (sub-)optimal DTW alignment path. Let 1 X
T ×T (k) Lshape (X) = sDTW(X, Y(k) ). (11)
the alignment be represented by a matrix A ∈ {0, 1} , L(D) − 1 k=2
(k)
in which Ai,j = 1 when xi is aligned to yj ; and zero
otherwise, and with boundary conditions A1,1 = 1 and Given a batch of samples {Xb }b=1,2,...,B (where B is the
AT,T (k) = 1. All valid alignment paths run from the upper- batch size), the total loss is:
left entry (1,1) to the lower-right entry (T, T (k) ) using moves B
1 X
↓, → or &. The alignment costs are stored in a matrix C. L= (LMSE (Xb ) + λLshape (Xb )) , (12)
(k) B
For simplicity, we use Ci,j = kxi − yj k, the Euclidean b=1
distance. The DTW distance between X and Y(k) is then: where λ is a trade-off parameter. This can be minimized by
(k) stochastic gradient descent or its variants (such as Adam
DTW(X, Y ) = min hA, Ci. (9)
A∈A (Kingma and Ba 2015)). The training procedure is shown
in Algorithm 1.
where A is the set of T × T (k) binary alignment matri-
ces, and h·, ·i is the matrix inner product. The DTW dis- Anomaly Score and Detection
tance is non-differentiable due to the min operator. To in-
tegrate DTW into end-to-end training, we replace (9) by Given a time series X = [x1 , x2 , . . . , xT ] and its reconstruc-
the smoothed DTW (sDTW) distance (Cuturi and Blondel tion Y = [y1 , y2 , . . . , yT ], the reconstruction error at time t
2017): is e(t) = yt −xt . We then fit a normal distribution N (µ, Σ)
! using the set of {e(t)} from all time steps and all time series
X in the validation set.
sDTW(X, Y(k) ) = −γ log e−hA,Ci/γ , (10) On inference, the probability that xt from an unseen time
A∈A series in the test set is anomalous is defined as:
where γ > 0. This is based on Pthe smoothed min operator 1
1
n
minγ {a1 , . . . , an } = −γ log i=1 e−ai /γ , which reduces 1− p exp − (e(t)−µ)T Σ−1 (e(t)−µ) .
to the min operator when γ approaches zero. (2π)d |Σ| 2
9570
Dataset T # Training # Validation # Testing % Anomaly
ECG
(A) chfdb chf01 275 64 40 17 59 14.61
(B) chfdb chf13 45590 64 53 22 40 12.35
(C) chfdbchf15 64 237 101 104 4.45
(D) ltstdb 20221 43 64 57 24 35 11.51
(E) ltstdb 20321 240 64 43 18 45 9.61
(F) mitdb 100 180 64 64 27 70 8.38
2D-gesture 64 91 39 47 24.63
Power-demand 512 25 11 29 11.44
Yahoo’s S5 128 659 398 394 3.20
Thus, we can take The time series are partitioned into length-T sequences
by using a sliding window. The sliding window has a
s(t) = (e(t) − µ)T Σ−1 (e(t) − µ) (13) stride of 32 on the ECG data sets, 512 on Power-demand,
as xt ’s anomaly score. When this is greater than a predefined and 64 on 2D-gesture and Yahoo’s S5. Table 1 shows
threshold, xt is classified as an anomaly. the sequence length T , number of sequences in the train-
ing/validation/testing set, and percentage of anomalous sam-
ples in the test set.
Experiments
In this section, experiments are performed on the following Baselines The proposed RAMED model is compared with
nine commonly-used real-world time series benchmarks1 four recent anomaly detection baselines:2 (i) recurrent au-
(Table 1): toencoder (RAE) (Malhotra et al. 2016); (ii) recurrent re-
constructive network (RRN) (Yoo, Kim, and Kim 2019),
1. ECG: This is a collection of 6 data sets on the detection which combines attention, skip transition and a state-forcing
of anomalous beats from electrocardiograms readings. regularizer; (iii) recurrent autoencoder ensemble (RAE-
2. 2D-gesture: This contains time series of X-Y coordinates ensemble) (Kieu et al. 2019), which uses an ensemble of
of an actor’s right hand. The data is extracted from an RNNs with sparse skip connections as encoders and de-
video in which the actor grabs a gun from his hip-mounted coders; (iv) BeatGAN (Zhou et al. 2019), which is a re-
holster, moves it to the target, and returns it to the holster. cent CNN autoencoder-based generative adversarial net-
The anomalous region is in the area where the actor fails work (GAN) (Goodfellow et al. 2014) for time series
to return his gun to the holster. anomaly detection.
3. Power-demand: This contains one year of power con- Evaluation Metrics Performance measures such as preci-
sumption records measured by a Dutch research facility sion and recall depend on thresholding the anomaly score.
in 1997. To avoid setting this threshold, we use the following metrics
4. Yahoo’s S5 Webscope: This contains records from real which have been widely used in anomaly detection (Wang
production traffic of the Yahoo website. Anomalies are et al. 2019; Ren et al. 2019; Li et al. 2020; Su et al. 2019):
manually labeled by human experts. (i) area under the ROC curve (AUROC), (ii) area under the
precision-recall curve (AUPRC), and (iii) the highest F1-
ECG and 2D-gesture are bivariate time series (d = 2), while score (denoted F1best ) (Li et al. 2020; Su et al. 2019), which
Power-demand and Yahoo’s S5 are univariate (d = 1). For is selected from using 1000 thresholds uniformly distributed
each of ECG, 2D-gesture and Power-demand, the public from 0 to the maximum anomaly score over all time steps in
data set includes a training set (containing only normal data) the test set (Yoo, Kim, and Kim 2019).
and a test set. We use 30% of the training set for validation,
and the rest for actual training. The model with the lowest Implementation Details We use 3 encoders and 3 de-
reconstruction loss on the validation set is selected for evalu- coders. Each encoder and decoder is a single-layer LSTM
ation. For Yahoo’s S5, the available data set is split into three with 64 units. We perform grid search on the hyperpa-
parts: with 40% of the samples for training, another 30% for rameter β in (7) from {0.1, 0.2, . . . , 0.9}, λ in (12) from
validation, and the remaining 30% for testing. The training {10−4 , 10−3 , 10−2 , 10−1 , 1}, τ in (6) is set to 3 and γ in (10)
set contains unknown anomalies, and we use the model with is set to 0.1. The Adam optimizer (Kingma and Ba 2015) is
the highest AUROC value on the validation set for evalua- used with an initial learning rate of 10−3 .
tion. 2
RAE and RRN are downloaded from https://github.com/
1
ECG, 2D-gesture and Power-demand are from http://www. YongHoYoo/AnomalyDetection, BeatGAN is from https://github.
cs.ucr.edu/∼eamonn/discords/, while Yahoo’s S5 is from https:// com/Vniex/BeatGAN, and RAE-ensemble is from https://github.
webscope.sandbox.yahoo.com/. com/tungk/OED
9571
ECG 2D- Power- Yahoo’s avg
metric method gesture demand S5 rank
A B C D E F
BeatGAN 0.6566 0.7056 0.7329 0.6173 0.8160 0.4223 0.7256 0.5796 0.8728 4.33
RAE 0.6728 0.7502 0.8289 0.5452 0.7970 0.4715 0.7601 0.6122 0.8823 3.78
AUROC RRN 0.6393 0.7623 0.7405 0.6318 0.8101 0.4531 0.7530 0.6607 0.8869 3.44
RAE-ensemble 0.6884 0.7788 0.8570 0.6400 0.8035 0.5213 0.7808 0.6587 0.8850 2.44
RAMED 0.7127 0.8551 0.8736 0.6473 0.8828 0.6399 0.7839 0.6787 0.8942 1.00
BeatGAN 0.5197 0.4101 0.2254 0.1613 0.3342 0.0778 0.4952 0.1228 0.4702 4.44
RAE 0.5501 0.4249 0.4996 0.1435 0.2126 0.0894 0.4979 0.1350 0.4782 3.78
AUPRC RRN 0.5260 0.5653 0.4139 0.1652 0.3206 0.0833 0.4866 0.1446 0.4794 3.22
RAE-ensemble 0.5549 0.4769 0.5256 0.2026 0.2798 0.0948 0.5287 0.1400 0.4783 2.56
RAMED 0.5803 0.7008 0.5486 0.2203 0.3784 0.1253 0.5331 0.1627 0.4809 1.00
BeatGAN 0.5102 0.4204 0.2931 0.2502 0.4776 0.1562 0.4941 0.2266 0.4484 4.44
RAE 0.5478 0.4736 0.5046 0.2193 0.3886 0.1581 0.5300 0.2798 0.4473 3.78
F1best RRN 0.5440 0.5502 0.4537 0.2621 0.4548 0.1562 0.5240 0.2926 0.4502 3.00
RAE-ensemble 0.5479 0.5016 0.5333 0.2735 0.3910 0.1602 0.5511 0.2678 0.4497 2.67
RAMED 0.5762 0.6871 0.5541 0.3466 0.4855 0.2090 0.5633 0.2934 0.4502 1.00
Table 2: Anomaly detection results (the larger the better). The best results are highlighted. Average rank (the smaller the better)
is recorded in the last column.
9572
(a) ECG(A). (a) ECG(A).
(b) Gesture.
(b) Gesture.
Figure 3: Effect of varying λ.
Figure 2: Effect of varying β.
ECG(A) T (k) ’s AUROC AUPRC F1best β = 0.1, λ = 10−4 and L(D) = 3. Experiments are per-
formed on the ECG(A) and Gesture data sets.
L(D) = 1 64 0.6781 0.5305 0.5275 Figure 2 shows the AUROC’s at different β’s. As can
L(D) = 2 21, 64 0.6976 0.5828 0.5708 be seen, the performance w.r.t. β is relatively stable. When
L(D) = 3 7, 21, 64 0.7172 0.5803 0.5762 β is set to 0.1, the proposed model achieves the best per-
formance. This is because when β is small, more coarse-
L(D) = 4 2, 7, 21, 64 0.6743 0.5132 0.5196 grained information can be used to help temporal modeling
at a higher-resolution levels; whereas a larger β may ignore
the coarse-grained information and degrades performance.
Gesture T (k) ’s AUROC AUPRC F1best Figure 3 shows the AUROC’s at different λ’s. As can be
L (D)
=1 64 0.7656 0.5292 0.5448 seen, when λ is small (10−4 ), but nonzero, better perfor-
(D)
mance is achieved.
L =2 21, 64 0.7693 0.5080 0.5462
Table 4 shows the AUROC’s with different numbers of de-
L(D) = 3 7, 21, 64 0.7839 0.5331 0.5633 coders. As can be seen, increasing L(D) (from 1 to 3) can im-
(D)
L =4 2, 7, 21, 64 0.7779 0.5153 0.5545 prove performance as more abundant multiresolution tempo-
ral patterns are involved. However, when L(D) increases to
Table 4: Effect of varying L(D) on ECG(A) and Gesture. 4, performance is degraded. This is because when L(D) = 4,
T (k) ’s are the decoding lengths of the various decoders. the coarsest decoder has a decoding length of only 2, and
cannot provide useful global temporal information.
9573
Acknowledgments He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-
ual learning for image recognition. In IEEE Conference on
This work was partially funded by the Foreign Science
Computer Vision and Pattern Recognition.
and Technology Cooperation Program of Huangpu Dis-
trict of Guangzhou (No. 2018GH09, 2019-2020), the Na- Hermans, M.; and Schrauwen, B. 2013. Training and
tional Natural Science Foundation of China (Grant Nos. analysing deep recurrent neural networks. In Advances in
61502174, 61872148), the Natural Science Foundation Neural Information Processing Systems.
of Guangdong Province (Grant Nos. 2017A030313355, Hihi, S. E.; and Bengio, Y. 1996. Hierarchical recurrent neu-
2019A1515010768), the Guangzhou Science and Tech- ral networks for long-term dependencies. In Advances in
nology Planning Project (Grant Nos. 201704030051, Neural Information Processing Systems.
201902010020), the Key R&D Program of Guangdong
Province (No. 2018B010107002), and the Fundamental Re- Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term
search Funds for the Central Universities. memory. Neural Computation 9(8): 1735–1780.
Kieu, T.; Yang, B.; Guo, C.; and Jensen, C. S. 2019. Outlier
References detection for time series with recurrent autoencoder ensem-
bles. In International Joint Conferences on Artificial Intelli-
Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural ma- gence.
chine translation by jointly learning to align and translate.
In International Conference on Learning Representations. Kieu, T.; Yang, B.; and Jensen, C. S. 2018. Outlier detec-
tion for multidimensional time series using deep neural net-
Chia, C.-C.; and Syed, Z. 2014. Scalable noise mining in works. In IEEE International Conference on Mobile Data
long-term electrocardiographic time-series to predict death Management.
following heart attacks. In International Conference on
Knowledge Discovery and Data Mining. Kingma, D. P.; and Ba, J. L. 2015. Adam: A method for
stochastic optimization. In International Conference on
Chung, J.; Ahn, S.; and Bengio, Y. 2017. Hierarchical multi- Learning Representations.
scale recurrent neural networks. In International Conference
on Learning Representations. Le Guen, V.; and Thome, N. 2019. Shape and time distor-
tion loss for training deep time series forecasting models. In
Cook, A. A.; Misirli, G.; and Fan, Z. 2020. Anomaly de- Advances in Neural Information Processing Systems.
tection for IoT time-series data: A survey. IEEE Internet of
Things Journal 7(7): 6481–6494. Li, L.; Yan, J.; Wang, H.; and Jin, Y. 2020. Anomaly de-
tection of time series with smoothness-inducing sequential
Cuturi, M.; and Blondel, M. 2017. Soft-DTW: A differen- variational auto-encoder. IEEE Transactions on Neural Net-
tiable loss function for time-series. In International Confer- works 1–15.
ence on Machine Learning.
Liu, Y.; Yu, R.; Zheng, S.; Zhan, E.; and Yue, Y. 2019.
Dietterich, T. G. 2000. Ensemble methods in machine learn- NAOMI: Non-autoregressive multiresolution sequence im-
ing. In International Workshop on Multiple Classifier Sys- putation. In Advances in Neural Information Processing
tems. Systems.
Ding, Z.; Yang, B.; Chi, Y.; and Guo, L. 2016. Enabling Ma, Q.; Lin, Z.; Chen, E.; and Garrison, C. 2020. Temporal
smart transportation systems: A parallel spatio-temporal pyramid recurrent neural network. In AAAI Conference on
database approach. IEEE Transactions on Computers 65(5): Artificial Intelligence.
1377–1391. Ma, Q.; Zheng, J.; Li, S.; and Cottrell, G. W. 2019. Learning
Filonov, P.; Lavrentyev, A.; and Vorontsov, A. 2016. Mul- representations for time series clustering. In Advances in
tivariate industrial time series with cyber-attack simulation: Neural Information Processing Systems.
Fault detection using an LSTM-based predictive data model. Malhotra, P.; Ramakrishnan, A.; Anand, G.; Vig, L.; Agar-
Preprint arXiv:1612.06676. wal, P.; and Shroff, G. M. 2016. LSTM-based encoder-
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; decoder for multisensor anomaly detection. Preprint
Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. arXiv:1607.00148.
2014. Generative adversarial nets. In Advances in Neural Ren, H.; Xu, B.; Wang, Y.; Yi, C.; Huang, C.; Kou, X.; Xing,
Information Processing Systems. T.; Yang, M.; Tong, J.; and Zhang, Q. 2019. Time-series
Gupta, M.; Gao, J.; Aggarwal, C. C.; and Han, J. 2014. Out- anomaly detection service at Microsoft. In International
lier detection for temporal data: A survey. IEEE Trans- Conference on Knowledge Discovery & Data Mining.
actions on Knowledge and Data Engineering 26(9): 2250– Ruff, L.; Vandermeulen, R.; Goernitz, N.; Deecke, L.; Sid-
2267. diqui, S. A.; Binder, A.; Müller, E.; and Kloft, M. 2018.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving deep Deep one-class classification. In International Conference
into rectifiers: Surpassing human-level performance on Im- on Machine Learning.
ageNet classification. In International Conference on Com- Sakoe, H.; and Chiba, S. 1978. Dynamic programming al-
puter Vision. gorithm optimization for spoken word recognition. IEEE
9574
Transactions on Acoustics, Speech, and Signal Processing
26(1): 159–165.
Su, Y.; Zhao, Y.; Niu, C.; Liu, R.; Sun, W.; and Pei, D.
2019. Robust anomaly detection for multivariate time se-
ries through stochastic recurrent neural network. In Interna-
tional Conference on Knowledge Discovery & Data Mining.
Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence
to sequence learning with neural networks. In Advances in
Neural Information Processing Systems.
Vincent, P.; Larochelle, H.; Bengio, Y.; and Manzagol, P.-A.
2008. Extracting and composing robust features with de-
noising autoencoders. In International Conference on Ma-
chine Learning.
Wang, S.; Zeng, Y.; Liu, X.; Zhu, E.; Yin, J.; Xu, C.; and
Kloft, M. 2019. Effective end-to-end unsupervised outlier
detection via inlier priority of discriminative network. In
Advances in Neural Information Processing Systems.
Wold, H. 1938. A study in analysis of stationary time series.
Journal of the Royal Statistical Society 102(2): 295–298.
Yoo, Y.; Kim, U.; and Kim, J. 2019. Recurrent reconstruc-
tive network for sequential anomaly detection. IEEE Trans-
actions on Cybernetics 1–12.
Zhang, C.; Song, D.; Chen, Y.; Feng, X.; Lumezanu, C.;
Cheng, W.; Ni, J.; Zong, B.; Chen, H.; and Chawla, N. 2019.
A deep neural network for unsupervised anomaly detection
and diagnosis in multivariate time series data. In AAAI Con-
ference on Artificial Intelligence.
Zhou, B.; Liu, S.; Hooi, B.; Cheng, X.; and Ye, J. 2019.
BeatGAN: Anomalous rhythm detection using adversarially
generated time series. In International Joint Conferences on
Artificial Intelligence.
9575