0% found this document useful (0 votes)
66 views

Time Series Anomaly Detection With Multiresolution Ensemble Decoding

Uploaded by

kai zhao
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

Time Series Anomaly Detection With Multiresolution Ensemble Decoding

Uploaded by

kai zhao
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Time Series Anomaly Detection with Multiresolution Ensemble Decoding


Lifeng Shen 1 , Zhongzhong Yu 2 , Qianli Ma 2, 3 , James T. Kwok 1
1
Department of Computer Science and Enginering, Hong Kong University of Science and Technology, Hong Kong
2
School of Computer Science and Engineering, South China University of Technology, Guangzhou
3
Key Laboratory of Big Data and Intelligent Robot (South China University of Technology), Ministry of Education
[email protected], [email protected], [email protected], [email protected]

Abstract Existing time series anomaly detection techniques can be


roughly categorized as either predictive or reconstruction-
Recurrent autoencoder is a popular model for time series based. Classical predictive models include the autoregres-
anomaly detection, in which outliers or abnormal segments sive moving average (ARMA) and autoregressive integrated
are identified by their high reconstruction errors. However, moving average (ARIMA) models (Wold 1938). They are
existing recurrent autoencoders can easily suffer from over-
fitting and error accumulation due to sequential decoding. In
linear regressors and use the prediction error as anomaly
this paper, we propose a simple yet efficient recurrent net- score. More recently, recurrent neural networks (RNNs)
work ensemble called Recurrent Autoencoder with Multires- (Zhang et al. 2019) and other deep predictors (Filonov,
olution Ensemble Decoding (RAMED). By using decoders Lavrentyev, and Vorontsov 2016) are also used. However,
with different decoding lengths and a new coarse-to-fine fu- these methods depend largely on the models’ extrapolation
sion mechanism, lower-resolution information can help long- capacities (Yoo, Kim, and Kim 2019). On the other hand,
range decoding for decoders with higher-resolution outputs. reconstruction-based methods learn a compressed represen-
A multiresolution shape-forcing loss is further introduced to tation for the core statistical structures of normal data, and
encourage decoders’ outputs at multiple resolutions to match then use this to reconstruct the time series. Points or seg-
the input’s global temporal shape. Finally, the output from ments that cannot be well reconstructed are considered as
the decoder with the highest resolution is used to obtain an
anomaly score at each time step. Extensive empirical studies
outliers. Reconstruction methods are popular in applications
on real-world benchmark data sets demonstrate that the pro- such as anomalous rhythm detection (Zhou et al. 2019) and
posed RAMED model outperforms recent strong baselines on network traffic monitoring (Kieu, Yang, and Jensen 2018).
time series anomaly detection. In this paper, we focus on reconstruction-based methods.
In deep learning, the recurrent auto-encoder (RAE) (Mal-
hotra et al. 2016) has demonstrated good performance in
Introduction time series anomaly detection. Following the well-known
Anomaly detection aims to identify anomalous patterns from sequence-to-sequence framework (Sutskever, Vinyals, and
data. In particular, time series anomaly detection has re- Le 2014), the RAE consists of an encoder and a decoder. The
ceived a lot of attention in the past decade (Gupta et al. 2014; reconstruction error at each time step is used as an anomaly
Cook, Misirli, and Fan 2020). Time series data can be eas- score. Very recently, other autoencoder variants have also
ily found in many real-world applications. One example is been proposed. For example, Yoo, Kim, and Kim (2019) de-
cyber-physical systems such as smart buildings, factories, veloped the recurrent reconstructive network (RRN), which
and power plants (Chia and Syed 2014; Ding et al. 2016), uses self-attention and feedback transition to help capture
in which there are a large number of sensors. Efficient and the temporal dynamics. Kieu et al. (2019) proposed the re-
robust time series anomaly detection can help monitor sys- current autoencoder ensemble based on ensemble learning
tem behaviors such that potential risks and financial losses (Dietterich 2000). Both the encoders and decoders consist
can be avoided. However, detecting outliers from time series of several RNNs with sparse skip connections. On inference,
data is challenging. First, finding and labeling of anomalies the median reconstruction error from all decoders is used as
are very time-consuming and expensive in practice. More- the anomaly score. However, the RAE and its variants can
over, time series data usually have complex nonlinear and have difficulties in decoding long time series due to error
high-dimensional dynamics that are difficult to model. To accumulation from previous time steps.
alleviate the first issue, time series anomaly detection is usu- In this paper, we propose the Recurrent Autoencoder with
ally formulated as an one-class classification problem (Ruff Multiresolution Ensemble Decoding (RAMED). Inspired by
et al. 2018; Zhou et al. 2019), in which the training set con- (Kieu et al. 2019), RAMED also has an ensemble of de-
tains only normal samples. coders. However, the difference is that the proposed de-
coders capture the time series’ temporal information at mul-
Copyright c 2021, Association for the Advancement of Artificial tiple resolutions. This is achieved by controlling the number
Intelligence (www.aaai.org). All rights reserved. of decoding steps in the decoders. With a short decoding

9567
length, the decoder has to focus on macro temporal char- Specifically, f (E) uses not only the immediate previous state
acteristics such as trend patterns and seasonality; whereas (E) (E)
ht−1 , but also ht−s for some skip length s > 1:
a decoder with long decoding length can capture more de- " #!
tailed local temporal patterns. Furthermore, instead of sim- (E) (E)
(E) (E) w1 ht−1 + w2 ht−s
ply averaging the decoder outputs as in (Kieu et al. 2019), ht = f xt ; , (2)
|w1 | + |w2 |
the lower-resolution temporal information is used to guide
decoding at a higher resolution. Specifically, we introduce where coefficients w1 , w2 are randomly sampled from
a multiresolution shape-forcing loss to encourage the de- {(1, 0), (0, 1), (1, 1)} at each time step. In (Kieu et al. 2019),
coders to match the input’s global temporal shape at mul- the skip length s is randomly sampled from [1, 10] and fixed
tiple resolutions. This avoids overfitting the nonlinear local before training.
patterns at a higher resolution, and alleviates error accumu-
lation during decoding. Finally, the output from the highest Decoding for Anomaly Detection The encoder’s com-
(E)
resolution (whose decoding length equals the length of the pressed representation hT can be decoded by using a
whole time series) is used as the ensemble output. LSTM. In time series anomaly detection, decoding is usu-
Our main contributions can be summarized as follows: ally easier when performed in time-reverse order (Kieu
• We present the novel recurrent autoencoder RAMED, et al. 2019; Yoo, Kim, and Kim 2019), i.e., the target
with multiple decoders of different decoding lengths. By reconstructed output for input X = [x1 , x2 , . . . , xT ] is
introducing a shape-forcing reconstruction loss, decoders [yT , yT −1 , . . . , y1 ], where yt is the LSTM’s output at time
(E)
can capture temporal characteristics of the time series at t (for t = T, T − 1, . . . , 1). After initializing hT by hT ,
multiple resolutions. {hT −1 , . . . , h1 } are obtained as:
• We introduce a fusion mechanism to integrate multireso- yt = Wht + b,
lution temporal information from multiple decoders. (3)
ht−1 = LSTM([yt ; ht ]),
• We conduct extensive empirical studies on time series
anomaly detection. Results demonstrate that the proposed where W, b are learnable parameters. An anomaly score is
model outperforms recent strong baselines. computed for each xt based on the error e(t) = yt − xt . As
can be seen from (3), this error can accumulate during the
sequential decoding process.
Related Work
Autoencoders for Time Series Anomaly Detection Encoder-Decoder Ensemble In the recurrent autoencoder
ensemble (RAE-ensemble) (Kieu et al. 2019), multiple re-
The sequence-to-sequence model (Sutskever, Vinyals, and
current encoders and decoders are used. Let L(E) be the
Le 2014) is a popular auto-encoding approach for sequential (E )
number of encoders, and hT i be the representation from
data. There are two steps in its learning procedure: (i) en-
the ith encoder. The integrated compressed representation
coding, which compresses the sequential data into a fixed-
length representation; and (ii) decoding, which reconstructs h(E) is obtained by
the original input from the learned compressed representa- 
(E ) (E ) (E (E) )

tion. The sequence-to-sequence model has been widely used h(E) = FMLP concat[hT 1 ;. . .; hT i ;. . .; hT L ] , (4)
in natural language processing (Bahdanau, Cho, and Ben-
gio 2015). Recently, it is also used in time series applica- where FMLP is a fully-connected layer, and h(E) shares
(E )
tions such as prediction (Le Guen and Thome 2019), clus- the same dimension as each hT i . During decoding, L(D)
tering (Ma et al. 2019) and anomaly detection (Malhotra decoders are used, with each of which following the
et al. 2016; Yoo, Kim, and Kim 2019; Kieu et al. 2019). (k)
same recurrent decoding process. After initializing hT to
Two recent representative sequence-to-sequence models for (k) (k)
h(E) , the kth decoder D(k) outputs {yT , . . . , y1 } and
time series anomaly detection are the recurrent auto-encoder
{hT −1 , . . . , h1 } as:
(RAE) (Malhotra et al. 2016) and recurrent reconstructive
network (RRN) (Yoo, Kim, and Kim 2019). (k) (k)
yt = W(k) ht + b(k) ,
(k) (k) (k) (5)
Time Series Encoding The recurrent neural network ht−1 = LSTM(k) ([yt ; ht ]),
(RNN) is often used to encode time series data. Let X =
[x1 , x2 , . . . , xT ], where xt ∈ Rd , be a time series of length where W(k) , and b(k) are learnable parameters. Note that
(E)
T . At time t, the encoder’s hidden state ht is updated as: this also suffers from error accumulation as in (3). During
inference, outputs from all the decoders are pooled together.
(E) (E)
ht = f (E) ([xt ; ht−1 ]), (1)
Multiresolution Temporal Modeling
(E)
where f (E) is a nonlinear function. The hT at the last time To capture multiresolution temporal information, Hihi and
step T is then used as X’s compressed representation. Bengio (1996) developed a hierarchical RNN that inte-
A popular choice for f (E) is the long-short term mem- grates multiple delays and time scales in different recurrent
ory (LSTM) (Hochreiter and Schmidhuber 1997). Kieu et al. neurons. Similarly, to model multiscale structures in text.
(2019) added sparse skip connections to the RNN cells so Hermans and Schrauwen (2013); Chung, Ahn, and Ben-
that additional hidden states in the past can be considered. gio (2017) introduced the hierarchical multiscale RNN. This

9568
multiresolution multiresolution outputs
shape-forcing loss from top layers

coarser

finer

intput time series


output time series
reversed
mse loss

Stage I: Encoding Stage II: Multiresolution ensemble decoding

Figure 1: The proposed Recurrent Autoencoder with Multiresolution Ensemble Decoding (RAMED).

stacks multiple recurrent layers, with each layer receiving forcing loss based on dynamic time warping (DTW) (Sakoe
hidden states from the previous layer as input. Instead of and Chiba 1978).
simply stacking multiple recurrent layers, Liu et al. (2019) The following sections describe these various components
proposed a coarse-to-fine procedure for time series imputa- in more detail.
tion. Very recently, the pyramid RNN (Ma et al. 2020) ag-
gregates the multiresolution information from each recurrent Decoder Lengths The kth decoder D(k) reconstructs a
layer in a bottom-up manner. time series of length T (k) , where T (k) = αk T and

αk = 1/τ k−1 ∈ (0, 1] (6)


Proposed Architecture
In this paper, we utilize multiresolution temporal informa- for some τ > 1 (τ = 2 in Figure 1). Note that α1 = 1 and
(D)
tion by integrating with a coarse-to-fine decoding process. T (1) = T . We require T (L ) ≥ 2, so that the decoder at
Figure 1 shows the proposed model (with L(E) = L(D) = the top takes at least two decoding steps.
3), which will be called Recurrent Autoencoder with Mul- To improve robustness, as in the denoising autoencoder
tiresolution Ensemble Decoding (RAMED). (Vincent et al. 2008), we add a small amount of noise δ to
the LSTM’s input, where  is a small scalar (10−4 in the ex-
Multiresolution Ensemble Decoding periments), and δ is random noise from the standard normal
As in (Kieu et al. 2019), RAMED uses an ensemble of L(E) distribution N (0, 1).
RNN encoders with the encoding process in (4). At the low-
est resolution layer, macro temporal characteristics of the Coarse-to-Fine Fusion Since the outputs from different
time series are captured. This is then passed to the next layer decoders have different lengths, they cannot be summarized
(with a higher decoding resolution), and so on. to an ensemble output by simply using the average or me-
Subsequently, multiple reconstructions are obtained by dian as in (Kieu et al. 2019). In the following, we propose a
simple yet efficient multiresolution coarse-to-fine strategy to
running L(D) recurrent decoders on the compressed repre-
fuse the coarser-grained decoder with the finer-grained de-
sentation h(E) . To encourage different decoders to capture coders.
temporal behaviors of the time series at different resolutions,
Consider two decoders D(k+1) and D(k) . Note from (6)
we use different numbers of decoding steps for the decoders.
that T (k) = τ T (k+1) > T (k+1) , and so information ex-
A decoder with short decoding length has to focus on the
tracted from D(k+1) is coarser than that from D(k) . In
macro temporal characteristics; whereas a decoder with long
other words, the decoder at the top (k = L(D) ), with out-
decoding length can capture more detailed local temporal (L(D) ) (L(D) )
patterns. A multiresolution fusion strategy is used to effi- put {h1 , . . . , hT (D) } obtained via (5), is the coarsest
ciently fuse the decoder outputs in a coarse-to-fine manner. among all decoders.
Moreover, the decoded output is encouraged to be simi- For the other decoders D(k) s (k = L(D) − 1, . . . , 1),
lar to the input time series by using a differentiable shape- instead of using (5), it first combines its previous hidden

9569
(k)
state ht+1 with the corresponding slightly-coarser informa- Algorithm 1 Recurrent Autoencoder with Multiresolution
(k+1)
tion hdt/τ e from the sibling decoder D(k+1) as: Ensemble Decoding (RAMED).
  Input: a set of time series {Xb }; batch size B; number of
(k) (k) 0 (k) (k+1)
ĥt = βht+1 +(1−β)FMLP concat[ht+1 ; hdt/τ e ] , (7) encoders L(E) ; number of decoders L(D) ; τ .
1: for each decoder, its decoding length is T (k) = αk T ;
0
where FMLP is a two-layer fully-connected network with the 2: repeat
PReLU (Parametric Rectified Linear Unit) (He et al. 2015) 3: sample a batch of B time series;
(k)
activation, and βht+1 (with β > 0) plays a similar role as 4: for b = 1, . . . , B do
the residual connection (He et al. 2016). Analogous to (5), 5: feed time series Xb to encoders and obtain the
(k) (E)
this ĥt is then fed into the LSTM cell to generate last hidden states {hb,T };
(k) (k) (k) 6: obtain joint representations {h(E) } via (4);
ht = LSTM(k) ([yt+1 +  δ; ĥt ]),
7: for k = L(D) , L(D) −1 . . . , 1 do
for t = T (k) − 1, . . . , 1. 8: run the decoder D(k) ;
Finally, the ensemble’s reconstructed output can be 9: if (k 6= L(D) ) then
obtained from the bottom-most decoder as Yrecon = 10: perform coarse-to-fine fusion;
(1) (1) (1)
[y1 , y2 , . . . , yT ] (after reversing to the original time 11: end if
(k)
order). To encourage Yrecon to be close to input X = 12: obtain updated hidden states {hb,t } and out-
(k)
[x1 , . . . , xT ], we use the square loss as the reconstruction puts {yb,t };
error: 13: end for
T
X (1) 14: minimize (12) by SGD or its variants;
LMSE (X) = kyt − xt k2 . (8)
15: end for
t=1
16: until convergence.
Multiresolution Shape-Forcing Loss To further encour-
age decoders to learn consistent temporal patterns at differ-
ent resolutions, we force the decoders’ outputs to have sim-
With the sDTW distance, we encourage decoders at dif-
ilar shapes as the original input by introducing a loss based
ferent resolutions to output time series with similar tempo-
on dynamic time warping (DTW) (Sakoe and Chiba 1978).
ral characteristics as the input. Here, decoders whose decod-
Let the output from decoder D(k) be Y(k) = ing length is less than the length of the whole time series
(k) (k) (k)
[y1 , y2 , . . . , yT (k) ]. Since T (k) 6= T for k = are considered. This leads to the following multiresolution
2, . . . , L(D) , we define a similarity between time series X shape-forcing loss:
and each Y(k) by DTW. The DTW similarity is based on (D)
L
distances along the (sub-)optimal DTW alignment path. Let 1 X
T ×T (k) Lshape (X) = sDTW(X, Y(k) ). (11)
the alignment be represented by a matrix A ∈ {0, 1} , L(D) − 1 k=2
(k)
in which Ai,j = 1 when xi is aligned to yj ; and zero
otherwise, and with boundary conditions A1,1 = 1 and Given a batch of samples {Xb }b=1,2,...,B (where B is the
AT,T (k) = 1. All valid alignment paths run from the upper- batch size), the total loss is:
left entry (1,1) to the lower-right entry (T, T (k) ) using moves B
1 X
↓, → or &. The alignment costs are stored in a matrix C. L= (LMSE (Xb ) + λLshape (Xb )) , (12)
(k) B
For simplicity, we use Ci,j = kxi − yj k, the Euclidean b=1

distance. The DTW distance between X and Y(k) is then: where λ is a trade-off parameter. This can be minimized by
(k) stochastic gradient descent or its variants (such as Adam
DTW(X, Y ) = min hA, Ci. (9)
A∈A (Kingma and Ba 2015)). The training procedure is shown
in Algorithm 1.
where A is the set of T × T (k) binary alignment matri-
ces, and h·, ·i is the matrix inner product. The DTW dis- Anomaly Score and Detection
tance is non-differentiable due to the min operator. To in-
tegrate DTW into end-to-end training, we replace (9) by Given a time series X = [x1 , x2 , . . . , xT ] and its reconstruc-
the smoothed DTW (sDTW) distance (Cuturi and Blondel tion Y = [y1 , y2 , . . . , yT ], the reconstruction error at time t
2017): is e(t) = yt −xt . We then fit a normal distribution N (µ, Σ)
! using the set of {e(t)} from all time steps and all time series
X in the validation set.
sDTW(X, Y(k) ) = −γ log e−hA,Ci/γ , (10) On inference, the probability that xt from an unseen time
A∈A series in the test set is anomalous is defined as:
where γ > 0. This is based on Pthe smoothed min operator 1

1

n
minγ {a1 , . . . , an } = −γ log i=1 e−ai /γ , which reduces 1− p exp − (e(t)−µ)T Σ−1 (e(t)−µ) .
to the min operator when γ approaches zero. (2π)d |Σ| 2

9570
Dataset T # Training # Validation # Testing % Anomaly
ECG
(A) chfdb chf01 275 64 40 17 59 14.61
(B) chfdb chf13 45590 64 53 22 40 12.35
(C) chfdbchf15 64 237 101 104 4.45
(D) ltstdb 20221 43 64 57 24 35 11.51
(E) ltstdb 20321 240 64 43 18 45 9.61
(F) mitdb 100 180 64 64 27 70 8.38
2D-gesture 64 91 39 47 24.63
Power-demand 512 25 11 29 11.44
Yahoo’s S5 128 659 398 394 3.20

Table 1: Statistics of the time series data sets.

Thus, we can take The time series are partitioned into length-T sequences
by using a sliding window. The sliding window has a
s(t) = (e(t) − µ)T Σ−1 (e(t) − µ) (13) stride of 32 on the ECG data sets, 512 on Power-demand,
as xt ’s anomaly score. When this is greater than a predefined and 64 on 2D-gesture and Yahoo’s S5. Table 1 shows
threshold, xt is classified as an anomaly. the sequence length T , number of sequences in the train-
ing/validation/testing set, and percentage of anomalous sam-
ples in the test set.
Experiments
In this section, experiments are performed on the following Baselines The proposed RAMED model is compared with
nine commonly-used real-world time series benchmarks1 four recent anomaly detection baselines:2 (i) recurrent au-
(Table 1): toencoder (RAE) (Malhotra et al. 2016); (ii) recurrent re-
constructive network (RRN) (Yoo, Kim, and Kim 2019),
1. ECG: This is a collection of 6 data sets on the detection which combines attention, skip transition and a state-forcing
of anomalous beats from electrocardiograms readings. regularizer; (iii) recurrent autoencoder ensemble (RAE-
2. 2D-gesture: This contains time series of X-Y coordinates ensemble) (Kieu et al. 2019), which uses an ensemble of
of an actor’s right hand. The data is extracted from an RNNs with sparse skip connections as encoders and de-
video in which the actor grabs a gun from his hip-mounted coders; (iv) BeatGAN (Zhou et al. 2019), which is a re-
holster, moves it to the target, and returns it to the holster. cent CNN autoencoder-based generative adversarial net-
The anomalous region is in the area where the actor fails work (GAN) (Goodfellow et al. 2014) for time series
to return his gun to the holster. anomaly detection.
3. Power-demand: This contains one year of power con- Evaluation Metrics Performance measures such as preci-
sumption records measured by a Dutch research facility sion and recall depend on thresholding the anomaly score.
in 1997. To avoid setting this threshold, we use the following metrics
4. Yahoo’s S5 Webscope: This contains records from real which have been widely used in anomaly detection (Wang
production traffic of the Yahoo website. Anomalies are et al. 2019; Ren et al. 2019; Li et al. 2020; Su et al. 2019):
manually labeled by human experts. (i) area under the ROC curve (AUROC), (ii) area under the
precision-recall curve (AUPRC), and (iii) the highest F1-
ECG and 2D-gesture are bivariate time series (d = 2), while score (denoted F1best ) (Li et al. 2020; Su et al. 2019), which
Power-demand and Yahoo’s S5 are univariate (d = 1). For is selected from using 1000 thresholds uniformly distributed
each of ECG, 2D-gesture and Power-demand, the public from 0 to the maximum anomaly score over all time steps in
data set includes a training set (containing only normal data) the test set (Yoo, Kim, and Kim 2019).
and a test set. We use 30% of the training set for validation,
and the rest for actual training. The model with the lowest Implementation Details We use 3 encoders and 3 de-
reconstruction loss on the validation set is selected for evalu- coders. Each encoder and decoder is a single-layer LSTM
ation. For Yahoo’s S5, the available data set is split into three with 64 units. We perform grid search on the hyperpa-
parts: with 40% of the samples for training, another 30% for rameter β in (7) from {0.1, 0.2, . . . , 0.9}, λ in (12) from
validation, and the remaining 30% for testing. The training {10−4 , 10−3 , 10−2 , 10−1 , 1}, τ in (6) is set to 3 and γ in (10)
set contains unknown anomalies, and we use the model with is set to 0.1. The Adam optimizer (Kingma and Ba 2015) is
the highest AUROC value on the validation set for evalua- used with an initial learning rate of 10−3 .
tion. 2
RAE and RRN are downloaded from https://github.com/
1
ECG, 2D-gesture and Power-demand are from http://www. YongHoYoo/AnomalyDetection, BeatGAN is from https://github.
cs.ucr.edu/∼eamonn/discords/, while Yahoo’s S5 is from https:// com/Vniex/BeatGAN, and RAE-ensemble is from https://github.
webscope.sandbox.yahoo.com/. com/tungk/OED

9571
ECG 2D- Power- Yahoo’s avg
metric method gesture demand S5 rank
A B C D E F

BeatGAN 0.6566 0.7056 0.7329 0.6173 0.8160 0.4223 0.7256 0.5796 0.8728 4.33
RAE 0.6728 0.7502 0.8289 0.5452 0.7970 0.4715 0.7601 0.6122 0.8823 3.78
AUROC RRN 0.6393 0.7623 0.7405 0.6318 0.8101 0.4531 0.7530 0.6607 0.8869 3.44
RAE-ensemble 0.6884 0.7788 0.8570 0.6400 0.8035 0.5213 0.7808 0.6587 0.8850 2.44
RAMED 0.7127 0.8551 0.8736 0.6473 0.8828 0.6399 0.7839 0.6787 0.8942 1.00

BeatGAN 0.5197 0.4101 0.2254 0.1613 0.3342 0.0778 0.4952 0.1228 0.4702 4.44
RAE 0.5501 0.4249 0.4996 0.1435 0.2126 0.0894 0.4979 0.1350 0.4782 3.78
AUPRC RRN 0.5260 0.5653 0.4139 0.1652 0.3206 0.0833 0.4866 0.1446 0.4794 3.22
RAE-ensemble 0.5549 0.4769 0.5256 0.2026 0.2798 0.0948 0.5287 0.1400 0.4783 2.56
RAMED 0.5803 0.7008 0.5486 0.2203 0.3784 0.1253 0.5331 0.1627 0.4809 1.00

BeatGAN 0.5102 0.4204 0.2931 0.2502 0.4776 0.1562 0.4941 0.2266 0.4484 4.44
RAE 0.5478 0.4736 0.5046 0.2193 0.3886 0.1581 0.5300 0.2798 0.4473 3.78
F1best RRN 0.5440 0.5502 0.4537 0.2621 0.4548 0.1562 0.5240 0.2926 0.4502 3.00
RAE-ensemble 0.5479 0.5016 0.5333 0.2735 0.3910 0.1602 0.5511 0.2678 0.4497 2.67
RAMED 0.5762 0.6871 0.5541 0.3466 0.4855 0.2090 0.5633 0.2934 0.4502 1.00

Table 2: Anomaly detection results (the larger the better). The best results are highlighted. Average rank (the smaller the better)
is recorded in the last column.

ECG(A) ECG(B) ECG(C)


method
AUROC AUPRC F1best AUROC AUPRC F1best AUROC AUPRC F1best
w/o coarse-to-fine fusion 0.6781 0.5305 0.5275 0.8196 0.6144 0.5539 0.7537 0.4700 0.5263
w/o Lshape 0.6916 0.5720 0.5556 0.8542 0.6362 0.6266 0.8450 0.4828 0.5365
full model 0.7127 0.5803 0.5762 0.8551 0.7008 0.6871 0.8736 0.5486 0.5541
ECG(D) ECG(E) ECG(F)
method
AUROC AUPRC F1best AUROC AUPRC F1best AUROC AUPRC F1best
w/o coarse-to-fine fusion 0.5125 0.1326 0.2092 0.8212 0.3058 0.3886 0.5083 0.0875 0.1593
w/o Lshape 0.5609 0.1742 0.2537 0.8455 0.3228 0.4235 0.5598 0.0993 0.1704
full model 0.6473 0.2203 0.3466 0.8828 0.3784 0.4855 0.6399 0.1253 0.2090
2D-gesture Power-demand Yahoo’s S5
method
AUROC AUPRC F1best AUROC AUPRC F1best AUROC AUPRC F1best
w/o coarse-to-fine fusion 0.7656 0.5292 0.5448 0.6357 0.1385 0.2642 0.8846 0.4501 0.4335
w/o Lshape 0.7716 0.5328 0.5525 0.6632 0.1500 0.2805 0.8895 0.4783 0.4497
full model 0.7839 0.5331 0.5633 0.6787 0.1627 0.2934 0.8942 0.4809 0.4502

Table 3: Effectiveness of coarse-to-fine fusion and multiresolution shape-forcing loss in RAMED.

Performance Comparison Ablation Study


Results are shown in Table 2. In terms of the average rank- In this section, we examine the contributions of the coarse-
ing, RAE-ensemble has better performance among the 4 to-fine fusion strategy and multiresolution shape-forcing
baselines. This agrees with the general view that ensemble loss in the proposed RAMED model. Sensitivity analysis on
learning is beneficial. The proposed RAMED consistently the hyperparameters is also performed.
outperforms RAE-ensemble on all three metrics, and the im-
provement on the AUROC is particularly large. This demon- Effect of Coarse-to-Fine Fusion In this experiment, we
strates that using multiresolution information in an ensemble still use 3 encoders as in the full model, but only use one
further helps time series reconstruction, such that lower false decoder (with decoding length equal to the input length). As
positive rates and higher true negative rates can be achieved. can be seen from Table 3, when only one decoder is left to

9572
(a) ECG(A). (a) ECG(A).

(b) Gesture.
(b) Gesture.
Figure 3: Effect of varying λ.
Figure 2: Effect of varying β.

ECG(A) T (k) ’s AUROC AUPRC F1best β = 0.1, λ = 10−4 and L(D) = 3. Experiments are per-
formed on the ECG(A) and Gesture data sets.
L(D) = 1 64 0.6781 0.5305 0.5275 Figure 2 shows the AUROC’s at different β’s. As can
L(D) = 2 21, 64 0.6976 0.5828 0.5708 be seen, the performance w.r.t. β is relatively stable. When
L(D) = 3 7, 21, 64 0.7172 0.5803 0.5762 β is set to 0.1, the proposed model achieves the best per-
formance. This is because when β is small, more coarse-
L(D) = 4 2, 7, 21, 64 0.6743 0.5132 0.5196 grained information can be used to help temporal modeling
at a higher-resolution levels; whereas a larger β may ignore
the coarse-grained information and degrades performance.
Gesture T (k) ’s AUROC AUPRC F1best Figure 3 shows the AUROC’s at different λ’s. As can be
L (D)
=1 64 0.7656 0.5292 0.5448 seen, when λ is small (10−4 ), but nonzero, better perfor-
(D)
mance is achieved.
L =2 21, 64 0.7693 0.5080 0.5462
Table 4 shows the AUROC’s with different numbers of de-
L(D) = 3 7, 21, 64 0.7839 0.5331 0.5633 coders. As can be seen, increasing L(D) (from 1 to 3) can im-
(D)
L =4 2, 7, 21, 64 0.7779 0.5153 0.5545 prove performance as more abundant multiresolution tempo-
ral patterns are involved. However, when L(D) increases to
Table 4: Effect of varying L(D) on ECG(A) and Gesture. 4, performance is degraded. This is because when L(D) = 4,
T (k) ’s are the decoding lengths of the various decoders. the coarsest decoder has a decoding length of only 2, and
cannot provide useful global temporal information.

reconstruct the input time series, performance on all metrics Conclusion


decrease. This verifies usefulness of the multiresolution fu-
In this paper, we introduce a recurrent ensemble network
sion strategy.
called Recurrent Autoencoder with Multiresolution Ensem-
Effect of Multiresolution Shape-Forcing Loss In this ex- ble Decoding (RAMED) for time series anomaly detection.
periment, we remove the multiresolution shape-forcing loss RAMED is based on a new coarse-to-fine fusion mecha-
from (12), and only minimize the reconstruction error. As nism, which integrates all the decoders into an ensemble,
can be seen from Table 3, the shape-forcing loss also plays and a multiresolution shape-forcing loss, which encourages
an important role in multiresolution decoding. decoders’ outputs to match the input’s global temporal shape
at multiple resolutions. This avoids overfitting the nonlinear
Sensitivity to Hyperparameters We study the following local patterns at a higher resolution, and alleviates error ac-
hyperparameters in the proposed model: (i) coarse-to-fine cumulation during decoding. Experiments on various time
fusion weight β in (7), (ii) tradeoff parameter λ on the mul- series benchmark data sets demonstrate that the proposed
tiresolution shape-forcing loss in (12), and (iii) L(D) , the model achieves better anomaly detection performance than
number of decoders. The default hyperparameter settings are competitive baselines.

9573
Acknowledgments He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-
ual learning for image recognition. In IEEE Conference on
This work was partially funded by the Foreign Science
Computer Vision and Pattern Recognition.
and Technology Cooperation Program of Huangpu Dis-
trict of Guangzhou (No. 2018GH09, 2019-2020), the Na- Hermans, M.; and Schrauwen, B. 2013. Training and
tional Natural Science Foundation of China (Grant Nos. analysing deep recurrent neural networks. In Advances in
61502174, 61872148), the Natural Science Foundation Neural Information Processing Systems.
of Guangdong Province (Grant Nos. 2017A030313355, Hihi, S. E.; and Bengio, Y. 1996. Hierarchical recurrent neu-
2019A1515010768), the Guangzhou Science and Tech- ral networks for long-term dependencies. In Advances in
nology Planning Project (Grant Nos. 201704030051, Neural Information Processing Systems.
201902010020), the Key R&D Program of Guangdong
Province (No. 2018B010107002), and the Fundamental Re- Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term
search Funds for the Central Universities. memory. Neural Computation 9(8): 1735–1780.
Kieu, T.; Yang, B.; Guo, C.; and Jensen, C. S. 2019. Outlier
References detection for time series with recurrent autoencoder ensem-
bles. In International Joint Conferences on Artificial Intelli-
Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural ma- gence.
chine translation by jointly learning to align and translate.
In International Conference on Learning Representations. Kieu, T.; Yang, B.; and Jensen, C. S. 2018. Outlier detec-
tion for multidimensional time series using deep neural net-
Chia, C.-C.; and Syed, Z. 2014. Scalable noise mining in works. In IEEE International Conference on Mobile Data
long-term electrocardiographic time-series to predict death Management.
following heart attacks. In International Conference on
Knowledge Discovery and Data Mining. Kingma, D. P.; and Ba, J. L. 2015. Adam: A method for
stochastic optimization. In International Conference on
Chung, J.; Ahn, S.; and Bengio, Y. 2017. Hierarchical multi- Learning Representations.
scale recurrent neural networks. In International Conference
on Learning Representations. Le Guen, V.; and Thome, N. 2019. Shape and time distor-
tion loss for training deep time series forecasting models. In
Cook, A. A.; Misirli, G.; and Fan, Z. 2020. Anomaly de- Advances in Neural Information Processing Systems.
tection for IoT time-series data: A survey. IEEE Internet of
Things Journal 7(7): 6481–6494. Li, L.; Yan, J.; Wang, H.; and Jin, Y. 2020. Anomaly de-
tection of time series with smoothness-inducing sequential
Cuturi, M.; and Blondel, M. 2017. Soft-DTW: A differen- variational auto-encoder. IEEE Transactions on Neural Net-
tiable loss function for time-series. In International Confer- works 1–15.
ence on Machine Learning.
Liu, Y.; Yu, R.; Zheng, S.; Zhan, E.; and Yue, Y. 2019.
Dietterich, T. G. 2000. Ensemble methods in machine learn- NAOMI: Non-autoregressive multiresolution sequence im-
ing. In International Workshop on Multiple Classifier Sys- putation. In Advances in Neural Information Processing
tems. Systems.
Ding, Z.; Yang, B.; Chi, Y.; and Guo, L. 2016. Enabling Ma, Q.; Lin, Z.; Chen, E.; and Garrison, C. 2020. Temporal
smart transportation systems: A parallel spatio-temporal pyramid recurrent neural network. In AAAI Conference on
database approach. IEEE Transactions on Computers 65(5): Artificial Intelligence.
1377–1391. Ma, Q.; Zheng, J.; Li, S.; and Cottrell, G. W. 2019. Learning
Filonov, P.; Lavrentyev, A.; and Vorontsov, A. 2016. Mul- representations for time series clustering. In Advances in
tivariate industrial time series with cyber-attack simulation: Neural Information Processing Systems.
Fault detection using an LSTM-based predictive data model. Malhotra, P.; Ramakrishnan, A.; Anand, G.; Vig, L.; Agar-
Preprint arXiv:1612.06676. wal, P.; and Shroff, G. M. 2016. LSTM-based encoder-
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; decoder for multisensor anomaly detection. Preprint
Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. arXiv:1607.00148.
2014. Generative adversarial nets. In Advances in Neural Ren, H.; Xu, B.; Wang, Y.; Yi, C.; Huang, C.; Kou, X.; Xing,
Information Processing Systems. T.; Yang, M.; Tong, J.; and Zhang, Q. 2019. Time-series
Gupta, M.; Gao, J.; Aggarwal, C. C.; and Han, J. 2014. Out- anomaly detection service at Microsoft. In International
lier detection for temporal data: A survey. IEEE Trans- Conference on Knowledge Discovery & Data Mining.
actions on Knowledge and Data Engineering 26(9): 2250– Ruff, L.; Vandermeulen, R.; Goernitz, N.; Deecke, L.; Sid-
2267. diqui, S. A.; Binder, A.; Müller, E.; and Kloft, M. 2018.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving deep Deep one-class classification. In International Conference
into rectifiers: Surpassing human-level performance on Im- on Machine Learning.
ageNet classification. In International Conference on Com- Sakoe, H.; and Chiba, S. 1978. Dynamic programming al-
puter Vision. gorithm optimization for spoken word recognition. IEEE

9574
Transactions on Acoustics, Speech, and Signal Processing
26(1): 159–165.
Su, Y.; Zhao, Y.; Niu, C.; Liu, R.; Sun, W.; and Pei, D.
2019. Robust anomaly detection for multivariate time se-
ries through stochastic recurrent neural network. In Interna-
tional Conference on Knowledge Discovery & Data Mining.
Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence
to sequence learning with neural networks. In Advances in
Neural Information Processing Systems.
Vincent, P.; Larochelle, H.; Bengio, Y.; and Manzagol, P.-A.
2008. Extracting and composing robust features with de-
noising autoencoders. In International Conference on Ma-
chine Learning.
Wang, S.; Zeng, Y.; Liu, X.; Zhu, E.; Yin, J.; Xu, C.; and
Kloft, M. 2019. Effective end-to-end unsupervised outlier
detection via inlier priority of discriminative network. In
Advances in Neural Information Processing Systems.
Wold, H. 1938. A study in analysis of stationary time series.
Journal of the Royal Statistical Society 102(2): 295–298.
Yoo, Y.; Kim, U.; and Kim, J. 2019. Recurrent reconstruc-
tive network for sequential anomaly detection. IEEE Trans-
actions on Cybernetics 1–12.
Zhang, C.; Song, D.; Chen, Y.; Feng, X.; Lumezanu, C.;
Cheng, W.; Ni, J.; Zong, B.; Chen, H.; and Chawla, N. 2019.
A deep neural network for unsupervised anomaly detection
and diagnosis in multivariate time series data. In AAAI Con-
ference on Artificial Intelligence.
Zhou, B.; Liu, S.; Hooi, B.; Cheng, X.; and Ye, J. 2019.
BeatGAN: Anomalous rhythm detection using adversarially
generated time series. In International Joint Conferences on
Artificial Intelligence.

9575

You might also like