Sakurada 2014
Sakurada 2014
ABSTRACT has been used since the 1990’s, often with the name autoas-
This paper proposes to use autoencoders with nonlinear di- sociative neural networks [8], [7]. However, there are still few
mensionality reduction in the anomaly detection task. The works in which researchers try to apply those learned fea-
authors apply dimensionality reduction by using an autoen- tures to other data mining tasks. Our idea is to apply them
coder onto both artificial data and real data, and compare to one of the fundamental data mining tasks: the anomaly
it with linear PCA and kernel PCA to clarify its property. detection task. We perform dimensionality reduction us-
The artificial data is generated from Lorenz system, and ing autoencoders to the data which contain anomalies. We
the real data is the spacecrafts’ telemetry data. This paper investigate the difference in performance to detect anoma-
demonstrates that autoencoders are able to detect subtle lies by comparing an autoencoder with other traditional ap-
anomalies which linear PCA fails. Also, autoencoders can proaches such as linear principal component analysis (here-
increase their accuracy by extending them to denoising au- inafter referred to as PCA), and kernel PCA. Previous works
toenconders. Moreover, autoencoders can be useful as non- proposed the other extension to the ordinary autoencoder,
linear techniques without complex computation as kernel named denoising autoencoder [13], and we also include this
PCA requires. Finaly, the authors examine the learned fea- approach in our comparison.
tures in the hidden layer of autoencoders, and present that Our work eventually aims to detect anomalies in the space-
autoencoders learn the normal state properly and activate crafts’ telemetry data by dimensionality reduction technique.
differently with anomalous input. Spacecrafts have a complex system and their telemetry data
have hundreds of variables. Most of the variables are nonlin-
early correlated and temporally dependent. It is difficult for
Categories and Subject Descriptors humans to distinguish the abnormal state from the normal
I.2.1 [Artificial Intelligence]: Applications and Expert Sys- state only by the raw data. For this reason, training the
tems—Industrial automation; I.5.4 [Pattern recognition]: Ap- machine to learn the normal state and displaying the recon-
plications—Signal processing struction error as the anomaly score is valuable. Thus, in
this paper we especially focus on the time series data which
consist of 10-100 variables with the nonlinear correlation.
General Terms Our contribution is three-fold. First we apply dimension-
Performance ality reduction using autoencoders to both artificial data
and real data, and present that autoencoders are applicable
Keywords to anomaly detection. Second, we compare the performance
among autoencoders, denoising autoencoders, linear PCA
anomaly detection, novelty detection, fault detection, au- and kernel PCA to clarify the property of autoencoders.
toencoder, auto-assosiative neural network, denoising au- We found that 1) autoencoders can detect anomalies which
toencoder, dimensionality reduction, nonlinear, spacecrafts linear PCA fails to detect, and also increase the accuracy
by extending autoencoders to denoising autoencoders, and
1. INTRODUCTION 2) autoencoders can avoid complex computation as kernel
Recently, feature learning using the neural network with PCA requires without degrading the quality of detecting
dimensionality reduction has become popular in Deep Learn- performance. Finally, we investigate the learned features
ing context [4]. Actually an autoencoder, which is the neural in the hidden layer of the autoencoder, and display that
network with nonlinear dimensionality reduction capability, they learn the normal state properly and activates differ-
ently with anomalous input.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
2. RELATED WORK
of this work owned by others than ACM must be honored. Abstracting with One of the properties of autoencoders is that they can
credit is permitted. To copy otherwise, or republish, to post on servers or to employ nonlinear dimensionality reduction. There are sev-
redistribute to lists, requires prior specific permission and/or a fee. Request eral papers such as [8], [6], [10] in which the authors in-
permissions from [email protected]. vestigated its nonlinear property. In [6], they theoretically
MLSDA ’14, December 2, 2014, Gold Coast, QLD, Australia
Copyright 2014 ACM 978-1-4503-3159-3/14/12 ...$15.00 demonstrated the nonlinearity of autoencoders. In [8], [6],
http://dx.doi.org/10.1145/2689746.2689747 [10], they applied autoencoders to nonlinear anomaly detec-
tion data which is artificially generated. However, the data
they used are too simple to simulate the real data. In our x1 x^ 1
work, we generated the data with 25 dimensions from a more
complicated nonlinear system using the Lorenz equations. x2 x^ 2
Some of the previous works applied autoencoders to the
real data or the realistic data generated by simulating the
real world model [11], [3], [12], [9]. However, these works are x3 x^ 3
insufficient in that either they only use the low dimensional
data or they lack the comparison with the other approaches. x4 x^ 4
We applied two kinds of real data: one has 10 dimensions
and the other has more than 100 dimensions. Although some +1
works compare an autoencoder with other approaches [7], x5 x^ 5
[15], in this paper we focus on the dimensionality reduction
and determine the difference in performances according to +1
the reconstruction error.
Layer L1 Layer L2 Layer L3
3. ANOMALY DETECTION USING AUTOEN-
CODERS Figure 1: Autoencoder [1]
50
15
40
30 10
Variable
z3
20
5
10
0
0
20
-30
-20
0 -10
0 Training
10 Test
20 -5
z1 -20 30 z2 0 500 1000 1500 2000 2500
Time Index
60 250
50
200
40
150
Variable
Variable
30
100
20
50
10
0
0
Figure 2: Top: Normal {z(1), z(2), ..., z(849)} (blue) and Figure 3: Top: Normalized data of Satellite-A. Bottom:
anomalous {z(850), z(851), ..., z(1000)} (red) data from Normalized data of Satellite-B.
Lorenz system. Bottom: Normalized 25 dimensional Lorenz
system data x.
tions:
ż1 (t) = σ(z2 (t) − z1 (t))
with small subset of neurons. We can achieve this by in- ż2 (t) = z1 (t)(ρ − z3 (t))z2 (t) (4)
creasing the number of hidden units and adding some noise
to the input. There are some ways in adding the noise to ż3 (t) = z1 (t)z2 (t) − βz3 (t)
each input, but in this work, we destruct the input by ran-
We set three parameters σ, ρ and β to 28, 10 and 8/3 re-
domly choosing a fixed number of components of the input
spectively. According to Eq. 4, first we generated the vec-
to be 0, which is sometimes called as the salt-and-pepper
tor z(t) = (z1 (t) z2 (t) z3 (t))T . We sampled 1000 vec-
noise [14].
tors by running this simulation for 100[s] with the sam-
pling rate 0.1[s], with the small observation noise and sys-
tem transition noise. To generate the anomalous data, af-
4. EXPERIMENTAL SETUP ter sampling we flipped the values from z3 (850) to z3 (1000)
We performed dimensionality reduction on each data by 4 horizontally so that z3 aligns in reverse chronological or-
methods: linear PCA, an autoencoder, a denoising autoen- der after 850th. To generate the high dimensional vector
coder and kernel PCA. In each method, the number of latent x(t), first we made the matrix W ∈ R25×3 whose compo-
space dimension was adjusted manually. For autoencoders nents we randomly chose from the interval (−5, 5). Then
and denoising autoencoders, we adjusted several parame- we multiplied W by each vector z(t), i.e., x(t) = W z(t).
ters in the objective function (Eq. 3) as λ = 0.00001, β = 3, We divided 1000 samples into two, which are 700 train-
ρ = 0.01. The destruction level, i.e., the probability of that ing samples {x(1), x(2), ..., x(700)} and 300 test samples
each element is forced to 0, is fixed to 0.1. We compared the {x(701), x(702), ..., x(1000)}, with the latter half of the test
performances based on the reconstruction error in Eq. 1. samples including anomalies. Fig. 2 shows the distribution
of the 1000 vectors of z and the data of x after normalized
4.1 Artificial Data to a mean of zero and a variance of 1.
We prepared the nonlinear simulated data using the Lorenz
system. The Lorenz system consists of the following equa- 4.2 Real Data
We used two kinds of the spacecraft telemetry data in our
real data experiment: Satellite-A and Satellite-B. Satellite- Table 1: The average AUC of the 4 different methods on the
A and Satellite-B have 17 and 106 continuous sensor mea- 3 different data. LPCA, AE, dAE and KPCA denotes linear
surements respectively. Spacecraft telemetry data have many PCA, an autoencoder, a denoising autoencoder and kernel
different sensor measurements and in general these inputs PCA respectively. The first row Lorenz has the results on
are correlated with each other [16], [17]. This means that the artificial data using the Lorenz system, and last two
we can remove redundant inputs and represent each data rows Sat-A and Sat-B has the results on the real data of
sample as a lower dimensional vector. two kinds of spacecrafts’ telemetry data.
LPCA AE dAE KPCA
Fig. 3 shows the data after we normalized each data so
that the mean and variance become 0 and 1 respectively. Lorenz 0.5104 0.6473 0.7011 0.7045
Sat-A 0.8852 0.8847 0.9354 0.8862
5. RESULT AND DISCUSSION Sat-B 0.9764 0.9763 0.8355 0.7689
Normal Anomalous
1 4
Difference
0.6 0
0.4 -2
0.2 -4
Normal Anomalous
0 -6
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Normal Anomalous
0.15
0.6
0.1
0.5
Reconstruction Error
0.05
Difference
0.4
0.3
-0.05
0.2
-0.1
0.1
-0.15
Normal Anomalous
0 -0.2
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Normal Anomalous
0.7
0.2
0.6
0.1
Reconstruction Error
0.5
Difference
0.4
-0.1
0.3
-0.2
0.2
-0.3
0.1
Normal Anomalous
0 -0.4
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Normal Anomalous
0.1
0.1
0.05
Reconstruction Error
0.08
Difference
0.06
-0.05
0.04
-0.1
0.02
-0.15
Normal Anomalous
0 -0.2
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Figure 4: Result on Lorenz system data. The reconstruction error (left column) and the difference between the original and
reconstructed data (right column) of linear PCA, an autoencoder, a denoising autoencoder and kernel PCA.
Linear PCA (latent space: 4dim) Linear PCA (latent space: 4dim)
4 4
3.5
3
3
Reconstruction Error 2
2.5
Difference
1
0
1.5
-1
1
-2
0.5
3.5
2
3
Reconstruction Error
1
2.5
Difference
2 0
1.5
-1
-2
0.5
4
1
3.5
Reconstruction Error
3 0
Difference
2.5
-1
2
1.5 -2
1
-3
0.5
Normal Anomalous Normal Anomalous
0 -4
0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700
3
0.12
2.5
0.1
Reconstruction Error
2
Difference
0.08 1.5
0.06 1
0.5
0.04
0
0.02
-0.5
Normal Anomalous Normal Anomalous
0 -1
0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700
Figure 5: Result on Satellite-A data. The reconstruction error (left column) and the difference (right column) of linear PCA,
an autoencoder, an denoising autoencoder and kernel PCA are shown from top to bottom.
Linear PCA (latent space: 16dim) Linear PCA (latent space: 16dim)
4.5 1.5
Normal Anomalous
4 1
3.5
0.5
Reconstruction Error
3
0
Difference
2.5
-0.5
2
-1
1.5
-1.5
1
0.5 -2
Normal Anomalous
0 -2.5
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
0.5
Reconstruction Error
Difference
0
3 -0.5
-1
2
-1.5
-2
1
-2.5 Normal Anomalous
0 -3
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
Normal Anomalous
8
2
7
Reconstruction Error
6 1
Difference
5
0
4
3 -1
2
-2
1
Normal Anomalous
0 -3
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
Normal Anomalous
1.2
10
1
Reconstruction Error
5
Difference
0.8
0.6
-5
0.4
-10
0.2
Normal Anomalous
0 -15
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
Figure 6: Result on Satellite-B data. The reconstruction error (left column) and the difference (right column) of linear PCA,
an autoencoder, a denoising autoencoder and kernel PCA are shown from top to bottom.
[4] G. E. Hinton and R. R. Salakhutdinov. Reducing the
dimensionality of data with neural networks. Science,
313(5786):504–507, 2006.
[5] H. Hoffmann. Kernel pca for novelty detection.
Pattern Recognition, 40(3):863–874, 2007.
[6] B. Hwang and S. Cho. Characteristics of
auto-associative mlp as a novelty detector. In
Proceedings of the International Joint Conference on
Neural Networks, volume 5, pages 3086–3091, 1999.
[7] N. Japkowicz, C. Myers, and M. Gluck. A novelty
detection approach to classification. In Proceedings of
the 14th International Joint Conference on Artificial
Intelligence, volume 1, pages 518–523, 1995.
[8] M. A. Kramer. Nonlinear principal component
1
0.8
analysis using autoassociative neural networks. AIChE
J., 37(2):233–243, 1991.
0.6 [9] M. Martinelli, E. Tronci, G. Dipoppa, and
Unit 3
0.6
0.5
Networks, volume 3, pages 2878–2883, 2002.
[13] P. Vincent, H. Larochelle, Y. Bengio, and P.-A.
0.4
Manzagol. Extracting and composing robust features
0.3 with denoising autoencoders. In Proceedings of the
25th International Conference on Machine Learning,
0.2
pages 1096–1103, 2008.
0.1 [14] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and
0 0.2 0.4 0.6 0.8 1
Unit 1
P.-A. Manzagol. Stacked denoising autoencoders:
Learning useful representations in a deep network
with a local denoising criterion. Journal of Machine
Figure 7: Top: An example of the activation in three of neu-
Learning Research, 11:3371–3408, Dec. 2010.
rons in the hidden layer of the denoising autoencoder with
[15] G. Williams, R. Baxter, H. He, S. Hawkins, and L. Gu.
normal input (blue) and anomalous input (red). Bottom:
A comparative study of rnn for outlier detection in
Another example of the activation in two of the neurons in
data mining. In Proceedings of the International
the hidden layer of the denoising autoencoder with normal
Conference on Data Mining, page 709, 2002.
input (blue) and anomalous input (red). In these two fig-
ures, the hidden units of the denoising autoencoder activate [16] T. Yairi, M. Inui, A. Yoshiki, Y. Kawahara, and
in a different way with anomalous input. N. Takata. Spacecraft telemetry data monitoring by
dimensionality reduction techniques. In Proceedings of
SICE Annual Conference, pages 1230–1234, Aug 2010.
[17] T. Yairi, T. Tagawa, and N. Takata. Telemetry
monitoring by dimensionality reduction and learning
hidden markov model. In Proceedings of the
International Symposium on Artificial Intelligence,
Robotics and Automation in Space, 2012.