2.transfer Learning For Time Series Anomaly
2.transfer Learning For Time Series Anomaly
detection
1 Introduction
Time series data frequently arise in many di↵erent scientific and industrial con-
texts. For instance, companies use a variety of sensors to continuously monitor
equipment and natural resources. One relevant use case is developing algorithms
that can automatically identify time series that show anomalous behavior. Ide-
ally, anomaly detection could be posed as a supervised learning problem. How-
ever, these algorithms require large amounts of labeled training data. Unfor-
tunately, such data is often not available as obtaining expert labels is time-
consuming and expensive. Typically, only a small number of labels are known
for a limited number of data sets. For example, if a company monitors several
similar machines, they may only label events (e.g., shutdown, maintenance...)
for a small subset of them.
Transfer learning is an area of research focused on methods that are able to
extract information (e.g., labels, knowledge, etc.) from a data set and reapply
it in another, di↵erent data set. Specifically, the goal of transfer learning is
to improve performance on the target domain by leveraging information from
a related data set called the source domain [10]. In this paper, we adopt the
paradigm of transfer learning for anomaly detection. In our setting, we assume
that labeled examples are only available in the source domains, and that there
no labeled examples in the target domain. In the example, we utilize the label
information available for machine A to help constructing an anomaly detection
algorithm for machine B, where no labeled points are available for machine B.
In this paper we study transfer learning in the context of time-series anomaly
detection, which has received less attention in transfer learning [1, 6, 10]. Our
approach attempts to transfer instances from the source domain to the target
domain. It is based on two important and common insights about anomalous
data points, namely that they are infrequent and unexpected. We leverage these
insights to propose two di↵erent ways to identify which source instances should
be transferred to the target domain. Finally, we make predictions in the target
domain by using 1-nearest neighbors classifier where the transferred instances are
the only labeled data points in the target domain. We experimentally evaluate
our approach on a large data set adapted from a real-world data set and find
that it outperforms an unsupervised approach.
2 Problem statement
We can formally define the task we address in this paper as follows:
Given: One or multiple source domains DS with source domain data {XS , YS },
and a target domain DT with target domain data {XT , YT }, where the in-
stances x 2 X are time series and the labels y 2 Y are 2 {anomaly, normal}.
Additionally, only partial label information is available in the source do-
mains, and no label information in the target domain.
Do: Learn a model for anomaly detection fT (·) in the target domain DT using
the knowledge in DS , and DS 6= DT .
Both the source and target domain instances are time series. Thus each instance
x = {(t1 , v1 ), . . . , (tn , vn )}, where ti is a time stamp and vi is a single mea-
surement of the variable of interest v at time ti . The problem has the following
characteristics:
– The joint distributions of source and target domain data, denoted by pS (X, Y )
and pT (X, Y ), are not necessarily equal.
– No labels are known for the target domain, thus YT = ?. In the source
domain, (partial) label information is available.
– The same variable v is monitored in the source and target domain, under
possibly di↵erent conditions (e.g., the same machine in di↵erent factories).
– The number of samples in the DS and DT are denoted respectively by nS =
|XS | and nT = |XT |, and no restrictions are imposed on them.
– Each time series in DS or DT has the same length d.
– The source and target domain instances are randomly sampled from the true
underlying distribution.
4 Methodology
In order to learn the model for anomaly detection fT (·) in the target domain, we
transfer labeled instances from di↵erent source domains. To avoid situations of
negative transfer (e.g., transferring an instance with the label anomaly that maps
to a normal instance in the target domain), a decision function decides whether
to transfer an instance or not. First, we outline the intuitions behind the decision
function based on two commonly known characteristics of anomalous instances
(Sec. 4.1). Then, we propose two distinct decision functions (Sec. 4.2 and 4.3).
Finally, we describe a method for supervised anomaly detection in the target
domain based on the transferred instances (Sec. 4.4).
4.1 Instance-based transfer learning for anomaly detection
Notice that in the latter property the time series xS can have any form, while this
is not true for the first property, where the form is restricted by the distribution
of the target domain data. Given a labeled instance (xS , yS ) 2 DS that we want
to transfer to the target domain, Property 1 and Property 2 allow us to make
a decision whether to transfer or not. We can formally define a weight associated
with xS which will be high when the transfer makes sense, and low when it will
likely cause negative transfer.
(
pT (xS ) if yS = normal
wS = (1)
1 pT (xS ) if yS = anomaly
where hnT is the standard deviation of the Gaussian, and si are the subsequences
of the instances in XT . The Gaussian kernel ensures that instead of simply
counting similar subsequences, the count is weighted for each subsequence si
based on the kernelized distance to sS .
Estimating the densities for the subsequences yields more accurate estimates
given the reduced dimensionality, but simultaneously results in l = m/d esti-
mates for each time series xS . Hence, we have to adjust Eq. 1 to reflect this new
situation. We only show the case in which the label yS = normal as the reverse
case is straightforward:
l
!
1 X
wS = ˆ
fT,m (si ) Zmin (5)
Zmax Zmin i=1
X
Zmax = max ˆ (sj )
fT,m (6)
xT 2{XT [xS }
sj 2xT
The sum of the density estimates in the subsequences is normalized using min-
max normalization, such that wS 2 [0, 1]. Zmin is calculated similarly as Zmax
in Eq. 6, but taking the minimum instead of maximum. By setting a threshold
on the final weights, we make a decision on whether to transfer or not.
4.3 Cluster-based transfer decision function
Our second proposed decision function is also based on the intuitions outlined in
Sec. 4.1. First, the target domain data XT are clustered using k-means clustering.
Second, the resulting set of clusters C over XT is divided into a set of large
clusters, and a set of small clusters according to the following definition [5]:
b
X
|Ci | nT ⇥ ↵ (7)
i=1
|Cb |
(8)
|Cb+1 |
LC = {Ci |i b} and SC = {Ci |i > b} are respectively the set of large and small
clusters, and LC [ SC = C.
2
Furthermore, we define the radius of a cluster as ri = maxxj 2Ci kxj ci k .
Lastly, a decision is made whether or not to transfer a labeled instance xS
from the source domain. Intuitively, and in line with Observation 1 and 2,
anomalies in XT should fall in small clusters, while large clusters contain the
normal instances. Transferred labeled instances from the source domain should
adhere to the same intuitions. Each transferred instance is assigned to a cluster
2
Ci 2 C such that kxS ci k is minimized. An instance is only transferred in two
cases. First, if the instance has label normal and is assigned to a cluster Ci such
that Ci 2 LC and the distance of the instance to the cluster center is less or
equal to the radius of the cluster. Second, if the instance has label anomaly and
fulfills either of two conditions: the instance is assigned to a cluster Ci such that
Ci 2/ LC, or it is assigned to a cluster Ci such that Ci 2 LC and the distance of
the instance to the cluster center is larger than the radius of the cluster. In all
other cases there is no transfer.
After transferring instances from one or multiple source domains to the target
domain using the decision functions in Sec. 4.2 and 4.3, we can construct a
classifier in the target domain to detect anomalies. Ignoring the unlabeled target
domain data, we only use the set of labeled data L = {(xi , yi )}ni=1
A
, nA being the
number of instances transferred. It has been shown that one-nearest-neighbor
(1NN) classifier with dynamic time warping (DTW) or Euclidean distance is a
strong candidate for time series classification [9]. To that end, we construct a
1NN-DTW classifier on top of L to predict the labels of unseen instances.
5 Experimental evaluation
In this section we aim to answer the following research questions:
– Do the proposed decision functions for instance-based time series transfer
succeed in transferring useful knowledge between source and target domain.
First, we introduce the unsupervised baseline method to which we will compare
the 1NN-DTW method with instance transfer (Sec. 5.1). Then, we discuss the
data, the experimental setup, and the results (Sec. 5.2).
5.2 Experiments
Data. Due to the lack of readily available benchmarks for the problem outlined
in Sec. 2, we experimentally evaluate on a real-world data set obtained from
a large company. The provided data detail resource usage continuously tracked
over a period of approximately two years. Since the usage is highly dependent
on the time of day, we can generate 24 (hourly) data sets by grouping the usage
data by hour. Each data set contains about 850 di↵erent time series. For a
limited number of these series in each set we possess expert labels indicating
either normal or anomaly.
Experimental setup. In turn, we treat each of the 24 data sets as the target do-
main and the remaining data sets as source domains. We consider transferring
from a single source or multiple sources. Any labeled examples in the target
domain are set aside and serve as the test set. First, the proposed decision
functions are used to transfer instances from either a single source domain or
multiple source domains combined to the target domain. Then, we train both
the unsupervised CBLOF (Sec. 5.1), and supervised 1NN-DTW anomaly de-
tection model that uses the labeled instances transferred to the target domain
(Sec. 4.4). Finally, both models predict the labels of the test set, and we report
classification accuracy. For the density-based approach, we set the threshold on
the final weights to 0.5. For the cluster-based approach we selected ↵ = 0.95,
= 4, and the number of clusters 10.
1.0
0.8
Classification accuracy
0.6
0.4
cluster-based
0.2 density-based
CBLOF
0.0
00:00:00 02:00:00 04:00:00 06:00:00 08:00:00 10:00:00 12:00:00 14:00:00 16:00:00 18:00:00 20:00:00 22:00:00
Data set used as target
Fig. 1: The graph plots the mean classification accuracy and the standard deviation
for each of the 24 (hourly) data sets. These statistics are calculated after considering
7 randomly chosen data sets as source domains, and performing the analysis for each
combination of source and target. The plot indicates both transfer approaches with
1NN-DTW perform quite similarly, while outperforming the unsupervised method in
21 of the 24 data sets.
6 Conclusion
In this paper we introduced two decision functions to guide instance-based trans-
fer learning in case the instances are time series and the task at hand is anomaly
detection. Both functions are based on two commonly knowns insights about
anomalies: they are infrequent and unexpected. We experimentally evaluated
Table 1: A limited excerpt of the experimental evaluation. The number of transferred
instances is denoted by nA . Density-based is the density-based decision function with
1NN-DTW anomaly detection. Cluster-based is the cluster-based decision function with
1NN-DTW. CBLOF is the unsupervised anomaly detection. All reported numbers are
classification accuracies on a hold-out test set in the target domain, rounded o↵. Combo
is the the combination of 7 separate, randomly chosen source domains.
References
1. Andrews, J.T., Tanay, T., Morton, E., Griffin, L.: Transfer representation-learning
for anomaly detection. ICML (2016)
2. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: A survey. ACM com-
puting surveys (CSUR) 41(3), 1–72 (2009)
3. Chattopadhyay, R., Sun, Q., Fan, W., Davidson, I., Panchanathan, S., Ye, J.:
Multisource domain adaptation and its application to early detection of fatigue.
ACM Transactions on Knowledge Discovery from Data (TKDD) 6(4), 18 (2012)
4. Fukunaga, K.: Introduction to statistical pattern recognition. Academic press
(2013)
5. Kha, N.H., Anh, D.T.: From cluster-based outlier detection to time series discord
discovery. In: Revised Selected Papers of the PAKDD 2015 Workshops on Trends
and Applications in Knowledge Discovery and Data Mining-Volume 9441. pp. 16–
28. Springer-Verlag New York, Inc. (2015)
6. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Transactions on knowledge
and data engineering 22(10), 1345–1359 (2010)
7. Spiegel, S.: Transfer learning for time series classification in dissimilarity spaces.
In: Proceedings of AALTD 2016: Second ECML/PKDD International Workshop
on Advanced Analytics and Learning on Temporal Data. p. 78 (2016)
8. Torrey, L., Shavlik, J.: Transfer learning. Handbook of Research on Machine Learn-
ing Applications and Trends: Algorithms, Methods, and Techniques 1, 242 (2009)
9. Wei, L., Keogh, E.: Semi-supervised time series classification. In: Proceedings of
the 12th ACM SIGKDD international conference on Knowledge discovery and data
mining. pp. 748–753. ACM (2006)
10. Weiss, K., Khoshgoftaar, T.M., Wang, D.: A survey of transfer learning. Journal
of Big Data 3(1), 9 (2016)