Evaluation Metrics For Anomaly Detection Algorithm
Evaluation Metrics For Anomaly Detection Algorithm
DOI: 10.2478/ausi-2019-0008
György KOVÁCS
Technical University of Cluj-Napoca, Romania
email: [email protected]
1 Introduction
Anomaly detection is the process of identifying erroneous data in big data sets,
in order to improve the quality of further data processing. An anomaly de-
tection method classifies data into normal and abnormal values. The selection
Computing Classification System 1998: G.2.2
Mathematics Subject Classification 2010: 68R15
Key words and phrases: anomaly detection, classification, evaluation metrics
113
of the best detection method greatly depends on the data set characteristics.
Therefore, we need metrics to evaluate the performance of different methods,
on a given data set.
Traditionally, in order to evaluate the quality of a classification, the con-
fusion matrix, or one of its derived metrics is used. These metrics work well
when the data set does not have a temporal dimension.
The anomaly detection task has certain particularities when it comes to
time-series data. The temporal dimension that may be lacking in other types
of data sets can be taken into account in order to improve the evaluation of
these methods.
In this paper we propose some evaluation metrics which are more appropri-
ate for time series. The basic idea of the new metrics take into consideration
the temporal distance between the true and predicted anomaly points. This
way, a small time shift between the true and the detected anomaly is consid-
ered a good result as opposed to the traditional metric that will consider it an
erroneous detection.
Through a number of experiments, we demonstrate that our proposed met-
rics are closer to the intuition of a human expert.
The remainder of this paper is organized as follows: Section 2 discusses
how time-series classification is used in the field and what metrics are used to
evaluate the quality of classifications. Section 3 discusses the notation that we
will be using in this paper going forward. The anomaly detection problem is
defined for time-series data. We also define some prior requirements that we
expect to be true for a time-series data metric that takes temporal distances
into account. Section 4 presents the classification metrics we propose which
are evaluated in Section 5. We check if the assumptions from Section 2 hold
for our metrics. We also compare them with the traditional confusion matrix
derived metrics such as accuracy, precision, and recall. We also present the
result of applying our methods to real-world data. Section 6 concludes the
paper.
2 Related work
Anomaly detection in time-series data is an important subset of generic anomaly
detection. Much work has been done in developing anomaly detection meth-
ods. Some work has also been done in developing better metrics.
In many applications, it is more efficient to do feature extraction on the
time-series data, and do classifications based on those features rather than on
the actual time-series, as is the case for [7]. This is due to the fact the in many
applications the volume of time series data is large and multi-dimensional. It
is not easily analyzed and in many applications where speed is important, it
is not practical to run algorithms directly on the raw data. Instead, the time
series is split into segments, and for each segment features such as mean value,
maximum and minimum amplitude and so on are calculated. Here classical
clustering methods such as K-Nearest Neighbor can be used to classify each
segment of time-series data.
The confusion matrix is generally used in the context of times series classi-
fication, which is the case in [1]. In [3] the authors use the confusion matrix
explicitly as an input to train the classification model.
Better metrics for time-series have been proposed. In [2] the authors propose
a metric that can differentiate between the generative processes of the time-
series data. In [4] the authors propose a number of metrics such as Average
Segmentation Count (ASC), Absolute Segmentation Distance (ASD) and Av-
erage Direction Tendency (ADT). These metrics were developed for evaluating
a segmentation of a time-series, but they can be used just as well for evaluating
the quality of anomaly detection. We will slightly modify the names of ASC
and ASD by replacing segmentation with detection. In the experiments sec-
tion, we will use these metrics Average Detection Count (ADC) and Absolute
Detection Distance (ADD) and compare them with our metrics.
3 Problem statement
3.1 Notation
In order to express concisely the ideas presented in this paper, we will define
the main concepts of anomaly detection and use the notation presented in this
chapter for the following chapters as well.
This paper discusses concepts related to time-series data. By time-series data
we mean an ordered set of real values that are indexed by natural numbers.
We will not be discussing continuous values, since in practice we measure by
sampling.
X = {x0 , x1 , x2 , . . . , xn }, xt ∈ IR
The main focus of this paper are classifications. The set of class labels,
which will be referred to as a classification, is similar to X, the difference being
that while X consists of real values, C consists of binary values {0, 1}. We will
consider values labelled as 0 as being normal values, and values labelled as 1
C = {c0 , c1 , c2 , . . . , cn }, ct ∈ {0, 1}
X = {8, 8, 8, 8, 42, 8, 8, 8, 8}
In the example given w = 0. Thus the classifier only looks at one point for
each classification. The result is the following classification:
C = {0, 0, 0, 0, 1, 0, 0, 0, 0}
C:
In order to decide which one approximates the original function better, some
metric is used.
This problem can be expressed as a comparison of classifications generated
by different classifiers. We consider the target classifier C0 . This is the function
that we would like to reproduce. Given a number of different classifiers C1 , C2 ,
C3 we would like to find which one approximates C0 the most.
In order to do this, we compare the classifications generated by them given
the same training data X. We will use the graphical representation from Figure
2.
C0 :
C1 :
C2 :
Figure 2: A comparison of three classifications, the first one being the target
classification C0 and the rest are regarded as the candidate classifications.
One can see that C1 identifies the anomaly prematurely while C2 identifies
two anomalies, one prematurely and one with a delay
X
n
count(Ci ) = cj , cj ∈ Ci
j=1
Using this classification, we can see that the score for C1 is m↓ (C1 ) = 0
and the score for C2 is m↓ (C2 ) = 1. We can say that the first classification is
better than the second one, since it has a lower value. This is represented by
the subscript arrow that is pointing down. A metric where a higher value is
better is denoted by a little arrow pointing up.
The examples presented in this chapter are simplistic and are only used to
familiarize the reader with the notation that will be used for the remainder of
this paper.
Closeness Going further, we consider that the closer the detection is to the
actual anomaly in the target classification, the better that classification is. The
graphical representation can be seen in Figure 3e. While both classification
missed the anomaly, C1 detected an anomaly closer to the target one than C2 .
C0 : C0 :
C1 : C1 :
C2 : C2 :
(a) Detection (b) False Negative
C0 : C0 :
C1 : C1 :
C2 : C2 :
(c) Less Wrong (d) Near Detection
C0 : C0 :
C1 : C1 :
C2 : C2 :
(e) Closeness (f) Locally Perfect vs Globally Good
fact that we can have clusters of anomalies. Each cluster can have only one
anomaly or multiple, but in close proximity to each other. This rule assumes
that it is better to discover each cluster, rather than perfectly match every
single anomaly from one single cluster. In the example from 3f, we can see
that C0 has two clusters, one with one single anomaly, and one with five close
anomalies. We consider C2 worse than C1 even though it perfectly described
the anomalies from the second cluster, because it failed to detect the first
cluster.
4 Proposed metrics
4.1 Temporal distance method
This method consists of calculating the sum of all distances between anomalies
from the two classifications. This method is similar to the ADD metric from
[4]. The difference being that while in ADD we look only in the proximity
of the detection, while our method looks at the closest detection, regardless
of proximity. To this end we define a function that calculates the distance
between each anomaly of the first classification and the corresponding closest
anomaly from the second classification, fclosest : C2 → IR.
Next we can define our method by using the function described above in
the following manner:
where TTC stands for Target To Candidate and is given by fclosest (C0 , Ci ),
and CTT stands for Candidate To Target and is given by fclosest (Ci , C0 ). C0
stands for the target classification and Ci stands for the candidate classifica-
tion.
Note that lower values produced by this method are better than higher ones.
This is represented by a little downward pointing arrow.
We calculate both the sum of all the distances of the closest anomalies from
the candidate classification to the target classification (CTT), and the sum
of all distances of the closest anomalies from the target classification to the
candidate one (TTC). By adding these two together, we have a metric that
punishes false negative values and false positive values. TTC punishes false
negatives and CTT punishes false positives.
A graphical visualization of the metric can be seen in Figure 4. We can see
that C0 has two anomalies, but C1 only has one. Thus the closest anomaly to
∆t1
CTT
C0 :
C1 :
TTC
∆t1 ∆t2
both anomalies from C0 is the single anomaly in C1 . Because the two temporal
distances are ∆t1 and ∆t2 respectively, TTC = ∆t1 + ∆t2 . However, from the
perspective of C1 , the closest anomaly to its only anomaly is the first from C0 .
Only that one is taken into account, thus CTT = ∆t1 . Finally the resulting
value calculated by the metric is TD↓ (C1 ) = 2∆t1 + ∆t2 . One can see that the
best possible value for this metric is 0.
We also define a variation of this metric that we dubbed the Squared Tem-
poral Distance. STD is defined similarly to TD, except when adding up the
distances they are first squared. This is done in order to punish larger dis-
tances more than smaller ones. For example the value of STD for the given
example is 2∆t21 + ∆t22 .
4. False Anomaly (FA) Figure 5d. Every normal target value that has
an associated anomalous candidate value is counted as a false anomaly.
C0 : C0 :
C1 : C1 :
(a) Exact Match (EM) (b) Detected Anomaly (DA)
C0 : C0 :
C1 : C1 :
(c) Missed Anomaly (MA) (d) False Anomaly (FA)
These four values can be used further to define derived metrics. We define
two such metrics here:
C0 :
C1 :
∆t
We denote with WS the sum up all the weighted values produced for each
target anomaly. Note that we only take into consideration the closest candidate
anomaly. We also count up all the false anomaly cases FA, similar to the
previous section.
We define the Weighted Detection Difference Metric using the WS
and FA. We just scale the FA by some factor and subtract it from the WS.
WDD↑ = WS − wf ∗ FA
where wf is the weight of the false anomalies. Other functions were also
considered such as a linear function:
∆t
f(∆t) = 1 −
tmax
or a variant that punishes outliers equally:
∆t
1− tmax if ∆t < tmax
f(∆t) =
−1 otherwise
real numbers. These cases arise because the metric can use the number
of anomalies detected in order to divide some other number. If there are
no anomalies, we can not perform that operation. We call this situation
undecidable and use a dash to denote that situation.
In the second example we produced the ranking in a similar fashion to the
previous example. The actual change point happens in the third group of
anomalies from C0 . We consider that only the change point is an anomaly.
The rest of the anomalies are outliers. Outliers can be found both before and
after the change point.
The metrics that best matched the imposed ranking this time were Recall,
ADD, STD and DAIR. In this particular example the classical methods had
similar distances to the proposed metrics. We believe that this is because of
the fact that in this particular example, all of the anomaly groups were made
up of sequential anomalies.
Consider a classifier that is always a few time-samples behind with the
classification. If all anomalies are point anomalies, all candidate anomalies
would miss the target anomalies and a classical metric would produce a bad
score. Now if the anomalies were not point anomalies, but were a continuous
intervals, even though the candidate intervals of anomalies were shifted, most
anomalies would still overlap, thus producing a better score.
Table 1: For each rule defined in Section 2 we verify if the metric respects the
relation.
The table shows that while some of the proposed metrics may sometimes
X
5
Distance = |ri − r^i |
i=1
C0 :
C1 :
C2 :
C3 :
C4 :
C5 :
Table 2: CO2 emissions. Note that all missing values are considered to be the
worst scores.
anomaly exists. Looking at the table of results, we can see that precision and
accuracy can not tell the difference between these classifications. However, we
would argue that C2 is the clear winner. While the detection of the anomalies
are off by one or two time samples, they are still in the neighborhood of the
true anomalies. Only the fourth and fifth anomalies are missed by it. 2.
The metrics that best matched our ranking were TD, TDIR, DAIR. STD
and ADD matched but they are also good classifications. This example shows
the potential instability of the WDD method, that produced a ranking that is
as bad as the ones produced by the confusion matrix.
C0 :
C1 :
C2 :
C3 :
C4 :
C5 :
6 Conclusion
In this paper we tackled with the problem of qualitative metrics applied to
anomaly detection in time-series data. We concluded that classical metrics
such as Accuracy, Precision and Recall do not take into consideration the
time dimension of time-series data, in which near matches might be just as
good as exact matches, or at least they are better than complete misses.
We defined the problem in more rigorous terms, and provided some require-
ments that we believe a good metric should meet. Next we defined some new
metrics. We checked whether or not our proposed metrics respect the require-
ments set out by us previously. We also compared the performance of our
References
[1] E. Baidoo, J. Lewis Priestley,An Analysis of Accuracy using Logistic Regression
and Time Series, Grey Literature from PhD Candidates. 2 (2016), https://
digitalcommons.kennesaw.edu/dataphdgreylit/2/. ⇒ 115
[2] J. Caiado, N. Crato, D. Peña, A periodogram-based metric for time series clas-
sification, Computational Statistics Data Analysis 50 (2006) 2668–2684. ⇒
115
[3] B. Esmael, A. Arnaout, R. K. Fruhwirth, G. Thonhauser, Improving time series
classification using Hidden Markov Models, Proceedings of the 12th International
Conference on Hybrid Intelligent Systems (HIS), 2012, pp. 502–507. ⇒ 115
[4] A. Gensler, B. Sick, Novel Criteria to Measure Performance of Time Series Seg-
mentation Techniques, Proceedings of the LWA 2014 Workshops: KDML, IR,
FGWM, Aachen, Germany, 2014. ⇒ 115, 120, 121
[5] R.J. Hyndman, Time Series Data Library, Accessed: 2018-11-12, https://
datamarket.com/data/list/?q=provider:tsdl. ⇒ 126
[6] N. Laptev, S. Amizadeh, I. Flint, Generic and Scalable Framework for Auto-
mated Time-series Anomaly Detection, Proceedings of the 21th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, 2015, pp.
1939–1947. ⇒ 126
[7] S.I. Lee, C.P. Adans-Dester, M. Grimaldi, A.V. Dowling, P.C. Horak, R.M.
Black-Schaffer, P. Bonato, J.T. Gwin, Enabling stroke rehabilitation in home
and community settings: a wearable sensor-based approach for upper-limb mo-
tor training, IEEE journal of translational engineering in health and medicine 6
(2018) 1–11. ⇒ 115
[8] Gh. Sebestyen, A. Hangan, Gy. Kovacs, Z. Czako, Platform for Anomaly Detec-
tion in Time-Series, XXXIV. Kandó Conference, Budapest, Hungary, 2018. ⇒
126