0% found this document useful (0 votes)

21 views18 pages

Evaluation Metrics For Anomaly Detection Algorithm

The document discusses evaluation metrics for anomaly detection algorithms in time-series data. It proposes three new metrics that account for the temporal dimension of time-series data when evaluating classification quality. The document outlines related work on anomaly detection and time-series classification, defines notation and the problem statement, and compares the proposed metrics to traditional metrics through experiments on real-world data.

Uploaded by

Ashish Jha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views18 pages

Evaluation Metrics For Anomaly Detection Algorithm

Uploaded by

Ashish Jha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Acta Univ.

Sapientiae, Informatica 11, 2 (2019) 113–130

DOI: 10.2478/ausi-2019-0008

Evaluation metrics for anomaly detection

algorithms in time-series

György KOVÁCS
Technical University of Cluj-Napoca, Romania
email: [email protected]

Gheorghe SEBESTYEN Anca HANGAN

Technical University of Cluj-Napoca, Technical University of Cluj-Napoca,
Romania Romania
email: email: [email protected]
[email protected]

Abstract. Time-series are ordered sequences of discrete-time data. Due

to their temporal dimension, anomaly detection techniques used in time-
series have to take into consideration time correlations and other time-
related particularities. Generally, in order to evaluate the quality of an
anomaly detection technique, the confusion matrix and its derived met-
rics such as precision and recall are used. These metrics, however, do
not take this temporal dimension into consideration. In this paper, we
propose three metrics that can be used to evaluate the quality of a classi-
fication, while accounting for the temporal dimension found in time-series
data.

1 Introduction
Anomaly detection is the process of identifying erroneous data in big data sets,
in order to improve the quality of further data processing. An anomaly de-
tection method classifies data into normal and abnormal values. The selection
Computing Classification System 1998: G.2.2
Mathematics Subject Classification 2010: 68R15
Key words and phrases: anomaly detection, classification, evaluation metrics
113

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

114 Gy. Kovács, G. Sebestyen, A. Hangan

of the best detection method greatly depends on the data set characteristics.
Therefore, we need metrics to evaluate the performance of different methods,
on a given data set.
Traditionally, in order to evaluate the quality of a classification, the con-
fusion matrix, or one of its derived metrics is used. These metrics work well
when the data set does not have a temporal dimension.
The anomaly detection task has certain particularities when it comes to
time-series data. The temporal dimension that may be lacking in other types
of data sets can be taken into account in order to improve the evaluation of
these methods.
In this paper we propose some evaluation metrics which are more appropri-
ate for time series. The basic idea of the new metrics take into consideration
the temporal distance between the true and predicted anomaly points. This
way, a small time shift between the true and the detected anomaly is consid-
ered a good result as opposed to the traditional metric that will consider it an
erroneous detection.
Through a number of experiments, we demonstrate that our proposed met-
rics are closer to the intuition of a human expert.
The remainder of this paper is organized as follows: Section 2 discusses
how time-series classification is used in the field and what metrics are used to
evaluate the quality of classifications. Section 3 discusses the notation that we
will be using in this paper going forward. The anomaly detection problem is
defined for time-series data. We also define some prior requirements that we
expect to be true for a time-series data metric that takes temporal distances
into account. Section 4 presents the classification metrics we propose which
are evaluated in Section 5. We check if the assumptions from Section 2 hold
for our metrics. We also compare them with the traditional confusion matrix
derived metrics such as accuracy, precision, and recall. We also present the
result of applying our methods to real-world data. Section 6 concludes the
paper.

2 Related work
Anomaly detection in time-series data is an important subset of generic anomaly
detection. Much work has been done in developing anomaly detection meth-
ods. Some work has also been done in developing better metrics.
In many applications, it is more efficient to do feature extraction on the
time-series data, and do classifications based on those features rather than on

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

Evaluation metrics for anomaly detection algorithms in time-series 115

the actual time-series, as is the case for [7]. This is due to the fact the in many
applications the volume of time series data is large and multi-dimensional. It
is not easily analyzed and in many applications where speed is important, it
is not practical to run algorithms directly on the raw data. Instead, the time
series is split into segments, and for each segment features such as mean value,
maximum and minimum amplitude and so on are calculated. Here classical
clustering methods such as K-Nearest Neighbor can be used to classify each
segment of time-series data.
The confusion matrix is generally used in the context of times series classi-
fication, which is the case in [1]. In [3] the authors use the confusion matrix
explicitly as an input to train the classification model.
Better metrics for time-series have been proposed. In [2] the authors propose
a metric that can differentiate between the generative processes of the time-
series data. In [4] the authors propose a number of metrics such as Average
Segmentation Count (ASC), Absolute Segmentation Distance (ASD) and Av-
erage Direction Tendency (ADT). These metrics were developed for evaluating
a segmentation of a time-series, but they can be used just as well for evaluating
the quality of anomaly detection. We will slightly modify the names of ASC
and ASD by replacing segmentation with detection. In the experiments sec-
tion, we will use these metrics Average Detection Count (ADC) and Absolute
Detection Distance (ADD) and compare them with our metrics.

3 Problem statement
3.1 Notation
In order to express concisely the ideas presented in this paper, we will define
the main concepts of anomaly detection and use the notation presented in this
chapter for the following chapters as well.
This paper discusses concepts related to time-series data. By time-series data
we mean an ordered set of real values that are indexed by natural numbers.
We will not be discussing continuous values, since in practice we measure by
sampling.
X = {x0 , x1 , x2 , . . . , xn }, xt ∈ IR

The main focus of this paper are classifications. The set of class labels,
which will be referred to as a classification, is similar to X, the difference being
that while X consists of real values, C consists of binary values {0, 1}. We will
consider values labelled as 0 as being normal values, and values labelled as 1

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

116 Gy. Kovács, G. Sebestyen, A. Hangan

being anomalous values.

C = {c0 , c1 , c2 , . . . , cn }, ct ∈ {0, 1}

The classification is generated by a classifier function C. The classifier func-

tion takes an xt ∈ X value, with a range of size w around it, and generates
a classification value ct . Thus, this classifier can be used for a sliding window
classification.
C : X2w+1 → C
ct = C(xt−w , . . . , xt , . . . , xt+w )
For a simplified example, consider the following time-series data:

X = {8, 8, 8, 8, 42, 8, 8, 8, 8}

We classify this data using the following classifier:

1 if xt > 10
C(xt ) =
0 otherwise

In the example given w = 0. Thus the classifier only looks at one point for
each classification. The result is the following classification:

C = {0, 0, 0, 0, 1, 0, 0, 0, 0}

We can represent the classification visually in Figure 1 as a line, where

each point represents a classification. At the points where the classification
value is 1, we draw a short vertical line through the horizontal line. We use
this notation because our base assumption is that anomalous values are a
disproportionately small subset of the whole set of values.

Figure 1: A classification C represented by a line with vertical lines where the

value of C are 1.

Next, we define the classification evaluation problem. Supervised learning is

a machine learning task with the purpose of reproducing a function by looking
at example inputs and outputs. Given two or more potential candidate func-
tions it is important for such an algorithm to be able to rank such functions.

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

Evaluation metrics for anomaly detection algorithms in time-series 117

In order to decide which one approximates the original function better, some
metric is used.
This problem can be expressed as a comparison of classifications generated
by different classifiers. We consider the target classifier C0 . This is the function
that we would like to reproduce. Given a number of different classifiers C1 , C2 ,
C3 we would like to find which one approximates C0 the most.
In order to do this, we compare the classifications generated by them given
the same training data X. We will use the graphical representation from Figure
2.

C0 :

C1 :

C2 :

Figure 2: A comparison of three classifications, the first one being the target
classification C0 and the rest are regarded as the candidate classifications.
One can see that C1 identifies the anomaly prematurely while C2 identifies
two anomalies, one prematurely and one with a delay

We compare the classifications using a metric m : C → IR, Ci ∈ C. The

metric produces a score by comparing the given classification with the target
classification. If a classification Ca is considered better than another one, say
Cb , the metric would produce a better score for Ca . If a metric produces a
better score for a classification, we consider that classification as better.
As an example, suppose we have the classifications from Figure 2. We will
use a simple metric, that counts the number of anomalous points in each
classification and computes the difference between that classification and the
target classification.
We define the count function as the sum of all the elements of a classification.
This effectively counts the number of anomalies because they are represented
by the number 1, while the normal values are represented by the number 0.

X
n
count(Ci ) = cj , cj ∈ Ci
j=1

Next, we simply calculate the difference between the number of anomalies

in the target classification and the candidate classification.

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

118 Gy. Kovács, G. Sebestyen, A. Hangan

m↓ (Ci ) = |count(C0 ) − count(Ci )|

Using this classification, we can see that the score for C1 is m↓ (C1 ) = 0
and the score for C2 is m↓ (C2 ) = 1. We can say that the first classification is
better than the second one, since it has a lower value. This is represented by
the subscript arrow that is pointing down. A metric where a higher value is
better is denoted by a little arrow pointing up.
The examples presented in this chapter are simplistic and are only used to
familiarize the reader with the notation that will be used for the remainder of
this paper.

3.2 Prior requirements

We give examples with a target classification and a number of candidate clas-
sifications. We rank the classifications using our intuition. We would like a
classification metric that can capture that intuition. These assumptions may
or may not hold for certain applications.
Each of these situations will be tested both by the existing metrics used,
and also our proposed metrics. In Section 5 we aggregate all the score data
and show if the given metric does indeed respect these requirements.

Detection The first requirement is to rank a classification that finds an

anomaly higher than one that doesn’t. The graphical representation can be
seen in Figure 3a. Because C1 correctly detects the anomaly, whereas C2 does
not.

False Detection The second requirement is that if there is indeed no anomaly,

we would consider the classification that doesn’t detect an anomaly as the
better one. The graphical representation can be seen in Figure 3b. Because
the target classification does not contain anomalies and C2 falsely detects an
anomaly, we can say that it is the worst from the two.

Less Wrong Whenever we have two classifications that correctly predict

the anomaly, an ideal metric would choose the one with the fewer incorrect
classifications, as presented in Figure 3c. Because both C1 and C2 correctly
detect the anomaly, C1 is considered better because it has less errors than C2 .

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

Evaluation metrics for anomaly detection algorithms in time-series 119

Near Detection Here we are starting to enter controversial territory. Given

that we use time-series, we will hold the points near the point of interest in
higher regard than those farther away. We consider that a classification that
almost detects the anomaly correctly, is better than one that doesn’t detect
the anomaly at all. The graphical representation can be seen in Figure 3d.
While none of the classifications manage to exactly detect the anomaly in the
right place, we consider C1 as better because at least it did detect something
relatively close to the anomaly, while C2 did not at all.

Closeness Going further, we consider that the closer the detection is to the
actual anomaly in the target classification, the better that classification is. The
graphical representation can be seen in Figure 3e. While both classification
missed the anomaly, C1 detected an anomaly closer to the target one than C2 .

C0 : C0 :

C1 : C1 :

C2 : C2 :
(a) Detection (b) False Negative
C0 : C0 :

C1 : C1 :

C2 : C2 :
(c) Less Wrong (d) Near Detection
C0 : C0 :

C1 : C1 :

C2 : C2 :
(e) Closeness (f) Locally Perfect vs Globally Good

Figure 3: Visualisation of the requirements. In each case the top classification

(C0 ) is the target classification and in these examples middle classification
(C1 ) is considered to be a better classification than the lower one (C2 ).

Locally Perfect vs Globally Good Lastly, we introduce the principle

of being Globally Good instead of Locally Perfect. This rule emphasizes the

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

120 Gy. Kovács, G. Sebestyen, A. Hangan

fact that we can have clusters of anomalies. Each cluster can have only one
anomaly or multiple, but in close proximity to each other. This rule assumes
that it is better to discover each cluster, rather than perfectly match every
single anomaly from one single cluster. In the example from 3f, we can see
that C0 has two clusters, one with one single anomaly, and one with five close
anomalies. We consider C2 worse than C1 even though it perfectly described
the anomalies from the second cluster, because it failed to detect the first
cluster.

4 Proposed metrics
4.1 Temporal distance method
This method consists of calculating the sum of all distances between anomalies
from the two classifications. This method is similar to the ADD metric from
[4]. The difference being that while in ADD we look only in the proximity
of the detection, while our method looks at the closest detection, regardless
of proximity. To this end we define a function that calculates the distance
between each anomaly of the first classification and the corresponding closest
anomaly from the second classification, fclosest : C2 → IR.
Next we can define our method by using the function described above in
the following manner:

TD↓ (Ci ) = TTC + CTT

where TTC stands for Target To Candidate and is given by fclosest (C0 , Ci ),
and CTT stands for Candidate To Target and is given by fclosest (Ci , C0 ). C0
stands for the target classification and Ci stands for the candidate classifica-
tion.
Note that lower values produced by this method are better than higher ones.
This is represented by a little downward pointing arrow.
We calculate both the sum of all the distances of the closest anomalies from
the candidate classification to the target classification (CTT), and the sum
of all distances of the closest anomalies from the target classification to the
candidate one (TTC). By adding these two together, we have a metric that
punishes false negative values and false positive values. TTC punishes false
negatives and CTT punishes false positives.
A graphical visualization of the metric can be seen in Figure 4. We can see
that C0 has two anomalies, but C1 only has one. Thus the closest anomaly to

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

Evaluation metrics for anomaly detection algorithms in time-series 121

∆t1
CTT

C0 :

C1 :

TTC
∆t1 ∆t2

Figure 4: Visualization of the Temporal Distance metric. CTT = ∆t1 and

TTC = ∆t1 + ∆t2

both anomalies from C0 is the single anomaly in C1 . Because the two temporal
distances are ∆t1 and ∆t2 respectively, TTC = ∆t1 + ∆t2 . However, from the
perspective of C1 , the closest anomaly to its only anomaly is the first from C0 .
Only that one is taken into account, thus CTT = ∆t1 . Finally the resulting
value calculated by the metric is TD↓ (C1 ) = 2∆t1 + ∆t2 . One can see that the
best possible value for this metric is 0.
We also define a variation of this metric that we dubbed the Squared Tem-
poral Distance. STD is defined similarly to TD, except when adding up the
distances they are first squared. This is done in order to punish larger dis-
tances more than smaller ones. For example the value of STD for the given
example is 2∆t21 + ∆t22 .

4.2 Counting method

The next method is also similar to the ADC method defined in [4]. However
our method is more analogous to a forgiving decision matrix, that counts close
detections as true positives.
The counting method counts the occurrence of four situations. The situa-
tions of interest are pictured in Figure 5, and are as follows:
1. Exact Match (EM) Figure 5a. If the anomaly is from the candidate
classification matches exactly the anomaly in the target classification,
we call that an exact match.
2. Detected Anomaly (DA) Figure 5b. If the candidate anomaly does
not match exactly the target anomaly, a range is considered. If the
anomaly is withing that range, it is considered as a detection. However,
that candidate anomaly will also be counted up as a false anomaly.

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

122 Gy. Kovács, G. Sebestyen, A. Hangan

3. Missed Anomaly (MA) Figure 5c. If there is no anomaly present

in the expected range of the target anomaly, we count it as a missed
anomaly.

4. False Anomaly (FA) Figure 5d. Every normal target value that has
an associated anomalous candidate value is counted as a false anomaly.

C0 : C0 :

C1 : C1 :
(a) Exact Match (EM) (b) Detected Anomaly (DA)
C0 : C0 :

C1 : C1 :
(c) Missed Anomaly (MA) (d) False Anomaly (FA)

Figure 5: Relevant situations that are each counted.

These four values can be used further to define derived metrics. We define
two such metrics here:

Total Detected In Range We calculate the ratio of correctly detected

anomalies to the number of total anomalies. With the maximum score of 1 and
minimum of 0, this metric is similar to recall as derived from the confusion
matrix. We can also call this metric “forgiving recall”.
EM + DA
TDIR↑ =
EM + DA + MA

Detection Accuracy In Range We calculate the ratio of correctly de-

tected anomalies to the total number of detected anomalies. With the maxi-
mum score of 1 and minimum of 0, this metric is similar to precision as derived
from the confusion matrix. We can also call this metric “forgiving precision”.
EM + DA
DAIR↑ =
EM + DA + FA
We could have implemented “forgiving” versions of the metrics defined for
the confusion matrix, and we could have defined some new ones, like the ratio

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

Evaluation metrics for anomaly detection algorithms in time-series 123

of exact matches to detected anomalies. However in the interest of conciseness,

we will only consider the two defined above.

4.3 Weighted method

The weighted method can be seen as a combination of the previous two meth-
ods. For each anomaly of the target classification we calculate a weight that
is given by a function that takes the distance of the candidate anomaly to the
target anomaly and produces the weight. In Figure 6 we calculate a weight
w for an anomaly. In the example a bell curve function is used. Any function
based on distance can be used as long as it is monotonically decreasing.

C0 :

C1 :

∆t

Figure 6: Weighted metric with a gaussian function, w = f(∆t), where f de-

scribes a bell curve.

We denote with WS the sum up all the weighted values produced for each
target anomaly. Note that we only take into consideration the closest candidate
anomaly. We also count up all the false anomaly cases FA, similar to the
previous section.
We define the Weighted Detection Difference Metric using the WS
and FA. We just scale the FA by some factor and subtract it from the WS.

WDD↑ = WS − wf ∗ FA
where wf is the weight of the false anomalies. Other functions were also
considered such as a linear function:

∆t
f(∆t) = 1 −
tmax
or a variant that punishes outliers equally:

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

124 Gy. Kovács, G. Sebestyen, A. Hangan

∆t
1− tmax if ∆t < tmax
f(∆t) =
−1 otherwise

5 Analysis and experiments

5.1 Requirements evaluation
We took each of the six rules defined in Section 3.2, and we turned it into
synthetic data. We did this by considering 101 points for each Classification.
For each point we assigned a one where the figure had a vertical line, and
assigned a zero for the rest of the points. In fact the figures that appear
in Section 3.2 where generated from these synthetic datasets. We used the
synthetic datasets and calculated the scores produced for all the methods
described in the previous chapter alongside classical methods. The results are
aggregated in Table 1.
Table 1 is the cross product of the metrics we used and the rules we defined
in Section 3.2. We differentiate between four possible outcomes. Each outcome
is represented by a special character:
X We represent cases where the metric strictly respects the rule set out by
us via a checkmark. Effectively this means that the metric produced a
better result for the first candidate classification than for the second one.
The actual value produced by the classification can be larger or smaller,
depending on the metric used.
= Equality does not respect the rule we set out in Chapter 2. However
we decided to show it explicitly because it gives better insight into the
workings of the metric. While we do not consider equality cases to be
an instance of a successful quality evaluation, we consider them as cases
where the metric can not tell the difference between two candidate clas-
sifications.
× In cases where the metric gives a strictly better score to a worse candidate
classification, we consider that as a broken rule. A metric that would fit
the rules we laid out would never break any rules.
- There are cases where a metric can not be calculated. Otherwise stated,
there are classifications that yield scores that can not be expressed by

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

Evaluation metrics for anomaly detection algorithms in time-series 125

real numbers. These cases arise because the metric can use the number
of anomalies detected in order to divide some other number. If there are
no anomalies, we can not perform that operation. We call this situation
undecidable and use a dash to denote that situation.
In the second example we produced the ranking in a similar fashion to the
previous example. The actual change point happens in the third group of
anomalies from C0 . We consider that only the change point is an anomaly.
The rest of the anomalies are outliers. Outliers can be found both before and
after the change point.
The metrics that best matched the imposed ranking this time were Recall,
ADD, STD and DAIR. In this particular example the classical methods had
similar distances to the proposed metrics. We believe that this is because of
the fact that in this particular example, all of the anomaly groups were made
up of sequential anomalies.
Consider a classifier that is always a few time-samples behind with the
classification. If all anomalies are point anomalies, all candidate anomalies
would miss the target anomalies and a classical metric would produce a bad
score. Now if the anomalies were not point anomalies, but were a continuous
intervals, even though the candidate intervals of anomalies were shifted, most
anomalies would still overlap, thus producing a better score.

m Det FDet LWrong NDet Close LP vs GG

Accuracy↑ X X X × = ×
Precision↑ X - = = = ×
Recall↑ - - X - = =
ADC↓ X = = X X X
ADD↓ - - = - X =
TD↓ - - X - X X
STD↓ - - X - X X
TDIR↑ X - = X X X
DAIR↑ - - X - X =
WDD↑ X X X X X X

Table 1: For each rule defined in Section 2 we verify if the metric respects the
relation.

The table shows that while some of the proposed metrics may sometimes

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

126 Gy. Kovács, G. Sebestyen, A. Hangan

be unable to distinguish between two classifications or evaluate an answer,

they never give erroneous results. The same can not be said of Accuracy or
Precision.
From our experiments, while WDD obtained the best results, the efficacy
of the metric is very much dependent on the parameters used. By modifying
the parameters we can get counter intuitive results.

5.2 Real data example

In order to further test the quality of our metric we will apply them to some
example time-series traffic data. Two datasets will be considered. The first
only has outlier points while the other also contains a change point.
Both datasets and classifiers are used in [8]. The classifiers are described in
chapter 4 of that article. The classifications {C1 , . . . , C5 } are generated by the
classifiers {Bounded Derivative (d = 0), Bounded Derivative (d = 1), Median
Method, Linear Approximation, First Order AR}. The classifiers will not be
discussed in this paper.
The first dataset is from [5], and the second one from [6]. The classification
diagrams of the aforementioned datasets can be seen in Figure 7 and Figure 8
respectively. We evaluate each classification with each of the metrics used in
Table 1. We also add another metric that ranks each classification according to
the subjective opinions of the authors. We would like that all metrics generate
a similar order to the one imposed by us.
The imposed ranking is done by assigning a number starting from 1 to each
classification, where 1 is considered the best classification, and 5 is considered
the worst. Next we compare that ranking with the ranking generated by the
metrics. We consider the classification with the best result as the one with
ranking of 1, the second best with 2 and so on. This step can be checked
manually in the table. Finally we calculate the distance between the imposed
ranking and the ranking generated by the metric. The lower the distance,
the better the metric performed. The distance is calculated by summing the
differences between the two rankings.

X
5
Distance = |ri − r^i |
i=1

where ri is the ranking of the classification Ci generated by a metric m and r^i

is the imposed ranking of the given classification Ci .

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

Evaluation metrics for anomaly detection algorithms in time-series 127

C0 :

C1 :

C2 :

C3 :

C4 :

C5 :

Figure 7: CO2 emissions

m m(C1 ) m(C2 ) m(C3 ) m(C4 ) m(C5 ) Distance

Accuracy↑ 0.9688 0.9479 0.9479 0.9427 0.9531 9
Precision↑ 0.0000 0.0000 0.0000 0.0000 0.0000 10
Recall↑ - 0.0000 0.0000 0.0000 0.0000 6
ADC↑ 0 4 4 5 2 5
ADD↓ - 9 24 27 12 2
TD↓ - 8 99 490 132 0
STD↓ - 2048 1561 60538 2646 2
TDIR↑ 0.0000 0.6667 0.6667 0.1667 0.3333 1
DAIR↑ - 0.5000 0.5000 0.1667 0.4000 1
WDD↑ 0.0000 -3.5000 -3.9000 -43.3000 -6.1000 8
Ranking↓ 5 1 2 4 3 0

Table 2: CO2 emissions. Note that all missing values are considered to be the
worst scores.

For the first example we considered C0 . None of the classifiers managed to

pin down the anomalies exactly. While all of them were off by some margin,
some of them are clearly worse than others. For example, C1 didn’t detect any
anomalies, while C4 detected a cluster of them towards the end, where only one

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

128 Gy. Kovács, G. Sebestyen, A. Hangan

anomaly exists. Looking at the table of results, we can see that precision and
accuracy can not tell the difference between these classifications. However, we
would argue that C2 is the clear winner. While the detection of the anomalies
are off by one or two time samples, they are still in the neighborhood of the
true anomalies. Only the fourth and fifth anomalies are missed by it. 2.
The metrics that best matched our ranking were TD, TDIR, DAIR. STD
and ADD matched but they are also good classifications. This example shows
the potential instability of the WDD method, that produced a ranking that is
as bad as the ones produced by the confusion matrix.

C0 :

C1 :

C2 :

C3 :

C4 :

C5 :

Figure 8: Concurrent users – Change point

6 Conclusion
In this paper we tackled with the problem of qualitative metrics applied to
anomaly detection in time-series data. We concluded that classical metrics
such as Accuracy, Precision and Recall do not take into consideration the
time dimension of time-series data, in which near matches might be just as
good as exact matches, or at least they are better than complete misses.
We defined the problem in more rigorous terms, and provided some require-
ments that we believe a good metric should meet. Next we defined some new
metrics. We checked whether or not our proposed metrics respect the require-
ments set out by us previously. We also compared the performance of our

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

Evaluation metrics for anomaly detection algorithms in time-series 129

m m(C1 ) m(C2 ) m(C3 ) m(C4 ) m(C5 ) Distance

Accuracy↑ 0.7354 0.9609 0.9659 0.9619 0.9498 4
Precision↑ 0.9333 0.1555 0.4444 0.2 0.0222 8
Recall↑ 0.1386 0.875 0.6896 0.8181 0.1428 2
ADC↑ 92 8 29 10 5 8
ADD↓ 275 1 18 2 23 2
TD↓ 8101 166 183 147 4123 4
STD↓ 364933 1360 1849 3079 892305 2
TDIR↑ 1 0.8888 0.8444 1 0.2 11
DAIR↑ 0.1470 0.9756 0.8085 0.9574 0.6 2
WDD↑ -112.1999 27.8999 23.0999 34.4999 -353.6 6
Ranking↓ 5 1 2 3 4 0

Table 3: Concurrent users – Change point

proposed metrics with the performance of the classical metrics. We concluded

that our metrics never gave an incorrect answer. The same could not be said
of the classical methods.
We also applied our proposed metrics to two real datasets of web traffic. We
compared the performance of all metrics discussed, and concluded that our
proposed metrics performed better or the same as the classical metrics.

References
[1] E. Baidoo, J. Lewis Priestley,An Analysis of Accuracy using Logistic Regression
and Time Series, Grey Literature from PhD Candidates. 2 (2016), https://
digitalcommons.kennesaw.edu/dataphdgreylit/2/. ⇒ 115
[2] J. Caiado, N. Crato, D. Peña, A periodogram-based metric for time series clas-
sification, Computational Statistics Data Analysis 50 (2006) 2668–2684. ⇒
115
[3] B. Esmael, A. Arnaout, R. K. Fruhwirth, G. Thonhauser, Improving time series
classification using Hidden Markov Models, Proceedings of the 12th International
Conference on Hybrid Intelligent Systems (HIS), 2012, pp. 502–507. ⇒ 115
[4] A. Gensler, B. Sick, Novel Criteria to Measure Performance of Time Series Seg-
mentation Techniques, Proceedings of the LWA 2014 Workshops: KDML, IR,
FGWM, Aachen, Germany, 2014. ⇒ 115, 120, 121
[5] R.J. Hyndman, Time Series Data Library, Accessed: 2018-11-12, https://
datamarket.com/data/list/?q=provider:tsdl. ⇒ 126

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

130 Gy. Kovács, G. Sebestyen, A. Hangan

[6] N. Laptev, S. Amizadeh, I. Flint, Generic and Scalable Framework for Auto-
mated Time-series Anomaly Detection, Proceedings of the 21th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, 2015, pp.
1939–1947. ⇒ 126
[7] S.I. Lee, C.P. Adans-Dester, M. Grimaldi, A.V. Dowling, P.C. Horak, R.M.
Black-Schaffer, P. Bonato, J.T. Gwin, Enabling stroke rehabilitation in home
and community settings: a wearable sensor-based approach for upper-limb mo-
tor training, IEEE journal of translational engineering in health and medicine 6
(2018) 1–11. ⇒ 115
[8] Gh. Sebestyen, A. Hangan, Gy. Kovacs, Z. Czako, Platform for Anomaly Detec-
tion in Time-Series, XXXIV. Kandó Conference, Budapest, Hungary, 2018. ⇒
126

Received: September 17, 2019 • Revised: November 26, 2019

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

Time-Series Anomaly Detection Service at Microsoft
No ratings yet
Time-Series Anomaly Detection Service at Microsoft
9 pages
Cheboli Deepthi May2010 PDF
No ratings yet
Cheboli Deepthi May2010 PDF
83 pages
Anomaly Detection
No ratings yet
Anomaly Detection
51 pages
2020TadGAN Time Series Anomaly Detection Using
No ratings yet
2020TadGAN Time Series Anomaly Detection Using
11 pages
Enhancing Time Series Anomaly Detection: A Hybrid Model Fusion Approach
No ratings yet
Enhancing Time Series Anomaly Detection: A Hybrid Model Fusion Approach
13 pages
A Review On Anomaly Detection in Time Series
No ratings yet
A Review On Anomaly Detection in Time Series
6 pages
Calibrated One-Class Classification For Unsupervised Time Series Anomaly Detection
No ratings yet
Calibrated One-Class Classification For Unsupervised Time Series Anomaly Detection
14 pages
Anomalies in Time Series
No ratings yet
Anomalies in Time Series
19 pages
HybridAD A Hybrid Model-Driven Anomaly Detection Approach For Multivariate Time Series
No ratings yet
HybridAD A Hybrid Model-Driven Anomaly Detection Approach For Multivariate Time Series
13 pages
Deep_Learning_for_Anomaly_Detection_in_Time-Series_Data_Review_Analysis_and_Guidelines
No ratings yet
Deep_Learning_for_Anomaly_Detection_in_Time-Series_Data_Review_Analysis_and_Guidelines
23 pages
Smbl Merged
No ratings yet
Smbl Merged
28 pages
1128282917-MIT
No ratings yet
1128282917-MIT
55 pages
USAD Architecture
No ratings yet
USAD Architecture
14 pages
Atf ETH Master Thesis AD+RCA
No ratings yet
Atf ETH Master Thesis AD+RCA
43 pages
Monitoring The Network Monitoring System: Anomaly Detection Using Pattern Recognition
No ratings yet
Monitoring The Network Monitoring System: Anomaly Detection Using Pattern Recognition
4 pages
242944
No ratings yet
242944
108 pages
El 17 01 08
No ratings yet
El 17 01 08
8 pages
Ecmlpkdd08 Lazarevic Dmfa
No ratings yet
Ecmlpkdd08 Lazarevic Dmfa
116 pages
Anomaly Detection and Time Series Analysis1
No ratings yet
Anomaly Detection and Time Series Analysis1
6 pages
Unsupervised Model Selection For Time-Series Anomaly Detection
No ratings yet
Unsupervised Model Selection For Time-Series Anomaly Detection
25 pages
Survey
No ratings yet
Survey
19 pages
Deep Learning For Time Series Anomaly Detection-A Survey
No ratings yet
Deep Learning For Time Series Anomaly Detection-A Survey
43 pages
Robust Anomaly Detection for Multivariate Time Series Through Stochastic Recurrent Neural Network
No ratings yet
Robust Anomaly Detection for Multivariate Time Series Through Stochastic Recurrent Neural Network
10 pages
A Review On Outlier Detection in Time Series Data BCAM 1 PDF
No ratings yet
A Review On Outlier Detection in Time Series Data BCAM 1 PDF
45 pages
Time - Series - Data 2024 05 22 05 16
No ratings yet
Time - Series - Data 2024 05 22 05 16
50 pages
entropy-22-01363-v2
No ratings yet
entropy-22-01363-v2
15 pages
Go l Mohammad i 2015
No ratings yet
Go l Mohammad i 2015
10 pages
Tkde2022 Beatgan
No ratings yet
Tkde2022 Beatgan
14 pages
2012 Network Anomaly Detection by Cascading K-Means Clustering and C4.5 Decision Tree Algorithm
No ratings yet
2012 Network Anomaly Detection by Cascading K-Means Clustering and C4.5 Decision Tree Algorithm
9 pages
DeepLearningforTimeSeriesAnomalyDetection
No ratings yet
DeepLearningforTimeSeriesAnomalyDetection
42 pages
Anomaly Detection
No ratings yet
Anomaly Detection
7 pages
Neural Contextual Anomaly Detection For Time Series
No ratings yet
Neural Contextual Anomaly Detection For Time Series
22 pages
5.1.1 Objective and Scope: Jyenis 2020
No ratings yet
5.1.1 Objective and Scope: Jyenis 2020
8 pages
Anomaly Detection Survey
No ratings yet
Anomaly Detection Survey
72 pages
One-Class Collective Anomaly Detection Based On Lstm-Rnns
No ratings yet
One-Class Collective Anomaly Detection Based On Lstm-Rnns
13 pages
Anomaly Detection: Jing Gao
No ratings yet
Anomaly Detection: Jing Gao
51 pages
Defense
No ratings yet
Defense
91 pages
Online Time-Series Anomaly Detection A Survey of M
No ratings yet
Online Time-Series Anomaly Detection A Survey of M
36 pages
Anomaly Detection
No ratings yet
Anomaly Detection
5 pages
p29-GAN Based Anomaly Detection Review Including Reviewer Suggestions
No ratings yet
p29-GAN Based Anomaly Detection Review Including Reviewer Suggestions
13 pages
Discovering system health anomalies using data mining techniquesdiscovering system health anomalies using data mining techniques
No ratings yet
Discovering system health anomalies using data mining techniquesdiscovering system health anomalies using data mining techniques
11 pages
Improved Data Discrimination in Wireless Sensor Networks: B. A. Sabarish, S. Shanmugapriya
No ratings yet
Improved Data Discrimination in Wireless Sensor Networks: B. A. Sabarish, S. Shanmugapriya
3 pages
Large-Scale Unusual Time Series Detection
No ratings yet
Large-Scale Unusual Time Series Detection
4 pages
A Study On Performance Metrics For ANomaly Detection Based On Industrial Control System Operation Data
No ratings yet
A Study On Performance Metrics For ANomaly Detection Based On Industrial Control System Operation Data
21 pages
Anomaly Detection in Big Data
No ratings yet
Anomaly Detection in Big Data
148 pages
Anomaly Detection Time Series Final PDF
No ratings yet
Anomaly Detection Time Series Final PDF
12 pages
Aam Micro
No ratings yet
Aam Micro
13 pages
APznzaYnecyXMEr-LQv9QUeETcJbwmNAK5O2xkfwKE5El6mPIXg-eQ6OudWQ8xqcHCcshI4kt4YoHR-8-Lae73pSsYtWtH3sqgsmz-84SS5iw7zEzloHoCgnXck3YIYOl394oOSaCz1LiK_6zHRd4YBxHjFbOFpkQIw7oHY5cPjdLVc05WexLgIMCgl1DJr8l7m7Ov56K9yGpHibBITGrM
No ratings yet
APznzaYnecyXMEr-LQv9QUeETcJbwmNAK5O2xkfwKE5El6mPIXg-eQ6OudWQ8xqcHCcshI4kt4YoHR-8-Lae73pSsYtWtH3sqgsmz-84SS5iw7zEzloHoCgnXck3YIYOl394oOSaCz1LiK_6zHRd4YBxHjFbOFpkQIw7oHY5cPjdLVc05WexLgIMCgl1DJr8l7m7Ov56K9yGpHibBITGrM
16 pages
Anomaly Detection Using Generative Adversarial Networks: Reviewing Methodological Progress and Challenges
No ratings yet
Anomaly Detection Using Generative Adversarial Networks: Reviewing Methodological Progress and Challenges
13 pages
Timeseries Novelty Detection Using Oneclass Support Vector Machi
No ratings yet
Timeseries Novelty Detection Using Oneclass Support Vector Machi
5 pages
Knime Anomaly Detection Visualization
No ratings yet
Knime Anomaly Detection Visualization
13 pages
DL For Time Series Anomaly Detection
No ratings yet
DL For Time Series Anomaly Detection
42 pages
A Survey of Anomaly Detection Methods in Networks: Weiyu Zhang, Qingbo Yang, Yushui Geng
No ratings yet
A Survey of Anomaly Detection Methods in Networks: Weiyu Zhang, Qingbo Yang, Yushui Geng
3 pages
Elk 2111 123
No ratings yet
Elk 2111 123
17 pages
Ahmed PDF
No ratings yet
Ahmed PDF
6 pages
Machine Learning Approaches To Network Anomaly Detection: Tarem Ahmed, Boris Oreshkin and Mark Coates
No ratings yet
Machine Learning Approaches To Network Anomaly Detection: Tarem Ahmed, Boris Oreshkin and Mark Coates
6 pages
Anomaly Detection 2
No ratings yet
Anomaly Detection 2
8 pages
1 s2.0 S0167739X23000560 Main
No ratings yet
1 s2.0 S0167739X23000560 Main
12 pages
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
IIT Bombay Project
No ratings yet
IIT Bombay Project
24 pages
PR-YOLO Improved YOLO For Fast Protozoa Classifica
No ratings yet
PR-YOLO Improved YOLO For Fast Protozoa Classifica
10 pages
Section 2 - Introduction To Machine Learning-Bje Edits - Ipynb - Colab
No ratings yet
Section 2 - Introduction To Machine Learning-Bje Edits - Ipynb - Colab
7 pages
DIP-LECTURE - NOTES Final
No ratings yet
DIP-LECTURE - NOTES Final
222 pages
A Comparative Review of Image Processing Based Crack Detection Techniques On Civil Engineering Structures
No ratings yet
A Comparative Review of Image Processing Based Crack Detection Techniques On Civil Engineering Structures
17 pages
Tuan 11 Pointcloud Rovis
No ratings yet
Tuan 11 Pointcloud Rovis
40 pages
Research Paper on Image Segmentation Using Matlab
100% (1)
Research Paper on Image Segmentation Using Matlab
6 pages
M.Tech - Computer Vision and Image Processing
No ratings yet
M.Tech - Computer Vision and Image Processing
21 pages
Customer Profiling, Segmentation, and Sales Prediction Using AI in Direct Marketing
No ratings yet
Customer Profiling, Segmentation, and Sales Prediction Using AI in Direct Marketing
11 pages
Comparing CNNs and Random Forests For Landsat
No ratings yet
Comparing CNNs and Random Forests For Landsat
19 pages
FRSC 05 1253627
No ratings yet
FRSC 05 1253627
28 pages
2002 Research
No ratings yet
2002 Research
25 pages
Imp2
No ratings yet
Imp2
6 pages
Dec 2019 CS463 - Digital Image Processing (R&S) - Ktu Qbank
No ratings yet
Dec 2019 CS463 - Digital Image Processing (R&S) - Ktu Qbank
2 pages
Brain Tumor Detection From MRI Images Using MATLAB Code
No ratings yet
Brain Tumor Detection From MRI Images Using MATLAB Code
5 pages
Final Report Printout(5)
No ratings yet
Final Report Printout(5)
55 pages
Study On Lane Detection Applide Image Processing
No ratings yet
Study On Lane Detection Applide Image Processing
50 pages
Finite Element Method and Medical Imaging Techniques in Bone Biomechanics
No ratings yet
Finite Element Method and Medical Imaging Techniques in Bone Biomechanics
4 pages
Computational Photography Methods and Applications
No ratings yet
Computational Photography Methods and Applications
560 pages
Recognition and Localization of Ripen Tomato Based
No ratings yet
Recognition and Localization of Ripen Tomato Based
7 pages
1807 10584fhwf
No ratings yet
1807 10584fhwf
6 pages
A Survey On Fish Classification Techniques
No ratings yet
A Survey On Fish Classification Techniques
14 pages
Automated Digitization of Student’s Marks from the Answer‑Book
No ratings yet
Automated Digitization of Student’s Marks from the Answer‑Book
9 pages
Application of Image Processing in Agriculture
No ratings yet
Application of Image Processing in Agriculture
8 pages
Object-Based Classification vs. Pixel-Based Classification: Comparitive Importance of Multi-Resolution Imagery
No ratings yet
Object-Based Classification vs. Pixel-Based Classification: Comparitive Importance of Multi-Resolution Imagery
6 pages
QR Code Recognition Based On Image Processing
No ratings yet
QR Code Recognition Based On Image Processing
4 pages
Digital Image Processing and Analysis Human and Computer Vision Applications with CVIPtools Second Edition Umbaugh pdf download
No ratings yet
Digital Image Processing and Analysis Human and Computer Vision Applications with CVIPtools Second Edition Umbaugh pdf download
52 pages
"One Picture Is Worth More Than Ten Thousand Words": Anonymous
No ratings yet
"One Picture Is Worth More Than Ten Thousand Words": Anonymous
33 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Introduction To ImageNet Competition
No ratings yet
Introduction To ImageNet Competition
10 pages

Evaluation Metrics For Anomaly Detection Algorithm

Uploaded by

Evaluation Metrics For Anomaly Detection Algorithm

Uploaded by

Acta Univ.

Sapientiae, Informatica 11, 2 (2019) 113–130

Evaluation metrics for anomaly detection

Gheorghe SEBESTYEN Anca HANGAN

Abstract. Time-series are ordered sequences of discrete-time data. Due

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

being anomalous values.

The classification is generated by a classifier function C. The classifier func-

We classify this data using the following classifier:

We can represent the classification visually in Figure 1 as a line, where

Figure 1: A classification C represented by a line with vertical lines where the

Next, we define the classification evaluation problem. Supervised learning is

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

We compare the classifications using a metric m : C → IR, Ci ∈ C. The

Next, we simply calculate the difference between the number of anomalies

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

m↓ (Ci ) = |count(C0 ) − count(Ci )|

3.2 Prior requirements

Detection The first requirement is to rank a classification that finds an

False Detection The second requirement is that if there is indeed no anomaly,

Less Wrong Whenever we have two classifications that correctly predict

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

Near Detection Here we are starting to enter controversial territory. Given

Figure 3: Visualisation of the requirements. In each case the top classification

Locally Perfect vs Globally Good Lastly, we introduce the principle

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

TD↓ (Ci ) = TTC + CTT

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

Figure 4: Visualization of the Temporal Distance metric. CTT = ∆t1 and

4.2 Counting method

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

3. Missed Anomaly (MA) Figure 5c. If there is no anomaly present

Figure 5: Relevant situations that are each counted.

Total Detected In Range We calculate the ratio of correctly detected

Detection Accuracy In Range We calculate the ratio of correctly de-

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

of exact matches to detected anomalies. However in the interest of conciseness,

4.3 Weighted method

Figure 6: Weighted metric with a gaussian function, w = f(∆t), where f de-

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

5 Analysis and experiments

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

m Det FDet LWrong NDet Close LP vs GG

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

be unable to distinguish between two classifications or evaluate an answer,

5.2 Real data example

where ri is the ranking of the classification Ci generated by a metric m and r^i

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

Figure 7: CO2 emissions

m m(C1 ) m(C2 ) m(C3 ) m(C4 ) m(C5 ) Distance

For the first example we considered C0 . None of the classifiers managed to

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

Figure 8: Concurrent users – Change point

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

m m(C1 ) m(C2 ) m(C3 ) m(C4 ) m(C5 ) Distance

Table 3: Concurrent users – Change point

proposed metrics with the performance of the classical metrics. We concluded

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

Received: September 17, 2019 • Revised: November 26, 2019

Unauthentifiziert | Heruntergeladen 24.01.20 01:12 UTC

You might also like