0% found this document useful (0 votes)

6 views34 pages

Full Text 01

This bachelor thesis evaluates the performance of the Isolation Forest model for anomaly detection in log files from a company named Trimma, comparing it with K-nearest neighbors and Cluster-based Local Outlier Factor models. The study finds that while Isolation Forest classifies similar data points, it does not perform the best in execution time for detecting anomalies. The research highlights the importance of identifying anomalies in data to address potential issues in processing times effectively.

Uploaded by

dhrjkhsl1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views34 pages

Full Text 01

Uploaded by

dhrjkhsl1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

ANOMALY DETECTION

TECHNIQUES FOR UNSUPERVISED

MACHINE LEARNING
Albin Iivari

Bachelor Thesis, 15 hp/credits

Computer Science

2022
Abstract

Anomalies in data can be of great importance as they often indicate faulty be-
haviour. Locating these can thus assist in finding the source of the issue. Isola-
tion Forest, an unsupervised machine learning model used to detect anomalies, is
evaluated against two other commonly used models. The data set used were log
files from a company named Trimma. The log files contained information about
different events that executed. Different types of event could differ in execution
time. The models were then used to find logs where some event took longer than
usual to execute. The feature created for the models was a percentual difference
from the median of each job type. The comparison made on various data set
sizes, using one feature, showed that Isolation Forest did not perform the best
with regard to execution time among the models. Isolation Forest classified sim-
ilar data points compared to the other models. However, the smallest classified
anomaly differed a bit from the other models. This discrepancy was only seen
in the smaller anomalies, the larger deviations were consistently classified as
anomalies by all models.
Acknowledgements

I would like to give my most sincere thanks towards Lili Jiang and Andreas Theodorou who
helped me with not only machine-learning related questions, but also with general questions
in writing the report and interpreting the results. As well as my family, especially my mom
and dad, for supporting me through all the years of university. My friend Dilshad, for being
an amazing friend who always believed in me and helped when life was stressful. This would
not have been possible without any of you.
I’d like to thank Anton Vermcrantz and Kristoffer Granberg at Trimma for supplying me
with the data required and giving general support in the issues encountered.

iii
Contents

1 Introduction 1
1.1 Research question 1
1.2 Limitations 1

2 Related work 2

3 Theoretical background 3
3.1 Anomaly detection 3
3.2 Unsupervised Machine Learning 3
3.3 Isolation Forest (IF) 4
3.4 K-nearest neighbors (kNN) 5
3.5 Cluster-based Local Outlier Factor (CBLOF) 6
3.5.1 Kmeans clustering 6

4 Method 8
4.1 Hardware 8
4.2 PYOD 8
4.3 The data set 8
4.3.1 Data manipulation 9

5 Results 12
5.1 Isolation Forest 12
5.2 K-nearest neighbor 15
5.3 Cluster-based Local Outlier Factor 19

6 Discussion 21

7 Conclusion 23

8 Future work 24

Bibliography 25

v
1 Introduction

Anomaly detection refers to the problem of recognizing data points that deviates from the
normal behaviour of the data [1]. If one was to use a rule-based system to detect anomalies,
it would require frequent updates to define what normal is [2]. Instead, by using anomaly
detection systems to define what normal is, unseen anomalies can be detected by comparing
it to the normal.
I have been working with Trimma, a decision support company, which loads and processes
data from their customers to help examine it. There is a unique solution for each of their
customer in the way that their data is loaded and processed. This is a result of each customers’
data set being different and what to be examined varies from customer to customer. These
processes are run at different intervals; some being in set times such as at midnight every
second day, and others in more dynamic intervals, such as each time some button is clicked.
However, sometimes issues arise in that the execution time of these processes take longer
than usual. This could be a result of a bug in the processing, abnormal data set size received
from the customer, a server performing poorly, among other causes. Locating these instances
of abnormal execution time could thus be of great value in noticing that there is an issue,
and finding the cause. Doing this manually is not feasible as there are too many logs being
generated every day, which would result in a slow response from Trimma. To solve this issue,
a machine learning model is to be used on their event logs to detect when some process has
taken an unusual amount of time to execute. These anomalous logs are then to be examined
by analysts at Trimma in order to locate the root cause of the problem.
In this dissertation, a few commonly used machine-learning approaches are used on Trimma’s
data to evaluate the models’ performance in execution time and detecting anomalies.

1.1 Research question

The purpose of this dissertation is to evaluate three kinds of anomaly detection techniques for
the use of finding anomalies in logs. More specifically, the research question is Does Isolation
Forest perform better than other widely used unsupervised machine learning techniques in
anomaly detection?

1.2 Limitations
The models’ performance were only evaluated using one feature. The results might differ
when more dimensions are used. This is especially true for Isolation Forest that scales well
with high-dimensional data sets [3].
As there was no ground truth available, the contamination rate had to be estimated. This
estimation was made by comparing the results of different contamination rates when using
one data set. If another data set was used, this contamination rate would likely yield other
results.

1
2 Related work

Identifying data that does not fit the expectations is very useful in a variety of fields, and
unsupervised machine learning is often used. For example, Schlegl et. al [4] successfully used
unsupervised machine learning to identify potential imaging markers in disease progression.
Leung and Leckie [5] used unsupervised machine learning to create a model which detects
network intrusions. By using unsupervised learning, the model could also detect new types
of attacks and not only ones available in the training data. Lastly, Bolton and Hand [6] used
unsupervised machine learning to detect fraud by looking at longitudinal data, data that is
captured from the same subjects at multiple points in time.
Systems regularly create log files with information about current processes. These log files
can be a good source of data for use in anomaly detection. He et. al [7] evaluated six models
of unsupervised and supervised nature. The models were used to find abnormal log sequences
in log files. To begin, the data set was parsed by creating templates of the logs. The templates
were simply groups of similar logs, where the logs had some common part. After parsing the
logs, features were extracted from the logs by the use of windowing to separate log data into
groups. Each group representing a log sequence. These extracted features were then used in
the anomaly detection models. In this paper, an anomaly is not an abnormal sequence, it is
an abnormal duration and thus use different features to detect anomalies. Since the data set
used in this paper is more structured, the categorical and numerical data is already separated
and thus require no parsing.
The preprocessing described in Section 4.3.1 is very similar to Median Absolute Deviation
(MAD) [8] in that it uses the deviation from median to detect anomalies. However, MAD
is used specifically for univariate data, which was not directly applicable in this case since
the data set contains several different categories. The preprocessing removes the need of
categories by calculating the median of each category and then calculating the deviation of
each data point, with respect to their category’s median. This deviation is then scaled to be
proportional to the category’s median. The difference between this solution and MAD is that
this solution also takes the size of the deviation into account. MAD defines its anomalies
by comparing the median of the deviation from the median. When applying this method
to the current data set, some categories had a median deviation of 0, thus any data points
that deviated with just a second would be classified as anomalies. Therefore, a percentage
difference from the median is a better solution for this data set.

2
3 Theoretical background

This section explains the algorithms used in this paper, as well as general background knowl-
edge of the anomaly detection field.

3.1 Anomaly detection

Goldstein et. al [9] defines an anomaly as having two important characteristics:

1. anomalies are different from the norm with respect to their features and
2. they are rare in a dataset compared to normal instances.

This is the definition of anomaly detection used in this paper.

Anomaly detection is grouped into two types: global and local detection. Global anomaly
detection is when the model compares a point with all other points in the data set to determine
whether they are an outlier [10]. Unlike global detection, local does not compare each point
with the entire data set. Instead, only the direct neighbourhood is considered when deciding
whether a point is anomalous or not. The latter type usually detects more anomalies than the
former, since it would also detect data points that has a normal value when comparing to the
whole data set, but an unusual value comparing to its neighbours. There are several different
measures used to classify anomalies. Some models use proximity as a measure [11][12], some
use density of the data [13] and some use isolation [3].
Detecting anomalies in logs could help discover issues in processes before they escalate.
In this context, Trimma wants to discover when some process is taking longer than usual. A
process taking longer could indicate that something within their system is behaving oddly. It
could be that someone changed something in the code that affected the process, or that some
of their servers are showing slower performance overall. It could also be something from
their customer that is wrong, e.g a different data set size being sent, or a different format being
used. When a potential anomaly is detected, their analysts and engineers can investigate what
caused the issue before it becomes something reoccurring.

3.2 Unsupervised Machine Learning

Unsupervised machine learning techniques requires no training data [1]. Unsupervised is
useful when the data is unlabeled and thus requires less work to process the data set before-
hand. Anomaly detection techniques of unsupervised nature assume that normal data points
occur considerably more frequently than anomalous data points. This assumption is used to
classify data points that occur less frequently as anomalies. Unsupervised techniques assign
each data point a score instead of a label. This score indicates how likely it is that the data
point is an anomaly. How the score is calculated depends on the model used.
One parameter required in unsupervised anomaly detection is the contamination rate 𝑐,
i.e the estimated fraction of anomalies in the data set. Considering that unsupervised machine

3
learning does not contain labeling of the data, the ground truth is unknown. As a result, the
contamination rate has to be estimated in other ways. Since one of the characteristics of an
anomaly was its rarity in the data set, an estimation was found by examining the data points
classified as anomalies. With 𝑐 = 10%, the default value of the models, the data points classified
as anomalies had far too many false-positives according to domain experts at Trimma. The
smallest deviation classified as an anomaly (the lower threshold) had a deviation of 0.08% from
the median value of the category. Halving 𝑐 resulted in a lower threshold of approximately
10%, which is more accurate, but still not a large enough deviation according to Trimma’s
analysts. Halving 𝑐 again resulted in 𝑐 = 2.5% which gave a lower threshold of approximately
25%. This is the final contamination rate used across the models, as it was deemed as an
appropriate definition of an anomaly in this situation. As seen in Figure 1, the contamination
rate correlates to the lower threshold. The lower the contamination rate was, the larger was
the smallest anomaly detected. Note that only the increase threshold was considered during
this process, the reasoning being that finding anomalies that increased in duration is of more
importance to Trimma.

Figure 1: Lower threshold for each model with various contamination rates. Note that the
x-axis is inverted.

3.3 Isolation Forest (IF)

Isolation Forest is a tree-ensemble method used to detect anomalies. Other methods usually
build a profile for normal data points and then labels data points that do not fit this profile
as anomalies. IF differs from this by instead explicitly finding anomalous data points [3]. It
achieves this by creating an ensemble of decision trees, and then recursively splitting the tree
on a random feature between its min- and max-value. Values that got isolated in a shorter

4
path are more probable to be anomalous. Liu et. al [3] explains the reason this technique
works.

” This random partitioning produces noticeable shorter paths for anomalies since
(a) the fewer instances of anomalies result in a smaller number of partitions –
shorter paths in a tree structure, and (b) instances with distinguishable attribute-
values are more likely to be separated in early partitioning. ”

Since it is an ensemble of trees, one of the variables is the amount of trees used. Each
tree uses a random partition of the data and recursively split as explained above. The average
length it takes for each data point to be isolated is then calculated. Data points that requires
fewer splits on average have a higher probability of being an anomaly. The other variable is
subsample size. IF only uses a subsample of the whole data set. This helps with issues such
as masking or swamping, while also allowing IF to scale well. Swamping refers to the issue
where a model wrongly identifies normal data points as anomalies. Masking refers to when
there are clusters of anomalies which makes them harder to detect since it would require
more splits to isolate any of them [3]. By using a smaller sample size, the severity of these
issues are mitigated. The sample size has a default value of 𝑚𝑖𝑛(256, 𝑛) where n is the size of
the data set The execution time largely depends on the amount of isolation trees used in the
ensemble and the amount of features used.
Figure 1 and 2 visualizes splitting a data set to isolate different points. As shown, it takes
fewer splits to isolate an anomalous point than it takes to isolate a normal point. This is
generally the case, though, it is possible that a normal point takes fewer splits to be isolated
compared to an anomalous point. However, using more than one tree helps mitigate this issue
as the average amount of splits is used.

Figure 2: Example of splitting a Figure 3: Example of splitting a

data set to isolate an data set to isolate a
anomaly normal data point

3.4 K-nearest neighbors (kNN)

kNN is a simple algorithm that uses 𝑘 neighbours of a data point as basis for classification. The
only parameters required are 𝑘, the amount of neighbours to consider, and the contamination
rate. E.g when classifying a point 𝑝 with 𝑘 = 5, the 5 nearest neighbours of 𝑝 are considered.
By looking at the classes of the 𝑘 neighbours, the majority class is chosen as 𝑝’s class[14].
In anomaly detection, kNN can be used to detect global anomalies by finding the k-
nearest-neighbors. The anomaly score of a data point can be calculated in a few different

5
ways. One way is using the distance to it’s 𝑘th nearest neighbor [15]. In doing this, data
points that are sparse and further away compared to other data points have a greater distance
to it’s 𝑘th nearest neighbour, thus being more likely of being an anomaly. Another option is
to use the average distances to its k-nearest-neighbors. This approach has the benefit of also
taking into account the local density of the points [12]. In this paper, the latter option is used.

3.5 Cluster-based Local Outlier Factor (CBLOF)

Cluster-based Local Outlier Factor is a distance based anomaly detection that is tightly cou-
pled with clustering techniques. This is to have one functionality to both cluster and detect
outliers using the same process [11]. In the original algorithm, it works by clustering data into
clusters using some clustering algorithm. These clusters are later defined as large clusters or
small clusters. He et. al defines the large and small clusters using the following definition:

Definition 2: (large and small cluster) Suppose 𝐶 = {𝐶 1, 𝐶 2, . . . , 𝐶𝑘 } is the set

of clusters in the sequence that |𝐶𝑙 | ≥ |𝐶 2 | ≥ . . . ≥ |𝐶𝑘 |. Given two numeric
parameters 𝛼 and 𝛽, we define 𝑏 as the boundary of large and small cluster if one
of following formulas holds.

(|𝐶 1 | + |𝐶 2 | + . . . + |𝐶𝑏 |) ≥ |𝐷 | ∗𝛼

|𝐶𝑏 | /|𝐶𝑏+𝑙 | ≥ 𝛽

is defined as: 𝐿𝐶 = {𝐶𝑖 | 𝑖 ≤ 𝑏} and the set of small

Then, the set of large cluster
cluster is defined as: 𝑆𝐶 = 𝐶 𝑗 | 𝑗 > 𝑏 .

Here, 𝛼 is a parameter used to decide how large the large clusters should collectively be.
If 𝛼 is set to 90%, then the Large clusters should collectively contain 90% of the data points. 𝛽
is used to decide how much Large and Small clusters should differ in size. E.g if set to 5, any
large cluster should be at least 5 times larger than any small cluster.
To decide outlierness, CBLOF uses a combination of distance and size. Assume 𝑝 is a data
point, to decide its anomaly score: Firstly calculate the distance from 𝑝 to the nearest large
cluster center. Secondly, calculate the size of the cluster 𝑝 belongs to to use as weighting.
The CBLOF implementation in PyOD does not use the cluster sizes for weighting. The
reason given is that it affects the performance since anomalies that are close to small clusters
are not detected as a result. This implementation uses Kmeans++ clustering instead of the
Squeezer algorithm used in the original algorithm.

3.5.1 Kmeans clustering

Kmeans is a widely used clustering algorithm within machine-learning. Firstly, the num-
ber of clusters 𝑘 is chosen. Then the algorithm chooses 𝑘 random values that are set as the
centers of the clusters. All data points are assigned to their closest cluster center. When this
process is done, new centers are calculated based on the average data point assigned to each
cluster, and all data points are then compared to these new centers. The algorithm continues
this process until either a) a certain number of iterations has been finished, b) the new centers
do not change after an iteration or c) the data points do not change cluster.
A variant of this algorithm is the kmeans++ algorithm introduced by Arthur and Vassil-
vitskii [16]. This variant often has a better execution time compared to the original algorithm.
The difference lies within how the initial cluster centers are chosen. Arthur and Vassilvitskii
describes the variant as

6
”A variant that chooses centers at random from the data points, but weighs the
data points according to their squared distance squared from the closest center
already chosen”

I.e this variant first selects one initial cluster center (centroid) at random. The next clusters
are chosen using a probabilistic method instead of at random. The probabilistic method is
calculating the distance from each data point to the previously chosen centroid. Data points
that are further away have a larger weight for being chosen as centroids since they have a
larger chance of being in a different cluster. This process is repeated until 𝑘 centroids have
been selected. This in turn increases the chance of selecting centroids that are in different
clusters, and thus may require less iterations than the original algorithm.

7
4 Method

In this section, an overview of the methods used are presented. Section 4.2 presents the toolkit
used to implement the models. Section 4.3 shows the data set and how the data set is prepared.

4.1 Hardware
All experiments were ran on a computer with the following hardware: CPU: AMD Ryzen 5
5600X 3.7 GHz 35MB GPU: GeForce GTX 1070 Ti RAM: 16GB 3200 MHz

4.2 PYOD
PYOD [17], a Python Toolkit for detecting outliers, is used to build the different models. The
toolkit has a variety of models already implemented. For the anomaly-detection models,
a contamination parameter is required. This contamination parameter describes the frac-
tion of anomalies in the data set. The model assigns all data points an outlier score, the
𝑛 ∗ 𝑐𝑜𝑛𝑡𝑎𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 𝑟𝑎𝑡𝑒 points with the highest scores are then labeled as anomalies. This
forces the highest scoring data points to be classified as anomalies, even if they score low.
As a result, if the contamination rate is set too high, it would force the model to misclassify
points as anomalies. If it is set too low, the model might miss some anomalies and only take
into account the most severe ones.
The fitted models can be saved to detect anomalies in a new data set. The new data set
are then compared to the fitted model, and all data points that are above the model’s lower
threshold are labeled as anomalies. The contamination rate does not affect predictions of
a new data set, since the thresholds are already set. I.e, if a data set contains 10,000 data
points, all data points above the set thresholds would be considered anomalies, not only the
top scoring 2.5%.

4.3 The data set

The data set contained a mix of categorical and numerical data. The categorical data described
the event, and the numerical data described the duration of an event. Each event consists of
several sub-events called steps. Each step has a duration. Since the steps vary in duration,
the goal of the model is to detect specific events where some step has taken considerably
longer than it usually takes. However, the goal is not to find global anomalies, i.e data-points
whose duration is different from the majority of the other data-points. The goal is to find
local anomalies within each of the categories, i.e data points whose duration vary from the
ordinary of that category. Since the data has two categorical variables, a solution such that
the models do not compare different categories with each other was required . A log being
different is not enough to decide whether it is an anomaly or not, the size of the difference
must be taken into account as well.

8
4.3.1 Data manipulation
The categorical data needed for these models were job id and step id. The machine-
learning models can only use numerical data, thus some preprocessing steps were required
for the categorical data. Example of logs are available in Table 1.

Table 1 Example of logs before preprocessing

JobName Step Duration(s)
CompanyX - DB Backup 1 3619
CompanyY - Load financial 1 30
CompanyY - Load financial 2 100

The goal of the model was to find local anomalies within each category. However, splitting
the data between 350 categories resulted in imbalanced data. When using the categorical data
as a feature, the models would classify data points with rare categories as anomalies. When
using the raw duration as a feature, the models would classify rare durations as anomalies.
However, these are not the desired results. The raw duration does not matter, it is the differ-
ence from the ordinary duration within the category that is the important aspect. As a result,
none of the categorical data or the raw duration was to be used as features. By creating a new
feature based on the categorical data combined with the duration, it was possible to look for
global anomalies instead of local ones within each category.
The feature created for this solution was a percentage difference from the median of the
category. This was done by first calculating the median duration of each category, and then
calculating the percentage difference between each log’s duration and the their median. The
difference was calculated with the following formula:

((𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛 − 𝑚𝑒𝑑𝑖𝑎𝑛)/𝑚𝑒𝑑𝑖𝑎𝑛) ∗ 100 (4.1)

This feature was created since all logs, regardless of category, has something in common
- their deviation from the ordinary. What ordinary is depends on the log’s category. This
is precisely what the model should classify each log on. If the deviation is large enough, it
should be deemed an anomaly. In doing this, we capture the important parts of the categorical
features, without having to use the categorical features directly in the model. This allows the
model to locate local anomalies globally. By using percentage, it is ensured that shorter job
types are weighted the same as longer job types. Another advantage with this approach is
that new categories can be evaluated without having to re-fit the model, by simply calculating
their median and the percentage difference from the median. A possible drawback of this
approach is that increases have a larger weight than decreases. This is because the increase is
not bound, while the decrease is bound. A process cannot take less than 0 seconds, thus the
largest decrease possible is 100%.
The entire preprocessing pipeline is as shown in Figure 4 divided into four steps. The
cleaning step involves removing data that would not result in good performances. This in-
cludes data whose category only has one instance, and data of events that failed to execute.
These data points are excluded since they would add no value to the model and could nega-
tively affect the results. Figure 5 shows that the majority of the values had a deviation of 0,
or a slight decrease
Table 2 shows an example of logs after adding median and medianDiff as columns.

9
Figure 4: Preprocessing pipeline for the data set

Table 2 Example of logs after preprocessing

JobName Step Duration(s) Median MedianDifference (%)
CompanyX - DB Backup 1 3619 3800 -4.763
CompanyY - Load financial 1 30 30 0
CompanyY - Load financial 2 100 70 42.86

10
Figure 5: Distribution of percentual deviations from median in data set

11
5 Results

This chapter discusses the results of the models. All models’ contamination rate was set to
0.025 (2.5%). Exactly how well the models perform is hard to measure since there was not an
exact threshold decided for what defines an anomaly. However, as previously stated, it was
defined as the top 2.5% deviating data points in this data set. All models reported a threshold
between approximately 20% and 30%. These thresholds were confirmed by domain experts to
be fitting.
The data set used to fit each model contains approximately 8,000 data points, these models
can then be saved and loaded. In doing this, the definition of an anomaly does not change
with new data sets. The fitted model is then able to detect anomalies in new data sets with
the same thresholds chosen during fitting.

5.1 Isolation Forest

The amount of trees 𝑡 was set to 100 by default, as specified in the original paper. However,
this amount can be excessive if the path lengths converge with less trees. To find the optimal
amount of trees for the data set, a comparison between the model’s performance and 𝑡 was
made. Several models with 𝑡 ranging from [1, 200] was fitted. More trees generally means
better performance, however, it also comes at a cost of execution time as seen in Figure 7.
Thus, the optimal 𝑡 would be the smallest 𝑡 possible to get consistent lower thresholds. As
seen in Figure 6, IF’s lower thresholds converges around 𝑡 = 20 with this data set. Thus, 20
was used for 𝑡.
Running the model 100 times on the same data set of approximately 8,000 logs with the
configuration above, gives an average lower threshold of 31.09% for increases in duration. For
decrease in duration, the average lower threshold for this data set was -21.14%. Meaning a log
is classified as an anomaly if its duration increases by approximately 31.09% or decreases by
approximately 21.14%. However, these thresholds are not a general solution, if another data
set was to be used during fitting, the thresholds would change. The thresholds depend on the
distribution of the data set. If the data set were to have more larger anomalies, the threshold
would be affected by also increasing in value. This is since larger anomalies have a larger
anomaly score since they on average become isolated faster [3]. By having a larger anomaly
score they are more likely to be part of the 2.5% logs classified as anomalies. Since there is a
limit to how many logs being classified, the more larger anomalies there are, the less smaller
anomalies are included. This in turn increases the lower threshold. IF had a slightly larger
increase threshold compared to the other models, thus the other models classified more data
points that increased than IF did,, as well as detecting masked anomalies, as seen in Figure 8.
Isolation Forest performed well in finding anomalies. The results varied a bit each fitting
because of IF’s random nature, but it performed consistently well. The larger outliers were
classified as anomalies consistently. The inconsistency shows at the lower threshold, i.e the
smallest change in duration to be classified as an anomaly, which would differ a bit each fitting
as a result of IF’s random nature. IF classified 208 data points as anomalies.

12
Figure 6: Thresholds set for Isolation Forest with an increasing amount
of trees, black line showing the average

Figure 7: Execution time when fitting IF with an increasing amount of

trees

13
Figure 8: Differences in anomalies found between IF and the other models

14
5.2 K-nearest neighbor
kNN performed well in terms of finding anomalies with 𝑘 = 12. This 𝑘 was chosen because
the threshold values converge after that point as seen in Figure 9. The value of 𝑘 is very
dependant of the data set, 𝑘 = 12 performs well when fitting the data set of 8,000 data points.
However, using 𝑘 = 12 on the data set with 40,000 data points, the anomaly detection was
not consistent in what classified as an anomaly. The data points that were not classified as
anomalies while having a larger deviation than some anomaly are ’inconsistent normal points’
as seen in Figure 10.
The smallest increase deviation that classified as an anomaly was approximately 23.41%
from the median. The smallest deviation for events that decreased in duration was -20.72%.
However, the kNN model being a very simple model also had the drawback of a slower exe-
cution time as seen in Figure 11. It scaled the worst out of the three models. Although, it had
the best execution times for 𝑛 < 2000. kNN classified 213 data points as anomalies.
Using the fitted model to predict new samples is faster, but still performs the slowest of
the three models as seen in Figure 12.
kNN suffers from the effects of masking, where if there are many similar anomalies, they
are not detected. This can be demonstrated by adding 𝑘 copies of an anomaly to the data set,
as seen in Figure 13.

15
Figure 9: Thresholds set for different values of 𝑘

Figure 10: Inconsistencies in kNN with 𝑘 = 12 when using a

data set with a different distribution

16
Figure 11: Execution time of fitting CBLOF, IF and kNN with an increas-
ing data set size

Figure 12: Execution time when predicting new data sets of increasing
sizes using the fitted models

17
Figure 13: Masked anomaly in kNN

18
5.3 Cluster-based Local Outlier Factor
CBLOF performed similarly to Isolation Forest in that it had good performance finding anoma-
lies. Unlike IF, its results did not vary after each fitting, the lower thresholds for anomalies
stayed the same each run, unless the data set was changed. The execution time increased
approximately linearly with the amount of clusters used, as seen in Figure 14. The anomalies
found were approximately the same with the increasing clusters as seen in Figure 15, mean-
ing that no more than two clusters were needed for this data set. The smallest increase from
median classified as an anomaly was approximately 26.83%, and the smallest decrease was
approximately 24.16%. Unlike kNN, CBLOF did detect the masked anomaly seen in Figure 8
CBLOF performed the best out of the three models with regard to the execution time, in
both fitting and predicting on new data sets with the fitted model, while also having similar
thresholds as the kNN and IF. CBLOF classified 213 data points as anomalies.

19
Figure 14: Differences in execution time with an increasing amount of
clusters

Figure 15: Thresholds set with an increasing amount of clusters

20
6 Discussion

IF required about 20 trees to gain stable performance, but this did also affect the execution
time of the model, making it worse than CBLOF on this type of data. As the data used to
fit the model was one-dimensional, Isolation Forest only split on one feature, as opposed to
a random feature if using multi-dimensional data. As a result, each tree split on a random
value between the minimum- and maximum value of % deviation. As seen in Figure 5, the
majority of values are close 0%, with many having a small decrease compared to the median.
As a result of this, when a tree in Isolation Forest splits on a random value, it is much more
likely to split where the values are sparse. Thus, it is more likely to isolate the larger devia-
tions, than it is to isolate a normal point. The reason the threshold varies a little every run is
because of these random splits. When the threshold for increases is smaller, the threshold for
decreases also becomes larger. This is because there is a set amount of anomalies indicated
by the contamination rate. If the increase threshold gets smaller, more anomalies are found
in increases, but fewer anomalies are found in decreases, as that threshold gets further away
from zero. The amount of anomalies found by IF varied slightly every run, but on average it
found fewer anomalies compared to CBLOF and kNN. IF had a slightly larger threshold for in-
creases than the other models, while having approximately the same threshold for decreases.
One possible explanation for this is that the values with more than 30% increase were more
sparse and thus more likely to be isolated. There were 50 logs between +20% and +30%, and
only 25 logs between +30% and +40%.
CBLOF did not require many clusters due to the distribution of the data, approximately
90% of the data points had a % deviation between approximately [−7.5, 6]%. Since the majority
of the values had a small deviation from the median, as a result there existed only one large
cluster. This large cluster center would be dependent on where the majority of the values
lie. In this data set, the large cluster center had a value of 1.62. The small cluster contained
only four values, which were extra large anomalies with a % deviation between [400,1600].
As a result, CBLOF used the distance between each data point and 1.62 as an anomaly score.
As anomalies were defined as points that deviated from the median, this resulted in larger
deviations having larger anomaly scores. Since the large cluster center had a value of 1.62
and not 0, this skewed the thresholds. Anomalies that increased thus required 1.62 more
percentage points than anomalies that decreased in order to be classified as anomalies. This
is the reason the absolute value of the two thresholds in Figure 15 are not exactly equal.
kNN works by using the distances to the k nearest-neighbors for each data point. The
larger the average of the distances are, the more weight is given. A larger weight indicates
a larger probability of being an anomaly. As the majority of the values are close to 0%, the
normal values have a small weight. While the values that have a larger % difference are more
sparse, and thus are more likely to be anomalies. kNN suffers from the issues of masking
in the data set, where if there are several similar anomalies, they are not detected. This is
because the model only looks at the 𝑘 nearest-neighbors. If the 𝑘 neighbors are similar data
points, the average distances between them are small, and thus have a small weight, and are
not detected. This issue does not occur in CBLOF or IF. CBLOF is not affected as it only looks
at the distance to the closest large cluster center, thus small clusters of anomalies does not
affect detection. IF only uses a subset of the data, thus mitigating the masking issue.

21
As seen in Figure 8, the models differ only in detecting the smaller anomalies. This is
expected since the models differ in threshold values. If a model has a lower threshold than
another model, it will also classify smaller deviations as anomalies. Some instances of masking
are seen as well, where both CBLOF and IF detected the masked anomalies whereas kNN did
not. CBLOF classified approximately 30 increasing data points more than IF, but IF classified
approximately 30 decreasing data points more than CBLOF. IF did however classify 5 data
points less in total compared to kNN and CBLOF.

22
7 Conclusion

To answer the research question, Isolation Forest did perform better than kNN in execution
time, and it did not have the same problems with masking. IF does not have this issue since
each tree uses a subset of the data, which mitigates masking issues. However, IF did not
perform better than CBLOF in terms of execution time. This is because CBLOF did not require
many clusters to achieve good results in one-dimensional data, while IF required around 20
trees to have stable results. Thus, the results are inconclusive and require further testing.
The three models’ detection results were similar but had some differences in detecting
smaller anomalies. CBLOF had thresholds at [-24.16 , 26.83], kNN at [-20.73, 23.41], and fi-
nally IF at [-21.14, 31.09]. These thresholds are simply the smallest % difference from median
classified as anomalies, in both decrease and increase. IF has a larger increase threshold com-
pared to the other two, but similar decrease threshold. All models classified a similar amount
of data points, where kNN, CBLOF classified 213 data points as anomalies and IF classified
208.

23
8 Future work

If there is to be continued work upon this model, an area of improvement would be to account
for time of the day of each log. Logs where the event is ran at night should only be compared
to the same event being ran at night. Processes running during the night can afford to take
longer since the results will not be needed until the morning. One could also scale so that
logs with a lower median have more flexibility than the bigger jobs. The reason for this is
that a 20% increase of a job whose median is 30 seconds is significantly cheaper than a 20%
increase of a job whose median is 2000 seconds. Frequency could also be accounted for, more
frequently ran jobs is of higher priority than rare jobs.
As the experiments were done with a one dimensional feature space, it would be interest-
ing to see how the models compare when using more features.
More aspects could be considered when comparing these models. This paper compared
the anomaly detection as well as execution time. However, it would be of interest to also
compare memory usage for each model.
One could investigate the use of a reinforcement learning model, which in this case could
be a good option since anomalies will be investigated by an engineer. This engineer could
then rate how well the model performed in order to improve the classification.
Lastly, it would be interesting to investigate whether these machine learning models per-
form better than simpler statistical methods such as using interquartile range (IQR) to detect
anomalies [18].

24
Bibliography

[1] Varun Chandola, Arindam Banerjee, and Vipin Kumar. “Anomaly detection: A survey”.
In: ACM computing surveys (CSUR) 41.3 (2009), pp. 1–58.
[2] Animesh Patcha and Jung-Min Park. “An overview of anomaly detection techniques:
Existing solutions and latest technological trends”. In: Computer networks 51.12 (2007),
pp. 3448–3470.
[3] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. “Isolation forest”. In: 2008 eighth ieee
international conference on data mining. IEEE. 2008, pp. 413–422.
[4] Thomas Schlegl et al. “Unsupervised anomaly detection with generative adversarial
networks to guide marker discovery”. In: International conference on information pro-
cessing in medical imaging. Springer. 2017, pp. 146–157.
[5] Kingsly Leung and Christopher Leckie. “Unsupervised anomaly detection in network
intrusion detection using clusters”. In: Proceedings of the Twenty-eighth Australasian
conference on Computer Science-Volume 38. 2005, pp. 333–342.
[6] Richard J Bolton, David J Hand, et al. “Unsupervised profiling methods for fraud detec-
tion”. In: Credit scoring and credit control VII (2001), pp. 235–255.
[7] Shilin He et al. “Experience report: System log analysis for anomaly detection”. In:
2016 IEEE 27th international symposium on software reliability engineering (ISSRE). IEEE.
2016, pp. 207–218.
[8] Christophe Leys et al. “Detecting outliers: Do not use standard deviation around the
mean, use absolute deviation around the median”. In: Journal of experimental social
psychology 49.4 (2013), pp. 764–766.
[9] Markus Goldstein and Seiichi Uchida. “A comparative evaluation of unsupervised anomaly
detection algorithms for multivariate data”. In: PloS one 11.4 (2016), e0152173.
[10] Mennatallah Amer and Markus Goldstein. “Nearest-Neighbor and Clustering based
Anomaly Detection Algorithms for RapidMiner”. In: (). url: https://www.goldiges.
de/publications/Anomaly_Detection_Algorithms_for_RapidMiner.
pdf.
[11] Zengyou He, Xiaofei Xu, and Shengchun Deng. “Discovering cluster-based local out-
liers”. In: Pattern recognition letters 24.9-10 (2003), pp. 1641–1650.
[12] Fabrizio Angiulli and Clara Pizzuti. “Fast outlier detection in high dimensional spaces”.
In: European conference on principles of data mining and knowledge discovery. Springer.
2002, pp. 15–27.
[13] Markus M Breunig et al. “LOF: identifying density-based local outliers”. In: Proceedings
of the 2000 ACM SIGMOD international conference on Management of data. 2000, pp. 93–
104.
[14] Thomas Cover and Peter Hart. “Nearest neighbor pattern classification”. In: IEEE trans-
actions on information theory 13.1 (1967), pp. 21–27.

25
[15] Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. “Efficient algorithms for min-
ing outliers from large data sets”. In: Proceedings of the 2000 ACM SIGMOD international
conference on Management of data. 2000, pp. 427–438.
[16] David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding.
Tech. rep. Stanford, 2006.
[17] Yue Zhao, Zain Nasrullah, and Zheng Li. “PyOD: A Python Toolbox for Scalable Outlier
Detection”. In: Journal of Machine Learning Research 20.96 (2019), pp. 1–7. url: http:
//jmlr.org/papers/v20/19-011.html.
[18] Peter J Rousseeuw and Mia Hubert. “Robust statistics for outlier detection”. In: Wiley
interdisciplinary reviews: Data mining and knowledge discovery 1.1 (2011), pp. 73–79.

26
27

Isolation Forest Based Anomaly Detection: A Systematic Literature Review
No ratings yet
Isolation Forest Based Anomaly Detection: A Systematic Literature Review
5 pages
CRCF Brainstorm
No ratings yet
CRCF Brainstorm
20 pages
Unsupervised Anomaly Detection Algorithms On Real-World Data: How Many Do We Need?
No ratings yet
Unsupervised Anomaly Detection Algorithms On Real-World Data: How Many Do We Need?
34 pages
Anomaly Detection For Web Log Based Data
No ratings yet
Anomaly Detection For Web Log Based Data
5 pages
A Machine Learning Approach To Anomaly Detection
No ratings yet
A Machine Learning Approach To Anomaly Detection
13 pages
Anomaly Detection in Time Series Data: A Practical Implementation For Pulp and Paper Industry
No ratings yet
Anomaly Detection in Time Series Data: A Practical Implementation For Pulp and Paper Industry
108 pages
Ecmlpkdd08 Lazarevic Dmfa
No ratings yet
Ecmlpkdd08 Lazarevic Dmfa
116 pages
10 - Anomaly Detection
No ratings yet
10 - Anomaly Detection
12 pages
A Review On Anomaly Detection in Time Series
No ratings yet
A Review On Anomaly Detection in Time Series
6 pages
Empowering Anomaly Detection Algorithm: A Review
No ratings yet
Empowering Anomaly Detection Algorithm: A Review
14 pages
FULLTEXT01
No ratings yet
FULLTEXT01
7 pages
Investigating Optimal Features in Log Files For Anomaly Detection Using Optimization Approach
No ratings yet
Investigating Optimal Features in Log Files For Anomaly Detection Using Optimization Approach
9 pages
Anomaly Detection
No ratings yet
Anomaly Detection
13 pages
Anomaly Detection and Time Series Analysis1
No ratings yet
Anomaly Detection and Time Series Analysis1
6 pages
1 s2.0 S0952197622004936 Main
No ratings yet
1 s2.0 S0952197622004936 Main
8 pages
Mathematics 10 04043
No ratings yet
Mathematics 10 04043
30 pages
IEEE Conference Template
No ratings yet
IEEE Conference Template
4 pages
Anomaly Detection 2
No ratings yet
Anomaly Detection 2
8 pages
A Hybrid Machine Learning Method
No ratings yet
A Hybrid Machine Learning Method
6 pages
02 - 03 - Anomaly Detection Survey
No ratings yet
02 - 03 - Anomaly Detection Survey
27 pages
Anomaly Detection For Data Streams in Large-Scale Distributed Heterogeneous Computing Environments
No ratings yet
Anomaly Detection For Data Streams in Large-Scale Distributed Heterogeneous Computing Environments
11 pages
6anomaly Fraud Detection
No ratings yet
6anomaly Fraud Detection
5 pages
Performance Evaluation of Unsupervised Techniques in Cyber-Attack Anomaly Detection
No ratings yet
Performance Evaluation of Unsupervised Techniques in Cyber-Attack Anomaly Detection
13 pages
Elk 2111 123
No ratings yet
Elk 2111 123
17 pages
Machine Learning For Anomaly Detection
No ratings yet
Machine Learning For Anomaly Detection
23 pages
Anomaly Detection
No ratings yet
Anomaly Detection
51 pages
Anomaly Detection RapidMiner
No ratings yet
Anomaly Detection RapidMiner
12 pages
Anomaly Detection Algorithms For RapidMiner
No ratings yet
Anomaly Detection Algorithms For RapidMiner
12 pages
PAACDA Comprehensive Data Corruption Detection Algorithm
No ratings yet
PAACDA Comprehensive Data Corruption Detection Algorithm
8 pages
Cheboli Deepthi May2010 PDF
No ratings yet
Cheboli Deepthi May2010 PDF
83 pages
5.1.1 Objective and Scope: Jyenis 2020
No ratings yet
5.1.1 Objective and Scope: Jyenis 2020
8 pages
2 PB
No ratings yet
2 PB
10 pages
Machine Learning For Anomaly Detection A Systemati
No ratings yet
Machine Learning For Anomaly Detection A Systemati
47 pages
Isolation Forest Algorithm For Anomaly Detection
No ratings yet
Isolation Forest Algorithm For Anomaly Detection
16 pages
ff12 Deep Learning For Anomaly Detection
No ratings yet
ff12 Deep Learning For Anomaly Detection
71 pages
Anomaly Detection On Industrial Electrical Systems Using Deep Learning
No ratings yet
Anomaly Detection On Industrial Electrical Systems Using Deep Learning
6 pages
SMBL Merged
No ratings yet
SMBL Merged
28 pages
Outliers Intrusion Detection: Anomaly Detection, Also Referred To As Outlier Detection
No ratings yet
Outliers Intrusion Detection: Anomaly Detection, Also Referred To As Outlier Detection
1 page
WP S-Ax Key Steps To Detect An Anomaly in Real-time-JAN10
No ratings yet
WP S-Ax Key Steps To Detect An Anomaly in Real-time-JAN10
10 pages
Anomaly Detection Survey
No ratings yet
Anomaly Detection Survey
72 pages
Machine Learning Techniques For Anomaly Detection: An Overview
No ratings yet
Machine Learning Techniques For Anomaly Detection: An Overview
9 pages
Liu 2008
No ratings yet
Liu 2008
10 pages
Elastic Anomalies
No ratings yet
Elastic Anomalies
7 pages
Machine Learning For Time Series Anomaly Detection: Ihssan Tinawi
No ratings yet
Machine Learning For Time Series Anomaly Detection: Ihssan Tinawi
55 pages
Anomaly Detection
No ratings yet
Anomaly Detection
22 pages
Anomaly Detection
No ratings yet
Anomaly Detection
7 pages
Outlier Detection With The Use of Isolation Forests
No ratings yet
Outlier Detection With The Use of Isolation Forests
15 pages
2020TadGAN Time Series Anomaly Detection Using
No ratings yet
2020TadGAN Time Series Anomaly Detection Using
11 pages
AnomalyDetection HimanshuDhakad YuvrajSingh
No ratings yet
AnomalyDetection HimanshuDhakad YuvrajSingh
9 pages
Knime Anomaly Detection Visualization
No ratings yet
Knime Anomaly Detection Visualization
13 pages
Functional Isolation Forest
No ratings yet
Functional Isolation Forest
16 pages
How To Find A Unicorn: A Novel Model-Free, Unsupervised Anomaly Detection Method For Time Series
No ratings yet
How To Find A Unicorn: A Novel Model-Free, Unsupervised Anomaly Detection Method For Time Series
35 pages
Anomaly Detection in Log Files Using
No ratings yet
Anomaly Detection in Log Files Using
67 pages
2023 Anomaly Detection From Web Log Data Using Machine Learning Model
No ratings yet
2023 Anomaly Detection From Web Log Data Using Machine Learning Model
6 pages
Enhancing Time Series Anomaly Detection: A Hybrid Model Fusion Approach
No ratings yet
Enhancing Time Series Anomaly Detection: A Hybrid Model Fusion Approach
13 pages
ADA Adaptive Deep Log Anomaly Detector
No ratings yet
ADA Adaptive Deep Log Anomaly Detector
10 pages
USAD Architecture
No ratings yet
USAD Architecture
14 pages
Paper 6 CN
No ratings yet
Paper 6 CN
32 pages
Fundamentals of Machine Learning: An Introduction to Neural Networks
From Everand
Fundamentals of Machine Learning: An Introduction to Neural Networks
Peter Johnson
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
Power BI's AI Capabilities
No ratings yet
Power BI's AI Capabilities
9 pages
Pattern Recognition Letters: Julien Lesouple, Cédric Baudoin, Marc Spigai, Jean-Yves Tourneret
No ratings yet
Pattern Recognition Letters: Julien Lesouple, Cédric Baudoin, Marc Spigai, Jean-Yves Tourneret
11 pages
4.3.2.4 Lab - Internet Meter Anomaly Detection
No ratings yet
4.3.2.4 Lab - Internet Meter Anomaly Detection
8 pages
Intrusion Detection System
No ratings yet
Intrusion Detection System
22 pages
Smart Log Data Analytics: Techniques For Advanced Security Analysis Florian Skopik PDF Download
No ratings yet
Smart Log Data Analytics: Techniques For Advanced Security Analysis Florian Skopik PDF Download
42 pages
ML Course Project
No ratings yet
ML Course Project
13 pages
Next Wave Mobility
No ratings yet
Next Wave Mobility
13 pages
Ijsra 2024 2184
No ratings yet
Ijsra 2024 2184
19 pages
Eti Report
No ratings yet
Eti Report
11 pages
B-Anomaly Detection Based On Multidimensional Data Processing For Protecting Vital Devices in 6G-Enabled Massive IIoT
No ratings yet
B-Anomaly Detection Based On Multidimensional Data Processing For Protecting Vital Devices in 6G-Enabled Massive IIoT
12 pages
DMDW Full PDF
No ratings yet
DMDW Full PDF
784 pages
Anomaly Detection For Cybersecurity of The Substations
No ratings yet
Anomaly Detection For Cybersecurity of The Substations
12 pages
Review of Gen AI Models For Financial Risk Management
No ratings yet
Review of Gen AI Models For Financial Risk Management
16 pages
Data Science in Finance
No ratings yet
Data Science in Finance
83 pages
AI-Powered Block Chain Technology in Industry 4.0
No ratings yet
AI-Powered Block Chain Technology in Industry 4.0
24 pages
Mlns Notes
No ratings yet
Mlns Notes
20 pages
Search Queries Anomaly Detection Using Python
No ratings yet
Search Queries Anomaly Detection Using Python
11 pages
AI-Based Leaf Spring Health Monitoring System Proposal
No ratings yet
AI-Based Leaf Spring Health Monitoring System Proposal
9 pages
DATA SCIENCE May - 2019
No ratings yet
DATA SCIENCE May - 2019
21 pages
Online Time-Series Anomaly Detection A Survey of M
No ratings yet
Online Time-Series Anomaly Detection A Survey of M
36 pages
Securing Generative AI: A Survey On The Role of Secure Access Service Edge (SASE) in Mitigating Exploitability
No ratings yet
Securing Generative AI: A Survey On The Role of Secure Access Service Edge (SASE) in Mitigating Exploitability
14 pages
Journal Pre-Proof: Microprocessors and Microsystems
No ratings yet
Journal Pre-Proof: Microprocessors and Microsystems
27 pages
2024 AI SMEUpdateExamples47 49
No ratings yet
2024 AI SMEUpdateExamples47 49
35 pages
Front Page Ramesh
No ratings yet
Front Page Ramesh
7 pages
Amir - Iran - Machine Anomaly Detection & Diagnostic Solution Proposal - Vibit Wi-Fi - V1.0.
No ratings yet
Amir - Iran - Machine Anomaly Detection & Diagnostic Solution Proposal - Vibit Wi-Fi - V1.0.
20 pages
Data Mining
No ratings yet
Data Mining
13 pages
TM Forum GenAI - Driving Intel in NLM Report
No ratings yet
TM Forum GenAI - Driving Intel in NLM Report
32 pages
Anomaly Detection in Network Traffic For Cybersecurity
No ratings yet
Anomaly Detection in Network Traffic For Cybersecurity
9 pages
Unit 5
No ratings yet
Unit 5
70 pages
Machine Learning Applications in Power Systems
No ratings yet
Machine Learning Applications in Power Systems
5 pages