Full Text 01
Full Text 01
2022
Abstract
Anomalies in data can be of great importance as they often indicate faulty be-
haviour. Locating these can thus assist in finding the source of the issue. Isola-
tion Forest, an unsupervised machine learning model used to detect anomalies, is
evaluated against two other commonly used models. The data set used were log
files from a company named Trimma. The log files contained information about
different events that executed. Different types of event could differ in execution
time. The models were then used to find logs where some event took longer than
usual to execute. The feature created for the models was a percentual difference
from the median of each job type. The comparison made on various data set
sizes, using one feature, showed that Isolation Forest did not perform the best
with regard to execution time among the models. Isolation Forest classified sim-
ilar data points compared to the other models. However, the smallest classified
anomaly differed a bit from the other models. This discrepancy was only seen
in the smaller anomalies, the larger deviations were consistently classified as
anomalies by all models.
Acknowledgements
I would like to give my most sincere thanks towards Lili Jiang and Andreas Theodorou who
helped me with not only machine-learning related questions, but also with general questions
in writing the report and interpreting the results. As well as my family, especially my mom
and dad, for supporting me through all the years of university. My friend Dilshad, for being
an amazing friend who always believed in me and helped when life was stressful. This would
not have been possible without any of you.
I’d like to thank Anton Vermcrantz and Kristoffer Granberg at Trimma for supplying me
with the data required and giving general support in the issues encountered.
iii
Contents
1 Introduction 1
1.1 Research question 1
1.2 Limitations 1
2 Related work 2
3 Theoretical background 3
3.1 Anomaly detection 3
3.2 Unsupervised Machine Learning 3
3.3 Isolation Forest (IF) 4
3.4 K-nearest neighbors (kNN) 5
3.5 Cluster-based Local Outlier Factor (CBLOF) 6
3.5.1 Kmeans clustering 6
4 Method 8
4.1 Hardware 8
4.2 PYOD 8
4.3 The data set 8
4.3.1 Data manipulation 9
5 Results 12
5.1 Isolation Forest 12
5.2 K-nearest neighbor 15
5.3 Cluster-based Local Outlier Factor 19
6 Discussion 21
7 Conclusion 23
8 Future work 24
Bibliography 25
v
1 Introduction
Anomaly detection refers to the problem of recognizing data points that deviates from the
normal behaviour of the data [1]. If one was to use a rule-based system to detect anomalies,
it would require frequent updates to define what normal is [2]. Instead, by using anomaly
detection systems to define what normal is, unseen anomalies can be detected by comparing
it to the normal.
I have been working with Trimma, a decision support company, which loads and processes
data from their customers to help examine it. There is a unique solution for each of their
customer in the way that their data is loaded and processed. This is a result of each customers’
data set being different and what to be examined varies from customer to customer. These
processes are run at different intervals; some being in set times such as at midnight every
second day, and others in more dynamic intervals, such as each time some button is clicked.
However, sometimes issues arise in that the execution time of these processes take longer
than usual. This could be a result of a bug in the processing, abnormal data set size received
from the customer, a server performing poorly, among other causes. Locating these instances
of abnormal execution time could thus be of great value in noticing that there is an issue,
and finding the cause. Doing this manually is not feasible as there are too many logs being
generated every day, which would result in a slow response from Trimma. To solve this issue,
a machine learning model is to be used on their event logs to detect when some process has
taken an unusual amount of time to execute. These anomalous logs are then to be examined
by analysts at Trimma in order to locate the root cause of the problem.
In this dissertation, a few commonly used machine-learning approaches are used on Trimma’s
data to evaluate the models’ performance in execution time and detecting anomalies.
1.2 Limitations
The models’ performance were only evaluated using one feature. The results might differ
when more dimensions are used. This is especially true for Isolation Forest that scales well
with high-dimensional data sets [3].
As there was no ground truth available, the contamination rate had to be estimated. This
estimation was made by comparing the results of different contamination rates when using
one data set. If another data set was used, this contamination rate would likely yield other
results.
1
2 Related work
Identifying data that does not fit the expectations is very useful in a variety of fields, and
unsupervised machine learning is often used. For example, Schlegl et. al [4] successfully used
unsupervised machine learning to identify potential imaging markers in disease progression.
Leung and Leckie [5] used unsupervised machine learning to create a model which detects
network intrusions. By using unsupervised learning, the model could also detect new types
of attacks and not only ones available in the training data. Lastly, Bolton and Hand [6] used
unsupervised machine learning to detect fraud by looking at longitudinal data, data that is
captured from the same subjects at multiple points in time.
Systems regularly create log files with information about current processes. These log files
can be a good source of data for use in anomaly detection. He et. al [7] evaluated six models
of unsupervised and supervised nature. The models were used to find abnormal log sequences
in log files. To begin, the data set was parsed by creating templates of the logs. The templates
were simply groups of similar logs, where the logs had some common part. After parsing the
logs, features were extracted from the logs by the use of windowing to separate log data into
groups. Each group representing a log sequence. These extracted features were then used in
the anomaly detection models. In this paper, an anomaly is not an abnormal sequence, it is
an abnormal duration and thus use different features to detect anomalies. Since the data set
used in this paper is more structured, the categorical and numerical data is already separated
and thus require no parsing.
The preprocessing described in Section 4.3.1 is very similar to Median Absolute Deviation
(MAD) [8] in that it uses the deviation from median to detect anomalies. However, MAD
is used specifically for univariate data, which was not directly applicable in this case since
the data set contains several different categories. The preprocessing removes the need of
categories by calculating the median of each category and then calculating the deviation of
each data point, with respect to their category’s median. This deviation is then scaled to be
proportional to the category’s median. The difference between this solution and MAD is that
this solution also takes the size of the deviation into account. MAD defines its anomalies
by comparing the median of the deviation from the median. When applying this method
to the current data set, some categories had a median deviation of 0, thus any data points
that deviated with just a second would be classified as anomalies. Therefore, a percentage
difference from the median is a better solution for this data set.
2
3 Theoretical background
This section explains the algorithms used in this paper, as well as general background knowl-
edge of the anomaly detection field.
1. anomalies are different from the norm with respect to their features and
2. they are rare in a dataset compared to normal instances.
3
learning does not contain labeling of the data, the ground truth is unknown. As a result, the
contamination rate has to be estimated in other ways. Since one of the characteristics of an
anomaly was its rarity in the data set, an estimation was found by examining the data points
classified as anomalies. With 𝑐 = 10%, the default value of the models, the data points classified
as anomalies had far too many false-positives according to domain experts at Trimma. The
smallest deviation classified as an anomaly (the lower threshold) had a deviation of 0.08% from
the median value of the category. Halving 𝑐 resulted in a lower threshold of approximately
10%, which is more accurate, but still not a large enough deviation according to Trimma’s
analysts. Halving 𝑐 again resulted in 𝑐 = 2.5% which gave a lower threshold of approximately
25%. This is the final contamination rate used across the models, as it was deemed as an
appropriate definition of an anomaly in this situation. As seen in Figure 1, the contamination
rate correlates to the lower threshold. The lower the contamination rate was, the larger was
the smallest anomaly detected. Note that only the increase threshold was considered during
this process, the reasoning being that finding anomalies that increased in duration is of more
importance to Trimma.
Figure 1: Lower threshold for each model with various contamination rates. Note that the
x-axis is inverted.
4
path are more probable to be anomalous. Liu et. al [3] explains the reason this technique
works.
” This random partitioning produces noticeable shorter paths for anomalies since
(a) the fewer instances of anomalies result in a smaller number of partitions –
shorter paths in a tree structure, and (b) instances with distinguishable attribute-
values are more likely to be separated in early partitioning. ”
Since it is an ensemble of trees, one of the variables is the amount of trees used. Each
tree uses a random partition of the data and recursively split as explained above. The average
length it takes for each data point to be isolated is then calculated. Data points that requires
fewer splits on average have a higher probability of being an anomaly. The other variable is
subsample size. IF only uses a subsample of the whole data set. This helps with issues such
as masking or swamping, while also allowing IF to scale well. Swamping refers to the issue
where a model wrongly identifies normal data points as anomalies. Masking refers to when
there are clusters of anomalies which makes them harder to detect since it would require
more splits to isolate any of them [3]. By using a smaller sample size, the severity of these
issues are mitigated. The sample size has a default value of 𝑚𝑖𝑛(256, 𝑛) where n is the size of
the data set The execution time largely depends on the amount of isolation trees used in the
ensemble and the amount of features used.
Figure 1 and 2 visualizes splitting a data set to isolate different points. As shown, it takes
fewer splits to isolate an anomalous point than it takes to isolate a normal point. This is
generally the case, though, it is possible that a normal point takes fewer splits to be isolated
compared to an anomalous point. However, using more than one tree helps mitigate this issue
as the average amount of splits is used.
5
ways. One way is using the distance to it’s 𝑘th nearest neighbor [15]. In doing this, data
points that are sparse and further away compared to other data points have a greater distance
to it’s 𝑘th nearest neighbour, thus being more likely of being an anomaly. Another option is
to use the average distances to its k-nearest-neighbors. This approach has the benefit of also
taking into account the local density of the points [12]. In this paper, the latter option is used.
(|𝐶 1 | + |𝐶 2 | + . . . + |𝐶𝑏 |) ≥ |𝐷 | ∗𝛼
|𝐶𝑏 | /|𝐶𝑏+𝑙 | ≥ 𝛽
Here, 𝛼 is a parameter used to decide how large the large clusters should collectively be.
If 𝛼 is set to 90%, then the Large clusters should collectively contain 90% of the data points. 𝛽
is used to decide how much Large and Small clusters should differ in size. E.g if set to 5, any
large cluster should be at least 5 times larger than any small cluster.
To decide outlierness, CBLOF uses a combination of distance and size. Assume 𝑝 is a data
point, to decide its anomaly score: Firstly calculate the distance from 𝑝 to the nearest large
cluster center. Secondly, calculate the size of the cluster 𝑝 belongs to to use as weighting.
The CBLOF implementation in PyOD does not use the cluster sizes for weighting. The
reason given is that it affects the performance since anomalies that are close to small clusters
are not detected as a result. This implementation uses Kmeans++ clustering instead of the
Squeezer algorithm used in the original algorithm.
6
”A variant that chooses centers at random from the data points, but weighs the
data points according to their squared distance squared from the closest center
already chosen”
I.e this variant first selects one initial cluster center (centroid) at random. The next clusters
are chosen using a probabilistic method instead of at random. The probabilistic method is
calculating the distance from each data point to the previously chosen centroid. Data points
that are further away have a larger weight for being chosen as centroids since they have a
larger chance of being in a different cluster. This process is repeated until 𝑘 centroids have
been selected. This in turn increases the chance of selecting centroids that are in different
clusters, and thus may require less iterations than the original algorithm.
7
4 Method
In this section, an overview of the methods used are presented. Section 4.2 presents the toolkit
used to implement the models. Section 4.3 shows the data set and how the data set is prepared.
4.1 Hardware
All experiments were ran on a computer with the following hardware: CPU: AMD Ryzen 5
5600X 3.7 GHz 35MB GPU: GeForce GTX 1070 Ti RAM: 16GB 3200 MHz
4.2 PYOD
PYOD [17], a Python Toolkit for detecting outliers, is used to build the different models. The
toolkit has a variety of models already implemented. For the anomaly-detection models,
a contamination parameter is required. This contamination parameter describes the frac-
tion of anomalies in the data set. The model assigns all data points an outlier score, the
𝑛 ∗ 𝑐𝑜𝑛𝑡𝑎𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 𝑟𝑎𝑡𝑒 points with the highest scores are then labeled as anomalies. This
forces the highest scoring data points to be classified as anomalies, even if they score low.
As a result, if the contamination rate is set too high, it would force the model to misclassify
points as anomalies. If it is set too low, the model might miss some anomalies and only take
into account the most severe ones.
The fitted models can be saved to detect anomalies in a new data set. The new data set
are then compared to the fitted model, and all data points that are above the model’s lower
threshold are labeled as anomalies. The contamination rate does not affect predictions of
a new data set, since the thresholds are already set. I.e, if a data set contains 10,000 data
points, all data points above the set thresholds would be considered anomalies, not only the
top scoring 2.5%.
8
4.3.1 Data manipulation
The categorical data needed for these models were job id and step id. The machine-
learning models can only use numerical data, thus some preprocessing steps were required
for the categorical data. Example of logs are available in Table 1.
The goal of the model was to find local anomalies within each category. However, splitting
the data between 350 categories resulted in imbalanced data. When using the categorical data
as a feature, the models would classify data points with rare categories as anomalies. When
using the raw duration as a feature, the models would classify rare durations as anomalies.
However, these are not the desired results. The raw duration does not matter, it is the differ-
ence from the ordinary duration within the category that is the important aspect. As a result,
none of the categorical data or the raw duration was to be used as features. By creating a new
feature based on the categorical data combined with the duration, it was possible to look for
global anomalies instead of local ones within each category.
The feature created for this solution was a percentage difference from the median of the
category. This was done by first calculating the median duration of each category, and then
calculating the percentage difference between each log’s duration and the their median. The
difference was calculated with the following formula:
This feature was created since all logs, regardless of category, has something in common
- their deviation from the ordinary. What ordinary is depends on the log’s category. This
is precisely what the model should classify each log on. If the deviation is large enough, it
should be deemed an anomaly. In doing this, we capture the important parts of the categorical
features, without having to use the categorical features directly in the model. This allows the
model to locate local anomalies globally. By using percentage, it is ensured that shorter job
types are weighted the same as longer job types. Another advantage with this approach is
that new categories can be evaluated without having to re-fit the model, by simply calculating
their median and the percentage difference from the median. A possible drawback of this
approach is that increases have a larger weight than decreases. This is because the increase is
not bound, while the decrease is bound. A process cannot take less than 0 seconds, thus the
largest decrease possible is 100%.
The entire preprocessing pipeline is as shown in Figure 4 divided into four steps. The
cleaning step involves removing data that would not result in good performances. This in-
cludes data whose category only has one instance, and data of events that failed to execute.
These data points are excluded since they would add no value to the model and could nega-
tively affect the results. Figure 5 shows that the majority of the values had a deviation of 0,
or a slight decrease
Table 2 shows an example of logs after adding median and medianDiff as columns.
9
Figure 4: Preprocessing pipeline for the data set
10
Figure 5: Distribution of percentual deviations from median in data set
11
5 Results
This chapter discusses the results of the models. All models’ contamination rate was set to
0.025 (2.5%). Exactly how well the models perform is hard to measure since there was not an
exact threshold decided for what defines an anomaly. However, as previously stated, it was
defined as the top 2.5% deviating data points in this data set. All models reported a threshold
between approximately 20% and 30%. These thresholds were confirmed by domain experts to
be fitting.
The data set used to fit each model contains approximately 8,000 data points, these models
can then be saved and loaded. In doing this, the definition of an anomaly does not change
with new data sets. The fitted model is then able to detect anomalies in new data sets with
the same thresholds chosen during fitting.
12
Figure 6: Thresholds set for Isolation Forest with an increasing amount
of trees, black line showing the average
13
Figure 8: Differences in anomalies found between IF and the other models
14
5.2 K-nearest neighbor
kNN performed well in terms of finding anomalies with 𝑘 = 12. This 𝑘 was chosen because
the threshold values converge after that point as seen in Figure 9. The value of 𝑘 is very
dependant of the data set, 𝑘 = 12 performs well when fitting the data set of 8,000 data points.
However, using 𝑘 = 12 on the data set with 40,000 data points, the anomaly detection was
not consistent in what classified as an anomaly. The data points that were not classified as
anomalies while having a larger deviation than some anomaly are ’inconsistent normal points’
as seen in Figure 10.
The smallest increase deviation that classified as an anomaly was approximately 23.41%
from the median. The smallest deviation for events that decreased in duration was -20.72%.
However, the kNN model being a very simple model also had the drawback of a slower exe-
cution time as seen in Figure 11. It scaled the worst out of the three models. Although, it had
the best execution times for 𝑛 < 2000. kNN classified 213 data points as anomalies.
Using the fitted model to predict new samples is faster, but still performs the slowest of
the three models as seen in Figure 12.
kNN suffers from the effects of masking, where if there are many similar anomalies, they
are not detected. This can be demonstrated by adding 𝑘 copies of an anomaly to the data set,
as seen in Figure 13.
15
Figure 9: Thresholds set for different values of 𝑘
16
Figure 11: Execution time of fitting CBLOF, IF and kNN with an increas-
ing data set size
Figure 12: Execution time when predicting new data sets of increasing
sizes using the fitted models
17
Figure 13: Masked anomaly in kNN
18
5.3 Cluster-based Local Outlier Factor
CBLOF performed similarly to Isolation Forest in that it had good performance finding anoma-
lies. Unlike IF, its results did not vary after each fitting, the lower thresholds for anomalies
stayed the same each run, unless the data set was changed. The execution time increased
approximately linearly with the amount of clusters used, as seen in Figure 14. The anomalies
found were approximately the same with the increasing clusters as seen in Figure 15, mean-
ing that no more than two clusters were needed for this data set. The smallest increase from
median classified as an anomaly was approximately 26.83%, and the smallest decrease was
approximately 24.16%. Unlike kNN, CBLOF did detect the masked anomaly seen in Figure 8
CBLOF performed the best out of the three models with regard to the execution time, in
both fitting and predicting on new data sets with the fitted model, while also having similar
thresholds as the kNN and IF. CBLOF classified 213 data points as anomalies.
19
Figure 14: Differences in execution time with an increasing amount of
clusters
20
6 Discussion
IF required about 20 trees to gain stable performance, but this did also affect the execution
time of the model, making it worse than CBLOF on this type of data. As the data used to
fit the model was one-dimensional, Isolation Forest only split on one feature, as opposed to
a random feature if using multi-dimensional data. As a result, each tree split on a random
value between the minimum- and maximum value of % deviation. As seen in Figure 5, the
majority of values are close 0%, with many having a small decrease compared to the median.
As a result of this, when a tree in Isolation Forest splits on a random value, it is much more
likely to split where the values are sparse. Thus, it is more likely to isolate the larger devia-
tions, than it is to isolate a normal point. The reason the threshold varies a little every run is
because of these random splits. When the threshold for increases is smaller, the threshold for
decreases also becomes larger. This is because there is a set amount of anomalies indicated
by the contamination rate. If the increase threshold gets smaller, more anomalies are found
in increases, but fewer anomalies are found in decreases, as that threshold gets further away
from zero. The amount of anomalies found by IF varied slightly every run, but on average it
found fewer anomalies compared to CBLOF and kNN. IF had a slightly larger threshold for in-
creases than the other models, while having approximately the same threshold for decreases.
One possible explanation for this is that the values with more than 30% increase were more
sparse and thus more likely to be isolated. There were 50 logs between +20% and +30%, and
only 25 logs between +30% and +40%.
CBLOF did not require many clusters due to the distribution of the data, approximately
90% of the data points had a % deviation between approximately [−7.5, 6]%. Since the majority
of the values had a small deviation from the median, as a result there existed only one large
cluster. This large cluster center would be dependent on where the majority of the values
lie. In this data set, the large cluster center had a value of 1.62. The small cluster contained
only four values, which were extra large anomalies with a % deviation between [400,1600].
As a result, CBLOF used the distance between each data point and 1.62 as an anomaly score.
As anomalies were defined as points that deviated from the median, this resulted in larger
deviations having larger anomaly scores. Since the large cluster center had a value of 1.62
and not 0, this skewed the thresholds. Anomalies that increased thus required 1.62 more
percentage points than anomalies that decreased in order to be classified as anomalies. This
is the reason the absolute value of the two thresholds in Figure 15 are not exactly equal.
kNN works by using the distances to the k nearest-neighbors for each data point. The
larger the average of the distances are, the more weight is given. A larger weight indicates
a larger probability of being an anomaly. As the majority of the values are close to 0%, the
normal values have a small weight. While the values that have a larger % difference are more
sparse, and thus are more likely to be anomalies. kNN suffers from the issues of masking
in the data set, where if there are several similar anomalies, they are not detected. This is
because the model only looks at the 𝑘 nearest-neighbors. If the 𝑘 neighbors are similar data
points, the average distances between them are small, and thus have a small weight, and are
not detected. This issue does not occur in CBLOF or IF. CBLOF is not affected as it only looks
at the distance to the closest large cluster center, thus small clusters of anomalies does not
affect detection. IF only uses a subset of the data, thus mitigating the masking issue.
21
As seen in Figure 8, the models differ only in detecting the smaller anomalies. This is
expected since the models differ in threshold values. If a model has a lower threshold than
another model, it will also classify smaller deviations as anomalies. Some instances of masking
are seen as well, where both CBLOF and IF detected the masked anomalies whereas kNN did
not. CBLOF classified approximately 30 increasing data points more than IF, but IF classified
approximately 30 decreasing data points more than CBLOF. IF did however classify 5 data
points less in total compared to kNN and CBLOF.
22
7 Conclusion
To answer the research question, Isolation Forest did perform better than kNN in execution
time, and it did not have the same problems with masking. IF does not have this issue since
each tree uses a subset of the data, which mitigates masking issues. However, IF did not
perform better than CBLOF in terms of execution time. This is because CBLOF did not require
many clusters to achieve good results in one-dimensional data, while IF required around 20
trees to have stable results. Thus, the results are inconclusive and require further testing.
The three models’ detection results were similar but had some differences in detecting
smaller anomalies. CBLOF had thresholds at [-24.16 , 26.83], kNN at [-20.73, 23.41], and fi-
nally IF at [-21.14, 31.09]. These thresholds are simply the smallest % difference from median
classified as anomalies, in both decrease and increase. IF has a larger increase threshold com-
pared to the other two, but similar decrease threshold. All models classified a similar amount
of data points, where kNN, CBLOF classified 213 data points as anomalies and IF classified
208.
23
8 Future work
If there is to be continued work upon this model, an area of improvement would be to account
for time of the day of each log. Logs where the event is ran at night should only be compared
to the same event being ran at night. Processes running during the night can afford to take
longer since the results will not be needed until the morning. One could also scale so that
logs with a lower median have more flexibility than the bigger jobs. The reason for this is
that a 20% increase of a job whose median is 30 seconds is significantly cheaper than a 20%
increase of a job whose median is 2000 seconds. Frequency could also be accounted for, more
frequently ran jobs is of higher priority than rare jobs.
As the experiments were done with a one dimensional feature space, it would be interest-
ing to see how the models compare when using more features.
More aspects could be considered when comparing these models. This paper compared
the anomaly detection as well as execution time. However, it would be of interest to also
compare memory usage for each model.
One could investigate the use of a reinforcement learning model, which in this case could
be a good option since anomalies will be investigated by an engineer. This engineer could
then rate how well the model performed in order to improve the classification.
Lastly, it would be interesting to investigate whether these machine learning models per-
form better than simpler statistical methods such as using interquartile range (IQR) to detect
anomalies [18].
24
Bibliography
[1] Varun Chandola, Arindam Banerjee, and Vipin Kumar. “Anomaly detection: A survey”.
In: ACM computing surveys (CSUR) 41.3 (2009), pp. 1–58.
[2] Animesh Patcha and Jung-Min Park. “An overview of anomaly detection techniques:
Existing solutions and latest technological trends”. In: Computer networks 51.12 (2007),
pp. 3448–3470.
[3] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. “Isolation forest”. In: 2008 eighth ieee
international conference on data mining. IEEE. 2008, pp. 413–422.
[4] Thomas Schlegl et al. “Unsupervised anomaly detection with generative adversarial
networks to guide marker discovery”. In: International conference on information pro-
cessing in medical imaging. Springer. 2017, pp. 146–157.
[5] Kingsly Leung and Christopher Leckie. “Unsupervised anomaly detection in network
intrusion detection using clusters”. In: Proceedings of the Twenty-eighth Australasian
conference on Computer Science-Volume 38. 2005, pp. 333–342.
[6] Richard J Bolton, David J Hand, et al. “Unsupervised profiling methods for fraud detec-
tion”. In: Credit scoring and credit control VII (2001), pp. 235–255.
[7] Shilin He et al. “Experience report: System log analysis for anomaly detection”. In:
2016 IEEE 27th international symposium on software reliability engineering (ISSRE). IEEE.
2016, pp. 207–218.
[8] Christophe Leys et al. “Detecting outliers: Do not use standard deviation around the
mean, use absolute deviation around the median”. In: Journal of experimental social
psychology 49.4 (2013), pp. 764–766.
[9] Markus Goldstein and Seiichi Uchida. “A comparative evaluation of unsupervised anomaly
detection algorithms for multivariate data”. In: PloS one 11.4 (2016), e0152173.
[10] Mennatallah Amer and Markus Goldstein. “Nearest-Neighbor and Clustering based
Anomaly Detection Algorithms for RapidMiner”. In: (). url: https://www.goldiges.
de/publications/Anomaly_Detection_Algorithms_for_RapidMiner.
pdf.
[11] Zengyou He, Xiaofei Xu, and Shengchun Deng. “Discovering cluster-based local out-
liers”. In: Pattern recognition letters 24.9-10 (2003), pp. 1641–1650.
[12] Fabrizio Angiulli and Clara Pizzuti. “Fast outlier detection in high dimensional spaces”.
In: European conference on principles of data mining and knowledge discovery. Springer.
2002, pp. 15–27.
[13] Markus M Breunig et al. “LOF: identifying density-based local outliers”. In: Proceedings
of the 2000 ACM SIGMOD international conference on Management of data. 2000, pp. 93–
104.
[14] Thomas Cover and Peter Hart. “Nearest neighbor pattern classification”. In: IEEE trans-
actions on information theory 13.1 (1967), pp. 21–27.
25
[15] Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. “Efficient algorithms for min-
ing outliers from large data sets”. In: Proceedings of the 2000 ACM SIGMOD international
conference on Management of data. 2000, pp. 427–438.
[16] David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding.
Tech. rep. Stanford, 2006.
[17] Yue Zhao, Zain Nasrullah, and Zheng Li. “PyOD: A Python Toolbox for Scalable Outlier
Detection”. In: Journal of Machine Learning Research 20.96 (2019), pp. 1–7. url: http:
//jmlr.org/papers/v20/19-011.html.
[18] Peter J Rousseeuw and Mia Hubert. “Robust statistics for outlier detection”. In: Wiley
interdisciplinary reviews: Data mining and knowledge discovery 1.1 (2011), pp. 73–79.
26
27