0% found this document useful (0 votes)

66 views13 pages

ACM Icmlt2022

Uploaded by

asdigistore101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views13 pages

ACM Icmlt2022

Uploaded by

asdigistore101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/361231718

Threshold based Technique to Detect Anomalies using Log Files

Conference Paper · March 2022

DOI: 10.1145/3529399.3529430

CITATION READS
1 290

4 authors, including:

Barjinder Kaur Sajjad Dadkhah

University of New Brunswick University of New Brunswick
19 PUBLICATIONS 360 CITATIONS 49 PUBLICATIONS 442 CITATIONS

SEE PROFILE SEE PROFILE

Ali A. Ghorbani
University of New Brunswick
333 PUBLICATIONS 17,402 CITATIONS

SEE PROFILE

All content following this page was uploaded by Sajjad Dadkhah on 31 August 2022.

The user has requested enhancement of the downloaded file.

Threshold based Technique to Detect Anomalies using Log
Files
TOLUWALOPE DAVID AKANDE, Faculty of Computer Science, UNB, Canada
BARJINDER KAUR, Faculty of Computer Science, UNB, Canada
SAJJAD DADKHAH, Faculty of Computer Science, UNB, Canada
ALI A. GHORBANI, Faculty of Computer Science, UNB, Canada
Every action carried out on computer systems can be captured using log files. Proper scanning of log files
can divulge security breaches. However, a large-scale data processing engine should analyze the log files
due to the voluminous events in log files. This paper proposes an anomaly detection approach using a
threshold to discriminate between regular and aberrant log files. The experiments are performed on HDFS, a
publicly available log dataset. The system’s efficacy is evaluated using Robust Random Cut Forest (RRCF), an
unsupervised tree-based approach where we achieved precision 97.10% and F1-score 98.47% results. Hadoop
framework is utilized to run the experiments due to its capability of parallel processing tasks in less time, even
on large datasets.
CCS Concepts: • Security and privacy → Intrusion/anomaly detection and malware mitigation.
Additional Key Words and Phrases: Anomaly detection, log analysis, unsupervised machine learning, dis-
tributed system, RRCF.
ACM Reference Format:
Toluwalope David Akande, Barjinder Kaur, Sajjad Dadkhah, and Ali A. Ghorbani. 2018. Threshold based
Technique to Detect Anomalies using Log Files. In 2022 The 7th International Conference on Machine Learning
Technologies (ICMLT’22) (ICMLT’22) , March 11-13, 2022, Rome, Italy. ACM, New York, NY, USA, 12 pages.
https://doi.org/XXXXXXX.XXXXXXX

1 INTRODUCTION
Log files are used to record the activity of the computers attached to the network normally. These
files can be used for number of purposes including security breach analysis [4], troubleshooting [18]
or anomaly detection [23]. During the earlier days, everything was done manually, but with the
increase in the volume of log files due to the number of systems connecting, it has become difficult
to process these files with traditional tools.
With the increased complexity of the data collected from heterogeneous sources and stored in
logs, detecting anomalous activity without prior knowledge is becoming a real challenge. Companies
are continuously migrating operations from centralized systems to distributed systems. Due to
the essential services such as payment solutions, DNS services, search engines these companies
provide, they can not afford down-times. Down-times such as service outages and deterioration of
quality of service will lead to brand damage and revenue loss. In 2014, an Avaya report found that
downtime loss averages from 2,300 - 9,000 dollars per minute depending on factors like company
size and industry vertical [1]. In August 2016, a five-hour downtime in an operation center caused

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00
https://doi.org/XXXXXXX.XXXXXXX

1
Conference acronym ’XX, June 03–05, 2018,
Toluwalope
Woodstock,
David
NYAkande, Barjinder Kaur, Sajjad Dadkhah, and Ali A. Ghorbani

2,000 canceled flights and an estimated loss of 150 million dollars for Delta Airlines [1]. Hence,
building a reliable system has become essential.
For recording the run time, information on the software’s software logs is required. This type of
log file has been employed in a variety of reliability assurance tasks, [13]. Log files are unstructured
text that contains information about all behaviours and events during software or a process. Due to
the volume and velocity of processes running in a distributed environment, manual investigation
of system behaviour for malfunction detection is impractical as the volume of data is in gigabytes.
Logging may be classified into several categories. Security logging entails information collection
from systems related to security and helps to identify possible breaches, malicious programs,
information thefts and to assess the condition of security measures [3]. Access logs, which provide
information regarding user authentication, are also included in these logs. System failures and
malfunctions are revealed via operational logging [3]. Compliance logging, which is commonly
confused with security logging, gives information regarding compliance with security standards.
These logs are divided into two categories: recording of information system security in terms of
data flow and storage, such as PCI, DSS, or HIPAA standard compliance, and tracking of system
settings [3].
Log analysis involves the following four main steps:
(1) log collection.
(2) log parsing.
(3) feature extraction.
(4) anomaly detection.
A typical log analysis management architecture is made up of variable and constant components.
Traditionally, developers relied on regex for log parsing and detecting anomalies. Log parsing is
usually the first step of automated log analysis. However, this process is challenging due to the
following reasons.
• Due to a large number of logs and, as a result, the considerable time utilized on human regex
building.
• The complexity of software and, as a result, variety of different event templates.
• The frequency with which software upgrades are performed, and hence the frequency with
which logging statements are updated [13].
Anomaly detection, which tries to reveal anomalous system behaviours instantly, is critical
in large-scale system incident management. In real-time, it helps system developers to identify
and rectify issues quickly and decrease system downtime. Helping developers who in traditional
systems manually review system logs or create rules to identify abnormalities based on domain
expertise, with the addition of keyword search (e.g., “failure”, “exception”) or regular expression
match . However, for large-scale systems, this anomaly detection approach, primarily based on
manual analysis of logs, has proven unsatisfactory for the following reasons[14].
• Because of the large-scale and parallel nature of current systems, the system behaviours are
too complicated for any single developer, who is solely accountable for sub-components.
• Modern systems create a large number of logs every hour. Even with commands like search
and grep, the sheer volume of such logs makes it notoriously difficult, if not impossible, to
manually filter the vital information from the noise data for anomaly detection.
As a result, automated methods for identifying irregularities in logs are necessary. Thus, our
contribution in this study is following:
• Firstly, we proposed a threshold-based technique for detecting the anomalies in log files.

2
Threshold based Technique to Detect Anomalies using Log Files
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

• Secondly, we focus on anomaly identification in log files using tree-based techniques using
data mining.
• Finally, the proposed approach is evaluated using RRCF, which is compared with six different
classifiers.
The rest of the paper is organized as follows. In Section 2 the previous work is presented.
Section 3 describes our proposed methodology followed by dataset description, machine learning
methodologies and results achieved in Section 4. Finally, a conclusion of the work with future
research directions is presented in Section 5.

2 LITERATURE REVIEW
Log analysis is the initial and foremost step in discovering abnormalities in log files. In this section,
firstly, we discuss different steps of a log analysis framework. We are then summarizing the previous
work done on the anomaly detection approach.

2.1 Framework for Log Analysis

The framework for anomaly detection consists primarily of four steps: log collecting, log parsing,
feature extraction, and anomaly detection.
• Log Collection: Large-scale systems generate logs regularly to record system states and
runtime information, with each log containing a date and a log message summarizing what
happened [14]. Because this vital information may be utilized for several purposes, including
anomaly detection, logs are first collected for later use.
• Log Parsing: Logs are usually in the plain text form, which has a regular part and variable
part. Also, the log messages are in the unstructured format, so the goal is to extract a set of
event templates from raw logs to be organized into a more structured format. More specifically,
each log message may be processed into a constant part with defined characteristics [27].
Fig.1 depicts the framework for log parsing.
• Feature Extraction: After log parsing, the next important step is extracting the robust
feature which represents log events. The log data needs to be grouped according to their
identifiers. Thus, the log events act as input, and the output is received in an event count
matrix. This information is fed into anomaly detection models according to the application
proposed [12].
• Anomaly detection: This is the final step of the framework where a different machine or
deep learning models are constructed to detect anomalies in the log files. These models help
in identifying whether or not a new incoming log sequence is an anomaly [7].

2.2 Related Work

Anomaly detection, described in [6], is the discovery of patterns in data that do not correlate to
predicted behaviour. Anomalies in data arise for several causes, and suspicious network traffic
behaviour might suggest that a cyber-attacker has infiltrated the network. All techniques for
anomaly detection are fundamentally concerned with creating a representation of typical behaviour
and labeling anything that does not fit into this representation as an abnormality. Fig. 2 depicts the
method for detecting anomaly files; however, this method assumes that a deviation from regular
events is known before it occurs.
Wang et al.[25] propose the use of K-NN, an unsupervised method for detection of an anomaly
in log files. The study offers a three-part log-based anomaly detection with efficient neighbor
selection and an automated k-nearest neighbor selection technique. In the initial step, search for
neighbors using minhash and MVP-tree is done then the automatic selection of k neighbors is

3
Conference acronym ’XX, June 03–05, 2018,
Toluwalope
Woodstock,
David
NYAkande, Barjinder Kaur, Sajjad Dadkhah, and Ali A. Ghorbani

Fig. 1. Framework for log parsing

performed. Neighbors from the MVP tree are chosen and saved into a spare neighbor sample set. A
neighbor assessment technique based on the Silhouette Coefficient is used. An average distance
from each category samples between the real neighbor and the sample to be detected is used to
detect anomalies. However, this method requires the tuning of hyper-parameters to obtain the best
result.
Existing approaches primarily use past log files and are then fed into the detection model to detect
abnormalities. In [31] authors proposed LogRobust, a novel log-based anomaly detection technique.
The proposed detection technique extracts and encodes semantic information from log events as
semantic vectors. Further, an attention-based Bi-LSTM model to detect anomalies in log files is
used to collect contextual information in log sequences and automatically learn the significance of
various log events. With the same model, i.e., LSTM Wanget et al. worked on one-month log data
to detect anomalies. Their method achieved a precision of 0.96%. The log entries obtained from
router device NetEngine40E series are installed in real network settings [26].
A deep neural network named deepLog is proposed to detect live log anomalies [8]. The DeepLog
understands and encodes the whole log message, including the timestamp, log key, and parameter
values. It detects anomalies at the log entry-level rather than at the session level. Precision, recall,
and f-measure of 0.88, 1.00, and 0.93 are achieved using the proposed technique. The drawback of
this approach is that the deviation of events from normal is unknown.
Kun et al. [28] developed LogC, a novel Log-based anomaly detection technique with component-
aware analysis. In this approach, the LogC is divided into log template sequences and component
sequences and further used to train a combined LSTM model for identifying anomalous logs.
With the proposed approach, accuracy, recall, and f-measure values of 93.53%, 98.29%, and 95.85%,
respectively, are achieved.
The authors in [20] proposed a two-level parsing with Log key as input to the convolution neural
network (CNN). The parsing is applied to raw data to retrieve log key and vector, respectively. After

4
Threshold based Technique to Detect Anomalies using Log Files
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

comparing with LSTM and Multilayer Perceptron (MLP), they found CNN gives better accuracy
using their approach.
Based on the density-based algorithm OPTICS, the authors [29] proposed an unsupervised
streaming anomaly detection mechanism for a knowledge-based construction system and streaming
anomaly detection system for generating alerts in real-time. By using hdfs log files dataset, the
F1-score of detecting anomalies was 83%.
An integrated, scalable framework for anomaly detection is proposed for the distributed environ-
ment using many unlabeled data logs. The experiments performed using NASA Hypertext Transfer
Protocol (HTTP) logs datasets are validated using k-means and XGBoost system after extracting
fourteen features [17].

Fig. 2. Traditional Approach to Anomaly Detection in Log Files [31]

Fig. 3. Framework for Anomaly detection

3 FEATURE EXTRACTION & METHODOLOGIES

In this section, we described our proposed approach to detecting anomalies in log files. The content
generated by logging statements may be classified into many groups to detect abnormal logs that
represent system faults automatically and correctly. We utilize a system that can manage streaming

5
Conference acronym ’XX, June 03–05, 2018,
Toluwalope
Woodstock,
David
NYAkande, Barjinder Kaur, Sajjad Dadkhah, and Ali A. Ghorbani

data and is less susceptible to outliners while detecting anomalies. Due to the voluminous number of
logs generated, we selected Hadoop for processing large data. Fig.4 depicts the proposed framework
used in detecting anomalies.

3.1 Feature Extraction

Originally logs were unstructured and contained actions taken on the system in an unstructured
manner. Firstly, the unstructured logs are converted to structured logs using a template. Next, the
structured logs are converted to vector embedding using Gensim1 a natural language processing
library in python.

3.2 Methodologies
To detect anomalies in log files, we have used different unsupervised anomaly detection models,
i.e., RRCF, one-class support vector machine (OCSVM), Isolation forest (IF), Local Outlier Factor
(LOF), and Elliptic Envelope (EE). The purpose of using these models on our proposed approach is
because of their ability to detect anomalies efficiently.
• Robust random cut forest (RRCF): An anomaly detection model that handles high-
dimensional and streaming data better. Following that, we evaluate rrcf against various
unsupervised learning methods. A robust random cut tree (RRCT) is created from a point
set S by iteratively dividing it until each point is isolated in its own bounding box [10]. A
random dimension is chosen for each iteration of the tree construction procedure by selecting
a dimension proportional between its minimum and maximum values. A random value is
chosen between the minimum and most significant value of that dimension. A new leaf node
for x is generated, and the point is deleted from the point set if the partition separates a point
x from the remainder of the point set. The procedure followed for every subset recursively.
In RRCF, outliers are more likely to be found closer to the base of the tree. The collusive
displacement of a point is used to determine if it is an outlier (CoDisp). If adding a new point
increases the model’s bit depth, it is more likely to be an outlier. The total bit depths of all
points in the tree are used to indicate collusive displacement. The algorithm for constructing
an RRCT tree [10] is formally specified below.
• One-Class SVM (OCSVM): It is an unsupervised intrusion detection method that separates
the training data from the origin iteratively. It works by finding the maximal hyperplane (or
linear decision boundary) [16]. The basic idea is to map input data into a high-dimensional
feature space using an appropriate kernel function. Here, by giving data points for training
without specifying any class information, a hyperplane or linear function is constructed in
the feature space and is computed as given in Eq. (1) [16]:

𝑓 𝑒 (𝑓 ) = 𝑤 𝑇 𝜙 (𝑓 ) − 𝜌 (1)
where 𝑓 𝑒 (𝑓 ) represents the feature space, 𝑤 norm perpendicular, 𝜌 is the bias of the hyper-
plane, 𝜙 (𝑓 ) is the map function.
• Isolation Forest (IF):
It is an ensemble approach that has been widely used for anomaly detection. The structure
of IF is built by adding isolation trees based on the concept of the extra tree that partition
attributes according to lowest and highest values. The partition process continues until all the
samples are isolated, and the partition needs to isolate from the root to a leaf. The instance or
sample is located near the signifies that it is an anomaly because usually abnormal samples
1 https://pypi.org/project/gensim/

6
Threshold based Technique to Detect Anomalies using Log Files
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

have smaller average path than normal samples thus easier to isolate [9, 19]. The anomalous
score 𝑠𝑐 is calculated as defined in Eq. (2).

𝐸 (ℎ(𝑥))
𝑠𝑐 (𝑥, 𝑛) = 2− (2)
𝑐 (𝑛)
where 𝐸 (ℎ(𝑥)) represents the average value of ℎ(𝑥) from itrees collection. Then the instance
𝑥 is assigned to outlier if the value of 𝑠𝑐 is close to 1 otherwise considered as normal. The
value of 𝑐 (𝑛) is the average of ℎ(𝑥) for a given 𝑛.
• Local Outlier Factor (LF):
An outlier detection algorithm that calculates the degree to which a data point is anoma-
lous [5]. Here the data point is calculated by measuring the density in the neighborhood
surrounding that point. There are two basic steps: Firstly, density measure for a local data
point is measured by computing the inverse average reachability. Secondly, the algorithm
based on approximated local density creates the LOF-value for all data points defined in Eq.
3 [32]. LOF value ≥ indicates more inline, whereas values > 1 is an outlier.
Í
𝑏 ∈𝑁𝑘 (𝑎)𝑙𝑟𝑑 (𝑏) 1
𝐿𝑂𝐹𝑘 (𝑎) = · (3)
|𝑁𝑘 (𝑎)| 𝑙𝑟𝑑 (𝑎)
where the 𝐿𝑂𝐹 is the average reachability distance of all neighbors of 𝑎 divided by the actual
reachability distance of the data point itself, 𝑘 is the distance between another point and data
point.𝑁𝑘 (𝑎) is the shortcut for 𝑁𝑘−𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝑎) and 𝑙𝑟𝑑 is local distance reachability.
• Elliptic Envelope (EE): It is an unsupervised as well as supervised methodology that helps
in modeling high-dimensional data. Based on a density concept, an ellipse is drawn around
the data points. Thus, the values which lie inside the ellipse are termed as normal, and those
lying outside distribution density are labeled outliers [22].
• OPTICS: It is a density-based algorithm that allows making a group of clusters without
defining the number. It is based on the observation of given MinPts (minimum number of
points). Here the key idea is processing the higher density points first. It retains the clustering
order using: the core distance and the reachability distance [30].
• LogCluster: A data mining algorithm that is well suited for analyzing security logs. By
creating a data cluster, the algorithm helps detect patterns from textual log files. This algorithm
works in two passes firstly, by passing over the log files for finding frequent words and second
splitting the log files into clusters [24].
Because points cannot be entered or removed from trees once they have been created, the
methods mentioned earlier are not suitable for usage with streaming data. Furthermore, the
methods mentioned above are susceptible to "irrelevant dimensions," which means that partitions
are frequently wasted on dimensions that give little information. Robust random forest aims to
solve the problems that these algorithms can cause.

3.3 Performance Evaluation Metrics

To understand the performance of each detection model, in this work we have used precision, recall,
and f-measure which are defined as follows [21]:
• True Positive (TP): number of anomalous logs that are correctly identified.
• False Positive (FP): number of normal logs that are identified as abnormal logs.
• True Negative (TN): Samples that are abnormal but are correctly by the model.
• False Negative (FN): number of abnormal logs identified as normal logs.

7
Conference acronym ’XX, June 03–05, 2018,
Toluwalope
Woodstock,
David
NYAkande, Barjinder Kaur, Sajjad Dadkhah, and Ali A. Ghorbani

Table 1. Dataset Information

Dataset Total System # of Anomalies Anomalies(%)

Hadoop 33415 Distributed 976 2.92
OpenStack 70389 Operating 19144 27.19

• Precision: Ratio of correctly classified samples over all actually detected samples.
𝑇 𝑃/(𝑇 𝑃 + 𝐹 𝑃)
• Recall: Calculates the ratio of all “truly detected samples” to all the “samples that should be
detected.”
𝑇 𝑃/(𝑇 𝑃 + 𝐹 𝑁 )
• F1-measure: Reporting the harmonic mean or weighted average of precision and recall.
2 · 𝑇 𝑃/(2 · 𝑇 𝑃 + 𝐹 𝑁 + 𝐹 𝑃)

4 DATASET DESCRIPTION & RESULTS

Research Questions We design our experiments to answer the following research questions:
RQ1: What is the most effective robust random cut forest threshold to use?
RQ2: How effective is robust random cut forest, Isolation forest, Local Outlier Factor, One Class
SVM, Elliptic Envelope, logcluster and OPTICS in detecting anomalous logs?

4.1 Dataset Description:

In this work, we have evaluated our proposed method using two open-source datasets [15]. Table
1 summarises the datasets’ fundamental information. All the experiments are performed using
Hadoop and Pyhton 3.6.
• Hadoop Distributed File System (HDFS): It is a log dataset that is created in distributed
private cloud environment utilizing benchmark workloads. The dataset generated by running
Hadoop-based map-reduce jobs on Amazon EC2 consists of 11,172,157 log messages out of
which 284,818 are anomalous. The labels are manually crafted to define rules to identify
anomalies. In this study HDFS components, content and events were used in the detection of
log also. A total of 33415 log files are used for analysis [11]. In this study for evaluating the
system, 70389 log files were analyzed [7].
• Openstack: It is a cloud operating system that manages pools of hardware resources through-
out a data center. The dataset is generated on CloudLab, with anomaly samples generated
by injecting failures into the system. A total of 1,335,318 log entries were collected, with 7%
abnormal.
Feature Selection: Features play an essential role in accurately getting the results. Log files
contain different features, including DateTime, PID, logging level, component, content, event id,
event template. For conducting the experiments in this study, we have used component and events
features from log files for anomaly detection. These features are variables and in string format.
While the components feature of HDFS has two parts, i.e., data blocks and nodes used for storing
those data blocks, whereas event variable refers to the action that took place.
Threshold Selection for RRCF: The collusive displacement of a point in a robust random cut
forest determines its anomaly score. A specific threshold should be set to discriminate between a

8
Threshold based Technique to Detect Anomalies using Log Files
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

regular log and an aberrant log. The most efficient threshold setting was determined by experi-
menting with various threshold values such as maximum deviation, three-sigma, median absolute
deviation, and 1.5*median absolute deviation. The three-sigma threshold is defined as the mean plus
three times the standard deviation. Any residual number that exceeds the three-sigma criterion is
almost certainly an anomaly. The median absolute deviation is calculated by summing the absolute
departures from the median. It’s a dispersion metric comparable to the standard deviation, but it’s
more resistant to outliers [2].

Fig. 4. Threshold Setting for both HDFS and OpenStack Dataset

4.2 Results
We report our experimental results corresponding to the research questions mentioned in subsection
4.1
RQ1: What is the most effective robust random cut forest threshold to use?
Selecting the suitable threshold setting is an essential task in detecting anomalous logs. We
investigate the impact of using maximum deviation, three-sigma, median absolute deviation, and
1.5*median absolute deviation on both HDFS and Open-stack datasets. In Fig. 4, setting the threshold
as maximum deviation on both datasets yielded the best result and performed better than setting
the threshold as 3-sigma. On the other hand, setting the threshold as maximum deviation using
OpenStack dataset gives a precision, recall, f-measure of 0.73, 0.99, and 0.84, respectively. While in
the case of HDFS dataset setting the threshold as maximum deviation resulted in precision, recall,
f-measure of 0.97, 0.99, and 0.98, respectively, which is higher in comparison to the OpenStack
dataset. This signifies that the threshold has impacted the performance of RRCF in detecting
anomalies using log files.
textbfRQ2: How effective is RRCF, IF, LOF, ONSVM, EE, LogCluster, and Optics in detecting
anomalous logs?
Here, we have investigated how well RRCF performed when compared to other unsupervised
learning algorithms. For RRCF, we selected the maximum deviation as the threshold since we
determined that it produces the best results in the previous section. As depicted in Fig. 5, RRCF
achieved the best recall and f-measure score (98.10%, 99.89% for OpenStack and hdfs, respectively) as
compared to other algorithms in both the datasets. However, using OpenStack dataset, EE algorithm
provided the best precision whereas EE, IF both best precision results on HDFS dataset
We further compared our method to the log clustering implemented by Shilin et al. [15]. It has
been analyzed that our proposed approach of using RRCF achieved better precision, recall, and
f-measure values than logcluster[15] as depicted in Table 2

9
Conference acronym ’XX, June 03–05, 2018,
Toluwalope
Woodstock,
David
NYAkande, Barjinder Kaur, Sajjad Dadkhah, and Ali A. Ghorbani

Fig. 5. Performance Metric for both HDFS & OpenStack Dataset.

Table 2. Results on the both dataset

Algorithm OpenStack HDFS

Precision (%) Recall (%) F-measure (%) Precision (%) Recall (%) F-measure (%)
Robust random cut forest 72.82 98.10 83.59 97.10 99.89 98.47
Isolation Forest 71.97 79.09 75.36 97.53 80.37 88.12
Local Outliner Factor 76.76 63.53 69.52 97.07 87.73 92.16
One Class SVM 67.69 46.48 55.11 97.51 50.22 66.30
Elliptic Envelope 73.38 90.72 81.14 97.53 90.42 93.84
LogCluster - - - 87.00 74.00 79.00
Optics - - - 71.00 100.00 83.00

Additionally, the OPTICS algorithm used by Zeufack et al. [29] is compared with RRCF. From
Table 2 it can be analysed that RRCF had achieved a better precision and f-measure score of 97.10%
and 98.47% than OPTICS (71%,83%)

5 CONCLUSION
In today’s large-scale distributed systems, logs are frequently used to discover abnormalities.
However, traditional anomaly identification, which depends mainly on manual log examination,
becomes infeasible due to the dramatic rise in log size. Automated log analysis and anomaly
detection technologies have been extensively researched in recent years to decrease manual labor.
However, developers are still unaware of cutting-edge anomaly detection methods, and they
frequently have to re-design a new anomaly detection method on their own, owing to a lack of
a complete study and comparison of current approaches. This study fills that need by offering a
comprehensive assessment and evaluation of five cutting-edge anomaly detection methods. This
work determines that RRCF performs better than other state-of-the-art techniques in determining
the best threshold value to detect anomalies. It has also been observed that RRCF with maximum
deviation as the threshold setting outperforms the IF, OCSVM, EE, LOF, Logcluster algorithms in
terms of performance.

10
Threshold based Technique to Detect Anomalies using Log Files
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

However, because each log file has a distinct format, in the future, we propose to test the RRCF
model against additional log files from IoT devices. Because IoT devices are vulnerable to breaches,
spotting anomalies before they cause harm to owners of IoT devices is essential.

REFERENCES
[1] [n.d.]. Cost of downtime. https://www.atlassian.com/incident-management/kpis/cost-of-downtime
[2] [n.d.]. mad. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.median_abs_deviation.html#scipy.stats.
median_abs_deviation
[3] Jakub Breier and Jana Branišová. 2015. Anomaly detection from log files using data mining techniques. In Information
Science and Applications. Springer, 449–457.
[4] Jakub Breier and Jana Branišová. 2017. A dynamic rule creation based anomaly detection method for identifying
security breaches in log records. Wireless Personal Communications 94, 3 (2017), 497–511.
[5] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. 2000. LOF: identifying density-based local
outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data. 93–104.
[6] Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly Detection: A Survey. ACM Comput. Surv. 41, 3,
Article 15 (July 2009), 58 pages. https://doi.org/10.1145/1541880.1541882
[7] Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system
logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications
Security. 1285–1298.
[8] Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. DeepLog: Anomaly Detection and Diagnosis from System
Logs through Deep Learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications
Security (Dallas, Texas, USA) (CCS ’17). Association for Computing Machinery, New York, NY, USA, 1285–1298.
https://doi.org/10.1145/3133956.3134015
[9] Amir Farzad and T Aaron Gulliver. 2020. Unsupervised log message anomaly detection. ICT Express 6, 3 (2020),
229–237.
[10] Sudipto Guha, Nina Mishra, Gourav Roy, and Okke Schrijvers. 2016. Robust Random Cut Forest Based Anomaly
Detection on Streams. In Proceedings of The 33rd International Conference on Machine Learning (Proceedings of Machine
Learning Research, Vol. 48), Maria Florina Balcan and Kilian Q. Weinberger (Eds.). PMLR, New York, New York, USA,
2712–2721. http://proceedings.mlr.press/v48/guha16.html
[11] Haixuan Guo, Shuhan Yuan, and Xintao Wu. 2021. LogBERT: Log Anomaly Detection via BERT. arXiv preprint
arXiv:2103.04475 (2021).
[12] Shangbin Han, Qianhong Wu, Han Zhang, Bo Qin, Jiankun Hu, Xingang Shi, Linfeng Liu, and Xia Yin. 2021. Log-Based
Anomaly Detection With Robust Feature Extraction and Online Learning. IEEE Transactions on Information Forensics
and Security 16 (2021), 2300–2311. https://doi.org/10.1109/TIFS.2021.3053371
[13] Shilin He, Pinjia He, Zhuangbin Chen, Tianyi Yang, Yuxin Su, and Michael R Lyu. 2020. A Survey on Automated Log
Analysis for Reliability Engineering. arXiv preprint arXiv:2009.07237 (2020).
[14] Shilin He, Jieming Zhu, Pinjia He, and Michael R Lyu. 2016. Experience report: System log analysis for anomaly
detection. In 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 207–218.
[15] Shilin He, Jieming Zhu, Pinjia He, and Michael R. Lyu. 2020. Loghub: A Large Collection of System Log Datasets
towards Automated Log Analytics. CoRR abs/2008.06448 (2020). arXiv:2008.06448 https://arxiv.org/abs/2008.06448
[16] Katherine Heller, Krysta Svore, Angelos D Keromytis, and Salvatore Stolfo. 2003. One class support vector machines
for detecting anomalous windows registry accesses. (2003).
[17] João Henriques, Filipe Caldeira, Tiago Cruz, and Paulo Simões. 2020. Combining k-means and xgboost models for
anomaly detection using log datasets. Electronics 9, 7 (2020), 1164.
[18] Nathaniel Kremer-Herman and Douglas Thain. 2020. Log Discovery for Troubleshooting Open Distributed Systems
with TLQ. In Practice and Experience in Advanced Research Computing. 224–231.
[19] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In 2008 eighth ieee international conference on
data mining. IEEE, 413–422.
[20] Siyang Lu, Xiang Wei, Yandong Li, and Liqiang Wang. 2018. Detecting anomaly in big data system logs using
convolutional neural network. In 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl
Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science
and Technology Congress (DASC/PiCom/DataCom/CyberSciTech). IEEE, 151–158.
[21] Nour Moustafa, Jiankun Hu, and Jill Slay. 2019. A holistic review of network anomaly detection systems: A compre-
hensive survey. Journal of Network and Computer Applications 128 (2019), 33–55.
[22] Peter J Rousseeuw and Katrien Van Driessen. 1999. A fast algorithm for the minimum covariance determinant estimator.
Technometrics 41, 3 (1999), 212–223.

11
Conference acronym ’XX, June 03–05, 2018,
Toluwalope
Woodstock,
David
NYAkande, Barjinder Kaur, Sajjad Dadkhah, and Ali A. Ghorbani

[23] Tabea Schmidt, Florian Hauer, and Alexander Pretschner. 2020. Automated Anomaly Detection in CPS Log Files. In
International Conference on Computer Safety, Reliability, and Security. Springer, 179–194.
[24] Risto Vaarandi, Bernhards Blumbergs, and Markus Kont. 2018. An unsupervised framework for detecting anomalous
messages from syslog log files. In NOMS 2018-2018 IEEE/IFIP Network Operations and Management Symposium. IEEE,
1–6.
[25] Bingming Wang, Ying Shi, and Zhe Yang. 2020. A Log-Based Anomaly Detection Method with Efficient Neighbor
Searching and Automatic K Neighbor Selection. Sci. Program. 2020 (2020), 4365356:1–4365356:17.
[26] Xiaojuan Wang, Defu Wang, Yong Zhang, Lei Jin, and Mei Song. 2019. Unsupervised learning for log data analysis
based on behavior and attribute features. In Proceedings of the 2019 International Conference on Artificial Intelligence
and Computer Science. 510–518.
[27] Rakesh Bahadur Yadav, P Santosh Kumar, and Sunita Vikrant Dhavale. 2020. A Survey on Log Anomaly Detection
using Deep Learning. In 2020 8th International Conference on Reliability, Infocom Technologies and Optimization (Trends
and Future Directions)(ICRITO). IEEE, 1215–1220.
[28] Kun Yin, Meng Yan, Ling Xu, Zhou Xu, Zhao Li, Dan Yang, and Xiaohong Zhang. 2020. Improving Log-Based Anomaly
Detection with Component-Aware Analysis. In 2020 IEEE International Conference on Software Maintenance and
Evolution (ICSME). 667–671. https://doi.org/10.1109/ICSME46990.2020.00069
[29] Vannel Zeufack, Donghyun Kim, Daehee Seo, and Ahyoung Lee. 2021. An unsupervised anomaly detection framework
for detecting anomalies in real time through network system’s log files analysis. High-Confidence Computing 1, 2
(2021), 100030. https://doi.org/10.1016/j.hcc.2021.100030
[30] Vannel Zeufack, Donghyun Kim, Daehee Seo, and Ahyoung Lee. 2021. An unsupervised anomaly detection framework
for detecting anomalies in real time through network system’s log files analysis. High-Confidence Computing 1, 2
(2021), 100030.
[31] Xu Zhang, Yong Xu, Qingwei Lin, Bo Qiao, Hongyu Zhang, Yingnong Dang, Chunyu Xie, Xinsheng Yang, Qian Cheng,
Ze Li, et al. 2019. Robust log-based anomaly detection on unstable log data. In Proceedings of the 2019 27th ACM Joint
Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.
807–817.
[32] Tim Zwietasch. 2014. Detecting anomalies in system log files using machine learning techniques. B.S. thesis.