ACM Icmlt2022
ACM Icmlt2022
net/publication/361231718
CITATION READS
1 290
4 authors, including:
Ali A. Ghorbani
University of New Brunswick
333 PUBLICATIONS 17,402 CITATIONS
SEE PROFILE
All content following this page was uploaded by Sajjad Dadkhah on 31 August 2022.
1 INTRODUCTION
Log files are used to record the activity of the computers attached to the network normally. These
files can be used for number of purposes including security breach analysis [4], troubleshooting [18]
or anomaly detection [23]. During the earlier days, everything was done manually, but with the
increase in the volume of log files due to the number of systems connecting, it has become difficult
to process these files with traditional tools.
With the increased complexity of the data collected from heterogeneous sources and stored in
logs, detecting anomalous activity without prior knowledge is becoming a real challenge. Companies
are continuously migrating operations from centralized systems to distributed systems. Due to
the essential services such as payment solutions, DNS services, search engines these companies
provide, they can not afford down-times. Down-times such as service outages and deterioration of
quality of service will lead to brand damage and revenue loss. In 2014, an Avaya report found that
downtime loss averages from 2,300 - 9,000 dollars per minute depending on factors like company
size and industry vertical [1]. In August 2016, a five-hour downtime in an operation center caused
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00
https://doi.org/XXXXXXX.XXXXXXX
1
Conference acronym ’XX, June 03–05, 2018,
Toluwalope
Woodstock,
David
NYAkande, Barjinder Kaur, Sajjad Dadkhah, and Ali A. Ghorbani
2,000 canceled flights and an estimated loss of 150 million dollars for Delta Airlines [1]. Hence,
building a reliable system has become essential.
For recording the run time, information on the software’s software logs is required. This type of
log file has been employed in a variety of reliability assurance tasks, [13]. Log files are unstructured
text that contains information about all behaviours and events during software or a process. Due to
the volume and velocity of processes running in a distributed environment, manual investigation
of system behaviour for malfunction detection is impractical as the volume of data is in gigabytes.
Logging may be classified into several categories. Security logging entails information collection
from systems related to security and helps to identify possible breaches, malicious programs,
information thefts and to assess the condition of security measures [3]. Access logs, which provide
information regarding user authentication, are also included in these logs. System failures and
malfunctions are revealed via operational logging [3]. Compliance logging, which is commonly
confused with security logging, gives information regarding compliance with security standards.
These logs are divided into two categories: recording of information system security in terms of
data flow and storage, such as PCI, DSS, or HIPAA standard compliance, and tracking of system
settings [3].
Log analysis involves the following four main steps:
(1) log collection.
(2) log parsing.
(3) feature extraction.
(4) anomaly detection.
A typical log analysis management architecture is made up of variable and constant components.
Traditionally, developers relied on regex for log parsing and detecting anomalies. Log parsing is
usually the first step of automated log analysis. However, this process is challenging due to the
following reasons.
• Due to a large number of logs and, as a result, the considerable time utilized on human regex
building.
• The complexity of software and, as a result, variety of different event templates.
• The frequency with which software upgrades are performed, and hence the frequency with
which logging statements are updated [13].
Anomaly detection, which tries to reveal anomalous system behaviours instantly, is critical
in large-scale system incident management. In real-time, it helps system developers to identify
and rectify issues quickly and decrease system downtime. Helping developers who in traditional
systems manually review system logs or create rules to identify abnormalities based on domain
expertise, with the addition of keyword search (e.g., “failure”, “exception”) or regular expression
match . However, for large-scale systems, this anomaly detection approach, primarily based on
manual analysis of logs, has proven unsatisfactory for the following reasons[14].
• Because of the large-scale and parallel nature of current systems, the system behaviours are
too complicated for any single developer, who is solely accountable for sub-components.
• Modern systems create a large number of logs every hour. Even with commands like search
and grep, the sheer volume of such logs makes it notoriously difficult, if not impossible, to
manually filter the vital information from the noise data for anomaly detection.
As a result, automated methods for identifying irregularities in logs are necessary. Thus, our
contribution in this study is following:
• Firstly, we proposed a threshold-based technique for detecting the anomalies in log files.
2
Threshold based Technique to Detect Anomalies using Log Files
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
• Secondly, we focus on anomaly identification in log files using tree-based techniques using
data mining.
• Finally, the proposed approach is evaluated using RRCF, which is compared with six different
classifiers.
The rest of the paper is organized as follows. In Section 2 the previous work is presented.
Section 3 describes our proposed methodology followed by dataset description, machine learning
methodologies and results achieved in Section 4. Finally, a conclusion of the work with future
research directions is presented in Section 5.
2 LITERATURE REVIEW
Log analysis is the initial and foremost step in discovering abnormalities in log files. In this section,
firstly, we discuss different steps of a log analysis framework. We are then summarizing the previous
work done on the anomaly detection approach.
3
Conference acronym ’XX, June 03–05, 2018,
Toluwalope
Woodstock,
David
NYAkande, Barjinder Kaur, Sajjad Dadkhah, and Ali A. Ghorbani
performed. Neighbors from the MVP tree are chosen and saved into a spare neighbor sample set. A
neighbor assessment technique based on the Silhouette Coefficient is used. An average distance
from each category samples between the real neighbor and the sample to be detected is used to
detect anomalies. However, this method requires the tuning of hyper-parameters to obtain the best
result.
Existing approaches primarily use past log files and are then fed into the detection model to detect
abnormalities. In [31] authors proposed LogRobust, a novel log-based anomaly detection technique.
The proposed detection technique extracts and encodes semantic information from log events as
semantic vectors. Further, an attention-based Bi-LSTM model to detect anomalies in log files is
used to collect contextual information in log sequences and automatically learn the significance of
various log events. With the same model, i.e., LSTM Wanget et al. worked on one-month log data
to detect anomalies. Their method achieved a precision of 0.96%. The log entries obtained from
router device NetEngine40E series are installed in real network settings [26].
A deep neural network named deepLog is proposed to detect live log anomalies [8]. The DeepLog
understands and encodes the whole log message, including the timestamp, log key, and parameter
values. It detects anomalies at the log entry-level rather than at the session level. Precision, recall,
and f-measure of 0.88, 1.00, and 0.93 are achieved using the proposed technique. The drawback of
this approach is that the deviation of events from normal is unknown.
Kun et al. [28] developed LogC, a novel Log-based anomaly detection technique with component-
aware analysis. In this approach, the LogC is divided into log template sequences and component
sequences and further used to train a combined LSTM model for identifying anomalous logs.
With the proposed approach, accuracy, recall, and f-measure values of 93.53%, 98.29%, and 95.85%,
respectively, are achieved.
The authors in [20] proposed a two-level parsing with Log key as input to the convolution neural
network (CNN). The parsing is applied to raw data to retrieve log key and vector, respectively. After
4
Threshold based Technique to Detect Anomalies using Log Files
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
comparing with LSTM and Multilayer Perceptron (MLP), they found CNN gives better accuracy
using their approach.
Based on the density-based algorithm OPTICS, the authors [29] proposed an unsupervised
streaming anomaly detection mechanism for a knowledge-based construction system and streaming
anomaly detection system for generating alerts in real-time. By using hdfs log files dataset, the
F1-score of detecting anomalies was 83%.
An integrated, scalable framework for anomaly detection is proposed for the distributed environ-
ment using many unlabeled data logs. The experiments performed using NASA Hypertext Transfer
Protocol (HTTP) logs datasets are validated using k-means and XGBoost system after extracting
fourteen features [17].
5
Conference acronym ’XX, June 03–05, 2018,
Toluwalope
Woodstock,
David
NYAkande, Barjinder Kaur, Sajjad Dadkhah, and Ali A. Ghorbani
data and is less susceptible to outliners while detecting anomalies. Due to the voluminous number of
logs generated, we selected Hadoop for processing large data. Fig.4 depicts the proposed framework
used in detecting anomalies.
3.2 Methodologies
To detect anomalies in log files, we have used different unsupervised anomaly detection models,
i.e., RRCF, one-class support vector machine (OCSVM), Isolation forest (IF), Local Outlier Factor
(LOF), and Elliptic Envelope (EE). The purpose of using these models on our proposed approach is
because of their ability to detect anomalies efficiently.
• Robust random cut forest (RRCF): An anomaly detection model that handles high-
dimensional and streaming data better. Following that, we evaluate rrcf against various
unsupervised learning methods. A robust random cut tree (RRCT) is created from a point
set S by iteratively dividing it until each point is isolated in its own bounding box [10]. A
random dimension is chosen for each iteration of the tree construction procedure by selecting
a dimension proportional between its minimum and maximum values. A random value is
chosen between the minimum and most significant value of that dimension. A new leaf node
for x is generated, and the point is deleted from the point set if the partition separates a point
x from the remainder of the point set. The procedure followed for every subset recursively.
In RRCF, outliers are more likely to be found closer to the base of the tree. The collusive
displacement of a point is used to determine if it is an outlier (CoDisp). If adding a new point
increases the model’s bit depth, it is more likely to be an outlier. The total bit depths of all
points in the tree are used to indicate collusive displacement. The algorithm for constructing
an RRCT tree [10] is formally specified below.
• One-Class SVM (OCSVM): It is an unsupervised intrusion detection method that separates
the training data from the origin iteratively. It works by finding the maximal hyperplane (or
linear decision boundary) [16]. The basic idea is to map input data into a high-dimensional
feature space using an appropriate kernel function. Here, by giving data points for training
without specifying any class information, a hyperplane or linear function is constructed in
the feature space and is computed as given in Eq. (1) [16]:
𝑓 𝑒 (𝑓 ) = 𝑤 𝑇 𝜙 (𝑓 ) − 𝜌 (1)
where 𝑓 𝑒 (𝑓 ) represents the feature space, 𝑤 norm perpendicular, 𝜌 is the bias of the hyper-
plane, 𝜙 (𝑓 ) is the map function.
• Isolation Forest (IF):
It is an ensemble approach that has been widely used for anomaly detection. The structure
of IF is built by adding isolation trees based on the concept of the extra tree that partition
attributes according to lowest and highest values. The partition process continues until all the
samples are isolated, and the partition needs to isolate from the root to a leaf. The instance or
sample is located near the signifies that it is an anomaly because usually abnormal samples
1 https://pypi.org/project/gensim/
6
Threshold based Technique to Detect Anomalies using Log Files
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
have smaller average path than normal samples thus easier to isolate [9, 19]. The anomalous
score 𝑠𝑐 is calculated as defined in Eq. (2).
𝐸 (ℎ(𝑥))
𝑠𝑐 (𝑥, 𝑛) = 2− (2)
𝑐 (𝑛)
where 𝐸 (ℎ(𝑥)) represents the average value of ℎ(𝑥) from itrees collection. Then the instance
𝑥 is assigned to outlier if the value of 𝑠𝑐 is close to 1 otherwise considered as normal. The
value of 𝑐 (𝑛) is the average of ℎ(𝑥) for a given 𝑛.
• Local Outlier Factor (LF):
An outlier detection algorithm that calculates the degree to which a data point is anoma-
lous [5]. Here the data point is calculated by measuring the density in the neighborhood
surrounding that point. There are two basic steps: Firstly, density measure for a local data
point is measured by computing the inverse average reachability. Secondly, the algorithm
based on approximated local density creates the LOF-value for all data points defined in Eq.
3 [32]. LOF value ≥ indicates more inline, whereas values > 1 is an outlier.
Í
𝑏 ∈𝑁𝑘 (𝑎)𝑙𝑟𝑑 (𝑏) 1
𝐿𝑂𝐹𝑘 (𝑎) = · (3)
|𝑁𝑘 (𝑎)| 𝑙𝑟𝑑 (𝑎)
where the 𝐿𝑂𝐹 is the average reachability distance of all neighbors of 𝑎 divided by the actual
reachability distance of the data point itself, 𝑘 is the distance between another point and data
point.𝑁𝑘 (𝑎) is the shortcut for 𝑁𝑘−𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝑎) and 𝑙𝑟𝑑 is local distance reachability.
• Elliptic Envelope (EE): It is an unsupervised as well as supervised methodology that helps
in modeling high-dimensional data. Based on a density concept, an ellipse is drawn around
the data points. Thus, the values which lie inside the ellipse are termed as normal, and those
lying outside distribution density are labeled outliers [22].
• OPTICS: It is a density-based algorithm that allows making a group of clusters without
defining the number. It is based on the observation of given MinPts (minimum number of
points). Here the key idea is processing the higher density points first. It retains the clustering
order using: the core distance and the reachability distance [30].
• LogCluster: A data mining algorithm that is well suited for analyzing security logs. By
creating a data cluster, the algorithm helps detect patterns from textual log files. This algorithm
works in two passes firstly, by passing over the log files for finding frequent words and second
splitting the log files into clusters [24].
Because points cannot be entered or removed from trees once they have been created, the
methods mentioned earlier are not suitable for usage with streaming data. Furthermore, the
methods mentioned above are susceptible to "irrelevant dimensions," which means that partitions
are frequently wasted on dimensions that give little information. Robust random forest aims to
solve the problems that these algorithms can cause.
7
Conference acronym ’XX, June 03–05, 2018,
Toluwalope
Woodstock,
David
NYAkande, Barjinder Kaur, Sajjad Dadkhah, and Ali A. Ghorbani
• Precision: Ratio of correctly classified samples over all actually detected samples.
𝑇 𝑃/(𝑇 𝑃 + 𝐹 𝑃)
• Recall: Calculates the ratio of all “truly detected samples” to all the “samples that should be
detected.”
𝑇 𝑃/(𝑇 𝑃 + 𝐹 𝑁 )
• F1-measure: Reporting the harmonic mean or weighted average of precision and recall.
2 · 𝑇 𝑃/(2 · 𝑇 𝑃 + 𝐹 𝑁 + 𝐹 𝑃)
8
Threshold based Technique to Detect Anomalies using Log Files
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
regular log and an aberrant log. The most efficient threshold setting was determined by experi-
menting with various threshold values such as maximum deviation, three-sigma, median absolute
deviation, and 1.5*median absolute deviation. The three-sigma threshold is defined as the mean plus
three times the standard deviation. Any residual number that exceeds the three-sigma criterion is
almost certainly an anomaly. The median absolute deviation is calculated by summing the absolute
departures from the median. It’s a dispersion metric comparable to the standard deviation, but it’s
more resistant to outliers [2].
4.2 Results
We report our experimental results corresponding to the research questions mentioned in subsection
4.1
RQ1: What is the most effective robust random cut forest threshold to use?
Selecting the suitable threshold setting is an essential task in detecting anomalous logs. We
investigate the impact of using maximum deviation, three-sigma, median absolute deviation, and
1.5*median absolute deviation on both HDFS and Open-stack datasets. In Fig. 4, setting the threshold
as maximum deviation on both datasets yielded the best result and performed better than setting
the threshold as 3-sigma. On the other hand, setting the threshold as maximum deviation using
OpenStack dataset gives a precision, recall, f-measure of 0.73, 0.99, and 0.84, respectively. While in
the case of HDFS dataset setting the threshold as maximum deviation resulted in precision, recall,
f-measure of 0.97, 0.99, and 0.98, respectively, which is higher in comparison to the OpenStack
dataset. This signifies that the threshold has impacted the performance of RRCF in detecting
anomalies using log files.
textbfRQ2: How effective is RRCF, IF, LOF, ONSVM, EE, LogCluster, and Optics in detecting
anomalous logs?
Here, we have investigated how well RRCF performed when compared to other unsupervised
learning algorithms. For RRCF, we selected the maximum deviation as the threshold since we
determined that it produces the best results in the previous section. As depicted in Fig. 5, RRCF
achieved the best recall and f-measure score (98.10%, 99.89% for OpenStack and hdfs, respectively) as
compared to other algorithms in both the datasets. However, using OpenStack dataset, EE algorithm
provided the best precision whereas EE, IF both best precision results on HDFS dataset
We further compared our method to the log clustering implemented by Shilin et al. [15]. It has
been analyzed that our proposed approach of using RRCF achieved better precision, recall, and
f-measure values than logcluster[15] as depicted in Table 2
9
Conference acronym ’XX, June 03–05, 2018,
Toluwalope
Woodstock,
David
NYAkande, Barjinder Kaur, Sajjad Dadkhah, and Ali A. Ghorbani
Additionally, the OPTICS algorithm used by Zeufack et al. [29] is compared with RRCF. From
Table 2 it can be analysed that RRCF had achieved a better precision and f-measure score of 97.10%
and 98.47% than OPTICS (71%,83%)
5 CONCLUSION
In today’s large-scale distributed systems, logs are frequently used to discover abnormalities.
However, traditional anomaly identification, which depends mainly on manual log examination,
becomes infeasible due to the dramatic rise in log size. Automated log analysis and anomaly
detection technologies have been extensively researched in recent years to decrease manual labor.
However, developers are still unaware of cutting-edge anomaly detection methods, and they
frequently have to re-design a new anomaly detection method on their own, owing to a lack of
a complete study and comparison of current approaches. This study fills that need by offering a
comprehensive assessment and evaluation of five cutting-edge anomaly detection methods. This
work determines that RRCF performs better than other state-of-the-art techniques in determining
the best threshold value to detect anomalies. It has also been observed that RRCF with maximum
deviation as the threshold setting outperforms the IF, OCSVM, EE, LOF, Logcluster algorithms in
terms of performance.
10
Threshold based Technique to Detect Anomalies using Log Files
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
However, because each log file has a distinct format, in the future, we propose to test the RRCF
model against additional log files from IoT devices. Because IoT devices are vulnerable to breaches,
spotting anomalies before they cause harm to owners of IoT devices is essential.
REFERENCES
[1] [n.d.]. Cost of downtime. https://www.atlassian.com/incident-management/kpis/cost-of-downtime
[2] [n.d.]. mad. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.median_abs_deviation.html#scipy.stats.
median_abs_deviation
[3] Jakub Breier and Jana Branišová. 2015. Anomaly detection from log files using data mining techniques. In Information
Science and Applications. Springer, 449–457.
[4] Jakub Breier and Jana Branišová. 2017. A dynamic rule creation based anomaly detection method for identifying
security breaches in log records. Wireless Personal Communications 94, 3 (2017), 497–511.
[5] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. 2000. LOF: identifying density-based local
outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data. 93–104.
[6] Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly Detection: A Survey. ACM Comput. Surv. 41, 3,
Article 15 (July 2009), 58 pages. https://doi.org/10.1145/1541880.1541882
[7] Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system
logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications
Security. 1285–1298.
[8] Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. DeepLog: Anomaly Detection and Diagnosis from System
Logs through Deep Learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications
Security (Dallas, Texas, USA) (CCS ’17). Association for Computing Machinery, New York, NY, USA, 1285–1298.
https://doi.org/10.1145/3133956.3134015
[9] Amir Farzad and T Aaron Gulliver. 2020. Unsupervised log message anomaly detection. ICT Express 6, 3 (2020),
229–237.
[10] Sudipto Guha, Nina Mishra, Gourav Roy, and Okke Schrijvers. 2016. Robust Random Cut Forest Based Anomaly
Detection on Streams. In Proceedings of The 33rd International Conference on Machine Learning (Proceedings of Machine
Learning Research, Vol. 48), Maria Florina Balcan and Kilian Q. Weinberger (Eds.). PMLR, New York, New York, USA,
2712–2721. http://proceedings.mlr.press/v48/guha16.html
[11] Haixuan Guo, Shuhan Yuan, and Xintao Wu. 2021. LogBERT: Log Anomaly Detection via BERT. arXiv preprint
arXiv:2103.04475 (2021).
[12] Shangbin Han, Qianhong Wu, Han Zhang, Bo Qin, Jiankun Hu, Xingang Shi, Linfeng Liu, and Xia Yin. 2021. Log-Based
Anomaly Detection With Robust Feature Extraction and Online Learning. IEEE Transactions on Information Forensics
and Security 16 (2021), 2300–2311. https://doi.org/10.1109/TIFS.2021.3053371
[13] Shilin He, Pinjia He, Zhuangbin Chen, Tianyi Yang, Yuxin Su, and Michael R Lyu. 2020. A Survey on Automated Log
Analysis for Reliability Engineering. arXiv preprint arXiv:2009.07237 (2020).
[14] Shilin He, Jieming Zhu, Pinjia He, and Michael R Lyu. 2016. Experience report: System log analysis for anomaly
detection. In 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 207–218.
[15] Shilin He, Jieming Zhu, Pinjia He, and Michael R. Lyu. 2020. Loghub: A Large Collection of System Log Datasets
towards Automated Log Analytics. CoRR abs/2008.06448 (2020). arXiv:2008.06448 https://arxiv.org/abs/2008.06448
[16] Katherine Heller, Krysta Svore, Angelos D Keromytis, and Salvatore Stolfo. 2003. One class support vector machines
for detecting anomalous windows registry accesses. (2003).
[17] João Henriques, Filipe Caldeira, Tiago Cruz, and Paulo Simões. 2020. Combining k-means and xgboost models for
anomaly detection using log datasets. Electronics 9, 7 (2020), 1164.
[18] Nathaniel Kremer-Herman and Douglas Thain. 2020. Log Discovery for Troubleshooting Open Distributed Systems
with TLQ. In Practice and Experience in Advanced Research Computing. 224–231.
[19] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In 2008 eighth ieee international conference on
data mining. IEEE, 413–422.
[20] Siyang Lu, Xiang Wei, Yandong Li, and Liqiang Wang. 2018. Detecting anomaly in big data system logs using
convolutional neural network. In 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl
Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science
and Technology Congress (DASC/PiCom/DataCom/CyberSciTech). IEEE, 151–158.
[21] Nour Moustafa, Jiankun Hu, and Jill Slay. 2019. A holistic review of network anomaly detection systems: A compre-
hensive survey. Journal of Network and Computer Applications 128 (2019), 33–55.
[22] Peter J Rousseeuw and Katrien Van Driessen. 1999. A fast algorithm for the minimum covariance determinant estimator.
Technometrics 41, 3 (1999), 212–223.
11
Conference acronym ’XX, June 03–05, 2018,
Toluwalope
Woodstock,
David
NYAkande, Barjinder Kaur, Sajjad Dadkhah, and Ali A. Ghorbani
[23] Tabea Schmidt, Florian Hauer, and Alexander Pretschner. 2020. Automated Anomaly Detection in CPS Log Files. In
International Conference on Computer Safety, Reliability, and Security. Springer, 179–194.
[24] Risto Vaarandi, Bernhards Blumbergs, and Markus Kont. 2018. An unsupervised framework for detecting anomalous
messages from syslog log files. In NOMS 2018-2018 IEEE/IFIP Network Operations and Management Symposium. IEEE,
1–6.
[25] Bingming Wang, Ying Shi, and Zhe Yang. 2020. A Log-Based Anomaly Detection Method with Efficient Neighbor
Searching and Automatic K Neighbor Selection. Sci. Program. 2020 (2020), 4365356:1–4365356:17.
[26] Xiaojuan Wang, Defu Wang, Yong Zhang, Lei Jin, and Mei Song. 2019. Unsupervised learning for log data analysis
based on behavior and attribute features. In Proceedings of the 2019 International Conference on Artificial Intelligence
and Computer Science. 510–518.
[27] Rakesh Bahadur Yadav, P Santosh Kumar, and Sunita Vikrant Dhavale. 2020. A Survey on Log Anomaly Detection
using Deep Learning. In 2020 8th International Conference on Reliability, Infocom Technologies and Optimization (Trends
and Future Directions)(ICRITO). IEEE, 1215–1220.
[28] Kun Yin, Meng Yan, Ling Xu, Zhou Xu, Zhao Li, Dan Yang, and Xiaohong Zhang. 2020. Improving Log-Based Anomaly
Detection with Component-Aware Analysis. In 2020 IEEE International Conference on Software Maintenance and
Evolution (ICSME). 667–671. https://doi.org/10.1109/ICSME46990.2020.00069
[29] Vannel Zeufack, Donghyun Kim, Daehee Seo, and Ahyoung Lee. 2021. An unsupervised anomaly detection framework
for detecting anomalies in real time through network system’s log files analysis. High-Confidence Computing 1, 2
(2021), 100030. https://doi.org/10.1016/j.hcc.2021.100030
[30] Vannel Zeufack, Donghyun Kim, Daehee Seo, and Ahyoung Lee. 2021. An unsupervised anomaly detection framework
for detecting anomalies in real time through network system’s log files analysis. High-Confidence Computing 1, 2
(2021), 100030.
[31] Xu Zhang, Yong Xu, Qingwei Lin, Bo Qiao, Hongyu Zhang, Yingnong Dang, Chunyu Xie, Xinsheng Yang, Qian Cheng,
Ze Li, et al. 2019. Robust log-based anomaly detection on unstable log data. In Proceedings of the 2019 27th ACM Joint
Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.
807–817.
[32] Tim Zwietasch. 2014. Detecting anomalies in system log files using machine learning techniques. B.S. thesis.
12