0% found this document useful (0 votes)
8 views

2023 Anomaly Detection From Web Log Data Using Machine Learning Model

Uploaded by

vewabev936
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

2023 Anomaly Detection From Web Log Data Using Machine Learning Model

Uploaded by

vewabev936
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Anomaly Detection from Web Log Data Using

Machine Learning Model


2023 7th International Conference on Computer Applications in Electrical Engineering-Recent Advances (CERA) | 979-8-3503-0500-5/23/$31.00 ©2023 IEEE | DOI: 10.1109/CERA59325.2023.10455153

Amit Kumar Mishra Piyush Bagla Ravi Sharma


Dept. of Computer Science & Department of Computer Science & Department of Computer Science &
Engineering, Engineering, Engineering,
Graphic Era Hill University, National Institute of Technology, National Institute of Technology,
Dehradun, (India) Jalandhar, India Jalandhar, (India)
[email protected] [email protected] [email protected]

Neeraj Kumar Pandey Neha Tripathi


Department of Computer Science & Department of Computer Science &
Engineering, Engineering,
Graphic Era (Deemed to be Graphic Era (Deemed to be
University), Dehradun, India University), Dehradun, India
[email protected] [email protected]

Abstract— The information in the logs produced by the system, which is not always feasible. Different developers
servers, devices, and applications can be utilized to assess the produce various services, and these services are subject to
system's health. It's crucial to manually review logs, for change over time. The log data can be manually analysed
instance, during upgrades, to verify whether the update and using a few standard methods. These methods' precision and
data movement went smoothly. Manual testing is insufficiently potency, however, are constrained. Searching for log entries
trustworthy, and manual log examination takes much time and using keywords is a standard technique for analysing huge
effort. In this paper, we propose to search log files for log files [2].
anomalous sequences using the machine learning methods
K­Means and DBSCAN. The two data representation Outliers, which differ from the majority of the data and
approaches examined in this study were feature vector may indicate anything wrong or abnormal with the system,
representation and IDF representation. The effectiveness of the are also known as anomaly data. By identifying anomalous
deployed machine learning algorithms was examined using data in the log files, we will examine in this paper the
evaluation measures like F1 score, recall, and precision. The potential uses of machine learning for analysing the log files
study found considerable differences in the algorithms' at Mobilaris. Different machine learning techniques will be
capacities to spot anomalies, with some algorithms being better tested and assessed because there is no one machine learning
at seeing various types of abnormal arrangements than their approach that is known to be the best effective for
overall prevalence. By using the study's findings, the user recognising unusual sequences at Mobilaris log files. This
might be able to spot strange arrangements after manually
study will be used to investigate various data illustration
sifting through the log file.
techniques that are essential in locating unusual data
Keywords— Clustering, Anomaly detection, Web logs, structures. Mobilaris can provide actual data from the actual
Machine Learning. world to do the log analysis in this project. We can evaluate
the data in available log files to locate aberrant log
I. INTRODUCTION arrangements, detect problems by the system administrator,
The complexity and size of modern IT systems make it and system breakdown without manually going through
difficult to debug and identify system failures. In some available log files.
circumstances, the only approach to pinpoint the underlying
cause of a system failure is to analyse the log files.
Enormous systems generate a huge amount of data; as the
volume of data increases, it becomes increasingly difficult
and time-consuming to detect errors and flaws manually. A
system bug, human error, or device fault are all potential
causes of an unhealthy system. Anomaly detection is a
popular technique for investigating a system breakdown.
Finding anomaly data can be crucial in identifying system
faults or the system's state [1]. Anomaly data may also point
to anything intriguing occurring in the system.
Log analysis is a crucial step in assessing the system's
health and identifying the root of issues. The logs that are Fig. 1. Log Analyzer
activated by servers, equipment, and applications are
necessary to record the system information. Upgrades and
data migration can fail, and it is standard procedure to A web log analyzer is a tool or software that processes
analyse the logs manually—an unpleasant and time- and analyzes web server log files to extract valuable
consuming operation for the end user. Manually inspecting information and generate insights about website usage,
the log files frequently necessitates expert knowledge of the visitor behavior, and performance. It helps understand how
users interact with a website, identify trends, detect
979-8-3503-0500-5/23/$31.00 ©2023 IEEE

Authorized licensed use limited to: National Institute of Technology. Downloaded on July 29,2024 at 10:02:25 UTC from IEEE Xplore. Restrictions apply.
anomalies, and optimize various aspects of web operations The major contribution of this research is to:
(Figure 1). To analyze log files and detect anomalous logs,
we follow some commonly used steps: • To prepare the data from the received file by parsing it.

Data Preprocessing: Begin by preprocessing the log files. • To give the machine learning algorithms a
This could entail preparing the data for analysis, dealing with representation of the log entries as numerical values.
missing numbers, and structuring the logs. Depending on the • Use algorithms based on machine learning to search the
format of the log files, you may need to extract pertinent Mobilaris log file for abnormal log sequences.
elements from them, such as timestamps, IP addresses,
request types, or response codes. The remaining paper is formulated as follows: Section II
focuses on the literature review based on the distinguished
Feature Extraction: Identify key features that will be machine learning algorithms. Section III defines the
utilized to distinguish anomalies in the log data. This can proposed methodology for finding anomalies from the log
involve extracting statistics, patterns, or unique files. Section IV defines the outcomes and analysis retrieved
characteristics from the log entries. For example, you might from the analytics. Finally, the article concludes the results
extract information about the frequency of certain events, and explores the future scope.
time intervals between events, or unusual combinations of
events.
Define a Baseline: Establish a baseline or normal II. LITERATURE REVIEW
behavior profile for the log data. This baseline represents the Textual data is occasionally seen as meaningful information
expected patterns and characteristics of the logs under and can be crucial to machine learning algorithms [3].
normal operating conditions. You can compute statistical Several natural language processing techniques can be used
measures (e.g., mean, standard deviation) or use historical for logs that contain textual data. One of the more well-liked
data as a reference to define the baseline.
natural language processing methods is frequency-inverse
Anomaly Detection Techniques: Apply appropriate Document[3]. The textual data appears by using data
anomaly detection techniques to identify deviations from the frequency, and TF-IDF is another popular technique for
established baseline. There are various methods you can use, expressing data in anomaly detection. [3].
such as: Word2vec, a method for natural language processing, has
• Statistical Methods: Statistical techniques like z- proven to be highly successful and efficient at representing
score, percentile-based methods, or clustering textual data in small dimensions in paper [4].
can help identify data points that significantly A lot of studies also employ log vectorization approaches
deviate from the expected behaviour. to describe the log data [5][6], employing log abstraction
techniques that make use of the constant portion of the log
• Machine Learning Approaches: Supervised or messages, log arrangement are converted into logs and
unsupervised algorithms like clustering, random vectors are transformed into log events. The code's logprint
forest or neural networks based on machine statement generates generic log messages as placeholders for
learning can be trained on labelled or unlabelled log events. The paper's sequences [7] contain a lot of log
data to detect anomalies in the log files. events, and the events based on the log are packed in two
different ways—weighting based on IDF that has been
• Time-Series Analysis: If the log files contain adjusted and contrast-based weighting. According to the
temporal information, time-series analysis study, the significance of each log event varies according to
techniques like ARIMA (Autoregressive how frequently it looks in various log classifications. The
Integrated Moving Average) or LSTM (Long weighting technique was suggested as a consequence to the
Short-Term Memory) networks can be applied employee.
to capture patterns and detect anomalies.
The journal [8] also discussed the tree structure diagram, also
• Rule-based Methods: The Rule-based called the decision tree algorithm. The SVM, another
approaches comprise some tricks or thresholds classification-based technique, was also examined in the
based on expert knowledge or domain-specific paper. Each of the three supervised algorithms combined the
requirements. Any log entry that violates these labels and training examples to create an event count vector.
rules is flagged as an anomaly. The decision tree selections proved to be more reasonable for
Alert Generation and Response: To inform the relevant the developer in picking anomalies compared to two
parties, an alarm is generated to identify abnormal behaviour. different algorithms when discovering abnormalities in the
Establish appropriate response procedures to investigate and data. Although all of the algorithms were effective at finding
address potential security threats or system irregularities. anomalies, SVM had the highest overall accuracy. The
LogCluster algorithm was chosen by the publication [9]. A
Continuous Monitoring and Feedback: Implement data clustering algorithm called LogCluster uses log files to
continuous monitoring of the log files and regularly update find patterns. In 92 days, LogCluster was utilised to measure
and refine the anomaly detection models. Monitor the performance. 1,879,209 of the 296,699,550 log messages
performance of the detection techniques in real-world that were processed by the implementation were identified as
scenarios, adapt to evolving patterns, and incorporate anomalies. They also found that not all anomalies were
feedback to improve the effective process and specific anomaly log messages but standard log messages. The
approach for anomaly detection. paper[10] used the DBSCAN algorithm, a density-based
clustering approach. It was suggested that DBSCAN be used
to spot anomalies in monthly temperature data. According to

Authorized licensed use limited to: National Institute of Technology. Downloaded on July 29,2024 at 10:02:25 UTC from IEEE Xplore. Restrictions apply.
the report, it offers several advantages over statistical the file are represented by textual or numerical data. It is
strategies for spotting irregularities. crucial to gather and prepare the relevant data before utilising
the machine learning algorithms. Finding patterns in data
III. PROPOSED METHODOLOGY requires a fundamental step called data representation.
K-means and DBSCAN, two distinct unsupervised Finding patterns in data requires a fundamental step called
machine learning methods, will be employed in this study to data representation[13]. The distance between sets of data
find aberrant sequences in Mobilaris log files. Unsupervised that are comparable to one another should be kept to a
learning is preferred because it can derive knowledge from minimum, while the distance between sets of data that aren't
data structures without labels and because log sequences can should be kept at a maximum. After processing and applying
reveal hidden issues [11]. The process of finding anomalies the log abstraction technique, a new log file with log ids
in a data file is broken down into various parts, as shown in representing the log entries will be produced[14]. Textual log
Figure 2. With the assistance of the Mobilaris staff, entries can be represented in the log arrangement as numbers
anomalies in the log file were labelled. thanks to the feature vector encoding. As there are distinct
log entries in the whole log file, the vector will have the
The log file from Tag Vibration Service (TVS), one of same number of dimensions. Every log will have a
Mobilaris' services, was what we obtained from them. TVS designated place. Each log item's frequency of appearance in
counts the intervals between blinks to detect the validity of a the sliding window is counted using the feature vector
tag[12]. The constant and variable data in log entries from representation[15].

Fig. 2. Proposed Methodology


IV. RESULTS AND DISCUSSION frequency, make up the 1176 anomalous sequences in the
TVS dataset.
There are 41294 log entries in the data file that Mobilaris
sent, and there are 29 different types of log entries. 1176 A. K-Means with Feature Vector Approach
aberrant sequences of varied sizes are present among the Here we are representing the work with k-means approach
41294 total. Precision, recall, and F1-score will be used as including feature vector representation. All
three different performance indicators in this paper. In this anomalous log sequence results are shown in Table
study, we'll use the three performance indicators in two
5.1, and only the unique ones are shown in Table I.
different evaluation examinations. The total number of
The three distinct thresholds were 99, 98, and 97.5,
aberrant log sequences found will be counted in the first and
in the second, The variety of anomalous log sequences that and the metrics recall, F1, and precision are
we discovered will be counted. A total of 174 different types displayed. Figures 3, 4, and 5 are used to represent
of anomalous sequences, each occurring at a different the K-means algorithm plots with the applied
thresholds. To decrease the dimensions from 29 to

Authorized licensed use limited to: National Institute of Technology. Downloaded on July 29,2024 at 10:02:25 UTC from IEEE Xplore. Restrictions apply.
2, PCA was used. Three clusters as well as the blue
stars which are exist in each plot indicates the
centre of each cluster. The algorithmic identified
abnormalities are indicated by red circles
surrounding the data points.

TABLE I. TIME TAKEN FOR DATA ENCRYPTION AND DECRYPTION


Metrics K - means with K - means K - means with
threshold with threshold threshold
percentile 99 percentile 98 percentile 97.5
Recall 0.34 0.61 0.87
F1- score 0.51 0.75 0.93
Precision 1 1 1

The suggested method achieves a memory storage Fig. 4. Total evaluation time
quantity of 13,598,247.75 bits by altering the number of
bullets. The optimization strategies' combined execution time
is 21,008 milliseconds. Figure 3 shows how the system
performs using the suggested technique when the number of
repeats is changed. Figure 4. Represents the fitness value of
the suggested approach. The message with the highest fitness
value in the MPSO had the lowest mistake frequency. As the
number of observations rises in this case, the efficiency score
falls. Table II displays the proposed MANNs-based back
propagation technique's thorough classification validity. The
proposed MANN provides 91.25 percent accuracy in this
case.

PERFORMACE PARAMETERS OF THE MODEL


K - means with K - means with K - means with
threshold threshold threshold
Metrics percentile 99 percentile 98 percentile 97.5
Recall 0.78 0.79 0.83
F1- Fig. 5. Model Threshold Values
score 0.88 0.88 0.91
B. K-Means with IDF Representation
Precisio
n 1 1 1 We will display the K-means approach using the outcome of
IDF data representation in this subsection. Result from all
anomalous Log arrangements is shown by Table III. While
table IV indicates the singular ones only. The three
thresholds were 99, 98, and 97.5, and the metrics recall, F1,
and precision are displayed. Figures 6, 7, and 8 show the
graph that the K-means algorithm produces when the
specified threshold is used. The dimensions were also
decreased from 29 to 2 using PCA. Every plot consists three
different clusters and the location of centroid of every cluster
is represented by blue star. The algorithm's expected
anomalies are indicated by red circles surrounding the data
points.

Table III: Outcome from all log arrangements, IDF


Representation and K-means.

K - means with K - means with K - means with


threshold threshold threshold
Fig. 3. Frequency of distinct keys Metrics percentile 99 percentile 98 percentile 97.5
Recall 0.31 0.31 0.87
F1- score 0.48 0.48 0.93
Precision 1 1 1

Authorized licensed use limited to: National Institute of Technology. Downloaded on July 29,2024 at 10:02:25 UTC from IEEE Xplore. Restrictions apply.
Fig. 8: Threshold at k=3, 97.5% accuracy with K-Means
C. DBSCAN with IDF
We will demonstrate the outcome of the DBSCAN technique
Table IV: Result from unique log sequences, Kmeans, IDF with IDF representation in this subsection. All anomalous
Representation log sequence results are shown in Table V, whereas only the
K - means with K - means with K - means with unique ones are shown in Table VI. Recall, F1, and precision
threshold threshold threshold
Metrics percentile 99 percentile 98 percentile 97.5 scores are displayed. The output from the DBSCAN
Recall 0.87 0.87 0.91 technique is shown in Figure 9. The dimensions were also
F1- decreased from 29 to 2 using PCA. Six clusters were
score 0.93 0.93 0.95 produced by DBSCAN. The expected anomalous sequences
Precisio from the DBSCAN are represented by the purple data points.
n 1 1 1
Table V: Performance parameters with DBSCAN

Metrics DBSCAN
Recall 0.35
F1- score 0.51
Precision 1

Table VI: Log sequences for IDF and DBSCAN

Metrics DBSCAN
Recall 0.78
F1- score 0.87

Fig. 6. Fitness Value of Proposed Model Precision 1

Fig. 7: Threshold at k=3, 98% accuracy with KMeans


Fig. 9: DBSCAN Accuracy
When the threshold percentile was 99, the model's
performance in detecting all anomalous arrangements by
involving the feature vector representation using the k-means
strategy was subpar. Recall and F1 scores increased when the
threshold was lowered to 97,5, as seen in Table 5.1. Recall
increased from 0.34 to 0.87 and F1 increased from 0.51 to
0.93. The F1 score and recall were on their peak (the score of
F1 was from 0.88 to 0.91 and the increment in recall was
from 0.78 to 0.83), and the k-means technique with feature
vector representation successfully identified different
anomalous sequences in all percentiles. When the threshold
percentile was 97.5, the algorithm likewise looks efficient
and better in terms of identifying the distinct arrangement.

Authorized licensed use limited to: National Institute of Technology. Downloaded on July 29,2024 at 10:02:25 UTC from IEEE Xplore. Restrictions apply.
V. CONCLUSION
We looked into the potential of machine learning for and feature vector representation was equally good in finding
focusing on anomaly detection when analysing the log files the total number of aberrant log arrangements with a
at Mobilaris. As a result, we discovered that both K-means threshold of 97. We also thought about how the data
and DBSCAN were effective in identifying distinctive log representation would affect the outcomes. The K-means with
anomalies. The only algorithms that successfully discovered IDF representation outperformed other techniques at
the total number of log sequences were K-means techniques identifying separate log sequences when the threshold
with low percentile thresholds. According to our percentile was 97. Experimenting with different window
investigation, the K-means strategy using IDF representation widths could improve the research's findings even further.

REFERENCES [9] P. Jain, M. Shankar Bajpai, and R. Pamula, “A Modified DBSCAN


Algorithm for Anomaly Detection in Time-series Data with
[1] Yin, Kun et al. “Improving Log-Based Anomaly Detection with Seasonality,” The International Arab Journal of Information
Component Aware Analysis”. In: 2020 IEEE International Technology, vol. 19, no. 1. Zarqa University, Jan. 01, 2022. doi:
Conference on Software Maintenance and Evolution (ICSME). 2020, 10.34028/iajit/19/1/3.
pp. 667–671. DOI: 10.1109/ ICSME46990.2020.00069. [10] F. Gerz, T. R. Basturk, J. Kirchhoff, J. Denker, L. Al-Shrouf, and M.
[2] Lin, Qingwei et al. “Log Clustering Based Problem Identification for Jelali, “A Comparative Study and a New Industrial Platform for
Online Service Systems”. In: 2016 IEEE/ACM 38th International Decentralized Anomaly Detection Using Machine Learning
Conference on Software Engineering Companion (ICSE-C). 2016, pp. Algorithms,” 2022 International Joint Conference on Neural
102–111. Networks (IJCNN). IEEE, Jul. 18, 2022. doi:
[3] Si, Yaqing, Zhou, Wendi, and Gai, Jiale. “Research and 10.1109/ijcnn55064.2022.9892939.
Implementation of Data Extraction Method Based on NLP”. In: 2020 [11] N. K. Pandey, A. K. Mishra, N. Tripathi, P. Bagla and R. Sharma,
IEEE 14th International Conference on Anti-counterfeiting, Security, "Implementation and Monitoring of Network Traffic Security using
and Identification (ASID). 2020, pp. 11–15. DOI: Machine Learning," 2023 2nd International Conference on Smart
10.1109/ASID50160.2020.9271745. Technologies and Systems for Next Generation Computing
[4] Wang, Mengying, Xu, Lele, and Guo, Lili. “Anomaly Detection of (ICSTSN), Villupuram, India, 2023, pp. 1-5, doi:
System Logs Based on Natural Language Processing and Deep 10.1109/ICSTSN57873.2023.10151471.
Learning”. In: 2018 4th International Conference on Frontiers of [12] A. K. Mishra, N. Tripathi, A. Gupta, D. Upadhyay and N. K. Pandey,
Signal Processing (ICFSP). 2018, pp. 140–144. DOI: "Prediction and detection of nutrition deficiency using machine
10.1109/ICFSP.2018.8552075. learning," 2023 International Conference on Device Intelligence,
[5] Xiao, Tong et al. “LPV: A Log Parser Based on Vectorization for Computing and Communication Technologies, (DICCT), Dehradun,
Offline and Online Log Parsing”. In: 2020 IEEE International India, 2023, pp. 1-5, doi:
Conference on Data Mining (ICDM). 2020, pp. 1346–1351. DOI: https://10.1109/DICCT56244.2023.10110072.
10.1109/ICDM50108.2020. [13] N. K. Pandey, K. Kumar, G. Saini, A. K. Mishra “Security Issues and
[6] N. K. Pandey, A. K. Mishra, V. Kumar, A. Kumar, M. Diwakar and Challenges in Cloud of Things-Based Applications for Industrial
N. Tripathi, "Machine Learning based Food Demand Estimation for Automation” Annals of Operations Research 2023.
Restaurants," 2023 6th International Conference on Information https://doi.org/10.1007/s10479-023-05285-7.
Systems and Computer Networks (ISCON), Mathura, India, 2023, pp. [14] A. K. Mishra, M. Wazid, D. P. Singh, A. K. Das, S. Roy and S.
1-5, doi: 10.1109/ISCON57294.2023.10112059. Shetty, "ACKS-IA: An Access Control and Key Agreement Scheme
[7] B. Zhang, H. Zhang, V.-H. Le, P. Moscato, and A. Zhang, “Semi- for Securing Industry 4.0 Applications," in IEEE Transactions on
supervised and unsupervised anomaly detection by mining numerical Network Science and Engineering, doi:
workflow relations from system logs,” Automated Software 10.1109/TNSE.2023.3296329.
Engineering, vol. 30, no. 1. Springer Science and Business Media [15] A. K. Mishra, M. Wazid, D. P. Singh, A. K. Das, J. Singh, and A. V.
LLC, Dec. 03, 2022. doi: 10.1007/s10515-022-00370-w. Vasilakos, “Secure Blockchain-Enabled Authentication Key
[8] Vaarandi, Risto, Blumbergs, Bernhards, and Kont, Markus. “An Management Framework with Big Data Analytics for Drones in
unsupervised framework for detecting anomalous messages from Networks Beyond 5G Applications,” Drones, vol. 7, no. 8, p. 508,
syslog log files”. In: NOMS 2018 - 2018 IEEE/IFIP Network Aug. 2023, doi: https://doi.org/10.3390/drones7080508.
Operations and Management Symposium. 2018, pp. 1–6. DOI:
10.1109/ NOMS.2018.8406283.

Authorized licensed use limited to: National Institute of Technology. Downloaded on July 29,2024 at 10:02:25 UTC from IEEE Xplore. Restrictions apply.

You might also like