2023 Anomaly Detection From Web Log Data Using Machine Learning Model
2023 Anomaly Detection From Web Log Data Using Machine Learning Model
Abstract— The information in the logs produced by the system, which is not always feasible. Different developers
servers, devices, and applications can be utilized to assess the produce various services, and these services are subject to
system's health. It's crucial to manually review logs, for change over time. The log data can be manually analysed
instance, during upgrades, to verify whether the update and using a few standard methods. These methods' precision and
data movement went smoothly. Manual testing is insufficiently potency, however, are constrained. Searching for log entries
trustworthy, and manual log examination takes much time and using keywords is a standard technique for analysing huge
effort. In this paper, we propose to search log files for log files [2].
anomalous sequences using the machine learning methods
KMeans and DBSCAN. The two data representation Outliers, which differ from the majority of the data and
approaches examined in this study were feature vector may indicate anything wrong or abnormal with the system,
representation and IDF representation. The effectiveness of the are also known as anomaly data. By identifying anomalous
deployed machine learning algorithms was examined using data in the log files, we will examine in this paper the
evaluation measures like F1 score, recall, and precision. The potential uses of machine learning for analysing the log files
study found considerable differences in the algorithms' at Mobilaris. Different machine learning techniques will be
capacities to spot anomalies, with some algorithms being better tested and assessed because there is no one machine learning
at seeing various types of abnormal arrangements than their approach that is known to be the best effective for
overall prevalence. By using the study's findings, the user recognising unusual sequences at Mobilaris log files. This
might be able to spot strange arrangements after manually
study will be used to investigate various data illustration
sifting through the log file.
techniques that are essential in locating unusual data
Keywords— Clustering, Anomaly detection, Web logs, structures. Mobilaris can provide actual data from the actual
Machine Learning. world to do the log analysis in this project. We can evaluate
the data in available log files to locate aberrant log
I. INTRODUCTION arrangements, detect problems by the system administrator,
The complexity and size of modern IT systems make it and system breakdown without manually going through
difficult to debug and identify system failures. In some available log files.
circumstances, the only approach to pinpoint the underlying
cause of a system failure is to analyse the log files.
Enormous systems generate a huge amount of data; as the
volume of data increases, it becomes increasingly difficult
and time-consuming to detect errors and flaws manually. A
system bug, human error, or device fault are all potential
causes of an unhealthy system. Anomaly detection is a
popular technique for investigating a system breakdown.
Finding anomaly data can be crucial in identifying system
faults or the system's state [1]. Anomaly data may also point
to anything intriguing occurring in the system.
Log analysis is a crucial step in assessing the system's
health and identifying the root of issues. The logs that are Fig. 1. Log Analyzer
activated by servers, equipment, and applications are
necessary to record the system information. Upgrades and
data migration can fail, and it is standard procedure to A web log analyzer is a tool or software that processes
analyse the logs manually—an unpleasant and time- and analyzes web server log files to extract valuable
consuming operation for the end user. Manually inspecting information and generate insights about website usage,
the log files frequently necessitates expert knowledge of the visitor behavior, and performance. It helps understand how
users interact with a website, identify trends, detect
979-8-3503-0500-5/23/$31.00 ©2023 IEEE
Authorized licensed use limited to: National Institute of Technology. Downloaded on July 29,2024 at 10:02:25 UTC from IEEE Xplore. Restrictions apply.
anomalies, and optimize various aspects of web operations The major contribution of this research is to:
(Figure 1). To analyze log files and detect anomalous logs,
we follow some commonly used steps: • To prepare the data from the received file by parsing it.
Data Preprocessing: Begin by preprocessing the log files. • To give the machine learning algorithms a
This could entail preparing the data for analysis, dealing with representation of the log entries as numerical values.
missing numbers, and structuring the logs. Depending on the • Use algorithms based on machine learning to search the
format of the log files, you may need to extract pertinent Mobilaris log file for abnormal log sequences.
elements from them, such as timestamps, IP addresses,
request types, or response codes. The remaining paper is formulated as follows: Section II
focuses on the literature review based on the distinguished
Feature Extraction: Identify key features that will be machine learning algorithms. Section III defines the
utilized to distinguish anomalies in the log data. This can proposed methodology for finding anomalies from the log
involve extracting statistics, patterns, or unique files. Section IV defines the outcomes and analysis retrieved
characteristics from the log entries. For example, you might from the analytics. Finally, the article concludes the results
extract information about the frequency of certain events, and explores the future scope.
time intervals between events, or unusual combinations of
events.
Define a Baseline: Establish a baseline or normal II. LITERATURE REVIEW
behavior profile for the log data. This baseline represents the Textual data is occasionally seen as meaningful information
expected patterns and characteristics of the logs under and can be crucial to machine learning algorithms [3].
normal operating conditions. You can compute statistical Several natural language processing techniques can be used
measures (e.g., mean, standard deviation) or use historical for logs that contain textual data. One of the more well-liked
data as a reference to define the baseline.
natural language processing methods is frequency-inverse
Anomaly Detection Techniques: Apply appropriate Document[3]. The textual data appears by using data
anomaly detection techniques to identify deviations from the frequency, and TF-IDF is another popular technique for
established baseline. There are various methods you can use, expressing data in anomaly detection. [3].
such as: Word2vec, a method for natural language processing, has
• Statistical Methods: Statistical techniques like z- proven to be highly successful and efficient at representing
score, percentile-based methods, or clustering textual data in small dimensions in paper [4].
can help identify data points that significantly A lot of studies also employ log vectorization approaches
deviate from the expected behaviour. to describe the log data [5][6], employing log abstraction
techniques that make use of the constant portion of the log
• Machine Learning Approaches: Supervised or messages, log arrangement are converted into logs and
unsupervised algorithms like clustering, random vectors are transformed into log events. The code's logprint
forest or neural networks based on machine statement generates generic log messages as placeholders for
learning can be trained on labelled or unlabelled log events. The paper's sequences [7] contain a lot of log
data to detect anomalies in the log files. events, and the events based on the log are packed in two
different ways—weighting based on IDF that has been
• Time-Series Analysis: If the log files contain adjusted and contrast-based weighting. According to the
temporal information, time-series analysis study, the significance of each log event varies according to
techniques like ARIMA (Autoregressive how frequently it looks in various log classifications. The
Integrated Moving Average) or LSTM (Long weighting technique was suggested as a consequence to the
Short-Term Memory) networks can be applied employee.
to capture patterns and detect anomalies.
The journal [8] also discussed the tree structure diagram, also
• Rule-based Methods: The Rule-based called the decision tree algorithm. The SVM, another
approaches comprise some tricks or thresholds classification-based technique, was also examined in the
based on expert knowledge or domain-specific paper. Each of the three supervised algorithms combined the
requirements. Any log entry that violates these labels and training examples to create an event count vector.
rules is flagged as an anomaly. The decision tree selections proved to be more reasonable for
Alert Generation and Response: To inform the relevant the developer in picking anomalies compared to two
parties, an alarm is generated to identify abnormal behaviour. different algorithms when discovering abnormalities in the
Establish appropriate response procedures to investigate and data. Although all of the algorithms were effective at finding
address potential security threats or system irregularities. anomalies, SVM had the highest overall accuracy. The
LogCluster algorithm was chosen by the publication [9]. A
Continuous Monitoring and Feedback: Implement data clustering algorithm called LogCluster uses log files to
continuous monitoring of the log files and regularly update find patterns. In 92 days, LogCluster was utilised to measure
and refine the anomaly detection models. Monitor the performance. 1,879,209 of the 296,699,550 log messages
performance of the detection techniques in real-world that were processed by the implementation were identified as
scenarios, adapt to evolving patterns, and incorporate anomalies. They also found that not all anomalies were
feedback to improve the effective process and specific anomaly log messages but standard log messages. The
approach for anomaly detection. paper[10] used the DBSCAN algorithm, a density-based
clustering approach. It was suggested that DBSCAN be used
to spot anomalies in monthly temperature data. According to
Authorized licensed use limited to: National Institute of Technology. Downloaded on July 29,2024 at 10:02:25 UTC from IEEE Xplore. Restrictions apply.
the report, it offers several advantages over statistical the file are represented by textual or numerical data. It is
strategies for spotting irregularities. crucial to gather and prepare the relevant data before utilising
the machine learning algorithms. Finding patterns in data
III. PROPOSED METHODOLOGY requires a fundamental step called data representation.
K-means and DBSCAN, two distinct unsupervised Finding patterns in data requires a fundamental step called
machine learning methods, will be employed in this study to data representation[13]. The distance between sets of data
find aberrant sequences in Mobilaris log files. Unsupervised that are comparable to one another should be kept to a
learning is preferred because it can derive knowledge from minimum, while the distance between sets of data that aren't
data structures without labels and because log sequences can should be kept at a maximum. After processing and applying
reveal hidden issues [11]. The process of finding anomalies the log abstraction technique, a new log file with log ids
in a data file is broken down into various parts, as shown in representing the log entries will be produced[14]. Textual log
Figure 2. With the assistance of the Mobilaris staff, entries can be represented in the log arrangement as numbers
anomalies in the log file were labelled. thanks to the feature vector encoding. As there are distinct
log entries in the whole log file, the vector will have the
The log file from Tag Vibration Service (TVS), one of same number of dimensions. Every log will have a
Mobilaris' services, was what we obtained from them. TVS designated place. Each log item's frequency of appearance in
counts the intervals between blinks to detect the validity of a the sliding window is counted using the feature vector
tag[12]. The constant and variable data in log entries from representation[15].
Authorized licensed use limited to: National Institute of Technology. Downloaded on July 29,2024 at 10:02:25 UTC from IEEE Xplore. Restrictions apply.
2, PCA was used. Three clusters as well as the blue
stars which are exist in each plot indicates the
centre of each cluster. The algorithmic identified
abnormalities are indicated by red circles
surrounding the data points.
The suggested method achieves a memory storage Fig. 4. Total evaluation time
quantity of 13,598,247.75 bits by altering the number of
bullets. The optimization strategies' combined execution time
is 21,008 milliseconds. Figure 3 shows how the system
performs using the suggested technique when the number of
repeats is changed. Figure 4. Represents the fitness value of
the suggested approach. The message with the highest fitness
value in the MPSO had the lowest mistake frequency. As the
number of observations rises in this case, the efficiency score
falls. Table II displays the proposed MANNs-based back
propagation technique's thorough classification validity. The
proposed MANN provides 91.25 percent accuracy in this
case.
Authorized licensed use limited to: National Institute of Technology. Downloaded on July 29,2024 at 10:02:25 UTC from IEEE Xplore. Restrictions apply.
Fig. 8: Threshold at k=3, 97.5% accuracy with K-Means
C. DBSCAN with IDF
We will demonstrate the outcome of the DBSCAN technique
Table IV: Result from unique log sequences, Kmeans, IDF with IDF representation in this subsection. All anomalous
Representation log sequence results are shown in Table V, whereas only the
K - means with K - means with K - means with unique ones are shown in Table VI. Recall, F1, and precision
threshold threshold threshold
Metrics percentile 99 percentile 98 percentile 97.5 scores are displayed. The output from the DBSCAN
Recall 0.87 0.87 0.91 technique is shown in Figure 9. The dimensions were also
F1- decreased from 29 to 2 using PCA. Six clusters were
score 0.93 0.93 0.95 produced by DBSCAN. The expected anomalous sequences
Precisio from the DBSCAN are represented by the purple data points.
n 1 1 1
Table V: Performance parameters with DBSCAN
Metrics DBSCAN
Recall 0.35
F1- score 0.51
Precision 1
Metrics DBSCAN
Recall 0.78
F1- score 0.87
Authorized licensed use limited to: National Institute of Technology. Downloaded on July 29,2024 at 10:02:25 UTC from IEEE Xplore. Restrictions apply.
V. CONCLUSION
We looked into the potential of machine learning for and feature vector representation was equally good in finding
focusing on anomaly detection when analysing the log files the total number of aberrant log arrangements with a
at Mobilaris. As a result, we discovered that both K-means threshold of 97. We also thought about how the data
and DBSCAN were effective in identifying distinctive log representation would affect the outcomes. The K-means with
anomalies. The only algorithms that successfully discovered IDF representation outperformed other techniques at
the total number of log sequences were K-means techniques identifying separate log sequences when the threshold
with low percentile thresholds. According to our percentile was 97. Experimenting with different window
investigation, the K-means strategy using IDF representation widths could improve the research's findings even further.
Authorized licensed use limited to: National Institute of Technology. Downloaded on July 29,2024 at 10:02:25 UTC from IEEE Xplore. Restrictions apply.