1 s2.0 S1568494624000887 Main

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Applied Soft Computing Journal 153 (2024) 111314

Contents lists available at ScienceDirect

Applied Soft Computing


journal homepage: www.elsevier.com/locate/asoc

A robust Wide & Deep learning framework for log-based anomaly detection
Weina Niu a , Xuhan Liao a , Shiping Huang a , Yudong Li a , Xiaosong Zhang a , Beibei Li b ,∗
a
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
b
School of Cyber Science and Engineering, Sichuan University, Chengdu, China

ARTICLE INFO ABSTRACT

Keywords: Log-based anomaly detections have shown huge commercial potential in system maintenance. However,
Log-based anomaly detection existing methods encounter two practical challenges. Firstly, they struggle to maintain consistent performance
Log templates extraction when dealing with evolving logs over time. Secondly, they face difficulties in effectively detecting frequency
Semantic information
anomalies, such as abnormal system resource usage and abnormal system operating frequencies. In this
Multi-features
paper, we propose a robust log-based anomaly detection framework using Wide & Deep learning called
Deep learning
WDLog. Particularly, we enhance the processing of template semantic information by building upon the Drain
algorithm, then we introduce invariant features and statistical features and propose a multi-feature anomaly
detection method based on the Wide & Deep framework. The experimental results on HDFS and BGL datasets
demonstrate the promising performance of WDLog compared to state-of-the-art methods in terms of anomaly
detection effectiveness. Furthermore, WDLog exhibits robustness to evolving logs, achieving F1-scores of over
90% under different degrees of log variation.

1. Introduction and machine learning. Although there are differences in the structure
and content of different system logs [7], each log statement can be
In dynamic and complex environments, system failures are in- described by the pattern of variables and constants. Thus, they first
evitable. These failures can manifest in various ways, such as com- extract log templates, that is, extract specific log execution information
ponents running incorrectly, errors occurring during their interaction,
into general and representative system execution behaviors. However,
or intentional attacks on the system. While simple system failures can
the existing methods will lose a lot of key information during the
often be resolved by restarting the system, complex faults are more
challenging to repair. If these faults are not addressed in a timely template extraction process. For example, DeepLog [8] replaced the
manner, they can have a significant impact on the system. Due to original log with the log sequence number after extracting the log
internal or external reasons, the probability of abnormal operations template, and the semantic information is lost.
and crashes of the terminal system and software system is increasing, Moreover, when the system changes, both the content and format
which will have a great negative impact on the user experience and of the log can be affected. The log data is constantly expanding and
even cause huge losses. For example, on November 26, 2020, Amazon updating, which makes the existing detection method not suitable for
Web Services (AWS) experienced a major outage that affected several dealing with different types and structures of log data, and the accuracy
companies that depend on AWS cloud services, including Adobe, Roku,
rate will also decrease when it is applied to new and complex log data.
Twilio, and Flickr [1]. On February 24, 2021, several key payment
The primary factor contributing to this is the continuous generation
system services of the Federal Reserve were interrupted, and users were
unable to use any payment services within 4 h, causing disruption to of system logs in current software systems, particularly in distributed
more than millions of financial transactions [2]. systems, resulting in a significant volume of log data being generated.
Existing computer systems use logs to record the status of key Studies show that about 20%–45% of log statements will change during
points and events during their operation, thereby helping maintenance their lifetime [9]. For example, log statements that represent the same
personnel debug system performance and troubleshoot. Early log-based system behavior may replace ‘‘down’’ with ‘‘off’’. However, the exist-
anomaly detection methods mainly rely on experienced experts to find ing log anomaly detection method based on machine learning is too
anomalies, and their efficiency is low. The mainstream methods [3– high-fitting, so it is difficult to maintain excellent anomaly detection
6] are to automatically discover abnormal logs through data mining

∗ Corresponding author.
E-mail address: [email protected] (B. Li).
1
https://github.com/adadadad194/WDLog.

https://doi.org/10.1016/j.asoc.2024.111314
Received 23 July 2023; Received in revised form 15 January 2024; Accepted 19 January 2024
Available online 30 January 2024
1568-4946/© 2024 Elsevier B.V. All rights reserved.
W. Niu et al. Applied Soft Computing 153 (2024) 111314

accuracy after the log statement changes. Furthermore, existing meth- method can extract common parts on local clusters of similar logs in-
ods struggle to identify anomalies that do not result in changes in the stead of global log datasets. Typical methods of applying this technique
execution flow. include LKE [16], LogSig [17], SHISO [18], and LenMa [19].
We propose a robust log-based anomaly detection method using Unlike general text data, log messages have some unique char-
Wide & Deep learning, called WDLog,1 to address the above problems. acteristics. Therefore, the log template can be extracted by adopting
First, the semantic embedding and clustering are added on the basis a method that adapts to the log structure and content characteris-
of the log template extraction algorithm Drain. Then, we extract tem- tics. Classical applications of heuristics include AEL [20], POP [21],
poral features, invariant features, and statistical features from the log IPLOM [22], and Drain [12]. Specifically, AEL [20] grouped log mes-
template sequences. Next, we train a GRU model based on the attention sages by comparing occurrences between constant and variable tokens.
mechanism using temporal features and a Gradient Boosting Decision IPLoM [22] grouped log messages based on message length, token
Tree (GBDT) model that combines invariant features with statistical position, and mapping relationship based on an iterative partitioning
features. Finally, we utilize Wide & Deep [10] and multi-features to strategy. Spell [23] parsed logs in a stream based on the longest com-
detect anomaly logs. mon subsequence algorithm. Based on the prefix tree idea, Drain [12]
The main contributions of this paper include the following: constructed a fixed-depth tree structure to represent log messages and
(1) We design an optimized log template extraction algorithm, efficiently extract common templates.
which generates log templates through embedding and clustering on However, the Drain algorithm cannot effectively deal with the
the basis of Drain, thus effectively reducing its sensitivity to word updated log. For example, when the system is updated, it is likely
changes, and improving the robustness of the algorithm. to replace some old phrases with more commonly used synonymous
(2) We propose a log-based anomaly detection method combined phrases when recording logs. These logs should belong to the same
with multi-features and Wide & Deep framework. We extract temporal template, but Drain may classify them into multiple templates, resulting
features, invariant features and statistical features based on correlation in redundancy. Not only increase the training time, but also the new
from the generated log template sequences, thus enabling the detection template number may cause recognition anomalies.
model to deal with different anomalies.
(3) We conduct different experiments on two public datasets with 2.2. Log-based anomaly detection
different characteristics (HDFS and BGL). The experimental results
demonstrate that WDLog achieves better detection performance and Although the supervised log anomaly detection method has better
stronger robustness compared with well-known solutions. overall performance, it needs a well-marked and labeled dataset, which
The remainder of this paper is arranged as follows. Related work is time-consuming. Researchers have tried to solve the problem of dif-
is described in Section 2. Section 3 elaborates on the proposed WDLog ficult dataset collection and labeling for supervised learning methods.
framework. Experimental results are shown in Section 4. Section 5 is For example, Meng et al. [24] proposed a framework for modeling
discussion. Our conclusions are drawn in Section 6. log streams as natural language sequences, called LogAnomaly, whose
basic idea was to convert log templates into vectors, and classify log
2. Related work templates by calculating the spatial distance of the vectors. This method
only needs a small number of normal log datasets to train the proposed
2.1. Log template extraction deep learning model, and then to detect abnormal log data. However,
this method cannot adapt to the scenario where log templates increase
Log template extraction is used to classify system logs in a specific due to the continuous update of the system. Aiming at the problem
way. The variable part of the same type of log is replaced with ‘‘*’’ that old anomaly detection methods fail over time, Zhang et al. [25]
or other proper words, special symbols, etc., and integrated with the proposed a new log anomaly detection method, called LogRobust. This
constant part to obtain the log template. The mainstream log extraction method semanticized the content of the log template instead of directly
methods are mainly based on frequent pattern mining, clustering, and representing the template with a number, which reduces the loss of
heuristics. information and enables it to handle new logs with minor changes.
The method based on regular expressions [11] is the simplest and Compared with supervised anomaly detection methods, unsuper-
most direct way to extract log templates. This method parses logs vised anomaly detection methods do not need to pre-label the dataset.
based on rules that are manually written by experts and adapted to However, due to the lack of labeling information, its overall perfor-
specific logs. However, this method makes it difficult to effectively mance is relatively poor. Because anomaly labels are often unavail-
deal with massive log data. Therefore, regular expressions are often able in practice, unsupervised methods are much more practical in
used as an auxiliary technology to extract log templates together with real-world service systems. Researchers have proposed many anomaly
other algorithms. For example, the Drain algorithm [12] used regular detection models based on unsupervised methods. For example, Du
expressions to preprocess log data to optimize processing time and then et al. [8] proposed the DeepLog framework that processed log infor-
constructed a log parsing tree based on the processed data to extract log mation as a natural language sequence and automatically learned log
templates. patterns from normal execution flow. DeepLog detected anomalies by
A log template is an expression pattern that frequently appears in determining whether incoming log events violate the prediction results
logs. Therefore, the log can be distinguished by counting the frequency of the stacked LSTM model. But the method still has the problem
of occurrence of some items of different statements in the log. Typical of low precision caused by the evolution of log sentences over time.
methods of applying this technique include SLCT [13], LogCluster [14], Concerning this problem, Yin et al. [26] proposed an unsupervised
and Logram [15]. These methods first traverse the log data and then anomaly detection model called LogCL, which can automatically learn
construct frequent item sets. After that, the log messages are catego- the characteristics of log data and compare the learned characteristics.
rized into the generated clusters, and finally, the log templates are They used CNN and Bi-LSTM models to automatically extract log fea-
extracted from the clusters. tures and identify abnormal logs through clustering algorithms. In order
To capture common parts of logs, similar logs can be clustered to better analyze and understand logs, Vaarandi et al. [14] proposed
together and common tags shared across each cluster can be identified a log clustering and pattern mining algorithm called logCluster. The
as log templates. Compared with pattern mining-based methods, this algorithm can group log data, cluster similar logs together, classify
them into a cluster, and extract frequent event patterns from them.
Semi-supervised anomaly detection methods only need to use part
1
https://github.com/adadadad194/WDLog. of the labeled data and most of the unlabeled data to complete the

2
W. Niu et al. Applied Soft Computing 153 (2024) 111314

Fig. 1. The framework of WDLog.

training. It has the advantages of both the unsupervised method and 3.1. Log template extraction
the supervised method, so it has good performance at a relatively small
cost. For instance, Lin et al. [27] introduced a novel and practical Log template extraction mainly includes three modules: log pre-
log-based anomaly detection approach called PLELog. This method processing, original template generation, and template compression.
employed semi-supervised learning to eliminate the time-consuming The objective of log pre-processing is to eliminate strings that cannot
manual labeling process and integrated historical anomaly knowledge be recognized by the algorithm in the log data. This step involves
through probabilistic label estimation, leveraging the advantages of replacing certain portions of the content in the log statements with
supervised approaches. Aiming at the problem of semi-supervised log a designated mask. Original template generation mainly builds a log
anomaly detection, where the only training data available are nor- parsing tree according to the format and content characteristics of the
pre-processed logs and then generates initial log templates based on
mal logs from a baseline period, Yen et al. [28] proposed a model
the parsing tree structure and its leaf node distribution. The core of
called ‘‘CausalConvLSTM’’. The model combined the convolutional neu-
template compression is to convert initial log templates into semantic
ral network (CNN) [29] and the long short-term memory network
vectors and cluster them.
(LSTM) [30], which can effectively model and predict time series data,
thereby realizing log anomaly detection. This method used a two-stage
3.1.1. Log pre-processing
training method. The first stage was supervised training, using labeled Log statements usually contain compound words consisting of more
normal log data for training, and using the cross-entropy loss function than two words and special symbols. An Out of Vocabulary (OOV)
to optimize the model. The second stage was semi-supervised training, exception occurs without replacing these words that are not recognized
which uses unlabeled log data for training and uses a semi-supervised by the dictionary. The log statement also contains some strings with a
loss function to optimize the model. This method can effectively detect fixed format and high frequency, such as IP addresses, IP port numbers,
abnormal information in logs and has high accuracy and robustness. fixed-length hexadecimal strings, etc. Replacing it with a constant cor-
Many research studies have been conducted in the field of log responding to the semantics can not only express the real semantics of
template extraction and log anomaly detection, yielding promising the log statement more effectively but also speed up the parsing speed
results. However, most of these studies have primarily focused on the and reduce the difficulty of parsing, thereby improving the generation
time series aspect of the logs, overlooking certain invariant features speed of the parsing tree and the accuracy of classification. Thus, log
and statistical characteristics of the logs. This limited utilization of log pre-processing consists of three main steps:
information in anomaly detection has led to incomplete coverage and Step 1: Analyzing the linguistic characteristics of log files, identi-
gaps in the detection of anomalies. fying the strings that can be replaced and meet the requirements, and
constructing a substitution table that corresponds the strings with their
semantics, as shown in Table 1.
3. The proposed WDLog framework
Step 2: Identifying the words in the log file based on the vocabulary
list. If any unrecognized words are found, add an out-of-vocabulary
WDLog mainly contains two modules: log template extraction and (OOV) entry to the substitution table.
anomaly detection, as shown in Fig. 1. The log template extraction Step 3: Constructing corresponding regular expressions based on
relies on an optimized version of Drain to obtain the log templates; the identified strings. Performing a global search on the log file to
anomaly detection combines temporal features, invariant features, and find string variables that match the regular expressions, and replacing
statistical features with Wide & Deep framework to comprehensively the original string variables with the replacement words from the
evaluate anomaly logs. substitution table. The main procedure is described in Algorithm 1.

3
W. Niu et al. Applied Soft Computing 153 (2024) 111314

Algorithm 1 Regular Expression Matching Algorithm Algorithm 2 Template generation


Require: 𝑟𝑒𝑔𝑒𝑥𝐿𝑖𝑠𝑡, 𝑙𝑜𝑔𝑠(ℎ𝑑𝑓 𝑠𝐿𝑖𝑠𝑡∕𝑏𝑔𝑙𝐿𝑖𝑠𝑡) Require: 𝑙𝑜𝑔𝑠𝐿𝑖𝑠𝑡, 𝑐𝑜𝑛𝑓 𝑖𝑔
Ensure: 𝑙𝑜𝑔𝑠𝐿𝑖𝑠𝑡 Ensure: 𝑙𝑜𝑔2𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒𝐿𝑖𝑠𝑡
1: 𝑙𝑜𝑔𝑠𝐿𝑖𝑠𝑡 ← 𝑖𝑛𝑖𝑡𝐿𝑖𝑠𝑡() 1: 𝑙𝑜𝑔2𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒𝐿𝑖𝑠𝑡 ← 𝑖𝑛𝑖𝑡𝐿𝑖𝑠𝑡()
2: for 𝑙𝑜𝑔 in 𝑙𝑜𝑔𝑠 do 2: 𝑙𝑒𝑛𝐺𝑟𝑜𝑢𝑝 ← 𝑠𝑒𝑝𝐵𝑦𝐿𝑒𝑛𝑔𝑡ℎ(𝑙𝑜𝑔𝑠𝐿𝑖𝑠𝑡) ⊳ group logs by length
3: for 𝑟𝑒𝑔𝑒𝑥 in 𝑟𝑒𝑔𝑒𝑥𝐿𝑖𝑠𝑡 do 3: 𝑙𝑒𝑛𝐺𝑟𝑜𝑢𝑝 ← 𝑠𝑜𝑟𝑡(𝑙𝑜𝑔𝑠𝐿𝑖𝑠𝑡, 𝑐𝑜𝑛𝑓 𝑖𝑔.𝑚𝑎𝑥𝑔𝑟𝑜𝑢𝑝) ⊳
4: 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑_𝑙𝑜𝑔 ← 𝑓 𝑖𝑛𝑑𝑅𝑒𝑔𝑒𝑥(𝑙𝑜𝑔, 𝑟𝑒𝑔𝑒𝑥) ⊳ replace words in sort by occurrence frequency, categorize the top n according to the
logs configuration in config and treat the rest as one category
5: 𝑙𝑜𝑔𝑠𝐿𝑖𝑠𝑡 ← 𝑎𝑑𝑑𝐿𝑜𝑔(𝑙𝑜𝑔, 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑_𝑙𝑜𝑔) 4: for 𝑔𝑟𝑜𝑢𝑝 in 𝑐𝑜𝑛𝑓 𝑖𝑔.𝑑𝑒𝑝𝑡ℎ do
6: return 𝑙𝑜𝑔𝑠𝐿𝑖𝑠𝑡 5: 𝑙𝑒𝑛𝐺𝑟𝑜𝑢𝑝 ← 𝑠𝑒𝑝𝐵𝑦𝑃 𝑟𝑒𝑓 𝑖𝑥(𝑙𝑒𝑛𝐺𝑟𝑜𝑢𝑝) ⊳ Iterate
through classification by word prefix based on the maximum depth
set in config
Table 1
6: for 𝐺𝑟𝑜𝑢𝑝 in 𝑙𝑒𝑛𝐺𝑟𝑜𝑢𝑝 do
Substitution table.
7: 𝐺𝑟𝑜𝑢𝑝 ← 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝐺𝑟𝑜𝑢𝑝) ⊳ compute similarity
Substitute word Original string
8: 𝐺𝑟𝑜𝑢𝑝 ← 𝑠𝑒𝑝𝐵𝑦𝑆𝑖𝑚(𝐺𝑟𝑜𝑢𝑝) ⊳ group by similarity
IP port 10.251.110.8:50010
NUM 235 237
9: 𝑙𝑜𝑔2𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒𝐿𝑖𝑠𝑡 ← 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒(𝐺𝑟𝑜𝑢𝑝) ⊳ generate log template
ID blk −2419185540153636210 10: return 𝑙𝑜𝑔2𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒𝐿𝑖𝑠𝑡
HEX 0 × 1808d120
OOV addStoredBlock, <

to calculate the weights of different words instead of directly using


uniform weights.
3.1.2. Original template generation
Step 2: Using Density Based Spatial Clustering of Applications with
Original template generation consists of two main steps:
Noise (DBSCAN) [32] to cluster the obtained log templates with se-
Step 1: Constructing the log statement parsing tree based on the
mantic vectors to achieve template compression. DBSCAN has no re-
drain algorithm according to factors such as length, similarity, and
quirements on the shape or number of clusters. It divides regions that
word order between log statements. Specifically, the log statements
match the set density into clusters, and then merges the connected
are first classified according to their lengths; and then each classified
clusters with high density into the same cluster. To achieve the goal
sub-tree is reclassified by means of constant prefix matching in a multi-
of clustering clusters, DBSCAN defines two parameters, Epsilon(𝜖) and
threaded concurrent execution manner. To prevent the parsing tree
from becoming excessively large and to avoid excessive construction minPts. Epsilon(𝜖) indicates the maximum radius of a cluster. MinPts
time, certain limitations are imposed on the iterations and scale of the indicates the number of minimum points within a cluster. A cluster is
log parsing tree. This is achieved by setting the maximum depth and only considered to be a cluster if it contains at least minPts of clustered
the maximum number of leaf nodes. objects within a radius, otherwise, it is considered to be discrete noise
Step 2: Calculating the similarity to classify the log statements. points. The whole procedure is described in Algorithm 3.
The similarity between log statements in a log group is calculated Algorithm 3 Template Compression Algorithm
to identify common patterns and determine log templates. Here we Require: 𝑙𝑜𝑔2𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒𝐿𝑖𝑠𝑡, 𝑔𝑙𝑜𝑣𝑒.𝑡𝑥𝑡
define the calculation of similarity in Eqs. (1) and (2). EQ is a function Ensure: 𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒2𝑟𝑒𝑝𝑟𝐿𝑖𝑠𝑡
that differentiates words. And the value of Sim indicates the simi- 1: 𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒2𝑟𝑒𝑝𝑟𝐿𝑖𝑠𝑡 ← initList()
larity between the two statements. Specifically, we need to calculate 2: for 𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒 in 𝑙𝑜𝑔2𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒𝐿𝑖𝑠𝑡 do
the similarity between each log statement within a log group. If the 3: 𝑟𝑒𝑝𝑟 ← initFLoatList()
similarity between two log statements reaches a threshold value, then 4: for 𝑤𝑜𝑟𝑑 in 𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒 do
the two objects are of high similarity and will be classified to the same 5: 𝑇 𝐹 ← CalcTF(𝑤𝑜𝑟𝑑, 𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒)
template, i.e. to the same leaf node as the object already classified. Here 6: 𝐼𝐷𝐹 ← CalcIDF(𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒, 𝑙𝑜𝑔2𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒𝐿𝑖𝑠𝑡)
we consider the log statement that is compared first as the classified 7: 𝑇 𝐹 _𝐼𝐷𝐹 ← 𝑇 𝐹 × 𝐼𝐷𝐹
object. If the similarity does not reach the threshold, it means that 8: 𝑟𝑒𝑝𝑟 ← Add(𝑇 𝐹 _𝐼𝐷𝐹 × toVec(𝑤𝑜𝑟𝑑, glove.txt))
the compared object and the classified object are so different that they
9: 𝑟𝑒𝑝𝑟 ← Add(𝑟𝑒𝑝𝑟)
cannot be the same log template. Therefore the log statement with a
10: 𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒2𝑟𝑒𝑝𝑟𝐿𝑖𝑠𝑡 ← Add(𝑟𝑒𝑝𝑟𝑇 𝑜𝐿𝑖𝑠𝑡(𝑟𝑒𝑝𝑟))
large difference needs to be classified to a new leaf node, indicating an
11: 𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒2𝑟𝑒𝑝𝑟𝐿𝑖𝑠𝑡 ← DBSCAN(𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒2𝑟𝑒𝑝𝑟𝐿𝑖𝑠𝑡)
added classification. Detailed process is described in Algorithm 2.
{ 12: return 𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒2𝑟𝑒𝑝𝑟𝐿𝑖𝑠𝑡
( ) 1 if 𝑤1 = 𝑤2
𝐸𝑄 𝑤1 , 𝑤2 = (1)
0 otherwise
∑𝑛 ( ) 3.2. Anomaly detection
𝐸𝑄 seq1 (𝑖), seq2 (𝑖)
𝑆𝑖𝑚(𝑠𝑒𝑞1 , 𝑠𝑒𝑞2 ) = 𝑖=1 (2)
𝑛 Multi-feature log-based anomaly detection mainly contains two
phases: feature extraction and model building. We extract key features
3.1.3. Template compression including temporal features, invariant features, and statistical features,
Template compression mainly includes two steps: converting log from the log sequence in the feature extraction phase. Then we build a
templates into semantic vectors and clustering them. Wide & Deep model to process these features separately using different
Step 1: Decomposing each log template into each word it contains, components.
then using the pre-trained Glove word vector [31] to get the semantic
vector corresponding to each word, and finally adding the semantic 3.2.1. Feature extraction
vectors corresponding to all the words in the sentence together to Since the log generation and collection methods of different systems
form a new 300 dimension template semantic vector. In order to better are different, it is necessary to select the appropriate sequence segmen-
represent the template semantic information by the template semantic tation method according to the characteristics of the log. The log events
vector, we use Term Frequency–Inverse Document Frequency (TF–IDF) in HDFS all contain block subjects, which have obvious correlations, so

4
W. Niu et al. Applied Soft Computing 153 (2024) 111314

the subject-based independent sequence segmentation method is used. Table 2


Invariant features obtained by setting different thresholds.
However, there is no obvious correlation between BGL log statements,
Dataset Threshold Invariant features Feature count
and the log sequence is divided according to the time period, so the
(Example ranges)
sliding window segmentation method is used. After splitting the log
0.9 1–5, 3–4, 6–7, 8–9, . . . 12
files into sequences, the next step is to extract the key features in the log
0.8 . . . 6–11, 6–12, 17–32, . . . 14
sequences. Logs contain three key features, namely temporal features, HDFS
0.7 . . . 14–23, 14–24, 22–25, . . . 21
invariant features, and statistical features. For temporal features, the 0.6 . . . 20–22, 20–25, 22–25, . . . 38
log events are recorded in the order of occurrence, and the log sequence 0.95 4–7, 13–14, 17–177, 17–204, . . . 625
is divided based on the generated timestamps. Therefore, the sequence 0.9 . . . 20–33, 20–34, 20–35, . . . 976
BGL
0.85 . . . 21–23, 21–24, 21–25, . . . 1271
itself inherently represents the time relationship. Invariant features
0.8 . . . 10–220, 10–261, 13–232, . . . 1581
refer to the corresponding relationship between paired log events with
causal relationships, where an open event must be followed by a closed
event for normal operation of the system. The normal functioning of
such paired events ensures the overall proper functioning of the system. As for the invariant features, we directly use the arrays of two log
For instance, a log event pair consisting of ‘‘start transaction’’ and events to compute the Pearson coefficients. The degree of correlation
‘‘commit’’ is an invariant feature. Statistical features refer to the number between different log events can be assessed by the Pearson correlation
of occurrences of individual events in a log sequence. coefficient of occurrence numbers between two log events. The invari-
ant feature is actually a linearly correlated relationship and is positively
Algorithm 4 Feature Extraction Algorithm
correlated. For a completely normal log sequence, the Pearson correla-
Require: 𝑙𝑜𝑔𝑆𝑒𝑞𝑢𝑒𝑛𝑐𝑒, 𝑒𝑣𝑒𝑛𝑡𝑃 𝑎𝑖𝑟𝑠, 𝑒𝑣𝑒𝑛𝑡𝑠 tion coefficient for the number of occurrences of two log events with
Ensure: 𝑖𝑛𝑣𝑎𝑟𝑖𝑎𝑛𝑡𝐹 𝑒𝑎𝑡𝑢𝑟𝑒, 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑎𝑙𝐹 𝑒𝑎𝑡𝑢𝑟𝑒, 𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙𝐹 𝑒𝑎𝑡𝑢𝑟𝑒 the invariant feature relationship is 1. And we think that the reason
1: 𝑛 ← len(𝑒𝑣𝑒𝑛𝑡𝑃 𝑎𝑖𝑟𝑠), 𝑚 ← len(𝑒𝑣𝑒𝑛𝑡𝑠), 𝑡 ← len(𝑙𝑜𝑔𝑆𝑒𝑞𝑢𝑒𝑛𝑐𝑒) for some abnormalities is that the invariant feature relationship does
2: 𝑖𝑛𝑣𝑎𝑟𝑖𝑎𝑛𝑡𝐹 𝑒𝑎𝑡𝑢𝑟𝑒, 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑎𝑙𝐹 𝑒𝑎𝑡𝑢𝑟𝑒, 𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙𝐹 𝑒𝑎𝑡𝑢𝑟𝑒 not match. However, since not all log files used in the calculation of
← initFeatures(𝑛, 𝑚, 𝑡) Pearson’s correlation coefficient are normal log sequences, the actual
3: for 𝑖, 𝑒𝑣𝑒𝑛𝑡𝑃 𝑎𝑖𝑟 in 𝑒𝑣𝑒𝑛𝑡𝑃 𝑎𝑖𝑟𝑠 do distribution of log sequences in log files consists of a large number
4: 𝑛𝑢𝑚𝑏𝑒𝑟0 = getNumber(𝑙𝑜𝑔𝑆𝑒𝑞𝑢𝑒𝑛𝑐𝑒, 𝑒𝑣𝑒𝑛𝑡𝑃 𝑎𝑖𝑟[0]) of normal log sequences and a small number of abnormal sequences.
5: 𝑛𝑢𝑚𝑏𝑒𝑟1 = getNumber(𝑙𝑜𝑔𝑆𝑒𝑞𝑢𝑒𝑛𝑐𝑒, 𝑒𝑣𝑒𝑛𝑡𝑃 𝑎𝑖𝑟[1]) Taking HDFS logs as an example, the abnormal log sequences account
6: if 𝑛𝑢𝑚𝑏𝑒𝑟0 == 𝑛𝑢𝑚𝑏𝑒𝑟1 then for three percent. Therefore, the Pearson correlation coefficient cannot
7: 𝑖𝑛𝑣𝑎𝑟𝑖𝑎𝑛𝑡𝐹 𝑒𝑎𝑡𝑢𝑟𝑒[𝑖] = 1 reach 1. So we set a threshold value of Pearson correlation coefficient
8: else less than 1 to find the correct invariant feature relationship, while
9: 𝑖𝑛𝑣𝑎𝑟𝑖𝑎𝑛𝑡𝐹 𝑒𝑎𝑡𝑢𝑟𝑒[𝑖] = 0 setting a coefficient of less than 1 also helps to reduce the noise caused
10: for 𝑖, 𝑒𝑣𝑒𝑛𝑡 in 𝑒𝑣𝑒𝑛𝑡𝑠 do by manual labeling errors.
11: 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑎𝑙𝐹 𝑒𝑎𝑡𝑢𝑟𝑒[𝑖] = getNumber(𝑙𝑜𝑔𝑆𝑒𝑞𝑢𝑒𝑛𝑐𝑒, 𝑒𝑣𝑒𝑛𝑡) Filtering statistical features additionally requires obtaining an array
12: for 𝑖, 𝑙𝑜𝑔 in 𝑙𝑜𝑔𝑆𝑒𝑞𝑢𝑒𝑛𝑐𝑒 do of labels for the log sequences. The Pearson correlation coefficient is
13: 𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙𝐹 𝑒𝑎𝑡𝑢𝑟𝑒[𝑖] = getEventID(𝑙𝑜𝑔) calculated by the occurrences of each log event and the corresponding
14: return 𝑖𝑛𝑣𝑎𝑟𝑖𝑎𝑛𝑡𝐹 𝑒𝑎𝑡𝑢𝑟𝑒, 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑎𝑙𝐹 𝑒𝑎𝑡𝑢𝑟𝑒, 𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙𝐹 𝑒𝑎𝑡𝑢𝑟𝑒 actual label of the log sequence. The correlation coefficient represents
how relevant the statistical data is to the normal or abnormal sequence.
Given a log sequence, the extraction process of its features is shown And to some extent, the coefficient also indicates the importance of
in Algorithm 4. For invariant feature, we initialize an n-dimensional different statistical data for the anomaly detection algorithm. Statistic
vector 𝑥𝐼𝑛𝑣𝑎𝑟𝑖𝑎𝑛𝑡 (n represents the number of log event pairs). Then feature filtering is also done by setting thresholds, but the thresh-
we set the value of 𝑥𝐼𝑛𝑣𝑎𝑟𝑖𝑎𝑛𝑡𝑖 to 1 if the numbers of two events in olds for statistical features are considered in relation to the needs of
the 𝑖th event pair are equal in the log sequence, and 0 otherwise. We downstream task.
use an m-dimensional vector 𝑥𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑎𝑙 represents the statistical feature
and 𝑥𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑎𝑙 𝑖 represents the number of event-𝑖 in the log sequence. 3.2.2. Threshold experiment of feature filtering
The temporal feature 𝑥𝑇 𝑒𝑚𝑝𝑜𝑟𝑎𝑙 represents the log sequence itself, and In our approach, the selection of the threshold is critical, which
𝑥𝑇 𝑒𝑚𝑝𝑜𝑟𝑎𝑙 𝑖 is the event ID of the 𝑖th log in log sequence. directly determines the features used in the anomaly detection phase.
These features can be found by experienced technicians, but as the A high threshold may ignore valuable features, causing the loss of
number of logs grows extremely fast, it is unaffordable to rely solely information and reducing anomaly detection capability. Conversely, a
on human effort. However, it should be noted that the number and low threshold will retain redundant features, interfere with the learning
type of invariant features and statistical features can be large, and not of valuable information, and slow down the training of the model.
all of them are equally informative for the anomaly detection task. So The selection of threshold needs to be considered with the specific
feature filtering is required to select features with positive impacts on application scenario. In our experiments, the dataset and feature type
the anomaly detection task. Based on this situation, we use the Pearson have a great influence on the threshold setting, and the results are
correlation coefficient as an indicator for feature screening, which is shown in Tables 2 and 3. When the Pearson coefficient is above 0.9,
both HDFS and BGL can extract a number of invariant features, indicat-
fast and easy to compute and understand. Pearson coefficient can assess
ing the universality of invariant features. And the BGL dataset, which
the linear correlation of different variables to identify features that have
has more template categories shows even more invariant features.
a strong correlation, as calculated in Eq. (3). Notice that we calculate
Compared with the invariant features, statistical features can only be
Pearson correlation coefficient between pairs of log events to select
extracted by setting a lower threshold, which indicates that the direct
invariant features, and that between log events and labels to select
correlation between log events and labels is weak, and these features
statistical features. For each event, we will get an array representing
can only be used to assist other features for anomaly detection.
the occurrences of that event in each log sequence.
𝑐𝑜𝑣(𝑥, 𝑦) 3.2.3. Model building
𝜌𝑥,𝑦 =
𝜎𝑋 𝜎𝑌 To fully use the valuable information in the logs to improve the
𝐸(𝑋𝑌 ) − 𝐸(𝑋)𝐸(𝑌 ) (3) anomaly detection capability, we propose a Wide & Deep based multi-
= √ √
𝐸(𝑋 2 ) − 𝐸 2 (𝑋) 𝐸(𝑌 2 ) − 𝐸 2 (𝑌 ) feature anomaly detection method. The Wide & Deep architecture is

5
W. Niu et al. Applied Soft Computing 153 (2024) 111314

Table 3 different parts contribute more to the model. Such potential sequential
Statistical features obtained by setting different thresholds.
relationships can be better learned by using the GRU model based
Dataset Threshold Statistical features Feature count on the attention mechanism. By learning such timing characteristics,
(Example events)
abnormal execution processes can be identified more accurately.
0.5 event3, event4 2
The calculation process on the Deep end is shown in Formula (4).
HDFS 0.4 event3, event4, event16, event33 4
0.3 ...event11, event12, event33... 7
In the following formula, 𝑥𝑖 is the log template vector of the 𝑖th log in
log sequence, 𝑧𝑖 and 𝑟𝑖 are the outputs of the update gate and reset gate
0.5 event19, event127 2
BGL 0.4 event19, event127 2 respectively, ℎ𝑖 represents the hidden state at time 𝑖, and 𝑊 𝑧 , 𝑊 𝑟 and
0.3 ...event30, event31, event32... 26 𝑊 ℎ are the corresponding weight matrices respectively. 𝑂𝑢𝑡𝑝𝑢𝑡𝐷𝑒𝑒𝑝 is
the final output of the Deep end.
𝑋𝑇 𝑒𝑚𝑝𝑜𝑟𝑎𝑙 = {𝑥1 , 𝑥2 , … 𝑥𝑛 }
not a specific model, but rather a programmatic idea combining deep 𝑧𝑖 = 𝜎(𝑊 𝑧 ⋅ [ℎ𝑖−1 , 𝑥𝑖 ])
learning and machine learning. This architecture consists of two parts: 𝑟𝑖 = 𝜎(𝑊 𝑟 ⋅ [ℎ𝑖−1 , 𝑥𝑖 ])
Wide and Deep. The Wide part is generally a generalized linear model,
ℎ̃𝑖 = 𝑡𝑎𝑛ℎ(𝑊 ℎ ⋅ [𝑟𝑖 ⊙ ℎ𝑖−1 , 𝑥𝑖 ]) (4)
which gives the whole model a strong memory capability and applies
to simple and strongly correlated features. The Deep part, on the ℎ𝑖 = (1 − 𝑧𝑖 ) ⊙ ℎ̃𝑖 + 𝑧𝑖 ⊙ ℎ𝑖−1
other hand, enables the model to have generalization ability and is ∑
𝑛

suitable for learning complex features. Our architecture mainly consists 𝑂𝑢𝑡𝑝𝑢𝑡𝐷𝑒𝑒𝑝 = 𝑡𝑎𝑛ℎ( 𝜆𝑖 ℎ𝑖 )
𝑖=1
of a Gradient Boosting Decision Tree-Logistic Regression model(GBDT-
LR) [33] on the Wide end and an Attention-Based Gate-Recurrent Unit Wide end. We choose the Gradient Boosting Decision Tree-Logistic Re-
(Attention-GRU) model on the Deep end. gression model (GBDT-LR) as the Wide end. The invariant features and
In order to make full use of the three different features, we divide statistical features are superficial for log anomaly detection. Gradient
them into different parts of this anomaly detection model approach. Boosting Decision Tree (GBDT) can easily remember such relationships.
The temporal features will be input to the Deep end, while the invariant Therefore, in the subsequent detection phase, GBDT has the ability to
features and statistical features will be input to the Wide end. After ab- identify anomalies caused by the invariant and statistical features on
stracting the high-dimensional features on the Wide end and Deep end, the logs.
the features are then combined and computed for prediction through The prediction process of GBDT is shown in Eq. (5). X is a combi-
a logistic regression layer. Unlike anomaly detection methods that use nation of invariant features and statistical features of the log sequence,
only deep learning or machine learning, our approach combines both which is used as input to the Wide end. 𝐹𝑚 (𝑥) represents the cumulative
the Deep and Wide ends. This combination allows more features to be output of the first 𝑚 − 1 trees, ℎ𝑚 (𝑥) represents the predicted value of
processed by type and exploits more useful information in the logs to the leaf node in the 𝑚th tree, and the result obtained after M iterations
achieve better anomaly detection capabilities. The main components is 𝑂𝑢𝑡𝑝𝑢𝑡𝑊 𝑖𝑑𝑒 .
of the model, namely the Deep and Wide ends, are described in detail 𝑋 = 𝑐𝑜𝑛𝑐𝑎𝑡{𝑋𝐼𝑛𝑣𝑎𝑟𝑖𝑎𝑛𝑡 , 𝑋𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑎𝑙 }
below.
𝐹𝑚 (𝑋) = 𝐹𝑚−1 (𝑋) + ℎ𝑚 (𝑋) (5)
Deep end. We choose the Attention-Based Gated Recurrent Unit Model
𝑂𝑢𝑡𝑝𝑢𝑡𝑊 𝑖𝑑𝑒 = 𝐹𝑀 (𝑋)
as the Deep end for processing the input content. The Gated Recurrent
Unit (GRU) [34] is a modified Recurrent Neural Network (RNN). Com- Furthermore, the integration of Logistic Regression (LR) on top of
pared to the basic RNN structure, the Gated Recurrent Unit (GRU) has GBDT has proven to be a significant improvement. This approach adds
shown improved performance in handling the gradient disappearance only a layer of computation to the GBDT algorithm and achieves a more
problem and capturing long-term dependencies in sequential data. Also, comprehensive prediction by taking multiple features into account. Our
GRU has a simpler internal structure and better computational speed method uses GBDT to abstract the invariant and statistical features and
than another improved RNN model, LSTM [30]. However, using GRU recombine the useful information to obtain new features that have a
alone is not enough. In a log anomaly detection model, the output of more positive impact on the outcome. Then as shown in formula (6),
each GRU unit depends on the current log event and on the memory the LR layer combines all original features (including invariant features
of the past. But if a log sequence is too long, there may be cases where and statistical features) with newly generated features as input and
anomalous events far apart have an impact on the current event, which predicts the final result.
can affect the validity of the prediction. In order to eliminate this 𝐼𝑛𝑝𝑢𝑡𝐿𝑅 = 𝑐𝑜𝑛𝑐𝑎𝑡{𝑂𝑢𝑡𝑝𝑢𝑡𝐷𝑒𝑒𝑝 , 𝑋𝐼𝑛𝑣𝑎𝑟𝑖𝑎𝑛𝑡 , 𝑋𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑎𝑙 , 𝑂𝑢𝑡𝑝𝑢𝑡𝑊 𝑖𝑑𝑒 }
affection, we use an attention mechanism [35] to reassign weights to (6)
𝑅𝑒𝑠𝑢𝑙𝑡 = 𝐿𝑅(𝐼𝑛𝑝𝑢𝑡𝐿𝑅 )
the degree of influence of past memory units on the current output, and
the process is shown in Fig. 2. Note that the Wide end is trained before training of the Deep end
The output of each GRU unit depends on the input log event of and Logistic Regression. We first train the GBDT using the invariant
this unit and the hidden state of the previous GRU unit. At the same and statistical features with the corresponding labels. Then we use the
time, the attention mechanism learns the weight of each output that trained GBDT as Wide end to get the synthetic features 𝑂𝑢𝑡𝑝𝑢𝑡𝑊 𝑖𝑑𝑒
contributes to the accuracy of the prediction, increasing the degree of which is part of the Logistic Regression’s input. At last, we apply
effective memory and improving the reliability of the prediction results. backward propagation and gradient descent to train the Deep end and
Here are some examples that show the effectiveness of the attention Logistic Regression simultaneously.
mechanism. Due to the complexity of the log system, the distribution
of log sequences also shows diversity. There may be partially similar log 4. Experimental evaluation
sequences in the system. Figs. 3 and 4 depict two possible scenarios for
log sequences. There are cases where the same sequence is followed by In this section, we present the experiments conducted to validate
a number of different sequences within the system, as shown in Fig. 3. the effectiveness of the WDLog framework. We begin by describing
There are also cases where different sequences are followed by the same our experimental setup in Section 4.1. Next, in Section 4.2, we dis-
sequence, as shown in Fig. 4. For the above two cases, if the two log cuss the comparative methods employed and the performance metrics
sequences have the same state, i.e., they are both normal or abnormal, utilized. Subsequently, in Section 4.3, we outline the model settings
then their common parts contribute more to the model. Otherwise, their and experimental parameters employed in our study. The impact of

6
W. Niu et al. Applied Soft Computing 153 (2024) 111314

Fig. 2. Attention based GRU.

Table 4
Dataset sequence distribution.
Dataset Number of Number of
normal sequence abnormal sequence
training set 333 792 11 244
HDFS validation set 55 635 1871
testing set 168 796 3723
Fig. 3. Same sequence followed by different sequences. training set 24 146 27 200
BGL validation set 4012 4545
testing set 12 116 13 558

contains one or more block_id. In our experiments, the HDFS dataset is


divided into 575,061 log sequences according to block_id, and then is
divided as the training set, verification set, and testing set according to
Fig. 4. Different sequences followed by the same sequence.
the ratio of 6:1:3.
(2) BGL: It is collected over 7 months by a supercomputer deployed
at Lawrence Livermore National Laboratory. It contains a total of
parameter settings on WDLog is examined in Section 4.4. We analyze 4,747,963 log records, of which 348,460 log records are abnormal. In
the anomaly detection performance of the proposed WDLog method, our experiments, the BGL dataset uses a sliding window to extract the
along with other advanced methods, on multiple datasets in Section 4.5. log sequence, where the window size is set to 120, and then the training
Finally, in Section 4.6, we showcase the effectiveness of the WDLog set, validation set, and testing set are divided according to the ratio of
architecture through ablation experiments. 6:1:3.
Details of these two open datasets in our experiments are given in
4.1. Experimental setup Table 4.

The experiments were written in Python 3.9, modeled in PyTorch


4.2. Comparative methods and performance metrics
1.11.0 environment, and conducted on a Windows-OS-based device,
whose main parameters are:
CPU: AMD Ryzen 7 5800H with Radeon 3.20 GHz Several known log-based anomaly detection methods are also se-
Graphics Card: GTX3060 lected for experimental comparison with WDLog (DeepLog,
Memory: 16G RAM LogAnomaly, PLELog, and LogCluster). We adopt the Precision, Recall,
We choose two datasets with different characteristics from the open and F1-score (F1) as the evaluation metrics, which are typically used
source project Loghub [36]: Hadoop Distributed File System (HDFS) in this field.
and BlueGene/L (BGL). HDFS datasets are generated in an HDFS cluster 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇 𝑃 ∕(𝑇 𝑃 + 𝐹 𝑃 )
using a baseline workload, where anomalies are identified and labeled
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇 𝑃 ∕(𝑇 𝑃 + 𝐹 𝑁) (7)
manually during the log generation process based on manually devel-
oped rules. BGL (Blue Gene/L) dataset is a supercomputing system log 𝐹 1 = 2 ∗ 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙∕(𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙)
dataset collected by Lawrence Livermore National Laboratory (LLNL).
The difference is that log information in the HDFS dataset can be 4.3. Model settings
divided into log sequences by unique block IDs, while logs in the BGL
dataset do not contain block IDs and are divided into log sequences by The primary model settings and experimental parameters for WD-
sliding windows. Log on HDFS and BGL datasets are presented in Table 5. The values
The main characteristics of each dataset are as follows: were carefully tested and determined as described in Section 4.4. For
(1) HDFS: It is obtained by running map-reduce tasks for 38.7 h on the comparative models, their parameters are set based on the refer-
more than 2000 Amazon EC2 nodes. It contains a total of 11,172,157 ences, and further fine-tuned using the optimal test results obtained in
log records which includes 284,818 abnormal records. Each log record our study.

7
W. Niu et al. Applied Soft Computing 153 (2024) 111314

Table 5
Model settings for HDFS and BGL.
Parameter HDFS BGL
Depth 4 4
Similarity 0.6 0.5
Hidden States 150 150
GRU Layers 4 4
Estimators 150 150

Fig. 7. Performances on different sizes of hidden states.

log statements in the HDFS dataset (10 million) and the BGL dataset
(millions), the impact of redundant templates on accuracy is minimal.
However, the presence of redundant templates does affect the total
number of template types, and having too many templates can increase
the time overhead and resource consumption of downstream tasks.
Therefore, it is important to find an optimal parameter selection that
Fig. 5. Number of generated log templates on HDFS. strikes a balance.
Based on the experimental results, setting Depth = 4 and Similar-
ity = 0.6 is found to be the best parameter configuration for HDFS
logs, while Depth = 4 and Similarity = 0.5 are optimal for BGL logs.
This choice of parameters ensures a reasonable number of extracted
templates while maintaining accuracy.

4.4.2. Parameters for anomaly detection


By adjusting the number of hidden states and GRU layers in the deep
structure, as well as the number of estimators in the wide structure,
the performance of the log anomaly detection model can be evaluated
under various parameter settings. This analysis helps understand the
impact of these settings on the model’s performance and identifies the
most effective configuration for accurate anomaly detection.
To assess the influence of the number of hidden states, an exper-
iment was conducted by varying the hidden states parameter while
keeping other parameters at their default values. The F1-score was
measured for different values of hidden states, such as 50, 100, 150,
200, 250, and 300. The results, depicted in Fig. 7, show the relationship
between the number of hidden states and the corresponding F1-score.
The results show that the F1-score exhibits a gradual increase fol-
Fig. 6. Number of generated log templates on BGL. lowed by a gradual decrease as the number of hidden states increases.
This observation indicates that a higher number of hidden states allows
the model to capture more refined features from the training set,
4.4. Parameter sensitivity experiment leading to improved performance. However, when the number becomes
excessively large, it can lead to overfitting and a decline in performance
4.4.1. Parameters for log template extraction on the test set. Therefore, there exists an optimal range of hidden states
When generating log parse trees, the choice of similarity threshold that balances model complexity and generalization ability, ensuring the
and parse tree depth can significantly impact the accuracy of log best anomaly detection performance.
template extraction. To investigate this, we performed experiments on Similarly, experiments were conducted to evaluate the impact of
HDFS logs and BGL logs datasets, adjusting the similarity threshold and GRU layers on anomaly detection performance. The GRU layers were
parse tree depth to observe their effects on the number of generated log varied from 1 to 6, while other parameters were kept at their default
templates. The experimental results are depicted in Figs. 5 and 6. values. The experimental results, shown in Fig. 8, demonstrate that
Based on the observation of the content of generated templates both GRU layers and hidden states significantly influence the model’s
and the number of log statements associated with different parameter performance, highlighting their importance in determining the model’s
settings, it was found that the number of log statements classified to effectiveness.
redundant templates for HDFS logs is only a few dozen, and for BGL Furthermore, the number of estimators in the wide structure was
logs, it is only a few hundred. In comparison to the large number of also investigated. Estimators were set to values of 50, 100, 150, 200,

8
W. Niu et al. Applied Soft Computing 153 (2024) 111314

Table 6
Performance of different methods.
Log Method Precision Recall F1 Score
DeepLog [8] 0.945 0.900 0.922
LogAnomaly [24] 0.860 0.898 0.878
HDFS PLELog [27] 0.957 0.888 0.921
LogCluster [14] 1.000 0.836 0.911
WDLog 0.979 0.925 0.951
DeepLog [8] 0.907 0.956 0.931
LogAnomaly [24] 0.972 0.941 0.956
BGL PLELog [27] 0.967 0.978 0.973
LogCluster [14] 0.914 0.642 0.754
WDLog 0.994 0.978 0.986

abnormal logs and normal logs is typically characterized by substantial


differences, making it relatively easy for LogCluster to separate them
effectively. However, when it comes to the BGL dataset, LogCluster
faces challenges in distinguishing certain normal logs that fall within
the marginal range during the clustering process of abnormal logs. This
Fig. 8. Performances on different numbers of GRU layers. difficulty arises from the larger number of template categories in BGL
and the relatively smaller differences between these templates. Conse-
quently, LogCluster’s clustering results on BGL may be compromised,
leading to suboptimal performance in anomaly detection.
DeepLog, LogAnomaly and PLELog are models based on log tem-
poral features. When extracting log templates, Deeplog only utilizes
template types and does not consider the semantic features of tem-
plates, which means that template semantic information is not used to
represent templates. LogAnomaly introduces an innovative approach
that specifically targets the distinctions between near-synonyms and
antonyms in log statements. It further delves into the semantic differ-
ences between log statements, and uses vectors with semantic informa-
tion to represent log templates, resulting in remarkable performance
on both the BGL and HDFS datasets. Notably, LogAnomaly exhibits
exceptional performance in terms of the F1-score on the BGL dataset,
indicating the substantial amount of untapped information present
in log statements that can be effectively leveraged. PLELog, on the
other hand, extracts semantic vectors of log templates by using a pre-
trained model, Glove, which allows mining more semantic information
in log templates, and achieves a better performance. For anomaly
detection, DeepLog and LogAnomaly adopt LSTM to learn the temporal
Fig. 9. Performances on different numbers of estimators. information in log templates, while PLELog employs the GRU model.
Both methods can effectively extract the hidden information in the
log template vectors. The only difference is that the GRU model has
250, and 300, while other parameters remained unchanged. The results, a simpler construction, which can shorten the training time. These
presented in Fig. 9, indicate that increasing the number of estimators approaches demonstrate strong overall performance, reaffirming the
positively contributes to the F1-score of anomaly detection, but the significance of temporal features as a crucial foundation for effective
effect exhibits a slower rate of change. Estimators are associated with anomaly detection.
the level of feature abstraction, and a higher value allows for finer Compared with the above methods, WDLog combines Glove with
abstraction of features, leading to improved anomaly detection. Addi- clustering technique DBSCAN to compress templates, which enhances
tionally, the experiments reveal that the impact of increasing estimators the robustness to resist small changes of logs in real environment.
is more significant for HDFS logs compared to BGL logs, suggesting that During the anomaly detection phase, WDLog takes a comprehensive
HDFS logs benefit more from increased feature abstraction due to their approach by considering a broader range of log features and their
characteristics. potential characteristics for identification. It incorporates the features
of GRU and GDBT by adopting the Wide & Deep framework. Besides the
4.5. Comparative experiment temporal features, our model further considers the invariant features
and statistical features in the log sequences to improve the performance
4.5.1. Anomaly detection performance of anomaly detection. As a result, WDLog delivers outstanding perfor-
To evaluate the anomaly detection capability of WDLog, we con- mance on both datasets, showcasing excellent recall and F1-score that
ducted a comparative analysis with well-known solutions in the field, surpasses other approaches.
including DeepLog, LogAnomaly, PLELog, and LogCluster. The experi-
mental results are presented in Table 6. 4.5.2. Robustness
The results indicate that LogCluster achieves the highest accuracy To assess the robustness of WDLog and other methods in the context
in identifying abnormal logs in the HDFS dataset. However, its per- of iterative changes in log updates, we conducted experiments by
formance is not as strong when applied to the BGL dataset. The main injecting iterated HDFS logs and BGL logs at various percentages (0%,
reason for this discrepancy lies in the nature of LogCluster as a semantic 5%, 10%, 15%, and 20%). These new logs represent different degrees
clustering algorithm. In the case of HDFS, the distinction between of normal post-iteration logs. The performance of each method was

9
W. Niu et al. Applied Soft Computing 153 (2024) 111314

Table 7
Time consumption of different methods on HDFS and BGL.
Method HDFS BGL
Training Testing Training Testing
DeepLog [8] 1 h 50 m 20 m 44 m 7 m
LogAnomaly [24] 4 h 40 m 477 m 4 h 20 m 39 m
PLELog [27] 43 m 42 s 24 m 10 s
LogCluster [14] 19 m 23 s 41 s 40 s
WDLog 53 m 51 s 30 m 12 s

Table 8
Ablation results on HDFS and BGL.
Dataset Scheme Precision Recall F1 Score
without Attention 0.981 0.803 0.883
without Wide 0.973 0.895 0.932
HDFS
without Deep 0.975 0.716 0.825
WDLog 0.979 0.925 0.951
without Attention 0.978 0.904 0.940
without Wide 0.979 0.955 0.967
BGL
without Deep 0.979 0.782 0.869
Fig. 10. Robustness of different methods on changing HDFS and BGL logs.
WDLog 0.994 0.978 0.986

evaluated using the F1-score, and the results were visualized in a box
plot depicted in Fig. 10. LogAnomaly consume much more time than the GRU-based method
DeepLog heavily relies on the temporal features of logs for anomaly PLELog and the WDLog in this paper.
detection. It employs log template tags to represent log statements The clustering methods mainly focus on the selection of clustering
as time series. However, when logs change, new log statements are indexes. And the whole clustering process is relatively simple and costs
categorized into new template tags. Unless DeepLog is retrained, it little time. The GRU model is a simplification of the LSTM model. The
struggles to handle the new time series correctly. Hence, DeepLog input, output and forgetting gates of LSTM are reduced to the reset and
exhibits the highest box plot height and demonstrates poor robustness. update gates of GRU, which reduces a large part of the computation
LogCluster and LogAnomaly both rely on semantic information for process and costs less time.
anomaly detection. However, LogCluster performs better when logs Compared with PLELog, which also uses the GRU model, our
change because LogAnomaly divides logs into more fine-grained cate- method adds semantic clustering in the preprocessing stage, and a Wide
gories, thereby amplifying the impact of semantic changes and reducing end in the anomaly detection stage to perform anomaly detection task
robustness. On the other hand, PLELog utilizes both semantic and together. The overall complexity and computation increase to some
temporal features for anomaly detection. While both semantics and extent, but compared with PLELog which only has the Deep end, the
temporal sequences may change during log iteration, PLELog effec- training phase only increases by 10 min, and the testing phase only
tively handles log changes with good robustness. However, PLELog increases by 2 s. The time consumption is still in an advantageous
relies on a limited set of features, and if these features change during position among all the methods compared.
log iteration, the anomaly detection performance will be significantly
affected. 4.6. Ablation experiment
WDLog is crafted to tackle the problem of evolving logs, particularly
during template extraction. It replaces log statements with template
To assess the contribution of different components to the overall
semantics to minimize sensitivity to variations introduced by evolving
model, we conducted ablation experiments on the attention mecha-
logs. On the other hand, the DBSCAN used for template clustering is
nism, Wide model, and Deep model. Each component was masked out
able to classify the logs into the correct templates even when there are
separately, while the remaining parts of the model were left unchanged.
minor changes in the logging system, such as synonym substitutions,
abbreviations, and so on. Different from traditional approaches, WDLog Anomaly detection was performed on HDFS logs and BGL logs, and the
incorporates invariant and statistical features in its anomaly detec- results are presented in Table 8.
tion process. This comprehensive strategy enhances the framework’s The experimental results demonstrate that the attention mechanism
robustness to the dynamic nature of evolving logs. plays a significant role in improving the overall performance of the
algorithm. With numerous features involved, it is crucial to effectively
4.5.3. Time consumption learn and utilize the relevant features to avoid interference in pre-
Time consumption is a key indicator for evaluating anomaly de- dictions. The attention mechanism helps focus the model’s learning
tection systems and plays a vital role in anomaly detection systems. on important features, thereby enhancing its performance. Regarding
Real-life scenarios where fast and effective response is critical. Our the Wide and Deep components within the model, the results indicate
evaluation carefully breaks down the time consumption into two key that omitting the Wide component reduces the overall prediction abil-
components: training time, a metric that reflects how efficiently WDLog ity, whereas omitting the Deep component diminishes the prediction
adapts to the specific characteristics of the log data during the training ability. The Deep component leverages temporal features for anomaly
phase, and test time, which focuses on how efficiently WDLog detects detection, while the Wide component predominantly utilizes invariant
anomalies in the log data. The experimental data are shown in the features and statistical features. These components complement each
Table 7. In general, all the current log anomaly detection methods only other in prediction. Therefore, to achieve optimal results, the Deep
take a short time to train and test, which can satisfy the daily anomaly component needs to be supplemented by the Wide component.
detection tasks. In summary, the attention mechanism contributes to overall perfor-
It is clear that the least time-consuming method is the cluster- mance improvement, while both the Wide and Deep components are
ing method LogCluster, while the LSTM-based methods DeepLog and crucial for achieving the best results.

10
W. Niu et al. Applied Soft Computing 153 (2024) 111314

5. Discussion Acknowledgments

5.1. Class imbalance This work was supported in part by the National Science Foundation
of China under Grant 61902262, in part by the National Key Research
We believe that considering the expansion of the anomaly log and Development Program of China under Grant 2023QY0101.
dataset using class balancing techniques, such as manually creating
anomaly logs, to improve model performance presents significant chal- References
lenges. Because the temporal features required by the model are dif-
ficult to replicate and the expansion of features for statistical and [1] T. Soper, Amazon Web Services outage affects Adobe, Roku, Twilio,
Flickr, others, 2020, https://www.geekwire.com/2020/amazon-web-services-
invariant features may result in overfitting. However, this is still a
outage-affects-adobe-roku-twilio-flickr-others/.
worthwhile direction to pursue, and we will consider applying class [2] Annual report of the board of governors of the federal reserve system, 2021,
balancing techniques to log anomaly detection in future work. https://www.federalreserve.gov/publications/2021-ar-payment-system-and-
reserve-bank-oversight.htm.
[3] Q. Fu, J.-G. Lou, Q. Lin, R. Ding, D. Zhang, T. Xie, Contextual analysis of program
5.2. Scalability
logs for understanding system behaviors, in: 2013 10th Working Conference on
Mining Software Repositories, MSR, IEEE, 2013, pp. 397–400.
We conduct a comprehensive evaluation of log anomaly detec- [4] M. Farshchi, J.-G. Schneider, I. Weber, J. Grundy, Experience report: Anomaly
tion frameworks. In terms of handling log data scales, the framework detection of cloud application operations using log and cloud metric correlation
exhibits excellent performance and strong linear scalability, and can analysis, in: 2015 IEEE 26th International Symposium on Software Reliability
Engineering, ISSRE, IEEE, 2015, pp. 24–34.
effectively handle various datasets of different sizes. In terms of model
[5] K. Zhang, J. Xu, M.R. Min, G. Jiang, K. Pelechrinis, H. Zhang, Automated IT
training, the use of Deep & Wide learning-based training solutions system failure prediction: A deep learning approach, in: 2016 IEEE International
greatly reduces training time and improves the efficiency of the system Conference on Big Data, Big Data, IEEE, 2016, pp. 1291–1300.
in processing massive datasets. In addition, the framework shows flexi- [6] X. Shu, J. Smiy, D.D. Yao, H. Lin, Massive distributed and parallel log analysis
for organizational security, in: 2013 IEEE Globecom Workshops, GC Wkshps,
bility in hardware resource utilization and model update maintenance,
IEEE, 2013, pp. 194–199.
providing a reliable foundation for long-term maintenance of the sys- [7] P. He, J. Zhu, S. He, J. Li, M.R. Lyu, An evaluation study on log parsing and
tem. Overall, these findings not only guide the optimization of current its use in log mining, in: 2016 46th Annual IEEE/IFIP International Conference
log anomaly detection system performance, but also provide substantial on Dependable Systems and Networks, DSN, IEEE, 2016, pp. 654–661.
support for future applications in large-scale data environments. [8] M. Du, F. Li, G. Zheng, V. Srikumar, Deeplog: Anomaly detection and diagnosis
from system logs through deep learning, in: Proceedings of the 2017 ACM
SIGSAC Conference on Computer and Communications Security, 2017, pp.
6. Conclusion 1285–1298.
[9] S. Kabinna, C.-P. Bezemer, W. Shang, M.D. Syer, A.E. Hassan, Examining the
In this study, we presented WDLog, a robust log-based anomaly stability of logging statements, Empir. Softw. Eng. 23 (2018) 290–333.
[10] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G.
detection framework designed to effectively identify abnormal logs
Anderson, G. Corrado, W. Chai, M. Ispir, et al., Wide & deep learning for
even in the presence of log changes. We improved the Drain algorithm recommender systems, in: Proceedings of the 1st Workshop on Deep Learning
to generate log templates that are less sensitive to changing words, for Recommender Systems, 2016, pp. 7–10.
thereby preserving more semantic information and enhancing robust- [11] K. Kaushik, G. Sharma, G. Goyal, A.K. Sharma, A. Chaubey, A systematic
ness. Additionally, we extracted multiple features based on correlation approach for analyzing log files based on string matching regular expressions,
in: Cyber Security and Digital Forensics: Proceedings of ICCSDF 2021, Springer,
to better capture the characteristics of different anomalies. By leverag-
2022, pp. 3–10.
ing the Wide & Deep framework, we employed different feature types to [12] P. He, J. Zhu, Z. Zheng, M.R. Lyu, Drain: An online log parsing approach with
detect anomaly logs, thereby improving the accuracy of the detection fixed depth tree, in: 2017 IEEE International Conference on Web Services, ICWS,
process. Experimental results on the HDFS and BGL datasets demon- 2017, pp. 33–40, http://dx.doi.org/10.1109/ICWS.2017.13.
[13] R. Vaarandi, Mining event logs with slct and loghound, in: NOMS 2008-
strated the effectiveness of WDLog in anomaly detection. However,
2008 IEEE Network Operations and Management Symposium, IEEE, 2008, pp.
it should be noted that WDLog currently does not support intelligent 1071–1074.
word segmentation in log processing. As part of future research, we [14] R. Vaarandi, M. Pihelgas, Logcluster-a data clustering and pattern mining
will focus on further advancements in this field and explore intelligent algorithm for event logs, in: 2015 11th International Conference on Network
word segmentation techniques to enhance the capabilities of WDLog. and Service Management, CNSM, IEEE, 2015, pp. 1–7.
[15] H. Dai, H. Li, C.-S. Chen, W. Shang, T.-H. Chen, Logram: Efficient log parsing
using 𝑛 n-gram dictionaries, IEEE Trans. Softw. Eng. 48 (3) (2020) 879–892.
CRediT authorship contribution statement [16] Q. Fu, J.-G. Lou, Y. Wang, J. Li, Execution anomaly detection in distributed
systems through unstructured log analysis, in: 2009 Ninth IEEE International
Weina Niu: Writing – review & editing, Writing – original draft, Conference on Data Mining, IEEE, 2009, pp. 149–158.
[17] L. Tang, T. Li, C.-S. Perng, LogSig: Generating system events from raw textual
Software, Methodology, Formal analysis, Conceptualization. Xuhan
logs, in: Proceedings of the 20th ACM International Conference on Information
Liao: Writing – review & editing, Software, Methodology, Investigation, and Knowledge Management, 2011, pp. 785–794.
Data curation, Conceptualization. Shiping Huang: Writing – review & [18] M. Mizutani, Incremental mining of system log format, in: 2013 IEEE
editing, Writing – original draft, Software, Data curation. Yudong Li: International Conference on Services Computing, IEEE, 2013, pp. 595–602.
Software, Investigation, Formal analysis. Xiaosong Zhang: Writing – [19] K. Shima, Length matters: Clustering system log messages using length of words,
2016, arXiv preprint arXiv:1611.03213.
review & editing, Supervision, Funding acquisition. Beibei Li: Writing [20] Z.M. Jiang, A.E. Hassan, P. Flora, G. Hamann, Abstracting execution logs to
– review & editing, Validation, Methodology. execution events for enterprise applications (short paper), in: 2008 the Eighth
International Conference on Quality Software, IEEE, 2008, pp. 181–186.
Declaration of competing interest [21] P. He, J. Zhu, S. He, J. Li, M.R. Lyu, Towards automated log parsing for large-
scale log data analysis, IEEE Trans. Dependable Secure Comput. 15 (6) (2017)
931–944.
The authors declare that they have no known competing finan- [22] A.A. Makanju, A.N. Zincir-Heywood, E.E. Milios, Clustering event logs using
cial interests or personal relationships that could have appeared to iterative partitioning, in: Proceedings of the 15th ACM SIGKDD International
influence the work reported in this paper. Conference on Knowledge Discovery and Data Mining, 2009, pp. 1255–1264.
[23] M. Du, F. Li, Spell: Online streaming parsing of large unstructured system logs,
IEEE Trans. Knowl. Data Eng. 31 (11) (2018) 2213–2227.
Data availability [24] W. Meng, Y. Liu, Y. Zhu, S. Zhang, D. Pei, Y. Liu, Y. Chen, R. Zhang, S. Tao, P.
Sun, et al., LogAnomaly: Unsupervised detection of sequential and quantitative
Data will be made available on request. anomalies in unstructured logs, in: IJCAI, vol. 19, 2019, pp. 4739–4745.

11
W. Niu et al. Applied Soft Computing 153 (2024) 111314

[25] X. Zhang, Y. Xu, Q. Lin, B. Qiao, H. Zhang, Y. Dang, C. Xie, X. Yang, Q. Cheng, [31] J. Pennington, R. Socher, C.D. Manning, Glove: Global vectors for word represen-
Z. Li, et al., Robust log-based anomaly detection on unstable log data, in: Pro- tation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural
ceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Language Processing, EMNLP, 2014, pp. 1532–1543.
Conference and Symposium on the Foundations of Software Engineering, 2019, [32] A. Ram, S. Jalal, A.S. Jalal, M. Kumar, A density based algorithm for discovering
pp. 807–817. density varied clusters in large spatial databases, Int. J. Comput. Appl. 3 (6)
[26] C. Yin, Y. Zhang, Unsupervised log anomaly detection model based on CNN and (2010) 1–4.
Bi-LSTM(in Chinese), J. Comput. Appl. (2023). [33] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S.
[27] L. Yang, J. Chen, Z. Wang, W. Wang, J. Jiang, X. Dong, W. Zhang, Semi- Bowers, et al., Practical lessons from predicting clicks on ads at facebook, in:
supervised log-based anomaly detection via probabilistic label estimation, in: Proceedings of the Eighth International Workshop on Data Mining for Online
2021 IEEE/ACM 43rd International Conference on Software Engineering, ICSE, Advertising, 2014, pp. 1–9.
2021, pp. 1448–1460, http://dx.doi.org/10.1109/ICSE43902.2021.00130. [34] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk,
[28] S. Yen, M. Moh, T.-S. Moh, CausalConvLSTM: Semi-supervised log anomaly Y. Bengio, Learning phrase representations using RNN encoder-decoder for
detection through sequence modeling, in: 2019 18th IEEE International Con- statistical machine translation, 2014, arXiv preprint arXiv:1406.1078.
ference on Machine Learning and Applications, ICMLA, 2019, pp. 1334–1341, [35] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning
http://dx.doi.org/10.1109/ICMLA.2019.00217. to align and translate, 2014, arXiv preprint arXiv:1409.0473.
[29] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to [36] S. He, J. Zhu, P. He, M.R. Lyu, Loghub: a large collection of system log datasets
document recognition, Proc. IEEE 86 (11) (1998) 2278–2324. towards automated log analytics, 2020, arXiv preprint arXiv:2008.06448.
[30] A. Graves, A. Graves, Long short-term memory, in: Supervised Sequence Labelling
with Recurrent Neural Networks, Springer, 2012, pp. 37–45.

12

You might also like