1 s2.0 S1568494624000887 Main
1 s2.0 S1568494624000887 Main
1 s2.0 S1568494624000887 Main
A robust Wide & Deep learning framework for log-based anomaly detection
Weina Niu a , Xuhan Liao a , Shiping Huang a , Yudong Li a , Xiaosong Zhang a , Beibei Li b ,∗
a
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
b
School of Cyber Science and Engineering, Sichuan University, Chengdu, China
Keywords: Log-based anomaly detections have shown huge commercial potential in system maintenance. However,
Log-based anomaly detection existing methods encounter two practical challenges. Firstly, they struggle to maintain consistent performance
Log templates extraction when dealing with evolving logs over time. Secondly, they face difficulties in effectively detecting frequency
Semantic information
anomalies, such as abnormal system resource usage and abnormal system operating frequencies. In this
Multi-features
paper, we propose a robust log-based anomaly detection framework using Wide & Deep learning called
Deep learning
WDLog. Particularly, we enhance the processing of template semantic information by building upon the Drain
algorithm, then we introduce invariant features and statistical features and propose a multi-feature anomaly
detection method based on the Wide & Deep framework. The experimental results on HDFS and BGL datasets
demonstrate the promising performance of WDLog compared to state-of-the-art methods in terms of anomaly
detection effectiveness. Furthermore, WDLog exhibits robustness to evolving logs, achieving F1-scores of over
90% under different degrees of log variation.
1. Introduction and machine learning. Although there are differences in the structure
and content of different system logs [7], each log statement can be
In dynamic and complex environments, system failures are in- described by the pattern of variables and constants. Thus, they first
evitable. These failures can manifest in various ways, such as com- extract log templates, that is, extract specific log execution information
ponents running incorrectly, errors occurring during their interaction,
into general and representative system execution behaviors. However,
or intentional attacks on the system. While simple system failures can
the existing methods will lose a lot of key information during the
often be resolved by restarting the system, complex faults are more
challenging to repair. If these faults are not addressed in a timely template extraction process. For example, DeepLog [8] replaced the
manner, they can have a significant impact on the system. Due to original log with the log sequence number after extracting the log
internal or external reasons, the probability of abnormal operations template, and the semantic information is lost.
and crashes of the terminal system and software system is increasing, Moreover, when the system changes, both the content and format
which will have a great negative impact on the user experience and of the log can be affected. The log data is constantly expanding and
even cause huge losses. For example, on November 26, 2020, Amazon updating, which makes the existing detection method not suitable for
Web Services (AWS) experienced a major outage that affected several dealing with different types and structures of log data, and the accuracy
companies that depend on AWS cloud services, including Adobe, Roku,
rate will also decrease when it is applied to new and complex log data.
Twilio, and Flickr [1]. On February 24, 2021, several key payment
The primary factor contributing to this is the continuous generation
system services of the Federal Reserve were interrupted, and users were
unable to use any payment services within 4 h, causing disruption to of system logs in current software systems, particularly in distributed
more than millions of financial transactions [2]. systems, resulting in a significant volume of log data being generated.
Existing computer systems use logs to record the status of key Studies show that about 20%–45% of log statements will change during
points and events during their operation, thereby helping maintenance their lifetime [9]. For example, log statements that represent the same
personnel debug system performance and troubleshoot. Early log-based system behavior may replace ‘‘down’’ with ‘‘off’’. However, the exist-
anomaly detection methods mainly rely on experienced experts to find ing log anomaly detection method based on machine learning is too
anomalies, and their efficiency is low. The mainstream methods [3– high-fitting, so it is difficult to maintain excellent anomaly detection
6] are to automatically discover abnormal logs through data mining
∗ Corresponding author.
E-mail address: [email protected] (B. Li).
1
https://github.com/adadadad194/WDLog.
https://doi.org/10.1016/j.asoc.2024.111314
Received 23 July 2023; Received in revised form 15 January 2024; Accepted 19 January 2024
Available online 30 January 2024
1568-4946/© 2024 Elsevier B.V. All rights reserved.
W. Niu et al. Applied Soft Computing 153 (2024) 111314
accuracy after the log statement changes. Furthermore, existing meth- method can extract common parts on local clusters of similar logs in-
ods struggle to identify anomalies that do not result in changes in the stead of global log datasets. Typical methods of applying this technique
execution flow. include LKE [16], LogSig [17], SHISO [18], and LenMa [19].
We propose a robust log-based anomaly detection method using Unlike general text data, log messages have some unique char-
Wide & Deep learning, called WDLog,1 to address the above problems. acteristics. Therefore, the log template can be extracted by adopting
First, the semantic embedding and clustering are added on the basis a method that adapts to the log structure and content characteris-
of the log template extraction algorithm Drain. Then, we extract tem- tics. Classical applications of heuristics include AEL [20], POP [21],
poral features, invariant features, and statistical features from the log IPLOM [22], and Drain [12]. Specifically, AEL [20] grouped log mes-
template sequences. Next, we train a GRU model based on the attention sages by comparing occurrences between constant and variable tokens.
mechanism using temporal features and a Gradient Boosting Decision IPLoM [22] grouped log messages based on message length, token
Tree (GBDT) model that combines invariant features with statistical position, and mapping relationship based on an iterative partitioning
features. Finally, we utilize Wide & Deep [10] and multi-features to strategy. Spell [23] parsed logs in a stream based on the longest com-
detect anomaly logs. mon subsequence algorithm. Based on the prefix tree idea, Drain [12]
The main contributions of this paper include the following: constructed a fixed-depth tree structure to represent log messages and
(1) We design an optimized log template extraction algorithm, efficiently extract common templates.
which generates log templates through embedding and clustering on However, the Drain algorithm cannot effectively deal with the
the basis of Drain, thus effectively reducing its sensitivity to word updated log. For example, when the system is updated, it is likely
changes, and improving the robustness of the algorithm. to replace some old phrases with more commonly used synonymous
(2) We propose a log-based anomaly detection method combined phrases when recording logs. These logs should belong to the same
with multi-features and Wide & Deep framework. We extract temporal template, but Drain may classify them into multiple templates, resulting
features, invariant features and statistical features based on correlation in redundancy. Not only increase the training time, but also the new
from the generated log template sequences, thus enabling the detection template number may cause recognition anomalies.
model to deal with different anomalies.
(3) We conduct different experiments on two public datasets with 2.2. Log-based anomaly detection
different characteristics (HDFS and BGL). The experimental results
demonstrate that WDLog achieves better detection performance and Although the supervised log anomaly detection method has better
stronger robustness compared with well-known solutions. overall performance, it needs a well-marked and labeled dataset, which
The remainder of this paper is arranged as follows. Related work is time-consuming. Researchers have tried to solve the problem of dif-
is described in Section 2. Section 3 elaborates on the proposed WDLog ficult dataset collection and labeling for supervised learning methods.
framework. Experimental results are shown in Section 4. Section 5 is For example, Meng et al. [24] proposed a framework for modeling
discussion. Our conclusions are drawn in Section 6. log streams as natural language sequences, called LogAnomaly, whose
basic idea was to convert log templates into vectors, and classify log
2. Related work templates by calculating the spatial distance of the vectors. This method
only needs a small number of normal log datasets to train the proposed
2.1. Log template extraction deep learning model, and then to detect abnormal log data. However,
this method cannot adapt to the scenario where log templates increase
Log template extraction is used to classify system logs in a specific due to the continuous update of the system. Aiming at the problem
way. The variable part of the same type of log is replaced with ‘‘*’’ that old anomaly detection methods fail over time, Zhang et al. [25]
or other proper words, special symbols, etc., and integrated with the proposed a new log anomaly detection method, called LogRobust. This
constant part to obtain the log template. The mainstream log extraction method semanticized the content of the log template instead of directly
methods are mainly based on frequent pattern mining, clustering, and representing the template with a number, which reduces the loss of
heuristics. information and enables it to handle new logs with minor changes.
The method based on regular expressions [11] is the simplest and Compared with supervised anomaly detection methods, unsuper-
most direct way to extract log templates. This method parses logs vised anomaly detection methods do not need to pre-label the dataset.
based on rules that are manually written by experts and adapted to However, due to the lack of labeling information, its overall perfor-
specific logs. However, this method makes it difficult to effectively mance is relatively poor. Because anomaly labels are often unavail-
deal with massive log data. Therefore, regular expressions are often able in practice, unsupervised methods are much more practical in
used as an auxiliary technology to extract log templates together with real-world service systems. Researchers have proposed many anomaly
other algorithms. For example, the Drain algorithm [12] used regular detection models based on unsupervised methods. For example, Du
expressions to preprocess log data to optimize processing time and then et al. [8] proposed the DeepLog framework that processed log infor-
constructed a log parsing tree based on the processed data to extract log mation as a natural language sequence and automatically learned log
templates. patterns from normal execution flow. DeepLog detected anomalies by
A log template is an expression pattern that frequently appears in determining whether incoming log events violate the prediction results
logs. Therefore, the log can be distinguished by counting the frequency of the stacked LSTM model. But the method still has the problem
of occurrence of some items of different statements in the log. Typical of low precision caused by the evolution of log sentences over time.
methods of applying this technique include SLCT [13], LogCluster [14], Concerning this problem, Yin et al. [26] proposed an unsupervised
and Logram [15]. These methods first traverse the log data and then anomaly detection model called LogCL, which can automatically learn
construct frequent item sets. After that, the log messages are catego- the characteristics of log data and compare the learned characteristics.
rized into the generated clusters, and finally, the log templates are They used CNN and Bi-LSTM models to automatically extract log fea-
extracted from the clusters. tures and identify abnormal logs through clustering algorithms. In order
To capture common parts of logs, similar logs can be clustered to better analyze and understand logs, Vaarandi et al. [14] proposed
together and common tags shared across each cluster can be identified a log clustering and pattern mining algorithm called logCluster. The
as log templates. Compared with pattern mining-based methods, this algorithm can group log data, cluster similar logs together, classify
them into a cluster, and extract frequent event patterns from them.
Semi-supervised anomaly detection methods only need to use part
1
https://github.com/adadadad194/WDLog. of the labeled data and most of the unlabeled data to complete the
2
W. Niu et al. Applied Soft Computing 153 (2024) 111314
training. It has the advantages of both the unsupervised method and 3.1. Log template extraction
the supervised method, so it has good performance at a relatively small
cost. For instance, Lin et al. [27] introduced a novel and practical Log template extraction mainly includes three modules: log pre-
log-based anomaly detection approach called PLELog. This method processing, original template generation, and template compression.
employed semi-supervised learning to eliminate the time-consuming The objective of log pre-processing is to eliminate strings that cannot
manual labeling process and integrated historical anomaly knowledge be recognized by the algorithm in the log data. This step involves
through probabilistic label estimation, leveraging the advantages of replacing certain portions of the content in the log statements with
supervised approaches. Aiming at the problem of semi-supervised log a designated mask. Original template generation mainly builds a log
anomaly detection, where the only training data available are nor- parsing tree according to the format and content characteristics of the
pre-processed logs and then generates initial log templates based on
mal logs from a baseline period, Yen et al. [28] proposed a model
the parsing tree structure and its leaf node distribution. The core of
called ‘‘CausalConvLSTM’’. The model combined the convolutional neu-
template compression is to convert initial log templates into semantic
ral network (CNN) [29] and the long short-term memory network
vectors and cluster them.
(LSTM) [30], which can effectively model and predict time series data,
thereby realizing log anomaly detection. This method used a two-stage
3.1.1. Log pre-processing
training method. The first stage was supervised training, using labeled Log statements usually contain compound words consisting of more
normal log data for training, and using the cross-entropy loss function than two words and special symbols. An Out of Vocabulary (OOV)
to optimize the model. The second stage was semi-supervised training, exception occurs without replacing these words that are not recognized
which uses unlabeled log data for training and uses a semi-supervised by the dictionary. The log statement also contains some strings with a
loss function to optimize the model. This method can effectively detect fixed format and high frequency, such as IP addresses, IP port numbers,
abnormal information in logs and has high accuracy and robustness. fixed-length hexadecimal strings, etc. Replacing it with a constant cor-
Many research studies have been conducted in the field of log responding to the semantics can not only express the real semantics of
template extraction and log anomaly detection, yielding promising the log statement more effectively but also speed up the parsing speed
results. However, most of these studies have primarily focused on the and reduce the difficulty of parsing, thereby improving the generation
time series aspect of the logs, overlooking certain invariant features speed of the parsing tree and the accuracy of classification. Thus, log
and statistical characteristics of the logs. This limited utilization of log pre-processing consists of three main steps:
information in anomaly detection has led to incomplete coverage and Step 1: Analyzing the linguistic characteristics of log files, identi-
gaps in the detection of anomalies. fying the strings that can be replaced and meet the requirements, and
constructing a substitution table that corresponds the strings with their
semantics, as shown in Table 1.
3. The proposed WDLog framework
Step 2: Identifying the words in the log file based on the vocabulary
list. If any unrecognized words are found, add an out-of-vocabulary
WDLog mainly contains two modules: log template extraction and (OOV) entry to the substitution table.
anomaly detection, as shown in Fig. 1. The log template extraction Step 3: Constructing corresponding regular expressions based on
relies on an optimized version of Drain to obtain the log templates; the identified strings. Performing a global search on the log file to
anomaly detection combines temporal features, invariant features, and find string variables that match the regular expressions, and replacing
statistical features with Wide & Deep framework to comprehensively the original string variables with the replacement words from the
evaluate anomaly logs. substitution table. The main procedure is described in Algorithm 1.
3
W. Niu et al. Applied Soft Computing 153 (2024) 111314
4
W. Niu et al. Applied Soft Computing 153 (2024) 111314
5
W. Niu et al. Applied Soft Computing 153 (2024) 111314
Table 3 different parts contribute more to the model. Such potential sequential
Statistical features obtained by setting different thresholds.
relationships can be better learned by using the GRU model based
Dataset Threshold Statistical features Feature count on the attention mechanism. By learning such timing characteristics,
(Example events)
abnormal execution processes can be identified more accurately.
0.5 event3, event4 2
The calculation process on the Deep end is shown in Formula (4).
HDFS 0.4 event3, event4, event16, event33 4
0.3 ...event11, event12, event33... 7
In the following formula, 𝑥𝑖 is the log template vector of the 𝑖th log in
log sequence, 𝑧𝑖 and 𝑟𝑖 are the outputs of the update gate and reset gate
0.5 event19, event127 2
BGL 0.4 event19, event127 2 respectively, ℎ𝑖 represents the hidden state at time 𝑖, and 𝑊 𝑧 , 𝑊 𝑟 and
0.3 ...event30, event31, event32... 26 𝑊 ℎ are the corresponding weight matrices respectively. 𝑂𝑢𝑡𝑝𝑢𝑡𝐷𝑒𝑒𝑝 is
the final output of the Deep end.
𝑋𝑇 𝑒𝑚𝑝𝑜𝑟𝑎𝑙 = {𝑥1 , 𝑥2 , … 𝑥𝑛 }
not a specific model, but rather a programmatic idea combining deep 𝑧𝑖 = 𝜎(𝑊 𝑧 ⋅ [ℎ𝑖−1 , 𝑥𝑖 ])
learning and machine learning. This architecture consists of two parts: 𝑟𝑖 = 𝜎(𝑊 𝑟 ⋅ [ℎ𝑖−1 , 𝑥𝑖 ])
Wide and Deep. The Wide part is generally a generalized linear model,
ℎ̃𝑖 = 𝑡𝑎𝑛ℎ(𝑊 ℎ ⋅ [𝑟𝑖 ⊙ ℎ𝑖−1 , 𝑥𝑖 ]) (4)
which gives the whole model a strong memory capability and applies
to simple and strongly correlated features. The Deep part, on the ℎ𝑖 = (1 − 𝑧𝑖 ) ⊙ ℎ̃𝑖 + 𝑧𝑖 ⊙ ℎ𝑖−1
other hand, enables the model to have generalization ability and is ∑
𝑛
suitable for learning complex features. Our architecture mainly consists 𝑂𝑢𝑡𝑝𝑢𝑡𝐷𝑒𝑒𝑝 = 𝑡𝑎𝑛ℎ( 𝜆𝑖 ℎ𝑖 )
𝑖=1
of a Gradient Boosting Decision Tree-Logistic Regression model(GBDT-
LR) [33] on the Wide end and an Attention-Based Gate-Recurrent Unit Wide end. We choose the Gradient Boosting Decision Tree-Logistic Re-
(Attention-GRU) model on the Deep end. gression model (GBDT-LR) as the Wide end. The invariant features and
In order to make full use of the three different features, we divide statistical features are superficial for log anomaly detection. Gradient
them into different parts of this anomaly detection model approach. Boosting Decision Tree (GBDT) can easily remember such relationships.
The temporal features will be input to the Deep end, while the invariant Therefore, in the subsequent detection phase, GBDT has the ability to
features and statistical features will be input to the Wide end. After ab- identify anomalies caused by the invariant and statistical features on
stracting the high-dimensional features on the Wide end and Deep end, the logs.
the features are then combined and computed for prediction through The prediction process of GBDT is shown in Eq. (5). X is a combi-
a logistic regression layer. Unlike anomaly detection methods that use nation of invariant features and statistical features of the log sequence,
only deep learning or machine learning, our approach combines both which is used as input to the Wide end. 𝐹𝑚 (𝑥) represents the cumulative
the Deep and Wide ends. This combination allows more features to be output of the first 𝑚 − 1 trees, ℎ𝑚 (𝑥) represents the predicted value of
processed by type and exploits more useful information in the logs to the leaf node in the 𝑚th tree, and the result obtained after M iterations
achieve better anomaly detection capabilities. The main components is 𝑂𝑢𝑡𝑝𝑢𝑡𝑊 𝑖𝑑𝑒 .
of the model, namely the Deep and Wide ends, are described in detail 𝑋 = 𝑐𝑜𝑛𝑐𝑎𝑡{𝑋𝐼𝑛𝑣𝑎𝑟𝑖𝑎𝑛𝑡 , 𝑋𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑎𝑙 }
below.
𝐹𝑚 (𝑋) = 𝐹𝑚−1 (𝑋) + ℎ𝑚 (𝑋) (5)
Deep end. We choose the Attention-Based Gated Recurrent Unit Model
𝑂𝑢𝑡𝑝𝑢𝑡𝑊 𝑖𝑑𝑒 = 𝐹𝑀 (𝑋)
as the Deep end for processing the input content. The Gated Recurrent
Unit (GRU) [34] is a modified Recurrent Neural Network (RNN). Com- Furthermore, the integration of Logistic Regression (LR) on top of
pared to the basic RNN structure, the Gated Recurrent Unit (GRU) has GBDT has proven to be a significant improvement. This approach adds
shown improved performance in handling the gradient disappearance only a layer of computation to the GBDT algorithm and achieves a more
problem and capturing long-term dependencies in sequential data. Also, comprehensive prediction by taking multiple features into account. Our
GRU has a simpler internal structure and better computational speed method uses GBDT to abstract the invariant and statistical features and
than another improved RNN model, LSTM [30]. However, using GRU recombine the useful information to obtain new features that have a
alone is not enough. In a log anomaly detection model, the output of more positive impact on the outcome. Then as shown in formula (6),
each GRU unit depends on the current log event and on the memory the LR layer combines all original features (including invariant features
of the past. But if a log sequence is too long, there may be cases where and statistical features) with newly generated features as input and
anomalous events far apart have an impact on the current event, which predicts the final result.
can affect the validity of the prediction. In order to eliminate this 𝐼𝑛𝑝𝑢𝑡𝐿𝑅 = 𝑐𝑜𝑛𝑐𝑎𝑡{𝑂𝑢𝑡𝑝𝑢𝑡𝐷𝑒𝑒𝑝 , 𝑋𝐼𝑛𝑣𝑎𝑟𝑖𝑎𝑛𝑡 , 𝑋𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑎𝑙 , 𝑂𝑢𝑡𝑝𝑢𝑡𝑊 𝑖𝑑𝑒 }
affection, we use an attention mechanism [35] to reassign weights to (6)
𝑅𝑒𝑠𝑢𝑙𝑡 = 𝐿𝑅(𝐼𝑛𝑝𝑢𝑡𝐿𝑅 )
the degree of influence of past memory units on the current output, and
the process is shown in Fig. 2. Note that the Wide end is trained before training of the Deep end
The output of each GRU unit depends on the input log event of and Logistic Regression. We first train the GBDT using the invariant
this unit and the hidden state of the previous GRU unit. At the same and statistical features with the corresponding labels. Then we use the
time, the attention mechanism learns the weight of each output that trained GBDT as Wide end to get the synthetic features 𝑂𝑢𝑡𝑝𝑢𝑡𝑊 𝑖𝑑𝑒
contributes to the accuracy of the prediction, increasing the degree of which is part of the Logistic Regression’s input. At last, we apply
effective memory and improving the reliability of the prediction results. backward propagation and gradient descent to train the Deep end and
Here are some examples that show the effectiveness of the attention Logistic Regression simultaneously.
mechanism. Due to the complexity of the log system, the distribution
of log sequences also shows diversity. There may be partially similar log 4. Experimental evaluation
sequences in the system. Figs. 3 and 4 depict two possible scenarios for
log sequences. There are cases where the same sequence is followed by In this section, we present the experiments conducted to validate
a number of different sequences within the system, as shown in Fig. 3. the effectiveness of the WDLog framework. We begin by describing
There are also cases where different sequences are followed by the same our experimental setup in Section 4.1. Next, in Section 4.2, we dis-
sequence, as shown in Fig. 4. For the above two cases, if the two log cuss the comparative methods employed and the performance metrics
sequences have the same state, i.e., they are both normal or abnormal, utilized. Subsequently, in Section 4.3, we outline the model settings
then their common parts contribute more to the model. Otherwise, their and experimental parameters employed in our study. The impact of
6
W. Niu et al. Applied Soft Computing 153 (2024) 111314
Table 4
Dataset sequence distribution.
Dataset Number of Number of
normal sequence abnormal sequence
training set 333 792 11 244
HDFS validation set 55 635 1871
testing set 168 796 3723
Fig. 3. Same sequence followed by different sequences. training set 24 146 27 200
BGL validation set 4012 4545
testing set 12 116 13 558
7
W. Niu et al. Applied Soft Computing 153 (2024) 111314
Table 5
Model settings for HDFS and BGL.
Parameter HDFS BGL
Depth 4 4
Similarity 0.6 0.5
Hidden States 150 150
GRU Layers 4 4
Estimators 150 150
log statements in the HDFS dataset (10 million) and the BGL dataset
(millions), the impact of redundant templates on accuracy is minimal.
However, the presence of redundant templates does affect the total
number of template types, and having too many templates can increase
the time overhead and resource consumption of downstream tasks.
Therefore, it is important to find an optimal parameter selection that
Fig. 5. Number of generated log templates on HDFS. strikes a balance.
Based on the experimental results, setting Depth = 4 and Similar-
ity = 0.6 is found to be the best parameter configuration for HDFS
logs, while Depth = 4 and Similarity = 0.5 are optimal for BGL logs.
This choice of parameters ensures a reasonable number of extracted
templates while maintaining accuracy.
8
W. Niu et al. Applied Soft Computing 153 (2024) 111314
Table 6
Performance of different methods.
Log Method Precision Recall F1 Score
DeepLog [8] 0.945 0.900 0.922
LogAnomaly [24] 0.860 0.898 0.878
HDFS PLELog [27] 0.957 0.888 0.921
LogCluster [14] 1.000 0.836 0.911
WDLog 0.979 0.925 0.951
DeepLog [8] 0.907 0.956 0.931
LogAnomaly [24] 0.972 0.941 0.956
BGL PLELog [27] 0.967 0.978 0.973
LogCluster [14] 0.914 0.642 0.754
WDLog 0.994 0.978 0.986
9
W. Niu et al. Applied Soft Computing 153 (2024) 111314
Table 7
Time consumption of different methods on HDFS and BGL.
Method HDFS BGL
Training Testing Training Testing
DeepLog [8] 1 h 50 m 20 m 44 m 7 m
LogAnomaly [24] 4 h 40 m 477 m 4 h 20 m 39 m
PLELog [27] 43 m 42 s 24 m 10 s
LogCluster [14] 19 m 23 s 41 s 40 s
WDLog 53 m 51 s 30 m 12 s
Table 8
Ablation results on HDFS and BGL.
Dataset Scheme Precision Recall F1 Score
without Attention 0.981 0.803 0.883
without Wide 0.973 0.895 0.932
HDFS
without Deep 0.975 0.716 0.825
WDLog 0.979 0.925 0.951
without Attention 0.978 0.904 0.940
without Wide 0.979 0.955 0.967
BGL
without Deep 0.979 0.782 0.869
Fig. 10. Robustness of different methods on changing HDFS and BGL logs.
WDLog 0.994 0.978 0.986
evaluated using the F1-score, and the results were visualized in a box
plot depicted in Fig. 10. LogAnomaly consume much more time than the GRU-based method
DeepLog heavily relies on the temporal features of logs for anomaly PLELog and the WDLog in this paper.
detection. It employs log template tags to represent log statements The clustering methods mainly focus on the selection of clustering
as time series. However, when logs change, new log statements are indexes. And the whole clustering process is relatively simple and costs
categorized into new template tags. Unless DeepLog is retrained, it little time. The GRU model is a simplification of the LSTM model. The
struggles to handle the new time series correctly. Hence, DeepLog input, output and forgetting gates of LSTM are reduced to the reset and
exhibits the highest box plot height and demonstrates poor robustness. update gates of GRU, which reduces a large part of the computation
LogCluster and LogAnomaly both rely on semantic information for process and costs less time.
anomaly detection. However, LogCluster performs better when logs Compared with PLELog, which also uses the GRU model, our
change because LogAnomaly divides logs into more fine-grained cate- method adds semantic clustering in the preprocessing stage, and a Wide
gories, thereby amplifying the impact of semantic changes and reducing end in the anomaly detection stage to perform anomaly detection task
robustness. On the other hand, PLELog utilizes both semantic and together. The overall complexity and computation increase to some
temporal features for anomaly detection. While both semantics and extent, but compared with PLELog which only has the Deep end, the
temporal sequences may change during log iteration, PLELog effec- training phase only increases by 10 min, and the testing phase only
tively handles log changes with good robustness. However, PLELog increases by 2 s. The time consumption is still in an advantageous
relies on a limited set of features, and if these features change during position among all the methods compared.
log iteration, the anomaly detection performance will be significantly
affected. 4.6. Ablation experiment
WDLog is crafted to tackle the problem of evolving logs, particularly
during template extraction. It replaces log statements with template
To assess the contribution of different components to the overall
semantics to minimize sensitivity to variations introduced by evolving
model, we conducted ablation experiments on the attention mecha-
logs. On the other hand, the DBSCAN used for template clustering is
nism, Wide model, and Deep model. Each component was masked out
able to classify the logs into the correct templates even when there are
separately, while the remaining parts of the model were left unchanged.
minor changes in the logging system, such as synonym substitutions,
abbreviations, and so on. Different from traditional approaches, WDLog Anomaly detection was performed on HDFS logs and BGL logs, and the
incorporates invariant and statistical features in its anomaly detec- results are presented in Table 8.
tion process. This comprehensive strategy enhances the framework’s The experimental results demonstrate that the attention mechanism
robustness to the dynamic nature of evolving logs. plays a significant role in improving the overall performance of the
algorithm. With numerous features involved, it is crucial to effectively
4.5.3. Time consumption learn and utilize the relevant features to avoid interference in pre-
Time consumption is a key indicator for evaluating anomaly de- dictions. The attention mechanism helps focus the model’s learning
tection systems and plays a vital role in anomaly detection systems. on important features, thereby enhancing its performance. Regarding
Real-life scenarios where fast and effective response is critical. Our the Wide and Deep components within the model, the results indicate
evaluation carefully breaks down the time consumption into two key that omitting the Wide component reduces the overall prediction abil-
components: training time, a metric that reflects how efficiently WDLog ity, whereas omitting the Deep component diminishes the prediction
adapts to the specific characteristics of the log data during the training ability. The Deep component leverages temporal features for anomaly
phase, and test time, which focuses on how efficiently WDLog detects detection, while the Wide component predominantly utilizes invariant
anomalies in the log data. The experimental data are shown in the features and statistical features. These components complement each
Table 7. In general, all the current log anomaly detection methods only other in prediction. Therefore, to achieve optimal results, the Deep
take a short time to train and test, which can satisfy the daily anomaly component needs to be supplemented by the Wide component.
detection tasks. In summary, the attention mechanism contributes to overall perfor-
It is clear that the least time-consuming method is the cluster- mance improvement, while both the Wide and Deep components are
ing method LogCluster, while the LSTM-based methods DeepLog and crucial for achieving the best results.
10
W. Niu et al. Applied Soft Computing 153 (2024) 111314
5. Discussion Acknowledgments
5.1. Class imbalance This work was supported in part by the National Science Foundation
of China under Grant 61902262, in part by the National Key Research
We believe that considering the expansion of the anomaly log and Development Program of China under Grant 2023QY0101.
dataset using class balancing techniques, such as manually creating
anomaly logs, to improve model performance presents significant chal- References
lenges. Because the temporal features required by the model are dif-
ficult to replicate and the expansion of features for statistical and [1] T. Soper, Amazon Web Services outage affects Adobe, Roku, Twilio,
Flickr, others, 2020, https://www.geekwire.com/2020/amazon-web-services-
invariant features may result in overfitting. However, this is still a
outage-affects-adobe-roku-twilio-flickr-others/.
worthwhile direction to pursue, and we will consider applying class [2] Annual report of the board of governors of the federal reserve system, 2021,
balancing techniques to log anomaly detection in future work. https://www.federalreserve.gov/publications/2021-ar-payment-system-and-
reserve-bank-oversight.htm.
[3] Q. Fu, J.-G. Lou, Q. Lin, R. Ding, D. Zhang, T. Xie, Contextual analysis of program
5.2. Scalability
logs for understanding system behaviors, in: 2013 10th Working Conference on
Mining Software Repositories, MSR, IEEE, 2013, pp. 397–400.
We conduct a comprehensive evaluation of log anomaly detec- [4] M. Farshchi, J.-G. Schneider, I. Weber, J. Grundy, Experience report: Anomaly
tion frameworks. In terms of handling log data scales, the framework detection of cloud application operations using log and cloud metric correlation
exhibits excellent performance and strong linear scalability, and can analysis, in: 2015 IEEE 26th International Symposium on Software Reliability
Engineering, ISSRE, IEEE, 2015, pp. 24–34.
effectively handle various datasets of different sizes. In terms of model
[5] K. Zhang, J. Xu, M.R. Min, G. Jiang, K. Pelechrinis, H. Zhang, Automated IT
training, the use of Deep & Wide learning-based training solutions system failure prediction: A deep learning approach, in: 2016 IEEE International
greatly reduces training time and improves the efficiency of the system Conference on Big Data, Big Data, IEEE, 2016, pp. 1291–1300.
in processing massive datasets. In addition, the framework shows flexi- [6] X. Shu, J. Smiy, D.D. Yao, H. Lin, Massive distributed and parallel log analysis
for organizational security, in: 2013 IEEE Globecom Workshops, GC Wkshps,
bility in hardware resource utilization and model update maintenance,
IEEE, 2013, pp. 194–199.
providing a reliable foundation for long-term maintenance of the sys- [7] P. He, J. Zhu, S. He, J. Li, M.R. Lyu, An evaluation study on log parsing and
tem. Overall, these findings not only guide the optimization of current its use in log mining, in: 2016 46th Annual IEEE/IFIP International Conference
log anomaly detection system performance, but also provide substantial on Dependable Systems and Networks, DSN, IEEE, 2016, pp. 654–661.
support for future applications in large-scale data environments. [8] M. Du, F. Li, G. Zheng, V. Srikumar, Deeplog: Anomaly detection and diagnosis
from system logs through deep learning, in: Proceedings of the 2017 ACM
SIGSAC Conference on Computer and Communications Security, 2017, pp.
6. Conclusion 1285–1298.
[9] S. Kabinna, C.-P. Bezemer, W. Shang, M.D. Syer, A.E. Hassan, Examining the
In this study, we presented WDLog, a robust log-based anomaly stability of logging statements, Empir. Softw. Eng. 23 (2018) 290–333.
[10] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G.
detection framework designed to effectively identify abnormal logs
Anderson, G. Corrado, W. Chai, M. Ispir, et al., Wide & deep learning for
even in the presence of log changes. We improved the Drain algorithm recommender systems, in: Proceedings of the 1st Workshop on Deep Learning
to generate log templates that are less sensitive to changing words, for Recommender Systems, 2016, pp. 7–10.
thereby preserving more semantic information and enhancing robust- [11] K. Kaushik, G. Sharma, G. Goyal, A.K. Sharma, A. Chaubey, A systematic
ness. Additionally, we extracted multiple features based on correlation approach for analyzing log files based on string matching regular expressions,
in: Cyber Security and Digital Forensics: Proceedings of ICCSDF 2021, Springer,
to better capture the characteristics of different anomalies. By leverag-
2022, pp. 3–10.
ing the Wide & Deep framework, we employed different feature types to [12] P. He, J. Zhu, Z. Zheng, M.R. Lyu, Drain: An online log parsing approach with
detect anomaly logs, thereby improving the accuracy of the detection fixed depth tree, in: 2017 IEEE International Conference on Web Services, ICWS,
process. Experimental results on the HDFS and BGL datasets demon- 2017, pp. 33–40, http://dx.doi.org/10.1109/ICWS.2017.13.
[13] R. Vaarandi, Mining event logs with slct and loghound, in: NOMS 2008-
strated the effectiveness of WDLog in anomaly detection. However,
2008 IEEE Network Operations and Management Symposium, IEEE, 2008, pp.
it should be noted that WDLog currently does not support intelligent 1071–1074.
word segmentation in log processing. As part of future research, we [14] R. Vaarandi, M. Pihelgas, Logcluster-a data clustering and pattern mining
will focus on further advancements in this field and explore intelligent algorithm for event logs, in: 2015 11th International Conference on Network
word segmentation techniques to enhance the capabilities of WDLog. and Service Management, CNSM, IEEE, 2015, pp. 1–7.
[15] H. Dai, H. Li, C.-S. Chen, W. Shang, T.-H. Chen, Logram: Efficient log parsing
using 𝑛 n-gram dictionaries, IEEE Trans. Softw. Eng. 48 (3) (2020) 879–892.
CRediT authorship contribution statement [16] Q. Fu, J.-G. Lou, Y. Wang, J. Li, Execution anomaly detection in distributed
systems through unstructured log analysis, in: 2009 Ninth IEEE International
Weina Niu: Writing – review & editing, Writing – original draft, Conference on Data Mining, IEEE, 2009, pp. 149–158.
[17] L. Tang, T. Li, C.-S. Perng, LogSig: Generating system events from raw textual
Software, Methodology, Formal analysis, Conceptualization. Xuhan
logs, in: Proceedings of the 20th ACM International Conference on Information
Liao: Writing – review & editing, Software, Methodology, Investigation, and Knowledge Management, 2011, pp. 785–794.
Data curation, Conceptualization. Shiping Huang: Writing – review & [18] M. Mizutani, Incremental mining of system log format, in: 2013 IEEE
editing, Writing – original draft, Software, Data curation. Yudong Li: International Conference on Services Computing, IEEE, 2013, pp. 595–602.
Software, Investigation, Formal analysis. Xiaosong Zhang: Writing – [19] K. Shima, Length matters: Clustering system log messages using length of words,
2016, arXiv preprint arXiv:1611.03213.
review & editing, Supervision, Funding acquisition. Beibei Li: Writing [20] Z.M. Jiang, A.E. Hassan, P. Flora, G. Hamann, Abstracting execution logs to
– review & editing, Validation, Methodology. execution events for enterprise applications (short paper), in: 2008 the Eighth
International Conference on Quality Software, IEEE, 2008, pp. 181–186.
Declaration of competing interest [21] P. He, J. Zhu, S. He, J. Li, M.R. Lyu, Towards automated log parsing for large-
scale log data analysis, IEEE Trans. Dependable Secure Comput. 15 (6) (2017)
931–944.
The authors declare that they have no known competing finan- [22] A.A. Makanju, A.N. Zincir-Heywood, E.E. Milios, Clustering event logs using
cial interests or personal relationships that could have appeared to iterative partitioning, in: Proceedings of the 15th ACM SIGKDD International
influence the work reported in this paper. Conference on Knowledge Discovery and Data Mining, 2009, pp. 1255–1264.
[23] M. Du, F. Li, Spell: Online streaming parsing of large unstructured system logs,
IEEE Trans. Knowl. Data Eng. 31 (11) (2018) 2213–2227.
Data availability [24] W. Meng, Y. Liu, Y. Zhu, S. Zhang, D. Pei, Y. Liu, Y. Chen, R. Zhang, S. Tao, P.
Sun, et al., LogAnomaly: Unsupervised detection of sequential and quantitative
Data will be made available on request. anomalies in unstructured logs, in: IJCAI, vol. 19, 2019, pp. 4739–4745.
11
W. Niu et al. Applied Soft Computing 153 (2024) 111314
[25] X. Zhang, Y. Xu, Q. Lin, B. Qiao, H. Zhang, Y. Dang, C. Xie, X. Yang, Q. Cheng, [31] J. Pennington, R. Socher, C.D. Manning, Glove: Global vectors for word represen-
Z. Li, et al., Robust log-based anomaly detection on unstable log data, in: Pro- tation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural
ceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Language Processing, EMNLP, 2014, pp. 1532–1543.
Conference and Symposium on the Foundations of Software Engineering, 2019, [32] A. Ram, S. Jalal, A.S. Jalal, M. Kumar, A density based algorithm for discovering
pp. 807–817. density varied clusters in large spatial databases, Int. J. Comput. Appl. 3 (6)
[26] C. Yin, Y. Zhang, Unsupervised log anomaly detection model based on CNN and (2010) 1–4.
Bi-LSTM(in Chinese), J. Comput. Appl. (2023). [33] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S.
[27] L. Yang, J. Chen, Z. Wang, W. Wang, J. Jiang, X. Dong, W. Zhang, Semi- Bowers, et al., Practical lessons from predicting clicks on ads at facebook, in:
supervised log-based anomaly detection via probabilistic label estimation, in: Proceedings of the Eighth International Workshop on Data Mining for Online
2021 IEEE/ACM 43rd International Conference on Software Engineering, ICSE, Advertising, 2014, pp. 1–9.
2021, pp. 1448–1460, http://dx.doi.org/10.1109/ICSE43902.2021.00130. [34] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk,
[28] S. Yen, M. Moh, T.-S. Moh, CausalConvLSTM: Semi-supervised log anomaly Y. Bengio, Learning phrase representations using RNN encoder-decoder for
detection through sequence modeling, in: 2019 18th IEEE International Con- statistical machine translation, 2014, arXiv preprint arXiv:1406.1078.
ference on Machine Learning and Applications, ICMLA, 2019, pp. 1334–1341, [35] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning
http://dx.doi.org/10.1109/ICMLA.2019.00217. to align and translate, 2014, arXiv preprint arXiv:1409.0473.
[29] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to [36] S. He, J. Zhu, P. He, M.R. Lyu, Loghub: a large collection of system log datasets
document recognition, Proc. IEEE 86 (11) (1998) 2278–2324. towards automated log analytics, 2020, arXiv preprint arXiv:2008.06448.
[30] A. Graves, A. Graves, Long short-term memory, in: Supervised Sequence Labelling
with Recurrent Neural Networks, Springer, 2012, pp. 37–45.
12