0% found this document useful (0 votes)
35 views

Anomaly detection in system log data using lightweight multi 2024

Uploaded by

Riaz Ahmad
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Anomaly detection in system log data using lightweight multi 2024

Uploaded by

Riaz Ahmad
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Anomaly detection in system log data using lightweight multi-tasking machine

learning algorithm

Abstract
Log anomaly detection plays a vital role in monitoring IT systems, alongside metrics and traces.
Anomalies can be identified through two main types of logs: individual logs and log sequences.
Individual logs reflect the status of the system at a specific moment, while the aggregation of multiple
logs illustrates the execution paths within the system. When the patterns observed in log sequences
significantly diverge from expected normal behaviors, it may indicate the presence of system anomalies.
Supervised learning methods are typically favored for detecting anomalies in log sequences due to their
superior performance; however, these methods necessitate labeled data for model training. As systems
evolve, the volume of logs can increase dramatically, rendering the labeling process labor-intensive and
often impractical. Consequently, alternative learning approaches, such as semi-supervised or
unsupervised methods, become more viable for detection tasks. In practice, the detection of log anomalies
is fraught with challenges, including unstable log formats, the emergence of new log types, and the
complexities of unexplored log semantics. To tackle these issues and improve detection performance, we
introduce a lightweight semi-supervised multi-task learning method called Multi-Log in this paper. The
core components of this proposed method include the pre-trained language model BERT, dimensionality
reduction techniques, attention mechanisms derived from the Transformer architecture, and multi-task
learning strategies. In line with previous research, we perform extensive experiments on three well-known
datasets: HDFS, BGL, and Thunderbird. Our proposed model is not only 50 times smaller in size but also
maintains comparable F1-Scores to the original model. Furthermore, it surpasses baseline methods in
effectiveness, achieving performance levels akin to those of supervised learning models.

Index Term: System logs, anomaly detection, masked language model, multi-task learning,
transformer
Introduction
System logs are essential sources of information that document significant events and illustrate the
behavior and operation of systems. When errors, failures, or performance issues arise, these logs are
instrumental in analyzing and identifying the root causes (Pham & Lee, 2024, p. 1). However, as the size and
complexity of logs grow due to system evolution, manually diagnosing issues becomes increasingly difficult.
Consequently, it is crucial to develop automated methods for detecting anomalies and promptly notifying
system operators to avert potential severe consequences.
Anomaly detection in system logs entails identifying patterns that diverge from expected normal behaviors.
These deviations can signal possible system malfunctions, security threats, or operational inefficiencies.
The definition of normal behaviors can vary significantly, depending on the specific application domains or
the interests of the analysts involved. Within the context of logs, two primary types of inputs are utilized to
detect abnormal system activities: individual logs and log sequences.
Research studies [3]– [5] have investigated the use of sentiment analysis to identify anomalies in individual
logs by employing two sentiment polarities: positive for normal behavior and negative for abnormal
behavior. These studies found that event logs predominantly express positive sentiments when systems
are operating correctly. Conversely, when unexpected events such as crashes or failures take place, the
logs tend to reflect negative sentiments.
Unlike individual logs, analyzing unusual execution patterns through log sequences presents greater
challenges. Specifically, the patterns within logs can vary significantly due to the dynamic nature of log data.
In this paper, we investigate execution patterns derived from log sequences to identify log anomalies.
Detection methods that utilize log sequences are generally categorized into two types: traditional machine
learning-based methods and deep learning-based methods. Machine learning approaches, such as
Support Vector Machines (SVM) [6], Isolation Forest (IM) [7], and Principal Component Analysis (PCA) [8],
typically focus on counting message types for each session window and aggregating them into message
count vectors, which serve as invariant features for training and detection. However, this method overlooks
the temporal relationships among log messages. To address this limitation, deep learning techniques
convert log sequences into log template ID sequences [9], [10] or semantic vector sequences [11]–[16] for
model input. Supervised methods, which benefit from labeled data, effectively extract both normal and
abnormal log patterns, achieving remarkable performance in log anomaly detection [11], [12], [14].
Nonetheless, the process of labeling log sequence data is labor-intensive and often impractical due to the
vast volume of logs generated daily (e.g., 50GB/h) [17]. In the absence of labeled data, semi-supervised
learning methods emerge as more viable alternatives, operating under the reasonable assumption that the
training data predominantly consists of normal system behaviors.

Log messages are essentially print statements produced by software developers, and their formats can
vary based on the specific system or software architecture. Typically, a log message consists of two
components: a constant part and a variable part, as depicted in Figure 1. The initial step in processing logs
involves log parsing, which aims to distinguish between these fixed and variable elements. The result of this
parsing process yields log templates and their corresponding IDs. Given that log events are articulated in
human language, relying solely on log template IDs for detection can overlook the semantic significance of
these templates. Consequently, recent studies advocate for the use of semantic vector sequences to
enhance anomaly detection.
In this paper, we focus on addressing three key challenges in log anomaly detection: the large size of
model weights, the instability of logs, and the lack of exploration of log semantics. The Transformer
architecture [18] is a type of neural network that has demonstrated remarkable effectiveness in various
natural language processing applications. Transformers employ self-attention mechanisms, allowing them
to consider all positions of each token in the input sequences simultaneously, which enhances the learning
of relationships among log messages. Neural Log [14] leverages a BERT base model [19] with 110 million
parameters to create semantic vectors for log messages. Notably, BERT is constructed solely with
Transformer encoders and does not incorporate Transformer decoders. Similarly, LAnoBERT [15] initializes
its network using the same parameters as the BERT base model for both training and detection. However,
the substantial number of trainable parameters in these models poses challenges for practical deployment
due to extended training and prediction times.
In practical applications, logs are typically produced by various components or processes and subsequently
consolidated in a central location for analysis. This aggregation process can encounter several challenges,
including network congestion, packet loss, and communication errors, which may introduce noise into the
log data. Such noise can manifest as missing, duplicated, or disordered log entries, resulting in
uncertainties regarding the temporal alignment and completeness of the log information. Additionally,
logging statements are frequently modified as developer’s update and refactor software, which can cause
these altered logs to be recognized as new log templates. This situation may lead to false positives in
detection methods.

To tackle the aforementioned challenges in log anomaly detection, we introduce a semi-supervised


approach called MultiLog, which utilizes multi-task learning and the Transformer architecture. Initially, we
leverage the SBERT model [22], a variant of the pre-trained BERT network, to extract semantically rich
sentence embeddings that have been fine-tuned on extensive datasets. After converting log messages into
log templates, SBERT is employed to embed all unique log templates. The resulting embedding vectors are
then utilized to initialize the log template matrix.

Next, we recognize that the size of the log template embedding’s has a significant effect on the overall
model weight. To mitigate this, we train a PCA model [23] using the SNLI dataset to reduce the
dimensionality of the vectors. Furthermore, we apply a self-attention mechanism to learn the
interrelationships among log messages and to identify patterns indicative of normal behavior. Additionally,
our model is trained using a multi-task learning framework with three specific objectives: predicting the next
log template ID, predicting masked template IDs, and minimizing the distance between the encoded [CLS]
vector and an anchor vector. By combining self-attention with the third objective, we aim to effectively
address the instability issues present in log data.

The contributions of this study are outlined as follows:

 To our knowledge, we introduce the first semi-supervised method for log anomaly detection that
incorporates multi-task learning and dimension reduction for template embedding vectors.

 In contrast to prior research that addresses log instability through supervised learning methods, we
approach this issue using a semi-supervised framework, which presents a greater challenge.

 We perform extensive experiments on three widely recognized datasets to gather insights and
assess model performance.

 This study is the first to investigate anomaly patterns across these three datasets and to compare
the effectiveness of using top candidates versus predictive probabilities as anomaly scores.

 To ensure reproducibility, we make our code and datasets publicly available.

The structure of this paper is organized into four main sections. Section II offers a review of existing
literature that discusses both supervised and semi-supervised approaches to log anomaly detection. In
Section III, we detail our proposed method, including the background and overall architecture. Section IV
outlines the specifics of the datasets used and the experimental setup. Section V presents the results of our
experiments along with insights gained from the ablation study. Finally, Section VI concludes the paper
and discusses potential future research directions.

Related Work
This section reviews prior research that employs log sequences for anomaly detection. Broadly, these
studies can be classified into two main categories: supervised and semi-supervised learning methods. A
brief timeline of notable approaches to log anomaly detection is illustrated in Figure 2.
Deep Log [9] is recognized as the first study to utilize the LSTM model for identifying abnormal
execution paths within sequences of log messages. This semi-supervised approach is trained exclusively
on data representing normal execution paths to derive patterns of typical behavior. The method employs
the Spell log parser [24] to transform raw logs into log templates, using log template identifiers as input
features. The model is designed to predict the next log template ID based on a sequence of log template
IDs. During the inference phase, if the actual next log event ID is not found among the predicted
top gg candidates, the corresponding log template sequence is flagged as an anomaly. However, relying
solely on log template IDs as features neglects the semantic relationships between log events, which can
hinder the model's overall performance.
To investigate the semantics of log messages, Log Anomaly [10] introduces the template2vec approach,
which extracts semantic information from log templates. This method computes a semantic template
vector by taking the weighted average of the word vectors present in the template, with these word
vectors generated using a lexical-contrast embedding model known as dLCE [25]. Unlike Deep Log, Log
Anomaly is capable of detecting both sequential and quantitative log anomalies by integrating template
vector sequences with template count vectors. During the detection phase, the predicted next log template
IDs are compared against the actual next log template IDs to identify anomalies, similar to the approach
used in Deep Log. To handle modifications or the introduction of new log templates that occur between
consecutive training sessions, Log Anomaly creates a “temporary “template vector that approximates an
existing one based on the similarity of the template vectors.
Log Robust [11] is recognized as the first study to tackle the instability problem associated with log data.
This approach utilizes semantic vectorization along with an attention-based Bi-LSTM neural network
[26] to address the issue. The semantic template vector is computed as the weighted average of the word
vectors derived from a log template, with the weights determined using the TF-IDF method [27]. The
word vectors themselves are obtained through the Fast Text algorithm [28]. Subsequently, the Bi-LSTM
network is employed to capture the contextual relationships within log sequences.
Hit Anomaly [12] utilizes a hierarchical transformer architecture for log anomaly detection, effectively
leveraging both log template sequences and parameter values. Initially, the Drain log parser [29] extracts
sequences of log templates and parameter values, which are then encoded simultaneously using dedicated
encoders for each type of input. The resulting encoded vectors are combined with attention scores to
classify anomalies. By employing semantic vectorization and the self-attention mechanism inherent in the
Transformer architecture, Hit Anomaly demonstrates strong detection capabilities, particularly with
unstable log events.
Building on the principles of BERT, LogBERT [13] introduces a semi-supervised framework that
incorporates two training tasks: masked log key prediction and hypersphere volume minimization, aimed
at detecting anomalous log sequences. In the first task, the model learns to predict log templates that have
been randomly masked within normal log sequences. The second task focuses on clustering normal log
sequences closer together in the embedding space. Unlike Deep Log, which predicts the next log
template, LogBERT predicts the randomly masked log templates during the detection phase, also utilizing
the concept of top gg candidates for anomaly detection. However, a significant limitation of this
approach is that the log template embedding matrix is trained from scratch, missing the advantages of pre-
trained embedding’s. This can hinder the identification of evolving logs due to the limited vocabulary
present in log datasets.
Neural Log [14] takes a different approach by using a pre-trained BERT model to embed preprocessed
log messages without relying on a log parser. The semantic vectors generated from these log messages are
then input into a Transformer encoder for classification. While this model achieves high detection
performance, it suffers from slow training and inference times, making it impractical for real-time
applications. The use of the BERT-base model, which contains 110 million parameters to convert each
log message into its corresponding semantic vector, creates a bottleneck in the process.
To facilitate real-time local anomaly detection on edge devices, Light Log [16] employs a dimensionality
reduction technique known as PPA [30] to compress high-dimensional semantic vectors into lower-
dimensional representations.
A temporal convolutional network (TCN) is then trained for binary classification tasks. In contrast to Log
BERT, Lambert proposes a semi-supervised log anomaly detection method that does not utilize the Drain
log parser. Instead, raw log messages are preprocessed into log templates using regular expressions to
identify elements such as numbers, dates, and IP addresses. These log templates are subsequently
tokenized and converted into semantic vectors using a method similar to that of the original BERT model.
LAnoBERT focuses solely on masked log key prediction as its training objective, but during the testing
phase, it masks all log keys in a sequence and computes predictive probabilities rather than relying on the
top gg candidates for anomaly detection.
Although this parser-free approach has its advantages, it can still be significantly affected by log key
explosion [29] if the regular expressions are not well-defined. Additionally, the large number of model
parameters (110 million) poses challenges for real-time deployment.

Proposed Method
In response to the limitations identified in existing semi-supervised learning methods discussed in the
introduction and related works, we have developed a novel framework for anomaly detection designed to
tackle these challenges and enhance detection performance. The overall architecture of MultiLog is
illustrated in Figure 3, which comprises two primary components: the training phase and the inference
phase.
During the training phase, raw log messages are either parsed into log template IDs using a log parser or
preprocessed with regular expressions without the parser. We will assess the effectiveness of both
approaches in the experimental section. Following this, sequences of log template IDs are provided as
inputs to the model. Subsequently, semantic log template vectors are retrieved from the log template
embedding matrix, using the log template IDs as keys. These semantic vectors are then utilized to train a
Transformer encoder, which learns the contextual relationships among the log templates.
During the inference phase, we utilize the probability obtained from predicting the next log template ID as
the anomaly score. This score is then compared against a predefined threshold to classify a log sequence
as either normal or abnormal. To compute the final anomaly score, we mask each token ID in the sequence
and gather all scores associated with the masked sequences using the trained model. The ultimate score
for a log sequence is determined by aggregating the top k scores from this process.
During the training phase, raw log messages are either processed into log template IDs using a log parser
or preprocessed with regular expressions without the parser. We will assess the effectiveness of both
approaches in the experimental section. Subsequently, the sequences of log template IDs are input into the
model. Following this, semantic log template vectors are retrieved from the log template embedding matrix,
where the keys correspond to the log template IDs. Using these semantic vectors, a Transformer encoder
is trained to understand the contextual relationships between the log templates.

During the inference phase, we utilize the probability obtained from predicting the next log template ID as
the anomaly score. This score is then compared against a predefined threshold to classify a log sequence
as either normal or abnormal. To compute the final anomaly score, we mask each token ID in the sequence
and gather all scores associated with the masked sequences using the trained model. The final score for a
log sequence is determined by aggregating the top k scores from this process.

In this framework, the embedding matrix serves as a repository of low-dimensional semantic vectors, with
log template IDs as the keys. This matrix is created only once prior to the model training. Its purpose is to
address semantic and model weight challenges. Additionally, the use of self-attention, the prediction of
masked template IDs, and the minimization of Euclidean distance are intended to tackle issues related to
log instability. Training through a multi-task approach may enhance the effectiveness of each task and
improve overall detection performance. The subsequent subsections will provide further details about this
framework.

A. Dimension Reduction
The dimension reduction phase aims to address the semantic issues of log templates while minimizing
model weights without significantly compromising performance. This phase focuses on generating a low-
dimensional embedding matrix that encapsulates all log templates from the target log data. To tackle the
semantic challenge, we utilize the pre-trained SBERT, which has been fine-tuned on BERT using a large-
scale dataset. SBERT employs Siamese and triplet networks along with cosine similarity during training,
ensuring that similar sentences are positioned closely in the embedding space, while dissimilar ones are
pushed apart.
For reducing model weights, we apply the PCA algorithm to decrease the size of the embedding vectors,
which in turn helps to reduce the overall size of the Transformer model. Figure 4 illustrates the process of
reducing the dimensionality of semantic template vectors during both training and inference.
In the training phase, we begin by randomly selecting 10,000 sentences from the SNLI dataset 3131 and
extracting their embedding vectors using the pre-trained SBERT. The choice of the SNLI dataset is
strategic, as it has been used for fine-tuning SBERT, ensuring that its embedding vectors occupy the same
embedding space as those from other large-scale datasets used in the fine-tuning process. This alignment
is advantageous for embedding log templates later on. Following this, a PCA model is trained on the
SNLI embedding vectors to produce lower-dimensional embedding’s. It is important to note that the PCA
model is trained only once and will not be retrained, even if MultiLog undergoes retraining with new log
data. The impact of vector dimensionality on the model weights of MultiLog is presented in Table 1.

In the inference phase, the unique log templates are initially embedded using the pre-trained SBERT.
Next, the trained PCA model is utilized to reduce the dimensionality of the log template embedding
vectors. These vectors are then updated in the template embedding matrix and integrated into MultiLog.
The template embedding matrix is initially set up with random weights and consists of four rows, each
corresponding to four special tokens: <UNK>, <PAD>, <MASK>, and <CLS>.
B. Training Stage
The training process of MultiLog is illustrated in Figure 5. Initially, unstructured log messages are parsed
into log templates and their corresponding IDs. These log template IDs are then organized in
chronological order and transformed into ID sequences for model input using a sliding window approach.
To improve model performance on predicting masked tokens, we opt to mask an ID sequence in each new
epoch rather than just once during data preprocessing. For each ID sequence, the masking strategy
follows the original BERT methodology, with masked tokens selected randomly at a rate of 15%. To
capture the contextual relationships among log templates within a sequence, a special token CLSCLS,
which signifies classification, is added at the start of the sequence. Utilizing the self-attention mechanism,
the final hidden state associated with the CLSCLS token encodes the entire log sequence.
Once the log key ID sequences are masked, these inputs are converted into their respective embedding
vectors. Unlike LogBERT and LAnoBERT, which train the embedding matrix from scratch, MultiLog
leverages a pre-trained model and reduces vector dimensions in advance, thereby significantly decreasing
both training and inference time. Nevertheless, we still incorporate special tokens such
as UNKUNK, PADPAD, MASKMASK, and CLSCLS during the training phase.
C. Inference Stage
An anomaly score is typically employed to assess whether a log sequence is normal or abnormal. Models
such as DeepLog, LogBERT, and LogAnomaly utilize the concept of top k candidates for their
predictions, whereas LAnoBERT relies on predictive probability as the anomaly score. Through
experimentation with both methods of calculating anomaly scores, we discovered that using predictive
probability yields more effective and stable results compared to the top k candidates. Consequently, we
have adopted this approach for MultiLog, mirroring the strategy used in LAnoBERT.
In contrast to LAnoBERT, which optimizes its model solely by predicting masked tokens, our model
generates three outputs based on three distinct training objectives. As a result, we will evaluate anomaly
scores from all available types and select the most effective one. Figure 8 illustrates the anomaly
detection strategy employed in MultiLog. For a log key ID sequence, each ID is replaced by
the MASKMASK token one at a time, resulting in w masked sequence inputs. The CLSCLS token is
then added to the beginning of each masked sequence, and these final sequences are input into the trained
Transformer encoder. From each masked input sequence, the encoder retrieves three output values: the
predictive probability of the actual token corresponding to the masked token, the predictive probability of
the next token, and the Euclidean distance between the encoded CLSCLS vector and the center vector C.
Since the model is trained exclusively on normal log sequences, it is anticipated
that MASKMASK tokens will not be accurately reconstructed when presented with abnormal log
sequences during the testing phase. Consequently, low predictive probabilities for the masked tokens may
serve as indicators of log anomalies, similar to the predictive probabilities for the subsequent log key IDs.
Additionally, the Euclidean distance is generally small for normal sequences and larger for abnormal
ones. The final anomaly score is calculated as the average of the top k scores from the total w scores,
where k is less than or equal to w.
Experimental Setup
In line with previous research, we assess the detection performance of MultiLog using three datasets:
HDFS 3434, BGL 3535, and Thunderbird 3535. The HDFS dataset is produced by the Hadoop
Distributed File System within a private cloud setting, with labels determined through handcrafted rules. The
BGL dataset is sourced from a BlueGene/L supercomputer at Lawrence Livermore National Laboratories
and includes both alert and non-alert messages categorized by alert tags. Lastly, the Thunderbird dataset is
generated from a supercomputer at Sandia National Laboratories, with log labels similarly identified by alert
category tags, just like the BGL dataset. Table 1 provides detailed information about each log dataset.

For the HDFS dataset, we organize log messages into sequences based on block session IDs. In contrast,
for the BGL and Thunderbird datasets, we create non-overlapping sequences in chronological order using
a fixed window size of 100, treating each window as a session. We divide the data into training and testing
sets at an 80:20 ratios to evaluate the performance of MultiLog and baseline models. To generate
subsequences for model input, we employ a sliding window technique with a size of 10 and a step size of 1.
Additionally, we apply text preprocessing methods to extract unique logs, treating these logs as log
templates without utilizing a log parser.

The experiments were carried out on a server running Ubuntu OS 22.04.3, equipped with an NVIDIA
GeForce RTX 4090 with 24GB of memory and 256GB of RAM. The environment was configured using
Anaconda 2023.09-0, Python 3.9.18, and TensorFlow 2.9.0. Our code and implementation details are
available at https://github.com/tuananhphamds/MultiLog.

Table 3 outlines the hyperparameter information. We employed the AdamW optimizer 3636 to fine-tune
the model parameters, setting an initial learning rate of 0.0001 and dedicating 20% of the training steps to
the warmup phase. In our experiments, we assigned equal contributions to the loss weights α, β, and γ.

We utilize the F1 Score metric to assess the performance of the model, which is derived from precision and
recall. Precision indicates the proportion of true positives out of all predicted positives, whereas recall
reflects the model's ability to accurately identify true positives.
The F1 Score is calculated as the harmonic mean of precision and recall. It is important to note that we
have not performed hyper parameter tuning to enhance the detection performance of MultiLog.
Precision = TP/TP + FP (10)
Recall = TP/TP + FN (11)
F1 = 2 × Precision × Recall/Precision + Recall (12)

Experiment Results and Analysis


A. Detection Performance Comparison of Multi-Log and Baseline
Methods
In this section, we evaluate the detection performance of the Multi-Log model against other semi-
supervised baseline methods using the HDFS, BGL, and Thunderbird datasets. To ensure a fair
comparison, we compute the F1 Score for each method based on their optimal threshold. An
asterisk (*) denotes methods that were re-implemented, as the F1 Scores obtained from existing
repositories did not meet expectations. Additionally, we investigate the performance of Multi-Log by
analyzing the outcomes from its three training objectives: Next, Mask, and Center, as illustrated in
Figure 8.
The evaluation results of the models are summarized in Table 4. In the HDFS dataset, Log-Robust
demonstrates the highest performance, achieving an F1 Score of 99.00%. In comparison, the F1
Score of Multi-Log(Next) is just 0.92% lower than that of Log-Robust, while also surpassing other
semi-supervised methods. For the BGL and Thunderbird datasets, LogBERT and LAnoBERT
emerge as the top-performing models, with F1 Scores of 99.39% and 99.90%, respectively. Overall,
Multi-Log(Next) secures the highest average F1 Score across the three datasets, reaching 98.59%.
It is important to note that we utilized a random selection strategy for training data, consistent with
existing semi-supervised methods, whereas the Neural-Log model employed a chronological
selection based on time order (Pham & Lee, 2024, p. 9). This difference in selection strategy accounts
for the superior F1 Score of Multi-Log(Next) compared to the Neural-Log model.
A noticeable trend emerges in the F1 Scores across various semi-supervised learning methods.
Specifically, models such as Deep Log, Log Anomaly, and Multi-Log(Next) demonstrate strong detection
performance in the HDFS dataset, but their effectiveness diminishes in the BGL and Thunderbird
datasets. Conversely, LogBERT, LAnoBERT, and Multi-Log(Mask) perform well in the BGL and
Thunderbird datasets, while their performance in the HDFS dataset is comparatively lower.
The first key observation is that Deep Log, Log Anomaly, and Multi-Log(Next) focus on predicting the
next log key ID as their training objective, whereas the other methods rely on masked token prediction.
The second observation highlights the distinct nature of abnormal patterns across the three datasets: the
HDFS dataset primarily features collective anomalies, while the BGL and Thunderbird datasets are
characterized by point anomalies.
In the BGL and Thunderbird datasets, the test sets include individual anomalous log templates that
are absent from the training data. This makes the prediction of masked tokens particularly effective
for identifying these singular logs, leading to improved F1 Scores. In contrast, collective anomalies
cannot be detected solely by analyzing individual log messages. The HDFS dataset predominantly
features collective anomalies, which account for nearly 60% (10,043 out of 16,838) of the total
anomalies. This suggests that Multi-Log(Next) is more adept at capturing the overall patterns (global
patterns) present in log sequences compared to Multi-Log(Mask), which focuses on local patterns.

Effect of Using Multi-Task Learning on Model Performance


In this section, we assess the impact of integrating multiple training objectives on the detection of log
anomalies. The results of this comparison are detailed in Table 5. The Multi-Log model, utilizing the
combined objectives of Next, Mask, and Distance, demonstrates superior performance in the HDFS
and BGL datasets, achieving increases of 1.8% and 2.29%, respectively, compared to the model
that focuses solely on the Next objective. Additionally, the average F1 Scores indicate an overall
enhancement of 1.31% when employing a multi-objective training approach for log anomaly
detection.

Effect of Unstable Logs on Model Performance


We adopted a similar approach to that used in Log-Robust to create synthetic datasets, aiming to
demonstrate the instability of log data in real-world deployments. Figure 9 depicts the process of
generating unstable log event sequences from normal logs. In the synthetic datasets, we introduce three
types of unstable logs: deletion, shuffling, and duplication. For each type, we incorporate varying ratios of
unstable logs into the normal HDFS test dataset, which consists of 111,644 normal log sequences. Each
sequence is randomly altered by injecting an unimportant event from within the sequence itself. To
identify these unimportant events in the HDFS dataset, we analyze the frequency of occurrence for each
log event in the normal test data. The distribution of log template occurrences is illustrated in Figure 10;
from which we select the top six log template IDs with the highest frequencies as the unimportant events.
The evaluation results for various semi-supervised methods are presented in Table 6. It is observed
that the detection performance of Deep Log and Log Anomaly significantly declines, while Multi-
Log(Next) and LogBERT maintain their effectiveness despite the instability of log data

However, the F1 Scores from LogBERT exhibit variability due to the random selection of masked
tokens in its prediction strategy. Notably, LogBERT achieves an F1 Score of 85.31% with a deletion
type and a 5% injection rate, but this score increases to 88.45% when the injection rate is raised to
20%. In our approach, we inject one unimportant log event for each type of unstable log, yet the
detection performance of Deep Log and Log Anomaly is considerably impacted. In summary, Multi-
Log(Next) demonstrates the best performance in handling unstable log data, outperforming the other
methods.
Effect of Different Log Parsers on Model Performance
This section examines the impact of different log parsers on the performance of the Multi-Log(Next)
model, as parsing errors can lead to a decline in detection accuracy. We assess Multi-Log(Next) using
three preprocessing methods for log messages: Spell, Drain, and a no-parser approach, which utilizes
regular expressions and simple transformations to extract unique processed log messages, treating them as
log templates. The results of this evaluation are summarized in Table 7, which illustrates the effects of
these log parsers on detection performance.
To gain a clearer understanding of the results, we analyze the number of abnormal logs that were
incorrectly classified as normal and present this information in Table 8. Our analysis focuses on the BGL
and Thunderbird datasets, as they provide explicit labels for each log entry (with ‘-’ indicating normal
logs and other labels for abnormal logs). The findings indicate that using regular expressions and text
transformations results in no parsing errors, contributing to a slight increase in the F1 Score across all
three datasets. In the BGL dataset, both the Spell and Drain parsers misclassify approximately 19% of
anomalous logs as normal, which accounts for the lower F1 Scores of 97.66% and 97.20%, respectively.
In the Thunderbird dataset, despite 99.97% of anomalous logs being incorrectly parsed, the F1 Score only
experiences a minor reduction of 3.7% (from 99.13% to 95.43%), a trend also observed in the Deep Log
model.

Conclusion
In this paper, we introduce a novel lightweight semi-supervised multi-task learning approach for detecting
anomalies in system logs. We effectively tackle three key challenges associated with log anomaly
detection: the issue of large model weights, the instability of log data, and the complexities of unexpected
log semantics. Our extensive experiments have yielded several important insights.
First, utilizing the primary training objective of predicting the next token ID proves to be effective and
suitable for real-world applications, as it can successfully identify both collective and point anomalies.
Second, employing predictive probabilities as anomaly scores results in more stable F1 Scores compared
to using top g candidates. Third, by designing appropriate regular expressions and effective text
transformations, we can eliminate the need for log parsers, thereby avoiding parsing errors and enhancing
model performance. Lastly, the integration of semantic embedding’s, the self-attention mechanism, and the
third objective focused on minimizing Euclidean distance helps to mitigate the instability issues present in
log data.

Semi-supervised learning methods typically operate under the assumption that the training data
consists solely of normal behaviors, which is a reasonable expectation. However, in real-world
scenarios, it is possible for anomalies to inadvertently be included in the training data, potentially
leading to a decline in detection performance. In future research, we intend to investigate the impact
of introducing anomalies into the training set, recognizing that this presents a significant challenge.
We believe that our proposed method, along with the insights gained, will contribute positively to
future studies in this area.

References

You might also like