0% found this document useful (0 votes)
2 views

LogPrompt_A_Log-based_Anomaly_Detection_Framework_Using_Prompts

The document presents LogPrompt, a log anomaly detection framework that utilizes prompts to enhance the performance of pre-trained language models (PLMs) in learning semantic and sequential information from log data. It addresses challenges such as the scarcity of labeled logs and class imbalance by employing focal loss during training and allowing for effective detection even with limited data. The framework consists of five stages, including log preprocessing, prompt construction, and prompt-based tuning on PLMs, demonstrating improved detection capabilities in experiments.

Uploaded by

ladakhdiaries40
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

LogPrompt_A_Log-based_Anomaly_Detection_Framework_Using_Prompts

The document presents LogPrompt, a log anomaly detection framework that utilizes prompts to enhance the performance of pre-trained language models (PLMs) in learning semantic and sequential information from log data. It addresses challenges such as the scarcity of labeled logs and class imbalance by employing focal loss during training and allowing for effective detection even with limited data. The framework consists of five stages, including log preprocessing, prompt construction, and prompt-based tuning on PLMs, demonstrating improved detection capabilities in experiments.

Uploaded by

ladakhdiaries40
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

LogPrompt: A Log-based Anomaly Detection

Framework Using Prompts


Ting Zhang1 , Xin Huang2,∗ , Wen Zhao3 , Shaohuang Bian4 , Peng Du1
1
School of Software and Microelectronics, Peking University, Beijing, China
2
School of Cyberspace Science and Technology, Beijing Institute of Technology, Beijing, China
3
National Engineering Research Center for Software Engineering, Peking University, Beijing, China
4
College of Information and Electrical Engineering, China Agricultural University, Beijing, China
{zhangting2019, zhaowen, pdu}@pku.edu.cn, [email protected], [email protected]

Abstract—Log data are widely used in anomaly detection tasks (1) Learning from the whole labeled log data requires
of software system. At present, log anomaly detection methods
2023 International Joint Conference on Neural Networks (IJCNN) | 978-1-6654-8867-9/23/$31.00 ©2023 IEEE | DOI: 10.1109/IJCNN54540.2023.10191948

tremendous resources because of its large amount [1]. There-


based on deep learning have greatly progressed. However, the fore, a model suitable for training with few data is needed.
existing methods have the following limitations: (1) Logs are at
large scale but labeled logs are rare, so training a detection (2) Log anomaly detection task usually needs to compre-
model that requires a number of labeled log data from scratch is hensively consider the semantic and sequence information that
costly and impractical; (2) Log anomaly detection tasks usually reflect the point and conditional anomalies of logs, respectively
need to comprehensively consider the semantic and sequential [1]. However, most of the current log anomaly detection
information in logs, but most of the current log anomaly detection frameworks only build models from either aspect.
frameworks only build models from either aspect; (3) Normal
and abnormal logs are imbalanced in real world, which seriously (3) Normal and abnormal logs are imbalanced [1], that is,
reduces the detection recalls. This paper proposes a log anomaly normal logs are often the majority, while abnormal logs are the
detection framework called LogPrompt to solve the problems minority. Unbalanced data distribution makes model training
mentioned above. LogPrompt leverages prompts to guide the biased to most classes (i.e., normal logs), which negatively
pretrained language model (PLM) to better learn the semantic affects the model.
and sequential information of logs, and avoids training a model
from scratch. Even with few training data, the model achieves
To address the aforementioned problems, this paper pro-
good detection performance. Moreover, it uses focal loss instead of poses a log anomaly detection framework called LogPrompt
cross entropy loss to guide the model optimization during training based on prompt tuning. It aims to solve the current problems
stage, for alleviating the class imbalance problem. Experiments of log anomaly detection task. LogPrompt mainly consists of
show that LogPrompt can detect log anomalies more effectively five stages: (1) log preprocessing, including log parsing and
and efficiently by prompts, and it can significantly improve the
recalls and F1 scores.
grouping; (2) prompt and verbalizer construction; (3) semantic,
Index Terms—anomaly detection, pre-trained language model, sequential, and prompt embeddings; (4) prompt-based tuning
prompt tuning on PLM, and using focal loss to alleviate log imbalance
problem; (5) log anomaly detection based on prompt-tuned
I. I NTRODUCTION PLM.
The development of modern networks have increased the re- The contributions of this paper are as follows:
quirement for the real-time property and accuracy of anomaly • We propose a novel anomaly detection framework named

detection technology. Log data are closely related to network LogPrompt, which leverages prompts to enable the PLM
security and big data. Software systems usually record op- to learn more about the representations of logs. Com-
eration and status information by printing console logs. A pared with training a model from scratch, PLM provides
complex software system may generate many logs. When good parameter initialization, which can reduce resource
anomaly occurs, log data can help operation and maintenance consumption. Even with few training data, LogPrompt
(O&M) personnel discover system failure in time. Therefore, achieves good detection performance.
log data are widely used in log anomaly detection task. • Semantic and sequential tokens are comprehensively

Traditional log data analysis and anomaly detection are considered and embedded to help PLM effectively and
performed manually by O&M personnel on the basis of efficiently detect point and conditional anomalies.
their professional knowledge. However, manual log analysis • Focal loss is used to replace cross entropy loss, which

cannot meet current requirements because of massive and alleviates the class imbalance of real-world log data.
unstructured system log data. Therefore, log anomaly detection Thus, the evaluation metrics are improved.
based on deep learning has become an important research II. R ELATED W ORK
trend. However, log anomaly detection tasks need to consider
A. Deep Learning-based Log Anomaly Detection
the following issues:
Du et al. [2] proposed DeepLog, which leverages long short-
* Corresponding author. term memory (LSTM) to capture the execution sequence of

Authorized licensed use limited to: Zhejiang University. Downloaded on June 15,2024 at 15:29:21 UTC from IEEE Xplore. Restrictions apply.
normal logs in training stage, then determine whether the log Prompt tuning mainly includes three processes: prompt
sequences are abnormal in anomaly detection stage. engineering, answer searching and answer mapping. Template
Weibin Meng et al. [3] proposed LogAnomaly, which also engineering refers to the process of designing prompt tem-
uses LSTM model. They also proposed a word embedding plates, which can help PLMs ”recall” what they ”learned”
method called Template2Vec that considers synonym and during pre-training. It is a technical approach to activate the
antonym information. They aimed to learn semantic similarity knowledge of PLMs. Prompt templates mainly include discrete
between log keywords and templates, and to solve the online templates (e.g., PET [11]) and continuous templates (e.g., P-
learning problem of new templates. tuning [12]), both of which contain a [M SK] token. Answer
Xu Zhang et al. [4] proposed LogRobust, which extracts searching refers to searching the token that involves in the
the semantic information of log events and represents it as vocabularies of PLM with the highest probability of corre-
a semantic vector. They use an attention-based bidirectional sponding [M SK] in the prompt template; answer mapping
LSTM (Bi-LSTM) model to capture contextual information in maps the prediction of [M SK] into label, which is normal or
log sequences and automatically learn the representations of anomaly in log anomaly detection task. The mapping function
different log events. Thus, LogRobust can identify and handle is also called verbalizer. Thus, the process of answer mapping
unstable log events and sequences. is also called verbalizing.
At present, large-scale language models are widely used in
the field of natural language processing (NLP), and they show III. D EFINITIONS AND TASK D ESCRIPTION
good performance in various NLP tasks. Jacob Devlin et al.
[5] proposed BERT in 2017, which is based on bidirectional A. Definitions
Transformer encoder for pre-training. Inspired by BERT, Haix- 1) Log: A log is also called a log entry, which is typically
uan Guo et al [6] proposed LogBERT based on BERT, which printed to a console or file by log print statements in programs.
captures the sequential information by learning from normal Most logs contain information such as timestamp, log level,
log sequences. and log content. Log content consists of constants and vari-
ables. The content printed by the same print statement is the
B. Pre-trained Language Model
same. The variable is also called parameter, and it reflects the
Pre-trained language model learns from some general tasks variable information of the running system. Under different
and changes the weights of some parameters in the model, and states or activities, the variable may be different. In general,
then transfers the model to other downstream tasks for further log content is simply referred to as log. If the log content is
learning, so that it can further adapt to downstream tasks. segmented by certain delimiters (e.g., blank space), then each
A great number of studies have shown that language mod- element obtained is called a token. The log length usually
els pre-trained on large corpora can learn general language refers to the number of tokens. We can represent a log as:
representations and provide better model initialization of pa-
rameters, which not only improves the generalization ability L = {t1 , t2 , . . . , tm } (1)
of models, but also speeds up the convergence of target task.
In addition, pre-training saves a lot of resources by avoiding where L represents a log, ti represents the i-th token, and m
training a new model from scratch. represents the log length.
BERT [5] is a widely used pre-trained language model, it 2) Log sequence: A log sequence is usually composed of
utilizes the mask language model (MLM), which predicts the logs in chronological order. The length of a log sequence
masked tokens in a sentence based on Transformer encoder. usually refers to the number of log entries contained in a log
RoBERTa [7] and ALBERT [8] are the improvements of sequence. Therefore, we can represent a log sequence as:
BERT. RoBERTa mainly changes static mask to dynamic
mask. ALBERT compresses the model size and reduces the SL = {L1 , L2 , . . . , Ln } (2)
number of parameters through cross-layer parameter sharing
where SL represents a log sequence, Li represents the i-th
and factorized embedding parameterization.
log, and n represents the length of the log sequence.
C. Prompt Tuning
B. Task Description
PLMs are usually pre-trained on large corpora, but the se-
mantics in logs may differ greatly from those in the pre-trained Given a log sequence SL = {L1 , L2 , . . . , Ln }, we hope to
tasks because of their diverse training objectives. Prompt detect whether the log sequence reflects point anomalies or
tuning [9] is a method to adapt downstream tasks to the PLM. conditional anomalies that occur in the system. Specifically, a
The PLM can better understand the log anomaly detection task point anomaly refers to existing Li ∈ L, and Li can indicate
through prompt tuning. Moreover, prompt tuning is suitable for that the system occurs an anomaly. A conditional anomaly
few-shot learning which can significantly improve the learning refers to existing SL′ = {Li , Li+1 , . . . } ⊆ SL , and SL′ can
capabilities for machine intelligence and practical adaptive indicate that the system occurs an anomaly. The objective
applications by accessing only a small number of labeled of log anomaly detection task is to detect whether anomalies
examples [10]. occurs in the running system by analyzing logs.

Authorized licensed use limited to: Zhejiang University. Downloaded on June 15,2024 at 15:29:21 UTC from IEEE Xplore. Restrictions apply.
Verbalizer
(5) Log anomaly
(4) Prompt-based test detection
tuning on PLM size
Normal Normal
running
Focal Loss error Anomaly
Verbalizing
failure

(2) Verbalizer construction

Pre-trained Language Model (e.g., BERT, RoBERTa, ALBERT) test

Semantic Embedding
Sequential Embedding
Template Embedding [CLS] semantic Receiving ... allocateBlock sequential bbb51b95 3d91fa85 it is [MSK]

(3) Semantic, sequential,


and prompt embeddings
Log Semantic Tokens Prompt
Log Sequential Tokens
{"Receiving", "block", "src", "dest", {"semantic", <SEM>, "sequential",
{"bbb51b95", "3d91fa85"}
"NameSystem", "allocateBlock"} <SEQ>, "it", "is", [MSK]}

(2) Prompt construction Choose ①

Prompt Repository
Tokenizer Templates
① semantic <SEM> sequential <SEQ> it is [MSK]
② <SEM> <SEQ> it is [MSK]
③ <SEM> <SEQ> normal or anomaly ? [MSK]
④ <SEM> <SEQ> h[PRO] h[PRO] [MSK]
(1) Log preprocessing Structured Label words
Logs Anomaly: error, failure
Normal: test, size

Fig. 1. Workflow of LogPrompt

IV. L OG P ROMPT 1) Template: A template can be represented as:


In this section, we propose a log anomaly detection frame-
T = {[P RO]1 , ..., <SEM >, [P RO]i ,
work called LogPrompt based on prompt tuning. The pipeline (3)
of LogPrompt includes five steps: (1) log preprocessing; (2) ..., <SEQ>, [P RO]j , ..., [M SK]}
prompt and verbalizer construction; (3) semantic, sequential
and prompt embeddings; (4) tuning PLM based on prompts; where T represents a template, [P RO]i represents the i-th
(5) log anomaly detection. The workflow of LogPrompt is token of the template, <SEM > and <SEQ> represent the
shown in Fig.1. semantic sequence and event execution sequence, respectively.
[M SK] represents the mask in the template.
A. Log Preprocessing However, manually constructing templates is suboptimal
In log preprocessing stage, raw logs are sent to a log because we usually have no prior knowledge of constructing
parser (e.g., Spell and Drain) to obtain structured logs. Log- good templates. For this reason, continuous templates are
Prompt integrates Spell [13] and Drain [14] log parsers, which designed to enable the model to automatically ”learn” the
parse logs through longest common subsequence and FT-Tree, templates suitable for log anomaly detection task. The original
respectively. In structured logs, variables are replaced with token [P RO]i is replaced with a learnable token h[P RO]i , and
”< ∗ >”. Variables and special symbols (e.g., ”/” and ”$”) can thus, we obtain:
be ignored in log semantic sequences, so they are removed
from the semantic sequences. LogPrompt also deduplicates T = {h[P RO]1 , ..., <SEM >, h[P RO]i ,
(4)
tokens in semantic sequences because of the excessive long log ..., <SEQ>, h[P RO]j , ..., [M SK]}
sequences. When a token is repeated for several times, only
the token that appears for the first time is retained, which also where h[P RO]i represents the i-th continuous token, which
ensures that the order of semantic sequences is appropriate. In is a trainable tensor. In addition, some tokens can be set to be
the case of log execution sequences, the complete execution continuous and the other tokens remain discrete.
sequences are retained without any processing. LogPrompt provides a model for training continuous tem-
plates, the model contains a Bi-LSTM head that uses ReLU as
B. Prompt and Verbalizer Construction the activation function and an MLP head. The training process
Prompts are used for tuning PLM. Thus, the construction is represented as:
of prompts is critical. A typical prompt consists of two parts:
a template and a set of label words. h[P RO]i ← M LP (Bi-LST M (h[P RO]i )) (5)

Authorized licensed use limited to: Zhejiang University. Downloaded on June 15,2024 at 15:29:21 UTC from IEEE Xplore. Restrictions apply.
2) Label Words and Verbalizer: Label words Vlabel are of prompt template T , respectively. Then, the prompt sequence
constructed manually, and they can represent normal and xin is obtained:
abnormal meanings. The selected label words should be in the xin = {[P RO]1 , ..., xsem , [P RO]i ,
vocabulary V of PLM L. Then, a mapping called verbalizer (7)
that maps the label words to the class Y = {0, 1} (0 represents ..., xseq , [P RO]j , ..., [M SK]}
normal, 1 represents anomaly) can be manually constructed. Then, the PLM L predicts the [M SK] in xin , that is,
Formally, the verbalizer can be represented as: P ([M SK] = v|xin ) is calculated to indicate which word v
in Vlabel is the best substitute for [M SK]. Finally, the word
M :v→y (6) v with the highest probability is selected, and the result y is
obtained according to the mapping M :
where v represents the word in the label words, that is, v ∈ V ,
and y represents the category of log anomaly detection task, y = M (arg max P ([M SK] = v|xin ) (8)
that is, normal or anomaly. For example, label words such v∈V
as ”error” and ”failure” can be used to indicate anomalies, For example, a log semantic sequence and an
and the other words unrelated to anomalies such as ”test” and execution sequence are represented as xsem =
”size” can be used to indicate normal instances. {Receiving, block, src, dest, N ameSystem, allocateBlock}
and xseq = {bbb51b95, 3d91f a85}, and they
C. Embeddings are filled into the given prompt template T =
{semantic, <SEM >, sequential, <SEQ>, it, is, [M SK]}
Logs are unstructured text data. Thus, many works use then obtain xin = {semantic, xsem , sequential, xseq ,
word embeddings to represent the semantic information of it, is, [M SK]}. PLM L will embed xin and
logs. Word2Vec [15] is a non-contextual embedding method predict the [M SK] token. We define verbalizer as
and mainly includes skip-gram and continuous bag of words. M = {{error, f ailure} → 1, {test, size} → 0}. If
It has two main limitations. The first limitation is that the the prediction of [M SK] is ”error”, then the verbalizer maps
embeddings are static, which means the embeddings for a it to anomaly label. If the prediction of [M SK] is ”test”,
word is always the same regardless of its context. Thus, it then it is normal. This case study is shown in Fig.2.
fails to model polysemous words. For example, the word
”block” may refer to the data block in HDFS, or refer to E. Loss Function
preventing something from happening. However, Word2Vec In the training stage, PLM learns the words represented as
fails to distinguish between them. The second is the out- normal or anomaly through prompts and performs anomaly
of-vocabulary problem, which means it cannot represent the detection. For supervised training, loss function is used for
words that do not appear in the vocabulary. evaluating the performance of model and guiding model for
Language models, such as ELMo [16], GPT [17] and BERT update. The target of log anomaly detection task is to detect
[5] are proposed to address the limitations of non-contextual whether an anomaly occurs in the system through a log entry
embeddings. They use contextual embedding to capture the or a log execution sequence. In essence, log anomaly detection
semantic information of words in different contexts. Moreover, task is a binary classification task. Thus, most deep learning-
they can effectively alleviate the out-of-vocabulary problem by based models for log anomaly detection use cross entropy loss
pre-training on a large-scale corpus. as loss function. For log anomaly detection task, cross entropy
The contextual embeddings by PLMs are powerful. Thus, loss function is defined as follows:
we leverage BERT-based models to generate log semantic LCE = −ylog(p) − (1 − y)log(1 − p)
and sequential embeddings by inputting the log texts and 
−log(p), if y = 1 (9)
log event execution sequences respectively. The main reason =
−log(1 − p), if y = 0
for modeling the two kinds of embeddings is that system
anomalies can be mainly divided into point and conditional where y represents the ground true label of a single log entry
anomalies [18]. Point anomalies are usually detected at the or a log event execution sequence, y = 1 represents the log is
semantic level of log texts, while conditional anomalies are abnormal, y = 0 represents the log is normal, and p represents
usually detected at the log event execution sequences. We the probability that the log is predicted as an abnormal log.
combine them to improve the robustness of LogPrompt. However, abnormal logs are rarer than normal logs in real
world. Thus, capturing the semantic words representing system
anomalies is more difficult. To alleviate the impact of class
D. Prompt Tuning imbalance, a weight factor α is added to cross entropy loss
Prompt-based tuning enables the PLM to better understand function to obtain the balance cross entropy loss function,
log anomaly detection task, learn the representations of normal which is defined as follows:
and abnormal logs, and distinguish them. LBCE = −yαlog(p) − (1 − y)(1 − α)log(1 − p)
First, the semantic sequence xsem and execution sequence

−αlog(p), if y = 1 (10)
xseq are filled into the placeholders <SEM > and <SEQ> =
−(1 − α)log(1 − p), if y = 0

Authorized licensed use limited to: Zhejiang University. Downloaded on June 15,2024 at 15:29:21 UTC from IEEE Xplore. Restrictions apply.
1. Raw logs 3. Prompts

081109 203519 143 INFO dfs.DataNode Prompt Repository


$DataXceiver: Receiving block Templates
blk_-1608999687919862906 src: / ① semantic <SEM> sequential <SEQ> it is [MSK]
10.250.10.6:40524 dest: /10.250.10.6:50010 ② <SEM> <SEQ> it is [MSK]
③ <SEM> <SEQ> normal or anomaly ? [MSK]
081109 203520 26 INFO dfs.FSNamesystem: ④ <SEM> <SEQ> h[PRO] h[PRO] [MSK]
BLOCK* NameSystem.allocateBlock: /mnt/hadoop/ Label words
mapred/system/job_200811092030_0001/job.split. Anomaly: error, failure
blk_-1608999687919862906 Normal: test, size

2. Structured logs Log parsing 4. Prompt sequence

Date Time ... Event ID Event Template semantic Receiving block src
dest NameSystem allocateBlock
Receiving block <*> sequential bbb51b9 3d91fa85
08/11/09 20:35:19 ... bbb51b9 it is [MSK]
src: /<*> dest: /<*>

BLOCK*
08/11/09 20:35:20 ... 3d91fa85 NameSystem.allocateBl Anomaly Detection
ock: <*> <*> by PLM & Verbalizer

Fig. 2. Case study

where y and p are represented as above, α is a weight factor texts, and conditional anomalies are detected by the log
and α ∈ [0, 1], which is used to adjust the imbalance between execution sequences.
normal logs and abnormal logs. • BGL dataset is an open dataset of logs collected from
We hope that the model can pay more attention to the logs the BlueGene/L supercomputer system at Lawrence Liv-
that are difficult to be classified, to improve the accuracy of ermore National Labs (LLNL). It contains fine-grained
classification. Focal loss [19] further improves balance cross labels, so it is widely used for log anomaly detection
entropy loss by adding modulating factors (1 − p) and p, and task. Compared with HDFS dataset, BGL dataset does
a focusing parameter γ, which not only alleviate the log class not have identifiers.
imbalance problem, but also make the model focus more on Logs are randomly sampled for each kind of datasets to
the hard-to-classify logs during training. Focal loss is defined build training sets that contains 10,000 log sequences for few-
as follows: shot learning since the whole datasets are so much larger.
LF = −yα(1 − p)γ log(p) − (1 − y)(1 − α)pγ log(1 − p) The whole datasets are used to evaluate the performance of
LogPrompt and compare it with other log anomaly detection
−α(1 − p)γ log(p),

if y = 1
= frameworks.
−(1 − α)pγ log(1 − p), if y = 0
(11) TABLE I
D ETAILED STATISTICAL INFORMATION OF THE DATASETS USED IN THE
where y, p and α are represented as above, γ is the adjustable EXPERIMENTS
focusing parameter.
Dataset # Training set (ano.) # Testing set (ano.)
V. E XPERIMENTS
HDFS 10,000 (298) 575,060 (16,838)
A. Datasets
BGL 10,000 (759) 4,747,963 (348,460)
Loghub [20] is an open-source platform that collects many
real-world log datasets, including supercomputer, distributed
and operating system logs. HDFS and BGL datasets are B. Experimental Design and Results
selected as experimental datasets in this paper. The performance of log anomaly detection is usually evalu-
• HDFS dataset is generated by MapReduce jobs (such ated by four metrics: precision, recall, F1 score and accuracy.
as distributed sorting and text scanning) running on 203 We conduct contrastive experiments to explore the influence
Amazon EC2 nodes. It includes 11,197,954 log entries, of different parameter settings and to compare with previous
about 2.9% of which are abnormal logs. Based on the work, and conduct ablation study to prove the effectiveness of
identifier block id in HDFS dataset, it can be grouped leveraging semantic and sequential information in LogPrompt.
into 575,060 log sequences. Since the anomalies in HDFS We put forward research questions (RQs) and answer these
dataset are labeled based on block id, it is necessary to RQs through experiments.
consider point anomalies and conditional anomalies that RQ1: What are the advantages of LogPrompt compared
may be included in the log event execution sequences of with other log anomaly detection frameworks?
the corresponding block id. In general, point anomalies To evaluate the performance of LogPrompt, we compared
can be detected by the semantic information of log it with four other log anomaly detection frameworks based

Authorized licensed use limited to: Zhejiang University. Downloaded on June 15,2024 at 15:29:21 UTC from IEEE Xplore. Restrictions apply.
TABLE II
P ERFORMANCE OF L OG P ROMPT AND FOUR OTHER LOG ANOMALY DETECTION FRAMEWORKS

HDFS BGL
Framework
Precision↑ Recall↑ F1 score↑ Precision↑ Recall↑ F1 score↑
DeepLog 0.8844 0.6949 0.7734 0.8974 0.8278 0.8612
LogAnomaly 0.9415 0.4047 0.5619 0.7312 0.7609 0.7408
LogRobust 0.9830 0.9480 0.9460 0.8970 0.7920 0.8410
LogBERT 0.8702 0.7810 0.8232 0.8940 0.9232 0.9083
LogPrompt (SEM) 0.9832 0.7436 0.8459 0.9841 0.8741 0.9219
LogPrompt (SEQ) 0.6819 0.5783 0.5202 0.8842 0.8963 0.8902
LogPrompt (SEM + SEQ) 0.9965 0.9474 0.9713 0.9848 0.9753 0.9799

TABLE III
P ERFORMANCE EFFECT OF DIFFERENT PLM S ON L OG P ROMPT

HDFS BGL
PLM
Precision↑ Recall↑ F1 score↑ Accuracy↑ Precision↑ Recall↑ F1 score↑ Accuracy↑
BERT-base-uncased 0.9964 0.8911 0.9406 0.9949 0.9861 0.9367 0.9607 0.9941
BERT-large-uncased 0.9981 0.5909 0.7065 0.9813 1.0000 0.7542 0.8599 0.9817
RoBERTa-base 0.9975 0.8938 0.9426 0.9951 0.9843 0.9363 0.9597 0.9939
RoBERTa-large 0.9811 0.6982 0.8092 0.9857 0.9649 0.8555 0.9030 0.9867
ALBERT-base-v1 0.9968 0.8938 0.9419 0.9951 0.9872 0.9607 0.9737 0.9961
ALBERT-base-v2 0.9965 0.9474 0.9713 0.9963 0.9848 0.9753 0.9799 0.9970

on deep learning, namely, DeepLog, LogAnomaly, LogRobust information of the logs, most metrics are worse than those
and LogBERT. Tabel II shows the performance of these using both the semantic and sequential information of the logs.
different log anomaly detection frameworks evaluated on two RQ2: What is the effect of using different PLMs on the
log datasets. performance of log anomaly detection?
We compare six different BERT, RoBERTa and ALBERT
TABLE IV models to explore the impact of using different PLMs on the
D ESIGN OF PROMPTS performance. Table III shows their performances on two log
Prompt Template Label words
datasets.
Anomaly:”error” The performance of different PLMs on different datasets
Bi-LSTM <SEM > <SEQ> h[P RO] h[P RO] [M SK]
Normal:”normal” may be different. For example, RoBERTa-large has good
Anomaly:”error”
Manual-0 <SEM > <SEQ> it is [M SK]
Normal:”normal” performance on BGL dataset, but is has poor performance
Anomaly:”normal” on HDFS dataset. We also find that light models are better
Manual-1 <SEM > <SEQ> it is [M SK]
Normal:”error”
Anomaly:”yes” than large models. For example, BERT-base-uncased and
Manual-2 <SEM > <SEQ> anomaly ? [M SK]
Normal:”no” RoBERTa-base get better recalls than BERT-large-uncased
Anomaly:”error”
None <SEM > <SEQ> [M SK] and RoBERTa-large, respectively. Moreover, ALBERT-base-
Normal:”normal”
v1 and ALBERT-base-v2 have better performance than BERT
The experimental results of LogPrompt are under the op- and RoBERTa models, one possible reason is that they com-
timal parameter combination. ALBERT-based-v2 is used as press the model size and reduce the number of parameters
the PLM, the prompt encoder type is Bi-LSTM, and other through cross-layer parameter sharing and factorized embed-
parameters are set to the default value of ALBERT-based-v2. ding parameterization.
In the experiment, three different random seeds are set and RQ3: What is the effect of using different kinds of
three models are trained respectively, and each model evaluates prompts on the performance of log anomaly detection?
the testing set after training. Table II shows the mean value We use PLM ALBERT-base-v2 for experiments. On the
of metrics that LogPrompt evaluates on the testing sets of the one hand, continuous template is constructed automatically
three models. based on Bi-LSTM. On the other hand, discrete templates
Table II shows that LogPrompt exceeds most baseline mod- are constructed manually. Prompts are designed and shown
els on HDFS dataset. It is only slightly lower than LogRobust in Table IV and the experiment results are shown in Table V.
in recall. It outperforms all baseline models on BGL dataset. The experimental results show that the precision, F1 score
In addition, through ablation experiments, it is found that and accuracy of constructing continuous template automati-
if LogPrompt only uses one of the semantic or sequential cally based on Bi-LSTM are higher than that of constructing

Authorized licensed use limited to: Zhejiang University. Downloaded on June 15,2024 at 15:29:21 UTC from IEEE Xplore. Restrictions apply.
TABLE V
P ERFORMANCE OF USING DIFFERENT KINDS OF PROMPTS

HDFS BGL
Prompt
Precision↑ Recall↑ F1 score↑ Accuracy↑ Precision↑ Recall↑ F1 score↑ Accuracy↑
Bi-LSTM 0.9965 0.9474 0.9713 0.9963 0.9848 0.9753 0.9799 0.9970
Manual-0 0.8982 0.9849 0.9112 0.9947 0.9680 0.9734 0.9702 0.9955
Manual-1 0.9863 0.8219 0.8960 0.9857 0.9863 0.8219 0.8960 0.9857
Manual-2 0.9759 0.9420 0.9586 0.9961 0.9733 0.9795 0.9764 0.9964
None 0.3440 0.8456 0.4891 0.9196 0.6320 0.0178 0.0346 0.9253

TABLE VI
P ERFORMANCE OF USING DIFFERENT KINDS OF LOSS FUNCTIONS

Metrics
Dataset Loss function
Precision↑ Recall↑ F1 score↑ Accuracy↑
Cross entropy loss 0.8982 0.9849 0.9112 0.9947
HDFS
Focal loss 0.9965 0.9474 0.9713 0.9963
Cross entropy loss 0.9966 0.8861 0.9359 0.9912
BGL
Focal loss 0.9848 0.9753 0.9799 0.9970

(a) Predictions of using cross entropy (b) Truths of using cross entropy loss (c) Predictions of using focal loss (d) Truths of using focal loss
loss

Fig. 3. Visualization of logits

discrete templates manually (Manual-0,1,2) on HDFS dataset. The experimental results are showed in Table VI.
On BGL dataset, automatically constructing continuous tem- For HDFS dataset, precision, F1 score and accuracy are
plates outperforms manually constructing discrete templates in higher when using focal loss than when using cross entropy
F1 score and accuracy. In conclusion, constructing continuous loss, but recall is lower. For BGL dataset, although using
templates automatically is feasible and effective, and the cross entropy loss has a higher precision than focal loss, it
tedious procedures of manually constructing templates are is lower than using focal loss on recall and F1 score. Overall,
avoided. using focal loss can alleviate log class imbalance problem to a
Compared with Manual-0 and Manual-1, recalls and F1 certain extent and improve the overall performance of anomaly
scores decrease when the label word mapping is exchanged. detection.
Therefore, the model misunderstands the semantic information We visualize the outputted logits when evaluating on the
of the logs when the label words are used with the opposite testing set of BGL dataset, as shown in Fig.3.
meanings of the ground true labels. For example, the label Fig.3(a) shows the predictions of ALBERT-base-v2 trained
word ”error” might be interpreted as having a positive meaning with cross entropy loss. The upper left part of the line y = x
and classified as normal instance. is classified as anomaly, and the lower right part is classified
If [M SK] is predicted without other prompt tokens (None), as normal. Fig.3(b) shows the ground true labels of the logs.
then all metrics decrease significantly, which implies that We can observe that many anomalies are classified as normal.
prompt construction is crucial to log anomaly detection task. Fig.3(c) shows the predictions of ALBERT-base-v2 trained
RQ4: Does focal loss alleviate log class imbalance prob- with focal loss. The upper left part of the line y = x is
lem compared with cross entropy loss? classified as anomaly, and the lower right part is classified
We construct continuous template automatically based on as normal. Fig.3(d) shows the ground true labels of the logs.
Bi-LSTM, using cross entropy loss and focal loss on PLM Compared with cross entropy loss, logs of misclassification
ALBERT-base-v2 to conduct anomaly detection experiments. are greatly reduced.

Authorized licensed use limited to: Zhejiang University. Downloaded on June 15,2024 at 15:29:21 UTC from IEEE Xplore. Restrictions apply.
In conclusion, compared with using cross entropy loss, us- [15] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
ing focal loss can detect anomalies more accurately. Moreover, “Distributed representations of words and phrases and their composi-
tionality,” Advances in neural information processing systems, vol. 26,
the intra-class distance between normal and anomaly class 2013.
clusters is smaller and the inter-class distance is larger, which [16] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee,
greatly reduces the possibility of predicting anomalous logs as and L. Zettlemoyer, “Deep contextualized word representations,” in
Proceedings of NAACL-HLT, 2018, pp. 2227–2237.
normal logs. [17] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al.,
“Language models are unsupervised multitask learners,” OpenAI blog,
vol. 1, no. 8, p. 9, 2019.
VI. C ONCLUSION [18] R. Chalapathy and S. Chawla, “Deep learning for anomaly detection: A
survey,” arXiv preprint arXiv:1901.03407, 2019.
This paper proposes LogPrompt, which is a prompt-based [19] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss
log anomaly detection framework. It constructs prompts to for dense object detection,” in Proceedings of the IEEE international
guide the PLM to learn semantic and sequential information conference on computer vision, 2017, pp. 2980–2988.
[20] S. He, J. Zhu, P. He, and M. R. Lyu, “Loghub: a large collection of
in the logs, improving the evaluation metrics of log anomaly system log datasets towards automated log analytics,” arXiv preprint
detection task. Focal loss is used instead of cross entropy arXiv:2008.06448, 2020.
loss to guide model optimization during model training, for
alleviating the imbalance of normal and abnormal log samples.
Experiments show that LogPrompt can detect anomalies more
effectively and efficiently by learning semantic and sequential
information from prompts, even training with few log data.

R EFERENCES
[1] G. Pang, C. Shen, L. Cao, and A. V. D. Hengel, “Deep learning for
anomaly detection: A review,” ACM Computing Surveys (CSUR), vol. 54,
no. 2, pp. 1–38, 2021.
[2] M. Du, F. Li, G. Zheng, and V. Srikumar, “Deeplog: Anomaly detection
and diagnosis from system logs through deep learning,” in Proceedings
of the 2017 ACM SIGSAC conference on computer and communications
security, 2017, pp. 1285–1298.
[3] W. Meng, Y. Liu, Y. Zhu, S. Zhang, D. Pei, Y. Liu, Y. Chen, R. Zhang,
S. Tao, P. Sun et al., “Loganomaly: Unsupervised detection of sequential
and quantitative anomalies in unstructured logs.” in IJCAI, vol. 19, no. 7,
2019, pp. 4739–4745.
[4] X. Zhang, Y. Xu, Q. Lin, B. Qiao, H. Zhang, Y. Dang, C. Xie, X. Yang,
Q. Cheng, Z. Li et al., “Robust log-based anomaly detection on unstable
log data,” in Proceedings of the 2019 27th ACM Joint Meeting on
European Software Engineering Conference and Symposium on the
Foundations of Software Engineering, 2019, pp. 807–817.
[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
of deep bidirectional transformers for language understanding,” arXiv
preprint arXiv:1810.04805, 2018.
[6] H. Guo, S. Yuan, and X. Wu, “Logbert: Log anomaly detection via bert,”
in 2021 International Joint Conference on Neural Networks (IJCNN).
IEEE, 2021, pp. 1–8.
[7] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert
pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[8] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut,
“Albert: A lite bert for self-supervised learning of language representa-
tions,” arXiv preprint arXiv:1909.11942, 2019.
[9] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-
train, prompt, and predict: A systematic survey of prompting methods
in natural language processing,” arXiv preprint arXiv:2107.13586, 2021.
[10] N. Zhang, S. Deng, Z. Sun, J. Chen, W. Zhang, and H. Chen, “Relation
adversarial network for low resource knowledge graph completion,” in
Proceedings of The Web Conference 2020, 2020, pp. 1–12.
[11] T. Schick and H. Schütze, “Few-shot text generation with natural lan-
guage instructions,” in Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing, 2021, pp. 390–402.
[12] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang, “Gpt
understands, too,” arXiv preprint arXiv:2103.10385, 2021.
[13] M. Du and F. Li, “Spell: Streaming parsing of system event logs,” in
2016 IEEE 16th International Conference on Data Mining (ICDM).
IEEE, 2016, pp. 859–864.
[14] P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing
approach with fixed depth tree,” in 2017 IEEE international conference
on web services (ICWS). IEEE, 2017, pp. 33–40.

Authorized licensed use limited to: Zhejiang University. Downloaded on June 15,2024 at 15:29:21 UTC from IEEE Xplore. Restrictions apply.

You might also like