Tools and Benchmarks For Automated Log Parsing
Tools and Benchmarks For Automated Log Parsing
SEIP)
Abstract—Logs are imperative in the development and main- /* A logging code snippet extracted from:
hadoop/hdfs/server/datanode/BlockReceiver.java */
tenance process of many software systems. They record detailed
runtime information that allows developers and support engi- LOG.info("Received block " + block + " of size "
neers to monitor their systems and dissect anomalous behaviors + block.getNumBytes() + " from " + inAddr);
and errors. The increasing scale and complexity of modern Log Message
software systems, however, make the volume of logs explodes.
2015-10-18 18:05:29,570 INFO dfs.DataNode$PacketResponder: Received
In many cases, the traditional way of manual log inspection block blk_-562725280853087685 of size
A Log67108864
Messagefrom /10.251.91.84
becomes impractical. Many recent studies, as well as industrial
tools, resort to powerful text search and machine learning-based Structured Log
analytics solutions. Due to the unstructured nature of logs, a first TIMESTAMP 2015-10-18 18:05:29,570
crucial step is to parse log messages into structured data for LEVEL INFO
subsequent analysis. In recent years, automated log parsing has COMPONENT dfs.DataNode$PacketResponder
been widely studied in both academia and industry, producing a EVENT
Received block <*> of size <*> from /<*>
series of log parsers by different techniques. To better understand TEMPLATE
the characteristics of these log parsers, in this paper, we present PARAMETERS [“blk_-562725280853087685”, “67108864”, “10.251.91.84”]
a comprehensive evaluation study on automated log parsing and
further release the tools and benchmarks for easy reuse. More Fig. 1. An Illustrative Example of Log Parsing
specifically, we evaluate 13 log parsers on a total of 16 log
datasets spanning distributed systems, supercomputers, operating
systems, mobile systems, server applications, and standalone soft-
ware. We report the benchmarking results in terms of accuracy, software systems routinely generate tons of logs (e.g., about
robustness, and efficiency, which are of practical importance gigabytes of data per hour for a commercial cloud applica-
when deploying automated log parsing in production. We also tion [8]). The huge volume of logs makes it impractical to
share the success stories and lessons learned in an industrial
application at Huawei. We believe that our work could serve as manually inspect log messages for key diagnostic information,
the basis and provide valuable guidance to future research and even provided with search and grep utilities. Second, log
deployment of automated log parsing. messages are inherently unstructured, because developers usu-
Index Terms—Log management, log parsing, log analysis, ally record system events using free text for convenience and
anomaly detection, AIOps flexibility [9]. This further increases the difficulty in automated
I. I NTRODUCTION analysis of log data. Many recent studies (e.g., [10]–[12]),
as well as industrial solutions (e.g., Splunk [13], ELK [14],
Logs play an important role in the development and main- Logentries [15]), have evolved to provide powerful text search
tenance of software systems. It is a common practice to record and machine learning-based analytics capabilities. To enable
detailed system runtime information into logs, allowing devel- such log analysis, the first and foremost step is log parsing [9],
opers and support engineers to understand system behaviours a process to parse free-text raw log messages into a stream of
and track down problems that may arise. The rich information structured events.
and the pervasiveness of logs enable a wide variety of system
As the example illustrated in Fig.1, each log message
management and diagnostic tasks, such as analyzing usage
is printed by a logging statement and records a specific
statistics [1], ensuring application security [2], identifying
system event with its message header and message content.
performance anomalies [3], [4], and diagnosing errors and
The message header is determined by the logging frame-
crashes [5], [6].
work and thus can be relatively easily extracted, such as
Despite the tremendous value buried in logs, how to analyze
timestamp, verbosity level (e.g., ERROR/INFO/DEBUG), and
them effectively is still a great challenge [7]. First, modern
component. In contrast, it is often difficult to structurize
* Part of the work was done when the author was an intern at Huawei. the free-text message content written by developers, since
122
TABLE I
S UMMARY OF I NDUSTRIAL L OG M ANAGEMENT T OOLS AND S ERVICES
TABLE II
S UMMARY OF AUTOMATED L OG PARSING T OOLS
Open Industrial
Log Parser Year Technique Mode Efficiency Coverage Preprocessing
Source Use
SLCT 2003 Frequent pattern mining Offline High 8 8 9 8
AEL 2008 Heuristics Offline High 9 9 8 9
IPLoM 2012 Iterative partitioning Offline High 9 8 8 8
LKE 2009 Clustering Offline Low 9 9 8 9
LFA 2010 Frequent pattern mining Offline High 9 8 8 8
LogSig 2011 Clustering Offline Medium 9 8 8 8
SHISO 2013 Clustering Online High 9 8 8 8
LogCluster 2015 Frequent pattern mining Offline High 8 8 9 9
LenMa 2016 Clustering Online Medium 9 8 9 8
LogMine 2016 Clustering Offline Medium 9 9 8 9
Spell 2016 Longest common subsequence Online High 9 8 8 8
Drain 2017 Parsing tree Online High 9 9 9 8
MoLFI 2018 Evolutionary algorithms Offline Low 9 9 9 8
to many duplicate issues. It is crucial to automatically (SaaS) solutions for log management. They enable powerful
identify duplicate issues to reduce the efforts of develop- log search, visualization, and machine learning (ML) analytics
ers and support engineers. Microsoft has reported some capabilities. To illustrate, we list 10 representative products
studies [11], [35], [36] on this task, in which structured in the market, including both well-established ones (e.g.,
event data are required. Splunk [13]) and newly-started ones (e.g., Logz.io [38]). As
• Performance modeling. Facebook has recently reported a key component, automated log parsing has recently risen as
a use case [3] to apply logs as a valuable data source a appealing selling point in some products [39]–[41]. Current
to performance modeling, where potential performance solutions of automated log parsing, however, are realized with
improvements can be quickly validated. A prerequisite built-in parsing support for common log types, such as Apache
to this approach is to extract all possible event templates and Nginx logs [39]. For other types of logs, they have to rely
from the logs. The performance model construction takes on users to perform custom parsing with regex scripts, grok
event sequences as inputs. patterns [16], or a parsing wizard. Current industrial parsing
• Failure diagnosis. Manual failure diagnosis is a time solutions require deep domain knowledge, and thus fall out of
consuming and challenging task since logs are not only the scope of this study.
of huge volume but also extremely verbose and messy. 2) Research Studies. Table II provides a summary of 13
Some recent progress [4], [37] has been made to automate representative log parsers proposed in the literature, which are
root cause analysis based on machine learning techniques. the main subjects of our study. These log parsers are all aimed
Likewise, log parsing is deemed as a prerequisite. for automated log parsing, but may differ in quality. After
reviewing the literature, we list some key characteristics for
B. Characteristics of Log Parsers log parsers that are of practical importance.
As an important step in log analysis, automated approaches Technique. Different log parsers may adopt different log
of log parsing have been widely studied, producing an abun- parsing strategies. We categorize them into 7 types of strate-
dance of log parsers ranging from research prototypes to gies, including frequent pattern mining, clustering, iterative
industrial solutions. To gain an overview of existing log partitioning, longest common subsequence, parsing tree, evolu-
parsers, we summarize the key characteristics of them. tionary algorithms, and other heuristics. We will present more
1) Industrial Solutions. Table I provides a summary of details of these log parsing methods in Section II-C.
some industrial log analysis and management tools. With Mode. According to different scenarios of log parsing, log
the upsurge of big data, many cloud providers as well as parsers can be categorized to two main modes, i.e., offline
startup companies provide on-premise or software-as-a-service and online. Offline log parsers are a type of batch processing
123
and require that all the log data are available before parsing. message instead of the whole log data to parse rare log
On the contrary, online log parsers process log messages one messages. LogCluster is an extension of SCLT, and can be
by one in a streaming manner, which is often more practical robust to shifts in token positions.
when logs are collected as a stream. 2) Clustering: Event template forms a natural pattern of a
Efficiency. Efficiency is always a major concern for log group of log messages. From this view, log parsing can be
parsing in practice, considering the large volume of logs. modeled as a clustering problem of log messages. Examples
An inefficient log parser can greatly hinder subsequent log that apply the clustering algorithms for log parsing include 3
analysis tasks that have low latency requirements in cases such offline methods (i.e., LKE [23], LogSig [43], and LogMine
as real-time anomaly detection and performance monitoring. [27]) and 2 online methods (i.e., SHISO [44], and LenMa
In Table II, the efficiency of current tools has been categorized [26]). Specifically, LKE employs the hierarchical clustering
into three levels: high, medium and low. algorithm based on weighted edit distances between pairwise
Coverage. Coverage denotes the capability of a log parser to log messages. LogSig is a message signature based algorithm
successfully parse all input log messages. If yes, it is marked as to cluster log messages into a predefined number of clus-
“”. “” indicates that a log parser can only structurize part of ters. LogMine can generate event templates in a hierarchical
the logs. For example, SLCT can extract frequently-occurring clustering way, which groups log messages into clusters from
event templates by applying frequent pattern mining, but fails bottom to top. SHISO and LenMa are both online methods,
to handle rare event templates precisely. A high-quality log which parse logs in a similar streaming manner. For each
parser should be able to process all input log messages, since newly coming log message, the parsers first compute its
ignoring any important event may miss the opportunity for similarity to representative event templates of existing log
anomaly detection and root cause identification. clusters. The log message will be added to an existing cluster
Preprocessing. Preprocessing is a step to remove some if it is successfully matched, otherwise a new log cluster will
common variable values, such as IP address and numbers, be created. Then, the corresponding event template will be
by manually specifying simple regular expressions. The pre- updated accordingly.
processing step is straightforward, but require some additional 3) Heuristics: Different from general text data, log mes-
manual work. We mark “” if a preprocessing step is explicitly sages have some unique characteristics. As such, some work
specified in a log parsing method, and “” otherwise. (i.e., AEL [45], IPLoM [22], Drain [25]) proposes heuristics-
Open-source. An open-source log parser can allow re- based log parsing methods. Specifically, AEL separates log
searchers and practitioners to easily reuse and further improve messages into multiple groups by comparing the occurrences
existing log parsing methods. This can not only benefit related between constant tokens and variable tokens. IPLoM employs
research but also facilitate wide adoption of automated log an iterative partitioning strategy, which partitions log messages
parsing. However, current open-source tools for log parsing into groups by message length, token position and mapping
are still limited. We mark “” if an existing log parser is relation. Drain applies a fixed-depth tree structure to represent
open-source, and “” otherwise. log messages and extracts common templates efficiently. These
Industrial use. A log parser has more practical value and heuristics make use of the characteristics of logs and perform
should be more reliable if it has been deployed in production quite well in many cases.
for industrial use. We mark “” if a log parser has been 4) Others: Some other methods exist. For example, Spell
reported on use in an industrial setting, and “” otherwise. [24] utilizes the longest common subsequence algorithm to
parse logs in a stream manner. Recently, Messaoudi et al.
C. Techniques of Log Parsers [28] propose MoLFI, which models log parsing as a multiple-
In this work, we have studied a total of 13 log parsers. We objective optimization problem and solves it using evolution-
briefly summarize the techniques used by these log parsers ary algorithms.
from the following aspects:
1) Frequent Pattern Mining: A frequent pattern is a set D. Tool Implementation
of items that occurs frequently in a data set. Likewise, event Although automated log parsing has been studied for several
templates can be seen as a set of constant tokens that occurs years, it is still not a well-received technique in industry. This
frequently in logs. Therefore, frequent pattern mining is an is largely due to the lack of publicly available tools that are
straightforward approach to automated log parsing. Examples ready for industrial use. For operation engineers who often
include SLCT [20], LFA [42], and LogCluster [21]. All the have limited expertise in machine learning techniques, im-
three log parsers are offline methods and follow a similar plementing an automated log parsing tool requires non-trivial
parsing procedure: 1) traversing over the log data by several efforts. This may exceed the overhead for manually crafting
passes, 2) building frequent itemsets (e.g., tokens, token- regular expressions. Our work aims to bridge this gap between
position pairs) at each traversal, 3) grouping log messages academia and industry and promote the adoption for automated
into several clusters, and 4) extracting event templates from log parsing. We have implemented an open-source log parsing
each cluster. SLCT, to our knowledge, is the first work that toolkit, namely logparser, and released a large benchmark
applies frequent pattern mining to log parsing. Furthermore, set as well. As a part-time project, the implementation of
LFA considers the token frequency distribution in each log logparser takes over two years and have 11.7K LOC in Python.
124
TABLE III
S UMMARY OF L OGHUB DATASETS
Dataset Description Time Span Data Size #Messages #Templates (total) #Templates (2k)
Distributed system logs
HDFS Hadoop distributed file system log 38.7 hours 1.47 GB 11,175,629 30 14
Hadoop Hadoop mapreduce job log N.A. 48.61 MB 394,308 298 114
Spark Spark job log N.A. 2.75 GB 33,236,604 456 36
ZooKeeper ZooKeeper service log 26.7 days 9.95 MB 74,380 95 50
OpenStack OpenStack software log N.A. 60.01 MB 207,820 51 43
Supercomputer logs
BGL Blue Gene/L supercomputer log 214.7 days 708.76 MB 4,747,963 619 120
HPC High performance cluster log N.A. 32.00 MB 433,489 104 46
Thunderbird Thunderbird supercomputer log 244 days 29.60 GB 211,212,192 4,040 149
Operating system logs
Windows Windows event log 226.7 days 26.09 GB 114,608,388 4,833 50
Linux Linux system log 263.9 days 2.25 MB 25,567 488 118
Mac Mac OS log 7.0 days 16.09 MB 117,283 2,214 341
Mobile system logs
Android Android framework log N.A. 3.38 GB 30,348,042 76,923 166
HealthApp Health app log 10.5 days 22.44 MB 253,395 220 75
Server application logs
Apache Apache server error log 263.9 days 4.90 MB 56,481 44 6
OpenSSH OpenSSH server log 28.4 days 70.02 MB 655,146 62 27
Standalone software logs
Proxifier Proxifier software log N.A. 2.42 MB 21,329 9 8
Currently, logparser contains a total of 13 log parsing methods different systems. A robust log parser should perform
proposed by researchers and practitioners. Among them, five consistently across different datasets, and thus can be
log parsers (i.e., SLCT, LogCluster, LenMa, Drain, MoLFI) used in the versatile production environment.
are open-source from existing research work. However, they • Efficiency measures the processing speed of a log parser.
are implemented in different programming languages and have We evaluate the efficiency by recording the time that a
different input/output formats. Examples and documents are parser takes to parse a specific dataset. The less time a
also missing or incomplete, making it difficult for a trial. For log parser consumes, the higher efficiency it provides.
ease of use, we define a standard and unified input/output
A. Experimental Setup
interface for different log parsing methods and further wrap
up the existing tools into a single Python package. Logparser Dataset. Real-world log data are currently scarce in public
requires a raw log file with free-text log messages as input, due to confidential issues, which hinders the research and
and finally outputs a structured log file and an event template development of new log analysis techniques. In this work, we
file with aggregated event counts. The outputs can be easily have released, on our loghub data repository [31], a large col-
fed into subsequent log mining tasks. Our logparser toolkit can lection of logs from 16 different systems spanning distributed
help engineers quickly identify the strengths and weaknesses systems, supercomputers, operating systems, mobile systems,
of different log parsing methods and evaluate their possibility server applications, and standalone software. Table III presents
for industrial use cases. a summary of the datasets. Some of them (e.g., HDFS [18],
Hadoop [11], BGL [30]) are production logs released by
III. E VALUATION previous studies, while the others (e.g., Spark, Zookeeper,
In this section, we evaluate 13 log parsers on 16 benchmark HealthApp, Android) are collected from real-world systems in
datasets, and report the benchmarking results in terms of ac- our lab. Loghub contains a total of 440 million log messages
curacy, robustness, and efficiency. They are three key qualities that amounts to 77 GB in size. To the best of our knowledge,
of interest when applying log parsing in production. it is the largest collection of log datasets. Wherever possible,
• Accuracy measures the ability of a log parser in distin-
the logs are not sanitized, anonymized or modified in any way.
guishing constant parts and variable parts. Accuracy is They are freely accessible for research purposes. At the time
one main focus of existing log parsing studies, because an of writing, our loghub datasets have been downloaded over
inaccurate log parser could greatly limit the effectiveness 1000 times by more than 150 organizations from both industry
of the downstream log mining tasks [9]. (35%) and academia (65%).
• Robustness of a log parser measures the consistency of
In this work, we use the loghub datasets as benchmarks to
its accuracy under log datasets of different sizes or from evaluate all existing log parsers. The large size and diversity
125
TABLE IV
ACCURACY OF L OG PARSERS ON D IFFERENT DATASETS
Dataset SLCT AEL IPLoM LKE LFA LogSig SHISO LogCluster LenMa LogMine Spell Drain MoLFI Best
HDFS 0.545 0.998 1* 1* 0.885 0.850 0.998 0.546 0.998 0.851 1* 0.998 0.998 1
Hadoop 0.423 0.538 0.954 0.670 0.900 0.633 0.867 0.563 0.885 0.870 0.778 0.948 0.957* 0.957
Spark 0.685 0.905 0.920 0.634 0.994* 0.544 0.906 0.799 0.884 0.576 0.905 0.920 0.418 0.994
Zookeeper 0.726 0.921 0.962 0.438 0.839 0.738 0.660 0.732 0.841 0.688 0.964 0.967* 0.839 0.967
OpenStack 0.867 0.758 0.871* 0.787 0.200 0.200 0.722 0.696 0.743 0.743 0.764 0.733 0.213 0.871
BGL 0.573 0.758 0.939 0.128 0.854 0.227 0.711 0.835 0.69 0.723 0.787 0.963* 0.960 0.963
HPC 0.839 0.903* 0.824 0.574 0.817 0.354 0.325 0.788 0.830 0.784 0.654 0.887 0.824 0.903
Thunderb. 0.882 0.941 0.663 0.813 0.649 0.694 0.576 0.599 0.943 0.919 0.844 0.955* 0.646 0.955
Windows 0.697 0.690 0.567 0.990 0.588 0.689 0.701 0.713 0.566 0.993 0.989 0.997* 0.406 0.997
Linux 0.297 0.673 0.672 0.519 0.279 0.169 0.701 0.629 0.701* 0.612 0.605 0.690 0.284 0.701
Mac 0.558 0.764 0.673 0.369 0.599 0.478 0.595 0.604 0.698 0.872* 0.757 0.787 0.636 0.872
Android 0.882 0.682 0.712 0.909 0.616 0.548 0.585 0.798 0.880 0.504 0.919* 0.911 0.788 0.919
HealthApp 0.331 0.568 0.822* 0.592 0.549 0.235 0.397 0.531 0.174 0.684 0.639 0.780 0.440 0.822
Apache 0.731 1* 1* 1* 1* 0.582 1* 0.709 1* 1* 1* 1* 1* 1
OpenSSH 0.521 0.538 0.802 0.426 0.501 0.373 0.619 0.426 0.925* 0.431 0.554 0.788 0.500 0.925
Proxifier 0.518 0.518 0.515 0.495 0.026 0.967* 0.517 0.951 0.508 0.517 0.527 0.527 0.013 0.967
Average 0.637 0.754 0.777 0.563 0.652 0.482 0.669 0.665 0.721 0.694 0.751 0.865* 0.605 N.A.
of loghub datasets can not only measure the accuracy of log on sampled subsets, each containing 2,000 log messages.
parsers but also test the robustness and efficiency of them. The log messages are randomly sampled from the original
To allow easy reproduction of the benchmarking results, we log dataset, yet retains the key properties, such as event
randomly sample 2000 log messages from each dataset and redundancy and event variety.
manually label the event templates as ground truth. Specif- Table IV presents the accuracy results of 13 log parsers
ically, in Table III, “#Templates (2k sample)” indicates the evaluated on 16 log datasets. Specifically, each row denotes
number of event templates in log samples, while “#Templates the parsing accuracy of different log parsers on one dataset,
(total)” shows the total number of event templates generated which facilitates comparison among different log parsers. Each
by a rule-based log parser. column represents the parsing accuracy of one log parser over
Accuracy Metric. To quantify the effectiveness of auto- different datasets, which helps identify its robustness across
mated log parsing, as with [24], we define the parsing accuracy different types of logs. In particular, we mark accuracy values
(PA) metric as the ratio of correctly parsed log messages over greater than 0.9 in boldface since they indicate high accuracy
the total number of log messages. After parsing, each log in practice. For each dataset, the best accuracy is highlighted
message has an event template, which in turn corresponds to with a asterisk “*” and shown in the column “Best”. We
a group of messages of the same template. A log message is can observe that most of the datasets are accurately (over
considered correctly parsed if and only if its event template 90%) parsed by at least one log parser. Totally, 8 out of
corresponds to the same group of log messages as the ground 13 log parsers attatin the best accuracy on at least two log
truth does. For example, if a log sequence [E1, E2, E2] is datasets. Even more, some log parsers can parse the HDFS and
parsed to [E1, E4, E5], we get PA=1/3, since the 2nd and Apache datasets with 100% accuracy. This is because HDFS
3rd messages are not grouped together. In contrast to standard and Apache error logs have relatively simple event templates
evaluation metrics that are used in previous studies, such as and are easy to identify. However, several types of logs (e.g.,
precision, recall, and F1-measure [9], [22], [28], PA is a more OpenStack, Linux, Mac, HealthApp) still could not be parsed
rigorous metric. In PA, partially matched events are considered accurately due to their complex structure and abundant event
incorrect. templates (e.g., 341 templates in Mac logs). Therefore, further
The parameters of all the log parsers are fine tuned through improvements should be made towards better parsing those
over 10 runs and the best results are reported to avoid bias complex log data.
from randomization. All the experiments were conducted on a
To measure the overall effectiveness of log parsers, we
server with 32 Intel(R) Xeon(R) 2.60GHz CPUs, 62GB RAM,
compute the average accuracy of each log parser across
and Ubuntu 16.04.3 LTS installed.
different datasets, as shown in the last row of Table IV. We can
B. Accuracy of Log Parsers observe that, on average, the most accurate log parser is Drain,
which attains high accuracy on 9 out of 16 datasets. The other
In this part, we evaluate the accuracy of log parsers. We top ranked log parsers include IPLoM, AEL, and Spell, which
found that some log parsers (e.g., LKE) cannot handle the achieve high accuracy on 6 datasets. In contrast, the four log
original datasets in reasonable time (e.g., even in days). Thus, parsers that have the lowest average accuracy are LogSig, LFA,
for fair comparison, the accuracy experiments are conducted MoLFI, and LKE. Therefore, we can briefly conclude that log
126
Fig. 2. Accuracy Distribution of Log Parsers across Different Types of Logs
parsers should take full advantage of the inherent structure than four log datasets, as shown in Table IV. Meanwhile,
and characteristics of log messages to achieve good parsing MoLFI is the most recently published log parser, and the
accuracy, instead of directly applying standard algorithms such other five log parsers are ranked in the top in Figure 2.
as clustering and frequent pattern mining. We also choose three large datasets, i.e., HDFS, BGL, and
One may have noticed that the above accuracy results Android. The raw logs have a volume of over 1GB each,
are lower than what have been reported by previous papers and the groundtruth templates are readily available for ac-
(e.g., [25], [27]). The reasons are as follows: 1) We use a curacy computation. HDFS and BGL have also been used
more rigorous accuracy metric which rejects partially matched as benchmarks datasets in the previous work [22], [24]. For
events. 2) For fairness of comparison, we apply the same each log dataset, we vary the volume from 300 KB to 1
preprocessing rules (e.g., IP or number replacement) to each GB, while fix the parameters of log parsers that were fine
log parser, which are much less than those reported before. tuned on 2k log samples. Specifically, 300KB is roughly the
size of each 2k log sample. We truncate the raw log files
C. Robustness of Log Parsers to obtain samples of other volumes (e.g., 1GB). Figure 3
Robustness is crucial to the practical use of a log parser shows the parsing accuracy results. Note that some lines are
in production environments. In this part, we evaluate the incomplete in the figure, because methods like MoLFI and
robustness of log parsers from two aspects: 1) robustness LenMa cannot finish parsing within reasonable time (6 hours
across different types of logs and 2) robustness on different in our experiment). A good log parser should be robust to such
volumes of logs. changes of log volumes. However, we can see that parameters
Figure 2 shows a boxplot that indicates the accuracy distri- tuned on small log samples cannot fit well to large log data. All
bution of each log parser across the 16 log datasets. For each the six best performing log parsers have a drop in accuracy or
box, the horizontal lines from bottom to top correspond to the show obvious fluctuations as the log volume increases. The
minimum, 25-percentile, median, 75-percentile and maximum log parsers, except IPLoM, are relatively stable on HDFS
accuracy values. The diamond mark denotes an outlier point, data, achieving an accuracy over 80%. Drain and AEL also
since LenMa only has an accuracy of 0.174 on HealthApp show relatively stable accuracy on BGL data. However, on
logs. From left to right in the figure, the log parsers are Android data, all the parsers have a large degradation on
arranged in ascending order of the average accuracy shown in accuracy, because Android logs have quite a large number
Table IV. That is, LogSig has the lowest accuracy and Drain of event templates and are more complex to parse. Compared
obtains the highest accuracy on average. A good log parser to other log parsers, Drain achieves relatively stable accuracy
should be able to parse many different types of logs for general and shows its robustness when changing volumes of logs.
use. However, we can observe that, although most log parsers
D. Efficiency of Log Parsers
achieve the maximal accuracy over 0.9, they have a large
variance over different datasets. There is still no log parser Efficiency is an important aspect of log parsers to consider
that performs well on all log data. Therefore, we suggest users in order to handle log data in large scale. To measure the
to try different log parsers on their own logs first. Currently, efficiency of a log parser, we record the running time it needs
Drain performs the best among all the 13 log parsers under to finish the entire parsing process. Similar to the setting of
study. It not only attains the highest accuracy on average, but the previous experiment, we evaluate six log parsers on three
also shows the smallest variance. log datasets.
In addition, we evaluate the robustness of log parsers on The results are presented in Figure 4. It is obvious that
different volumes of logs. In this experiment, we select six the parsing time increases with the raising of log size on all
log parsers, i.e., MoLFI, Spell, LenMa, IPLoM, AEL, and the three datasets. Drain and IPLoM have better efficiency,
Drain. They have achieved high accuracy (over 90%) on more which scales linearly with the log size. Both methods can
127
(a) HDFS (b) BGL (c) Android
Fig. 3. Accuracy of Log Parsers on Different Volumes of Logs
finish parsing 1GB of logs within tens of minutes. AEL also logs for diagnostic information, which requires not only non-
performs well except on large BGL data. It is because AEL trivial efforts but also deep knowledge of the logs. In many
needs to compare with every log message in a bin, yet BGL cases, event statistics and correlations are valuable hints to
has a large bin size when the dataset is large. Other log parsers help engineers make informed decisions.
do not scale well with the volume of logs. Especially, LenMa To reduce the efforts of engineers, a LogKit platform has
and MoLFI cannot even finish parsing 1GB of BGL data or been built to automate the log analysis process, including
Android data within 6 hours. The efficiency of a log parser log search, rule-based diagnosis, and dashboard reporting of
also depends on the type of logs. When the log data is simple event statistics and correlations. A key feature of this platform
and has a limited number of event templates, log parsing is is to parse logs into structured data. At first, log parsing
often an efficient process. For instance, HDFS logs contain was done in an ad-hoc way by writing regular expressions
only 30 event templates, thus all the log parsers can process to match the events of interest. However, the parsing rules
1GB of data within an hour. However, the parsing process become unmanageable quickly. First, existing parsing rules
would become slow for logs with a large number of event cannot cover all types of logs, since it is time-consuming
templates (e.g., Android). to write the parsing rules one by one. Second, System X is
evolving quickly, leading to frequent changes of log structures.
IV. I NDUSTRIAL D EPLOYMENT Maintenance of such a rule base for log parsing has become
In this section, we share our experiences of deploying a new pain point. As a result, automated log parsing is a high
automated log parsing in production at Huawei. System X demand.
(anonymized name) is one of the popular products of Huawei. Success stories. With close collaboration with the product
Logs are collected during the whole product lifecycle, from de- team, we have successfully deployed automated log parsing
velopment, testing, beta testing, to online monitoring. They are in production. After detailed comparisons of different log
used as a main data source to failure diagnosis, performance parsers as described in Section III, we choose Drain because
optimization, user profiling, resource allocation, and some of its superiority in accuracy, robustness, and efficiency. In
other tasks for improving product quality. When the system is addition, by taking advantage of the characteristics of the
still in a small scale, many of these analysis tasks are able to be logs of System X, we have optimized the Drain approach
performed manually. However, after a rapid growth in recent from the following aspects. 1) Preprocessing. The logs of
years, System X nowadays produces over terabytes of log data System X have over ten thousand event templates as well as a
daily. It becomes impractical for engineers to manually inspect wide range of parameters. As we have done in [9], we apply
128
a simple yet effective preprocessing step to filter common practices in open-source and industrial systems, respectively.
parameters, such as IP, package name, number, and file path. Zhu et al. [48] propose LogAdvisor, a classification-based
This greatly simplifies the problem for subsequent parsing. method to make logging suggestions on where to log. Zhao
Especially, some of the preprocessing scripts are extracted et al. [49] further provide an entropy metric to determine
from the original parsing rule base, which is already available. logging points with maximal coverage of a control flow. Yuan
2) Deduplication. Many log messages comprise only constant et al. [50] design LogEnhancer to enhance existing logging
string, with no parameters inside (e.g., “VM terminated.”). statements with informative variables. Recently, He et al. [51]
Recurrences of these log messages result in a large number of have conducted an empirical study on the natural language
duplicate messages in logs. Meanwhile, the preprocessing step descriptions of logging statements. Ding et al. [52] provide a
produce a lot of duplicate log messages as well (e.g., “Con- cost-effective way for dynamic logging with limited overhead.
nected to <IP>”), in which common parameters have been Log parsing. Log parsing has been widely studied in recent
removed. We perform deduplication of these duplicate log years, which can be categorized into rule-based, source code-
messages to reduce the data size, which significantly improves based, and data-driven parsing. Most current log manage-
the efficiency of log parsing. 3) Partitioning. The log message ment tools support rule-based parsing (e.g., [40], [41]). Some
header contains two fields: verbosity level and component. studies [18], [19] make use of static analysis techniques for
In fact, log messages of different levels or components are source code-based parsing. Data-driven log parsing approaches
always printed by different logging statements (e.g., DEBUG are the main focus of this paper, most of which have been
vs. INFO). Therefore, it is beneficial to partition log messages summarized in Section II. More recently, He et al. [53] have
into different groups according to the level and component studied large-scale log parsing through the parallelization on
information. This naturally divides the original problem into Spark. Thaler et al. [54] model textual log messages with
independent subproblems. 4) Parallelization. The partitioning deep neural networks. Gao et al. [55] apply an optimization
of logs can not only narrow down the search space of event algorithm to discover multi-line structures from logs.
templates, but also allow for parallelization. In particular, we Log analysis. Log analysis is a research area that has been
extend Drain with Spark and naturally exploit the above log studied for decades due to its practical importance. There are
data partitioning for quick parallelization. By now, we have an abundance of techniques and applications of log analysis.
successfully run Drain in production for more than one year, Typical applications include anomaly detection [12], [18],
which attains over 90% accuracy in System X. We believe that [23], [56], problem diagnosis [4], [5], runtime verification
the above optimizations are general and can be easily extended [57], performance modeling [3], etc. To address the challenges
to other similar systems as well. involved in log analysis, many data analytics techniques have
Potential improvements. During the industrial deployment been developed. For example, Xu et al. [18] apply the principle
of Drain, we have observed some directions that need further component analysis (PCA) to identify anomaly issues. Du et
improvements. 1) State identification. State variables are of al. [10] investigate the use of deep learning to model event
significant importance in log analysis (e.g., “DB connection sequences. Lin et al. [11] develop a clustering algorithm to
ok” vs. “DB connection error”). However, current log parsers group similar issues. Our work on log parsing serves as the
cannot distinguish state values from other parameters. 2) Deal- basis to perform such analysis and can greatly reduce the
ing with log messages with variable lengths. A single logging efforts for the subsequent log analysis process.
statement may produce log messages with variable lengths
(e.g., when printing a list). Current log parsers are length- VI. C ONCLUSION
sensitive and fail to deal with such cases, thus resulting in Log parsing plays an important role in system maintenance,
degraded accuracy. 3) Automated parameters tuning. Most of because it serves as the the first step towards automated log
current log parsers apply data-driven approaches to extracting analysis. In recent years, many research efforts have been
event templates and some model parameters need to be tuned devoted towards automated log parsing. However, there is a
manually. It is desirable to develop a mechanism for automated lack of publicly available log parsing tools and benchmark
parameters tuning. We call for research efforts to realize datasets. In this paper, we implement a total of 13 log
the above potential improvements, which would contribute to parsing methods and evaluate them on 16 log datasets from
better adoption of automated log parsing. different types of software systems. We have opened source
our toolkit and released the benchmark datasets to researchers
V. R ELATED W ORK and practice for easy reuse. Moreover, we share our experience
Log parsing is only a small part of the broad problem of of deploying automated log parsing at Huawei. We hope our
log management. In this section, we review the related work work, together with the released tools and benchmarks, could
from the aspects of log quality, log parsing, and log analysis. facilitate more research on log analysis.
Log quality. The effectiveness of log analysis is directly
determined by the quality of logs. To enhance log quality, ACKNOWLEDGMENTS
recent studies have been focused on providing informative This work was supported by the National Key R&D Pro-
logging guidance or effective logging mechanisms during de- gram of China (2018YFB1004804), the National Natural Sci-
velopment. Yuan et al. [46] and Fu et al. [47] report the logging ence Foundation of China (61722214, 61502401, 61332010),
129
the Program for Guangdong Introducing Innovative and En- [28] S. Messaoudi, A. Panichella, D. Bianculli, L. Briand, and R. Sasnauskas,
“A search-based approach for accurate identification of log message
trepreneurial Teams (2016ZT06D211), the Research Grants formats,” in ICPC, 2018.
Council of the Hong Kong Special Administrative Region, [29] Loggly: Cloud log management service. [Online]. Available: https:
China (No. CUHK 14210717 of the General Research Fund). //www.loggly.com
[30] A. Oliner and J. Stearley, “What supercomputers say: A study of five
Zibin Zheng is the corresponding author. system logs,” in DSN, 2007.
[31] Loghub: A collection of system log datasets for intelligent log analysis.
R EFERENCES [Online]. Available: https://github.com/logpai/loghub
[32] Overview of logs-based metrics. [Online]. Available: https://cloud.
[1] G. Lee, J. J. Lin, C. Liu, A. Lorek, and D. V. Ryaboy, “The unified google.com/logging/docs/logs-based-metrics
logging infrastructure for data analytics at Twitter,” PVLDB, vol. 5, [33] P. Barham, A. Donnelly, R. Isaacs, and R. Mortier, “Using magpie for
no. 12, pp. 1771–1780, 2012. request extraction and workload modelling,” in OSDI, 2004, pp. 259–
[2] A. Oprea, Z. Li, T. Yen, S. H. Chin, and S. A. Alrwais, “Detection of 272.
early-stage enterprise infection by mining large-scale log data,” in DSN, [34] J. Lou, Q. Fu, S. Yang, Y. Xu, and J. Li, “Mining invariants from console
2015, pp. 45–56. logs for system problem detection,” in ATC, 2010.
[3] M. Chow, D. Meisner, J. Flinn, D. Peek, and T. F. Wenisch, “The mys- [35] R. Ding, Q. Fu, J. G. Lou, Q. Lin, D. Zhang, and T. Xie, “Mining
tery machine: End-to-end performance analysis of large-scale internet historical issue repositories to heal large-scale online service systems,”
services,” in OSDI, 2014, pp. 217–231. in DSN, 2014, pp. 311–322.
[4] K. Nagaraj, C. E. Killian, and J. Neville, “Structured comparative [36] M. Lim, J. Lou, H. Zhang, Q. Fu, A. B. J. Teoh, Q. Lin, R. Ding, and
analysis of systems logs to diagnose performance problems,” in NSDI, D. Zhang, “Identifying recurrent and unknown performance issues,” in
2012, pp. 353–366. ICDM, 2014, pp. 320–329.
[5] D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy, “Sherlog: [37] Automated root cause analysis for spark applica-
error diagnosis by connecting clues from run-time logs,” in ASPLOS, tion failures. [Online]. Available: https://www.oreilly.com/ideas/
2010, pp. 143–154. automated-root-cause-analysis-for-spark-application-failures
[6] X. Xu, L. Zhu, I. Weber, L. Bass, and D. Sun, “POD-Diagnosis: Error [38] Logz.io. [Online]. Available: https://logz.io
diagnosis of sporadic operations on cloud applications,” in DSN, 2014, [39] New automated log parsing. [Online]. Available: https://blog.rapid7.
pp. 252–263. com/2016/03/03/new-automated-log-parsing
[7] A. J. Oliner, A. Ganapathi, and W. Xu, “Advances and challenges in log [40] Log parsing - automated, easy to use, and efficient. [Online]. Available:
analysis,” Commun. ACM, vol. 55, no. 2, pp. 55–61, 2012. https://logz.io/product/log-parsing
[8] H. Mi, H. Wang, Y. Zhou, M. R. Lyu, and H. Cai, “Toward fine- [41] Automated parsing log types. [Online]. Available: https://www.loggly.
grained, unsupervised, scalable performance diagnosis for production com/docs/automated-parsing
cloud computing systems,” IEEE Trans. Parallel Distrib. Syst., vol. 24, [42] M. Nagappan and M. A. Vouk, “Abstracting log lines to log event types
no. 6, pp. 1245–1255, 2013. for mining software system logs,” in MSR, 2010, pp. 114–117.
[9] P. He, J. Zhu, S. He, J. Li, and M. R. Lyu, “An evaluation study on log [43] L. Tang, T. Li, and C.-S. Perng, “LogSig: Generating system events
parsing and its use in log mining,” in DSN, 2016, pp. 654–661. from raw textual logs,” in CIKM, 2011, pp. 785–794.
[10] M. Du, F. Li, G. Zheng, and V. Srikumar, “Deeplog: Anomaly detection [44] M. Mizutani, “Incremental mining of system log format,” in SCC, 2013,
and diagnosis from system logs through deep learning,” in CCS, 2017, pp. 595–602.
pp. 1285–1298. [45] Z. M. Jiang, A. E. Hassan, P. Flora, and G. Hamann, “Abstracting
[11] Q. Lin, H. Zhang, J. Lou, Y. Zhang, and X. Chen, “Log clustering based execution logs to execution events for enterprise applications,” in QSIC,
problem identification for online service systems,” in ICSE, 2016. 2008, pp. 181–186.
[12] S. He, J. Zhu, P. He, and M. R. Lyu, “Experience report: System log [46] D. Yuan, S. Park, and Y. Zhou, “Characterizing logging practices in
analysis for anomaly detection,” in ISSRE, 2016, pp. 207–218. open-source software,” in ICSE, 2012, pp. 102–112.
[13] Splunk. [Online]. Available: http://www.splunk.com [47] Q. Fu, J. Zhu, W. Hu, J.-G. Lou, R. Ding, Q. Lin, D. Zhang, and T. Xie,
[14] ELK. [Online]. Available: https://www.elastic.co/elk-stack “Where do developers log? an empirical study on logging practices in
[15] Logentries. [Online]. Available: https://logentries.com industry,” in ICSE, 2014, pp. 24–33.
[16] A beginnerś guide to logstash grok. [Online]. Available: https: [48] J. Zhu, P. He, Q. Fu, H. Zhang, M. R. Lyu, and D. Zhang, “Learning
//logz.io/blog/logstash-grok to log: Helping developers make informed logging decisions,” in ICSE,
[17] W. Xu, “System problem detection by mining console logs,” Ph.D. vol. 1, 2015, pp. 415–425.
dissertation, University of California, Berkeley, 2010. [49] X. Zhao, K. Rodrigues, Y. Luo, M. Stumm, D. Yuan, and Y. Zhou,
[18] W. Xu, L. Huang, A. Fox, D. A. Patterson, and M. I. Jordan, “Detecting “Log20: Fully automated optimal placement of log printing statements
large-scale system problems by mining console logs,” in SOSP, 2009, under specified overhead threshold,” in SOSP, 2017, pp. 565–581.
pp. 117–132. [50] D. Yuan, J. Zheng, S. Park, Y. Zhou, and S. Savage, “Improving software
[19] M. Nagappan, K. Wu, and M. A. Vouk, “Efficiently extracting opera- diagnosability via log enhancement,” in ASPLOS, 2011, pp. 3–14.
tional profiles from execution logs using suffix arrays,” in ISSRE, 2009, [51] P. He, Z. Chen, S. He, and M. R. Lyu, “Characterizing the natural
pp. 41–50. language descriptions in software logging statements,” in ASE, 2018,
[20] R. Vaarandi, “A data clustering algorithm for mining patterns from event pp. 178–189.
logs,” in IPOM, 2003. [52] R. Ding, H. Zhou, J. Lou, H. Zhang, Q. Lin, Q. Fu, D. Zhang, and T. Xie,
[21] R. Vaarandi and M. Pihelgas, “Logcluster - a data clustering and pattern “Log2: A cost-aware logging mechanism for performance diagnosis,” in
mining algorithm for event logs,” in CNSM, 2015, pp. 1–7. ATC, 2015.
[22] A. Makanju, A. Zincir-Heywood, and E. Milios, “Clustering event logs [53] P. He, J. Zhu, S. He, J. Li, and M. R. Lyu, “Towards automated log
using iterative partitioning,” in KDD, 2009. parsing for large-scale log data analysis,” IEEE Trans. Dependable Sec.
[23] Q. Fu, J.-G. Lou, Y. Wang, and J. Li, “Execution anomaly detection in Comput. (TDSC), vol. 15, no. 6, pp. 931–944, 2018.
distributed systems through unstructured log analysis,” in ICDM, 2009, [54] M. P. Stefan Thaler, Vlado Menkonvski, “Towards a neural language
pp. 149–158. model for signature extraction from forensic logs,” in ISDFS, 2017.
[24] M. Du and F. Li, “Spell: Streaming parsing of system event logs,” in [55] Y. Gao, S. Huang, and A. G. Parameswaran, “Navigating the data
ICDM, 2016, pp. 859–864. lake with DATAMARAN: automatically extracting structure from log
[25] P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing datasets,” in SIGMOD, 2018, pp. 943–958.
approach with fixed depth tree,” in ICWS, 2017, pp. 33–40. [56] S. He, Q. Lin, J. Lou, H. Zhang, M. R. Lyu, and D. Zhang, “Identifying
impactful service system problems via log analysis,” in FSE, 2018, pp.
[26] K. Shima, “Length matters: Clustering system log messages using length
60–70.
of words,” arXiv:1611.03213, 2016.
[57] W. Shang, Z. Jiang, H. Hemmati, B. Adams, A. Hassan, and P. Martin,
[27] H. Hamooni, B. Debnath, J. Xu, H. Zhang, G. Jiang, and A. Mueen,
“Assisting developers of big data analytics applications when deploying
“LogMine: fast pattern recognition for log analytics,” in CIKM, 2016,
on hadoop clouds,” in ICSE, 2013, pp. 402–411.
pp. 1573–1582.
130