Log-Based Software Monitoring A Systematic Mapping Study
Log-Based Software Monitoring A Systematic Mapping Study
ABSTRACT
Modern software development and operations rely on monitoring to understand
how systems behave in production. The data provided by application logs and
runtime environment are essential to detect and diagnose undesired behavior and
improve system reliability. However, despite the rich ecosystem around industry-
ready log solutions, monitoring complex systems and getting insights from log data
remains a challenge. Researchers and practitioners have been actively working to
address several challenges related to logs, e.g., how to effectively provide better
tooling support for logging decisions to developers, how to effectively process and
store log data, and how to extract insights from log data. A holistic view of the
research effort on logging practices and automated log analysis is key to provide
directions and disseminate the state-of-the-art for technology transfer. In this paper,
we study 108 papers (72 research track papers, 24 journals, and 12 industry track
papers) from different communities (e.g., machine learning, software engineering,
and systems) and structure the research field in light of the life-cycle of log data.
Our analysis shows that (1) logging is challenging not only in open-source projects
but also in industry, (2) machine learning is a promising approach to enable a
contextual analysis of source code for log recommendation but further investigation
is required to assess the usability of those tools in practice, (3) few studies approached
efficient persistence of log data, and (4) there are open opportunities to analyze
application logs and to evaluate state-of-the-art log analysis techniques in a DevOps
context.
How to cite this article Cândido J, Aniche M, van Deursen A. 2021. Log-based software monitoring: a systematic mapping study. PeerJ
Comput. Sci. 7:e489 DOI 10.7717/peerj-cs.489
Logging Developer OPs Engineer Log Analysis
Source code
…
Query Dashboards
…
Log data
system behaves in production. In fact, the symbiosis between development and operations
resulted in a mix known as DevOps (Bass, Weber & Zhu, 2015; Dyck, Penners & Lichter,
2015; Roche, 2013), where both roles work in a continuous cycle. In addition, given the rich
nature of data produced by large-scale systems in production and the popularization of
machine learning, there is an increasingly trend to adopt artificial intelligence to automate
operations. Gartner (2019) refers to this movement as AIOps and also highlights
companies providing automated operations as a service. Unsurprisingly, the demand to
analyze operations data fostered the creation of a multi-million dollar business
(TechCrunch, 2017, Investor’s Business Daily, 2018) and plethora of open-source and
commercial tools to process and manage log data. For instance, the Elastic stack
(https://www.elastic.co/what-is/elk-stack) (a.k.a. “ELK” stack) is a popular option to collect,
process, and analyze log data (possibly from different sources) in a centralized manner.
Figure 1 provides an overview about how the life-cycle of log data relates to different
stages of the development cycle. First, the developer instruments the source with API
calls to a logging framework (e.g., SLF4J or Log4J) to record events about the internal
state of the system (in this case, whenever the reference “ data ” is “null”). Once the system
is live in production, it generates data continuously whenever the execution flow reaches
the log statements. The data provided by application logs (i.e., data generated from API
calls of logging frameworks) and runtime environments (e.g., CPU and disk usage) are
essential to detect and diagnose undesired behavior and improve software reliability. In
practice, companies rely on a logging infrastructure to process and manage that data.
In the context of the Elastic stack, possible components would be Elasticsearch
(https://www.elastic.co/elasticsearch/), Logstash (https://www.elastic.co/logstash) and
Kibana (https://www.elastic.co/kibana): Logstash is a log processor tool with several
plugins available to parse and extract log data, Kibana provides an interface for
visualization, query, and exploration of log data, and Elasticsearch, the core component of
the Elastic stack, is a distributed and fault-tolerant search engine built on top of Apache
Lucene (https://lucene.apache.org). Variants of those components from other vendors include
Grafana (https://grafana.com) for user interface and Fluentd (https://www.fluentd.org) for log
Figure 2 Overview of survey methodology: our four steps consists of the discovery of related studies (“Search Process”), the selection of
relevant studies (“Study Selection”), the mapping process (“Classification”), and the update for papers published in 2019 (“Survey
Update”). Full-size DOI: 10.7717/peerj-cs.489/fig-2
SURVEY METHODOLOGY
The goal of this paper is to discover, categorize, and summarize the key research results in
log-based software monitoring. To this end, we perform a systematic mapping study to
provide a holistic view of the literature in logging and automated log analysis. Concretely,
we investigate the following research questions:
RQ1: What are the publication trends in research on log-based monitoring over the years?
RQ2: What are the different research scopes of log-based monitoring?
The first research question (RQ1) addresses the historical growth of the research field.
Answering this research question enables us to identify the popular venues and the
communities (e.g., Software Engineering, Distributed Systems) that have been focusing
on log-based monitoring innovation. Furthermore, we aim at investigating the participation
of industry in the research field. Researchers can benefit from our analysis by helping them to
make a more informed decision regarding venues for paper submission. In addition, our
analysis also serves as a guide to practitioners willing to engage with the research community
either by attending conferences or looking for references to study and experiment. The
second research question (RQ2) addresses the actual mapping of the primary studies.
As illustrated in Fig. 1, the life-cycle of log data contains different inter-connected contexts
(i.e., “Logging”, “Log Infrastructure”, and “Log Analysis”) with their own challenges and
concerns that span the entire development cycle. Answering this research question enables
us to identify those concerns for each context and quantify the research effort by the number
of primary studies in a particular category. In addition, we aim at providing an overview
of the studies so practitioners and researchers are able to use our mapping study as a starting
point for an in-depth analysis of a particular topic of interest.
Overall, we follow the standard guidelines for systematic mapping (Petersen et al., 2008).
Our survey methodology is divided into four parts as illustrated in Fig. 2. First, we perform
preliminary searches to derive our search criteria and build an initial list of potential
relevant studies based on five data sources. Next, we apply our inclusion/exclusion criteria
to arrive at the eventual list of selected papers up to 2018 (when we first conducted the
survey). We then conduct the data extraction and classification procedures. Finally, we
update the results of our survey to include papers published in 2019.
Study selection
We conduct the selection process by assessing the 4,187 entries according to inclusion/
exclusion criteria and by selecting publications from highly ranked venues. We define the
criteria as follows:
The rationale for criterion C1 is that major venues use English as the standard idiom for
submission. The rationale for criterion C2 is to avoid including secondary studies in our
mapping, as suggested by Kitchenham & Charters (2007). In addition, the process of
applying this criterion allows us to identify other systematic mappings and systematic
literature reviews related to ours. The rationale for criterion C3 is that some databases
return gray literature as well as short papers; our focus is on full peer-reviewed research
papers, which we consider mature research, ready for real-world tests. Note that different
venues might have different page number specifications to determine whether a
submission is a full or short paper, and these specifications might change over time.
We consulted the page number from each venue to avoid unfair exclusion. The rationale
for criterion C4 is to exclude papers that are unrelated to the scope of this mapping
study. We noticed that some of the results are in the context of, e.g., mathematics and
environmental studies. While we could have tweaked our search criteria to minimize
the occurrence of those false positives (e.g., NOT deforestation), we were unable to
systematically derive all keywords to exclude; therefore, we favored higher false positive
rate in exchange of increasing the chances of discovering relevant papers.
The first author manually performed the inclusion procedure. He analyzed the title and
abstracts of all the papers marking the paper as “in” or “out”. During this process, the
author applied the criteria and categorized the reasons for exclusion. For instance,
whenever an entry fails the criteria C4, the authors classified it as “Out of Scope”. The
categories we used are: “Out of Scope”, “Short/workshop paper”, “Not a research paper”,
“Unpublished” (e.g., unpublished self-archived paper indexed by Google Scholar),
“Secondary study”, and “Non-English manuscript”. It is worth mentioning that we flagged
three entries as “Duplicate” as our merging step missed these cases due to special
characters in the title. After applying the selection criteria, we removed 3,872 entries
resulting in 315 entries.
In order to filter the remaining 315 papers by rank, we used the CORE Conference
Rank (CORE Rank) (http://www.core.edu.au/conference-portal) as a reference. We
considered studies published only in venues ranked as A* or A. According to the CORE
Rank, those categories indicate that the venue is widely known in the computer science
community and has a strict review process by experienced researches. After applying the
rank criteria, we removed 219 papers.
Our selection consists of (315 − 219 =) 96 papers after applying inclusion/exclusion
criteria (step 1) and filtering by venue rank (step 2). Table 1 summarises the selection
process.
Study ct ex
tra tra
ex ct
Conference
ce Call for refinement
Proceedings
gs Papers
Type of publication
Topics of
Research track, Industry track, Journal Interest
Primary
clustering Studies
Classification schema
Keywords
Figure 4 Data extraction and classification for RQ2. The dashed arrows denote the use of the data
schema by the researchers with the primary studies. Full-size DOI: 10.7717/peerj-cs.489/fig-4
complete meta-analysis is out of scope from our study, we believe the extracted data is
sufficient to address the research question. Figure 3 summarizes the process for RQ1.
To answer RQ2, we collect the abstracts from the primary studies. In this process, we
structure the abstract to better identify the motivation of the study, what problem the
authors are addressing, how the researchers are mitigating the problem, and the results of
the study. Given the diverse set of problems and domains, we first group the studies
according to their overall context (e.g., whether the paper relates to “Logging”, “Log
Infrastructure”, or “Log Analysis”). To mitigate self-bias, we conducted two independent
triages and compared our results. In case of divergence, we review the paper in depth to
assign the context that better fits the paper. To derive the classification schemafor each
context, we perform the keywording of abstracts (Petersen et al., 2008). In this process,
we extract keywords in the abstract (or introduction, if necessary) and cluster similar
keywords to create categories. We perform this process using a random sample of papers
to derive an initial classification schema.
Later, with all the papers initially classified, the authors explored the specific objectives
of each paper and review the assigned category. To that aim, the first and second authors
performed card sorting (Spencer & Warfel, 2004; Usability.gov, 2019) to determine the
goal of each of the studied papers. Note that, in case new categories emerge in this process,
we generalize them in either one of the existing categories or enhance our classification
schema to update our view of different objectives in a particular research area. After the
first round of card sorting, we noticed that some of the groups (often the ones with high
number of papers) could be further broken down in subcategories (we discuss the
categories and related subcategories in the Results section).
The first author conducted two separate blinded classifications on different periods of
time to measure the degree of adherence to the schema given that classification is subject
of interpretation, and thus, a source of bias. The same outcome converged on 83% of
the cases (80 out of the 96 identified papers). The divergences were then discussed with the
second author of this paper. Furthermore, the second author reviewed the resulting
classification. Note that, while a paper may address more than one category, we choose the
Figure 5 Growth of publication types over the years. Labels indicate the number of publication per
type in a specific year. There are 108 papers in total. Full-size DOI: 10.7717/peerj-cs.489/fig-5
category related to the most significant contribution of that paper. Figure 4 summarizes the
process for RQ2.
Survey update
As of October of 2020, we updated our survey to include papers published in 2019 since
we first conducted this analysis during December in 2018. To this end, we select all 11
papers from 2018 and perform forward snowballing to fetch a preliminary list of papers
from 2019. We use snowballing for simplicity since we can leverage the “Cited By” feature
from Google Scholar rather than scraping data of all five digital libraries. It is worth
mentioning that we limit the results up to 2019 to avoid incomplete results for 2020.
For the preliminary list of 2019, we apply the same selection and rank criteria
(see Section “Study Selection”); then, we analyze and map the studies according to the
existing classification schema (see Section “Data Extraction and Classification”). In this
process, we identify 12 new papers and merge them with our existing dataset. Our final
dataset consists of (96 + 12 =) 108 papers.
RESULTS
Publication trends (RQ1)
Figure 5 highlights the growth of publication from 1992 to 2019. The interest on logging
has been continuously increasing since the early 2000’s. During this time span, we
observed the appearance of industry track papers reporting applied research in a real
context. This gives some evidence that the growing interest on the topic attracted not only
researchers from different areas but also companies, fostering the collaboration between
academia and industry.
We identified 108 papers (72 research track papers, 24 journals, and 12 industry track
papers) published in 46 highly ranked venues spanning different communities (Table 2).
Table 2 highlights the distribution of venues grouped by the research community,
e.g., there are 44 papers published on 10 Software Engineering venues.
Table 3 highlights the most recurring venues in our dataset (we omitted venues with less
than three papers for brevity). The “International Conference on Software Engineering
(ICSE)”, the “Empirical Software Engineering Journal (EMSE)”, and the “International
Conference on Dependable Systems and Networks (DSN)” are the top three recurring
venues related to the subject and are well-established venues. DSN and ICSE are
conferences with more than 40 editions each and EMSE is a journal with an average of
five issues per year since 1996. At a glance, we noticed that papers from DSN have an
emphasis on log analysis of system logs while papers from ICSE and EMSE have an
emphasis on development aspects of logging practices (more details about the research
areas in the next section). Note that Table 3 also concentrates 65% (71 out of 108) of the
primary studies in our dataset.
International Conference on Software Engineering Andrews & Zhang (2003), Yuan, Park & Zhou (2012), Beschastnikh et al. (2014), Fu et al. 10
(ICSE) (2014a), Pecchia et al. (2015), Zhu et al. (2015), Lin et al. (2016), Chen & Jiang (2017a), Li
et al. (2019b), Zhu et al. (2019)
Empirical Software Engineering Journal (EMSE) Huynh & Miller (2009), Shang, Nagappan & Hassan (2015), Russo, Succi & Pedrycz (2015), 9
Chen & Jiang (2017b), Li, Shang & Hassan (2017), Hassani et al. (2018), Li et al. (2018),
Zeng et al. (2019), Li et al. (2019a)
IEEE/IFIP International Conference on Oliner & Stearley (2007), Lim, Singh & Yajnik (2008), Cinque et al. (2010), Di Martino, 8
Dependable Systems and Networks (DSN) Cinque & Cotroneo (2012), El-Sayed & Schroeder (2013), Oprea et al. (2015), He et al.
(2016a), Neves, Machado & Pereira (2018)
International Symposium on Software Reliability Tang & Iyer (1992), Mariani & Pastore (2008), Banerjee, Srikanth & Cukic (2010), Pecchia & 7
Engineering (ISSRE) Russo (2012), Farshchi et al. (2015), He et al. (2016b), Bertero et al. (2017)
International Conference on Automated Software Andrews (1998), Chen et al. (2018), He et al. (2018a), Ren et al. (2019), Liu et al. (2019a) 5
Engineering (ASE)
International Symposium on Reliable Distributed Zhou et al. (2010), Kc & Gu (2011), Fu et al. (2012), Chuah et al. (2013), Gurumdimma et al. 5
Systems (SRDS) (2016)
ACM International Conference on Knowledge Makanju, Zincir-Heywood & Milios (2009), Nandi et al. (2016), Wu, Anchuri & Li (2017), Li 4
Discovery and Data Mining (KDD) et al. (2017)
IEEE International Symposium on Cluster, Cloud Prewett (2005), Yoon & Squicciarini (2014), Lin et al. (2015), Di et al. (2017) 4
and Grid Computing (CCGrid)
IEEE Transactions on Software Engineering (TSE) Andrews & Zhang (2003), Tian, Rudraraju & Li (2004), Cinque, Cotroneo & Pecchia (2013), 4
Liu et al. (2019b)
Annual Computer Security Applications Abad et al. (2003), Barse & Jonsson (2004), Yen et al. (2013) 3
Conference (ACSAC)
IBM Journal of Research and Development Aharoni et al. (2011), Ramakrishna et al. (2017), Wang et al. (2017) 3
International Conference on Software Shang et al. (2014), Zhi et al. (2019), Anu et al. (2019) 3
Maintenance and Evolution (ICSME)
IEEE International Conference on Data Mining Fu et al. (2009), Xu et al. (2009a), Tang & Li (2010) 3
(ICDM)
Journal of Systems and Software (JSS) Mavridis & Karatza (2017), Bao et al. (2018), Farshchi et al. (2018) 3
Total 71
Logging
Log messages are usually in the form of free text and may expose parts of the system
state (e.g., exceptions and variable values) to provide additional context. The full log
statement also includes a severity level to indicate the purpose of that statement. Logging
frameworks provide developers with different log levels: debug for low level logging, info to
provide information on the system execution, error to indicate unexpected state that
may compromise the normal execution of the application, and fatal to indicate a severe
state that might terminate the execution of the application. Logging an application involves
Empirical studies
Understanding how practitioners deal with the log engineering process in a real scenario is
key to identify open problems and provide research directions. Papers in this category aim
at addressing this agenda through empirical studies in open-source projects (and their
communities).
Yuan, Park & Zhou (2012) conducted the first empirical study focused on
understanding logging practices. They investigated the pervasiveness of logging, the
benefits of logging, and how log-related code changes over time in four open-source
projects (Apache httpd, OpenSSH, PostgresSQL, and Squid). In summary, while logging
was widely adopted in the projects and were beneficial for failure diagnosis, they show that
logging as a practice relies on the developer’s experience. Most of the recurring changes
were updates to the content of the log statement.
Later, Chen & Jiang (2017b) conducted a replication study with a broader corpus: 21
Java-based projects from the Apache Foundation. Both studies confirm that logging code is
actively maintained and that log changes are recurrent; however, the presence of log data in
bug reports are not necessarily correlated to the resolution time of bug fixes (Chen & Jiang,
2017b). This is understandable as resolution time also relates to the complexity of the
reported issue.
It is worth mentioning that the need for tooling support for logging also applies in an
industry setting. For instance, in a study conducted by Pecchia et al. (2015), they show that
the lack of format conventions in log messages, while not severe for manual analysis,
undermines the use of automatic analysis. They suggest that a tool to detect inconsistent
Log requirements
An important requirement of log data is that it must be informative and useful to a
particular purpose. Papers in this subcategory aim at evaluating whether log statements
can deliver expected data, given a known requirement.
Log infrastructure
The infrastructure supporting the analysis process plays an important role because
the analysis may involve the aggregation and selection of high volumes of data.
The requirements for the data processing infrastructure depend on the nature of the
analysis and the nature of the log data. For instance, popular log processors, e.g., Logstash
and Fluentd, provide regular expressions out-of-the-box to extract data from well-
known log formats of popular web servers (e.g., Apache Tomcat and Nginx). However,
extracting content from highly unstructured data into a meaningful schema is not trivial.
LOG INFRASTRUCTURE deals with the tooling support necessary to make the further
analysis feasible. For instance, data representation might influence on the efficiency of data
aggregation. Other important concerns include the ability of handling log data for real-
time or offline analysis and scalability to handle the increasing volume of data.
Log parsing
Parsing is the backbone of many log analysis techniques. Some analysis operate under the
assumption that source-code is unavailable; therefore, they rely on parsing techniques to
process log data. Given that log messages often have variable content, the main challenge
tackled by these papers is to identify which log messages describe the same event. For
example, “Connection from A port B” and “Connection from C port D” represent the same
event. The heart of studies in parsing is the template extraction from raw log data.
Fundamentally, this process consists of identifying the constant and variable parts of raw
log messages.
Several approaches rely on the “textual similarity” between the log messages. Aharon
et al. (2009) create a dictionary of all words that appear in the log message and use the
frequency of each word to cluster log messages together. Somewhat similar, IPLOM
(Iterative Partitioning Log Mining) leverages the similarities between log messages related
to the same event, e.g., number, position, and variability of tokens (Makanju, Zincir-
Heywood & Milios, 2009; Makanju, Zincir-Heywood & Milios, 2012). Liang et al. (2007)
also build a dictionary out of the keywords that appear in the logs. Next, each log is
converted to a binary vector, with each element representing whether the log contains
that keyword. With these vectors, the authors compute the correlation between any two
events.
Somewhat different from others, Gainaru et al. (2011) cluster log messages by searching
for the best place to split a log message into its “constant” and its “variable” parts. These
clusters are self-adaptive as new log messages are processed in a streamed fashion.
Hamooni et al. (2016) also uses string similarity to cluster logs. Interestingly, authors
however made use of map-reduce to speed up the processing. Finally, Zhou et al. (2010)
propose a fuzzy match algorithm based on the contextual overlap between log lines.
Transforming logs into “sequences” is another way of clustering logs. Lin et al.
(2016) convert logs into vectors, where each vector contains a sequence of log events of a
given task, and each event has a different weight, calculated in different ways. Tang &
Li (2010) propose LOGTREE, a semi-structural way of representing a log message. The
overall idea is to represent a log message as a tree, where each node is a token, extracted via
a context-free grammar parser that the authors wrote for each of the studied systems.
Interestingly, in this paper, the authors raise awareness to the drawbacks of clustering
techniques that consider only word/term information for template extraction. According
them, log messages related to same events often do not share a single word.
From an empirical perspective, He et al. (2016a) compared four log parsers on five
datasets with over 10 million raw log messages and evaluated their effectiveness in a
real log-mining task. The authors show, among many other findings, that current log
parsing methods already achieve high accuracy, but do not scale well to large log data.
Later, Zhu et al. (2019) extended the former study and evaluated a total of 13 parsing
Log storage
Modern complex systems easily generate giga- or petabytes of log data a day. Thus, in the
log data life-cycle, storage plays an important role as, when not handled carefully, it might
become the bottleneck of the analysis process. Researchers and practitioners have been
addressing this problem by offloading computation and storage to server farms and
leveraging distributed processing.
Mavridis & Karatza (2017) frame the problem of log analysis at scale as a “big data”
problem. Authors evaluated the performance and resource usage of two popular big data
solutions (Apache Hadoop and Apache Spark) with web access logs. Their benchmarks
show that both approaches scale with the number of nodes in a cluster. However, Spark is
more efficient for data processing since it minimizes reads and writes in disk. Results
suggest that Hadoop is better suited for offline analysis (i.e., batch processing) while Spark
is better suited for online analysis (i.e., stream processing). Indeed, as mentioned early,
He et al. (2018b) leverages Spark for parallel parsing because of its fast in-memory
processing.
Another approach to reduce storage costs consists of data compression techniques for
efficient analysis (Lin et al., 2015; Liu et al., 2019a). Lin et al. (2015) argue that while
traditional data compression algorithms are useful to reduce storage footprint, the
compression-decompression loop to query data undermines the efficiency of log analysis.
The rationale is that traditional compression mechanisms (e.g., gzip) perform compression
and decompression in blocks of data. In the context of log analysis, this results in waste
of CPU cycles to compress and decompress unnecessary log data. They propose a
compression approach named Cowik that operates in the granularity of log entries.
They evaluated their approach in a log search and log joining system. Results suggest
that the approach is able to achieve better performance on query operations and produce
the same join results with less memory. Liu et al. (2019a) proposes a different approach
named LOGZIP based on an intermediate representation of raw data that exploits the
structure of log messages. The underlying idea is to remove redundant information from
log events and compress the intermediate representation rather than raw logs. Results
indicate higher compression rates compared to baseline approaches (including COWIK).
Anomaly detection
Anomaly detection techniques aim to find undesired patterns in log data given that
manual analysis is time-consuming, error-prone, and unfeasible in many cases.
We observe that a significant part of the research in the logging area is focused on this type
of analysis. Often, these techniques focus on identifying problems in software systems.
Based on the assumption that an “anomaly” is something worth investigating, these
techniques look for anomalous traces in the log files.
Oliner & Stearley (2007) raise awareness on the need of datasets from real systems to
conduct studies and provide directions to the research community. They analyzed log
data from five super computers and conclude that logs do not contain sufficient
information for automatic detection of failures nor root cause diagnosis, small events
might dramatically impact the number of logs generated, different failures have different
predictive signatures, and messages that are corrupted or have inconsistent formats are
not uncommon. Many of the challenges raised by the authors are well known nowadays
and have been in continuous investigation in academia.
Researchers have been trying several different techniques, such as deep learning and
NLP (Du et al., 2017; Bertero et al., 2017; Meng et al., 2019; Zhang et al., 2019), data mining,
statistical learning methods, and machine learning (Lu et al., 2017; He et al., 2016b;
Ghanbari, Hashemi & Amza, 2014; Tang & Iyer, 1992; Lim, Singh & Yajnik, 2008; Xu et al.,
2009b, Xu et al., 2009a) control flow graph mining from execution logs (Nandi et al., 2016),
finite state machines (Fu et al., 2009; Debnath et al., 2018), frequent itemset mining
(Lim, Singh & Yajnik, 2008), dimensionality reduction techniques (Juvonen, Sipola &
Hämäläinen, 2015), grammar compression of log sequences (Gao et al., 2014), and
probabilistic suffix trees (Bao et al., 2018).
Interestingly, while these papers often make use of systems logs (e.g., logs generated
by Hadoop, a common case study among log analysis in general) for evaluation, we
conjecture that these approaches are sufficiently general, and could be explored in (or are
worth trying at on) other types of logs (e.g., application logs).
Failure prediction
Being able to anticipate failures in critical systems not only represents competitive business
advantage but also represents prevention of unrecoverable consequences to the business.
Quality assurance
Log analysis might support developers during the software development life cycle and,
more specifically, during activities related to quality assurance.
Andrews & Zhang (2000, 2003) advocated the use of logs for testing purposes since the
early 2000’s. In their work, the authors propose an approach called log file analysis (LFA).
LFA requires the software under test to write a record of events to a log file, following a
pre-defined logging policy that states precisely what the software should log. A log file
analyzer, also written by the developers, then analyses the produced log file and only
accepts it in case the run did not reveal any failures. The authors propose a log file analysis
language to specify such analyses.
More than 10 years later, Chen et al. (2018) propose an automated approach to estimate
code coverage via execution logs named LogCoCo. The motivation for this use of log
data comes from the need to estimate code coverage from production code. The authors
argue that, in a large-scale production system, code coverage from test workloads might
not reflect coverage under production workload. Their approach relies on program
analysis techniques to match log data and their corresponding code paths. Based on this
data, LogCoCo estimates different coverage criteria, i.e., method, statement, and branch
Log platforms
Monitoring systems often contain dashboards and metrics to measure the “heartbeat” of
the system. In the occurrence of abnormal behavior, the operations team is able to visualize
the abnormality and conduct further investigation to identify the cause. Techniques to
reduce/filter the amount of log data and efficient querying play an important role to
support the operations team on diagnosing problems. One consideration is, while visual
aid is useful, in one extreme, it can be overwhelming to handle several charts and
dashboards at once. In addition, it can be non-trivial to judge if an unknown pattern on
the dashboard represents an unexpected situation. In practice, operations engineers may
rely on experience and past situations to make this judgment. Papers in this subcategory
focus on full-fledged platforms that aim at providing a full experience for monitoring
teams.
Two studies were explicitly conducted in an industry setting, namely MELODY
(Aharoni et al., 2011) at IBM and FLAP (Li et al., 2017) at Huawei Technologies. MELODY
is a tool for efficient log mining that features machine learning-based anomaly detection
for proactive monitoring. It was applied with ten large IBM clients, and the authors
reported that MELODY was useful to reduce the excessive amount of data faced by their
users. FLAP is a tool that combines state-of-the-art processing, storage, and analysis
techniques. One interesting feature that was not mentioned in other studies is the use of
template learning for unstructured logs. The authors also report that FLAP is in
production internally at Huawei.
While an industry setting is not always accessible to the research community, publicly
available datasets are useful to overcome this limitation. Balliu et al. (2015) propose BIDAL,
a tool to characterize the workload of cloud infrastructures, They use log data from Google
data clusters for evaluation and incorporate support to popular analysis languages and
storage backends on their tool. Di et al. (2017) propose LOGAIDER, a tool that integrates log
mining and visualization to analyze different types of correlation (e.g., spatial and
temporal). In this study, they use log data from Mira, an IBM Blue Gene-based
supercomputer for scientific computing, and reported high accuracy and precision in
uncovering correlations associated with failures. Gunter et al. (2007) propose a log
summarization solution for time-series data integrated with anomaly detection techniques
to troubleshoot grid systems. They used a publicly available testbed and conducted
controlled experiments to generate log data and anomalous events. The authors highlight
the importance of being able to choose which anomaly detection technique to use, since
they observed different performance depending on the anomaly under analysis.
Open-source systems for cloud infrastructure and big data can be also used as
representative objects of study. Yu et al. (2016) and Neves, Machado & Pereira (2018)
conduct experiments based on OpenStack and Apache Zookeeper, respectively. CLOUDSEER
(Yu et al., 2016) is a solution to monitor management tasks in cloud infrastructures.
The technique is based on the characterization of administrative tasks as models inferred
DISCUSSION
Our results show that logging is an active research field that attracted not only researchers
but also practitioners. We observed that most of the research effort focuses on log analysis
techniques, while the other research areas are still in a early stage. In the following, we
highlight open problems, gaps, and future directions per research area.
In LOGGING, several empirical studies highlight the importance of better tooling
support for developers since logging is conducted in a trial-and-error manner
(see subcategory “Empirical Studies”). Part of the problem is the lack of requirements
for log data. When the requirements are well defined, logging frameworks can be tailored
to a particular use case and it is feasible to test whether the generated log data fits the use
case (see subcategory “Log Requirements”). However, when requirements are not clear,
developers rely on their own experience to make log-related decisions. While static analysis
is useful to anticipate potential issues in log statements (e.g., null reference in a logged
variable), other logging decisions (e.g., where to log) rely on the context of source code
(see subcategory “Implementation of Log Statements”). Research on this area already
shows the feasibility of employing machine learning to address those context-sensitive
decisions. However, it is still unknown the implications of deploying such tools to
developers. Further work is necessary to address usability and operational aspects of those
techniques. For instance, false positives is a reality in machine learning. There is no 100%
accurate model and false positives will eventually emerge even if in a low rate. How to
communicate results in a way that developers keeps engaged in a productive way is
important to bridge the gap of theory and practice. This also calls for closer collaboration
between academia and industry.
In LOG INFRASTRUCTURE, most of the research effort focused on parsing techniques.
We observed that most papers in the “Log Parsing” subcategory address the template
extraction problem as an unsupervised problem, mainly by clustering the static part of the
log messages. While the analysis of system logs (e.g., web logs and other data provided that
the runtime environment) was extensively explored (mostly Hadoop log data), little has
been explored in the field of application logs. We believe that this is due to the lack of
publicly available dataset. In addition, application logs might not have a well-defined
structure and can vary significantly from structured system logs. This could undermine
the feasibility of exploiting clustering techniques. One way to address the availability
problem could be using log data generated from test suites in open-source projects.
However, test suites might not produce comparable volume of data. Unless there is a
publicly available large-scale application that could be used by the research community, we
argue that the only way to explore log parsing at large-scale is in partnership with industry.
THREATS TO VALIDITY
Our study maps the research landscape in logging, log infrastructure, and log analysis
based on our interpretation of the 108 studies published from 1992 to 2019. In this section,
we discuss possible threats to the validity of this work and possibilities for future
expansions of this systematic mapping.
External validity
The main threat to the generalization of our conclusions relates to the representativeness
of our dataset. Our procedure to discover relevant papers consists of querying popular
digital libraries rather than looking into already known venues in Software Engineering
(authors’ field of expertise). While we collected data from five different sources, it is
unclear how each library indexes the entries. It is possible that we may have missed a
relevant paper because none of the digital libraries reported it. Therefore, the search
procedure might be unable to yield complete results. Another factor that influences the
completeness of our dataset is the filtering of papers based on the venue rank (i.e., A and A*
according to the CORE Rank). There are several external factors that influence the
acceptance of a paper that are not necessarily related to the quality and relevance of the
study. The rationale for applying the exclusion criterion by venue rank is to reduce the
Internal validity
The main threat to the internal validity relates to our classification procedure. The first
author conducted the first step of the characterization procedure. Given that the entire
process was mostly manual, this might introduce a bias on the subsequent analysis. To
reduce its impact, the first author performed the procedure twice. Moreover, the second
author revisited all the decisions made by the first author throughout the process.
All diversions were discussed and settled throughout the study.
CONCLUSIONS
In this work, we show how researchers have been addressing the different challenges in
the life-cycle of log data. Logging provides a rich source of data that can enable several
types of analysis that is beneficial to the operations of complex systems. LOG ANALYSIS is a
mature field, and we believe that part of this success is due to the availability of dataset to
foster innovation. LOGGING and LOG INFRASTRUCTURE, on the other hand, are still in a early
stage of development. There are several barriers that hinder innovation in those area,
e.g., lack of representative data of application logs and access to developers. We believe that
closing the gap between academia and industry can increase momentum and enable the
future generation of tools and standards for logging.
Funding
This work was supported by the Netherlands Organization for Scientific Research (NWO)
MIPL project [grant number 628.008.003]. The funders had no role in study design, data
collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests
Arie van Deursen is an Academic Editor for PeerJ Computer Science.
Jeanderson Barros Cândido is a Ph.D. student at TU Delft and is conducting his
research at Adyen N.V., the industry partner of his Ph.D. program.
Author Contributions
Jeanderson Cândido conceived and designed the experiments, performed the
experiments, analyzed the data, performed the computation work, prepared figures and/
or tables, authored or reviewed drafts of the paper, and approved the final draft.
Maurício Aniche conceived and designed the experiments, performed the experiments,
analyzed the data, authored or reviewed drafts of the paper, and approved the final draft.
Arie van Deursen analyzed the data, authored or reviewed drafts of the paper, and
approved the final draft.
Data Availability
The following information was supplied regarding data availability:
The raw data is available in the Supplemental File.
Supplemental Information
Supplemental information for this article can be found online at http://dx.doi.org/10.7717/
peerj-cs.489#supplemental-information.
REFERENCES
Abad C, Taylor J, Sengul C, Yurcik W, Zhou Y, Rowe K. 2003. Log correlation for intrusion
detection: a proof of concept. In: Proceedings of the 19th Annual Computer Security Applications
Conference, 2003, Las Vegas, Nevada, USA. Piscataway: IEEE, 255–264.
Agrawal A, Karlupia R, Gupta R. 2019. Logan: a distributed online log parser. In: 2019 IEEE 35th
International Conference on Data Engineering (ICDE). Piscataway: IEEE, 1946–1951.
Aharon M, Barash G, Cohen I, Mordechai E. 2009. One graph is worth a thousand logs:
uncovering hidden structures in massive system event logs. In: Buntine W, Grobelnik M,
Mladenić D, Shawe-Taylor J, eds. Machine Learning and Knowledge Discovery in Databases.
Berlin, Heidelberg: Springer, 227–243.
Aharoni E, Fine S, Goldschmidt Y, Lavi O, Margalit O, Rosen-Zvi M, Shpigelman L. 2011.
Smarter log analysis. IBM Journal of Research and Development 55(5):10:1–10:10
DOI 10.1147/JRD.2011.2165675.
Andrews JH. 1998. Testing using log file analysis: tools, methods, and issues. In: Proceedings 13th
IEEE International Conference on Automated Software Engineering (Cat. No.98EX239). 157–166.
Andrews JH, Zhang Y. 2000. Broad-spectrum studies of log file analysis. In: Proceedings of the
22nd International Conference on Software Engineering - ICSE ’00, Limerick, Ireland. New York:
ACM Press, 105–114.