Xu Sosp09
Xu Sosp09
Wei Xu∗ Ling Huang† Armando Fox∗ David Patterson∗ Michael I. Jordan∗
∗ †
EECS Department Intel Labs Berkeley
University of California at Berkeley, USA Berkeley, CA, USA
{xuw,fox,pattrsn,jordan}@cs.berkeley.edu [email protected]
1
starting: xact 325 is COMMITTING
method for doing this augmentation is part of our contri- starting: xact 346 is ABORTING
bution.
We studied logs and source code of many popular soft-
1 CLog.info("starting: " + txn);
ware systems used in Internet services, and observed that 2 Class Transaction {
a typical console log is much more structured than it ap- 3 public String toString() {
pears: the definition of its “schema” is implicit in the log 4 return "xact " + this.tid +
5 " is " + this.state;
printing statements, which can be recovered from pro-
6 }
gram source code. This observation is key to our log 7 }
parsing approach, which yields detailed and accurate fea-
tures. Given the ubiquitous presence of open-source soft- Figure 1: Top: two lines from a simple console log. Bot-
ware in many Internet systems, we believe the need for tom: Java code that could produce these lines.
source code is not a practical drawback to our approach.
and distill over 24 million lines of console logs (col-
Our contribution is a general four-step methodology lected from 203 Hadoop nodes) to a one-page decision
that allows machine learning and information retrieval tree that a domain expert can readily understand. This
techniques to be applied to free-text logs to find the automated process can be done with Hadoop map-reduce
“needles in the haystack” that might indicate operational on 60 Amazon EC2 nodes within 3 minutes.
problems, without any manual input. Specifically, our
Section 2 provides an overview of our approach, Sec-
methodology involves the following four contributions:
tion 3 describes our log parsing technique in detail, Sec-
1) A technique for analyzing source code to recover tions 4 and 5 present our solutions for feature creation
the structure inherent in console logs; and anomaly detection, Section 6 evaluates our approach
2) The identification of common information in logs— and discusses the visualization technique, Section 7 dis-
state variables and object identifiers—and the automatic cusses extensions and provide suggestions to improve
creation of features from the logs (exploiting the struc- log quality, Section 8 summarizes related work, and Sec-
ture found) that can be subjected to analysis by a variety tion 9 draws some conclusions.
of machine learning algorithms;
3) Demonstration of a machine learning and informa- 2 Overview of Approach
tion retrieval methodology that effectively detects un- 2.1 Information buried in textual logs
usual patterns or anomalies across large collections of
such features extracted from a console log; Important information is buried in the millions of lines
4) Where appropriate, automatic construction of a vi- of free-text console logs. To analyze logs automatically,
sualization that distills the results of anomaly detection we need to create high quality features, the numerical
in a compact and operator-friendly format that assumes representation of log information that is understandable
no understanding of the details of the algorithms used to by a machine learning algorithm. The following three
analyze the features. key observations lead to our solution to this problem.
The combination of elements in our approach, in- Source code is the “schema” of logs. Although con-
cluding our novel combination of source code analysis sole logs appear in free text form, they are in fact quite
with log analysis and automatic creation of features for structured because they are generated entirely from a rel-
anomaly detection, enables a level of detail in log analy- atively small set of log printing statements in the system.
sis that was previously impossible due to the inability of Consider the simple console log excerpt and the source
previous methods to correctly identify the features nec- code that generated it in Figure 1. Intuitively, it is
essary for problem identification. easier to recover the log’s hidden “schema” using the
Our approach requires no changes to existing soft- source code information (especially for a machine). Our
ware and works on existing textual console logs of any method leverages source code analysis to recover the in-
size, and some of the more computationally expensive herit structure of logs. The most significant advantage
steps are embarrassingly parallel, allowing us to run them
Variable Examples Distinct
as Hadoop [2] map-reduce jobs using cloud computing, values
achieving nearly linear speedup for a few dollars per run. Identifiers transaction id in Darkstar; block id many
We evaluate our approach and demonstrate its capa- in Hadoop file system; cache key in the
Apache web server; task id in Hadoop
bility and scalability with two real-world systems: the map reduce.
Darkstar online game server [28] and the Hadoop File State Transaction execution state in Darkstar; few
System. For Darkstar, our method accurately detects per- Vars Server name for each block in Hadoop file
formance anomalies immediately after they happen and system; HTTP status code (200, 404);
POSIX process return values;
provides hints as to the root cause. For Hadoop, we de-
tect runtime anomalies that are commonly overlooked, Table 1: State variables and identifiers
2
void startTransaction(){ 1. Log Parsing 2. Feature creation 3. Anomaly 4.Visualization
…
LOG.info(“starting” + transact); starting: xact (.*) is (.*) At time window 100 detection
} COMMITTED PREPARING
Message template
Source Code
ABORTED
COMMITTING
starting: xact 325 is PREPARING type=1, tid=325, state=PREPARING
prepare: xact 325 is COMMITTING type=2, tid=325, state= COMMITTING State Ratio Vector
comitted: xact 325 is COMMITTED type=3, tid=325, state=COMMITTED
325: 111000000
Raw Console Log Structured Log 326: 101000000 PCA Anomaly Detection Decision Tree
327: 111010000
Message Count Vectors
of our approach is that we are able to accurately parse all information such as message types and variables to au-
possible log messages, even the ones rarely seen in actual tomatically create features that capture information con-
logs. In addition, we are able to eliminate most heuristics veyed in logs. To our knowledge, this is the first work ex-
and guesses for log parsing used by existing solutions. tracting information at this fine level of granularity from
Common log structures lead to useful features. A console logs for problem detection.
person usually reads the log messages in Figure 1 as a Messages are strongly correlated. When log messages
constant part (starting: xact ... is) and multiple are grouped properly, there is a strong and stable corre-
variable parts (325/326 and COMMITTING/ABORTING). lation among messages within the same group. For ex-
In this paper, we call the constant part the message type ample, messages containing a certain file name are likely
and the variable part the message variable. to be highly correlated because they are likely to come
Message types—marked by constant strings in a log from logically related execution steps in the program.
message—are essential for analyzing console logs and A group of related messages is often a better indi-
have been widely used in earlier work [17]. In our anal- cator of problems than individual messages. Many
ysis, we use the constant strings solely as markers for anomalies are only indicated by incomplete message se-
the message types, completely ignoring their semantics quences. For example, if a write operation to a file fails
as English words, which is known to be ambiguous [22]. silently (perhaps because the developers do not handle
Message variables carry crucial information as well. the error correctly), no single error message is likely to
In contrast to prior work that focuses on numerical vari- indicate the failure. By correlating messages about the
ables [17, 22, 35], we identified two important types of same file, however, we can detect such cases by observ-
message variables for problem detection by studying logs ing that the expected “closing file” message is missing.
from many systems and by interviewing Internet service Previous work grouped logs by time windows only, and
developers/operators who heavily use console logs. We the detection accuracy suffers from noise in the correla-
acknowledge that logs also contain other types of mes- tion [14, 27, 35]. In contrast, we create message groups
sage variables such as timestamps and various counts. based on more accurate information, such as the message
We do not discuss those variables in this paper as they variables described above. In this way, the correlation is
have been well studied in existing work [27]. much stronger and more readily encoded so that the ab-
Identifiers are variables used to identify an object ma- normal correlations also become easier to detect.
nipulated by the program (e.g., the transaction ids 325 2.2 Workflow of our approach
and 346 in Figure 1), while state variables are labels that
enumerate a set of possible states an object could have in Figure 2 shows the four steps in our general framework
program (e.g. COMMITTING and ABORTING in Figure 1). for mining console logs.
Table 1 provides extra examples of such variables. We 1) Log parsing. We first convert a log message from
can determine whether a given variable is an identifier unstructured text to a data structure that shows the mes-
or a state variable progmatically based on its frequency sage type and a list of message variables in (name, value)
in console logs. Intuitively, state variables have a small pairs. We get all possible log message template strings
number of distinct values while identifiers take a large from the source code and match these templates to each
number of distinct values (detailed in Section 4). log message to recover its structure (that is, message
Message types and variables contain important run- type and variables). Our experiments show that we can
time information useful to the operators. However, lack- achieve high parsing accuracy in real-world systems.
ing tools to extract these structures, operators either ig- There are systems that use structured tracing only,
nore them, or spend hours greping and manually exam- such as BerkeleyDB (Java edition). In this case, because
ining log messages, which is tedious and inefficient. logs are already structured, we can skip this first step and
Our accurate log parsing allows us to use structured directly apply our feature creation and anomaly detec-
3
System Lang Logger Msg Construction LOC LOL Vars Parse ID ST
Operating system
Linux (Ubuntu) C custom printk + printf wrap 7477k 70817 70506 Y Yb Y
Low level Linux services
Bootp C custom printf wrap 11k 322 220 Y N N
DHCP server C custom printf wrap 23k 540 491 Y Yb Y
DHCP client C custom printf wrap 5k 239 205 Y Yb Y
ftpd C custom printf wrap 3k 66 67 Y Y N
openssh C custom printf wrap 124k 3341 3290 Y Y Y
crond C printf printf wrap 7k 112 131 Y N Y
Kerboros 5 C custom printf wrap 44k 6261 4971 Y Y Y
iptables C custom printf wrap 52k 2341 1528 Y N Y
Samba 3 C custom printf wrap 566k 8461 6843 Y Y Y
Internet service building blocks
Apache2 C custom printf wrap 312k 4008 2835 Y Y Y
mysql C custom printf wrap 714k 5092 5656 Y Yb Yb
postgresql C custom printf wrap 740k 12389 7135 Y Yb Yb
Squid C custom printf wrap 148k 2675 2740 Y Y Y
Jetty Java log4j string concatenation 138k 699 667 Y Y Y
Lucene Java custom custom log function 217k 143 159 Ya Y N
BDB (Java) Java custom custom structured trace 260k - - - Y N
Distributed systems
Hadoop Java custom log4j string concatenation 173k 911 1300 Y Y Y
Darkstar Java jdk-log Java format string 90k 578 658 Y Yb Yb
Nutch Java log4j string concatenation 64k 507 504 Y Y Y
Cassandra Java log4j string concatenation 46k 393 437 Y N Y
Storage Prototype C custom custom structured trace -c -c -c -c Y Y
a Logger class is not consistent in every module, so we need to manually specify the logger function name for each module.
b System prints minimal amount of logs by default, so we need to enable debug logging.
c Source code not available, but logs are well structured so manual parsing is easy.
Table 2: Console logging in popular software systems. LOC = lines of codes in the system. LOL = number of log printing
statements. Vars = number of variables reported in log messages. Parse = whether our source analysis based parsing applies. ID =
whether identifier variables are reported. ST = whether state variables are reported.
tion methods. Note that these structured logs still contain to our approach, and different algorithms utilizing differ-
both identifiers and state variables. 1 ent extracted features could be readily “plugged in” to
2) Feature creation. Next, we construct feature vectors our framework.
from the extracted information by choosing appropriate 4) Visualization. Finally, in order to let system integra-
variables and grouping related messages. In this paper, tors and operators better understand the PCA anomaly
we focus on constructing the state ratio vector and the detection results, we visualize results in a decision
message count vector features, which are unexploited in tree [34]. Compared to the PCA-based detector, the deci-
prior work. In our experiments with two large-scale real- sion tree provides a more detailed explanation of how the
world systems, both features yield good detection results. problems are detected, in a form that resembles the event
3) Anomaly detection. Then, we apply anomaly de- processing rules [10] with which system integrators and
tection methods to mine feature vectors, labeling each operators are familiar.
feature vector as normal or abnormal. We find that
2.3 Case study and data collection
the Principal Component Analysis (PCA)-based anomaly
detection method [5] works very well with both features. We studied source code and logs from 22 widely de-
This method is an unsupervised learning algorithm, in ployed open source systems. Table 2 summarizes the
which all parameters can be either chosen automatically results. Although these systems are distinct in nature,
or tuned easily, eliminating the need for prior input from developed in different languages by different developers
the operators. Although we use this specific machine at different times, 20 of the 22 systems use free text logs,
learning algorithm for our case studies, it is not intrinsic and our source-code-analysis based log parsing applies to
all of the 20. Interestingly, we found that about 1%-5%
1 In fact, the last system in Table 2 (Storage Prototype) is an anony-
of code lines are logging calls in most of the systems,
mous research prototype with built-in customized structured traces.
Without any context, even without knowing the functionality of the sys-
but most of these calls are rarely, if ever, executed be-
tem, our feature creation and anomaly detection algorithm successfully cause they represent erroneous execution paths. It is al-
discovered log segments that the developer found insightful. most impossible to maintain log-parsing rules manually
4
System Nodes Messages Log Size dicating run-time performance problems that have been
Darkstar 1 1,640,985 266 MB confirmed by Hadoop developers.
Hadoop (HDFS) 203 24,396,061 2412 MB All log data are collected from unmodified off-the-
shelf systems. Console logs are written directly to local
Table 3: Data sets used in evaluation. Nodes=Number of disks on each node and collected offline by simply copy-
nodes in the experiments.
ing log files, which shows the convenience (no instru-
mentation or configuration) of our log mining approach.
with such a large number of distinct logger calls, which In the HDFS experiment, we used the default logging
highlights our advantage of discovering message types level, while in the Darkstar experiment, we turned on de-
automatically from source code. On average, a message bug logging (FINER level in the logging framework).
reports a single variable. However, there are many mes-
sages, such as starting server that reports no vari- 3 Log Parsing with Source Code
ables, while other messages can report 10 or more.
Most C programs use printf style format strings for In addition to standard “fields” in console logs, such
logging, although a large portion uses wrapper functions as timestamps, we focus on the free text part of a log
to generate standard information such as time stamps message. For the log excerpt at the top of Figure 1,
and severity levels. These wrappers, even if customized, human readers would reasonably conclude that 325,
346, COMMITTING and ABORTING are message vari-
can be detected automatically from the format string pa-
rameter. In contrast, Java programs usually use string ables while the rest are constant strings marking message
concatenation to generate log messages and often rely types. They could then write a regular expression such as
on standard logger packages (such as log4j). Analyz- starting: xact (.*) is (.*) to “templatize”
ing these logging calls requires understanding data types, such log messages. We want to automate the process.
which we detail in Section 3. Our source-code-analysis The difficulty. Unless the log itself is marked with
based log parsing approach successfully works on most formatting to distinguish these elements, we must
of them, and can find at least one of state variables and “templatize” automatically. As discussed in Sec-
identifiers in 21 of the 22 systems in Table 2 (16 have tion 2.1, it is much easier for a machine to use
both), confirming our assumption of their prevalence. the source code as the “schema” for console logs.
To be succinct yet reveal important issues in console If the software is written in a language like C, it
log mining, we focus further discussion on two repre- is likely that the template can be directly inferred
sentative systems shown in Table 2: the Darkstar online from printf variants that generate the messages, such
game server and the Hadoop File System (HDFS). Both as fprintf(LOG, "starting: xact %d is %s"),
systems handle persistence, an important yet complicated with the various escapes (%d, %f, and so on) telling us
function in large-scale Internet services. However, these something about the types of the variables. However, it
two systems are different in nature. Darkstar focuses on is more challenging in object-oriented (OO) languages
small, time sensitive transactions, while HDFS is a file such as Java, which are increasingly used for open source
system designed for storing large files and batch process- software (Table 2). Consider the excerpt of Java source
ing. They represent two major open source contributors shown in the bottom half of Figure 1, which generated the
(Sun and Apache, respectively) with different coding and two example log lines. Clearly, the tid variable of the
logging styles. txn object corresponds to the identifiers 325 and 346 in
We collected logs from systems running on Amazon’s the log message of Figure 1, and the state variable cor-
Elastic Compute Cloud (EC2) and we also used EC2 to responds to the labels COMMITTING and ABORTING. Try-
analyze these logs. Table 3 summarizes the log data ing to extract a regular expression by simply “grepping”
sets we used. The Darkstar example revealed a behav- the source code would only give us starting: (. *)
ior that strongly depended on the deployment environ- (line 1), which does not distinguish tid and state as
ment, which led to problems when migrating from tra- separate features with distinct ranges of possible values.
ditional server farms to clouds. In particular, we found Critically, as we will show later, we need this finer level
that Darkstar did not gracefully handle performance vari- of feature resolution to find “interesting” problems.
ations that are common in the cloud-computing environ- Three reasons cause this difficulty to arise in OO lan-
ment. By analyzing console logs, we found the reason guages. First, we need to know that CLog identifies a log-
for this problem, as discussed in detail in Section 6.2. ger object; that is, knowing the name of the logger class
Satisfied with Darkstar results, to further evaluate our is not enough. Second, the OO idiom for printing is for
method we analyzed HDFS logs, which are much more an object to implement a toString() method that re-
complex. We collected HDFS logs from a Hadoop clus- turns a printable representation of itself for interpolation
ter running on over 200 EC2 nodes, yielding 24 million into a string; in this example, the toString() method
lines of logs. We successfully extracted log segments in- of the abstract type Transaction actually reveals the
5
starting: (.*)
[transact][Transaction] starting: xact (.*) is (.*)
[Participant.java:345] [tid,state][int,String]
[at Participant.java:345]
Partial message template Console Logs
starting: xact (.*) is (.*) at node (.*)
toString() definitions [tid, state, node] [int, String, Node]
[at Participant.java:345] Reverse Index
Source Code AST Type hierarchy info Complete message templates Parsing results
Static source code analysis Runtime log parsing
Figure 3: Using source code information to parse console logs.
underlying structure of the log message. Third, due to heuristics in log analysis [17, 30], we construct an index
class inheritance, the actual toString() method used query from each log message by removing all numbers
in a particular call might be defined in a subclass rather and special symbols. From the list of relevance-ranked
than the base class of the logger object. candidate results returned by the reverse-index search,
we pick the highest-ranked result that allows a regular ex-
Our parsing approach. All three reasons are addressed
pression match to succeed against the log message. We
by our log parsing method that consists of two steps: a
note that once the reverse index is constructed (it usu-
static source code analysis step and the runtime log pars-
ally fits in memory), the parsing step is embarrassingly
ing step, as Figure 3 illustrates. In particular, we do not
parallel; we implement it as a Hadoop map-reduce job
claim to handle every situation correctly (despite exten-
by replicating the index to every worker node and par-
sive support for language idioms), but we do show that
titioning the log among the workers, achieving near lin-
some of the important features used in our results cannot
ear speedup. The map stage performs the reverse-index
be extracted using existing log parsing techniques.
search; the reduce stage processing depends on the fea-
The static source code analysis step takes program tures to be constructed, and Section 4 shows 2 examples.
source (and possibly the names of the logger class) as To summarize, unlike existing log-parsing methods,
input. In this step, we first generate the source code’s the fine granularity of structure revealed by our method
abstract syntax tree (AST) [1], a popular data structure enables analyses that are traditionally possible only with
for traversing and analyzing source code. We use the structured logs. Section 7 discusses the many intrinsic
AST implementations built into the open-source Eclipse subtleties in source code analysis and log parsing.
IDE [25]. We use the AST to identify all method calls on
objects of the classes (or their subclasses), recording the 4 Feature Creation
filename and line number of the call. Each call gives us This section describes our technique for constructing fea-
only a partial message template, since the template may tures from parsed logs. We focus on two features, the
involve interpolation of objects of nonprimitive types, as state ratio vector and the message count vector, based
in line 1 of the source code excerpt in Figure 1. We then on state variables and identifiers (see Section 2.1), re-
enumerate all toString() calls in all classes, and look spectively. The state ratio vector is able to capture the
at the string formatting statements in those calls to de- aggregated behavior of the system over a time window.
duce the types of variables in message templates, sub- The message count vector helps detect problems related
stituting this type information back into the partial tem- to individual operations. Both features describe message
plates. We do this recursively until all templates inter- groups constructed to have strong correlations among
polate only primitive types; if no toString() method their members. The features faithfully capture these cor-
can be found for a particular variable anywhere along its relations, which are often good indicators of runtime
inheritance path, we assume that that variable can take on problems. Although these features are from the same
any string value and we do no further semantic interpreta- log, and similar in structure, they are constructed inde-
tion. A single pass can accomplish all of these operations pendently, and have different semantics.
over the AST. The output of the process is the complete
message templates, with a data structure containing each 4.1 State variables and state ratio vectors
message’s template (regular expression), position in the State variables can appear in a large portion of log mes-
source code, and the names and data types of all variables sages. In fact, 32% of the log messages from Hadoop and
appearing in the message. We describe the details of the 28% of messages from Darkstar contain state variables.
template extraction approach in Appendix A. In many systems, during normal execution the relative
To parse the logs, we first compile all message tem- frequency of each value of a state variable in a time win-
plates into an Apache Lucene [11] reverse index [20], dow usually stays the same. For example, in Darkstar, the
which allows us to quickly associate any log message ratio between ABORTING and COMMITTING is very stable
with the corresponding template. Following established during normal execution, but changes significantly when
6
a problem occurs. Notice that the actual number does not Algorithm 1 Message count vector construction
matter (as it depends on workload), but the ratio among 1. Find all message variables reported in the log with the
different values matters. following properties:
We construct state ratio vectors y to encode this corre- a. Reported many times;
lation: Each state ratio vector represents a group of state b. Has many distinct values;
variables in a time window, while each dimension of the c. Appears in multiple message types.
vector corresponds to a distinct state variable value , and 2. Group messages by values of the variables
the value of the dimension is how many times this state chosen above.
3. For each message group, create a message count
value appears in the time window.
vector y = [y1 , y2 , . . . , yn ], where yi is the number of
In creating features based on state variables we used appearances of messages of type i (i = 1 . . . n)
an automatic procedure that combined two desiderata: 1) in the message group.
message variables should be frequently reported, but 2)
they should range across a small constant number of dis-
tinct values that do not depend on the number of mes-
sages. Specifically in our experiments, we chose state of appearances of the corresponding message types in a
variables that were reported at least 0.2N times, with N group (corresponding to “term frequency”).
the number of messages, and had a number of distinct Algorithm 1 summarizes our three-step process for
values not increasing with N for large values of N (e.g., feature construction. We now try to provide intuition be-
more than a few thousand). Our results were not sensitive hind the design choices in this algorithm.
to the choice of 0.2. In the first step of the algorithm, we automatically
The time window size is also automatically deter- choose identifiers (we do not want to require operators
mined. Currently we choose a size that allows the vari- to specify a search key). The intuition is that if a vari-
able to appear at least 10D times in 80% of all the time able meets the three criteria in step 1 of Algorithm 1, it
windows, where D is the number of distinct values. This is likely to identify object such as transactions. The fre-
choice of time window allows the variable to appear quency/distinct value pattern of identifiers is very differ-
enough times in each window to make the count sta- ent from other variables, so it is easy to discover identi-
tistically significant [4] while keeping the time window fiers 2 . We have very few false selections in all data sets,
small enough to capture transient problems. We tried and the small number of false choices is easy to eliminate
with other parameters than 10 and 80% and we did not by a manual examination.
see a significant change in detection results. In the second step, the message group essentially de-
We stack all n-dimensional y’s from m time windows scribes an execution path, with two major differences.
to construct the m × n state ratio matrix Y s . First, not every processing step is necessarily represented
in the console logs. Since the logging points are hand
4.2 Identifiers and message count vectors
chosen by developers, it is reasonable to assume that
Identifiers are also prevalent in logs. For example, almost logged steps should be important for diagnosis. Second,
50% of messages in HDFS logs contain identifiers. We correct ordering of messages is not guaranteed across
observe that all log messages reporting the same identi- multiple nodes, due to unsynchronized clocks across
fier convey a single piece of information about the iden- many computers. This ordering might be a problem
tifier. For instance, in HDFS, there are multiple log mes- for diagnosing synchronization-related problems, but it
sages about a block when the block is allocated, written, is still useful in identifying many kinds of anomalies.
replicated, or deleted. By grouping these messages, we In the third step, we use the bag of words model [6]
get the message count vector, which is similar to an exe- to represent the message group because: 1) it does not
cution path [8] (from custom instrumentation). require ordering among terms (message types), and 2)
To form the message count vector, we first automati- documents with unusual terms are given more weight in
cally discover identifiers, then group together messages document ranking. In our case, the rare log messages are
with the same identifier values, and create a vector per indeed likely to be more important.
group. Each vector dimension corresponds to a different We gather all the message count vectors to construct
message type, and the value of the dimension tells how message count matrix Y m as an m×n matrix where each
many messages of that type appear in the message group. row is a message count vector y, as described in step 3
The structure of this feature is analogous to the bag of Algorithm 1. Y m has n columns, corresponding to n
of words model in information retrieval [6]. In our ap- message types that reported the identifier (analogous to
plication, the “document” is the message group. The di- 2 Like the state variable case, identifiers are chosen as variables re-
mensions of the vector consist of the union of all useful ported at least 0.2N times, where N is total number of messages. We
message types across all groups (analogous to all possi- also require the variables have at least 0.02N distinct values, and re-
ble “terms”), and the value of a dimension is the number ported in at least 5 distinct messages types.
7
/sec p
Feature Rows Columns
transactions
Status ratio matrix Y s time window state value 150
Message count matrix Ym identifier message type ya
A
COMMITTING
100 y B
Table 4: Semantics of rows and columns of features
COMMITTING
“terms”). Y m has m rows, each of which corresponds to 50
a message group (analogous to “document”).
Although the message count matrix Y m has com- 0
pletely different semantics from the state ratio matrix Y s , 0 20 40 60 80 100
ACTIVE
ACTIVE per sec
trasactions per sec
both can be analyzed using matrix-based anomaly detec-
Figure 4: The intuition behind PCA detection with simpli-
tion tools (see Section 5). Table 4 summarizes the se- fied data. We plot only two dimensions from the Darkstar state
mantics of the rows and columns of each feature matrix. variable feature. It is easy to see high correlation between these
4.3 Implementing feature creation algorithms two dimensions. PCA determines the dominant normal pattern,
separates it out, and makes it easier to identify anomalies.
To improve efficiency of our feature generation algo-
rithms in map-reduce, we tailored the implementation. Feature data sets n k
The step of discovering state variables and/or identifiers Darkstar - message count 18 3
(the first steps in Section 4.1 and 4.2) is a single map- Darkstar - state ratio 6 1
reduce job that calculates the number of distinct values HDFS - message count 28 4
for all variables and determines which variables to in- HDFS - state ratio 202 2
clude in further feature generation steps. The step of con- Table 5: Low effective dimensionality of feature data. n = Di-
structing features from variables is another map-reduce mensionality of feature vector y; k = Dimensionality required
job with log parsing as the map stage and message group- to capture 95% of variance in the data. In all of our data, we
ing as the reduce stage. For the state ratio , we sort have k n, exhibiting low effective dimensionality.
messages by time stamp, while for the message count
vector, we sort by identifier values. Notice that the map highly correlated, due to the strong correlation among
stage (parsing step) only needs to output the required data log messages within a group. We aim to identify abnor-
rather than the entire text message, resulting in huge I/O mal vectors that deviate from such correlation patterns.
savings during the data shuffling and sorting before re- Figure 4 illustrates a simplified example using two di-
duce. Feature creation time is negligible when compared mensions (number of ACTIVE and COMMITTING per sec-
to parsing time. ond) from Darkstar state ratio vectors. We see most data
points reside close to a straight line (a one-dimensional
5 Anomaly Detection subspace). In this case, we say the data have low ef-
We use anomaly detection methods to find unusual pat- fective dimensionality. The axis S d captures the strong
terns in logs. In this way, we can automatically find correlations between the two dimensions. Intuitively, a
log segments that are most likely to indicate problems. data point far from the S d (such as point A) shows un-
Given the feature matrices we construct, outlier detec- usual correlation, and thus represents an anomaly. In
tion methods can be applied to detect anomalies con- contrast, point B, although far from most other points,
tained in the logs. We have investigated a variety of such resides close to the Sd , and is thus normal. In fact, both
methods and have found that Principal Component Anal- ACTIVE and COMMITTING are larger in this case, which
ysis (PCA) [5, 16] combined with term-weighting tech- simply indicates that the system is busier.
niques from information retrieval [23, 26] yields excel- Indeed, we do observe low effective dimensionality in
lent anomaly detection results on both feature matrices, the feature matrices Y s and Y m in many systems. Ta-
while requiring little parameter tuning. ble 5 shows k, the number of dimensions required to cap-
PCA is a statistical method that captures patterns in ture 95% of the variance in data 3 . Intuitively, in the case
high-dimensional data by automatically choosing a set of of the state ratio , when the system is in a stable state, the
coordinates—the principal components—that reflect co- ratios among different state variable values are roughly
variation among the original coordinates. We use PCA to constant. For the message count vector, as each dimen-
separate out repeating patterns in feature vectors, thereby sion corresponds to a certain stage in the program and
making abnormal message patterns easier to detect. PCA the stages are determined by the program logic, the mes-
has runtime linear in the number of feature vectors; there- sages in a group are correlated. The correlations among
fore, detection can scale to large logs. messages, determined by the normal program execution,
result in highly correlated dimensions for both features.
Intuition behind PCA anomaly detection. (The math-
challenged may want to skip to the results in Section 6.) 3 This is a common heuristic for determining k in PCA detec-
By construction, dimensions in our feature vectors are tors [15]; we use this number in all of our experiments.
8
In summary, PCA captures dominant patterns in data System Total Log Failed Failed %
to construct a (low) k-dimensional normal subspace S d HDFS 24,396,061 29,636 0.121%
in the original n-dimensional space. The remaining Darkstar 1,640,985 35 0.002%
(n − k) dimensions form the abnormal subspace S a . By
projecting the vector y on S a (separating out its compo- Table 6: Parsing accuracy. Parse fails on a message when we
nent on Sd ), it is much easier to identify abnormal vec- cannot find a message template that matches the message and
extract message variables.
tors. This forms the basis for anomaly detection [5, 16].
6
x 10
Detecting anomalies. Intuitively, we use the “distance” 10
from the end point of a vector y to the normal subspace 8
Messages/min
Sd to determine whether y is abnormal. This can be 6
formalized by computing the squared prediction error
4
SPE ≡ ya 2 (the squared length of vector y a ), where
2
ya is the projection of y onto the abnormal subspace
Sa , and can be computed as y a = (I − PPT )y, where 0
0 20 40 60 80 100
Number of nodes
P = [v1 , v2 , . . . , vk ], is formed by the first k principal
components chosen by PCA algorithm.
Figure 5: Scalability of log parsing with number of nodes
As Figure 4 shows, abnormal vectors are typically far used. The x-axis is the number of nodes used, while the y-axis
away from the normal subspace S d . Thus, the “detection is the number of messages processed per minute. All nodes are
rule” is simple: we mark y is abnormal if Amazon EC2 high-CPU medium instances. We used the HDFS
data set (described in (Table 3) with over 24 million lines. We
SPE = ya 2 > Qα , (1) parsed raw textual logs and generated the message count vec-
tor feature (see Section 4.2). Each experiment was repeated 4
where Qα denotes the threshold statistic for the SPE times and the reported data point is the mean.
residual function at the (1 − α) confidence level.
Automatically determine detection threshold. To TF-IDF does not apply to the state ratio feature. This is
compute Qα we make use of the Q-statistic, a well- because the state ratio matrix is a dense matrix that is not
known test statistic for the SPE residual function [13]. amenable to an interpretation as a bag-of-words model.
The computed threshold Q α guarantees that the false However, applying the PCA method directly to Y s gives
alarm probability is no more than α under the assump- good results on the state ratio feature.
tion that data matrix Y has a multivariate Gaussian
distribution. However, as pointed out by Jensen and 6 Evaluation and Visualization
Solomon [13], and as verified in our empirical work, the We first show the accuracy and scalability achieved by
Q-statistic is robust even when the underlying distribu- our log parsing method (Section 6.1) and then discuss
tion of the data differs substantially from Gaussian. our experiences with the two real-world systems.
The choice of the confidence parameter α for anomaly We began our experiments of problem detection with
detection has been studied in previous work [16], and Darkstar, in which both features give simple yet insight-
we follow standard recommendations in choosing α = ful results (Section 6.2). Satisfied with these results,
0.001 in our experiments. We found that our detection we applied our techniques to the much more complex
results are not sensitive to this parameter choice. HDFS logs. We also achieve high detection accuracy
Improving PCA detection results. Our message count (Section 6.3). However, the results are less intuitive to
vector is constructed in a way similar to the bag-of-words system operators and developers, so we developed a de-
model, so it is natural to consider term weighting tech- cision tree visualization method, which summarizes the
niques from information retrieval. We applied Term Fre- PCA detection results in a single, intuitive picture (Sec-
quency / Inverse Document Frequency (TF-IDF), a well- tion 6.4) that is more operator friendly because the tree
established heuristic in information retrieval [23, 26], to resembles the rule-based event processing systems oper-
pre-process the data. Instead of applying PCA directly to ators use [10].
the feature matrix Y m we replace each entry y i,j in Ym 6.1 Log parsing accuracy and scalability
with a weighted entry w i,j ≡ yi,j log(n/dfj ), where dfj
is total number of message groups that contain the j-th Accuracy. Table 6 shows that our log parsing method
message type. Intuitively, multiplying the original count achieves over 99.8% accuracy on both systems. Specif-
with the IDF reduces the weight of common message ically, our technique successfully handled rare messages
types that appear in most groups, which are less likely types, even those that appeared only twice in over 24 mil-
to indicate problems. We found this step to be essential lion messages in HDFS. On the contrary, word-frequency
for improving detection accuracy. based console log analysis tools, such as SLCT [32], do
9
not recover either of the features we use in this paper. formance disturbance by capping the CPU available to
State variables are too common to be separated from con- Darkstar to 50% of the normal level during time 1400 to
stant strings by word frequency only. In addition, these 1800.
tools ignore all rare messages, which are required to con- Detection by state ratio vectors. The only state variable
struct message count vectors. chosen by our feature generation algorithm is state,
There are only a few message types that our parser which is reported in 456, 996 messages (about 28% of
fails to handle. Almost all of these messages contain long all log messages in our data set). It has 8 distinct val-
string variables. These long strings may overwhelm the ues, including PREPARING, ACTIVE, COMMITTING,
constant strings we are searching for, preventing reverse ABORTING and so on, so our state ratio matrix Y s has
index search from finding the correct message template. 8 columns (dimensions). The time window (automati-
However, these messages typically appear at the initial- cally determined according to Section 4.1) is 3 seconds;
ization or termination phase of a system (or a subsystem), we restricted the choice to whole seconds.
when the state of the system is dumped to the console.
Figures 6 (a) and (b) show the results between time
Thus, we did not see any impact of missing these mes-
1000 and 2500, where plot (a) displays the average la-
sages on our detection results.
tency reported by the client emulator, which acts as a
We believe the accuracy of our approach to parsing is
ground truth for evaluating our method, and plot (b) dis-
essential; only with an accurate parsing system can we
plays the PCA anomaly detection results on the state ra-
extract state variables and identifiers—the basis for our
tio matrix Y s . We see that anomalies detected by our
feature construction—from textual logs. Thus, we con-
method during the time interval (1400, 1800) match the
sider the requirement of access to source code to be a
high client-side latency very well; i.e., the anomalies de-
small price to pay (especially given that many modules
tected in the state ratio matrix correlate very well with
are open-source), given the high quality parsing results
the increases in client latency. Comparing the abnormal
that our technique produces.
vectors to the normal vectors, we see that the ratio be-
Scalability. We evaluated the scalability of our log pars- tween number of ABORTING to COMMITTING increases
ing approach with a varying number of EC2 nodes. Fig- from about 1:2000 to about 1:2, indicating that a dispro-
ure 5 shows the result: Our log parsing and feature ex- portionate number of ABORTING transactions are related
traction algorithms scale almost linearly with up to about to the poor client latency.
50 nodes. Even though we parsed all messages gener- Generally, the abnormal state ratio may be the cause,
ated by 200 HDFS nodes (with aggressive logging) over symptom, or consequence of the performance degrada-
48 hours, log parsing only takes less than 3 minutes with tion. In the Darkstar case, the ratio reflects the cause of
50 nodes, or less than 10 minutes with 10 nodes. When the problem: when the system performance gets worse,
we use more than 60 nodes, the overhead of index dis- Darkstar does not adjust transaction timeout accordingly,
semination and job scheduling dominate running time. causing many normal transactions to be aborted and
6.2 Darkstar experiment results restarted, resulting in further load increase to the system.
As mentioned in Section 2.3, we observed high perfor- Notice that a traditional grep-based method does not
mance (i.e., client side response time) variability when help in this case for two reasons: 1) As a normal
deploying the Darkstar server on a cloud-computing en- user of Darkstar—without having knowledge about its
vironment such as EC2 during performance disturbances, internals—the transaction states are obscure implemen-
especially for CPU contention. We wanted to see if we tation details. Thus, it is difficult for a user to choose
could understand the reason for this high performance the correct ones from many variables to search for. In
variability solely from console logs. Indeed, we were contrast, we systematically discover and analyze all state
unfamiliar with Darkstar, so our setting was realistic as variables. 2) ABORTING happens even during normal op-
the operator often knows little about system internals. eration times, due to the optimistic concurrency model
In the experiment, we deployed an unmodified Dark- used in Darkstar, where aborting is used to handle access
star 0.95 distribution on a single node (because the Dark- conflicts. It is not a single ABORTING message, but the
star version we use supports only one node). Darkstar ratio of ABORTING to other values of the state variable
does not log much by default, so we turned on the debug- that captures the problem.
level logging. We deployed a simple game, DarkMud, Detection by message count vectors. From Darkstar
provided by the Darkstar team, and created a workload logs, Algorithm 1 automatically chooses two identifier
generator that emulated 60 user clients in the DarkMud variables, the transaction id and the asynchronous chan-
virtual world performing random operations such as flip- nel id. Figure 6(c) shows detection results on the mes-
ping switches, picking up and dropping items. The client sage count vector constructed from the transaction id
emulator recorded the latency of each operation. We variable. There are 68,029 transaction ids reported in 18
ran the experiment for 4800 seconds and injected a per- different message types. Thus, the dimension of matrix
10
(a) Client latency
Latency (sec)
4 ← Disturbance starts Latency
2
← Disturbance ends
0
1000 1500 2000 2500
(b) Status ratio vector detection Residual
Residual 500 Threshold
Alarms
0
1000 1500 2000 2500
(c) Message count vector detection Residual
Residual
20 Threshold
10 Alarms
0
1000 1500 2000 2500
Time since start (sec)
Figure 6: Darkstar detection results. (a) shows that the disturbance injection caused a huge increase in client response time. (b)
shows PCA anomaly detection results on the state ratio vector created from message variable state. The dashed line shows the
threshold Qα . The solid line with spikes is the SPE calculated according to Eq. (1). The circles denote the anomalous vectors
detected by our method, whose SPE values exceed threshold Qα . (c) shows detection results with the message count vector. The
SPE value of each vector (the solid line) is plotted at the time when the last message of the group occurs.
Ym is 68, 029×18. By construction, each message count in HDFS. We generated the HDFS logs by setting up a
vector represents a set of operations (message types) oc- Hadoop cluster on 203 EC2 nodes and running sample
curring when executing a transaction. PCA identifies the Hadoop map-reduce jobs for 48 hours, generating and
normal vectors corresponding to a common set of opera- processing over 200 TB of random data. We collected
tions (simplified for presentation): {create, join txn, over 24 million lines of logs from HDFS.
commit, prepareAndCommit}. Abnormal transactions Detection on message count vector. From HDFS logs,
can deviate from this set by missing a few message types, Algorithm 1 automatically chooses one identifier vari-
or having rare message types such as abort txn in- able, the blockid, which is reported in 11, 197, 954
stead of commit and join txn. We detected 504 of messages (about 50% of all messages) in 29 message
these as abnormal. To validate our result, we augmented types. Also, there are 575, 139 distinct blockids re-
each feature vector using the timestamp of the last mes- ported in the log, so the message count matrix Y m has
sage in that group, and we found that almost all abnormal a dimension of 575, 139 × 29. The PCA detector gives
transactions occur when the disturbance is injected. We very good separation between normal and abnormal row
see that the anomalies continue to appear (with a smaller vectors in the matrix: Using an automatically determined
frequency) for a short time period after the disturbance threshold (Q α in Eq. (1) in Section 5), it can success-
stopped due to the queueing effect as the system recov- fully detect abnormal vectors corresponding to blocks
ered from the disturbance. Notice that the state ratio vec- that went through abnormal execution paths.
tor method did not mark the recovery period as abnormal, To further validate our results, we manually labeled
demonstrating that the message count vector method was each distinct message vector, not only marking them as
more sensitive because it modeled individual operations normal or abnormal, but also determining the type of
while state ratio vector method captured only aggregate problems for each vector. The labeling was done by care-
behavior. fully studying HDFS code and by consulting with local
There were no anomalies on the channel id variable Hadoop experts. We show in the next section that the de-
during the entire experiment, suggesting that the channel cision tree visualization helps both ourselves and Hadoop
id variable is not related to this performance anomaly. developers to understand our results. We emphasize that
This result is consistent with the state ratio vector de- this labeling step is done only to validate our method—it
tection result. In console logs, it is common that there is not a required step when using our technique. Label-
are several different pieces of information that describe ing half a million vectors is possible because many of the
the same system behavior. This commonality suggests an vectors are exactly the same. In fact, there are only 680
important direction for future research: to exploit multi- distinct vectors, confirming our intuition that most blocks
source learning algorithms, which combine multiple de- go through a common execution path.
tection results to further improve accuracy. Table 7 shows the manual labels and detection results.
We see that the PCA detector can detect a large fraction
6.3 Hadoop experiment results
of anomalies in the data, and significant improvement can
Compared to Darkstar, HDFS is larger scale and the logic be achieved when we preprocess data with TF-IDF , con-
is much more complex. In this experiment, we show that firming our expectations from Section 5.
we can automatically discover many abnormal behaviors Throughout the experiment, we experienced no catas-
11
trophic failures; thus, most problems listed in Table 7 # Anomaly Description Actual Raw TF-IDF
1 Namenode not updated after 4297 475 4297
only affect performance.
deleting block
The first anomaly in Table 7 uncovered a bug that has 2 Write exception client give up 3225 3225 3225
been hidden in HDFS for a long time. In a certain (rel- 3 Write failed at beginning 2950 2950 2950
atively rare) code path, when a block is deleted (due to 4 Replica immediately deleted 2809 2803 2788
5 Received block that does not 1240 20 1228
temporary over-replication), the record on the namenode belong to any file
is not updated until the next write to the block, caus- 6 Redundant addStoredBlock 953 33 953
ing the file system to believe in a replica that no longer 7 Delete a block that no longer 724 18 650
exists, which causes subsequent block deletion to fail. exists on data node
8 Empty packet for block 476 476 476
Hadoop developers have recently confirmed this bug.
9 Receive block exception 89 89 89
This anomaly is hard to find because there is no single 10 Replication Monitor timedout 45 37 45
error message indicating the problem. However, we dis- 11 Other anomalies 108 91 107
cover it because we analyze abnormal execution paths. Total 16916 10217 16808
We also notice that we do not have the problem that # False Positive Description Raw TF-IDF
causes confusion in traditional grep based log analysis. 1 Normal background migration 1399 1397
In HDFS datanode logs, we see many messages like 2 Multiple replica (for task / job desc files) 372 349
#:Got Exception while serving # to #:#. Ac- 3 Unknown Reason 26 0
Total 1797 1746
cording to Apache issue tracking (jira) HADOOP-3678,
this is a normal behavior of HDFS: the HDFS data node Table 7: Detected anomalies and false positives using PCA on
generates the exception when a HDFS client does not fin- Hadoop message count vector feature. Actual is the number of
anomalies labeled manually. Raw is PCA detection result on
ish reading an entire block before it stops. These excep-
raw data, TF-IDF is detection result on data preprocessed with
tion messages have confused many users, as indicated by
TF-IDF and normalized by vector length (Section 5).
multiple discussion threads on the Hadoop user mailing
list. While traditional keyword matching (e.g., searching log, instead of manually specified.
for words like Exception or Error) would have flagged
these as errors, our message count method successfully 6.4 Visualizing detection results with decision trees
avoids this false positive because this happens too many From the point of view of an operator, the transforma-
times to be abnormal. tion underlying PCA is a black box algorithm: it pro-
Our algorithm does report some false positives, which vides no intuitive explanation of the detection results and
are inevitable in any unsupervised learning algorithm. cannot be interrogated. Human operators need to man-
For example, the second false positive in Table 7 occurs ually examine anomalies to understand the root cause,
because a few blocks are replicated 10 times instead of 3 and PCA itself provides little help in this regard. In this
times for the majority of blocks. These message groups section, we show how to augment PCA-based detection
look suspicious, but Hadoop experts told us that these with decision trees to make the results more easily un-
are normal situations when the map-reduce system is dis- derstandable and actionable by operators. The decision
tributing job configuration files to all the nodes. It is in- tree result resembles the (manually written) rules used
deed a rare situation compared to the data accesses, but in many system-event-processing programs [10], so it is
is normal by the system design. Eliminating this type of easier for non-machine learning experts. This technique
“rare but normal” false positive requires domain expert is especially useful for features with many dimensions,
knowledge. As a future direction, we are investigating such as the message count vector feature in HDFS.
semi-supervised learning techniques that can take opera- Decision trees have been widely used for classifi-
tor feedback and further improve our results. cation. Because decision tree construction works in
Detection on state ratio vectors. The only state variable the original coordinates of the input data, its classifica-
chosen in HDFS logs by our feature generation algorithm tion decisions tend to be easy to visualize and under-
is the node name. Node name might not sound like a stand [34]. Constructing a decision tree requires a train-
state variable, but as the set of nodes (203 total) are rela- ing set with class labels. We use the automatically gen-
tively fixed in HDFS, and their names meet the criterion erated PCA detection results (normal vs. abnormal) as
of state variable described in Section 4.1. Thus, the state class labels, in contrast to the normal use of decision
ratio vector feature reduces to per node activity count, a trees. Our decision tree is constructed to explain the un-
feature well-studied in existing work [12, 17]. As in this derlying logic of the detection algorithm, rather than the
previous work, we are able to detect transient workload nature of the dataset.
imbalance, as well as node reboot events. However, our Figure 7 is the decision tree generated using Rapid-
approach is less ad-hoc because the state ratio feature is Miner [21] from the anomaly detection results of the
chosen automatically based on information in the console HDFS log. It clearly shows the most important mes-
12
>=4
blockMap updated: # is added to # size #
1
logging practices that significantly reduced the useful-
<=3
Received block # of size # from # <=2
ness of the console log. Some of them are easy to fix. For
>=3 1 example, Facebook’s Cassandra storage system traces all
Unexpected error trying to delete block # … >=1
0
1 operations of nodes sending messages to each other, but
0
Redundant addStoredBlock request received… >=1
it does not write the sequence number or ID of messages
0 1 logged. This renders the log almost useless if multiple
>=1
Receiving empty packet for block #
1 threads on a single machine are sending messages con-
0
# starting thread to transfer block # to # >=1 currently. However, just by adding the message ID, our
0 0
1 message count method readily applies and would help
… But it does not belong to any file … >=1
0 0
1 detect node communication problems.
Adding an already existing block # >=2 Another bad logging practice, which has been discov-
<=1 1
ered in prior work, is the poor estimate of event severity.
0 Many “FATAL” or “ERROR” events are not as bad as the
developer thinks [14, 22]. This mistake is because each
Figure 7: The decision tree visualization. Each node is the developer judges the severity only in the context of his
message type string (# is the place holder for variables). The own module instead of in the context of the entire system.
number on the edge is the threshold of message count, gener- As we show in the Hadoop read exception example, our
ated by the decision tree algorithm. Small boxes contain the
tool, based on the frequency of the events, can provide
labels from PCA, with a red 1 for abnormal and a green 0 for
normal.
developers with insight into the real severity of individ-
ual events and thus improve quality of future logging.
sage types. For example, the first level shows that if Challenges in log parsing. Since we rely on static
blockMap (the data structure that keeps block locations) source code analysis to extract structure from the logs,
is updated more than 3 times, it is abnormal. This indi- our method may fail in some cases and fall back on iden-
cates the over-replication problem (Anomaly 4 or False tifying a large chunk of a log message as an unparsed
Positive 1 in Table 7). The second level shows that if a string. For example, if programmers use very general
block is received 2 times or less, it is abnormal; this indi- types such as Object in Java (very rare in practice), our
cates under-replication or block-write failure (Anomaly type resolution step fails because there are too many pos-
2 and 3 in Table 7). Level 3 of the decision tree is related sibilities. We guard against this by limiting the number
to the bug we discussed in Section 6.3. of descendants of a class to 100, which is large enough to
In summary, the visualization of results with decision accommodate all logs we studied but small enough to fil-
trees helps operators and developers notice types of ab- ter out genuine JDK, AWT and Swing classes with many
normal behaviors instead of individual abnormal events, subclasses (such as Object). Features such as generics
which can greatly improve the efficiency of finding root and mix-ins in modern OO languages provide the mecha-
causes and preventing future alarms. nisms usually needed to avoid having to declare an object
in a very general class. In addition, some log messages
7 Discussion are undecorated, emitting only a variable of some prim-
Should we completely replace console logs with struc- itive type without any constant label. These messages
tured tracing? There are various such efforts [8, 29]. are usually leftovers from the debugging phase, and we
Progress has been slow, however, mainly because there simply ignore these messages.
is no standard for structured tracing embraced by all
8 Related Work
open source developers. It is also technically difficult to
design a global “schema” to accommodate all informa- Most existing work treats the entire log as a single se-
tion contained in console logs in a structured format 4 . quence of repeating message types and mines it with time
Even if such a standard existed, manually porting all series analysis methods. Hellerstein et al. developed a
legacy codes to the schema would be expensive. Auto- novel method to mine important patterns such as message
matic porting of legacy logging code to structured log- burst, periodicity and dependencies from SNMP data in
ging would be no simpler than our log parsing. Our an enterprise network [12, 18]. Yamanishi et al. mod-
feature creation and anomaly detection algorithm can be eled syslog sequences as a mixture of Hidden Markov
used withoutlog parsing in systems with structured traces Models (HMM), in order to find messages that are likely
only, and we described a successful example in Sec- to be related to critical failures [35]. Lim et al. analyzed
tion 2.2 a large-scale enterprise telephony system log with mul-
Improving console logs. We have discovered some bad tiple heuristic filters to find messages related to actual
failures [17]. Treating a log as a single time series, how-
4 syslog is not structured because it uses the free text field heavily. ever, does not perform well in large-scale clusters with
13
multiple independent processes that generate interleaved thus, efficient algorithms such as PCA yield promising
logs. The model becomes overly complex and parame- anomaly detection results. In addition, we summarize
ters are hard to tune with interleaved logs [35]. Our anal- detection results with decision tree visualization, which
ysis is based on message groups rather than a time series helps operators/integrators/developers to quickly under-
of individual messages. The grouping approach makes stand the detection result.
it possible to obtain useful results with simple, efficient Our work has opened up many new opportunities for
algorithms such as PCA. turning built-in console logs into a powerful monitoring
A crucial but questionable assumption in previous system for problem detection, and suggests a variety of
work is that message types can be detected accurately. future directions that can be explored, including: 1) ex-
Some projects [12, 18] use manual type labels from tracting log templates from program binaries instead of
SNMP data, which are not generally available in console source code, which not only makes our approach work
logs. Many other projects use simple heuristics—such on non-open-source modules but also brings much oper-
as removing all numeric values and strings that resemble ational convenience; 2) designing other features to fully
IP addresses—to detect message types [17, 35]. These utilize the rich information in console logs; 3) develop-
heuristics are not sufficiently general. If the heuris- ing online detection algorithms instead of current post-
tics fail to capture some relevant variables, the result- mortem analysis; and 4) investigating methods to cor-
ing message types can be in the tens of thousands [17]. relate logs from multiple related applications and detect
SLCT [32], Loghound [33], Sisyphus [27], and [7] use more complex failure cases.
more advanced clustering and association rules, as well
as scoring methods from information retrieval to extract Acknowledgements
message templates for log parsing. IPLoM [19] used a The authors would like to thank Bill Bolosky, Richard
series of heuristics to capture the differences of similar Draves, Jon Stearley, Byung-Gon Chun, Jaideep Chan-
log messages to determine message types. Although drashekar, Petros Maniatis, Peter Vosshall, Deborah
these methods can successfully detect recurring patterns, Weisser, Kimberly Keeton and Kristal Sauer for their
they do so by considering textual properties of logs. In great suggestions on the early draft of the paper. Our
contrast, our approach extracts information about pro- thanks also go to the anonymous SOSP reviewers and es-
gram objects from log messages, and our detection is pecially our shepherd Jeff Chase for the feedbacks.
based on event traces related to those objects, rather than This research is supported in part by gifts from Sun
on textual properties. Indeed, our message count vector Microsystems, Google, Microsoft, Amazon Web Ser-
feature is more similar to path-based problem detection vices, Cisco Systems, Facebook, Hewlett-Packard, Net-
approaches such as Pinpoint [3, 8], as we pointed out in work Appliance, and VMWare, and by matching funds
Section 4.2. from the University of California Industry/University
Software development involves other textual informa- Cooperative Research Program (UC Discovery) grant
tion than console logs. By making use of source code, COM07-10240.
Tan et al. proposed a novel approach to detect incon- References
sistencies between textual comments and the program [1] A. W. Appel. Modern Compiler Implementation in Java.
logic [31]. Our idea is similar in that we can make Cambridge University Press, second edition, 2002.
textual information designed for human also machine- [2] D. Borthakur. The hadoop distributed file system: Archi-
understandable by using highly structured source code. tecture and design. Hadoop Project Website, 2007.
However, there are unique challenges in console log anal- [3] M. Y. Chen and et al. Path-based failure and evolution
ysis, because we must analyze runtime information in ad- management. In Proc. NSDI’04, pages 23–23, San Fran-
cisco, California, 2004. USENIX.
dition to source code.
[4] M. H. DeGroot and M. J. Schervish. Probability and
9 Conclusions and Future Work Statistics. Addison-Wesley, 3rd edition, 2002.
We propose a general approach to problem detection via [5] R. Dunia and S. J. Qin. Multi-dimensional fault diagnosis
using a subspace approach. In Proc. ACC, 1997.
the analysis of console logs, the built-in monitoring in-
[6] R. Feldman and J. Sanger. The Text Mining Handbook:
formation in most software systems. Using source code Advanced Approaches in Analyzing Unstructured Data.
as a reference to understand the structure of console logs, Cambridge Univ. Press, 12 2006.
we are able to parse logs accurately. The accuracy in log [7] K. Fisher, D. Walker, K. Q. Zhu, and P. White. From dirt
parsing allows us to extract the identifiers and state vari- to shovels: fully automatic tool generation from ad hoc
ables, which are widely found in logs yet are usually ig- data. In Proceedings of ACM POPL ’08, pages 421–434,
2008.
nored due to difficulties in log parsing. Using console [8] R. Fonseca and et al. Xtrace: A pervasive network tracing
logs, we are able to construct powerful features that were framework. In In Proc. NSDI, 2007.
previously exploited only in structured traces. These fea- [9] C. Gulcu. Short introduction to log4j, March 2002.
tures reveal accurate information on system execution; http://logging.apache.org/log4j.
14
0 Transaction transact = ...;
[10] S. E. Hansen and E. T. Atkins. Automated system mon-
1 Log.info("starting: " + transact);
itoring and notification with Swatch. In Proc. USENIX
LISA ’93, pages 145–152, 1993.
[11] E. Hatcher and O. Gospodnetic. Lucene in Action. Man- 2 Class Transaction {
ning Publications Co., Greenwich, CT, 2004. 3 public String toString() {
[12] J. Hellerstein, S. Ma, and C. Perng. Discovering action- 4 return "xact " + this.tid +
able patterns in event data. IBM Sys. Jour, 41(3), 2002. 5 " is " + this.state;
[13] J. E. Jackson and G. S. Mudholkar. Control procedures 6 }
for residuals associated with principal component analy- 7 }
sis. Technometrics, 21(3):341–349, 1979.
[14] W. Jiang and et al. Understanding customer problem trou- 8 Class SubTransaction extends Transaction{
bleshooting from storage system logs. In Proceedings of 9 private Node node = ....;
USENIX FAST’09, 2009. 10 public String toString() {
[15] I. Jolliffe. Principal Component Analysis. Springer, 2002. 11 return "xact " + this.tid +
[16] A. Lakhina, M. Crovella, and C. Diot. Diagnosing 12 " is " + this.state +" at "+ node;
network-wide traffic anomalies. In Proc. ACM SIG- 13 }
COMM, 2004. 14 }
[17] C. Lim, N. Singh, and S. Yajnik. A log mining approach
to failure analysis of enterprise telephony systems. In 15 Class TransactExec extends Transaction {
Proc. DSN, June 2008. 16 .....
[18] S. Ma and J. L. Hellerstein. Mining partially periodic
event patterns with unknown periods. In Proc. IEEE
ICDE, Washington, DC, 2001. Figure 8: Example source code segments. Notice that the log-
[19] A. A. Makanju, A. N. Zincir-Heywood, and E. E. Mil- ger call in line 1 may generate different log messages at runtime
ios. Clustering event logs using iterative partitioning. In due to the class hierarchies.
Proceedings of KDD ’09, 2009.
[20] C. Manning, P. Ragahavan, and et al. Introduction to In- A Appendix: Extracting message tem-
formation Retrieval. Cambridge University Press, 2008. plates from source code
[21] I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and
T. Euler. Yale: Rapid prototyping for complex data min- We illustrate the details of our source code analysis tech-
ing tasks. In Proc. ACM KDD, New York, NY, 2006. niques for message template extraction with a running
[22] A. Oliner and J. Stearley. What supercomputers say: A example in Java, though the general idea applies to other
study of five system logs. In Proc. IEEE DSN, Washing- object-oriented languages as well.
ton, DC, 2007.
[23] K. Papineni. Why inverse document frequency? In Proc. Line 1 of Figure 8 is a simple logger call. However,
NAACL ’01:, pages 1–8, Morristown, NJ, 2001. Asso. for as we discussed in Section 3, it might generate different
Comp. Linguistics. kinds of messages such as
[24] J. E. Prewett. Analyzing cluster log files using logsurfer. starting: xact 325 is COMMITTING
In Proc. Annual Conf. on Linux Clusters, 2003.
starting: xact 346 is ABORTING at n1:8080 (1)
[25] T. Sager, A. Bernstein, M. Pinzger, and C. Kiefer. De-
tecting similar java classes using tree algorithms. In Proc. This is because the variable transact is a complex data
ACM MSR ’06, pages 65–71, 2006. type with multiple toString() definitions (Line 2-15).
[26] G. Salton and C. Buckley. Term weighting approaches in Our goal is to discover all possible message templates
automatic text retrieval. Technical report, Cornell, Ithaca, that Line 1 can generate, so we need to resolve the type
NY, USA, 1987.
hierarchy information of transact.
[27] J. Stearley. Towards informatic analysis of syslogs. In
Proc. IEEE CLUSTER, Washington, DC, 2004. Figure 9 illustrates the major steps of our approach.
[28] Sun. Project darkstar. www.projectdarkstar.com, 2008. All analysis is done on the abstract syntax tree (AST) [1]
[29] Sun. Solaris Dynamic Tracing Guide, 2008. generated by the Eclipse IDE. Our analysis uses three
[30] J. Tan and et al. SALSA: Analyzing logs as StAte ma- data structures created from the AST: a list of par-
chines. In Proc. of WASL ’08, 2008. tial message templates, a table of templates represent-
[31] L. Tan, D. Yuan, G. Krishna, and Y. Zhou. /*icomment: ing toString() methods for all declared types (the
bugs or bad comments?*/. In Proc. ACM SOSP ’07, New
York, NY, 2007. ACM. “toString Table”), and a Class Hierarchy table. Al-
[32] R. Vaarandi. A data clustering algorithm for mining pat- though logically the data structures are independent of
terns from event logs. Proc. IPOM, 2003. each other, our implementation builds them using a sin-
[33] R. Vaarandi. A breadth-first algorithm for mining fre- gle pass over the AST.
quent patterns from event logs. In INTELLCOMM, vol-
ume 3283, pages 293–308. Springer, 2004. Partial message template extraction. We first look for
[34] I. H. Witten and E. Frank. Data Mining: Practical Ma- all method invocations on objects of the logger class.
chine Learning Tools and Techniques with Java Imple- This gives us the list of all log messages that could pos-
mentations. Morgan Kaufmann, 2000. sibly be generated, whether they actually appear in the
[35] K. Yamanishi and Y. Maruyama. Dynamic syslog mining
for network failure monitoring. In Proc. ACM KDD, New log or not. Common logger class libraries such as log4j-
York, NY, 2005. based loggers [9] can be automatically detected by ex-
amining the library the software uses. Analyzing the pa-
15
Partial template extraction Type resolution
starting: (.*) [transact][Transaction] [Participant.java:345] starting: (.*) [transact][Transaction] [Participant.java:345]
(Transaction)
Partial message template (a) xact (.*) is (.*) [tid, state] [int, String]
toString Table
starting: xact (.*) is (.*) [tid,state][int,String] [at Participant.java:345]
Transaction starting: xact (.*) is (.*) at node (.*) [tid, state, node] [int, String, Node][at Participant.java:345]
SubTransaction TransactExec ……… (d)
(c)
Class Hierarchy Table Complete message templates
rameters of the logger call in Line 1 of Figure 8 gives the there is a toString() method defined in any sub-
partial message template shown in Figure 9 (a). We also classes, we generate a message template as if the sub-
record the names and types of message variables interpo- class is used instead of the declared class. For ex-
lated into the log message (such as transact in Line 1 ample, because SubTransaction is a sub-class of
of Figure 8), which are crucial for the final type resolu- Transaction, we generate a second message template
tion, the filename and line number of the logger call. capturing the case when transact is actually an in-
stance of SubTransaction. We do this for every sub-
Type analysis. We next determine how each mes-
class of Transaction known at compile time.
sage variable will be rendered as a string in the log-
Lastly, note that type resolution is recursive. For ex-
ger call. For example, because transact is of type
ample, if an object has class SubTransaction, we ex-
Transaction, we can determine how it would appear
amine the toString method of SubTransaction (line 8
in a log message by looking at the toString() method
of Figure 8) and we find that it interpolates a variable
of the Transaction class. We traverse the AST to build
node of nonprimitive type (line 11). We recurse and sub-
a toString Table containing the toString() definitions
stitute in the toString template of Node. We do this until
and toString templates of all classes. Figure 9 (b) shows
the type of every variable becomes a primitive type or
the toString templates extracted from Lines 2–16 in Fig-
unparsed strings. We also limit the maximum depth of
ure 8.
recursions to deal with recursive type definitions.
Due to the importance of class hierarchy information, Because we are using static analysis techniques to
we do a third traversal on AST to build the Class Hierar- predict what the log output will look like at runtime,
chy table. Box (c) in Figure 9 shows an example. it is impossible to correctly handle all cases. Exam-
Type resolution. Finally, for each partial message ples of such cases includes loops and recursive calls.
template containing non-primitive variables (i.e., mem- We make our technique robust by allowing it to fall
ber of a nonprimitive class), we look up the class’s back to unparsed string (. *) in such cases. In the
toString method and corresponding toString templates real systems we studied, these hard cases rarely oc-
in the toString Table, and substitute the templates found cur in log printing and rarely cause problems in prac-
into the partial message template. For example, for the tice. There are some language-specific idioms such
logger call in Line 1 of Figure 8 that references the as Arrays.deeoToString(array) (array dumping)
transact object, we lookup the toString method of its in Java, which has an implicit built-in format of using
class (Transaction). If the toString() method is comma to separate array elements. Our parser recognize
not explicitly defined in Transaction class, we use the these idioms and handle them as special cases.
Class Hierarchy Table, built from the AST, to attempt to
resolve toString() in the object’s superclasses. We
do this recursively until either a toString() method is
found or we reach the root of the class hierarchy (in Java,
the java.lang.Object class), in which case we give
up and treat the template as an unparsed string (. *).
The sub-classing problem is also handled in this
step. We find all descendants of a declared class. If
16