Error Log Processing For Accurate Failure Prediction

This document proposes techniques for processing error logs to improve the accuracy of online failure prediction. It introduces (a) assigning error IDs to messages based on edit distance, (b) clustering similar error sequences, and (c) statistical noise filtering. Experiments on a commercial telecommunications system show the techniques improve failure prediction accuracy by up to 45% compared to without data preparation.

Uploaded by

Tameta Dada

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views8 pages

Error Log Processing For Accurate Failure Prediction

Uploaded by

Tameta Dada

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Error Log Processing for Accurate Failure Prediction

Felix Salfner Steffen Tschirpke

International Computer Science Institute, Berkeley Humboldt-Universität zu Berlin
[email protected] [email protected]

Abstract The term online failure prediction subsumes techniques

that try to forecast the occurrence of system failures dur-
Error logs are a fruitful source of information both for di- ing runtime based on continuous system observation. It
agnosis as well as for proactive fault handling – however should not be confused with failure prediction techniques
elaborate data preparation is necessary to filter out valu- in traditional reliability theory. More specifically, the fail-
able pieces of information. In addition to the usage of ure prediction approach used in this paper is based on the
well-known techniques, we propose three algorithms: (a) observation of error events during runtime, i.e., upcom-
assignment of error IDs to error messages based on Lev- ing failures are predicted by analyzing the errors that have
enshtein’s edit distance, (b) a clustering approach to group occurred recently before present time (see Figure 1).
similar error sequences, and (c) a statistical noise filtering The failure prediction technique is based on hidden
algorithm. By experiments using data of a commercial semi-Markov models (HSMM) and has been described in
telecommunication system we show that data preparation detail in [5]. However, the main focus of this paper is not
is an important step to achieve accurate error-based online the prediction model but the preparation of the data fed
failure prediction. into the HSMM. More specifically, the main steps of data
preparation are:

1 Introduction • Input data of the failure predictor are error se-

quences. Each error event consists of a timestamp
Despite of some early work such as [1], preparation of and a distinctive integer error ID denoting the type
data has long been seen as the “inevitable evil” and has of the error event. The process of sequence extrac-
hence been neglected in most scientific papers. This ap- tion is described in Section 3.
plies especially to logfile data. However, with the emer-
gence of concepts such as IBM’s autonomic computing • Since the HSMM failure predictor applies techniques
[2], the importance of logfiles as a valuable source of in- from machine learning, training data needs to be ex-
formation on a system’s status continues to increase, as tracted that should represent system characteristics
can be seen from a variety of recent works such as [3] and as precisely as possible. In this paper we propose
the development of standards such as the Common Base a method to group similar failure-related error se-
Event [4]. This paper shows that clever mining of infor- quences. Grouping is based on a sequence clustering
mation from logfiles can significantly improve accuracy technique (see Section 4).
of an error-based online failure prediction method. How-
ever, the goal is not to provide a comprehensive overview • Error logs often contain events that are unrelated to
of various techniques that could be applied – we focus on a specific fault, but due to parallelism in the system
a description of the techniques we have applied and how these events are interweaved with unrelated events.
these techniques improved our results for online failure This can be seen as noise in the data set. We propose
prediction. a noise filtering algorithm based on a statistical test
in Section 5.

The data used in this paper derives from a commer-

cial telecommunication system which is described in Sec-
tion 2. In order to demonstrate the effect of sequence
clustering and noise filtering on failure prediction accu-
racy we show experiments in Section 6. It is shown that
without data preparation techniques failure prediction ac-
Figure 1: Error-based online failure prediction. curacy drops by up to 45%.

1
2 The Data Set instances) to SMS and MOC services. In this study, only
the first failure type has been investigated.
The data set used for experiments derives from a commer-
cial telecommunication system. Its main purpose is to re-
alize a Service Control Point (SCP) in an Intelligent Net- 3 Data Preprocessing
work (IN), providing Service Control Functions (SCF) for
communication related management such as billing, num- 3.1 Creating Machine Processable Logfiles
ber translations or prepaid functionality. Services are of-
fered for Mobile Originated Calls (MOC), Short Message Traditionally, logfiles were intended to be read by humans
Service (SMS), or General Packet Radio Service (GPRS). in order to support fault diagnosis and root cause analysis
Service requests are transmitted to the system using vari- after a system had failed. They are not well-suited for ma-
ous communication protocols such as Remote Authentica- chine processing. An (anonymized) example log record
tion Dial In User Interface (RADIUS), Signaling System consisting of three lines in the error log is shown in Fig-
Number 7 (SS7), or Internet Protocol (IP). Since the sys- ure 3.
tem is a SCP, it cooperates closely with other telecommu- 2004/04/09-19:26:13.634089-29846-00010-LIB ABC USER-AGOMP#020200034000060|
020101044430000|000000000000-020234f43301e000-2.0.1|020200003200060
nication systems in the Global System for Mobile Com- 2004/04/09-19:26:13.634089-29846-00010-LIB ABC USER-NOT: src=ERROR APPLICATION
munication (GSM), however, it does not switch calls it- sev=SEVERITY MINOR id=020d02222083730a
2004/04/09-19:26:13.634089-29846-00010-LIB ABC USER-unknown nature of address
self. The system is realized as multi-tier architecture em- value specified

ploying a component-based software design. At the time Figure 3: Anonymized error log record from the telecommunica-
when measurements were taken the system consisted of tion system. The record consists of three log lines.
more than 1.6 million lines of code, approximately 200
components realized by more than 2000 classes, running In order to simplify machine processing, we applied the
simultaneously in several containers, each replicated for transformations described in the following paragraphs.
fault tolerance.
Specification of the telecommunication system requires
Eliminating logfile rotation. Logfile rotation denotes a
that within successive, non-overlapping five minute inter-
technique to switch to a new logfile when the current log-
vals, the fraction of calls having response time longer than
file has reached a size limit, time span limit, or both. In
250ms must not exceed 0.01%. This definition of the fail-
the telecommunication system logging was organized in
ures to be predicted is equivalent to a required four-nines
a ring-buffer fashion consisting of n logfiles. Data has
interval service availability. Hence the failures that are
been reorganized to form one large chronologically or-
predicted by the HSMM failure predictor belong to the
dered logfile.
class of performance failures.
The setup from which data has been collected is de-
picted in Figure 2. A call tracker kept trace of re- Identifying borders between messages. While error
quest response times and logged each request that showed messages “travel” through various modules and architec-
a response time exceeding 250ms. Furthermore, the tural levels of the system, more and more information is
call tracker provided information in five-minute intervals accumulated until the resulting log-record is written to the
whether call availability dropped below 99.99%. More logfile. This often leads to situations where the original er-
specifically, the exact time of failure has been determined ror message is quoted several times within one log record
to be the first failed request that caused interval availabil- and one log record spans several lines in the file. We elim-
ity to drop below the threshold. inated duplicated information and assigned each piece to
a fixed column in the log such that each line corresponds
to exactly one log record. This also involved the usage of
a unique field delimiter.

Converting time. Timestamps in the original log-

Figure 2: Experimental setup. Call response times have been files were tailored to humans and had the form
tracked from outside the system in order to identify failures. 2004/04/09-19:26:13.634089 stating that the
log message occurred at 7:26pm and 13.634089 seconds
We had access to data collected at 200 non-consecutive on 04/09/2004. In order to enable, e.g., computation of the
days spanning a period of 273 days. The entire dataset time interval between two error messages we transformed
consists of error logs with a total of 26,991,314 log each timestamp into real-valued UTC, which roughly re-
records including 1,560 failures of two types: The first lates to seconds since Jan. 1st, 1970. This also involved
type (885 instances) relates to GPRS and the second (675 the issue of timezone information.

2
3.2 Assigning IDs to Error Messages Data No of msgs Reduction in %
Original 1,695,160 n/a
Many error analysis tools including the HSMM failure Without numbers 12,533 99.26%
predictor rely on an integer number to characterize the Levenshtein 1,435 88.55% / 99.92%
type of each error message. However, in our case such
an identifier was not available. This section describes the Table 1: Number of different log messages in the original data,
algorithm we used to assign an ID to each error message after substitution of numbers by placeholders, and after clustering
in the log. by the Levenshtein distance metric.
The type of an error report is only implicitly given by
a natural language sentence describing the event. In this
to each other. Except for a few blocks in the middle of the
section, we propose a method to automatically assign er-
plot, dark steps only occur along the main descending di-
ror IDs to messages on the basis of Levenshtein’s edit dis-
agonal and the rest of the plot is rather light-colored. This
tance. Note that the error ID is meant to characterize what
indicates that strong similarity is only present among mes-
has happened, which corresponds to the type of an error
sages with the same ID and not between other message
message in contrast to the message source, as has been
types. In addition to the plot, we have manually checked
discussed in [6].
a selection of a few tens of messages. Hence using a fixed
threshold is a simple yet robust approach. Nevertheless,
Removal of numbers. Assume that the following mes- as is the case for any grouping algorithm it may assign the
sage occurs in the error log same ID to two error message that should be kept separate.
process 1534: end of buffer reached For example, if process 1534 was a crucial singleton pro-
cess in the system (like the “init” process in the Linux ker-
The situation that exactly process with number 1534 nel, which always has process ID one) the number would
reaches the end of a buffer will occur rather rarely. Fur- be an important piece of information that should not be
thermore, the process number relates to the source rather eliminated. However, in our case the significant reduction
than the type of the message. Hence, all numbers and log- in the number of messages outweighs such effects. Note
record specific data such as IP addresses, etc. are replaced that Levenshtein distances have to be computed only once
by placeholders. For example, the message shown above for any pair of messages.
is translated into:
process xx: end of buffer reached
In order not to loose the information, a copy of the original
message is kept.

Number assignment. Since a 100% complete replace-

ment of all record-specific data is infeasible (there were
even typos in the error messages) error IDs are assigned
on the basis of Levenshtein’s edit distance [7] expressing
dissimilarity of messages. After number removal, Leven-
shtein distance is computed between all pairs of log mes-
sages appearing in the log. By applying a threshold on
dissimilarity, similar messages receive the same integer Figure 4: Levenshtein similarity plot for a subset of message
types. Points represent Levenshtein distance between one pair
number — the error ID. of error messages (dark color indicates small distance).
Applying this algorithm to the telecommunication data
resulted in an immense reduction of the number of mes-
sage types: While in the original dataset there were
1,695,160 different log-messages, the number of message 3.3 Tupling
types could be reduced to 1,435 (see Table 1)
Applying a simple threshold might seem too simplistic In [8], the authors note that repetitive log records occur-
to make a decision which messages are grouped together. ring more or less at the same time are frequently multiple
However, experiments have shown that this is not the case. reports of the same fault. Tsao and Siewiorek introduced a
Figure 4 provides a plot where the gray value of each point procedure called tupling, which basically refers to group-
indicates Levenshtein distance of the corresponding mes- ing of error events that occur within some time interval or
sage pair for a selection of messages. In the plot all mes- that refer to the same location [9]. Current research aims
sages that are assigned the same error ID are arranged next at quantifying temporal and spatial tupling. For example,

3
in [10] the authors introduce a correlation measure for this or not, the failure log, which has been written by the
purpose. call tracker, has been analyzed, to extract timestamps and
We adopt the tupling method of [9]. However, equat- types of failure occurrences. In this last step of data pre-
ing the location reported in an error message with the true processing both types of sequences are extracted from the
location of the fault only works for systems with strong data set.
fault containment regions. Since this assumption does not Three parameters are involved in sequence extraction:
hold for the telecommunication system under considera-
tion, spatial tupling is not considered any further, here. 1. Lead-time. In order to predict failures before a fail-
The basic idea of tupling is that all errors showing an inter- ure occurs, extracted failure sequences preceded the
arrival time less than a threshold ε are grouped.1 Grouping time of failure occurrence by time ∆tl . In the exper-
can lead to two problems: iments we used a value of five minutes.
1. Error messages might be combined that refer to sev-
eral (unrelated) faults. This is called a collision. 2. Data window size. The length of each sequence is
determined by a maximum time ∆td . In the experi-
2. If an inter-arrival time > ε occurs within the error
ments we used sequences of five minute length.
pattern of one single fault, this pattern is divided into
more than one tuple. This effect is called truncation.
3. Margins for non-failure sequences. The set of non-
Both the number of collisions and truncations depend failure sequences should be extracted from the log
on ε. If ε is large, truncation happens rarely and collision at times when the system is fault-free. However,
will occur very likely. If ε is small the effect is vice versa. whether a system really is fault-free is hard to tell.
To find an optimal ε, the authors suggest to plot the num- Therefore, we applied a “ban period” of size ∆tm be-
ber of tuples over ε. This should yield an L-shaped curve: fore and after a failure. By visual inspection (length
If ε equals zero, the number of tuples equals the number of bursts of failures, etc.), we determined ∆tm to be
of error events in the logfile. While ε is increased, the 20 minutes.
number drops quickly. When the optimal value for ε is
reached, the curve flattens suddenly. Our data supports
Non-failure sequences have been generated using overlap-
this claim: Figure 5 shows the plot for a subset of one
ping time windows, which simulates the case that failure
million log records. The graph shows a clear change point
prediction is performed each time an error occurs, and a
and a value of ε = 0.015s has been chosen.
random selection has been used to reduce the size of the
training data set.

4 Failure Sequence Clustering

The term failure mechanism, as used in this paper, denotes
a principle chain of actions or conditions that leads to a
system failure. It is assumed that in complex computer
systems such as the telecommunication system various
failure mechanisms can lead to the same failure. Different
failure mechanisms can show completely different behav-
Figure 5: Effect of tupling window size ε: the graph shows the ior in the error event logs, which makes it very difficult
resulting number of tuples depending on ε for one million log for the learning algorithm to extract the inherent “princi-
records. ple” of failure behavior in a given training data set. The
idea of clustering failure-related error sequences (which
for brevity reasons from now on will be called “failure
3.4 Extracting Sequences sequences”) is to group similar sequences and train a sep-
arate HSMM for each group. Failure sequence cluster-
Two types of data sets are needed to train the HSMM- ing aims at grouping failure sequences according to their
based failure predictor: a set of failure-related error se- similarity — however, there is no “natural” distance met-
quences and a set of non-failure-related sequences. In or- ric such as Euclidean norm for error event sequences. We
der to decide whether a sequence is a failure sequence use sequence likelihoods from small HSMMs for this pur-
1 In [9], there is a second, larger threshold to add later events if they pose. The approach is inspired by [11] but yields separate
are similar, but this is not considered, here specialized models instead of one mixture model.

4
ertheless, D(j, j) is close to zero since it denotes log-
sequence likelihood for the sequence, model M j has been
trained with.
In order to achieve a good measure of similarity among
sequences models should not be overfitted to their train-
ing sequences. Furthermore, one model needs to be
trained for each failure sequence in the training data set.
Therefore, models M i have only a few states and are er-
godic (have the structure of a clique). An example is
shown in Figure 7. In order to further avoid too specific
models, exponential distributions for inter-error durations
and a uniform background distribution have been used.
Figure 6: Matrix of logarithmic sequence likelihoods. Each
element (i, j) in Background distributions add some small probability to
the matrix is logarithmic sequence likelihood
log P (F i |M j ) for sequence F i and model M j .

4.1 Obtaining the Dissimilarity Matrix

Most clustering algorithms require as their input data a
matrix of dissimilarities among data points (D). In our
case, each data point is a failure sequence F i and hence
D(i, j) denotes the dissimilarity between failure sequence
F i and F j . Figure 7: Topology of HSMMs used for computation of the dis-
As first step a small HSMM M i is trained separately similarity matrix. Observation symbol probabilities are omitted.
for each failure sequence F i . The objective of the train-
ing algorithm is to adjust the HSMM parameters (e.g., all HMM observation probabilities following a (data in-
transition probabilities and observation probability distri- dependent) distribution such as uniform.
butions) to the training sequence, i.e., the HSMM is tuned
such that it assigns a high sequence likelihood to the train-
4.2 Grouping Failure Sequences
ing sequence.
In order to compute D(i, j) the sequence likelihood In order to group similar failure sequences, a clustering al-
P (F i |M j ) is computed for each sequence F i using each gorithm has been applied to the dissimilarity matrix. Due
model M j . Sequence likelihood is used as a similarity to the fact that the number of groups cannot be determined
score ∈ [0, 1]. Since model M j has been trained with upfront and can vary greatly, we applied hierarchical clus-
sequence F j , it assigns a high sequence likelihood to se- tering methods (both agglomerative and divisive, c.f., e.g.,
quences F i that are similar to F j , and a lower sequence [12]). The actual number of groups has been determined
likelihood to sequences F i that are less similar to F j . In by visual inspection of banner plots.
order to avoid numeric instabilities, the logarithm of the
likelihood (log-likelihood) is used (see Figure 6).
4.3 Analysis of Sequence Clustering
The resulting matrix is not yet a dissimilarity matrix,
since first, values are ≤ 0 and second, sequence likeli- The failure sequence clustering approach implies several
hoods are not symmetric: P (F i |M j ) 6= P (F j |M i ). This
parameters such as the number of states of the HSMMs,
is solved by taking the arithmetic mean of both likelihoods
or the clustering method used. This section explores their
and using the absolute value. Hence D(i, j) is defined as:influence on sequence clustering (not on failure prediction
accuracy, which is investigated in Section 6). In order to

log P (F i |M j ) + log P (F j |M i )
do so many combinations of parameters have been ana-
D(i, j) = (1)

2 lyzed, but only key results can be presented here. In order

to support clarity of the plots, a data excerpt from five
Still, matrix D is not a proper dissimilarity matrix successive days including 40 failure sequences has been
since a proper metric requires that D(i, j) = 0, if used.
F i = F j . There is no solution to this problem since We explored one divisive algorithm (DIANA), and four
from D(j, j) = 0 follows that P (F j |M j ) = 1. How- agglomerative approaches (AGNES) using single linkage,
ever, if M j would assign a probability of one to F j it average linkage, complete linkage and Ward’s procedure
would assign a probability of zero to all other sequences (c.f. [12]) Figure 8 shows banner plots for all methods
F i 6= F j , which would be useless for clustering. Nev- using a dissimilarity matrix that has been generated using

5
a HSMM with 20 states and a uniform background distri-
bution with a weighting factor of 0.25. Banner plots con-
nect data points (sequences) by a bar of length to the level
of distance metric where the two points are merged / di-
vided. Single linkage clustering (second row, left) shows

agnes average agnes complete

20 states bg = 0.25 20 states bg = 0.25

Figure 9: After clustering similar failure sequences, filtering is ap-

plied to remove failure unrelated errors from training sequences.
Times of failure occurrence are indicated by t.
0 20 40 60 80 100 120 140 0 20 40 60 80 120 160 200 234
Height Height
Agglomerative Coefficient = 0.57 Agglomerative Coefficient = 0.72

agnes single agnes ward

20 states bg = 0.25 20 states bg = 0.25

butions and found that some background distribution is

necessary (otherwise, each model only recognizes exactly
the sequence it has been trained with). However, the ac-
tual strength (or weight) of the background distribution
0 10 20 30 40 50 60 70 80 90 0 50 100 150 200 250 300 350 400 has only small impact as long as it stays in a reasonable
Height Height
Agglomerative Coefficient = 0.45 Agglomerative Coefficient = 0.85 range (if the weighting factor for background distributions
diana standard
20 states bg = 0.25 gets too large, the “chaining-effect” can be observed and
the agglomerative coefficient is decreasing).

5 Filtering the Noise

234 200 160 120 80 60 40 20 0
Height
Divisive Coefficient = 0.69
The objective of the previous clustering step was to group
Figure 8: Clustering of 40 failure sequences using five different
clustering methods: agglomerative clustering (“agnes”) using av-
failure sequences that are traces of the same failure mech-
erage, complete, and single linkage, agglomerative clustering us- anism. Hence it can be expected that failure sequences of
ing Ward’s method and divisive clustering (“diana”). one group are more or less similar. However, experiments
have shown that this is not always the case. The reason for
the typical chaining effect, which does not result in a good this is that error logfiles contain noise (unrelated events),
separation of failure sequences yielding an agglomera- which results mainly from parallelism within the system.
tive coefficient of only 0.45. Complete linkage (first row, Hence we applied some filtering to eliminate noise and
right) performs better resulting in a clear separation of two to mine the events in the sequences that make up the true
groups and an agglomerative coefficient of 0.72. Not sur- pattern.
prisingly, average linkage (first row, left) resembles some The filtering mechanism is based on the notion that
mixture of single and complete linkage clustering. Divi- within a certain time window prior to failure, indicative
sive clustering (bottom row, left) divides data into three events occur more frequently within failure sequences
groups at the beginning but does not look consistent since of the same failure mechanism than within all other se-
groups are split up further rather quickly. The resulting quences. The precise definition of “more frequently” is
agglomerative coefficient is 0.69. Finally, agglomerative based on the χ2 test of goodness of fit.
clustering using Ward’s method (second row, right) results The filtering process is depicted in the blow-up of Fig-
in the clearest separation achieving an agglomerative co- ure 9 and performs the following steps:
efficient of 0.85. The results are roughly the same if other
1. Prior probabilities are estimated for each symbol.
parameter settings are considered.
Priors express the “general” probability that a given
In order to investigate the impact of the number of symbol occurs.
states N of the HSMMs, we performed several experi-
ments ranging from five to 50 states. We found that fail- 2. All failure sequences of one group (which are simi-
ure grouping only
√ works well if the number of states is lar and are expected to represent one failure mecha-
roughly above L where L denotes the average length of nism), are aligned such that the failure occurs at time
the sequences. This might be explained by the fact that t = 0. In the figure, sequences F 1 , F 2 , and F 4 are
there are roughly N 2 transitions in the model. aligned and the dashed line indicates time of failure
We also investigated the effect of background distri- occurrence.

6
currence of error ei to the frequency of occurrence
within the entire training data.

2. p̂0i are estimated from all failure sequences (irrespec-

tive of the groups obtained from clustering). Xi
compares the frequency of occurrence of error ei to
all failure sequences (irrespective of the group).

Figure 10: The three different sequence sets that can be used to 3. p̂0i are estimated separately for each group of failure
compute symbol prior probabilities. sequences from all errors within the group (over all
time windows). For each error ei the testing variable
Xi compares the occurrence within one time window
3. Time windows are defined that reach backwards in
to the entire group of failure sequences.
time. The length of the time window is fixed and time
windows may overlap. Time windows are indicated Experiments have been performed on the dataset used
by shaded vertical bars in the figure. previously for clustering analysis and six non-overlapping
filtering time windows of length 50 seconds have been an-
4. The test is performed for each time window sepa-
alyzed. Figure 11 plots the average number of symbols in
rately, taking into account all error events that have
one group of failure sequences after filtering out all errors
occurred within the time window in all failure se-
with Xi < c for various values of c.
quences of the group.
5. Only error events that occur significantly more fre-
quently in the time window than their prior proba-
bility stay in the set of training sequences. All other
error events within the time window are removed.
6. Filtering rules are stored for each time window spec-
ifying error symbols that pass the filter. The filter
rules are used later for online failure prediction in
order to filter new sequences that occur during run-
time.
More formally, each error ei that occurs in failure se-
quences of the same cluster within a time window (t − Figure 11: Mean sequence length depending on threshold c for
∆t, t] prior to failure is checked for significant deviation three different priors.
from the prior p̂0i by a test variable known from χ-grams,
which are a non-squared version of the testing variable Regarding the prior computed from all sequences (solid
of the χ2 goodness of fit test (see, e.g., [13]). The test- line), all symbols pass the filter for very small thresholds.
ing variable Xi is defined as the non-squared standardized At some value of c the length of sequences starts dropping
difference: quickly until some point where sequence lengths stabilize
ni − n p̂0 for some range of c. With further increasing c average se-
Xi = p 0 i , (2)
n p̂i quence length drops again until finally not a single symbol
where ni denotes the number of occurrences of error ei passes filtering. Similar to the tupling heuristic by [8], we
and n is the total number of errors in the time window. consider a threshold at he beginning of the middle plateau
An analysis reveals that all Xi have an expected value of to best distinguish between “signal” and noise. Other pri-
zero and variance of one, hence they can all be compared ors do not show this behavior, hence we used priors esti-
to one threshold c: Filtering eliminates all errors ei from mated from all sequences (first prior).
the sequences within the time window, for which Xi < c.
For online prediction, the sequence under investigation is
filtered the same way before sequence likelihood is com- 6 Results
puted.
As stated before, the overall objective was to predict fail-
Three variants regarding the computation of priors pˆi 0
ures of the telecommunication system as accurate as pos-
exist (see Figure 10):
sible. The metric used to measure accuracy of predictions
1. p̂0i are estimated from all training sequences (failure is the so-called F-measure, which is the harmonic mean
and non-failure). Xi compares the frequency of oc- of precision and recall. Precision is the relative number

7
of correctly predicted failures to the total number of pre- and (c) a statistical filtering algorithm to reduce noise in
dictions, and recall is the relative number of correctly pre- the sequences. We conducted experiments to assess the
dicted failures to the total number of failures. A definition effect of sequence clustering and noise filtering. The re-
and comprehensive analysis can be found in Chapter 8.2 sults unveiled that elaborate data preparation is a very im-
of [5]. The HSMM prediction method involves a cus- portant step to achieve good prediction accuracy.
tomizable threshold determining whether a failure warn- In addition to failure prediction the proposed tech-
ing is issued very easily (at a low level of confidence in the niques might also be helpful to speed up the process of
prediction) or only if it is rather sure that a failure is im- diagnosis: For example, if root causes have been identi-
minent, which affects the trade-off between precision and fied for each failure group in a reference data set, identifi-
recall.2 In this paper we only report maximum achievable cation of the most similar reference sequence would allow
F-measure. a first assignment of potential root causes for a failure that
Applying the full chain of data preparation as described has occurred during runtime.
in Sections 3 to 5 yields a failure prediction F-measure of
0.66. A comparative study has shown that this result is
significantly more accurate than best-known error-based
References
prediction approaches (see Chapter 9.9 of [5]). In or- [1] R. K. Iyer, L. T. Young, and V. Sridhar. Recognition of error symp-
der to determine the effect of clustering and filtering, we toms in large systems. In Proceedings of 1986 ACM Fall joint com-
puter conference, pages 797–806, Los Alamitos, CA, USA, 1986.
have conducted experiments based on ungrouped (unclus- IEEE Computer Society Press.
tered) data as well as on unfiltered data. Unfortunately,
experiments with neither filtering nor grouping were not [2] Paul Horn. Autonomic computing: IBM’s perspective on the state
of information technology, Oct. 2001.
feasible. All experiments have been performed with the
same HSMM setup (i.e., number of states, model topol- [3] Adam Oliner and Jon Stearley. What supercomputers say: A study
ogy, etc.). Results unveil that data preparation plays a sig- of five system logs. In IEEE proceedings of International Confer-
ence on Dependable Systems and Networks (DSN’07), pages 575–
nificant role in achieving accurate failure predictions (see 584. IEEE Computer Society, 2007.
Table 2).
[4] David Bridgewater. Standardize messages with the common base
Method Max. F-Measure rel. Quality event model, 2004.

Optimal results 0.66 100% [5] Felix Salfner. Event-based Failure Prediction: An Extended Hid-
Without grouping 0.5097 77% den Markov Model Approach. dissertation.de - Verlag im Internet
GmbH, Berlin, Germany, 2008. (Available as PDF).
Without filtering 0.3601 55%
[6] Felix Salfner, Steffen Tschirpke, and Miroslaw Malek. Compre-
Table 2: Failure prediction accuracy expressed as maximum F- hensive logfiles for autonomic systems. In IEEE Proceedings
measure from data with full data preparation, without failure se- of IPDPS, Workshop on Fault-Tolerant Parallel, Distributed and
quence grouping (clustering) and without noise filtering. Network-Centric Systems (FTPDS), 2004.

[7] Alberto E. D.T. Apostolico and Zvi Galil. Pattern Matching Algo-
rithms. Oxford University Press, 1997.
7 Conclusions
[8] R. Iyer and D. Rosetti. A statistical load dependency of cpu errors
It is common perception today that logfiles, and in partic- at slac. In IEEE Proceedings of 12th International Symposium on
Fault Tolerant Computing (FTCS-12), 1982.
ular error logs, are a fruitful source of information both
for analysis after failure and for proactive fault handling [9] M. M. Tsao and Daniel P. Siewiorek. Trend analysis on system er-
which frequently builds on the anticipation of upcoming ror files. In Proc. 13th International Symposium on Fault-Tolerant
Computing, pages 116–119, Milano, Italy, 1983.
failures. However, in order to get (machine) access to the
information contained in logs, the data needs to be put [10] Song Fu and Cheng-Zhong Xu. Quantifying temporal and spatial
into shape and valuable pieces of information need to be fault event correlation for proactive failure management. In IEEE
Proceedings of Symposium on Reliable and Distributed Systems
picked from the vast amount of data. This paper described (SRDS 07), 2007.
the process we used to prepare error logs of a commercial
telecommunication system for a hidden semi-Markov fail- [11] Padhraic Smyth. Clustering sequences with hidden Markov mod-
els. In Michael C. Mozer, Michael I. Jordan, and Thomas Petsche,
ure predictor. editors, Advances in Neural Information Processing Systems, vol-
The preparation process consists of three major steps ume 9, page 648. The MIT Press, 1997.
and involved the following new concepts: (a) an algo-
[12] Leonard Kaufman and Peter J. Rousseeuw. Finding Groups in
rithm to automatically assign integer error IDs to error Data. John Wiley and Sons, New York, 1990.
messages, (b) a clustering algorithm for error sequences,
[13] Rainer Schlittgen. Einführung in die Statistik: Analyse und Mod-
2 Infact, either precision or recall can be increased to 100% at the ellierung von Daten. Oldenbourg-Wissenschaftsverlag, München,
cost of the other. Wien, 9 edition, 2000.