Error Log Processing For Accurate Failure Prediction
Error Log Processing For Accurate Failure Prediction
1
2 The Data Set instances) to SMS and MOC services. In this study, only
the first failure type has been investigated.
The data set used for experiments derives from a commer-
cial telecommunication system. Its main purpose is to re-
alize a Service Control Point (SCP) in an Intelligent Net- 3 Data Preprocessing
work (IN), providing Service Control Functions (SCF) for
communication related management such as billing, num- 3.1 Creating Machine Processable Logfiles
ber translations or prepaid functionality. Services are of-
fered for Mobile Originated Calls (MOC), Short Message Traditionally, logfiles were intended to be read by humans
Service (SMS), or General Packet Radio Service (GPRS). in order to support fault diagnosis and root cause analysis
Service requests are transmitted to the system using vari- after a system had failed. They are not well-suited for ma-
ous communication protocols such as Remote Authentica- chine processing. An (anonymized) example log record
tion Dial In User Interface (RADIUS), Signaling System consisting of three lines in the error log is shown in Fig-
Number 7 (SS7), or Internet Protocol (IP). Since the sys- ure 3.
tem is a SCP, it cooperates closely with other telecommu- 2004/04/09-19:26:13.634089-29846-00010-LIB ABC USER-AGOMP#020200034000060|
020101044430000|000000000000-020234f43301e000-2.0.1|020200003200060
nication systems in the Global System for Mobile Com- 2004/04/09-19:26:13.634089-29846-00010-LIB ABC USER-NOT: src=ERROR APPLICATION
munication (GSM), however, it does not switch calls it- sev=SEVERITY MINOR id=020d02222083730a
2004/04/09-19:26:13.634089-29846-00010-LIB ABC USER-unknown nature of address
self. The system is realized as multi-tier architecture em- value specified
ploying a component-based software design. At the time Figure 3: Anonymized error log record from the telecommunica-
when measurements were taken the system consisted of tion system. The record consists of three log lines.
more than 1.6 million lines of code, approximately 200
components realized by more than 2000 classes, running In order to simplify machine processing, we applied the
simultaneously in several containers, each replicated for transformations described in the following paragraphs.
fault tolerance.
Specification of the telecommunication system requires
Eliminating logfile rotation. Logfile rotation denotes a
that within successive, non-overlapping five minute inter-
technique to switch to a new logfile when the current log-
vals, the fraction of calls having response time longer than
file has reached a size limit, time span limit, or both. In
250ms must not exceed 0.01%. This definition of the fail-
the telecommunication system logging was organized in
ures to be predicted is equivalent to a required four-nines
a ring-buffer fashion consisting of n logfiles. Data has
interval service availability. Hence the failures that are
been reorganized to form one large chronologically or-
predicted by the HSMM failure predictor belong to the
dered logfile.
class of performance failures.
The setup from which data has been collected is de-
picted in Figure 2. A call tracker kept trace of re- Identifying borders between messages. While error
quest response times and logged each request that showed messages “travel” through various modules and architec-
a response time exceeding 250ms. Furthermore, the tural levels of the system, more and more information is
call tracker provided information in five-minute intervals accumulated until the resulting log-record is written to the
whether call availability dropped below 99.99%. More logfile. This often leads to situations where the original er-
specifically, the exact time of failure has been determined ror message is quoted several times within one log record
to be the first failed request that caused interval availabil- and one log record spans several lines in the file. We elim-
ity to drop below the threshold. inated duplicated information and assigned each piece to
a fixed column in the log such that each line corresponds
to exactly one log record. This also involved the usage of
a unique field delimiter.
2
3.2 Assigning IDs to Error Messages Data No of msgs Reduction in %
Original 1,695,160 n/a
Many error analysis tools including the HSMM failure Without numbers 12,533 99.26%
predictor rely on an integer number to characterize the Levenshtein 1,435 88.55% / 99.92%
type of each error message. However, in our case such
an identifier was not available. This section describes the Table 1: Number of different log messages in the original data,
algorithm we used to assign an ID to each error message after substitution of numbers by placeholders, and after clustering
in the log. by the Levenshtein distance metric.
The type of an error report is only implicitly given by
a natural language sentence describing the event. In this
to each other. Except for a few blocks in the middle of the
section, we propose a method to automatically assign er-
plot, dark steps only occur along the main descending di-
ror IDs to messages on the basis of Levenshtein’s edit dis-
agonal and the rest of the plot is rather light-colored. This
tance. Note that the error ID is meant to characterize what
indicates that strong similarity is only present among mes-
has happened, which corresponds to the type of an error
sages with the same ID and not between other message
message in contrast to the message source, as has been
types. In addition to the plot, we have manually checked
discussed in [6].
a selection of a few tens of messages. Hence using a fixed
threshold is a simple yet robust approach. Nevertheless,
Removal of numbers. Assume that the following mes- as is the case for any grouping algorithm it may assign the
sage occurs in the error log same ID to two error message that should be kept separate.
process 1534: end of buffer reached For example, if process 1534 was a crucial singleton pro-
cess in the system (like the “init” process in the Linux ker-
The situation that exactly process with number 1534 nel, which always has process ID one) the number would
reaches the end of a buffer will occur rather rarely. Fur- be an important piece of information that should not be
thermore, the process number relates to the source rather eliminated. However, in our case the significant reduction
than the type of the message. Hence, all numbers and log- in the number of messages outweighs such effects. Note
record specific data such as IP addresses, etc. are replaced that Levenshtein distances have to be computed only once
by placeholders. For example, the message shown above for any pair of messages.
is translated into:
process xx: end of buffer reached
In order not to loose the information, a copy of the original
message is kept.
3
in [10] the authors introduce a correlation measure for this or not, the failure log, which has been written by the
purpose. call tracker, has been analyzed, to extract timestamps and
We adopt the tupling method of [9]. However, equat- types of failure occurrences. In this last step of data pre-
ing the location reported in an error message with the true processing both types of sequences are extracted from the
location of the fault only works for systems with strong data set.
fault containment regions. Since this assumption does not Three parameters are involved in sequence extraction:
hold for the telecommunication system under considera-
tion, spatial tupling is not considered any further, here. 1. Lead-time. In order to predict failures before a fail-
The basic idea of tupling is that all errors showing an inter- ure occurs, extracted failure sequences preceded the
arrival time less than a threshold ε are grouped.1 Grouping time of failure occurrence by time ∆tl . In the exper-
can lead to two problems: iments we used a value of five minutes.
1. Error messages might be combined that refer to sev-
eral (unrelated) faults. This is called a collision. 2. Data window size. The length of each sequence is
determined by a maximum time ∆td . In the experi-
2. If an inter-arrival time > ε occurs within the error
ments we used sequences of five minute length.
pattern of one single fault, this pattern is divided into
more than one tuple. This effect is called truncation.
3. Margins for non-failure sequences. The set of non-
Both the number of collisions and truncations depend failure sequences should be extracted from the log
on ε. If ε is large, truncation happens rarely and collision at times when the system is fault-free. However,
will occur very likely. If ε is small the effect is vice versa. whether a system really is fault-free is hard to tell.
To find an optimal ε, the authors suggest to plot the num- Therefore, we applied a “ban period” of size ∆tm be-
ber of tuples over ε. This should yield an L-shaped curve: fore and after a failure. By visual inspection (length
If ε equals zero, the number of tuples equals the number of bursts of failures, etc.), we determined ∆tm to be
of error events in the logfile. While ε is increased, the 20 minutes.
number drops quickly. When the optimal value for ε is
reached, the curve flattens suddenly. Our data supports
Non-failure sequences have been generated using overlap-
this claim: Figure 5 shows the plot for a subset of one
ping time windows, which simulates the case that failure
million log records. The graph shows a clear change point
prediction is performed each time an error occurs, and a
and a value of ε = 0.015s has been chosen.
random selection has been used to reduce the size of the
training data set.
4
ertheless, D(j, j) is close to zero since it denotes log-
sequence likelihood for the sequence, model M j has been
trained with.
In order to achieve a good measure of similarity among
sequences models should not be overfitted to their train-
ing sequences. Furthermore, one model needs to be
trained for each failure sequence in the training data set.
Therefore, models M i have only a few states and are er-
godic (have the structure of a clique). An example is
shown in Figure 7. In order to further avoid too specific
models, exponential distributions for inter-error durations
and a uniform background distribution have been used.
Figure 6: Matrix of logarithmic sequence likelihoods. Each
element (i, j) in Background distributions add some small probability to
the matrix is logarithmic sequence likelihood
log P (F i |M j ) for sequence F i and model M j .
5
a HSMM with 20 states and a uniform background distri-
bution with a weighting factor of 0.25. Banner plots con-
nect data points (sequences) by a bar of length to the level
of distance metric where the two points are merged / di-
vided. Single linkage clustering (second row, left) shows
6
currence of error ei to the frequency of occurrence
within the entire training data.
Figure 10: The three different sequence sets that can be used to 3. p̂0i are estimated separately for each group of failure
compute symbol prior probabilities. sequences from all errors within the group (over all
time windows). For each error ei the testing variable
Xi compares the occurrence within one time window
3. Time windows are defined that reach backwards in
to the entire group of failure sequences.
time. The length of the time window is fixed and time
windows may overlap. Time windows are indicated Experiments have been performed on the dataset used
by shaded vertical bars in the figure. previously for clustering analysis and six non-overlapping
filtering time windows of length 50 seconds have been an-
4. The test is performed for each time window sepa-
alyzed. Figure 11 plots the average number of symbols in
rately, taking into account all error events that have
one group of failure sequences after filtering out all errors
occurred within the time window in all failure se-
with Xi < c for various values of c.
quences of the group.
5. Only error events that occur significantly more fre-
quently in the time window than their prior proba-
bility stay in the set of training sequences. All other
error events within the time window are removed.
6. Filtering rules are stored for each time window spec-
ifying error symbols that pass the filter. The filter
rules are used later for online failure prediction in
order to filter new sequences that occur during run-
time.
More formally, each error ei that occurs in failure se-
quences of the same cluster within a time window (t − Figure 11: Mean sequence length depending on threshold c for
∆t, t] prior to failure is checked for significant deviation three different priors.
from the prior p̂0i by a test variable known from χ-grams,
which are a non-squared version of the testing variable Regarding the prior computed from all sequences (solid
of the χ2 goodness of fit test (see, e.g., [13]). The test- line), all symbols pass the filter for very small thresholds.
ing variable Xi is defined as the non-squared standardized At some value of c the length of sequences starts dropping
difference: quickly until some point where sequence lengths stabilize
ni − n p̂0 for some range of c. With further increasing c average se-
Xi = p 0 i , (2)
n p̂i quence length drops again until finally not a single symbol
where ni denotes the number of occurrences of error ei passes filtering. Similar to the tupling heuristic by [8], we
and n is the total number of errors in the time window. consider a threshold at he beginning of the middle plateau
An analysis reveals that all Xi have an expected value of to best distinguish between “signal” and noise. Other pri-
zero and variance of one, hence they can all be compared ors do not show this behavior, hence we used priors esti-
to one threshold c: Filtering eliminates all errors ei from mated from all sequences (first prior).
the sequences within the time window, for which Xi < c.
For online prediction, the sequence under investigation is
filtered the same way before sequence likelihood is com- 6 Results
puted.
As stated before, the overall objective was to predict fail-
Three variants regarding the computation of priors pˆi 0
ures of the telecommunication system as accurate as pos-
exist (see Figure 10):
sible. The metric used to measure accuracy of predictions
1. p̂0i are estimated from all training sequences (failure is the so-called F-measure, which is the harmonic mean
and non-failure). Xi compares the frequency of oc- of precision and recall. Precision is the relative number
7
of correctly predicted failures to the total number of pre- and (c) a statistical filtering algorithm to reduce noise in
dictions, and recall is the relative number of correctly pre- the sequences. We conducted experiments to assess the
dicted failures to the total number of failures. A definition effect of sequence clustering and noise filtering. The re-
and comprehensive analysis can be found in Chapter 8.2 sults unveiled that elaborate data preparation is a very im-
of [5]. The HSMM prediction method involves a cus- portant step to achieve good prediction accuracy.
tomizable threshold determining whether a failure warn- In addition to failure prediction the proposed tech-
ing is issued very easily (at a low level of confidence in the niques might also be helpful to speed up the process of
prediction) or only if it is rather sure that a failure is im- diagnosis: For example, if root causes have been identi-
minent, which affects the trade-off between precision and fied for each failure group in a reference data set, identifi-
recall.2 In this paper we only report maximum achievable cation of the most similar reference sequence would allow
F-measure. a first assignment of potential root causes for a failure that
Applying the full chain of data preparation as described has occurred during runtime.
in Sections 3 to 5 yields a failure prediction F-measure of
0.66. A comparative study has shown that this result is
significantly more accurate than best-known error-based
References
prediction approaches (see Chapter 9.9 of [5]). In or- [1] R. K. Iyer, L. T. Young, and V. Sridhar. Recognition of error symp-
der to determine the effect of clustering and filtering, we toms in large systems. In Proceedings of 1986 ACM Fall joint com-
puter conference, pages 797–806, Los Alamitos, CA, USA, 1986.
have conducted experiments based on ungrouped (unclus- IEEE Computer Society Press.
tered) data as well as on unfiltered data. Unfortunately,
experiments with neither filtering nor grouping were not [2] Paul Horn. Autonomic computing: IBM’s perspective on the state
of information technology, Oct. 2001.
feasible. All experiments have been performed with the
same HSMM setup (i.e., number of states, model topol- [3] Adam Oliner and Jon Stearley. What supercomputers say: A study
ogy, etc.). Results unveil that data preparation plays a sig- of five system logs. In IEEE proceedings of International Confer-
ence on Dependable Systems and Networks (DSN’07), pages 575–
nificant role in achieving accurate failure predictions (see 584. IEEE Computer Society, 2007.
Table 2).
[4] David Bridgewater. Standardize messages with the common base
Method Max. F-Measure rel. Quality event model, 2004.
Optimal results 0.66 100% [5] Felix Salfner. Event-based Failure Prediction: An Extended Hid-
Without grouping 0.5097 77% den Markov Model Approach. dissertation.de - Verlag im Internet
GmbH, Berlin, Germany, 2008. (Available as PDF).
Without filtering 0.3601 55%
[6] Felix Salfner, Steffen Tschirpke, and Miroslaw Malek. Compre-
Table 2: Failure prediction accuracy expressed as maximum F- hensive logfiles for autonomic systems. In IEEE Proceedings
measure from data with full data preparation, without failure se- of IPDPS, Workshop on Fault-Tolerant Parallel, Distributed and
quence grouping (clustering) and without noise filtering. Network-Centric Systems (FTPDS), 2004.
[7] Alberto E. D.T. Apostolico and Zvi Galil. Pattern Matching Algo-
rithms. Oxford University Press, 1997.
7 Conclusions
[8] R. Iyer and D. Rosetti. A statistical load dependency of cpu errors
It is common perception today that logfiles, and in partic- at slac. In IEEE Proceedings of 12th International Symposium on
Fault Tolerant Computing (FTCS-12), 1982.
ular error logs, are a fruitful source of information both
for analysis after failure and for proactive fault handling [9] M. M. Tsao and Daniel P. Siewiorek. Trend analysis on system er-
which frequently builds on the anticipation of upcoming ror files. In Proc. 13th International Symposium on Fault-Tolerant
Computing, pages 116–119, Milano, Italy, 1983.
failures. However, in order to get (machine) access to the
information contained in logs, the data needs to be put [10] Song Fu and Cheng-Zhong Xu. Quantifying temporal and spatial
into shape and valuable pieces of information need to be fault event correlation for proactive failure management. In IEEE
Proceedings of Symposium on Reliable and Distributed Systems
picked from the vast amount of data. This paper described (SRDS 07), 2007.
the process we used to prepare error logs of a commercial
telecommunication system for a hidden semi-Markov fail- [11] Padhraic Smyth. Clustering sequences with hidden Markov mod-
els. In Michael C. Mozer, Michael I. Jordan, and Thomas Petsche,
ure predictor. editors, Advances in Neural Information Processing Systems, vol-
The preparation process consists of three major steps ume 9, page 648. The MIT Press, 1997.
and involved the following new concepts: (a) an algo-
[12] Leonard Kaufman and Peter J. Rousseeuw. Finding Groups in
rithm to automatically assign integer error IDs to error Data. John Wiley and Sons, New York, 1990.
messages, (b) a clustering algorithm for error sequences,
[13] Rainer Schlittgen. Einführung in die Statistik: Analyse und Mod-
2 Infact, either precision or recall can be increased to 100% at the ellierung von Daten. Oldenbourg-Wissenschaftsverlag, München,
cost of the other. Wien, 9 edition, 2000.