0% found this document useful (0 votes)
83 views

Real-Time System Log Monitoring/Analytics Framework

The document describes a real-time log monitoring and analytics framework that processes and analyzes system logs to provide summarized views of events. Key points: - The framework processes large volumes of unstructured system logs from the Jaguar supercomputer to generate structured views of messages by application, error type, node location, and other categories. - It builds statistical profiles of applications based on system events and resource usage observed from past runs to help identify anomalies versus normal behavior. - A web interface provides real-time summarized views of console, network, scheduler logs and enables users to query the underlying database of raw log messages.

Uploaded by

MaheshwarReddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views

Real-Time System Log Monitoring/Analytics Framework

The document describes a real-time log monitoring and analytics framework that processes and analyzes system logs to provide summarized views of events. Key points: - The framework processes large volumes of unstructured system logs from the Jaguar supercomputer to generate structured views of messages by application, error type, node location, and other categories. - It builds statistical profiles of applications based on system events and resource usage observed from past runs to help identify anomalies versus normal behavior. - A web interface provides real-time summarized views of console, network, scheduler logs and enables users to query the underlying database of raw log messages.

Uploaded by

MaheshwarReddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Real-Time System Log Monitoring/Analytics Framework

Raghul Gunasekaran, Sarp Oral, David Dillow, Byung Park, Galen Shipman, Al Geist
Oak Ridge National Laboratory
{gunasekaranr,oralhs,dillowda,parkph,gshipman,gst}@ornl.gov

Abstract

Analyzing system logs provides useful insights for identifying system/application anomalies and helps in
better usage of system resources. Nevertheless, it is simply not practical to scan through the raw log messages
on a regular basis for large-scale systems. First, the sheer volume of unstructured log messages affects the
readability, and secondly correlating the log messages to system events is a daunting task. These factors limit
large-scale system logs primarily for generating alerts on known system events, and post-mortem diagnosis
for identifying previously unknown system events that impacted the systems performance. In this paper,
we describe a log monitoring framework that enables prompt analysis of system events in real-time. Our
web-based framework provides a summarized view of console, netwatch, consumer, and apsched logs in real-
time. The logs are parsed and processed to generate views of applications, message types, individual/group of
compute nodes, and in sections of the compute platform. Also from past application runs we build a statistical
profile of user/application characteristics with respect to known system events, recoverable/non-recoverable
error messages and resources utilized. The web-based tool is being developed for Jaguar XT5 at the Oak
Ridge Leadership Computing facility.

1 Introduction or application anomalies requires an extensive un-


derstanding of the system state and environment.
System logs, generally referred to as RAS (Reli- System administrators are tasked with the tedious
ability, Availability and Serviceability) logs, provide and daunting effort of isolating problem and under-
a wealth of information on the status of large scale standing system failures from logs. Though a num-
computing systems. Often these system logs are ber of commercial log analysis tools such as Splunk
large volumes of unstructured and redundant infor- are being adopted at a huge cost, they simply serve
mation, which affect the readability and easy inter- as a sophisticated query interface. These factors
pretation. The Jaguar XT5 typically generates a limit the usage of the logs on a regular basis and
few hundred thousand log messages per day. More- are primarily used only for diagnosing previously
over, interpreting these logs for diagnosing system unknown system events that cause serious perfor-
mance degradation. This results in a number of
This research used resources of the Oak Ridge Leader- silent errors going undetected, which can result in
ship Computing Facility, located in the National Center for poor application performance.
Computational Sciences at Oak Ridge National Laboratory,
which is supported by the Office of Science of the Department System log and metrics are time stamped values
of Energy under Contract DE-AC05-00OR22725. collected regularly over time and aggregated over
Notice: This manuscript has been authored by UT- the entire machine. In general the log messages can
Battelle, LLC, under Contract No. DE-AC05-00OR22725
with the U.S. Department of Energy. The United States Gov- be parsed to identify the source generating the error,
ernment retains and the publisher, by accepting the article the error type, the error message, and the potential
for publication, acknowledges that the United States Gov- target (entity at fault) based on the type of error.
ernment retains a non-exclusive, paid-up, irrevocable, world-
wide license to publish or reproduce the published form of
Combining the RAS and scheduler logs allows us
this manuscript, or allow others to do so, for United States to associate errors with measurements to examine
Government purposes. a specific application that was running during the

1
Figure 1: Log Monitoring/Analysis Framework

time period. With an understanding of the compute marized views of log messages via a web inter-
infrastructure, we can further categorize the mes- face. First, we process raw log message streams
sages based on source/target types, physical loca- by structuring the log messages and direct them
tion and mapping/dependence between the various in to MySQL database tables. Second, we gen-
entities. Associating these information points with erate materialized views of log messages with re-
each other, we can generate summarized views of spect to application, error-type, physical location
log messages in a more precise and readable format. (cabinets, nodes) and source-destination pairing.
Furthermore, by observing multiple runs of applica- Third, we enable querying/viewing log messages via
tions over longer period, we can build a profile for a web interface in real-time, and also allow other
individual applications based on observed system users/tools to query raw log from the database. Fi-
events and metrics. In comparison to traditional nally, we use system log and other system metrics
tools, our approach of profiling does not provide to profile the runtime characteristics of individual
fine-grained details of application behavior, but sim- applications. Our implementation is based on the
ply model the runtime characteristics of the appli- Oak Ridge Leadership Computing Facility (OLCF)
cation. Also, the general profiling tools [6, 2, 1, 14] Jaguar XT5 system logs and Spider center-wide file
have a significant compute overhead (4-8%) and system [13] backend disk stats. The framework de-
bandwidth requirements while capturing trace in- scribed in this paper extends our previous work de-
formation, so they carry an unacceptable cost when scribed in [10].
used for continuous monitoring of applications. Our
approach of profiling will help model typical appli-
cation runs in terms of system events, which can be 2 Monitoring Framework
used for understanding the resource utilization of
individual applications and anomaly detection. We To enhance the readability and ability to analyze
define anomaly as a deviation from the expected be- logs in real-time, we are developing a log monitoring
havior of an application. This does not necessarily framework for the Jaguar XT5, illustrated in Figure
imply that application is at fault, but the abnormal 1. Organizing the log messages is a two step process,
behavior of an application could indicate shared re- first pre-processing where the raw logs are parsed
source constraints or impact by other applications and stored in database tables. Second, the log en-
sharing the same platform. Our approach of char- tries are grouped by associating messages within
acterization and anomaly detection are indicators a time window to provide a concise summary of
of poor or undesirable application performance and the log messages, referred to as materialized views.
act as triggers for further investigation. The monitoring framework is being developed on
In summary, we propose a framework for pro- a dedicated server hosting a MySQL database and
cessing/analyzing log messages and enabling sum- apache webserver. The log streams are directed to

2
Table 1: A few examples of RAS and Apsched Log

Netwatch Log
Timestamp Node ID Port Remote Node Port Errs
100130 11:52:39 c4-6c2s0s0 5 c4-4c0s0s3 0 1 uPacket Squash
100130 11:52:39 c7-5c1s0s0 5 c7-7c1s0s3 0 3 uPacket Squash
100130 11:52:39 c23-2c1s0s0 6 * * Deadlock Timeout, Req Chan

Console Log
[2010-06-11 00:24:21][c14-2c0s4n0]beer: cpu id 9: nid 9812, cpu 0 has been unresponsive for 240 seconds
[2010-06-11 00:24:21][c14-2c0s4n0]LustreError: 773:0:(ptllnd tx.c:469:kptllnd tx callback()) Portals error
to 12345-9812@ptl1: PTL EVENT SEND END(9) tx=ffff8103f99bcd80 fail=PTL NAL FAILED(4)
unlinked=0 1276230767 ref 2 fl Rpc:N/0/0 rc 0/0
[2010-06-11 00:24:21][c14-2c0s4n0]Lustre: widow1-OST0015-osc-ffff8103f9e8d000: Connection to service
widow1-OST0015 via nid 10.36.227.118@o2ib was lost; in progress operations
this service will wait for recovery to complete.
[2010-06-11 00:24:23][c16-2c0s6n1]HARDWARE ERROR
[2010-06-11 00:24:23][c16-2c0s6n1]CPU 0: Machine Check Exception: 0 Bank 4: dc04400040080813
[2010-06-11 00:24:23][c16-2c0s6n1]TSC 4b9ec89ff4f6 ADDR 8e497800 MISC c0090fff01000000

Apsched Log
01:20:05: Confirmed apid 4325785 resId 349 pagg 0 nids: 12706,12710,13446,13506,13510,13696,13700,...
01:20:06: Bound apid 0 resId 349 pagg 16574 batch 418808
01:20:06: Placed apid 4325786 resId 349 pagg 16574 uid 63137 cmd jobcleanup nids: 12706,12710,...
02:14:23: Released apid 4325786 resId 349 pagg 16574 claim
02:14:23: Canceled apid 4325785 resId 349 pagg 16574

the server via syslog-ng from the Cray SMW. The error type, which uniquely identify every log en-
file system metrics are queried periodically from the try. Source refers to the node generating the er-
back-end storage system, described in [11]. ror and target refers to the entity (node, router,
OSS) the source is complaining about. The ver-
2.1 Log pre-processing bose description of the error in the log message is
also stored in the tables as error message for sup-
porting more detailed views, described in the next
The Jaguar XT5 system status is monitored via subsection. The netwatch and consumer log have
four log streams, syslog, console, netwatch, and con- a more definitive structure. For example, for the
sumer log. For our analysis we analyze only the netwatch log shown in Table 1, all messages have
last three log streams, which are generally referred the same structure with different error types. How-
to as the RAS logs, while the syslog is not used in ever, the console log messages are more verbose and
our analysis. We also study the apsched log from unstructured, and reports on Lustre (file system),
the ALPS (Application Level Placement Scheduler) BEER, machine check exceptions, out of memory,
subsystem, the Cray supported system for placing are a few of them. Identifying the target informa-
and launching applications on the XT nodes. tion from the console logs is not straight forward
To improve the readability without loss of infor- as the target information is embedded in the error
mation, we parse the log streams and store them in message, and the structure of the message varies
MySQL database tables, each log stream is stored based on the error type. This gets more complicated
in a separate database table. The RAS log en- for Lustre error messages, as extracting the target
tries are time stamped values and are generated information depends on the specific error message.
by a specific node, a few sample log entries are For handling Lustre error messages we parsed CDE-
shown in Table 1. From these log messages we can BUG messages from the source code categorized as
parse out the timestamp, source, target, and

3
D EMERG and D ERROR. We then defined regu- other entities are prefixed by their type, such
lar expressions for each individual message identify- as ost, mdt, oss, mds or ib followed by an
ing the various components of information [10]. All integer value, example ost432.
the regular expressions are stored as a text file and
are accessed by the parser. Each log stream has a Apart from the appID-node mapping, the map-
separate parser, written is python, which uses reg- pings described above are fixed and specific to the
ular expressions to match log messages, parses the OLCF compute infrastructure. The mappings are
log, and writes to the database tables. encoded as a configuration file and made available
The apsched log records the allocation of nodes to the parser.
for a specific user job. The log have a well defined Table 2: Aggregated Log Messages
structure, as documented in [7], as shown in Ta-
ble 1. The log entry with keyword Confirmed and Console Log
Bound record the allocation of compute nodes to Time: 05:10:00 05:20:00
be a specific user job identified by a application Applications: 49
ID (apid ), reservation ID (resId ), and session ID Source: 3427 SourceType: cnode
(pagg), which uniquely identify an allocation. Log Target: 1 TargetType: rtr
entry with keyword Canceled marks the end of the Target Node: 6311
job. Entries with keyword Placed and Released Lustre PTL errors: 10194
mark the usage of nodes for appruns initiated by Netwatch Log
the user within the job, which is identified by a new Time: 12:34:00 12:44:00
apid and the same reservation and session ID as Applications: 17
the job. For identifying a node to the scheduled Source: 1183 SourceType: cnode
application we simply the use apid associated with Target: 1136 TargetType: cnode
individual apprun. Column: 1,2,3,4,5.24
Each raw RAS log entry, as described above, Row: 5
is parsed to identify timestamp, source, target, upacket squash: 12115
error type and error message, where each entry Deadlock Timeout: 3542
is a table column. To facilitate further analysis of Appid: 26457
the logs the following fields (columns in the table) Starttime: 18:34:00
are also inserted into the tables in real-time. Nodes: 2000
Source: 347 SourceType: cnode
• appID, from the apsched log we can associate Target: 34 TargetType: cnode, rtr
every node to an application ID or appID. This BEER errors: 234
information is added to every RAS log entry MCE: 37
identifying the application the source node is Lustre PTL errors: 592
running.

• sourceType, targetType, we identity very


source or target entity as a compute node, I/O 2.2 Log views
node, service node, router, OST, OSS, MDT or
MDS. Having organized the data in a structured and
queryable format, we are still overwhelmed with the
• sourceID, targetID, in general the source volume of information. Correlating these messages
node is labeled using CID (example c14- to the systems state and environment we are able to
2c4s5n3), which identifies the nodes physical generate concise summarizes of the log. By system
location in the cabinet. The target information state, we mean details of node allocations for ap-
in the logs are represented as CID, IP address, plication and condition of system components (ac-
NID or hostname. NID is an integer value iden- tive/failed). System environment refers to the prop-
tifying the nodes on the compute platform. To erties and dependencies of various hardware compo-
facilitate analysis we maintain a uniform iden- nents, such as the mapping between OST, OSS and
tification schema, where all nodes on the com- RAID controllers and individual nodes types (com-
pute platform are identified by NIDs and all pute, I/O, service) on the compute platform.

4
To generate materialized views, which are precise ularity, say ten minute interval, through simple sql
overview of multiple lines of log entries, we cluster queries, which execute in few tens of milliseconds.
the log messages in two different ways. One, we clus-
ter log messages over the runtime of the application, 2.3 Other System Metrics
and secondly, we cluster log messages in fixed time
widows. The first approach will help view events Apart from the system logs, we collect file sys-
that are occurring on the nodes running the same tem usage metrics from the back-end disk storage
application, and the second approach will help iden- system. The Spider [13] file system has 96 DDN
tify errors related to specific hardware and are not RAID controllers, which are accessed by the com-
related to any specific application. pute nodes via the Object Storage Servers. We
Table 2 shows examples of summarized view of monitor the read/write bandwidth and IOPS us-
log messages. The console log message shows 3427 age, tier delay and request size distribution from all
distinct source nodes complaining on a single router the RAID controllers. This information is used to
node generating 10194 portal error messages within actively monitor the file system usage in real-time.
a ten minute period, a total of 49 applications are We use this information to model the file system us-
affected. In the second example, the grouping of the age of individual applications, detailed in the next
netwatch log, we see heavy network congestion on section.
cabinets in row five, which implies congestion along Organizing the raw data and correlating them is
a specific axis in a segment of the 3D Torus. This done within the database, which can be queried by
congestion affectings 17 different applications. Fi- external users. However, the benefit of this work is
nally, when clustering with respect to application, in presenting such correlated information via a web
we see the errors generated by an application run- interface in real-time and in a usable format for sys-
ning on 2000 nodes since its start time. This appli- tem administrators. We have a basic web interface
cation generated Lustre portal errors, BEER, and to view the logs and metrics in real-time. We are
MCE (machine check exception) errors. currently working on presenting this information via
The clustering approach was implemented within a web interface in an interactive manner for system
the mysql database with the help of sql events and administrators.
procedures. To enable materialized views we cre-
ate additional tables in the database that store ag- 3 Profiling Applications
gregate count of errors based on applications and
for individual logs streams. For aggregation based Profiling helps understand the functional and re-
on a application, a table(app log view) indexed in source utilization characteristics of individual ap-
timestamp and appid, maintains a list of source and plications. In general, profiling is done using fine-
error types and a count of the errors being reported grained performance tools and is undertaken by ap-
for each type. A sql event is timed to execute a plication developers in collaboration with perfor-
sql procedure periodically in one minute intervals. mance experts for optimization purposes. However,
The sql procedure basically aggregates data from all the runtime behavior of application on shared re-
the RAS log tables generated in the last minute and source environment is very different from that ob-
updates the app log view table. Similarly, for the served on controlled test environments. Also, it is
RAS logs we maintain individual aggregation table, important to profile individual user-application be-
which are also updated in one minute granularities. havior rather than grouping users with a common
Also, we aggregate Lustre messages from the con- scientific application. Users, based on their scien-
sole log in a separate table. As described in ear- tific needs, define various parameters at runtime,
lier section, we have extensively defined Lustre error such as the number of compute nodes, dimensions
messages as regular expressions, which help present of the compute grid, I/O usage (frequency of check-
more detailed views of the logs. This helps us iden- pointing), all of which determine the true runtime
tify errors with respect to the module within Lus- characteristic of an application. In our observation
tre, such as lnet, ptl, llite, ldlm, obd, or mdc, that of the jobs submitted to the Jaguar XT5, every user
is causing the error. Though the aggregation within has one or more fixed job allocation models. In this
the database is limited to one minute intervals, such section we present a few preliminary findings on pro-
organization of data allows querying at lower gran- filing user application based on netwatch log and

5
I/O usage. The analysis presented in this section even though these runs occured along with different
is based on four months of data collected between mixtures of other running applications during the
June and September of 2010. Our approach of pro- four month observation period. The results match
filing applications based on logs and system metrics with the properties of the applications; the interest-
can help identify application’s resource needs and ing finding is that we were able to deduce such appli-
provision system resources accordingly. cation behavior from thenetwatch error log. Having
established a correlation between packet squashes
3.1 Network Utilization and and link utilization, we are working towards
quantifying the observed packet squash error rate
The netwatch log reports on the link status of to the actual network utilization by an applications
the SeaStar 3D torus interconnect. A few sample in GB/s. With knowledge of such application char-
log entries are shown in Table 1. The first field acteristics we will be able to make better scheduling
is the timestamp of the log entry, followed by the decisions.
source and remote node IDs and port numbers. Ev- Table 3: Netwatch Log Stats
ery SeaStar router has 6 ports (numbered 0 to 5)
and is connected to six neighbor nodes. The last
two fields are the number of packet squashes and Applications App-1 App-3 App-4
the error type, read as micro-packet squashes. The
packet squash error indicates the number of retrans- # of Compute Nodes 2k 3-5k 5k
missions required on a specific link, which is sam-
pled every minute independently on each SeaStar % of Nodes 0-0.7% 8-10% 15-18%
router. Log messages are generated when a nonzero Reporting error
number of packet squashes are detected in the sam-
ple interval. The packet squash error, being a data Error Rate (msg/min) <2 8-12 15-19
link layer message, can indicate data loss, data cor-
ruption or simply a bad link (hardware problem).
Hardware problems were ruled out in our analysis.
Hardware problems occur irrespective of the appli-
cation running, and links reporting packet squash 3.2 I/O Usage
errors over long time periods are replaced during
scheduled maintenance periods. The other types Understanding I/O demands of applications is
of error messages recorded in the netwatch log are critical for provisioning storage system resources.
deadlock timeout and buffer overruns. For our pre- A specific scientific application can define a generic
liminary analysis we only analyze the packet squash I/O demand, however the actual I/O utilization is
error messages. specific to the user running the application. In gen-
We found a strong correlation between packet eral, observed patterns of I/O behavior are read at
squash errors and specific applications, as shown in the beginning of a run and write at the end of a
Table 3. This table shows the range of nodes report- run, and for long running jobs, checkpointing at in-
ing errors and range of number of error messages termediate points. The frequency of checkpointing
whenever certain three applications were running. usually drives the I/O demand of an application.
From prior knowledge of these applications, App-1 The utilization will also vary depending on how the
is I/O intensive, App-3 is MPI intensive (interpro- user performs parallel I/O, e.g., whether or not files
cess communication), and App-4 uses Global Ar- are shared between processes. Projecting individ-
rays. We found applications that are I/O intensive ual users peak bandwidth (or I/O operations per
and non-MPI intensive to generate the fewest packet second) and frequency of file system usage will help
squash messages. Large numbers of packet squash provision system resources more efficiently.
errors are reported when an MPI-intensive job is In Figure 2, we present a typical usage of the Spi-
running; an even larger number of errors are re- der file system in a day. The figure has four sub-
ported when a job utilizes Global Arrays with MPI. plots with each presenting the write bandwidth us-
Our interpretations are based on observing multiple age observed during a period of six hours. The write
runs of the same user application at different times, bandwidth usage is sampled in two seconds intervals

6
system resources, it is possible to get an estimate of
the I/O usage of the application of interest by ob-
serving multiple runs of the same application. The
bandwidth metrics captured at the controllers are
not per application, as that would require extensive
application trace information, adding considerable
overhead. It is worth mentioning that, we observed
an ongoing constant 5 GB/s write activity on back-
end disks at all times, which is taken into consider-
ation into our framework as background noise.
Using the time series data, we plot the Cumula-
tive Distribution Function (CDF) of the write band-
width usage by the application (App-1), as shown in
Figure 3. From the plot we can infer that for more
than 20% of total application runtime the user is
writing at a rate greater than 32 GB/s with peaks
around 42 GB/s. The CDF plot provides us with
an estimate of the user application’s I/O needs. In
our study of a few other applications, a similar pat-
tern was observable with two other applications, as
shown in Table 4. This table summarizes the I/O
usage for three applications, with the peak write
Figure 2: Plot of file system usage observed on a given
day.
bandwidth observed for each application and what
percentage of the total runtime does the application
operate at more than 80% and 50% of the peak write
1

0.9
bandwidth. The application (App-1) is a short du-
0.8
ration routine that generally runs for a few tens of
0.7
minutes. However, for a longer running jobs, say
Distribution P(X<x)

0.6 3 hours or more, it is difficult to capture such I/O


0.5 behavior by directly observing the file system us-
0.4 age. In general for long running applications, users
0.3 do frequent checkpointing, which is one of the most
0.2 I/O consuming task. An auto correlation over the
0.1 runtime stats of the application will give us the peri-
0
5 10 15 20 25 30 35 40 45 50 odic I/O usage pattern or the checkpointing pattern
Bandwidth GB/s of the application. This is evident in the Figure 2,
for the time period of 10:00 to 14:30 hrs, as two sep-
Figure 3: CDF of write bandwidth usage derived from
multiple runs of the application (App-1).
arate activities with two distinct frequencies with
two different amplitudes can be observed.

from every RAID controller using the manufacturer Table 4: Applications I/O Usage
provided API, and the aggregate value across all
controllers gives the total file system usage. From
Applications App-1 App-2 App-3
the scheduler’s log, our application (App-1) of inter-
est was running during the following time periods:
Observed Peak(GB/s) 38 -42 12-15 22-35
0:25 to 2:42 hrs, 9:30 to 9:56 hrs, 14:53 to 17:31
hrs, and 19:45 to 23:18 hrs. Observing the write
bandwidth usage during the above mentioned time Runtime >80% of peak 18-20% 6-5% 4-5%
periods we see a clear pattern of high write band-
width by App-1. Though there might be other ap- Runtime >50% of peak 38-42% 12-16% 20-25%
plications running concurrently and taxing the file

7
4 Related Work day. In general, the above reviewed papers leverage
on machine learning techniques for finding trends
Conventionally, log messages have been associ- or abnormalities that occur multiple times over a
ated with temporal data mining techniques, where period of time. However, it is of interest in iden-
the frequency and sequence of events are of inter- tifying such abnormalities on the first occurrence.
est. Recent studies on HPC logs have focused on Our approach of profiling characterizes normal be-
machine learning and statistical methods for analyz- havior, which can they be used for capturing ab-
ing and detecting system failures. In the machine- normal activities. Our approach towards using logs
learning paradigm described in [15], the log mes- for application characterization comes with a pro-
sage structures are parsed from the source code and found understanding of the system architecture and
a feature vector is constructed as a sequence of log applications. This helps correlating events to errors
messages. Using principal component analysis tech- and understanding the impact of such errors on the
nique, deviations of the run-time log from the pre- system and application’s performance.
defined vectors are identified, and any variance is
defined as an anomaly. SLCT (Simple Logfile Clus-
tering tool [12]) is a data-clustering paradigm for 5 Conclusion
mining event patterns in log data. An apriori al-
gorithm, the first stage is to generate a count of As systems have grown to peta-scale the debug
all unique words in the log, identify log messages levels of logs have decreased (less verbose), while,
containing words above a threshold value, and then the volume of log generated has increased. This
clustering those log lines. This method is based poses a challenge in terms of readability and the
on the assumption that events of interests occur in valuable interpretations that can be made from the
bursts and ignores errors with low frequency. logs. Apart from the need to enhance the read-
Nodeinfo [4], an entropy based anomaly detec- ability of the logs, our approach is focused towards
tion system, tags every log message to quantify using structured log data for building and support-
the importance. Then, the entropy of every node ing analytical tools that can enhance our interpre-
in the system is quantified based on the number tations of the log. Currently, we are working on
of occurrences of alerts within a given time pe- presenting this information via a web interface in
riod. It is presumed that all nodes operate similarly, a more interactive manner. Also, we are extend-
and the entropy of every node should be the same. ing our work by building a profiling tool to model
Any variance of entropy would be categorized as the runtime characteristics of an individual appli-
an alert/anomaly. Similarly, the models proposed cation, which can help in anomaly detection, iden-
in [5] and [3] group log messages and use predic- tify inter-job interference, and lead to the design of
tive techniques under the assumption that the logs context-aware schedulers. Profiling an application
carry all event information and occur in bursts. Re- gives us the expected or acceptable behavior of an
cent work [9] [8] have suggested using system logs application that providing the capability to identify
to understand component level interaction in large anomalous behavior, which is a deviation from the
scale systems and model the system state leverag- expected behavior. Also, such deviations can help
ing machine learning techniques. First log events or us identify applications that tend to get impacted
anomalies are correlated to specific hardware, and or have poor performance because of other appli-
then the successive events are mapped to other com- cations running on the shared compute platform.
ponents in the system that are affected by the spe- Identifying such inter-job interference can help us
cific event. This helps understand how system com- make well-informed scheduling decisions increasing
ponents are interdependent and the cascading effect the overall throughput of the system.
of system events.
In reviewed papers, one of the principal assump-
tions in analyzing systems logs is that the system References
supports logging of all events, which may be im-
practical for large systems like Jaguar. In peta-scale
system, logs in general capture only failure events, [1] Totalview. http://www.roguewave.com/support/
which in itself generates a few gigabytes of data per product-documentation/index.html.aspx.

8
[2] Using cray performance analysis tools. Doc- [13] G. M. Shipman, D. A. Dillow, S. Oral, and
ument S-2474-51, Cray User Documents F. Wang. The spider center wide file systems;
(http://docs.cray.com), 2009. from concept to reality. In Cray User Group
Conference, 2009.
[3] A. Makanju, A.N. Zincir-Heywood, E. E. Mil-
ios. Clustering event logs using iterative parti- [14] D. Terpstra, H. Jagode, H. You, and J. Don-
tioning. In ACM SIGKDD International con- garra. Collecting performance data with papi-
ference on Knowledge discovery and data min- c. In 3rd Parallel Tools Workshop, 2009.
ing, 2009.
[15] W. Xu, L. Huang, A. Fox, D. Patterson, M.
[4] J. Oliner, A. Aiken, and J. Stearley . Alert Jordan. Mining Console Logs for Large-Scale
Detection in System Logs. In Proceedings of System Problem Detection. In 3rd Workshop
the International Conference on Data Mining on Tackling System Problems with Machine
(ICDM), 2008. Learning Techniques(SysML), 2008.
[5] Y. Liang, Y. Zhang, H. Xiong, Hui, R. Sahoo.
Failure Prediction in IBM BlueGene/L Event
Logs. In Proceedings of the International Con-
ference on Data Mining (ICDM), 2007.
[6] L. Adhianto, S. Banerjee, M. Fagan,
M. Krentel, G. Marini, J. Mellor-Crummey,
and N. R. Tallent. Hpctoolkit: Tools for
performance analysis of optimized parallel
programs. Concurrency and Computation:
Practice and Experience, 22(6):685–701, 2010.
[7] Hwa-Chun Wendy Lin. Understanding aprun
use patterns. In Cray User Group Conference,
2009.
[8] J. Oliner, A. Aiken. A query language for
understanding component interactions in pro-
duction systems. In Proceedings of the Inter-
national Conference on Supercomputing(ICS),
2010.
[9] J. Oliner, A. Aiken. Online detection of multi-
component interactions in production systems.
In Proceedings of the International Conference
on Dependable Systems and Networks(DSN),
2011.
[10] R. Gunasekaran, D. Dillow, B. Park, G. Ship-
man, D. Maxwell, J. Hill, A. Geist. Corelating
log mesages for system diagnostics. In Cray
User Group Conference, 2010.
[11] R. Miller, J. Hill, D. Dillow, R. Gunasekaran,
D. Maxwell. Monitoring tools for large scale
systems. In Cray User Group Conference, 2010.
[12] Risto Vaarandi. A data clustering algorithm
for mining patterns from event logs. In IEEE
IPOM’03 Proceedings, 2003.

You might also like