Auto Analysis ITSys Perf MGMT
Auto Analysis ITSys Perf MGMT
Systems Nov
Performance 2011
Management
This paper makes a case for auto analysis of production data to be
enable the infrastructure experts and their managers on what is going
on in the system, so that they can proactively plan on performance
management. This does not mean actions will be taken automatically,
it simply means that either proactively or reactively the production
system expert has a tool in his hand to more effectively manage system
performance.
A copy of the license is included in the section entitled "GNU Free Documentation License".
http://www.gnu.org/licenses/fdl.html
Auto Analysis for IT Systems Performance Management
November 2011
1. Introduction
This paper makes a case for auto analysis of production data to be enable the
infrastructure experts and their managers on what is going on in the system, so
that they can proactively plan on performance management. This does not
mean actions will be taken automatically, it simply means that either proactively
or reactively the production system expert has a tool in his hand to more
effectively manage system performance. Neither do we claim that this is an
innovative approach or a proposal for a new tool. It is likely that most of what
we discuss in the paper can be implemented within the framework of any ESM
tool. It is just that we make the discussion and analysis techniques explicit to
enable auto analysis to be implemented for system performance management.
The rest of this paper is organized as follows. Section 2 provides the framework
for discussion by means of explaining what we mean by an IT system and what
are the principal drivers for performance and capacity analysis. Section 3
provides a set of commonly occurring events that need to be proactively tracked
for better system management. Section 4 provides simple rules for being able
to automatically analyze the impact of the events. Section 5 provides guidelines
for implementing auto analysis within any capacity management or performance
management tool.
1
(e.g. login or search) which in turn may be implemented through a set of web
transactions.
Figure 1
The next section discusses the types of events that are useful to monitor from a
system performance management perspective.
2
3. Useful Events to Monitor
Even though it may be an infrastructure expert monitoring the system, one has
to bear in mind that any event that we classify as ‘useful’ must be related to the
overall business and it's customers. Thus service levels in terms of response
times and ability to process a certain rate of transactions are the most
important metrics to monitor. These may in turn manifest as high utilization of
servers in case capacity is inadequate or in case services are implemented sub-
optimally.
Table 1 provides a list of cause and effects that are useful to capture in most IT
systems.
Table 1
Cause Effect
1 Increase in workload Higher response times, higher utilizations
2 Spike in workload Higher instantaneous response times, higher
instantaneous utilizations
3 Change in workload Change in per class response times and per
mix class utilizations
4 Sudden increase of Possible shoot up in response times and
infrequent utilizations, shift in bottleneck
transactions
5 Release of new Possible increase in concurrency and thereby
functionality response times and utilizations
6 Application version Possible increase/decrease in service times,
upgrade response times, utilizations, shift in bottleneck
7 Critical application Sudden drop in utilizations, potential increase in
component failure queue sizes
8 Network outage Sudden drop in workload and thereby response
times and utilization
9 Network reconnect Surge in connections
10 Hardware upgrade Reduction in response times and utilizations,
Shift in bottleneck
11 Software version Possible change in response times and thereby
upgrade utilizations
12 Ad hoc operations Increase in response times and utilizations
13 Database Possible change in response times and
maintenance utilizations
Table 1 is an indicative list of cause and effect relationships. More of these can
be added along similar lines. The purpose of listing these down is to draw an
inverse relationship of effect with possible causes and these can be useful for
auto analysis.
3
4. Auto Analysis
Table 2
The idea behind auto analysis is that whenever an event is discovered in production
either proactively or reactively (see Section 5), we should automatically throw up
various potential causes as well as correlations to the infrastructure expert for him
to easily take a decision on what could be causing the problem and thereby what
needs to be done to rectify it.
5. Implementation Strategy
We now focus on how auto analysis needs to be implemented within the framework
of a production system performance monitoring or capacity management tool. The
tool first of all needs to be able to monitor events that are listed in Table 2. This
means that we need to be able to capture
4
service times, visit counts and demand per class to be able to determine
bottlenecks
Once we have this at our disposal then we should proactively monitor the effects of
the following: