0% found this document useful (0 votes)
15 views6 pages

Auto Analysis ITSys Perf MGMT

This paper advocates for the implementation of auto analysis of production data to enhance IT systems performance management, enabling infrastructure experts to proactively address issues. It outlines the importance of monitoring workload, applications, and infrastructure to understand system performance and provides guidelines for implementing auto analysis within existing management tools. The document emphasizes that while actions will not be automated, the insights gained from auto analysis will aid experts in making informed decisions to improve system performance.

Uploaded by

kmdbasappa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views6 pages

Auto Analysis ITSys Perf MGMT

This paper advocates for the implementation of auto analysis of production data to enhance IT systems performance management, enabling infrastructure experts to proactively address issues. It outlines the importance of monitoring workload, applications, and infrastructure to understand system performance and provides guidelines for implementing auto analysis within existing management tools. The document emphasizes that while actions will not be automated, the insights gained from auto analysis will aid experts in making informed decisions to improve system performance.

Uploaded by

kmdbasappa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Auto Analysis for IT

Systems Nov

Performance 2011

Management
This paper makes a case for auto analysis of production data to be
enable the infrastructure experts and their managers on what is going
on in the system, so that they can proactively plan on performance
management. This does not mean actions will be taken automatically,
it simply means that either proactively or reactively the production
system expert has a tool in his hand to more effectively manage system
performance.

Copyright (C) 2011 Rajesh Mansharamani


Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.3 or any later version
published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts,
and no Back-Cover Texts.

A copy of the license is included in the section entitled "GNU Free Documentation License".

http://www.gnu.org/licenses/fdl.html
Auto Analysis for IT Systems Performance Management

[email protected]

November 2011

1. Introduction

IT systems performance is managed in production in one of two ways. Either


one or more infrastructure experts monitor the system and take actions to
ensure smooth functioning of the system, or an Enterprise System Management
(ESM) tool (such as IBM Tivoli, HP OpenView) is used to generate alerts and
infrastructure experts act upon the alerts. In either case the infrastructure
expert has to diagnose the root cause of the problem by means of various kinds
of analysis and then rectify it. More often than not problems are resolved
reactively, that is, after they have occurred rather than preventing them from
occurring in the first place. Given the speed at which a production system needs
to be brought back on track, capacity addition is usually the easiest solution and
also the most expensive one, with no guarantees on the quantity of benefit.

This paper makes a case for auto analysis of production data to be enable the
infrastructure experts and their managers on what is going on in the system, so
that they can proactively plan on performance management. This does not
mean actions will be taken automatically, it simply means that either proactively
or reactively the production system expert has a tool in his hand to more
effectively manage system performance. Neither do we claim that this is an
innovative approach or a proposal for a new tool. It is likely that most of what
we discuss in the paper can be implemented within the framework of any ESM
tool. It is just that we make the discussion and analysis techniques explicit to
enable auto analysis to be implemented for system performance management.

The rest of this paper is organized as follows. Section 2 provides the framework
for discussion by means of explaining what we mean by an IT system and what
are the principal drivers for performance and capacity analysis. Section 3
provides a set of commonly occurring events that need to be proactively tracked
for better system management. Section 4 provides simple rules for being able
to automatically analyze the impact of the events. Section 5 provides guidelines
for implementing auto analysis within any capacity management or performance
management tool.

2. IT System and Performance Drivers

Figure 1 depicts an IT system comprising of a set of N users accessing services


that may be hosted across one or more data centres, over a wide area network.
The services are implemented within a set of IT applications that are hosted
across one or more servers, which have access to one or more storage devices
(either internal or external to the server). While our discussion implicitly
assumes online transaction processing (OLTP) of business transactions, it can
easily be extended for batch system analysis as well.

We assume that users submit business transactions to the system. A business


transaction (e.g. order entry) is composed of a set of elementary transactions

1
(e.g. login or search) which in turn may be implemented through a set of web
transactions.

Figure 1

When we talk about system performance we mean response times and


throughputs of business transactions. To meet service levels across these
metrics we need to invest in capacity of servers (processors, memory), storage,
and network, and monitor the capacity being consumed of the same.

System performance may be impacted in one of three ways:

1) By the workload on the system


2) By the applications used to implement business services
3) By the underlying infrastructure (hardware and software technology)

Changes to any of these elements can cause changes to system performance.


Therefore for a holistic approach to system performance management we
should try to monitor all these three elements in as much detail as possible.

Monitoring of the workload can be achieved by means of analyzing the web


server logs. Monitoring of applications can be achieved by means of profiling
tools as well as operating system monitors that operate at a process level.
Monitoring of infrastructure is usually the easiest given the number of
monitoring tools for CPU, memory, disk, and network available in most IT
systems. In addition to these it is useful to maintain logs of system activity,
specially of changes in the system such as introduction of a new service, release
of a new application, and addition of capacity.

The next section discusses the types of events that are useful to monitor from a
system performance management perspective.

2
3. Useful Events to Monitor

Even though it may be an infrastructure expert monitoring the system, one has
to bear in mind that any event that we classify as ‘useful’ must be related to the
overall business and it's customers. Thus service levels in terms of response
times and ability to process a certain rate of transactions are the most
important metrics to monitor. These may in turn manifest as high utilization of
servers in case capacity is inadequate or in case services are implemented sub-
optimally.

Table 1 provides a list of cause and effects that are useful to capture in most IT
systems.

Table 1

Cause Effect
1 Increase in workload Higher response times, higher utilizations
2 Spike in workload Higher instantaneous response times, higher
instantaneous utilizations
3 Change in workload Change in per class response times and per
mix class utilizations
4 Sudden increase of Possible shoot up in response times and
infrequent utilizations, shift in bottleneck
transactions
5 Release of new Possible increase in concurrency and thereby
functionality response times and utilizations
6 Application version Possible increase/decrease in service times,
upgrade response times, utilizations, shift in bottleneck
7 Critical application Sudden drop in utilizations, potential increase in
component failure queue sizes
8 Network outage Sudden drop in workload and thereby response
times and utilization
9 Network reconnect Surge in connections
10 Hardware upgrade Reduction in response times and utilizations,
Shift in bottleneck
11 Software version Possible change in response times and thereby
upgrade utilizations
12 Ad hoc operations Increase in response times and utilizations
13 Database Possible change in response times and
maintenance utilizations

Table 1 is an indicative list of cause and effect relationships. More of these can
be added along similar lines. The purpose of listing these down is to draw an
inverse relationship of effect with possible causes and these can be useful for
auto analysis.

3
4. Auto Analysis

Whenever we monitor a production system we are in fact observing the effects


of some cause that has made the system behave in a particular way. Table 2
inverts Table 1 and presents the causes that could have possibly caused the
events to occur during monitoring.

Table 2

Event Observed Possible Cause


1 Spike in response time, Spike in workload, including ad hoc
utilizations requests
2 Sudden jump in response times, Jump in workload, change in
utilizations workload mix, sudden increase of
infrequent transactions,
application/software/hardware
upgrade
3 Sudden drop in workload, Network outage or
utilization or response times application/software component
failure
4 Surge in connections Network reconnect
5 Reduction in response times Hardware/software/application
upgrade or reduction in workload or
database maintenance
6 Shift in bottleneck Change in workload mix,
hardware/software/application
upgrade, failure of redundant devices

Table 2 is by no means exhaustive but it is indicative of the kind of relationships to


list down so that one can easily automate the analysis of mapping causes to effects.

The idea behind auto analysis is that whenever an event is discovered in production
either proactively or reactively (see Section 5), we should automatically throw up
various potential causes as well as correlations to the infrastructure expert for him
to easily take a decision on what could be causing the problem and thereby what
needs to be done to rectify it.

5. Implementation Strategy

We now focus on how auto analysis needs to be implemented within the framework
of a production system performance monitoring or capacity management tool. The
tool first of all needs to be able to monitor events that are listed in Table 2. This
means that we need to be able to capture

 overall utilizations of CPU, disk, network


 per class utilizations (to map on to workload mix changes) – this can be
derived if per class workload is known
 workload overall and per class
 number of connections – either through measurement or derive

4
 service times, visit counts and demand per class to be able to determine
bottlenecks

Once we have this at our disposal then we should proactively monitor the effects of
the following:

 Hardware/Software/Application Upgrade, Change in Workload, Introduction


of New Functionality: Has the bottleneck reduced, to where has the new
bottleneck shifted, when is it likely to be hit or more specifically up to how
much more load can the system take before service levels get violated
 Increase in workload: how fast is the workload growing before service levels
will get violated, which is the first bottleneck that we will hit, what do we
need to do to fix the bottleneck and if it is a capacity increase then how
much capacity is required to sustain the workload up to what level
 Spikes in utilization: Are these one off or recurring? If these are recurring
then how do we correlate with other device utilizations and from the day of
occurrence of these spikes how do we correlate with other related events
that have caused a change in the system configuration. Also do we have a
proportional increase in workload to justify the spike.

Reactively, we can do the following:

 Increase in response times: can we relate this to increase in workload or


change in system/application configuration or increase in users
 And all the other items in Table 2

The idea behind this is to be able to quantitatively show these relations as


described above, so that the analyst does not have to manually pore through a lot
of data to be able to infer the same. For implementation purposes it is best to draw
a flowchart per item described above and implement the same within the
performance management framework. This should include what if analysis in terms
of adding capacity or shifting the bottleneck. For example, if capacity cannot be
added for budgetary reasons then the extent to which demand needs to be reduced
should be listed out and targets provided to the application owners.

You might also like