Compperf PDF
Compperf PDF
Arnold O. Allen
Software Technology Division
Hewlett-Packard
Roseville, California
AP PROFESSIONAL
Harcourt Brace & Company, Publishers
AP PROFESSIONAL
1300 Boylston Street, Chestnut Hill, MA 02167
ISBN 0-12-051070-7
Anyone who works through all the examples and exercises will gain a basic
understanding of computer performance analysis and will be able to put it to use
in computer performance management.
The prerequisites for this book are a basic knowledge of computers and
some mathematical maturity. By basic knowledge of computers I mean that the
reader is familiar with the components of a computer system (CPU, memory, I/O
devices, operating system, etc.) and understands the interaction of these compo-
nents to produce useful work. It is not necessary to be one of the digerati (see the
definition in the Definitions and Notation section at the end of this preface) but it
would be helpful. For most people mathematical maturity means a semester or so
of calculus but others reach that level from studying college algebra.
I chose Mathematica as the primary tool for constructing examples and mod-
els because it has some ideal properties for this. Stephen Wolfram, the original
developer of Mathematica, says in the What is Mathematica? section of his
book [Wolfram 1991]: .
Mathematica is a general computer software system and language intended
for mathematical and other applications.
You can use Mathematica as:
1. A numerical and symbolic calculator where you type in questions, and Mathe-
matica prints out answers.
2. A visualization system for functions and data.
If you want to know what the millionth prime is, without listing all those
preceding it, proceed as follows.
The number has been computed to over two billion decimal digits. Before the
age of computers an otherwise unknown British mathematician, William Shanks,
spent twenty years computing to 707 decimal places. His result was published
in 1853. A few years later it was learned that he had written a 5 rather than a 4 in
the 528th place so that all the remaining digits were wrong. Now you can calculate
707 digits of in a few seconds with Mathematica and all 707 of them will be
correct!
Mathematica can also eliminate much of the drudgery we all experienced in
high school when we learned algebra. Suppose you were given the messy expres-
sionsion 6x2y2 4xy3 + x4 4x3y + y4 and told to simplify it. Using Mathematica
you would proceed as follows:
4 3 2 2 3 4
Out[3]= x 4 x y + 6 x y 4 x y + y
In[4]:= Simplify[%]
4
Out[4]= (x + y)
If you use calculus in your daily work or if you have to help one of your children
with calculus, you can use Mathematica to do the tricky parts. You may remember
the scene in the movie Stand and Deliver where Jaime Escalante of James A.
Garfield High School in Los Angeles uses tabular integration by parts to show that
2 2
x sin xdx = -x cos x + 2x cos x + C
With Mathematica you get this result as follows.
Mathematica can even help you if youve forgotten the quadratic formula and
want to find the roots of the polynomial x2 + 6x 12. You proceed as follows:
6 + 2 Sqrt[21] 6 2 Sqrt[21]
Out[4]= {{x > -----------------}, {x > -------------
} }
2 2
None of the above Mathematica output looks exactly like what you will see on the
screen but is as close as I could capture it using the SessionLog.m functions.
We will not use the advanced mathematical capabilities of Mathematica very
often but it is nice to know they are available. We will frequently use two other
powerful strengths of Mathematica. They are the advanced programming lan-
guage that is built into Mathematica and its graphical capabilities.
In the example below we show how easy it is to use Mathematica to generate
the points needed for a graph and then to make the graph. If you are a beginner to
computer performance analysis you may not understand some of the parameters
used. They will be defined and discussed in the book. The purpose of this exam-
ple is to show how easy it is to create a graph. If you want to reproduce the graph
you will need to load in the package work.m. The Mathematica program
Approx is used to generate the response times for workers who are using termi-
nals as we allow the number of user terminals to vary from 20 to 70. We assume
there are also 25 workers at terminals doing another application on the computer
system. The vector Think gives the think times for the two job classes and the
array Demands provides the service requirements for the job classes. (We will
define think time and service requirements later.)
Acknowledgments
Many people helped bring this book into being. It is a pleasure to acknowledge
their contributions. Without the help of Gary Hynes, Dan Sternadel, and Tony
Engberg from Hewlett-Packard in Roseville, California this book could not have
been written. Gary Hynes suggested that such a book should be written and
provided an outline of what should be in it. He also contributed to the
Mathematica programming effort and provided a usable scheme for printing the
output of Mathematica programspiles of numbers are difficult to interpret! In
addition, he supplied some graphics and got my workstation organized so that it
was possible to do useful work with it. Dan Sternadel lifted a big administrative
load from my shoulders so that I could spend most of my time writing. He
arranged for all the hardware and software tools I needed as well as FrameMaker
and Mathematica training. He also handled all the other difficult administrative
problems that arose. Tony Engberg, the R & D Manager for the Software
Technology Division of Hewlett-Packard, supported the book from the beginning.
He helped define the goals for and contents of the book and provided some very
useful reviews of early drafts of several of the chapters.
Thanks are due to Professor Leonard Kleinrock of UCLA. He read an early
outline and several preliminary chapters and encouraged me to proceed. His two
volume opus on queueing theory has been a great inspiration for me; it is an out-
standing example of how technical writing should be done.
A number of people from the Hewlett-Packard Performance Technology
Center supported my writing efforts. Philippe Benard has been of tremendous
assistance. He helped conquer the dynamic interfaces between UNIX, Frame-
Maker, and Mathematica. He solved several difficult problems for me including
discovering a method for importing Mathematica graphics into FrameMaker and
coercing FrameMaker into producing a proper Table of Contents. Tom Milner
became my UNIX advisor when Philippe moved to the Hewlett-Packard Cuper-
tino facility. Jane Arteaga provided a number of graphics from Performance
Technology Center documents in a format that could be imported into Frame-
Maker. Helen Fong advised me on RTEs, created a nice graphic for me, proofed
several chapters, and checked out some of the Mathematica code. Jim Lewis read
several drafts of the book, found some typos, made some excellent suggestions
for changes, and ran most of the Mathematica code. Joe Wihnyk showed me how
to force the FrameMaker HELP system to provide useful information. Paul Prim-
mer, Richard Santos, and Mel Eelkema made suggestions about code profilers
and SPT/iX. Mel also helped me describe the expert system facility of HP Glan-
cePlus for MPE/iX. Rick Bowers proofed several chapters, made some helpful
suggestions, and contributed a solution for an exercise. Jim Squires proofed sev-
eral chapters, and made some excellent suggestions. Gerry Wade provided some
insight into how collectors, software monitors, and diagnostic tools work. Sharon
Riddle and Lisa Nelson provided some excellent graphics. Dave Gershon con-
verted them to a format acceptable to FrameMaker. Tim Gross advised me on
simulation and handled some ticklish UNIX problems. Norbert Vicente installed
FrameMaker and Mathematica for me and customized my workstation. Dean
Coggins helped me keep my workstation going.
Some Hewlett-Packard employees at other locations also provided support
for the book. Frank Rowand and Brian Carroll from Cupertino commented on a
draft of the book. John Graf from Sunnyvale counseled me on how to measure
the CPU power of PCs. Peter Friedenbach, former Chairman of the Executive
Steering Committee of the Transaction Processing Performance Council (TPC),
advised me on the TPC benchmarks and provided me with the latest TPC bench-
mark results. Larry Gray from Fort Collins helped me understand the goals of the
Standard Performance Evaluation Corporation (SPEC) and the new SPEC bench-
marks. Larry is very active in SPEC. He is a member of the Board of Directors,
Chair of the SPEC Planning Committee, and a member of the SPEC Steering
Committee. Dr. Bruce Spenner, the General Manager of Disk Memory at Boise,
advised me on Hewlett-Packard I/O products. Randi Braunwalder from the same
facility provided the specifications for specific products such as the 1.3- inch Kit-
tyhawk drive.
Several people from outside Hewlett-Packard also made contributions. Jim
Calaway, Manager of Systems Programming for the State of Utah, provided
some of his own papers as well as some hard- to-find IBM manuals, and
reviewed the manuscript for me. Dr. Barry Merrill from Merrill Consultants
reviewed my comments on SMF and RMF. Pat Artis from Performance Associ-
ates, Inc. reviewed my comments on IBM I/O and provided me with the manu-
script of his book, MVS I/O Subsystems: Configuration Management and
Performance Analysis, McGraw-Hill, as well as his Ph. D. Dissertation. (His
coauthor for the book is Gilbert E. Houtekamer.) Steve Samson from Candle Cor-
poration gave me permission to quote from several of his papers and counseled
me on the MVS operating system. Dr. Anl Sahai from Amdahl Corporation
reviewed my discussion of IBM I/O devices and made suggestions for improve-
ment. Yu-Ping Chen proofed several chapters. Sean Conley, Chris Markham, and
Marilyn Gibbons from Frame Technology Technical Support provided extensive
help in improving the appearance of the book. Marilyn Gibbons was especially
helpful in getting the book into the exact format desired by my publisher. Brenda
Feltham from Frame Technology answered my questions about the Microsoft
Windows version of FrameMaker. The book was typeset using FrameMaker on a
Hewlett-Packard workstation and on an IBM PC compatible running under
Microsoft Windows. Thanks are due to Paul R. Robichaux and Carol Kaplan for
making Sean, Chris, Marilyn, and Brenda available. Dr. T. Leo Lo of McDonnell
Douglas reviewed Chapter 7 and made several excellent recommendations. Brad
Horn and Ben Friedman from Wolfram Research provided outstanding advice on
how to use Mathematica more effectively.
Thanks are due to Wolfram Research not only for asking Brad Horn and Ben
Friedman to counsel me about Mathematica but also for providing me with
Mathematica for my personal computer and for the HP 9000 computer that sup-
ported my workstation. The address of Wolfram Research is
Wolfram Research, Inc.
P. O. Box 6059
Champaign, Illinois 61821
Telephone: (217)398-0700
Brian Miller, my production editor at Academic Press Boston did an excel-
lent job in producing the book under a heavy time schedule. Finally, I would like
to thank Jenifer Niles, my editor at Academic Press Professional, for her encour-
agement and support during the sometimes frustrating task of writing this book.
Reference
1. Martha L. Abell and James P. Braselton, Mathematica by Example, Academic
Press, 1992.
2. Martha L. Abell and James P. Braselton, The Mathematica Handbook, Aca-
demic Press, 1992.
3. Nancy R. Blachman, Mathematica: A Practical Approach, Prentice-Hall,
1992.
4. Richard E. Crandall, Mathematica for the Sciences, Addison-Wesley, 1991.
5. Theodore Gray and Jerry Glynn, Exploring Mathematics with Mathematica,
Addison-Wesley, 1991.
6. Leonard Kleinrock, Queueing Systems, Volume I: Theory, John Wiley, 1975.
7. Leonard Kleinrock, Queueing Systems, Volume II: Computer Applications,
JohnWiley, 1976.
8. Roman Maeder, Programming in Mathematica, Second Edition, Addison-
Wesley, 1991.
9. Stan Wagon, Mathematica in Action, W. H. Freeman, 1991
10. Stephen Wolfram, Mathematica: A System for Doing Mathematics by Com-
puter, Second Edition, Addison-Wesley, 1991.
A computer can never have too much memory or too fast a CPU.
Michael Doob
Notices of the AMS
1.1 Introduction
The word performance in computer performance means the same thing that
performance means in other contexts, that is, it means How well is the computer
system doing the work it is supposed to do? Thus it means the same thing for
personal computers, workstations, minicomputers, midsize computers,
mainframes, and supercomputers. Almost everyone has a personal computer but
very few people think their PC is too fast. Most would like a more powerful model
so that Microsoft Windows would come up faster and/or their spreadsheets would
run faster and/or their word processor would perform better, etc. Of course a more
powerful machine also costs more. I have a fairly powerful personal computer at
home; I would be willing to pay up to $1500 to upgrade my machine if it would
run Mathematica programs at least twice as fast. To me that represents good
performance because I spend a lot of time running Mathematica programs and
they run slower than any other programs I run. It is more difficult to decide what
good or even acceptable performance is for a computer system used in business.
It depends a great deal on what the computer is used for; we call the work the
computer does the workload. For some applications, such as an airline reservation
system, poor performance could cost an airline millions of dollars per day in lost
revenue. Merrill has a chapter in his excellent book [Merrill 1984] called
Obtaining Agreement on Service Objectives. (By service objectives Merrill is
referring to how well the computer executes the workload.) Merrill says
There are three ways to set the goal value of a service objec-
tive: a measure of the users subjective perception, manage-
ment dictate, and guidance from others experiences.
Of course, the best method for setting the service objective
goal value requires the most effort. Record the users subjec-
tive perception of response and then correlate perception with
internal response measures.
Merrill describes a case study that was used to set the goal for a CICS (Customer
Information Control System, one of the most popular IBM mainframe application
programs) system with 24 operators at one location. (IBM announced in
September 1992 that CICS will be ported to IBM RS/6000 systems as well as to
Hewlett-Packard HP 3000 and HP 9000 platforms.) For two weeks each of the 24
operators rated the response time at the end of each hour with the subjective
ratings of Excellent, Good, Fair, Poor, or Rotten (the operators were not given any
actual response times). After throwing out the outliers, the ratings were compared
to the response time measurements from the CICS Performance Analyzer (an IBM
CICS performance measurement tool). It was discovered that whenever over 93%
of the CICS transactions completed in under 4 seconds, all operators rated the
service as Excellent or Good. When the percentage dropped below 89% the
operators rated the service as Poor or Rotten. Therefore, the service objective goal
was set such that 90% of CICS transactions must complete in 4 seconds.
We will discuss the problem of determining acceptable performance in a
business environment in more detail later in the chapter.
Since acceptable computer performance is important for most businesses we
have an important sounding phrase for describing the management of computer
performanceit is called performance management or capacity management.
Performance management is an umbrella term to include most operations
and resource management aspects of computer performance. There are various
ways of breaking performance management down into components. At the
Hewlett-Packard Performance Technology Center we segment performance man-
agement as shown in Figure 1.1.
We believe there is a core area consisting of common access routines that
provide access to performance metrics regardless of the operating system plat-
form. Each quadrant of the figure is concerned with a different aspect of perfor-
mance management.
Application optimization helps to answer questions such as Why is the pro-
gram I use so slow? Tools such as profilers can be used to improve the perfor-
mance of application code, and other tools can be used to improve the efficiency
of operating systems.
tion will require more computer resources for completion than are currently
available, then arrangements can be made to procure the required capacity before
they are required. It is not knowing what the requirements are that can lead to
panic.
Investing in performance management saves money. Having limited
resources is thus a compelling reason to do more planning rather than less. It
doesnt require a large effort to avoid many really catastrophic problems.
With regard to the last item there are some who ask: Since computer sys-
tems are getting cheaper and more powerful every day, why dont we solve any
capacity shortage problem by simply adding more equipment? Wouldnt this be
less expensive than using the time of highly paid staff people to do a detailed sys-
tems analysis for the best upgrade solution? There are at least three problems
with this solution. The first is that, even though the cost of computing power is
declining, most companies are spending more on computing every year because
they are developing new applications. Many of these new applications make
sense only because computer systems are declining in cost. Thus the computing
budget is increasing and the executives in charge of this resource must compete
with other executives for funds. A good performance management effort makes it
easier to justify expenditures for computing resources.
Another advantage of a good performance management program is that it
makes the procurement of upgrades more cost effective (this will help get the
required budget, too).
A major use of performance management is to prevent a sudden crisis in
computer capacity. Without it there may be a performance crisis in a major appli-
cation, which could cost the company dearly.
In organizing performance management we must remember that hardware is
not the only resource involved in computer performance. Other factors include
how well the computer systems are tuned, the efficiency of the software, the
operating system chosen, and priority assignments.
5. the number and speed of I/O channels and the size as well as the speed of disk
cache (on disk controllers or in main memory)
6. tape memory
7. the speed of the communication lines connecting the terminals or workstations
to the computer system.
This list is incomplete but provides some idea of the scope of computer
performance. We discuss the components of computer performance in more detail
in Chapter 2.
been almost 100% per year! (RISC means reduced instruction set computers as
compared to the traditional CISC or complex instruction set computers.) Similar
rates of improvement are being made in main memory technology. Unfortunately,
the improvement rate for I/O devices lags behind those for other technologies.
These changes must be kept in mind when planning future upgrades.
In spite of the difficulties inherent in capacity planning, many progressive
companies have successful capacity planning programs. For the story of how the
M&G Group PLC of England successfully set up capacity planning at an IBM
mainframe installation see the interesting article [Claridge 1992]. There are four
parts of a successful program:
1. understanding the current business requirements and users performance
requirements
2. prediction of future workload
3. an evaluation of future configurations
4. an ongoing management process.
per order entered; that is, into CPU seconds per transaction, I/Os required per
transaction, memory required, etc.
Devising a measurement strategy for assessing the actual performance and
utilization of a computer system and its components is an important part of
capacity planning. We must obtain the capability for measuring performance and
for storing the performance data for later reference, that is, we must have mea-
surement tools and a performance database. The kind of program that collects
system resource consumption data on a continuous basis is called a software
monitor and the performance data files produced by a monitor are often called
log files. For example, the Hewlett-Packard performance tool HP LaserRX has
a monitor called SCOPE that collects performance information and stores it for
later use in log files. If you have an IBM mainframe running under the MVS
operating system, the monitor most commonly used is the IBM Resource Mea-
surement Facility (RMF). From the performance information that has been cap-
tured we can determine what our current service levels are, that is, how well we
are serving our customers. Other tools exist that make it easy for us to analyze the
performance data and present it in meaningful ways to users and management.
An example is shown in Figure 1.2, which was provided by the Hewlett-Packard
UNIX performance measurement tool HP LaserRX/UX. HP LaserRX/UX soft-
ware lets you display and analyze collected data from one or more HP-UX based
systems. This figure shows how you can examine a graph called Global Bottle-
necks, which does not directly indicate bottlenecks but does show the major
resource utilization at the global level, view CPU system utilization at the global
level, and then make a more detailed inspection at the application and process
level. Thus we examine our system first from an overall point of view and then
hone in on more detailed information. We discuss performance tools in more
detail later in this chapter.
Once we have determined how well our current computer systems are sup-
porting the major applications we need to set performance objectives.
1.2.1.1 PerformanceMeasures
The two most common performance measures for interactive processing are
average response time and average throughput. The first of these measures is the
delay the user experiences between the instant a request for service from the
computer system is made and when the computer responds. The average
throughput is a measure of how fast the computer system is processing the work.
The precise value of an individual response time is the elapsed time from the
instant the user hits the enter key until the instant the corresponding reply begins
that 1 out of 10 values will exceed the 90th percentile value. It is part of the folk-
lore of capacity planning that the perceived value of the average response time
experienced is the 90th percentile value of the actual value. If the response time
has an exponential distribution (a common occurrence) then the 90th percentile
value is 2.3 times the average value. Thus, if a user has experienced a long
sequence of exponentially distributed response times with an average value of 2
seconds, the user will perceive an average response time of 4.6 seconds! The rea-
son for this is as follows: Although only 1 out of 10 response times exceeds 4.6
seconds, these long response times make a bigger impression on the memory
than the 9 out of 10 that are smaller. We all seem to remember bad news better
than good news! (Maybe thats why most of the news in the daily paper seems to
be bad news.)
The average throughput is the average rate at which jobs are completed in an
interval of time, that is, the number of jobs or transactions completed divided by
the time in which they were completed. Thus, for an order-entry application, the
throughput might be measured in units of number of orders entered per hour, that
is, orders per hour. The average throughput is of more interest to management
than to the end user at the terminal; it is not sensed by the users as response time
is, but it is important as a measure of productivity. It measures whether or not the
work is getting done on time. Thus, if Short Shingles receives 4,000 orders per
day but the measured throughput of their computer system is only 3,500 order-
entry applications per day, then the orders are not being processed on time. Either
the computer system is not keeping up, there are not enough order-entry person-
nel to handle all the work, or some other problem exists. Something needs to be
done!
The primary performance measures for batch processing are average job
turnaround time and average throughput. Another important performance mea-
sure is completion of the batch job in the batch window for installations that
have an important batch job that must be completed within a window. The
window of such a batch job is the time period in which it must be started and
completed. The payroll is such an application. It cannot be started until the work
records of the employees are available and must be completed by a mixed time or
there will be a lot of disgruntled employees. An individual job turnaround time is
the interval between the instant a batch program (job) is read into the computer
system and the instant that the program completes execution. Thus a batch sys-
tem processing bills to customers for services rendered might have a turnaround
time of 12 minutes and a throughput of three jobs per hour.
Another performance measure of interest to user departments is the avail-
ability of the computer system. This is defined as the percentage of scheduled
computer system time in which the system is actually available to users to do use-
ful work. The system can fail to be available because of hardware failures, soft-
ware failures, or by allowing preventive maintenance to be scheduled during
normal operating hours.
0.25 and 1.5 seconds during the peak period of the day, or as an average of 1.25
seconds with a 95th percentile response time of 3.75 seconds at all times. The
objectives usually vary by time of day, day of the week, day of the month, type of
work, and by other factors, such as a holiday season, that can impact perfor-
mance. Service level objectives are usually established for online response time,
batch turnaround time, availability requirements for resources and workloads,
backup and recovery resources and procedures, and disaster plans.
One of the major advantages of the use of SLAs is that it gets a dialog going
between the user departments and the computer installation management. This
two-way communication helps system management understand the needs of their
users and it helps the users understand the problems IS management has in
providing the level of service desired by the users. As Backman [Backman 1990]
says about SLA benefits:
The expectations of both the supplier and the consumer are set.
Both sides are in agreement on the service and the associated
criteria defined. This is the main tangible benefit of using
SLAs.
The intangible benefits, however, provide much to the par-
ties as well. The transition from a reactionary fire fighting
methodology of performance management to one of a proac-
tive nature will be apparent if the SLA is followed and sup-
ported. Just think how you will feel if all those system
surprises have been eliminated, allowing you to think about
the future. The SLA method provides a framework for organi-
zational cooperation. The days of frantically running around
juggling batch schedules and moving applications from
machine to machine are eliminated if the SLA has been prop-
erly defined and adhered to.
Also, capacity planning becomes a normal, scheduled
event. Regular capacity planning reports will save money in
the long run since the output of the capacity plan will be fac-
tored into future SLAs over time, allowing for the planned
increases in volume to be used in the projection of future hard-
ware purchases.
Miller in his article [Miller 1987] on service level agreements claims the elements
that need to be structured for a successful service level agreement are as follows:
Miller also provides a proposed general format for service level agreements
and an excellent service level agreement checklist.
If service level agreements are to work well, there must be cooperation and
understanding between the users and the suppliers of the information systems.
Vanvick in his interesting paper [Vanvick 1992] provides a quiz to be taken by IS
managers and user managers to help them understand each other. He recom-
mends that IS respondents with a poor score get one week in a user re-education
camp where acronyms are prohibited. User managers get one week in an IS re-
education camp where acronyms are the only means of communication.
Another tool that is often used in conjunction with service level agreements
is chargeback to the consumer of computer resources.
Chargeback
There are those who believe that a service level agreement is a carrot to encourage
user interest in performance management while chargeback is the stick. That is, if
users are charged for the IS resources they receive, they will be less likely to make
unrealistic performance demands. In addition users can sometimes be persuaded
to shift some of their processing to times other than the peak period of the day by
offering them lower rates.
Not all installations use chargeback but some types of installations have no
choice. For example, universities usually have a chargeback system to prevent
students from using excessive amounts of IS resources. Students usually have job
identification numbers; a limited amount of computing is allowed for each num-
ber.
According to Freimayer [Freimayer 1988] benefits of a chargeback system
include the following:
These seem to be real benefits but, like most things in this world, they are not
obtained without effort. The problems with chargeback systems are always more
political than technical, especially if a chargeback system is just being
implemented. Most operating systems provide the facilities for collecting the
information needed for a chargeback program and commercial software is
available for implementing chargeback. The difficulties are in deciding the goals
of a program and implementing the program in a way that will be acceptable to the
users and to upper management.
The key to implementing a chargeback program is to treat it as a project to
be managed just as any other project is managed. This means that the goals of the
project must be clearly formulated. Some typical goals are:
1. Recover the full cost to IS for the service provided.
2. Encourage users to take actions that will improve performance, such as per-
forming low priority processing at off-peak times, deleting obsolete data from
disk storage, and moving some processing such as word processing or spread-
sheets to PCs or workstations.
3. Discourage users from demanding unreasonable service levels.
Part of the implementation project is to ensure that the users understand and
feel comfortable with the goals of the chargeback system that is to be imple-
mented. It is important that the system be perceived as being fair. Only then
should the actual chargeback system be designed and implemented. Two impor-
tant parts of the project are: (1) to get executive level management approval and
(2) to verify with the accounting department that the accounting practices used in
the plan meet company standards. Then the chargeback algorithms can be
designed and put into effect.
1. CPU time
2. disk I/O
3. disk space used (quantity and duration)
4. tape I/O
5. connect time
6. network costs
7. paging rate
8. lines printed
9. amount of storage used real/virtual).
Factors that may affect the billing rates of the above resources include:
1. job class
2. job priority surcharges
3. day shift (premium)
4. evening shift (discount).
As an example of how a charge might be levied, suppose that the CPU cost
per month for a certain computer is $100,000 and that the number of hours of
CPU time used in October was 200. Then the CPU billing rate for October would
be $100,000/200 = $500 per hour, assuming there were no premium charges. If
Group A used 10 hours of CPU time in October, the group would be charged
$5,000 for CPU time plus charges for other items that were billable such as the
disk I/O, lines printed, and amount of storage used.
Standard costing is another method of chargeback that can be used for
mature systems, that is, systems that have been in use long enough that IS knows
how much of each computer resource is needed, on the average, to process one of
the standard units, also called a business work unit (BWU) or natural forecasting
unit (NFU). An example for a travel agency might be a booking of an airline
flight. For a bank it might be the processing of a monthly checking account for a
private (not business) customer. A BWU for a catalog service that takes most
orders by 800 number phone calls could be phone orders processed.
Other questions that must be answered as part of the implementation project
include:
1. What reports must be part of the chargeback process and who receives them?
2. How are disagreements about charges negotiated?
3. When is the chargeback system reviewed?
4. When is the chargeback system renegotiated?
A chargeback system works best when combined with a service level agree-
ment so both can be negotiated at the same time.
Schrier [Schrier 1992] described how the City of Seattle developed a charge-
back system for a data communications network.
Not everyone agrees that chargeback is a good idea; especially when dis-
gruntled users can buy their own PCs or workstations. The article by Butler [But-
ler 1992] contains interviews with a number of movers and shakers as well as a
discussion of the tools available for chargeback. The subtitle of the article is,
Users, IS disagree on chargeback merit for cost control in downsized environ-
ment. The abstract is:
ment cycle. The standard book on SPE is [Smith 1991]. Smith says, in the open-
ing paragraph:
The basic principle of SPE is that service level objectives are set during the
application specification phase of development and are designed in as the
functionality of the application is specified and detailed design begins.
Furthermore, resource requirements to achieve the desired service levels are also
part of the development process.
One of the key techniques of SPE is the performance walkthrough. It is per-
formed early in the software development cycle, in the requirements analysis
phase, as soon as a general idea of system functions is available. The main part of
the meeting is a walkthrough of the major system functions to determine whether
or not the basic design can provide the desired performance with the anticipated
volume of work and the envisioned hardware platform. An example of how this
might work is provided by Bailey [Bailey 1991]. A database transaction process-
ing system was being designed that was required to process 14 transactions per
second during the peak period of the day. Each transaction required the execution
of approximately 1 million computer instructions on the proposed computer.
Since the computer could process far in excess of 14 million instructions per sec-
ond, it appeared there would be no performance problems. However, closer
inspection revealed that the proposed computer was a multiprocessor with four
CPUs and that the database system was single threaded, that is, to achieve the
required performance each processor would need the capability of processing 14
million instructions per second! Since a single CPU could not deliver the
required CPU cycles the project was delayed until the database system was mod-
ified to allow multithreading operations, that is, so that four transactions could be
executed simultaneously. When the database system was upgraded the project
went forward and was very successful. Without the walkthrough the system
would have been developed prematurely.
I believe that a good performance walkthrough could have prevented many,
if not most, of the performance disasters that have occurred. However, Murphys
law must be repealed before we can be certain of the efficacy of performance
walkthroughs. Of course the performance walkthrough is just the beginning of
the SPE activity in a software development cycle, but a very important part.
Organizations that have adopted SPE claim that they need to spend very little
time tuning their applications after they go into the production phase, have fewer
unpleasant surprises just before putting their applications into production, and
have a much better idea of what hardware resources will be needed to support
their applications in the future. Application development done using SPE also
requires less software maintenance, less emergency hardware procurement, and
more efficient application development. These are strong claims, as one would
expect from advocates, but SPE seems to be the wave of the future.
Howard in his interesting paper [Howard 1992a] points out that serious
political questions can arise in implementing SPE. Howard says:
SPE ensures that application development not only satis-
fies functional requirements, but also performance require-
ments.
There is a problem that hinders the use of SPE for many
shops, however. It is a political barrier between the application
development group and other groups that have a vested interest
in performance. This wall keeps internal departments from
communicating information that can effectively increase the
performance of software systems, and therefore decrease over-
all MIS operating cost.
Lack of communication and cooperation is the greatest
danger. This allows issues to slip away without being resolved.
MIS and the corporation can pay dearly for system inefficien-
cies, and sometimes do not even know it.
A commitment from management to improve communica-
tions is important. Establishing a common goal of software
developmentthe success of the corporationis also critical
to achieving staff support. Finally, the use of performance anal-
Howard gives several real examples, without the names of the corporations
involved, in which major software projects failed because of performance
problems. He provides a list of representative performance management products
with a description of what they do. He quotes from a number of experts and from
several managers of successful projects who indicate why they were successful. It
all comes down to the subtitle of Howards paper, To balance program
performance and function, users, developers must share business goals.
Howard [Howard 1992b] amplifies some of his remarks in [Howard 1992a]
and provides some helpful suggestions on selling SPE to application developers.
models. We discuss these techniques briefly later in this chapter in the section on
statistical projection. More material about statistical projection techniques is
provided in Chapter 7.
1. rules of thumb
2. back-of-the-envelope calculations
3. statistical forecasting
4. analytical queueing theory modeling
5. simulation modeling
6. benchmarking.
1. If the CPU is at 100% utilization or less and the required work is being com-
pleted on time, everything is okay for now (but always remember, tomorrow is
another day).
2. If the CPU is at 100% busy, and all work is not completed, you have a prob-
lem. Begin looking at the CPU resource.
3. If the CPU is not 100% busy, and all work is not being completed, a problem
also exists and the I/O and memory subsystems should be investigated.
Rules of thumb are often used in conjunction with other modeling tech-
niques as we will show later. As valuable as rules of thumb are, one must use cau-
tion in applying them because a particular rule may not apply to the system under
consideration. For example, many of the rules of thumb given in [Zimmer 1990]
are operating system dependent or hardware dependent; that is, may only be valid
for systems using the IBM MVS operating system or for Tandem computer sys-
tems, etc.
Samson in his delightful paper [Samson 1988] points out that some rules of
thumb are of doubtful authenticity. These include the following:
1. There is a knee in the curve.
2. Keep device utilization below 33%.
3. Keep path utilization below 30%.
4. Keep CPU utilization below ??%.
To understand these questionable rules of thumb you need to know about the
curve of queueing time versus utilization for the simple M/M/1 queueing system.
The M/M/1 designation means there is one service center with one server; this
server provides exponentially distributed service. The M/M/1 system is an open
system with customers arriving at the service center in a pattern such that the
time between the arrival of consecutive customers has an exponential distribu-
tion. The curve of queueing time versus server utilization is smooth with a verti-
cal asymptote at a utilization of 1. This curve is shown Figure 1.4. If we let S
represent the average service time, that is, the time it takes the server to provide
service to one customer, on the average, and U the server utilization, then the
average queueing time for the M/M/1 queueing system is given by
S
U .
1U
Response Time
Utilization
With regard to the first questionable rule of thumb (There is a knee in the
curve), many performance analysts believe that, if response time or queueing
time is plotted versus load on the system or device, then, at a magic value of load,
the curve turns up sharply. This point is known as the knee of the curve. In Fig-
ure 1.5 it is the point (0.5, 0.5). As Samson says (I agree with him):
to half its service time. Someone decided many years ago that
these numbers had some magical significancethat a device
less than one-third busy wasnt busy enough, and that delay
more than half of service time was excessive.
Samson has other wise things to say about this rule in his The rest of the story
and Lesson of the legend comments. You may want to check that
1 S S
= .
3 (1 1 ) 2
3
With respect to the third questionable rule of thumb (Keep path utilization
below 30%), Samson points out that it is pretty much the preceding rule repeated.
With newer systems, path utilizations exceeding 30% often have satisfactory per-
formance. You must study the specific system rather than rely on questionable
rules of thumb.
The final questionable rule of thumb (Keep CPU utilization below ??%) is
the most common. The ?? value is usually 70 or 80. This rule of thumb overlooks
the fact that it is sometimes very desirable for a computer system to run with
100% CPU utilization. An example is an interactive system that runs these work-
loads at a high priority but also has low priority batch jobs to utilize the CPU
power not needed for interactive work. Rosenbergs three-part rule of thumb
applies here.
Exercise 1.1
Two women on bicycles face each other at opposite ends of a road that is 40 miles
long. Ms. West at the western end of the road and Ms. East at the eastern end start
toward each other, simultaneously. Each of them proceeds at exactly 20 miles per
hour until they meet. Just as the two women begin their journeys a bumblebee flies
from Ms. Wests left shoulder and proceeds at a constant 50 miles per hour to Ms.
Easts left shoulder then back to Ms. West, then back to Ms. East, etc., until the
two women meet. How far does the bumblebee fly? Hint: For the first flight
segment we have the equation 50 t = 40 20 x t where t is the time in hours for
the flight segment. This equation yields t = 40/70 or a distance of 200/7 =
28.571428571 miles.
Linear Projection
Linear projection is a very natural technique to apply since most of us tend to think
linearly. We believe wed be twice as happy is we had twice as much money, etc.
Suppose we have averaged the CPU utilization for each of the last 12 months to
obtain the following 12 numbers {0.605, 0.597, 0.623,0.632, 0.647,0.639, 0.676,
0.723, 0.698, 0.743, 0.759, 0.772}. Then we could use the Mathematica program
shown in Table 1.2 to fit a least-squares line through the points; see Figure 1.6 for
the result.
The least-squares line is the line fitted to the points so that the sum of the
squares of the vertical deviations between the line and the given points is mini-
mized. This is a straightforward calculation with some nice mathematical proper-
ties. In addition, it leads to a line that intuitively looks like a good fit. The
concept of a least-squares estimator was discovered by the great German mathe-
matician Karl Friedrick Gauss in 1795 when he was 18 years old!
One must use great care when using linear projection because data that
appears linear over a period of time sometimes become very nonlinear in a short
time. There is a standard mathematical way of fitting a straight line to a set of
points called linear regression which provides both (a) a measure of how well a
straight line fits the measured points and (b) how much error to expect if we
extend the straight line forward to predict values for the future. We will discuss
these topics and others in the chapter on forecasting.
HP RXForecast Example
Figure 1.7 is an example of how linear regression and forecasting can be done with
the Hewlett-Packard product HP RXForecastlUX. The figure is from page 2-16 of
the HP RXForecast Users Manual for HP-UX Systems. The fluctuating curve is
the smoothed curve of observed weekly peak disk utilization for a computer using
the UNIX operating system. The center line is the trend line which extends beyond
the observed values. The upper and lower lines provide the 90% prediction
interval in which the predicted values will fall 90 percent of the time.
Example 1.1
In this example we show that simulation can be used for other interesting
problems that we encounter every day. The problem we discuss is called the
Monty Hall problem on computer bulletin boards. Marilyn vos Savant, in her
decide! The program and the outputfrom a run of 10,000 trials are shown in Table
1.3.
The best and shortest paper in a mathematics or statistics journal I have seen
about Marilyns problem is the paper by Gillman [Gillman 1992]. Gillman also
discusses some other equivalent puzzles. In the paper [Barbeau 1993], Barbeau
discusses the problem, gives the history of the problem with many references,
and considers a number of equivalent problems.
We see from the output that, with 10,000 trials, the person who always
switches won 66.7% of the time and someone who never switches won 33.3% of
the time for this run of the simulation. This is good evidence that the switching
strategy will win about two-thirds of the time. Marilyn is right!
Several aspects of this simulation result are common to simulation. In the
first place, we do not get the exact answer of 2/3 for the probability that a contes-
tant who always switches will win, although in this case it was very close to 2/3.
If we ran the simulation again we would get a slightly different answer. You may
want to try it yourself to see the variability.
Dont feel bad if you disagreed with Marilyn. Persi Diaconis, one of the best
known experts on probability and statistics in the worldhe won one of the
famous MacArthur Prize Fellowship genius awardssaid about the Monty
Hall problem, I cant remember what my first reaction to it was because Ive
known about it for so many years. Im one of the many people who have written
papers about it. But I do know that my first reaction has been wrong time after
time on similar problems. Our brains are just not wired to do probability prob-
lems very well, so Im not surprised there were mistakes.
Exercise 1.2
This exercise is for programmers only. If you do not like to write code you will
only frustrate yourself with this problem.
Consider the land of Femina where females are held in such high regard that
every man and wife wants to have a girl. Every couple follows exactly the same
strategy: They continue to have children until the first female child is born. Then
they have no further children. Thus the possible birth sequences are G, BG, BBG,
BBBG,.... Write a Mathematica simulation program to determine the average
number of children in a family in Femina. Assume that only single births occur
no twins or triplets, every family does have children, etc.
ful use of analytic queueing theory models to evaluate the possible configuration
changes to a computer system. The computer system they were modeling was a
mainframe computer with a virtual memory operating system servicing automo-
tive design engineers who were using graphics terminals. These terminals put a
heavy computational load on the system and accessed a large database. The sys-
tem supported 10 terminals and had a fixed multiprogramming level of three, that
is, three jobs were kept in main memory at all times. The two main upgrade alter-
natives that were modeled were: (a) adding 0.5 megabytes of main memory
(computer memory was very expensive at the time this study was made) or (b)
procuring I/O devices that would reduce the average time required for an I/O
operation from 38 milliseconds to 15.5 milliseconds. Boyse and Warn were able
to show that the two alternatives would have almost the same effect upon perfor-
mance. Each would reduce the average response time from 21 to 16.8 seconds,
increase the throughput from 0.4 to 0.48 transactions per second, and increase the
number of terminals that could be supported with the current average response
time from 10 to 12.
It should be noted that the simulation modeling would have taken a great deal
longer if it had been done using a general purpose simulation modeling system
such as GPSS or SIMSCRIPT. SNAP/SHOT is a special purpose simulator
designed by IBM to model IBM hardware and to accept inputs from IBM
performance data collectors.
1.2.4.7 Benchmarking
Dongarra, Martin, and Worlton [Dongarra, Martin, and Worlton 1987] define
benchmarking as Running a set of well-known programs on a machine to
compare its performance with that of others. Thus it is a process used to evaluate
the performance or potential performance of a computer system for some
specified kind of workload. For example, personal computer magazines publish
the test results obtained from running benchmarks designed to measure the
performance of different computer systems for a particular application such as
word processing, spread sheet analysis, or statistical analysis. They also publish
results that measure the performance of one computer performing the same task,
such as spread sheet analysis or statistical analysis, with different software
systems; this type of test measures software performance rather than hardware
performance. There are standard benchmarks such as Livermore Loops, Linpack,
Whetstones, and Dhrystones. The first two benchmarks are used to test scalar and
vector floating-point performance. The Whetstones benchmark tests the basic
1.2.5 Validation
Before a model can be used for making performance predictions it must, of course,
be validated. By validating a model we mean confirming that it reasonably
represents the computer system it is designed to represent.
The usual method of validating a model is to use measured parameter values
from the current computer system to set up and run the model and then to com-
pare the predicted performance parameters from the model with the measured
performance values. The model is considered valid if these values are close. How
close they must be to consider the model validated depends upon the type of
model used. Thus a very detailed simulation model would be expected to perform
more accurately than an approximate queueing theory network model or a statis-
tical forecasting model. For a complex simulation model the analyst may need to
use a statistical testing procedure to make a judgment about the conformity of the
model to the actual system. One of the most quoted papers about statistical
approaches to validation of simulation models is [Schatzoff and Tillman 1975].
Rules of thumb are often used to determine the validity of an approximate queue-
ing theory model. Back-of-the-envelope calculations are valuable for validating
any model. In all validation procedures, common sense, knowledge about the
installed computer system, and experience are important.
Validating models of systems that do not yet exist is much more challenging
than validating a model of an existing system that can be measured and compared
with a model. For such systems it is useful to apply several modeling techniques
for comparison. Naturally, back-of-the-envelope calculations should be made to
verify that the model output is not completely wrong. Simulation is the most
likely modeling technique to use as the primary technique but it should be cross-
checked with queueing theory models and even simple benchmarks. A talent for
good validation is what separates the outstanding modelers from the also-rans.
All system managers should have the first goalif there were no users there
would be no need for system managers! The second goal has the virtue of being
quantified so that its achievement can be verified. The last goal could probably
qualify as what John Rockart [Rockart 1979] calls a critical success factor. A
system manager who fails to achieve critical success factor goals will probably
not remain a system manager for very long. (A critical success factor is some-
thing that is of critical importance for the success of the organization.)
Deese [Deese 1988] provides some interesting comments on the manage-
ment perspective on capacity planning.
Exercise 1.3
You are the new systems manager of a departmental computer system for a
marketing group at Alpha Alpha. The system consists of a medium-sized
computer connected by a LAN to a number of workstations. Your customers are
a number of professionals who use the workstations to perform their daily work.
The previous systems Manager, Manager Manager (he changed his name from
John Smith to Manager Manager to celebrate his first management position), left
things in a chaotic mess. The users complain about
1. Very poor response timeespecially during peak periods of the day, that is,
just after the office opens in the morning and in the middle of the afternoon.
2. Unpredictable response times. The response time for the same application may
vary between 0.5 seconds and 25 seconds even outside the busiest periods of
the day!
3. The batch jobs that are to be run in the evening often have not been processed
when people arrive in the morning. These batch jobs must be completed before
the marketing people can do their work.
Exercise 1.4
The following service level agreement appears in [Duncombe 1991]:
l.EXPECTATIONS
The party of the first part (AP) agrees to limit their
demands on and use of the services to a reasonable
level.
2. PENALTIES
If either party to this contract breaches the
aforementioned EXPECTATIONS, the breaching party must
buy lunch.
By:
Title:
Witness:
Date:
Diagnostic Tools
Diagnostic tools are used to find out what is happening on your computer system
now. For example, you may ask, Why has my response time deteriorated from 2
seconds to 2 minutes? Diagnostic tools can answer your question by telling you
what programs are running and how they are using the system resources.
Diagnostic tools can be used to discover problems such as a program caught in a
loop and burning up most of the CPU time on the system, a shortage of memory
causing memory management problems, excessive file opening and closing
causing unnecessary demands on the I/O system, or unbalanced disk utilization.
Some diagnostic monitors can log data for later examination.
The diagnostic tool we use the most at the Hewlett-Packard Performance
Technology Center is the HP GlancePlus family. Figure 1.8 is from the HP Glan-
cePlus/UX Users Manual [HP 1990]. It shows the last of nine HP GlancePlus/
UX screens used by a performance analyst who was investigating a performance
problem in a diskless workstation cluster.
workstation is a new user on the cluster and does not realize how much memory
is needed.
time, disk I/O time, etc.), time spent waiting on locks, etc. With this information
the application can be tuned to perform more efficiently. Unfortunately, program
profilers and other application optimization tools seem to be the Rodney
Dangerfields of software tools; they just dont get the respect they deserve.
Software engineers tend to feel that they know how to make a program efficient
without any outside help. (Donald Knuth, regarded by many, including myself, to
be the best programmer in the world, is a strong believer in profilers. His paper
[Knuth 1971] is highly regarded by knowledgable programmers.) Literature is
limited on application optimization tools, and even computer performance books
tend to overlook them. An exception is the excellent introduction to profilers
provided by Bentley in his chapter on this subject [Bentley 1988]. Bentley
provides other articles on improving program performance in [Bentley 1986].
The neglect of profilers and other application optimization tools is unfortu-
nate because profilers are available for most computers and most applications.
For example, on an IBM personal computer or plug compatible, Borland Interna-
tional, Inc., provides Turbo Profiler, which will profile programs written using
Turbo Pascal, any of Borlands C++ compilers, and Turbo Assembler, as well as
programs compiled with Microsoft C and MASM. Other vendors also provide
profilers, of course. Profilers are available on most computer systems. The pro-
filer most actively used at the Hewlett-Packard Performance Technology Center
is the HP Software Performance Tuner/XL (HP SPT/XL) for Hewlett-Packard
HP 3000 computers. This tool was developed at the Performance Technology
Center and is very effective in improving the running time of application pro-
grams. One staff member was able to make a large simulation program run in
one-fifth of the original time after using HP SPT/XL to tune it. HP SPT/XL has
also been used very effectively by the software engineers who develop new ver-
sions of the HP MPE/iX operating system.
Figure 1.9 displays a figure from page 3-4 of the HP SPT/XL Users Manual:
Analysis Software. It shows that, for the application studied, 94.4% of the pro-
cessing time was spent in system code. It also shows that DBGETs, which are
calls to the TurboImage database system, take up 45.1 % of the processing time.
As can be seen from the DBGETS line, these 6,857 calls spend only a fraction of
this time utilizing the CPU; the remainder of the time is spent waiting for some-
thing such as disk I/O, database locks, etc. Therefore, the strategy for optimizing
this application would require you to determine why the application is waiting
and to fix the problem.
Application optimization tools are most effective when they are used during
application development. Thus these tools are important for SPE (systems perfor-
mance engineering) activities.
Finally, in the last several years, expert systems for computer performance
evaluation have been developed. As Hood says [Hood 1992]: The MVS
operating system and its associated subsystems could be described as the most
complex entity ever developed by man. For this reason a number of commercial
expert systems for analyzing the performance of MVS have been developed
including CA-ISS/THREE, CPExpert, MINDOVER MVS, and MVS Advisor.
CA-ISS/THREE is especially interesting because it is one of the earliest
computer performance systems with an expert system component as well as
queueing theory modeling capability.
In his paper [Domanski 1990] Domanski cites the following advantages of
expert systems for computer performance evaluation:
1. Expert systems are often cost effective when human expertise is very costly,
not available, or contradictory.
2. Expert systems are objective. They are not biased to any pre-determined goal
state, and they will not jump to conclusions.
3. Expert systems can apply a systematic reasoning process requiring a very large
knowledge base that a human expert cannot retain because of its size.
4. Expert systems can be used to solve problems when given an unstructured
problem or when no clear procedure/algorithm exists.
From this discussion it is clear that an expert system for a complex operating
system can do a great deal to help manage performance. However, even for
simpler operating systems, an expert system for computer performance analysis
can do a great deal to help manage performance. For example, Hewlett-Packard
recently announced that an expert system capability has been added to the online
diagnostic tool HP GlancePlus for MPE/iX systems. It uses a comprehensive set
of rules developed by performance specialists to alert the user whenever a possible
performance problem arises. It also provides an extensive online help facility
developed by performance experts. We quote from the HP GlancePlus Users
Manual (for MPE/iX Systems):
This says that most experts would agree that the system is
experiencing a problem when interactive users consume more
than 90% of the CPU. Currently, interactive use is 96.4%.
Since the probability is only 75% (not 100%), some additional
situations are not true. (In this case, the number of processes
currently starved for the CPU might not be high enough to
declare a real emergency.)
...
High level analysis can be performed only if the Expert facility
is enabled for high leveluse the V command: XLEVEL=-
HIGH. After the global analysis in which a problem type was
not normal, the processes that executed during the last interval
The last Action line of the preceding display means that the priority should be
changed (QZAP) for process identification number 122, a Pascal compilation
(PASXL). Furthermore, the Log-on of the person involved is Mel.Eelkema, and
his process should be moved from the C queue to the D queue. Mel is a software
engineer at the Performance Technology Center. He said the expert system caught
him compiling in an interactive queue where large compilations are not
recommended.
The expert system provides three levels of analysis: low level, high level,
and dump level. For example, the low level analysis might be:
If we ask for high level analysis of this problem, we obtain more details about the
problems observed and a possible solution as follows:
----------------------------SWITCH Analysis-----------
Excessive Mode Switching exists for processes in the D
queue.
An excessive amount of mode switching was found for
the following processes:
Check for possible conversion CM to NM or use the OCT
program
JSNo Dev Logon Pin Program Pri CPU% Disc CM%
MMsw CMsw J9 10 FIN.PROD 110 CHECKS D 16.4%
2.3 0% 533 0
performance. For example, the IBM Share and Guide organizations have
performance committees.
The professional organization that should be of interest to most readers of
this book is the Computer Measurement Group, abbreviated CMG. CMG holds a
conference in December of each year. Papers are presented on all aspects of com-
puter performance analysis and all the papers are available in a proceedings.
CMG also publishes a quarterly, CMG Transactions, and has local CMG chapters
that usually meet once per month. The address of CMG headquarters is The
Computer Measurement Group, 414 Plaza Drive, Suite 209, Westmont, IL
60559, (708)655-1812-Voice, (708)655-1813-FAX.
The Capacity Management Review, formerly called EDP Performance
Review, is a monthly newsletter on managing computer performance. Included
are articles by practitioners, reports of conferences, and reports on new computer
performance tools, classes, etc. It is published by the Institute for Computer
Capacity Management, P. 0. Box 82847, Phoenix, AZ 85071, (602)997-7374.
Another computer performance analysis organization that is organized to
support more theoretically inclined professionals such as university professors
and personnel from suppliers of performance software is ACM Sigmetrics. It is a
special interest group of the Association for Computing Machinery (ACM). Sig-
metrics publishes the Performance Evaluation Review quarterly and holds an
annual meeting. One issue of the Performance Evaluation Review is the proceed-
ings of that meeting. Their address is ACM Sigmetrics, c/o Association of Com-
puting Machinery, 11 West 42nd Street, New York, NY 10036, (212) 869-7440.
6. What is software performance engineering and what are some of the problems
of implementing it?
7. What are the primary modeling techniques used for computer performance
studies?
8. What are the three basic components of any computer system according to
Rosenberg?
9. What are some rules of thumb of doubtful authenticity according to Samson?
10. Suppose youre on a game show and youre given a choice of three doors.
Behind one door is a car; behind the others, goats. You pick a doorsay, No.
1and the host, who knows whats behind the doors, opens another door
say, No. 3which has a goat. He then says to you, Do you want to pick door
No. 2? Is it to your advantage to switch your choice?
11. Name two expert systems for computer performance analysis.
1.5 Solutions
Solution to Exercise 1.1
This is sometimes called the von Neumann problem. John von Neumann (1903
1957) was the greatest mathematician of the twentieth century. Many of those who
knew him said he was the smartest person who ever lived. Von Neumann loved to
solve back-of-the-envelope problems in his head. The easy way to solve the
problem (Im sure this is the way you did it) is to reason that the bumblebee flies
at a constant 50 miles per hour until the cyclists meet. Since they meet in one hour,
the bee flies 50 miles. The story often told is that, when John von Neumann was
presented with the problem he solved it almost instantly. The proposer then said,
So you saw the trick. He answered, What trick? It was an easy infinite series
to sum. Recently, Bailey [Bailey 1992] showed how von Neumann might have
set up the infinite series for a simpler version of the problem. Even for the simpler
version setting up the infinite series is not easy.
book [Blachman 19921 (I had not seen Ms. Blachmans solution when I wrote this
program.).
nancy[n_]:=
Block[{i,trials, average,k},
(* trials counts the number of births *)
(* for each couple. It is initialized to zero. *)
trials=Table[0, {n}];
For[i=1, i<=n, i++,
While[True,trials[[i]]=trials[[i]]+1;If[Random[Integer
, {0,1}]>0,
Break[]] ];];
(* The while statement counts number of births *)
(* for couple i. *)
(* The while is set up to test after a pass through *)
(* the loop *)
(* so we can count the birth of the first girl baby. *)
average=Sum[trials[[k]], {k, 1, n}]/n;
Print[The average number of children is , average];
]
It is not difficult to prove that, if one attempts to perform a task which has
probability of success p each time one tries, then the average number of attempts
until the first success is l/p. See the solution to Exercise 4, Chapter 3, of [Allen
1990]. Hence we would expect an average family size of 2 children. We see
below that with 1,000 families the program estimated the average number of chil-
dren to be 2.007pretty close to 2!
In[8]:= nancy[1000]
2007
The average number of children is ----
1000
In[9]:= N[%]
Out[9]= 2.007
This answer is very close to 2. Ms. Blachman sent me her solution before her
book was published. I present it here with her problem statement and her permis-
sion. Ever the instructor, she pointed out relative to my solution: By the way it is
not necessary to include {0, 1} in the call to Random[Integer, {0, 1}]. Random[-
Integer] returns either 0 or 1. The statement of her exercise and the solution
from page 296 of [Blachman 1992] follow:
10.3 Suppose families have children until they have a boy. Run a simulation
with 1000 families and determine how many children a family will have on aver-
age. On average, how many daughters and how many sons will there be in a fam-
ily?
makeFamily[]:=
Block[{
children = { }
} ,
While[Random[Integer] == 0,
AppendTo[children, girl]
];
Append[children, boy]
]
makeFamily::usage=makeFamily[ ] returns a list of
children.
numChildren[n_Integer] :=
Block[{
allChildren
},
allChildren = Flatten[Table[makeFamily[ ],
{n}]];
{
avgChildren > Length[allChildren]/n,
avgBoys > Count[allChildren, boy]/n,
avgGirls > Count[allChildren, girl]/n
}
]
numChildren::usage=numchildren[n] returns statistics
on
the number of children from n families.
You can see that Ms. Blachmans programs are very elegant indeed! It is
very easy to follow the logic of her code. Her numChildren program also runs
faster than my nancy program. I ran her program with the following result:
In[9]:= numChildren[1000]//Timing
girl[n_]:=
Block[ {boys=0},
For[i=1, i<=n, i++, While[Random[Integer] ==0, boys=
boys+1]];
Return[N[(boys+n)/n]]]
1. Get the computer system functioning the way it should so that your users can
be more productive.
2. Establish a symbiotic relationship with the users of your computer system,
possibly leading to a service level agreement.
1. Finding the source of the difficulties with response time and the batch jobs not
being run on time. This book is designed to help you solve problems like these.
2. Once the source of the problems is uncovered then the solutions can be under-
taken. We hope this book will help with this, too.
3. You must communicate to your users what the reasons are for their poor ser-
vice in the past and how you are going to fix the problems. It is important to
keep the users apprised of what you are doing to remedy the problems and
what the current performance is. The latter is usually in the form of a weekly or
monthly performance report. The contents and format of the report will depend
upon what measurement and reporting tools are available.
For an excellent example of a service level agreement with notes on what the
terms mean see [Dithmar, Hugo, and Knight 1989].
1.6 References
1. Arnold 0. Allen, Probability, Statistics, and Queueing Theory with Computer
Science Applications, Second Edition, Academic Press, San Diego, 1990.
2. Arnold 0. Allen, Back-of-the-envelope modeling, EDP Performance
Review, July 1987, 16.
3. Rex Backman, Performance contracts, INTERACT, September 1990, 5052.
4. David H. Bailey, A capacity planning primer, SHARE 62 Proceedings, 1984,
5. Herbert R. Bailey, The girl and the fly: a von Neumann legend, Mathemati-
cal Spectrum, 24(4), 1992, 108109.
6. Peter Bailey, The ABCs of SPE: software performance engineering, Capac-
ity Management Review, September 1991.
7. Ed Barbeau, The problem of the car and goats, The College Mathematics
Journal, 24(2), March 1993, 149154.
2.1 Introduction
In Chapter 1 we listed some of the hardware and software characteristics that had
an effect on the performance of a computer system, that is, on how fast it will
perform the work you want it to do. In this chapter we will consider these
characteristics and some others in more detail. We also consider how these
components or contributors to computer performance are modeled. In addition we
shall attempt to give you a feeling for the relative size of the contributions of each
of these components to the overall performance of a computer system in executing
a workload.
Our first task it to describe how we state a speed comparison between two
machines performing the same task. For example, when someone says machine
A is twice as fast as machine B in performing task X, exactly what is meant? We
will use the definitions recommended by Hennessy and Patterson [Hennessy and
Patterson 1990]. For example, A is n% faster than machine B means
Execution Time B n
= 1+ ,
Execution Time A 100
where the numerator in the fraction is the time it takes machine B to execute task
X and the denominator is the time it takes machine A to do so. Since we want to
solve for n, we rewrite the formula in the form
To avoid confusion we always set up the ratio so that n is positive, that is, we
talk in terms of A is faster than B rather than B is slower than A. Let us con-
sider an example.
Example 2.1
A Mathematica calculation took 17.36 seconds on machine A and 74.15 seconds
Exercise 2.1
We know that machine A runs a program in 20 seconds while machine B requires
30 seconds to run the same program. Which of the following statements is true?
1. A is 50% faster than B.
2. A is 33% faster than B.
3. Neither of the above.
This formula defines speedup and describes how we calculate it using Amdahls
law, the middle formula. Thus the speedup is two if the new execution time is
exactly one half the old execution time. Let us consider an example.
Example 2.2
Suppose we are considering a floating-point coprocessor for our computer.
Suppose, also, that the coprocessor will speed up numerical processing by a factor
of 20 but that only 20% of our workload uses numerical processing. We want to
compute the overall speedup from obtaining the floating-point coprocessor. We
see that Fractionenhanced = 0.2, and Speedupenhanced = 20 so that
1
Speedup = = 1. 234568.
overall 0. 2
0.8 +
20
Amdahls law is important in that it shows that, if an enhancement can only
be used for a fraction of a job, then the maximum speedup cannot exceed the
reciprocal of one minus that fraction. In Example 2.2, the maximum speedup is
limited by the reciprocal of 0.8 or 1.25. This also demonstrates the law of dimin-
ishing returns; speeding up the coprocessor to 50 times as fast as the computer
without it will improve the overall speedup very little over the 20 times speedup.
(In fact, only from 1.2345679 to 1.2437811 or 0.75%.) The only thing that would
really help the speedup would be to increase the fraction of the time that it is
effective.
The Mathematica program speedup from the package first.m can be used to
make speedup calculations. The listing of the program follows.
speedup[enhanced_, speedup_] :=
(* enhanced is percent of time in enhanced mode *)
(* speedup is speedup while in enhanced mode *)
Block[{frac, speed},
frac = enhanced / 100;
speed = 1 /(1frac + frac / speedup);
Print["The speedup is ", N[speed, 8]];
]
The Mathematica program speedup can be used to make the calculation in
Example 2.2 as follows:
The concepts of speedup and A is n% faster than B are related but not
equivalent concepts. For example, if machine A is enhanced in such a way as to
run 100% faster for all its calculations and called machine B, then the speedup as
an enhanced system is 2.0 and machine B is 100% faster than machine A.
The Intel i586 microprocessor (code named the P5 until late October 1992 when
Intel announced that it would be known as the Pentium) was released by Intel in
March 1993. Personal computer vendors introduced and displayed personal com-
puters using the Pentium chip in May 1993 at Comdex in Atlanta. As you are
reading this passage you probably know all about the Pentium (i586) and possi-
bly the i686 or i786. We can be sure that the computers available a year from any
given time will be much more powerful than those available at the given time.
The clock rate can be used to compare two processors of exactly the same
type, such as two Intel 80486 microprocessors, roughly but not exactly. Thus a
100 MHz Intel 80486 computer would run almost exactly twice as fast as a 50
MHz 80486, if the caches were the same size and speed, they each had the same
amount of main memory of the same speed, etc. However, a computer with a 25
MHz Motorola 68040 microprocessor and the same amount of memory as a com-
puter with a 25 MHz Intel 80486 microprocessor would not be expected to have
the same computing power. The reason for this is that the average number of
clock cycles per instruction (CPI) is not the same for the two microprocessors,
and the CPI itself depends upon what program is run to compute it.
For a given program which has a given instruction count (number of instruc-
tions) or instruction path length (in the IBM mainframe world this is usually
shortened to path length) the CPI is defined by the following equation
CPU cycles for the program
CPI =
Instruction count for the program . Thus the CPU time required to execute
a program is given by the formula
In this formula, the instruction count depends upon the program itself, the
instruction set architecture of the computer, and the compiler used to generate the
instructions. Thus the CPI depends upon the program, the computer architecture,
and compiler technology. The clock cycle time depends upon the computer
architecture, that is, its organization and technology. Thus, not one of the three
factors in the formula is independent from the other two! We note that the total
CPU time depends very much upon what sort of work we are doing with our
computer. Compiling a FORTRAN program, updating a database, and running a
spreadsheet make very different demands upon the CPU.
At this point you are probably wondering, Why has nothing been said about
MIPS? Arent MIPS a universal measure of CPU power? In case you are not
familiar with MIPS, it means millions of instructions per second.
What is usually left out of the statement of the MIPS rating is what the
instructions are accomplishing. Since computers require more clock cycles to
perform some instructions than others, the number of instructions that can be
executed in any time interval depends upon what mix of instructions is executed.
Thus running different programs on the same computer can yield different MIPS.
Thus there is no fixed MIPS rating for a given computer. Comparing different
computers with different instruction sets is very difficult using MIPS because a
program could require a great many more instructions on one machine than the
other. One way that people have tried to get around this difficulty is to declare a
certain computer as a standard and compare the time it takes to perform a certain
task against the time it takes to perform it on the standard machine, thus generat-
ing relative MIPS. The machine most often used as a standard 1-MIPS machine
is the VAX-11/780. (It is now widely known that the actual VAX-11/780 speed is
approximately 0.5 MIPS.) For example, suppose program A ran on a standard-
VAX-11/780 in 345 seconds but required only 69 seconds on machine B.
Machine B would then be said to have a relative MIPS rating of 345/69 = 5.
There are a number of obvious difficulties with this approach. If program A was
written to run on an IBM 4381 or a Hewlett-Packard 3000 Series 955, it might be
difficult to run the program on a VAX-11/780, so one would probably have to
limit the use of this standard machine to comparisons with other VAX machines.
Even then there would be the question of whether one should use the latest com-
piler and operating system on the VAX-11/780 or the original ones that were used
when the rating was established. Weicker, the developer of the Dhrystone bench-
mark, in his paper [Weicker 1990], reported that he ran his Dhrystone benchmark
program on two VAX-11/780 computers with different compilers. He reported
that on the first run the benchmark was translated into 483 instructions that exe-
cuted in 700 microseconds for a native MIPS rating of 0.69 MIPS. On the second
run 226 instructions were executed in 543 microseconds, yielding 0.42 native
MIPS. Weicker notes that the run with the lowest MIPS rating executed the
benchmark faster.
In his paper Weicker addressed the question, Why, then, should this article
bother to characterize in detail these stone age benchmarks? (Weicker is refer-
ring to benchmarks such as the Dhrystone, Whetstone, and Linpack.) He answers
in part:
What Weicker dislikes is that the Dhrystone 1.1 benchmark is run to obtain a
rating in Dhrystones per second. This rating is then divided by 1757 to obtain the
number of relative VAX MIPS. If you read that a computer manufacture claims a
MIPS rating of, say, 50, with no further explanation, you can be almost certain
that the rating was obtained in this way. Most manufacturers will also provide the
results of the Dhrystone, Whetstone, and other leading benchmarks. As an exam-
ple, I have a 33 MHz 80486DX personal computer. The Power Meter rating for
my PC is 14.652 relative VAX MIPS. Power Meter (a product of The Database
Group, Inc.) is a measurement program used by many PC vendors to obtain the
relative VAX MIPS rating for their IBM PC or compatible computers.
Because of the difficulty in pinning down exactly what MIPS means, it is
sometime said that, MIPS means Meaningless Indication of Processor Speed.
The only meaningful measure of how fast your CPU can do your work is to
use a monitor to measure how fast it does so. Of course your CPU also needs the
assistance of other computer components such as I/O devices, cache, main mem-
ory, the operating system, etc., and no description of CPU performance is com-
plete without specifying these other components as well. A typical software
performance monitor will measure I/O activity as well as other indicators of
information that is performance related.
Although there is some variability in how long it takes a CPU to perform
even a simple operation, such as adding two numbers, there will be an averaging
effect if you measure the performance of a computer system as it executes a pro-
gram. The main problem is in selecting a program or mix of programs that faith-
fully represent the workload on your system. We discuss this problem in more
detail in the chapter on benchmarking.
Example 2.3
Sam Spade has written a very clever piece of software called SeeItAll that will
monitor the performance of any IBM PC or compatible computer. SeeItAIl has
magical properties; it provides any item of performance information that is of
interest to anyone and causes no overhead on the PC measured. Using SeeItall,
Sam measures the execution of the long Mathematica program
ComputeEverything on his 50 MHz 80486 PC. He finds that
ComputeEverything requires 50 seconds of CPU time and has an instruction
count of 750 million instructions. What is the CPI for ComputeEverything on
Sams machine? What is the MIPS rating of Sams machine while running
ComputeEverything?
Solution
The appropriate formula for the calculation is
CPU time = Instruction count CPI Clock cycle time.
To simplify the calculation we use Mathematica as follows:
This shows that the CPI is 10/3 clock cycles per instruction. Note that we used the
formula
The MIPS rating is 750/50 or 15 because 750 million instructions were executed
in 50 seconds. We can make these calculations easier using the Mathematica
program cpu from the package first.m: .
cpu[instructions_, MHz_, cputime_] :=
(* instructions is number of instructions executed by
*)
(* the cpu in the length of time cputime *)
Block[{cpi,mips},
mips = 10^(6) instructions / cputime;
cpi = MHz / mips;
Print["The speed in MIPS is ", N[mips, 8]];
Print["The number of clock cycles per instruction,
CPI, is , N[cpi,10]];
]
Note that we use the identity CPI = MHz /MIPS. We left out the algebra that
shows that this formula is true, but it follows from the formula
CPU time = Instruction count CPI Clock cycle time.
The calculations for Example 2.3 using cpu follow:
Exercise 2.2
Sam Spades friend Mike Hammer borrows SeeItAIl to check the speed of the
prototype of an IBM PC-compatible personal computer that his company is
designing. He runs ComputeEverything in 20 seconds according to SeeItAII.
Unfortunately, Mike doesnt know the speed of the Intel 80486 microprocessor in
the machine. Could it be the 100 MHz microprocessor that everyone is talking
about?
Exercise 2.3
Sam Spades friend Dick Tracy claims that his company is designing an Intel
80486 clone with a clock speed of 200 MHz that will enable their new personal
computer to execute the program ComputeEverything in 5 seconds flat. What
CPI and MIPS are required for this machine to attain this goal?
The operation of a CPU with pipelineing, caching, and other advanced fea-
tures is very difficult to model exactly. Fortunately, detailed modeling is not nec-
essary for the purpose of performance management as it would be for engineers
who are designing a new computer system. We need model only as accurately as
we can predict future workloads. The CPU of a computer system can be effec-
tively modeled with a queueing theory model using only the average amount of
CPU service time required to run a representative workload. This number can be
obtained from a software monitor. We discuss measurement considerations in
Chapter 5.
So far we have discussed only uniprocessor systems, that is, computer sys-
tems with one CPU. Many computer systems have more than one processor and
thus are known as multiprocessor systems (What else?). There are two basic
organizations for such systems: loosely coupled and tightly coupled. Tightly cou-
pled systems are more common. This type of organization is used for computer
systems with a small number of processors, usually not more than 8, but 2 or 4
processors are more common. Loosely coupled systems usually have 32 or more
processors. The new CM-5 Connection Machine recently announced by Think-
ing Machines has from 32 to 16,384 processors.
Tightly coupled multiprocessors, also called shared memory multiprocessors,
are distinguished by the fact that all the processors share the same memory.
There is only one operating system, which synchronizes the operation of the pro-
cessors as they make memory and database requests. Most such systems allow a
certain degree of parallelism, that is, for some applications they allow more than
one processor to be active simultaneously doing work for the same application.
Tightly coupled multiprocessor computer systems can be modeled using queue-
ing theory and information from a software monitor. This is a more difficult task
than modeling uniprocessor systems because of the interference between proces-
sors. Modeling is achieved using a load dependent queueing model together with
some special measurement techniques.
Loosely coupled multiprocessor systems, also known as distributed memory
systems, are sometimes called massively parallel computers or multicomputers.
Each processor has its own memory and sometimes a local operating system as
well. There are several different organizations for loosely coupled systems but
the problem all of them have is indicated by Amdahls law, which says that the
degree of speedup due to the parallel operation is given by
1
Speedup =
Fraction
(1 Fraction )+
parallel
parallel n
where n is the total number of processors. The problem is in achieving a high
degree of parallelism. For example, if the system has 100 processors with all of
them running in parallel half of the time, the speedup is only 1.9802. To obtain a
speedup of 50 requires that the fraction of the time that all processors are operating
in parallel is 98/99 = 0.98989899.
Thinking Machines is the best known company that builds massively parallel
computers. Patterson, in his article [Patterson 1992], says of the latest Thinking
Machines computer:
This is truly an exciting time for computer designers and everyone who uses a
computer will benefit!
There is a great deal of active research on parallel computing systems. The
September/November 1991 issue of the IBM Journal of Research and Develop-
ment is devoted entirely to parallel processing. Gordon Bells paper [Bell 1992]
is an excellent current technology review of the field. The papers [Flatt 1991],
[Eager, Zahorjan, and Lazowska 1989], [Tanenbaum, Kaashoek, and Bal 1992],
and [Kleinrock and Huang 1992] are excellent contemporary research papers on
parallel processing. [Tanenbaum, Kaashoek, and Bal 1992] is an especially good
paper for the software side of parallel computing. The September 1992 issue of
IEEE Spectrum is a special issue devoted to supercomputers; it covers all aspects
of the newest computer architectures as well as the problems of developing soft-
ware to take advantage of the processing power. An update to some of the articles
is provided in the January 1993 issue of IEEE Spectrum, the annual review of
products and applications.
Ideally one would desire an indefinitely large memory capacity such that any
particular word would be immediately available.... We are...forced to recognize
the possibility of constructing a hierarchy of memories, each of which has
greater capacity than the preceding but which is less quickly accessible.
A. W. Burks, H. G. Goldstine, and J. von Neumann
Preliminary Discussion of the Logical Design of an Electronic Computing
Instrument(1946)
Figure 2.1 shows the typical memory hierarchy on a computer system; it is valid
for most computers ranging from personal computers and workstations to
supercomputers. It fits the description provided by Burks, Goldstine, and von
Neumann in their prescient 1946 report. The fastest memory, and the smallest in
the system, is provided by the CPU registers. As we proceed from left to right in
the hierarchy, memories become larger, the access times increase, and the cost per
byte decreases. The goal of a well-designed memory hierarchy is a system in
which the average memory access times are only slightly slower than that of the
fastest element, the CPU cache (the CPU registers are faster than the CPU cache
but cannot be used for general storage), with an average cost per bit that is only
slightly higher than that of the lowest cost element.
A CPU (processor) cache is a small, fast memory that holds the most
recently accessed data and instructions from main memory. Some computer
architectures, such as the Hewlett-Packard Precision Architecture, call for sepa-
rate caches for data and instructions. When the item sought is not found in the
cache, a cache miss occurs, and the item must be retrieved from main memory.
This is a much slower access, and the processor may become idle while waiting
for the data element to be delivered. Fortunately, because of the strong locality of
reference exhibited by a programs instruction and data reference sequences,
95% to more than 98% of all requests are satisfied by the cache on a typical sys-
tem. Caches work because of the principle of locality. The principle of locality is
described by Hennessy and Patterson [Hennessy and Patterson 1990] as follows:
Thus a cache operates as a system that moves recently accessed items and the
items near them to a storage medium that is faster than main memory.
Just as all objects referenced by the CPU need not be in the CPU cache or
caches, not all objects referenced in a program need be in main memory. Most
computers (even Personal Computers) have virtual memory so that some lines of
a program may be stored on a disk. The most common way that virtual memory
is handled is to divide the address space into fixed-size blocks called pages. At
any give time a page can be stored either in main memory or on a disk. When the
CPU references an item within a page that is not in the CPU cache or in main
memory, a page fault occurs, and the page is moved from the disk to main mem-
ory. Thus the CPU cache and main memory have the same relationship as main
memory and disk memory. Disk storage devices, such as the IBM 3380 and 3390,
have cache storage in the disk control unit so that a large percentage of the time a
page or block of data can be read from the cache, obviating the need to perform a
disk read. Special algorithms and hardware for writing to the cache have also
been developed. According to Cohen, King, and Brady [Cohen, King, and Brady
1989] disk cache controllers can give up to an order of magnitude better I/O ser-
vice time than an equivalent configuration of uncached disk storage.
Because caches consist of small high speed memory, they are very fast and
can significantly improve the performance of computer systems. Let us see, in a
rough sort of way, what a CPU cache can do for performance.
Example 2.4
Jack Smith has an older personal computer that does not have a CPU cache. He
decides to upgrade his machine. The machine he decides is the best for him has
two different CPU cache sizes available. Jack has used a profiler to study the large
program that he uses most of the time. His calculations indicate that with the
smallest of the two CPU caches he will get a cache hit 60% of the time while with
the largest cache he will get a hit 90% of the time. How much will each of the
caches speed up his processing compared to no cache at all if cache memory has
a speedup of 5 compared to main memory?
Solution
We make the calculations with the Mathematica program speedup as follows:
In[9]:= speedup[60, 5]
The speedup is 1.9230769
In[10]:= speedup[90, 5]
The speedup is 3.5714286
Thus the smaller cache provides a speedup of 1.9230769 while the large
cache speeds up the processing with speedup 3.5714286. It usually pays to obtain
the largest cache offered because the difference in cost for a larger cache is usu-
ally small.
CPU caches make it more difficult to analyze benchmark results because
many benchmark programs are so small that they fit into many caches although a
typical program that is run on the system will not fit into the cache. Suppose, for
example, your main application program had 20,000 lines of code and the 80/20
rule applied, that is, 20% of the code accounted for 80% of the execution time.
Thus 4,000 lines of code account for 80% of the execution time. If the cache
could hold 2,000 lines of code, then we would have a 40% hit rate for the CPU
cache, that is, 50% of 80%. According to speedup, this would give us a speedup
of 1.4705882:
In[8]:= speedup[40, 5]
The speedup is 1.4705882
Example 2.5
The fastest memory in an IBM PC or compatible with a 33 MHz Intel 486DX
microprocessor is in the CPU registers, which have access times of about 10 ns.
The next fastest is the primary cache memory on the processor. Most 486 PCs also
have an off chip cache called the secondary cache. Thus the primary cache is a
cache into the secondary cache, which is a cache for main memory. This double
caching is necessary because main memory speeds have not kept up with CPU
speeds. Caches work because of the principle of locality described earlier. A cache
operates as a system that moves recently accessed items and the items near them
to a storage medium that is faster than main memory. The main memory access
times for personal computers today (June 1993) varies from about 70 ns to 100 ns.
The next level of storage below main memory is virtual storage, that is, hard disk
storage. Hard disks typically have an access time of around 15 ms. This means that
main memory is about 200,000 times as fast as hard disk memory. (On my PC this
ratio is about 204,286.)
A significant problem with large, fast computers is that of providing sufficient I/O
bandwidth to keep the CPU busy.
Richard E. Matick
IBM Systems Journal 1986
Because of its effect on the overall system throughput and end-user response time,
minimization of DASD response time is a primary objective in the design of a
storage hierarchy.... Long-term trends in processor and DASD technology show
a 10 percent compound increase of the processor and DASD-performance gap.
Significant contributors to MSD performance are based on mechanical rather
than electronic technologies. Therefore, other avenues must be explored to keep
pace with the DASD response time requirements of systems.
Edward I. Cohen, Gary M. King, and James T. Brady
IBM Systems Journal 1989
2.3.1 Input/Output
I/O has been the Achilles heel of computers and computing for a number of years,
although there are some signs of improvement on the horizon. In fact Hennessy
and Patterson, in their admirable book [Hennessy and Patterson 1990] have a
chapter on Input/Output that begins with the paragraph:
IBM refers to disk drives as DASD (for direct access storage devices) and disk
memory is often referred to as auxiliary storage by most authors. PC users usually
refer to their disk drives as hard drives or fixed disks to differentiate them from
their floppy drives, which are used primarily to load new software or to back up
the other drives.
Let us briefly review the characteristics of the most common I/O device on
most computers from PCs and workstations to super computers: the magnetic
disk drive. A magnetic disk drive has a collection of platters rotating on a spindle.
The most common rotational speed is 3,600 revolutions per minute (RPM)
although some of the newer drives spin at 6,400 RPM. The platters are metal
disks covered with magnetic recording material on both sides. (Of course, the
floppy drives on PCs have removable plastic disks called diskettes.) Disk drives
have diameters as small as 1.8 inches for subnotebook computers and as large as
14 inches on mainframe drives such as the IBM 3990. (Hewlett-Packard
announced a drive with a diameter of only 1.3 in in June 1992 with deliveries
beginning in early 1993.)
The top as well as the bottom surface of each platter is used for storage and
is divided into concentric circles called tracks. (On some drives, such as the IBM
3380, the top of the top platter and the bottom of the bottom platter are not used
for storage.) A 1.44-MB floppy drive for a PC has 80 tracks on each surface;
large drives can have as many as 2,200 tracks. Each track is divided into sectors;
the sector is the smallest unit of information that can be read. A sector is 512
bytes on most disk drives. This is approximately the storage required for a half
page of ordinary double-spaced text. A 1.44-MB floppy drive has 18 sectors per
track; the 200-MB disk drive on my PC has 38 sectors on each of the 682 tracks
on each of its 16 surfaces.
To read or write information into a sector, a read/write head is located over
or under each surface attached to a movable arm. Bits are magnetically read or
recorded on the track by the read/write head. The arms are connected so that each
read/write head is over the same track of every device. A cylinder is the set of all
tracks under the heads at a given time. Thus, if a disk drive has 20 surfaces, a cyl-
inder consists of 20 tracks.
Each disk drive has a controller, which begins a read or write operation by
moving the arm to the proper cylinder. This is called a seek; naturally the time
required to move the read/write heads to the required cylinder is called the seek
time. The minimum seek time is the time to move the arm one track, the maxi-
mum seek time is the time to move from the first to last track (or vice versa). The
average seek time is defined by disk drive vendors as the sum of the time for all
possible seeks divided by the number of possible seeks. However, due to locality
of reference for most applications, in most cases measured average seek time is
25% to 30% of that provided by the vendors. (Sometimes no seek is required and
large seeks are rarely required.) For example, Cohen, King, and Brady [Cohen,
King, and Brady 1989] report The IBM 3380 Model K has a rated average seek
time of 16 milliseconds. However, due to the reference pattern to the data, in
most cases the experienced average seek is about 25 to 30 percent of the rated
average seek.
Latency is the delay associated with the rotation of the platters until the
requested sector is located under the read/write head. The average latency (usu-
ally called the latency) is the time it takes to complete a half revolution of the
disk. Since most drives rotate at 3,600 RPM, the latency is usually 8.3 millisec-
onds.
The next component of the disk access time is the data transfer time. This is
the time it takes to move the data from the storage device. It can be calculated by
the formula
number of sectors transferred
disk rotation time = transfer time.
number of sectors per track
For example, the 200-MB disk drive on my PC has 38 sectors, each 512 bytes
long, for a total track capacity of 19,456 bytes. It rotates at 3,600 RPM and thus
completes a rotation in 16.667 milliseconds or 0.016667 seconds. The time to
transfer one sector of data is thus 1/38 16.667 = 0.439 milliseconds. The data
transfer time is usually a small part of the access time. As Johnson says [Johnson
1991]: For a 4,096-byte block on a 3.0 megabyte per second channel, it takes
approximately 1.3 milliseconds for data transfer, yet performance tuning experts
are happy when an average I/O takes 20 to 40 ms.
As we indicate in Figure 2.2, a string of disk drives is usually connected to
the CPU through a channel and a control unit. Some IBM systems also have mul-
tiple strings connected to control units; each separate string of drives is connected
through a head-of-string device.
Rotational position sensing (RPS) is used for many I/O subsystems. This
technique allows the transfer path (controller, channel, etc.) to be used by other
devices during a drives seek and rotational latency period. The controller tells
the drive to issue an alert when the desired sector is approaching the read/write
head. When the drive issues this signal, the controller attempts to establish a
communication path to main memory so that the required data transfer can occur.
If communication is established, the transfer is performed, and the drive is avail-
able for further service. If the attempt at connect fails because one or more of the
path elements is busy, the drive must make a full revolution before another
attempt at connection can be made. This additional delay is called an RPS miss.
There are some drives such as the EMC Symmetrix II system which have actua-
tor level buffers that eliminate RPS delay entirely. If a path is not available at the
critical time, the information from the track is read into an actuator buffer. The
information is then transmitted from the buffer when a path is available. This has
the effect of lowering the channel utilization as well.
Some computer systems have alternative channel paths between the disk
drives and the CPU. That is, each disk drive can be connected to more than one
controller, and each controller can be connected to more than one channel. For
these systems an RPS miss occurs only if all the channel paths are busy when the
disk drive is ready to transmit data. On IBM systems this is called dynamic path
selection (DPS) and up to four internal data paths are available for each disk
drive. The DPS facility is sometimes known as floating channels because it
allows a read command to a disk drive to go out on one channel while the data
may be returned on a different channel.
The total disk access time is the sum of the seek time, latency time, transfer
time, controller overhead, RPS miss time, and the queueing time. The queueing
time is the most difficult to estimate and is the sum of the two delays: the initial
delay until the drive is free so that it can be used and the delay until a channel is
free to transmit the I/O commands to the disk. For non-RPS systems there is
another queueing delay for the channel after the seek to place the read/write
heads over the desired cylinder is completed. The channel is required to search
for the sector to be read as well as for the transfer.
Example 2.6
Suppose Superdrive Inc. has announced a super new disk drive with the following
characteristics: Average seek time 20 ms, rotation time 12.5 ms (4,800 RPM), and
150 sectors, each 512 bytes long, per track. Compute the average time to read or
write an 8 sector block of data assuming no queueing delays, controller overhead
of 2 ms, and no RPS misses.
Solution
The value of 2 ms for controller overhead is a value often used by I/O experts.
Since we have assumed no queueing delays or RPS misses, the average time to
access 8 sectors (4,096 bytes) is the sum of the average seek time, the average
latency (rotational delay), data transfer time, and the controller overhead. We can
safely use 30% of the average seek time provided by Superdrive or 6 ms for the
average seek time. The average latency is 6.25 ms. By the formula we used earlier,
the data transfer time is (8/150) 12.5 = 0.6667 ms. Hence the average access
time is 6 + 6.25 + 0.6667 + 2 = 14.9167 ms.
Exercise 2.4
Consider the following Mathematica program. Use simpledisk to verify the
solution to Example 2.6.
While I/O performance has not increased as much per year in recent years as
CPU performance there have been some substantial improvements in disk perfor-
mance, even on PCs. (Hennessy and Patterson claim it is 4% to 6% per year com-
pared to 18% to 35% per year improvements in CPU performance.) Three years
ago the average seek time for a PC hard disk was 28 ms or so. The hard disk I
bought for my PC in May 1993 has an average seek time of 13.8 ms. The storage
on this drive cost $1.39 per MB compared to $33.50 per MB for the RAM mem-
ory I bought at the same time. (These prices were about half what I spent for sim-
ilar hardware in late 1991. They are probably even lower as you are reading this.)
Software and even hardware caching is often used on PCs, which further
improves I/O performance. Even with these improvements I/O is still often the
bottleneck.
This morning as I came into my office building I noticed a number of
Hewlett-Packard HP7935 disk drives in the hall that were being replaced. (They
look like the icon in the right margin.) These drives were state-of-the-art for HP
3000 computer systems in 1983 and only five years ago most computer rooms at
Hewlett-Packard installations were full of them. (Some still are.) This drive
which can store 404 MB of data is, according to my tape measure, 22 inches
wide, 33 inches deep, and 32 inches high. The drives are usually stacked two
high to produce a stack that is about the size of a phone booth. The average seek
time on these drives is 24.0 ms with an average rotational delay of 11.1 ms. The
drives I saw were replaced by Hewlett-Packard C2202A drives, which are stored
in cabinets with four to each cabinet. These drives are the natural replacement for
the HP7935s because they both use the HPIB interface. Hewlett-Packard has
higher performance drives, which use the SCSI interface. Each C3302A drive
can store 670 MB of data, has an average seek time of 17 ms and an average
latency of 7.5 ms. Thus a cabinet that is much smaller than a HP7935 drive (14.5
in by 27 in by 28 in) can store 2.617 GB of data. The C2202A is a tremendous
improvement over the HP7935 disk drive but not nearly as much improvement as
there has been in CPU and memories over the period between the two drives. In
These studies indicate that the IBM 3330 disks are so much
faster than the IBM 2314s that they can radically change the
productivity of an IBM 360 computerin fact, a good part of
the superior productivity claimed for the IBM 370 may be due
to the faster disks. Using faster disks on an IBM 360 can
reduce the 20% to 30% idle time common for this machine to
less than 10%.
In spite of these revelations IBM has never had anything but a good reputation for
I/O design. Hennessy and Patterson say:
Naturally, after these reports, IBM continued to improve its I/O performance. IBM
increased the speed and size of its disk drives, added cache memory to the control
units of some drives, and instituted floating channels so that the commands to
read data from a disk drive could go out on one channel but the data retrieved
could be returned on a different channel; hardware determines what channels to
use. One of the biggest improvements was the announcement of the IBM 3090
with expanded storage which is also referred to in some IBM documents as
extended storage. Expanded storage on the IBM 3090 and later models is not at
all like expanded or extended storage on a personal computer; it is more like a
RAM disk on a PC. Expanded storage on an IBM mainframe is generally regarded
as an ultra-high-speed paging subsystem. When the MVS memory manager
(called RSM for real storage manager although the IBM term for main memory is
central storage) decides to move a page from main memory it can go either to disk
storage (auxiliary memory) or to expanded storage. Similarly, when a page must
be brought into main memory it can come from auxiliary storage or from
expanded storage.
Expanded storage can only be used for 4K block transfers to and from cen-
tral storage. Individual bytes in expanded storage cannot be addressed directly,
and direct I/O transfers between expanded storage and conventional auxiliary
storage cannot occur. The time to resolve a page fault for a page located in
expanded storage can range from 75 to 135 microseconds (no one seems to be
sure about the exact values of these ranges). This compares with an expected
time of 2 to 20 milliseconds to resolve a page fault from auxiliary storage; thus
expanded storage is from about 15 to 265 times as fast as auxiliary storage. There
is also a savings in processor overhead for I/O initiation and the subsequent han-
dling of the I/O completion interrupt.
There now seems to be a general perception that MVS I/O problems can be
solved if adequate main and expanded storage is provided. As Beretvas says
[Beretvas 1987]:
Samson [Samson 1992] claims that the MVS I/O problem has been solved for old
applications but there are some new large applications now feasible because of the
increased capabilities of the new IBM mainframes and the new releases of
MVS\ESA; these new applications can create I/O performance problems.
In his paper [Artis 1992], Artis explains the evolution of the IBM I/O sub-
system as it has evolved from the initial facilities provided by the IBM System/
360 through the IBM Svstem/390 system of operating under MVS/ESA. An even
Naturally, other computer manufacturers have similar stories to tell about the
evolution of their I/O systems.
As we mentioned earlier, Hewlett-Packard has constantly improved their
disk drives. For example, during 1991 the average seek time was reduced to 12.6
ms for the fastest drives. Most drives now have a latency of 7.5 ms or less and
controller overhead has been lowered to less than 1 ms. In November 1991,
Hewlett-Packard announced the availability of disk arrays, better known as
RAID for Redundant Arrays of Inexpensive Disks (see [Patterson, Gibson, and
Katz 1988]). (We discuss RAID later in this chapter.) In June 1992 Hewlett-Pack-
ard announced a disk drive with 21.4 MB of storage and a disk diameter of 1.3
in., thus becoming the first company to announce such a small disk drive. This
amazing disk drive, called the Kittyhawk Personal Storage Module, is designed
to withstand a system drop of about 3 feet during read/write operation. It spins at
5,400 RPM thus having a latency of 5.56 ms. It has an average seek time of less
than 18 ms, a sustained transfer rate of 0.9 MB/second with a burst data rate of
1.2 MB/second. It has a spinup of approximately 1 second. One model (the one
with 14 MB of storage) has one platter and two heads while the model with 21.4
MB of storage has two platters and three heads. This drive measures 0.4 in by 2
in by 1.44 in and weighs approximately 1 ounce. Delivery of these drives began in
early 1993. In March 1993 Hewlett-Packard announced a second version, the
Kittyhawk II PSM, with a storage capacity of 42.8 MB. It remains the worlds
smallest disk drive and can store the equivalent of 28,778 typed pages of infor-
mation.
In spite of the progress it has made with disk drives, Hewlett-Packard has
recognized that the CPU and memory speeds on their computers are improving
more rapidly than disk access speeds and that memory costs are constantly mov-
ing down. Therefore, Hewlett-Packard has improved the performance of I/O-
intensive applications by increasing memory size and using main memory as a
buffer for disk memory.
The HP 3000 MPE/iX operating system uses an improved disk caching
capability called mapped files. The mapped files technique significantly
improves I/O performance by reducing the number of physical I/Os without
imposing additional CPU overhead or sacrificing data integrity and protection.
This technique also eliminates file system buffering and optimizes global mem-
ory management.
Mapped files are based on the operating systems demand-paged virtual
memory and are made possible by the extremely large virtual address space
(MPE/iX provides approximately 281 trillion bytes of virtual address space) on
the system. When a file is opened it is logically mapped into the virtual space.
That is, all files on the system and their contents are referenced by virtual
addresses. Every byte of each opened file has a unique virtual address.
File access performance is improved when the code and data required for
processing can be found in main memory. Traditional disk caching reduces costly
disk reads by using main memory for code and data. HP mapped files and virtual
memory management further improve performance by caching writes. Once a
virtual page is read into memory, it can be read by multiple users without addi-
tional I/O overhead. If it is a data page (HP pages data and instructions sepa-
rately), it can be read and written to in memory without physically writing it to
disk. When the desired page is already in memory, locking delays are greatly
reduced, which increases throughput. Finally, when the memory manager does
write a page back to disk, it combines multiple pages into a single write, again
reducing multiple physical I/Os. The virtual-to-physical address translations to
locate portions of the mapped-in files are performed by the system hardware, so
that operating system overhead is greatly reduced.
In addition, the mapped file technique eliminates file system buffering. Since
the memory manager fetches data directly into the users area, the need for file
system buffering is eliminated.
Other computer manufacturers have of course found other ways to improve
their I/O performance. Companies that specialize in disk drives have been
stretching the envelope over the last several years. In 1990, the typical, almost
universal rotational speed of disk drives was 3,600 RPM. This has been increased
to 4,004 RPM, then to 5,400 RPM, and, as we mentioned earlier, in January 1993
Hewlett-Packard announced a drive which a 6,400 RPM spin rate; thus its
latency is only 4.69 ms. It also has 2.1 GB of storage capacity and a diameter of 3
1/2 in. You may be asking, Why dont the mainframe folks speed up their large
drives, too? (Some mainframe drives have a diameter of 14 in.) The answer lies
in physics. It is very difficult to keep a large drive from flying apart when it is
spun rapidly. The smaller a drive, the faster it can spin. This is leading to small
drives with very high data densities. By the time you read this paragraph the sta-
tistics of disk drive performance will surely be higher, but the improvements in
disk technology will still be lagging the improvements in CPU and main memory
speeds.
The hottest new innovation in disk storage technology is the disk array, more
commonly denoted by the acronym RAID (Redundant Array of Inexpensive
Disks). The seminal paper for this technology is the paper [Patterson, Gibson,
and Katz 1988]. It introduced RAID terminology and established a research
agenda for a group of researchers at the University of California at Berkeley for
several years. The abstract of their paper, which provides a concise statement
about the technology follows:
Nash also reports that currently the price per MB of disk storage for main-
frames is about $5.20, but is expected to drop to approximately $1 per MB within
four years. He also claims that minicomputer and PC disk drives currently sell for
about $3.5/MB and $3.00/MB, respectively, but are expected to drop to $1/MB
by 1997. Nash also provides a list of third-party vendors offering RAID systems
for different platforms. Platforms included are PCs and networks, Macintosh,
UNIX systems, superservers, minicomputers, and mainframes.
Modeling disk I/O can be very easy or very difficult depending upon what
level of detail is necessary for your modeling effort. Recall that the total time to
complete an I/O operation for a traditional disk drive (not RAID) is the sum of
the seek time, latency time, transfer time, controller overhead, RPS miss time for
RPS systems, and the queueing or contention time. All of these are easy to com-
pute except the queueing time and the RPS miss time. For modeling systems with
no I/O performance problems, that is, with few RPS misses and no queueing,
modeling is trivial. Computer systems with I/O problems can often be modeled
using queueing network models. If the I/O problems are very serious it might be
necessary to use simulation modeling or hybrid modeling. For the hybrid model-
ing approach simulation is used to model the I/O subsystem in detail to arrive at
an accurate average I/O access time. This average access time is then used in a
queueing network model as a delay time.
CPU-I/O-Memory Connection
We have been treating the CPU, I/O, and main memory resources somewhat
independently; almost as though they really were independent, which they arent.
Of course you must have adequate CPU power to execute a particular workload
within a reasonable time frame and with reasonable response time. (No one can do
a mainframe job with an original 4.77 MHz IBM PC.) On the other hand, the
fastest CPU in the world cannot do much if there is insufficient main memory or
insufficient I/O capability.
As Schardt noted earlier, if you dont have enough main memory, you cannot
fully utilize the processor. The processor will spend a lot of time waiting for I/O
completions.
One of the unmistakable signs of lack of memory is thrashing, that is, pag-
ing that is so excessive that almost nothing else is done by the computer. If you
have attempted to run large Mathematica programs on your PC with insufficient
main memory or not enough swapping memory on your hard drive, you have
probably experienced this phenomenon. Your hard disk activity light will stay on
all the time but there will be almost no indication of new results on your monitor.
There are similar sorts of indications of thrashing that occur on larger machines,
of course.
Not enough main memory (or main/expanded on an IBM mainframe or com-
patible) can also prevent your I/O subsystem from operating properly. Finally,
too little main memory sometimes keeps the multiprogramming level so low that
the CPU is frequently idle when there is work to be done. The multiprogramming
level is low because there is room for only a few programs at a time in main
memory. The CPU also could be idle because all the programs in main memory
are inactive due to page faults or other I/O requests that are pending.
Naturally, a computer system cannot function well if there is not sufficient I/
O capability in the form of disk drives, channels, control units, and I/O caches to
handle the I/O required by the application programs. However, for adequate I/O
performance there must also be sufficient main memory and sufficient CPU pro-
cessor power.
Rosenbergs rules mentioned in Chapter 1 provide some guidelines for deter-
mining the cause of performance problems. Rosenbergs rules [Rosenberg 1991]
are:
1. If the CPU is at l00% utilization or less and the required
work is being completed on time, everything is okay for now.
(But always remember, tomorrow is another day.)
2. If the CPU is at 100% busy and all the required work is not
completed, you have a problem. Begin looking at the CPU
level.
3. If the CPU is not 100% busy, and all work is not completed,
a problem also exists and the I/O and memory subsystems
should be investigated.
Rule 3 conforms to what one would expect; the problem is in the I/O subsystem,
the memory subsystem, or both subsystems. Rule 2 is not so obvious. The problem
is not necessarily that the CPU is underpowered. By checking to see what the CPU
is busy doing you may discover that the CPU is spending too much time on paging
activity. As Rosenberg points out, this means there is a memory problem.
Checking the CPU activity could also show that the I/O subsystem is causing the
problem.
significant demands on the computer system during prime time, that is, during
the peak periods of the day. These devices include printers, graphic display
devices such as computer monitors, and tape drives. Tape drives are usually
excluded because they are used primarily as backup devices and are used during
off-shift times. It is possible that tape drives need to be modeled as part of the
system if there is a great deal of online logging to tape drives. Similarly, for some
workstations that do very extensive graphical applications such as CAD, the
graphics subsystem must be explicitly modeled. Large printing jobs are usually
done off-line so need not be modeled unless the performance problem is in getting
the printing done on time.
2.4 Solutions
Solution to Exercise 2.1
30 20
We see that n = 100 = 50 percent. So that A is 50% faster than B.
20
The calculation using perform is:
2.666666667
This shows that, if we assume the microprocessor runs at 100 MHz, then the MIPS
has jumped to 37.5 and the CPI has dropped to 8/3. These numbers are similar to
some of those reported by Intel.
A 150 MIPS machine with a CPI of 4/3 would be a remarkable machine in 1993
but the Intel 80586 (renamed the Pentium by Intel) approaches some of these
performance statistics! The first Pentium-based personal computers were
announced by vendors in May 1993. Intel has released two versions of the
Pentium, a 60 MHz version and a 66 MHz version. According to [Smith 1993]
It has been suggested that Intel uses the word Pentium to describe the 80586
because pent means five leading to the suggestion that they should have called
it the Cinco de Micro.
2.5 References
1. H. Pat Artis, MVS/ESA: Evolution of the S/390 I/O subsystem, Enterprise
System Journal, April 1992, 8693.
2. Kevin Bachus, Patrick Houston, and Elizabeth Longsworth, Right as
RAID, Corporate Computing, May 1993, 6185.
3. Gordon Bell, ULTRACOMPUTERS: A teraflop before its time, CACM,
August 1992, 2747.
4. Thomas Beretvas, Paging analysis in an expanded storage environment,
CMG 87 Conference Proceedings, Computer Measurement Group, 1987,
256265.
5. Edward I. Cohen, Gary M. King, and James T. Brady, Storage hierarchies,
IBM Systems Journal, 28(1), 1989, 6276.
6. Elizabeth Corcoran, Thinking Machines: Hillis & Company race toward a
teraflops, Scientific American, December 1991, 140141.
7. Peter J. Denning, RISC architecture, American Scientist, January-February
1993, 710.
8. Derek L. Eager, John Zahorjan, and Edward D. Lazowska, Speedup versus
efficiency in parallel systems, IEEE Transactions on Computers, 38(3),
March 1989, 408423.
9. Horace P. Flatt, Further results using the overhead model for parallel sys-
tems, IMB Journal of Research and Development, 36(5/6), September/
November 1991, 721726.
10. Mark B. Friedman, ed, CMG Transactions, Fall 1991, Computer Measure-
ment Group. Special issue with selected papers on RAID..
11. John L. Hennessy and David A. Patterson, Computer Architecture: A Quan-
titative Approach, Morgan Kaufmann, San Mateo, CA, 1990.
12. Gilbert E. Houtekamer and H. Pat Artis, MVS I/O Subsystems: Configura-
tion Management and Performance Analysis, McGraw-Hill, New York,
1992.
13. Robert H. Johnson, DASD: IBM direct access storage devices, CMG91
Conference Proceedings, Computer Measurement Group, 1991, 1251
1263.
14. David K. Kahaner and Ulrich Wattenberg, Japan: a competitive assess-
ment, IEEE Spectrum, September 1992, 4247.
15. Leonard Kleinrock and Jau-Hsiung Huang, On parallel processing systems:
Amdahls law generalized and some results on optimal design, IEEE
Transactions on Software engineering, 18(5), May 1992, 434447.
16 Elizabeth Lindholm, Closing the performance gap: as RAID systems
mature, vendors are tinkering with the architecture to increase performance,
Datamation, March 1, 1993, 122126.
17. Lester Lipsky and C. D. Church, Applications of a queueing network
model for a computer system, Computing Surveys, 1977, 205222.
18. Richard E. Matick, Impact of memory systems on computer architecture
and system organization, IBM Systems Journal, 25(3/4), 1986, 274304.
19. Kim S. Nash, When it RAIDS, it pours, ComputerWorld, Jun 7, 1993, 49.
20. David A. Patterson, Expert opinion: Traditional mainframes and
supercomputers are losing the battle, IEEE Spectrum, January 1992, 34.
21. David A. Patterson, Garth Gibson, Randy H. Katz, A case for redundant
arrays of inexpensive disks (RAID), ACM SIGMOD Conference Proceed-
ings, June 13, 1988, 109116. Reprinted in CMG Transactions, Fall
1991.
22. Tom Quinlan, HP disk array provides secure storage for servers, Info-
World, May 31, 1993, 30.
23. Jerry L. Rosenberg, More magic and mayhem: formulas, equations and
relationships for I/O and storage subsystems, CMG91 Proceedings, Com-
puter Measurement Group, 1991, 11361149.
24. Stephen L. Samson, private communication, 1992.
25. Richard M. Schardt, An MVS tuning approach, IBM Systems Journal,
19(1), 1980, 102119.
26. Gina Smith, Will the Pentium kill the 496?, PC Computing, May 1993,
116125.
3.1 Introduction
For all performance calculations we assume some sort of model of the system
under study. A model is an abstraction of a system that is easier to manipulate and
experiment with than the real systemespecially if the system under study does
not yet exist. It could be a simple back-of-the-envelope model. However, for more
formal modeling studies, computer systems are usually modeled by symbolic
mathematical models. (An exception is a detailed benchmark in which real people
key in transactions to a real computer system running a real application. Because
of the complications and expense of this procedure, it is rarely done.) We usually
use a queueing network model when thinking about a computer system. The most
difficult part of effective modeling is determining what features of the system
must be included and which can safely be left out. Fortunately, using a queueing
network model of a computer system helps us solve this key modeling problem.
The reason for this is that queueing network models tend to mirror computer
systems in a natural way. Such models can then be solved using analytic
techniques or by simulation. In this chapter we show that quite a lot can be
calculated using simple back-of-the envelope techniques. These are made possible
by some queueing network laws including Littles law, the utilization law, the
response time law, and the forced flow law. We will illustrate these laws with
examples and provide some simple exercise to enable you to test your
understanding.
We describe the workload intensity for each of the three workload types as
follows:
servers, that is, sufficiently many servers so that one can always be provided to
an arriving customer. We model terminals as delay servers because we assume
each user has a terminal and does not need to queue up to use it.
A queueing center is somewhat different and represents the most common
service center in a queueing network because customers must compete for ser-
vice with the other customers. If all the servers at the center are busy, arriving
customers join a waiting line to queue (wait) for service. We usually refer to the
waiting line as a queue. CPUs and I/O devices are modeled as queueing service
centers.
The service demands for a single class model are usually given in terms of
Dk, the total service time a customer requires at service center k. (We assume the
service centers are numbered 1, 2,..., K.) Sometimes Dk is defined in terms of the
average service demand Sk per visit to service center k and the average number of
visits Vk that a customer makes to service center k. Then we can write
Dk = Vk 3 Sk. For example, if the service center is the CPU, we may find that the
average time a job spends at the CPU on a single visit is 0.02 seconds but that, on
the average, 30 visits are required. Then D1 = 30 3 0.02 = 0.6 seconds.
We are following the usual queueing theory terminology of using the word
customer to refer to a service request. For modeling an open computer system
we have in mind a queueing network similar to that in Figure 3.3. In this figure
the customers (requests for service) arrive at the computer center where they
begin service with a CPU burst. Then the customer goes to one of the I/O devices
(disks) to receive some I/O service (perhaps a request for a customer record).
Following the I/O service the customer returns to the CPU queue for more CPU
service. Eventually the customer will receive the final CPU service and leave the
computer system.
We assume that the queueing network representation of a computer system
has C customer classes and K service centers. We use the symbol Sc,k for the
average service time for a class c customer at service center k, that is, for the
average time required for a server in service center k to provide the required ser-
vice to one class c customer. It is the reciprocal of c,k, a Greek symbol used to
represent the average service rate or the average number of class c customers ser-
viced per unit of time at service center k when the service center is busy. Sup-
pose, for example, that a single workload class computer system has one CPU
and we let k = 1 for the CPU service center. Then, if the average CPU service
requirement is 2 seconds for each customer, we have S1 = 2 seconds and the
average service rate for the CPU is 1 = 0.5 customers per second.
Some service centers, such as a multiprocessor computer systems with sev-
eral CPUs, have multiple servers. It is customary to specify the average service
time on a per-server basis. Thus, if a multiprocessor system has two CPUs, we
specify how long a single processor requires, on the average, to process one cus-
tomer and designate this number as the average service time. For queueing net-
work models we are not as interested in the average service time of a customer
for one visit as we are in the total service demand Dc,k = Vc,k 3 Sc,k where Vc,k
is the average number of visits a class c customer makes to service center k.
Example 3.1
Suppose the performance analysts at Fast Gunn decide to model their computer
system as shown in Table 3.1 with one CPU and three I/O devices. They decide to
use two workload classes and to number the CPU server as Center 1, with the I/O
devices numbered 2, 3, and 4. Both workloads are terminal workloads. Workload
1 has 20 active terminals and a mean think time of 10 seconds, that is, N1 = 20
and Z1 = 10 seconds. Workload 2 has 15 active terminals and a mean think time
ing service has its service preempted if an arriving customer has a higher priority.
The preempted customer returns to the head of its priority class to queue for ser-
vice. The interrupted service is continued at the interruption point for preemp-
ive-resume systems and must be begun from the beginning for preemptive-
repeat systems. Nonpreemptive systems are called head-of-the-line queueing
systems, abbreviated HOL.
In recent years a classless queueing discipline called processor sharing has
been widely used. At a service center with the processor sharing queueing disci-
pline, each customer at the center shares the processing service of the center
equally. Thus a processor sharing service center that can service a single cus-
torner at the rate of 10 per second services each of 2 customers at the rate of 5 per
second or each of 10 customers at the rate of l per second.
vidual service centers as shown in Table 3.3. For example, if we considered the
Uk Average utilization at
center
Rk Average residence
(response) time
Xk Average center throughput
CPU service center, we might find that the average utilization U1 = 0.78, aver-
age response time R1 = 0.9 seconds, average throughput X1 = 5.6 jobs second,
and average number at the CPU L1 = 5.04 jobs.
breakdown of response time into the CPU portion and the I/O portion so that we
can determine where upgrading is most urgently needed. Examples of some of the
multiclass performance measures are shown in Example 3.4.
Similarly, we have service center measures of two types: aggregate or total
measures and per class measures. Thus we may want to know the total CPU utili-
zation as well as the breakdown of this utilization between the different work-
loads.
Let us consider some further applications of Littles law to the closed model
of Figure 3.4. First we consider the CPU by itself, without the queue, to be our
system and suppose the average arrival rate to the CPU, including the flow back
from the I/O devices, is 60 transactions per second while the average service time
per visit of a job to the CPU is 0.01 seconds. Then, by Littles law, the average
number of transactions in service at the CPU is 60 3 0.01 = 0.6. Now let us con-
sider the application of Littles law to the CPU system consisting of the CPU and
the queue for the CPU. Suppose there are 18.6 transactions, on the average, in the
CPU system, including those in the queue. Since the average number at the CPU
itself is 0.6, this means there are 18 in the queue, on the average. Hence, by Lit-
tles law, the average time in the queue is 18/60 = 0.3 seconds. Thus the average
total time (queueing plus service) a job spends at the CPU for one pass is 0.3 +
0.01 = 0.31 seconds. We can check this value using Littles law for the system. It
yields 18.6/60 = 0.31 seconds. (We must have done it right.)
Example 3.3
Suppose Arnolds Armchairs has an interactive computer system (single
workload) with the characteristics shown in Table 3.5.
Example 3.4
You may be wondering what the performance estimates are for the model we
described in Example 3.1. Unfortunately, this is a rather complex model to solve.
It is one of the models we explain in Chapter 4. However, the Mathematica
program Exact from my book [Allen 1990] (slightly revised) can be used to make
the calculations we show here. A revised form of the program called
MultiCentralServer appears in the paper [Allen and Hynes 1991]. It also can be
used to make the same calculations. The first line of Exact follows:
We see from this line that the first parameter that must be entered, Pop, is a
vector whose components are the number of customers in each class; the next
parameter, Think, is a vector of the think times (recall that a batch workload has a
think time of zero); and the final parameter, Demands, is an array of service
demands. In Example 3.1 we have Pop = {20, 15} because workload class 1 has
20 active terminals and workload class 2 has 15 active terminals. Similarly the
entry for the parameter Think is the vector {10, 5}. The service demands of the
workloads are given in an array in which row 1 provides the service demands for
workload class 1, row 2 the service demands for workload class 2, etc. For this
example it is called Demands and is displayed in the Mathematica session for
Example 3.1 that follows:
Out[15]= {10, 5}
In[16]:= Pop
The output shows that the CPU is the bottleneck device and is nearly satu-
rated. The second and third disk drives seem to be somewhat heavily utilized
according to the performance rules of thumb commonly used.
Exercise 3.1
Consider Example 3.4. Suppose the computer system is upgraded so that the CPU
is twice as fast and each I/O device is twice as fast as well. Use Exact to calculate
the new values for the performance data.
part of the folklore that scientific computing jobs are CPU bound, while business
oriented jobs are I/O bound That is, for scientific workloads such as CAD (com-
puter-aided design), FORTRAN compilations, etc, the CPU is usually the bottle-
neck. Workloads that are business oriented, such as database management
systems, electronic mail, payroll computations, etc., tend to have I/O bottlenecks.
Of course, one can always find a particular scientific workload that is not CPU
bound and a particular business system that is not I/O bound, but it is true that
different workloads on the same computer system can have dramatically different
bottlenecks. Since the workload on many computer systems changes during dif-
ferent periods of the day, so do the bottlenecks. Usually, we are most interested in
the bottleneck during the peak (busiest) period of the day.
Example 3.5
Sue Simpson, the lead performance analyst at Sample Systems, measures the
performance parameters of a small batch processing computer system. She finds
that the CPU has the visit ratio V1 = 30 with S1 = 0.04 seconds, the first I/O
device has V2 = 10 and S2 = 0.03 seconds, while the other I/O device has V3 = 5
and S3 = 0.04 seconds. Hence, Sue calculates D1 = 1.2 seconds, D2 = 03
seconds, while D3 = 0.2 seconds. She concludes that the bottleneck is the CPU
(the system is CPU bound).
max[D, N 3 Dmax Z] R N 3 D.
Example 3.6
Consider Example 3.5. For this example D = D1 + D2 + D3 = 1.7 seconds and
Dmax = D1 = 1.2 seconds. If we assume the average number of batch programs
X min
N 5 1
0. 588235 = , = 0.833333.
ND + Z 1. 7 1. 2
and
We have shown the brute force back-of-the-envelope solution you could perform
with a calculator. The solution using the Mathematica program bounds follows
As we ask you to show in Exercise 4.4, the exact answers are X = 0.831941
jobs per second and R = 6.01004 seconds. At this point you may be thinking, If
I have a Mathematica program that will compute the exact values of X and R for
me, what good are the bounds? The bounds are best used for back-of-the-enve-
lope kinds of calculations when you may be away from your workstation or PC.
The bounds are also excellent for validating a model you are developingespe-
cially if it is a simulation model; simulation models are often difficult to validate.
(Of course, you could use the exact solution obtained with your Mathematica
program here, too.) However, if you develop a simulation model, make a long
run, and have results for X and R that do not fall within the bounds, you know
there is an error somewhere. Conversely, if the results do fall within the bounds
you have some reason for optimism
Bounds have been developed for multiclass queueing network models but
are so difficult to calculate that they are of little practical importance
accuracy for the purpose of the study. The current system is called the baseline
system. (The process of determining that the model is a good representation of the
current or baseline system is called validation.)
The second phase is the evaluation phase in which the model is modified to
represent the system under study after planned changes are made to the hardware,
software, and workload. The model is then run to determine the performance
parameters of the modified system. Typically the modified model represents a
computer system with a more powerful CPU, more memory, more I/O capacity,
and (possibly) improved software.
The final phase is the verification phase when the actual new system perfor-
mance is compared to the performance that was predicted during the evaluation
phase. This third phase is often not performed but can be very valuable because it
helps us improve our modeling techniques.
The most critical part of a modeling study is the setting of clear objectives
for the study. Most failed modeling studies fail because the purpose of the study
was not clearly understood. We recommend that no modeling study be under-
taken without a succinct statement of purpose such as one of the following:
After the objective of the study is decided upon the model construction
phase is begun. The most common case is one in which a current computer sys-
tem must be modeled. Sometimes the model is of a computer system that does
not yet exist, but this is usually the case only for a computer manufacturer who is
designing a new line of equipment. We will assume that a model is to be con-
structed of a current computer system or systems.
As in all modeling, constructing a queueing network model requires that the
modeler decide what are the important features of the system modeled that must
be included in the model and what features do not have a primary effect and can
safely be excluded. The purpose of the model has a big influence here. The model
should include only those system resources and workload components that have a
primary effect on performance and for which parameter values can be obtained.
delayed for six months we will be able to test not only our model but our predic-
tion of future workloads.
Very little can be added to this beautiful statement. The special issue of the ACM
Computing Surveys in which Grahams statement appears was dedicated to
queueing network models of computer system performance; it was published in
September 1978 but contains material that is still relevant.
The best known books on queueing theory, especially as the theory can be
applied to computer systems, are the two volumes by Kleinrock [Kleinrock 1975,
1976]. These two volumes are distinguished by being clearly written and filled
with useful information. Scholars as well as practitioners praise Kleinrocks two
volumes.
In this book we will show you how to use queueing network models of com-
puter systems. We will demonstrate how measured data can be used to construct
the input parameters for the models and how to overcome the pitfalls that some-
times occur. We will provide Mathematica programs to solve the models using
both analytic queueing theory as well as simulation and give you an opportunity
to experiment with the models.
3.7 Solutions
Solution to Exercise 3.1
We use the Mathematica program Exact to obtain the output shown. We halved
all the service requirements in the array Demands. The other parameters were not
changed.
In|5]:= MatrixForm[Demands]
Out[7]= {10, 5}
In[8]:= Exact[Pop, Think, Demands]
3.8 References
1 Arnold O. Allen, Probability, Statistics, and Queueing Theory with Computer
Science Applications, Second Edition, Academic Press, San Diego, 1990.
2. Arnold O. Allen and Gary Hynes, Solving a queueing model with Mathemat-
ica, The Mathematica Journal, 1(3), Winter 1991, 108112.
3. G. Scott Graham, Guest editors overview Queueing network models of
computer system performance, ACM Computing Surveys, 10(3), September
1978, 219224. A special issue devoted to queueing network models of
computer system performance.
4. Leonard Kleinrock, Queueing Systems Volume I: Theory, John Wiley, New
York, 1975.
5. Leonard Kleinrock, Queueing Systems Volume II: Computer Applications,
John Wiley, New York, 1976.
6. Edward D. Lazowska, John Zahorjan, G. Scott Graham, and Kenneth C.
Sevcik, Quantitative System Performance: Computer System Analysis Using
Queueing Network Models, Prentice-Hall, Englewood Cliffs, NJ, 1984.
7. John D. C. Little, A proof of the queueing formula: L = W, Operations
Research, 9(3), 1961, 383387.
4.1 Introduction
In Chapter 3 we discussed queueing network models and some of the laws of such
models such as Littles law, the utilization law, the response time law, and the
forced flow law. We also considered simple bounds analysis. Also discussed were
the parameters needed to define a queueing network model and the performance
measures that can be calculated for such models. We describe most computer
systems under study in terms of queueing network models. Such models can be
solved using either analytic solution methods or simulation. In this chapter we will
discuss the mean value analysis (MVA) approach to the analytic solution of
queueing network models. MVA is a solution technique developed by Reiser and
Lavenberg [Reiser 1979, Reiser and Lavenberg 1980]. In Chapter 6 we discuss
solutions of queueing network models through simulation.
Although analytic queueing theory is very powerful there are queueing net-
works that cannot be solved exactly using the theory. In their paper [Baskett et al.
1975], a widely quoted paper in analytic queueing theory, Baskett et al. general-
ized the types of networks that can be solved analytically. Multiple customer
classes each with different service requirements as well as service time distribu-
tions other than exponential are allowed. Open, closed, and mixed networks of
queues are also allowed. They allow four types of service centers, each with a
different queueing discipline. Before this seminal paper was published most
queueing theory was restricted to Jackson networks that allowed only one cus-
tomer class and required all service times to be exponential. The exponential dis-
calculations. The calculations for the single, open class model are shown in the
Table 4.1. We assume that the average arrival rate as well as the average service
demands at the service centers are known. Thus these are the inputs to the model.
The outputs or performance measures are what we calculate using the formulas in
Table 4. 1.
The calculations exhibited in the table can be made using the Mathematica
program sopen from the package work.m, which follows Table 4.1.
Example 4.1
The analysts at Gopher Garbage feel they can model one of their computer
systems using the single class open model with three service centers, a CPU and
two I/O devices. Their measurements provide the statistics in Table 4.2. Although
not shown in the table, they measured the average arrival rate of transactions to be
0.25 transactions per second.
In[3]:= <<work.m
In[6]:= sopen[0.25, v, s]
It is clear from the output that the first disk is the bottleneck and the cause of
the poor performance. The analysts could approximate the effect of adding
another disk drive like the first drive and splitting the load over the two drives by
using two drives in place of the first drive, each with Vdisk = 40 and Sdisk = 0.03
seconds. We make this change in the following Mathematica session:
In[7]:= sopen[lambda, v, s]
The performance has improved considerably and the new bottleneck appears
to be the third disk drive, that is, the one with the mean service time of 0.028 sec-
onds. The effect of further upgrades can easily be tested.
Exercise 4.1
Consider Example 4.1. Suppose that, instead of replacing the first drive with two
identical drives, Gopher Garbage decides to replace this drive by one that is twice
as fast; that is, by one with a visit ratio of 80 and an average service time of 0.015
seconds. Use sopen to make the performance calculations for the upgraded
system.
Exercise 4.2
Consider Example 4.1 after the new drive has been added; that is, after the first
drive is replaced by two drives. Use sopen to estimate the performance that would
result for the enhanced system if the third drive is replaced by two drives (one new
one), each with a mean service time of 0.028 seconds and with the load split
between them.
Single Class Closed MVA Algorithm. Consider the closed computer system of
Figure 4.2. Suppose the mean think time is Z for each of the N active terminals.
The CPU has either the FCFS or the processor sharing queue discipline with
service demand D1 given. We are also given the service demands of the I/O
devices numbered from 2 to K. We calculate the performance measures as
follows:
R[n] = R [ n ],
k =1
k
n
X[n] = ,
R[ n ] + Z
X = X[N].
R = R[N].
Set the average number of customers (jobs) in the main computer system to
L = X 3 R. Set server utilizations to Uk = XDk, k=1, 2, ..., K.
We calculated Lk[N] and Rk[N] for each server in the last iteration of Step 2.
This algorithm is implemented by the Mathematica program sclosed which
follows:
The algorithm is actually quite straightforward and intuitive except for the
first equation of Step 2, which depends upon the arrival theorem, stated by
Reiser [Reiser 1981] as follows:
Like all MVA algorithms, this algorithm depends upon Littles law (discussed in
Chapter 3) and the above arrival theorem. The key equation is the first equation of
Step 2, Rk[n] = Dk(1 + Lk[n 1] ), which is executed for each service center. By the
arrival theorem, when a customer arrives at service station k the customer finds
Lk[n 1] customers already there. Thus the total number of customers requiring
service, including the new arrival, is 1 + Lk[n 1]. Hence the total time the new
customer spends at the center is given by the first equation in Step 2, if we assume
we neednt account for the service time that a customer in service has already
received. The fact that we need not do this is one of the theorems of MVA! The
arrival theorem provides us with a bootstrap technique needed to solve the
equation Rk[n] = Dk(1 + Lk[n 1]) for n = N. When n is 1 Lk[n 1] = Lk[0] = 0
so that Rk[1] = Dk, which seems very reasonable; when there is only one
customer in the system there cannot be a queue for any device so the response time
at each device is merely the service demand. The next equation is the assertion that
the total response time is the sum of the times spent at the devices. The last two
equations are examples of the application of Littles law. The final equation
provides the input needed for the first equation of Step 2 for the next iteration and
the bootstrap is complete. Step 3 completes the algorithm by observing the
performance measures that have been calculated and using the utilization law, a
form of Littles law.
Let us illustrate the single class closed MVA model with an example.
Example 4.2
Mellow Memory Makers has an interactive computer system consisting of 50
active terminals connected to a computer system as in figure 4.2. The
performance analysts at MMM find that they can model this system by the
queueing model described in the preceding algorithm with one CPU and three disk
I/O devices. Their measurements indicate that the average think time is 20
seconds, the mean CPU service demand per interaction is 0.2 seconds, and the
mean service demand per interaction on the three I/O devices is 0.03, 0.04, and
0.06 seconds, respectively. The calculations to apply the model can be made with
sclosed as follows:
We see from the output that the throughput is X = 2.43623 interactions per
second, the mean response time R = 0.523474 seconds, the CPU utilization is
0.487247, and the average number of customers (active inquiries) in the com-
puter system is L = 1.2753. We also see that the CPU is the bottleneck of the
computer system.
Exercise 4.3
Use sclosed to find the performance of the Mellow Memory Makers system of
Example 4.2 if the CPU is upgraded to twice the current capacity but the I/O
devices are retained.
Exercise 4.4
Use sclosed to find the exact solution of the computer system described in
Examples 3.5 and 3.6. Assume the average population of batch jobs is 5.
manufacturers and others with both measurement and modeling tools. For exam-
ple, Digital Equipment Corporation has announced DECperformance Solution
V1.0, an integrated product set providing performance and capacity management
capabilities for DEC VAX systems running under the VMS operating system.
Hewlett-Packard provides an HP Performance Consulting service to help cus-
tomers with HP 3000 or HP 9000 computer systems solve their performance
problems.
system
Rc
R
Class c
c,k
response k
time
Example 4.3
The performance analysts at the Zealous Zymurgy brewery feel they can model
one of their computer systems using the multiclass open model with the
parameters given in Table 4.4.
Class c k Dc, k
1 1.2 1 1 0.20
1 2 0.08
1 3 0.10
2 0.8 2 1 0.05
2 2 0.06
2 3 0.15
3 0.5 3 1 0.02
3 2 0.21
3 3 0.12
The performance analysts enter the data from Table 4.4 into the program mopen
and obtain their output in the following Mathematica session:
All times in the output of mopen are in seconds. The performance appears to
be excellent! Users from each workload class have an average response time that
is less than one second. The system is well balanced with each service center
almost equally loaded. The second disk drive is loaded slightly higher than the
other service centers, making it the bottleneck. We ask you to use mopen to
determine the effect of replacing the second disk drive by one that is twice as
fast.
Exercise 4.5
Consider Example 4.3. Suppose Zealous Zymurgy decides to replace the second
disk drive by one that is twice as fast. Assuming the current workload, what are
the new values of average response time for each workload class? What would
these numbers be if each workload intensity was doubled after the new disk was
installed?
Example 4.4
Consider Example 4.2. The solution to the original model using the program
sclosed required 0.35 seconds on my workstation as we see from the printout
below.
Suppose we convert to a model with two classes by arbitrarily placing each user
into one of two identical terminal classes. Then we solve the model using Exact
as follows:
We get exactly the same performance statistics as before but it took 18.32 seconds
to run the multiclass model compared to only 0.35 seconds for the single class
model!
Exercise 4.6
Verify that the output of Exact in Example 4.4 does provide the same performance
statistics as the output of sclosed.
solution, the solution it produces is not usually the exact solution, no matter how
small we make epsilon. However, the solution is usually sufficiently close to the
exact solution for all practical purposes. Thus the approximate algorithm allows
us to model many computer systems that it would not be practical to model using
the exact algorithm. Let us consider some examples. We display the first line of
Approx here so you can see what the inputs are:
Example 4.5
Consider Example 3.4. We show the solutions of that example using Exact and
Approx with an epsilon of 0.001, and Approx with an epsilon of 0.000001. Note
that the exact solution using Exact required 9.45 seconds on my workstation,
Approx with an epsilon of 0.001 required 1.24 seconds, and Approx with an
epsilon of 0.000001 took 1.85 seconds. The calculated performance measures
from Approx changed very little as epsilon was dropped from 0.001 to 0.000001.
The differences in output values between Approx and Exact run from about 2 to
6 percent. This is not as bad as it may first appear because the uncertainty of the
values of input data, especially for predicting input values for future time periods,
is often larger than that.
Out[5]= {10, 5}
Exercise 4.7
The computer performance analysts at Serene Syrup studied one of their computer
systems and found it could be analyzed as a closed system with three workload
classes, two terminal and one batch. Tables 4.5 and 4.6 define the inputs to the
current model. Find the performance statistics for the computer system using
Exact and compare the results to the solution using Approx with an epsilon of
0.01. Also compare the solution times.
c Nc Thinkc
1 5 20
2 5 20
3 9 0
1 1 0.25
1 2 0.08
1 3 0.12
2 1 0.20
2 2 0.40
c k Dc,k
2 3 0.60
3 1 0.60
3 2 0.10
3 3 0.12
average number of terminals in use or the average number of active batch jobs,
directly from the measurement data. In addition, users who project their future
workloads often can predict their future volume of work only in terms of
throughput required, that is, in terms such as the number of transactions per
month or week rather than in the average number of active terminals. It is com-
mon practice for modelers in this situation to replace such a workload by a trans-
action workload with the same throughput and the same service demands as the
original measured workload.
The second reason for using transaction workloads is not very important
since efficient algorithms for approximating closed models exist. An example is
the algorithm we use in Approx.
The third reason is important, too; we illustrate it with an example.
While modeling customer systems with queueing network models at the
Hewlett-Packard Performance Technology Center we discovered that the use of
open (transaction) workloads sometimes causes problems in modeling multiple
class workloads. One would expect a closed workload with a small population to
be poorly represented as an open class because an open class has an infinite pop-
ulation. This expectation is easy to verify. In addition, we found that in using the
approximate MVA mixed multiclass algorithm, significant closed workloads
(that is, workloads with high utilization of some resources) represented as an
open workload class can cause sizable errors in other classes which must com-
pete for resources at the same priority level. We avoid these problems by using a
modified type of closed workload class that we call a fixed throughput class. We
developed an algorithm that converts a terminal workload or a batch workload
into a modified terminal or batch workload with a given throughput. In the case
of a terminal workload we use as input the required throughput, the desired mean
think time, and the service demands to create a terminal workload that has the
desired throughput. We also compute the average number of active terminals
required to produce the given throughput. The same algorithm works for a batch
class workload because a batch workload can be thought of as a terminal work-
load with zero think time. For the batch class workload we compute the average
number of batch jobs required to generate the required throughput.
We present an example that illustrates difficulties that arise in using transac-
tion workloads in situations in which their use seems appropriate. We also show
how fixed throughput classes allow us to obtain satisfactory results. There are
cases, of course, in which the use of transaction workloads to represent batch or
terminal workloads does produce satisfactory results.
Example 4.6
The analysts at Hooch Distilleries have successfully modeled one of their
computer systems using the approximate MVA model with three batch workload
classes and three service centersa CPU and two I/O devices. The service
demands are shown in Table 4.7. All times are in seconds
The populations of workload classes A, B, and C are one, two, and one,
respectively. Using this information and that from Table 4.7, the analysts at
Hooch use the Mathematica program Approx to obtain the performance results
shown in Tables 4.8 and 4.9. All times in the tables are in seconds and through-
puts in transactions per second. The Hooch analysts are satisfied that the model
values are a good approximation to the measured values of their system. We treat
them as identical in this example.
c k Dc,k
A CPU 300.0
I/O 1 90.0
I/O 2 60.0
B CPU 90.0
I/O 1 0.6
I/O 2 12.0
C CPU 1800.0
I/O 1 18.0
I/O 2 9
c Xc Rc
A 0.000751 1330.858
B 0.005565 359.369
C 0.000145 6882.214
k Uk Lk
Out[5]= {0, 0}
Out[6]= {1, 2}
Nue is very disappointed with the results. He thought that removing work-
load class C from the system would greatly improve the performance of the sys-
tem in processing workload classes A and B but the CPU is still almost saturated
while the turnaround times for workload classes A and B are down only 23 per-
cent and 26 percent, respectively. Suddenly he realizes that he has not modeled
the workload correctly. The way he modeled the system makes it do more class A
and class B work than the original measured system did. To do the same amount
of work in the same amount of time the model should have the same throughput
rates for each workload class as the measured system. Nue decides to model the
modified system with transaction workloads having the same throughputs as the
original measured system. He decides to validate this model by modeling the cur-
rent system with three transaction class workloadsthe first having the same
throughput and service demands as workload class A, the second the same as
workload class B, and the third like workload class C. If the output of this model
predicts performance that is close to the measured values, the model is validated.
He uses the Mathematica program mopen in the Mathematica session that fol-
lows:
Out[4]= {{300, 90, 60}, {90, 0.6, 12}, {1800, 18, 9}}
This output is very different from that in Tables 4.8 and 4.9. The modeled
response time for workload class A has increased 1,754 percent, for workload
class B by 1,950 percent, and for workload class C by 2,038 percent! The use of
transaction workloads will clearly not work here. It is hard to believe that the
transaction workload model predicts an average response time for workload class
C that is 21.38 times as big as the measured value. The reason for this very large
discrepancy is that a workload class with a small finite population is represented
in this model as a workload class with an infinite population.
If we now run the Mathematica program Fixed, requesting the throughputs
shown in Table 4.8 with the service demands of Table 4.7, we obtain the output
shown:
90 0.6 12.
1800. 18 9
Out[6]= {0, 0, 0}
Class# ArrivR Pc
------ ----------- ---------
1 0.000751 0.993882
2 0.005565 1.9844
3 0.000145 0.991998
Class# Resp TPut
------ ------------- ----------
1 1323.411353 0.000751
2 356.585718 0.005565
3 6841.362715 0.000145
also, that these numbers are very close to the actual sizes of the original popula-
tions. It might not be clear to you how to use Fixed. To explain how it is used, let
us look at the whole program. In spite of the name, the program will calculate the
performance statistics for ordinary terminal and batch workload classes as well as
fixed workload classes, using the approximation techniques presented in the pro-
gram Approx. Fixed was written by Gary Hynes for our joint paper [Allen and
Hynes 1991]. Some of the notation is slightly different from that used in this
book.
In the first line of the program,
each element of the vector Ac is zero for a terminal or batch class but the desired
throughput for a fixed class. Since we have only fixed classes for this example we
used ArrivalRate, a vector of the desired throughputs, for Ac. Each element of the
vector Nc is blank for fixed classes and the actual population of the class for
terminal or batch classes. For this example we entered { ,, } for Nc because all three
classes were considered fixed classes. The input vector Zc has as component c the
mean think time for the class c workload. The component is zero for batch classes
and the mean think time for terminal classes. The array Dck is an array such that
the element in row c and column k is the service demand of the class c workload
at service center k. Finally, epsilon is the error criterion. We used an epsilon of
0.001 in this example.
The vector Pc is a bit unusual. If component c is a fixed class that component
of Pc is the estimate provided by Fixed of the population Nc of class c. Since all
components in our example are fixed class, the final output is composed of these
estimates. In general, if class c is not a fixed class, component c of Pc is Xc, the
calculated throughput of class c customers. If you see a non-zero number in the
column labeled ArrivR in the output, then the corresponding number in the col-
umn Pc is the estimate provided by Fixed of the population Nc of class c. If the
number in the column labeled ArrivR is zero, then the number in column Pc is
Xc, the calculated throughput of class c customers.
In the Mathematica calculations that follow, Nue uses the Fixed program to
estimate the performance of the current system with workload C removed. He
assumes the currently measured throughput rates for workloads A and B.
Out[7]= {0, 0}
Class# ArrivR Pc
------ ---------- ---------
1 0.000751 0.497256
2 0.005565 0.76552
The predicted performance values seem very reasonable. Note that the
model predicts that 0.497256 class A batch jobs and 0.76552 class B batch jobs
must be in the system on the average. This is the end of Example 4.6.
Exercise 4.8
Consider Example 4.6. Suppose Hooch Distilleries does make the planned change
to the system studied in the example and the performance is very close to that
predicted by Fixed. Use Fixed to predict the response time for class A and class
B workloads if the throughput for each class increases by 20 percent. Assume the
service demands do not change. Use an epsilon of 0.001.
Example 4.7
A small computer system at Symple Symon Sugar has two workload classes, a
terminal class and a batch class with the service demands shown in Table 4.10.
Assume the average think time for the terminal workload is 20 seconds. The size
of the terminal class is 30 and of the batch class is 5. Let us first calculate the
c k Dc,k
1 CPU 0.40
I/O 1 0.12
I/O 2 0.12
2 CPU 20.00
I/O 1 15.00
I/O 2 15.00
Out[7]= {35, 5}
Out[8]= {20, 0}
3 0.788597 0.457408
Analysts at Symple Symon are not happy with this result because they want
the average response time for their terminal customers to be less than 1.5 sec-
onds. They estimate the performance values for a priority system with the termi-
nal workload given priority one and the batch workload priority two as follows:
First they compute the performance values as though the only workload was the
terminal workload using Approx as shown:
Out[10]= {35}
Out[11]= {20}
For this call of Approx the analysts used the original terminal workload
class. The average response time is only 1.39518 seconds and the average
throughput is 1.635882 interactions per second compared to 4.862755 seconds
and 1.407728 interactions per second without priorities. To compute the perfor-
mance of the batch class, we compute the effective demands of the batch work-
load by using the formula
Sc,k Dc,k
V c,k Sc,k = V c,k c1 = c1 = Dc,k .
1 U
r=1
r,k 1 U
r=1
r,k
We calculate the performance of the batch workload using Approx and the
effective demands with the following Mathematica session.
In[22]:= U = N[U, 6]
This shows that the response time with priorities for the batch class is
300.734038 seconds with a throughput of 0.016626 jobs per second. The compu-
tation using the Mathematica program Pri that calculates the performance statis-
tics for the system with priorities follows:
Out[27]= {35, 5}
Out[28]= {20, 0}
The output from Pri yields average response times of 1.39518 and
300.738369 seconds, respectively, for the response times and 1.635882 and
0.016626, respectively, for the throughputs. These are almost exactly the values
we calculated with a more indirect approach. Note that these values are only
approximate for two reasons: We used the reduced-work-rate approximation for
calculating the priorities and we used the approximate MVA techniques as well.
Exercise 4.9
Consider Example 4.6. Use Pri to estimate the performance parameters that would
result if the first workload class is given preemptive-resume priority over the
second workload class. Use an epsilon value of 0.0001.
central server model shown in Figure 4.3. This model was developed by Buzen
[Buzen 1971].
The central server referred to in the title of this model is the CPU. The cen-
tral server model is closed because it contains a fixed number of programs N (this
is also the multiprogramming level, of course). The programs can be thought of
as markers or tokens that cycle around the system interminably. Each time a pro-
gram makes the trip from the CPU directly back to the end of the CPU queue we
assume that a program execution has been completed and a new program enters
the system. Thus there must be a backlog of jobs ready to enter the computer sys-
tem at all times. We assume there are K service centers where service center 1 is
the CPU. We assume also that the service demand at each center is known. Buzen
provided an algorithm called the convolution algorithm to calculate the perfor-
mance statistics of the central server model. We provide a MVA algorithm that is
more intuitive and is a modification of the single class closed MVA algorithm we
presented in Section 4.2.1.2.
MVA Central Server Algorithm. Consider the central server system of Figure
4.3. Suppose we are given the mean total resource requirement Dk for each of the
K service centers and the multiprogramming level N. Then we calculate the
performance measures of the system as follows:
K
R[n] = R [ n ],
k =1
k
n
X[n] = ,
R[ n ]
X = X[N].
R = R[N].
The central server algorithm is valid for the same reasons that the single
class closed algorithm is valid. It depends upon repeated applications of Littles
law and the arrival theorem. The Mathematica program cent implements the
algorithm. Example 4.8 demonstrates its use.
k Dk
CPU 3.5
I/O 1 3.0
I/O 2 2.0
I/O 3 7.5
k Uk Lk
Example 4.8
The Creative Cryogenics Corporation has a batch computer system that runs only
one application. Actually, it is used for other purposes during the day but runs one
batch application during the evening hours. Priscilla Pridefull, the chief
performance analyst, measures the system and obtains service and performance
numbers. All times are in seconds. The average measured turnaround time was
26.69 seconds with an average throughput of 0.11 jobs per second. The service
demands are shown in Table 4.11, and the utilizations of and number of customers
at each service center are shown in Table 4.12
After verifying that the output of the central server model run with the mea-
sured data agreed well with the measured performance, using a multiprogram-
ming level of 3, Priscilla decided to use cent to determine what the performance
would be if enough additional main memory were obtained to allow a multipro-
gramming level of 5. (She knows how much memory is needed for the operating
systems and other components of the system as well as how much is needed for
each copy of the batch program.) Her Mathematica run follows the display of the
first line from cent. Note that, as the first line shows, Priscilla enters the multi-
programming level N, and the vector of service demands to execute the program.
cent[N_?IntegerQ, D_?VectorQ]:=
In[8]:= Demands
We see that the throughput has increased 15.8% to 0.127406 jobs per second
(458.66 per hour) while the response time has increased 47% to 39.2446 seconds.
We also note that the bottleneck device, the third disk drive, is almost saturated
(the utilization is 0.955546).
Priscilla notes that she must do something about the third I/O device. She
decides to model the system to see how much improvement would result from
splitting the load between the third I/O device and a new identical device. In
addition, her users are complaining that it takes too long to run all their batch
jobs. They need to get them all done before they must turn the computer system
over to the day shift. Priscilla estimates that a throughput of 720 jobs per hour
(0.2 jobs per second) will be required within a year to meet the user requirements.
She uses the program Fixed to-decide what multiprogramming level will be
needed to be sure of obtaining a throughput of 0.2 jobs per second. Fixed com-
putes 8.05661 for the average number of batch jobs needed to obtain a through-
put of 0.2 jobs per second, which means that the proper multiprogramming level
is probably 8 but could be 9. In the program call of Fixed, Priscilla uses braces
around 0.15 and 0 (twice), and double braces around the service demands
because Fixed assumes the service demands are given as an array and that Ac,
Nc, and Zc are vectors:
Class# ArrivR Pc
------ --------- ----------
1 0.2 8.05661
After running Fixed she makes the following calculations using Mathemat-
ica to check that, with the new I/O device, she needs enough memory to maintain
a multiprogramming level of 8 as was predicted by Fixed, and that with this mul-
tiprogramming level the requirements are met.
Note that Priscilla modeled the new configuration by setting Demands equal
to {3.5, 3.0, 2.0, 3.75, 3.75} to account for the new I/O device. For multipro-
gramming level 8 the throughput exceeds 0.2 jobs per second.
Note, also, that the central server model does not model the CPU and I/O
overhead needed to manage memory directly. (Analysts sometimes correct for
this by adding a little to the CPU service demand.) In spite of this, the central
server model can be used to model some fairly complex systems. For example, in
their book [Ferrari, Serazzi, and Zeigner 1983] Ferrari et al. used the central
server model to find the optimal multiprogramming level in a large mainframe
virtual memory system, to improve a virtual memory system configuration, for
bottleneck forecasting for a real-time application, and for other studies.
Exercise 4.10
For the final system modeled by Priscilla Pridefull at Creative Cryogenics the
third and fourth I/O devices are still the bottlenecks of the system. Suppose the
two new I/O devices are replaced by faster I/O devices so that the new average
service demands on them are 2.5 seconds. Suppose, also, that enough memory is
added so that the multiprogramming level can be increased to 10. Use cent to
calculate the average throughput and response time of the system. Assume the
system will be run at multiprogramming level 10 until all the jobs are completed.
Although the central server model has been used extensively it has two
major flaws. The first flaw is that it models only batch workloads and only one of
them at a time. That is, it cannot be used to model terminal workloads at all and it
cannot be used to model more than one batch workload at a time. The other flaw
is that it assumes a fixed multiprogramming level although most computer sys-
tems have a fluctuating value for this variable. In the next model we show how to
adapt the central server model so that it can model a terminal or a batch workload
with a multiprogramming level that changes over time. We need only assume that
there is a maximum possible multiprogramming level m.
Example 4.9
Meridian Mappers wants to connect their 30 personal computers together by a
LAN with a powerful file server; the server can be modeled with one CPU and two
I/O devices. Their estimates of the service demands their personal computers will
make on the file server are 0.1, 0.2, and 0.25 seconds, respectively, for the CPU,
I/O device 1, and I/O device 2. Their average think time is estimated to be 20
seconds and the maximum multiprogramming level that can be achieved by the
file server is 5. They hope that this system will provide an average response time
that is less than 1 second with an average throughput of at least 1 interaction per
second. Their modeling of it with online follows:
Center# Utiliz
------- ----------
1 0.144408
2 0.288816
3 0.361021
Exercise 4.11
Suppose Meridian Mappers of Example 4.9 decides to consider a file server that
is half as fast but has I/O devices that are twice as fast, that is, that Demands =
{0.2, 0.1, 0.125}, but that will support a maximum multiprogramming level of 10.
Use online to estimate the performance.
At this point you may be thinking: You have shown how to model memory
in a computer system with either a single batch workload or a single terminal
workload, although the latter was a bit complicated. Can memory be modeled in
a multiclass workload model? My answer is a resounding, Yes, but . . . There
is no exact model for modeling memory in a computer system with multiple
workload classes. However, comprehensive (and expensive) modeling packages
such as Best/1 MVS and MAP do model such systems. The bad news about this
is that the models are very complex as well as proprietary. At the Hewlett-Pack-
ard Performance Technology Center, Gary Hynes has added the capability of
modeling memory in multiclass computer systems with hundreds of lines of C++
code. In principle I could translate the code to Mathematica, but in practice I can-
not. There is no easy way to build a queueing model that can model memory in a
multiclass computer system but you can buy a package that will do so. Calaway
[Calaway 1991] mentioned that he modeled memory with Best/1 MVS but was
unable to do so with the simulation package SNAP/SHOT. Some of his com-
ments follow:
change, and the CPU busy went from 73.1 to 72.6 (a difference
of 0.5 percent) at the low end and from 93.2 to 93.0 (a differ-
ence of 0.2) at the high end. See Figure. This would indicate
that our system was not memory constrained.
4.3 Solutions
Solution to Exercise 4.1
We made the calculations with the following Mathematica session:
In[7]:= sopen[0.25, v, s]
This output shows that better performance results from replacing the slow
disk with a fast disk than with adding a new slow disk and splitting the load
between the two. This result is actually a well known result from queueing the-
ory.
Out[8]= 0.25
In[11]:= sopen[lambda, v, s]
Adding another drive has certainly improved the performance but the perfor-
mance of this system is not as good as that of the system in Exercise 4.1.
In[9]:= Demands
As the output shows, the average response time has dropped from 0.523474
seconds to 0.278343 seconds, and the number of interactions in process has
dropped from 1.2753 to 0.686306, both of which are significant improvements,
although the throughput has increased only from 2.43623 interactions per second
to 2.46568 interactions per second, a very minor improvement.
In[8]:= Demands
Thus the system mean response time is slightly larger than the lower bound
we calculated in Example 3.6, and the system mean throughput is about halfway
between the lower and upper bounds.
In[12]:= MatrixForm[Demands]
The new average response time for each class is 0.447038 seconds (0.531),
0.238551 seconds (0.365), and 0.378384 seconds (0.479), respectively, where the
number in parentheses is the value with the slower drive. The improvements are
significant but not spectacular.
The performance calculation with the new drive but doubled workload inten-
sities follows:
We see that the new average response times for the three classes are
0.706982 seconds,0.345712 seconds, and 0.55166 seconds, respectively. We get
excellent response times with twice the load. Perhaps the system is overconfig-
ured!
The last two columns in the output of each program are identical. These rep-
resent the total number of customers and the total utilization, respectively, at the
service centers. sclosed also provides the residence (response) time at each of the
service centers. We do not provide this information as output in Exact because it
is not very meaningful for a multiclass model (OK, I know you may think that the
performance statistics are not exactly the same with this left out, and you are
probably right). sclosed prints out the average response time, which is 0.523474.
This agrees with the average response time of each class in the output of Exact.
sclosed also provides the average throughput, 2.43623 customers per second. In
the output of Exact we give two numbers for this, one for each class. These num-
bers are both 1.21812 so their sum is 2.43624. The third number in the output of
sclosed is 1.2753, the total number of customers in the system. This agrees with
the sum of the elements of the next-to-last column in the output from both
sclosed and Exact.
The solution using Approx is accurate enough for most practical purposes
and was generated in much less time.
In[15]:= Demands
In[16]:= ArrivalRate
In[18]:= Think
Out[18]= {0, 0}
Class# ArrivR Pc
------ ------------ --------
1 0.000901644 0.65881
2 0.00667807 1.0057
From the output we see that RA = 730.677 seconds and RB = 150.598 sec-
onds. Thus the response time for workload class A has increased by only 10.35
percent and that of workload B by 9.46 percent. The CPU is the bottleneck and
has reached a utilization of 0.87152.
Out[6]= {10, 5}
We see that the performance of the first workload class improves consider-
ably. The average response time drops from 10.35 seconds to 3.569473 seconds
while the average throughput increases from 0.98276 interactions per second to
1.473897 interactions per second. This improvement for the first workload class
leads to poorer performance for the second workload class for which the average
response time increases from 8.18 to 21.16 seconds, while the average through-
put declines from 1.13 interactions per second to 0.573333 interactions per sec-
ond.
In[9]:= Demands
Center# Utiliz
------- ----------
1 0.292039
2 0.14602
3 0.182525
4.4 References
1. Arnold O. Allen, Probability, Statistics, and Queueing Theory with Computer
Science Applications, Second Edition, Academic Press, San Diego, 1990.
2. Arnold O. Allen and Gary Hynes, Solving a queueing model with Mathemat-
ica, Mathematica Journal, 1(3), Winter 1991, 108112.
3. Arnold O. Allen and Gary Hynes, Approximate MVA solutions with fixed
throughput classes, CMG Transactions (71), Winter 1991, 2937.
4. Forest Baskett, K. Mani Chandy, Richard R. Muntz, and Fernando G. Palacios,
Open, closed, and mixed networks of queues with different classes of cus-
tomers, JACM, 22(2), April 1975, 248260.
5. Jeffrey P. Buzen, Queueing network models of multiprogramming, Ph.D.
dissertation, Division of Engineering and Applied Physics, Harvard Univer-
sity, Cambridge, MA, May 1971.
6. James D. Calaway, SNAP/SHOT VS BEST/1, Technical Support, March
1991, 1822.
5.1 Introduction
In this chapter we examine the measurement problem and the problem of
parameterization. The measurement problem is, How can I measure how well my
computer system is processing the workload? We assume that you have one or
more measurement tools available for your computer system or systems. We
discuss how to use your measurement tools to find out what your computer system
is doing from a performance point of view. We also discuss how to get the data
you need for parameterizing a model. In many cases it is necessary to process the
measurement data to obtain the parameters needed for modeling.
nanosecond range while software monitors usually use a system clock with milli-
second resolutions, and (3) higher sampling rates (we discuss sampling later).
The overwhelming disadvantage for most installations is the high cost and the
need for special expertise to use a hardware monitor effectively. Most readers of
this book will not be concerned with hardware monitors.
There are other detailed classifications of performance monitors but we
restrict our discussion to software monitors because they are the concern of
almost all performance managers. The three most common types of software
monitors are used for diagnostics (sometimes called real-time or trouble shooting
monitors), for studying long-term trends (sometimes called historical monitors),
and job accounting monitors for gathering chargeback information. These three
types can be used for monitoring the whole computer system or be specialized for
a particular piece of software such as CICS, IMS, or DB2 on an IBM mainframe.
There are probably more specialized monitors designed for CICS than for any
other software system.
The uses for a diagnostic monitor include the following:
To accomplish these uses a diagnostic monitor should first present you with
an overall picture of what is happening on your system plus the ability to focus
on critical areas in more detail. A good diagnostic monitor will provide assis-
tance to the user in deciding what is important. For example, the monitor may
highlight the names of jobs or processes that are performing poorly or that are
causing overall systems problems. Some diagnostic monitors have expert system
capabilities to analyze the system and make recommendations to the user.
A diagnostic monitor with a built-in expert system can be especially useful
for an installation with no resident performance expert. An expert system or
adviser can diagnose performance problems and make recommendations to the
user. For example, the expert system might recommend that the priority of some
jobs be changed, that the I/O load be balanced, that more main memory or a
faster CPU is needed, etc. The expert system could reassure the user in some
cases as well. For example, if the CPU is running at 100% utilization but all the
interactive jobs have satisfactory response times and low priority batch jobs are
running to fully utilize the CPU, this could be reported to the user by the expert
system.
Uses for monitors designed for long-term performance management include
the following:
As Merrill points out, SMF information is also used for computer performance
evaluation.
Accounting monitors, such as SMF, generate records at the termination of
batch jobs or interactive sessions indicating the system resources consumed by
the job or session. Items such as CPU seconds, I/O operations, memory residence
time, etc., are recorded.
Two software monitors produced by the Hewlett-Packard Performance Tech-
nology Center are used to measure the performance of the HP-UX system I am
using to write this book. HP GlancePlus/UX is an online diagnostic tool (some-
times called a trouble shooting tool) that monitors ongoing system activity. The
HP GlancePlus/UX Users Manual provides a number of examples of how this
monitor can be used to perform diagnostics, that is, determine the cause of a per-
formance problem. The other software monitor used on the system is HP
LaserRX/UX. This monitor is used to look into overall system behavior on an
ongoing basis, that is, for trend analysis. This is important for capacity planning.
It is also the tool we use to provide the information needed to parameterize a
model of the system.
There are two parts of every software monitor, the collector that gathers the
performance data and the presentation tools designed to present the data in a
meaningful way. The presentation tools usually process the raw data to put it into
a convenient form for presentation. Most early monitors were run as batch jobs
and the presentation was in the form of a report, which also was generated by a
batch job. While monitor collectors for long range monitors are batch jobs, most
diagnostic monitors collect performance data only while the monitor is activated.
The two basic modes of operation of software monitors are called event-
driven and sampling. Events indicate the start or the end of a period of activity or
inactivity of a hardware or software component. For example, an event could be
the beginning or end of an I/O operation, the beginning or end of a CPU burst of
activity, etc. An event-driven monitor operates by detecting events. A sampling
monitor operates by testing the states of a system at predetermined time intervals,
such as every 10 ms. A sampling monitor would find the CPU utilization by
checking the CPU every t seconds to find out if it is busy or not. Clearly, the
value of t must be fairly small to ensure the accuracy of the measurement of CPU
utilization; it is usually on the order of 10 to 15 milliseconds. A small value of t
means sampling occurs fairly often, which increases sampling overhead. CPU
sampling overhead is typically in the range of 1 to 5 percent, that is, the CPU is
used 1 to 5 percent of the time to perform the sampling. Ferrari et al. in Chapter 5
of [Ferrari, Serazzi, and Zeigner 1983] provide more details about sampling over-
head.
Software monitors are very complex programs that require an intimate
knowledge of both the hardware and operating system of the computer system
being measured. Therefore, a software monitor is usually purchased from the
computer company that produced the computer being monitored or a software
performance vendor such as Candle Corporation, Boole & Babbage, Legent,
Computer Associates, etc. For more detailed information on available monitors
see [Howard Volume 2].
If you are buying a software monitor for obtaining the performance parame-
ters you need for modeling your system, the properties you should look for
include:
1. Low overhead.
2. The ability to measure throughput, service times, and utilization for the major
servers.
3. The ability to separate workload into homogeneous classes with demand lev-
els and response times for each.
4. The ability to report metrics for different types of classes such as interactive,
batch, and transaction.
5. The ability to capture all activity on the system including system overhead by
the operating system.
6. Provide sufficient detail to detect anomalous behavior (such as a runaway
process), which indicates atypical activity.
7. Provide for long-term trending via low volume data.
8. Good documentation and training provided by the vendor.
9. Good tools for presenting and interpreting the measurement results.
Example 5.1
The performance analysts at Black Bart measure their MVS system over a period
of 4,500 seconds with RMF and find that the measured total CPU time is 2,925
seconds so the average CPU utilization over the period is 2,925/4,500 = 0.65 or 65
percent. However, the total CPU time reported for the two workload classes, wk1
and wk2, is 1,800 seconds and 675 seconds, respectively. Since these numbers add
up to 2,475 seconds, 450 seconds are not accounted for and thus must be assumed
to be overhead. If the analysts do not know the capture ratios for the two workload
classes, the usual procedure is to assign the overhead proportionally, that is, assign
((1,800/(1,800 + 675))(450) = 316 seconds to wk1 and the other 134 seconds to
wk2. Then, over the 4,500-second interval wk1 has (1,800 + 316)/4,500 = 0.47 or
47 percent CPU utilization and wk2 has (675 + 134)/4,500 = 0.18 or 18 percent
CPU utilization. This means the effective capture ratio for each class is 0.55/0.65.
On the other hand, if the Black Bart performance analysts had previously found
that the capture ratio for wk1 was approximately 0.9 and for wk2 it was 0.85, then
they would assign 1,800/0.9 = 2,000 CPU seconds to wk1 and 675/0.85 = 794
seconds to wk2 even though the sum is not exactly 2,925 seconds. According to
Bronner [Bronner 1983], if the sum of all the CPU times estimated from the use
of capture ratios is within 10 percent of the actual CPU utilization, the CPU
estimates are acceptable. Here the error is only 4.48 percent.
Monitors are able to accumulate huge amounts of data. It is important to
have facilities for reducing and presenting this data in an understandable format.
One of the most common ways of presenting information, such as global CPU
utilization, is by means of graphs showing the evolution of the measurement(s)
over time. In Figure 5.1 we can see parts of a couple of graphs and a display table
from a software monitor. The table shows that at 11 am on April 3, 1991, the
application called system notes was consuming 17.5 percent of the CPU on the
HP-UX system being monitored by HP LaserRX/UX. The reason for displaying
the very detailed table was that the graph above it indicated that the Global Sys-
tem CPU Utilization was very high at 11 am on April 3. The use of this graph in
turn was triggered by the study of the Global Bottlenecks graph. Thus in using
monitors one normally proceeds from the general to the specific.
Although Steps 1 and 7 are very important, these steps tend to be the most
neglected.
Failure to specify carefully the purpose of a modeling study is an almost
surefire guarantee of failure. The purpose of the study colors the measurements
taken, the method of analysis, the assumptions made, the resources used, the
reports to management, and other considerations too numerous to catalog.
An example of the purpose of a modeling study is: Can the workloads run-
ning on two separate Hewlett-Packard HP 3000 Series 980/100 uniprocessors be
combined to run on one HP 3000 Series 980/300 multiprocessor? The Series
980/300 has three processors and is rated as roughly 2.1 times as powerful as a
Series 980/100. To answer this question, the hardware and software of the three
computers in question must be completely specified, the workloads carefully
defined, and the performance criteria for measuring whether or not the combined
workload can run on one Series 980/300 must be chosen.
Step 7 in the modeling paradigm is an opportunity to learn from the study. If
the predicted performance of the modified system is quite different from the
actual measured performance, it is important to find out why. Often the differ-
ence is due to errors in predicting the load on the modified system. For example,
it might have been necessary to schedule work on the modified system that had
not been anticipated. It may have been due to modeling errors. If this is true, it is
important to correct the errors so that future modeling studies can be improved.
For Step 2 we must decide what measurement period to use for the model.
Analysts usually choose a peak period of the day, week, or month since this is
when performance problems are most likely to exist. The length of the measure-
ment interval is also very important because of the problem of end effects. End
effects are measurement errors caused because some of the customers are pro-
cessed partly outside the measurement interval. Longer intervals have less error
from end effects than shorter intervals. Intervals from 30 to 90 minutes are typi-
cal intervals chosen because they are long enough to keep end effects under con-
trol and short enough to keep the amount of data needed in reasonable balance.
For each work load the service demand for each class c and center k is Dc,k, the
service demand or total service time required at center k by workload c.
Some modeling software has the capability of automating the parameteriza-
tion of the model. However, the person running the modeling package must still
get involved in the validation process, which can lead to changes in the modeling
setup. Two modeling packages that have the automated modeling capability are
Best/1 MVS from BGS Systems and MAP from Amdahl Corporation. Both
model IBM mainframes running the MVS operating system.
Best/1 MVS uses the CAPTURE/MVS data reduction and analysis tool. By
combining data from two standard measurement facilities (RMF and SMF),
CAPTURE/MVS reports contain both system-wide use of hardware resources
and workload specific performance measures. In addition, CAPTURE/MVS also
automatically produces input to BEST/1 MVS. By using the AUTO-CAPTURE
facility, new or infrequent users need not learn the command syntax and associ-
ated JCL statements and thus save a lot of time and effort.
For MAP users the automated method uses the OBTAIN feature of MAP.
This facility, available only for MVS installations, allows SMF/RMF data to be
processed and a MAP model generated. OBTAIN processes the SMF/RMF data
and constructs a system model based on both the information contained in these
records, and on user-provided parameters that specify how workload data in SMF
records is to be interpreted. The OBTAIN feature is a separate application pro-
gram within the MAP product that executes interactively. Stoesz [Stoesz 1985]
discusses the validation process after using CAPTURE/MVS or OBTAIN to con-
struct an analytical queueing model of an MVS system.
In the following example we assume that the performance information avail-
able is similar to that provided by SMF and RMF records on an IBM mainframe
running under the MVS operating system. We have used the technical bulletins
[Bronner 1983] and [Wicks 1991] as guides for this example. We assume that for
terminal workload classes the average number of active terminals, the average
number of interactions completed, the average response time, and the average
service demand of the workload class for each service center is provided or can
be obtained without excessive calculation. Then, from the number of interactions
completed in the observation interval, we calculate the average throughput
Xc = c. (This is an approximation due to end effects.) We estimate the average
think time from the response time formula as follows:
Nc
Zc = Rc .
Xc
For batch workload classes we assume we are provided with the average
number of jobs in service, the number of completions, the average turnaround
time, and all service demands.
Example 5.2
A small computer system at Big Bucks Bank was measured using their software
performance monitor for 1 hour and 15 minutes (4,500 seconds). The computer
system has three workload classes, two terminal and one batch. The terminal
classes are numbered 1 and 2 with the batch class assigned number 3. Some of the
measurement results are shown in Tables 5.1 through 5.3.
c Nc Interactions Rc
They also obtained the device utilization and average number of customers
at each of the three devices as shown in Table 5.2. The CPU utilization has been
corrected for any capture ratio errors, that is, the CPU utilization accounts for
CPU overhead.
k Number Utilization
1 2.06 0.93
2 0.16 0.13
3 0.22 0.18
Table 5.3 provides the measured service demands for each job class at the CPU
and each of the two I/O devices.
c k Dc,k
1 CPU 0.025
1 I/O 1 0.040
1 1/O 2 0.060
2 CPU 0.200
2 I/O 1 0.200
2 I/O 2 0.060
3 CPU 0.600
3 I/O 1 0.050
3 I/O 2 0.060
In[4] := x1 = 1485./4500
Out[4]= 0.33
Out[5]= 30.4061
In [6]: = x2 = 1062./4500
Out[6]= 0.236
Out[7]= 19.6127
In[8]:= x3 = 6570./4500
Out[8]= 1.46
In[9]:= n3 = 1.46/ x3
Out[9]= 1.
In[10]:= n3 = 1.41 x3
Out[10]= 2.0586
In[15]:= Demands
The analysts at Big Bank feel that the model outputs are sufficiently close to
the measured values to validate the model. They are satisfied with the current
performance of the computer system but the users have told them that the
throughput of the first online system will quadruple and the throughput of the
second online workload will double in the next six months, although the batch
component is not expected to increase. The analysts feel that an upgrade to a
computer with a CPU that is 1.5 times as fast without changing the I/O might sat-
isfy the requirements of their users. The users want to be able to process the new
volume of online work without increasing the response time of the first workload
class above 0.2 seconds and that of the second workload class above 1.0 seconds
with the turnaround time of the batch workload remaining below 1.0 seconds.
The analysts model the proposed system using the Mathematica program Fixed
as follows:
Out[6]= 1.32
In[7]:= x2 = 2 0.236
Out[7]= 0.472
In[8]:= x3 = 1.46
Out[8]= 1.46
Class# ArrivR Pc
-------- ----------- ---------
1 1.32 40.3563
2 0.472 9.68986
3 1.46 0.876127
Note that the response time requirements are far exceeded. Perhaps Big
Bucks could make do with a slightly smaller processor. Note, also, that there will
be approximately 40.3563 active users of the first online application, 9.68986
active users of the second online application, and 0.876127 active batch jobs with
the new system.
Exercise 5.1
Ross Ringer, a fledgling performance analyst at Big Bucks Bank, suggests that
they could save a lot of money by procuring the model of their current machine
with a CPU 25 percent faster than their current machine rather than one that is 50
percent faster. This machine could then be board upgraded to a CPU with twice
the power of the current machine for a very reasonable price. By board upgraded
we mean that the old CPU board could be replaced with the faster CPU board
without changing any of the other components. Use Fixed to see if Ross is right.
Exercise 5.2
Fruitful Farms measures the performance of one of their computer systems during
the peak afternoon period of the day for 1 hour (3,600 seconds). Their monitor
reports that the CPU is idle for 600 seconds of this interval and thus busy for 3,000
seconds (50 minutes). Fruitful Farms has three workload classes on the computer
system, one terminal class, term, and two batch classes, batch1 and batch2. The
monitor reports that workload class term used 20 minutes of CPU time, batch1
used 8 minutes, and batch2 used 2 minutes. (a) Calculate the amount of the 3000
seconds of CPU time that should be allocated to each workload class assuming the
capture ratio is the same for all workloads. (b) Make the calculation of part (a)
assuming that all CPU overhead is due to paging and that 80% of the paging is for
the terminal class while 15% is for batch1 and 5% for batch2.
5.4 Solutions
Solution to Exercise 5.1
Ross calculates the new service demands for the CPU for the three workload
classes by multiplying each of the demands for the upgraded CPU in Example 5.2
by 1.5/1.25, yielding the values shown in the matrix Demands displayed in the
following Mathematica session:
In[19]:= MatrixForm[Demands]
Out[23]//MatrixForm= 0.02 0.04 0.06
0.16 0.2 0.3
0.48 0.05 0.06
In[24]:= Think
Out[24]= {30.4061, 19.6127, 0}
From the output above we see that Ross is almost right! The average
response time for the second online workload class is 1.016144 seconds, which is
slightly over the 1.0-second goal. However, this is an approximate model and all
the estimates are approximate as well, so Rosss recommendation is OK.
Out[45]= 800
In[46]:= 1200 8/30
Out[46]= 320
Out[47]= 80
100
Out[48]= ---
3
In[49]:= N[%]
Out[49]= 33.3333
In[50]:= (8 60 + 320)/60
40
Out[50]= ---
3
In[51]:= N[%]
Out[51]= 13.3333
In[52]:= (2 60 + 80)/60
10
Out[52]= ---
3
For part (b) we allocate 80% of the 1,200 unallocated CPU seconds to the term
workload class; this comes to 960 seconds or 16 minutes. We allocate 15% of
1200 or 180 seconds (3 minutes) to batch1 and the other 5% or 1 minute to batch2.
The Mathematica calculations for this follow:
In[55]:= .8 1200
Out[55]= 960.
In[56]:= %/60
Out[56]= 16.
Out[57]= 180.
In[58]:= %/60
Out[58]= 3.
5.5 References
1. Leroy Bronner, Capacity Planning: Basic Hand Analysis, IBM Washington
Systems Center Technical Bulletin, December 1983.
2. Domenico Ferrari, Giuseppe Serazzi, and Alessandro Zeigner, Measurement
and Tuning of Computer Systems, Prentice-Hall, Englewood Cliffs, NJ, 1983.
3. Phillip C. Howard, IS Capacity Management Handbook Series, Volume 1,
Capacity Planning, Institute for Computer Capacity Management, updated
every few months.
4. Phillip C. Howard, IS Capacity Management Handbook Series, Volume 2,
Performance Analysis and Tuning, Institute for Computer Capacity Manage-
ment, updated every few months.
5. IBM, MVS/ESA Resource Measurement Facility Version 4 General Informa-
tion, GC28-1028-3, IBM, March 1991.
6. H. W. Barry Merrill, Merrills Expanded Guide to Computer Performance
Evaluation Using the SAS System, SAS, Cary, NC, 1984.
7. Roger D. Stoesz, Validation tips for analytic models of MVS systems,
CMG 85 Conference Proceedings, Computer Measurement Group, 1985,
670674.
8. Raymond J. Wicks, Balanced Systems and Capacity Planning, IBM Wash-
ington Systems Center Technical Bulletin GG22-9299-03, September 1991.
6.1 Introduction
Simulation and benchmarking have a great deal in common. When simulating a
computer system we manipulate a model of the system; when benchmarking a
computer system we manipulate the computer system itself. Manipulating the real
computer system is more difficult and much less flexible than manipulating a
simulation model. In the first place, we must have physical possession of the
computer system we are benchmarking. This usually means it cannot be doing any
other work while we are conducting our benchmarking studies. If we find that a
more powerful system is needed we must obtain access to the more powerful
system before we can conduct benchmarking studies on it. By contrast, if we are
dealing with a simulation model, in many cases, all we need to do to change the
model is to change some of the parameters.
study theories in physics, cosmology, and other disciplines; and to model com-
puter systems. After the crash of a DC-10 aircraft near Chicago a few years ago
because an engine fell off, a DC-10 flight training simulator was used to study
whether or not the plane could be controlled with one engine detached. (It could
but the pilots did not realize they had lost an engine until too late.) For other
exotic applications of simulation see [Pool 1992].
Twenty years ago modeling computer systems was almost synonymous with
simulation. Since that time so much progress has been made in analytic queueing
theory models of computer systems that simulation has been displaced by queue-
ing theory as the modeling technique of choice; simulation is now considered by
many computer performance analysts to be the modeling technique of last resort.
Most modelers use analytic queueing theory if possible and simulation only if it
is very difficult or impossible to use queueing theory. Most current computer sys-
tem modeling packages use queueing network models that are solved analyti-
cally. Some of the best known of these are Best/1 MVS from BGS Systems, Inc.;
MAP from Amdahl Corp.; CA-ISS/THREE from Computer Associates, Interna-
tional, Inc.; and Model 300 from Boole & Babbage. RESQ from IBM provides
both simulation and analytic queueing theory modeling capabilities.
The reason for the preference by most analysts for analytic queueing theory
modeling is that it is much easier to formulate the model and takes much less
computer time to use than simulation. See, for example, the paper [Calaway
1991] we discussed in Chapter 1. Kobayashi in his well-known book [Kobayashi
1978] says:
l. Construct the model by choosing the service centers, the service center service
time distributions, and the interconnection of the center.
2. Generate the transactions (customers) and route them through the model to
represent the system.
3. Keep track of how long each transaction spends at each service center. The
service time distribution is used to generate these times.
4. Construct the performance statistics from the above counts.
5. Analyze the statistics.
6. Validate the model.
Of course, these same tasks are necessary for Step 6 of the modeling study
paradigm.
One of the major activities in any simulation study is writing the computer
code that makes the calculations for the study. Such programs are called simula-
tors. In the next section we discuss how simulators are written.
The M/M/1 queueing system is an open system with one server that provides
exponentially distributed service; this means that the probability that the provided
service will require not more than t time units is given by P [s # t] = 1 et/S
where S is the average service time. For the M/M/1 queueing system the
interarrival time, that is, the time between successive arrivals, also has an
exponential distribution. Thus, if I describes the interarrival time, then
P [ # t] = 1 et, where is the average arrival rate. The two parameters that
define this model are the average arrival rate (customers per second) , and the
average service time S (seconds per customer).
10.5912, 6.6391,
The purpose of printing out the value of mean response time at the end of the
warmup period is to determine whether or not it seems likely that the steady-state
has been reached. Since the correct value of mean response time is 5.0, the run
length of 250 didnt seem to be long enough. But neither did a run of length 2500
where the error rose from 0.9967 (for the run of length 250) to 2.6083 (for the run
of length 2500)! A warmup period of 10000 appeared to be adequate. However,
the batch runs should have been longer than 250 in our last run as the large
confidence interval shows. MacDougall, in Table 4.2 of [MacDougall 1987],
claims that to obtain 5% accuracy in the average queueing time (response time
minus service time) requires a sample size (run length) of 189774. We had a
sample size of 250000 after the warmup in our first run and the estimated average
queueing time of 3.92449 is in error by only 1.92%. The error in the average
response time is 1.53%. We show the exact values of all the performance
measures in the output of the program mm1. Note that mm1 required only 0.03
seconds for the calculation while the first simulation run was 872.92 seconds (14
minutes and 32.92 seconds) long.
Our simmm1 example illustrates some of the problems of simulation. We
will discuss other problems after the following exercise.
Exercise 6.1
Make two M/M/1 simulation runs with simmm1, first with a lambda value of 0.9,
an average service time of 1.0 seconds, a seed of 11, a warmup value (n) of 1500,
and a batch length value (m) of 500. Then repeat the run with all values the same
except the batch length (m); make it 2000. Compare the 95 percent confidence
intervals for the two runs. (Warning: The first run on my 33 MHz 486 PC took
253.21 seconds and the second 982.4 seconds. If you have a slower computer,
such as a 16 MHz 386SX, the two runs could be very long. In this case you may
want to take a coffee break or a walk around the block while the computations are
made.)
The basic problem in discrete event simulation is that the outputs of a simu-
lator are sequences of random variables rather than the exact performance num-
bers we would like. The conclusions of a simulation study are based on estimates
made from these random variables. Therefore, the estimates themselves are also
random variables rather than the performance numbers we want. We usually are
interested in estimates of the average values of performance parameters of the
computer system under study. For example, we are interested in the average
response time of customers in a workload class. If we push n customers of work-
load class c through the simulator, we obtain the numbers R1, R2, ..., Rn. From
these numbers, which are the measured values of the response times for the n
customers, the simulator must estimate the average response time for the class. If
n is 10000, we may have the simulator ignore the first 1000 of these 10000 num-
bers to avoid the transient phase and estimate the true value of the average
response time R by R where
10000
1
R = Ri .
9000 i=1001
This is the usual method of estimating an average value; R is called the simple
mean of the numbers R1001, R1002, ..., R10000. It is important in a simulation study
not only to be able to obtain estimates of important parameters from the study, but
also to have some sort of assurance that the estimate is close enough to the true
value to satisfy the needs of the modeling study. In the program simmm1 we used
the method of batch means to calculate a 95 percent confidence interval for the
mean response time. There are a couple of other methods that are sometimes used
for this purpose and also help with the problem of determining that the simulation
process has reached the steady-state. Unfortunately, both of these methods are
rather advanced and thus not easy for beginners to implement. Some simulation
languages, such as RESQ, have built-in facilities for both these methods.
The first advanced method is called the regeneration method. This method
simultaneously solves three problems: (1) the problem of independent runs. (2)
the problem of the transient state, and (3) the problem of generating a confidence
interval for an estimate. In our discussion of the method of batch means, we
neglected to mention the problem of making the batch runs independent. What
tends to keep them from being independent is the correlation between successive
customers. If one customer has a very long response time because of long queues
at the service centers, then immediately succeeding customers tend to have long
response times as well; of course, if a customer has a short response time, then
immediately succeeding customers tend to have short response times, too. The
batch runs are approximately independent if each of them is sufficiently long,
however. The regeneration method automatically generates independent subruns.
The regenerative method also solves the problem of the transient state. Finally,
the regenerative method supplies a technique for generating confidence intervals.
With these three advantages one might suppose that everyone should use the
regenerative method. Unfortunately, there are disadvantages for the regenerative
method, too. The method does not apply to all simulation models, although it
does apply to the simulation of most computer systems. In addition it is much
more complex to set up properly and more difficult to program.
The regeneration method depends upon the existence of regeneration or
renewal points. At each such point future behavior of the simulation is indepen-
dent of past behavior and, in a probabilistic sense, restarts or regenerates its
behavior from that point. Eventually the system returns to the same regeneration
point or state in what is called a regeneration cycle. The regeneration cycles are
used as subruns for the simulation study. Since each regeneration point represents
identical simulation model states, the behavior of the system during one cycle is
independent of the behavior in another cycle, so the subruns are independent. The
bias due to the initial conditions also disappears. An example of a regeneration
point for the M/M/1 queueing model is the initial state in which the system is
empty and the first customer to enter the system will appear in Dt seconds where
Dt is a random number from an exponentially distributed stream with average
value 1/. The first regeneration cycle ends the next time the simulated system
again reaches the empty state.
In Section 3.3.2 of [Bratley, Fox, and Schrage 1987], the authors discuss
regenerative methods, provide an algorithm for using the regeneration method,
and give a list of pros and cons of the regenerative method. One of the cons is
that the regeneration cycles may be embarrassingly long. Although Bratley et al.
didnt mention it, there may be extremely short regeneration cycles as well.
Another problem is in setting up regeneration points to begin a simulation. This
can be a real challenge. The regeneration method is not recommended for begin-
ners.
There is also a discussion of the spectral method in [Bratley, Fox, and
Schrage 1987]. The spectral method is supported by the RESQ programming lan-
guage and examples of its use are given in [MacNair and Sauer 1985]. The
method does provide confidence intervals for steady-state averages. In addition,
MacNair and Sauer claim:
Bratley et al. [Bratley, Fox, and Schrage 1987] discuss other advanced methods,
which they call autoregressive methods. These methods are not widely used and
Bratley et al. do not present an optimistic portrayal of their use. In fact, they end
Section 3.3 with the statement:
Knuth mentions this program in a pejorative manner in several other places in his
book.
The most common random number generators are linear congruential gen-
erators that work as follows: Given a positive integer m and an initial seed z0,
with 0#z0<m, the sequence z0,z 1,z2, ... is generated with
zn + 1 = azn + b mod m where a and b are integers less than m. The integer a is
called the multiplier, and is in the range 2,3, ..., m 1, b is called the increment,
and m the modulus. In the formula for generating the next random number, mod
m means to take the remainder upon division by m. Thus, if m is 13, then 27
mod m is 1.
Park and Miller recommend a standard uniform random number generator
based on a linear congruential generator with increment zero. They also recom-
mend that the modulus m be a large prime integer. (Recall that a positive integer
m is prime if the only positive integers that divide it evenly are 1 and m. By con-
vention, 1 is not considered a prime number so the sequence of prime numbers is
2, 3, 5, 7, 11, 13, 17, .... ) Their algorithm is begun by choosing a seed z1 and gen-
erating the sequence zl, z 2, z3, ... by the formula zn+l = a zn mod m for
n = 1, 2, 3, .... Finally, each zn is converted into a number between zero and one
by dividing by m which yields a new sequence u1, u2, u3, ... where un = zn/m.
Park and Miller refer to this algorithm as the Lehmer generator. The numbers m
and a must be chosen very carefully to make the Lehmer generator work prop-
erly.
All linear congruential generators are periodic; that is, after a certain number
of iterations the generator repeats itself. Let us illustrate by an example from
[Park and Miller 1988]. Suppose we choose the random Lehmer generator with
the multiplier a = 6 and modulus m = 13. Then, if the initial seed is 2, the Leh-
mer generator yields the sequence (before dividing by 13) of
2, 12, 7, 3, 5, 4, 11, 1, 6, 10, 8, 9, 2, ... After the second 2 the sequence repeats
itself. The choice of any other initial seed would yield a circular shift of the
above sequence. This generator is a full period generator, that is, it yields all the
numbers from 1 through 12 exactly once in each period. The multiplier a = 5 in
the above example yields a Lehmer generator with period of only four; it is not a
full period generator. We demonstrate these properties with the Mathematica pro-
gram ran:
7, 3, 5, 4, 11, 1}
The statement on line 5 shows how the Mathematica program uran can be used
to generate uniform random variables on the interval between zero and one.
Exercise 6.2
Consider the Lehmer generator with m = 13. We saw that with the multiplier
a = 6 we have a full period generator, while the multiplier a = 5 yields a
generator with a period of only 4. Test all the other multipliers between 2 and 12
to see which give you a full period Lehmer generator.
Knuth [Knuth 1981] discusses how to choose the parameters of a linear con-
gruential generator to obtain a full period. He considers generators with b = 0 as
a special case. The solution for this case is given by Theorem B on page 19 of the
Knuth book. A linear congruential generator with b = 0 is called a multiplicative
linear congruential generator. Every full period linear congruential generator
produces a fixed circular list; the initial seed determines the starting point on this
list for the output of any particular run.
Another desirable property of a random number generator is that the output
be random. As Gardner shows in [Gardner 1989, Gardner 1992], the exact mean-
ing of random is difficult to define. Loosely speaking, the output of a random
number generator is random if it appears to be so. Statistical tests have been
designed to test this property because humans cannot make good judgments
about randomness. Knuth [Knuth 1981] has a long, difficult section with the title
What is a random sequence? It turns out that, if a sequence is random, then
subsequences must exist that appear to be very nonrandom. That is, sequences
such as 1, 2, 3, 4, 5, 6, 7, 8, 9, 0. In practice we must depend upon statistical tests
to decide whether or not a random number generator yields random output. Some
choices of a and m for the Lehmer generator yield sequences that are more ran-
dom than others. It is not easy to choose the combinations of a and m for a Leh-
mer generator that will generate satisfactory random output. For their minimal
standard random number generator Park and Miller recommend the multiplier
a = 16807 with the modulus m = 2147483647. They chose
31
m = 2 1 = 2147483647 because it is a large prime. For this value of m there
are more than 534 million values of a that make the generator a full period gener-
ator. Extensive testing has been performed which suggests that the combination
of a and m recommended by Park and Miller does yield a truly random full
period sequence. Being truly random means that it has passed the statistical
tests that are used to determine randomness or lack of it. The Park and Miller
minimal standard random number generator has been implemented successfully
on a number of computer platforms.
From a uniform random number generator, which generates a sequence
u1, u2, u3, ... where each un is between zero and one, it is possible to generate a
sequence with any probability distribution desired. Knuth [Knuth 1981] includes
algorithms for most distributions of interest to those modeling computer systems.
Some of the algorithms are somewhat complex but the algorithm for generating
an exponentially distributed random sequence is very straightforward. One can
generate an exponentially distributed random sequence with average value x by
calculating bn = x log un for each n where the log function is the natural loga-
rithm, that is, the logarithm to the base e where e is approximately 2.718281828.
The Mathematica program rexpon can be used to generate an exponential ran-
dom sequence.
Out[15]= 3.47863
In[3]:= <<StatisticsContinuousDistributions
In[5]:= Mean[table1]
Out[5]= 3.56487
In[7]:= Mean[table1]
Out[7]= 4.62718
In[9]:= Mean[table1]
Out[9]= 2.86325
In[11]:= Mean[table1]
Out[11]= 3.53028
In[12]:= Variance[table1]
Out[12]= 12.73
In[13]:= 3.5^2
Out[13]= 12.25
Note that for small samples, such as 20, the mean was not always close to
3.5, but for a sample of size 10000, both the mean and variance were fairly close
to the underlying distribution. (The variance for an exponential random variable
is the square of its mean, so, if the mean is 3.5, the variance should be 12.25.)
Marsaglia is one of the leaders in random number generation. In his keynote
address A Current View of Random Number Generators for the Computer Sci-
ence and Statistics: 16th Symposium on the Interface, Atlanta, 1984, which is
published as [Marsaglia 1985] he made some important remarks. He said, in the
abstract:
1. INTRODUCTION
Most computer systems have random number generators avail-
able, and for most purposes they work remarkably well.
Indeed, a random number generator is much like sex: when its
good its wonderful, and when its bad its still pretty good.
Marsaglia then describes some of the other common random number generators,
some new generators, some new, more stringent tests, and the results of applying
the tests to old and new random number generators. He concludes with the
following paragraph:
ULTRA has a period of some 10366 and that every possible m-tuple, from pairs,
3-tuples, 4-tuples up to 37-tuples, can appear. Statistical tests show that those m-
tuples appear with frequencies consistent with underlying probability theory.
If you read [Knuth 1981] you will be amazed by the number of tests for ran-
domness he provides. However, if you do a simulation study you may be tempted
to skip the testing of your random number generator. This would be a mistake.
Jon Bentley, the author of the regular column Software Exploratorium in UNIX
Review, in [Bentley 1992] discusses the use of a random number generator to
study the approximate solution to the traveling-salesman problem. He uses a ran-
dom number generator recommended by Knuth in Algorithm A and implemented
by Program A written in Knuths MIXAL on page 27 of [Knuth 1981]. Bentley
tested his version of the program more thoroughly than Knuth did and discovered
that, for his application, Knuths recommendation wouldnt work! If he had not
done the extensive testing he may not have discovered the error for some time.
Bentley found a modification to the algorithm based on some of Knuths recom-
mendations that does work satisfactorily. In his column Bentley gave the follow-
ing exercise:
In Exercise 12, fortune refers to a program that reads a file of one-line quotations
and prints one at random. The generator referred to is a FORTRAN program on
page 171 of [Knuth 1981]. The answer to Exercise 12 provided by Bentley is:
Knuth is one of the most admired computer scientists of our time. His book [Knuth
1981] is the standard reference on random number generation. His final advice in
the SUMMARY for the chapter RANDOM NUMBERS includes the following
statements:
(O1 E1 ) 2 (O 2 E 2 ) 2 (O k E k ) 2
chsq = + +L+ .
E1 E2 Ek
Each numerator in the sum for chisq measures the square of the difference
between the observed and expected number in a category; the number in each
denominator scales the squared value. Fortunately, for large n, the distribution of
chisq approaches the well-known probability distribution called the chi-square
distribution. The chi-square distribution is completely characterized by one inte-
ger parameter called the degrees of freedom.
In the program chisquare k = 4. We calculate the three numbers x25, x50,
and x75 which define four intervals of the real line in such a way that, if the ran-
dom sequence has an exponential distribution with mean value mean, then one-
fourth of the sequence will fall into each interval. Since we assume we know the
mean of the sequence, by the rules for calculating number of degrees of freedom
of the chi-square distribution approximating chisq, it has k 1 = 3 degrees of
freedom. If our null hypothesis had been merely that the sequence was exponen-
tial so that we must estimate the mean from the data we would lose another
degree of freedom so that chisq would be approximated by a chi-square distribu-
tion with 2 degrees of freedom. We now provide some output from chisquare
that shows some tests of exponential random numbers generated by rexpon and
by Mathematica using Random. The Mathematica package work.m was loaded
before the statements below were executed using Version 2.0 of Mathematica.
SeedRandom yields different values for other versions of Mathematica so you
may get somewhat different results from if you use a version of Mathematica
other than 2.0.
In[6]:= Mean[y]
Out[6]= 3.54594
In[8]:= SeedRandom[2]
In[9]:= x = Table[Random[ExponentialDistribution[1/
3.5]], {5000}];//Timing
In[11]:= Mean[x]
Out[11]= 3.54394
In[12]:= SeedRandom[23]
In[13]:= x = Table[Random[ExponentialDistribution[1/
3.5]], {5000}];//Timing
In[14]:= Mean[x]
Out[14]= 3.52034
In[15]:= chisquare[0.02, x, 3.5
p is 0.946125
q is 0.0538745
The sequence passes the test.
In[5]:= SeedRandom[47]
In[6]:= y = Table[Random[ExponentialDistribution[1/
10]], {5000}];
Although rexpon uses the Park and Miller minimal standard random number
generator, which they claim is very efficient, it required 166.21 seconds to gener-
ate 5000 exponential random variates compared to 11.55 seconds required to pro-
duce them using the Mathematica Random function. The program chisquare
rejects the sequence if p is less than half of alpha or q is less than half of alpha.
We calculate p as the probability that, if the null hypothesis is true, a value of
chisq as large or larger than the one observed would be observed. Similarly, q
represents the probability that an observed value of chisq smaller than that
observed would occur. We have followed Knuths recommendation of testing
each random number generator at least three times with different seeds. Both ran-
dom number generators pass all the tests with an alpha of 0.02 (two percent).
Some authorities would not reject the sequence based on q being less than one
half of alpha but would reject only if p is less than alpha. We follow Knuths rec-
ommendation in choosing success or failure of the sequence in chisquare.
Exercise 6.3
Load the Mathematica package work.m and use chisquare to test the sequence
generated by the following Mathematica statements to see if it is a random sample
from an exponential distribution with mean 10. Use 0.02 as the alpha value.
In[4]:= SeedRandom[47]
In[5]:= y = Table[Random[ExponentlalDistribution[1/
10] ], {1000}];
Simulation languages of the third type are specifically designed for analyz-
ing computer systems. These languages have special facilities that make it easier
to construct a simulator for a modeling study of a computer system. The modeler
is thus relieved of a great deal of complex coding and analysis. For example, such
languages can easily generate random number distributions of the kind usually
used in models of computer systems. In addition Type 3 languages make it easy
to simulate computer hardware devices such as disk drives, CPUs, and channels
as part of a computer system simulation. Some languages, such as RESQ, also
allows advanced methods for controlling the length of a simulator run such as the
regeneration method, running until the confidence interval for an estimated per-
formance parameter is less than a critical value, etc. Type 3 languages are more
expensive, in general, than Type 1 or Type 2 languages, as one would expect, but
provide a savings in the time to construct a simulator. Of course there is a learn-
ing curve for any new language; it might be necessary to attend a class to attain
the best results.
Type 2 programming languages provide a number of features needed for
general purpose simulation but no special features for modeling computer sys-
tems as such. Therefore, it is easier to develop a simulator with a Type 2 pro-
gramming language than with a Type 1 general purpose language, but not as
easily as with a Type 3 language.
Type 1 languages should be used for constructing a simulator of a computer
system only if (a) the simulator is to be used so extensively that efficiency is of
paramount importance, (b) personnel with the requisite skills in statistics and
coding are available to construct the model, and (c) a simple technique for proto-
typing the simpler versions of the simulator is available to assist in validating the
simulator.
Bratley et al. [Bratley, Fox, and Schrage, 1987] provide examples of simula-
tors written in Type l languages (Fortran, and Pascal) as well as Type 2 lan-
guages (Simscript, GPSS, and Simula). They also warn:
MacNair and Sauer [MacNair and Sauer 1985] provide a number of computer
modeling examples using simulation written in the Type 3 language RESQ.
3. Collecting the information needed for conducting the simulation study. Infor
mation is needed for validation as well as construction of the model.
4. Choosing the simulation language. This choice depends upon the skills of the
people available to do the coding.
5. Coding the simulation, including generating the random number streams
needed, testing the random number streams, and verifying that the coding is
correct. People with special skills are needed for this step.
6. Overcoming the special simulation problems of determining when the simula-
tion process has reached the steady-state and a method of judging the accuracy
of the results.
7. Validating the simulation model.
8. Evaluating the results of the simulation model.
A failure of any one of these steps can cause a failure of the whole effort.
Simulation is the only tool available for modeling computer hardware that does
not yet exist and thus is of great importance to computer designers. It also plays a
leading role in analyzing the performance of complex communication networks.
Fortier and Desrochers [Fortier and Desrochers 1990] describe how the MATLAN
simulation modeling package can be used to analyze local area networks (LANs).
6.6 Benchmarking
We discussed benchmarking briefly in Chapters 1 and 2. There are actually two
basically different kinds of benchmarking. The first kind is defined by Dongarra
et al. [Dongarra, Martin, and Worlton 1987] as Running a set of well-known
programs on a machine to compare its performance with that of others. Every
computer manufacturer runs these kinds of benchmarks and reports the results for
each announced computer system. The second kind is defined by Artis and
Domanski [Artis and Domanski 1988] as a carefully designed and structured
experiment that is designed to evaluate the characteristics of a system or
subsystem to perform a specific task or tasks. The first kind of benchmark is
The Artis and Domanski kind of benchmark is the type one would use to model
the workload on your current system and run on the proposed system. It is the most
difficult kind of modeling in current use for computer systems. Before we discuss
the Artis and Domanski type of benchmark, we will discuss the first type of
benchmark, the kind that is called a standard benchmark. We have previously
mentioned some of the standard benchmarks, including the Dhrystone benchmark,
in Chapter 2.
In the very early days of computers, the speed of different machines was
compared using main memory access time, clock speed, and the number of CPU
clock cycles needed to perform the addition and multiply instructions. Since most
programming in those days was done either in machine language or assembly
language, in principle, programmers could use this information plus the cycle
times of other common instructions to estimate the performance of a new
machine.
The next improvement in estimating computer performance was the Gibson
Mix provided by J. C. Gibson of IBM and formally described in [Gibson 1970].
Gibson ran some dynamic instruction traces on a selection of programs written
for the IBM 650 and 704 computers. From these traces he was able to calculate
what percent of instructions were of various types. For example, he found that
Load/Store instructions accounted for 31.2% of all instructions executed and
Add/Subtract accounted for 6.1%. From the percentage of each instruction used
and the execution time of each instruction, it is possible to compute the average
execution time of an instruction and thus the average execution rate. In his excel-
lent historical paper [Serlin 1986] Serlin shows how the Gibson Mix could be
used to estimate the MIPS for a 1970-vintage Supermini computer. Serlin also
points out that the Gibson Mix was part of industry lore in 1964, although Gibson
did not formally publish his results until 1970 and this only in an IBM internal
report.
It was quickly discovered that the Gibson Mix was not representative of the
work done on many computer systems and did not measure the ability of compil-
ers to produce good optimized code. These concerns led to the development of
some standard synthetic benchmarks.
As Engberg says [Engberg 1988] about synthetic benchmarks:
Thus synthetic benchmarks do not do any useful calculations, unlike the Linpack
benchmark, which is a collection of Fortran subroutines for solving a system of
linear equations. Results of the Linpack benchmark are given in terms of Linpack
MFLOPS.
The two best known synthetic benchmarks are the Whetstone and the Dhrys-
tone. The Whetstone benchmark was developed at the National Physical Labora-
tory in Whetstone, England, by Curnow and Wichman in 1976. It was designed
to measure the speed of numerical computation and floating-point operations for
midsize and small computers. Now it is most often used to rate the floating-point
operation of scientific workstations. My IBM PC compatible 33 MHz 486 has a
Whetstone rating of 5,700K Whetstones per second. According to [Serlin 1986]
the HP 3000/930 has a rating of 2,841K Whetstones per second, the IBM 4381-
11 has a rating of approximately 2,000K Whetstones per second, and the IBM RT
PC a rating of 200K Whetstones per second.
The Dhrystone benchmark was developed by Weicker in 1984 to measure
the performance of system programming types of operating systems, compilers,
editors, etc. The result of running the Dhrystone benchmark is reported in Dhrys-
tones per second. Weicker in his paper [Weicker 1990] describes his original
benchmark as well as Versions 1.1 and 2.0. Whetstones per second is often con-
verted into MIPS or millions of instructions per second. The MIPS usually
reported are relative VAX MIPS, that is, MIPS calculated relative to the VAX 11/
780, which was once thought to be a 1 MIPS machine but is now generally
believed to be approximately a 0.5 MIPS machine. By this we mean that for most
programs run on the VAX 11/780 it executes approximately 500,000 instructions
per second. Weicker [Weicker 1990] not only discusses his Dhrystone benchmark
but also discusses the Whetstone, Livermore Fortran Kernels, Stanford Small
Programs Benchmark Set, EDN Benchmarks, Sieve of Eratosthenes, and SPEC
benchmarks. Weicker also says:
ment. Members of BAPCo include Advanced Micro Devices Inc., Digital Equip-
ment, Dell Computer, Hewlett-Packard, IBM, Intel, Microsoft, and Ziff-Davis
Labs.
Rather than have one composite number for the combined two benchmark
suites SPEC provides a separate metric for CINT92 and for CFP92. SPECint92 is
the composite metric for CINT92. It is the geometric mean of the SPECratios of
the six integer benchmarks. The SPECratio for a benchmark on a given system is
the quotient derived by dividing the SPEC Reference Time for that benchmark
(run time on a DEC VAX 11/780) by the run time for the same benchmark on that
particular system. SPECfp92 is the composite metric for CFP92 and is the geo-
metric mean of the SPECratios of the fourteen floating-point benchmarks. We
provide some representative SPEC benchmark results in Tables 6.1 and 6.2.
These results are those reported to SPEC by the manufacturers. Note that IBM no
longer reports MIPS results.
In Table 6.1 HP/705 is shorthand for Hewlett-Packard HP 9000 Series 705
and similarly for HP/750. IBM/220 is shorthand for IBM RS/6000 Model 220
and similarly for IBM/970. Sun SS2 is an abbreviation for Sun SPARCstation 2.
In Table 6.2 SS is shorthand for SuperSPARC. All the results in Table 6.2 were
reported in [Boudette 1993]. In his article Boudette also included the perfor-
mance results reported by Intel for the Intel 66 MHz Pentium, the 60 MHz Pen-
tium, the 33/66 MHz 486DX2, the 50 MHz 486DX, the 25/50 MHz 486DX2, the
33 MHz 486DX, and the 25 MHz 486DX based on the internal Intel benchmark
Icomp. These benchmark results indicate that the 66 MHz Pentium almost dou-
bles the performance of the 33/66 MHz 486DX2 which is 78.9 percent faster than
the 33 MHz 486DX.
In addition to reporting the composite metrics SPECint92 and SPECfp92
manufacturers report the performance on each individual benchmark. This helps
users better position different computers relative to the work to be done. The
floating-point suite is recommended for comparing the floating-point-intensive
(typically engineering and scientific applications) environment. The integer suite
is recommended for environments that are not floating-point-intensive. It is a
good indicator of performance in a commercial environment. CPU performance
is one of the indicators of commercial environment performance. Other compo-
nents include disk and terminal subsystems, memory, and OS services. SPEC has
announced that benchmarks are being readied to measure overall throughput, net-
working, and disk input/output for release in 1992 and 1993. Currently SPEC
benchmarks run only under UNIX.
a common database over a local or wide-area network (thus, the terms tpsA-
local and tpsA-wide). The TPA-A benchmark uses the human and computer
operations involved in a typical banking automated teller machine (ATM) trans-
action as a simplified model to represent a wide array of OLTP business transac-
tions. Results of the benchmark are expressed in TPS (transactions per second)
and in $/TPS or dollars per TPS.[At first it was planned to represent the cost in
units of thousands of dollars per TPS ($K/TPS) but it was found to be too com-
plicated for business executives to think in those terms.] The TPS rating is equal
to the number of transactions completed per unit of time provided that 90 percent
of the transactions have a response time of two seconds or less. The $/TPS is the
total cost of the system tested divided by the obtained TPS rating. This is
intended as a price-performance measure so the lower the result, the better the
performance. The total system cost includes all major hardware and software
components (including terminals, disk drives, operating system and database
software as required by benchmark specifications), support, and 5 years of main-
tenance costs.
The second TPC benchmark, called TPC Benchmark B (TPC-B), is intended
as a replacement for TPI. TPC-B was approved in August 1990 and is primarily
a database server test in which streams of transactions are submitted to a database
host/server in a batch mode. The database operations associated with TPC-B
transactions are similar to those of TPC-A, but there are no terminals or end-
users associated with the TPC-B benchmark. Results of this benchmark are the
same as those for the TPC-A benchmark: TPS and $/TPS.
In Table 6.3 we present some of the results reported by the TPC on March
1516, 1992. The TPC-A results are local results.
Although the TPC-A and TPC-B benchmarks have been widely accepted
there has been some criticism of some features of these benchmarks. The most
severe charge against the two benchmarks is that neither truly represents any
actual segment of the commercial computing marketplace. Another complaint is
that the TPS rating is too sensitive to the requirement that 90 percent of all trans-
actions must have a response time not exceeding 2 seconds. The TPC-A bench-
mark has been criticized for being a single-transaction workload although most
commercial workloads have a batch component. The TPC-B benchmark has a
batch but no online component. To answer these complaints the TPC has devel-
oped a new benchmark called TPC-C that is considered to be an order-entry
benchmark.
The TPC-A and TPC-B benchmarks are not directly usable for making pur-
chase decisions because neither of them can be matched with an actual applica-
tion. However, they do provide information to those who are planning to develop
OLTP applications. By reading TPC-A and TPC-B reports from different vendors
application developers can obtain rough ideas about the performance of compet-
ing computer systems as well as relative costs. However, developers who have
applications similar to that described by the TPC-C benchmark are able to make
at least a rough estimate of what model of computer is needed if they read the
FDRs in detail for the machines of interest.
The TPC-C results reported in Table 6.4 are from [Boudette 1993].
Word Processing
MS Word for Windows 1.1
Wordperfect 5.1
Spreadsheet
Lotus 123 R 3.1+
Excel 3.0
Quattro Pro 3.0
DataBase
dBASE IV 1.1
Paradox 3.5
Desktop Graphics
Harvard Graphics 3.0
Desktop Publishing
Pagemaker 4.0
Software Development
Borland C++ 2.0
Microsoft C 6.0
The metric used to quantify performance is scripts per minute. This metric is
calculated for each application and then combined to yield a performance metric
for each category. Thus there is a metric for word processing, spreadsheets,
database, desktop graphics, desktop publishing, and software development.
According to Strehlo [Strehlo 1992], the scoring is calibrated so that a typical 33
MHz 486 computer will score approximately 100. One could use the output from
the SYSmark92 benchmark performed on a number of different personal
computers to help decide what personal computers to buy for people who have
similar workloads. For example, for users in a group that makes a lot of
spreadsheet calculations, the spreadsheet rating can be used to compare the
usefulness of different personal computers for making spreadsheet computations.
Then all the PCs that satisfy your spreadsheet rating criterion can be analyzed
relative to other factors such as price, ease-of-use, quality, support policies,
training requirements, if any, etc., to make the final purchase decision. Part of any
decision should involve allowing some of the final users to test the machines to
see which ones they like.
!SCRIPT AUTOCAPTURE
!*
!* Automated MPE V/E Script For Ldev 120
$CONTROL ERRORS=10, WARN
!*
!* Set the terminal line transmission speed to 960,
emulation
!* mode to character mode
!*
!SET speed=960, mode=char, type=0
!SET eor=nul
!*TIMER = 15:32:44
!LOGON
!* Generate a message to the SUT to logon and wait 70
decisec-
!* onds from the receipt of a PROMPT character from the
SUT
!* before sending the next message.
!*
!SEND "hello manager.sys", CR
!WAIT 0, 70
!*
!* Generate a message to the SUT to execute GLANCE
!*
!SEND "run glancev.pub.sys", CR
!WAIT 0, 3
!*
!* Generate a message to the SUT to examine the GLOBAL
screen
!*
!SEND "g"
!WAIT 0, 0
!*
!* Generate a message to the SUT to EXIT from GLANCE
!*
!SEND "e"
!WAIT 0, 26
!* Generate a message to the SUT to logoff the MPE
session
!*
!SEND "BYE", CR
!*TIMER = 15:33:22
!LOGON
!* End Of Script
!*TIMER = 15:33:23
!END
All computer vendors have drivers for controlling their benchmarks. Since
there are more IBM installations than any other kind, the IBM Teleprocessing
Network Simulator (program number 5662-262, usually called TPNS) is proba-
bly the best known driver in use. TPNS generates actual messages in the IBM
Communications Controller and sends them over physical communication lines
(one for each line that TPNS is emulating) to the computer system under test.
TPNS consists of two software components, one of which runs in the IBM
mainframe or plug compatible used for controlling the benchmark and one that
runs in the IBM Communications Controller. TPNS can simulate a specified net-
work of terminals and their associated messages, with the capability of altering
network conditions and loads during the run. It enables user programs to operate
as they would under actual conditions, since TPNS does not simulate or affect
any functions of the host system(s) being tested. Thus it (and most other similar
drivers including WRANGLER, the driver used at the Hewlett-Packard Perfor-
mance Technology Center) can be used to model system performance, evaluate
communication network design, and test new application programs. A driver may
be much less difficult to use than the development of some detailed simulation
models but is expensive in terms of the hardware required. One of its most
important uses is testing new or modified online programs both for accuracy and
performance. Drivers such as TPNS or WRANGLER make it possible to utilize
all seven of the uses of benchmarks described by Artis and Domanski. Kube in
[Kube 1981] describes how TPNS has been used for all these activities. Of
course the same claim can be made for most commercial drivers.
If, like most installations, batch jobs are run on your computer system with your
online (terminal) workload classes, you can use your RTE to capture the scripts in
which batch jobs are launched. However, it can be a real challenge to construct a
representative batch workload if you run a number of different batch jobs with
very different resources requirements. The benchmark section of [Howard]
describes the rather tedious procedure for construction a representative batch
workload.
In spite of all the difficulties and challenges I have cited, it is possible to con-
struct representative and useful benchmarks. Computer manufacturers couldnt
live without them and some large computer installations depend upon them.
However, constructing a good benchmark for your installation is not and easy
task and is not recommended for most installations. In their excellent paper
[Dongarra, Martin, and Worlton 1987], Dongarra et al. warned:
Note that the authors define a kernel to be the central portion of a program,
containing the bulk of its calculations, which consumes the most execution time.
Clark, in his interesting paper [Clark 1991], provides a report on his experi-
ences at his installation in developing and running their first benchmark. Their
benchmark was what Clark calls a proof-of-concept (POC) benchmark. Clark
describes this type of benchmark as follows:
Clark does not reveal the exact purpose of the benchmark study. However, it
appears that the feasibility of moving an application that was running under CICS
on an IBM mainframe to an open platform was to be determined. On the open
platform SQL would be used to access the data. For the latter part of the
benchmark it was necessary to simulate SQL transactions using a relational
database management system. Clark discusses the planning, team involvement,
establishing control over the vendor benchmark personnel, scope, workload, data,
driving the benchmark, documentation, and the final report. Clark is an
experienced performance analyst and had access to advice from Bernard
Domanski, an experienced benchmarker, so his chances for success were greatly
enhanced over that to be expected for someone relatively new to computer
performance analysis. For Clarks study workload characterization and the
generation of test data were especially challenging.
Exercise 6.4
You are the lead performance analyst at Information Overload. You have
excellent rapport with your users who provide very good feedback on their
workload growth so that you can accurately predict the demands on your computer
system. Your performance studies show that your current computer system will be
able to support your workload at the level required by the service level agreement
you have with your users for only six more months. You have prepared a list of
three different computer systems from three different vendors that you feel are
good upgrade candidates based on your modeling of the three systems. Clarence
Clod, the manager of your installation, insists that you must conduct benchmark
studies on the three different computer systems using a representative benchmark
that you must develop before a new system can be ordered. Your biggest challenge
in complying with his orders will be:
(a) Constructing a truly representative benchmark in time to run it on the
three systems.
(b) Assuming that you succeed with (a), running the benchmark successfully
on the three candidate systems.
(c) Assuming you succeed with (a) and (b), analyzing the results of the three
studies in a way that will give you great confidence that you can make the correct
choice.
(d) None of the above.
Exercise 6.5
You are the manager of a group of engineers who are using a simulation package
on their workstations to design electronic circuits. The simulation package is
heavily dependent upon floating-point calculations. The engineers complain that
their projects are getting behind schedule because their workstations are so slow.
You obtain authorization from your management to replace all your workstations.
As you read the literature from different vendors on their workstations what
benchmarks or performance metrics will be of most importance to you?
6.7 Solutions
Solution to Exercise 6.1
The two runs requested follow. They were made on my Hewlett-Packard
workstation and thus took less time but yielded exactly the results made on my
home 33MHz 486DX IBM PC compatible.
The exact value of the average steady-state response time for an M/M/1 queueing
system with server utilization 0.9 is 10. For the first run the estimate of this
quantity is 9.86683, the 95 percent confidence interval contains the correct value,
and the length of the confidence interval is 2.19117. For the second run the
estimated value of the average response time is 9.85506 (not quite as good an
estimate as we obtained for the shorter first run), the confidence interval contains
the correct value, and the length of the confidence interval is 1.46548.
In[5]:= m =13
Out[5]= 13
In[6]:= seed =2
Out[6]= 2
In[7]:= n = 13
Out[7]= 13
Out[9]= {2, 6, 5, 2, 6, 5, 2, 6, 5, 2, 6, 5, 2}
Out[13]= {2, 5, 6, 2, 5, 6, 2, 5, 6, 2, 5, 6, 2}
From the above runs of ran and the runs performed earlier we construct Table 6.6.
2 12 3 3 4 6
5 4 6 12 7 12
8 4 9 3 10 6
11 12 12 12
It is interesting to note that there are four full period multipliers2, 6, 11, 12.
In[3]:= <<work.m
In[5]:= SeedRandom[47]
In[6]:= y = Table[Random[ExponentialDistribution[1/
10]], {5000}];
"p is "0.1262315175895422
"q is "0.873768482410458
"The sequence passes the test."
This solution was made using Version 2.1 of Mathematica; Version 2.0
yields slightly different values for p and q because the output of SeedRanom[47]
is different for the two versions of Mathematica.
vated to try to convince you that their system is the most effective for your work-
load.
6.8 References
1. Anon et al., A measure of transaction processing power, Datamation, April
1, 1985, 112118.
2. H. Pat Artis and Bernard Domanski, Benchmarking MVS Systems, notes from
the course taught January 1114 1988 at Tyson Corner, VA.
3. Jon Bentley, Some random thoughts, Unix Review, June 1992, 7177.
4. Neal Boudette, Intel gears Pentium to drive continued 486 system sales,
PCWEEK, February 15, 1993.
5. Paul Bratley, Bennett L. Fox, and Linus E. Schrage, A Guide to Simulation,
Second Edition, Springer-Verlag, New York, 1987.
6. James D. Calaway, SNAP/SHOT VS BEST /1, Technical Support, March
1991, 1822.
7. Philip I. Clark, What do you really expect from a benchmark?: a beginners
perspective, CMG 91 Proceedings, Computer Measurement Group, 826
832.
8. Jack Dongarra, Joanne L. Martin, and Jack Worlton, Computer benchmark-
ing: paths and pitfalls, IEEE Spectrum, July 1987, 3813.
9. Tony Engberg, Performance: questions worth asking, Interact, August 1988,
5061.
10. Paul J. Fortier and George R. Desrochers, Modeling and Analysis of Local
Area Networks, CRC Press, Boca Raton, FL, 1990.
7.1 Introduction
As Patrick Henry suggests, forecasting means predicting the future from the past.
In ancient times this was done by examining chicken entrails or consulting an
oracle. In modern times the concept of time series analysis has developed to help
us predict the future. Forecasting is most useful in predicting workload growth but
may sometimes be used to predict CPU utilization or even response time.
Forecasting using time series analysis is essentially a form of pattern recognition
or curve fitting. The most popular pattern is a straight line but other patterns
sometimes used include exponential curves and the S-curve. One of the keys to
good forecasting is good data and the source of much useful data is the user
community. That is why one of the most popular and successful forecasting
techniques for computer systems is forecasting using natural forecasting units
(NFUs), also known as business units (BUs) and as key volume indicators (KVI).
The users can forecast the growth of natural forecasting units such as new
checking accounts, new home equity loans, or new life insurance policies sold
much more accurately than computer capacity planners in the installation can
predict future computer resource requirements from past requirements. If the
capacity planners can associate the computer resource usage with the natural
forecasting units, future computer resource requirements can be predicted. For
example, it may be true that the CPU utilization for a computer system is strongly
correlated with the number of new life insurance policies sold. Then, from the
predictions of the growth of policies sold, the capacity planning group can predict
when the CPU utilization will exceed the threshold which will require an upgrade.
to the trend. The most common curve used is a linear curve but exponential and
S-curve fitting is sometimes used as well. After a curve is fitted to the trend data
the seasonality and cyclic components are returned to the series so that the fore-
cast can be made. Of course the random component must be taken into account in
making the final forecast. Fortunately, we have statistical systems available to
handle the rather complex mathematics of all this.
Natural forecasting units are sometimes called business units or key volume
indicators because an NFU is usually a business unit. The papers [Browning
1990], [Bowerman 1987], [Reyland 1987], [Lo and Elias 1986], and [Yen 1985]
are some of the papers on NFU (business unit) forecasting that have been pre-
sented at international CMG conferences. In their paper [Lo and Elias 1986], Lo
and Elias list a number of other good NFU forecasting papers.
The basic problem that NFU forecasting solves is that the end users, the peo-
ple who depend upon computers to get their work done, are not familiar with
computer performance units (sometimes called DPUs for data processing units)
such as interactions per second, CPU utilization, or I/Os per second, while com-
puter capacity planners are not familiar with the NFUs or the load that NFUs put
on a computer system.
Lo and Elias [Lo and Elias 1986] describe a pilot project undertaken to
investigate the feasibility of adopting the NFU forecasting technique as part of a
capacity planning program. According to Lo and Elias, the major steps needed
for applying the NFU forecasting technique are (I have changed the wording
slightly from their statement):
Lo and Elias used the Boole & Babbage Workload Planner software to do the
dependency analysis. This software was also used to project the future capacity
requirements using standard linear and compound regression techniques. One of
their biggest challenges was manually keying in all the data for 266 NFUs. They
were able to reduce the number of NFUs to three highly smoothed ones.
Example 7.1
Yen, in his paper [Yen 1985], describes how he predicted future CPU
requirements for his IBM mainframe computer installation from input from users.
He describes the procedure in the abstract for his paper as follows:
Yen discovered that user departments can accurately predict their magnetic disk
requirements (IBM refers to magnetic disks as DASD for direct access storage
device). They can do this because application developers know the record sizes
of files they are designing and the people who will be using the systems can make
good predictions of business volumes. Yen used 5 years of historical data
describing DASD allocations and CPU consumption in a regression study. He
made a scatter diagram in which the y-axis represented CPU hours required for a
month, Monday through Friday, 8 am to 4 pm, while the x-axis represented GB of
DASD storage installed online on the fifteenth day of that month. Yen found that
the regression line y = 34.58 + 2.59x fit the data extraordinarily well. The usual
measure of goodness-of-fit is the R-squared value, which was 0.95575. (R-squared
is also called the coefficient of determination.) In regression analysis studies, R-
squared can vary between 0, which means no correlation between x and y values,
and ,1 which means perfect correlation between x and y values. A statistician
might describe the R-squared value of 0.95575 by saying, 95.575 percent of the
total variation in the sample is due to the linear association between the variables
x and y. An R-squared value larger than 0.9 means that there is a strong linear
relationship between x and y.
Yen no longer has the data he used in his paper but provided me with data
from December 1985 through October 1990. From this data I obtained the x and y
values plotted in Figure 7.1 together with the regression line obtained from the
following Mathematica calculations using the standard Mathematica package
In[3]:= <<StatisticsLinearRegression
In[12]:= Regress[data, {1,x}, x]
PValue}
Here we show ParemeterTable from Regress for the data with all the outliers,
including all December points, deleted:
Yen was able to make use of his regression equation plus input from some
application development projects to predict when the next computer upgrade was
needed. Let us examine how that might be done with the data in Figure 7.2. The
rightmost data point is (512.15, 921.019). Since there are 152 hours in a time
period consisting of 19 days with 8 hours per day, the number of equivalent IBM
3083 Model J CPUs for this point is 6.06. We assume that Blue Cross has the
equivalent of at least 7 IBM 3083 Model J computers at this time. If it is exactly
7, we would like to know when at least 8 will be needed. We can use the regres-
sion line to estimate this as shown in the following Mathematica calculation. We
see that at least eight equivalent CPUs will be needed when the online storage
reaches 643.391 GB. We can predict when that will happen and thus when an
upgrade will be needed, at least to within a few months.
While the technique used by Yen to predict when the next upgrade should
occur within a few months, forecasting of total CPU hours needed per month
alone does not provide much information on the performance of the system as it
approaches the point where more computing capacity is needed. More detailed
information is needed to determine when the performance deteriorates so that the
users feel that such performance measures as average response time are unac-
ceptable. Yen and his colleagues of course tracked performance information to
avoid this problem. In fact, Yen used the modeling package Best/1 MVS to make
frequent performance predictions. The forecasting process allowed Yen to predict
far in advance when an upgrade would likely be needed so that the necessary pro-
curement procedures could be carried out in a timely fashion.
Exercise 7.1
Apply linear regression to the file data1 that is on the diskette in the back of the
book. Hint: Dont forget to read in the package LinearRegression from Statistics.
How you read it in depends upon what version of Mathematica you have.
Example 7.2
This example is taken from the HP RXForecast Users Manual For HP-UX
Systems. One of the useful features of HP RXForecast is the capability of
associating business units with computer performance metrics to see if there is a
correlation. When there is a strong correlation, HP RXForecast will forecast
computer performance metrics from business unit forecasts. For this example the
scopeux collector was run continuously from January 3, 1990, until March 19,
1990, to generate the TAHOE.PRF performance log file. Then HP RXForecast
was used to correlate the global CPU utilization to the business units provided in
the business unit file TAHOEWK.BUS. The flat ASCII file called
1 2 1990 5510
1 3 1990 4300
1 4 1990 5000
2 1 1990 5920
2 2 1990 4800
2 3 1990 3000
2 4 1990 5700
3 1 1990 4800
3 2 1990 5200
3 3 1990 7800
3 4 1990 6500
4 1 1990 6700
4 2 1990 7000
4 3 1990 6200
4 4 1990 7400
5 1 1990 7700
5 2 1990 6900
5 3 1990 8100
5 4 1990 8300
6 1 1990 8600
6 3 1990 9000
6 4 1990 9300
The graph shown in Figure 7.3 was produced by HP RXForecast. The first
part of the graph (up to week 3 of the third month) compares the actual global
CPU utilization and the global CPU utilization predicted by regression of CPU
utilization on business units. The two curves are very close. The single curve
starting in the third week of the third month is the RXForecast forecast of CPU
utilization from the predicted business units. The regression for the first part of
the curve is very good with an R-squared value of 0.86 and a standard error of
only 5.49. Note that, for the business unit forecasting technique to work, the pre-
diction of the growth of business units must be provided to HP RXForecast.
7.3 Solutions
Solution to Exercise 7.1
We used Mathematica as shown here except we that do not indicate how the data1
file was read by a simple <<data1 because it dumps all the numbers on the screen.
We also display only the final graphic. The fit looks pretty good in Figure 7.4
although the R-squared value of 0.883297 is slightly lower than wed like.
In[3]:= <<StatisticsLinearRegression
In[6]:= gp = ListPlot[data]
Out[6]= -Graphics-
Out[12]= -Graphics-
7.4 References
1. Tim Browning, Forecasting computer resources using business elements: a
pilot study, CMG 90 Conference Proceedings, Computer Measurement
Group, 1990, 421127.
2. James R. Bowerman, An introduction to business element forecasting, CMG
87 Conference Proceedings, Computer Measurement Group, 1987, 703
709.
3. C. Chatfield, The Analysis of Time Series: An Introduction, Third Edition,
Chapman and Hall, London, 1984.
4. T. L. Lo and J. P. Elias, Workload forecasting using NFU: a capacity planners
perspective, CMG 86 Conference Proceedings, Computer Measurement
Group, 1986, 115120.
5. George W. (Bill) Miller, Workload characterization and forecasting for a large
commercial environment, CMG 87 Conference Proceedings, Computer
Measurement Group, 1987, 655665.
6. John M. Reyland, The use of natural forecasting units, CMG 87 Conference
Proceedings, Computer Measurement Group, 1987, 71013.
7. Kaisson Yen, Projecting SPU capacity requirements: a simple approach,
CMG 85 Conference Proceedings, Computer Measurement Group, 1985,
386391.
8.1 Introduction
I hope the reader fits Shaws definition of unreasonable and wants to change
things for the better. The purpose of this chapter is to review the first seven
chapters of this book and to suggest what you might do to continue your education
in computer performance analysis.
Execution Time B n
= 1+ ,
Execution Time A 100
where the numerator in the fraction is the time it takes machine B to execute task
X and the denominator is the time it takes machine A to do so. Solving for n yields
Processors (CPUs)
One of the most important components of any computer system is the central
processing unit (CPU) (CPUs on multiprocessor systems). The processing power
of a CPU is primarily determined by the clock cycle or smallest unit of time in
which the CPU can execute a single instruction. (According to [Kahaner and
Wattenberg 1992] the Hitachi S-3800 has the shortest clock cycle of any
commercial computer in the world; it is two billionths of a second!) Some
superscalar RISC (reduced instruction set computer) systems can execute more
than one instruction per cycle by pipelining. Pipelining is a method of improving
the throughput of a CPU by overlapping the execution of multiple instructions. It
is described in detail in [Hennessy and Patterson 1990] and conceptually in
[Denning 1993]. It is customary to provide basic CPU speed in units of millions
of clock cycles per second or MHz. As this is being written (June 1993) the fastest
50 seconds, the CPI is 3 1/3 clock cycles per instruction, and the MIPS rating is
15 for the code executed. Both of these numbers would probably be different if a
different code sequence was executed.
Multiprocessors
Many computer systems have more than one processor (CPU) and thus are known
as multiprocessor systems. There are two basic organizations for such systems:
loosely coupled and tightly coupled.
Tightly coupled multiprocesors, also called shared memory multiprocessors,
are distinguished by the fact that all the processors share the same memory. There
is only one operating system, which synchronizes the operation of the processors
as they make memory and data base requests. Most such systems allow a certain
degree of parallelism; that is, for some applications they allow more than one
processor to be active simultaneously doing work for the same application.
Tightly coupled multiprocessor computer systems can be modeled using queue-
ing theory and information from a software monitor. This is a more difficult task
than modeling uniprocessor systems because of the interference between proces-
sors. Modeling is achieved using a load dependent queueing model together with
some special measurement techniques.
Loosely coupled multiprocessor systems, also known as distributed memory
systems, are sometimes called massively parallel computers or multicomputers.
Each processor has its own memory and sometimes a local operating system as
well. There are several different organizations for loosely coupled systems but
the problem all of them have in achieving high speeds is indicated by Amdahls
law, which says that the degree of speedup due to the parallel operation is given
by
1
Speedup = ,
(1 Fraction )+
Fraction parallel
parallel
n
where n is the total number of processors. The problem is achieving a high degree
of parallelism. For example, if the system has 100 processors with all of them
running in parallel one half of the time, the speedup is only 1.9802. To obtain a
speedup of 50 requires that the fraction of the time that all processors are operating
in parallel is 98/99=0.98989899.
We discuss a number of the leading multiprocessor computer systems in
Chapter 2. We also recommend the September 1992 issue of IEEE Spectrum. It is
a special issue devoted to supercomputers and it covers all aspects of the newest
computers (even personal computers) have virtual memory so that some lines of
a program may be stored on a disk. The most common way that virtual memory
is handled is to divide the address space into fixed-size blocks called pages. At
any give time a page can be stored either in main memory or on a disk. When the
CPU references an item within a page that is not in the CPU cache or in main
memory, a page fault occurs, and the page is moved from disk to main memory.
Thus the CPU cache and main memory have the same relationship as main mem-
ory and disk memory. Disk storage devices, such as the IBM 3380 and 3390,
have cache storage in the disk control unit so that a large percentage of the time a
page or block of data can be read from the cache obviating the need to perform a
disk read. Special algorithms and hardware for writing to the cache have also
been developed. According to Cohen, King, and Brady [Cohen, King, and Brady
1989] disk cache controllers can give up to an order of magnitude better I/O ser-
vice time than an equivalent configuration of uncached disk storage.
Because caches consist of small, speedy memory elements they are very fast
and can significantly improve the performance of computer systems. In Chapter
2 we give some examples of how CPU caches can improve performance.
Input and output is a very important component of the performance of com-
puter systems although this fact is frequently overlooked. The most important I/O
device for most computers is the magnetic disk drive, which we discuss is some
detail in Chapter 2.
The hottest new innovation in disk storage technology is the disk array, more
commonly denoted by the acronym RAID (Redundant Array of Inexpensive
Disks). The seminal paper for this technology is the paper [Patterson, Gibson,
and Katz 1988]. It introduced RAID terminology and established a research
agenda for a group of researchers at UC Berkeley for several years. The abstract
of their paper, which provides a concise statement about the technology follows.
response is sent back to the users terminal. This model is a queueing network
model which can be solved using either analytic queueing theory or simulation.
workload is called a transaction workload and does not correlate quite so closely
with the way an actual user utilizes a computer system. Large data base systems
such as airline reservation systems have transaction workloads, which corre-
spond roughly to computer systems with a very large number of active terminals.
There are two types of parameters for each workload type: parameters that
specify the workload intensity and parameters that specify the service require-
ment of the workload at each of the computer service centers.
We describe the workload intensity for each of the three workload types as
follows:
average service time for a class c customer at service center k, that is, for the
average time required for a server in service center k to provide the required ser-
vice to one class c customer. It is the reciprocal of c,k, which is a Greek symbol
used to represent the average service rate or the average number of class c cus-
tomers serviced per unit of time at service center k when the service center is
busy.
The average response time, R, and average throughput, X, are the most com-
mon system performance metrics for terminal and batch workloads. These same
performance metrics are used for queueing networks, both as measurements of
system wide performance and measurements of service center performance. In
addition we are interested in the average utilization, U, of each service facility.
For any server the average utilization of the device over a time period is the frac-
tion of the time that the server is busy. Thus, if over a 10 minute period the CPU
is busy 5 minutes, then we have U = 0.5 for that period. Sometimes the utiliza-
tion is given in percentage terms so this utilization would be stated as 50% utili-
zation. In Chapter 3 we discuss the queueing network performance
measurements separately for single workload class models and multiple work-
load class models. For single workload class models, the primary system perfor-
mance parameters are the average response time, R, the average throughput, X,
and the average number of customers in the system, L. In addition, for each ser-
vice center we are interested in the average utilization, the average time a cus-
tomer spends at the center, the average center throughput, and the average
number of customers at the center.
For multiple workload class models there also are system performance mea-
sures and center performance measures. Thus we may be interested in the aver-
age response time for users who are performing order entry as well as for those
who are making customer inquiries. In addition we may want to know the break-
down of response time into the CPU portion and the I/O portion so that we can
determine where upgrading is most urgently needed.
Similarly, we have service center measures of two types: aggregate or total
measures and per class measures. Thus we may want to know the total CPU utili-
zation as well as the breakdown of this utilization between the different work-
loads.
after John D.C. Little who gave the first formal proof in his 1961 paper [Little
1961]. Before Littles proof the result had the status of a folk theorem, that is,
almost everyone believed the result was true but no one knew how to prove it. The
use of Littles law is the most important and useful principle of queueing theory
and his paper is the single most quoted paper in the queueing theory literature.
Littles law applies to any system with the following properties:
N
R= Z.
X
The response time law can to generalized to the multiclass case to yield
Nc
Rc = Zc .
Xc
In Section 3.3.3 we provide several examples of the use of the response time
law.
For a single workload class computer system the forced flow law says that
the throughput of service center k, Xk, is given by Xk = VK 3 X where X is the
computer system throughput. This means that a computer system is holistic in the
sense that the overall throughput of the system determines the throughput
through each service center and vice versa.
We repeat Example 3.3 below (as Example 8.1) because it illustrates several
of the laws under discussion
Example 8.1
Suppose Arnolds Armchairs has an interactive computer system (single
workload) with the characteristics shown in Table 8.1.
Parameter Description
Although analytic queueing theory is very powerful there are queueing net-
works that cannot be solved exactly using the theory. In their paper [Baskett,
Chandy, Muntz, and Palacios 1975], a widely quoted paper in analytic queueing
theory, Baskett et al. generalized the types of networks that can be solved analyt-
ically. Multiple customer classes, each with different service requirements, as
well as service time distributions other than exponential are allowed. Open,
closed, and mixed networks of queues are also allowed. They allow four types of
service centers, each with a different queueing discipline. Before this seminal
paper was published most queueing theory was restricted to Jackson networks
that allowed only one customer class (a single workload class) and required all
service times to be exponential. The exponential distribution is a popular one in
applied probability because of its nice mathematical properties and because many
real world probability distributions are approximately exponential. The networks
described by Baskett et al. are now known as BCMP networks. For these net-
works efficient solution algorithms are known; many of them are presented in
Chapter 4 together with Mathematica programs for their solution.
counted as device 1. The MVA algorithm for the performance calculations fol-
lows.
Single Class Closed MVA Algorithm. Consider the closed computer system of
Figure 8.4. Suppose the mean think time is Z for each of the N active terminals.
The CPU has either the FCFS or the processor sharing queue discipline with
service demand D1 given. We are also given the service demands of each I/O
device numbered from 2 to K. We calculate the performance measures as follows:
R[n] = R [n],
k=1
k
n
X[n] = ,
R[n] + Z
Lk[n] = X[n]Rk[n], k=1,2, ..., K.
X = X[N].
R = R[N].
Set the average number of customers (jobs) in the main computer system to
L = X R.
The algorithm is actually quite straightforward and intuitive except for the
first equation of Step 2 which depends upon the arrival theorem, stated by Reiser
in [Reiser 1981] as follows:
Like all MVA algorithms, this algorithm depends upon Littles law (discussed in
Chapter 3), and the arrival theorem. The key equation is the first equation of Step
2, Rk[n] = Dk (1 + Lk[nl]), which is executed for each service center. By the
arrival theorem, when a customer arrives at service station k the customer finds
Lk[n1] customers already there. Thus the total number of customers requiring
service, including the new arrival, is 1 + Lk[n1]. Hence the total time the new
customer spends at the center is given by the first equation in Step 2 if we assume
we neednt account for the service time that a customer in service has already
received. The fact that we need not do this is one of the theorems of MVA! The
arrival theorem provides us with a bootstrap technique needed to solve the
equation Rk[n] = Dk(1 + Lk[n 1]) for n = N. When n is 1 Lk[n 1] = Lk[0] = 0
so that Rk[1] = Dk, which seems very reasonable; when there is only one
customer in the system there cannot be a queue for any device so the response time
at each device is merely the service demand. The next equation is the assertion that
the total response time is the sum of the times spent at the devices. The last two
equations are examples of the application of Littles law. The final equation
provides the input needed for the first equation of Step 2 for the next iteration and
the bootstrap is complete. Step 3 completes the algorithm by observing the
performance measures that have been calculated and using the utilization law, a
form of Littles law.
This algorithm is implemented by the Mathematica program sclosed in the
package work.m. In Chapter 4 we provide an example of the use of this model
and two exercises for the reader.
assumes that each workload class is a transaction class. The Mathematica pro-
gram mopen in the package work.m implements the calculations. In Chapter 4
we provide an example and an exercise that use mopen.
The exact MVA solution algorithm for the closed multiclass model is based
on the same ideas as the single class model (Littles law and the arrival theorem)
but is much more difficult to explain and to implement. In addition the computa-
tional requirements have a combinatorial explosion as the number of classes and
the population of each class increases. I explain the algorithm on pages 413414
of my book [Allen 1990] and in my article [Allen and Hynes 1991] with Gary
Hynes. In Chapter 4 we show how to use the Mathematica program Exact from
the package work.m, which is a slightly revised form of the program by that
name in my book [Allen 1990]. In Chapter 4 we consider some examples using
Exact.
Unfortunately, as we mentioned earlier, Exact is very computationally inten-
sive and thus is not practical for modeling systems with many workload classes
or many service centers (or systems with both many workload classes and many
service centers). To obviate this problem, we consider an approximate MVA
algorithm for closed multiclass systems. The approximate algorithm is suffi-
ciently accurate for most modeling studies and is much faster than the exact algo-
rithm. We provide the Mathematica program Approx in the package work.m to
implement the approximate algorithm; we also provide an example of its use as
well as an exercise to test your understanding of the use of Approx.
There is an approximate MVA algorithm for modeling computer systems
that (simultaneously) have both open and closed workload classes. (Recall that
transaction workload classes are open although both terminal and batch work-
loads are closed.) The algorithm for solving mixed multiclass models is pre-
sented in my book [Allen 1990] on pages 415416 with an example of its use.
However, we do not recommend the use of this algorithm for reasons that are
explained in Chapter 4.
We avoid these problems by using a modified type of closed workload class
that we call a fixed throughput class. At the Hewlett-Packard Performance Tech-
nology Center Gary Hynes developed an algorithm that converts a terminal
workload or a batch workload into a modified terminal or batch workload with a
given throughput. In the case of a terminal workload we use as input the required
throughput, the desired mean think time, and the service demands to create a ter-
minal workload that has the desired throughput. We also compute the average
number of active terminals required to produce the given throughput. The same
algorithm works for a batch class workload because a batch workload can be
thought of as a terminal workload with zero think time. For the batch class work-
load we compute the average number of batch jobs required to generate the
required throughput.
In Chapter 4 we present an example that illustrates difficulties that arise in
using transaction (open) workloads in situations in which their use seems appro-
priate. We also show how fixed throughput classes allow us to obtain satisfactory
results. To do this we provide the Mathematica program Fixed in the package
work.m to implement the fixed class algorithm. We also provide an exercise to
test your understanding of the use of Fixed.
Priority Queues
In all of the models discussed so far we have assumed that there are no priorities
for workload classes, that is, that all are treated the same. However, most actual
computer systems do allow some workloads to have priority, that is, to receive
preferential treatment over other workload classes. For example, if a computer
system has two workload classes, a terminal class that is handling incoming
customer telephone orders for products and the other is a batch class handling
accounting or billing, it seems reasonable to give the terminal workload class
priority over the batch workload class.
Every service center in a queueing network has a queue discipline or algo-
rithm for determining the order in which arriving customers receive service if
there is a conflict, that is, if there is more than one customer at the service center.
The most common queue discipline in which there are no priority classes is the
first-come, first-served assignment system, abbreviated as FCFS or FIFO (first-
in, first-out). Other nonpriority queueing disciplines include last-come, first-
served (LCFS or LIFO), and random-selection-for-service (RSS or SIRO).
For priority queueing systems workloads are divided into priority classes
numbered from 1 to n. We assume that the lower the priority class number, the
higher the priority, that is, that workloads in priority class i are given preference
over workloads in priority class j if i < j. That is, workload 1 has the most prefer-
ential priority followed by workload 2, etc. Customers within a workload class
are served with respect to that class by the FCFS queueing discipline.
There are two basic control policies to resolve the conflict when a customer
of class i arrives to find a customer of class j receiving service, where i < j. In a
nonpreemptive priority system, the newly arrived customer waits until the cus-
tomer in service completes service before beginning service. This type of priority
system is called a head-of-the-line system, abbreviated HOL. In a preemptive pri-
ority system, service for the priority j customer is interrupted and the newly
arrived customer begins service. The customer whose service was interrupted
returns to the head of the queue for the jth class. As a further refinement, in a pre-
emptive-resume priority queueing system, the customer whose service was inter-
rupted begins service at the point of interruption on the next access to the service
facility.
Unfortunately, exact calculations cannot be made for networks with work-
load class priorities. However, widely used approximations do exist. The sim-
plest approximation is the reduced-work-rate approximation for preemptive-
resume priority systems that have the same priority structure at each service cen-
ter. It works as follows: The processing power at node k for class c customers is
reduced by the proportion of time that the service center is processing higher pri-
ority customers. Suppose the service rate of class c customers at service center k
is mc,k Then the effective service rate of at node k for class c jobs is given by
c1
c,k = c,k 1
U
r=1
r,k .
The new effective service rate means that the effective service time
1
Sc,k = .
c,k
Note that all customers are unaffected by lower priority customers so that, in
particular, priority class 1 customers have the same effective service rate as the
actual full service rate. It is also true that for class 1 workloads the network can
be solved exactly.
In Chapter 4 we show how to use the reduced-work-rate approximation
directly from the definition. We also show how to use the Mathematica program
Pri from the package work.m to make the calculations and provide an exercise
in the use of Pri.
central server model shown in Figure 8.5. This model was developed by Buzen
[Buzen 1971].
The central server referred to in the title of this model is the CPU. The cen-
tral server model is closed because it contains a fixed number of programs N (this
is also the multiprogramming level, of course). The programs can be thought of
as markers or tokens that cycle around the system interminably. Each time a pro-
gram makes the trip from the CPU directly back to the end of the CPU queue we
assume that a program execution has been completed and a new program enters
the system. Thus there must be a backlog of jobs ready to enter the computer sys-
tem at all times. We assume there are K service centers with service center 1 the
CPU. We assume also that the service demand at each center is known. Buzen
provided an algorithm called the convolution algorithm to calculate the perfor-
mance statistics of the central server model. In Section 4.2.4 of Chapter 4 we pro-
vide an MVA algorithm that is more intuitive and is a modification of the single
class closed MVA algorithm we presented earlier in this chapter.
We provide the Mathematica program cent in the package work.m to imple-
ment the algorithm; in Chapter 4 we also provide examples of its use and an exer-
cise.
Although the central server model has been used extensively it has two
major flaws. The first flaw is that it models only batch workloads and only one of
them at a time. That is, it cannot be used to model terminal workloads at all and it
cannot be used to model more than one batch workload at a time. The other flaw
is that it assumes a fixed multiprogramming level although most computer sys-
tems have a fluctuating value for this variable. In Chapter 4 we show how to
adapt the central server model so that it can model a terminal or a batch workload
with time varying multiprogramming level. We need only assume that there is a
maximum possible multiprogramming level m.
Since a batch computer system can be viewed as a terminal system with
think time zero, we imagine the closed system of Figure 8.4 as a system with N
terminals or workstations all connected to a central computer system. We assume
that the computer system has a fluctuating multiprogramming level with a maxi-
mum value m. If a request for service arrives at the central computer system
when there are already m requests in process the request must join a queue to wait
for entry into main memory. (We assume that the number of terminals, N, is
larger than m.) The response time for a request is lowest when there are no other
requests being processed and is largest when there are N requests either in pro-
cess or queued up to enter the main memory of the central computer system. A
computer system with terminals connected to a central computer with an upper
limit on the multiprocessing level (the usual case) is not a BCMP queueing net-
work. The non-BCMP model for this system is created in two steps. In the first
step the entire central computer system, that is, everything but the terminals, is
replaced by a flow equivalent server (FESC). This FESC can be thought of as a
black box that when given the system workload as input responds with the same
throughput and response time as the real system. The FESC is a load dependent
server, that is, the throughput and response time at any time depends upon the
number of requests in the FESC. We create the FESC by computing the through-
put for the central system considered as a central server model with multipro-
gramming level 1, 2, 3,..., m. The second step in the modeling process is to
replace the central computer system in Figure 8.4 by the FESC as shown in Fig-
ure 8.6. The algorithm to make the calculations is rather complex so we will not
explain it completely here. (It is Algorithm 6.3.3 in my book [Allen 1990.) How-
ever, the Mathematica program online in the package work.m implements the
algorithm. The inputs to online are m, the maximum multiprogramming level
Demands, the vector of demands for the K service centers, N, the number of ter-
minals, and T, the average think time. The outputs of online are the average
throughput, the average response time, the average number of requests from the
terminals that are in process, the vector of probabilities that there are 0, 1, ..., m
requests in the central computer system, the average number in the central com-
puter system, the average time there, the average number in the queue to enter the
central computer system (remember, no more than m can be there), the average
time in the queue, and the vector of utilizations of the service centers.
In Example 4.9 we show how the FESC form of the central server model can
be used to model the file server on a LAN.
Unfortunately, there is no easy way to extend the central server model so that
it can model main memory with more than one workload class. There are expen-
sive tools available to model memory for IBM MVS systems but they use very
complex, proprietary algorithms. My colleague Gary Hynes at the Hewlett-Pack-
ard Performance Technology Center has written a modeling package that can be
used to model memory for Hewlett-Packard computer systems; it is proprietary,
of course.
Figure 8.6. FESC Form of Central Server Model
Monitors
The basic measurement tool for computer performance is the monitor. There are
two basic types of monitors: software monitors and hardware monitors. Since
Hardware monitors are used almost exclusively by computer manufacturers, we
discuss only software monitors in Chapter 5. The three most common types of
software monitors are used for diagnostics (sometimes called real-time or trouble
shooting monitors), for studying long-term trends (sometimes called historical
monitors), and job accounting monitors for gathering chargeback information.
These three types can be used for monitoring the whole computer system or be
specialized for a particular piece of software such as CICS, IMS, or DB2 on an
IBM mainframe. There are probably more specialized monitors designed for
CICS than for any other software system.
The uses for a diagnostic monitor include the following:
Some diagnostic monitors have expert system capabilities to analyze the sys-
tem and make recommendations to the user. A diagnostic monitor with a built-in
expert system can be especially useful for an installation with no resident perfor-
mance expert. An expert system or adviser can diagnose performance problems
and make recommendations to the user. For example, the expert system might
recommend that the priority of some jobs be changed, that the I/O load be bal-
anced, that more main memory or a faster CPU is needed, etc. The expert system
could reassure the user in some cases as well. For example, if the CPU is running
at 100% utilization but all the interactive jobs have satisfactory response times
and low priority batch jobs are running to fully utilize the CPU, this could be
reported to the user by the expert system.
Uses for monitors designed for long term performance management include
the following:
1. Low overhead.
2. The ability to measure throughput, service times, and utilization for the major
servers.
3. The ability to separate workload into homogeneous classes with demand levels
and response times for each.
4. The ability to report metrics for different types of classes such as interactive,
batch, and transaction.
5. The ability to capture all activity on the system including system overhead by
the operating system.
6. Provision of sufficient detail to detect anomalous behavior (such as a runaway
process) which indicates atypical activity.
7. Provision for long term trending via low volume data.
8. Good documentation and training provided by the vendor.
9. Good tools for presenting and interpreting the measurement results.
capture ratio for a job or workload class is known, the total CPU utilization can
be obtained by dividing the sum of the TCB time and the SRB time by the cap-
ture ratio. The CPU capture ratio can be estimated by linear regression and other
techniques. Wicks describes how to use the regression technique in Appendix D
of [Wicks 1991]. The approximate values of the capture ratio for many types of
applications are known. For example, for CICS it is usually between 0.85 and
0.9, for TSO between 0.35 and 0.45, for commercial batch workload classes
between 0.55 and 0.65, and for scientific batch workload classes between 0.8 and
0.9.
We illustrate the calculation of capture ratios in Example 5.1.
We provide a further discussion of the modeling study paradigm in Section
5.3.1. (We had discussed it earlier in Section 3.5.)
Simulation
The kind of simulation that is most important for modeling computer systems is
often called discrete event simulation but certainly falls within the rubric of what
Knuth calls the Monte Carlo method. Knuth in his widely referenced book [Knuth
1981], says, These traditional uses of random numbers have suggested the name
Monte Carlo method, a general term used to describe any algorithm that employs
random numbers.
Twenty years ago modeling computer systems was almost synonymous with
simulation. Since that time so much progress has been made in analytic queueing
theory models of computer systems that simulation has been displaced by queue-
ing theory as the modeling technique of choice; simulation is now considered by
many computer performance analysts to be the modeling technique of last resort.
Most modelers use analytic queueing theory if possible and simulation only if it
is very difficult or impossible to use queueing theory. Most current computer sys-
tem modeling packages use queueing network models that are solved analyti-
cally.
The reason for the preference by most analysts for analytic queueing theory
modeling is that it is much easier to formulate the model and takes much less
computer time to use than simulation. See, for example, the paper [Calaway
1991] we discussed in Chapter 1.
When using simulation as the modeling tool for a modeling study the first
step of the modeling study paradigm discussed in Section 5.3.1 is especially
important, that is, to define the purpose of the modeling study.
Bratley, Fox, and Schrage [Bratley, Fox, and Schrage 1987] define simula-
tion as follows:
routes them through the model in the same way that a real workload moves
through a computer system. Thus visits are made to a representation of the CPU,
representations of I/O devices, etc.
To perform steps 4 and 5 of the modeling study paradigm described in Sec-
tion 5.3.1 (and more briefly in Section 3.5) requires the following basic tasks.
1. Construct the model by choosing the service centers, the service center service
time distributions, and the interconnection of the center.
2. Generate the transactions (customers) and route them through the model to
represent the system.
3. Keep track of how long each transaction spends at each service center. The ser-
vice time distribution is used to generate these times.
4. Construct the performance statistics from the above counts.
5. Analyze the statistics.
6. Validate the model.
Of course, these same tasks are necessary for Step 6 of the modeling study
paradigm.
Simulation is a powerful modeling techniques but requires a great deal of
effort to perform successfully. It is much more difficult to conduct a successful
modeling study using simulation than is generally believed.
Challenges of modeling a computer system using simulation include:
A failure of any one of these steps can cause a failure of the whole effort.
We discuss all of these simulation challenges with examples and exercises in
Chapter 6.
Benchmarking
There are actually two basically different kinds of benchmarking. The first kind is
defined by Dongarra et al. [Dongarra, Martin, and Worlton 1987] as Running a
set of well-known programs on a machine to compare its performance with that of
others. Every computer manufacturer runs these kinds of benchmarks and reports
the results for each announced computer system. The second kind is defined by
Artis and Domanski [Artis and Domanski 1988] as a carefully designed and
structured experiment that is designed to evaluate the characteristics of a system
or subsystem to perform a specific task or tasks. The first kind of benchmark is
represented by the Whetstone, Dhrystone, and Linpack benchmarks.
The Artis and Domanski kind of benchmark is the type one would use to
model the workload on your current system and run on the proposed system. It is
the most difficult kind of modeling in current use for computer systems.
Before we discuss the Artis and Domanski type of benchmark we discuss the
first type of benchmark, the kind that is called a standard benchmark.
The two best known standard benchmarks are the Whetstone and the Dhrys-
tone. The Whetstone benchmark was developed at the National Physical Labora-
tory in Whetstone, England, by Curnow and Wichman in 1976. It was designed
to measure the speed of numerical computation and floating-point operations for
midsize and small computers. Now it is most often used to rate the floating-point
operation of scientific workstations. My IBM PC compatible 33 MHz 486 has a
Whetstone rating of 5,700K Whetstones per second. According to [Serlin 1986]
the HP3000/930 has a rating of 2,841K Whetstones per second, the IBM 4381-11
has a rating of approximately 2,000K Whetstones per second, and the IBM RT
PC a rating of 200K Whetstones per second.
The Dhrystone benchmark was developed by Weicker in 1984 to measure
the performance of system programming types of operating systems, compilers,
editors, etc. The result of running the Dhrystone benchmark is reported in Dhrys-
tones per second. Weicker in his paper [Weicker 1990] describes his original
benchmark as well as Versions 1.1 and 2.0. Weicker [Weicker 1990] not only dis-
cusses his Dhrystone benchmark but also discusses the Whetstone, Livermore
Fortran Kernels, Stanford Small Programs Benchmark Set, EDN Benchmarks,
Sieve of Eratosthenes, and SPEC benchmarks. Weickerts paper is one of the best
summary papers available on standard benchmarks.
According to QAPLUS Version 3.12, my IBM PC 33 MHz 486 compatible
executes 22,758 Dhrystones per second. According to [Serlin 1986] the IBM
3090/200 executes 31,250 Dhrystones per second, the HP3000/930 executes
10,000 Dhrystones per second, and the DEC VAX 11/780 executes 1,640 Dhrys-
tones per second, with all figures based on the Version 1.1 benchmark. However,
IBM calculates VAX MIPS by dividing the Dhrystones per second from the
Dhrystone 1.1 benchmark by 1,757; IBM evidently feels that the VAX 11/780 is
a 1,757 Dhrystones per second machine. The Dhrystone statistics on the 11/780
are very sensitive to the version of the compiler in use. Weicker [Weicker 1990]
reports that he obtained very different results running the Dhrystone benchmark
on a VAX 11/780 with Berkeley UNIX (4.2) Pascal and with DEC VMS Pascal
(V.2.4). On the first run he obtained a rating of 0.69 native MIPS and on the sec-
ond run a rating of 0.42 native MIPS. He did not reveal the Dhrystone ratings.
Standard benchmarks are useful in providing at least ballpark estimates of
the capacity of different computer systems. However there are a number of prob-
lems with the older standard benchmarks such as Whetstone, Dhrystone, Lin-
pack, etc. One problem is that there are a number of different versions of these
benchmarks and vendors sometimes fail to mention which version was used. In
addition, not all vendors execute them in exactly the same way. That is appar-
ently the reason why Checkit, QAPLUS, and Power Meter report different values
for the Whetstone and Dhrystone benchmarks. Another complicating factor is the
environment in which the benchmark is run. These could include operating sys-
tem version, compiler version, memory speed, I/O devices, etc. Unless these are
spelled out in detail it is difficult to interpret the results of a standard benchmark.
Three new organizations have been formed recently with the goal of provid-
ing more meaningful benchmarks for comparing the capability of computer sys-
tems for doing different types of work. The Transaction Processing Performance
Council (TPC) was founded in 1988 at the initiative of Omri Serlin to develop
online teleprocessing (OLTP) benchmarks. Just as the TPC was organized to
develop benchmarks for OLTP the Standard Performance Evaluation Corporation
(SPEC) is a nonprofit corporation formed to establish, maintain, and endorse a
standardized set of benchmarks that can be applied to the newest generation of
Drivers (RTEs)
To perform some of the benchmarks we mention in Chapter 6, such as the TPC
benchmarks TPC-A and TPC-C, a special form of simulator called a driver or
remote terminal emulator (RTE) is used to generate the online component of the
workload. The driver simulates the work of the people at the terminals or
workstations connected to the system as well as the communication equipment
and the actual input requests to the computer system under test (SUT in
benchmarking terminology). An RTE, as shown in Figure 8.7, consists of a
separate computer with special software that accepts configuration information
and executes job scripts to represent the users and thus generate the traffic to the
SUT. There are communication lines to connect the driver to the SUT. To the SUT
the input is exactly the same as if real users were submitting work from their
terminals. The benchmark program and the support software such as compilers or
database management software are loaded into the SUT, and driver scripts
representing the users are placed on the RTE system. The RTE software reads the
scripts, generates requests for service, transmits the requests over the
communication lines to the benchmark on the SUT, waits for and times the
responses from the benchmark program, and logs the functional and performance
information. Most drivers also have software for recording a great deal of
statistical performance information.
Most RTEs have two powerful software features for dealing with scripts.
The first is the ability to capture scripts from work as it is being performed. The
second is the ability to generate scripts by writing them out in the format under-
stood by the software.
All computer vendors have drivers for controlling their benchmarks. Since
there are more IBM installations than any other kind, the IBM Teleprocessing
Network Simulator (program number 5662-262, usually called TPNS) is proba-
bly the best known driver in use. TPNS generates actual messages in the IBM
Communications Controller and sends them over physical communication lines
(one for each line that TPNS is emulating) to the computer system under test.
TPNS consists of two software components, one of which runs in the IBM
mainframe or plug compatible used for controlling the benchmark and one that
runs in the IBM Communications Controller. TPNS can simulate a specified net-
work of terminals and their associated messages, with the capability of altering
network conditions and loads during the run. It enables user programs to operate
as they would under actual conditions, since TPNS does not simulate or affect
any functions of the host system(s) being tested. Thus it (and most other similar
drivers including WRANGLER, the driver used at the Hewlett-Packard Perfor-
mance Technology Center) can be used to model system performance, evaluate
communication network design, and test new application programs. A driver may
be much less difficult to use than the development of some detailed simulation
models but is expensive in terms of the hardware required. One of its most
important uses is testing new or modified online programs both for accuracy and
performance. Drivers such as TPNS or WRANGLER make it possible to utilize
all seven of the uses of benchmarks described by Artis and Domanski. Kube
[Kube 1981] describes how TPNS has been used for all these activities. Of
course the same claim can be made for most commercial drivers.
Natural forecasting units are sometimes called business units or key volume
indicators because an NFU is usually a business unit. The papers [Browning
1990], [Bowerman 1987], [Reyland 1987], [Lo and Elias 1986], and [Yen 1985]
are some of the papers on NFU (business unit) forecasting that have been pre-
sented at National CMG Conferences. In their paper [Lo and Elias 1986], Lo and
Elias list a number of other good NFU forecasting papers.
The basic problem that NFU forecasting solves is that the end users, the peo-
ple who depend upon computers to get their work done, are not familiar with
computer performance units (sometimes called DPUs for data processing units)
such as interactions per second, CPU utilization, or I/Os per second, while com-
puter capacity planners are not familiar with the NFUs or the load that NFUs put
on a computer system.
Lo and Elias [Lo and Elias 1986] describe a pilot project undertaken at their
installation. According to Lo and Elias, the major steps needed for applying the
NFU forecasting technique are (I have changed the wording slightly from their
statement):
Lo and Elias used the Boole & Babbage Workload Planner software to do
the dependency analysis. This software was also used to project the future capac-
ity requirements using standard linear and compound regression techniques.
Yen, in his excellent paper [Yen 1985], describes how he predicted future
CPU requirements for his IBM mainframe computer installation from input from
users. He describes the procedure in the abstract for his paper as follows:
Yen discovered that user departments can accurately predict their magnetic disk
requirements (IBM refers to magnetic disks as DASD for direct access storage
device). They can do this because application developers know the record sizes
of files they are designing and the people who will be using the systems can make
good predictions of business volumes. Yen used 5 years of historical data
describing DASD allocations and CPU consumption in a regression study. He
made a scatter diagram in which the y-axis represented CPU hours required for a
month, Monday through Friday, 8 am to 4 pm, while the x-axis represented GB of
DASD storage installed online on the fifteenth day of that month. Yen found that
the regression line y = 34.58 + 2.59x fit the data extraordinarily well. The usual
measure of goodness-of-fit is the R-squared value, which was 0.95575. (R-squared
is also called the coefficient of determination.) In regression analysis studies, R-
squared can vary between 0, which means no correlation between x and y values,
and ,1 which means perfect correlation between x and y values. A statistician
might describe the R-squared value of 0.95575 by saying, 95.575 percent of the
total variation in the sample is due to the linear association between the variables
x and y. An R-squared value larger than 0.9 means that there is a strong linear
relationship between x and y.
Yen was able to make use of his regression equation plus input from some
application development projects to predict when the next computer upgrade was
needed.
Yen no longer has the data he used in his paper but provided me with data
from December 1985 through October 1990. From this data I obtained the x and y
values plotted in Figure 8.8 together with the regression line obtained using the
package LinearRegression from the Statistics directory of Mathematica. The x
values are GB of DASD storage online as of the fifteenth of the month, while y is
the measured number of CPU hours for the month, normalized into 19 days of 8
hours per day measured in units of IBM System/370 Model 3083 J processors.
The Parameter Table in the output from the Regress program shows that the
regression line is y = 310.585 + 2.25101 x, where x is the number of GB of
online DASD storage and y is the corresponding number of CPU hours for the
month. We also see that R-squared is 0.918196 and that the estimates of the con-
stants in the regression equation are both considered significant. If you are well
versed in statistics you know what the last statement means. If not, I can tell you
that it means that the estimates look very good. Further information is provided
in the ANOVA Table to bolster the belief that the regression line fits the data very
well. However, a glance at Figure 8.8 indicates there are several points in the
scatter diagram that appear to be outliers. (An outlier is a data point that doesnt
seem to belong to the remainder of the set.) Yen has assured me that the two most
prominent points that appear to be outliers really are! The leftmost outlier is the
December 1987 value. It is the low point just above the x-axis at x = 376.6. Yen
says that the installation had just upgraded their DASD so that there was a big
jump in installed online DASD storage. In addition, Yen recommends taking out
all December points because every December is distorted by extra holidays. The
rightmost outlier is the point for December 1989, which is located at (551.25,
627.583). Yen says the three following months are outliers as well, although they
dont appear to be so in the figure. Again, the reason these points are outliers is
another DASD upgrade and file conversion.
In[3]:= StatisticsLinearRegression
In[12]:= Regress[data, {1,x}, x]
Here we show the Parameter Table from regress with the outliers removed.
8.3 Recommendations
This book is an introductory one so that, even if you have absorbed every word in
it, there is still much to be learned about computer performance management. In
this section I make recommendations about how to learn more about performance
management of computer systems from both the management and purely technical
views. There is much more material available on the technical side than the
management side. In fact, I have not been able to find even one outstanding
contemporary book on managing computer performance activities. The book
[Martin et al. 1991] is an excellent book on the management of IS that emphasizes
the importance of good performance but provides little information on how to
achieve good performance. In spite of this weakness, if you are part of IS
management, you should read this book. It provides a number of good references,
an excellent elementary introduction to computer systems as well as
telecommunications and networking, and sections on all aspects of IS
management. Another useful but brief book [Lam and Chan 1987] discusses
capacity planning from a management point of view. It features the results of an
empirical study of computer capacity planning practices based on a survey the
authors made of the 1985 Fortune 1000 companies. Lam and Chan base their
conclusions on the 388 responses received to their questionnaire. (They mailed
930 questionnaires; 388 usable replies were returned.) The Lam and Chan book
also has an excellent bibliography with both management and technical
references.
Neither of these books covers in detail some of the most important manage-
ment tools such as service level agreements, chargeback, and software perfor-
mance engineering. (The brief book [Dithmar, Hugo, and Knight 1989] provides
a lucid discussion of service level agreements with an excellent example service
level agreement with notes.) The best source for written information on these
techniques is the collection of articles mentioned in Chapter 1 and listed in the
references to that chapter. A few are listed at the end of this chapter as well. (The
papers on service level agreements [Miller 1987a] and [Duncombe 1992] are
especially recommended.) These should be supplemented with articles published
by the Computer Measurement Group in the annual proceedings for the Decem-
ber meeting and in their quarterly publication CMG Reviews. (The paper by
Rosenberg [Rosenberg 1992] is highly recommended both for its wisdom and its
entertaining style.) Another source of good management articles is The Capacity
Management Review published by the Institute for Computer Capacity Manage-
ment. This organization also publishes six volumes of their IS Capacity Manage-
ment Handbook Series, which is updated on a regular basis and contains a great
deal of information that is valuable for managers of computer installations. The
institute also publishes technical reports such as their 1989 report Managing Cus-
tomer Service.
If you are going to implement a new technique such as the negotiation of ser-
vice level agreements with your users, the implementation of a chargeback sys-
tem, or both techniques, the most efficient way to learn how to do so without
excessive pain is to attend a class or workshop on each such technique. If you
work for a company that uses techniques such as service level agreements and
chargeback, there are probably classes or workshops available, internally. If not,
the Institute for Computer Capacity Management has the following courses or
workshops that could be of help: Costing and Chargeback Workshop, Managing
IS Costs, and Managing Customer Service. [Of the 13 organizations I have iden-
tified that provide training in performance management related areas, only the
Institute for Computer Capacity Management (ICCM) offers instruction in ser-
vice level management and chargeback except, possibly, as part of a more gen-
eral course.] If you are contemplating starting a capacity planning program, there
are even more training opportunities including the following: Introduction to IS
Capacity Management (ICCM), Preparing a Formal Capacity Plan (ICCM),
Basic Capacity Planning (Watson and Walker, Inc.), and Capacity Planning
(Hitachi Data Systems).
One important area of performance management that we were unable to
include in this book is the general area of computer communication networks.
The most important application of these networks is client/server computing,
sometimes called distributed processing, cooperative processing, or even transac-
tion processing and described as The network is the computer. I describe it in
[Allen 1993]: Client/server computing refers to an architecture in which appli-
cations on intelligent workstations work transparently with data and applications
on other processors, or servers, across a local or wide area network. To under-
stand client/server computing you must, of course, understand computer commu-
nication networks. A very simple nontechnical introduction to such networks is
provided in Chapter 6 of [Martin et al. 1991]. For a more detailed, technical
description that is very clearly written see [Tanenbaum 1988]. (Tanenbaums
book comes close to being the standard computer network book for technical
readers.) A more elementary discussion is provided by [Miller 1991]. I wrote a
tutorial [Allen 1993] about client/server computing. There are a number of tech-
nical books about the subject including [Berson 1992], [Inmon 1993], and [Orfali
and Harkey 1992]. The book by Inmon is the least technical of these books but
very clearly written and highly recommended. Although we do not discuss com-
puter communication networks or client/server computing in this book, many of
the tools we discussed are valuable in studying the performance of these systems.
For example, in their paper [Turner, Neuse, and Goldgar 1992], Turner et al. dis-
cuss how to use simulation to study the performance of a clientlserver system.
Similarly, Swink [Swink 1992] shows how SPE can be utilized in the client/
server environment.
A number of computer communication network short courses (2 to 5 days)
are taught by the following vendors: QED Information Sciences, Amdahl Educa-
tion, Data-Tech Institute, and Technology Exchange Company. There are also a
number of client/server courses including: Building Client/Server Applications
(Technology Training Corp.), How to Integrate Client-Server Into the IBM Envi-
ronment (Technology Transfer Institute), Managing the Migration to Client-
Server Architectures (Microsoft University), Analysis and Design of Client-
Server Systems (Microsoft University), and Implementing Client/Server Appli-
cations and Distributed Data (Digital Consulting, Inc.).
Installations that take this approach tend to use very simple modeling techniques
such as rules-of-thumb. Others use more sophisticated techniques such as
queueing theory or simulation but apply them to the component of the system
most likely to be the bottleneck such as the CPU or an I/O device. Very simple
queueing theory models can sometimes be applied to components. By simple we
mean an open queueing system with a single service center. Queueing theory was
originally developed for the study of telephone systems using simple but powerful
models. These same models have been used to study I/O devices including
channels and disks, caches, and LANs. My book [Allen 1990] covers these simple
queueing models as well as the more complex queueing network models used in
Chapter 4 of this book. My self-study course [Allen 1992] uses my book as a
textbook and includes a modeling package that runs under Microsoft Windows
3.x. The two volumes [Kleinrock 1975, Kleinrock 1976] are the definitive books
on queueing theory; they are praised by theoreticians as well as practitioners and
cover most aspects of the theory as it applies to computer systems and networks.
The elegant and elementary book [Hall 1991] is especially recommended for
learning beginning queueing theory, although none of the examples in the book
concern computer system performance. The book has an excellent chapter on
simulation as well as a number of examples of the use of simulation throughout
Dont you wish Professor Hall had designed your computer room or the waiting
room of your HMO? It would be difficult to praise Randolphs book too highly!
The standard book on the use of analytic queueing theory network models to
study the performance of computer systems using MVA (Mean Value Analysis) is
[Lazowska el al. 1984]. More recent books on the subject include [King 1990],
[Molloy 1989] and [Robertazzi 1990]. Computer installations that use analytic
queueing theory network models often find that it is more cost effective to pur-
chase a modeling package than to develop the software required to make the cal-
culations. Most available modeling packages are described in [Howard Volume
1]. Vendors for the software also provide the training necessary to use the prod-
ucts.
learn by reading articles such as [Morse 1993] and [Incorvia 1992] and by serv-
ing an apprenticeship under an expert. There is no royal road to benchmarking.
Forecasting is a discipline that is widely used by management, is well docu-
mented in books and articles, and is taught not only in colleges and universities
but also by those who offer training in computer performance management. In
addition, there are a number of workload forecasting tools available and listed in
[Howard Volume 1].
I hope you have found this book useful. If you have questions or suggestions
for the second edition, please write to me; if it is extremely urgent, call me. My
address is: Dr. Arnold Allen, Hewlett-Packard, 8000 Foothills Boulevard.,
Roseville, CA 95747. My phone number is (916) 785-5230.
8.4 References
1. Arnold O. Allen, Probability, Statistics, and Queueing Theory with Computer
Science Applications, Second Edition, Academic Press, San Diego, 1990.
2. Arnold O. Allen, So you want to communicate? Can open systems and the
client/server model help?, Capacity Planning and Alternative Platforms,
Institute for Computer Capacity Management, 1993.
3. Arnold O. Allen and Gary Hynes, Approximate MVA solutions with fixed
throughput classes, CMG Transactions (71), Winter 1991, 2937.
4. Arnold O. Allen and Gary Hynes, Solving a queueing model with Mathemat-
ica, Mathematica Journal, 1(3), Winter 1991, 108112.
5. H. Pat Artis and Bernard Domanski, Benchmarking MVS Systems, Notes from
the course taught January 1114, 1988, at Tyson Corner, VA.
6. Forest Baskett, K. Mani Chandy, Richard R. Muntz, and Fernando G. Palacios,
Open, closed, and mixed networks of queues with different classes of cus-
tomers, JACM, 22(2), April 1975, 248260.
7. Alex Berson, Client/Server Architecture, McGraw-Hill, New York, 1992.
8. James R. Bowerman, An introduction to business element forecasting, CMG
87 Conference Proceedings, Computer Measurement Group, 1987, 703
709.
9. Paul Bratley, Bennett L. Fox, and Linus E. Schrage, A Guide to Simulation,
Second Edition, Springer-Verlag, New York, 1987
10. Leroy Bronner, Capacity Planning: Basic Hand Analysis, IBM Washington
Systems Center Technical Bulletin, December 1983.
11. Tim Browning, Forecasting computer resources using business elements: a
pilot study, CMG 90 Conference Proceedings, Computer Measurement
Group, 1990, 421427.
12. Jeffrey P. Buzen, Queueing network models of multiprogramming, Ph.D. dis-
sertation, Division of Engineering and Applied Physics, Harvard University,
Cambridge, MA, May 1971.
13. James D. Calaway, SNAP/SHOT VS BEST/1. Technical Support, March
1991, 1822.
14. C. Chatfield, The Analysis of Time Series: An Introduction, Third Edition,
Chapman and Hall, London, 1984.
15. Edward I. Cohen, Gary M. King, and James T. Brady, Storage hierarchies,
IBM Systems Journal, 28(1), 1989, 6276.
16. Peter J. Denning, RISC architecture, American Scientist, January-February
1993, 710.
17. Hans Dithmar, Ian St. J. Hugo, and Alan J. Knight, The Capacity Manage-
ment Primer, Computer Capacity Management Service Ltd., 1989. (Also
available from the Institute for Computer Capacity Management.)
18. Jack Dongarra, Joanne L. Martin, and Jack Worlton, Computer benchmark-
ing: paths and pitfalls, IEEE Spectrum, July 1987, 3843.
19. Brian Duncombe, Managing your way to effective service level agree-
ments, Capacity Management Review, December 1992 14.
20. Domenico Ferrari, Giuseppe Serazzi, and Alessandro Zeigner, Measurement
and Tuning of Computer Systems, Prentice-Hall, Englewood Cliffs, NJ, 1983.
21. Randolph W. Hall, Queueing Methods, Prentice-Hall, Englewood Cliffs, NJ,
1991
22. Richard W. Hamming, The Art of Probability for Scientists and Engineers,
Addison-Wesley, Reading, MA, 1991.
23. John L. Hennessy and David A. Patterson, Computer Architecture: A Quanti-
tative Approach, Morgan Kaufmann, San Mateo, CA, 1990.
52. Martin Reiser, Mean value analysis of queueing networks, A new look at an
old problem, Proc. 4th Int. Symp. on Modeling and Performance Evaluation
of Computer Systems, Vienna (1979).
53. Martin Reiser, Mean value analysis and convolution method for queue-
dependent servers in closed queueing networks, Performance Evaluation,
1(1), January 1981, 718.
54. Martin Reiser and Stephen S. Lavenberg, Mean value analysis of closed
multichain queueing networks, JACM, 22, April 1980, 313322.
55. John M. Reyland, The use of natural forecasting units, CMG 87 Confer-
ence Proceedings, Computer Measurement Group, 1987, 71013.
56. Thomas G. Robertazzi, Computer Networks and Systems: Queueing Theory
and Performance Evaluation, Springer-Verlag, New York, 1990.
57. Jerry L. Rosenberg, The capacity planning managers phrase book and sur-
vival guide, CMG 92 Conference Proceedings, Computer Measurement
Group, 1992, 983989.
58. Omri Serlin, MIPS, Dhrystones and other tales, Datamation, June 1986,
112118.
59. Carol Swink, SPE in a client/server environment: a case study, CMG 92
Conference Proceedings, Computer Measurement Group, 1992, 271276.
60. Andrew S. Tanenbaum, Computer Networks, Second Edition, Prentice-Hall,
Englewood Cliffs, NJ, 1988.
61. Michael Turner, Douglas Neuse, and Richard Goldgar, Simulating optimizes
move to client/server applications, CMG 92 Conference Proceedings, Com-
puter Measurement Group, 1992, 805812.
62. Reinhold P. Weicker, An overview of common benchmarks, IEEE Com-
puter, December 1990, 6575.
63. Peter D. Welch, The statistical analysis of simulation results, in Computer
Performance Modeling Handbook, Stephen S. Lavenberg, Ed., Academic
Press, New York, 1983.
64. Raymond J. Wicks, Balanced Systems and Capacity Planning, IBM Wash-
ington Systems Center Technical Bulletin GG22-9299-03, September 1991.
The Remove command allows you to erase the global version of Regress so you
can access the LinearRegression version of Regress as we show in the following
Mathematica session segment, which is slightly scrambled because some of the
output is too wide for the page.
In[5]:=<<first.m
In[6]:= perform[23,45]
In[9]:= <<StatisticsLinearRegression
In[11]:= Remove[Regress]
Sometimes, when you have loaded a number of packages the contexts can get so
scrambled that you must sign off from Mathematica with the Quit command and
start over again.
Version 2.0 of Mathematica provides a number of help messages that were not
present in Version 1.2. These messages are sometimes very useful and at other
times seem like useless nagging. The help messaging system gets very exercised
if you use names that are similar. For example, if you type function = 12, you
will get the following message:
General::spell1:
Possible spelling error: new symbol name "function"
is similar to existing symbol "Function".
This may your first warning that Function is the name of a Mathematica function.
You can get a similar message by entering frank = 12 and mfrank = 1. The
warning message is:
General::spell1:
Possible spelling error: new symbol name mfrank
is similar to existing symbol frank.
Messages like this can be a little annoying but come with the territory.
Abell and Braselton wrote two books about Mathematica which were pub-
lished in 1992. In the first book, Mathematica by Example, they provide several
examples of the use of the package LinearRegression.m as well as a number of
other packages that come with Mathematica Version 2.0. In their second book,
Mathematica Handbook, they provide even more discussion of the packages.
Both of their books are heavily oriented toward the Macintosh Mathematica front
end but provide a great many examples that can be appreciated by anyonewith
any version of Mathematica. At the time of this writing (June 1993) the Macin-
tosh and Next Mathematica front ends are more elaborate than those for the vari-
ous UNIX versions or the two versions for the IBM PC and compatibles. Rumors
abound that the long-awaited X-Windows front end will be available soon.
The package stored in the file first.m and that stored in work.m follow.
BeginPackage["first`"]
first::usage="This is a collection of functions used in this book."
perform::usage="perform[A_, B_] calculates the percentage faster one machine
is over the other where A is the execution time on machine A and B is the execu-
Begin["firstprivate"]
perform[A_, B_] :=
(* A is the execution time on machine A *)
(* B is the execution time on machine B *)
Block[{n, m},
n = ((B-A)/A) 100;
m = ((A-B)/B) 100;
If[A <= B, Print["Machine A is n% faster than machine B where n = ", N[n,10]],
Print[Machine B is n% faster than machine A where n = ", N[m, 10]]]; ]
speedup[enhanced_, speedup_] :=
(* enhanced is percent of time in enhanced mode *)
(* speedup is speedup while in enhanced mode *)
Block[{frac, speed},
frac = enhanced / 100;
speed = l /(l - frac + frac / speedup);
Print["The speedup is ", N[speed, 8]]; ]
nancy[n_] :=
Block[{i,trials, average,k},
(* trials counts the number of births *)
(* for each couple. It is initialized to zero. *)
trials=Table[0, {n}];
For[i=1, i<=n, i++,
While[True,trials[[i]]=trials[[i]]+1;
If[Random[Integer, {0,1 }]>0, Break[]] ];];
(* The while statement counts the number of births for couple i. *)
(* The while is set up to test after a pass through the loop *)
(* so we can count the birth of the first girl baby. *)
average=sum[trials[[k]], {k, 1, n}]/n;
Print["The average number of children is ", average];
]
trial[n_] :=
Block[{switch=0, noswitch=0},
correctdoor=Table[Random[Integer, {1,3}], {n}];
firstchoice=Table[Random[Integer, {1,3}], {n}];
For[i=1, i<=n, i++,
If[Abs[correctdoor[[i]]-firstchoice[[i]]]>0,
switch=switch+1, noswitch=noswitch+l]];
Return[{N[switch/n,8],N[noswitch/n,8]}];
]
make Family[]:=
Block[{
children = { }
},
While[Random[Integer] == 0,
AppendTo[children, girl]
];
Append[children, boy]
]
makeFamily::usage=makeFamily[] returns a list of children.
numChildren[n_Integer] :=
Block[{
allChildren
},
allChildren = Flatten[Table[makeFamily[ ], {n}]];
{
avgChildren -> Length[allChildren]/n,
avgBoys -> Count[allChildren, boy]/n,
avgGirls -> Count[allChildren, girl]/n
}
]
numChildren::usage=numchildren[n] returns statistics on
the number of children from n families.
BeginPackage["work","StatisticsNormalDistribution", "StatisticsCom-
monDistributionsCommon", "StatisticsContinuousDistributions"]
work::usage="This is a collection of functions used in this book."
sopen::usage="sopen[lambda_, v_?VectorQ, s_?VectorQ] computes the perfor-
mance statistics for the single workload class open model of Figure 4.1. For this
program lambda is the average throughput, v is the vector of visit ratios for the
service centers, and s is the vector of the average service time per visit for each
service center."
sclosed::usage="sclosed[N_?IntegerQ, D_?VectorQ, Z_] computes the perfor-
mance statistics for the single workload class closed model of Figure 4.2. N is the
number of terminals, D is the vector of service demands and Z is the mean think
time at each terminal."
mopen::usage="mopen[lambda_, d_ ] computes the performance statistics for the
multiple workload class open model of Figure 4.1. For this program lambda is the
average throughput and d is the C by K matrix of service demands.
cent::usage=cent[N_?IntegerQ, D_?VectorQ] computes the performance statis-
tics for the central server
model with fixed MPL N. D is the service demand vector.
online::usage=online[m_?IntegerQ, Demands_?VectorQ, N ?IntegerQ, T_]
computes the performance statistics for a terminal system with a FESC to replace
the central server model of the computer system. The program subcent is used to
calculate the rates needed as input. The maximum multiprograming level allowed
is m. Demands is the vector of service demands. N is the number of active termi-
nals or workstations connected to the computer system. T is the mean think time
for the users at the terminals.
subcent::usage=Computes the throughput for a central server model with fixed
MPL.
Exact::usage=Exact[Pop_?VectorQ, Think_?VectorQ, Demands_?MatrixQ]
computes the performance statistics for the closed multiworkload class model of
Figure 4.2. Pop is the vector of population by class. Think is the vector of think
time by class and Demands is the C by K matrix of service demands.
Approx::usage=Approx[Pop_?VectorQ, Think_?VectorQ, Demands_?MatrixQ,
epsilon_Real ] computes the performance statistics for the closed multiworkload
class of Figure 4.2 using an approximation technique. Pop is the vector of popu-
lation by class. Think is the vector of think time by class and Demands is the C
by K matrix of service demands. The parameter epsilon determines how accu-
rately the algorithm attempts to calculate the solution.
Print[ ] ;
Print[
SequenceForm[
ColumnForm[ Join[ {Class#, ------}, Range[C] ], Right ],
ColumnForm[ Join[ { TPut, -----------}, SetAccuracy[ lambda, 6] ], Right ],
ColumnForm[ Join[ { Number, ---------}, SetAccuracy[ L, 6] ], Right ],
ColumnForm[ Join[ { Resp, --------------}, SetAccuracy[R,6]], Right ]]];
Print[ ];
Print[ ];
Print[
SequenceForm[
ColumnForm[ Join[ {Center#, ------}, Range[K] ], Right ],
ColumnForm[ Join[{ Number, ---------}, SetAccuracy[ number, 6] ], Right
],
ColumnForm[ Join[ { Utiliz, ----------}, SetAccuracy[u1,6]], Right ]]];
]
cent[N_?IntegerQ, D_?VectorQ]:=
(* central server model *)
(* k is number of service centers *)
(* N is MPL, D is service demand vector *)
Block[{L, w,k, wn, n, lambdan, rho},
k = Length[D];
L = Table[0, {k}];
For[n=1, n<=N, n++, w=D*(L+1); wn=Apply[Plus,w]; lambdan=n/wn;
L = lambda w;
qplus=Join[{q0},q];
probin = Flatten[{Take[qplus, m], 1 - Apply[Plus, Take[qplus, m]]}];
numberin = Drop[probin, 1]. Range[1,m];
timein = numberin / lambda;
numberinqueue = L - numberin;
timeinqueue = numberinqueue / lambda;
U = lambda * Demands;
k = Length[Demands];
(* lambda is mean throughput *)
(* w is mean response time *)
(* qplus is vector of conditional probabilities *)
Print[];
Print[];
Print[The average number of requests
in process is , L];
Print[The average system throughput is , lambda];
Print[The average system response time is , w];
Print[The average number in main memory is , SetAccuracy[numberin,5]];
Print[ ];
Print[ ];
Print[
SequenceForm[
ColumnForm[ Join[ {Center#, -------}, Range[k] ], Right],
ColumnForm[ Join[ { Utiliz, -----------}, SetAccuracy[U,6]], Right ]]];
]
subcent[k_?IntegerQ,N_?IntegerQ,D ?VectorQ]:=
(*central server model *)
(* k is number of service centers *)
(* N is MPL, D is service demand vector *)
Block[{L, w, wn, n, lambdan, rho},
L=Table[0, {k}];
For[n=1, n<=N, n++, w=D*(L+1); wn=Apply[Plus,w]; lambdan=n/wn;
L=lambdan w; rho=lambdan D];
(* lambdan is mean throughput *)
(* wn is mean time in system *)
(* L is vector of number at servers *)
(* rho is vector of utilizations *)
Return[{lambdan}];
]
srate[m_?IntegerQ, Demands_?VectorQ] :=
Block[{n},
k = Length[Demands];
rate = {};
For[n = 1, n<=m, n++, rate = Join[ rate, subcent[k, n, Demands]]];
Return[{rate}];
]
m[[x]]-- ;
x--;
While[ (x >= 1) && (m[[x]] == Pop[[x]]), x--];
If[x < 1, Return[{ }] ];
m[[x]]++;
FixPerm[numC, m, Pop ]
]
qtemp = x . r;
If[OddQ[n], q2[v]=qtemp, q1[v]=qtemp ];
v = NextPerm[numC, Pop, v] ];
If[OddQ[n], Clear[q1], Clear[q2]]
];
cr = Apply[Plus, r, 1];
su = x. Demands;
1= x . r ;
Print[ ];
Print[ ];
Print[
SequenceForm[
ColumnForm[ Join[ {Class#, ------},Range[numC] ], Right l,
ColumnForm[ Join[ { Think, -----}, Think], Right],
ColumnForm[ Join[ { Pop, -----}, Pop], Right],
ColumnForm[ Join[ { Resp, ---------}, SetAccuracy[ cr, 6] ], Right],
ColumnForm[ Join[ { TPut, ----------}, SetAccuracy[ x, 6] ], Right] ] ];
Print[ ];
Print[ ];
Print[
SequenceForm[
ColumnForm[ Join[ {Center#, ------},Range[numK] ], Right],
ColumnForm[ Join[ { Number, -----------}, SetAccuracy[ 1, 6] ], Right],
ColumnForm[ Join[ { Utiliz, -------------}, SetAccuracy[su, 6]], Right ]]];
Flag = True ;
While[Flag==True,
Flag = False;
q = Table[(newQueueLength = x[[c]] r[[c,k]];
If[ Abs[ q[[c,k]] - newQueueLength] >= epsilon, Flag=True];
newQueueLength),
{c, 1, numC}, {k, 1, numK} ];
];
su = x. Demands ;
number = x . r ;
Print[ ] ;
Print[ ] ;
Print[
SequenceForm[
ColumnForm[ Join[ {Class#, ------}, Table[ c, {c,1,numC} ] ], Right],
ColumnForm[ Join[ { Think, ------}, Think], Right],
ColumnForm[ Join[ { Pop, ------}, Pop], Right],
ColumnForm[ Join[ { Resp, -------------}, SetAccuracy[ cr, 6] ], Right],
ColumnForm[ Join[ {TPut, -----------}, SetAccuracy[ x, 6] ], Right] ] ];
Print[ ];
Print[ ];
Print[
SequenceForm[
ColumnForm[ Join[ {Center#, ------}, Table[ c, {c,1,numK} ] ], Right ],
ColumnForm[ Join[ {number, --------------}, SetAccuracy[number, 6]],
Right ],
ColumnForm[ Join[ { Utilization, -----------}, SetAccuracy[su, 6]], Right
]]];
] /; Length[Pop] == Length[Think] == Length[Demands]
Block[ {Flag, Rck, Xc, newQ, Qck, Rc, Qk, Uk, Pc, Tc,
numC = Length[Nc], numK = Dimensions[Dck][[2]] },
Tc = N[ Zc + Apply[Plus, Dck, 1] ];
Pc = N[ Table[ If[NumberQ[ Nc[[c]] ], Nc[[c]],
If[Zc[[c]]==0, 1, 100] ], {c, 1, numC} ] ];
Qck = Table[ Dck[[c,k]] / Tc[[c]] Pc[[c]], {c, 1, numC}, {k, 1, numK} ];
Flag = True;
While[Flag==True,
Qk = Apply[Plus, Qck];
Rck = Table[ Dck[[c,k]]*
(1+ Qk[[k]] - Qck[[c,k]] + Qck[[c,k]] *
If[ Pc[[c]] < 1, 0, ((Pc[[c]]-1)/Pc[[c]])] ),
{c,1,numC},{k,1,numK}];
Rc = Apply[Plus, Rck, 1 ];
{c, 1, numC} ] ;
Flag = False;
Qck = Table[(newQ = Xc[[c]] Rck[[c,k]];
If[ Abs[ Qck[[c,k]] - newQ] >= epsilon, Flag=True];
newQ), {c, 1, numC}, {k, 1, numK} ];
];
Uk = Xc . Dck;
Qk = Xc . Rck;
Print[ ];
Print[ ];
Print[
SequenceForm[
ColumnForm[ Join[ {Class#, ----------}, Table[ c, {c,1,numC} ] ], Right],
ColumnForm[ Join[ {ArrivR, -----------------}, Ac], Right],
ColumnForm[ Join[ { Pc, ---------------}, Pc], Right ]]];
Print[ ];
Print[ ];
Print[
SequenceForm[
ColumnForm[ Join[ {Class#, -----------}, Table[ c, {c,1,numC} ] ], Right ],
Flag = True ;
While[Flag==True,
cr = Apply[Plus, r, 1];
x = Pop / (Think + cr);
cr = Apply[Plus, r, l];
x = Pop / (Think + cr);
Flag = False;
cr = Apply[Plus, r, 1 ];
x = Pop / (Think + cr);
utilize = x. Demands;
number = x . r;
Print[ ];
Print[ ];
Print[
SequenceForm[
ColumnForm[ Join[ {Class#, -------}, Range[numC] ], Right],
ColumnForm[ Join[ { Think, -----}, Think], Right],
ColumnForm[ Join[ { Pop, ----------}, Pop], Right],
ColumnForm[ Join[ { Resp, --------------}, SetAccuracy[ cr, 6] ],Right],
ColumnForm[ Join[ { TPut, ---------------}, SetAccuracy[ x, 6] ], Right] ] ];
Print[ ];
Print[ ];
Print[
SequenceForm[
ColumnForm[ Join[ {Center#, ----------}, Table[ c, {c,l,numK} ] ], Right],
ColumnForm[ Join[ {Number, ---------------}, SetAccuracy[number, 6]],
Right],
ColumnForm[ Join[ { Utiliz, ---------- }, SetAccuracy[utilize, 6]], Right ]]];
mm1[lambda_, es_] :=
Block[{wq, rho, w, l, lq, piq90, piw90},
rho=lambda es;
w =es/(1-rho);
wq =rho w;
l=lambda w;
lq=lambda wq;
piq90=N[Max[w Log[10 rho], 0], 10];
piw90=N[w Log[10], 10];
Print[];
Print[The server utilization is , rho];
Print[The average time in the queue is , wq];
Print[The average time in the system is ,w];
Print[The average number in the queue is ,lq];
Print[The average number in the system is ,l];
Print[The average number in a nonempty queue is ,1/(1-rho)];
Print[The 90th percentile value of q is ,piq90];
Print[The 90th percentile value of w is ,piw90]
]
A.2 References
1. Martha L. Abell and James P. Braselton, Mathematica by Example, Academic
Press, Boston, 1992.
2. Martha L. Abell and James P. Braselton, Mathematica Handbook, Academic
Press, Boston, 1992.
3. Nancy Blachman, Mathematica: A Practical Approach, Prentice Hall, Engle-
wood Cliffs, NJ, 1992.
A Performance Corporation
Barbeau, Ed, 57
Abell, Martha L., xix, 327, 346 baseline system, 120
ACM Computing Surveys, 122 Baskett, Forrest, 125, 180, 286, 319
ACM Sigmetrics, 52 batch window, 10
ALLCACHE, 74 BCMP networks, 126, 286
Allen, Arnold 0., 57, 115, 124, 140, Bell, C. Gordon, 63, 74, 97
146, 180, 290, 315, 319 Benard, Phillipe, xvii
Amdahls law, 65, 66, 73, 275 benchmark, 203
Anderson, James W., Jr., 80 Debit-Credit, 239
Application optimization, 3 Dhrystone, 37, 70, 232, 233, 302
Approx (Mathematica program),xvi, Linpack, 37, 232, 302
142, 143, 145, 148, 149, 153, 158, Livermore Fortran Kernels, 234,
177, 194, 290, 339 303
Approximate MVA algorithm with Livermore Loops, 37
fixed throughput classes, 146 Sieve of Eratosthenes, 234, 303
arrival theorem, 134, 288 standard, 232, 302
Arteaga, Jane, xvii Stanford Small Programs Bench-
Artis, H. Pat, xviii, xix, 87, 97, 231, mark Set, 234, 303
247, 255, 302, 305, 319 SYSmark92 benchmark suite, 242
Association for Computing Machinery TPC Benchmark A (TPC-A), 239
(ACM), 52 TPC Benchmark B (TPC-B), 240
autoregressive methods, 214 TPC Benchmark C (TPC-C), 241
auxiliary memory, 78, 87 Whetstone, 37, 70, 232, 233, 302
benchmarking, 37, 203, 231
Bentley, Jon, 44, 58, 223, 255
B Beretvas, Thomas, 87, 97
Backman, Rex, 13, 57 Berra,Yogi, 189
back-of-the-envelope calculations, 27, Berson,Alex, 315, 319
28, 39, 126 Best/1 MVS, 36, 136, 169, 191, 205,
back-of-the-envelope modeling, 27, 28 266
Bailey, David H., 57 Blachman, Nancy, xiii, xix, 53, 54,
Bailey, Herbert R., 53, 57 58, 325, 328, 346
Computer Measurement Group, 4, 52, Domanski, Bernard, 46, 59, 231, 247,
314 250, 255, 302, 305, 319
computer performance tools, 41 Dongarra, Jack, 37, 59, 231, 249, 255,
application optimization, 44 302, 320
capacity planning, 45 driver, 204, 299
diagnostic, 42 Duket, S. D., 214
expert system, 45 Duncombe, Brian, 40, 59, 314, 320
resource management, 43 dynamic path selection (DPS), 83
ComputerWorld, 91
confidence interval
for estimate, 213 E
Conley, Sean, xviii Eager, Derek L., 75, 97
convolution algorithm, 161, 293 Einstein, Albert, xi, 125
Corcoran, Elizabeth, 97 Elkema, Mel, xvii
CPExpert, 46, 47, 48, 58 Elias, J. P., 261, 270, 308, 322
CPF, 92, 238 end effects, 191
CPI (cycles per instruction), 68 end users, 261, 308
CPU (Central Processing Unit), 67 Engberg, Tony, xvi, xvii, 233, 255
cpu (Mathematica program), 71, 95, Enterprise Systems Connection
96, 330 (ESCON), 88
CPU bound, 117, 285 Escalante, Jaime, xv
Crandall, Richard E, xix evaluation phase, 121
critical success factor, 39 Exact (Mathematica program), 114,
116, 123, 140, 141, 142, 143, 145,
J Lindholm, Elizabeth, 98
linear projection, 29
Jackson networks, 125, 286 linear regression, 30
Johnson, Robert H., 82, 98 LinearRegression (Mathematica
Judson, Horace Freeland, 101 package), 263, 309
Lipsky, Lester, 86, 98
Little, John D. C., 111, 283, 321
K Littles law, 111, 113, 118, 134, 282,
Kaashoek, M. Frans, 99 288, 289
Kahaner, David K., 321 Lo, Dr.: T. L., xviii, 261, 270, 308, 322
Kaplan, Carol, xviii
Katz, Randy H., 98, 322
Kelly-Bootle, Stan, 203
M
Kelton, W. D., 214 M/M/1 queueing system, 25, 206
Kelvin, Lord, xi MacArthur Prize Fellowship, 35
Kendall Square Research, 74 MacDougall, Myron H., 208, 210
kernel, 249 211, 322
key volume indicators (KVI), 259, 307 MacNair, Edward A., 214, 230, 256,
King, Gary M., 77, 97, 277, 320 322
King, Peter J. B., 321 322 Maeder, Roman, xiii, xix
Kleinrock, Leonard, xvii, xix, 75, 98, Majors, Joe, 78
122, 124, 206, 256, 321 makeFamily, (Mathematica program)
Knight, Alan J., 58, 320 329
knowledge base, 46 MAP, 136, 169, 191, 192, 205
knowledge domain, 46 mapped files, 89
Knuth, Donald E., 3, 44, 60, 215, 218, Markham, Chris, xviii
223, 228, 321 Markowitz, Harry M., 206
Kobayashi, Hisashi, 205, 208 Marsaglia, George, 221, 222, 256
KSR-1, 74 Martin, E. Wainright, 313, 322
Kube, C. B., 247, 305, 321 Martin, Joanne L., 37, 59, 231, 255,
KVI, see key volume indicators 302, 320
massively parallel computers, 73, 275
L Matick, Richard E., 79, 98
McBride, Doug, 12, 60
Lam, Shui F., 314, 321
latency, 82 mean value analysis, (MVA) 125, 285
S 229
discrete event, 204, 300
Sahai, Dr. Anil, xviii languages, 229
Samson, Stephen L., xviii, 25, 27, 60, Monte Carlo, 204, 299
87, 98 simulation languages
Santos, Richard, xvii GPSS, 37, 229
saturated server, 116, 285 PAWS, 229
Sauer, Charles H., 214, 230, 256, RESQ, 229
322 SCERT II, 229
Sawyer, Tom, 248 SIMSCRIPT, 37, 206, 229
Schardt, Richard M., 78, 86, 98 simulation modeling 32, 300
Schatzoff, Martin, 38, 60 simulation modeling package
Schrage, Linus E., 32, 58, 213, 230, MATLAN, 231
255, 300, 319 simulator, 206
Schrier, William M., 60 Singh, Yogendra, 80
sclosed (Mathematica program), 133, single class closed MVA algorithm,
135, 141, 142, 172, 175, 176, 334 132, 287
scopeux (Hewlett-Packard collector for SLA, 12, 13
HP-UX system), 266 SMF (System Management Facility),
secondary cache, 79 136
sector, 81 Smith, Connie, 18, 61
seed, 216 SNAP/SHOT, 36, 37, 169, 170, 255
initial, 216, 217 software performance engineering
seek, 81 (SPE), 17, 18
verification, 120
Vince, N. C., 61 Z
Vicente, Norbert, xvii Zahorjan, John, 97, 124, 321
von Neumann, John, 53, 75, 215 Zaman, Arif, 222, 256
vos Savant, Marilyn, 32, 33, 61 Zeigner, Alessandro, 181, 183, 186,
201, 320
W Zimmer, Harry, 23, 61
Y
Yen, Kaisson, 262, 263, 265, 266, 270,
308, 309, 310, 324
Arnold O. Allen
Software Technology Division
Hewlett-Packard
Roseville, California
AP PROFESSIONAL
Harcourt Brace & Company, Publishers
AP PROFESSIONAL
1300 Boylston Street, Chestnut Hill, MA 02167
ISBN 0-12-051070-7
Anyone who works through all the examples and exercises will gain a basic
understanding of computer performance analysis and will be able to put it to use
in computer performance management.
The prerequisites for this book are a basic knowledge of computers and
some mathematical maturity. By basic knowledge of computers I mean that the
reader is familiar with the components of a computer system (CPU, memory, I/O
devices, operating system, etc.) and understands the interaction of these compo-
nents to produce useful work. It is not necessary to be one of the digerati (see the
definition in the Definitions and Notation section at the end of this preface) but it
would be helpful. For most people mathematical maturity means a semester or so
of calculus but others reach that level from studying college algebra.
I chose Mathematica as the primary tool for constructing examples and mod-
els because it has some ideal properties for this. Stephen Wolfram, the original
developer of Mathematica, says in the What is Mathematica? section of his
book [Wolfram 1991]: .
Mathematica is a general computer software system and language intended
for mathematical and other applications.
You can use Mathematica as:
1. A numerical and symbolic calculator where you type in questions, and Mathe-
matica prints out answers.
2. A visualization system for functions and data.
If you want to know what the millionth prime is, without listing all those
preceding it, proceed as follows.
The number has been computed to over two billion decimal digits. Before the
age of computers an otherwise unknown British mathematician, William Shanks,
spent twenty years computing to 707 decimal places. His result was published
in 1853. A few years later it was learned that he had written a 5 rather than a 4 in
the 528th place so that all the remaining digits were wrong. Now you can calculate
707 digits of in a few seconds with Mathematica and all 707 of them will be
correct!
Mathematica can also eliminate much of the drudgery we all experienced in
high school when we learned algebra. Suppose you were given the messy expres-
sionsion 6x2y2 4xy3 + x4 4x3y + y4 and told to simplify it. Using Mathematica
you would proceed as follows:
4 3 2 2 3 4
Out[3]= x 4 x y + 6 x y 4 x y + y
In[4]:= Simplify[%]
4
Out[4]= (x + y)
If you use calculus in your daily work or if you have to help one of your children
with calculus, you can use Mathematica to do the tricky parts. You may remember
the scene in the movie Stand and Deliver where Jaime Escalante of James A.
Garfield High School in Los Angeles uses tabular integration by parts to show that
2 2
x sin xdx = -x cos x + 2x cos x + C
With Mathematica you get this result as follows.
Mathematica can even help you if youve forgotten the quadratic formula and
want to find the roots of the polynomial x2 + 6x 12. You proceed as follows:
6 + 2 Sqrt[21] 6 2 Sqrt[21]
Out[4]= {{x > -----------------}, {x > -------------
} }
2 2
None of the above Mathematica output looks exactly like what you will see on the
screen but is as close as I could capture it using the SessionLog.m functions.
We will not use the advanced mathematical capabilities of Mathematica very
often but it is nice to know they are available. We will frequently use two other
powerful strengths of Mathematica. They are the advanced programming lan-
guage that is built into Mathematica and its graphical capabilities.
In the example below we show how easy it is to use Mathematica to generate
the points needed for a graph and then to make the graph. If you are a beginner to
computer performance analysis you may not understand some of the parameters
used. They will be defined and discussed in the book. The purpose of this exam-
ple is to show how easy it is to create a graph. If you want to reproduce the graph
you will need to load in the package work.m. The Mathematica program
Approx is used to generate the response times for workers who are using termi-
nals as we allow the number of user terminals to vary from 20 to 70. We assume
there are also 25 workers at terminals doing another application on the computer
system. The vector Think gives the think times for the two job classes and the
array Demands provides the service requirements for the job classes. (We will
define think time and service requirements later.)
Acknowledgments
Many people helped bring this book into being. It is a pleasure to acknowledge
their contributions. Without the help of Gary Hynes, Dan Sternadel, and Tony
Engberg from Hewlett-Packard in Roseville, California this book could not have
been written. Gary Hynes suggested that such a book should be written and
provided an outline of what should be in it. He also contributed to the
Mathematica programming effort and provided a usable scheme for printing the
output of Mathematica programspiles of numbers are difficult to interpret! In
addition, he supplied some graphics and got my workstation organized so that it
was possible to do useful work with it. Dan Sternadel lifted a big administrative
load from my shoulders so that I could spend most of my time writing. He
arranged for all the hardware and software tools I needed as well as FrameMaker
and Mathematica training. He also handled all the other difficult administrative
problems that arose. Tony Engberg, the R & D Manager for the Software
Technology Division of Hewlett-Packard, supported the book from the beginning.
He helped define the goals for and contents of the book and provided some very
useful reviews of early drafts of several of the chapters.
Thanks are due to Professor Leonard Kleinrock of UCLA. He read an early
outline and several preliminary chapters and encouraged me to proceed. His two
volume opus on queueing theory has been a great inspiration for me; it is an out-
standing example of how technical writing should be done.
A number of people from the Hewlett-Packard Performance Technology
Center supported my writing efforts. Philippe Benard has been of tremendous
assistance. He helped conquer the dynamic interfaces between UNIX, Frame-
Maker, and Mathematica. He solved several difficult problems for me including
discovering a method for importing Mathematica graphics into FrameMaker and
coercing FrameMaker into producing a proper Table of Contents. Tom Milner
became my UNIX advisor when Philippe moved to the Hewlett-Packard Cuper-
tino facility. Jane Arteaga provided a number of graphics from Performance
Technology Center documents in a format that could be imported into Frame-
Maker. Helen Fong advised me on RTEs, created a nice graphic for me, proofed
several chapters, and checked out some of the Mathematica code. Jim Lewis read
several drafts of the book, found some typos, made some excellent suggestions
for changes, and ran most of the Mathematica code. Joe Wihnyk showed me how
to force the FrameMaker HELP system to provide useful information. Paul Prim-
mer, Richard Santos, and Mel Eelkema made suggestions about code profilers
and SPT/iX. Mel also helped me describe the expert system facility of HP Glan-
cePlus for MPE/iX. Rick Bowers proofed several chapters, made some helpful
suggestions, and contributed a solution for an exercise. Jim Squires proofed sev-
eral chapters, and made some excellent suggestions. Gerry Wade provided some
insight into how collectors, software monitors, and diagnostic tools work. Sharon
Riddle and Lisa Nelson provided some excellent graphics. Dave Gershon con-
verted them to a format acceptable to FrameMaker. Tim Gross advised me on
simulation and handled some ticklish UNIX problems. Norbert Vicente installed
FrameMaker and Mathematica for me and customized my workstation. Dean
Coggins helped me keep my workstation going.
Some Hewlett-Packard employees at other locations also provided support
for the book. Frank Rowand and Brian Carroll from Cupertino commented on a
draft of the book. John Graf from Sunnyvale counseled me on how to measure
the CPU power of PCs. Peter Friedenbach, former Chairman of the Executive
Steering Committee of the Transaction Processing Performance Council (TPC),
advised me on the TPC benchmarks and provided me with the latest TPC bench-
mark results. Larry Gray from Fort Collins helped me understand the goals of the
Standard Performance Evaluation Corporation (SPEC) and the new SPEC bench-
marks. Larry is very active in SPEC. He is a member of the Board of Directors,
Chair of the SPEC Planning Committee, and a member of the SPEC Steering
Committee. Dr. Bruce Spenner, the General Manager of Disk Memory at Boise,
advised me on Hewlett-Packard I/O products. Randi Braunwalder from the same
facility provided the specifications for specific products such as the 1.3- inch Kit-
tyhawk drive.
Several people from outside Hewlett-Packard also made contributions. Jim
Calaway, Manager of Systems Programming for the State of Utah, provided
some of his own papers as well as some hard- to-find IBM manuals, and
reviewed the manuscript for me. Dr. Barry Merrill from Merrill Consultants
reviewed my comments on SMF and RMF. Pat Artis from Performance Associ-
ates, Inc. reviewed my comments on IBM I/O and provided me with the manu-
script of his book, MVS I/O Subsystems: Configuration Management and
Performance Analysis, McGraw-Hill, as well as his Ph. D. Dissertation. (His
coauthor for the book is Gilbert E. Houtekamer.) Steve Samson from Candle Cor-
poration gave me permission to quote from several of his papers and counseled
me on the MVS operating system. Dr. Anl Sahai from Amdahl Corporation
reviewed my discussion of IBM I/O devices and made suggestions for improve-
ment. Yu-Ping Chen proofed several chapters. Sean Conley, Chris Markham, and
Marilyn Gibbons from Frame Technology Technical Support provided extensive
help in improving the appearance of the book. Marilyn Gibbons was especially
helpful in getting the book into the exact format desired by my publisher. Brenda
Feltham from Frame Technology answered my questions about the Microsoft
Windows version of FrameMaker. The book was typeset using FrameMaker on a
Hewlett-Packard workstation and on an IBM PC compatible running under
Microsoft Windows. Thanks are due to Paul R. Robichaux and Carol Kaplan for
making Sean, Chris, Marilyn, and Brenda available. Dr. T. Leo Lo of McDonnell
Douglas reviewed Chapter 7 and made several excellent recommendations. Brad
Horn and Ben Friedman from Wolfram Research provided outstanding advice on
how to use Mathematica more effectively.
Thanks are due to Wolfram Research not only for asking Brad Horn and Ben
Friedman to counsel me about Mathematica but also for providing me with
Mathematica for my personal computer and for the HP 9000 computer that sup-
ported my workstation. The address of Wolfram Research is
Wolfram Research, Inc.
P. O. Box 6059
Champaign, Illinois 61821
Telephone: (217)398-0700
Brian Miller, my production editor at Academic Press Boston did an excel-
lent job in producing the book under a heavy time schedule. Finally, I would like
to thank Jenifer Niles, my editor at Academic Press Professional, for her encour-
agement and support during the sometimes frustrating task of writing this book.
Reference
1. Martha L. Abell and James P. Braselton, Mathematica by Example, Academic
Press, 1992.
2. Martha L. Abell and James P. Braselton, The Mathematica Handbook, Aca-
demic Press, 1992.
3. Nancy R. Blachman, Mathematica: A Practical Approach, Prentice-Hall,
1992.
4. Richard E. Crandall, Mathematica for the Sciences, Addison-Wesley, 1991.
5. Theodore Gray and Jerry Glynn, Exploring Mathematics with Mathematica,
Addison-Wesley, 1991.
6. Leonard Kleinrock, Queueing Systems, Volume I: Theory, John Wiley, 1975.
7. Leonard Kleinrock, Queueing Systems, Volume II: Computer Applications,
JohnWiley, 1976.
8. Roman Maeder, Programming in Mathematica, Second Edition, Addison-
Wesley, 1991.
9. Stan Wagon, Mathematica in Action, W. H. Freeman, 1991
10. Stephen Wolfram, Mathematica: A System for Doing Mathematics by Com-
puter, Second Edition, Addison-Wesley, 1991.
verification, 120
Vince, N. C., 61 Z
Vicente, Norbert, xvii Zahorjan, John, 97, 124, 321
von Neumann, John, 53, 75, 215 Zaman, Arif, 222, 256
vos Savant, Marilyn, 32, 33, 61 Zeigner, Alessandro, 181, 183, 186,
201, 320
W Zimmer, Harry, 23, 61
Y
Yen, Kaisson, 262, 263, 265, 266, 270,
308, 309, 310, 324