CSC 301 Lecture Vii
CSC 301 Lecture Vii
Performance
Measurement and
Analysis
Chapter 11 Objectives
2
11.1 Introduction
3
11.2 The Basic Computer
Performance Equation
• The basic computer performance equation has
been useful in our discussions of RISC versus
CISC:
4
11.2 The Basic Computer
Performance Equation
• We have also learned that CPU efficiency is not the
sole factor in overall system performance. Memory
and I/O performance are also important.
• Amdahl’s Law tells us that the system performance
gain realized from the speedup of one component
depends not only on the speedup of the component
itself, but also on the fraction of work done by the
component:
5
11.2 The Basic Computer
Performance Equation
• In short, using Amdahl’s Law we know that we need
to make the common case fast.
• So if our system is CPU bound, we want to make
the CPU faster.
• A memory bound system calls for improvements in
memory management.
• The performance of an I/O bound system will
improve with an upgrade to the I/O system.
Of course, fixing a performance problem in one part of the
system can expose a weakness in another part of the system!
6
11.3 Mathematical Preliminaries
7
11.3 Mathematical Preliminaries
8
11.3 Mathematical Preliminaries
9
11.3 Mathematical Preliminaries
10
11.3 Mathematical Preliminaries
11
11.3 Mathematical Preliminaries
12
11.3 Mathematical Preliminaries
13
11.3 Mathematical Preliminaries
14
11.3 Mathematical Preliminaries
15
11.3 Mathematical Preliminaries
16
11.3 Mathematical Preliminaries
17
11.3 Mathematical Preliminaries
18
11.3 Mathematical Preliminaries
20
11.3 Mathematical Preliminaries
21
11.3 Mathematical Preliminaries
22
11.3 Mathematical Preliminaries
23
11.3 Mathematical Preliminaries
25
11.4 Benchmarking
30
11.4 Benchmarking
32
11.4 Benchmarking
33
11.4 Benchmarking
35
11.4 Benchmarking
36
11.5 CPU Performance
Optimization
• CPU optimization includes many of the topics that
have been covered in preceding chapters.
– CPU optimization includes topics such as pipelining,
parallel execution units, and integrated floating-point
units.
• We have not yet explored two important CPU
optimization topics: Branch optimization and user
code optimization.
• Both of these can affect performance in dramatic
ways.
37
11.5 CPU Performance
Optimization
• We know that pipelines offer significant execution
speedup when the pipeline is kept full.
• Conditional branch instructions are a type of pipeline
hazard that can result in flushing the pipeline.
– Other hazards are include conflicts, data dependencies, and
memory access delays.
• Delayed branching offers one way of dealing with
branch hazards.
• With delayed branching, one or more instructions
following a conditional branch are sent down the
pipeline regardless of the outcome of the statement.
38
11.5 CPU Performance
Optimization
• The responsibility for setting up delayed branching
most often rests with the compiler.
• It can choose the instruction to place in the delay slot
in a number of ways.
• The first choice is a useful instruction that executes
regardless of whether the branch occurs.
• Other possibilities include instructions that execute if
the branch occurs, but do no harm if the branch does
not occur.
• Delayed branching has the advantage of low
hardware cost.
39
11.5 CPU Performance
Optimization
• Branch prediction is another approach to minimizing
branch penalties.
• Branch prediction tries to avoid pipeline stalls by
guessing the next instruction in the instruction
stream.
– This is called speculative execution.
• Branch prediction techniques vary according to the
type of branching. If/then/else, loop control, and
subroutine branching all have different execution
profiles.
40
11.5 CPU Performance
Optimization
There are various ways in which a prediction can be
made:
• Fixed predictions do not change over time.
• True predictions result in the branch being always
taken or never taken.
• Dynamic prediction uses historical information
about the branch and its outcomes.
• Static prediction does not use any history.
41
11.5 CPU Performance
Optimization
• When fixed prediction assumes that a branch is not
taken, the normal sequential path of the program is
taken.
• However, processing is done in parallel in case the
branch occurs.
• If the prediction is correct, the preprocessing
information is deleted.
• If the prediction is incorrect, the speculative
processing is deleted and the preprocessing
information is used to continue on the correct path.
42
11.5 CPU Performance
Optimization
• When fixed prediction assumes that a branch is
always taken, state information is saved before the
speculative processing begins.
• If the prediction is correct, the saved information is
deleted.
• If the prediction is incorrect, the speculative
processing is deleted and the saved information is
restored allowing execution to continue to continue
on the correct path.
43
11.5 CPU Performance
Optimization
• Dynamic prediction employs a high-speed branch
prediction buffer to combine an instruction with its
history.
• The buffer is indexed by the lower portion of the
address of the branch instruction that also contains
extra bits indicating whether the branch was recently
taken.
– One-bit dynamic prediction uses a single bit to indicate
whether the last occurrence of the branch was taken.
– Two-bit branch prediction retains the history of the
previous to occurrences of the branch along with a
probability of the branch being taken.
44
11.5 CPU Performance
Optimization
• The earliest branch prediction implementations
used static branch prediction.
• Most newer processors (including the Pentium,
PowerPC, UltraSparc, and Motorola 68060) use
two-bit dynamic branch prediction.
• Some superscalar architectures include branch
prediction as a user option.
• Many systems implement branch prediction in
specialized circuits for maximum throughput.
45
11.5 CPU Performance
Optimization
• The best hardware and compilers will never
equal the abilities of a human being who has
mastered the science of effective algorithm and
coding design.
• People can see an algorithm in the context of the
machine it will run on.
– For example a good programmer will access a stored
column-major array in column-major order.
• We end this section by offering some tips to help
you achieve optimal program performance.
46
11.5 CPU Performance
Optimization
• Operation counting can enhance program
performance.
• With this method, you count the number of
instruction types executed in a loop then determine
the number of machine cycles for each instruction.
• The idea is to provide the best mix of instruction
types for a particular architecture.
• Nested loops provide a number of interesting
optimization opportunities.
47
11.5 CPU Performance
Optimization
• Loop unrolling is the process of expanding a loop
so that each new iteration contains several of the
original operations, thus performing more
computations per loop iteration. For example:
for (i = 1; i <= 30; i++)
a[i] = a[i] + b[i] * c;
becomes
for (i = 1; i <= 30; i+=3)
{ a[i] = a[i] + b[i] * c;
a[i+1] = a[i+1] + b[i+1] * c;
a[i+2] = a[i+2] + b[i+2] * c; }
48
11.5 CPU Performance
Optimization
• Loop fusion combines loops that use the same data
elements, possibly improving cache performance.
For example:
for (i = 0; i < N; i++)
C[i] = A[i] + B[i];
for (i = 0; i < N; i++)
D[i] = E[i] + C[i];
becomes
for (i = 0; i < N; i++)
{ C[i] = A[i] + B[i];
D[i] = E[i] + C[i]; }
49
11.5 CPU Performance
Optimization
• Loop fission splits large loops into smaller ones to
reduce data dependencies and resource conflicts.
• A loop fission technique known as loop peeling
removes the beginning and ending loop statements.
For example:
for (i = 1; i < N+1; i++)
{ if (i==1)
becomes
A[i] = 0;
else if (i == N) A[1] = 0;
A[i] = N; for (i = 2; i < N; i++)
else A[i] = A[i] + A[i] = A[i] + 8;
8; } A[N] = N;
50
11.5 CPU Performance
Optimization
• The text lists a number of rules of thumb for
getting the most out of program performance.
• Optimization efforts pay the biggest dividends
when they are applied to code segments that
are executed the most frequently.
• In short, try to make the common cases fast.
51
11.6 Disk Performance
53
11.6 Disk Performance
54
11.6 Disk Performance
56
11.6 Disk Performance
59
11.6 Disk Performance
61
11.6 Disk Performance
62
11.6 Disk Performance
64
11.6 Disk Performance
66
11.6 Disk Performance
68
11.6 Disk Performance
69
11.6 Disk Performance
70
11.6 Disk Performance
71
11.6 Disk Performance
72
11.6 Disk Performance
73
11.6 Disk Performance
75
Chapter 11 Conclusion
76
Chapter 11 Conclusion
77
End of Chapter 11
78