IT401 Computer Organization and Architecture: Prasun Ghosal
IT401 Computer Organization and Architecture: Prasun Ghosal
=
=
n
i
i i C CPI
1
) * ( CPU clock cycles
C
i
: number of instructions of class i executed
CPI
i
: average number of cycles per instruction for that
instruction class
n: number of instruction classes
Overall program CPI dependent on
Number of cycles for each instruction type
Frequency of each instruction type in the program
execution
14
Measuring Performance4/5
CPU clock cycles =Instructions for a program * CPI
ime ExecutionT
e Performanc
1
=
CPU time =CPU clock cycles * clock cycle time
CPU time =CPU clock cycles for a program / clock rate
CPU time =Instruction count * CPI * clock cycle time
CPU time =Instruction count * CPI/clock rate
=
=
n
i
i i C CPI
1
) * (
CPU clock cycles
15
Benchmarks1/3
Concept of Workload
Informally, set of programs that the user runs day in and day out
Benchmarks
Programs specifically chosen to measure performance
Form a workload that the user hopes will predict the performance of the actual
workload
Best benchmark types are real programs
Use of benchmarks whose performance depends on small code segments
encourages optimizations in either the architecture or compiler
A problem: Compilers with special-purpose optimizations targeted at specific
benchmarks. Will such optimizations produce good or correct code with a real
application?
16
Benchmarks2/3
COPYRIGHT 1998 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED
Matrix 300 in SPEC suite
in 1989
SPEC is System
Performance Evaluation
Cooperative
For matrix 300, the
enhanced compiler
improves performance by a
factor of more than 9!.
Although not that much
improvement with other
benchmarks.
SPEC benchmark web site
http://www.specbench.org
17
Benchmarks3/3
Why real programs are not used to measure performance?
Small size of benchmark (easier compilation and simulation)
Compilers might not be available for a new machine
Numerous published performance results are available for small
benchmarks
Benchmarks are OK for the initial design phase, but a working computer
system should be evaluated with a real program
Writing Performance reports
Reproducibility
Include everything needed to be able to duplicate the experiment
18
Comparing and Summarizing Performance1/4
Selected benchmark
Agreed to use response time or throughput
How to summarize performance of a group of benchmarks?
M/C A M/C B
P1 1 10
P2 1000 100
Total 1001 110
A is 10 times faster than B for P1
B is 10 times faster than A for P2
What is the relative performance of A &
B?
Use Total Execution Time
1 . 9
110
1001
) (
) (
) (
) (
= = =
B ime ExecutionT
A ime ExecutionT
A e Performanc
B e Performanc
19
Comparing and Summarizing Performance2/4
B is 9.1 times faster than A for P1 and P2 together
One figure as Summary of performance directly proportional to execution
time
If the workload consists of running P1 and P2 an equal number of times,
this statement would predict the relative execution times for the workload on
each machine
Average of execution times that is directly proportional to total execution
time isarithmetic mean (AM)
=
=
n
i
i Time
n
AM
1
) (
1
Time(i): execution time for i
th
program
n: total number of programs in the workload
A Smaller mean means smaller average
execution time and thus improved performance
20
Comparing and Summarizing Performance3/4
Arithmetic mean proportional to execution time, if programs in workload are
each run an equal number of times. What happens if not the case?
Assign a weighting factor w(i) to each program to indicate frequency of the
program in the workload
Weighted arithmetic mean
AM special case of weighted AM when all weights are equal
=
=
n
i
i Time i w WeightedAM
1
) ( * ) (
21
Comparing and Summarizing Performance4/4
Program M/C A M/C B M/C C
P1 1 10 20
P2 1000 100 20
Table shows runtimes of P1 and P2 on three machines A, B, and C
Workload consists of P1 and P2.
P1 is run 10 times as often as P2
Find which machine is fastest for this workload and by how much?
22
SPEC95 Benchmarks
CPU benchmark
Created by a set of computer companies in 1989
SPEC95 (8 integer and 10 floating point programs). Figure 2.6
SPEC95 web site (http://www.specbench.org/osg/cpu95/news/cpu95descr.html)
SPEC ratio for xxx.benchmark =
xxx.benchmark reference time /xxx.benchmark run time
Normalized measure. Higher results indicate faster performance
Reference machine is a Sun SPARCstation 10/40
SPECint95 or SPECfp95 summary measurement is obtained by taking geometric mean
of the SPEC ratios
n
n
i
i SPECratio
=1
) (
=
n
i
i a
1
) (
Product of a
1
* a
2
* ..* a
n
23
SPEC95 Benchmark results for Pentium and
Pentium Pro
At same clock rate, Pentium Pro
is 1.4 to 1.5 times faster
When clock rate increased by a
certain factor, processor
performance increases by a lower
factor
Pentium clock rate from 100 to
200 MHz. SPECint95 performance
improves by only 1.7 (Why?)
24
SPEC95 Benchmark results for Pentium and
Pentium Pro
At same clock rate, Pentium
Pro is 1.7 to 1.8 times faster
Clock rate from 100 to 200
MHz, SPECfp95 improves by
only 1.4 (Why?)
Bottleneck at memory system
due to increase of processor
speed, which effect is more
evident on floating point
benchmarks because of size.
25
Performance Summary Example1/2
M/C A M/C B
P1 1482 139
P2 2266 254
P3 6206 690
Which machine is faster according to total
execution time? And by how much?
Total Execution Time (A) =1482 +2266 +6206 =9954
Total Execution (B) =139 +254 +690 =1083
Machine B is fastest by 9954/1083 =9.27 times
26
Performance Summary Example2/2
M/C A M/C B
P1 1482 139
P2 2266 254
P3 6206 690
Which machine is faster by the geometric
mean measure?
Remember how SPEC reported performance?
Normalize in reference to one machine
Choose A as reference machine
Obtain Execution time ratios (ET Ratio)
ET Ratio(P1) =ET(A)/ET(B) =1482/139 =10.66
ET Ratio (P2) =2266/254 =8.92
ET Ratio(P3) =6206/690 =8.99
Geometric Mean =(Ratio (P1) * Ratio(P2) * Ratio(P3))
1/3
Geometric Mean =9.49
Machine B is 9.49 times faster than A according to
geometric mean measure
27
Amdahls Law1/3
Pitfall
Expecting the improvement of one aspect of a machine to increase performance
by an amount proportional to the size of the improvement
Program runs in 100 sec on a machine
Multiply operations responsible for 80 sec of time
How much do we need to improve the speed of multiplication if program is to run 5
times faster?
Execution time after improvement =
(Execution time affected by improvement/Amount of improvement +Execution time unaffected)
Execution time after improvement =80/n +(100-80) =20 =(100/5)
20 =80/n +20 80/n =0 no n can be found to achieve the requested improvement
Make the common case fast
28
Amdahls Law2/3
Another form of Amdahls Law (to yield Speedup)
Speedup =Performance after improvement/Performance before
Speedup =Execution time before/Execution time after improvement
Assume new hardware added to machine
f =fractions of all operations which use new hardware
s =speedup of those operations using new hardware
Execution time with new hardware is T
new
Execution time without new hardware is T
old
T
new
=f* T
old
/s +(1-f) * T
old
Overall speedup S =T
old
/T
new
Speedup =s / (s f * (s-1))
f
s 0.1 Speedup
2 1.052632
5 1.086957
10 1.098901
s 0.5 Speedup
2 1.333333
5 1.666667
10 1.818182
s 0.9 Speedup
2 1.818182
5 3.571429
10 5.263158
s 0.99 Speedup
2 1.980198
5 4.807692
10 9.174312
29
Amdahls Law3/3
Example of memory versus processor speedup
A =B op C
Assume memory access takes 4 cycles and a typical operation takes 2 cycles
Which of the following achieves the best increase in performance
Increase memory speed by 50%
Double operation speed
Calculate how many memory accesses are needed first?
1 to get instruction from memory
2 to get B and C from memory
1 to store result (A) back in memory
Then we need a total of 4 memory access operations
Memory access time =4 (accesses) * 4 (cycles/access) =16 cycles
Operation time =1 (operation) * 2 (cycles/operation) =2 cycles
Total number of cycles =16 +2 =18
Option 1 increase memory speed by 50%
s1 =1.5 (how?)
f1 =memory access time/ total time
=16/18 =0.889
S1 =1.42
Option 2 double operation speed
s2 =2
f2 =operation time/total time
=2/18 =0.111
S2 =1.059
30
MIPS as a Performance Metric
MIPS is million instructions per second
MIPS =instruction count / (Execution time * 10
6
)
Instruction execution rate (instruction/sec)
Faster machines have a higher MIPS rating
Problems with MIPS
Does not take into account capabilities of instructions
(can not compare computers with different ISA)
Varies between programs on the same computer
(a machine can not have a single MIPS rating for all programs)
Can vary inversely with performance