Performance of A Computer
Performance of A Computer
Unit 1a
Introduction
Where to place this course
The Computational Stack, Hardware/Software Interface
Administrivia
a[i] = b[i] + c;
Compiler
Systems software
(OS, compiler)
lw
add
add
lw
lw
add
sw
$15, 0($2)
$16, $15, $14
$17, $15, $13
$18, 0($12)
$19, 0($17)
$20, $18, $19
$20, 0($16)
Assembler
Hardware
000000101100000
110100000100010
BE/DigSys
Microarchitects view:
How to design a computer that
meets system design goals.
Choices critically affect both
the SW programmer and
the HW designer
Operating System
ISA
Microarchitecture
Logic
Circuits
Electrons
Moores Law
Administrivia
Periodic class assignments
(keep a hardcopy handy)
Surprise Quizzes
(best n-1 of n, n ~ 2/3)
HAs (Daily)
CAs (Weekly, based on the weeks coverage)
CTs (Two mid-sem and end-sem)
Attendance
Textbook: Computer Organization and Design: The
Hardare/Software Interface, Hennessy & Patterson, 2nd / 3rd
Ed., MKP
8
Introduction
Elements of Computing Systems
Von Neumann vs Harvard Model
ISA and Microarchitecture
Processing
Control
(sequencing)
Memory
(program
and data)
I/O
Datapath
10
11
Von-Neumann Model
PC
IR
Register File
MAR
MDR
12
Harvard Architecture
Separate storage and datapath for instructions and data
Originated from Harvard Mark I
CPU can both read an instruction and access data memory
simultaneously.
Can
13
15
x86 ISA has many implementations: 286, 386, 486, Pentium, Pentium Pro,
Pentium 4, Core,
16
Memory
Address space, Addressability, Alignment
Virtual memory management
17
18
19
20
Computer Architecture
Architecture
ktkt/
The art and science of designing and constructing
buildings
Computer Architecture
The art and science of selecting, designing
and interconnecting hardware components
and designing the hardware/software
interface to create a computing system that
meets the required system design goals
21
Recap
22
Performance
Response time & throughput; Factors that affect performance
The Performance Equation
Some Examples
23
Airplane
Boeing 737
Boeing 747
Concorde
Douglas DC-8
Passengers
101
470
132
146
598
610
1350
544
24
Passengers
101
470
132
146
598
610
1350
544
25
Defining Performance
Defining Performance
Maximize Throughput
number of jobs completed in given time
27
Q. Defining Performance
28
Q. Defining Performance
29
Assume 5 processes
Each process needs 2 minutes of CPU time
Defining Performance
31
33
Performance Comparison
34
35
Alternatively:
36
37
38
39
Example 2
Suppose we have two implementations of the
same instruction set architecture. Machine A has
a 1ns clock cycle and average CPI (clock cycles
per instruction) of 2.0 for some program and
machine B has a clock cycle time of 2 ns and a
CPI of 1.2 for the same program. Which machine
is faster for this program, and by how much?
41
42
43
Example 3
A compiler designer is trying to decide between two code
sequences for a particular machine. The hardware designers have
supplied the following facts:
Instruction class
A
B
C
A
2
B
1
C
2
Example 3
Considering only one factor (instruction count, in
this case) to assess performance can mislead.
When comparing two computers, we must look at
all three components, which combine to form
execution time.
If some of the factors are identical, like the clock rate
in the previous example, performance can be
determined by comparing all the non-identical
factors.
45
Instruction count
Determined by algorithm, compiler, ISA
Measured using software that simulates the ISA, hardware
counters present in many systems
46
Measuring Performance
47
Summary
Readings:
HP3E Ch.1. (1.1-1.3, 1.5); Ch.2. (4.1,4.2)
48
Performance
Evaluating Performance
49
Performance Metrics
Computer system performance can be
measured by several performance
metrics.
The metrics we use depend on the
purpose as well as the component of the
system in which we are interested.
For example, to benchmark a networking
device, wed use network bandwidth, which
tells us the number of bits the component can
transmit per second.
Performance Metrics
MIPS stands for millions of instructions
per second.
simple metric but practically useless to
express the performance of a system (why?)
Instructions can vary widely among processors
For example, complex instructions take more clocks
than simple instructions.
Thus, a complex instruction rate will be lower than
that for simple instructions.
Performance Metrics
MIPS is perhaps useful in comparing
various versions of processors derived
from the same instruction set.
MFLOPS: popular metric often used in
the scientific computing area.
Millions of floating-point operations per
second.
Synthetic Benchmarks
Programs specifically written for performance
testing.
Whetstone benchmark, named after the
Whetstone Algol compiler (Algol, later Fortran)
was developed in the mid-1970s to
measure floating-point performance
Dhrystone benchmark (Ada, later C)
developed in 1984 to measure integer
performance.
Synthetic Benchmarks
Both Whetstone and Dhrystone
benchmarks are small programs
Drawbacks with synthetic benchmarks:
No user would use them as applications:
they dont do anything of use
Not real programs, so they do not reflect
program behavior
They encouraged excessive optimization
by compilers to distort performance
results
Real Benchmarks
SPEC: System Performance Evaluation
Cooperative
SPEC CPU2006
Benchmark for measuring processor performance,
memory, and compiler
12I + 17F apps written in three or four different PLs
Integer programs: Compilers, compression, chess, CAD
placement & routing programs etc.
Floating Points: FEM, CFD simulations, ANN, 3D
graphics, image processing programs etc.
Performance of a REF machine is given (Sun Sparc ?).
Real Benchmarks
Others
SPECmail, SPECweb, SPECjvm
etc.
Means of Performance
Matter of interest: a single summarizing metric to get
an idea of performance
Less information, but preferred by marketers and users
Means of Performance
Arithmetic mean = 90 seconds
The implicit assumption in our arithmetic
mean calculation
Both programs are equally likely in the target
workload.
What if they are not?
Program 1
(seconds)
Program 2
(seconds)
Total time
(seconds)
Computer A Computer B
1
10
1000
100
1001
110
Means of Performance
GM has the following property:
GM (Xi) / GM (Yi) = GM (Xi/Y)
Advantage: Independent of running
times of individual programs and REF
machine
Example:
P1
P2
AM
GM
Time on A Time on B
1
10
1000
100
1001
110
31.6
31.6
Means of Performance
AM values tell us that A is about three
times faster than B, but GM suggests that
both machines perform the same. (Why?)
Coz GM tracks the performance ratio, not
execution time (thats its key drawback)
o Since Program 1 runs 10 times faster on A
and Program B runs 10 times faster on B, by
using GM we erroneously conclude that the
average performance of the two programs is
the same.
Next Class
Wind up Unit 1: Amdahls law
Review: Performance
sec
clock cycle
sec
CPU time
CPU
cycles
for
program
clock
cycle
time
program
clock cycle
program
clock cycle
CPU cycles for program
program
sec
CPU time
clock cycle
program
Clock rate
sec
clock cycle
CPU cycles for program
program
clock cycle
CPI
instruction
instruction
Instruction count
program
instruction
clock cycle
CPI
Instructio
n
count
program
sec
instruction
CPU time
program
clock cycle
Clock rate
sec
1
Clock rate
program
CPU performance
CPU time
sec CPI Instruction count
70
Review: MIPS
Machines with different
CPI
CPU time
Clockrate Instruction count
MIPS
6
6
CPI 10
CPU time 10
instruction sets?
Programs with different
instruction mixes?
Uncorrelated with
performance
Marketing metric
Meaningless Indicator of
Processor Speed
71
Review: MFLOPS
Popular in supercomputing
community
Number of FP operations Often not where time is
MFLOP/s
spent
CPU time 106
Not all FP operations are
equal
Can magnify performance
differences
A better algorithm (e.g.,
72
Amdahls Law
A motivating example
A program runs in 100 seconds on a
computer, with multiply operations
responsible for 80 seconds of this time. By
how much do I have to improve the speed of
multiplication if I want my program to run
five times faster?
73
Amdahls Law
Validity of the single processor approach to achieving large scale computing capabilities, G. M. Amdahl,
AFIPS Conference Proceedings, pp. 483-485, April 1967
Historical context
Amdahl was demonstrating the continued validity of the
single processor approach and of the weaknesses of the
multiple processor approach
A fairly obvious conclusion which can be drawn at this point is
that the effort expended on achieving high parallel performance
rates is wasted unless it is accompanied by achievements in
sequential processing rates of very nearly the same magnitude.
74
Amdahls Law
avg
R
i
Fi 1
i
Fraction of results
generated at rate Ri
i
Note: Not fraction
of time spent working
at this rate
75
Ravg
1
100
100
3.08 MFLOPS
0.3 0.2 0.5 30 2 0.5 32.5
1 10 100
30
2
0. 5
92.3%,
6.2%,
1 .5 %
32.5
32.5
32.5
Bottleneck
76
Amdahls Law
Fenhanced
Exec _ timenew Exec _ timeold (1 Fenhanced)
Speedup
enhanced
Speedup overall
F enhanced
Exec _ time new (1 F enhanced )
Speedup enhanced
77
78
Cache
Memory
Disk / Tape
79
80
Example
Which change is more effective on a certain machine: speeding up
10-fold the floating point square root operation only, which takes
up 20% of execution time, or speeding up 2-fold all floating point
operations, which take up 50% of total execution time?
(Assume that the cost of accomplishing either change is the same,
and the two changes are mutually exclusive.)
81
(HA03)
Suppose we have made the following measurements:
Frequency of floating point operations = 25%
Average CPI of floating point operations = 4.0
Average CPI of other instructions = 1.33
Frequency of FPSQR = 2% (FPSQR = Instruction for FPSquare Root)
CPI of FPSQR = 20
Given two design alternatives, the first being to reduce the
FPSQR to 2 and the second being to reduce the average CPI of
all FP operations to 2, which one should we opt for?
82
Readings
References. HP3E Ch.1. (1.1(1.1-1.3, 1.5); Ch.2. (4.1(4.1-4.3, 4.54.5-4.6
4.6))
Readings:
Real Stuff: Two SPEC Benchmarks and the Performance of Recent Intel
Notebook..
Notebook