CIS 501: Computer Architecture: Unit 4: Performance & Benchmarking
CIS 501: Computer Architecture: Unit 4: Performance & Benchmarking
This Unit
Metrics
Latency and throughput Speedup Averaging
Benchmarking
Performance Metrics
Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway. Andrew Tanenbaum Computer Networks, 4th ed., p. 91
Comparing speeds
System A program 15s System B 5s
How much faster is System B than System A? speedup of 3x 3x, 200% faster 1/3, 33% the running time 67% less running time
How much slower is System A than System B? slowdown of 3x 3x, 200% slower 3x, 300% the running time 200% more running time
A is X% faster than B if
X = ((Latency(B)/Latency(A)) 1) * 100 X = ((Throughput(A)/Throughput(B)) 1) * 100 Latency(A) = Latency(B) / (1+(X/100)) Throughput(A) = Throughput(B) * (1+(X/100))
Car/bus example
Latency? Car is 3 times (and 200%) faster than bus Throughput? Bus is 4 times (and 300%) faster than car
Speedup:
350/200 = 1.75 Program A is 1.75x faster than program B As a percentage: (1.75 1) * 100 = 75%
If program C is 1x faster than A, how many cycles does C run for? 200 (the same as A)
What if C is 1.5x faster? 133 cycles (50% faster than A)
CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance 9
For Example
You drive two miles
30 miles per hour for the first mile 90 miles per hour for the second mile
Would the answer be different if each segment was equal time (versus equal distance)?
11
Answer
You drive two miles
30 miles per hour for the first mile 90 miles per hour for the second mile
12
501 News
some HW1 answers were incorrect
please re-submit if you have already submitted
14
15
801 quotes
16
CPU Performance
17
18
CPI example
A program executes equal: integer, floating point (FP), memory ops Cycles per instruction type: integer = 1, memory = 2, FP = 3 What is the CPI? (33% * 1) + (33% * 2) + (33% * 3) = 2 Caveat: this sort of calculation ignores many effects Back-of-the-envelope arguments only
19
CPI Example
Assume a processor with instruction frequencies and costs
Integer ALU: 50%, 1 cycle Load: 20%, 5 cycle Store: 10%, 1 cycle Branch: 20%, 2 cycle
Compute CPI
Base = 0.5*1 + 0.2*5 + 0.1*1 + 0.2*2 = 2 CPI A = 0.5*1 + 0.2*5 + 0.1*1+ 0.2*1 = 1.8 CPI (1.11x or 11% faster) B = 0.5*1 + 0.2*3 + 0.1*1 + 0.2*2 = 1.6 CPI (1.25x or 25% faster) B is the winner
CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance 20
Measuring CPI
How are CPI and execution-time actually measured?
Execution time? stopwatch timer (Unix time command) CPI = (CPU time * clock frequency) / dynamic insn count How is dynamic instruction count measured?
Classic example
800 MHz PentiumIII faster than 1 GHz Pentium4! More recent example: Core i7 faster clock-per-clock than Core 2 Same ISA and compiler!
Measurement Challenges
26
Measurement Challenges
Is -O0 really faster than -O3? Why might it not be?
other processes running not enough runs not using a high-resolution timer cold-start effects managed languages: JIT/GC/VM startup
27
Experiment Design
Two kinds of errors: systematic and random removing systematic error
aka measurement bias or not measuring what you think you are Run on an unloaded system Measure something that runs for at least several seconds Understand the system being measured simple empty-for-loop test => compiler optimizes it away Vary experimental setup Use appropriate statistics
28
count
execution time
29
Confidence Intervals
Compute mean and confidence interval (CI)
s t n
CI example
setup
130 experiments, mean = 45.4s, stderr = 10.1s
at 95% CI, performance is 45.4 1.74 seconds What if we want a smaller CI?
31
Performance Laws
32
Amdahls Law
1 P (1 - P ) + S
What if I speedup 25% of a programs execution by 2x?
What if I speedup 25% of a programs execution by ?
How much will an optimization improve performance? P = proportion of running time affected by optimization S = speedup
1.14x speedup
1.33x speedup
33
Other
Edu
Transportation Labor Treasury Veteran's Affairs
Agriculture
Interest Defense Social Security
34
1 P (1 - P ) + N
How much will parallelization improve performance? P = proportion of parallel code N = threads
35
36
501 News
paper review #2 graded homework #2 out
due Wed 2 Oct at midnight can only submit once!
37
38
Littles Law
L = W
L = items in the system = average arrival rate W = average wait time
Assumption:
system is in steady state, i.e., average arrival rate = average departure rate
No assumptions about:
arrival/departure/wait time distribution or service order (FIFO, LIFO, etc.)
40
Benchmarking
41
42
SPECmark 2006
Reference machine: Sun UltraSPARC II (@ 296 MHz) Latency SPECmark
For each benchmark Take odd number of samples Choose median Take latency ratio (reference machine / your machine) Take average (Geometric mean) of ratios over all benchmarks
Throughput SPECmark
Run multiple benchmarks in parallel on multiple-processor system
43
Example: GeekBench
Set of cross-platform multicore benchmarks
Can run on iPhone, Android, laptop, desktop, etc
Pitfall: Workloads are simple, may not be a completely accurate representation of performance
We know they evaluate compared to a baseline benchmark
CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance 44
GeekBench Numbers
Desktop (4 core Ivy bridge at 3.4GHz): 11456
Laptop:
MacBook Pro (13-inch) - Intel Core i7-3520M 2900 MHz (2 cores) 7807
Phones:
iPhone 5 - Apple A6 1000 MHz (2 cores) 1589 iPhone 4S - Apple A5 800 MHz (2 cores) 642 Samsung Galaxy S III (North America) - Qualcomm Snapdragon S3 MSM8960 1500 MHz (2 cores) - 1429
45
Summary
Latency = seconds / program =
(instructions / program) * (cycles / instruction) * (seconds / cycle)
47
48
49
50
-O0
-O3 unrolled
vectorized
openmp
CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance 51