0% found this document useful (0 votes)
35 views

CIS 501: Computer Architecture: Unit 4: Performance & Benchmarking

This document discusses a unit on performance and benchmarking from a computer architecture course. It covers various performance metrics like latency and throughput. It also discusses speedup, averaging performance numbers, and factors that impact CPU performance like instructions per cycle and clock frequency. Common performance pitfalls from using partial metrics are described. The document provides examples and guidelines for accurately evaluating and comparing system performance.

Uploaded by

Rajesh Tiwary
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

CIS 501: Computer Architecture: Unit 4: Performance & Benchmarking

This document discusses a unit on performance and benchmarking from a computer architecture course. It covers various performance metrics like latency and throughput. It also discusses speedup, averaging performance numbers, and factors that impact CPU performance like instructions per cycle and clock frequency. Common performance pitfalls from using partial metrics are described. The document provides examples and guidelines for accurately evaluating and comparing system performance.

Uploaded by

Rajesh Tiwary
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

CIS 501: Computer Architecture

Unit 4: Performance & Benchmarking


Slides developed by Joe Devietti, Milo Martin & Amir Roth at Upenn with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

This Unit
Metrics
Latency and throughput Speedup Averaging

CPU Performance Performance Pitfalls

Benchmarking

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

Performance Metrics

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

Performance: Latency vs. Throughput


Latency (execution time): time to finish a fixed task Throughput (bandwidth): number of tasks per unit time
Different: exploit parallelism for throughput, not latency (e.g., bread) Often contradictory (latency vs. throughput) Will see many examples of this Choose definition of performance that matches your goals Scientific program? latency. web server? throughput.

Example: move people 10 miles


Car: capacity = 5, speed = 60 miles/hour Bus: capacity = 60, speed = 20 miles/hour Latency: car = 10 min, bus = 30 min Throughput: car = 15 PPH (count return trip), bus = 60 PPH

Fastest way to send 10TB of data? (1+ gbits/second)


CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance 4

Amazon Does This

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway. Andrew Tanenbaum Computer Networks, 4th ed., p. 91

Comparing speeds
System A program 15s System B 5s

How much faster is System B than System A? speedup of 3x 3x, 200% faster 1/3, 33% the running time 67% less running time

How much slower is System A than System B? slowdown of 3x 3x, 200% slower 3x, 300% the running time 200% more running time

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

Comparing Performance - Speedup


A is X times faster than B if
X = Latency(B)/Latency(A) (divide by the faster) X = Throughput(A)/Throughput(B) (divide by the slower)

A is X% faster than B if
X = ((Latency(B)/Latency(A)) 1) * 100 X = ((Throughput(A)/Throughput(B)) 1) * 100 Latency(A) = Latency(B) / (1+(X/100)) Throughput(A) = Throughput(B) * (1+(X/100))

Car/bus example
Latency? Car is 3 times (and 200%) faster than bus Throughput? Bus is 4 times (and 300%) faster than car

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

Speedup and % Increase and Decrease


Program A runs for 200 cycles Program B runs for 350 cycles Percent increase and decrease are not the same.
% increase: ((350 200)/200) * 100 = 75% % decrease: ((350 - 200)/350) * 100 = 42.3%

Speedup:
350/200 = 1.75 Program A is 1.75x faster than program B As a percentage: (1.75 1) * 100 = 75%

If program C is 1x faster than A, how many cycles does C run for? 200 (the same as A)
What if C is 1.5x faster? 133 cycles (50% faster than A)
CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance 9

Mean (Average) Performance Numbers


Arithmetic: (1/N) * P=1..N Latency(P)
For units that are proportional to time (e.g., latency)

Harmonic: N / P=1..N 1/Throughput(P)

For units that are inversely proportional to time (e.g., throughput)

You can add latencies, but not throughputs


Latency(P1+P2,A) = Latency(P1,A) + Latency(P2,A) Throughput(P1+P2,A) != Throughput(P1,A) + Throughput(P2,A) 1 mile @ 30 miles/hour + 1 mile @ 90 miles/hour Average is not 60 miles/hour

Geometric: NP=1..N Speedup(P)


CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

For unitless quantities (e.g., speedup ratios)


10

For Example
You drive two miles
30 miles per hour for the first mile 90 miles per hour for the second mile

Question: what was your average speed?


Hint: the answer is not 60 miles per hour Why?

Would the answer be different if each segment was equal time (versus equal distance)?

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

11

Answer
You drive two miles
30 miles per hour for the first mile 90 miles per hour for the second mile

Question: what was your average speed?


Hint: the answer is not 60 miles per hour 0.03333 hours per mile for 1 mile 0.01111 hours per mile for 1 mile 0.02222 hours per mile on average = 45 miles per hour

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

12

Mean (Average) Performance Numbers


Arithmetic: (1/N) * P=1..N Latency(P)
For units that are proportional to time (e.g., latency)

Harmonic: N / P=1..N 1/Throughput(P)

For units that are inversely proportional to time (e.g., throughput)

You can add latencies, but not throughputs


Latency(P1+P2,A) = Latency(P1,A) + Latency(P2,A) Throughput(P1+P2,A) != Throughput(P1,A) + Throughput(P2,A) 1 mile @ 30 miles/hour + 1 mile @ 90 miles/hour Average is not 60 miles/hour

Geometric: NP=1..N Speedup(P)


CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

For unitless quantities (e.g., speedup ratios)


13

501 News
some HW1 answers were incorrect
please re-submit if you have already submitted

Canvas instant feedback score isnt always reliable

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

14

Paper Review #2: The IBM 801


Important changes from 801 => improved 801 What was the role of the simulator? What was the role of the compiler?

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

15

801 quotes

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

16

CPU Performance

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

17

Recall: CPU Performance Equation


Multiple aspects to performance: helps to isolate them Latency = seconds / program =
(insns / program) * (cycles / insn) * (seconds / cycle) Insns / program: dynamic insn count Impacted by program, compiler, ISA Cycles / insn: CPI Impacted by program, compiler, ISA, micro-arch Seconds / cycle: clock period (Hz) Impacted by micro-arch, technology

For low latency (better performance) minimize all three


Difficult: often pull against one another Example we have seen: RISC vs. CISC ISAs RISC: low CPI/clock period, high insn count CISC: low insn count, high CPI/clock period

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

18

Cycles per Instruction (CPI)


CPI: Cycle/instruction for on average
IPC = 1/CPI Used more frequently than CPI Favored because bigger is better, but harder to compute with Different instructions have different cycle costs E.g., add typically takes 1 cycle, divide takes >10 cycles Depends on relative instruction frequencies

CPI example
A program executes equal: integer, floating point (FP), memory ops Cycles per instruction type: integer = 1, memory = 2, FP = 3 What is the CPI? (33% * 1) + (33% * 2) + (33% * 3) = 2 Caveat: this sort of calculation ignores many effects Back-of-the-envelope arguments only
19

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

CPI Example
Assume a processor with instruction frequencies and costs
Integer ALU: 50%, 1 cycle Load: 20%, 5 cycle Store: 10%, 1 cycle Branch: 20%, 2 cycle

Which change would improve performance more?


A. Branch prediction to reduce branch cost to 1 cycle? B. Faster data memory to reduce load cost to 3 cycles?

Compute CPI
Base = 0.5*1 + 0.2*5 + 0.1*1 + 0.2*2 = 2 CPI A = 0.5*1 + 0.2*5 + 0.1*1+ 0.2*1 = 1.8 CPI (1.11x or 11% faster) B = 0.5*1 + 0.2*3 + 0.1*1 + 0.2*2 = 1.6 CPI (1.25x or 25% faster) B is the winner
CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance 20

Measuring CPI
How are CPI and execution-time actually measured?
Execution time? stopwatch timer (Unix time command) CPI = (CPU time * clock frequency) / dynamic insn count How is dynamic instruction count measured?

More useful is CPI breakdown (CPICPU, CPIMEM, etc.)


So we know what performance problems are and what to fix Hardware event counters Available in most processors today One way to measure dynamic instruction count Calculate CPI using counter frequencies / known event costs Cycle-level micro-architecture simulation + Measure exactly what you want and impact of potential fixes! Method of choice for many micro-architects
CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance 21

Pitfalls of Partial Performance Metrics


CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance 22

Mhz (MegaHertz) and Ghz (GigaHertz)


1 Hertz = 1 cycle per second 1 Ghz is 1 cycle per nanosecond, 1 Ghz = 1000 Mhz (Micro-)architects often ignore dynamic instruction count but general public (mostly) also ignores CPI
Equates clock frequency with performance!

Which processor would you buy?


Processor A: CPI = 2, clock = 5 GHz Processor B: CPI = 1, clock = 3 GHz Probably A, but B is faster (assuming same ISA/compiler)

Classic example
800 MHz PentiumIII faster than 1 GHz Pentium4! More recent example: Core i7 faster clock-per-clock than Core 2 Same ISA and compiler!

Meta-point: danger of partial performance metrics!


CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance 23

MIPS (performance metric, not the ISA)


(Micro) architects often ignore dynamic instruction count
Typically work in one ISA/one compiler treat it as fixed

CPU performance equation becomes


Latency: seconds / insn = (cycles / insn) * (seconds / cycle) Throughput: insn / second = (insn / cycle) * (cycles / second)

MIPS (millions of instructions per second)


Cycles / second: clock frequency (in MHz) Example: CPI = 2, clock = 500 MHz 0.5 * 500 MHz = 250 MIPS

Pitfall: may vary inversely with actual performance


Compiler removes insns, program gets faster, MIPS goes down Work per instruction varies (e.g., multiply vs. add, FP vs. integer)
CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance 24

Performance Rules of Thumb


Design for actual performance, not peak performance
Peak performance: Performance you are guaranteed not to exceed Greater than actual or average or sustained performance Why? Caches misses, branch mispredictions, limited ILP, etc. For actual performance X, machine capability must be > X

Easier to buy bandwidth than latency


say we want to transport more cargo via train: (1) build another track or (2) make a train that goes twice as fast? Use bandwidth to reduce latency

Build a balanced system


Dont over-optimize 1% to the detriment of other 99% System performance often determined by slowest component
CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance 25

Measurement Challenges

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

26

Measurement Challenges
Is -O0 really faster than -O3? Why might it not be?
other processes running not enough runs not using a high-resolution timer cold-start effects managed languages: JIT/GC/VM startup

solution: experiment design + statistics

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

27

Experiment Design
Two kinds of errors: systematic and random removing systematic error
aka measurement bias or not measuring what you think you are Run on an unloaded system Measure something that runs for at least several seconds Understand the system being measured simple empty-for-loop test => compiler optimizes it away Vary experimental setup Use appropriate statistics

removing random error


Perform many runs: how many is enough?

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

28

Determining performance differences


Program runs in 20s on machine A, 20.1s on machine B Is this a meaningful difference?

count

the distribution matters!

execution time

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

29

Confidence Intervals
Compute mean and confidence interval (CI)

s t n

t = critical value from t-distribution s = sample standard error n = # experiments in sample

Meaning of the 95% confidence interval x .05


collected 1 sample with n experiments given repeated sampling, x will be within .05 of the true mean 95% of the time

If CIs overlap, differences not statistically significant


CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance 30

CI example
setup
130 experiments, mean = 45.4s, stderr = 10.1s

Whats the 95% CI? t = 1.962 (depends on %CI and # experiments)


look it up in a stats textbook or online

at 95% CI, performance is 45.4 1.74 seconds What if we want a smaller CI?

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

31

Performance Laws

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

32

Amdahls Law

1 P (1 - P ) + S
What if I speedup 25% of a programs execution by 2x?
What if I speedup 25% of a programs execution by ?

How much will an optimization improve performance? P = proportion of running time affected by optimization S = speedup
1.14x speedup

1.33x speedup

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

33

Amdahls Law for the US Budget


US Federal Govt Expenses 2013
4000 3500 3000 2500 $B 2000 1500 1000 500 0
http://en.wikipedia.org/wiki/2013_United_States_federal_budget

Other

Edu
Transportation Labor Treasury Veteran's Affairs

scrapping Dept of Transportation ($98B) cuts budget by 2.7%

Agriculture
Interest Defense Social Security

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

34

Amdahls Law for Parallelization

1 P (1 - P ) + N

How much will parallelization improve performance? P = proportion of parallel code N = threads

What is the max speedup for a program thats 10% serial?

What about 1% serial?

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

35

Increasing proportion of parallel code


Amdahls Law requires extremely parallel code to take advantage of large multiprocessors two approaches:
strong scaling: shrink the serial component + same problem runs faster - becomes harder and harder to do weak scaling: increase the problem size + natural in many problem domains: internet systems, scientific computing, video games - doesnt work in other domains

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

36

501 News
paper review #2 graded homework #2 out
due Wed 2 Oct at midnight can only submit once!

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

37

How long am I going to be in this line?

Use Littles Law!

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

38

Littles Law

L = W
L = items in the system = average arrival rate W = average wait time
Assumption:
system is in steady state, i.e., average arrival rate = average departure rate

No assumptions about:
arrival/departure/wait time distribution or service order (FIFO, LIFO, etc.)

Works on any queuing system Works on systems of systems


CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance 39

Littles Law for Computing Systems


Only need to measure two of L, and W
often difficult to measure L directly

Describes how to meet performance requirements


e.g., to get high throughput (), we need either: low latency per request (small W) service requests in parallel (large L)

Addresses many computer performance questions


sizing queue of L1, L2, L3 misses sizing queue of outstanding network requests for 1 machine or the whole datacenter calculating average latency for a design

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

40

Benchmarking

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

41

Processor Performance and Workloads


Q: what does performance of a chip mean? A: nothing, there must be some associated workload
Workload: set of tasks someone (you) cares about

Benchmarks: standard workloads


Used to compare performance across machines Either are or highly representative of actual programs people run

Micro-benchmarks: non-standard non-workloads


Tiny programs used to isolate certain aspects of performance Not representative of complex behaviors of real applications Examples: binary tree search, towers-of-hanoi, 8-queens, etc.

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

42

SPECmark 2006
Reference machine: Sun UltraSPARC II (@ 296 MHz) Latency SPECmark
For each benchmark Take odd number of samples Choose median Take latency ratio (reference machine / your machine) Take average (Geometric mean) of ratios over all benchmarks

Throughput SPECmark
Run multiple benchmarks in parallel on multiple-processor system

Recent (latency) leaders


SPECint: Intel Xeon E3-1280 v3 (63.7) SPECfp: Intel Xeon E5-2690 2.90 GHz (96.6)

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

43

Example: GeekBench
Set of cross-platform multicore benchmarks
Can run on iPhone, Android, laptop, desktop, etc

Tests integer, floating point, memory, memory bandwidth performance

GeekBench stores all results online


Easy to check scores for many different systems, processors

Pitfall: Workloads are simple, may not be a completely accurate representation of performance
We know they evaluate compared to a baseline benchmark
CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance 44

GeekBench Numbers
Desktop (4 core Ivy bridge at 3.4GHz): 11456

Laptop:
MacBook Pro (13-inch) - Intel Core i7-3520M 2900 MHz (2 cores) 7807

Phones:
iPhone 5 - Apple A6 1000 MHz (2 cores) 1589 iPhone 4S - Apple A5 800 MHz (2 cores) 642 Samsung Galaxy S III (North America) - Qualcomm Snapdragon S3 MSM8960 1500 MHz (2 cores) - 1429

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

45

Summary
Latency = seconds / program =
(instructions / program) * (cycles / instruction) * (seconds / cycle)

Instructions / program: dynamic instruction count


Function of program, compiler, instruction set architecture (ISA)

Cycles / instruction: CPI


Function of program, compiler, ISA, micro-architecture

Seconds / cycle: clock period


Function of micro-architecture, technology parameters

Optimize each component


CIS501 focuses mostly on CPI (caches, parallelism) but some on dynamic instruction count (compiler, ISA) and some on clock frequency (pipelining, technology)
CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance 46

Paper Review #3: Getting Wrong Results


Give two reasons why changing link order can affect performance. Do gcc's -O3 optimizations improve over -O2? Are there specific benchmarks that do reliably benefit from gcc's -O3? Give another potential source of measurement bias in computer experiments that is not evaluated in this paper. How can this source increase and/or decrease a program's performance?

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

47

Back of the envelope performance


#pragma omp parallel for for (int i = 0; i < ARRAY_SIZE; i++) { z[i] = a*x[i] + b*y[i]; }
unrolling speedup? vectorization speedup? OpenMP speedup?

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

48

HW1 mean runtime

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

49

95% CI for mean runtime


low (blue) and high (green) bounds of 95% CI

CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance

50

HW1 raw data

-O0

-O3 unrolled

vectorized
openmp
CIS 501: Comp. Arch. | Prof. Joe Devietti | Performance 51

You might also like