Performance: Latency
Performance: Latency
Topics: performance metrics CPU performance equation benchmarks and benchmarking reporting averages Amdahls Law Littles Law concepts
balance tradeoffs bursty behavior (average and peak performance)
Performance Metrics
latency: response time, execution time good metric for fixed amount of work (minimize time) throughput: bandwidth, work per time = (1 / latency) when there is NO OVERLAP > (1 / latency) when there is overlap
in real processors, there is always overlap (e.g., pipelining)
good metric for fixed amount of time (maximize work) comparing performance A is N times faster than B iff
perf(A)/perf(B) = time(B)/time(A) = N
may vary inversely with actual performance particularly bad metric for multicore chips
(FP ops / execution time) x 10-6 like MIPS, but counts only FP operations
FP ops have longest latencies anyway (problem #1) FP ops are the same across machines (problem #2)
instructions / program: dynamic instruction count mostly determined by program, compiler, ISA cycles / instruction: CPI mostly determined by ISA and CPU/memory organization seconds / cycle: cycle time, clock time, 1 / clock frequency mostly determined by technology and CPU organization uses of CPU performance equation high-level performance comparisons back of the envelope calculations helping architects think about compilers and technology
CISC time = P x 8 x T = 8PT RISC time = 2P x 2 x T = 4PT RISC time = CISC CPU time/2 the truth is much, much, much more complex actual data from IBM AS/400 (CISC -> RISC in 1995):
CISC time = P x 7 x T = 7PT RISC time = 3.1P x 3 x T/3.1 = 3PT (+1 tech. gen.)
+ good for focusing on individual features, but not big picture over-emphasize target feature (for better or worse) synthetic benchmarks: programs made up for benchmarking
e.g., Whetstone, Dhrystone
SPEC CPU2006
12 integer programs (C, C++)
gcc (compiler), perl (interpreter), hmmer (markov chain) bzip2 (compress), go (AI), sjeng (AI) libquantum (physics), h264ref (video) omnetpp (simulation), astar (path finding algs) xalanc (XML processing), mcf (network optimization)
Benchmarking Pitfalls
benchmark properties mismatch with features studied
e.g., using SPEC for large cache studies
careless scaling
using only first few million instructions (initialization phase) reducing program data size
there is no such thing as the average program use average when absolutely necessary
N / 1..N(1 / rate(i)) geometric mean (GM): average speedups of N programs N ( 1..N(speedup(i)) what if programs run at different frequencies within workload? weighting weighted AM = (1..N w(i) * time(i)) / N
GM Weirdness
what about averaging ratios (speedups)? HM / AM change depending on which machine is the base
machine A machine B B/A A/B
Program1 Program2
1 1000
10 0.1 0.1 10 (10+.1)/2 = 5.05 (.1+10)/2 = 5.05 AM B is 5.05 times faster! A is 5.05 times faster! 2/(1/10+1/.1) = 5.05 2/(1/.1+1/10) = 5.05 HM B is 5.05 times faster! A is 5.05 times faster! GM (10*.1) = 1 (.1*10) = 1
10 100
Amdahls Law
Validity of the Single-Processor Approach to Achieving Large- Scale Computing Capabilities G. Amdahl, AFIPS, 1967 let optimization speed up fraction f of program by factor s
speedup = old / ([(1-f) x old] + f/s x old) = 1 / (1 - f + f/s)
1/[(1-0.95) + (0.95/ )] = 20
make common case fast, but... ...uncommon case eventually limits performance
Littles Law
Key Relationship between latency and bandwidth: Average number in system = arrival rate * mean holding time Possibly the most useful equation I know Useful in design of computers, software, industrial processes, etc. Example: How big of a wine cellar should we build? We drink (and buy) an average of 2 bottles per week On average, we want to age the wine for 5 years bottles in cellar = 2 bottles/week * 52 weeks/year * 5 years
= 520 bottles
System Balance
each system component produces & consumes data make sure data supply and demand is balanced
Tradeoffs
Bandwidth problems can be solved with money. Latency problems are harder, because the speed of light is fixed and you cant bribe God David Clark well... can convert some latency problems to bandwidth problems solve those with money the famous bandwidth/latency tradeoff architecture is the art of making tradeoffs
Bursty Behavior
Q: to sustain 2 IPC... how many instructions should processor be able to fetch per cycle? execute per cycle? complete per cycle? A: NOT 2 (more than 2) dependences will cause stalls (under-utilization) if desired performance is X, peak performance must be > X programs dont always obey average behavior cant design processor only to handle average behvaior