0% found this document useful (0 votes)
80 views

Performance: Latency

The document discusses various performance metrics like latency, throughput, and MIPS and MFLOPS. It explains the CPU performance equation that separates performance into instructions per program, cycles per instruction, and seconds per cycle. Amdahl's Law and Little's Law are also summarized. The document discusses benchmarks and issues with different types of benchmarks. It covers reporting averages using arithmetic, harmonic, and geometric means. Concepts around system balance, tradeoffs, and bursty behavior are also summarized.

Uploaded by

Ni Tin
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views

Performance: Latency

The document discusses various performance metrics like latency, throughput, and MIPS and MFLOPS. It explains the CPU performance equation that separates performance into instructions per program, cycles per instruction, and seconds per cycle. Amdahl's Law and Little's Law are also summarized. The document discusses benchmarks and issues with different types of benchmarks. It covers reporting averages using arithmetic, harmonic, and geometric means. Concepts around system balance, tradeoffs, and bursty behavior are also summarized.

Uploaded by

Ni Tin
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Performance

Topics: performance metrics CPU performance equation benchmarks and benchmarking reporting averages Amdahls Law Littles Law concepts
balance tradeoffs bursty behavior (average and peak performance)

Performance Metrics
latency: response time, execution time good metric for fixed amount of work (minimize time) throughput: bandwidth, work per time = (1 / latency) when there is NO OVERLAP > (1 / latency) when there is overlap
in real processors, there is always overlap (e.g., pipelining)

good metric for fixed amount of time (maximize work) comparing performance A is N times faster than B iff
perf(A)/perf(B) = time(B)/time(A) = N

A is X% faster than B iff


perf(A)/perf(B) = time(B)/time(A) = 1 + X/100

Performance Metric I: MIPS


MIPS (millions of instructions per second) (instruction count / execution time in seconds) x 10-6 but instruction count is not a reliable indicator of work
Prob #1: work per instruction varies (FP mult >> register move) Prob #2: instruction sets arent equal (3 Pentium instrs != 3 Alpha instrs)

may vary inversely with actual performance particularly bad metric for multicore chips

Performance Metric II: MFLOPS

MFLOPS (millions of floating-point operations per second)

(FP ops / execution time) x 10-6 like MIPS, but counts only FP operations
FP ops have longest latencies anyway (problem #1) FP ops are the same across machines (problem #2)

may have been valid in 1980 (most programs were FP)


most programs today are integer i.e., light on FP load from memory takes longer than FP divide (prob #1) Cray doesnt implement divide, Motorola has SQRT, SIN, COS (#2)

CPU Performance Equation


processor performance = seconds / program separate into three components (for single core)

instructions / program: dynamic instruction count mostly determined by program, compiler, ISA cycles / instruction: CPI mostly determined by ISA and CPU/memory organization seconds / cycle: cycle time, clock time, 1 / clock frequency mostly determined by technology and CPU organization uses of CPU performance equation high-level performance comparisons back of the envelope calculations helping architects think about compilers and technology

CPU Performance Comparison


famous example: RISC Wars (RISC vs. CISC) assume
instructions / program: CISC = P, RISC = 2P CPI: CISC = 8, RISC = 2 T = clock period for CISC and RISC (assume they are equal)

CISC time = P x 8 x T = 8PT RISC time = 2P x 2 x T = 4PT RISC time = CISC CPU time/2 the truth is much, much, much more complex actual data from IBM AS/400 (CISC -> RISC in 1995):
CISC time = P x 7 x T = 7PT RISC time = 3.1P x 3 x T/3.1 = 3PT (+1 tech. gen.)

Actually Measuring Performance


how are execution-time & CPI actually measured? execution time: time (Unix cmd): wall-clock, CPU, system CPI = CPU time / (clock frequency * # instructions) more useful? CPI breakdown (compute, memory stall, etc.)
so we know what the performance problems are (what to fix)

measuring CPI breakdown hardware event counters (built into core)


calculate CPI using instruction frequencies/event costs

cycle-level microarchitecture simulator (e.g., SimpleScalar)


+ measure exactly what you want model microarchitecture faithfully (at least parts of interest) method of choice for many architects (yours, too!)

Benchmarks and Benchmarking


program as unit of work millions of them, many different kinds, which to use? benchmarks standard programs for measuring/comparing performance + represent programs people care about + repeatable!! benchmarking process
define workload extract benchmarks from workload execute benchmarks on candidate machines project performance on new machine run workload on new machine and compare not close enough -> repeat

Benchmarks: Toys, Kernels, Synthetics


toy benchmarks: little programs that no one really runs
e.g., fibonacci, 8 queens

little value, what real programs do these represent?


scary fact: used to prove the value of RISC in early 80s

kernels: important (frequently executed) pieces of real programs


e.g., Livermore loops, Linpack (inner product)

+ good for focusing on individual features, but not big picture over-emphasize target feature (for better or worse) synthetic benchmarks: programs made up for benchmarking
e.g., Whetstone, Dhrystone

toy kernels++, which programs do these represent?

Benchmarks: Real Programs


real programs + only accurate way to characterize performance requires considerable work (porting) Standard Performance Evaluation Corporation (SPEC) http://www.spec.org collects, standardizes and distributes benchmark suites consortium made up of industry leaders SPEC CPU (CPU intensive benchmarks)
SPEC89, SPEC92, SPEC95, SPEC2000, SPEC2006

other benchmark suites


SPECjvm, SPECmail, SPECweb, SPEComp

Other benchmark suite examples: TPC-C, TPC-H for databases

SPEC CPU2006
12 integer programs (C, C++)
gcc (compiler), perl (interpreter), hmmer (markov chain) bzip2 (compress), go (AI), sjeng (AI) libquantum (physics), h264ref (video) omnetpp (simulation), astar (path finding algs) xalanc (XML processing), mcf (network optimization)

17 floating point programs (C, C++, Fortran)


fluid dynamics: bwaves, leslie3d, ibm quantum chemistry: gamess, tonto physics: milc, zeusmp, cactusADM gromacs (biochem) namd (bio, molec dynamics), dealll (finite element analysis) soplex (linear programming), povray (ray tracing) calculix (mechanics), GemsFDTD (computational E&M) wrf (weather), sphinx3 (speech recognition)

Benchmarking Pitfalls
benchmark properties mismatch with features studied
e.g., using SPEC for large cache studies

careless scaling
using only first few million instructions (initialization phase) reducing program data size

choosing performance from wrong application space


e.g., in a realtime environment, choosing gcc

using old benchmarks


benchmark specials: benchmark-specific optimizations

Benchmarks must be continuously maintained and updated!

Reporting Average Performance


averages: one of the things architects frequently get wrong + pay attention now and you wont get them wrong important things about averages (i.e., means) ideally proportional to execution time (ultimate metric)
Arithmetic Mean (AM) for times Harmonic Mean (HM) for rates (IPCs) Geometric Mean (GM) for ratios (speedups)

there is no such thing as the average program use average when absolutely necessary

What Does The Mean Mean?


arithmetic mean (AM): average execution times of N programs (time(i)) / N harmonic mean (HM): average IPCs of N programs arithmetic mean cannot be used for rates (e.g., IPCs)
30 MPH for 1 mile + 90 MPH for 1 mile != avg. 60 MPH

N / 1..N(1 / rate(i)) geometric mean (GM): average speedups of N programs N ( 1..N(speedup(i)) what if programs run at different frequencies within workload? weighting weighted AM = (1..N w(i) * time(i)) / N

GM Weirdness
what about averaging ratios (speedups)? HM / AM change depending on which machine is the base
machine A machine B B/A A/B

Program1 Program2

1 1000

10 0.1 0.1 10 (10+.1)/2 = 5.05 (.1+10)/2 = 5.05 AM B is 5.05 times faster! A is 5.05 times faster! 2/(1/10+1/.1) = 5.05 2/(1/.1+1/10) = 5.05 HM B is 5.05 times faster! A is 5.05 times faster! GM (10*.1) = 1 (.1*10) = 1

10 100

geometric mean of ratios is not proportional to total time!


if we take total execution time, B is 9.1 times faster GM says they are equal

Amdahls Law
Validity of the Single-Processor Approach to Achieving Large- Scale Computing Capabilities G. Amdahl, AFIPS, 1967 let optimization speed up fraction f of program by factor s
speedup = old / ([(1-f) x old] + f/s x old) = 1 / (1 - f + f/s)

f = 95%, s = 1.1 f = 5%, s = 10 f = 5%, s = f = 95%, s

1/[(1-0.95) + (0.95/1.1)] = 1.094 1/[(1-0.05) + (0.05/10)] = 1.047 1/[(1-0.05) + (0.05/ )] = 1.052

1/[(1-0.95) + (0.95/ )] = 20

make common case fast, but... ...uncommon case eventually limits performance

Littles Law
Key Relationship between latency and bandwidth: Average number in system = arrival rate * mean holding time Possibly the most useful equation I know Useful in design of computers, software, industrial processes, etc. Example: How big of a wine cellar should we build? We drink (and buy) an average of 2 bottles per week On average, we want to age the wine for 5 years bottles in cellar = 2 bottles/week * 52 weeks/year * 5 years
= 520 bottles

System Balance
each system component produces & consumes data make sure data supply and demand is balanced

X demand >= X supply computation is X-bound


e.g., memory bound, CPU-bound, I/O-bound

goal: be bound everywhere at once (why?) X can be bandwidth or latency


X is bandwidth buy more bandwidth X is latency much tougher problem

Tradeoffs
Bandwidth problems can be solved with money. Latency problems are harder, because the speed of light is fixed and you cant bribe God David Clark well... can convert some latency problems to bandwidth problems solve those with money the famous bandwidth/latency tradeoff architecture is the art of making tradeoffs

Bursty Behavior
Q: to sustain 2 IPC... how many instructions should processor be able to fetch per cycle? execute per cycle? complete per cycle? A: NOT 2 (more than 2) dependences will cause stalls (under-utilization) if desired performance is X, peak performance must be > X programs dont always obey average behavior cant design processor only to handle average behvaior

You might also like