0% found this document useful (0 votes)

80 views

Performance: Latency

The document discusses various performance metrics like latency, throughput, and MIPS and MFLOPS. It explains the CPU performance equation that separates performance into instructions per program, cycles per instruction, and seconds per cycle. Amdahl's Law and Little's Law are also summarized. The document discusses benchmarks and issues with different types of benchmarks. It covers reporting averages using arithmetic, harmonic, and geometric means. Concepts around system balance, tradeoffs, and bursty behavior are also summarized.

Uploaded by

Ni Tin

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views

Performance: Latency

Uploaded by

Ni Tin

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Performance

Topics: performance metrics CPU performance equation benchmarks and benchmarking reporting averages Amdahls Law Littles Law concepts
balance tradeoffs bursty behavior (average and peak performance)

Performance Metrics
latency: response time, execution time good metric for fixed amount of work (minimize time) throughput: bandwidth, work per time = (1 / latency) when there is NO OVERLAP > (1 / latency) when there is overlap
in real processors, there is always overlap (e.g., pipelining)

good metric for fixed amount of time (maximize work) comparing performance A is N times faster than B iff
perf(A)/perf(B) = time(B)/time(A) = N

A is X% faster than B iff

perf(A)/perf(B) = time(B)/time(A) = 1 + X/100

Performance Metric I: MIPS

MIPS (millions of instructions per second) (instruction count / execution time in seconds) x 10-6 but instruction count is not a reliable indicator of work
Prob #1: work per instruction varies (FP mult >> register move) Prob #2: instruction sets arent equal (3 Pentium instrs != 3 Alpha instrs)

may vary inversely with actual performance particularly bad metric for multicore chips

Performance Metric II: MFLOPS

MFLOPS (millions of floating-point operations per second)

(FP ops / execution time) x 10-6 like MIPS, but counts only FP operations
FP ops have longest latencies anyway (problem #1) FP ops are the same across machines (problem #2)

may have been valid in 1980 (most programs were FP)

most programs today are integer i.e., light on FP load from memory takes longer than FP divide (prob #1) Cray doesnt implement divide, Motorola has SQRT, SIN, COS (#2)

CPU Performance Equation

processor performance = seconds / program separate into three components (for single core)

instructions / program: dynamic instruction count mostly determined by program, compiler, ISA cycles / instruction: CPI mostly determined by ISA and CPU/memory organization seconds / cycle: cycle time, clock time, 1 / clock frequency mostly determined by technology and CPU organization uses of CPU performance equation high-level performance comparisons back of the envelope calculations helping architects think about compilers and technology

CPU Performance Comparison

famous example: RISC Wars (RISC vs. CISC) assume
instructions / program: CISC = P, RISC = 2P CPI: CISC = 8, RISC = 2 T = clock period for CISC and RISC (assume they are equal)

CISC time = P x 8 x T = 8PT RISC time = 2P x 2 x T = 4PT RISC time = CISC CPU time/2 the truth is much, much, much more complex actual data from IBM AS/400 (CISC -> RISC in 1995):
CISC time = P x 7 x T = 7PT RISC time = 3.1P x 3 x T/3.1 = 3PT (+1 tech. gen.)

Actually Measuring Performance

how are execution-time & CPI actually measured? execution time: time (Unix cmd): wall-clock, CPU, system CPI = CPU time / (clock frequency * # instructions) more useful? CPI breakdown (compute, memory stall, etc.)
so we know what the performance problems are (what to fix)

measuring CPI breakdown hardware event counters (built into core)

calculate CPI using instruction frequencies/event costs

cycle-level microarchitecture simulator (e.g., SimpleScalar)

+ measure exactly what you want model microarchitecture faithfully (at least parts of interest) method of choice for many architects (yours, too!)

Benchmarks and Benchmarking

program as unit of work millions of them, many different kinds, which to use? benchmarks standard programs for measuring/comparing performance + represent programs people care about + repeatable!! benchmarking process
define workload extract benchmarks from workload execute benchmarks on candidate machines project performance on new machine run workload on new machine and compare not close enough -> repeat

Benchmarks: Toys, Kernels, Synthetics

toy benchmarks: little programs that no one really runs
e.g., fibonacci, 8 queens

little value, what real programs do these represent?

scary fact: used to prove the value of RISC in early 80s

kernels: important (frequently executed) pieces of real programs

e.g., Livermore loops, Linpack (inner product)

+ good for focusing on individual features, but not big picture over-emphasize target feature (for better or worse) synthetic benchmarks: programs made up for benchmarking
e.g., Whetstone, Dhrystone

toy kernels++, which programs do these represent?

Benchmarks: Real Programs

real programs + only accurate way to characterize performance requires considerable work (porting) Standard Performance Evaluation Corporation (SPEC) http://www.spec.org collects, standardizes and distributes benchmark suites consortium made up of industry leaders SPEC CPU (CPU intensive benchmarks)
SPEC89, SPEC92, SPEC95, SPEC2000, SPEC2006

other benchmark suites

SPECjvm, SPECmail, SPECweb, SPEComp

Other benchmark suite examples: TPC-C, TPC-H for databases

SPEC CPU2006
12 integer programs (C, C++)
gcc (compiler), perl (interpreter), hmmer (markov chain) bzip2 (compress), go (AI), sjeng (AI) libquantum (physics), h264ref (video) omnetpp (simulation), astar (path finding algs) xalanc (XML processing), mcf (network optimization)

17 floating point programs (C, C++, Fortran)

fluid dynamics: bwaves, leslie3d, ibm quantum chemistry: gamess, tonto physics: milc, zeusmp, cactusADM gromacs (biochem) namd (bio, molec dynamics), dealll (finite element analysis) soplex (linear programming), povray (ray tracing) calculix (mechanics), GemsFDTD (computational E&M) wrf (weather), sphinx3 (speech recognition)

Benchmarking Pitfalls
benchmark properties mismatch with features studied
e.g., using SPEC for large cache studies

careless scaling
using only first few million instructions (initialization phase) reducing program data size

choosing performance from wrong application space

e.g., in a realtime environment, choosing gcc

using old benchmarks

benchmark specials: benchmark-specific optimizations

Benchmarks must be continuously maintained and updated!

Reporting Average Performance

averages: one of the things architects frequently get wrong + pay attention now and you wont get them wrong important things about averages (i.e., means) ideally proportional to execution time (ultimate metric)
Arithmetic Mean (AM) for times Harmonic Mean (HM) for rates (IPCs) Geometric Mean (GM) for ratios (speedups)

there is no such thing as the average program use average when absolutely necessary

What Does The Mean Mean?

arithmetic mean (AM): average execution times of N programs (time(i)) / N harmonic mean (HM): average IPCs of N programs arithmetic mean cannot be used for rates (e.g., IPCs)
30 MPH for 1 mile + 90 MPH for 1 mile != avg. 60 MPH

N / 1..N(1 / rate(i)) geometric mean (GM): average speedups of N programs N ( 1..N(speedup(i)) what if programs run at different frequencies within workload? weighting weighted AM = (1..N w(i) * time(i)) / N

GM Weirdness
what about averaging ratios (speedups)? HM / AM change depending on which machine is the base
machine A machine B B/A A/B

Program1 Program2

1 1000

10 0.1 0.1 10 (10+.1)/2 = 5.05 (.1+10)/2 = 5.05 AM B is 5.05 times faster! A is 5.05 times faster! 2/(1/10+1/.1) = 5.05 2/(1/.1+1/10) = 5.05 HM B is 5.05 times faster! A is 5.05 times faster! GM (10*.1) = 1 (.1*10) = 1

10 100

geometric mean of ratios is not proportional to total time!

if we take total execution time, B is 9.1 times faster GM says they are equal

Amdahls Law
Validity of the Single-Processor Approach to Achieving Large- Scale Computing Capabilities G. Amdahl, AFIPS, 1967 let optimization speed up fraction f of program by factor s
speedup = old / ([(1-f) x old] + f/s x old) = 1 / (1 - f + f/s)

f = 95%, s = 1.1 f = 5%, s = 10 f = 5%, s = f = 95%, s

1/[(1-0.95) + (0.95/1.1)] = 1.094 1/[(1-0.05) + (0.05/10)] = 1.047 1/[(1-0.05) + (0.05/ )] = 1.052

1/[(1-0.95) + (0.95/ )] = 20

make common case fast, but... ...uncommon case eventually limits performance

Littles Law
Key Relationship between latency and bandwidth: Average number in system = arrival rate * mean holding time Possibly the most useful equation I know Useful in design of computers, software, industrial processes, etc. Example: How big of a wine cellar should we build? We drink (and buy) an average of 2 bottles per week On average, we want to age the wine for 5 years bottles in cellar = 2 bottles/week * 52 weeks/year * 5 years
= 520 bottles

System Balance
each system component produces & consumes data make sure data supply and demand is balanced

X demand >= X supply computation is X-bound

e.g., memory bound, CPU-bound, I/O-bound

goal: be bound everywhere at once (why?) X can be bandwidth or latency

X is bandwidth buy more bandwidth X is latency much tougher problem

Tradeoffs
Bandwidth problems can be solved with money. Latency problems are harder, because the speed of light is fixed and you cant bribe God David Clark well... can convert some latency problems to bandwidth problems solve those with money the famous bandwidth/latency tradeoff architecture is the art of making tradeoffs

Bursty Behavior
Q: to sustain 2 IPC... how many instructions should processor be able to fetch per cycle? execute per cycle? complete per cycle? A: NOT 2 (more than 2) dependences will cause stalls (under-utilization) if desired performance is X, peak performance must be > X programs dont always obey average behavior cant design processor only to handle average behvaior

Internet of Things
No ratings yet
Internet of Things
10 pages
RN310 Information Framework Release Notes v23.0
No ratings yet
RN310 Information Framework Release Notes v23.0
15 pages
Ikar Lab 3 Brochure
No ratings yet
Ikar Lab 3 Brochure
24 pages
M116C 1 M116C 1 Lect02-Performance
No ratings yet
M116C 1 M116C 1 Lect02-Performance
23 pages
Chapter4 Performance
No ratings yet
Chapter4 Performance
36 pages
CS5204/EE5364 - Advanced Computer Architecture - Performance
No ratings yet
CS5204/EE5364 - Advanced Computer Architecture - Performance
56 pages
SEN307-Lecture-5
No ratings yet
SEN307-Lecture-5
34 pages
Measuring Computer Performance
No ratings yet
Measuring Computer Performance
26 pages
Module 3
No ratings yet
Module 3
23 pages
L-2 (Computer Performance)
No ratings yet
L-2 (Computer Performance)
47 pages
2 RISC V Performance ISA
No ratings yet
2 RISC V Performance ISA
72 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
18 pages
Puter Performance
No ratings yet
Puter Performance
15 pages
CS-3006_4_PerformanceAnalysis
No ratings yet
CS-3006_4_PerformanceAnalysis
62 pages
Computer Performance
No ratings yet
Computer Performance
17 pages
Chapter Two
No ratings yet
Chapter Two
33 pages
Performance Chap4
No ratings yet
Performance Chap4
20 pages
Assessing and Understanding Performance
No ratings yet
Assessing and Understanding Performance
31 pages
Computer Organization and Architecture (AT70.01)
No ratings yet
Computer Organization and Architecture (AT70.01)
29 pages
Performance Issues
No ratings yet
Performance Issues
19 pages
Module 2 [26-10-2024]
No ratings yet
Module 2 [26-10-2024]
50 pages
1 - Introduction To Computer System
No ratings yet
1 - Introduction To Computer System
31 pages
DA_CI
No ratings yet
DA_CI
13 pages
2.Week
No ratings yet
2.Week
35 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
Computer Architecture and Performance
No ratings yet
Computer Architecture and Performance
33 pages
Performance Matrices
No ratings yet
Performance Matrices
14 pages
(2010-02-27) Measuring Performance
No ratings yet
(2010-02-27) Measuring Performance
11 pages
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
28 pages
Performance Measures
No ratings yet
Performance Measures
25 pages
Computer Performance
No ratings yet
Computer Performance
22 pages
CSC232 - Chp1 (Compatibility Mode)
No ratings yet
CSC232 - Chp1 (Compatibility Mode)
50 pages
IT401 Computer Organization and Architecture: Prasun Ghosal
No ratings yet
IT401 Computer Organization and Architecture: Prasun Ghosal
30 pages
Lec 2 Performance
No ratings yet
Lec 2 Performance
28 pages
Cs2100 14 Understanding Performance
No ratings yet
Cs2100 14 Understanding Performance
46 pages
This Unit: - Metrics
No ratings yet
This Unit: - Metrics
7 pages
Lec10 Performance
No ratings yet
Lec10 Performance
22 pages
Lecture 02 CH01 Performance Power
No ratings yet
Lecture 02 CH01 Performance Power
76 pages
Designing For Performance - Performance Metrics
No ratings yet
Designing For Performance - Performance Metrics
19 pages
Week 10 Part 02 - Processor Performance (Q Only) - Tagged 2
No ratings yet
Week 10 Part 02 - Processor Performance (Q Only) - Tagged 2
23 pages
Advanced Computer Architecture: 563 L02.1 Fall 2011
No ratings yet
Advanced Computer Architecture: 563 L02.1 Fall 2011
57 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
17 pages
Mod6 2 PDF
No ratings yet
Mod6 2 PDF
15 pages
COD Ch. 2 The Role of Performance
No ratings yet
COD Ch. 2 The Role of Performance
28 pages
4 Performance
No ratings yet
4 Performance
67 pages
L14 Introduction To Performance Evaluation
No ratings yet
L14 Introduction To Performance Evaluation
48 pages
3310
No ratings yet
3310
26 pages
Lecture: Metrics To Evaluate Performance
No ratings yet
Lecture: Metrics To Evaluate Performance
15 pages
ch2
No ratings yet
ch2
10 pages
Computer Architecture Unit 1
No ratings yet
Computer Architecture Unit 1
12 pages
The Role of Performance: Chapter - 2
No ratings yet
The Role of Performance: Chapter - 2
40 pages
Computer Performance
No ratings yet
Computer Performance
18 pages
CSCI 8150 Advanced Computer Architecture
No ratings yet
CSCI 8150 Advanced Computer Architecture
26 pages
Measuring Performance: Chris Clack B261 Systems Architecture
No ratings yet
Measuring Performance: Chris Clack B261 Systems Architecture
19 pages
Quatitative Principle
No ratings yet
Quatitative Principle
56 pages
Module 3.3 - Problems On Performance
No ratings yet
Module 3.3 - Problems On Performance
54 pages
Inroduction and Performance Analysis
No ratings yet
Inroduction and Performance Analysis
29 pages
Computer Performance
No ratings yet
Computer Performance
27 pages
Lecture 1 8405 Computer Architecture
No ratings yet
Lecture 1 8405 Computer Architecture
15 pages
Performances of Computer Systems: CSE 675.02: Introduction To Computer Architecture
No ratings yet
Performances of Computer Systems: CSE 675.02: Introduction To Computer Architecture
52 pages
Lecture 2: Metrics To Evaluate Systems
No ratings yet
Lecture 2: Metrics To Evaluate Systems
33 pages
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
Design Principles in Architecture
From Everand
Design Principles in Architecture
Rajendra Asan
No ratings yet
2150708
No ratings yet
2150708
2 pages
C28X Iqmath Library: A Virtual Floating Point Engine V1.5 July 8, 2008
No ratings yet
C28X Iqmath Library: A Virtual Floating Point Engine V1.5 July 8, 2008
72 pages
How To Edit Mockup - Clothes - 01
No ratings yet
How To Edit Mockup - Clothes - 01
9 pages
Movie Maker Handout
No ratings yet
Movie Maker Handout
24 pages
Accounting Information Systems 10th Edition Gelinas Test Bank 1
100% (64)
Accounting Information Systems 10th Edition Gelinas Test Bank 1
29 pages
Hospital Information Management PDF
No ratings yet
Hospital Information Management PDF
192 pages
How Bill Gates and Steve Jobs Changed The World
No ratings yet
How Bill Gates and Steve Jobs Changed The World
33 pages
RBT 2210
No ratings yet
RBT 2210
28 pages
For Project
No ratings yet
For Project
14 pages
05 Slide
No ratings yet
05 Slide
39 pages
Modal: How It Works
No ratings yet
Modal: How It Works
4 pages
A Practical Deep Learning-Based Acoustic Side
No ratings yet
A Practical Deep Learning-Based Acoustic Side
21 pages
CNS Unit 2
No ratings yet
CNS Unit 2
32 pages
3.3 Job Sequencing With Deadlines
100% (1)
3.3 Job Sequencing With Deadlines
13 pages
Conclusion: So Store Manager: It Stores The Sos, Sos Metadata
No ratings yet
Conclusion: So Store Manager: It Stores The Sos, Sos Metadata
1 page
Embedded Systems Syllabus
0% (1)
Embedded Systems Syllabus
4 pages
Famena Template
No ratings yet
Famena Template
3 pages
Lucene Lecture at Pisa
No ratings yet
Lucene Lecture at Pisa
11 pages
Kerrisdale Capital Short Thesis On Sourcefire (FIRE)
No ratings yet
Kerrisdale Capital Short Thesis On Sourcefire (FIRE)
22 pages
It Assignment 4
No ratings yet
It Assignment 4
7 pages
Digital Blocks: DB8279 Programmable Keyboard / Display Interface
No ratings yet
Digital Blocks: DB8279 Programmable Keyboard / Display Interface
4 pages
Soma Bringer Cheat Code
50% (2)
Soma Bringer Cheat Code
14 pages
Arfa Manaf - Geotechnical
No ratings yet
Arfa Manaf - Geotechnical
3 pages
Command Line Interface: CLI Management User Guide
No ratings yet
Command Line Interface: CLI Management User Guide
25 pages
Vijay 5G Technology
No ratings yet
Vijay 5G Technology
19 pages
Dell Inspiron 13 7352 Laptop Reference Guide en Us
No ratings yet
Dell Inspiron 13 7352 Laptop Reference Guide en Us
24 pages
ECommerce - API Documentation
No ratings yet
ECommerce - API Documentation
8 pages