0% found this document useful (0 votes)
73 views5 pages

This Unit: - ! Multicycle Datapath - ! Clock Vs CPI - ! CPU Performance Equation - ! Performance Metrics - ! Benchmarking

This document discusses performance metrics for CPUs. It introduces the concepts of multicycle datapaths, clock speed versus cycles per instruction (CPI), and the CPU performance equation. It explains that overall performance depends on the number of instructions, CPI, and clock cycle time. Different types of datapaths trade off these factors differently. The document also discusses benchmarking performance using standard workloads like SPEC, and how average performance metrics like mean and harmonic mean are calculated.

Uploaded by

avssrinivasav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views5 pages

This Unit: - ! Multicycle Datapath - ! Clock Vs CPI - ! CPU Performance Equation - ! Performance Metrics - ! Benchmarking

This document discusses performance metrics for CPUs. It introduces the concepts of multicycle datapaths, clock speed versus cycles per instruction (CPI), and the CPU performance equation. It explains that overall performance depends on the number of instructions, CPI, and clock cycle time. Different types of datapaths trade off these factors differently. The document also discusses benchmarking performance using standard workloads like SPEC, and how average performance metrics like mean and harmonic mean are calculated.

Uploaded by

avssrinivasav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

CIS 371 (Roth/Martin): Performance & Multicycle 1

CIS 371
Computer Organization and Design
Unit 4: Performance & Multicycle
CIS 371 (Roth/Martin): Performance & Multicycle 2
This Unit
Multicycle datapath
Clock vs CPI
CPU performance equation
Performance metrics
Benchmarking
CPU Mem I/O
System software
App App App
CIS 371 (Roth/Martin): Performance & Multicycle 3
Readings
P&H
Chapter 1.4 (for performance discussion)
CIS 371 (Roth/Martin): Performance & Multicycle 4
240 ! 371
CIS 240: build something that works
CIS 371: build something that works well
well means high-performance but also cheap, low-power, etc.
Mostly high-performance
So, what is the performance of this?
What is performance?
PC
Insn
Mem
Register
File
s1 s2 d
Data
Mem
+
4
CIS 371 (Roth/Martin): Performance & Multicycle 5
CPU Performance Equation
Multiple aspects to performance: helps to isolate them
Latency = seconds / program =
(insns / program) * (cycles / insn) * (seconds / cycle)
Insns / program: dynamic insn count = f(program, compiler, ISA)
Cycles / insn: CPI = f(program, compiler, ISA, micro-arch)
Seconds / cycle: clock period = f(micro-arch, technology)
For low latency (better performance) minimize all three
Difficult: often pull against one another
Example we have seen: RISC vs. CISC ISAs
RISC: low CPI/clock period, high insn count
CISC: low insn count, high CPI/clock period
CIS 371 (Roth/Martin): Performance & Multicycle 6
MIPS (performance metric, not the ISA)
Factor out dynamic insn count, CPU equation becomes
Latency: seconds / insn = (cycles / insn) * (seconds / cycle)
Throughput: insns / second = (insns / cycle) * (cycles / second)
MIPS (millions of insns per second): insns / second * 10
-6
Cycles / second: clock frequency (in MHz)
Example: CPI = 2, clock = 500 MHz (2ns period), what is MIPS?
0.5 * 500 MHz * 10
-6
= 250 MIPS
MIPS is okay for micro-architects
Typically work in one ISA/one compiler, treat insn count as fixed
Not okay for general public
Processors with different ISAs/compilers have incomparable MIPS
Wait, it gets worse
CIS 371 (Roth/Martin): Performance & Multicycle 7
Mhz (MegaHertz) and Ghz (GigaHertz)
1 Hertz = 1 cycle per second
1 Ghz is 1 cycle per nanosecond, 1 Ghz = 1000 Mhz
Micro-architects often ignore instruction count
but general public (mostly) also ignores CPI
Equates clock frequency with performance!!
Which processor would you buy?
Processor A: CPI = 2, clock = 5 GHz
Processor B: CPI = 1, clock = 3 GHz
Probably A, but B is faster (assuming same ISA/compiler)
Classic example
800 MHz PentiumIII faster than 1 GHz Pentium4!
Same ISA and compiler!
Meta-point: danger of partial performance metrics!
CIS 371 (Roth/Martin): Performance & Multicycle 8
System Performance
Clock frequency implies processor core clock frequency
Other system components have their own clocks (or not)
E.g., increasing processor clock doesnt accelerate memory
Example
Processor A: CPI
CORE
= 1, CPI
MEM
= 1, proc. clock = 500 MHz (2ns)
What is the speedup if we double processor clock frequency?
Base: CPI = 2 ! IPC = 0.5 ! MIPS = 250
New: CPI = 3 ! IPC = 0.33 ! MIPS = 333
Clock *= 2 ! CPI
MEM
*= 2
Speedup = 333/250 = 1.33 << 2
What about an infinite clock frequency?
Only a 2X (factor of 2) speedup
Example of Amdahls Law
CIS 371 (Roth/Martin): Performance & Multicycle 9
Amdahls Law
Literally: total speedup limited by non-accelerated piece
Example: can optimize 50% of program A
Even magic optimization that makes this 50% disappear
only yields a 2X speedup
For consumers: buy a balanced system
For microarchitects: build a balanced system
MCCF (Make Common Case Fast)
Focus your efforts on things that matter
CIS 371 (Roth/Martin): Performance & Multicycle 10
Single-Cycle Datapath Performance
Goes against make common case fast (MCCF) principle
+ Low CPI: 1
Long clock period: to accommodate slowest instruction
Especially if multiply/divide are included
P
C
Insn
Mem
Register
File
S
X
s1 s2 d
Data
Mem
a
d
+
4
<<
2
Multi-cycle Operations
Lets allow long-latency operations take many cycles
Calculation assumptions:
Most instructions take 10 nanoseconds (ns)
But multiply instruction takes 40ns
Multiplies are 10% of all instructions
Single-cycle datapath: 40ns clock period, 1 CPI
40ns per instruction
1/40ns = 0.025 billion instructions per second = 25 MIPS
Multi-cycle datapath: 10ns clock period
Average CPI = (90% * 1) + (10% * 4) = 1.3
13ns per instruction
1/13ns = 0.77 billion instructions per second = 77 MIPS
Multi-cycle is 3 times (or 200%) faster than single-cycle
CIS 371 (Roth/Martin): Performance & Multicycle 11 CIS 371 (Roth/Martin): Performance & Multicycle 12
Fine-Grained Multi-Cycle Datapath
Multi-cycle datapath: attacks high clock period
Cut datapath into multiple stages (5 here), isolate using FFs
Finite state machine (FSM) control walks insns through
+ Insns can skip stages and exit early (memory ops vs alu ops)
P
C
Insn
Mem
Register
File
S
X
s1 s2 d
Data
Mem
a
d
+
4
<<
2
I
R
D O
B
A
s3
s3
s3
s4
s5
s5
s5
CIS 371 (Roth/Martin): Performance & Multicycle 13
Multi-Cycle Datapath Performance
Opposite performance split of single-cycle datapath
+ Short clock period
High CPI
P
C
Insn
Mem
Register
File
S
X
s1 s2 d
Data
Mem
a
d
+
4
<<
2
I
R
D O
B
A
Multicycle Performance
Assumptions
30% loads, 5ns
10% stores, 5ns
50% adds, 4ns
10% multiplies, 20ns
Single-cycle datapath: 20ns clock period, 1 CPI
20ns per instruction or 50 MIPS
Simple multi-cycle datapath: 5ns clock
CPI = (90% * 1) + (10% * 4) = 1.3
6.5ns per instruction or 153 MIPS
Fine-grained multi-cycle datapath: 1ns clock
CPI = (30% * 5) + (10% * 5) + (50% * 4) + (10% * 20) =
1.5 + 0.5 + 2 + 2 = 6 CPI
6ns per instruction or 166 MIPS
CIS 371 (Roth/Martin): Performance & Multicycle 14
CIS 371 (Roth/Martin): Performance & Multicycle 15
Processor Performance and Workloads
Q: what does performance of a chip mean?
A: nothing, there must be some associated workload
Workload: set of tasks someone (you) cares about
Benchmarks: standard workloads
Used to compare performance across machines
Either are or highly representative of actual programs people run
Micro-benchmarks: non-standard non-workloads
Tiny programs used to isolate certain aspects of performance
Not representative of complex behaviors of real applications
Examples: binary tree search, towers-of-hanoi, 8-queens, etc.
CIS 371 (Roth/Martin): Performance & Multicycle 16
SPEC Benchmarks
SPEC: Standard Performance Evaluation Corporation
http://www.spec.org/
Consortium that collects, standardizes, and distributes benchmarks
Suites for CPU, Java, I/O, Web, Mail, OpenMP (multithreaded), etc.
Updated every few years: so companies dont target benchmarks
Post SPECmark results for different processors
1 number that represents performance for entire suite
CPU 2006: 29 CPU-intensive C/C++/Fortran programs
integer: bzip2, gcc, perl, hmmer (genomics), h264, etc.
floating-point: wrf (weather), povray, sphynx3 (speech), etc.
TPC: Transaction Processing Council
Like SPEC, but for web/database server workloads
Much heavier on memory, I/O, network, than on CPU
Doesnt give you the source code, only a description
CIS 371 (Roth/Martin): Performance & Multicycle 17
SPECmark
Reference machine: Sun SPARC 10
Latency SPECmark
For each benchmark
Take odd number of samples: on both machines
Choose median
Take latency ratio (Sun SPARC 10 / your machine)
Take geometric mean of ratios over all benchmarks
Throughput SPECmark
Run multiple benchmarks in parallel on multiple-processor system
Recent SPECmark latency leaders
SPECint: Intel 2.3 GHz Core2 Extreme (3119)
SPECfp: IBM 2.1 GHz Power5+ (4051)
CIS 371 (Roth/Martin): Performance & Multicycle 18
Mean (Average) Performance Numbers
Arithmetic: (1/N) * !
P=1..N
Latency(P)
For units that are proportional to time (e.g., latency)
You can add latencies, but not throughputs
Latency(P1+P2,A) = Latency(P1,A) + Latency(P2,A)
Throughput(P1+P2,A) != Throughput(P1,A) + Throughput(P2,A)
1 mile @ 30 miles/hour + 1 mile @ 90 miles/hour
Average is not 60 miles/hour
Harmonic: N / !
P=1..N
1/Throughput(P)
For units that are inversely proportional to time (e.g., throughput)
Geometric:
N
"#
P=1..N
Speedup(P)
For unitless quantities (e.g., speedups)
CIS 371 (Roth/Martin): Performance & Multicycle 19
How Can We Make Common Case Fast?
If we dont know what CC is?
How is CPI actually measured?
Execution time: time (Unix): wall clock / CPU + system
CPI = CPU time / (clock frequency * dynamic insn count)
How is dynamic insn count measured?
Hardware event counters
More useful is CPI breakdown (CPI
CPU
, CPI
MEM
, etc.)
So we know what performance problems are and what to fix
Hardware event counters:
+Accurate
Cant measure everything or evaluate modifications
Cycle-level micro-architecture simulation: e.g., SimpleScalar
+Measure exactly what you want, evaluate potential fixes
Burden of accuracy is on the simulator writer
CIS 371 (Roth/Martin): Performance & Multicycle 20
Summary
Multicycle datapath
Clock vs CPI
CPU performance equation
Performance metrics
Benchmarking
CPU Mem I/O
System software
App App App

You might also like