CAO Fall 2024 Lecture 06 Design Metrics Performance Evaluation
CAO Fall 2024 Lecture 06 Design Metrics Performance Evaluation
Lecture # 06
Design Metrics and CPU Performance Evaluation
Muhammad Imran
[email protected]
Acknowledgement
2
▪ Throughput
▪ Amount of data processed per clock cycle
▪ Bits per cycle or bits per second
▪ Tasks executed per unit time
▪ Instructions per second, Instructions per cycle etc.
▪ Latency
▪ Time to process a single task
▪ Number of cycles or seconds
▪ Timing
▪ Defined by the logic delays between sequential elements
▪ Clock period, frequency
Example …
6
input D Q D Q D Q output
8 Combinational Combinational 8
Logic Logic
p p p p
p p p
▪ Throughput?
▪ (Bits per output sample / time between two output samples)
▪ 8 bits/cycle, if 1 cycle = 10 ns, throughput = 8/10n = 800 Mbits/s
▪ Throughput can also be 1 task/sample per cycle!!
Example …
7
input D Q D Q D Q output
8 Combinational Combinational 8
Logic Logic
p p p p
p p p
▪ Latency?
▪ Time to complete one task / sample
▪ 3 clock cycles, if 1 cycle = 10 ns, latency = 30 ns
Example …
8
input D Q D Q D Q output
8 Combinational Combinational 8
Logic Logic
p p p p
p p p
▪ Timing?
▪ Clock period = tckl2q + combinational logic delay (longest) + ts
Design Tradeoffs: Multicycle Design
9
Xpower = 1;
for(i=0; i < 3; i++)
Xpower = X*Xpower;
clk
[7:0]
Start
[7:0]
×
0 [7:0] [7:0]
D[7:0] Q[7:0]
[7:0]
X[7:0] 1
Xpower[7:0]
▪ Throughput?
▪ 1 sample or task / 3 cycles
▪ 8 bits / 3 cycles = 2.7 bits per
Design Tradeoffs: Multicycle Design
10
Xpower = 1;
for(i=0; i < 3; i++)
Xpower = X*Xpower;
clk
[7:0]
Start
[7:0]
×
0 [7:0] [7:0]
D[7:0] Q[7:0]
[7:0]
X[7:0] 1
Xpower[7:0]
▪ Latency?
▪ 3 clock cycles
Design Tradeoffs: Multicycle Design
11
Xpower = 1;
for(i=0; i < 3; i++)
Xpower = X*Xpower;
clk
[7:0]
Start
[7:0]
×
0 [7:0] [7:0]
D[7:0] Q[7:0]
[7:0]
X[7:0] 1
Xpower[7:0]
▪ Clock Timing?
▪ Clock period = tclk2q + 1 multiplier delay + 1 mux delay + ts
Design Tradeoffs: Pipelining
12
clk
[7:0]
Xpower = 1; Start
[7:0]
for(i=0; i < 3; i++) ×
Xpower = X*Xpower;
0 [7:0] [7:0]
D[7:0] Q[7:0]
[7:0]
X[7:0] 1
Xpower[7:0]
D[7:0] Q[7:0]
clk [7:0]
D[7:0] Q[7:0] × D[7:0] Q[7:0] Xpower[7:0]
[7:0]
X[7:0] ×
[7:0] [7:0]
[7:0]
D[7:0] Q[7:0]
clk
[7:0]
Xpower = 1; Start
[7:0]
for(i=0; i < 3; i++) ×
Xpower = X*Xpower;
0 [7:0] [7:0]
D[7:0] Q[7:0]
[7:0]
X[7:0] 1
Xpower[7:0]
D[7:0] Q[7:0]
clk [7:0]
D[7:0] Q[7:0] × D[7:0] Q[7:0] Xpower[7:0]
[7:0]
X[7:0] ×
[7:0] [7:0]
[7:0]
D[7:0] Q[7:0]
clk
[7:0]
Xpower = 1; Start
[7:0]
for(i=0; i < 3; i++) ×
Xpower = X*Xpower;
0 [7:0] [7:0]
D[7:0] Q[7:0]
[7:0]
X[7:0] 1
Xpower[7:0]
D[7:0] Q[7:0]
clk [7:0]
D[7:0] Q[7:0] × D[7:0] Q[7:0] Xpower[7:0]
[7:0]
X[7:0] ×
[7:0] [7:0]
[7:0]
D[7:0] Q[7:0]
clk
[7:0]
Xpower = 1; Start
[7:0]
for(i=0; i < 3; i++) ×
Xpower = X*Xpower;
0 [7:0] [7:0]
D[7:0] Q[7:0]
[7:0]
X[7:0] 1
Xpower[7:0]
D[7:0] Q[7:0]
clk [7:0]
D[7:0] Q[7:0] × D[7:0] Q[7:0] Xpower[7:0]
[7:0]
X[7:0] ×
[7:0] [7:0]
[7:0]
D[7:0] Q[7:0]
▪ Cost of pipelining?
▪ More area! (Additional registers + Multiplier)
Design Tradeoffs: Single Cycle Design
16
× [7:0]
X[7:0]
▪ Throughput?
▪ 1 sample per cycle or 8 bits per cycle!
▪ Latency?
▪ 1 cycle (low latency!)
▪ Timing?
▪ Clock period = 2 multipliers delay + clk2q + ts
▪ Slower clock may undermine low latency!
How do we evaluate computers?
Defining Performance
18
▪ Cruising Speed
▪ How fast a single task can be executed …
▪ How many passengers are transported in a given time?
▪ That’s throughput …
In a similar manner, computers may be evaluated for
several parame ers …
Execution Time vs Throughput
20
▪ Desktop Computer
▪ How fast it executes a program?
▪ Parameter of interest is execution time / response time
▪ To improve performance → reduce execution time!
▪ Server / Datacenter Computers
▪ How many tasks / jobs are executed in a given time?
▪ Focus is throughput / bandwidth!
▪ To improve performance → enhance throughput!
▪ For single core systems
1
▪ Performance =
Execution Time
Execution Time vs Throughput
21
CPI 1 2 3 2 4 1 1
▪ Solution (a)
▪ Instructions per second = instructions per cycle × cycles per second
▪ Instructions per second for P1 = (1/1.5) × 3GHz = 2G instructions/s
▪ Instructions per second for P2 = (1/1) × 2.5GHz = 2.5G instructions/s
▪ Instructions per second for P3 = (1/2.2) × 4GHz = 1.818G instructions/s
▪ P2 has highest performance!
Exercise 1
31
▪ Consider three different processors P1, P2, and P3 executing the same
instruction set. P1 has a 3 GHz clock rate and a CPI of 1.5. P2 has a 2.5
GHz clock rate and a CPI of 1.0. P3 has a 4.0 GHz clock rate and has a
CPI of 2.2.
b. If the processors each execute a program in 10 seconds, find the number
of cycles and the number of instructions.
▪ Solution (b)
▪ Execution time = Instruction count × CPI × Clock cycle time
▪ Number of cycles = Execution Time × Clock Rate
▪ Number of cycles for P1 = 10s × 3GHz = 30G cycles,
▪ Number of cycles for P2= 25G cycles
▪ Number of cycles for P3 = 40G cycles
▪ Instructions count = (Execution Time × Clock Rate)/CPI
▪ Instructions count for P1 = (10s × 3G)/1.5 = 20G instructions
▪ Instructions count for P2 = (10s × 2.5G)/1.0 = 25G instructions
▪ Instructions count for P3 = (10s × 4G)/2.2 = 18.18G instructions
Exercise 1
32
▪ Solution (c)
▪ 0.7 × CPU Time = Instructions Count × 1.2 × CPI × Clock Cycle Time
▪ 1.2 / n = 0.7 → n = 1.2/0.7 = 1.714
▪ The clock rate (for any processor) must be increased by 1.714 to
achieve 30% reduction in execution time!
Exercise 2
33