0% found this document useful (0 votes)
14 views41 pages

CAO Fall 2024 Lecture 06 Design Metrics Performance Evaluation

Uploaded by

Omair Siddique
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views41 pages

CAO Fall 2024 Lecture 06 Design Metrics Performance Evaluation

Uploaded by

Omair Siddique
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

EE-321 Fall 2024

Computer Architecture and Organization

Lecture # 06
Design Metrics and CPU Performance Evaluation

Muhammad Imran
[email protected]
Acknowledgement
2

▪ Content from following has been used in these lectures


▪ Computer Organization and Design, RISC-V 2nd Edition, Patterson and
Hennessy
▪ Computer Organization and Design, RISC-V 1st Edition, Patterson and
Hennessy
Contents
3

▪ Design Metrics and Design Tradeoffs


▪ Throughput
▪ Latency
▪ Timing
▪ Evaluating Computing Performance
Design Metrics and Design Tradeoffs
Measuring Speed of a Design
5

▪ Throughput
▪ Amount of data processed per clock cycle
▪ Bits per cycle or bits per second
▪ Tasks executed per unit time
▪ Instructions per second, Instructions per cycle etc.

▪ Latency
▪ Time to process a single task
▪ Number of cycles or seconds

▪ Timing
▪ Defined by the logic delays between sequential elements
▪ Clock period, frequency
Example …
6

input D Q D Q D Q output
8 Combinational Combinational 8
Logic Logic

p p p p

p p p

▪ Throughput?
▪ (Bits per output sample / time between two output samples)
▪ 8 bits/cycle, if 1 cycle = 10 ns, throughput = 8/10n = 800 Mbits/s
▪ Throughput can also be 1 task/sample per cycle!!
Example …
7

input D Q D Q D Q output
8 Combinational Combinational 8
Logic Logic

p p p p

p p p

▪ Latency?
▪ Time to complete one task / sample
▪ 3 clock cycles, if 1 cycle = 10 ns, latency = 30 ns
Example …
8

input D Q D Q D Q output
8 Combinational Combinational 8
Logic Logic

p p p p

p p p

▪ Timing?
▪ Clock period = tckl2q + combinational logic delay (longest) + ts
Design Tradeoffs: Multicycle Design
9

Xpower = 1;
for(i=0; i < 3; i++)
Xpower = X*Xpower;

clk
[7:0]

Start
[7:0]
×
0 [7:0] [7:0]
D[7:0] Q[7:0]
[7:0]
X[7:0] 1
Xpower[7:0]

▪ Throughput?
▪ 1 sample or task / 3 cycles
▪ 8 bits / 3 cycles = 2.7 bits per
Design Tradeoffs: Multicycle Design
10

Xpower = 1;
for(i=0; i < 3; i++)
Xpower = X*Xpower;

clk
[7:0]

Start
[7:0]
×
0 [7:0] [7:0]
D[7:0] Q[7:0]
[7:0]
X[7:0] 1
Xpower[7:0]

▪ Latency?
▪ 3 clock cycles
Design Tradeoffs: Multicycle Design
11

Xpower = 1;
for(i=0; i < 3; i++)
Xpower = X*Xpower;

clk
[7:0]

Start
[7:0]
×
0 [7:0] [7:0]
D[7:0] Q[7:0]
[7:0]
X[7:0] 1
Xpower[7:0]

▪ Clock Timing?
▪ Clock period = tclk2q + 1 multiplier delay + 1 mux delay + ts
Design Tradeoffs: Pipelining
12

clk
[7:0]
Xpower = 1; Start
[7:0]
for(i=0; i < 3; i++) ×
Xpower = X*Xpower;
0 [7:0] [7:0]
D[7:0] Q[7:0]
[7:0]
X[7:0] 1
Xpower[7:0]

D[7:0] Q[7:0]
clk [7:0]
D[7:0] Q[7:0] × D[7:0] Q[7:0] Xpower[7:0]
[7:0]
X[7:0] ×
[7:0] [7:0]
[7:0]
D[7:0] Q[7:0]

▪ Throughput after pipelining?


▪ 1 task per cycle, 8 bits per cycle! (Improved!)
Design Tradeoffs: Pipelining
13

clk
[7:0]
Xpower = 1; Start
[7:0]
for(i=0; i < 3; i++) ×
Xpower = X*Xpower;
0 [7:0] [7:0]
D[7:0] Q[7:0]
[7:0]
X[7:0] 1
Xpower[7:0]

D[7:0] Q[7:0]
clk [7:0]
D[7:0] Q[7:0] × D[7:0] Q[7:0] Xpower[7:0]
[7:0]
X[7:0] ×
[7:0] [7:0]
[7:0]
D[7:0] Q[7:0]

▪ Latency after pipelining?


▪ 3 cycles! (Same!)
Design Tradeoffs: Pipelining
14

clk
[7:0]
Xpower = 1; Start
[7:0]
for(i=0; i < 3; i++) ×
Xpower = X*Xpower;
0 [7:0] [7:0]
D[7:0] Q[7:0]
[7:0]
X[7:0] 1
Xpower[7:0]

D[7:0] Q[7:0]
clk [7:0]
D[7:0] Q[7:0] × D[7:0] Q[7:0] Xpower[7:0]
[7:0]
X[7:0] ×
[7:0] [7:0]
[7:0]
D[7:0] Q[7:0]

▪ Timing after pipelining?


▪ Critical path still involves one multiplier delay (Same!)
Design Tradeoffs: Pipelining
15

clk
[7:0]
Xpower = 1; Start
[7:0]
for(i=0; i < 3; i++) ×
Xpower = X*Xpower;
0 [7:0] [7:0]
D[7:0] Q[7:0]
[7:0]
X[7:0] 1
Xpower[7:0]

D[7:0] Q[7:0]
clk [7:0]
D[7:0] Q[7:0] × D[7:0] Q[7:0] Xpower[7:0]
[7:0]
X[7:0] ×
[7:0] [7:0]
[7:0]
D[7:0] Q[7:0]

▪ Cost of pipelining?
▪ More area! (Additional registers + Multiplier)
Design Tradeoffs: Single Cycle Design
16

[7:0] [7:0] × Xpower[7:0]

× [7:0]
X[7:0]

▪ Throughput?
▪ 1 sample per cycle or 8 bits per cycle!
▪ Latency?
▪ 1 cycle (low latency!)
▪ Timing?
▪ Clock period = 2 multipliers delay + clk2q + ts
▪ Slower clock may undermine low latency!
How do we evaluate computers?
Defining Performance
18

▪ Which airplane is fastest / best performing?

▪ Cruising Speed
▪ How fast a single task can be executed …
▪ How many passengers are transported in a given time?
▪ That’s throughput …
In a similar manner, computers may be evaluated for
several parame ers …
Execution Time vs Throughput
20

▪ Desktop Computer
▪ How fast it executes a program?
▪ Parameter of interest is execution time / response time
▪ To improve performance → reduce execution time!
▪ Server / Datacenter Computers
▪ How many tasks / jobs are executed in a given time?
▪ Focus is throughput / bandwidth!
▪ To improve performance → enhance throughput!
▪ For single core systems
1
▪ Performance =
Execution Time
Execution Time vs Throughput
21

▪ Throughput may impact response time


▪ Example: Given multiple jobs
▪ If we increase number of cores
▪ Throughput increases!
▪ If jobs need to be queued (too many)
▪ More cores would also reduce the response time!
▪ If we make a single core faster
▪ Both response time and throughput increase!
CPU Execution Time
22

▪ Execution Time / Elapsed Time


▪ Time to run a program
▪ Includes I/O time, waiting time etc.
▪ CPU Execution Time / CPU Time
▪ Time spent by CPU in executing the program
▪ Does not include I/O time or waiting time etc!
▪ Sub types:
▪ User CPU Time
▪ Time spent by CPU on program itself
▪ System CPU Time
▪ Time spent by OS on behalf of the program! (Not other programs!)
▪ CPU Performance refers to User CPU Time!
CPU Performance Factors
23

CPU Execution Time CPU Clock Cycles for a


= × Clock Cycle Time
for a program program

CPU Execution Time CPU Clock Cycles for a program


=
for a program
Clock Rate (frequency)
▪ Example
▪ Program runs in 10s on Computer A @ 2GHz
▪ On B, it will run in 6s at a faster clock but will take 1.2 times more clock
cycles!
▪ What would be clock rate for Computer B?
CPU Performance Factors
24

CPU Execution Time CPU Clock Cycles for a


= × Clock Cycle Time
for a program program

CPU Execution Time CPU Clock Cycles for a program


=
for a program
Clock Rate (frequency)
▪ Solution
▪ Cycles for Computer A = 10s × 2GHz = 20G cycles
▪ Cycles for Computer B = 1.2 × 20G = 24G cycles
▪ Clock Rate for Computer B = Cycles / Execution Time
= (24G)/6 = 4 GHz
Instruction Performance
25

▪ Execution time also depends on number of instructions in a program!

CPU Clock Cycles for a Instructions for a Average Clock Cycles


= ×
program Program per Instruction (CPI)

▪ CPI (average) is one way of comparing different implementations of


same ISA
▪ Given same number of instructions per program
Instruction Performance
26

CPU Clock Cycles for a Instructions for a Average Clock Cycles


= ×
program Program per Instruction (CPI)

▪ Example: Comparing two implementations of same ISA


▪ Computer A has clock cycle of 250ps and CPI of 2.0 for a program
▪ Computer B has clock cycle of 500ps and CPI of 1.2 for same program
▪ Which one is faster for this program? How much?
▪ Solution
▪ Clock cycles for Computer A = I × 2
▪ Clock cycles for Computer B = I × 1.2
▪ CPU Time for A = Clock cycles × Cycle Time = I × 2 × 250ps
▪ CPU Time for B = Clock cycles × Cycle Time = I × 1.2 × 500ps
▪ Computer B takes 1.2 times more execution time, i.e., it’s 1.2 times
slower!
CPU Performance Equation
27

CPU Time = Instructions Count × CPI × Clock Cycle Time

CPU Time = Instructions Count × CPI


Clock Rate

Seconds Instructions Clock Cycles Seconds


Time = = × ×
Program Program Instructions Clock Cycle
CPU Performance Equation
28

CPU Time = Instructions Count × CPI × Clock Cycle Time


Example: Comparing two code sequences!
CPI for each instruction Code Instruction counts for each instruction class
class sequence A B C
A B C 1 2 1 2

CPI 1 2 3 2 4 1 1

Hardware Specifications Two alternative code sequences


(By hardware designer) (Options for compiler writer!)

▪ Which code sequence executes most instructions?


▪ What is CPI of each code sequence?
▪ Which will execute faster?
▪ Solution
▪ Code Sequence 2 executes most instructions i.e., 6
▪ Cycles for Code Sequence 1 = (2 × 1) + (1 × 2) + (2 × 3) = 10 cycles, CPI = 10/5 = 2
▪ Cycles for Code Sequence 2 = (4 × 1) + (1 × 2) + (1 × 3) = 9 cycles, CPI = 9/6 = 1.5 (faster)
▪ CPU Time for Code Sequence 1 = 10 × Clock Cycle Time
▪ CPU Time for Code Sequence 2 = 9 × Clock Cycle Time (faster)
▪ Fewer instructions do not always mean faster execution !!!
Knowledge Check!
29

▪ A given application written in Java runs 15 seconds on a desktop


processor. A new Java compiler is released that requires only 0.6 as
many instructions as the old compiler. Unfortunately, it increases the
CPI by 1.1. How fast can we expect the application to run using this
new compiler?
▪ Solution
▪ CPU Time Old = 15s
▪ Instructions Count Old = I
▪ Instructions Count New = 0.6 × I
▪ CPI Old = C
▪ CPI New = 1.1 × C
▪ CPU Time Old = 15s = I × C × Clock Cycle Time
▪ CPU Time New = 0.6 × I × 1.1 × C × Clock Cycle Time
= 0.66 × I × C × Clock Cycle Time
▪ CPU Time New = 0.66 × CPU Time Old = 0.66 × 15s = 9.9s
Exercise 1
30

CPU Time = Instructions Count × CPI × Clock Cycle Time

▪ Consider three different processors P1, P2, and P3 executing the


same instruction set. P1 has a 3 GHz clock rate and a CPI of 1.5. P2
has a 2.5 GHz clock rate and a CPI of 1.0. P3 has a 4.0 GHz clock
rate and has a CPI of 2.2.
a. Which processor has the highest performance expressed in
instructions per second?

▪ Solution (a)
▪ Instructions per second = instructions per cycle × cycles per second
▪ Instructions per second for P1 = (1/1.5) × 3GHz = 2G instructions/s
▪ Instructions per second for P2 = (1/1) × 2.5GHz = 2.5G instructions/s
▪ Instructions per second for P3 = (1/2.2) × 4GHz = 1.818G instructions/s
▪ P2 has highest performance!
Exercise 1
31

CPU Time = Instructions Count × CPI × Clock Cycle Time

▪ Consider three different processors P1, P2, and P3 executing the same
instruction set. P1 has a 3 GHz clock rate and a CPI of 1.5. P2 has a 2.5
GHz clock rate and a CPI of 1.0. P3 has a 4.0 GHz clock rate and has a
CPI of 2.2.
b. If the processors each execute a program in 10 seconds, find the number
of cycles and the number of instructions.
▪ Solution (b)
▪ Execution time = Instruction count × CPI × Clock cycle time
▪ Number of cycles = Execution Time × Clock Rate
▪ Number of cycles for P1 = 10s × 3GHz = 30G cycles,
▪ Number of cycles for P2= 25G cycles
▪ Number of cycles for P3 = 40G cycles
▪ Instructions count = (Execution Time × Clock Rate)/CPI
▪ Instructions count for P1 = (10s × 3G)/1.5 = 20G instructions
▪ Instructions count for P2 = (10s × 2.5G)/1.0 = 25G instructions
▪ Instructions count for P3 = (10s × 4G)/2.2 = 18.18G instructions
Exercise 1
32

CPU Time = Instructions Count × CPI × Clock Cycle Time

▪ Consider three different processors P1, P2, and P3 executing the


same instruction set. P1 has a 3 GHz clock rate and a CPI of 1.5. P2
has a 2.5 GHz clock rate and a CPI of 1.0. P3 has a 4.0 GHz clock
rate and has a CPI of 2.2.
c. We are trying to reduce the execution time by 30%, but this leads to
an increase of 20% in the CPI. What clock rate should we have to get
this time reduction?

▪ Solution (c)
▪ 0.7 × CPU Time = Instructions Count × 1.2 × CPI × Clock Cycle Time
▪ 1.2 / n = 0.7 → n = 1.2/0.7 = 1.714
▪ The clock rate (for any processor) must be increased by 1.714 to
achieve 30% reduction in execution time!
Exercise 2
33

CPU Time = Instructions Count × CPI × Clock Cycle Time

▪ Consider two different implementations of the same instruction set


architecture. The instructions can be divided into four classes according
to their CPI (classes A, B, C, and D). P1 with a clock rate of 2.5 GHz
and CPIs of 1, 2, 3, and 3, and P2 with a clock rate of 3 GHz and CPIs
of 2, 2, 2, and 2.
▪ Given a program with a dynamic instruction count of 1.0E6 instructions
divided into classes as follows: 10% class A, 20% class B, 50% class C,
and 20% class D
▪ Which is faster: P1 or P2?
▪ Solution
▪ CPU Time for P1 = (1e6) × (0.1 × 1 + 0.2 × 2 + 0.5 × 3 + 0.2 × 3) ×
(1/2.5GHz)
= 1.04 × 10-3 seconds
▪ CPU Time for P2 = (1e6) × (0.1 × 2 + 0.2 × 2 + 0.5 ×2 + 0.2 × 2) × (1/3GHz)
= 0.666 × 10-3 seconds
▪ P2 is faster!!
Exercise 2
34

CPU Time = Instructions Count × CPI × Clock Cycle Time

▪ Consider two different implementations of the same instruction set


architecture. The instructions can be divided into four classes
according to their CPI (classes A, B, C, and D). P1 with a clock rate
of 2.5 GHz and CPIs of 1, 2, 3, and 3, and P2 with a clock rate of 3
GHz and CPIs of 2, 2, 2, and 2.
▪ Given a program with a dynamic instruction count of 1.0E6
instructions divided into classes as follows: 10% class A, 20% class
B, 50% class C, and 20% class D
▪ What is the global CPI for each implementation??
▪ Solution
▪ Global CPI for P1 = 0.1 × 1 + 0.2 × 2 + 0.5 × 3 + 0.2 × 3 = 2.6
▪ Global CPI for P2 = 0.1 × 2 + 0.2 × 2 + 0.5 ×2 + 0.2 × 2 = 2
Exercise 2
35

CPU Time = Instructions Count × CPI × Clock Cycle Time

▪ Consider two different implementations of the same instruction set


architecture. The instructions can be divided into four classes
according to their CPI (classes A, B, C, and D). P1 with a clock rate
of 2.5 GHz and CPIs of 1, 2, 3, and 3, and P2 with a clock rate of 3
GHz and CPIs of 2, 2, 2, and 2.
▪ Given a program with a dynamic instruction count of 1.0E6
instructions divided into classes as follows: 10% class A, 20% class
B, 50% class C, and 20% class D
▪ Find the clock cycles required in both cases
▪ Solution
▪ Clock Cycles for P1 = 1e6 × 2.6 = 2.6M cycles
▪ Clock Cycles for P2 = 1e6 × 2 = 2M cycles
MIPS as Performance Measure
36

▪ MIPS: Million instructions per second


Instructions Count
MIPS =
Execution Time × 𝟏𝟎𝟔

▪ MIPS is the rate of instructions execution


▪ i.e., inverse of execution time!
▪ Limitations
▪ Cannot compare computers with different ISAs as the instructions count
and CPI would vary!
▪ Varies between programs for same ISA!
▪ Alternatively,

Instructions Count Clock Rate


MIPS = =
Instructions Count × CPI × 𝟏𝟎𝟔 CPI × 𝟏𝟎𝟔
Clock Rate
MIPS vs Execution Time!
37

Instructions Count Clock Rate


MIPS = MIPS =
Execution Time × 𝟏𝟎𝟔 CPI × 𝟏𝟎𝟔

▪ For a given program, consider:


Measurement Computer A Computer B
Instruction Count 10 billion 8 billion
Clock rate 4 GHz 4 GHz
CPI 1.0 1.1

▪ Which computer has higher MIPS rating?


▪ Solution
▪ MIPS for Computer A = 4G/(1×106 ) = 4×103
▪ MIPS for Computer B = 4G/(1.1×106 ) = 3.64 ×103
▪ Computer A has higher MIPS rating!
MIPS vs Execution Time!
38

Instructions Count Clock Rate


MIPS = MIPS =
Execution Time × 𝟏𝟎𝟔 CPI × 𝟏𝟎𝟔

▪ For a given program, consider:

Measurement Computer A Computer B


Instruction Count 10 billion 8 billion
Clock rate 4 GHz 4 GHz
CPI 1.0 1.1

▪ Which one is faster?


▪ Solution
▪ Execution Time for A = (10G)/(4×103 ×106 ) = 2.5s
▪ Execution Time for B = (8G)/(3.64 ×103 ×106 ) = 2.198s
▪ Computer B is faster despite having lower MIPS rating!
Execution Time is a more accurate measure of performance!
Amdahl’s La
39

▪ Performance enhancement possible by a given improvement is


limited by the amount that improved feature is used!
Execution time affected by
Execution time after improvement
Execution time
improvement = +
Amount of improvement unaffected
▪ Example
▪ Suppose a program runs in 100 seconds on a computer, with multiply
operations responsible for 80 seconds of this time. How much do I have
to improve the speed of multiplication if I want my program to run five
times faster?
▪ Solution
▪ 20 = (80/n) + 20
▪ That is, there is no amount by which we can enhance-multiply to achieve a
fivefold increase in performance, if multiply accounts for only 80% of the
workload.
Homework!
40

▪ Problems for Practice


▪ Do practice for exam!
▪ Chapter 1 Exercise Problems
▪ 1.6, 1.7,1.8, 1.14, 1.15
▪ Chapter 2 Exercise Problems
▪ 2.39 and 2.40
Relevant Reading
41

▪ Computer Organization and Design (RISC-V Edition), Patterson and


Hennessy
▪ Chapter 1
▪ Sections 1.6, 1.8, 1.9 and 1.10!

You might also like