0% found this document useful (0 votes)
13 views38 pages

Week6 Performance Numericals

Uploaded by

Markhor Gaming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views38 pages

Week6 Performance Numericals

Uploaded by

Markhor Gaming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Week 6 & 7

Numerical Problems +
Midterm Review for CSA

Adapted mostly from: Prof. Onur Mutlu


ETH Zurich
Evaluating the Single-Cycle
Microarchitecture

2
A Single-Cycle Microarchitecture
◼ Is this a good idea/design?

◼ When is this a good design?

◼ When is this a bad design?

◼ How can we design a better microarchitecture?

3
Performance Analysis Basics
Processor Performance
◼ How fast is my program?
❑ Every program consists of a series of instructions
❑ Each instruction needs to be executed.
Processor Performance
◼ How fast is my program?
❑ Every program consists of a series of instructions
❑ Each instruction needs to be executed.
◼ So how fast are my instructions ?
❑ Instructions are realized on the hardware
❑ They can take one or more clock cycles to complete
❑ Cycles per Instruction = CPI
Processor Performance
◼ How fast is my program?
❑ Every program consists of a series of instructions
❑ Each instruction needs to be executed.
◼ So how fast are my instructions ?
❑ Instructions are realized on the hardware
❑ They can take one or more clock cycles to complete
❑ Cycles per Instruction = CPI
◼ How much time is one clock cycle?
❑ The critical path determines how much time one cycle
requires = clock period.
❑ 1/clock period = clock frequency = how many cycles can be
done each second.
Processor Performance
◼ Now as a general formula
❑ Our program consists of executing N instructions.
❑ Our processor needs CPI cycles for each instruction.
❑ The maximum clock speed of the processor is f,
and the clock period is therefore T=1/f
Processor Performance
◼ Now as a general formula
❑ Our program consists of executing N instructions.
❑ Our processor needs CPI cycles for each instruction.
❑ The maximum clock speed of the processor is f,
and the clock period is therefore T=1/f
◼ Our program executes in
N x CPI x (1/f) =
N x CPI x T seconds
Performance Analysis Basics
◼ Execution time of an instruction
❑ {CPI} x {clock cycle time}
◼ CPI: Number of cycles it takes to execute an instruction

◼ Execution time (aka runtime) of a program


❑ Sum over all instructions [{CPI} x {clock cycle time}]
❑ {# of instructions} x {Average CPI} x {clock cycle time}

10
Performance Analysis of
Our Single-Cycle Design
A Single-Cycle Microarchitecture: Analysis
◼ Every instruction takes 1 cycle to execute
❑ CPI (Cycles per instruction) is strictly 1

◼ How long each instruction takes is determined by how long


the slowest instruction takes to execute
❑ Even though many instructions do not need that long to
execute

◼ Clock cycle time of the microarchitecture is determined by


how long it takes to complete the slowest instruction
❑ Critical path of the design is determined by the processing
time of the slowest instruction

12
What is the Slowest Instruction to Process?
◼ Let’s go back to the basics

◼ All six phases of the instruction processing cycle take a single


machine clock cycle to complete
❑ Fetch 1. Instruction fetch (IF)
❑ Decode 2. Instruction decode and
❑ Evaluate Address register operand fetch (ID/RF)
❑ Fetch Operands 3. Execute/Evaluate memory address (EX/AG)
4. Memory operand fetch (MEM)
❑ Execute
5. Store/writeback result (WB)
❑ Store Result

◼ Do each of the above phases take the same time (latency)


for all instructions?
13
A simplified view/model of SC processor
◼ Assumptions:
❑ Ignore mux delays
❑ Ignore single register delays (for PC)
❑ Ignore delay for control unit

14
Example Single-Cycle Datapath Analysis
◼ Assume (for the design in the previous slide)
❑ memory units (read or write): 200 ps
❑ ALU and adders: 100 ps
❑ register file (read or write): 50 ps
❑ other combinational logic: 0 ps
steps IF ID EX MEM WB
Delay
resources mem RF ALU mem RF

R-type 200 50 100 50 400


I-type 200 50 100 50 400
LW 200 50 100 200 50 600
SW 200 50 100 200 550
Branch 200 50 100 350
Jump 200 200
Let’s Find the Critical Path

PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0]
left 2
26 28 0 1

PC+4 [31– 28] M M


u u
x x
ALU
Add result 1 0
Add
RegDst Shift PCSrc2=Br Taken
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite

Instruction [25– 21] Read


Read register 1
PC address Read
Instruction [20– 16] data 1
Read
register 2 bcond
Zero
Instruction 0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15– 11] x u
1 Write x Data
data x
1 memory 0
Write
data
16 32
Instruction [15– 0] Sign
extend ALU ALU operation
control

Instruction [5– 0]

[Based on original figure from P&H CO&D, COPYRIGHT 2004


Elsevier. ALL RIGHTS RESERVED.]
R-Type and I-Type ALU

PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0]
left 2
26 28 0 1

PC+4 [31– 28] M M


u u
x x
ALU
Add result 1 0
Add 100ps RegDst Shift PCSrc2=Br Taken
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg
ALUOp

100ps
MemWrite
ALUSrc
RegWrite

Instruction [25– 21] Read


Read register 1
PC address Read

200ps
Instruction
Instruction [20– 16]

0
Read
register 2
data 1

Registers Read
250ps bcond
Zero
ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
memory Instruction [15– 11]
1
x
Write
data
400ps 1
u
x
350ps Data
memory
M
u
x
0
Write
data
16 32
Instruction [15– 0] Sign
extend ALU ALU operation
control

Instruction [5– 0]

[Based on original figure from P&H CO&D, COPYRIGHT


2004 Elsevier. ALL RIGHTS RESERVED.]
17
LW

PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0]
left 2
26 28 0 1

PC+4 [31– 28] M M


u u
x x
ALU
Add result 1 0
Add 100ps RegDst Shift PCSrc2=Br Taken
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg
ALUOp

100ps
MemWrite
ALUSrc
RegWrite

Instruction [25– 21] Read


Read register 1
PC address Read

200ps
Instruction
Instruction [20– 16]

0
Read
register 2
data 1

Registers Read
250ps bcond
Zero
ALU ALU
[31– 0] 0 Read
Instruction
memory Instruction [15– 11]
M
u
x
Write
register

Write
data 2
M
u
x
result Address
data
550ps
1
M
u
1
600ps data 1 350ps Write
Data
memory 0
x

data
16 32
Instruction [15– 0] Sign
extend ALU ALU operation
control

Instruction [5– 0]

[Based on original figure from P&H CO&D, COPYRIGHT


2004 Elsevier. ALL RIGHTS RESERVED.]
18
SW

PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0]
left 2
26 28 0 1

PC+4 [31– 28] M M


u u
x x
ALU
Add result 1 0
Add 100ps RegDst Shift PCSrc2=Br Taken
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg
ALUOp

100ps
MemWrite
ALUSrc
RegWrite

Instruction [25– 21] Read


Read register 1
PC address Read

200ps
Instruction
Instruction [20– 16]

0
Read
register 2
data 1

Registers Read
250ps bcond
Zero
ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15– 11] x u
Write x
1 data 1 350ps 550ps
Write
Data
memory 0
x

data
16 32
Instruction [15– 0] Sign
extend ALU ALU operation
control

Instruction [5– 0]

[Based on original figure from P&H CO&D, COPYRIGHT


2004 Elsevier. ALL RIGHTS RESERVED.]
19
Branch Taken

PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0]
left 2
26 28 0 1
M M
PC+4 [31– 28]
200ps u u

100ps ALU
Add result 1
x x
0
Add
RegDst Shift PCSrc2=Br Taken
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg
ALUOp

350ps
MemWrite
ALUSrc
RegWrite

PC
Read
address
Instruction [25– 21] Read
register 1
Read
350ps
200ps
Instruction
Instruction [20– 16]

0
Read
register 2
data 1

Registers Read
250ps bcond
Zero
ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15– 11] x u
1 Write x Data
data x
1 memory 0
Write
data
16 32
Instruction [15– 0] Sign
extend ALU ALU operation
control

Instruction [5– 0]

[Based on original figure from P&H CO&D, COPYRIGHT


2004 Elsevier. ALL RIGHTS RESERVED.]
20
Jump

PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0]
left 2
26 28 0 1

PC+4 [31– 28] M M


u u

100ps ALU
Add result 1
x x
0
Add
RegDst Shift PCSrc2=Br Taken
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg
ALUOp

200ps
MemWrite
ALUSrc
RegWrite

Instruction [25– 21] Read


Read register 1
PC address Read

200ps
Instruction
Instruction [20– 16]

0
Read
register 2
data 1

Registers Read
bcond
Zero
ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15– 11] x u
1 Write x Data
data x
1 memory 0
Write
data
16 32
Instruction [15– 0] Sign
extend ALU ALU operation
control

Instruction [5– 0]

[Based on original figure from P&H CO&D, COPYRIGHT


2004 Elsevier. ALL RIGHTS RESERVED.]
21
Exec. Time for one billion instructions
◼ Example:
For a program with 100 billion instructions executing on a
single-cycle MIPS processor:

Execution Time = # instructions x CPI x Tc


= (100 × 109)(1)(600 × 10-12 s)
= 100 x 600 x 10-3 s
= 60 s

22
Single-Cycle vs. Multicycle

Clock

Time
needed

Time
allotted Instr 1 Instr 2 Instr 3 Instr 4

Clock

Time Time
needed saved
3 cycles 5 cycles 3 cycles 4 cycles
Time
allotted Instr 1 Instr 2 Instr 3 Instr 4

Fig. Single-cycle versus multicycle instruction execution.


Performance of the Multicycle Processor
R-type 44% 4 cycles ALU-type P Not
C used
Load 24% 5 cycles
Store 12% 4 cycles
Branch 18% 3 cycles
Load P
Jump 2% 3 cycles C

Not
Contribution to CPI Store P
C used
R-type 0.444 = 1.76
Load 0.245 = 1.20
Store 0.124 = 0.48 Branch P Not Not Not
C used used used
Branch 0.183 = 0.54 (and jr)

Jump 0.023 = 0.06


_____________________________

Average CPI  4.04 Jump P Not Not Not Not


C used used used used
(except
jr & jal)

Note: ALU is not used in the last two cases here as separate
hardware exists for branch and jump address calculation,
which is not the case for our multicycle MIPs Slide 24
Multi-Cycle Performance: Average CPI
◼ Instructions take different number of cycles:
❑ 3 cycles: beq, j
❑ 4 cycles: R-Type, sw, addi
❑ 5 cycles: lw Realistic?
◼ CPI is weighted average, e.g. SPECINT2000 benchmark:
❑ 25% loads
❑ 10% stores
❑ 11% branches
❑ 2% jumps
❑ 52% R-type
❑ 0.25X

◼ Average CPI = (0.11 + 0.02) 3 +(0.52 + 0.10) 4 +(0.25) 5


= 4.12 25
If we can’t ignore “smaller” delays
◼ Find
❑ the critical path delay of SC processor (OR Find a lower bound
on the cycle time for the program counter.)
❑ the time taken for one billion instructions to execute

26
LW path when muxes etc can’t be ignored

PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0]
left 2
26 28 0 1
M M
PC+4 [31– 28]

ALU
u
x
u
x 24
Add result 1 0
Add 20 RegDst Shift PCSrc2=Br Taken
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite

5 Instruction [25– 21]


PC
Read
Read
register 1
Read 40
60
address

25
Instruction
Instruction [20– 16]

0
Read
register 2
data 1

Registers Read
bcond
Zero
ALU ALU
80
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data

82
u M
memory Instruction [15– 11] x u
Write x
1
97 data 1
Write
Data
memory 0
x

data
16 32
Instruction [15– 0] Sign
extend ALU ALU operation
control

Instruction [5– 0]

[Based on original figure from P&H CO&D, COPYRIGHT


2004 Elsevier. ALL RIGHTS RESERVED.]
27
LW path when muxes etc can’t be ignored
◼ Basically 2 muxes and CU operate in parallel to regfile
PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0]
left 2
26 28 0 1
M M
PC+4 [31– 28]

ALU
u
x
u
x 24
Add result 1 0
Add 20 RegDst Shift PCSrc2=Br Taken
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg
ALUOp
MemWrite

CU is in parallelALUSrc
to regfile
RegWrite
CU finishes it work at 28 ps
5 Instructionwhile
[25– 21]regfile at 40 ps
PC
Read
Read
register 1
Read 40
60
address

25
Instruction
Instruction [20– 16]

0
Read
register 2
data 1

Registers Read
bcond
Zero
ALU ALU
80
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data

82
u M
memory Instruction [15– 11] x u
Write x
This mux output is stable
1data
97
at 27 ps 1
In SW, immediateWrite
Data
memory 0
x

data is sent and not rt data


Its output is used after 82 ps, so it not data

Part of
Instruction the
[15– 0] critical path
16 32 This mux output is stable at 29 ps
Sign
extend WhileALU
ALU
rs read will complete at 40 ps
operation
control

Instruction [5– 0]

[Based on original figure from P&H CO&D, COPYRIGHT


2004 Elsevier. ALL RIGHTS RESERVED.]
28
Solution

Execution Time = # instructions x CPI x Tc


= (100 × 109)(1)(97 × 10-12 s)
= 100 x 97 x 10-3 s
= 9.7 s

29
SC vs MC Perf. Which one is faster?
◼ : Consider the hardware times of major units (all others
being negligible) in a datapath as given below.
Determine minimum clock cycle time, average CPI,
and average instruction execution time for single-cycle
and multi-cycle datapath.

30
Solution
Single-cycle clock period (cycle time) is determined by the
lw instruction, which activates the critical path:
Cycle time = memory + reg. file + ALU + memory + reg. file
= 25 + 15 + 20 + 25 + 15 = 100ps
Av. CPI =1
Av. time/instruction
= 100 ✕ 1 = 100ps

31
Multi-cycle clock period (cycle time) is determined by the
slowest hardware unit (memory in this case):
Cycle time = 25ps
Clock cycles used by instructions are 5 for lw, 4 for sw, 4 for
r-type, 3 for branch and 3 for jump. Therefore,
Av. CPI
= 0.1✕5 + 0.1✕4 + 0.4✕4 + 0.2✕3 + 0.2✕3
= 3.7
Av. time/instruction
= 25 ✕ 3.7 = 92.5ps

32
Performance ratio
(a) Suppose an operation involving register file, memory or
ALU each takes 1 time unit. Neglecting the time of all other
hardware, how much time will each MIPS instruction
take on a single-cycle datapath? Consider R-type, lw, sw,
beq and j instructions.

(b) What will be the execution times for MIPS instructions on


a 5-cycle multi-cycle datapath using a clock period of 1 time
unit?
(c) A program contains the following mix of instructions: lw
5%, sw 5%, r-type 70%, branch 10%, jump 10%.
What is the ratio of single-cycle CPU time to multicycle CPU
time for running this program on these datapaths?

33
Solution
(a) Each instruction will take 5 units of time on a single-cycle
datapath.

b) Times for MIPS instructions to run on a multi-cycle


datapath are:
Load, lw 5 time units
Store, sw 4 time units
R-type, add, etc. 4 time units
Branch, beq, bne 3 time units
Jump, j 3 time units

34
Solution

35
Performance for a benchmark program
Clock rates for single-cycle and multicycle datapaths are
given as 1GHz and 5GHz, respectively.
The following subroutine is used for estimating performance.
The argument register $a0 contains a large positive
integer and $a contains 1.
loop sub $a0, $a0, $a1
beq $a0, $0, done
j loop
done jr $31
Determine:
(a) Average cycles per instruction (CPI) for two datapaths.
(b) How much faster is the execution of the program on
multicycle processor compared to that on single cycle proc.?
36
Solution
(a) CPI
Single-cycle CPI = 1.0, because each instruction executes in
one cycle.
The instruction mix for multicycle datapath is:
sub takes 4 cycles and is executed a0 times
beq takes 3 cycles and is executed a0 times
j takes 3 cycles and is executed a0 – 1 times
jr takes 3 cycles and is executed once

Total number of instructions = 3a0 – 1 + 1 = 3a0


Multicycle CPI
= (4×a0 + 3×a0 + 3×a0 – 3 + 3)/(3×a0) = 10/3 = 3.333

37
Solution
(b) Execution time ratio:
The multicycle clock period is 0.2ns and the single-cycle clock
period is 1ns.
Therefore,
Performance ratio
= (single-cycle exec time)/(multicycle exec time)
= (1×3a0)/(0.2×3.333×3a0) // 1GHz -> 1 ns
// 5GHz -> 0.2 ns
= 1.5

38

You might also like