Week6 Performance Numericals
Week6 Performance Numericals
Numerical Problems +
Midterm Review for CSA
2
A Single-Cycle Microarchitecture
◼ Is this a good idea/design?
3
Performance Analysis Basics
Processor Performance
◼ How fast is my program?
❑ Every program consists of a series of instructions
❑ Each instruction needs to be executed.
Processor Performance
◼ How fast is my program?
❑ Every program consists of a series of instructions
❑ Each instruction needs to be executed.
◼ So how fast are my instructions ?
❑ Instructions are realized on the hardware
❑ They can take one or more clock cycles to complete
❑ Cycles per Instruction = CPI
Processor Performance
◼ How fast is my program?
❑ Every program consists of a series of instructions
❑ Each instruction needs to be executed.
◼ So how fast are my instructions ?
❑ Instructions are realized on the hardware
❑ They can take one or more clock cycles to complete
❑ Cycles per Instruction = CPI
◼ How much time is one clock cycle?
❑ The critical path determines how much time one cycle
requires = clock period.
❑ 1/clock period = clock frequency = how many cycles can be
done each second.
Processor Performance
◼ Now as a general formula
❑ Our program consists of executing N instructions.
❑ Our processor needs CPI cycles for each instruction.
❑ The maximum clock speed of the processor is f,
and the clock period is therefore T=1/f
Processor Performance
◼ Now as a general formula
❑ Our program consists of executing N instructions.
❑ Our processor needs CPI cycles for each instruction.
❑ The maximum clock speed of the processor is f,
and the clock period is therefore T=1/f
◼ Our program executes in
N x CPI x (1/f) =
N x CPI x T seconds
Performance Analysis Basics
◼ Execution time of an instruction
❑ {CPI} x {clock cycle time}
◼ CPI: Number of cycles it takes to execute an instruction
10
Performance Analysis of
Our Single-Cycle Design
A Single-Cycle Microarchitecture: Analysis
◼ Every instruction takes 1 cycle to execute
❑ CPI (Cycles per instruction) is strictly 1
12
What is the Slowest Instruction to Process?
◼ Let’s go back to the basics
14
Example Single-Cycle Datapath Analysis
◼ Assume (for the design in the previous slide)
❑ memory units (read or write): 200 ps
❑ ALU and adders: 100 ps
❑ register file (read or write): 50 ps
❑ other combinational logic: 0 ps
steps IF ID EX MEM WB
Delay
resources mem RF ALU mem RF
PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0]
left 2
26 28 0 1
Instruction [5– 0]
PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0]
left 2
26 28 0 1
100ps
MemWrite
ALUSrc
RegWrite
200ps
Instruction
Instruction [20– 16]
0
Read
register 2
data 1
Registers Read
250ps bcond
Zero
ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
memory Instruction [15– 11]
1
x
Write
data
400ps 1
u
x
350ps Data
memory
M
u
x
0
Write
data
16 32
Instruction [15– 0] Sign
extend ALU ALU operation
control
Instruction [5– 0]
PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0]
left 2
26 28 0 1
100ps
MemWrite
ALUSrc
RegWrite
200ps
Instruction
Instruction [20– 16]
0
Read
register 2
data 1
Registers Read
250ps bcond
Zero
ALU ALU
[31– 0] 0 Read
Instruction
memory Instruction [15– 11]
M
u
x
Write
register
Write
data 2
M
u
x
result Address
data
550ps
1
M
u
1
600ps data 1 350ps Write
Data
memory 0
x
data
16 32
Instruction [15– 0] Sign
extend ALU ALU operation
control
Instruction [5– 0]
PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0]
left 2
26 28 0 1
100ps
MemWrite
ALUSrc
RegWrite
200ps
Instruction
Instruction [20– 16]
0
Read
register 2
data 1
Registers Read
250ps bcond
Zero
ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15– 11] x u
Write x
1 data 1 350ps 550ps
Write
Data
memory 0
x
data
16 32
Instruction [15– 0] Sign
extend ALU ALU operation
control
Instruction [5– 0]
PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0]
left 2
26 28 0 1
M M
PC+4 [31– 28]
200ps u u
100ps ALU
Add result 1
x x
0
Add
RegDst Shift PCSrc2=Br Taken
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg
ALUOp
350ps
MemWrite
ALUSrc
RegWrite
PC
Read
address
Instruction [25– 21] Read
register 1
Read
350ps
200ps
Instruction
Instruction [20– 16]
0
Read
register 2
data 1
Registers Read
250ps bcond
Zero
ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15– 11] x u
1 Write x Data
data x
1 memory 0
Write
data
16 32
Instruction [15– 0] Sign
extend ALU ALU operation
control
Instruction [5– 0]
PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0]
left 2
26 28 0 1
100ps ALU
Add result 1
x x
0
Add
RegDst Shift PCSrc2=Br Taken
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg
ALUOp
200ps
MemWrite
ALUSrc
RegWrite
200ps
Instruction
Instruction [20– 16]
0
Read
register 2
data 1
Registers Read
bcond
Zero
ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15– 11] x u
1 Write x Data
data x
1 memory 0
Write
data
16 32
Instruction [15– 0] Sign
extend ALU ALU operation
control
Instruction [5– 0]
22
Single-Cycle vs. Multicycle
Clock
Time
needed
Time
allotted Instr 1 Instr 2 Instr 3 Instr 4
Clock
Time Time
needed saved
3 cycles 5 cycles 3 cycles 4 cycles
Time
allotted Instr 1 Instr 2 Instr 3 Instr 4
Not
Contribution to CPI Store P
C used
R-type 0.444 = 1.76
Load 0.245 = 1.20
Store 0.124 = 0.48 Branch P Not Not Not
C used used used
Branch 0.183 = 0.54 (and jr)
Note: ALU is not used in the last two cases here as separate
hardware exists for branch and jump address calculation,
which is not the case for our multicycle MIPs Slide 24
Multi-Cycle Performance: Average CPI
◼ Instructions take different number of cycles:
❑ 3 cycles: beq, j
❑ 4 cycles: R-Type, sw, addi
❑ 5 cycles: lw Realistic?
◼ CPI is weighted average, e.g. SPECINT2000 benchmark:
❑ 25% loads
❑ 10% stores
❑ 11% branches
❑ 2% jumps
❑ 52% R-type
❑ 0.25X
26
LW path when muxes etc can’t be ignored
PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0]
left 2
26 28 0 1
M M
PC+4 [31– 28]
ALU
u
x
u
x 24
Add result 1 0
Add 20 RegDst Shift PCSrc2=Br Taken
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite
25
Instruction
Instruction [20– 16]
0
Read
register 2
data 1
Registers Read
bcond
Zero
ALU ALU
80
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
82
u M
memory Instruction [15– 11] x u
Write x
1
97 data 1
Write
Data
memory 0
x
data
16 32
Instruction [15– 0] Sign
extend ALU ALU operation
control
Instruction [5– 0]
ALU
u
x
u
x 24
Add result 1 0
Add 20 RegDst Shift PCSrc2=Br Taken
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg
ALUOp
MemWrite
CU is in parallelALUSrc
to regfile
RegWrite
CU finishes it work at 28 ps
5 Instructionwhile
[25– 21]regfile at 40 ps
PC
Read
Read
register 1
Read 40
60
address
25
Instruction
Instruction [20– 16]
0
Read
register 2
data 1
Registers Read
bcond
Zero
ALU ALU
80
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
82
u M
memory Instruction [15– 11] x u
Write x
This mux output is stable
1data
97
at 27 ps 1
In SW, immediateWrite
Data
memory 0
x
Part of
Instruction the
[15– 0] critical path
16 32 This mux output is stable at 29 ps
Sign
extend WhileALU
ALU
rs read will complete at 40 ps
operation
control
Instruction [5– 0]
29
SC vs MC Perf. Which one is faster?
◼ : Consider the hardware times of major units (all others
being negligible) in a datapath as given below.
Determine minimum clock cycle time, average CPI,
and average instruction execution time for single-cycle
and multi-cycle datapath.
30
Solution
Single-cycle clock period (cycle time) is determined by the
lw instruction, which activates the critical path:
Cycle time = memory + reg. file + ALU + memory + reg. file
= 25 + 15 + 20 + 25 + 15 = 100ps
Av. CPI =1
Av. time/instruction
= 100 ✕ 1 = 100ps
31
Multi-cycle clock period (cycle time) is determined by the
slowest hardware unit (memory in this case):
Cycle time = 25ps
Clock cycles used by instructions are 5 for lw, 4 for sw, 4 for
r-type, 3 for branch and 3 for jump. Therefore,
Av. CPI
= 0.1✕5 + 0.1✕4 + 0.4✕4 + 0.2✕3 + 0.2✕3
= 3.7
Av. time/instruction
= 25 ✕ 3.7 = 92.5ps
32
Performance ratio
(a) Suppose an operation involving register file, memory or
ALU each takes 1 time unit. Neglecting the time of all other
hardware, how much time will each MIPS instruction
take on a single-cycle datapath? Consider R-type, lw, sw,
beq and j instructions.
33
Solution
(a) Each instruction will take 5 units of time on a single-cycle
datapath.
34
Solution
35
Performance for a benchmark program
Clock rates for single-cycle and multicycle datapaths are
given as 1GHz and 5GHz, respectively.
The following subroutine is used for estimating performance.
The argument register $a0 contains a large positive
integer and $a contains 1.
loop sub $a0, $a0, $a1
beq $a0, $0, done
j loop
done jr $31
Determine:
(a) Average cycles per instruction (CPI) for two datapaths.
(b) How much faster is the execution of the program on
multicycle processor compared to that on single cycle proc.?
36
Solution
(a) CPI
Single-cycle CPI = 1.0, because each instruction executes in
one cycle.
The instruction mix for multicycle datapath is:
sub takes 4 cycles and is executed a0 times
beq takes 3 cycles and is executed a0 times
j takes 3 cycles and is executed a0 – 1 times
jr takes 3 cycles and is executed once
37
Solution
(b) Execution time ratio:
The multicycle clock period is 0.2ns and the single-cycle clock
period is 1ns.
Therefore,
Performance ratio
= (single-cycle exec time)/(multicycle exec time)
= (1×3a0)/(0.2×3.333×3a0) // 1GHz -> 1 ns
// 5GHz -> 0.2 ns
= 1.5
38