07 Pipeline Notes
07 Pipeline Notes
Hakim Weatherspoon
CS 3410
Computer Science
Cornell University
The slides are the product of many rounds of teaching CS 3410
by Professors Weatherspoon, Bala, Bracy, McKee, and Sirer.
Review: Single Cycle Processor
+4 +4
addr
=?
PC din dout
offset control cmp
memory
new target
imm
pc extend
Review: Single Cycle Processor
Advantages
• Single cycle per instruction make logic and clock simple
Disadvantages
• Since instructions take different time to finish, memory
and functional unit are not efficiently utilized
• Cycle time is the longest delay
– Load instruction
• Best possible CPI is 1 (actually < 1 w parallelism)
– However, lower MIPS and longer clock period (lower clock
frequency); hence, lower performance
Review: Multi Cycle Processor
Advantages
• Better MIPS and smaller clock period (higher clock
frequency)
• Hence, better performance than Single Cycle processor
Disadvantages
• Higher CPI than single cycle processor
Pipelining
Both!
The Kids
Alice
Bob
Saw Drill
Glue Paint
The Instructions
N pieces, each built following same sequence:
Alice
Bob Alice
Latency: 4 hrs/task
Throughput: 1 task/hr
Concurrency: 4 CPI = 1
Pipelined Performance
Time
1 2 3 4 5 6 7 8 9 10
What if drilling takes twice as long, but gluing and paint take ½ as long?
Latency:
Throughput: CPI =
Pipelined Performance
Time
1 2 3 4 5 6 7 8 9 10
Done: 4 cycles
Done: 6 cycles
Done: 8 cycles
What if drilling takes twice as long, but gluing and paint take ½ as long?
Latency: 4 cycles/task
Throughput: 1 task/2 cycles CPI = 2
Lessons
Principle:
Throughput increased by parallel execution
Balanced pipeline very important
Else slowest stage dominates performance
Pipelining:
• Identify pipeline stages
• Isolate stages from each other
• Resolve pipeline hazards (next lecture)
Single Cycle vs Pipelined Processor
Single Cycle Pipelining
Single-cycle
insn0.fetch, dec, exec
insn1.fetch, dec, exec
Pipelined
insn0.fetch insn0.dec insn0.exec
insn1.fetch insn1.dec insn1.exec
23
Agenda
5-stage Pipeline
• Implementation
• Working Example
Hazards
• Structural
• Data Hazards
• Control Hazards
24
A Processor
Review: Single cycle processor
+4 +4
addr
=?
PC din dout
offset control cmp
memory
new target
imm
pc extend
A Processor
+4
addr
PC din dout
control
memory
new compute
imm jump/branch
pc extend targets
Instruction Instruction Write-
Fetch Decode Execute Memory Back
Pipelined Processor
memory register
alu
file
+4
addr
PC din dout
control
memory
compute
new jump/branch
extend targets
pc
A
memory register
D
alu
file
B
+4
addr
PC
inst
din dout
M
B
control
memory
compute
new jump/branch
imm
extend targets
pc
ctrl
ctrl
Fetch Decode Execute Memory Back
IF/ID ID/EX EX/MEM MEM/WB
Time Graphs
Cycle 1 2 3 4 5 6 7 8 9
add IF ID EX MEM WB
nand IF ID EX MEM WB
lw IF ID EX MEM WB
add IF ID EX MEM WB
sw IF ID EX MEM WB
Latency: 5 cycles
Throughput: 1 insn/cycle CPI = 1
Concurrency: 5 29
Principles of Pipelined Implementation
• Break datapath into multiple cycles (here 5)
• Parallel execution increases throughput
• Balanced pipeline very important
• Slowest stage determines clock rate
• Imbalance kills performance
• Add pipeline registers (flip-flops) for isolation
• Each stage begins by reading values from latch
• Each stage ends by writing values to latch
• Resolve hazards
30
Pipelined Processor
A
memory register
D
alu
file
B
+4
addr
PC
inst
din dout
M
B
control
memory
compute
new jump/branch
imm
extend targets
pc
ctrl
ctrl
Fetch Decode Execute Memory Back
IF/ID ID/EX EX/MEM MEM/WB
Pipeline Stages
Stage Perform Latch values of interest
Functionality
Use PC to index Program Memory, Instruction bits (to be decoded)
Fetch increment PC PC + 4 (to compute branch targets)
32
Instruction Fetch (IF)
instruction
memory
addr mc
+4
PC
- PC+4
new - pc-rel (PC-relative); e.g. BEQ, BNE
- pc-abs (PC absolute); e.g. J and JAL
pc
. (PC+4)31..28 • target • 00
- pc-reg (PC registers); e.g. JR
Instruction Fetch (IF)
instruction
memory
addr mc
Rest of pipeline
inst
+4
00 = read word
PC+4
PC
pc-reg
new pc-rel
pc-abs
pc
pc-sel
IF/ID
Instruction Fetch (IF)
instruction
memory
addr mc
Rest of pipeline
inst
+4
00 = read word
PC+4
PC
pc-reg
pc-rel
pc-abs
• PC+4
• pc-reg (PC registers: JR)
pc-sel
• pc-rel (PC-relative: BEQ, BNE)
• pc-abs (PC absolute: J and JAL)
IF/ID 36
Stage 1: Instruction Fetch
PC+4 inst
IF/ID
D
Rd
WE
file
register
Decode
Ra Rb
B
A
Rest of pipeline
Stage 1: Instruction Fetch
PC+4 inst
IF/ID
decode
D
Rd
WE
file
register
Decode
Ra Rb
B
A
dest
result
extend
Rest of pipeline
Stage 2: Instruction Decode
ID/EX
Execute (EX)
alu
ctrl target B D
EX/MEM
Rest of pipeline
Stage 2: Instruction Decode
ID/EX
+
pcreg
pcrel
pcabs
Execute (EX)
alu
pcsel
ctrl target B D
EX/MEM
Rest of pipeline
branch?
Stage 3: Execute
ctrl target B D
EX/MEM
din
addr
memory
mc
dout
MEM
ctrl M D
MEM/WB
Rest of pipeline
pcsel
branch? MEM
pcreg
D
D
Stage 3: Execute
Rest of pipeline
addr
din dout
M
B
pcrel memory
target
mc
pcabs
ctrl
ctrl
EX/MEM MEM/WB
Stage 4: Memory
ctrl M D
MEM/WB
WB
Stage 4: Memory
ctrl M D
MEM/WB
dest
result
WB
Putting it all together!
A
A
Rd
inst
D
D
mem B
B
inst
Ra Rb addr
imm
M
din dout
B
+4
mem
Rt Rd PC+4
PC+4
PC
Rd
Rd
OP
OP
OP
IF/ID ID/EX EX/MEM MEM/WB
49
iClicker Question
Consider a non-pipelined processor with clock
period C (e.g., 50 ns). If you divide the processor
into N stages (e.g., 5) , your new clock period will
be:
A. C
B. N
C. less than C/N
D. C/N
E. greater than C/N
50
iClicker Question
Consider a non-pipelined processor with clock
period C (e.g., 50 ns). If you divide the processor
into N stages (e.g., 5) , your new clock period will
be:
A. C
B. N
C. less than C/N
D. C/N
E. greater than C/N
51
Takeaway
Pipelining is a powerful technique to mask
latencies and increase throughput
• Logically, instructions execute one at a time
• Physically, instructions execute in parallel
– Instruction level parallelism
53
Agenda
5-stage Pipeline
• Implementation
• Working Example
Hazards
• Structural
• Data Hazards
• Control Hazards
54
Example: : Sample Code (Simple)
add r3 r1, r2
nand r6 r4, r5
lw r4 20(r2)
add r5 r2, r5
sw r7 12(r3)
55
M
U
X
4 target
+ PC+4 PC+4
R0 0
R1 ALU
regA
M
instruction
regB
R2 result
R3 valA U
PC Inst A ALU X
Register file
R4
L mdata
mem result
R5 U
valB M Data
R6
U mem
R7 data
X
imm dest
extend
valB
Bits 11-15
Rd M
Bits 16-20
Rt U dest dest
X
Bits 26-31
op op op
4 0
+ 0 0
R0 0
R1 36 0
R2 9 0 M
add
R3 12 0 U
nand
nop
PC A X
Register file
4 0 lw R4 18 L 0 0
add R5 7 0 M U Data
sw R6 41 U mem
R7 22 data
X
0 dest
extend
0
Initial Bits 11-15
0 M
4 0
+ 4 /0 4
R0 0
R1 36 0
9 0 M
add 3 1 2
R2
add
nand R3 12 /0 36 A
U
X
PC
Register file
8 4 lw R4 18 L 0 0
add R5 7 /0 9 M U Data
sw R6 41 U mem
R7 22 data
X
0 dest
extend
0
Fetch: Bits 11-15 /0 3 M
add 3 1 2 Bits 16-20 /0 2 U 0 0
X
Bits 26-31
/ add
nop nop nop
4 /0 4
+ 8 /4 8
R0 0
R1 36 0
1
0 M
nand 6 4 5
add 2
R2 9 36
R3 12 36
/ 18 U
nand A
PC X
Register file
12 8 lw R4 18 L /0 45 0
add R5 7 9
/9 7 M U Data
sw R6 41 U mem
R7 22 data
X
3 dest
extend
/0 9
Fetch: Bits 11-15 /3 6 M 3
nand 6 4 5 Bits 16-20 /2 5 U
X
/0 3 0
Bits 26-31
/ nand
add / add
nop nop
18 = 01 0010
4 7 = 00 0111 /4 8
+ 12 8 ------------------
R0 0 -3 = 11 1101
R1 36 0
4
/ 18 /0 45 M
lw 4 20(2)
add 5
R2 9 36
R3 12 18 U
nand A
PC X
Register file
16 12 lw R4 18 L / -3
45 0
7 /9 7
add R5
7 M U Data
sw R6 41 U mem
R7 22 data
X
6 dest
extend
/9 7
Fetch: Bits 11-15
6 3 M 3
lw 4 20(2) Bits 16-20
5 2 U
X
/3 6 /0 3
Bits 26-31
nand / nand
add / add
nop
4 8
+ 16 12
R0 0
R1 36 0
2
9 18 45 M
add 5 2 5
R2
add 4
R3 12 9 U
nand A
PC X
Register file
20 16 lw R4 18 L -3 45 0
7
add R5 7 18 M U Data
sw R6 41 U mem
R7 22 data
X
20 dest
extend
7
Fetch: Bits 11-15
0 6 M 6 3
add 5 2 5 Bits 16-20
4 5 U 6 3
X
Bits 26-31
lw nand add
4 12
+ 20 16
R0 0
R1 36 0 45
2
9 9 -3 M
sw 7 12(3)
R2
add 5
R3 45 9 U
nand A
PC X
Register file
24 20 lw R4 18 L 29 -3 0
add R5 7 7 M U Data
sw R6 41 U mem
R7 22 20 data
X
5 dest
extend
18
Fetch: Bits 11-15
5 0 M 4 6 3
sw 7 12(3) Bits 16-20
5 4 U 4 6
X
Bits 26-31
add lw nand
4 16
+ 20
R0 0
R1 36 0 -3
3
R2 9 9 29 M
add 7
R3 45 45 U
nand A
PC X
Register file
28 24 lw R4 18 L 16 29 99
7
add R5 7 22 M U Data
sw R6 -3 U mem
R7 22 data
X
12 dest
extend
7
No more Bits 11-15
0 5 M 5 4 6
instructions Bits 16-20
7 5 U 5 4
X
Bits 26-31
sw add lw
4 20
+
R0 0
R1 36 0
R2 9 45 16 M
add
R3 45 U
nand A 99
PC X
Register file
32 28 lw R4 99 L 57 16 0
add R5 7 M U Data
sw R6 -3 U mem
R7 22 12 data
X
dest
extend
22
No more Bits 11-15 0 M 7 5 4
instructions Bits 16-20 7 U 7 5
X
Bits 26-31
sw add
4
+
R0 0
R1 36 16
R2 9 57 M
add
R3 45 U
nand A
PC X
Register file
36 32 ;w R4 99 L 57 22 0
add R5 16 M U Data
sw R6 -3 U mem
R7 22 data
X
22 dest
extend
4
+
R0 0
R1 36
R2 9 M
add
R3 45 U
nand A
PC X
Register file
40 36 ;w R4 99 L
add R5 16 M U Data
sw R6 -3 U mem
R7 22 data
X
dest
extend
A
memory register
D
alu
file
B
+4
addr
PC
inst
din dout
M
B
control
memory
compute
new jump/branch
imm
extend targets
pc
ctrl
ctrl
Fetch Decode Execute Memory Back
IF/ID ID/EX EX/MEM MEM/WB
Agenda
5-stage Pipeline
• Implementation
• Working Example
Hazards
• Structural
• Data Hazards
• Control Hazards
70
Hazards
Correctness problems associated w/processor design
1. Structural hazards
Same resource needed for different purposes at
the same time (Possible: ALU, Register File, Memory)
2. Data hazards
Instruction output needed before it’s available
3. Control hazards
Next instruction PC unknown at time of Fetch 71
Dependences and Hazards
Dependence: relationship between two insns
• Data: two insns use same storage location
• Control: 1 insn affects whether another executes at all
• Not a bad thing, programs would be boring otherwise
• Enforced by making older insn go before younger one
– Happens naturally in single-/multi-cycle designs
– But not in a pipeline
Hazard: dependence & possibility of wrong insn order
• Effects of wrong insn order cannot be externally visible
• Hazards are a bad thing: most solutions either complicate
the hardware or reduce performance
72
iClicker Question
Data Hazards
• register file (RF) reads occur in stage 2 (ID)
• RF writes occur in stage 5 (WB)
• RF written in ½ half, read in second ½ half of cycle
A) Yes
1. Is there a dependence?
B) No
2. Is there a hazard? C) Cannot tell with the
information given.
73
iClicker Question
Data Hazards
• register file (RF) reads occur in stage 2 (ID)
• RF writes occur in stage 5 (WB)
• RF written in ½ half, read in second ½ half of cycle
IF ID MEM WB
add r3, r1, r2
MEM WB
sub r5, r3, r4 IF ID
IF ID X MEM WB
add r3, r1, r2
MEM WB
sub r5, r3, r4 IF ID X
79
Visualizing Data Hazards (2)
time Clock cycle
1 2 3 4 5 6 7 8 9
80
Visualizing Data Hazards (3)
time Clock cycle
1 2 3 4 5 6 7 8 9
81
Data Hazards
Data Hazards
• register file reads occur in stage 2 (ID)
• register file writes occur in stage 5 (WB)
• next instructions may read values about to be written
How to detect?
Detecting Data Hazards
A
A
Rd
inst
D
D
mem B
B
inst
Ra Rb addr
imm
M
din dout
B
+4
mem
Rt Rd PC+4
PC+4
PC IF/ID.Ra ≠ 0 &&
Rd
Rd
(IF/ID.Ra==ID/Ex.Rd
IF/ID.Ra==Ex/M.Rd
IF/ID.Ra==M/W.Rd)
OP
OP
OP
sub r5,r3,r4 add r3, r1, r2
A
Rd
inst
D
D
mem B
B
inst
Ra Rb addr
imm
M
din dout
B
+4
mem
detect Rt Rd PC+4
PC+4
PC hazard
Rd
Rd
OP
OP
OP
IF/ID ID/EX EX/MEM MEM/WB
Takeaway
Data hazards occur when a operand (register) depends on
the result of a previous instruction that may not be
computed yet. A pipelined processor needs to detect data
hazards.
Next Goal
What to do if data hazard detected?
iClicker
What to do if data hazard detected?
A) Wait/Stall
B) Reorder in Software (SW)
C) Forward/Bypass
D) All the above
E) None. We will use some other method
Possible Responses to Data Hazards
1.Do Nothing
• Change the ISA to match implementation
• “Hey compiler: don’t create code w/data hazards!”
(We can do better than this)
2.Stall
• Pause current and subsequent instructions till safe
3.Forward/bypass
• Forward data value to where it is needed
(Only works if value actually exists already)
89
Stalling
How to stall an instruction in ID stage
• prevent IF/ID pipeline register update
– stalls the ID stage instruction
• convert ID stage instr into nop for later stages
– innocuous “bubble” passes through pipeline
• prevent PC update
– stalls the next (IF stage) instruction
Detecting Data Hazards
A
A
add r3, r1, r2 Rd
sub inst
r5, r3, r5
D
or r6, r3, r4
D
mem B
B
add r6, r3, r8
inst
Ra Rb addr
imm
M
din dout
B
+4
mem
detect Rt Rd PC+4
PC+4
PC hazard
Rd
Rd
If detect hazard
OP
OP
OP
MemWr=0
RegWr=0
WE=0
IF/ID ID/EX EX/MEM MEM/WB
Stalling
Clock cycle
time 1 2 3 4 5 6 7 8
or r6, r3, r4
r3 = 10
add r3, r1, r2 IF ID Ex M W
r3 = 20
3 Stall
Stalls
sub r5, r3, r5 IF ID ID ID ID Ex M W
or r6, r3, r4 IF IF IF IF ID Ex M
rA rB data
B mem M
+4
Rd
Rd
Rd
(MemWr=0
RegWr=0)
WE
WE
WE
PC
nop
Op
Op
Op
sub r5,r3,r5 add r3,r1,r2
or r6,r3,r4 (WE=0)
/stall
NOP = If(IF/ID.rA ≠ 0 &&
(IF/ID.rA==ID/Ex.Rd STALL CONDITION MET
IF/ID.rA==Ex/M.Rd
IF/ID.rA==M/W.Rd))
Stalling
A A
D D D
inst rD B B
mem
inst
rA rB data
B mem M
+4
Rd
Rd
Rd
(MemWr=0
RegWr=0)
WE
WE
WE
PC
nop (MemWr=0
Op
Op
Op
RegWr=0)
sub r5,r3,r5 nop add r3,r1,r2
or r6,r3,r4 (WE=0)
/stall
NOP = If(IF/ID.rA ≠ 0 &&
(IF/ID.rA==ID/Ex.Rd
IF/ID.rA==Ex/M.Rd STALL CONDITION MET
IF/ID.rA==M/W.Rd))
Stalling
A A
D D D
inst rD B B
mem
inst
rA rB data
B mem M
+4
Rd
Rd
Rd
(MemWr=0
RegWr=0)
WE
WE
WE
PC
nop (MemWr=0 (MemWr=0
Op
Op
Op
RegWr=0) RegWr=0)
sub r5,r3,r5 nop nop add r3,r1,r2
or r6,r3,r4 (WE=0)
/stall
NOP = If(IF/ID.rA ≠ 0 &&
(IF/ID.rA==ID/Ex.Rd
IF/ID.rA==Ex/M.Rd
IF/ID.rA==M/W.Rd)) STALL CONDITION MET
Stalling
Clock cycle
time 1 2 3 4 5 6 7 8
r3 = 10
add r3, r1, r2 IF ID Ex M W
r3 = 20
3 Stall
Stalls
sub r5, r3, r5 IF ID ID ID ID Ex M W
or r6, r3, r4 IF IF IF IF ID Ex M
100
Forwarding
Forwarding bypasses some pipelined stages
forwarding a result to a dependent instruction
operand (register).
A A
D D D
inst B B
mem data
imm B mem M
Rd
Rd
detect
Rb
hazard
MC WE
MC WE
forward
Ra
unit
102
Forwarding Datapath
A A
D D D
inst B B
mem data
imm B mem M
Rd
Rd
detect
Rb
hazard
MC WE
MC WE
forward
Ra
unit
A
D
inst B
mem data
mem
sub r5, r3, r1 add r3, r1, r2
A
D
inst B
mem data
mem
sub r5, r3, r1 add r3, r1, r2
A
D
inst B
mem data
mem
or r6, r3, r4 sub r5, r3, r1 add r3, r1,r2
A
D
inst B
mem data
mem
or r6, r3, r4 sub r5, r3, r1 add r3, r1,r2
Detection Logic:
forward = (M/WB.WE && M/WB.Rd != 0 &&
ID/Ex.Ra == M/WB.Rd &&
not (ID/Ex.WE && Ex/M.Rd != 0 &&
ID/Ex.Ra == Ex/M.Rd)
107
Register File Bypass
A
D
inst B
mem data
mem
add r6, r3, r8 or r6, r3, r4 sub r5, r3, r1 add r3, r1,r2
A
D
inst B
mem data
mem
add r6, r3, r8 or r6, r3, r4 sub r5, r3, r1 add r3, r1,r2
Hazards
• Structural
• Data Hazards
• Control Hazards
110
Forwarding Example 2
time Clock cycle
1 2 3 4 5 6 7 8
lw r6, 4(r3)
or r5, r3, r5
sw r6, 12(r3)
111
Forwarding Example 2
time Clock cycle
1 2 3 4 5 6 7 8
or r5, r3, r6 IF ID Ex M W
sw r6, 12(r3) IF ID Ex M W
Forwarding Example 2
time Clock cycle backwards arrows require time travel
1 2 3 4 5 6 7 8
or r5, r3, r5 IF ID Ex M W
sw r6, 12(r3) IF ID Ex M W
Load-Use Hazard Explained
A
D
inst B
mem data
mem
A
D
inst B
mem data
mem
or r6,r4,r1 lw r4, 20(r8)
lw r4, 20(r8)
or r6, r3, r4
115
Load-Use Stall (1)
A
D
inst B
mem data
mem
or r6,r4,r1 lw r4, 20(r8)
lw r4, 20(r8) IF ID Ex
or r6, r3, r4 IF ID
116
Load-Use Stall (2)
A
D
inst B
mem data
mem
or r6,r4,r1 NOP lw r4, 20(r8)
lw r4, 20(r8) IF ID Ex M W
Stall
or r6, r3, r4 IF ID* ID Ex M W
117
Load-Use Stall (3)
A
D
inst B
mem data
mem
or r6,r4,r1 NOP lw r4, 20(
lw r4, 20(r8) IF ID Ex M W
Stall
or r6, r3, r4 IF Ex Ex
ID* ID M W
118
Load-Use Detection
A A
D D D
inst B B
mem data
imm
MC Ra Rb Rd B mem M
Rd
Rd
detect
hazard
MC WE
MC WE
forward
unit
A A
D D D
inst B B
mem data
imm
MC Ra Rb Rd B mem M
Rd
Rd
detect
hazard
MC WE
MC WE
forward
unit
123
Takeaway
Data hazards occur when a operand (register) depends on the result of
a previous instruction that may not be computed yet. A pipelined
processor needs to detect data hazards.
5 Hazards
Quiz
Find all hazards, and say how they are resolved:
sw r6, 12(r2)
Forwarding from M/WID/Ex (WEx)
Stall
+ Forwarding from M/WID/Ex (WEx)
5 Hazards
Quiz
Find all hazards, and say how they are resolved:
Stall
• Pause current and all subsequent instructions
Forward/Bypass
• Try to steal correct value from elsewhere in pipeline
• Otherwise, fall back to stalling or require a delay slot
Tradeoffs?
Agenda
5-stage Pipeline
• Implementation
• Working Example
Hazards
• Structural
• Data Hazards
• Control Hazards
131
i = 0; A bit of Context
do {
n += 2; i r1
i++; Assume:
} while(i < max) n r2
i = 7; max r3
n--;
133
• prevent PC update
Zap & Flush • clear IF/ID latch
• branch continues
inst A
mem
D
+4 B
data
PC
mem
branch decide
calc branch
New PC = 14
If branch Taken Zap
1C blt r1,r3,L IF ID Ex M W
20
addiu
r1,r0,7
IF ID NOP NOP NOP
24 subi r2,r2,1
IF NOP NOP NOP NOP
14 L:addi r2,r2,2 IF ID Ex M W
134
• prevent PC update
Zap & Flush • clear IF/ID latch
• branch continues
inst A
mem
D
+4 B
data
PC
mem
branch decide
calc branch
New PC = 1C
If branch Taken Zap
1C blt r1,r3,L IF ID Ex M W
20
addiu
r1,r0,7
IF ID NOP NOP NOP
24 subi r2,r2,1
IF NOP NOP NOP NOP
14 L:addi r2,r2,2 IF ID Ex M W
For every taken branch? OUCH!!! 135
Reducing the cost of control hazard
1. Delay Slot
• You MUST do this
• MIPS ISA: 1 insn after ctrl insn always executed
• Whether branch taken or not
2. Resolve Branch at Decode
• Some groups do this for Project 3, your choice
• Move branch calc from EX to ID
• Alternative: just zap 2nd instruction when branch taken
3. Branch Prediction
• Not in 3410, but every processor worth anything does this
(no offense!)
136
Problem: Zapping 2 insns/branch
inst A
mem
D
+4 B
data
PC
mem
branch decide
calc branch
New PC = 1C
If branch Taken Zap
1C blt r1, r3, Loop F D X
20 addiu r1, r0, 7 F D
24 subi r2, r2, 1 F
Z a p !
i = 0; Solution #1: Delay Slot
do {
n += 2; i r1
i++; Assume:
} while(i < max) n r2
i = 7; max r3
n--;
x10 addiu r1, r0, 0 # i=0
x14 Loop: addiu r2, r2, 2 # n
+= 2
x18 addiu r1, r1, 1 # i+
+
x1C blt r1, r3, Loop #
i<max? 138
Delay Slot in Action
inst A
mem
D
+4 B
data
PC
mem
branch decide
calc branch
New PC = 1C
If branch Taken Zap
1C blt r1, r3, Loop F D X
20 nop F D
24 addiu r1, r0, 7 F
Z a p !
Soln #2: Resolve Branches @ Decode
inst A
mem
D
+4 B
data
PC branch mem
calc
decide
branch
New PC = 1C If branch Taken No Zapping
1C blt r1, r3, Loop F D X
20 nop F D
Loop:addiu
14 F
r2,r2,2
No Z a p p i n g ! 140
Optimization: Fill the Delay Slot
x10 addiu r1, r0, 0 # i=0
x14 Loop: addiu r2, r2, 2 # n
+= 2
x18 addiu r1, r1, 1 # i+
+
x1C blt r1, r3,
Compiler Loop
transforms #
i<max? code
x20
x10 nopr1, r0, 0 # i=0
addiu
x14 Loop: addiu r1, r1, 1 # i++
x18 blt r1, r3, Loop #
i<max? 141
Optimization In Action!
inst A
mem
D
+4 B
data
PC branch mem
calc
decide
branch
New PC = 1C
143
Speculative Execution: Loops
Pipeline so far
• “Guess” (predict) that the branch will not be taken
We can do better!
• Make prediction based on last branch
• Predict “take branch” if last branch “taken”
• Or Predict “do not take branch” if last branch “not
taken”
J Top
End2:
Speculative Execution: Branch Execution
Branch Not Taken (NT)
Control hazards
Structural hazards
• resource contention
• so far: impossible because of ISA and pipeline design
Hazards Summary
Data hazards
• register file reads occur in stage 2 (IF)
• register file writes occur in stage 5 (WB)
• next instructions may read values soon to be written
Control hazards
• branch instruction may change the PC in stage 3 (EX)
• next instructions have already started executing
Structural hazards
• resource contention
• so far: impossible because of ISA and pipeline design
Data Hazard Takeaways
Data hazards occur when a operand (register) depends on the result
of a previous instruction that may not be computed yet. Pipelined
processors need to detect data hazards.