Pipeline Hazards: Christos Kozyrakis Stanford University
Pipeline Hazards: Christos Kozyrakis Stanford University
Pipeline Hazards
Christos Kozyrakis
Stanford University
http://eeclass.stanford.edu/ee108b
Clk
R R
ALU
Instr. Data
P e e
Memory Memory
C g g
s s
IF RF EX MEM. WB
Instruction Register Execution Memory Write
Fetch Fetch back
R R
ALU
Instr. Data
e e
Memory Memory
g g
s s
IF ID EX MEM WB
0
M
u
x
1
Add
4 Add Add
result
Shift
left 2
Read
Instruction
• Instruction length
– Fixed MIPS instruction length allows easy pipeline even though
decode does not happen until the second stage
– Intel 80x86 has a much more challenging problem where
instructions vary from 1-17 bytes
• Instruction formats
– Register access possible even though decode does not happen
until second stage due to regularity of formats
• Limited memory access
– Since only loads and stores access memory, other instructions
do not need to use the ALU to calculate the memory address
before the actual computation
func
ALU ALUctr
op Main 6
ALUop Control 3
6 Control
N (Local)
ALU
C. Kozyrakis EE 108b Lecture 9 10
Pipeline Control
• Not a problem
– Just pipeline the control signals along with the data
– Make sure they line up
RF/ID EX MEM WB
ExtOp ExtOp
ALUSrc ALUSrc
Ex/MEM Register
MEM/WB Register
ALUOp ALUOp
ID/Ex Register
IF/ID Register
PCSrc
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite
Branch
Shift
MemWrite
left 2
ALUSrc
MemtoReg
Read
Instruction
PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1
RegDst
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Clk
Pipeline Implementation:
Load IF Reg EX MEM WB
C. Kozyrakis
R-type IF Reg EE 108b
EXLecture 9MEM WB 15
But Something Is Fishy Here
• Then dividing it into 10 parts would make the clock even faster
– And wouldn’t the CPI still be one?
• Resource conflict
– Occurs when two instructions try to use same hardware
– Often arise when some functional unit is not fully pipelined
• To avoid structural hazards:
– Use resource once per instruction
– Always in the same cycle, so this arrangement is bad:
• Load uses Register File’s Write port during its 5th stage
1 2 3 4 5
Load IF RF/ID EX MEM WB
• R-type uses Register File’s Write port during the 4th stage
1 2 3 4
R-type IF RF/ID EX WB
R-type IF RF/ID EX WB
ALU IF RF/ID EX WB
R-type IF RF/ID EX WB
• Follow directions
– Always use resources at the same time in each instruction
• Build more complex functional units
– For example support two writes into register file
• But then you would probably want to make use of this feature in
other ways, and you probably still would like to follow directions
…
• Delay offending instruction (queue up to use the unit)
– Used when units are not fully pipelined (mult, div)
– Hardware inserts a pipeline stall(bubble) that delays
offending instruction
– Increases CPI from the ideal value of 1
ALU
I
add r1,r2,r3 Im Reg Dm Reg
ALU
s sub r4, r1, r3 Im Reg Dm Reg
t
r.
ALU
Im Reg Dm Reg
and r6, r1, r7
O
ALU
r Im Reg Dm Reg
d
or r8, r1, r9
e
ALU
Im Reg Dm Reg
r xor r10, r1, r11
C. Kozyrakis EE 108b Lecture 9 26
Data Hazard Solution
ALU
I
add r1, r2, r3 Im Reg Dm Reg
ALU
s sub r4, r1, r3 Im bubble bubble bubble Reg Dm Reg
t
r.
ALU
and r6, r1, r7 Im Reg Dm
O
r
or r8, r1, r9
ALU
d Im Reg
e
r Im Reg
xor r10, r1, r11
C. Kozyrakis EE 108b Lecture 9 28
Performance Effect
ALU
I
add r1, r2, r3 Im Reg Dm Reg
ALU
s sub r4, r1, r3 Im bubble bubble Reg Dm Reg
t
r.
ALU
and r6, r1, r7 Im Reg Dm
O
r
or r8, r1, r9
ALU
d Im Reg
e
r Im Reg
xor r10, r1, r11
C. Kozyrakis EE 108b Lecture 9 31
Performance Effect
• But you really have the value in the machine at end of ALU
– If you can use this value, the stall for ALU is zero!
• Fastest, but requires more hardware – called forwarding
ALU
I
add r1, r2, r3 Im Reg Dm Reg
ALU
s sub r4, r1, r3 Im Reg Dm Reg
t
r.
ALU
Im Reg Dm Reg
and r6, r1, r7
O
ALU
r Im Reg Dm Reg
d
or r8, r1, r9
e
ALU
Im Reg Dm Reg
r xor r10, r1, r11
C. Kozyrakis EE 108b Lecture 9 34
Forwarding Limitations
• With forwarding
– No ALU-to-ALU delay
– 1 cycle load-to-ALU delay
ALU
I
lw r1, 0(r2) Im Reg Dm Reg
ALU
s sub r4, r1, r6 Im Reg Dm Reg
t
r.
ALU
Im Reg Dm Reg
and r6, r1, r7
O
ALU
r Im Reg Dm Reg
d
or r8, r1, r9
e
r
ALU
I
lw r1, 0(r2) Im Reg Dm Reg
ALU
s sub r4, r1, r3 Im Reg
bubble Dm Reg
t
r.
ALU
Im Reg Dm Reg
and r6, r1, r7 bubble
O
ALU
r Im Reg Dm Reg
d or r8, r1, r9
e
r
IF RF EX MEM. WB
Instruction Register Execution Memory Write
Fetch Fetch back
R R
Instr.
ALU
P e Data e
C Memory Memory
g g
s s
ALU
Add
I
lw r1, 0(r2) Im Reg Dm Reg
ALU
s sub r4, r1, r6 Im Reg Reg
t
r.
ALU
Im Reg Reg
and r6, r1, r7
O
ALU
r Im Reg Reg
d
or r8, r1, r9
e
r
• Old pipeline:
lw r4 8(r1) IF RF EX MEM WB
(slot) IF RF EX MEM WB
add r2, r4, r1 IF RF EX MEM
• Old pipeline:
add r1, r2, r3 IF RF EX MEM WB
lw r4 8(r1) IF RF EX MEM WB
• Which is better?
– Depends on which pair of instructions happens more often
• Datapath
– Need to add multiplexers to functional units
– Source to function unit could come from
• Register file
• Memory
• ALU of last cycle
• ALU from two cycles ago
– Adding this mux increases the critical path of design
• Needs to be designed carefully
ALU
Data
Memory
MuxB
ALU
I
lw r1, 0(r2) Im Reg Dm Reg
ALU
s unrelated instruction Im Reg Dm Reg
t
r.
ALU
sub r4, r1, r3 Im Reg Dm Reg
O
ALU
r Im Reg Dm Reg
d and r6, r1, r7
e
ALU
r Im Reg Dm Reg
or r8, r1, r9
C. Kozyrakis EE 108b Lecture 9 47
Control Hazard