l06 Pipeline PDF
l06 Pipeline PDF
Pipeline Hazards
Arvind M.I.T.
Technology Assumptions
A small amount of very fast memory (caches) backed up by a large, slower memory Fast ALU (at least for integers) Multiported Register files (slower!)
A 5-stage pipelined Harvard architecture will be the focus of our detailed design
PC
addr rdata
IR
ALU
Inst. Memory
Data Memory
I-Fetch (IF)
Memory (MA)
t6 t7 ....
WB2 MA3 WB3 EX4 MA4 WB4 ID5 EX5 MA5 WB5
PC
addr rdata
IR
ALU
Inst. Memory
Data Memory
I-Fetch (IF)
Resources
Memory (MA)
t6 I5 I4 I3 t7 ....
time IF ID EX MA WB
I5 I4
I5
IR
IR
31
IR
PC
addr
inst IR
Inst Memory
A
ALU
we addr
GPRs
Imm Ext
wdata
Data Memory
rdata
Not quite correct! We need an Instruction Reg (IR) for each stage
September 28, 2005
IR
IR
31
IR
PC
addr
inst IR
Inst Memory
A
ALU
we addr
GPRs
Imm Ext
wdata
Data Memory
rdata
M
IR
31
W
IR
MemWrite Y we addr
WBSrc
PC
addr
inst IR
Inst Memory
GPRs
Imm Ext
wdata
Data Memory
rdata
ExtSel
BSrc
In the extreme case, an instruction may determine the next instruction to be executed
control hazard (branches, interrupts,...)
Data Hazards
r4 r1
0x4
Add
r1
IR IR
31
IR
PC
addr
inst IR
Inst Memory
A
ALU
we addr
GPRs
Imm Ext
wdata
Data Memory
rdata
... r1 r0 + 10 r4 r1 + 17 ...
September 28, 2005
r1 is stale. Oops!
stage 1
stage 2
stage 3
stage 4
Detect a hazard and provide feedback to previous stages to stall or kill instructions Controlling a pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur)
September 28, 2005
0x4
Add
nop
IR
IR
31
IR
PC
addr
inst IR
Inst Memory
A
ALU
we addr
GPRs
Imm Ext
... r1 r0 + 10 r4 r1 + 17 ...
September 28, 2005
wdata wdata
Data Memory
rdata
MD1
MD2
EX2 MA2 WB2 ID3 EX3 MA3 WB3 IF4 ID4 EX4 MA4 WB4 IF5 ID5 EX5 MA5 WB5
Resource Usage
IF ID EX MA WB
t2 I3 I2 I1
t3 I3 I2 nop I1
t4 I3 I2 nop nop I1
t6 I4 I3 I2 nop nop
t7 I5 I4 I3 I2 nop
.... I5 I4 I3 I2 I5 I4 I3
I5 I4
I5
nop
September 28, 2005
pipeline bubble
0x4
Add
nop
IR
IR
31
IR
PC
addr
inst IR
Inst Memory
A
ALU
we addr
GPRs
Imm Ext
wdata
Data Memory
rdata
Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions.
ws
Cre
nop
IR
IR
PC
addr
inst IR
Inst Memory
Cdest
A
ALU
we addr
GPRs
Imm Ext
wdata
Data Memory
rdata
Should we always stall if the rs field matches some rd? not every instruction writes a register we not every instruction reads a register re
September 28, 2005
immediate26
rd (rs) func (rt) rt (rs) op imm rt M [(rs) + imm] M [(rs) + imm] (rt) cond (rs) true: PC (PC) + imm false: PC (PC) + 4 PC (PC) + imm r31 (PC), PC (PC) + imm PC (rs) r31 (PC), PC (rs)
Cstall
stall = ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW) . re1D ((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW) . re2D
! t y no tor is l s is ful h T e th
0x4
Add
nop
IR
IR
31
IR
PC
addr
inst IR
Inst Memory
A
ALU
we addr
GPRs
Imm Ext
wdata wdata
Data Memory
rdata
MD1
MD2
However, the hazard is avoided because our memory system completes writes in one cycle ! Load/Store hazards, even when they do exist, are often resolved in the memory system itself. More on this later in the course.
20
stall
Add 0x4
Add
nop
Jump?
IR
I1
IR
PC 104
addr
inst
IR I2
Inst Memory
I1 I2 I3 I4
Pipelining Jumps
PCSrc (pc+4 / jabs / rind/ br)
stall
Add 0x4
Add
nop
Jump? IRSrcD
IR
I I2 1
IR
I1
PC 304 104
addr
inst
nop
IR nop I2
Inst Memory
I1 I2 I3 I4
Resource Usage
IF ID EX MA WB
time t0 t1 I1 I2 I1
t2 I3 I2 I1
t3 I4 nop I2 I1
t4 I5 I4 nop I2 I1
t5
t6
t7
....
I5 pipeline bubble
stall
Add 0x4
Add
M IR
nop
BEQZ? IRSrcD
IR I1
zero?
PC 104
addr
inst
nop
IR I2
A
ALU
Inst Memory
I1 I2 I3 I4
Branch condition is not known until the execute stage what action should be taken in the decode stage ?
stall
?
Add 0x4
Add
BEQZ?
M IR I1
nop
IR I2
zero?
nop
IR I3
A
ALU
Inst Memory
I1 I2 I3 I4
If the branch is taken ADD - kill the two following instructions BEQZ r1 200 - the instruction at the decode stage ADD is not valid ADD stall signal is not valid
stall
Add
0x4
Add
nop
Jump?
PC
IRSrcE
E IR I2
BEQZ?
M IR I1
zero?
PC 108
addr
inst
nop
IRSrcD IR I3 A
ALU
Inst Memory
I1 I2 I3 I4
If the branch is taken ADD - kill the two following instructions BEQZ r1 200 - the instruction at the decode stage ADD is not valid ADD stall signal is not valid
Dont stall if the branch is taken. Why? Instruction at the decode stage is invalid
IRSrcD = Case opcodeE BEQZ.z, BNEZ.!z nop ... Case opcodeD J, JAL, JR, JALR nop ... IM IRSrcE = Case opcodeE BEQZ.z, BNEZ.!z nop ... stall.nop + !stall.IRD
September 28, 2005
Give priority to the older instruction, i.e., execute stage instruction over decode stage instruction
WB2 nop nop nop nop nop ID5 EX5 MA5 WB5
Resource Usage
IF ID EX MA WB
time t0 t1 I1 I2 I1
t2 I3 I2 I1
t3 I4 I3 I2 I1
t4 I5 nop nop I2 I1
t5
t6
t7
....
One pipeline bubble can be removed if an extra comparator is used in the Decode stage
PCSrc (pc+4 / jabs / rind/ br)
Add
nop
0x4
Add
IR
PC
addr
nop
inst IR D
Inst Memory
GPRs
I1 I2 I3 I4
Other techniques include branch prediction, which can dramatically reduce the branch penalty... to come later
September 28, 2005
Bypassing
time (I1) r1 r0 + 10 (I2) r4 r1 + 17 (I3) (I4) (I5) t0 IF1 t1 t2 t3 t4 t5 ID1 EX1 MA1 WB1 IF2 ID2 ID2 ID2 ID2 IF3 IF3 IF3 IF3 stalled stages t6 t7 .... EX2 MA2 WB2 ID3 EX3 MA3 IF4 ID4 EX4 IF5 ID5
Each stall or kill introduces a bubble in the pipeline CPI > 1 A new datapath, i.e., a bypass, can get the data from the output of the ALU to its input
time (I1) r1 r0 + 10 (I2) r4 r1 + 17 (I3) (I4) (I5) t0 t1 IF1 t2 t3 ID1 EX1 IF2 ID2 IF3 t4 MA1 EX2 ID3 IF4 t5 WB1 MA2 EX3 ID4 IF5 t6 t7 ....
WB2 MA3 WB3 EX4 MA4 WB4 ID5 EX5 MA5 WB5
Adding a Bypass
stall
r4 r1...
0x4
Add
r1 ...
nop
IR
IR
M
31
IR
ASrc
PC addr
inst IR
Inst Memory
A
ALU
we addr
GPRs
Imm Ext
wdata
Data Memory
rdata
When does this bypass help? r1 M[r0 + 10] r4 r1 + 17 JAL 500 r4 r31 + 17 no
no
ASrc = (rsD=wsE).weE.re1D
Is this correct?
No because only ALU and ALUi instructions can benefit from this bypass Split weE into two components: we-bypass, we-stall
ASrc stall
= (rsD =wsE).we-bypassE . re1D = ((rsD =wsE).we-stallE + (rsD=wsM).weM + (rsD=wsW).weW). re1D +((rtD = wsE).weE + (rtD = wsM).weM + (rtD = wsW).weW). re2D
0x4
Add
nop
ASrc
we rs1 rs2 rd1 ws wd rd2
IR
IR
M
31
IR
PC
addr
inst IR
A
ALU
we addr
Inst Memory
GPRs
Imm Ext
BSrc
wdata
Data Memory
rdata
Machines with software-visible delay slots may execute significant number of NOP instructions inserted by the compiler.
September 28, 2005
38
Thank you !