0% found this document useful (0 votes)
111 views38 pages

l06 Pipeline PDF

pipeline

Uploaded by

shizghul89b
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views38 pages

l06 Pipeline PDF

pipeline

Uploaded by

shizghul89b
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

1

Pipeline Hazards

Computer Science and Artificial Intelligence Laboratory

Arvind M.I.T.

Based on the material prepared by Arvind and Krste Asanovic

6.823 L6- 2 Arvind

Technology Assumptions
A small amount of very fast memory (caches) backed up by a large, slower memory Fast ALU (at least for integers) Multiported Register files (slower!)

It makes the following timing assumption valid


tIM tRF tALU tDM tRW

A 5-stage pipelined Harvard architecture will be the focus of our detailed design

September 28, 2005

6.823 L6- 3 Arvind

5-Stage Pipelined Execution


0x4 Add we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext

PC

addr rdata

IR

ALU

we addr rdata wdata

Inst. Memory

Data Memory

I-Fetch (IF)

Decode, Reg. Fetch Execute (ID) (EX)


t0 IF1 t1 t2 ID1 EX1 IF2 ID2 IF3 t3 MA1 EX2 ID3 IF4 t4 WB1 MA2 EX3 ID4 IF5 t5

Memory (MA)
t6 t7 ....

Write -Back (WB)

time instruction1 instruction2 instruction3 instruction4 instruction5


September 28, 2005

WB2 MA3 WB3 EX4 MA4 WB4 ID5 EX5 MA5 WB5

5-Stage Pipelined Execution


Resource Usage Diagram
0x4 Add we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext

6.823 L6- 4 Arvind

PC

addr rdata

IR

ALU

we addr rdata wdata

Inst. Memory

Data Memory

I-Fetch (IF)
Resources

Decode, Reg. Fetch Execute (ID) (EX)


t0 I1 t1 I2 I1 t2 I3 I2 I1 t3 I4 I3 I2 I1 t4 I5 I4 I3 I2 I1 t5 I5 I4 I3 I2

Memory (MA)
t6 I5 I4 I3 t7 ....

Write -Back (WB)

time IF ID EX MA WB

I5 I4

I5

September 28, 2005

Pipelined Execution: ALU Instructions


0x4
Add

6.823 L6- 5 Arvind

IR

IR
31

IR

PC

addr

inst IR

Inst Memory

we rs1 rs2 rd1 ws wd rd2

A
ALU

we addr

GPRs
Imm Ext

wdata

wdata MD1 MD2

Data Memory

rdata

Not quite correct! We need an Instruction Reg (IR) for each stage
September 28, 2005

6.823 L6- 6 Arvind

IRs and Control points


0x4
Add

IR

IR
31

IR

PC

addr

inst IR

Inst Memory

we rs1 rs2 rd1 ws wd rd2

A
ALU

we addr

GPRs
Imm Ext

wdata

wdata MD1 MD2

Data Memory

rdata

Are control points connected properly?


- ALU instructions - Load/Store instructions - Write back

September 28, 2005

Pipelined MIPS Datapath


without jumps
F D E
IR 0x4
Add

6.823 L6- 7 Arvind

M
IR
31

W
IR

RegDst RegWrite we rs1 rs2 rd1 ws wd rd2 OpSel A


ALU

MemWrite Y we addr

WBSrc

PC

addr

inst IR

Inst Memory

GPRs
Imm Ext

wdata

wdata MD1 MD2

Data Memory

rdata

ExtSel

BSrc

September 28, 2005

How Instructions can Interact with each other in a pipeline


An instruction in the pipeline may need a resource being used by another instruction in the pipeline
structural hazard

6.823 L6- 8 Arvind

An instruction may produce data that is needed by a later instruction


data hazard

In the extreme case, an instruction may determine the next instruction to be executed
control hazard (branches, interrupts,...)

September 28, 2005

6.823 L6- 9 Arvind

Data Hazards
r4 r1
0x4
Add

r1
IR IR
31

IR

PC

addr

inst IR

Inst Memory

we rs1 rs2 rd1 ws wd rd2

A
ALU

we addr

GPRs
Imm Ext

wdata

wdata MD1 MD2

Data Memory

rdata

... r1 r0 + 10 r4 r1 + 17 ...
September 28, 2005

r1 is stale. Oops!

6.823 L6- 10 Arvind

Resolving Data Hazards


Freeze earlier pipeline stages until the data becomes available interlocks If data is available somewhere in the datapath provide a bypass to get it to the right stage Speculate about the hazard resolution and kill the instruction later if the speculation is wrong.

September 28, 2005

6.823 L6- 11 Arvind

Feedback to Resolve Hazards


FB1 FB2 FB3 FB4

stage 1

stage 2

stage 3

stage 4

Detect a hazard and provide feedback to previous stages to stall or kill instructions Controlling a pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur)
September 28, 2005

Interlocks to resolve Data Hazards


Stall Condition

6.823 L6- 12 Arvind

0x4
Add

nop

IR

IR
31

IR

PC

addr

inst IR

Inst Memory

we rs1 rs2 rd1 ws wd rd2

A
ALU

we addr

GPRs
Imm Ext

... r1 r0 + 10 r4 r1 + 17 ...
September 28, 2005

wdata wdata

Data Memory

rdata

MD1

MD2

6.823 L6- 13 Arvind

Stalled Stages and Pipeline Bubbles


time t0 t1 t2 t3 t4 t5 (I1) r1 (r0) + 10 IF1 ID1 EX1 MA1 WB1 (I2) r4 (r1) + 17 IF2 ID2 ID2 ID2 ID2 (I3) IF3 IF3 IF3 IF3 (I4) stalled stages (I5) time t0 t1 I1 I2 I1 t6 t7 ....

EX2 MA2 WB2 ID3 EX3 MA3 WB3 IF4 ID4 EX4 MA4 WB4 IF5 ID5 EX5 MA5 WB5

Resource Usage

IF ID EX MA WB

t2 I3 I2 I1

t3 I3 I2 nop I1

t4 I3 I2 nop nop I1

t5 I3 I2 nop nop nop

t6 I4 I3 I2 nop nop

t7 I5 I4 I3 I2 nop

.... I5 I4 I3 I2 I5 I4 I3

I5 I4

I5

nop
September 28, 2005

pipeline bubble

6.823 L6- 14 Arvind

Interlock Control Logic


stall ws Cstall rs rt ?

0x4
Add

nop

IR

IR
31

IR

PC

addr

inst IR

Inst Memory

we rs1 rs2 rd1 ws wd rd2

A
ALU

we addr

GPRs
Imm Ext

wdata

wdata MD1 MD2

Data Memory

rdata

September 28, 2005

Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions.

Interlocks Control Logic


ignoring jumps & branches
stall Cstall rs rt re1 0x4
Add

6.823 L6- 15 Arvind

ws we ? re2 we Cdest ws we Cdest IR


31

ws

Cre

nop

IR

IR

PC

addr

inst IR

Inst Memory

we rs1 rs2 rd1 ws wd rd2

Cdest

A
ALU

we addr

GPRs
Imm Ext

wdata

wdata MD1 MD2

Data Memory

rdata

Should we always stall if the rs field matches some rd? not every instruction writes a register we not every instruction reads a register re
September 28, 2005

6.823 L6- 16 Arvind

Source & Destination Registers


R-type: I-type: J-type: ALU ALUi LW SW BZ J JAL JR JALR
op op op rs rs rt rt rd func immediate16

immediate26

rd (rs) func (rt) rt (rs) op imm rt M [(rs) + imm] M [(rs) + imm] (rt) cond (rs) true: PC (PC) + imm false: PC (PC) + 4 PC (PC) + imm r31 (PC), PC (PC) + imm PC (rs) r31 (PC), PC (rs)

source(s) destination rs, rt rd rs rt rs rt rs, rt rs rs rs rs 31 31

September 28, 2005

6.823 L6- 17 Arvind

Deriving the Stall Signal


Cdest ws = Case opcode ALU rd ALUi, LW rt JAL, JALR R31 we = Case opcode ALU, ALUi, LW (ws 0) JAL, JALR on ... off Cre re1 = Case opcode ALU, ALUi, LW, SW, BZ, JR, JALR on J, JAL off re2 = Case opcode on ALU, SW off ...

Cstall

stall = ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW) . re1D ((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW) . re2D

! t y no tor is l s is ful h T e th

September 28, 2005

6.823 L6- 18 Arvind

Hazards due to Loads & Stores


Stall Condition
What if (r1)+7 = (r3)+5 ?

0x4
Add

nop

IR

IR
31

IR

PC

addr

inst IR

Inst Memory

we rs1 rs2 rd1 ws wd rd2

A
ALU

we addr

GPRs
Imm Ext

wdata wdata

Data Memory

rdata

... M[(r1)+7] (r2) r4 M[(r3)+5] ...


September 28, 2005

MD1

MD2

Is there any possible data hazard in this instruction sequence?

Load & Store Hazards


... M[(r1)+7] (r2) r4 M[(r3)+5] ...

6.823 L6- 19 Arvind

(r1)+7 = (r3)+5 data hazard

However, the hazard is avoided because our memory system completes writes in one cycle ! Load/Store hazards, even when they do exist, are often resolved in the memory system itself. More on this later in the course.

September 28, 2005

20

Five-minute break to stretch your legs

6.823 L6- 21 Arvind

Complications due to Jumps


PCSrc (pc+4 / jabs / rind/ br)

stall

Add 0x4
Add

nop
Jump?

IR
I1

IR

PC 104

addr

inst

IR I2

Inst Memory

Note fetching the next instruction before decode is speculation kill

I1 I2 I3 I4

096 100 104 304

ADD J 200 ADD kill ADD

A jump instruction kills (not stalls) the following instruction How?

September 28, 2005

6.823 L6- 22 Arvind

Pipelining Jumps
PCSrc (pc+4 / jabs / rind/ br)

stall

To kill a fetched instruction -- Insert a mux before IR


E M

Add 0x4
Add

nop
Jump? IRSrcD

IR
I I2 1

IR
I1

PC 304 104

addr

inst

nop

IR nop I2

Inst Memory

Any interaction between stall and jump?


IRSrcD = Case opcodeD J, JAL nop ... IM

I1 I2 I3 I4

096 100 104 304

ADD J 200 ADD kill ADD

September 28, 2005

6.823 L6- 23 Arvind

Jump Pipeline Diagrams


(I1) (I2) (I3) (I4) 096: 100: 104: 304: ADD J 200 ADD ADD time t0 t1 t2 IF1 ID1 EX1 IF2 ID2 IF3 t3 MA1 EX2 nop IF4 t4 WB1 MA2 nop ID4 t5 t6 t7 ....

WB2 nop nop EX4 MA4 WB4

Resource Usage

IF ID EX MA WB

time t0 t1 I1 I2 I1

t2 I3 I2 I1

t3 I4 nop I2 I1

t4 I5 I4 nop I2 I1

t5

t6

t7

....

I5 I4 I5 nop I4 I5 I2 nop I4 nop

I5 pipeline bubble

September 28, 2005

6.823 L6- 24 Arvind

Pipelining Conditional Branches


PCSrc (pc+4 / jabs / rind / br)

stall

Add 0x4
Add

M IR

nop
BEQZ? IRSrcD

IR I1

zero?

PC 104

addr

inst

nop

IR I2

A
ALU

Inst Memory

I1 I2 I3 I4

096 100 104 304

ADD BEQZ r1 200 ADD ADD

Branch condition is not known until the execute stage what action should be taken in the decode stage ?

September 28, 2005

6.823 L6- 25 Arvind

Pipelining Conditional Branches


PCSrc (pc+4 / jabs / rind / br)

stall

?
Add 0x4
Add

BEQZ?

M IR I1

nop

IR I2

zero?

IRSrcD PC 108 addr inst

nop

IR I3

A
ALU

Inst Memory

I1 I2 I3 I4

096 100 104 304

If the branch is taken ADD - kill the two following instructions BEQZ r1 200 - the instruction at the decode stage ADD is not valid ADD stall signal is not valid

September 28, 2005

6.823 L6- 26 Arvind

Pipelining Conditional Branches


PCSrc (pc+4/jabs/rind/br)

stall
Add

0x4
Add

nop
Jump?
PC

IRSrcE

E IR I2

BEQZ?

M IR I1

zero?

PC 108

addr

inst

nop

IRSrcD IR I3 A
ALU

Inst Memory

I1 I2 I3 I4

096 100 104 304

If the branch is taken ADD - kill the two following instructions BEQZ r1 200 - the instruction at the decode stage ADD is not valid ADD stall signal is not valid

September 28, 2005

6.823 L6- 27 Arvind

New Stall Signal


stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D + ((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D ) . !((opcodeE=BEQZ).z + (opcodeE=BNEZ).!z)

Dont stall if the branch is taken. Why? Instruction at the decode stage is invalid

September 28, 2005

Control Equations for PC and IR Muxes


PCSrc = Case opcodeE BEQZ.z, BNEZ.!z br ... Case opcodeD J, JAL JR, JALR ...

6.823 L6- 28 Arvind

jabs rind pc+4

IRSrcD = Case opcodeE BEQZ.z, BNEZ.!z nop ... Case opcodeD J, JAL, JR, JALR nop ... IM IRSrcE = Case opcodeE BEQZ.z, BNEZ.!z nop ... stall.nop + !stall.IRD
September 28, 2005

Give priority to the older instruction, i.e., execute stage instruction over decode stage instruction

Branch Pipeline Diagrams


(resolved in execute stage)
(I1) (I2) (I3) (I4) (I5) time t0 t1 t2 096: ADD IF1 ID1 EX1 100: BEQZ 200 IF2 ID2 104: ADD IF3 108: 304: ADD t3 MA1 EX2 ID3 IF4 t4 WB1 MA2 nop nop IF5 t5 t6 t7 ....

6.823 L6- 29 Arvind

WB2 nop nop nop nop nop ID5 EX5 MA5 WB5

Resource Usage

IF ID EX MA WB

time t0 t1 I1 I2 I1

t2 I3 I2 I1

t3 I4 I3 I2 I1

t4 I5 nop nop I2 I1

t5

t6

t7

....

I5 nop I5 nop nop I5 I2 nop nop I5 nop pipeline bubble

September 28, 2005

Reducing Branch Penalty (resolve in decode stage)

6.823 L6- 30 Arvind

One pipeline bubble can be removed if an extra comparator is used in the Decode stage
PCSrc (pc+4 / jabs / rind/ br)

Add

nop
0x4
Add

IR

PC

addr

nop
inst IR D

Inst Memory

we rs1 rs2 rd1 ws wd rd2

Zero detect on register file output

GPRs

Pipeline diagram now same as for jumps


September 28, 2005

Branch Delay Slots (expose control hazard to software)


Change the ISA semantics so that the instruction that follows a jump or branch is always executed
gives compiler the flexibility to put in a useful instruction where normally a pipeline bubble would have resulted.

6.823 L6- 31 Arvind

I1 I2 I3 I4

096 100 104 304

ADD BEQZ r1 200 ADD ADD

Delay slot instruction executed regardless of branch outcome

Other techniques include branch prediction, which can dramatically reduce the branch penalty... to come later
September 28, 2005

6.823 L6- 32 Arvind

Bypassing
time (I1) r1 r0 + 10 (I2) r4 r1 + 17 (I3) (I4) (I5) t0 IF1 t1 t2 t3 t4 t5 ID1 EX1 MA1 WB1 IF2 ID2 ID2 ID2 ID2 IF3 IF3 IF3 IF3 stalled stages t6 t7 .... EX2 MA2 WB2 ID3 EX3 MA3 IF4 ID4 EX4 IF5 ID5

Each stall or kill introduces a bubble in the pipeline CPI > 1 A new datapath, i.e., a bypass, can get the data from the output of the ALU to its input
time (I1) r1 r0 + 10 (I2) r4 r1 + 17 (I3) (I4) (I5) t0 t1 IF1 t2 t3 ID1 EX1 IF2 ID2 IF3 t4 MA1 EX2 ID3 IF4 t5 WB1 MA2 EX3 ID4 IF5 t6 t7 ....

September 28, 2005

WB2 MA3 WB3 EX4 MA4 WB4 ID5 EX5 MA5 WB5

6.823 L6- 33 Arvind

Adding a Bypass
stall

r4 r1...
0x4
Add

r1 ...
nop
IR

IR

M
31

IR

ASrc
PC addr

inst IR

Inst Memory

we rs1 rs2 rd1 ws wd rd2

A
ALU

we addr

GPRs
Imm Ext

wdata

wdata MD1 MD2

Data Memory

rdata

... (I1) r1 r0 + 10 (I2) r4 r1 + 17 yes September 28, 2005

When does this bypass help? r1 M[r0 + 10] r4 r1 + 17 JAL 500 r4 r31 + 17 no

no

The Bypass Signal

6.823 L6- 34 Arvind

Deriving it from the Stall Signal


stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D +((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D ) ws = Case opcode ALU rd ALUi, LW rt JAL, JALR R31 we = Case opcode ALU, ALUi, LW (ws 0) JAL, JALR on ... off

ASrc = (rsD=wsE).weE.re1D

Is this correct?

No because only ALU and ALUi instructions can benefit from this bypass Split weE into two components: we-bypass, we-stall

September 28, 2005

6.823 L6- 35 Arvind

Bypass and Stall Signals


Split weE into two components: we-bypass, we-stall
we-bypassE = Case opcodeE ALU, ALUi (ws 0) ... off we-stallE = Case opcodeE LW (ws 0) JAL, JALR on ... off

ASrc stall

= (rsD =wsE).we-bypassE . re1D = ((rsD =wsE).we-stallE + (rsD=wsM).weM + (rsD=wsW).weW). re1D +((rtD = wsE).weE + (rtD = wsM).weM + (rtD = wsW).weW). re2D

September 28, 2005

6.823 L6- 36 Arvind

Fully Bypassed Datapath


stall

PC for JAL, ...

0x4
Add

nop

ASrc
we rs1 rs2 rd1 ws wd rd2

IR

IR

M
31

IR

PC

addr

inst IR

A
ALU

we addr

Inst Memory

GPRs
Imm Ext

BSrc

wdata

wdata MD1 MD2

Data Memory

rdata

Is there still a need for the stall signal ?

stall = (rsD=wsE). (opcodeE=LWE).(wsE0 ).re1D + (rtD=wsE). (opcodeE=LWE).(wsE0 ).re2D

September 28, 2005

Why an Instruction may not be dispatched every cycle (CPI>1)


Full bypassing may be too expensive to implement
typically all frequently used paths are provided some infrequently used bypass paths may increase cycle time and counteract the benefit of reducing CPI

6.823 L6- 37 Arvind

Loads have two cycle latency


Instruction after load cannot use load result MIPS-I ISA defined load delay slots, a software-visible pipeline hazard (compiler schedules independent instruction or inserts NOP to avoid hazard). Removed in MIPS-II.

Conditional branches may cause bubbles


kill following instruction(s) if no delay slots

Machines with software-visible delay slots may execute significant number of NOP instructions inserted by the compiler.
September 28, 2005

38

Thank you !

You might also like