0% found this document useful (0 votes)
15 views31 pages

Unit 5 Pipeline Hazard

Uploaded by

Apurva Jarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views31 pages

Unit 5 Pipeline Hazard

Uploaded by

Apurva Jarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Pipeline Hazards

Computer Architecture
Five Stages of Pipeline
Pipelining is an implementation technique in which multiple
instructions are overlapped in execution.
The stages of instruction execution / pipelining are
 IF --- Instruction Fetch
 ID --- Instruction Decode / Register Read
 EX --- Execute in ALU / calculate address
 MEM --- Data memory access
 WB ---- Write back in register

The data flows from the left stage to right stage.


But in WB the result is written back (right to left data flow) in
the register
Pipeline Hazards
 Hazards: situations that makes the pipeline to stall or idle.
1. Structural hazards
 Caused by resource contention
 Using same resource by two instructions during the same cycle
2. Data hazards
 An instruction may compute a result needed by next instruction
 Hardware can detect dependencies between instructions
3. Control hazards
 Caused by instructions that change control flow (branches/jumps)
 Delays in changing the flow of control
 Hazards complicate pipeline control and limit performance
1. Structural Hazard - Conflict due to Memory Access
Time
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle
7
Load Ifetch
ID/Reg
Reg DMem Reg Same memory

AL
U
lw $1, 100($0)
is used for
instructions
Instr 2 Ifetch Reg DMem Reg
and data

AL
U
Instr 3 Ifetch Reg DMem Reg

AL
U
Instr 4 Ifetch Reg DMem Reg

AL
U
add $1,$2,$3

Structural Hazard:
Instr 5 Can’t load data and Ifetch Reg DMem Reg

AL
U
fetch Instruction 4
during clock cycle 4
Pipeline Hazards Slide 4
Resolving structural hazards
 Problem
 Attempt to use the same hardware resource (Memory) by two
different instructions during the same cycle
 Solution 1: Wait
 Must detect the hazard
 Must have mechanism to delay (stall) instruction access to
resource (Introduce bubble / NOP)
 Serious: hazard cannot be ignored
 Solution 2: Redesign the pipeline
 Add more hardware to eliminate the structural hazard
 In our example: use two memories with two memory ports
 Instruction Memory Can be implemented as
caches
 Data Memory
Pipeline Hazards Slide 5
Solution 1 : Detect Structural Hazard and Delay
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Cycle 3
ID/Reg
I Load Ifetch Reg DMem Reg

AL
U
n A bubble is a
s NOP instruction
Instr 2 Ifetch Reg DMem Reg

AL
t

U
r. Instr 3 Ifetch Reg DMem Reg

AL
U
O Introduce a bubble
Bubble Bubble Bubble Bubble Bubble
r Stall to delay
instruction
fetching
d Instr 4 Ifetch Reg DMem Reg

AL
U
e
Pipeline Hazards Slide 6

r
Solution 2: Add More Hardware
(Use Instruction and data memory)
 Eliminate structural hazard at design time
 Use two separate memories with two memory ports
 Instruction and data memories can be implemented as caches
IF ID EX
MEM WB
IF/ID ID/EX EX/MEM MEM/WB

Inc A
d
Imm16
00

Extend d zero
0
m Rs ALU result 0
PC

Address
u Registers A m
Instruction Rt
0 Address u
x
L
Reg_dst

m
Data_in

1 Instruction Data
Memory 0 Memory x
m u U 1
u Data_in
Rd x x
1 1

Pipeline Hazards Slide 7


2. Data Hazards
 Dependency between instructions causes a data hazard
 The dependent instructions are close to each other
 Pipelined execution might change the order of operand access

 Read After Write – RAW Hazard


 Given two instructions I and J, where I comes before J …
 Instruction J should read an operand after it is written by I
 Called a data dependence in compiler terminology
I: add $1, $2, $3 # r1 is written
J: sub $4, $1, $3 # r1 is read
 Hazard occurs when J reads the operand before I writes it

Pipeline Hazards Slide 8


Example of a RAW Data Hazard
Time (in clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
value of $2 10 10 10 10 10/20 20 20
20 ID/Reg
sub $2, $1, $3 IM R eg
AL
DM Reg
U

and $4, $2, $5


Program Execution

I Reg AL DM Reg
U
M
or $6, $3, $2 I Reg AL DM Reg
U
M
add $7, $2, $2
Order

I Reg AL DM Reg
(No Stalling. Data written in Register $2 by sub U
instruction is read after Instruction add is Fetched) M
sw $8, 10($2) I Reg AL DM
Here ALU calculates the address $2+10 U
M
Exec/
address
 Result of sub is needed by and, or, add, & sw instructions calculatio
n
 Instructions and & or will read old value of $2 from reg file
 During CC5, $2 is written and read – new value is read

Blank – not used


Left highlight – write operation Right highlight – Read operation
2. Solutions to data hazard
a) Reordering code (Software)
b) Operand forwarding (Hardware)
c) By using Stall
2a) Solution by reordering: The order of the instructions may be
changed such that one instruction need wait for the other
instruction’s result without affecting the logic

Pipeline Hazards Slide 10


2b. Operand Forwarding (Forwarding ALU Result)
 The ALU result is forwarded (fed back) to the ALU input
 No bubbles are inserted into the pipeline and no cycles are wasted
 ALU result exists in either EX/MEM or MEM/WB register
Time (in cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8

sub $2, $1, $3 I R eg AL DM Reg


U
M
and $4, $2, $5
Program Execution

I Reg AL DM Reg
U
M
or $6, I Reg AL DM Reg
U
M
$3, $2 add
Order

AL
IM Reg DM Reg
U

$7, $2,
sw $8, $2
10($2) AL DM
IM Reg
U
Pipeline Hazards Slide 11
2b. Operand Forwarding Unit
 Forwarding unit generates ForwardA and ForwardB
 That are used to control the two forwarding multiplexers
 Uses Ra and Rb in ID/EX and Rw in EX/MEM & MEM/WB
IF/ID ID/EX EX/MEM MEM/
WB
Imm32
Imm16 ALUSrc
Extend

ALU result
m
Rs ALU result 0
A

u m
A

ALU result
Rt
Registers
x Address u
B

Load data
m L Data
m Memory x
Rw Rb Ra

u U 1
u Data_in
Rb x
Rd
x

Rw

Rw
Rw

RegDst Writeback data


RegWrite
ForwardB ForwardA

Forwarding Unit

Pipeline Hazards Slide 12


2c. Stalling the Pipeline
Time (in cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
value of $2 10 10 10 10 10/20 20 20

20
sub $2, $1, $3 IM R eg
AL
DM Reg
U

and $4, $2, $5


Program Execution

I Reg Reg Reg AL DM Reg


U
M
bubble into ID/EX bubble AL DM Reg
U

bubble into ID/EX


Order

bubble AL DM Reg
U

or $6, $3, $2 IM Reg AL


U
DM
 The and instruction cannot fetch $2 until CC5
 Two bubbles (NOP instructions) are inserted in ID/EX
at the end of CC3 and CC4 cycles
Pipeline Hazards Slide 14
3. Control Hazards
 Branch instructions can cause great performance loss
 Branch instructions need two things:
 The result of branch: Taken or Not Taken
 Branch target address:
 PC + 4 Branch NOT taken
 PC + 4 + immediate*4 Branch Taken
 Branch instruction is not detected until the ID stage
 At which point a new instruction has already been fetched
 For our original pipeline:
 Effective address is not calculated until EX stage
 Branch condition get set in the EX/MEM register (EX/MEM.zero)
 3-cycle branch delay
Pipeline Hazards Slide 24
3-Cycle Branch Delay
 Instructions Next_1 thru Next_3 stored continuously with
beq will be fetched anyway
 Pipeline should flush Next_1 thru Next_3 if branch is taken
 Otherwise, they can be executed normally
cc1 cc2 cc3 cc4 cc5 cc6 cc7
beq $1,$3,100 IM R eg ALU DM Reg

Next_1 // bubble IM Reg Bubble Bubble Bubble

Next_2 // bubble IM Reg Bubble Bubble Bubble

Next_3 // bubble IM R eg Bubble Bubble

Branch_Target IM R eg ALU

Pipeline Hazards Slide 25


Branch Delay – CC1
 Consider the pipelined execution of: beq $1, $3, 100
 During the first cycle, beq is fetched in the IF stage
beq $1, $3, 100 ID EX
MEM
1004 IF/ID ID/EX EX/MEM

+4 A
PCSrc d
Imm16
Extend d Zero
0
m
PC = 1000

m Rs
Address
u u
Instruction
Registers A ALU result
x x
m
Instruction
Rt
Reg_dst m L
Data_in

1
Memory u
u U
x
Op

Writeback data x
W M E

Main

W M
Control

Pipeline Hazards Slide 26


Branch Delay – CC2
 During the second cycle, beq is decoded in the ID stage
 The next_1 instruction is fetched in the IF stage
next_1 beq $1, $3, 100 EX
MEM
1008 IF/ID ID/EX EX/MEM

1004
+4 A
PCSrc d
100 Imm16
Extend d Zero
0
m
PC = 1004

m Rs
$1

Address
u u
Instruction Rt Registers A ALU result
$3

x x
m
Instruction Reg_dst m L
Data_in

1
Memory u
u U
x
beq

Writeback data x
W M E

Main

W M
Control

Pipeline Hazards Slide 27


Branch Delay – CC3
 During the third cycle, beq is executed in the EX stage
 The next_2 instruction is fetched in the IF stage
next_2 next_1 beq $1, $3, 100
MEM
1012 IF/ID ID/EX EX/MEM

1008

1004
+4 A
PCSrc d
Imm16

100
Extend d Zero
0
m
PC = 1008

m Rs

1234 1234
Address
u u
Instruction Rt Registers A ALU result
x x
m
Instruction Reg_dst m L
Data_in

1
Memory u
u U
x
Writeback data x
W M E

Main Beq = 1

W M
Control

Pipeline Hazards Slide 28


Branch Delay – CC4
 During the fourth cycle, beq reaches MEM stage
 The next_3 instruction is fetched in the IF stage
next_3 next_2 next_1 beq $1, $3, 100

1016 IF/ID ID/EX EX/MEM

1012

1008
+4 A

1404
PCSrc d
Imm16
Extend d Zero = 1
0

1
m
PC = 1012

m Rs
Address
u u
Instruction Rt Registers A ALU result
x

0
x m
Instruction Reg_dst m L
Data_in

1
Memory u
u U
x
Writeback data x
W M E

Main Beq = 1

W M
Control

Pipeline Hazards Slide 29


Branch Delay – CC5
 During the fifth cycle, branch_target instruction is fetched
 Next_1 thru next_3 should be converted into NOPs
branch_target next_3 next_2 next_1

1408 IF/ID ID/EX EX/MEM

1016

1012
+4 A
PCSrc d
Imm16
Extend d Zero
0
m
PC = 1404

m Rs
Address
u u
Instruction Rt Registers A ALU result
x x
m
Instruction Reg_dst m L
Data_in

1
Memory u
u U
x
Writeback data x
W M E

Main

W M
Control

Pipeline Hazards Slide 30


Reducing the Delay of Branches
 Branch delay can be reduced from 3 cycles to just 1 cycle
 Branch decision is moved from 4th into 2nd pipeline stage
 Branches can be determined earlier in the ID stage
 Branch address calculation adder is moved to ID stage
 A comparator in the ID stage to compare the two fetched registers
 To determine branch decision, whether the branch is taken or not

 Only one instruction that follows the branch will be fetched


 If the branch is taken then only one instruction is flushed
 We need a control signal IF.Flush to zero the IF/ID register
 This will convert the fetched instruction into a NOP
Pipeline Hazards Slide 31
Reducing the Delay of Branches
IF.Flush

Hazard
detection
unit
M ID/EX
u
x
WB
EX/MEM
M
Control u M WB
MEM/WB
0
x
IF/ID EX M WB

4 Shift
left 2
M
Registers u
x
= Data
Instructio ALU
PC memory M
n u
memory M x
u
x

Sign
extend

M
u
x
Forwarding

unit

Pipeline Hazards Slide 32


Branch Hazard Alternatives
 Always stall the pipeline until branch direction is known
 Next instruction is always flushed (turned into a NOP)
 Predict Branch Not Taken
 Fetch successor instruction: PC+4 already calculated
 Almost half of MIPS branches are not taken on average
 Flush instructions in pipeline only if branch is actually taken
 Predict Branch Taken
 Can predict backward branches in loops  taken most of time
 However, branch target address is determined in ID stage
 Must reduce branch delay from 1 cycle to 0, but how?
 Delayed Branch
 Define branch to take place AFTER the following instruction
Pipeline Hazards Slide 33
Delayed Branch
 Define branch to take place after the next instruction
 For a 1-cycle branch delay, we have one delay slot
branch instruction
branch delay slot – next instruction
...
branch target – if branch taken

branch instruction (taken) IF ID EX MEM WB


branch delay slot (next instruction) IF ID EX MEM
WB
branch target IF ID MEM WB
EX

 Compiler/assembler fills the branch delay slot


 By selecting a useful instruction
Pipeline Hazards Slide 34
Scheduling the Branch Delay Slot
 From an independent instruction before the branch
 From a target instruction when branch is predicted taken
 From fall through when branch is predicted not taken

add $t2,$t3,$t4 sub $t4,$t5,$t6 beq $s1, $s0


beq $s1, $s0 Delay Slot
Delay Slot sub $t4,$t5,$t6

From Fall Through


beq $s1, $s0
From Before

Delay Slot
From Target

beq $s1, $s0 beq $s1, $s0


add $t2,$t3,$t4 sub $t4,$t5,$t6
beq $s1, $s0
sub $t4,$t5,$t6

Pipeline Hazards Slide 35


More on Delayed Branch
 Scheduling delay slot with
 Independent instruction is the best choice
 However, not always possible to find an independent instruction
 Target instruction is useful when branch is predicted taken
 Such as in a loop branch
 May need to duplicate instruction if it can be reached by another path
 Cancel branch delay instruction if branch is not taken
 Fall through is useful when branch is predicted not taken
 Cancel branch delay instruction if branch is taken

 Disadvantages of delayed branch


 Branch delay can increase to multiple cycles in deeper pipelines
 Zero-delay branching + dynamic branch prediction are required
Pipeline Hazards Slide 36
Zero-Delayed Branch
 How can we achieve zero-delay for a taken branch …
 If the branch target address is computed in the ID stage ?

 Solution
 Check the PC to see if the instruction being fetched is a branch
 Store the branch target address in a table in the IF stage
 Such a table is called the branch target buffer
 If branch is predicted taken then
 Next PC = branch target fetched from target buffer
 Otherwise, if branch is predicted not taken then
 Next PC = PC + 4
 Zero-delay is achieved because Next PC is determined in IF stage
Pipeline Hazards Slide 37
Branch Target and Prediction Buffer
 The branch target buffer is implemented as a small cache
 That stores the branch target address of taken branches
 We also have a branch prediction buffer
 To store the prediction bits for branch instructions
 The prediction bits are dynamically determined by the hardware

Branch Target Buffer


+4 PC of Branch Target Address

Prediction Buffer
mux
Lookup
PC

Pipeline Hazards Slide 38


Dynamic Branch Prediction
 Prediction of branches at runtime using prediction bits
 One or few prediction bits are associated with a branch instruction
 Branch prediction buffer is a small memory
 Indexed by the lower portion of the address of branch instruction
 The simplest scheme is to have 1 prediction bit per branch
 We don’t know if the prediction bit is correct or not
 If correct prediction …
 Continue normal execution – no wasted cycles
 If incorrect prediction (misprediction) …
 Flush the instructions that were incorrectly fetched – wasted cycles
 Update prediction bit and target address for future use
Pipeline Hazards Slide 39
2-bit Prediction Scheme
 Prediction is just a hint that is assumed to be correct
 If incorrect then fetched instructions are flushed
 1-bit prediction scheme has a performance shortcoming
 A loop branch is almost always taken, except for last iteration
 1-bit scheme will predict incorrectly twice, rather than once
 On the first and last loop iterations
 2-bit prediction schemes are often used
 A prediction must be wrong Predict
Not Taken
Predict
Taken
twice before it is changed Taken Taken
Taken
 A loop branch is mispredicted Taken No
t
Not Taken
Ta
only once on the last iteration Not Notke Not
Taken Takenn Taken
Taken
Pipeline Hazards Slide 40
Implementing Forwarding
 Two multiplexers are added at the inputs of the ALU
 ALU result in EX/MEM is forwarded (fed back)
 Writeback data in MEM/WB is also forwarded
 Two signals: ForwardA and ForwardB control forwarding
IF/ID ID/EX EX/MEM MEM/
WB
Imm32

Imm16 ALUSrc
Extend

ALU result
m
Rs ALU result 0
A

u m
A

ALU result
Rt
Registers
x Address u
B

Load data
m L Data
m Memory x
Rw Rb Ra

u U 1
u Data_in
Rb x
Rd
x

Rw

Rw
Rw

RegDst Writeback data


RegWrite
ForwardB ForwardA
Pipeline Hazards Slide 41

You might also like