0% found this document useful (0 votes)
25 views145 pages

07 Pipeline Notes

Uploaded by

Vishakha Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views145 pages

07 Pipeline Notes

Uploaded by

Vishakha Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 145

Pipelining

Hakim Weatherspoon
CS 3410
Computer Science
Cornell University
The slides are the product of many rounds of teaching CS 3410
by Professors Weatherspoon, Bala, Bracy, McKee, and Sirer.
Review: Single Cycle Processor

memory inst register


alu
file

+4 +4
addr
=?
PC din dout
offset control cmp
memory
new target
imm
pc extend
Review: Single Cycle Processor
Advantages
• Single cycle per instruction make logic and clock simple
Disadvantages
• Since instructions take different time to finish, memory
and functional unit are not efficiently utilized
• Cycle time is the longest delay
– Load instruction
• Best possible CPI is 1 (actually < 1 w parallelism)
– However, lower MIPS and longer clock period (lower clock
frequency); hence, lower performance
Review: Multi Cycle Processor

Advantages
• Better MIPS and smaller clock period (higher clock
frequency)
• Hence, better performance than Single Cycle processor
Disadvantages
• Higher CPI than single cycle processor

Pipelining: Want better Performance


• want small CPI (close to 1) with high MIPS and short
clock period (high clock frequency)
Improving Performance
Parallelism

Pipelining

Both!
The Kids
Alice

Bob

They don’t always get along…


The Bicycle
The Materials

Saw Drill

Glue Paint
The Instructions
N pieces, each built following same sequence:

Saw Drill Glue Paint


Design 1: Sequential Schedule

Alice owns the room


Bob can enter when Alice is finished
Repeat for remaining tasks
No possibility for conflicts
Sequential Performance
time
1 2 3 4 5 6 7 8…

Elapsed Time for4Alice:


Latency: 4
hours/task
Elapsed Time for1Bob:
Throughput: 4 hrs
task/4
Concurrency: 1 4*N
Total elapsed time:
CPI = 4
Can we do better?
Design 2: Pipelined Design
Partition room into stages of a pipeline

Dave Carol Bob Alice

One person owns a stage at a time


4 stages
4 people working simultaneously
Everyone moves right in lockstep
Design 2: Pipelined Design
Partition room into stages of a pipeline

Alice

One person owns a stage at a time


4 stages
4 people working simultaneously
Everyone moves right in lockstep
It still takes all four stages for one job to complete
Design 2: Pipelined Design
Partition room into stages of a pipeline

Bob Alice

One person owns a stage at a time


4 stages
4 people working simultaneously
Everyone moves right in lockstep
It still takes all four stages for one job to complete
Design 2: Pipelined Design
Partition room into stages of a pipeline

Carol Bob Alice

One person owns a stage at a time


4 stages
4 people working simultaneously
Everyone moves right in lockstep
It still takes all four stages for one job to complete
Design 2: Pipelined Design
Partition room into stages of a pipeline

Dave Carol Bob Alice

One person owns a stage at a time


4 stages
4 people working simultaneously
Everyone moves right in lockstep
It still takes all four stages for one job to complete
Design 2: Pipelined Design
Partition room into stages of a pipeline

Alice Alice Alice Alice

One person owns a stage at a time


4 stages
4 people working simultaneously
Everyone moves right in lockstep
It still takes all four stages for one job to complete
Pipelined Performance
time
1 2 3 4 5 6 7…

Latency: 4 hrs/task
Throughput: 1 task/hr
Concurrency: 4 CPI = 1
Pipelined Performance
Time
1 2 3 4 5 6 7 8 9 10

What if drilling takes twice as long, but gluing and paint take ½ as long?

Latency:
Throughput: CPI =
Pipelined Performance
Time
1 2 3 4 5 6 7 8 9 10
Done: 4 cycles

Done: 6 cycles

Done: 8 cycles

What if drilling takes twice as long, but gluing and paint take ½ as long?

Latency: 4 cycles/task
Throughput: 1 task/2 cycles CPI = 2
Lessons
Principle:
Throughput increased by parallel execution
Balanced pipeline very important
Else slowest stage dominates performance

Pipelining:
• Identify pipeline stages
• Isolate stages from each other
• Resolve pipeline hazards (next lecture)
Single Cycle vs Pipelined Processor
Single Cycle  Pipelining

Single-cycle
insn0.fetch, dec, exec
insn1.fetch, dec, exec

Pipelined
insn0.fetch insn0.dec insn0.exec
insn1.fetch insn1.dec insn1.exec

23
Agenda
5-stage Pipeline
• Implementation
• Working Example

Hazards
• Structural
• Data Hazards
• Control Hazards

24
A Processor
Review: Single cycle processor

memory inst register


alu
file

+4 +4
addr
=?
PC din dout
offset control cmp
memory
new target
imm
pc extend
A Processor

memory inst register


alu
file

+4
addr
PC din dout
control
memory
new compute
imm jump/branch
pc extend targets
Instruction Instruction Write-
Fetch Decode Execute Memory Back
Pipelined Processor

memory register
alu
file

+4
addr
PC din dout
control
memory
compute
new jump/branch
extend targets
pc

Fetch Decode Execute Memory WB


Pipelined Processor

A
memory register

D
alu
file

B
+4
addr
PC
inst

din dout

M
B
control
memory
compute
new jump/branch
imm

extend targets
pc

Instruction Instruction Write-


ctrl

ctrl

ctrl
Fetch Decode Execute Memory Back
IF/ID ID/EX EX/MEM MEM/WB
Time Graphs
Cycle 1 2 3 4 5 6 7 8 9

add IF ID EX MEM WB
nand IF ID EX MEM WB
lw IF ID EX MEM WB
add IF ID EX MEM WB
sw IF ID EX MEM WB

Latency: 5 cycles
Throughput: 1 insn/cycle CPI = 1
Concurrency: 5 29
Principles of Pipelined Implementation
• Break datapath into multiple cycles (here 5)
• Parallel execution increases throughput
• Balanced pipeline very important
• Slowest stage determines clock rate
• Imbalance kills performance
• Add pipeline registers (flip-flops) for isolation
• Each stage begins by reading values from latch
• Each stage ends by writing values to latch
• Resolve hazards

30
Pipelined Processor

A
memory register

D
alu
file

B
+4
addr
PC
inst

din dout

M
B
control
memory
compute
new jump/branch
imm

extend targets
pc

Instruction Instruction Write-


ctrl

ctrl

ctrl
Fetch Decode Execute Memory Back
IF/ID ID/EX EX/MEM MEM/WB
Pipeline Stages
Stage Perform Latch values of interest
Functionality
Use PC to index Program Memory, Instruction bits (to be decoded)
Fetch increment PC PC + 4 (to compute branch targets)

Control information, Rd index,


Decode instruction, generate
Decode control signals, read register file
immediates, offsets, register values (Ra,
Rb), PC+4 (to compute branch targets)
Perform ALU operation
Control information, Rd index, etc.
Compute targets (PC+4+offset,
Execute etc.) in case this is a branch,
Result of ALU operation, value in case
this is a store instruction
decide if branch taken

Perform load/store if needed, Control information, Rd index, etc.


Memory address is ALU result Result of load, pass result from execute

Writeback Select value, write to register file

32
Instruction Fetch (IF)

instruction
memory
addr mc

+4

PC
- PC+4
new - pc-rel (PC-relative); e.g. BEQ, BNE
- pc-abs (PC absolute); e.g. J and JAL
pc
. (PC+4)31..28 • target • 00
- pc-reg (PC registers); e.g. JR
Instruction Fetch (IF)

instruction
memory
addr mc

Rest of pipeline
inst
+4
00 = read word

PC+4
PC
pc-reg
new pc-rel
pc-abs
pc
pc-sel
IF/ID
Instruction Fetch (IF)

instruction
memory
addr mc

Rest of pipeline
inst
+4
00 = read word

PC+4
PC
pc-reg
pc-rel
pc-abs
• PC+4
• pc-reg (PC registers: JR)
pc-sel
• pc-rel (PC-relative: BEQ, BNE)
• pc-abs (PC absolute: J and JAL)
IF/ID 36
Stage 1: Instruction Fetch

PC+4 inst

IF/ID
D
Rd
WE

file
register
Decode

Ra Rb
B
A

ctrl PC+4 imm B A


ID/EX

Rest of pipeline
Stage 1: Instruction Fetch

PC+4 inst

IF/ID
decode
D
Rd
WE

file
register
Decode

Ra Rb
B
A
dest
result

extend

ctrl PC+4 imm B A


ID/EX

Rest of pipeline
Stage 2: Instruction Decode

ctrl PC+4 imm B A

ID/EX
Execute (EX)

alu

ctrl target B D
EX/MEM

Rest of pipeline
Stage 2: Instruction Decode

ctrl PC+4 imm B A

ID/EX

+
pcreg

pcrel

pcabs
Execute (EX)

alu
pcsel

ctrl target B D
EX/MEM

Rest of pipeline
branch?
Stage 3: Execute

ctrl target B D

EX/MEM
din
addr

memory
mc
dout
MEM

ctrl M D
MEM/WB

Rest of pipeline
pcsel
branch? MEM
pcreg
D

D
Stage 3: Execute

Rest of pipeline
addr
din dout

M
B

pcrel memory
target

mc
pcabs
ctrl

ctrl
EX/MEM MEM/WB
Stage 4: Memory

ctrl M D

MEM/WB
WB
Stage 4: Memory

ctrl M D

MEM/WB
dest
result
WB
Putting it all together!
A

A
Rd
inst

D
D
mem B

B
inst

Ra Rb addr

imm

M
din dout

B
+4
mem
Rt Rd PC+4
PC+4

PC

Rd

Rd
OP

OP

OP
IF/ID ID/EX EX/MEM MEM/WB
49
iClicker Question
Consider a non-pipelined processor with clock
period C (e.g., 50 ns). If you divide the processor
into N stages (e.g., 5) , your new clock period will
be:

A. C
B. N
C. less than C/N
D. C/N
E. greater than C/N
50
iClicker Question
Consider a non-pipelined processor with clock
period C (e.g., 50 ns). If you divide the processor
into N stages (e.g., 5) , your new clock period will
be:

A. C
B. N
C. less than C/N
D. C/N
E. greater than C/N
51
Takeaway
Pipelining is a powerful technique to mask
latencies and increase throughput
• Logically, instructions execute one at a time
• Physically, instructions execute in parallel
– Instruction level parallelism

Abstraction promotes decoupling


• Interface (ISA) vs. implementation (Pipeline)
MIPS is designed for pipelining
• Instructions same length
• 32 bits, easy to fetch and then decode

• 3 types of instruction formats


• Easy to route bits between stages
• Can read a register source before even knowing
what the instruction is
• Memory access through lw and sw only
• Access memory after ALU

53
Agenda
5-stage Pipeline
• Implementation
• Working Example

Hazards
• Structural
• Data Hazards
• Control Hazards

54
Example: : Sample Code (Simple)
add r3  r1, r2
nand r6  r4, r5
lw r4  20(r2)
add r5  r2, r5
sw r7  12(r3)

Assume 8-register machine

55
M
U
X

4 target
+ PC+4 PC+4
R0 0
R1 ALU
regA
M
instruction

regB
R2 result
R3 valA U
PC Inst A ALU X
Register file

R4
L mdata
mem result
R5 U
valB M Data
R6
U mem
R7 data
X
imm dest
extend
valB
Bits 11-15
Rd M
Bits 16-20
Rt U dest dest
X
Bits 26-31
op op op

IF/ID ID/EX EX/MEM MEM/WB


At time 1,
Fetch Example:
add r3 r1 r2
Start State @ Cycle 0
M
U
X

4 0
+ 0 0
R0 0
R1 36 0
R2 9 0 M
add
R3 12 0 U
nand
nop

PC A X
Register file

4 0 lw R4 18 L 0 0
add R5 7 0 M U Data
sw R6 41 U mem
R7 22 data
X
0 dest
extend
0
Initial Bits 11-15
0 M

State Bits 16-20


0 U 0 0
X
Bits 26-31
nop nop nop

Time: 0 IF/ID ID/EX EX/MEM MEM/WB


Cycle 1: Fetch add
add 3 1 2
M
U
X

4 0
+ 4 /0 4
R0 0
R1 36 0
9 0 M
add 3 1 2

R2
add
nand R3 12 /0 36 A
U
X
PC
Register file

8 4 lw R4 18 L 0 0
add R5 7 /0 9 M U Data
sw R6 41 U mem
R7 22 data
X
0 dest
extend
0
Fetch: Bits 11-15 /0 3 M
add 3 1 2 Bits 16-20 /0 2 U 0 0
X
Bits 26-31
/ add
nop nop nop

Time: 1/ 2 IF/ID ID/EX EX/MEM MEM/WB


Cycle 2: Fetch nand, Decode add
nand 6 4 5 add 3 1 2
M
U
X

4 /0 4
+ 8 /4 8
R0 0
R1 36 0
1
0 M
nand 6 4 5

add 2
R2 9 36
R3 12 36
/ 18 U
nand A
PC X
Register file

12 8 lw R4 18 L /0 45 0
add R5 7 9
/9 7 M U Data
sw R6 41 U mem
R7 22 data
X
3 dest
extend
/0 9
Fetch: Bits 11-15 /3 6 M 3
nand 6 4 5 Bits 16-20 /2 5 U
X
/0 3 0
Bits 26-31
/ nand
add / add
nop nop

Time: 2/ 3 IF/ID ID/EX EX/MEM MEM/WB


Cycle 3: Fetch lw, Decode nand, …
lw 4 20(2) nand 6 4 5 add 3 1 2
M
U nand ()
X

18 = 01 0010
4 7 = 00 0111 /4 8
+ 12 8 ------------------
R0 0 -3 = 11 1101
R1 36 0
4
/ 18 /0 45 M
lw 4 20(2)

add 5
R2 9 36
R3 12 18 U
nand A
PC X
Register file

16 12 lw R4 18 L / -3
45 0
7 /9 7
add R5
7 M U Data
sw R6 41 U mem
R7 22 data
X
6 dest
extend
/9 7
Fetch: Bits 11-15
6 3 M 3
lw 4 20(2) Bits 16-20
5 2 U
X
/3 6 /0 3
Bits 26-31
nand / nand
add / add
nop

Time: 3/ 4 IF/ID ID/EX EX/MEM MEM/WB


Cycle 4: Fetch add, Decode lw, …
add 5 2 5 lw 4 20(2) nand 6 4 5 add 3 1 2
M
U
X

4 8
+ 16 12
R0 0
R1 36 0
2
9 18 45 M
add 5 2 5

R2
add 4
R3 12 9 U
nand A
PC X
Register file

20 16 lw R4 18 L -3 45 0
7
add R5 7 18 M U Data
sw R6 41 U mem
R7 22 data
X
20 dest
extend
7
Fetch: Bits 11-15
0 6 M 6 3
add 5 2 5 Bits 16-20
4 5 U 6 3
X
Bits 26-31
lw nand add

Time: 4 IF/ID ID/EX EX/MEM MEM/WB


Cycle 4: Fetch add, Decode lw, …
sw 7 12(3) add 5 2 5 lw 4 20 (2) nand 6 4 5 add 3 1 2
M
U
X

4 12
+ 20 16
R0 0
R1 36 0 45
2
9 9 -3 M
sw 7 12(3)

R2
add 5
R3 45 9 U
nand A
PC X
Register file

24 20 lw R4 18 L 29 -3 0
add R5 7 7 M U Data
sw R6 41 U mem
R7 22 20 data
X
5 dest
extend
18
Fetch: Bits 11-15
5 0 M 4 6 3
sw 7 12(3) Bits 16-20
5 4 U 4 6
X
Bits 26-31
add lw nand

Time: 5 IF/ID ID/EX EX/MEM MEM/WB


Cycle 6: Decode sw, …
sw 7 12(3) add 5 2 5 lw 4 20(2) nand 6 4 5
M
U
X

4 16
+ 20
R0 0
R1 36 0 -3
3
R2 9 9 29 M
add 7
R3 45 45 U
nand A
PC X
Register file

28 24 lw R4 18 L 16 29 99
7
add R5 7 22 M U Data
sw R6 -3 U mem
R7 22 data
X
12 dest
extend
7
No more Bits 11-15
0 5 M 5 4 6
instructions Bits 16-20
7 5 U 5 4
X
Bits 26-31
sw add lw

Time: 6 IF/ID ID/EX EX/MEM MEM/WB


Cycle 7: Execute sw, ...
nop nop sw 7 12(3) add 5 2 5 lw 4 20(2)
M
U
X

4 20
+
R0 0
R1 36 0
R2 9 45 16 M
add
R3 45 U
nand A 99
PC X
Register file

32 28 lw R4 99 L 57 16 0
add R5 7 M U Data
sw R6 -3 U mem
R7 22 12 data
X
dest
extend
22
No more Bits 11-15 0 M 7 5 4
instructions Bits 16-20 7 U 7 5
X
Bits 26-31
sw add

Time: 7 IF/ID ID/EX EX/MEM MEM/WB


Cycle 8: Memory sw, ...
nop nop nop sw 7 12(3) add 5 2
M
U
X

4
+
R0 0
R1 36 16
R2 9 57 M
add
R3 45 U
nand A
PC X
Register file

36 32 ;w R4 99 L 57 22 0
add R5 16 M U Data
sw R6 -3 U mem
R7 22 data
X
22 dest
extend

No more Bits 11-15


M 5
instructions Bits 16-20 U 7
X
Bits 26-31
sw

Time: 8 IF/ID ID/EX EX/MEM MEM/WB


Slides thanks to Sally McKee
Cycle 9: Writeback sw, ...
nop nop nop nop sw 7 12(3)
M
U
X

4
+
R0 0
R1 36
R2 9 M
add
R3 45 U
nand A
PC X
Register file

40 36 ;w R4 99 L
add R5 16 M U Data
sw R6 -3 U mem
R7 22 data
X
dest
extend

No more Bits 11-15


M
instructions Bits 16-20 U
X
Bits 21-23

Time: 9 IF/ID ID/EX EX/MEM MEM/WB


iClicker Question
Pipelining is great because:

A. You can fetch and decode the same instruction


at the same time.
B. You can fetch two instructions at the same time.
C. You can fetch one instruction while decoding
another.
D. Instructions only need to visit the pipeline
stages that they require.
E. C and D
67
iClicker Question
Pipelining is great because:

A. You can fetch and decode the same instruction


at the same time.
B. You can fetch two instructions at the same time.
C. You can fetch one instruction while decoding
another.
D. Instructions only need to visit the pipeline
stages that they require.
E. C and D
68
Pipelined Processor

A
memory register

D
alu
file

B
+4
addr
PC
inst

din dout

M
B
control
memory
compute
new jump/branch
imm

extend targets
pc

Instruction Instruction Write-


ctrl

ctrl

ctrl
Fetch Decode Execute Memory Back
IF/ID ID/EX EX/MEM MEM/WB
Agenda
5-stage Pipeline
• Implementation
• Working Example

Hazards
• Structural
• Data Hazards
• Control Hazards

70
Hazards
Correctness problems associated w/processor design

1. Structural hazards
Same resource needed for different purposes at
the same time (Possible: ALU, Register File, Memory)

2. Data hazards
Instruction output needed before it’s available

3. Control hazards
Next instruction PC unknown at time of Fetch 71
Dependences and Hazards
Dependence: relationship between two insns
• Data: two insns use same storage location
• Control: 1 insn affects whether another executes at all
• Not a bad thing, programs would be boring otherwise
• Enforced by making older insn go before younger one
– Happens naturally in single-/multi-cycle designs
– But not in a pipeline
Hazard: dependence & possibility of wrong insn order
• Effects of wrong insn order cannot be externally visible
• Hazards are a bad thing: most solutions either complicate
the hardware or reduce performance

72
iClicker Question
Data Hazards
• register file (RF) reads occur in stage 2 (ID)
• RF writes occur in stage 5 (WB)
• RF written in ½ half, read in second ½ half of cycle

x10: add r3  r1, r2


x14: sub r5  r3, r4

A) Yes
1. Is there a dependence?
B) No
2. Is there a hazard? C) Cannot tell with the
information given.
73
iClicker Question
Data Hazards
• register file (RF) reads occur in stage 2 (ID)
• RF writes occur in stage 5 (WB)
• RF written in ½ half, read in second ½ half of cycle

x10: add r3  r1, r2


x14: sub r5  r3, r4

A) Yes for both


1. Is there a dependence?
B) No
2. Is there a hazard? C) Cannot tell with the
information given.
74
iClicker Follow-up

Which of the following statements is true?

A. Whether there is a data dependence between two


instructions depends on the machine the program is
running on.
B. Whether there is a data hazard between two
instructions depends on the machine the program is
running on.
C. Both A & B
D. Neither A nor B 75
iClicker Follow-up

Which of the following statements is true?

A. Whether there is a data dependence between two


instructions depends on the machine the program is
running on.
B. Whether there is a data hazard between two
instructions depends on the machine the program is
running on.
C. Both A & B
D. Neither A nor B 76
Where are the
Clock cycle
Data Hazards?
time
1 2 3 4 5 6 7 8 9

IF ID MEM WB
add r3, r1, r2

MEM WB
sub r5, r3, r4 IF ID

lw r6, 4(r3) IF ID MEM WB

or r5, r3, r5 IF ID MEM WB

sw r6, 12(r3) IF ID MEM WB


iClicker
How many data hazards due to r3
only
add r3, r1, r2

sub r5, r3, r4 A) 1


B) 2
lw r6, 4(r3) C) 3
D) 4
or r5, r3, r5
E) 5
sw r6, 12(r3)
Visualizing Data Hazards (1)
time Clock cycle backwards arrows require time travel
1 2 3 4 5 6 7 8 9

IF ID X MEM WB
add r3, r1, r2

MEM WB
sub r5, r3, r4 IF ID X

lw r6, 4(r3) IF ID X MEM WB

or r5, r3, r5 IF ID X MEM WB

sw r6, 12(r3) IF ID X MEM WB

79
Visualizing Data Hazards (2)
time Clock cycle
1 2 3 4 5 6 7 8 9

add r3, r1, r2 IF ID X MEM WB

sub r5, r3, r4 IF ID X MEM WB

lw r6, 4(r3) IF ID X MEM WB

or r5, r3, r5 IF ID X MEM WB

sw r6, 12(r3) IF ID X MEM WB

80
Visualizing Data Hazards (3)
time Clock cycle
1 2 3 4 5 6 7 8 9

add r3, r1, r2 IF ID X MEM WB

sub r5, r3, r4 IF ID X MEM WB

lw r6, 4(r3) IF ID X MEM WB

or r5, r3, r5 IF ID X MEM WB

sw r6, 12(r3) IF ID X MEM WB

81
Data Hazards
Data Hazards
• register file reads occur in stage 2 (ID)
• register file writes occur in stage 5 (WB)
• next instructions may read values about to be written

i.e. add r3, r1, r2


sub r5, r3, r4

How to detect?
Detecting Data Hazards
A

A
Rd
inst

D
D
mem B

B
inst

Ra Rb addr

imm

M
din dout

B
+4
mem
Rt Rd PC+4
PC+4

PC IF/ID.Ra ≠ 0 &&

Rd

Rd
(IF/ID.Ra==ID/Ex.Rd
IF/ID.Ra==Ex/M.Rd
IF/ID.Ra==M/W.Rd)
OP

OP

OP
sub r5,r3,r4 add r3, r1, r2

IF/ID fo r RbID/EX EX/MEM MEM/WB


at
r e pe
Detecting Data Hazards
A

A
Rd
inst

D
D
mem B

B
inst

Ra Rb addr

imm

M
din dout

B
+4
mem
detect Rt Rd PC+4
PC+4

PC hazard

Rd

Rd
OP

OP

OP
IF/ID ID/EX EX/MEM MEM/WB
Takeaway
Data hazards occur when a operand (register) depends on
the result of a previous instruction that may not be
computed yet. A pipelined processor needs to detect data
hazards.
Next Goal
What to do if data hazard detected?
iClicker
What to do if data hazard detected?
A) Wait/Stall
B) Reorder in Software (SW)
C) Forward/Bypass
D) All the above
E) None. We will use some other method
Possible Responses to Data Hazards
1.Do Nothing
• Change the ISA to match implementation
• “Hey compiler: don’t create code w/data hazards!”
(We can do better than this)
2.Stall
• Pause current and subsequent instructions till safe
3.Forward/bypass
• Forward data value to where it is needed
(Only works if value actually exists already)

89
Stalling
How to stall an instruction in ID stage
• prevent IF/ID pipeline register update
– stalls the ID stage instruction
• convert ID stage instr into nop for later stages
– innocuous “bubble” passes through pipeline
• prevent PC update
– stalls the next (IF stage) instruction
Detecting Data Hazards
A

A
add r3, r1, r2 Rd
sub inst
r5, r3, r5

D
or r6, r3, r4
D
mem B

B
add r6, r3, r8
inst
Ra Rb addr

imm

M
din dout

B
+4
mem
detect Rt Rd PC+4
PC+4

PC hazard

Rd

Rd
If detect hazard
OP

OP

OP
MemWr=0
RegWr=0
WE=0
IF/ID ID/EX EX/MEM MEM/WB
Stalling
Clock cycle
time 1 2 3 4 5 6 7 8

add r3, r1, r2

sub r5, r3, r5

or r6, r3, r4

add r6, r3, r8


Stalling
Clock cycle
time 1 2 3 4 5 6 7 8

r3 = 10
add r3, r1, r2 IF ID Ex M W
r3 = 20
3 Stall
Stalls
sub r5, r3, r5 IF ID ID ID ID Ex M W

or r6, r3, r4 IF IF IF IF ID Ex M

add r6, r3, r8 IF ID Ex


Stalling
A A
D D D
inst rD B B
mem
inst

rA rB data
B mem M
+4

Rd

Rd

Rd
(MemWr=0
RegWr=0)

WE

WE

WE
PC
nop
Op

Op

Op
sub r5,r3,r5 add r3,r1,r2

or r6,r3,r4 (WE=0)
/stall
NOP = If(IF/ID.rA ≠ 0 &&
(IF/ID.rA==ID/Ex.Rd STALL CONDITION MET
IF/ID.rA==Ex/M.Rd
IF/ID.rA==M/W.Rd))
Stalling
A A
D D D
inst rD B B
mem
inst

rA rB data
B mem M
+4

Rd

Rd

Rd
(MemWr=0
RegWr=0)

WE

WE

WE
PC
nop (MemWr=0
Op

Op

Op
RegWr=0)
sub r5,r3,r5 nop add r3,r1,r2

or r6,r3,r4 (WE=0)
/stall
NOP = If(IF/ID.rA ≠ 0 &&
(IF/ID.rA==ID/Ex.Rd
IF/ID.rA==Ex/M.Rd STALL CONDITION MET
IF/ID.rA==M/W.Rd))
Stalling
A A
D D D
inst rD B B
mem
inst

rA rB data
B mem M
+4

Rd

Rd

Rd
(MemWr=0
RegWr=0)

WE

WE

WE
PC
nop (MemWr=0 (MemWr=0
Op

Op

Op
RegWr=0) RegWr=0)
sub r5,r3,r5 nop nop add r3,r1,r2

or r6,r3,r4 (WE=0)
/stall
NOP = If(IF/ID.rA ≠ 0 &&
(IF/ID.rA==ID/Ex.Rd
IF/ID.rA==Ex/M.Rd
IF/ID.rA==M/W.Rd)) STALL CONDITION MET
Stalling
Clock cycle
time 1 2 3 4 5 6 7 8

r3 = 10
add r3, r1, r2 IF ID Ex M W
r3 = 20
3 Stall
Stalls
sub r5, r3, r5 IF ID ID ID ID Ex M W

or r6, r3, r4 IF IF IF IF ID Ex M

add r6, r3, r8 IF ID Ex


Stalling
How to stall an instruction in ID stage
• prevent IF/ID pipeline register update
– stalls the ID stage instruction
• convert ID stage instr into nop for later stages
– innocuous “bubble” passes through pipeline
• prevent PC update
– stalls the next (IF stage) instruction
Takeaway
Data hazards occur when a operand (register) depends on
the result of a previous instruction that may not be
computed yet. A pipelined processor needs to detect data
hazards.

Stalling, preventing a dependent instruction from


advancing, is one way to resolve data hazards.

Stalling introduces NOPs (“bubbles”) into a pipeline.


Introduce NOPs by (1) preventing the PC from updating,
(2) preventing writes to IF/ID registers from changing, and
(3) preventing writes to memory and register file.
*Bubbles in pipeline significantly decrease performance.
Possible Responses to Data Hazards
1.Do Nothing
• Change the ISA to match implementation
• “Compiler: don’t create code with data hazards!”
(Nice try, we can do better than this)
2.Stall
• Pause current and subsequent instructions till safe
3.Forward/bypass
• Forward data value to where it is needed
(Only works if value actually exists already)

100
Forwarding
Forwarding bypasses some pipelined stages
forwarding a result to a dependent instruction
operand (register).

Three types of forwarding/bypass


• Forwarding from Ex/Mem registers to Ex stage (M Ex)
• Forwarding from Mem/WB register to Ex stage (W Ex)
• RegisterFile Bypass
Add the Forwarding Datapath

A A
D D D
inst B B
mem data
imm B mem M

Rd

Rd
detect
Rb

hazard

MC WE

MC WE
forward
Ra

unit

IF/ID ID/Ex Ex/Mem Mem/WB

102
Forwarding Datapath

A A
D D D
inst B B
mem data
imm B mem M

Rd

Rd
detect
Rb

hazard

MC WE

MC WE
forward
Ra

unit

IF/ID ID/Ex Ex/Mem Mem/WB


Three types of forwarding/bypass
• Forwarding from Ex/Mem registers to Ex stage (MEx)
• Forwarding from Mem/WB register to Ex stage (W  Ex)
103
• RegisterFile Bypass
Forwarding Datapath 1: Ex/MEM  EX
Ex/Mem

A
D
inst B
mem data
mem
sub r5, r3, r1 add r3, r1, r2

add r3, r1, r2 IF ID Ex M W


sub r5, r3, r1
IF ID Ex M W
Problem: EX needs ALU result that is in MEM stage
Solution: add a bypass from EX/MEM.D to start of EX 104
Forwarding Datapath 1: Ex/MEM  EX
Ex/Mem

A
D
inst B
mem data
mem
sub r5, r3, r1 add r3, r1, r2

Detection Logic in Ex Stage:


forward = (Ex/M.WE && EX/M.Rd != 0 &&
ID/Ex.Ra == Ex/M.Rd)
|| (same for Rb)
105
Forwarding Datapath 2: Mem/WB  EX
Mem/WB

A
D
inst B
mem data
mem
or r6, r3, r4 sub r5, r3, r1 add r3, r1,r2

add r3, r1, r2 IF ID Ex M W


sub r5, r3, r1 IF ID Ex M W
or r6, r3, r4 IF ID Ex M W
Problem: EX needs value being written by WB
Solution: Add bypass from WB final value to start of EX 106
Forwarding Datapath 2: Mem/WB  EX
Mem/WB

A
D
inst B
mem data
mem
or r6, r3, r4 sub r5, r3, r1 add r3, r1,r2
Detection Logic:
forward = (M/WB.WE && M/WB.Rd != 0 &&
ID/Ex.Ra == M/WB.Rd &&
not (ID/Ex.WE && Ex/M.Rd != 0 &&
ID/Ex.Ra == Ex/M.Rd)
107
Register File Bypass

A
D
inst B
mem data
mem
add r6, r3, r8 or r6, r3, r4 sub r5, r3, r1 add r3, r1,r2

Problem: Reading a value that is currently being written


Solution: just negate register file clock
• writes happen at end of first half of each clock cycle
• reads happen during second half of each clock cycle
Register File Bypass

A
D
inst B
mem data
mem
add r6, r3, r8 or r6, r3, r4 sub r5, r3, r1 add r3, r1,r2

add r3, r1, r2 IF ID Ex M W


sub r5, r3, r1 IF ID Ex M W
or r6, r3, r4
IF ID Ex M W
add r6, r3, r8 IF ID Ex M W
Agenda
5-stage Pipeline
• Implementation
• Working Example

Hazards
• Structural
• Data Hazards
• Control Hazards

110
Forwarding Example 2
time Clock cycle
1 2 3 4 5 6 7 8

add r3, r1, r2

sub r5, r3, r4

lw r6, 4(r3)

or r5, r3, r5

sw r6, 12(r3)

111
Forwarding Example 2
time Clock cycle
1 2 3 4 5 6 7 8

add r3, r1, r2 IF ID Ex M W

sub r5, r3, r4


IF ID Ex M W
lw r6, 4(r3) IF ID Ex M W

or r5, r3, r6 IF ID Ex M W

sw r6, 12(r3) IF ID Ex M W
Forwarding Example 2
time Clock cycle backwards arrows require time travel
1 2 3 4 5 6 7 8

add r3, r1, r2 IF ID Ex M W

sub r5, r3, r4


IF ID Ex M W
lw r6, 4(r3) IF ID Ex M W

or r5, r3, r5 IF ID Ex M W

sw r6, 12(r3) IF ID Ex M W
Load-Use Hazard Explained

A
D
inst B
mem data
mem

or r6, r3, r4 lw r4, 20(r8)

Data dependency after a load instruction:


• Value not available until after the M stage
Next instruction cannot proceed if dependent
THE KILLER HAZARD 114
Load-Use Stall

A
D
inst B
mem data
mem
or r6,r4,r1 lw r4, 20(r8)

lw r4, 20(r8)

or r6, r3, r4

115
Load-Use Stall (1)

A
D
inst B
mem data
mem
or r6,r4,r1 lw r4, 20(r8)

lw r4, 20(r8) IF ID Ex

or r6, r3, r4 IF ID

116
Load-Use Stall (2)

A
D
inst B
mem data
mem
or r6,r4,r1 NOP lw r4, 20(r8)

lw r4, 20(r8) IF ID Ex M W
Stall
or r6, r3, r4 IF ID* ID Ex M W

117
Load-Use Stall (3)

A
D
inst B
mem data
mem
or r6,r4,r1 NOP lw r4, 20(

lw r4, 20(r8) IF ID Ex M W
Stall
or r6, r3, r4 IF Ex Ex
ID* ID M W

118
Load-Use Detection

A A
D D D
inst B B
mem data
imm
MC Ra Rb Rd B mem M

Rd

Rd
detect
hazard

MC WE

MC WE
forward
unit

IF/ID ID/Ex Ex/Mem Mem/WB

Stall = If(ID/Ex.MemRead &&


IF/ID.Ra == ID/Ex.Rd
119
Incorrectly Resolving Load-Use Hazards

A A
D D D
inst B B
mem data
imm
MC Ra Rb Rd B mem M

Rd

Rd
detect
hazard

MC WE

MC WE
forward
unit

IF/ID ID/Ex Ex/Mem Mem/WB

Most frequent 3410 non-solution to load-use hazards


Why is this “solution” so so so so so so awful? 120
iClicker Question
Forwarding values directly from Memory to the
Execute stage without storing them in a register
first:

A. Does not remove the need to stall.


B. Adds one too many possible inputs to the ALU.
C. Will cause the pipeline register to have the
wrong value.
D. Halves the frequency of the processor.
E. Both A & D
121
iClicker Question
Forwarding values directly from Memory to the
Execute stage without storing them in a register
first:

A. Does not remove the need to stall.


B. Adds one too many possible inputs to the ALU.
C. Will cause the pipeline register to have the
wrong value.
D. Halves the frequency of the processor.
E. Both A & D
122
Resolving Load-Use Hazards
Two MIPS Solutions:
• MIPS 2000/3000: delay slot
– ISA says results of loads are not available until one
cycle later
– Assembler inserts nop, or reorders to fill delay slot

• MIPS 4000 onwards: stall


– But really, programmer/compiler reorders to avoid
stalling in the load delay slot

123
Takeaway
Data hazards occur when a operand (register) depends on the result of
a previous instruction that may not be computed yet. A pipelined
processor needs to detect data hazards.

Stalling, preventing a dependent instruction from advancing, is one way


to resolve data hazards. Stalling introduces NOPs (“bubbles”) into a
pipeline. Introduce NOPs by (1) preventing the PC from updating, (2)
preventing writes to IF/ID registers from changing, and (3) preventing
writes to memory and register file. Bubbles (nops) in pipeline
significantly decrease performance.

Forwarding bypasses some pipelined stages forwarding a result to a


dependent instruction operand (register). Better performance than
stalling.
Quiz
Find all hazards, and say how they are resolved:

add r3, r1, r2


nand r5, r3, r4
add r2, r6, r3
lw r6, 24(r3)
sw r6, 12(r2)
Quiz
Find all hazards, and say how they are resolved:

add r3, r1, r2


nand r5, r3, r4
add r2, r6, r3
lw r6, 24(r3)
sw r6, 12(r2)

5 Hazards
Quiz
Find all hazards, and say how they are resolved:

add r3, r1, r2


nand r5, r3, r4Forwarding from Ex/MID/Ex (MEx)
add r2, r6, r3 Forwarding from M/WID/Ex (WEx)
lw r6, 24(r3)
RegisterFile (RF) Bypass

sw r6, 12(r2)
Forwarding from M/WID/Ex (WEx)

Stall
+ Forwarding from M/WID/Ex (WEx)

5 Hazards
Quiz
Find all hazards, and say how they are resolved:

add r3, r1, r2


sub r3, r2, r1
nand r4, r3, r1
or r0, r3, r4
xor r1, r4, r3
sb r4, 1(r0)
Quiz 2
Find all hazards, and say how they are resolved:

add r3, r1, r2


sub r3, r2, r1
nand r4, r3, r1
or r0, r3, r4
xor r1, r4, r3
sb r4, 1(r0)

Hours and hours of debugging!


Data Hazard Recap
Delay Slot(s)
• Modify ISA to match implementation

Stall
• Pause current and all subsequent instructions

Forward/Bypass
• Try to steal correct value from elsewhere in pipeline
• Otherwise, fall back to stalling or require a delay slot

Tradeoffs?
Agenda
5-stage Pipeline
• Implementation
• Working Example

Hazards
• Structural
• Data Hazards
• Control Hazards

131
i = 0; A bit of Context
do {
n += 2; i  r1
i++; Assume:
} while(i < max) n  r2
i = 7; max  r3
n--;

x10 addiu r1, r0, 0 # i=0


x14 Loop: addiu r2, r2, 2 # n
+= 2
x18 addiu r1, r1, 1 # i+
+
x1C blt r1, r3, Loop # 132
Control Hazards
Control Hazards
• instructions are fetched in stage 1 (IF)
• branch and jump decisions occur in stage 3 (EX)
 next PC not known until 2 cycles after branch/jump

x1C blt r1, r3, Loop


Branch not taken?
x20 addiu r1, r0, 7
No Problem!
x24 subi r2, r2, 1 Branch taken?
Just fetched 2 addi’s
 Zap & Flush

133
• prevent PC update
Zap & Flush • clear IF/ID latch
• branch continues

inst A
mem
D
+4 B
data
PC
mem
branch decide
calc branch
New PC = 14
If branch Taken  Zap
1C blt r1,r3,L IF ID Ex M W
20
addiu
r1,r0,7
IF ID NOP NOP NOP
24 subi r2,r2,1
IF NOP NOP NOP NOP
14 L:addi r2,r2,2 IF ID Ex M W
134
• prevent PC update
Zap & Flush • clear IF/ID latch
• branch continues

inst A
mem
D
+4 B
data
PC
mem
branch decide
calc branch
New PC = 1C
If branch Taken  Zap
1C blt r1,r3,L IF ID Ex M W
20
addiu
r1,r0,7
IF ID NOP NOP NOP
24 subi r2,r2,1
IF NOP NOP NOP NOP
14 L:addi r2,r2,2 IF ID Ex M W
For every taken branch? OUCH!!! 135
Reducing the cost of control hazard
1. Delay Slot
• You MUST do this
• MIPS ISA: 1 insn after ctrl insn always executed
• Whether branch taken or not
2. Resolve Branch at Decode
• Some groups do this for Project 3, your choice
• Move branch calc from EX to ID
• Alternative: just zap 2nd instruction when branch taken
3. Branch Prediction
• Not in 3410, but every processor worth anything does this
(no offense!)

136
Problem: Zapping 2 insns/branch

inst A
mem
D
+4 B
data
PC
mem
branch decide
calc branch
New PC = 1C
If branch Taken  Zap
1C blt r1, r3, Loop F D X
20 addiu r1, r0, 7 F D
24 subi r2, r2, 1 F

Z a p !
i = 0; Solution #1: Delay Slot
do {
n += 2; i  r1
i++; Assume:
} while(i < max) n  r2
i = 7; max  r3
n--;
x10 addiu r1, r0, 0 # i=0
x14 Loop: addiu r2, r2, 2 # n
+= 2
x18 addiu r1, r1, 1 # i+
+
x1C blt r1, r3, Loop #
i<max? 138
Delay Slot in Action

inst A
mem
D
+4 B
data
PC
mem
branch decide
calc branch
New PC = 1C
If branch Taken  Zap
1C blt r1, r3, Loop F D X
20 nop F D
24 addiu r1, r0, 7 F

Z a p !
Soln #2: Resolve Branches @ Decode
inst A
mem
D
+4 B
data
PC branch mem
calc
decide
branch
New PC = 1C If branch Taken  No Zapping
1C blt r1, r3, Loop F D X

20 nop F D
Loop:addiu
14 F
r2,r2,2
No Z a p p i n g ! 140
Optimization: Fill the Delay Slot
x10 addiu r1, r0, 0 # i=0
x14 Loop: addiu r2, r2, 2 # n
+= 2
x18 addiu r1, r1, 1 # i+
+
x1C blt r1, r3,
Compiler Loop
transforms #
i<max? code
x20
x10 nopr1, r0, 0 # i=0
addiu
x14 Loop: addiu r1, r1, 1 # i++
x18 blt r1, r3, Loop #
i<max? 141
Optimization In Action!
inst A
mem
D
+4 B
data
PC branch mem
calc
decide
branch
New PC = 1C

1C blt r1, r3, Loop F D X


20 addi r2,r2,2 F D
Loop:addi
14 F
o Insn
NNote: Nop or Z appi ng!
r1,r1,1
in delay slot will always be 142
executed whether branch take or not
Branch Prediction
Most processor support Speculative Execution
• Guess direction of the branch
– Allow instructions to move through pipeline
– Zap them later if guess turns out to be wrong
• A must for long pipelines

143
Speculative Execution: Loops
Pipeline so far
• “Guess” (predict) that the branch will not be taken

We can do better!
• Make prediction based on last branch
• Predict “take branch” if last branch “taken”
• Or Predict “do not take branch” if last branch “not
taken”

• Need one bit to keep track of last branch


Speculative Execution: Loops

What is accuracy of branch While (r3 ≠ 0) {…. r3--;}


predictor? Top: BEQZ r3, End
Wrong twice per loop!
Once on loop enter and exit J Top
We can do better with 2 bits End:

While (r3 ≠ 0) {…. r3--;}


Top2: BEQZ r3, End2

J Top
End2:
Speculative Execution: Branch Execution
Branch Not Taken (NT)

Predict Taken 2 (PT2) Predict Taken 1 (PT1)

Branch Taken (T)

Branch Taken (T) Branch Not Taken (NT)

Branch Taken (T)

Predict Not Taken 2 Predict Not Taken 1


(PT2) (PT1)

Branch Not Taken (NT)


Summary
Control hazards
• Is branch taken or not?
• Performance penalty: stall and flush

Reduce cost of control hazards


• Move branch decision from Ex to ID
• 2 nops to 1 nop
• Delay slot
• Compiler puts useful work in delay slot. ISA level.
• Branch prediction
• Correct. Great!
• Wrong. Flush pipeline. Performance penalty
Hazards Summary
Data hazards

Control hazards

Structural hazards
• resource contention
• so far: impossible because of ISA and pipeline design
Hazards Summary
Data hazards
• register file reads occur in stage 2 (IF)
• register file writes occur in stage 5 (WB)
• next instructions may read values soon to be written

Control hazards
• branch instruction may change the PC in stage 3 (EX)
• next instructions have already started executing

Structural hazards
• resource contention
• so far: impossible because of ISA and pipeline design
Data Hazard Takeaways
Data hazards occur when a operand (register) depends on the result
of a previous instruction that may not be computed yet. Pipelined
processors need to detect data hazards.

Stalling, preventing a dependent instruction from advancing, is one


way to resolve data hazards. Stalling introduces NOPs (“bubbles”)
into a pipeline. Introduce NOPs by (1) preventing the PC from
updating, (2) preventing writes to IF/ID registers from changing, and
(3) preventing writes to memory and register file. Nops significantly
decrease performance.

Forwarding bypasses some pipelined stages forwarding a result to a


dependent instruction operand (register). Better performance than
stalling.
150
Control Hazard Takeaways
Control hazards occur because the PC following a control
instruction is not known until control instruction is executed. If
branch is taken  need to zap instructions. 1 cycle performance
penalty.

Delay Slots can potentially increase performance due to control


hazards. The instruction in the delay slot will always be executed.
Requires software (compiler) to make use of delay slot. Put nop in
delay slot if not able to put useful instruction in delay slot.

We can reduce cost of a control hazard by moving branch decision


and calculation from Ex stage to ID stage. With a delay slot, this
removes the need to flush instructions on taken branches.
151

You might also like