0% found this document useful (0 votes)
13 views138 pages

Onur Digitaldesign - Comparch 2021 Lecture13 Pipelining Afterlecture

Uploaded by

adapa.nikitha30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views138 pages

Onur Digitaldesign - Comparch 2021 Lecture13 Pipelining Afterlecture

Uploaded by

adapa.nikitha30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 138

Digital Design &

Computer Arch.
Lecture 13: Pipelining
Prof. Onur Mutlu

ETH Zürich
Spring 2021
16 April 2021
Required Readings
 This week
 Pipelining
 H&H, Chapter 7.5
 Pipelining Issues
 H&H, Chapter 7.8.1-7.8.3

 Next week
 Out-of-order execution
 H&H, Chapter 7.8-7.9
 Smith and Sohi, “The Microarchitecture of Superscalar
Processors,” Proceedings of the IEEE, 1995
 More advanced pipelining
 Interrupt and exception handling
 Out-of-order and superscalar execution concepts

2
Agenda for Today & Next Few
Lectures
Last week & yesterday
 Single-cycle Microarchitectures
 Multi-cycle Microarchitectures

 Today & next week


 Pipelining
 Issues in Pipelining: Control & Data Dependence
Handling, State Maintenance and Recovery, …

 Next week & the week after


 Out-of-Order Execution
 Issues in OoO Execution: Load-Store Handling, …

3
Review: Single-Cycle MIPS
Processor
Jump MemtoReg
Control
MemWrite
Unit
Branch
ALUControl2:0 PCSrc
31:26
Op ALUSrc
5:0
Funct RegDst
RegWrite

CLK CLK
CLK
0 25:21
WE3 SrcA Zero WE
0 PC' PC Instr A1 RD1 0 Result
1 A RD

ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0
PCJump 15:11
1
WriteReg4:0
PCPlus4
+

SignImm
4 15:0
<<2
Sign Extend PCBranch

+
27:0 31:28

25:0
<<2

4
Review: Single-Cycle MIPS FSM
 Single-cycle machine

AS’ Sequential AS
Combinational
Logic
Logic
(State)

AS: Architectural State 5


Can We Do Better?

6
Review: Multi-Cycle MIPS
Processor
CLK
PCWrite
Branch PCEn
IorD Control PCSrc
MemWrite Unit ALUControl2:0
IRWrite ALUSrcB1:0
31:26 ALUSrcA
Op
5:0 RegWrite
Funct

MemtoReg
RegDst
CLK CLK CLK
CLK CLK
0 SrcA
WE WE3 A 31:28 Zero CLK
25:21
PC' PC Instr A1 RD1 1 00
0 Adr RD B

ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1 01
Instr / Data 20:16 4 01 SrcB 10
0
Memory 15:11 A3 10
CLK 1 Register PCJump
WD 11
0 File
Data WD3
1
<<2 27:0
<<2

ImmExt
15:0
Sign Extend
25:0 (Addr)

7
Review: Multi-Cycle MIPS
FSM
S0: Fetch S1: Decode
IorD = 0
Reset AluSrcA = 0 S11: Jump
ALUSrcB = 01 ALUSrcA = 0
ALUOp = 00 ALUSrcB = 11 Op = J
PCSrc = 00 ALUOp = 00 PCSrc = 10
IRWrite PCWrite
PCWrite
Op = ADDI
Op = BEQ
Op = LW
or Op = R-type What is the
S2: MemAdr Op = SW
S6: Execute
S8: Branch
S9: ADDI
Execute
shortcoming of
ALUSrcA = 1 ALUSrcA = 1
ALUSrcA = 1
ALUSrcB = 00 ALUSrcA = 1 this design?
ALUSrcB = 10 ALUSrcB = 00 ALUOp = 01 ALUSrcB = 10
ALUOp = 00 ALUOp = 10 PCSrc = 01 ALUOp = 00
Branch

Op = SW
Op = LW
S3: MemRead
S5: MemWrite
S7: ALU
Writeback S10: ADDI What does
Writeback
this design
IorD = 1
IorD = 1
MemWrite
RegDst = 1
MemtoReg = 0
RegDst = 0
MemtoReg = 0 assume
RegWrite RegWrite
about memory?

S4: Mem
Writeback

RegDst = 0
MemtoReg = 1
RegWrite

8
Can We Do Better?

9
Can We Do Better?
 What limitations do you see with the multi-cycle
design?

 Limited concurrency
 Some hardware resources are idle during different
phases of instruction processing cycle
 “Fetch” logic is idle when an instruction is being
“decoded” or “executed”
 Most of the datapath is idle when a memory access is
happening

10
Can We Use the Idle Hardware to Improve
Concurrency?
 Goal: More concurrency  Higher instruction
throughput (i.e., more “work” completed in one
cycle)

 Idea: When an instruction is using some resources in


its processing phase, process other instructions on
idle resources not needed by that instruction
 E.g., when an instruction is being decoded, fetch the
next instruction
 E.g., when an instruction is being executed, decode
another instruction
 E.g., when an instruction is accessing data memory
(ld/st), execute the next instruction
 E.g., when an instruction is writing its result into the
register file, access data memory for the next 11
Can Have Different Instructions in
Different Stages

 Fetch 1. Instruction fetch (IF)


 Decode 2. Instruction decode and
 Evaluate Address register operand fetch (ID/RF)
 Fetch Operands 3. Execute/Evaluate memory address (EX
 Execute 4. Memory operand fetch (MEM)
5. Store/writeback result (WB)
 Store Result

12
Can Have Different Instructions in
Different Stages
CLK
PCWrite
Branch PCEn
IorD Control PCSrc
MemWrite Unit ALUControl2:0
IRWrite ALUSrcB1:0
31:26 ALUSrcA
Op
5:0 RegWrite
Funct

MemtoReg
RegDst
CLK CLK CLK
CLK CLK
0 SrcA
WE WE3 A 31:28 Zero CLK
25:21
PC' PC Instr A1 RD1 1 00
0 Adr RD B

ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1 01
Instr / Data 20:16 4 01 SrcB 10
0
Memory 15:11 A3 10
CLK 1 Register PCJump
WD 11
0 File
Data WD3
1
<<2 27:0
<<2

ImmExt
15:0
Sign Extend
25:0 (Addr)

Of course, we need to be more careful than this! 13


Pipelining

14
Pipelining: Basic Idea
 More systematically:
 Pipeline the execution of multiple instructions
 Analogy: “Assembly line processing” of instructions

 Idea:
 Divide the instruction processing cycle into distinct
“stages” of processing
 Ensure there are enough hardware resources to process
one instruction in each stage
 Process a different instruction in each stage
 Instructions consecutive in program order are processed in
consecutive stages

 Benefit: Increases instruction processing throughput


(1/CPI)
 Downside: Start thinking about this… 15
Example: Execution of Four
Independent ADDs
 Multi-cycle: 4 cycles per instruction

F D E W
F D E W
F D E W
F D E W
Time
 Pipelined: 4 cycles per 4 instructions (steady state)
1 instruction completed per cycle
F D E W
F D E W
Is life always this beautiful?
F D E W
F D E W

Time

16
The Laundry Analogy
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A

6 PM 7 8 9 10 11 12 1 2 AM
Time

Task
order

 “place one dirty load of clothes in the washer”


 “when the washer is finished, place the wet load in the dryer”
 “when the dryer is finished, take out the dry load and fold”
 “when folding is finished, ask your roommate (??) to put the
clothes away”
- steps to do a load are sequentially
dependent
- no dependence between different loads
- different steps do not share resources
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
17
Pipelining Multiple Loads of
Laundry Time
Task
6 PM 7 8 9 10 11 12 1 2 AM

order
A

6 PM 7 8 9 10 11 12 1 2 AM
Time

Task
order

6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A

6 PM 7 8 9 10
- 4 loads of laundry in parallel
11 12 1 2 AM
Time

Task
order
- no additional resources
- throughput increased by 4
A

D
- latency per load is the same

Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
18
Pipelining Multiple Loads of Laundry:
In Practice Time
Task
6 PM 7 8 9 10 11 12 1 2 AM

order
A

6 PM 7 8 9 10 11 12 1 2 AM
Time

Task
order

6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A

6 PM 7 8 9 10 11 12 1 2 AM
Time

Task
order

the slowest step (the dryer) decides


throughput
19
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
Pipelining Multiple Loads of Laundry:
In Practice Time
Task
6 PM 7 8 9 10 11 12 1 2 AM

order
A

6 PM 7 8 9 10 11 12 1 2 AM
Time

Task
order

6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A

A
6 PM 7 8 9 10 11 12 1 2 AM
Time

Task
order
B
A
A

D
B
throughput restored (2 loads per hour) using 2 dryers
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
20
An Ideal Pipeline
 Goal: Increase throughput with little increase in cost
(hardware cost, in case of instruction processing)
 Repetition of identical operations
 The same operation is repeated on a large number of
different inputs (e.g., all laundry loads go through the
same steps)
 Repetition of independent operations
 No dependences between repeated operations
 Uniformly partitionable suboperations
 Processing can be evenly divided into uniform-latency
suboperations (that do not share resources)

 Fitting examples: automobile assembly line, doing


laundry 21
Ideal Pipelining
BW means Bandwidth
Same as Throughput (in this context)

combinational logic (F,D,E,M,W) BW=~(1/T)


T psec

T/2 ps (F,D,E) T/2 ps (M,W) BW=~(2/T)

T/3 T/3 T/3 BW=~(3/T)


ps (F,D) ps (E,M) ps (M,W)

22
More Realistic Pipeline:
Throughput
Nonpipelined version with delay T
BW = 1/(T+S) where S = register delay

T ps

 k-stage pipelined version Register delay reduces throughput


BWk-stage = 1 / (T/k +S ) (sequencing overhead b/w stages)

BWmax = 1 / (1 gate delay + S )

T/k T/k
ps ps

23
More Realistic Pipeline: Cost
 Nonpipelined version with combinational cost G
Cost = G+R where R = register cost

G gates

 k-stage pipelined version


Costk-stage = G + Rk Registers increase hardware cost

G/k G/k

24
Pipelining Instruction
Processing

25
Remember: The Instruction
Processing Cycle

 FETCH
 DECODE
 EVALUATE ADDRESS
 FETCH OPERANDS
 EXECUTE
 STORE RESULT

26
Remember: The Instruction
Processing Cycle

 Fetch
1. Instruction fetch (IF)
 Decode
2. Instruction decode and
register operand
 Evaluate fetch (ID/RF)
Address
3. Execute/Evaluate
 Fetch Operands memory address (EX/AG)
4. Memory operand fetch (MEM)
 Execute
5. Store/writeback
 Store Result
result (WB)

27
Remember the Single-Cycle
Uarch
Instruction [25– 0] Shift Jump address [31– 0]

26
left 2
28 0 PCSrc
1 1=Jump
PC+4 [31– 28] M M
u u
x x
ALU
Add result 1 0
Add Shift
RegDst
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg PCSrc2=Br Taken
ALUOp
MemWrite
ALUSrc
RegWrite

Instruction [25– 21] Read


Read register 1
PC address Read
Instruction [20– 16] data 1
Read
register 2 Zero
Instruction 0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15– 11] x u
1 Write x Data
data x
1 memory 0
Write
bcond data
16 32
Instruction [15– 0] Sign
extend ALU
control

Instruction [5– 0]

ALU operation

T BW=~(1/T)
Based on original figure from [P&H CO&D, COPYRIGHT 2004
Elsevier. ALL RIGHTS RESERVED.]
28
Dividing Into Stages
200ps 100ps 200ps 200ps 100ps
IF: Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write back
register file read address calculation
0
M
u
x
1
ignore
for now
Add

4 Add Add
result
Shift
left 2

Read
PC Address register 1 Read
data 1
Read
register 2 Zero
Instruction Registers Read ALU ALU
Write 0 Read
Instruction
memory
register
data 2
M
u
result Address
Data
data
1
M
u
RF
write
Write x memory
data x
1
0
Write
data
16 32
Sign
extend

s this the correct partitioning?


Why not 4 or 6 stages? Why not different boundaries?
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
29
Instruction Pipeline Throughput
Program
execution 2 4
200 400 6 600 8 800 101000 12
1200 14
1400 16
1600 18
1800
order Time
(in instructions)
Instruction Data
lw $1, 100($0) fetch
Reg ALU
access
Reg

Instruction Data
lw $2, 200($0) 8 ns
800ps fetch
Reg ALU
access
Reg

Instruction
lw $3, 300($0) 8 ns
800ps fetch
...
8 ns
800ps

Program
2 200 4 400 6 600 8 800 1000
10 1200
12 1400
14
execution
Time
order
(in instructions)
Instruction Data
lw $1, 100($0) Reg ALU Reg
fetch access

Instruction Data
lw $2, 200($0) 2 ns Reg ALU Reg
200ps fetch access

Instruction Data
lw $3, 300($0) 2 ns
200ps Reg ALU Reg
fetch access

2 ns
200ps 2 ns
200ps 2200ps
ns 2 ns
200ps 2 ns
200ps

5-stage speedup is 4, not 5 as predicted by the ideal model. Why?


30
Enabling Pipelined Processing:
Pipeline Registers
IF: Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write back
register file read address calculation
00
MM
No resource is used by more than one stage
uu
xx
11

IF/ID ID/EX EX/MEM MEM/WB


PCD+4

PCE+4

nPCM
Add
Add

Add
44 Add Add
Add result
result
Shift
Shift
leftleft
22

Read
Read
Instruction

Address register
register 11

AE
PCPC
PCF

Address Read
Read

AoutM
data
data 11
Read
Read
register
22 Zero
Zero

MDRW
Instruction register
Instruction Registers Read
Registers Read ALU ALU
ALU ALU
IRD

memory Write 00 Read


Read
Write data
data 22 result
result Address
Address 11
register
register MM data
data
Instruction
uu Data
Data MM
memory uu
BE
Write
Write xx memory
memory
data
data xx
11
Write 00
Write
data
data

AoutW
BM
ImmE

1616 3232
Sign
Sign
extend
extend

T/k T/k
ps T ps
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
31
Pipelined Operation Example
lw All instruction classes must follow the same path
and timing through the pipeline stages.
Instruction fetch
0 lw
lw
M 0
u M Memory
Instruction fetch lw
lw
x u
x
1 0
1
M
u
Memory
x0

Any performanceExecution
impact?
00 M1
u IF/ID ID/EX EX/MEM MEM/WB
M1x
M
IF/ID ID/EX EX/MEM MEM/WB

uu Add I F/ID ID/EX EX/MEM MEM/WB

xx
Add Add
4 Add result
Add
IF/ID ID/EX EX/MEM MEM/WB

11
4
Shift
Add Add
Add
4 left Add
2 result
r esult
Add Shift

n
Read Shift

tio
Address register 1 left 2
PC

c
Read left 2

u
data 1 Add

tr
4 Add

s
Read

In
Read result

tio
Zero
PC Address Instruction register 1register 2

c
n
Read Read
Regi sters ALU

u
Read ALU

tio
memory data 1 Shift 0 Read

tr
Write data 2 Address
register 1 result 1

s
Address Read
PC

In
Read data

c
register 2register left 2 M Zer o Data M

stru
Instructi on u
data
Regi sters 1
IF/ID ID/EX EX/MEM MEM/WB
Read ALU ALU memory u
memory Read Write 0 x Read
IF/ID Write data 2
ID/EX result
EX/MEM Address
MEM/WB 1 x

In
data Zero

n
register
Read2
register M 1 dat a
Instruction

tio
Data M 0
u Write
Write Registers ALU
PC Address register 1 Read ALU memory u

c
memory Read 0x data Read

tru
Write x
data data 2 1
data 1 result Address 1
Read data
s
register 16 32 M 0

In
Sign Zero Writ e M
Instruction register 2 u data
Registers extend ALU Data u
memory Write Read x 0 ALU
memory Read
Add
Add
Write 16
data
register Signdata32
2
extend
1 M
u
result Address
data
x 1
0 M
Write Data u
Write x data memory
data x
1
0
16 32
Sign Add
Add Add
Write
data
44 16
extend
32 Add result
Sign
extend
result
0
M
Shift
Shift lw
0
M
u
x
1
left 2
left 2 lwWrite back
u
x Write back
1

IF/ID
Read
Read
ID/EX EX/MEM MEM/WB
Instruction

lw 1
Instruction

PC Address0 I F/ID
register
lw
register 1
ID/EX
Read
EX/MEM MEM/WB

PC Address 0
Add

Instruction Read
decode
4 Add M
u M
Instruction decode data 1
data 1 Add Add

Read
result

4 x u
x Read Shift
Add Add
Zero
1
Instruction
1 register 22
register
left 2
r esult
Zero
Instruction
Shift
Registers Read ALU ALU
n

Read l eft 2
Registers ALU
tio

Address
memory register 1
Read 0 ALU Read
c

PC Read
memory
u

Write 0 Read
n

Read
data 2
2
data 1
result Address 11
tr

Write
tio

Address
s

Read
data result
In

Address IF/ID register 1 ID/EX Zero EX/MEM MEM/WB


IF/ID ID/EX EX/MEM
data MEM/WB
c

PC register 2Read
u

Instruction
register
data 1
Registers
M ALU
data
tr

register M
Read ALU
s

memory Read
M
0 Read
In

register 2Write data 2 Zero result Address 1


Instructi on
memory Add
register
Registers Read
M
u
ALU
u
u
ALU
Data
Data data
MM
uu
Add Write 0 Address Read
memory u
Write data 2 x result 1
dat a
Write xx
register M Data x
data M
Write memory
u 1
Add
4
4 Write
data
data
x
Add
AddAdd
result
memory
Write
data
u
x
0
xx
1
1
data 16
Sign
32 Shift
1
result Writ e
data
0

00
16
Sign
extend
32
Shift
left 2
left 2 Write
Write
data
n

extend
Read
data
tio

PC Address register 1
c

Read
n

Read
tru
tio

register
Read1 data 1
PC Address
s

16
Read
32
c

In

Zero
stru

register 2
Instruction
memory
Read 16
data 1
Registers Read
Sign 32 0
ALU ALU
In

Zero Read
Instruction
Write2
register
register
Registers
data 2
Read
Sign M ALU
result
ALU
Address
data
1
M
memory Write
Write
register
data 2 extend
extend
0
M
u
x result Address Data Read
memory data
1 u
x
data M
u 1 Data
Write x Write u 0
data data memory x
1
16 32 0
Write
Sign
data
extend
16 32
Sign
extend

97108/Patterson
97108/Patterson
Figure 06.15
Figure 06.15

Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
32
Pipelined Operation Example
lw $10, 20($1)
Instruction fetch
lw $10, 20($1) sub $11, $2, $3 lw $10, 20($1)
Instruction fetch sub $11, $2, $3 lw $10, 20($1)
0
M
Instruction decode Execution
u Instruction decode Execution sub $11, $2, $3 lw $10, 20($1)
x0
0
1 0
MM
M
u0u
u
Memory Write back
xMx
11
0
1
x
u
x
sub $11, $2, $3 lw $10, 20($1)
IF/ID ID/EX EX/MEM MEM/WB
1
M
u Memory Write back
x
Add
1 IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
4 Add
Add result
Add
Add
Add
Add IF/ID ID/EX Shift EX/MEM MEM/WB
4 Add Add
4 left 2 Add
Addresult
4 Add
Add result
Add
4 Add result
Add Shift result
n
tio Read Shift
left 2
PC Address register 1 Read Shift
Shift
left 2
c
tru

4 data 1 left
left 2
2 Add Add
Read result
ns

Read
In

Zero
tio

register
Read 2
n

PC Instruction
Address register 1
io

Read
c
tn

Read Registers
register 1 Shift ALU
n

Read Read ALU


tru

PC Address
tio

memory Read left 2 0


io
c

Write
register data 1 Read
register 1 data 2 Address
tu

PC Address Read 1 Read result 1


s
rc

PC Address register data


Read1 M data
c
In
ru

Read 2
register Zero
rtu
st

Instruction data 1
data 1 M
n

Read u Zero Data


Its

register
Read 2
Registers Read ALU ALU u
n

Instruction Zero
ns

memory Read
register
Write 2 x
0 Zero Read
In

Write Registers ALU


tio

Instruction memory
I

Instruction register
register
data 12 data 2
Read ALU
result Address x
1
PC Address memory Write Registers
register Read
Read 1M0 ALU ALU Address Read
data
c

memory Write Registers data


data
Read2 0 ALU result
ALU Read 0
M 1
tru

memory register data 2 uM


0 result Address
Write data
Read 1
Write
register
Read data1 2 M result AddressData data uM
1
s

Write x u data Data data M


In

register 2
register M
u Zero memory
Data xu
Instruction data
Write
Write 1x x
u memory uM
data Registers
16 Read32 ALU ALU Data
memory 0xx
u
memory Write
data
Write Sign 0
11x Write Read
data 2 result Address
data memory 00
1 x
data
register extend M1 Write
Write data
u data M 0
16 32 data
Write Data u
Write Sign x data memory
data 16
16 32
32 x
extend
Sign 1
Sign 0
16 32 Write
Clock 1 extend
extend
Sign data
extend
16 32
ClockClock
5 Sign
Clock 1 3 extend
Clock 3
sub $11, $2, $3 lw $10, 20($1)
Clock 5
Instruction fetch Instruction decode
sub $11, $2, $3 lw $10, 20($1) sub $11, $2, $3
0
0 sub $11, $2, $3 lw $10, 20($1)
Instruction
M
M0 fetch Instruction decode Write back
uM
u
xxu subExecution
$11, $2, $3 lw $10, 20($1)
Memory
0
1 x
1
0
M
1
0 Execution Memory sub $11, $2, $3
Mu
M
1
x
u
u
x Write back
1x IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB
1 IF/ID ID/EX EX/MEM MEM/WB

Add
Add
Add IF/ID ID/EX EX/MEM MEM/WB

Is life always this beautiful?


IF/ID ID/EX EX/MEM MEM/WB
4 IF/ID ID/EX Add Add
Add EX/MEM MEM/WB
4 Add Add
result
4 Addresult
Add result
Add Shift
Shift
Add Shift
left 2
left 2
4 left 2 Add Add
4 Add
Add result
Add
Add result
n

4 Read
n

Read
io

result
n
io

PC Address Read 1
register
register 1 Shift
tio

Address Read
tt

PC
c

Read Shift
c

register 1
u

PC Address
u

Read
c

data
data 1
1 Shift
left 2
rr
stru

Read left 2
stt

Read data 1
s
In

Read left 2 Zero


In

Instruction register
register 2
2 Zero
nIn

Instruction register 2 Zero


Instruction Read Registers
Registers Read ALU
ALU ALU
memory Read ALU ALU
n

Read Registersdata 0 Read


io

memory Write
Write Read
2 0 ALU
result Address Read 1
o
titn

PC Addressmemory register
Read
Write 1 data 2 0 result Address Read
data 1
register 1 Read Address
io

register data 2 result 1


c

PC Address register Read MM data


tc

register 1
ru

PC Address register M data M


M
tu

Read
data 1 u
c

data 1 u Data M
u

Read xu Data u
r

Data
s

Read
Write data 1 u
st

u
r

Zero
In

Write
Read
register 2 x memory x
n

Zero
t

Write x
Is

Instruction register
data 2 memory
memory x
In

Instruction data
registerRegisters
data 2 1
1 ALU Zero x
Instruction Registers Read
Read 1 ALU ALU 0
0
memory
memory Write Registers Read 0
0 ALU ALU
ALU Write
Write Read
Read 0
memory Write data
data 2
2 0 result
result Address
Write
Address 1
Write
register
register data 2 M
M result data
Address
data
data
Read
data
data 11
register M u data M
M
16 32 u Data
Data M u
16 32 u
Write
Write
Write
16 Sign
Sign
Sign
32 x
x
x Data
memory
memory uu
x
data extend 1 memory x
data extend
extend 1 0
Write 0
Write
data
data
16
16 32
32
Clock
Clock62
Clock 4 Sign
Sign
extend
extend

Clock6 2 4
Clock
Clock

Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
33
Illustrating Pipeline Operation:
Operation View
t0 t1 t2 t3 t4 t5
Inst0 IF ID EX MEM WB
Inst1 IF ID EX MEM WB
Inst2 IF ID EX MEM WB
Inst3 IF ID EX MEM WB
Inst4 IF ID EX MEM
IF ID EX
steady state
IF ID
(full pipeline) IF

34
Illustrating Pipeline Operation:
Resource View
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

IF I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10

ID I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

EX I0 I1 I2 I3 I4 I5 I6 I7 I8

MEM I0 I1 I2 I3 I4 I5 I6 I7

WB I0 I1 I2 I3 I4 I5 I6

35
Control Points in a Pipeline
PCSrc

0
M
u
x
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
4 Add
result
Branch
Shift
RegWrite left 2

Read MemWrite
Instruction

PC Address register 1
Read
data 1
Read ALUSrc
Zero
Zero MemtoReg
Instruction register 2
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u M
Data u
Write x memory
data x
1
0
Write
data
Instruction
[15– 0] 16 32 6
Sign ALU
extend control MemRead

Instruction
[20– 16]
0
M ALUOp
Instruction u
[15– 11] x
Based on original figure from [P&H CO&D, 1
COPYRIGHT 2004 Elsevier. ALL RIGHTS
RESERVED.] RegDst

Identical set of control points as the single-cycle datapath 36


Control Signals in a Pipeline
 For a given instruction
 same control signals as single-cycle, but
 control signals required at different cycles, depending on
stage
Þ Option 1: decode once using the same logic as single-cycle
and buffer signals until consumed
WB

Instruction
Control M WB

EX M WB

IF/ID ID/EX EX/MEM MEM/WB

Þ Option 2: carry relevant “instruction word/field” down the


pipeline and decode locally within each or in a previous
stage
Which one is 37
Pipelined Control Signals
PCSrc

ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB

EX M WB
IF/ID

Add

Add
4 Add result

RegWrite
Branch
Shift
left 2

MemWrite
ALUSrc
Read

MemtoReg
Instruction

PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Instruction 16 32 6
[15– 0] Sign ALU MemRead
extend control

Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1
RegDst

Based on original figure from [P&H CO&D,


COPYRIGHT 2004 Elsevier. ALL RIGHTS 38
RESERVED.]
Carnegie Mellon

Another Example: Single-Cycle and Pipelined


CLK CLK
CLK
25:21 WE3 SrcA Zero WE
0 PC' PC Instr A1 RD1 0
A RD

ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0 WriteReg4:0
15:11
1
PCPlus4
+

SignImm
4 15:0 <<2
Sign Extend
PCBranch

+
Result

CLK
CLK ALUOutW
CLK CLK CLK CLK
CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD

ALU
1 ALUOutM ReadDataW
A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0
15:11
RdE
1
+

SignImmE
4 15:0
<<2
Sign Extend PCBranchM
+

PCPlus4F PCPlus4D PCPlus4E

ResultW

Fetch Decode Execute Memory Writeback 39


Carnegie Mellon

Another Example: Correct Pipelined Datapath

CLK
CLK ALUOutW
CLK CLK CLK CLK
CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD

ALU
ALUOutM ReadDataW
1 A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
SignImmE
+

15:0 <<2
Sign Extend
4 PCBranchM

+
PCPlus4F PCPlus4D PCPlus4E

ResultW

Fetch Decode Execute Memory Writeback

 WriteReg control signal must arrive at the same time as Result


Pipelined processor. Harris and Harris, Chapter 7.5
40
Carnegie Mellon

Another Example: Pipelined Control


CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
BranchD BranchE BranchM
31:26 PCSrcM
Op ALUControlD ALUControlE2:0
5:0
Funct ALUSrcD ALUSrcE
RegDstD RegDstE
ALUOutW
CLK CLK CLK
CLK
25:21 WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD

ALU
1 ALUOutM ReadDataW
A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
+

15:0
<<2
Sign Extend SignImmE
PCBranchM
4

+
PCPlus4F PCPlus4D PCPlus4E

ResultW

 Same control unit as single-cycle processor


Control delayed to proper pipeline stage 41
Remember: An Ideal Pipeline
 Goal: Increase throughput with little increase in cost
(hardware cost, in case of instruction processing)
 Repetition of identical operations
 The same operation is repeated on a large number of
different inputs (e.g., all laundry loads go through the
same steps)
 Repetition of independent operations
 No dependencies between repeated operations
 Uniformly partitionable suboperations
 Processing an be evenly divided into uniform-latency
suboperations (that do not share resources)

 Fitting examples: automobile assembly line, doing


laundry 42
Instruction Pipeline: Not An Ideal
Pipeline
 Identical operations ... NOT!

 different instructions  not all need the same


stages
Forcing different instructions to go through the same pipe
stages
 external fragmentation (some pipe stages idle for some
instructions)

 Uniform suboperations ... NOT!


 different pipeline stages  not the same latency
Need to force each stage to be controlled by the same clock
 internal fragmentation (some pipe stages are too fast but all
take the same clock cycle time)

 Independent operations ... NOT!


 instructions are not independent of each other
Need to detect and resolve inter-instruction dependences43
to
Issues in Pipeline Design
 Balancing work in pipeline stages
 How many stages and what is done in each stage

 Keeping the pipeline correct, moving, and full in the


presence of events that disrupt pipeline flow
 Handling dependences
 Data
 Control
 Handling resource contention
 Handling long-latency (multi-cycle) operations

 Handling exceptions, interrupts


 Advanced: Improving pipeline throughput
 Minimizing stalls
44
Causes of Pipeline Stalls
 Stall: A condition when the pipeline stops moving

 Resource contention

 Dependences (between instructions)


 Data
 Control

 Long-latency (multi-cycle) operations

45
Dependences and Their Types
 Also called “dependency” or less desirably “hazard”

 Dependences dictate ordering requirements


between instructions

 Two types
 Data dependence
 Control dependence

 Resource contention is sometimes called resource


dependence
 However, this is not fundamental to (dictated by)
program semantics, so we will treat it separately
46
Handling Resource Contention
 Happens when instructions in two pipeline stages
need the same resource

 Solution 1: Eliminate the cause of contention


 Duplicate the resource or increase its throughput
 E.g., use separate instruction and data memories
(caches)
 E.g., use multiple ports for memory structures

 Solution 2: Detect the resource contention and stall


one of the contending stages
 Which stage do you stall?
 Example: What if you had a single read and write port
for the register file?
47
Carnegie Mellon

Example Resource Dependence: RegFile


 The register file can be read and written in the same cycle:
 write takes place during the 1st half of the cycle
 read takes place during the 2nd half of the cycle => no problem!!!
 However, operations that involve register file have only half a clock
cycle to complete the operation…

1 2 3 4 5 6 7 8

Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF

$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

48
Data Dependences
 Types of data dependences
 Flow dependence (true data dependence – read after
write)
 Output dependence (write after write)
 Anti dependence (write after read)

 Which ones cause stalls in a pipelined machine?


 For all of them, we need to ensure semantics of the
program is correct
 Flow dependences always need to be obeyed because
they constitute true dependence on a value
 Anti and output dependences exist due to limited
number of architectural registers
 They are dependence on a name, not a value
 We will later see what we can do about them
49
Data Dependence Types
Flow dependence
r3  r1 op r2 Read-after-Write
r5  r3 op r4 (RAW)

Anti dependence
r3  r1 op r2 Write-after-Read
r1  r4 op r5 (WAR)

Output-dependence
r3  r1 op r2 Write-after-Write
r5  r3 op r4 (WAW)
r3  r6 op r7 50
Pipelined Operation Example
lw $10, 20($1)
Instruction fetch
lw $10, 20($1) sub $11, $2, $3 lw $10, 20($1)
Instruction fetch sub $11, $2, $3 lw $10, 20($1)
0
M
Instruction decode Execution
u Instruction decode Execution sub $11, $2, $3 lw $10, 20($1)
x0
0
1 0
MM
M
u0u
u
Memory Write back
xMx
11
0
1
x
u
x
sub $11, $2, $3 lw $10, 20($1)
IF/ID ID/EX EX/MEM MEM/WB
1
M
u Memory Write back
x
Add
1 IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
4 Add
Add result
Add
Add
Add
Add IF/ID ID/EX Shift EX/MEM MEM/WB
4 Add Add
4 left 2 Add
Addresult
4 Add
Add result
Add
4 Add result
Add Shift result
n
tio Read Shift
left 2
PC Address register 1 Read Shift
Shift
left 2
c
tru

4 data 1 left
left 2
2 Add Add
Read result
ns

Read
In

Zero
tio

register
Read 2
n

PC Instruction
Address register 1
io

Read
c
tn

Read Registers
register 1 Shift ALU
n

Read Read ALU


tru

PC Address
tio

memory Read left 2 0


io
c

Write
register data 1 Read
register 1 data 2 Address
tu

PC Address Read 1 Read result 1


s
rc

PC Address register data


Read1 M data
c
In
ru

Read 2
register Zero
rtu
st

Instruction data 1
data 1 M
n

Read u Zero Data


Its

register
Read 2
Registers Read ALU ALU u
n

Instruction Zero
ns

memory Read
register
Write 2 x
0 Zero Read
In

Write Registers ALU


tio

Instruction memory
I

Instruction register
register
data 12 data 2
Read ALU
result Address x
1
PC Address memory Write Registers
register Read
Read 1M0 ALU ALU Address Read
data
c

memory Write Registers data


data
Read2 0 ALU result
ALU Read 0
M 1
tru

memory register data 2 uM


0 result Address
Write data
Read 1
Write
register
Read data1 2 M result AddressData data uM
1
s

Write x u data Data data M


In

register 2
register M
u Zero memory
Data xu
Instruction data
Write
Write 1x x
u memory uM
data Registers
16 Read32 ALU ALU Data
memory 0xx
u
memory Write
data
Write Sign 0
11x Write Read
data 2 result Address
data memory 00
1 x
data
register extend M1 Write
Write data
u data M 0
16 32 data
Write Data u
Write Sign x data memory
data 16
16 32
32 x
extend
Sign 1
Sign 0
16 32 Write
Clock 1 extend
extend
Sign data
extend
16 32
ClockClock
5 Sign
Clock 1 3 extend
Clock 3
sub $11, $2, $3 lw $10, 20($1)
Clock 5
Instruction fetch Instruction decode
sub $11, $2, $3 lw $10, 20($1) sub $11, $2, $3
0
0 sub $11, $2, $3 lw $10, 20($1)
Instruction
M
M0 fetch Instruction decode Write back
uM
u
xxu subExecution
$11, $2, $3 lw $10, 20($1)
Memory
0
1 x
1
0
M
1
0 Execution Memory sub $11, $2, $3
Mu
M
1
x
u
u
x Write back
1x IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB
1 IF/ID ID/EX EX/MEM MEM/WB

Add
Add
Add IF/ID ID/EX EX/MEM MEM/WB

What if the SUB were dependent on LW?


IF/ID ID/EX EX/MEM MEM/WB
4 IF/ID ID/EX Add Add
Add EX/MEM MEM/WB
4 Add Add
result
4 Addresult
Add result
Add Shift
Shift
Add Shift
left 2
left 2
4 left 2 Add Add
4 Add
Add result
Add
Add result
n

4 Read
n

Read
io

result
n
io

PC Address Read 1
register
register 1 Shift
tio

Address Read
tt

PC
c

Read Shift
c

register 1
u

PC Address
u

Read
c

data
data 1
1 Shift
left 2
rr
stru

Read left 2
stt

Read data 1
s
In

Read left 2 Zero


In

Instruction register
register 2
2 Zero
nIn

Instruction register 2 Zero


Instruction Read Registers
Registers Read ALU
ALU ALU
memory Read ALU ALU
n

Read Registersdata 0 Read


io

memory Write
Write Read
2 0 ALU
result Address Read 1
o
titn

PC Addressmemory register
Read
Write 1 data 2 0 result Address Read
data 1
register 1 Read Address
io

register data 2 result 1


c

PC Address register Read MM data


tc

register 1
ru

PC Address register M data M


M
tu

Read
data 1 u
c

data 1 u Data M
u

Read xu Data u
r

Data
s

Read
Write data 1 u
st

u
r

Zero
In

Write
Read
register 2 x memory x
n

Zero
t

Write x
Is

Instruction register
data 2 memory
memory x
In

Instruction data
registerRegisters
data 2 1
1 ALU Zero x
Instruction Registers Read
Read 1 ALU ALU 0
0
memory
memory Write Registers Read 0
0 ALU ALU
ALU Write
Write Read
Read 0
memory Write data
data 2
2 0 result
result Address
Write
Address 1
Write
register
register data 2 M
M result data
Address
data
data
Read
data
data 11
register M u data M
M
16 32 u Data
Data M u
16 32 u
Write
Write
Write
16 Sign
Sign
Sign
32 x
x
x Data
memory
memory uu
x
data extend 1 memory x
data extend
extend 1 0
Write 0
Write
data
data
16
16 32
32
Clock
Clock62
Clock 4 Sign
Sign
extend
extend

Clock6 2 4
Clock
Clock

Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
51
Data Dependence
Handling

52
Reading for Next Few Lectures
 H&H, Chapter 7.5-7.9

 Smith and Sohi, “The Microarchitecture of


Superscalar Processors,” Proceedings of the IEEE,
1995
 More advanced pipelining
 Interrupt and exception handling
 Out-of-order and superscalar execution concepts

53
How to Handle Data
Dependences
Anti and output dependences are easier to handle
 write to the destination only in last stage and in
program order

 Flow dependences are more interesting

 Five fundamental ways of handling flow dependences


 Detect and wait until value is available in register file
 Detect and forward/bypass data to dependent
instruction
 Detect and eliminate the dependence at the software
level
 No need for the hardware to detect dependence
 Predict the needed value(s), execute “speculatively”,
and verify 54
Remember: Data Dependence
Types
Flow dependence
r3  r1 op r2 Read-after-Write
r5  r3 op r4 (RAW)

Anti dependence
r3  r1 op r2 Write-after-Read
r1  r4 op r5 (WAR)

Output-dependence
r3  r1 op r2 Write-after-Write
r5  r3 op r4 (WAW)
r3  r6 op r7 55
RAW Dependence Handling
 Which one of the following flow dependences lead to
conflicts in the 5-stage pipeline?

addi ra r- -
IF ID EX MEM WB
addi r- ra - IF ID EX MEM WB
addi r- ra - IF ID EX MEM
addi r- ra - IF ID EX
addi r- ra - IF ?ID
addi r- ra - IF
56
Pipeline Stall: Resolving Data
Dependence
t0 t1 t2 t3 t4 t5
Insth IF ID ALU MEM WB
Insti i IF ID ALU MEM WB
Instj j IF ID ALU
ID MEM
ALU
ID ID
WB
MEM
ALU ALU
WB
MEM
Instk IF ID
IF ALU
ID
IF MEM
ALU
ID
IF WB
MEM
ALU
ID
Instl IF ID
IF ALU
ID
IF MEM
ALU
ID
IF
IF ID
IF ALU
ID
IF
i: rx  _
IF ID
IF
j:bubble
_  rx dist(i,j)=1
Stall = make the dependent instruction
bubble
j: _  rx dist(i,j)=2 wait until its source data valueIFis
bubble
j: _  rx dist(i,j)=3
available
j: _  rx dist(i,j)=4 1. stop all up-stream stages
57
2. drain all down-stream stages
Interlocking
 Detection of dependence between instructions in a
pipelined processor to guarantee correct execution

 Software based interlocking


vs.
 Hardware based interlocking

 MIPS acronym?

58
Approaches to Dependence
Detection
Scoreboarding (I)
 Each register in register file has a Valid bit associated
with it
 An instruction that is writing to the register resets the
Valid bit
 An instruction in Decode stage checks if all its source
and destination registers are Valid
 Yes: No need to stall… No dependence
 No: Stall the instruction

 Advantage:
 Simple. 1 bit per register

 Disadvantage:
 Need to stall for all types of dependences, not only 59
Approaches to Dependence
Detection

(II)
Combinational dependence check logic
 Special logic checks if any instruction in later stages is
supposed to write to any source register of the
instruction that is being decoded
 Yes: stall the instruction/pipeline
 No: no need to stall… no flow dependence

 Advantage:
 No need to stall on anti and output dependences

 Disadvantage:
 Logic is more complex than a scoreboard
 Logic becomes more complex as we make the pipeline
deeper and wider (flash-forward: think superscalar
execution) 60
Once You Detect the Dependence in
Hardware
 What do you do afterwards?

 Observation: Dependence between two instructions


is detected before the communicated data value
becomes available

 Option 1: Stall the dependent instruction right away


 Option 2: Stall the dependent instruction only when
necessary  data forwarding/bypassing
 Option 3: …

61
Data Forwarding/Bypassing
 Problem: A consumer (dependent) instruction has to
wait in decode stage until the producer instruction
writes its value in the register file
 Goal: We do not want to stall the pipeline
unnecessarily
 Observation: The data value needed by the
consumer instruction can be supplied directly from a
later stage in the pipeline (instead of only from the
register file)
 Idea: Add additional dependence check logic and
data forwarding paths (buses) to supply the
producer’s value to the consumer right after the
value is available
62
Aside: A Special Case of Data
Dependence
Control dependence
 Data dependence on the Instruction Pointer / Program
Counter

63
Aside: Control Dependence
 Question: What should the fetch PC be in the next
cycle?
 Answer: The address of the next instruction
 All instructions are control dependent on previous ones.
Why?

 If the fetched instruction is a non-control-flow


instruction:
 Next Fetch PC is the address of the next-sequential
instruction
 Easy to determine if we know the size of the fetched
instruction

 If the instruction that is fetched is a control-flow


instruction: 64
Digital Design &
Computer Arch.
Lecture 13: Pipelining
Prof. Onur Mutlu

ETH Zürich
Spring 2021
16 April 2021
We did not cover the
following slides. They are for
your benefit.
We will cover them in future
lectures.

66
Data Dependence
Handling: Concepts and
Implementation

67
How to Implement Stalling
PCSrc

ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB

EX M WB
IF/ID

Add

Add
4 Add result

RegWrite
Branch
Shift
left 2

MemWrite
ALUSrc
Read

MemtoReg
Instruction

PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Instruction 16 32 6
[15– 0] Sign ALU MemRead
extend control

Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1

 Stall
RegDst

 disable PC and IF/ID latching; ensure stalled instruction stays in


its stage
 Insert “invalid” instructions/nops into the stage following the
stalled one (called “bubbles”) 68
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
RAW Data Dependence Example
 One instruction writes a register ($s0) and next
instructions read this register => read after write
(RAW) dependence.

Only
add writes intoif$s0
theinpipeline handles
the first half of cycle 5
 and reads $s0
data on cycle 3, obtaining
dependences the wrong value
incorrectly!
 or reads $s0 on cycle 4, again obtaining the wrong
value
 sub reads $s0 in 2nd half of cycle 5, getting the
correct value 1 2 3 4 5 6 7 8

Time (cycles)
 subsequent instructions read the correct value of $s0
add
$s2
DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF

$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Compile-Time Detection and
Elimination 1 2 3 4 5 6 7 8 9 10

Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF

nop DM
nop IM RF RF

nop DM
nop IM RF RF

$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

 Insert enough NOPs for the required result to be


ready
 Or (if you can) move independent useful
Data Forwarding
 Also called Data Bypassing

 We have already seen the basic idea before


 Forward the result value to the dependent instruction
as soon as the value is available

 Remember dataflow?
 Data value supplied to dependent instruction as soon
as it is available
 Instruction executes when all its operands are
available

 Data forwarding brings a pipeline closer to data flow


execution principles
Data Forwarding

1 2 3 4 5 6 7 8

Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF

$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Data Forwarding
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
ALUControlD2:0 ALUControlE2:0
31:26
Op ALUSrcD ALUSrcE
5:0
Funct RegDstD RegDstE
PCSrcM
BranchD BranchE BranchM

CLK CLK CLK


CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 00
A RD 01

ALU
1 10 ALUOutM ReadDataW
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+

15:0
Extend
4
<<2

+
PCPlus4F PCPlus4D PCPlus4E

PCBranchM

ResultW

RegWriteW
ForwardBE

RegWriteM
ForwardAE

Hazard Unit
Data Forwarding
 Forward to Execute stage from either:
 Memory stage or
 Writeback stage

 When should we forward from either Memory or


Writeback stage?
 If that stage will write to a destination register and the
destination register matches the source register.
 If both the Memory and Writeback stages contain
matching destination registers, the Memory stage
should have priority, because it contains the more
recently executed instruction.
Data Forwarding (in
Pseudocode)
Forward to Execute stage from either:
 Memory stage or
 Writeback stage

 Forwarding logic for ForwardAE (pseudo code):


if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM) then
ForwardAE = 10 # forward from Memory stage
else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW) then
ForwardAE = 01 # forward from Writeback stage
else
ForwardAE = 00 # no forwarding

 Forwarding logic for ForwardBE same, but replace rsE


with rtE
Stalling
1 2 3 4 5 6 7 8

Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF

Trouble!
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

 Forwarding is sufficient to resolve RAW data dependences


 Unfortunately, there are cases when forwarding is not possible
 due to pipeline design and instruction latencies
 The lw instruction does not finish reading data until the end of the

Memory stage
 its result cannot be forwarded to the Execute stage of the next
instruction
Stalling

1 2 3 4 5 6 7 8 9

Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF

$s0 $s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 RF $s1 & RF

$s4
or or DM $t1
or $t1, $s4, $s0 IM IM RF $s0 | RF

Stall $s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Hardware Needed for Stalling
 Stalls are supported by
 adding enable inputs (EN) to the Fetch and Decode
pipeline registers
 and a synchronous reset/clear (CLR) input to the
Execute pipeline register
 or an INV bit associated with each pipeline register,
indicating that contents are INValid

 When a lw stall occurs


 StallD and StallF are asserted to force the Decode and
Fetch stage pipeline registers to hold their old values.
 FlushE is also asserted to clear the contents of the
Execute stage pipeline register, introducing a bubble
Stalling and Dependence
Detection Hardware
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
ALUControlD2:0 ALUControlE2:0
31:26
Op ALUSrcD ALUSrcE
5:0
Funct RegDstD RegDstE
PCSrcM
BranchD BranchE BranchM

CLK CLK CLK


CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 00
A RD 01

ALU
ALUOutM ReadDataW
EN

1 10
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+

15:0
Extend
4
<<2

+
PCPlus4F PCPlus4D PCPlus4E
CLR
EN

PCBranchM

ResultW

MemtoRegE

RegWriteW
ForwardBE

RegWriteM
ForwardAE
FlushE
StallD
StallF

Hazard Unit
Recall: How to Handle Data
Dependences
Anti and output dependences are easier to handle
 write to the destination in one stage and in program
order

 Flow dependences are more interesting

 Five fundamental ways of handling flow dependences


 Detect and wait until value is available in register file
 Detect and forward/bypass data to dependent
instruction
 Detect and eliminate the dependence at the software
level
 No need for the hardware to detect dependence
 Predict the needed value(s), execute “speculatively”,
and verify 80
Fine-Grained
Multithreading

81
Fine-Grained Multithreading
 Idea: Hardware has multiple thread contexts
(PC+registers). Each cycle, fetch engine fetches from
a different thread.
 By the time the fetched branch/instruction resolves, no
instruction is fetched from the same thread
 Branch/instruction resolution latency overlapped with
execution of other threads’ instructions

+ No logic needed for handling control and


data dependences within a thread
-- Single thread performance suffers
-- Extra logic for keeping thread contexts
-- Does not overlap latency if not enough
threads to cover the whole pipeline
82
Fine-Grained Multithreading (II)
 Idea: Switch to another thread every cycle such that
no two instructions from a thread are in the pipeline
concurrently

 Tolerates the control and data dependency latencies


by overlapping the latency with useful work from
other threads
 Improves pipeline utilization by taking advantage of
multiple threads

 Thornton, “Parallel Operation in the Control Data 6600,”


AFIPS 1964.
 Smith, “A pipelined, shared resource MIMD computer,”
ICPP 1978.
83
Fine-Grained Multithreading:
History
CDC 6600’s peripheral processing unit is fine-grained
multithreaded
 Thornton, “Parallel Operation in the Control Data 6600,” AFIPS
1964.
 Processor executes a different I/O thread every cycle
 An operation from the same thread is executed every 10
cycles

 Denelcor HEP (Heterogeneous Element Processor)


 Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978.
 120 threads/processor
 available queue vs. unavailable (waiting) queue for threads
 each thread can have only 1 instruction in the processor pipeline;
each thread independent
 to each thread, processor looks like a non-pipelined machine
 system throughput vs. single thread performance tradeoff
84
Fine-Grained Multithreading in
HEP
Cycle time: 100ns

 8 stages  800 ns to
complete an
instruction
 assuming no
memory access

 No control and data


dependency
checking
Burton Smith
(1941-2018)

85
Multithreaded Pipeline Example

Slide credit: Joel Emer 86


Sun Niagara Multithreaded
Pipeline

Kongetira et al., “Niagara: A 32-Way Multithreaded Sparc Processor,” IEEE Micro 2005.
87
Fine-Grained Multithreading
 Advantages
+ No need for dependency checking between instructions
(only one instruction in pipeline from a single thread)
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions
from different threads
+ Improved system throughput, latency tolerance, utilization

 Disadvantages
- Extra hardware complexity: multiple hardware contexts (PCs,
register files, …), thread selection logic
- Reduced single thread performance (one instruction fetched
every N cycles from the same thread)
- Resource contention between threads in caches and memory
- Some dependency checking logic between threads remains
(load/store) 88
Modern GPUs are
FGMT Machines

89
NVIDIA GeForce GTX 285
“core”

64 KB of storage
… for thread
contexts
(registers)

= data-parallel (SIMD) func. unit, = instruction stream decode


control shared across 8 units
= multiply-add = execution context storage
= multiply

90
Slide credit: Kayvon Fatahalian
NVIDIA GeForce GTX 285
“core”

64 KB of storage
… for thread
contexts
(registers)
 Groups of 32 threads share instruction stream (each
group is a Warp): they execute the same instruction
on different data
 Up to 32 warps are interleaved in an FGMT

manner
91
 Up to 1024 thread contexts can be stored
Slide credit: Kayvon Fatahalian
NVIDIA GeForce GTX 285

Tex Tex
… … … … … …

Tex Tex
… … … … … …

Tex Tex
… … … … … …

Tex Tex
… … … … … …

Tex Tex
… … … … … …

30 cores on the GTX 285: 30,720 threads


92
Slide credit: Kayvon Fatahalian
Further Reading for the
Interested (I)

Burton Smith
(1941-2018)

93
Further Reading for the
Interested (II)

94
Recall: How to Handle Data
Dependences
Anti and output dependences are easier to handle
 write to the destination in one stage and in program
order

 Flow dependences are more interesting

 Five fundamental ways of handling flow dependences


 Detect and wait until value is available in register file
 Detect and forward/bypass data to dependent
instruction
 Detect and eliminate the dependence at the software
level
 No need for the hardware to detect dependence
 Predict the needed value(s), execute “speculatively”,
and verify 95
A Special Case of Data
Dependence
Control dependence
 Data dependence on the Instruction Pointer / Program
Counter

96
Control Dependence
 Question: What should the fetch PC be in the next
cycle?
 Answer: The address of the next instruction
 All instructions are control dependent on previous ones.
Why?

 If the fetched instruction is a non-control-flow


instruction:
 Next Fetch PC is the address of the next-sequential
instruction
 Easy to determine if we know the size of the fetched
instruction

 If the instruction that is fetched is a control-flow


instruction: 97
Carnegie Mellon

Control Dependences
 Special case of data dependence: dependence on PC
 beq:
 branch is not determined until the fourth stage of the pipeline
 Instructions after the branch are fetched before branch is resolved
Always predict that the next sequential instruction is fetched
 Called “Always not taken” prediction
 These instructions must be flushed if the branch is taken

 Branch misprediction penalty


 number of instructions flushed when branch is taken
 May be reduced by determining branch earlier

98
Carnegie Mellon

Control Dependence: Original Pipeline


CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
ALUControlD2:0 ALUControlE2:0
31:26
Op ALUSrcD ALUSrcE
5:0
Funct RegDstD RegDstE
PCSrcM
BranchD BranchE BranchM

CLK CLK CLK


CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 00
A RD 01

ALU
1 10 ALUOutM ReadDataW
EN

A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+

15:0
Extend
4
<<2

+
PCPlus4F PCPlus4D PCPlus4E
CLR
EN

PCBranchM

ResultW

MemtoRegE

RegWriteW
ForwardBE
ForwardAE

RegWriteM
FlushE
StallD
StallF

Hazard Unit

99
Carnegie Mellon

Control Dependence
1 2 3 4 5 6 7 8 9

Time (cycles)
$t1
lw DM
20 beq $t1, $t2, 40 IM RF $t2 - RF

$s0
and DM
24 and $t0, $s0, $s1 IM RF $s1 & RF
Flush
$s4 these
or DM instructions
28 or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM
2C sub $t2, $s0, $s5 IM RF $s5 - RF

30 ...
...
$s2
slt DM $t3

slt
64 slt $t3, $s2, $s3 IM RF $s3 RF

100
Carnegie Mellon

Early Branch Resolution


CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
ALUControlD2:0 ALUControlE2:0
31:26
Op ALUSrcD ALUSrcE
5:0
Funct RegDstD RegDstE
BranchD

EqualD PCSrcD
CLK CLK CLK
CLK
WE3
= WE
25:21 SrcAE
0 PC' PCF InstrD A1 RD1 00
A RD 01

ALU
1 10 ALUOutM ReadDataW
EN

A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE RdE
1
SignImmD SignImmE
Sign
+

15:0
Extend
4
<<2
+

PCPlus4F PCPlus4D
CLR

CLR
EN

PCBranchD

ResultW

MemtoRegE

RegWriteW
ForwardBE

RegWriteM
ForwardAE
FlushE
StallD
StallF

Hazard Unit

Introduces another data dependency in Decode stage.. 101


Carnegie Mellon

Early Branch Resolution


1 2 3 4 5 6 7 8 9

Time (cycles)
$t1
lw DM
20 beq $t1, $t2, 40 IM RF $t2 - RF

$s0 Flush
and DM
24 and $t0, $s0, $s1 IM RF $s1 & RF this
instruction

28 or $t1, $s4, $s0

2C sub $t2, $s0, $s5

30 ...
...
$s2
slt DM $t3
slt
64 slt $t3, $s2, $s3 IM RF $s3 RF

102
Carnegie Mellon

Early Branch Resolution: Good Idea?


 Advantages
 Reduced branch misprediction penalty
 Reduced CPI (cycles per instruction)

 Disadvantages
 Potential increase in clock cycle time?
 Higher Tclock?
 Additional hardware cost
 Specialized and likely not used by other instructions

103
Carnegie Mellon

Data Forwarding for Early Branch Resolution


CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
ALUControlD2:0 ALUControlE2:0
31:26
Op ALUSrcD ALUSrcE
5:0
Funct RegDstD RegDstE
BranchD

EqualD PCSrcD
CLK CLK CLK
CLK
WE3
= WE
25:21 SrcAE
0 PC' PCF InstrD A1 RD1 0 00
A RD 01

ALU
ALUOutM ReadDataW
1 1 10
EN

A RD
Instruction 20:16
A2 RD2 0 00 0 SrcBE Data
Memory 01
A3 1 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+

15:0
Extend
4
<<2
+

PCPlus4F PCPlus4D
CLR

CLR
EN

PCBranchD

ResultW

MemtoRegE

RegWriteW
ForwardBD

ForwardBE
ForwardAD

RegWriteM
ForwardAE

RegWriteE
BranchD

FlushE
StallD
StallF

Hazard Unit

Data forwarding for early branch resolution. 104


Carnegie Mellon

Forwarding and Stalling Hardware Control


// Forwarding logic:
assign ForwardAD = (rsD != 0) & (rsD == WriteRegM) & RegWriteM;
assign ForwardBD = (rtD != 0) & (rtD == WriteRegM) & RegWriteM;

//Stalling logic:
assign lwstall = ((rsD == rtE) | (rtD == rtE)) & MemtoRegE;

assign branchstall = (BranchD & RegWriteE &


(WriteRegE == rsD | WriteRegE == rtD))
|
(BranchD & MemtoRegM &
(WriteRegM == rsD | WriteRegM == rtD));

// Stall signals;
assign StallF = lwstall | branchstall;
assign StallD = lwstall | branchstall;
assign FLushE = lwstall | branchstall;

105
Carnegie Mellon

Doing Better: Smarter Branch Prediction


 Guess whether branch will be taken
 Backward branches are usually taken (loops)
 Consider history of whether branch was previously taken to
improve the guess

 Good prediction reduces the fraction of branches


requiring a flush

106
Questions to Ponder
 What is the role of the hardware vs. the software in
data dependence handling?
 Software based interlocking
 Hardware based interlocking
 Who inserts/manages the pipeline bubbles?
 Who finds the independent instructions to fill “empty”
pipeline slots?
 What are the advantages/disadvantages of each?
 Think of the performance equation as well

107
Questions to Ponder
 What is the role of the hardware vs. the software in
the order in which instructions are executed in the
pipeline?
 Software based instruction scheduling  static
scheduling
 Hardware based instruction scheduling  dynamic
scheduling

 How does each impact different metrics?


 Performance (and parts of the performance equation)
 Complexity
 Power consumption
 Reliability
 …
108
More on Software vs. Hardware
 Software based scheduling of instructions  static
scheduling
 Compiler orders the instructions, hardware executes
them in that order
 Contrast this with dynamic scheduling (in which
hardware can execute instructions out of the compiler-
specified order)
 How does the compiler know the latency of each
instruction?

 What information does the compiler not know that


makes static scheduling difficult?
 Answer: Anything that is determined at run time
 Variable-length operation latency, memory addr, branch
direction

 109
More on Static Instruction
Scheduling

https://www.youtube.com/onurmutlulectures 110
Lectures on Static Instruction
Scheduling
 Computer Architecture, Spring 2015, Lecture 16
 Static Instruction Scheduling (CMU, Spring 2015)
 https://www.youtube.com/watch?v=isBEVkIjgGA&list=PL5PHm2jkkXmi5C
xxI7b3JCL1TWybTDtKq&index=18

 Computer Architecture, Spring 2013, Lecture 21


 Static Instruction Scheduling (CMU, Spring 2013)
 https://www.youtube.com/watch?v=XdDUn2WtkRg&list=PL5PHm2jkkXmi
dJOd59REog9jDnPDTG6IJ&index=21

https://www.youtube.com/onurmutlulectures 111
Carnegie Mellon

Pipelined Performance Example


 SPECINT2006 benchmark:
 25% loads
 10% stores
 11% branches
 2% jumps
 52% R-type

 Suppose:
 40% of loads used by next instruction
 25% of branches mispredicted

 All jumps flush next instruction


 What is the average CPI?
112
Carnegie Mellon

Pipelined Performance Example Solution


 Load/Branch CPI = 1 when no stall/flush, 2 when stall/flush.
Thus:
 CPIlw = 1(0.6) + 2(0.4) = 1.4 Average CPI for load
 CPIbeq = 1(0.75) + 2(0.25) = 1.25 Average CPI for branch

 And
 Average CPI =

113
Carnegie Mellon

Pipelined Performance Example Solution


 Load/Branch CPI = 1 when no stall/flush, 2 when stall/flush.
Thus:
 CPIlw = 1(0.6) + 2(0.4) = 1.4 Average CPI for load
 CPIbeq = 1(0.75) + 2(0.25) = 1.25 Average CPI for branch

 And
 Average CPI = (0.25)(1.4) + load
(0.1)(1) + store
(0.11)(1.25) + beq
(0.02)(2) + jump
(0.52)(1) r-type

= 1.15

114
Carnegie Mellon

Pipelined Performance
 There are 5 stages, and 5 different timing paths:

Tc = max {
tpcq + tmem + tsetup fetch
2(tRFread + tmux + teq + tAND + tmux + tsetup ) decode
tpcq + tmux + tmux + tALU + tsetup execute
tpcq + tmemwrite + tsetup memory
2(tpcq + tmux + tRFwrite)
writeback
}
 The operation speed depends on the slowest operation
 Decode and Writeback use register file and have only half a 115
Carnegie Mellon

Pipelined Performance Example


Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20
Equality comparator teq 40
AND gate tAND 15
Memory write Tmemwrite 220
Register file write tRFwrite 100

Tc = 2(tRFread + tmux + teq + tAND + tmux + tsetup )


= 2[150 + 25 + 40 + 15 + 25 + 20] ps
116
= 550 ps
Carnegie Mellon

Pipelined Performance Example


 For a program with 100 billion instructions executing on a
pipelined MIPS processor:
 CPI = 1.15
 Tc = 550 ps

 Execution Time = (# instructions) × CPI × Tc


= (100 × 109)(1.15)(550 × 10-12)
= 63 seconds

117
Carnegie Mellon

Performance Summary for MIPS arch.


Execution Time Speedup
Processor (seconds) (single-cycle is baseline)
Single-cycle 95 1
Multicycle 133 0.71
Pipelined 63 1.51

 Fastest of the three MIPS architectures is Pipelined.


 However, even though we have 5 fold pipelining, it is not
5 times faster than single cycle.

118
Pipelining and Precise
Exceptions: Preserving
Sequential Semantics
Multi-Cycle Execution
 Not all instructions take the same amount of time
for “execution”
 Idea: Have multiple different functional units that
take different number of cycles
 Can be pipelined or not pipelined
 Can let independent instructions start execution on a
different functional unit before a previous long-latency
instruction finishes execution
Integer add
E
Integer mul
E E E E
FP mul
?
F D
E E E E E E E E

E E E E E E E E ...
Load/store
120
Issues in Pipelining: Multi-Cycle
Execute
 Instructions can take different number of cycles in

EXECUTE stage
 Integer ADD versus FP MULtiply

FMUL R4  R1, R2 F D E E E E E E E E W
ADD R3  R1, R2 F D E W
F D E W
F D E W

FMUL R2  R5, R6 F D E E E E E E E E W
ADD R7  R5, R6 F D E W
F D E W

 What is wrong with this picture in a Von Neumann


architecture?
 Sequential semantics of the ISA NOT preserved!
 What if FMUL incurs an exception?
121
Exceptions vs. Interrupts
 Cause
 Exceptions: internal to the running thread
 Interrupts: external to the running thread

 When to Handle
 Exceptions: when detected (and known to be non-
speculative)
 Interrupts: when convenient
 Except for very high priority ones
 Power failure
 Machine check (error)

 Priority: process (exception), depends (interrupt)

 Handling Context: process (exception), system 122


Precise Exceptions/Interrupts
 The architectural state should be consistent
(precise) when the exception/interrupt is ready to
be handled

1. All previous instructions should be completely


retired.

2. No later instruction should be retired.

Retire = commit = finish execution and update arch.


state

123
Checking for and Handling Exceptions
in Pipelining
 When the oldest instruction ready-to-be-retired is
detected to have caused an exception, the control
logic

 Ensures architectural state is precise (register file, PC,


memory)

 Flushes all younger instructions in the pipeline

 Saves PC and registers (as specified by the ISA)

 Redirects the fetch engine to the appropriate exception


handling routine
124
Why Do We Want Precise
Exceptions?
Semantics of the von Neumann model ISA specifies
it
 Remember von Neumann vs. Dataflow

 Aids software debugging

 Enables (easy) recovery from exceptions

 Enables (easily) restartable processes

 Enables traps into software (e.g., software


implemented opcodes)

125
Ensuring Precise Exceptions in
Pipelining
Idea: Make each operation take the same amount of
time
FMUL R3  R1, R2 F D E E E E E E E E W
ADD R4  R1, R2 F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W

 Downside
 Worst-case instruction latency determines all
instructions’ latency
 What about memory operations?
 Each functional unit takes worst-case number of cycles?
126
Solutions
 Reorder buffer

 History buffer

 Future register file We will not cover these

 Checkpointing

 Suggested reading
 Smith and Plezskun, “Implementing Precise Interrupts in
Pipelined Processors,” IEEE Trans on Computers 1988 and
ISCA 1985.

127
Recall: Solution I: Reorder
Buffer
 (ROB)
Idea: Complete instructions out-of-order, but reorder
them before making results visible to architectural
state
 When instruction is decoded it reserves the next-
sequential entry in the ROB
 When instruction completes, it writes result into
ROB entry
 When instruction oldest in ROB and it has
completed without exceptions, its result moved to
Func Unit
reg. file or memory
Register
Instruction Reorder
Cache File Func Unit Buffer

Func Unit

128
Reorder Buffer
 Buffers information about all instructions that are
decoded but not yet retired/committed

129
What’s in a ROB Entry?
Valid bits for reg/data
V DestRegID DestRegVal StoreAddr StoreData PC Exception?
+ control bits

 Everything required to:


 correctly reorder instructions back into the program order
 update the architectural state with the instruction’s
result(s), if instruction can retire without any issues
 handle an exception/interrupt precisely, if an
exception/interrupt needs to be handled before retiring
the instruction

 Need valid bits to keep track of readiness of the


result(s) and find out if the instruction has completed
execution 130
Reorder Buffer: Independent
Operations
Result first written to ROB on instruction completion
 Result written to register file at commit time

F D E E E E E E E E R W
F D E R W
F D E R W
F D E R W
F D E E E E E E E E R W
F D E R W
F D E R W

 What if a later instruction needs a value in the reorder


buffer?
 One option: stall the operation  stall the pipeline
 Better: Read the value from the reorder buffer. How?
131
Reorder Buffer: How to Access?
 A register value can be in the register file, reorder
buffer, (or bypass/forwarding paths)

Random Access Memory


(indexed with Register ID,
Instruction Register which is the address of an entry)
Cache File
Func Unit

Func Unit

Reorder Func Unit


Content Buffer
Addressable
Memory bypass paths
(searched with
register ID,
which is part of the content of an entry)
132
Simplifying Reorder Buffer
Access
Idea: Use indirection

 Access register file first (check if the register is valid)


 If register not valid, register file stores the ID of the
reorder buffer entry that contains (or will contain) the
value of the register
 Mapping of the register to a ROB entry: Register file
maps the register to a reorder buffer entry if there is an
in-flight instruction writing to the register

 Access reorder buffer next

 Now, reorder buffer does not need to be content


addressable
133
Reorder Buffer in Intel Pentium
III

Boggs et al., “The


Microarchitecture of the
Pentium 4 Processor,” Intel
Technology Journal, 2001.

134
Important: Register Renaming with a
Reorder Buffer
 Output and anti dependencies are not true
dependencies
 WHY? The same register refers to values that have
nothing to do with each other
 They exist due to lack of register ID’s (i.e.
names) in the ISA

 The register ID is renamed to the reorder buffer


entry that will hold the register’s value
 Register ID  ROB entry ID
 Architectural register ID  Physical register ID
 After renaming, ROB entry ID used to refer to the
register

 This eliminates anti and output dependencies 135


Recall: Data Dependence Types
True (flow) dependence
r3  r1 op r2 Read-after-Write
r5  r3 op r4 (RAW) -- True

Anti dependence
r3  r1 op r2 Write-after-Read
r1  r4 op r5 (WAR) -- Anti

Output-dependence
r3  r1 op r2 Write-after-Write
r5  r3 op r4 (WAW) -- Output
r3  r6 op r7 136
In-Order Pipeline with Reorder
Buffer
Decode (D): Access regfile/ROB, allocate entry in ROB, check if
instruction can execute, if so dispatch instruction
 Execute (E): Instructions can complete out-of-order
 Completion (R): Write result to reorder buffer
 Retirement/Commit (W): Check for exceptions; if none, write
result to architectural register file or memory; else, flush
pipeline and start from exception handler
 In-order dispatch/execution, out-of-order completion, in-order
retirement Integer add
E
Integer mul
E E E E
FP mul
R W
F D
E E E E E E E E
R
E E E E E E E E ...
Load/store

138
Reorder Buffer Tradeoffs
 Advantages
 Conceptually simple for supporting precise exceptions
 Can eliminate false dependences

 Disadvantages
 Reorder buffer needs to be accessed to get the results
that are yet to be written to the register file
 CAM or indirection  increased latency and complexity

 Other solutions aim to eliminate the disadvantages


 History buffer
 Future file We will not cover these
 Checkpointing

139

You might also like