Onur Digitaldesign - Comparch 2021 Lecture13 Pipelining Afterlecture
Onur Digitaldesign - Comparch 2021 Lecture13 Pipelining Afterlecture
Computer Arch.
Lecture 13: Pipelining
Prof. Onur Mutlu
ETH Zürich
Spring 2021
16 April 2021
Required Readings
This week
Pipelining
H&H, Chapter 7.5
Pipelining Issues
H&H, Chapter 7.8.1-7.8.3
Next week
Out-of-order execution
H&H, Chapter 7.8-7.9
Smith and Sohi, “The Microarchitecture of Superscalar
Processors,” Proceedings of the IEEE, 1995
More advanced pipelining
Interrupt and exception handling
Out-of-order and superscalar execution concepts
2
Agenda for Today & Next Few
Lectures
Last week & yesterday
Single-cycle Microarchitectures
Multi-cycle Microarchitectures
3
Review: Single-Cycle MIPS
Processor
Jump MemtoReg
Control
MemWrite
Unit
Branch
ALUControl2:0 PCSrc
31:26
Op ALUSrc
5:0
Funct RegDst
RegWrite
CLK CLK
CLK
0 25:21
WE3 SrcA Zero WE
0 PC' PC Instr A1 RD1 0 Result
1 A RD
ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0
PCJump 15:11
1
WriteReg4:0
PCPlus4
+
SignImm
4 15:0
<<2
Sign Extend PCBranch
+
27:0 31:28
25:0
<<2
4
Review: Single-Cycle MIPS FSM
Single-cycle machine
AS’ Sequential AS
Combinational
Logic
Logic
(State)
6
Review: Multi-Cycle MIPS
Processor
CLK
PCWrite
Branch PCEn
IorD Control PCSrc
MemWrite Unit ALUControl2:0
IRWrite ALUSrcB1:0
31:26 ALUSrcA
Op
5:0 RegWrite
Funct
MemtoReg
RegDst
CLK CLK CLK
CLK CLK
0 SrcA
WE WE3 A 31:28 Zero CLK
25:21
PC' PC Instr A1 RD1 1 00
0 Adr RD B
ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1 01
Instr / Data 20:16 4 01 SrcB 10
0
Memory 15:11 A3 10
CLK 1 Register PCJump
WD 11
0 File
Data WD3
1
<<2 27:0
<<2
ImmExt
15:0
Sign Extend
25:0 (Addr)
7
Review: Multi-Cycle MIPS
FSM
S0: Fetch S1: Decode
IorD = 0
Reset AluSrcA = 0 S11: Jump
ALUSrcB = 01 ALUSrcA = 0
ALUOp = 00 ALUSrcB = 11 Op = J
PCSrc = 00 ALUOp = 00 PCSrc = 10
IRWrite PCWrite
PCWrite
Op = ADDI
Op = BEQ
Op = LW
or Op = R-type What is the
S2: MemAdr Op = SW
S6: Execute
S8: Branch
S9: ADDI
Execute
shortcoming of
ALUSrcA = 1 ALUSrcA = 1
ALUSrcA = 1
ALUSrcB = 00 ALUSrcA = 1 this design?
ALUSrcB = 10 ALUSrcB = 00 ALUOp = 01 ALUSrcB = 10
ALUOp = 00 ALUOp = 10 PCSrc = 01 ALUOp = 00
Branch
Op = SW
Op = LW
S3: MemRead
S5: MemWrite
S7: ALU
Writeback S10: ADDI What does
Writeback
this design
IorD = 1
IorD = 1
MemWrite
RegDst = 1
MemtoReg = 0
RegDst = 0
MemtoReg = 0 assume
RegWrite RegWrite
about memory?
S4: Mem
Writeback
RegDst = 0
MemtoReg = 1
RegWrite
8
Can We Do Better?
9
Can We Do Better?
What limitations do you see with the multi-cycle
design?
Limited concurrency
Some hardware resources are idle during different
phases of instruction processing cycle
“Fetch” logic is idle when an instruction is being
“decoded” or “executed”
Most of the datapath is idle when a memory access is
happening
10
Can We Use the Idle Hardware to Improve
Concurrency?
Goal: More concurrency Higher instruction
throughput (i.e., more “work” completed in one
cycle)
12
Can Have Different Instructions in
Different Stages
CLK
PCWrite
Branch PCEn
IorD Control PCSrc
MemWrite Unit ALUControl2:0
IRWrite ALUSrcB1:0
31:26 ALUSrcA
Op
5:0 RegWrite
Funct
MemtoReg
RegDst
CLK CLK CLK
CLK CLK
0 SrcA
WE WE3 A 31:28 Zero CLK
25:21
PC' PC Instr A1 RD1 1 00
0 Adr RD B
ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1 01
Instr / Data 20:16 4 01 SrcB 10
0
Memory 15:11 A3 10
CLK 1 Register PCJump
WD 11
0 File
Data WD3
1
<<2 27:0
<<2
ImmExt
15:0
Sign Extend
25:0 (Addr)
14
Pipelining: Basic Idea
More systematically:
Pipeline the execution of multiple instructions
Analogy: “Assembly line processing” of instructions
Idea:
Divide the instruction processing cycle into distinct
“stages” of processing
Ensure there are enough hardware resources to process
one instruction in each stage
Process a different instruction in each stage
Instructions consecutive in program order are processed in
consecutive stages
F D E W
F D E W
F D E W
F D E W
Time
Pipelined: 4 cycles per 4 instructions (steady state)
1 instruction completed per cycle
F D E W
F D E W
Is life always this beautiful?
F D E W
F D E W
Time
16
The Laundry Analogy
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
order
A
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A
6 PM 7 8 9 10
- 4 loads of laundry in parallel
11 12 1 2 AM
Time
Task
order
- no additional resources
- throughput increased by 4
A
D
- latency per load is the same
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
18
Pipelining Multiple Loads of Laundry:
In Practice Time
Task
6 PM 7 8 9 10 11 12 1 2 AM
order
A
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
order
A
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A
A
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
B
A
A
D
B
throughput restored (2 loads per hour) using 2 dryers
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
20
An Ideal Pipeline
Goal: Increase throughput with little increase in cost
(hardware cost, in case of instruction processing)
Repetition of identical operations
The same operation is repeated on a large number of
different inputs (e.g., all laundry loads go through the
same steps)
Repetition of independent operations
No dependences between repeated operations
Uniformly partitionable suboperations
Processing can be evenly divided into uniform-latency
suboperations (that do not share resources)
22
More Realistic Pipeline:
Throughput
Nonpipelined version with delay T
BW = 1/(T+S) where S = register delay
T ps
T/k T/k
ps ps
23
More Realistic Pipeline: Cost
Nonpipelined version with combinational cost G
Cost = G+R where R = register cost
G gates
G/k G/k
24
Pipelining Instruction
Processing
25
Remember: The Instruction
Processing Cycle
FETCH
DECODE
EVALUATE ADDRESS
FETCH OPERANDS
EXECUTE
STORE RESULT
26
Remember: The Instruction
Processing Cycle
Fetch
1. Instruction fetch (IF)
Decode
2. Instruction decode and
register operand
Evaluate fetch (ID/RF)
Address
3. Execute/Evaluate
Fetch Operands memory address (EX/AG)
4. Memory operand fetch (MEM)
Execute
5. Store/writeback
Store Result
result (WB)
27
Remember the Single-Cycle
Uarch
Instruction [25– 0] Shift Jump address [31– 0]
26
left 2
28 0 PCSrc
1 1=Jump
PC+4 [31– 28] M M
u u
x x
ALU
Add result 1 0
Add Shift
RegDst
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg PCSrc2=Br Taken
ALUOp
MemWrite
ALUSrc
RegWrite
Instruction [5– 0]
ALU operation
T BW=~(1/T)
Based on original figure from [P&H CO&D, COPYRIGHT 2004
Elsevier. ALL RIGHTS RESERVED.]
28
Dividing Into Stages
200ps 100ps 200ps 200ps 100ps
IF: Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write back
register file read address calculation
0
M
u
x
1
ignore
for now
Add
4 Add Add
result
Shift
left 2
Read
PC Address register 1 Read
data 1
Read
register 2 Zero
Instruction Registers Read ALU ALU
Write 0 Read
Instruction
memory
register
data 2
M
u
result Address
Data
data
1
M
u
RF
write
Write x memory
data x
1
0
Write
data
16 32
Sign
extend
Instruction Data
lw $2, 200($0) 8 ns
800ps fetch
Reg ALU
access
Reg
Instruction
lw $3, 300($0) 8 ns
800ps fetch
...
8 ns
800ps
Program
2 200 4 400 6 600 8 800 1000
10 1200
12 1400
14
execution
Time
order
(in instructions)
Instruction Data
lw $1, 100($0) Reg ALU Reg
fetch access
Instruction Data
lw $2, 200($0) 2 ns Reg ALU Reg
200ps fetch access
Instruction Data
lw $3, 300($0) 2 ns
200ps Reg ALU Reg
fetch access
2 ns
200ps 2 ns
200ps 2200ps
ns 2 ns
200ps 2 ns
200ps
PCE+4
nPCM
Add
Add
Add
44 Add Add
Add result
result
Shift
Shift
leftleft
22
Read
Read
Instruction
Address register
register 11
AE
PCPC
PCF
Address Read
Read
AoutM
data
data 11
Read
Read
register
22 Zero
Zero
MDRW
Instruction register
Instruction Registers Read
Registers Read ALU ALU
ALU ALU
IRD
AoutW
BM
ImmE
1616 3232
Sign
Sign
extend
extend
T/k T/k
ps T ps
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
31
Pipelined Operation Example
lw All instruction classes must follow the same path
and timing through the pipeline stages.
Instruction fetch
0 lw
lw
M 0
u M Memory
Instruction fetch lw
lw
x u
x
1 0
1
M
u
Memory
x0
Any performanceExecution
impact?
00 M1
u IF/ID ID/EX EX/MEM MEM/WB
M1x
M
IF/ID ID/EX EX/MEM MEM/WB
xx
Add Add
4 Add result
Add
IF/ID ID/EX EX/MEM MEM/WB
11
4
Shift
Add Add
Add
4 left Add
2 result
r esult
Add Shift
n
Read Shift
tio
Address register 1 left 2
PC
c
Read left 2
u
data 1 Add
tr
4 Add
s
Read
In
Read result
tio
Zero
PC Address Instruction register 1register 2
c
n
Read Read
Regi sters ALU
u
Read ALU
tio
memory data 1 Shift 0 Read
tr
Write data 2 Address
register 1 result 1
s
Address Read
PC
In
Read data
c
register 2register left 2 M Zer o Data M
stru
Instructi on u
data
Regi sters 1
IF/ID ID/EX EX/MEM MEM/WB
Read ALU ALU memory u
memory Read Write 0 x Read
IF/ID Write data 2
ID/EX result
EX/MEM Address
MEM/WB 1 x
In
data Zero
n
register
Read2
register M 1 dat a
Instruction
tio
Data M 0
u Write
Write Registers ALU
PC Address register 1 Read ALU memory u
c
memory Read 0x data Read
tru
Write x
data data 2 1
data 1 result Address 1
Read data
s
register 16 32 M 0
In
Sign Zero Writ e M
Instruction register 2 u data
Registers extend ALU Data u
memory Write Read x 0 ALU
memory Read
Add
Add
Write 16
data
register Signdata32
2
extend
1 M
u
result Address
data
x 1
0 M
Write Data u
Write x data memory
data x
1
0
16 32
Sign Add
Add Add
Write
data
44 16
extend
32 Add result
Sign
extend
result
0
M
Shift
Shift lw
0
M
u
x
1
left 2
left 2 lwWrite back
u
x Write back
1
IF/ID
Read
Read
ID/EX EX/MEM MEM/WB
Instruction
lw 1
Instruction
PC Address0 I F/ID
register
lw
register 1
ID/EX
Read
EX/MEM MEM/WB
PC Address 0
Add
Instruction Read
decode
4 Add M
u M
Instruction decode data 1
data 1 Add Add
Read
result
4 x u
x Read Shift
Add Add
Zero
1
Instruction
1 register 22
register
left 2
r esult
Zero
Instruction
Shift
Registers Read ALU ALU
n
Read l eft 2
Registers ALU
tio
Address
memory register 1
Read 0 ALU Read
c
PC Read
memory
u
Write 0 Read
n
Read
data 2
2
data 1
result Address 11
tr
Write
tio
Address
s
Read
data result
In
PC register 2Read
u
Instruction
register
data 1
Registers
M ALU
data
tr
register M
Read ALU
s
memory Read
M
0 Read
In
00
16
Sign
extend
32
Shift
left 2
left 2 Write
Write
data
n
extend
Read
data
tio
PC Address register 1
c
Read
n
Read
tru
tio
register
Read1 data 1
PC Address
s
16
Read
32
c
In
Zero
stru
register 2
Instruction
memory
Read 16
data 1
Registers Read
Sign 32 0
ALU ALU
In
Zero Read
Instruction
Write2
register
register
Registers
data 2
Read
Sign M ALU
result
ALU
Address
data
1
M
memory Write
Write
register
data 2 extend
extend
0
M
u
x result Address Data Read
memory data
1 u
x
data M
u 1 Data
Write x Write u 0
data data memory x
1
16 32 0
Write
Sign
data
extend
16 32
Sign
extend
97108/Patterson
97108/Patterson
Figure 06.15
Figure 06.15
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
32
Pipelined Operation Example
lw $10, 20($1)
Instruction fetch
lw $10, 20($1) sub $11, $2, $3 lw $10, 20($1)
Instruction fetch sub $11, $2, $3 lw $10, 20($1)
0
M
Instruction decode Execution
u Instruction decode Execution sub $11, $2, $3 lw $10, 20($1)
x0
0
1 0
MM
M
u0u
u
Memory Write back
xMx
11
0
1
x
u
x
sub $11, $2, $3 lw $10, 20($1)
IF/ID ID/EX EX/MEM MEM/WB
1
M
u Memory Write back
x
Add
1 IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
4 Add
Add result
Add
Add
Add
Add IF/ID ID/EX Shift EX/MEM MEM/WB
4 Add Add
4 left 2 Add
Addresult
4 Add
Add result
Add
4 Add result
Add Shift result
n
tio Read Shift
left 2
PC Address register 1 Read Shift
Shift
left 2
c
tru
4 data 1 left
left 2
2 Add Add
Read result
ns
Read
In
Zero
tio
register
Read 2
n
PC Instruction
Address register 1
io
Read
c
tn
Read Registers
register 1 Shift ALU
n
PC Address
tio
Write
register data 1 Read
register 1 data 2 Address
tu
Read 2
register Zero
rtu
st
Instruction data 1
data 1 M
n
register
Read 2
Registers Read ALU ALU u
n
Instruction Zero
ns
memory Read
register
Write 2 x
0 Zero Read
In
Instruction memory
I
Instruction register
register
data 12 data 2
Read ALU
result Address x
1
PC Address memory Write Registers
register Read
Read 1M0 ALU ALU Address Read
data
c
register 2
register M
u Zero memory
Data xu
Instruction data
Write
Write 1x x
u memory uM
data Registers
16 Read32 ALU ALU Data
memory 0xx
u
memory Write
data
Write Sign 0
11x Write Read
data 2 result Address
data memory 00
1 x
data
register extend M1 Write
Write data
u data M 0
16 32 data
Write Data u
Write Sign x data memory
data 16
16 32
32 x
extend
Sign 1
Sign 0
16 32 Write
Clock 1 extend
extend
Sign data
extend
16 32
ClockClock
5 Sign
Clock 1 3 extend
Clock 3
sub $11, $2, $3 lw $10, 20($1)
Clock 5
Instruction fetch Instruction decode
sub $11, $2, $3 lw $10, 20($1) sub $11, $2, $3
0
0 sub $11, $2, $3 lw $10, 20($1)
Instruction
M
M0 fetch Instruction decode Write back
uM
u
xxu subExecution
$11, $2, $3 lw $10, 20($1)
Memory
0
1 x
1
0
M
1
0 Execution Memory sub $11, $2, $3
Mu
M
1
x
u
u
x Write back
1x IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB
1 IF/ID ID/EX EX/MEM MEM/WB
Add
Add
Add IF/ID ID/EX EX/MEM MEM/WB
4 Read
n
Read
io
result
n
io
PC Address Read 1
register
register 1 Shift
tio
Address Read
tt
PC
c
Read Shift
c
register 1
u
PC Address
u
Read
c
data
data 1
1 Shift
left 2
rr
stru
Read left 2
stt
Read data 1
s
In
Instruction register
register 2
2 Zero
nIn
memory Write
Write Read
2 0 ALU
result Address Read 1
o
titn
PC Addressmemory register
Read
Write 1 data 2 0 result Address Read
data 1
register 1 Read Address
io
register 1
ru
Read
data 1 u
c
data 1 u Data M
u
Read xu Data u
r
Data
s
Read
Write data 1 u
st
u
r
Zero
In
Write
Read
register 2 x memory x
n
Zero
t
Write x
Is
Instruction register
data 2 memory
memory x
In
Instruction data
registerRegisters
data 2 1
1 ALU Zero x
Instruction Registers Read
Read 1 ALU ALU 0
0
memory
memory Write Registers Read 0
0 ALU ALU
ALU Write
Write Read
Read 0
memory Write data
data 2
2 0 result
result Address
Write
Address 1
Write
register
register data 2 M
M result data
Address
data
data
Read
data
data 11
register M u data M
M
16 32 u Data
Data M u
16 32 u
Write
Write
Write
16 Sign
Sign
Sign
32 x
x
x Data
memory
memory uu
x
data extend 1 memory x
data extend
extend 1 0
Write 0
Write
data
data
16
16 32
32
Clock
Clock62
Clock 4 Sign
Sign
extend
extend
Clock6 2 4
Clock
Clock
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
33
Illustrating Pipeline Operation:
Operation View
t0 t1 t2 t3 t4 t5
Inst0 IF ID EX MEM WB
Inst1 IF ID EX MEM WB
Inst2 IF ID EX MEM WB
Inst3 IF ID EX MEM WB
Inst4 IF ID EX MEM
IF ID EX
steady state
IF ID
(full pipeline) IF
34
Illustrating Pipeline Operation:
Resource View
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
IF I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10
ID I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
EX I0 I1 I2 I3 I4 I5 I6 I7 I8
MEM I0 I1 I2 I3 I4 I5 I6 I7
WB I0 I1 I2 I3 I4 I5 I6
35
Control Points in a Pipeline
PCSrc
0
M
u
x
1
Add
Add
4 Add
result
Branch
Shift
RegWrite left 2
Read MemWrite
Instruction
PC Address register 1
Read
data 1
Read ALUSrc
Zero
Zero MemtoReg
Instruction register 2
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u M
Data u
Write x memory
data x
1
0
Write
data
Instruction
[15– 0] 16 32 6
Sign ALU
extend control MemRead
Instruction
[20– 16]
0
M ALUOp
Instruction u
[15– 11] x
Based on original figure from [P&H CO&D, 1
COPYRIGHT 2004 Elsevier. ALL RIGHTS
RESERVED.] RegDst
Instruction
Control M WB
EX M WB
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite
Branch
Shift
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1
RegDst
ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0 WriteReg4:0
15:11
1
PCPlus4
+
SignImm
4 15:0 <<2
Sign Extend
PCBranch
+
Result
CLK
CLK ALUOutW
CLK CLK CLK CLK
CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD
ALU
1 ALUOutM ReadDataW
A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0
15:11
RdE
1
+
SignImmE
4 15:0
<<2
Sign Extend PCBranchM
+
ResultW
CLK
CLK ALUOutW
CLK CLK CLK CLK
CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD
ALU
ALUOutM ReadDataW
1 A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
SignImmE
+
15:0 <<2
Sign Extend
4 PCBranchM
+
PCPlus4F PCPlus4D PCPlus4E
ResultW
ALU
1 ALUOutM ReadDataW
A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
+
15:0
<<2
Sign Extend SignImmE
PCBranchM
4
+
PCPlus4F PCPlus4D PCPlus4E
ResultW
Resource contention
45
Dependences and Their Types
Also called “dependency” or less desirably “hazard”
Two types
Data dependence
Control dependence
1 2 3 4 5 6 7 8
Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
48
Data Dependences
Types of data dependences
Flow dependence (true data dependence – read after
write)
Output dependence (write after write)
Anti dependence (write after read)
Anti dependence
r3 r1 op r2 Write-after-Read
r1 r4 op r5 (WAR)
Output-dependence
r3 r1 op r2 Write-after-Write
r5 r3 op r4 (WAW)
r3 r6 op r7 50
Pipelined Operation Example
lw $10, 20($1)
Instruction fetch
lw $10, 20($1) sub $11, $2, $3 lw $10, 20($1)
Instruction fetch sub $11, $2, $3 lw $10, 20($1)
0
M
Instruction decode Execution
u Instruction decode Execution sub $11, $2, $3 lw $10, 20($1)
x0
0
1 0
MM
M
u0u
u
Memory Write back
xMx
11
0
1
x
u
x
sub $11, $2, $3 lw $10, 20($1)
IF/ID ID/EX EX/MEM MEM/WB
1
M
u Memory Write back
x
Add
1 IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
4 Add
Add result
Add
Add
Add
Add IF/ID ID/EX Shift EX/MEM MEM/WB
4 Add Add
4 left 2 Add
Addresult
4 Add
Add result
Add
4 Add result
Add Shift result
n
tio Read Shift
left 2
PC Address register 1 Read Shift
Shift
left 2
c
tru
4 data 1 left
left 2
2 Add Add
Read result
ns
Read
In
Zero
tio
register
Read 2
n
PC Instruction
Address register 1
io
Read
c
tn
Read Registers
register 1 Shift ALU
n
PC Address
tio
Write
register data 1 Read
register 1 data 2 Address
tu
Read 2
register Zero
rtu
st
Instruction data 1
data 1 M
n
register
Read 2
Registers Read ALU ALU u
n
Instruction Zero
ns
memory Read
register
Write 2 x
0 Zero Read
In
Instruction memory
I
Instruction register
register
data 12 data 2
Read ALU
result Address x
1
PC Address memory Write Registers
register Read
Read 1M0 ALU ALU Address Read
data
c
register 2
register M
u Zero memory
Data xu
Instruction data
Write
Write 1x x
u memory uM
data Registers
16 Read32 ALU ALU Data
memory 0xx
u
memory Write
data
Write Sign 0
11x Write Read
data 2 result Address
data memory 00
1 x
data
register extend M1 Write
Write data
u data M 0
16 32 data
Write Data u
Write Sign x data memory
data 16
16 32
32 x
extend
Sign 1
Sign 0
16 32 Write
Clock 1 extend
extend
Sign data
extend
16 32
ClockClock
5 Sign
Clock 1 3 extend
Clock 3
sub $11, $2, $3 lw $10, 20($1)
Clock 5
Instruction fetch Instruction decode
sub $11, $2, $3 lw $10, 20($1) sub $11, $2, $3
0
0 sub $11, $2, $3 lw $10, 20($1)
Instruction
M
M0 fetch Instruction decode Write back
uM
u
xxu subExecution
$11, $2, $3 lw $10, 20($1)
Memory
0
1 x
1
0
M
1
0 Execution Memory sub $11, $2, $3
Mu
M
1
x
u
u
x Write back
1x IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB
1 IF/ID ID/EX EX/MEM MEM/WB
Add
Add
Add IF/ID ID/EX EX/MEM MEM/WB
4 Read
n
Read
io
result
n
io
PC Address Read 1
register
register 1 Shift
tio
Address Read
tt
PC
c
Read Shift
c
register 1
u
PC Address
u
Read
c
data
data 1
1 Shift
left 2
rr
stru
Read left 2
stt
Read data 1
s
In
Instruction register
register 2
2 Zero
nIn
memory Write
Write Read
2 0 ALU
result Address Read 1
o
titn
PC Addressmemory register
Read
Write 1 data 2 0 result Address Read
data 1
register 1 Read Address
io
register 1
ru
Read
data 1 u
c
data 1 u Data M
u
Read xu Data u
r
Data
s
Read
Write data 1 u
st
u
r
Zero
In
Write
Read
register 2 x memory x
n
Zero
t
Write x
Is
Instruction register
data 2 memory
memory x
In
Instruction data
registerRegisters
data 2 1
1 ALU Zero x
Instruction Registers Read
Read 1 ALU ALU 0
0
memory
memory Write Registers Read 0
0 ALU ALU
ALU Write
Write Read
Read 0
memory Write data
data 2
2 0 result
result Address
Write
Address 1
Write
register
register data 2 M
M result data
Address
data
data
Read
data
data 11
register M u data M
M
16 32 u Data
Data M u
16 32 u
Write
Write
Write
16 Sign
Sign
Sign
32 x
x
x Data
memory
memory uu
x
data extend 1 memory x
data extend
extend 1 0
Write 0
Write
data
data
16
16 32
32
Clock
Clock62
Clock 4 Sign
Sign
extend
extend
Clock6 2 4
Clock
Clock
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
51
Data Dependence
Handling
52
Reading for Next Few Lectures
H&H, Chapter 7.5-7.9
53
How to Handle Data
Dependences
Anti and output dependences are easier to handle
write to the destination only in last stage and in
program order
Anti dependence
r3 r1 op r2 Write-after-Read
r1 r4 op r5 (WAR)
Output-dependence
r3 r1 op r2 Write-after-Write
r5 r3 op r4 (WAW)
r3 r6 op r7 55
RAW Dependence Handling
Which one of the following flow dependences lead to
conflicts in the 5-stage pipeline?
addi ra r- -
IF ID EX MEM WB
addi r- ra - IF ID EX MEM WB
addi r- ra - IF ID EX MEM
addi r- ra - IF ID EX
addi r- ra - IF ?ID
addi r- ra - IF
56
Pipeline Stall: Resolving Data
Dependence
t0 t1 t2 t3 t4 t5
Insth IF ID ALU MEM WB
Insti i IF ID ALU MEM WB
Instj j IF ID ALU
ID MEM
ALU
ID ID
WB
MEM
ALU ALU
WB
MEM
Instk IF ID
IF ALU
ID
IF MEM
ALU
ID
IF WB
MEM
ALU
ID
Instl IF ID
IF ALU
ID
IF MEM
ALU
ID
IF
IF ID
IF ALU
ID
IF
i: rx _
IF ID
IF
j:bubble
_ rx dist(i,j)=1
Stall = make the dependent instruction
bubble
j: _ rx dist(i,j)=2 wait until its source data valueIFis
bubble
j: _ rx dist(i,j)=3
available
j: _ rx dist(i,j)=4 1. stop all up-stream stages
57
2. drain all down-stream stages
Interlocking
Detection of dependence between instructions in a
pipelined processor to guarantee correct execution
MIPS acronym?
58
Approaches to Dependence
Detection
Scoreboarding (I)
Each register in register file has a Valid bit associated
with it
An instruction that is writing to the register resets the
Valid bit
An instruction in Decode stage checks if all its source
and destination registers are Valid
Yes: No need to stall… No dependence
No: Stall the instruction
Advantage:
Simple. 1 bit per register
Disadvantage:
Need to stall for all types of dependences, not only 59
Approaches to Dependence
Detection
(II)
Combinational dependence check logic
Special logic checks if any instruction in later stages is
supposed to write to any source register of the
instruction that is being decoded
Yes: stall the instruction/pipeline
No: no need to stall… no flow dependence
Advantage:
No need to stall on anti and output dependences
Disadvantage:
Logic is more complex than a scoreboard
Logic becomes more complex as we make the pipeline
deeper and wider (flash-forward: think superscalar
execution) 60
Once You Detect the Dependence in
Hardware
What do you do afterwards?
61
Data Forwarding/Bypassing
Problem: A consumer (dependent) instruction has to
wait in decode stage until the producer instruction
writes its value in the register file
Goal: We do not want to stall the pipeline
unnecessarily
Observation: The data value needed by the
consumer instruction can be supplied directly from a
later stage in the pipeline (instead of only from the
register file)
Idea: Add additional dependence check logic and
data forwarding paths (buses) to supply the
producer’s value to the consumer right after the
value is available
62
Aside: A Special Case of Data
Dependence
Control dependence
Data dependence on the Instruction Pointer / Program
Counter
63
Aside: Control Dependence
Question: What should the fetch PC be in the next
cycle?
Answer: The address of the next instruction
All instructions are control dependent on previous ones.
Why?
ETH Zürich
Spring 2021
16 April 2021
We did not cover the
following slides. They are for
your benefit.
We will cover them in future
lectures.
66
Data Dependence
Handling: Concepts and
Implementation
67
How to Implement Stalling
PCSrc
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite
Branch
Shift
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1
Stall
RegDst
Time (cycles)
subsequent instructions read the correct value of $s0
add
$s2
DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Compile-Time Detection and
Elimination 1 2 3 4 5 6 7 8 9 10
Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF
nop DM
nop IM RF RF
nop DM
nop IM RF RF
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Remember dataflow?
Data value supplied to dependent instruction as soon
as it is available
Instruction executes when all its operands are
available
1 2 3 4 5 6 7 8
Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Data Forwarding
CLK CLK CLK
ALU
1 10 ALUOutM ReadDataW
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+
15:0
Extend
4
<<2
+
PCPlus4F PCPlus4D PCPlus4E
PCBranchM
ResultW
RegWriteW
ForwardBE
RegWriteM
ForwardAE
Hazard Unit
Data Forwarding
Forward to Execute stage from either:
Memory stage or
Writeback stage
Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF
Trouble!
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Memory stage
its result cannot be forwarded to the Execute stage of the next
instruction
Stalling
1 2 3 4 5 6 7 8 9
Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF
$s0 $s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 RF $s1 & RF
$s4
or or DM $t1
or $t1, $s4, $s0 IM IM RF $s0 | RF
Stall $s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Hardware Needed for Stalling
Stalls are supported by
adding enable inputs (EN) to the Fetch and Decode
pipeline registers
and a synchronous reset/clear (CLR) input to the
Execute pipeline register
or an INV bit associated with each pipeline register,
indicating that contents are INValid
ALU
ALUOutM ReadDataW
EN
1 10
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+
15:0
Extend
4
<<2
+
PCPlus4F PCPlus4D PCPlus4E
CLR
EN
PCBranchM
ResultW
MemtoRegE
RegWriteW
ForwardBE
RegWriteM
ForwardAE
FlushE
StallD
StallF
Hazard Unit
Recall: How to Handle Data
Dependences
Anti and output dependences are easier to handle
write to the destination in one stage and in program
order
81
Fine-Grained Multithreading
Idea: Hardware has multiple thread contexts
(PC+registers). Each cycle, fetch engine fetches from
a different thread.
By the time the fetched branch/instruction resolves, no
instruction is fetched from the same thread
Branch/instruction resolution latency overlapped with
execution of other threads’ instructions
8 stages 800 ns to
complete an
instruction
assuming no
memory access
85
Multithreaded Pipeline Example
Kongetira et al., “Niagara: A 32-Way Multithreaded Sparc Processor,” IEEE Micro 2005.
87
Fine-Grained Multithreading
Advantages
+ No need for dependency checking between instructions
(only one instruction in pipeline from a single thread)
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions
from different threads
+ Improved system throughput, latency tolerance, utilization
Disadvantages
- Extra hardware complexity: multiple hardware contexts (PCs,
register files, …), thread selection logic
- Reduced single thread performance (one instruction fetched
every N cycles from the same thread)
- Resource contention between threads in caches and memory
- Some dependency checking logic between threads remains
(load/store) 88
Modern GPUs are
FGMT Machines
89
NVIDIA GeForce GTX 285
“core”
64 KB of storage
… for thread
contexts
(registers)
90
Slide credit: Kayvon Fatahalian
NVIDIA GeForce GTX 285
“core”
64 KB of storage
… for thread
contexts
(registers)
Groups of 32 threads share instruction stream (each
group is a Warp): they execute the same instruction
on different data
Up to 32 warps are interleaved in an FGMT
manner
91
Up to 1024 thread contexts can be stored
Slide credit: Kayvon Fatahalian
NVIDIA GeForce GTX 285
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Burton Smith
(1941-2018)
93
Further Reading for the
Interested (II)
94
Recall: How to Handle Data
Dependences
Anti and output dependences are easier to handle
write to the destination in one stage and in program
order
96
Control Dependence
Question: What should the fetch PC be in the next
cycle?
Answer: The address of the next instruction
All instructions are control dependent on previous ones.
Why?
Control Dependences
Special case of data dependence: dependence on PC
beq:
branch is not determined until the fourth stage of the pipeline
Instructions after the branch are fetched before branch is resolved
Always predict that the next sequential instruction is fetched
Called “Always not taken” prediction
These instructions must be flushed if the branch is taken
98
Carnegie Mellon
ALU
1 10 ALUOutM ReadDataW
EN
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+
15:0
Extend
4
<<2
+
PCPlus4F PCPlus4D PCPlus4E
CLR
EN
PCBranchM
ResultW
MemtoRegE
RegWriteW
ForwardBE
ForwardAE
RegWriteM
FlushE
StallD
StallF
Hazard Unit
99
Carnegie Mellon
Control Dependence
1 2 3 4 5 6 7 8 9
Time (cycles)
$t1
lw DM
20 beq $t1, $t2, 40 IM RF $t2 - RF
$s0
and DM
24 and $t0, $s0, $s1 IM RF $s1 & RF
Flush
$s4 these
or DM instructions
28 or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM
2C sub $t2, $s0, $s5 IM RF $s5 - RF
30 ...
...
$s2
slt DM $t3
slt
64 slt $t3, $s2, $s3 IM RF $s3 RF
100
Carnegie Mellon
EqualD PCSrcD
CLK CLK CLK
CLK
WE3
= WE
25:21 SrcAE
0 PC' PCF InstrD A1 RD1 00
A RD 01
ALU
1 10 ALUOutM ReadDataW
EN
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE RdE
1
SignImmD SignImmE
Sign
+
15:0
Extend
4
<<2
+
PCPlus4F PCPlus4D
CLR
CLR
EN
PCBranchD
ResultW
MemtoRegE
RegWriteW
ForwardBE
RegWriteM
ForwardAE
FlushE
StallD
StallF
Hazard Unit
Time (cycles)
$t1
lw DM
20 beq $t1, $t2, 40 IM RF $t2 - RF
$s0 Flush
and DM
24 and $t0, $s0, $s1 IM RF $s1 & RF this
instruction
30 ...
...
$s2
slt DM $t3
slt
64 slt $t3, $s2, $s3 IM RF $s3 RF
102
Carnegie Mellon
Disadvantages
Potential increase in clock cycle time?
Higher Tclock?
Additional hardware cost
Specialized and likely not used by other instructions
103
Carnegie Mellon
EqualD PCSrcD
CLK CLK CLK
CLK
WE3
= WE
25:21 SrcAE
0 PC' PCF InstrD A1 RD1 0 00
A RD 01
ALU
ALUOutM ReadDataW
1 1 10
EN
A RD
Instruction 20:16
A2 RD2 0 00 0 SrcBE Data
Memory 01
A3 1 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+
15:0
Extend
4
<<2
+
PCPlus4F PCPlus4D
CLR
CLR
EN
PCBranchD
ResultW
MemtoRegE
RegWriteW
ForwardBD
ForwardBE
ForwardAD
RegWriteM
ForwardAE
RegWriteE
BranchD
FlushE
StallD
StallF
Hazard Unit
//Stalling logic:
assign lwstall = ((rsD == rtE) | (rtD == rtE)) & MemtoRegE;
// Stall signals;
assign StallF = lwstall | branchstall;
assign StallD = lwstall | branchstall;
assign FLushE = lwstall | branchstall;
105
Carnegie Mellon
106
Questions to Ponder
What is the role of the hardware vs. the software in
data dependence handling?
Software based interlocking
Hardware based interlocking
Who inserts/manages the pipeline bubbles?
Who finds the independent instructions to fill “empty”
pipeline slots?
What are the advantages/disadvantages of each?
Think of the performance equation as well
107
Questions to Ponder
What is the role of the hardware vs. the software in
the order in which instructions are executed in the
pipeline?
Software based instruction scheduling static
scheduling
Hardware based instruction scheduling dynamic
scheduling
109
More on Static Instruction
Scheduling
https://www.youtube.com/onurmutlulectures 110
Lectures on Static Instruction
Scheduling
Computer Architecture, Spring 2015, Lecture 16
Static Instruction Scheduling (CMU, Spring 2015)
https://www.youtube.com/watch?v=isBEVkIjgGA&list=PL5PHm2jkkXmi5C
xxI7b3JCL1TWybTDtKq&index=18
https://www.youtube.com/onurmutlulectures 111
Carnegie Mellon
Suppose:
40% of loads used by next instruction
25% of branches mispredicted
And
Average CPI =
113
Carnegie Mellon
And
Average CPI = (0.25)(1.4) + load
(0.1)(1) + store
(0.11)(1.25) + beq
(0.02)(2) + jump
(0.52)(1) r-type
= 1.15
114
Carnegie Mellon
Pipelined Performance
There are 5 stages, and 5 different timing paths:
Tc = max {
tpcq + tmem + tsetup fetch
2(tRFread + tmux + teq + tAND + tmux + tsetup ) decode
tpcq + tmux + tmux + tALU + tsetup execute
tpcq + tmemwrite + tsetup memory
2(tpcq + tmux + tRFwrite)
writeback
}
The operation speed depends on the slowest operation
Decode and Writeback use register file and have only half a 115
Carnegie Mellon
117
Carnegie Mellon
118
Pipelining and Precise
Exceptions: Preserving
Sequential Semantics
Multi-Cycle Execution
Not all instructions take the same amount of time
for “execution”
Idea: Have multiple different functional units that
take different number of cycles
Can be pipelined or not pipelined
Can let independent instructions start execution on a
different functional unit before a previous long-latency
instruction finishes execution
Integer add
E
Integer mul
E E E E
FP mul
?
F D
E E E E E E E E
E E E E E E E E ...
Load/store
120
Issues in Pipelining: Multi-Cycle
Execute
Instructions can take different number of cycles in
EXECUTE stage
Integer ADD versus FP MULtiply
FMUL R4 R1, R2 F D E E E E E E E E W
ADD R3 R1, R2 F D E W
F D E W
F D E W
FMUL R2 R5, R6 F D E E E E E E E E W
ADD R7 R5, R6 F D E W
F D E W
When to Handle
Exceptions: when detected (and known to be non-
speculative)
Interrupts: when convenient
Except for very high priority ones
Power failure
Machine check (error)
123
Checking for and Handling Exceptions
in Pipelining
When the oldest instruction ready-to-be-retired is
detected to have caused an exception, the control
logic
125
Ensuring Precise Exceptions in
Pipelining
Idea: Make each operation take the same amount of
time
FMUL R3 R1, R2 F D E E E E E E E E W
ADD R4 R1, R2 F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
Downside
Worst-case instruction latency determines all
instructions’ latency
What about memory operations?
Each functional unit takes worst-case number of cycles?
126
Solutions
Reorder buffer
History buffer
Checkpointing
Suggested reading
Smith and Plezskun, “Implementing Precise Interrupts in
Pipelined Processors,” IEEE Trans on Computers 1988 and
ISCA 1985.
127
Recall: Solution I: Reorder
Buffer
(ROB)
Idea: Complete instructions out-of-order, but reorder
them before making results visible to architectural
state
When instruction is decoded it reserves the next-
sequential entry in the ROB
When instruction completes, it writes result into
ROB entry
When instruction oldest in ROB and it has
completed without exceptions, its result moved to
Func Unit
reg. file or memory
Register
Instruction Reorder
Cache File Func Unit Buffer
Func Unit
128
Reorder Buffer
Buffers information about all instructions that are
decoded but not yet retired/committed
129
What’s in a ROB Entry?
Valid bits for reg/data
V DestRegID DestRegVal StoreAddr StoreData PC Exception?
+ control bits
F D E E E E E E E E R W
F D E R W
F D E R W
F D E R W
F D E E E E E E E E R W
F D E R W
F D E R W
Func Unit
134
Important: Register Renaming with a
Reorder Buffer
Output and anti dependencies are not true
dependencies
WHY? The same register refers to values that have
nothing to do with each other
They exist due to lack of register ID’s (i.e.
names) in the ISA
Anti dependence
r3 r1 op r2 Write-after-Read
r1 r4 op r5 (WAR) -- Anti
Output-dependence
r3 r1 op r2 Write-after-Write
r5 r3 op r4 (WAW) -- Output
r3 r6 op r7 136
In-Order Pipeline with Reorder
Buffer
Decode (D): Access regfile/ROB, allocate entry in ROB, check if
instruction can execute, if so dispatch instruction
Execute (E): Instructions can complete out-of-order
Completion (R): Write result to reorder buffer
Retirement/Commit (W): Check for exceptions; if none, write
result to architectural register file or memory; else, flush
pipeline and start from exception handler
In-order dispatch/execution, out-of-order completion, in-order
retirement Integer add
E
Integer mul
E E E E
FP mul
R W
F D
E E E E E E E E
R
E E E E E E E E ...
Load/store
138
Reorder Buffer Tradeoffs
Advantages
Conceptually simple for supporting precise exceptions
Can eliminate false dependences
Disadvantages
Reorder buffer needs to be accessed to get the results
that are yet to be written to the register file
CAM or indirection increased latency and complexity
139