Onur Digitaldesign - Comparch 2021 Lecture14 Pipelined Processor Design Afterlecture
Onur Digitaldesign - Comparch 2021 Lecture14 Pipelined Processor Design Afterlecture
Computer Arch.
Lecture 14: Pipelined
Processor Design
Prof. Onur Mutlu
ETH Zürich
Spring 2021
22 April 2021
Required Readings
Last week & This week
Pipelining
H&H, Chapter 7.5
Pipelining Issues
H&H, Chapter 7.8.1-7.8.3
2
Agenda for Today & Next Few
Lectures
Earlier
Single-cycle Microarchitectures
Multi-cycle Microarchitectures
3
Review: Single-Cycle MIPS
Processor
Jump MemtoReg
Control
MemWrite
Unit
Branch
ALUControl2:0 PCSrc
31:26
Op ALUSrc
5:0
Funct RegDst
RegWrite
CLK CLK
CLK
0 25:21
WE3 SrcA Zero WE
0 PC' PC Instr A1 RD1 0 Result
1 A RD
ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0
PCJump 15:11
1
WriteReg4:0
PCPlus4
+
SignImm
4 15:0
<<2
Sign Extend PCBranch
+
27:0 31:28
25:0
<<2
4
Review: Single-Cycle MIPS FSM
Single-cycle machine
AS’ Sequential AS
Combinational
Logic
Logic
(State)
6
Review: Multi-Cycle MIPS
Processor
CLK
PCWrite
Branch PCEn
IorD Control PCSrc
MemWrite Unit ALUControl2:0
IRWrite ALUSrcB1:0
31:26 ALUSrcA
Op
5:0 RegWrite
Funct
MemtoReg
RegDst
CLK CLK CLK
CLK CLK
0 SrcA
WE WE3 A 31:28 Zero CLK
25:21
PC' PC Instr A1 RD1 1 00
0 Adr RD B
ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1 01
Instr / Data 20:16 4 01 SrcB 10
0
Memory 15:11 A3 10
CLK 1 Register PCJump
WD 11
0 File
Data WD3
1
<<2 27:0
<<2
ImmExt
15:0
Sign Extend
25:0 (Addr)
7
Review: Multi-Cycle MIPS
FSM
S0: Fetch S1: Decode
IorD = 0
Reset AluSrcA = 0 S11: Jump
ALUSrcB = 01 ALUSrcA = 0
ALUOp = 00 ALUSrcB = 11 Op = J
PCSrc = 00 ALUOp = 00 PCSrc = 10
IRWrite PCWrite
PCWrite
Op = ADDI
Op = BEQ
Op = LW
or Op = R-type What is the
S2: MemAdr Op = SW
S6: Execute
S8: Branch
S9: ADDI
Execute
shortcoming of
ALUSrcA = 1 ALUSrcA = 1
ALUSrcA = 1
ALUSrcB = 00 ALUSrcA = 1 this design?
ALUSrcB = 10 ALUSrcB = 00 ALUOp = 01 ALUSrcB = 10
ALUOp = 00 ALUOp = 10 PCSrc = 01 ALUOp = 00
Branch
Op = SW
Op = LW
S3: MemRead
S5: MemWrite
S7: ALU
Writeback S10: ADDI What does
Writeback
this design
IorD = 1
IorD = 1
MemWrite
RegDst = 1
MemtoReg = 0
RegDst = 0
MemtoReg = 0 assume
RegWrite RegWrite
about memory?
S4: Mem
Writeback
RegDst = 0
MemtoReg = 1
RegWrite
8
Can We Do Better?
9
Review: Pipelining Basic Idea
CLK
PCWrite
Branch PCEn
IorD Control PCSrc
MemWrite Unit ALUControl2:0
IRWrite ALUSrcB1:0
31:26 ALUSrcA
Op
5:0 RegWrite
Funct
MemtoReg
RegDst
CLK CLK CLK
CLK CLK
0 SrcA
WE WE3 A 31:28 Zero CLK
25:21
PC' PC Instr A1 RD1 1 00
0 Adr RD B
ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1 01
Instr / Data 20:16 4 01 SrcB 10
0
Memory 15:11 A3 10
CLK 1 Register PCJump
WD 11
0 File
Data WD3
1
<<2 27:0
<<2
ImmExt
15:0
Sign Extend
25:0 (Addr)
ALU
1 ALUOutM ReadDataW
A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
+
15:0
<<2
Sign Extend SignImmE
PCBranchM
4
+
PCPlus4F PCPlus4D PCPlus4E
ResultW
F D E W
F D E W
F D E W
F D E W
Time
Pipelined: 4 cycles per 4 instructions (steady state)
1 instruction completed per cycle
F D E W
F D E W
Is life always this beautiful?
F D E W
F D E W
Time
12
Review: Issues in Pipeline
Design
Balancing work in pipeline stages
How many stages and what is done in each stage
14
Review: Data Dependence
Types
Flow dependence
r3 r1 op r2 Read-after-Write
r5 r3 op r4 (RAW)
Anti dependence
r3 r1 op r2 Write-after-Read
r1 r4 op r5 (WAR)
Output-dependence
r3 r1 op r2 Write-after-Write
r5 r3 op r4 (WAW)
r3 r6 op r7 15
Review: How to Handle Data
Dependences
Anti and output dependences are easier to handle
write to the destination only in last stage and in
program order
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite
Branch
Shift
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1
Stall
RegDst
Time (cycles)
subsequent instructions read the correct value of $s0
add
$s2
DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Compile-Time Detection and
Elimination 1 2 3 4 5 6 7 8 9 10
Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF
nop DM
nop IM RF RF
nop DM
nop IM RF RF
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Remember dataflow?
Data value supplied to dependent instruction as soon
as it is available
Instruction executes when all its operands are
available
1 2 3 4 5 6 7 8
Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Data Forwarding
CLK CLK CLK
ALU
1 10 ALUOutM ReadDataW
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+
15:0
Extend
4
<<2
+
PCPlus4F PCPlus4D PCPlus4E
PCBranchM
ResultW
RegWriteW
ForwardBE
RegWriteM
ForwardAE
Hazard Unit
Data Forwarding
Forward to Execute stage from either:
Memory stage or
Writeback stage
Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF
Trouble!
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Memory stage
its result cannot be forwarded to the Execute stage of the next
instruction
Stalling
1 2 3 4 5 6 7 8 9
Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF
$s0 $s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 RF $s1 & RF
$s4
or or DM $t1
or $t1, $s4, $s0 IM IM RF $s0 | RF
Stall $s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Hardware Needed for Stalling
Stalls are supported by
adding enable inputs (EN) to the Fetch and Decode
pipeline registers
and a synchronous reset/clear (CLR) input to the
Execute pipeline register
or an INV bit associated with each pipeline register,
indicating that contents are INValid
ALU
ALUOutM ReadDataW
EN
1 10
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+
15:0
Extend
4
<<2
+
PCPlus4F PCPlus4D PCPlus4E
CLR
EN
PCBranchM
ResultW
MemtoRegE
RegWriteW
ForwardBE
RegWriteM
ForwardAE
FlushE
StallD
StallF
Hazard Unit
A Special Case of Data
Dependence
Control dependence
Data dependence on the Instruction Pointer / Program
Counter
30
Control Dependence
Question: What should the fetch PC be in the next
cycle?
Answer: The address of the next instruction
All instructions are control dependent on previous ones.
Why?
Control Dependences
Special case of data dependence: dependence on PC
beq:
branch is not resolved until the fourth stage of the pipeline
Instructions after the branch are fetched before branch is resolved
Always predict that the next sequential instruction is fetched
Called “Always not taken” prediction
These instructions must be flushed if the branch is taken
32
Carnegie Mellon
ALU
1 10 ALUOutM ReadDataW
EN
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+
15:0
Extend
4
<<2
+
PCPlus4F PCPlus4D PCPlus4E
CLR
EN
PCBranchM
ResultW
MemtoRegE
RegWriteW
ForwardBE
ForwardAE
RegWriteM
FlushE
StallD
StallF
Hazard Unit
33
Carnegie Mellon
Control Dependence
1 2 3 4 5 6 7 8 9
Time (cycles)
$t1
lw DM
20 beq $t1, $t2, 40 IM RF $t2 - RF
$s0
and DM
24 and $t0, $s0, $s1 IM RF $s1 & RF
Flush
$s4 these
or DM instructions
28 or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM
2C sub $t2, $s0, $s5 IM RF $s5 - RF
30 ...
...
$s2
slt DM $t3
slt
64 slt $t3, $s2, $s3 IM RF $s3 RF
34
Carnegie Mellon
EqualD PCSrcD
CLK CLK CLK
CLK
WE3
= WE
25:21 SrcAE
0 PC' PCF InstrD A1 RD1 00
A RD 01
ALU
1 10 ALUOutM ReadDataW
EN
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE RdE
1
SignImmD SignImmE
Sign
+
15:0
Extend
4
<<2
+
PCPlus4F PCPlus4D
CLR
CLR
EN
PCBranchD
ResultW
MemtoRegE
RegWriteW
ForwardBE
RegWriteM
ForwardAE
FlushE
StallD
StallF
Hazard Unit
Time (cycles)
$t1
lw DM
20 beq $t1, $t2, 40 IM RF $t2 - RF
$s0 Flush
and DM
24 and $t0, $s0, $s1 IM RF $s1 & RF this
instruction
30 ...
...
$s2
slt DM $t3
slt
64 slt $t3, $s2, $s3 IM RF $s3 RF
36
Carnegie Mellon
Disadvantages
Potential increase in clock cycle time?
Higher clock period and lower frequency?
Additional hardware cost
Specialized and likely not used by other instructions
37
Carnegie Mellon
EqualD PCSrcD
CLK CLK CLK
CLK
WE3
= WE
25:21 SrcAE
0 PC' PCF InstrD A1 RD1 0 00
A RD 01
ALU
ALUOutM ReadDataW
1 1 10
EN
A RD
Instruction 20:16
A2 RD2 0 00 0 SrcBE Data
Memory 01
A3 1 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+
15:0
Extend
4
<<2
+
PCPlus4F PCPlus4D
CLR
CLR
EN
PCBranchD
ResultW
MemtoRegE
RegWriteW
ForwardBD
ForwardBE
ForwardAD
RegWriteM
ForwardAE
RegWriteE
BranchD
FlushE
StallD
StallF
Hazard Unit
//Stalling logic:
assign lwstall = ((rsD == rtE) | (rtD == rtE)) & MemtoRegE;
// Stall signals;
assign StallF = lwstall | branchstall;
assign StallD = lwstall | branchstall;
assign FLushE = lwstall | branchstall;
39
Carnegie Mellon
40
Includes data dependence detection, early br resolution, forwarding, stall logic
Carnegie Mellon
41
More on Branch Prediction (I)
https://www.youtube.com/onurmutlulectures 42
More on Branch Prediction (II)
https://www.youtube.com/onurmutlulectures 43
More on Branch Prediction (III)
https://www.youtube.com/onurmutlulectures 44
Lectures on Branch Prediction
Digital Design & Computer Architecture, Spring 2020,
Lecture 16b
Branch Prediction I (ETH Zurich, Spring 2020)
https://www.youtube.com/watch?v=h6l9yYSyZHM&list=PL5Q2soXY2Zi_F
RrloMa2fUYWPGiZUBQo2&index=22
https://www.youtube.com/onurmutlulectures 45
Pipelined Performance
Example
46
Carnegie Mellon
Suppose:
40% of loads used by next instruction
25% of branches mispredicted
And
Average CPI =
48
Carnegie Mellon
And
Average CPI = (0.25)(1.4) + load
(0.1)(1) + store
(0.11)(1.25) + beq
(0.02)(2) + jump
(0.52)(1) r-type
= 1.15
49
Carnegie Mellon
Pipelined Performance
There are 5 stages, and 5 different timing paths:
Tc = max {
tpcq + tmem + tsetup fetch
2(tRFread + tmux + teq + tAND + tmux + tsetup ) decode
tpcq + tmux + tmux + tALU + tsetup execute
tpcq + tmemwrite + tsetup memory
2(tpcq + tmux + tRFwrite)
writeback
}
The operation speed depends on the slowest operation
Decode and Writeback use register file and have only half a 50
Carnegie Mellon
52
Carnegie Mellon
53
Recall: How to Handle Data
Dependences
Anti and output dependences are easier to handle
write to the destination only in last stage and in
program order
56
Questions to Ponder
What is the role of the hardware vs. the software in
the order in which instructions are executed in the
pipeline?
Software based instruction scheduling static
scheduling
Hardware based instruction scheduling dynamic
scheduling
58
More on Static Instruction
Scheduling
https://www.youtube.com/onurmutlulectures 59
Lectures on Static Instruction
Scheduling
Computer Architecture, Spring 2015, Lecture 16
Static Instruction Scheduling (CMU, Spring 2015)
https://www.youtube.com/watch?v=isBEVkIjgGA&list=PL5PHm2jkkXmi5C
xxI7b3JCL1TWybTDtKq&index=18
https://www.youtube.com/onurmutlulectures 60
Recall: How to Handle Data
Dependences
Anti and output dependences are easier to handle
write to the destination only in last stage and in
program order
62
Fine-Grained Multithreading
Idea: Hardware has multiple thread contexts
(PC+registers). Each cycle, fetch engine fetches from
a different thread.
By the time the fetched branch/instruction resolves, no
instruction is fetched from the same thread
Branch/instruction resolution latency overlapped with
execution of other threads’ instructions
8 stages 800 ns to
complete an
instruction
assuming no
memory access
66
Multithreaded Pipeline Example
Kongetira et al., “Niagara: A 32-Way Multithreaded Sparc Processor,” IEEE Micro 2005.
68
Fine-Grained Multithreading
Advantages
+ No need for dependence checking between instructions
(only one instruction in pipeline from a single thread)
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions
from different threads
+ Improved system throughput, latency tolerance, utilization
Disadvantages
- Extra hardware complexity: multiple hardware contexts (PCs,
register files, …), thread selection logic
- Reduced single thread performance (one instruction fetched
every N cycles from the same thread)
- Resource contention between threads in caches and memory
- Some dependence checking logic between threads remains
(load/store) 69
Modern GPUs are
FGMT Machines
70
NVIDIA GeForce GTX 285
“core”
64 KB of storage
… for thread
contexts
(registers)
71
Slide credit: Kayvon Fatahalian
NVIDIA GeForce GTX 285
“core”
64 KB of storage
… for thread
contexts
(registers)
Groups of 32 threads share instruction stream (each
group is a Warp): they execute the same instruction
on different data
Up to 32 warps are interleaved in an FGMT
manner
72
Up to 1024 thread contexts can be stored
Slide credit: Kayvon Fatahalian
NVIDIA GeForce GTX 285
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Burton Smith
(1941-2018)
74
Further Reading for the
Interested (II)
75
Digital Design &
Computer Arch.
Lecture 14: Pipelined
Processor Design
Prof. Onur Mutlu
ETH Zürich
Spring 2021
22 April 2021
We did not cover the
following slides. They are for
your benefit.
We will cover them in future
lectures.
77
Pipelining and Precise
Exceptions: Preserving
Sequential Semantics
Multi-Cycle Execution
Not all instructions take the same amount of time
for “execution”
Idea: Have multiple different functional units that
take different number of cycles
Can be pipelined or not pipelined
Can let independent instructions start execution on a
different functional unit before a previous long-latency
instruction finishes execution
Integer add
E
Integer mul
E E E E
FP mul
?
F D
E E E E E E E E
E E E E E E E E ...
Load/store
79
Issues in Pipelining: Multi-Cycle
Execute
Instructions can take different number of cycles in
EXECUTE stage
Integer ADD versus FP MULtiply
FMUL R4 R1, R2 F D E E E E E E E E W
ADD R3 R1, R2 F D E W
F D E W
F D E W
FMUL R2 R5, R6 F D E E E E E E E E W
ADD R7 R5, R6 F D E W
F D E W
When to Handle
Exceptions: when detected (and known to be non-
speculative)
Interrupts: when convenient
Except for very high priority ones
Power failure
Machine check (error)
82
Checking for and Handling Exceptions
in Pipelining
When the oldest instruction ready-to-be-retired is
detected to have caused an exception, the control
logic
84
Ensuring Precise Exceptions in
Pipelining
Idea: Make each operation take the same amount of
time
FMUL R3 R1, R2 F D E E E E E E E E W
ADD R4 R1, R2 F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
Downside
Worst-case instruction latency determines all
instructions’ latency
What about memory operations?
Each functional unit takes worst-case number of cycles?
85
Solutions
Reorder buffer
History buffer
Checkpointing
Suggested reading
Smith and Plezskun, “Implementing Precise Interrupts in
Pipelined Processors,” IEEE Trans on Computers 1988 and
ISCA 1985.
86
Recall: Solution I: Reorder
Buffer
(ROB)
Idea: Complete instructions out-of-order, but reorder
them before making results visible to architectural
state
When instruction is decoded it reserves the next-
sequential entry in the ROB
When instruction completes, it writes result into
ROB entry
When instruction oldest in ROB and it has
completed without exceptions, its result moved to
Func Unit
reg. file or memory
Register
Instruction Reorder
Cache File Func Unit Buffer
Func Unit
87
Reorder Buffer
Buffers information about all instructions that are
decoded but not yet retired/committed
88
What’s in a ROB Entry?
Valid bits for reg/data
V DestRegID DestRegVal StoreAddr StoreData PC Exception?
+ control bits
F D E E E E E E E E R W
F D E R W
F D E R W
F D E R W
F D E E E E E E E E R W
F D E R W
F D E R W
Func Unit
93
Important: Register Renaming with a
Reorder Buffer
Output and anti dependencies are not true
dependencies
WHY? The same register refers to values that have
nothing to do with each other
They exist due to lack of register ID’s (i.e.
names) in the ISA
Anti dependence
r3 r1 op r2 Write-after-Read
r1 r4 op r5 (WAR) -- Anti
Output-dependence
r3 r1 op r2 Write-after-Write
r5 r3 op r4 (WAW) -- Output
r3 r6 op r7 95
In-Order Pipeline with Reorder
Buffer
Decode (D): Access regfile/ROB, allocate entry in ROB, check if
instruction can execute, if so dispatch instruction
Execute (E): Instructions can complete out-of-order
Completion (R): Write result to reorder buffer
Retirement/Commit (W): Check for exceptions; if none, write
result to architectural register file or memory; else, flush
pipeline and start from exception handler
In-order dispatch/execution, out-of-order completion, in-order
retirement Integer add
E
Integer mul
E E E E
FP mul
R W
F D
E E E E E E E E
R
E E E E E E E E ...
Load/store
97
Reorder Buffer Tradeoffs
Advantages
Conceptually simple for supporting precise exceptions
Can eliminate false dependences
Disadvantages
Reorder buffer needs to be accessed to get the results
that are yet to be written to the register file
CAM or indirection increased latency and complexity
98