0% found this document useful (0 votes)
6 views97 pages

Onur Digitaldesign - Comparch 2021 Lecture14 Pipelined Processor Design Afterlecture

Uploaded by

adapa.nikitha30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views97 pages

Onur Digitaldesign - Comparch 2021 Lecture14 Pipelined Processor Design Afterlecture

Uploaded by

adapa.nikitha30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 97

Digital Design &

Computer Arch.
Lecture 14: Pipelined
Processor Design
Prof. Onur Mutlu

ETH Zürich
Spring 2021
22 April 2021
Required Readings
 Last week & This week
 Pipelining
 H&H, Chapter 7.5
 Pipelining Issues
 H&H, Chapter 7.8.1-7.8.3

 This week & Next week


 Out-of-order execution
 H&H, Chapter 7.8-7.9
 Smith and Sohi, “The Microarchitecture of Superscalar
Processors,” Proceedings of the IEEE, 1995
 More advanced pipelining
 Interrupt and exception handling
 Out-of-order and superscalar execution concepts

2
Agenda for Today & Next Few
Lectures
Earlier
 Single-cycle Microarchitectures
 Multi-cycle Microarchitectures

 Last week & today


 Pipelining
 Issues in Pipelining: Control & Data Dependence
Handling, State Maintenance and Recovery, …

 Tomorrow & Next week


 Out-of-Order Execution
 Issues in OoO Execution: Load-Store Handling, …

3
Review: Single-Cycle MIPS
Processor
Jump MemtoReg
Control
MemWrite
Unit
Branch
ALUControl2:0 PCSrc
31:26
Op ALUSrc
5:0
Funct RegDst
RegWrite

CLK CLK
CLK
0 25:21
WE3 SrcA Zero WE
0 PC' PC Instr A1 RD1 0 Result
1 A RD

ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0
PCJump 15:11
1
WriteReg4:0
PCPlus4
+

SignImm
4 15:0
<<2
Sign Extend PCBranch

+
27:0 31:28

25:0
<<2

4
Review: Single-Cycle MIPS FSM
 Single-cycle machine

AS’ Sequential AS
Combinational
Logic
Logic
(State)

AS: Architectural State 5


Can We Do Better?

6
Review: Multi-Cycle MIPS
Processor
CLK
PCWrite
Branch PCEn
IorD Control PCSrc
MemWrite Unit ALUControl2:0
IRWrite ALUSrcB1:0
31:26 ALUSrcA
Op
5:0 RegWrite
Funct

MemtoReg
RegDst
CLK CLK CLK
CLK CLK
0 SrcA
WE WE3 A 31:28 Zero CLK
25:21
PC' PC Instr A1 RD1 1 00
0 Adr RD B

ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1 01
Instr / Data 20:16 4 01 SrcB 10
0
Memory 15:11 A3 10
CLK 1 Register PCJump
WD 11
0 File
Data WD3
1
<<2 27:0
<<2

ImmExt
15:0
Sign Extend
25:0 (Addr)

7
Review: Multi-Cycle MIPS
FSM
S0: Fetch S1: Decode
IorD = 0
Reset AluSrcA = 0 S11: Jump
ALUSrcB = 01 ALUSrcA = 0
ALUOp = 00 ALUSrcB = 11 Op = J
PCSrc = 00 ALUOp = 00 PCSrc = 10
IRWrite PCWrite
PCWrite
Op = ADDI
Op = BEQ
Op = LW
or Op = R-type What is the
S2: MemAdr Op = SW
S6: Execute
S8: Branch
S9: ADDI
Execute
shortcoming of
ALUSrcA = 1 ALUSrcA = 1
ALUSrcA = 1
ALUSrcB = 00 ALUSrcA = 1 this design?
ALUSrcB = 10 ALUSrcB = 00 ALUOp = 01 ALUSrcB = 10
ALUOp = 00 ALUOp = 10 PCSrc = 01 ALUOp = 00
Branch

Op = SW
Op = LW
S3: MemRead
S5: MemWrite
S7: ALU
Writeback S10: ADDI What does
Writeback
this design
IorD = 1
IorD = 1
MemWrite
RegDst = 1
MemtoReg = 0
RegDst = 0
MemtoReg = 0 assume
RegWrite RegWrite
about memory?

S4: Mem
Writeback

RegDst = 0
MemtoReg = 1
RegWrite

8
Can We Do Better?

9
Review: Pipelining Basic Idea
CLK
PCWrite
Branch PCEn
IorD Control PCSrc
MemWrite Unit ALUControl2:0
IRWrite ALUSrcB1:0
31:26 ALUSrcA
Op
5:0 RegWrite
Funct

MemtoReg
RegDst
CLK CLK CLK
CLK CLK
0 SrcA
WE WE3 A 31:28 Zero CLK
25:21
PC' PC Instr A1 RD1 1 00
0 Adr RD B

ALU
EN A EN
20:16
A2 RD2 00 ALUResult ALUOut
1 01
Instr / Data 20:16 4 01 SrcB 10
0
Memory 15:11 A3 10
CLK 1 Register PCJump
WD 11
0 File
Data WD3
1
<<2 27:0
<<2

ImmExt
15:0
Sign Extend
25:0 (Addr)

Of course, we need to be more careful than this! 10


Carnegie Mellon

Review: Pipelined Datapath & Control


CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
BranchD BranchE BranchM
31:26 PCSrcM
Op ALUControlD ALUControlE2:0
5:0
Funct ALUSrcD ALUSrcE
RegDstD RegDstE
ALUOutW
CLK CLK CLK
CLK
25:21 WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD

ALU
1 ALUOutM ReadDataW
A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
+

15:0
<<2
Sign Extend SignImmE
PCBranchM
4

+
PCPlus4F PCPlus4D PCPlus4E

ResultW

 Same control unit as single-cycle processor


Control delayed to proper pipeline stage 11
Review: Execution of Four
Independent ADDs
 Multi-cycle: 4 cycles per instruction

F D E W
F D E W
F D E W
F D E W
Time
 Pipelined: 4 cycles per 4 instructions (steady state)
1 instruction completed per cycle
F D E W
F D E W
Is life always this beautiful?
F D E W
F D E W

Time

12
Review: Issues in Pipeline
Design
Balancing work in pipeline stages
 How many stages and what is done in each stage

 Keeping the pipeline correct, moving, and full in the


presence of events that disrupt pipeline flow
 Handling dependences
 Data
 Control
 Handling resource contention
 Handling long-latency (multi-cycle) operations

 Handling exceptions, interrupts


 Advanced: Improving pipeline throughput
 Minimizing stalls
13
Data Dependence
Handling: Concepts and
Implementation

14
Review: Data Dependence
Types
Flow dependence
r3  r1 op r2 Read-after-Write
r5  r3 op r4 (RAW)

Anti dependence
r3  r1 op r2 Write-after-Read
r1  r4 op r5 (WAR)

Output-dependence
r3  r1 op r2 Write-after-Write
r5  r3 op r4 (WAW)
r3  r6 op r7 15
Review: How to Handle Data
Dependences
Anti and output dependences are easier to handle
 write to the destination only in last stage and in
program order

 Flow dependences are more interesting

 Six fundamental ways of handling flow dependences


 Detect and wait until value is available in register file
 Detect and forward/bypass data to dependent
instruction
 Detect and eliminate the dependence at the software
level
 No need for the hardware to detect dependence
 Detect and move it out of the way for independent
instructions
16

Review: Pipeline Stall: Resolving Data
Dependence
t0 t1 t2 t3 t4 t5
Insth IF ID ALU MEM WB
Insti i IF ID ALU MEM WB
Instj j IF ID ALU
ID MEM
ALU
ID ID
WB
MEM
ALU ALU
WB
MEM
Instk IF ID
IF ALU
ID
IF MEM
ALU
ID
IF WB
MEM
ALU
ID
Instl IF ID
IF ALU
ID
IF MEM
ALU
ID
IF
IF ID
IF ALU
ID
IF
i: rx  _
IF ID
IF
j:bubble
_  rx dist(i,j)=1
Stall = make the dependent instruction
bubble
j: _  rx dist(i,j)=2 wait until its source data valueIFis
bubble
j: _  rx dist(i,j)=3
available
j: _  rx dist(i,j)=4 1. stop all up-stream stages
17
2. drain all down-stream stages
How to Implement Stalling
PCSrc

ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB

EX M WB
IF/ID

Add

Add
4 Add result

RegWrite
Branch
Shift
left 2

MemWrite
ALUSrc
Read

MemtoReg
Instruction

PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Instruction 16 32 6
[15– 0] Sign ALU MemRead
extend control

Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1

 Stall
RegDst

 disable PC and IF/ID latching; ensure stalled instruction stays in


its stage
 Insert “invalid” instructions/nops into the stage following the
stalled one (called “bubbles”) 18
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
RAW Data Dependence Example
 One instruction writes a register ($s0) and next
instructions read this register => read after write
(RAW) dependence.

Only
add writes intoif$s0
theinpipeline handles
the first half of cycle 5
 and reads $s0
data on cycle 3, obtaining
dependences the wrong value
incorrectly!
 or reads $s0 on cycle 4, again obtaining the wrong
value
 sub reads $s0 in 2nd half of cycle 5, getting the
correct value 1 2 3 4 5 6 7 8

Time (cycles)
 subsequent instructions read the correct value of $s0
add
$s2
DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF

$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Compile-Time Detection and
Elimination 1 2 3 4 5 6 7 8 9 10

Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF

nop DM
nop IM RF RF

nop DM
nop IM RF RF

$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

 Insert enough NOPs for the required result to be


ready
 Or (if you can) move independent useful
instructions up
Data Forwarding
 Also called Data Bypassing

 We have already seen the basic idea before


 Forward the result value to the dependent instruction
as soon as the value is available

 Remember dataflow?
 Data value supplied to dependent instruction as soon
as it is available
 Instruction executes when all its operands are
available

 Data forwarding brings a pipeline closer to data flow


execution principles
Data Forwarding

1 2 3 4 5 6 7 8

Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF

$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Data Forwarding
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
ALUControlD2:0 ALUControlE2:0
31:26
Op ALUSrcD ALUSrcE
5:0
Funct RegDstD RegDstE
PCSrcM
BranchD BranchE BranchM

CLK CLK CLK


CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 00
A RD 01

ALU
1 10 ALUOutM ReadDataW
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+

15:0
Extend
4
<<2

+
PCPlus4F PCPlus4D PCPlus4E

PCBranchM

ResultW

RegWriteW
ForwardBE

RegWriteM
ForwardAE

Hazard Unit
Data Forwarding
 Forward to Execute stage from either:
 Memory stage or
 Writeback stage

 When should we forward from either Memory or


Writeback stage?
 If that stage will write to a destination register and the
destination register matches the source register.
 If both the Memory and Writeback stages contain
matching destination registers, the Memory stage
should have priority, because it contains the more
recently executed instruction.
Data Forwarding (in
Pseudocode)
Forward to Execute stage from either:
 Memory stage or
 Writeback stage

 Forwarding logic for ForwardAE (pseudo code):


if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM) then
ForwardAE = 10 # forward from Memory stage
else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW) then
ForwardAE = 01 # forward from Writeback stage
else
ForwardAE = 00 # no forwarding

 Forwarding logic for ForwardBE same, but replace rsE


with rtE
Forwarding Is Not Always
Possible 1 2 3 4 5 6 7 8

Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF

Trouble!
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

 Forwarding is sufficient to resolve RAW data dependences


 Unfortunately, there are cases when forwarding is not possible
 due to pipeline design and instruction latencies
 The lw instruction does not finish reading data until the end of the

Memory stage
 its result cannot be forwarded to the Execute stage of the next
instruction
Stalling

1 2 3 4 5 6 7 8 9

Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF

$s0 $s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 RF $s1 & RF

$s4
or or DM $t1
or $t1, $s4, $s0 IM IM RF $s0 | RF

Stall $s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Hardware Needed for Stalling
 Stalls are supported by
 adding enable inputs (EN) to the Fetch and Decode
pipeline registers
 and a synchronous reset/clear (CLR) input to the
Execute pipeline register
 or an INV bit associated with each pipeline register,
indicating that contents are INValid

 When a lw stall occurs


 StallD and StallF are asserted to force the Decode and
Fetch stage pipeline registers to hold their old values.
 FlushE is also asserted to clear the contents of the
Execute stage pipeline register, introducing a bubble
Stalling and Dependence
Detection Hardware
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
ALUControlD2:0 ALUControlE2:0
31:26
Op ALUSrcD ALUSrcE
5:0
Funct RegDstD RegDstE
PCSrcM
BranchD BranchE BranchM

CLK CLK CLK


CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 00
A RD 01

ALU
ALUOutM ReadDataW
EN

1 10
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+

15:0
Extend
4
<<2

+
PCPlus4F PCPlus4D PCPlus4E
CLR
EN

PCBranchM

ResultW

MemtoRegE

RegWriteW
ForwardBE

RegWriteM
ForwardAE
FlushE
StallD
StallF

Hazard Unit
A Special Case of Data
Dependence
Control dependence
 Data dependence on the Instruction Pointer / Program
Counter

30
Control Dependence
 Question: What should the fetch PC be in the next
cycle?
 Answer: The address of the next instruction
 All instructions are control dependent on previous ones.
Why?

 If the fetched instruction is a non-control-flow


instruction:
 Next Fetch PC is the address of the next-sequential
instruction
 Easy to determine if we know the size of the fetched
instruction

 If the instruction that is fetched is a control-flow


instruction: 31
Carnegie Mellon

Control Dependences
 Special case of data dependence: dependence on PC
 beq:
 branch is not resolved until the fourth stage of the pipeline
 Instructions after the branch are fetched before branch is resolved
Always predict that the next sequential instruction is fetched
 Called “Always not taken” prediction
 These instructions must be flushed if the branch is taken

 Branch misprediction penalty


 number of instructions flushed when branch is taken
 May be reduced by resolving the branch earlier

32
Carnegie Mellon

Control Dependence: Original Pipeline


CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
ALUControlD2:0 ALUControlE2:0
31:26
Op ALUSrcD ALUSrcE
5:0
Funct RegDstD RegDstE
PCSrcM
BranchD BranchE BranchM

CLK CLK CLK


CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 00
A RD 01

ALU
1 10 ALUOutM ReadDataW
EN

A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+

15:0
Extend
4
<<2

+
PCPlus4F PCPlus4D PCPlus4E
CLR
EN

PCBranchM

ResultW

MemtoRegE

RegWriteW
ForwardBE
ForwardAE

RegWriteM
FlushE
StallD
StallF

Hazard Unit

33
Carnegie Mellon

Control Dependence
1 2 3 4 5 6 7 8 9

Time (cycles)
$t1
lw DM
20 beq $t1, $t2, 40 IM RF $t2 - RF

$s0
and DM
24 and $t0, $s0, $s1 IM RF $s1 & RF
Flush
$s4 these
or DM instructions
28 or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM
2C sub $t2, $s0, $s5 IM RF $s5 - RF

30 ...
...
$s2
slt DM $t3

slt
64 slt $t3, $s2, $s3 IM RF $s3 RF

34
Carnegie Mellon

Early Branch Resolution


CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
ALUControlD2:0 ALUControlE2:0
31:26
Op ALUSrcD ALUSrcE
5:0
Funct RegDstD RegDstE
BranchD

EqualD PCSrcD
CLK CLK CLK
CLK
WE3
= WE
25:21 SrcAE
0 PC' PCF InstrD A1 RD1 00
A RD 01

ALU
1 10 ALUOutM ReadDataW
EN

A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE RdE
1
SignImmD SignImmE
Sign
+

15:0
Extend
4
<<2
+

PCPlus4F PCPlus4D
CLR

CLR
EN

PCBranchD

ResultW

MemtoRegE

RegWriteW
ForwardBE

RegWriteM
ForwardAE
FlushE
StallD
StallF

Hazard Unit

Introduces another data dependence in Decode stage. 35


Carnegie Mellon

Early Branch Resolution


1 2 3 4 5 6 7 8 9

Time (cycles)
$t1
lw DM
20 beq $t1, $t2, 40 IM RF $t2 - RF

$s0 Flush
and DM
24 and $t0, $s0, $s1 IM RF $s1 & RF this
instruction

28 or $t1, $s4, $s0

2C sub $t2, $s0, $s5

30 ...
...
$s2
slt DM $t3
slt
64 slt $t3, $s2, $s3 IM RF $s3 RF

36
Carnegie Mellon

Early Branch Resolution: Good Idea?


 Advantages
 Reduced branch misprediction penalty
 Reduced CPI (cycles per instruction)

 Disadvantages
 Potential increase in clock cycle time?
 Higher clock period and lower frequency?
 Additional hardware cost
 Specialized and likely not used by other instructions

37
Carnegie Mellon

Data Forwarding for Early Branch Resolution


CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
ALUControlD2:0 ALUControlE2:0
31:26
Op ALUSrcD ALUSrcE
5:0
Funct RegDstD RegDstE
BranchD

EqualD PCSrcD
CLK CLK CLK
CLK
WE3
= WE
25:21 SrcAE
0 PC' PCF InstrD A1 RD1 0 00
A RD 01

ALU
ALUOutM ReadDataW
1 1 10
EN

A RD
Instruction 20:16
A2 RD2 0 00 0 SrcBE Data
Memory 01
A3 1 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+

15:0
Extend
4
<<2
+

PCPlus4F PCPlus4D
CLR

CLR
EN

PCBranchD

ResultW

MemtoRegE

RegWriteW
ForwardBD

ForwardBE
ForwardAD

RegWriteM
ForwardAE

RegWriteE
BranchD

FlushE
StallD
StallF

Hazard Unit

Data forwarding for early branch resolution. 38


Carnegie Mellon

Forwarding and Stalling Hardware Control


// Forwarding logic:
assign ForwardAD = (rsD != 0) & (rsD == WriteRegM) & RegWriteM;
assign ForwardBD = (rtD != 0) & (rtD == WriteRegM) & RegWriteM;

//Stalling logic:
assign lwstall = ((rsD == rtE) | (rtD == rtE)) & MemtoRegE;

assign branchstall = (BranchD & RegWriteE &


(WriteRegE == rsD | WriteRegE == rtD))
|
(BranchD & MemtoRegM &
(WriteRegM == rsD | WriteRegM == rtD));

// Stall signals;
assign StallF = lwstall | branchstall;
assign StallD = lwstall | branchstall;
assign FLushE = lwstall | branchstall;

39
Carnegie Mellon

Final Pipelined MIPS Processor (H&H)

40
Includes data dependence detection, early br resolution, forwarding, stall logic
Carnegie Mellon

Doing Better: Smarter Branch Prediction


 Guess whether branch will be taken
 Backward branches are usually taken (loops)
 Consider history of whether branch was previously taken to
improve the guess

 Good prediction reduces the fraction of branches


requiring a flush

41
More on Branch Prediction (I)

https://www.youtube.com/onurmutlulectures 42
More on Branch Prediction (II)

https://www.youtube.com/onurmutlulectures 43
More on Branch Prediction (III)

https://www.youtube.com/onurmutlulectures 44
Lectures on Branch Prediction
 Digital Design & Computer Architecture, Spring 2020,
Lecture 16b
 Branch Prediction I (ETH Zurich, Spring 2020)
 https://www.youtube.com/watch?v=h6l9yYSyZHM&list=PL5Q2soXY2Zi_F
RrloMa2fUYWPGiZUBQo2&index=22

 Digital Design & Computer Architecture, Spring 2020,


Lecture 17
 Branch Prediction II (ETH Zurich, Spring 2020)
 https://www.youtube.com/watch?v=z77VpggShvg&list=PL5Q2soXY2Zi_F
RrloMa2fUYWPGiZUBQo2&index=23

 Computer Architecture, Spring 2015, Lecture 5


 Advanced Branch Prediction (CMU, Spring 2015)
 https://www.youtube.com/watch?v=yDjsr-jTOtk&list=PL5PHm2jkkXmgVh
h8CHAu9N76TShJqfYDt&index=4

https://www.youtube.com/onurmutlulectures 45
Pipelined Performance
Example

46
Carnegie Mellon

Pipelined Performance Example


 SPECINT2017 benchmark:
 25% loads
 10% stores
 11% branches
 2% jumps
 52% R-type

 Suppose:
 40% of loads used by next instruction
 25% of branches mispredicted

 All jumps flush next instruction


 What is the average CPI?
47
Carnegie Mellon

Pipelined Performance Example Solution


 Load/Branch CPI = 1 when no stall/flush, 2 when stall/flush.
Thus:
 CPIlw = 1(0.6) + 2(0.4) = 1.4 Average CPI for load
 CPIbeq = 1(0.75) + 2(0.25) = 1.25 Average CPI for branch

 And
 Average CPI =

48
Carnegie Mellon

Pipelined Performance Example Solution


 Load/Branch CPI = 1 when no stall/flush, 2 when stall/flush.
Thus:
 CPIlw = 1(0.6) + 2(0.4) = 1.4 Average CPI for load
 CPIbeq = 1(0.75) + 2(0.25) = 1.25 Average CPI for branch

 And
 Average CPI = (0.25)(1.4) + load
(0.1)(1) + store
(0.11)(1.25) + beq
(0.02)(2) + jump
(0.52)(1) r-type

= 1.15

49
Carnegie Mellon

Pipelined Performance
 There are 5 stages, and 5 different timing paths:

Tc = max {
tpcq + tmem + tsetup fetch
2(tRFread + tmux + teq + tAND + tmux + tsetup ) decode
tpcq + tmux + tmux + tALU + tsetup execute
tpcq + tmemwrite + tsetup memory
2(tpcq + tmux + tRFwrite)
writeback
}
 The operation speed depends on the slowest operation
 Decode and Writeback use register file and have only half a 50
Carnegie Mellon

Pipelined Performance Example


Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20
Equality comparator teq 40
AND gate tAND 15
Memory write Tmemwrite 220
Register file write tRFwrite 100

Tc = 2(tRFread + tmux + teq + tAND + tmux + tsetup )


= 2[150 + 25 + 40 + 15 + 25 + 20] ps
51
= 550 ps
Carnegie Mellon

Pipelined Performance Example


 For a program with 100 billion instructions executing on a
pipelined MIPS processor:
 CPI = 1.15
 Tc = 550 ps

 Execution Time = (# instructions) × CPI × Tc


= (100 × 109)(1.15)(550 × 10-12)
= 63 seconds

52
Carnegie Mellon

Performance Summary for MIPS arch.


Execution Time Speedup
Processor (seconds) (single-cycle is baseline)
Single-cycle 95 1
Multicycle 133 0.71
Pipelined 63 1.51

 Fastest of the three MIPS architectures is Pipelined.


 However, even though we have 5 fold pipelining, it is not
5 times faster than single cycle.

53
Recall: How to Handle Data
Dependences
Anti and output dependences are easier to handle
 write to the destination only in last stage and in
program order

 Flow dependences are more interesting

 Six fundamental ways of handling flow dependences


 Detect and wait until value is available in register file
 Detect and forward/bypass data to dependent
instruction
 Detect and eliminate the dependence at the software
level
 No need for the hardware to detect dependence
 Detect and move it out of the way for independent
instructions
54

Recall: How to Handle Data
Dependences
Anti and output dependences are easier to handle
 write to the destination only in last stage and in
program order

 Flow dependences are more interesting

 Six fundamental ways of handling flow dependences


 Detect and wait until value is available in register file
 Detect and forward/bypass data to dependent
instruction
 Detect and eliminate the dependence at the software
level
 No need for the hardware to detect dependence
 Detect and move it out of the way for independent
instructions
55

Questions to Ponder
 What is the role of the hardware vs. the software in
data dependence handling?
 Software based interlocking
 Hardware based interlocking
 Who inserts/manages the pipeline bubbles?
 Who finds the independent instructions to fill “empty”
pipeline slots?
 What are the advantages/disadvantages of each?
 Think of the performance equation as well

56
Questions to Ponder
 What is the role of the hardware vs. the software in
the order in which instructions are executed in the
pipeline?
 Software based instruction scheduling  static
scheduling
 Hardware based instruction scheduling  dynamic
scheduling

 How does each impact different metrics?


 Performance (and parts of the performance equation)
 Complexity
 Power consumption
 Reliability
 …
57
More on Software vs. Hardware
 Software based scheduling of instructions  static
scheduling
 Compiler orders the instructions, hardware executes
them in that order
 Contrast this with dynamic scheduling (in which
hardware can execute instructions out of the compiler-
specified order)
 How does the compiler know the latency of each
instruction?

 What information does the compiler not know that


makes static scheduling difficult?
 Answer: Anything that is determined at run time
 Variable-length operation latency, memory addr, branch
direction

 58
More on Static Instruction
Scheduling

https://www.youtube.com/onurmutlulectures 59
Lectures on Static Instruction
Scheduling
 Computer Architecture, Spring 2015, Lecture 16
 Static Instruction Scheduling (CMU, Spring 2015)
 https://www.youtube.com/watch?v=isBEVkIjgGA&list=PL5PHm2jkkXmi5C
xxI7b3JCL1TWybTDtKq&index=18

 Computer Architecture, Spring 2013, Lecture 21


 Static Instruction Scheduling (CMU, Spring 2013)
 https://www.youtube.com/watch?v=XdDUn2WtkRg&list=PL5PHm2jkkXmi
dJOd59REog9jDnPDTG6IJ&index=21

https://www.youtube.com/onurmutlulectures 60
Recall: How to Handle Data
Dependences
Anti and output dependences are easier to handle
 write to the destination only in last stage and in
program order

 Flow dependences are more interesting

 Six fundamental ways of handling flow dependences


 Detect and wait until value is available in register file
 Detect and forward/bypass data to dependent
instruction
 Detect and eliminate the dependence at the software
level
 No need for the hardware to detect dependence
 Detect and move it out of the way for independent
instructions
61

Fine-Grained
Multithreading

62
Fine-Grained Multithreading
 Idea: Hardware has multiple thread contexts
(PC+registers). Each cycle, fetch engine fetches from
a different thread.
 By the time the fetched branch/instruction resolves, no
instruction is fetched from the same thread
 Branch/instruction resolution latency overlapped with
execution of other threads’ instructions

+ No logic needed for handling control and


data dependences within a thread
-- Single thread performance suffers
-- Extra logic for keeping thread contexts
-- Does not overlap latency if not enough
threads to cover the whole pipeline
63
Fine-Grained Multithreading (II)
 Idea: Switch to another thread every cycle such that
no two instructions from a thread are in the pipeline
concurrently

 Tolerates the control and data dependence latencies


by overlapping the latency with useful work from
other threads
 Improves pipeline utilization by taking advantage of
multiple threads

 Thornton, “Parallel Operation in the Control Data 6600,”


AFIPS 1964.
 Smith, “A pipelined, shared resource MIMD computer,”
ICPP 1978.
64
Fine-Grained Multithreading:
History
CDC 6600’s peripheral processing unit is fine-grained
multithreaded
 Thornton, “Parallel Operation in the Control Data 6600,” AFIPS
1964.
 Processor executes a different I/O thread every cycle
 An operation from the same thread is executed every 10
cycles

 Denelcor HEP (Heterogeneous Element Processor)


 Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978.
 120 threads/processor
 available queue vs. unavailable (waiting) queue for threads
 each thread can have only 1 instruction in the processor pipeline;
each thread independent
 to each thread, processor looks like a non-pipelined machine
 system throughput vs. single thread performance tradeoff
65
Fine-Grained Multithreading in
HEP
Cycle time: 100ns

 8 stages  800 ns to
complete an
instruction
 assuming no
memory access

 No control and data


dependence
checking
Burton Smith
(1941-2018)

66
Multithreaded Pipeline Example

Slide credit: Joel Emer 67


Sun Niagara Multithreaded
Pipeline

Kongetira et al., “Niagara: A 32-Way Multithreaded Sparc Processor,” IEEE Micro 2005.
68
Fine-Grained Multithreading
 Advantages
+ No need for dependence checking between instructions
(only one instruction in pipeline from a single thread)
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions
from different threads
+ Improved system throughput, latency tolerance, utilization

 Disadvantages
- Extra hardware complexity: multiple hardware contexts (PCs,
register files, …), thread selection logic
- Reduced single thread performance (one instruction fetched
every N cycles from the same thread)
- Resource contention between threads in caches and memory
- Some dependence checking logic between threads remains
(load/store) 69
Modern GPUs are
FGMT Machines

70
NVIDIA GeForce GTX 285
“core”

64 KB of storage
… for thread
contexts
(registers)

= data-parallel (SIMD) func. unit, = instruction stream decode


control shared across 8 units
= multiply-add = execution context storage
= multiply

71
Slide credit: Kayvon Fatahalian
NVIDIA GeForce GTX 285
“core”

64 KB of storage
… for thread
contexts
(registers)
 Groups of 32 threads share instruction stream (each
group is a Warp): they execute the same instruction
on different data
 Up to 32 warps are interleaved in an FGMT

manner
72
 Up to 1024 thread contexts can be stored
Slide credit: Kayvon Fatahalian
NVIDIA GeForce GTX 285

Tex Tex
… … … … … …

Tex Tex
… … … … … …

Tex Tex
… … … … … …

Tex Tex
… … … … … …

Tex Tex
… … … … … …

30 cores on the GTX 285: 30,720 threads


73
Slide credit: Kayvon Fatahalian
Further Reading for the
Interested (I)

Burton Smith
(1941-2018)

74
Further Reading for the
Interested (II)

75
Digital Design &
Computer Arch.
Lecture 14: Pipelined
Processor Design
Prof. Onur Mutlu

ETH Zürich
Spring 2021
22 April 2021
We did not cover the
following slides. They are for
your benefit.
We will cover them in future
lectures.

77
Pipelining and Precise
Exceptions: Preserving
Sequential Semantics
Multi-Cycle Execution
 Not all instructions take the same amount of time
for “execution”
 Idea: Have multiple different functional units that
take different number of cycles
 Can be pipelined or not pipelined
 Can let independent instructions start execution on a
different functional unit before a previous long-latency
instruction finishes execution
Integer add
E
Integer mul
E E E E
FP mul
?
F D
E E E E E E E E

E E E E E E E E ...
Load/store
79
Issues in Pipelining: Multi-Cycle
Execute
 Instructions can take different number of cycles in

EXECUTE stage
 Integer ADD versus FP MULtiply

FMUL R4  R1, R2 F D E E E E E E E E W
ADD R3  R1, R2 F D E W
F D E W
F D E W

FMUL R2  R5, R6 F D E E E E E E E E W
ADD R7  R5, R6 F D E W
F D E W

 What is wrong with this picture in a Von Neumann


architecture?
 Sequential semantics of the ISA NOT preserved!
 What if FMUL incurs an exception?
80
Exceptions vs. Interrupts
 Cause
 Exceptions: internal to the running thread
 Interrupts: external to the running thread

 When to Handle
 Exceptions: when detected (and known to be non-
speculative)
 Interrupts: when convenient
 Except for very high priority ones
 Power failure
 Machine check (error)

 Priority: process (exception), depends (interrupt)

 Handling Context: process (exception), system 81


Precise Exceptions/Interrupts
 The architectural state should be consistent
(precise) when the exception/interrupt is ready to
be handled

1. All previous instructions should be completely


retired.

2. No later instruction should be retired.

Retire = commit = finish execution and update arch.


state

82
Checking for and Handling Exceptions
in Pipelining
 When the oldest instruction ready-to-be-retired is
detected to have caused an exception, the control
logic

 Ensures architectural state is precise (register file, PC,


memory)

 Flushes all younger instructions in the pipeline

 Saves PC and registers (as specified by the ISA)

 Redirects the fetch engine to the appropriate exception


handling routine
83
Why Do We Want Precise
Exceptions?
Semantics of the von Neumann model ISA specifies
it
 Remember von Neumann vs. Dataflow

 Aids software debugging

 Enables (easy) recovery from exceptions

 Enables (easily) restartable processes

 Enables traps into software (e.g., software


implemented opcodes)

84
Ensuring Precise Exceptions in
Pipelining
Idea: Make each operation take the same amount of
time
FMUL R3  R1, R2 F D E E E E E E E E W
ADD R4  R1, R2 F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W

 Downside
 Worst-case instruction latency determines all
instructions’ latency
 What about memory operations?
 Each functional unit takes worst-case number of cycles?
85
Solutions
 Reorder buffer

 History buffer

 Future register file We will not cover these

 Checkpointing

 Suggested reading
 Smith and Plezskun, “Implementing Precise Interrupts in
Pipelined Processors,” IEEE Trans on Computers 1988 and
ISCA 1985.

86
Recall: Solution I: Reorder
Buffer
 (ROB)
Idea: Complete instructions out-of-order, but reorder
them before making results visible to architectural
state
 When instruction is decoded it reserves the next-
sequential entry in the ROB
 When instruction completes, it writes result into
ROB entry
 When instruction oldest in ROB and it has
completed without exceptions, its result moved to
Func Unit
reg. file or memory
Register
Instruction Reorder
Cache File Func Unit Buffer

Func Unit

87
Reorder Buffer
 Buffers information about all instructions that are
decoded but not yet retired/committed

88
What’s in a ROB Entry?
Valid bits for reg/data
V DestRegID DestRegVal StoreAddr StoreData PC Exception?
+ control bits

 Everything required to:


 correctly reorder instructions back into the program order
 update the architectural state with the instruction’s
result(s), if instruction can retire without any issues
 handle an exception/interrupt precisely, if an
exception/interrupt needs to be handled before retiring
the instruction

 Need valid bits to keep track of readiness of the


result(s) and find out if the instruction has completed
execution 89
Reorder Buffer: Independent
Operations
Result first written to ROB on instruction completion
 Result written to register file at commit time

F D E E E E E E E E R W
F D E R W
F D E R W
F D E R W
F D E E E E E E E E R W
F D E R W
F D E R W

 What if a later instruction needs a value in the reorder


buffer?
 One option: stall the operation  stall the pipeline
 Better: Read the value from the reorder buffer. How?
90
Reorder Buffer: How to Access?
 A register value can be in the register file, reorder
buffer, (or bypass/forwarding paths)

Random Access Memory


(indexed with Register ID,
Instruction Register which is the address of an entry)
Cache File
Func Unit

Func Unit

Reorder Func Unit


Content Buffer
Addressable
Memory bypass paths
(searched with
register ID,
which is part of the content of an entry)
91
Simplifying Reorder Buffer
Access
Idea: Use indirection

 Access register file first (check if the register is valid)


 If register not valid, register file stores the ID of the
reorder buffer entry that contains (or will contain) the
value of the register
 Mapping of the register to a ROB entry: Register file
maps the register to a reorder buffer entry if there is an
in-flight instruction writing to the register

 Access reorder buffer next

 Now, reorder buffer does not need to be content


addressable
92
Reorder Buffer in Intel Pentium
III

Boggs et al., “The


Microarchitecture of the
Pentium 4 Processor,” Intel
Technology Journal, 2001.

93
Important: Register Renaming with a
Reorder Buffer
 Output and anti dependencies are not true
dependencies
 WHY? The same register refers to values that have
nothing to do with each other
 They exist due to lack of register ID’s (i.e.
names) in the ISA

 The register ID is renamed to the reorder buffer


entry that will hold the register’s value
 Register ID  ROB entry ID
 Architectural register ID  Physical register ID
 After renaming, ROB entry ID used to refer to the
register

 This eliminates anti and output dependencies 94


Recall: Data Dependence Types
True (flow) dependence
r3  r1 op r2 Read-after-Write
r5  r3 op r4 (RAW) -- True

Anti dependence
r3  r1 op r2 Write-after-Read
r1  r4 op r5 (WAR) -- Anti

Output-dependence
r3  r1 op r2 Write-after-Write
r5  r3 op r4 (WAW) -- Output
r3  r6 op r7 95
In-Order Pipeline with Reorder
Buffer
Decode (D): Access regfile/ROB, allocate entry in ROB, check if
instruction can execute, if so dispatch instruction
 Execute (E): Instructions can complete out-of-order
 Completion (R): Write result to reorder buffer
 Retirement/Commit (W): Check for exceptions; if none, write
result to architectural register file or memory; else, flush
pipeline and start from exception handler
 In-order dispatch/execution, out-of-order completion, in-order
retirement Integer add
E
Integer mul
E E E E
FP mul
R W
F D
E E E E E E E E
R
E E E E E E E E ...
Load/store

97
Reorder Buffer Tradeoffs
 Advantages
 Conceptually simple for supporting precise exceptions
 Can eliminate false dependences

 Disadvantages
 Reorder buffer needs to be accessed to get the results
that are yet to be written to the register file
 CAM or indirection  increased latency and complexity

 Other solutions aim to eliminate the disadvantages


 History buffer
 Future file We will not cover these
 Checkpointing

98

You might also like