0% found this document useful (0 votes)
27 views130 pages

Arch4 Pipelined Processor Design Afterlecture

Uploaded by

Atmadeep Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views130 pages

Arch4 Pipelined Processor Design Afterlecture

Uploaded by

Atmadeep Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 130

Digital Design & Computer Arch.

Lecture 13: Pipelined Processor Design

Prof. Onur Mutlu

ETH Zürich
Spring 2023
6 April 2023
Agenda for Today & Next Few Lectures
 Prior weeks: Microarchitecture Fundamentals
 Single-cycle Microarchitectures
 Multi-cycle Microarchitectures
Problem
Algorithm
 Last week & today: Pipelining
Program/Language
 Pipelining System Software
 Pipelined Processor Design SW/HW Interface
 Control & Data Dependence Handling Micro-architecture
 Precise Exceptions: State Maintenance & Recovery Logic
Devices
 After Easter Break: Out-of-Order Execution Electrons

 Out-of-Order Execution
 Issues in OoO Execution: Load-Store Handling, …
2
Readings
 This week
 Pipelining
 H&H, Chapter 7.5
 Pipelining Issues
 H&H, Chapter 7.7, 7.8.1-7.8.3

 This week & after Easter Break


 Out-of-order execution
 H&H, Chapter 7.8-7.9
 Smith & Sohi, “The Microarchitecture of Superscalar Processors,”
Proceedings of the IEEE, 1995
 More advanced pipelining
 Interrupt and exception handling
 Out-of-order and superscalar execution concepts
3
Carnegie Mellon

Review: Pipelined Datapath & Control


CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM

BranchD BranchE BranchM


31:26 PCSrcM
Op ALUControlD ALUControlE2:0
5:0
Funct ALUSrcD ALUSrcE
RegDstD RegDstE
ALUOutW
CLK CLK CLK
CLK
25:21 WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD

ALU
ALUOutM ReadDataW
1 A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
+

15:0
<<2
Sign Extend SignImmE
4 PCBranchM

+
PCPlus4F PCPlus4D PCPlus4E

ResultW

 Same control unit as single-cycle processor


Control delayed to proper pipeline stage 4
Review: Causes of Pipeline Stalls
 Stall: A condition when the pipeline stops moving

 Resource contention

 Dependences (between instructions)


 Data
 Control

 Long-latency (multi-cycle) operations

5
Review: Data Dependence Types
Flow dependence
r3  r1 op r2 Read-after-Write
r5  r3 op r4 (RAW)
Anti dependence
r3  r1 op r2 Write-after-Read
r1  r4 op r5 (WAR)

Output dependence
r3  r1 op r2 Write-after-Write
r5  r3 op r4 (WAW)
r3  r6 op r7
6
Review: How to Handle Data Dependences
 Anti and output dependences are easier to handle
 write to the destination only in last stage and in program order

 Flow dependences are more interesting & challenging

 Six fundamental ways of handling flow dependences


 Detect and wait until value is available in register file
 Detect and forward/bypass data to dependent instruction
 Detect and eliminate the dependence at the software level
 No need for the hardware to detect dependence
 Detect and move it out of the way for independent instructions
 Predict the needed value(s), execute “speculatively”, and verify
 Do something else (fine-grained multithreading)
 No need to detect
7
Review: Pipeline Stall: Resolving Data Dependence

t0 t1 t2 t3 t4 t5
Insth IF ID ALU MEM WB
Insti i IF ID ALU MEM WB
Instj j IF ID ALU
ID MEM
ALU
ID ID
WB
MEM
ALU ALU
WB
MEM
Instk IF ID
IF ALU
ID
IF MEM
ALU
ID
IF WB
MEM
ALU
ID
Instl IF ID
IF ALU
ID
IF MEM
ALU
ID
IF
IF ID
IF ALU
ID
IF
i: rx  _
j: _  rx
bubble dist(i,j)=1 IF ID
IF
Stall = make the dependent instruction wait
j: _  rx
bubble dist(i,j)=2 IF
until its source data value is available
j: _  rx
bubble dist(i,j)=3 1. stop all up-stream stages
j: _  rx dist(i,j)=4 2. drain all down-stream stages
8
Data Dependence Handling:
Concepts and Implementation

9
How to Implement Stalling
PCSrc

ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB

EX M WB
IF/ID

Add

Add
4 Add result

RegWrite
Branch
Shift
left 2

MemWrite
ALUSrc
Read

MemtoReg
Instruction

PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Instruction 16 32 6
[15– 0] Sign ALU MemRead
extend control

Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1
RegDst

 Stall
 disable PC and IF/ID latching; ensure stalled instruction stays in its stage
 Insert “invalid” instructions/nops into the stage following the stalled one
(called “bubbles”)
10
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
RAW Data Dependence Example
 One instruction writes a register ($s0) and next instructions
read this register => read after write (RAW) dependence.
 add writes into $s0 in the first half of cycle 5
Wrong results happen only if
 and reads $s0 in cycle 3, obtaining the wrong value

the pipeline handles
or reads $s0 in cycle 4, again obtaining the wrong value
 sub readsdata
$s0 independences
2nd half of cycle incorrectly!
5, getting the correct value
 subsequent instructions read the correct value of $s0
1 2 3 4 5 6 7 8

Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF

$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Compile-Time Detection and Elimination
1 2 3 4 5 6 7 8 9 10

Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF

nop DM
nop IM RF RF

nop DM
nop IM RF RF

$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

 Insert enough independent instructions for the required result


to be ready by the time it is needed by a dependent one
 Reorder/reschedule/insert instructions at the compiler level
Data Forwarding
 Also called Data Bypassing

 Forward the result value to the dependent instruction


as soon as the value is available

 We have already seen the basic idea before


 Remember dataflow?
 Data value is supplied to dependent instruction as soon as it is
available
 Instruction executes when all its operands are available

 Data forwarding brings a pipeline closer to data flow


execution principles
Data Forwarding: Locations in Datapath

1 2 3 4 5 6 7 8

Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF

$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

From latched output of ALU to input of ALU


From WB to input of ALU
From WB to RF (internal in Register File)
Data Forwarding: Datapath & Control
From latched output of ALU to input of ALU
CLK CLK CLK
From WB to input of ALU RegWriteD RegWriteE RegWriteM RegWriteW
Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
From Regfile Unit
MemWriteD MemWriteE MemWriteM
ALUControlD2:0 ALUControlE2:0
31:26
Op ALUSrcD ALUSrcE
5:0
Funct RegDstD RegDstE
PCSrcM
BranchD BranchE BranchM

CLK CLK CLK


CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 00
A RD 01

ALU
1 10 ALUOutM ReadDataW
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
+

Sign
15:0
Extend
4
<<2

+
PCPlus4F PCPlus4D PCPlus4E

PCBranchM

ResultW

RegWriteW
ForwardBE
ForwardAE

RegWriteM
DependenceHazard
Detection
Unit Logic
Data Forwarding: Implementation
 Forward to Execute stage from either:
 Memory stage or
 Writeback stage

 When should we forward from either Memory or Writeback


stage?
 If that stage will write to a destination register and the
destination register matches the source register
 If both the Memory & Writeback stages contain matching
destination registers, Memory stage has priority to forward its
data, because it contains the more recently executed instruction
Data Forwarding (in Pseudocode)
 Forward to Execute stage from either:
 Memory stage or
 Writeback stage

 Forwarding logic for ForwardAE (pseudo code):


if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM) then
ForwardAE = 10 # forward from Memory stage
else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW) then
ForwardAE = 01 # forward from Writeback stage
else
ForwardAE = 00 # no forwarding

 Forwarding logic for ForwardBE same, but replace rsE with rtE
Data Forwarding Is Not Always Possible
1 2 3 4 5 6 7 8

Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF

Trouble!
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

 Forwarding is usually sufficient to resolve RAW data dependences


 Unfortunately, there are cases when forwarding is not possible
 due to pipeline design and instruction latencies
 The lw instruction does not finish reading data until the end of Memory stage
 its result cannot be forwarded to the Execute stage of the next instruction
unless we want a long critical path  breaks critical path design principle
Stalling Necessary for MEM-EX Dependence

1 2 3 4 5 6 7 8 9

Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF

$s0 $s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 RF $s1 & RF

$s4
or or DM $t1
or $t1, $s4, $s0 IM IM RF $s0 | RF

Stall $s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Stalling and Dependence Detection Hardware
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
ALUControlD2:0 ALUControlE2:0
31:26
Op ALUSrcD ALUSrcE
5:0
Funct RegDstD RegDstE
PCSrcM
BranchD BranchE BranchM

CLK CLK CLK


CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 00
A RD 01

ALU
ReadDataW
EN

1 10 ALUOutM
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
+

Sign
15:0
Extend
4
<<2

+
PCPlus4F
CLR

PCPlus4D PCPlus4E
EN

PCBranchM

ResultW

MemtoRegE

RegWriteW
ForwardBE
ForwardAE

RegWriteM
FlushE
StallD
StallF

Dependence
HazardDetection
Unit Logic
Hardware Needed for Stalling
 Stalls are supported by adding
 enable inputs (EN) to the Fetch and Decode pipeline registers
 synchronous reset/clear (CLR) input to the Execute pipeline
register
 or an INV bit associated with each pipeline register, indicating that
contents are INValid

 When a lw stall occurs


 Keep the values in the Decode and Fetch stage pipeline registers
 StallD and StallF are asserted
 Clear the contents of the Execute stage register, introducing a
bubble
 FlushE is also asserted
A Special Case of Data Dependence
 Control dependence
 Data dependence on the Instruction Pointer / Program Counter

22
Control Dependence
 Question: What should the fetch PC be in the next cycle?
 Answer: The address of the next instruction
 All instructions are control dependent on previous ones. Why?

 If the fetched instruction is a non-control-flow instruction:


 Next Fetch PC is the address of the next-sequential instruction
 Easy to determine if we know the size of the fetched instruction

 If the instruction that is fetched is a control-flow instruction:


 How do we determine the next Fetch PC?

 In fact, how do we know whether or not the fetched


instruction is a control-flow instruction?
23
Carnegie Mellon

Branch Prediction
 Special case of data dependence: dependence on PC

 beq:
 Conditional branch is not resolved until the fourth stage of the pipeline
 Instructions after the branch are fetched before branch is resolved
 Simple “branch prediction” example:
 Always predict that the next sequential instruction is fetched
 Called “Always not taken” prediction
 Flush (invalidate) “not-taken path” instructions if the branch is taken

 Branch misprediction penalty


 number of instructions flushed when branch is incorrectly predicted
 Penalty can be reduced by resolving the branch earlier
 Called “Early branch resolution”
24
Carnegie Mellon

Our Pipeline So Far


CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
ALUControlD2:0 ALUControlE2:0
31:26
Op ALUSrcD ALUSrcE
5:0
Funct RegDstD RegDstE
PCSrcM
BranchD BranchE BranchM

CLK CLK CLK


CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 00
A RD 01

ALU
1 10 ALUOutM ReadDataW
EN

A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
+

Sign
15:0
Extend
4
<<2

+
PCPlus4F PCPlus4D PCPlus4E
CLR
EN

PCBranchM

ResultW

MemtoRegE

RegWriteW
ForwardBE
ForwardAE

RegWriteM
FlushE
StallD
StallF

Dependence Detection Logic


Hazard Unit

25
Carnegie Mellon

Control Dependence: Flush on Misprediction


1 2 3 4 5 6 7 8 9

Time (cycles)
$t1
lw DM
20 beq $t1, $t2, 40 IM RF $t2 - RF

$s0
and DM
24 and $t0, $s0, $s1 IM RF $s1 & RF
Flush
$s4 these
or DM instructions
28 or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM
2C sub $t2, $s0, $s5 IM RF $s5 - RF

30 ... Flush
...
$s2
3 instructions
slt DM $t3

slt
64 slt $t3, $s2, $s3 IM RF $s3 RF

26
Carnegie Mellon

Pipeline with Early Branch Resolution

Dependence Detection Logic

Need to calculate branch target and condition in the Decode Stage 27


Carnegie Mellon

Early Branch Resolution


1 2 3 4 5 6 7 8 9

Time (cycles)
$t1
lw DM
20 beq $t1, $t2, 40 IM RF $t2 - RF

$s0 Flush
and DM
24 and $t0, $s0, $s1 IM RF $s1 & RF this
instruction

28 or $t1, $s4, $s0


Flush
2C sub $t2, $s0, $s5 1 instruction

30 ...
...
$s2
slt DM $t3
slt
64 slt $t3, $s2, $s3 IM RF $s3 RF

28
Carnegie Mellon

Early Branch Resolution: Good Idea?


 Advantages
 Reduced branch misprediction penalty
 Reduced CPI (cycles per instruction)

 Disadvantages
 Potential increase in clock cycle time?
 Higher clock period and lower frequency?
 Additional hardware cost
 Specialized and likely not used by other instructions

29
Recall: Performance Analysis Basics
 Execution time of a single instruction
 {CPI} x {clock cycle time}
 CPI: Number of cycles it takes to execute an instruction

 Execution time of an entire program


 Sum over all instructions [{CPI} x {clock cycle time}]
 {# of instructions} x {Average CPI} x {clock cycle time}

30
Carnegie Mellon

Data Forwarding for Early Branch Resolution


CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
ALUControlD2:0 ALUControlE2:0
31:26
Op ALUSrcD ALUSrcE
5:0
Funct RegDstD RegDstE
BranchD

EqualD PCSrcD
CLK CLK CLK
CLK
WE3
= WE
25:21 SrcAE
0 PC' PCF InstrD A1 RD1 0 00
A RD 01

ALU
ALUOutM ReadDataW
EN

1 1 10
A RD
Instruction 20:16
A2 RD2 0 00 0 SrcBE Data
Memory 01
A3 1 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
+

Sign
15:0
Extend
4
<<2
+

PCPlus4F PCPlus4D
CLR

CLR
EN

PCBranchD

ResultW

MemtoRegE

RegWriteW
ForwardBD

ForwardBE
ForwardAD

ForwardAE

RegWriteM
RegWriteE
BranchD

FlushE
StallD
StallF

DependenceHazard
Detection
Unit
Logic

Data forwarding for early branch resolution adds even more complexity 31
Carnegie Mellon

Forwarding and Stalling Hardware Control


// Forwarding logic:
assign ForwardAD = (rsD != 0) & (rsD == WriteRegM) & RegWriteM;
assign ForwardBD = (rtD != 0) & (rtD == WriteRegM) & RegWriteM;

//Stalling logic:
assign lwstall = ((rsD == rtE) | (rtD == rtE)) & MemtoRegE;

assign branchstall = (BranchD & RegWriteE &


(WriteRegE == rsD | WriteRegE == rtD))
|
(BranchD & MemtoRegM &
(WriteRegM == rsD | WriteRegM == rtD));

// Stall signals;
assign StallF = lwstall | branchstall;
assign StallD = lwstall | branchstall;
assign FLushE = lwstall | branchstall;

32
Carnegie Mellon

Final Pipelined MIPS Processor (H&H)

Dependence Detection Logic

Includes always-not-taken br prediction, early branch resolution, forwarding, stall logic


33
Carnegie Mellon

Doing Better: Smarter Branch Prediction


 Guess whether or not branch will be taken
 Backward branches are usually taken (loops iterate many times)
 History of whether branch was previously taken can improve the guess

 Accurate branch prediction reduces the fraction of branches


requiring a flush

 Many sophisticated techniques are employed in modern


processors
 Including simple machine learning methods (perceptrons)
 We will see them in Branch Prediction lectures

34
More on Branch Prediction (I)

https://www.youtube.com/watch?v=h6l9yYSyZHM&list=PL5Q2soXY2Zi_FRrloMa2fUYWPGiZUBQo2&index=22
More on Branch Prediction (II)

https://www.youtube.com/watch?v=z77VpggShvg&list=PL5Q2soXY2Zi_FRrloMa2fUYWPGiZUBQo2&index=23
More on Branch Prediction (III)

https://www.youtube.com/watch?v=yDjsr-jTOtk&list=PL5PHm2jkkXmgVhh8CHAu9N76TShJqfYDt&index=4
Lectures on Branch Prediction
 Digital Design & Computer Architecture, Spring 2020, Lecture 16b
 Branch Prediction I (ETH Zurich, Spring 2020)
 https://www.youtube.com/watch?v=h6l9yYSyZHM&list=PL5Q2soXY2Zi_FRrloMa2fU
YWPGiZUBQo2&index=22

 Digital Design & Computer Architecture, Spring 2020, Lecture 17


 Branch Prediction II (ETH Zurich, Spring 2020)
 https://www.youtube.com/watch?v=z77VpggShvg&list=PL5Q2soXY2Zi_FRrloMa2fU
YWPGiZUBQo2&index=23

 Computer Architecture, Spring 2015, Lecture 5


 Advanced Branch Prediction (CMU, Spring 2015)
 https://www.youtube.com/watch?v=yDjsr-
jTOtk&list=PL5PHm2jkkXmgVhh8CHAu9N76TShJqfYDt&index=4

https://www.youtube.com/onurmutlulectures 38
Pipelined Performance Example

39
Carnegie Mellon

Pipelined Performance Example


 An important program consists of:
 25% loads
 10% stores
 11% branches
 2% jumps
 52% R-type

 Assume:
 40% of loads used by next instruction
 25% of branches mispredicted; flushes the next instruction on misprediction
 All jumps flush the next instruction fetched

 What is the average CPI?


40
Carnegie Mellon

Pipelined Performance Example: CPI


 Load/Branch CPI = 1 when no stall/flush, 2 when stall/flush.
Thus:
 CPIlw = 1(0.6) + 2(0.4) = 1.4 Average CPI for load
 CPIbeq = 1(0.75) + 2(0.25) = 1.25 Average CPI for branch

 And
 Average CPI =

41
Carnegie Mellon

Pipelined Performance Example: CPI


 Load/Branch CPI = 1 when no stall/flush, 2 when stall/flush.
Thus:
 CPIlw = 1(0.6) + 2(0.4) = 1.4 Average CPI for load
 CPIbeq = 1(0.75) + 2(0.25) = 1.25 Average CPI for branch

 And
 Average CPI = (0.25)(1.4) + load
(0.1)(1) + store
(0.11)(1.25) + beq
(0.02)(2) + jump
(0.52)(1) r-type

= 1.15

42
Carnegie Mellon

Pipelined Performance Example: Cycle Time


 There are 5 stages, and 5 different timing paths:

Tc = max {
tpcq + tmem + tsetup fetch
2(tRFread + tmux + teq + tAND + tmux + tsetup ) decode
tpcq + tmux + tmux + tALU + tsetup execute
tpcq + tmemwrite + tsetup memory
2(tpcq + tmux + tRFwrite) writeback
}

 The clock cycle depends on the slowest stage

 Decode and Writeback use register file and have only half a
clock cycle to complete  that is why there is a 2 in front of them
43
Carnegie Mellon

Final Pipelined MIPS Processor (H&H)

Dependence Detection Logic

Includes always-not-taken br prediction, early branch resolution, forwarding, stall logic


44
Carnegie Mellon

Pipelined Performance Example: Cycle Time


Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20
Equality comparator teq 40
AND gate tAND 15
Memory write Tmemwrite 220
Register file write tRFwrite 100

Tc = 2(tRFread + tmux + teq + tAND + tmux + tsetup )


= 2[150 + 25 + 40 + 15 + 25 + 20] ps
= 550 ps 45
Carnegie Mellon

Pipelined Performance Example: Exec Time


 For a program with 100 billion instructions executing on a
pipelined MIPS processor:
 CPI = 1.15
 Tc = 550 ps

 Execution Time = (# instructions) × CPI × Tc


= (100 × 109)(1.15)(550 × 10-12)
= 63 seconds

46
Carnegie Mellon

Performance Summary for 3 MIPS microarch.


Execution Time Normalized Execution Time
Processor (seconds) (single-cycle is baseline)
Single-cycle 95 1
Multicycle 133 0.71
Pipelined 63 1.51

 Pipelined implementation is the fastest of 3 implementations

 Even though we have a 5-stage pipeline, speedup is not 5X


over the single-cycle or the multi-cycle system!

47
Recall: How to Handle Data Dependences
 Anti and output dependences are easier to handle
 write to the destination only in last stage and in program order

 Flow dependences are more interesting

 Six fundamental ways of handling flow dependences


 Detect and wait until value is available in register file
 Detect and forward/bypass data to dependent instruction
 Detect and eliminate the dependence at the software level
 No need for the hardware to detect dependence
 Detect and move it out of the way for independent instructions
 Predict the needed value(s), execute “speculatively”, and verify
 Do something else (fine-grained multithreading)
 No need to detect
48
Recall: How to Handle Data Dependences
 Anti and output dependences are easier to handle
 write to the destination only in last stage and in program order

 Flow dependences are more interesting

 Six fundamental ways of handling flow dependences


 Detect and wait until value is available in register file
 Detect and forward/bypass data to dependent instruction
 Detect and eliminate the dependence at the software level
 No need for the hardware to detect dependence
 Detect and move it out of the way for independent instructions
 Predict the needed value(s), execute “speculatively”, and verify
 Do something else (fine-grained multithreading)
 No need to detect
49
Question to Ponder: Hardware vs. Software
 What is the role of the hardware vs. the software in data
dependence handling?
 Software based interlocking
 Hardware based interlocking
 Who inserts/manages the pipeline bubbles?
 Who finds the independent instructions to fill “empty” pipeline
slots?
 What are the advantages/disadvantages of each?
 Think of the performance equation as well

50
Question to Ponder: Hardware vs. Software
 What is the role of the hardware vs. the software in the
order in which instructions are executed in the pipeline?
 Software based instruction scheduling  static scheduling
 Hardware based instruction scheduling  dynamic scheduling

 How does each impact different metrics?


 Performance (and parts of the performance equation)
 Complexity
 Power consumption
 Reliability
 Cost
 …

51
More on Software vs. Hardware
 Software based scheduling of instructions  static scheduling
 Hardware executes the instructions in the compiler-dictated order
 Contrast this with dynamic scheduling: hardware can execute
instructions out of the compiler-specified order
 How does the compiler know the latency of each instruction?

 What information does the compiler not know that makes


static scheduling difficult?
 Answer: Anything that is determined at run time
 Variable-length operation latency, memory address, branch direction

 How can the compiler alleviate this (i.e., estimate the


unknown)?
 Answer: Profiling (done statically or dynamically)
52
More on Static Instruction Scheduling

https://www.youtube.com/watch?v=isBEVkIjgGA&list=PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKq&index=18
Lectures on Static Instruction Scheduling

 Computer Architecture, Spring 2015, Lecture 16


 Static Instruction Scheduling (CMU, Spring 2015)
 https://www.youtube.com/watch?v=isBEVkIjgGA&list=PL5PHm2jkkXmi5CxxI7b3JC
L1TWybTDtKq&index=18

 Computer Architecture, Spring 2013, Lecture 21


 Static Instruction Scheduling (CMU, Spring 2013)
 https://www.youtube.com/watch?v=XdDUn2WtkRg&list=PL5PHm2jkkXmidJOd59RE
og9jDnPDTG6IJ&index=21

https://www.youtube.com/onurmutlulectures 54
Recall: Semantic Gap
 How close instructions & data types & addressing modes
are to high-level language (HLL)

HLL HLL
Small Semantic Gap
ISA with
Complex Inst Large Semantic Gap
& Data Types
& Addressing Modes ISA with
Simple Inst
& Data Types
& Addressing Modes
HW HW
Control Control
Signals Signals

Easier mapping of HLL to ISA Harder mapping of HLL to ISA


Less work for software designer More work for software designer
More work for hardware designer Less work for hardware designer
Optimization burden on HW Optimization burden on SW
Recall: How to Change the Semantic Gap Tradeoffs
 Translate from one ISA into a different “implementation” ISA

HLL
Small Semantic Gap
X86-64 ISA with
Complex Inst
& Data Types Software or Hardware Translator
& Addressing Modes

ARM v8.4 Implementation ISA with


HW Simple Inst
Control & Data Types
Signals & Addressing Modes

SW, translator, HW can all perform operation re-ordering 56


An Example: Rosetta 2 Binary Translator

https://en.wikipedia.org/wiki/Rosetta_(software)#Rosetta_2 57
An Example: Rosetta 2 Binary Translator

Apple M1,
2021

Source: https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested 58
Another Example: NVIDIA Denver

https://www.anandtech.com/show/8701/the-google-nexus-9-review/4 https://www.toradex.com/computer-on-modules/apalis-arm-family/nvidia-tegra-k1 59
https://safari.ethz.ch/digitaltechnik/spring2021/lib/exe/fetch.php?media=boggs_ieeemicro_2015.pdf
More on NVIDIA Denver Code Optimizer

60
https://safari.ethz.ch/digitaltechnik/spring2021/lib/exe/fetch.php?media=boggs_ieeemicro_2015.pdf
Transmeta: x86 to VLIW Translation

Proprietary VLIW ISA

X86 X86

Klaiber, “The Technology Behind Crusoe Processors,” Transmeta White Paper 2000.
https://www.wikiwand.com/en/Transmeta_Efficeon 61
https://classes.engineering.wustl.edu/cse362/images/c/c7/Paper_aklaiber_19jan00.pdf
More on Static Instruction Scheduling

https://www.youtube.com/watch?v=isBEVkIjgGA&list=PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKq&index=18
Lectures on Static Instruction Scheduling

 Computer Architecture, Spring 2015, Lecture 16


 Static Instruction Scheduling (CMU, Spring 2015)
 https://www.youtube.com/watch?v=isBEVkIjgGA&list=PL5PHm2jkkXmi5CxxI7b3JC
L1TWybTDtKq&index=18

 Computer Architecture, Spring 2013, Lecture 21


 Static Instruction Scheduling (CMU, Spring 2013)
 https://www.youtube.com/watch?v=XdDUn2WtkRg&list=PL5PHm2jkkXmidJOd59RE
og9jDnPDTG6IJ&index=21

https://www.youtube.com/onurmutlulectures 63
Recall: How to Handle Data Dependences
 Anti and output dependences are easier to handle
 write to the destination only in last stage and in program order

 Flow dependences are more interesting

 Six fundamental ways of handling flow dependences


 Detect and wait until value is available in register file
 Detect and forward/bypass data to dependent instruction
 Detect and eliminate the dependence at the software level
 No need for the hardware to detect dependence
 Detect and move it out of the way for independent instructions
 Predict the needed value(s), execute “speculatively”, and verify
 Do something else (fine-grained multithreading)
 No need to detect
64
Fine-Grained Multithreading

65
Fine-Grained Multithreading
 Idea: Fetch from a different thread every cycle such that no
two instructions from a thread are in the pipeline concurrently
 Hardware has multiple thread contexts (PC+registers per thread)
 Threads are completely independent
 No instruction is fetched from the same thread until the prior
branch/instruction from the thread completes

+ No logic needed for handling control and


data dependences within a thread
+ High thread-level throughput
-- Single thread performance suffers
-- Extra logic for keeping thread contexts
-- Throughput loss when there are not
enough threads to keep the pipeline full
Each pipeline stage has an instruction from a different, completely-independent thread
Fine-Grained Multithreading: Basic Idea
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM

BranchD BranchE BranchM


31:26 PCSrcM
Op ALUControlD ALUControlE2:0
5:0
Funct ALUSrcD ALUSrcE
RegDstD RegDstE
ALUOutW
CLK CLK CLK
CLK
25:21 WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD

ALU
ALUOutM ReadDataW
1 A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
+

15:0
<<2
Sign Extend SignImmE
4 PCBranchM

+
PCPlus4F PCPlus4D PCPlus4E

ResultW

Each pipeline stage has an instruction from a different, completely-independent thread

We need a PC and a register file for each thread + muxes and control
Fine-Grained Multithreading (II)
 Idea: Fetch from a different thread every cycle such that no
two instructions from a thread are in the pipeline concurrently

 Tolerates control and data dependence resolution latencies by


overlapping the latency with useful work from other threads
 Improves pipeline utilization by taking advantage of multiple
threads
 Improves thread-level throughput but sacrifices per-thread
throughput & latency

 Thornton, “Parallel Operation in the Control Data 6600,” AFIPS 1964.


 Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978.
68
Fine-Grained Multithreading: History
 CDC 6600’s peripheral processing unit is fine-grained
multithreaded
 Thornton, “Parallel Operation in the Control Data 6600,” AFIPS 1964.
 Processor executes a different I/O thread every cycle
 An operation from the same thread is executed every 10 cycles

 Denelcor HEP (Heterogeneous Element Processor)


 Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978.
 120 threads/processor
 Available queue vs. unavailable (waiting) queue for threads
 Each thread can have only 1 instruction in the processor pipeline
 Each thread independent
 To each thread, processor looks like a non-pipelined machine
 System throughput vs. single thread performance tradeoff

69
Fine-Grained Multithreading in HEP
 Cycle time: 100ns

 8 stages  800 ns to
complete an
instruction
 assuming no memory
access

 No control and data


dependence checking

Burton Smith
(1941-2018)

70
Multithreaded Pipeline Example

Slide credit: Joel Emer 71


Sun Niagara Multithreaded Pipeline

Kongetira et al., “Niagara: A 32-Way Multithreaded Sparc Processor,” IEEE Micro 2005.
72
Fine-Grained Multithreading
 Advantages
+ No need for dependence checking between instructions
(only one instruction in pipeline from a single thread)
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions from
different threads
+ Improved system throughput, latency tolerance, pipeline utilization

 Disadvantages
- Extra hardware complexity: multiple hardware contexts (PCs, register
files, …), thread selection logic
- Reduced single thread performance (one instruction fetched every N
cycles from the same thread)
- Resource contention between threads in caches and memory
- Dependence checking logic between threads may be needed (load/store)
73
Modern GPUs are
FGMT Machines

74
NVIDIA GeForce GTX 285 “core”

64 KB of storage
… for thread contexts
(registers)

= data-parallel (SIMD) func. unit, = instruction stream decode


control shared across 8 units
= multiply-add = execution context storage
= multiply

Slide credit: Kayvon Fatahalian


NVIDIA GeForce GTX 285 “core”

64 KB of storage
… for thread contexts
(registers)

 Groups of 32 threads share instruction stream (each group is


a Warp): they execute the same instruction on different data
 Up to 32 warps are interleaved in an FGMT manner
 Up to 1024 thread contexts can be stored
Slide credit: Kayvon Fatahalian
NVIDIA GeForce GTX 285 (~2009)

Tex Tex
… … … … … …

Tex Tex
… … … … … …

Tex Tex
… … … … … …

Tex Tex
… … … … … …

Tex Tex
… … … … … …

30 cores on the GTX 285: 30,720 threads


Slide credit: Kayvon Fatahalian
Further Reading for the Interested (I)

Burton Smith
(1941-2018)

78
Further Reading for the Interested (II)

79
More on Multithreading (I)

https://www.youtube.com/watch?v=iqi9wFqFiNU&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=51
More on Multithreading (II)

https://www.youtube.com/watch?v=e8lfl6MbILg&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=52
More on Multithreading (III)

https://www.youtube.com/watch?v=7vkDpZ1-hHM&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=53
More on Multithreading (IV)

https://www.youtube.com/watch?v=-hbmzIDe0sA&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=54
Lectures on Multithreading
 Parallel Computer Architecture, Fall 2012, Lecture 9
 Multithreading I (CMU, Fall 2012)
 https://www.youtube.com/watch?v=iqi9wFqFiNU&list=PL5PHm2jkkXmgDN1PLwOY
_tGtUlynnyV6D&index=51
 Parallel Computer Architecture, Fall 2012, Lecture 10
 Multithreading II (CMU, Fall 2012)
 https://www.youtube.com/watch?v=e8lfl6MbILg&list=PL5PHm2jkkXmgDN1PLwOY_
tGtUlynnyV6D&index=52
 Parallel Computer Architecture, Fall 2012, Lecture 13
 Multithreading III (CMU, Fall 2012)
 https://www.youtube.com/watch?v=7vkDpZ1-
hHM&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=53
 Parallel Computer Architecture, Fall 2012, Lecture 15
 Speculation I (CMU, Fall 2012)
 https://www.youtube.com/watch?v=-
hbmzIDe0sA&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=54

https://www.youtube.com/onurmutlulectures 84
Pipelining and Precise Exceptions:
Preserving Sequential Semantics
Multi-Cycle Execution
 Not all instructions take the same amount of time in the
“execute stage” of the pipeline

 Idea: Have multiple different functional units that take


different number of cycles
 Can be pipelined or not pipelined
 Can let independent instructions start execution on a different
functional unit before a previous long-latency instruction
finishes execution
Integer add
E
Integer mul
E E E E
FP mul
?
F D
E E E E E E E E

E E E E E E E E ...
Load/store
86
Issues in Pipelining: Multi-Cycle Execute
 Instructions can take different number of cycles in EXECUTE
stage
 Integer ADD versus Integer DIVide
Exception-causing
DIV R4  R1, R2 F D E E E E E E E E W instruction
ADD R3  R1, R2 F D E W
F D E W
F D E W

DIV R2  R5, R6 F D E E E E E E E E W
ADD R7  R5, R6 F D E W
F D E W

 What is wrong with this picture in a Von Neumann architecture?


 Sequential semantics of the ISA NOT preserved!
 What if DIV incurs an exception? (e.g., DIV by zero)
87
An Example Exception

Delayed “instruction”
due to exception

Exception-causing “instruction”

Time: 12:55 88
An Example Exception

Exception-causing
“instruction”

Time: 12:57 89
An Example Exception

Time: 12:58 90
An Example Exception

Time: 13:00 91
Another View

92
Exception Handled & Resolved…

Exception-causing
“instruction”

Time: 13:06 93
Exceptions and Interrupts
 “Unplanned” changes or interruptions in program execution

 Due to internal problems in execution of the program


 Exceptions

 Due to external events that need to be handled by the


processor
 Interrupts

 Both exceptions and interrupts require


 stopping of the current program
 saving the architectural state
 handling the exception/interrupt  switch to handler
 (if possible and makes sense) returning back to program execution
94
We Covered
Until This Point
in Lecture

95
Digital Design & Computer Arch.
Lecture 13: Pipelined Processor Design

Prof. Onur Mutlu

ETH Zürich
Spring 2023
6 April 2023
Exceptions and Interrupts: Examples
 Exception examples
 Divide by zero
 Overflow
 Undefined opcode
 General protection (or access protection)
 Page fault
 …

 Interrupt examples
 I/O device needing service (e.g., keyboard input, video input)
 (Periodic) system timer expiration
 Power failure
 Machine check
 …
97
Exceptions vs. Interrupts
 Cause
 Exceptions: internal to the running thread
 Interrupts: external to the running thread

 When to Handle
 Exceptions: when detected (and known to be non-speculative)
 Interrupts: when convenient
 Except for very high priority ones
 Power failure
 Machine check (error)

 Priority: process (exception), depends (interrupt)

 Handling Context: process (exception), system (interrupt)


98
Precise Exceptions/Interrupts
 The architectural state should be consistent (precise)
when the exception/interrupt is ready to be handled

1. All previous instructions should be completely retired

2. No later instruction should be retired

Retire = commit = finish execution and update arch. state

DIV R4  R1, R2
Precise state
ADD R3  R1, R2
(clean separation of
DIV R2  R5, R6 sequential instructions)
ADD R7  R5, R6

99
Checking for and Handling Exceptions in Pipelining

 When the oldest instruction ready-to-be-retired is detected


to have caused an exception, the control logic

 Ensures architectural state is precise (register file, PC, memory)

 Flushes all younger instructions in the pipeline

 Saves PC and registers (as specified by the ISA)

 Redirects the fetch engine to the appropriate exception


handling routine

100
Aside: From the x86-64 ISA Manual

https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html 101
Why Do We Want Precise Exceptions?
 Semantics of the von Neumann model ISA specifies it
 Remember von Neumann vs. Dataflow

 Aids software debugging

 Enables (easy) recovery from exceptions

 Enables (easily) restartable processes

 Enables traps into software (e.g., software implemented


opcodes)

102
Ensuring Precise Exceptions
 Easy to do in single-cycle and multi-cycle machines

 Single-cycle
 Instruction boundaries == Cycle boundaries

 Multi-cycle
 Add special states in the control FSM that lead to the
exception or interrupt handlers
 Switch to the handler only at a precise state
 before fetching the next instruction

See H&H Section 7.7 for a treatment of exceptions in multi-cycle microarchitecture 103
Precise Exceptions in Multi-Cycle Datapath
EPC register: Holds the exception causing PC
Cause register: Holds the cause of the exception
Exception Handler starts at address 0x80000180

See H&H Section 7.7 for a treatment of exceptions in multi-cycle microarchitecture 104
Precise Exceptions in Multi-Cycle FSM
 Supports
 Overflow
 Undefined
instruction

 mfc0 instruction
is used to copy
the exception
cause into a
general-purpose
register

See H&H Section 7.7 for a treatment of exceptions in multi-cycle microarchitecture 105
Precise Exceptions in Multi-Cycle Datapath

106
Multi-Cycle Execute: More Complications
 Instructions can take different number of cycles in EXECUTE
stage
 Integer ADD versus Integer DIVide

DIV R4  R1, R2 F D E E E E E E E E W
ADD R3  R1, R2 F D E W
F D E W
F D E W

DIV R2  R5, R6 F D E E E E E E E E W
ADD R7  R5, R6 F D E W
F D E W

 What is wrong with this picture in a Von Neumann architecture?


 Sequential semantics of the ISA NOT preserved!
 What if DIV incurs an exception? (e.g., DIV by zero)
107
Ensuring Precise Exceptions in Pipelining
 Idea: Make each operation take the same amount of time

DIV R3  R1, R2 F D E E E E E E E E W
ADD R4  R1, R2 F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W

 Downside
 Worst-case instruction latency determines all instructions’ latency
 What about memory operations?
 Each functional unit takes worst-case number of cycles?

108
Solutions
 Reorder buffer

 History buffer

We will not cover these


 Future register file See suggested lecture video from Spring 2015
Also see backup slides

 Checkpointing

 Suggested reading
 Smith and Plezskun, “Implementing Precise Interrupts in Pipelined
Processors,” IEEE Trans on Computers 1988 and ISCA 1985.

109
Solution I: Reorder Buffer (ROB)
 Idea: Complete instructions out-of-order, but reorder them
before making results visible to architectural state
 When instruction is decoded, it reserves the next-sequential
entry in the ROB
 When instruction completes, it writes result into ROB entry
 When instruction oldest in ROB and it has completed
without exceptions, its result moved to reg. file or memory

Func Unit
Instruction Register Reorder
Cache File Func Unit Buffer

Func Unit

ROB is implemented as a circular queue in hardware 110


Reorder Buffer
 A hardware structure that keeps information about all
instructions that are decoded but not yet retired/committed

Entry 0 Oldest instruction


Entry 1 (pointer to ROB entry
Entry 2
that contains information
about oldest instruction

Dest reg written?


in the machine)

Dest reg value


Entry Valid?
Dest reg ID

Entry 8 Youngest instruction

Entry 13
Entry 14
Entry 15

ROB is implemented as a circular queue in hardware 111


What’s in a ROB Entry?
Valid bits for reg/data
V DestRegID DestRegVal StoreAddr StoreData PC Exception?
+ control bits

 Everything required to:


 correctly reorder instructions back into the program order
 update the architectural state with the instruction’s result(s), if
instruction can retire without any issues
 handle an exception/interrupt precisely, if an
exception/interrupt needs to be handled before retiring the
instruction

 Need valid bits to keep track of readiness of the result(s)


and find out if the instruction has completed execution
112
Reorder Buffer: Independent Operations
 Result first written to ROB on instruction completion
 Result written to register file at commit time

F D E E E E E E E E R W
F D E R W
F D E R W
F D E R W
F D E E E E E E E E R W
F D E R W
F D E R W

 What if a later instruction needs a value in the reorder buffer?


 One option: stall the operation  stall the pipeline
 Better: Read the value from the reorder buffer. How?

113
Reorder Buffer: How to Access?
 A register value can be in the register file, reorder buffer,
(or bypass/forwarding paths)

Random Access Memory


(indexed with Register ID,
Instruction Register which is the address of an entry)
Cache File
Func Unit

Func Unit

Reorder Func Unit


Content Buffer
Addressable
Memory bypass paths
(searched with
register ID,
which is part of the content of an entry)
114
Reorder Buffer Example
Register File (RF) Reorder Buffer (ROB)
R0 Entry 0 Oldest
R1 Entry 1 instruction
R2 Entry 2
R3
R4
R5
R6
R7
Entry 8 Youngest
Value Valid?

Value
instruction

Initially: all registers


are valid in RF Entry 13
Entry 14
& ROB is empty
Entry 15 Entry Valid?

Dest reg written?


Dest reg value
Dest reg ID
Simulate:
MUL R1, R2  R3
MUL R3, R4  R11
ADD R5, R6  R3
ADD R3, R8  R12
115
Simplifying Reorder Buffer Access
 Idea: Use indirection

 Access register file first (check if the register is valid)


 If register not valid, register file stores the ID of the reorder
buffer entry that contains (or will contain) the value of the
register
 Mapping of the register to a ROB entry: Register file maps the
register to a reorder buffer entry if there is an in-flight
instruction writing to the register

 Access reorder buffer next

 Now, reorder buffer does not need to be content addressable


116
Reorder Buffer Example
Register File (RF) Reorder Buffer (ROB)
R0 Entry 0 Oldest
R1 Entry 1 instruction
R2 Entry 2
R3
R4
R5
R6
R7
Entry 8 Youngest
Value Valid?

Value Tag
(pointer to instruction
ROB entry)

Initially: all registers Entry 13


are valid in RF Entry 14
& ROB is empty Entry 15 Entry Valid?

Dest reg written?


Dest reg value
Dest reg ID
Simulate:
MUL R1, R2  R3
MUL R3, R4  R11
ADD R5, R6  R3
ADD R3, R8  R12
117
Reorder Buffer in Intel Pentium III/Pro

Boggs et al., “The


Microarchitecture of the
Pentium 4 Processor,” Intel
Technology Journal, 2001.

A Register Alias Table (RAT) points to where each register’s current value is (or will be)
Intel Pentium Pro (1995)

Processor chip Level 2 cache chip

Multi-chip module package

By Moshen - http://en.wikipedia.org/wiki/Image:Pentiumpro_moshen.jpg, CC BY-SA 2.5, https://commons.wikimedia.org/w/index.php?curid=2262471


119
Important: Register Renaming with a Reorder Buffer
 Output and anti dependences are not true dependences
 WHY? The same register refers to values that have nothing to
do with each other
 They exist due to lack of register ID’s (i.e. names) in
the ISA

 The register ID is renamed to the reorder buffer entry that


will hold the register’s value
 Register ID  ROB entry ID
 Architectural register ID  Physical register ID
 After renaming, ROB entry ID used to refer to the register

 This eliminates anti and output dependences


 Gives the illusion that there are a large number of registers
120
Recall: Data Dependence Types
Flow dependence
r3  r1 op r2 Read-after-Write
r5  r3 op r4 (RAW)
Anti dependence
r3  r1 op r2 Write-after-Read
r1  r4 op r5 (WAR)

Output dependence
r3  r1 op r2 Write-after-Write
r5  r3 op r4 (WAW)
r3  r6 op r7
121
Register Renaming Example (On Your Own)
 Assume
 Register file has a pointer to the reorder buffer entry that
contains or will contain the value, if the register is not valid
 Reorder buffer works as described before

 Where is the latest definition of R3 for each instruction


below in sequential order?
LD R0(0)  R3
LD R3, R1  R10
MUL R1, R2  R3
MUL R3, R4  R11
ADD R5, R6  R3
ADD R3, R8  R12

122
Reorder Buffer Example
Register File (RF) Reorder Buffer (ROB)
R0 Entry 0 Oldest
R1 Entry 1 instruction
R2 Entry 2
R3
R4
R5
R6
R7
Entry 8 Youngest
Value Valid?

Value Tag
(pointer to instruction
ROB entry)
Initially: all registers Entry 13
are valid in RF Entry 14
& ROB is empty Entry 15 Entry Valid?

Dest reg written?


Dest reg value
Simulate: Dest reg ID
LD R0(0)  R3
LD R3, R1  R10
MUL R1, R2  R3
MUL R3, R4  R11
ADD R5, R6  R3
123
ADD R3, R8  R12
In-Order Pipeline with Reorder Buffer
 Decode (D): Access regfile/ROB, allocate entry in ROB, check if
instruction can execute, if so dispatch instruction
 Execute (E): Instructions can complete out-of-order
 Completion (R): Write result to reorder buffer
 Retirement/Commit (W): Check for exceptions; if none, write result to
architectural register file or memory; else, flush pipeline and start from
exception handler
 In-order dispatch/execution, out-of-order completion, in-order retirement
Integer add
E
Integer mul
E E E E
FP mul
R W
F D
E E E E E E E E
R
E E E E E E E E ...
Load/store

ROB is implemented as a circular queue in hardware 124


Reorder Buffer Tradeoffs
 Advantages
 Conceptually simple for supporting precise exceptions
 Can eliminate false dependences

 Disadvantages
 Reorder buffer needs to be accessed to get the results that
are yet to be written to the register file
 CAM or indirection  increased latency and complexity

 Other solutions aim to eliminate the disadvantages


 History buffer
We will not cover these
 Future file See suggested lecture video from Spring 2015
 Checkpointing Also see backup slides

125
More on State Maintenance & Precise Exceptions

https://www.youtube.com/watch?v=nMfbtzWizDA&list=PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKq&index=13
More on State Maintenance & Precise Exceptions

https://www.youtube.com/watch?v=upJPVXEuqIQ&list=PL5Q2soXY2Zi-iBn_sw_B63HtdbTNmphLc&index=18
More on State Maintenance & Precise Exceptions

https://www.youtube.com/watch?v=9yo3yhUijQs&list=PL5Q2soXY2Zi8J58xLKBNFQFHRO3GrXxA9&index=17
Lectures on State Maintenance & Recovery
 Computer Architecture, Spring 2015, Lecture 11
 Precise Exceptions, State Maintenance/Recovery (CMU, Spring 2015)
 https://www.youtube.com/watch?v=nMfbtzWizDA&list=PL5PHm2jkkXmi5CxxI7b3J
CL1TWybTDtKq&index=13

 Digital Design & Computer Architecture, Spring 2019, Lecture 15a


 Reorder Buffer (ETH Zurich, Spring 2019)
 https://www.youtube.com/watch?v=9yo3yhUijQs&list=PL5Q2soXY2Zi8J58xLKBNFQ
FHRO3GrXxA9&index=17

 Digital Design & Computer Architecture, Spring 2021, Lecture 15a


 Precise Exceptions (ETH Zurich, Spring 2021)
 https://www.youtube.com/watch?v=upJPVXEuqIQ&list=PL5Q2soXY2Zi-
iBn_sw_B63HtdbTNmphLc&index=18

https://www.youtube.com/onurmutlulectures 129
Suggested Readings for the Interested
 Smith and Plezskun, “Implementing Precise Interrupts in
Pipelined Processors,” IEEE Trans on Computers 1988 and
ISCA 1985.

 Smith and Sohi, “The Microarchitecture of Superscalar


Processors,” Proceedings of the IEEE, 1995

 Hwu and Patt, “Checkpoint Repair for Out-of-order


Execution Machines,” ISCA 1987.

 Backup Slides

130

You might also like