Arch4 Pipelined Processor Design Afterlecture
Arch4 Pipelined Processor Design Afterlecture
ETH Zürich
Spring 2023
6 April 2023
Agenda for Today & Next Few Lectures
Prior weeks: Microarchitecture Fundamentals
Single-cycle Microarchitectures
Multi-cycle Microarchitectures
Problem
Algorithm
Last week & today: Pipelining
Program/Language
Pipelining System Software
Pipelined Processor Design SW/HW Interface
Control & Data Dependence Handling Micro-architecture
Precise Exceptions: State Maintenance & Recovery Logic
Devices
After Easter Break: Out-of-Order Execution Electrons
Out-of-Order Execution
Issues in OoO Execution: Load-Store Handling, …
2
Readings
This week
Pipelining
H&H, Chapter 7.5
Pipelining Issues
H&H, Chapter 7.7, 7.8.1-7.8.3
ALU
ALUOutM ReadDataW
1 A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
+
15:0
<<2
Sign Extend SignImmE
4 PCBranchM
+
PCPlus4F PCPlus4D PCPlus4E
ResultW
Resource contention
5
Review: Data Dependence Types
Flow dependence
r3 r1 op r2 Read-after-Write
r5 r3 op r4 (RAW)
Anti dependence
r3 r1 op r2 Write-after-Read
r1 r4 op r5 (WAR)
Output dependence
r3 r1 op r2 Write-after-Write
r5 r3 op r4 (WAW)
r3 r6 op r7
6
Review: How to Handle Data Dependences
Anti and output dependences are easier to handle
write to the destination only in last stage and in program order
t0 t1 t2 t3 t4 t5
Insth IF ID ALU MEM WB
Insti i IF ID ALU MEM WB
Instj j IF ID ALU
ID MEM
ALU
ID ID
WB
MEM
ALU ALU
WB
MEM
Instk IF ID
IF ALU
ID
IF MEM
ALU
ID
IF WB
MEM
ALU
ID
Instl IF ID
IF ALU
ID
IF MEM
ALU
ID
IF
IF ID
IF ALU
ID
IF
i: rx _
j: _ rx
bubble dist(i,j)=1 IF ID
IF
Stall = make the dependent instruction wait
j: _ rx
bubble dist(i,j)=2 IF
until its source data value is available
j: _ rx
bubble dist(i,j)=3 1. stop all up-stream stages
j: _ rx dist(i,j)=4 2. drain all down-stream stages
8
Data Dependence Handling:
Concepts and Implementation
9
How to Implement Stalling
PCSrc
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite
Branch
Shift
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1
RegDst
Stall
disable PC and IF/ID latching; ensure stalled instruction stays in its stage
Insert “invalid” instructions/nops into the stage following the stalled one
(called “bubbles”)
10
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
RAW Data Dependence Example
One instruction writes a register ($s0) and next instructions
read this register => read after write (RAW) dependence.
add writes into $s0 in the first half of cycle 5
Wrong results happen only if
and reads $s0 in cycle 3, obtaining the wrong value
the pipeline handles
or reads $s0 in cycle 4, again obtaining the wrong value
sub readsdata
$s0 independences
2nd half of cycle incorrectly!
5, getting the correct value
subsequent instructions read the correct value of $s0
1 2 3 4 5 6 7 8
Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Compile-Time Detection and Elimination
1 2 3 4 5 6 7 8 9 10
Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF
nop DM
nop IM RF RF
nop DM
nop IM RF RF
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
1 2 3 4 5 6 7 8
Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
ALU
1 10 ALUOutM ReadDataW
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
+
Sign
15:0
Extend
4
<<2
+
PCPlus4F PCPlus4D PCPlus4E
PCBranchM
ResultW
RegWriteW
ForwardBE
ForwardAE
RegWriteM
DependenceHazard
Detection
Unit Logic
Data Forwarding: Implementation
Forward to Execute stage from either:
Memory stage or
Writeback stage
Forwarding logic for ForwardBE same, but replace rsE with rtE
Data Forwarding Is Not Always Possible
1 2 3 4 5 6 7 8
Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF
Trouble!
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
1 2 3 4 5 6 7 8 9
Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF
$s0 $s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 RF $s1 & RF
$s4
or or DM $t1
or $t1, $s4, $s0 IM IM RF $s0 | RF
Stall $s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Stalling and Dependence Detection Hardware
CLK CLK CLK
ALU
ReadDataW
EN
1 10 ALUOutM
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
+
Sign
15:0
Extend
4
<<2
+
PCPlus4F
CLR
PCPlus4D PCPlus4E
EN
PCBranchM
ResultW
MemtoRegE
RegWriteW
ForwardBE
ForwardAE
RegWriteM
FlushE
StallD
StallF
Dependence
HazardDetection
Unit Logic
Hardware Needed for Stalling
Stalls are supported by adding
enable inputs (EN) to the Fetch and Decode pipeline registers
synchronous reset/clear (CLR) input to the Execute pipeline
register
or an INV bit associated with each pipeline register, indicating that
contents are INValid
22
Control Dependence
Question: What should the fetch PC be in the next cycle?
Answer: The address of the next instruction
All instructions are control dependent on previous ones. Why?
Branch Prediction
Special case of data dependence: dependence on PC
beq:
Conditional branch is not resolved until the fourth stage of the pipeline
Instructions after the branch are fetched before branch is resolved
Simple “branch prediction” example:
Always predict that the next sequential instruction is fetched
Called “Always not taken” prediction
Flush (invalidate) “not-taken path” instructions if the branch is taken
ALU
1 10 ALUOutM ReadDataW
EN
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
+
Sign
15:0
Extend
4
<<2
+
PCPlus4F PCPlus4D PCPlus4E
CLR
EN
PCBranchM
ResultW
MemtoRegE
RegWriteW
ForwardBE
ForwardAE
RegWriteM
FlushE
StallD
StallF
25
Carnegie Mellon
Time (cycles)
$t1
lw DM
20 beq $t1, $t2, 40 IM RF $t2 - RF
$s0
and DM
24 and $t0, $s0, $s1 IM RF $s1 & RF
Flush
$s4 these
or DM instructions
28 or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM
2C sub $t2, $s0, $s5 IM RF $s5 - RF
30 ... Flush
...
$s2
3 instructions
slt DM $t3
slt
64 slt $t3, $s2, $s3 IM RF $s3 RF
26
Carnegie Mellon
Time (cycles)
$t1
lw DM
20 beq $t1, $t2, 40 IM RF $t2 - RF
$s0 Flush
and DM
24 and $t0, $s0, $s1 IM RF $s1 & RF this
instruction
30 ...
...
$s2
slt DM $t3
slt
64 slt $t3, $s2, $s3 IM RF $s3 RF
28
Carnegie Mellon
Disadvantages
Potential increase in clock cycle time?
Higher clock period and lower frequency?
Additional hardware cost
Specialized and likely not used by other instructions
29
Recall: Performance Analysis Basics
Execution time of a single instruction
{CPI} x {clock cycle time}
CPI: Number of cycles it takes to execute an instruction
30
Carnegie Mellon
EqualD PCSrcD
CLK CLK CLK
CLK
WE3
= WE
25:21 SrcAE
0 PC' PCF InstrD A1 RD1 0 00
A RD 01
ALU
ALUOutM ReadDataW
EN
1 1 10
A RD
Instruction 20:16
A2 RD2 0 00 0 SrcBE Data
Memory 01
A3 1 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
+
Sign
15:0
Extend
4
<<2
+
PCPlus4F PCPlus4D
CLR
CLR
EN
PCBranchD
ResultW
MemtoRegE
RegWriteW
ForwardBD
ForwardBE
ForwardAD
ForwardAE
RegWriteM
RegWriteE
BranchD
FlushE
StallD
StallF
DependenceHazard
Detection
Unit
Logic
Data forwarding for early branch resolution adds even more complexity 31
Carnegie Mellon
//Stalling logic:
assign lwstall = ((rsD == rtE) | (rtD == rtE)) & MemtoRegE;
// Stall signals;
assign StallF = lwstall | branchstall;
assign StallD = lwstall | branchstall;
assign FLushE = lwstall | branchstall;
32
Carnegie Mellon
34
More on Branch Prediction (I)
https://www.youtube.com/watch?v=h6l9yYSyZHM&list=PL5Q2soXY2Zi_FRrloMa2fUYWPGiZUBQo2&index=22
More on Branch Prediction (II)
https://www.youtube.com/watch?v=z77VpggShvg&list=PL5Q2soXY2Zi_FRrloMa2fUYWPGiZUBQo2&index=23
More on Branch Prediction (III)
https://www.youtube.com/watch?v=yDjsr-jTOtk&list=PL5PHm2jkkXmgVhh8CHAu9N76TShJqfYDt&index=4
Lectures on Branch Prediction
Digital Design & Computer Architecture, Spring 2020, Lecture 16b
Branch Prediction I (ETH Zurich, Spring 2020)
https://www.youtube.com/watch?v=h6l9yYSyZHM&list=PL5Q2soXY2Zi_FRrloMa2fU
YWPGiZUBQo2&index=22
https://www.youtube.com/onurmutlulectures 38
Pipelined Performance Example
39
Carnegie Mellon
Assume:
40% of loads used by next instruction
25% of branches mispredicted; flushes the next instruction on misprediction
All jumps flush the next instruction fetched
And
Average CPI =
41
Carnegie Mellon
And
Average CPI = (0.25)(1.4) + load
(0.1)(1) + store
(0.11)(1.25) + beq
(0.02)(2) + jump
(0.52)(1) r-type
= 1.15
42
Carnegie Mellon
Tc = max {
tpcq + tmem + tsetup fetch
2(tRFread + tmux + teq + tAND + tmux + tsetup ) decode
tpcq + tmux + tmux + tALU + tsetup execute
tpcq + tmemwrite + tsetup memory
2(tpcq + tmux + tRFwrite) writeback
}
Decode and Writeback use register file and have only half a
clock cycle to complete that is why there is a 2 in front of them
43
Carnegie Mellon
46
Carnegie Mellon
47
Recall: How to Handle Data Dependences
Anti and output dependences are easier to handle
write to the destination only in last stage and in program order
50
Question to Ponder: Hardware vs. Software
What is the role of the hardware vs. the software in the
order in which instructions are executed in the pipeline?
Software based instruction scheduling static scheduling
Hardware based instruction scheduling dynamic scheduling
51
More on Software vs. Hardware
Software based scheduling of instructions static scheduling
Hardware executes the instructions in the compiler-dictated order
Contrast this with dynamic scheduling: hardware can execute
instructions out of the compiler-specified order
How does the compiler know the latency of each instruction?
https://www.youtube.com/watch?v=isBEVkIjgGA&list=PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKq&index=18
Lectures on Static Instruction Scheduling
https://www.youtube.com/onurmutlulectures 54
Recall: Semantic Gap
How close instructions & data types & addressing modes
are to high-level language (HLL)
HLL HLL
Small Semantic Gap
ISA with
Complex Inst Large Semantic Gap
& Data Types
& Addressing Modes ISA with
Simple Inst
& Data Types
& Addressing Modes
HW HW
Control Control
Signals Signals
HLL
Small Semantic Gap
X86-64 ISA with
Complex Inst
& Data Types Software or Hardware Translator
& Addressing Modes
https://en.wikipedia.org/wiki/Rosetta_(software)#Rosetta_2 57
An Example: Rosetta 2 Binary Translator
Apple M1,
2021
Source: https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested 58
Another Example: NVIDIA Denver
https://www.anandtech.com/show/8701/the-google-nexus-9-review/4 https://www.toradex.com/computer-on-modules/apalis-arm-family/nvidia-tegra-k1 59
https://safari.ethz.ch/digitaltechnik/spring2021/lib/exe/fetch.php?media=boggs_ieeemicro_2015.pdf
More on NVIDIA Denver Code Optimizer
60
https://safari.ethz.ch/digitaltechnik/spring2021/lib/exe/fetch.php?media=boggs_ieeemicro_2015.pdf
Transmeta: x86 to VLIW Translation
X86 X86
Klaiber, “The Technology Behind Crusoe Processors,” Transmeta White Paper 2000.
https://www.wikiwand.com/en/Transmeta_Efficeon 61
https://classes.engineering.wustl.edu/cse362/images/c/c7/Paper_aklaiber_19jan00.pdf
More on Static Instruction Scheduling
https://www.youtube.com/watch?v=isBEVkIjgGA&list=PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKq&index=18
Lectures on Static Instruction Scheduling
https://www.youtube.com/onurmutlulectures 63
Recall: How to Handle Data Dependences
Anti and output dependences are easier to handle
write to the destination only in last stage and in program order
65
Fine-Grained Multithreading
Idea: Fetch from a different thread every cycle such that no
two instructions from a thread are in the pipeline concurrently
Hardware has multiple thread contexts (PC+registers per thread)
Threads are completely independent
No instruction is fetched from the same thread until the prior
branch/instruction from the thread completes
ALU
ALUOutM ReadDataW
1 A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
+
15:0
<<2
Sign Extend SignImmE
4 PCBranchM
+
PCPlus4F PCPlus4D PCPlus4E
ResultW
We need a PC and a register file for each thread + muxes and control
Fine-Grained Multithreading (II)
Idea: Fetch from a different thread every cycle such that no
two instructions from a thread are in the pipeline concurrently
69
Fine-Grained Multithreading in HEP
Cycle time: 100ns
8 stages 800 ns to
complete an
instruction
assuming no memory
access
Burton Smith
(1941-2018)
70
Multithreaded Pipeline Example
Kongetira et al., “Niagara: A 32-Way Multithreaded Sparc Processor,” IEEE Micro 2005.
72
Fine-Grained Multithreading
Advantages
+ No need for dependence checking between instructions
(only one instruction in pipeline from a single thread)
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions from
different threads
+ Improved system throughput, latency tolerance, pipeline utilization
Disadvantages
- Extra hardware complexity: multiple hardware contexts (PCs, register
files, …), thread selection logic
- Reduced single thread performance (one instruction fetched every N
cycles from the same thread)
- Resource contention between threads in caches and memory
- Dependence checking logic between threads may be needed (load/store)
73
Modern GPUs are
FGMT Machines
74
NVIDIA GeForce GTX 285 “core”
64 KB of storage
… for thread contexts
(registers)
64 KB of storage
… for thread contexts
(registers)
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Burton Smith
(1941-2018)
78
Further Reading for the Interested (II)
79
More on Multithreading (I)
https://www.youtube.com/watch?v=iqi9wFqFiNU&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=51
More on Multithreading (II)
https://www.youtube.com/watch?v=e8lfl6MbILg&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=52
More on Multithreading (III)
https://www.youtube.com/watch?v=7vkDpZ1-hHM&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=53
More on Multithreading (IV)
https://www.youtube.com/watch?v=-hbmzIDe0sA&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=54
Lectures on Multithreading
Parallel Computer Architecture, Fall 2012, Lecture 9
Multithreading I (CMU, Fall 2012)
https://www.youtube.com/watch?v=iqi9wFqFiNU&list=PL5PHm2jkkXmgDN1PLwOY
_tGtUlynnyV6D&index=51
Parallel Computer Architecture, Fall 2012, Lecture 10
Multithreading II (CMU, Fall 2012)
https://www.youtube.com/watch?v=e8lfl6MbILg&list=PL5PHm2jkkXmgDN1PLwOY_
tGtUlynnyV6D&index=52
Parallel Computer Architecture, Fall 2012, Lecture 13
Multithreading III (CMU, Fall 2012)
https://www.youtube.com/watch?v=7vkDpZ1-
hHM&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=53
Parallel Computer Architecture, Fall 2012, Lecture 15
Speculation I (CMU, Fall 2012)
https://www.youtube.com/watch?v=-
hbmzIDe0sA&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=54
https://www.youtube.com/onurmutlulectures 84
Pipelining and Precise Exceptions:
Preserving Sequential Semantics
Multi-Cycle Execution
Not all instructions take the same amount of time in the
“execute stage” of the pipeline
E E E E E E E E ...
Load/store
86
Issues in Pipelining: Multi-Cycle Execute
Instructions can take different number of cycles in EXECUTE
stage
Integer ADD versus Integer DIVide
Exception-causing
DIV R4 R1, R2 F D E E E E E E E E W instruction
ADD R3 R1, R2 F D E W
F D E W
F D E W
DIV R2 R5, R6 F D E E E E E E E E W
ADD R7 R5, R6 F D E W
F D E W
Delayed “instruction”
due to exception
Exception-causing “instruction”
Time: 12:55 88
An Example Exception
Exception-causing
“instruction”
Time: 12:57 89
An Example Exception
Time: 12:58 90
An Example Exception
Time: 13:00 91
Another View
92
Exception Handled & Resolved…
Exception-causing
“instruction”
Time: 13:06 93
Exceptions and Interrupts
“Unplanned” changes or interruptions in program execution
95
Digital Design & Computer Arch.
Lecture 13: Pipelined Processor Design
ETH Zürich
Spring 2023
6 April 2023
Exceptions and Interrupts: Examples
Exception examples
Divide by zero
Overflow
Undefined opcode
General protection (or access protection)
Page fault
…
Interrupt examples
I/O device needing service (e.g., keyboard input, video input)
(Periodic) system timer expiration
Power failure
Machine check
…
97
Exceptions vs. Interrupts
Cause
Exceptions: internal to the running thread
Interrupts: external to the running thread
When to Handle
Exceptions: when detected (and known to be non-speculative)
Interrupts: when convenient
Except for very high priority ones
Power failure
Machine check (error)
DIV R4 R1, R2
Precise state
ADD R3 R1, R2
(clean separation of
DIV R2 R5, R6 sequential instructions)
ADD R7 R5, R6
99
Checking for and Handling Exceptions in Pipelining
100
Aside: From the x86-64 ISA Manual
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html 101
Why Do We Want Precise Exceptions?
Semantics of the von Neumann model ISA specifies it
Remember von Neumann vs. Dataflow
102
Ensuring Precise Exceptions
Easy to do in single-cycle and multi-cycle machines
Single-cycle
Instruction boundaries == Cycle boundaries
Multi-cycle
Add special states in the control FSM that lead to the
exception or interrupt handlers
Switch to the handler only at a precise state
before fetching the next instruction
See H&H Section 7.7 for a treatment of exceptions in multi-cycle microarchitecture 103
Precise Exceptions in Multi-Cycle Datapath
EPC register: Holds the exception causing PC
Cause register: Holds the cause of the exception
Exception Handler starts at address 0x80000180
See H&H Section 7.7 for a treatment of exceptions in multi-cycle microarchitecture 104
Precise Exceptions in Multi-Cycle FSM
Supports
Overflow
Undefined
instruction
mfc0 instruction
is used to copy
the exception
cause into a
general-purpose
register
See H&H Section 7.7 for a treatment of exceptions in multi-cycle microarchitecture 105
Precise Exceptions in Multi-Cycle Datapath
106
Multi-Cycle Execute: More Complications
Instructions can take different number of cycles in EXECUTE
stage
Integer ADD versus Integer DIVide
DIV R4 R1, R2 F D E E E E E E E E W
ADD R3 R1, R2 F D E W
F D E W
F D E W
DIV R2 R5, R6 F D E E E E E E E E W
ADD R7 R5, R6 F D E W
F D E W
DIV R3 R1, R2 F D E E E E E E E E W
ADD R4 R1, R2 F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
Downside
Worst-case instruction latency determines all instructions’ latency
What about memory operations?
Each functional unit takes worst-case number of cycles?
108
Solutions
Reorder buffer
History buffer
Checkpointing
Suggested reading
Smith and Plezskun, “Implementing Precise Interrupts in Pipelined
Processors,” IEEE Trans on Computers 1988 and ISCA 1985.
109
Solution I: Reorder Buffer (ROB)
Idea: Complete instructions out-of-order, but reorder them
before making results visible to architectural state
When instruction is decoded, it reserves the next-sequential
entry in the ROB
When instruction completes, it writes result into ROB entry
When instruction oldest in ROB and it has completed
without exceptions, its result moved to reg. file or memory
Func Unit
Instruction Register Reorder
Cache File Func Unit Buffer
Func Unit
Entry 13
Entry 14
Entry 15
F D E E E E E E E E R W
F D E R W
F D E R W
F D E R W
F D E E E E E E E E R W
F D E R W
F D E R W
113
Reorder Buffer: How to Access?
A register value can be in the register file, reorder buffer,
(or bypass/forwarding paths)
Func Unit
Value
instruction
Value Tag
(pointer to instruction
ROB entry)
A Register Alias Table (RAT) points to where each register’s current value is (or will be)
Intel Pentium Pro (1995)
Output dependence
r3 r1 op r2 Write-after-Write
r5 r3 op r4 (WAW)
r3 r6 op r7
121
Register Renaming Example (On Your Own)
Assume
Register file has a pointer to the reorder buffer entry that
contains or will contain the value, if the register is not valid
Reorder buffer works as described before
122
Reorder Buffer Example
Register File (RF) Reorder Buffer (ROB)
R0 Entry 0 Oldest
R1 Entry 1 instruction
R2 Entry 2
R3
R4
R5
R6
R7
Entry 8 Youngest
Value Valid?
Value Tag
(pointer to instruction
ROB entry)
Initially: all registers Entry 13
are valid in RF Entry 14
& ROB is empty Entry 15 Entry Valid?
Disadvantages
Reorder buffer needs to be accessed to get the results that
are yet to be written to the register file
CAM or indirection increased latency and complexity
125
More on State Maintenance & Precise Exceptions
https://www.youtube.com/watch?v=nMfbtzWizDA&list=PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKq&index=13
More on State Maintenance & Precise Exceptions
https://www.youtube.com/watch?v=upJPVXEuqIQ&list=PL5Q2soXY2Zi-iBn_sw_B63HtdbTNmphLc&index=18
More on State Maintenance & Precise Exceptions
https://www.youtube.com/watch?v=9yo3yhUijQs&list=PL5Q2soXY2Zi8J58xLKBNFQFHRO3GrXxA9&index=17
Lectures on State Maintenance & Recovery
Computer Architecture, Spring 2015, Lecture 11
Precise Exceptions, State Maintenance/Recovery (CMU, Spring 2015)
https://www.youtube.com/watch?v=nMfbtzWizDA&list=PL5PHm2jkkXmi5CxxI7b3J
CL1TWybTDtKq&index=13
https://www.youtube.com/onurmutlulectures 129
Suggested Readings for the Interested
Smith and Plezskun, “Implementing Precise Interrupts in
Pipelined Processors,” IEEE Trans on Computers 1988 and
ISCA 1985.
Backup Slides
130