Onur Ddca 2025 Lecture14 Out of Order Execution Afterlecture
Onur Ddca 2025 Lecture14 Out of Order Execution Afterlecture
ETH Zürich
Spring 2025
4 April 2025
Roadmap for Today (and Past Two Weeks)
◼ Prior to last week: Microarchitecture Fundamentals
❑ Single-cycle Microarchitectures
❑ Multi-cycle Microarchitectures
Problem
Algorithm
◼ Last week & yesterday: Pipelining
Program/Language
❑ Pipelining System Software
❑ Pipelined Processor Design SW/HW Interface
◼ Control & Data Dependence Handling Micro-architecture
◼ Precise Exceptions: State Maintenance & Recovery Logic
Devices
◼ Today: Out-of-Order Execution Electrons
❑ Out-of-Order Execution
❑ Issues in OoO Execution: Load-Store Handling, …
2
Readings
◼ This week
❑ Out-of-order execution
❑ H&H, Chapter 7.8-7.9
❑ Smith and Sohi, “The Microarchitecture of Superscalar Processors,”
Proceedings of the IEEE, 1995
◼ More advanced pipelining
◼ Interrupt and exception handling
◼ Out-of-order and superscalar execution concepts
◼ Optional
❑ Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro 1999.
◼ Next Week
❑ McFarling, “Combining Branch Predictors,” DEC WRL Technical
Report, 1993.
3
Review: In-Order Pipeline with Reorder Buffer
◼ Decode (D): Access regfile/ROB, allocate entry in ROB, check if instruction
can execute, if so dispatch instruction (i.e., send it to functional unit)
◼ Execute (E): Instructions can complete out-of-order
◼ Completion (R): Write result to reorder buffer
◼ Retirement/Commit (W): Check oldest instruction for exceptions; if none,
write result to architectural register file or memory; else, flush pipeline
and start from exception handler
◼ In-order dispatch/execution, out-of-order completion, in-order retirement
Integer add
E
Integer mul
E E E E
R W
F D FP mul
E E E E E E E E
R
E E E E E E E E ...
Load/store
Output dependence
r3 r1 op r2 Write-after-Write
r5 r3 op r4 (WAW)
r3 r6 op r7
5
Recall: Register Renaming with a Reorder Buffer
◼ Output and anti dependences are not true dependences
❑ WHY? The same register refers to values that have nothing to
do with each other
❑ They exist due to lack of register ID’s (i.e. names) in
the ISA
Value Tag
(pointer to instruction
ROB entry)
A Register Alias Table (RAT) points to where each register’s current value is (or will be)
Out-of-Order Execution
(Dynamic Instruction Scheduling)
An In-order Pipeline
Integer add
E
Integer mul
E E E E
FP mul
R W
F D
E E E E E E E E
E E E E E E E E ...
Cache miss
10
An Example Non-Ready Instruction
Independent instruction
cannot enter the
execution unit
Long-latency instruction
stalls the pipeline
Time: 12:55 11
An Example Non-Ready Instruction
Time: 12:57 12
An Example Non-Ready Instruction
Time: 12:58 13
An Example Non-Ready Instruction
Time: 13:00 14
Another View
15
Stalling Done & Independents Execute
Independent instruction
finally
dispatched and executing
Time: 13:06 16
An In-order Pipeline
Integer add
E
Integer mul
E E E E
FP mul
R W
F D
E E E E E E E E
E E E E E E E E ...
Cache miss
17
Can We Do Better?
18
How Can We Do Better?
◼ What do the following two pieces of code have in common
(with respect to execution in the previous design)?
MUL R3 R1, R2 LD R3 R1 (0)
ADD R3 R3, R1 ADD R3 R3, R1
ADD R4 R6, R7 ADD R4 R6, R7
MUL R5 R6, R8 MUL R5 R6, R8
ADD R7 R9, R9 ADD R7 R9, R9
20
Out-of-order Execution (Dynamic Scheduling)
◼ Idea: Move the non-ready instructions out of the way of
independent ones (such that independent ones can dispatch)
❑ Rest areas for non-ready instructions: Reservation stations
◼ Benefit:
❑ Latency tolerance: Allows independent instructions to execute
and complete in the presence of a long-latency operation
21
In-order vs. Out-of-order Dispatch
◼ In order dispatch + precise exceptions:
IMUL R3 R1, R2
F D E E E E R W
ADD R3 R3, R1
F D STALL E R W ADD R1 R6, R7
F STALL D E R W IMUL R5 R6, R8
ADD R7 R3, R5
F D E E E E E R W
F D STALL E R W
◼ 16 vs. 12 cycles
22
Enabling OoO Execution
1. Need to link the consumer of a value to the producer
❑ Register renaming: Associate a “tag” with each data value
2. Need to buffer instructions until they are ready to execute
❑ Insert instruction into reservation stations after renaming
3. Instructions need to keep track of readiness of source values
❑ Broadcast the “tag” when the value is produced
❑ Instructions compare their “source tags” to the broadcast tag
→ if match, source value becomes ready
4. When all source values of an instruction are ready, need to
dispatch the instruction to its functional unit (FU)
❑ Instruction wakes up if all sources are ready
❑ If multiple instructions are awake, need to select one per FU
23
Tomasulo’s Algorithm for OoO Execution
◼ OoO with register renaming invented by Robert Tomasulo
❑ Used in IBM 360/91 Floating Point Units
❑ Reading: Tomasulo, “An Efficient Algorithm for Exploiting Multiple
Arithmetic Units,” IBM Journal of R&D, Jan. 1967.
S R
Integer add
C E
E
H Integer mul
O
E E E E E
F D R W
D FP mul
D
U E E E E E E E E
E
L
E E E E E E E E E ... R
Load/store
◼ Smith and Sohi, “The Microarchitecture of Superscalar Processors,” Proc. IEEE, Dec. 1995.
27
Tomasulo’s Machine: IBM 360/91
FP registers
from memory from instruction unit
load
buffers store buffers
operation bus
reservation
stations to memory
FP FU FP FU
28
IBM 360/91 in Real World
http://www.columbia.edu/cu/computinghistory/36091.html 29
IBM 360/91 in Real World
http://www.righto.com/2019/04/iconic-consoles-of-ibm-system360.html 30
Recall Once More: Register Renaming
◼ Output and anti dependences are not true dependences
❑ WHY? The same register refers to values that have nothing to do
with each other
❑ They exist due to lack of register ID’s (i.e. names) in ISA
R0 1
R1 1
R2 1
R3 1
R4 1
R5 1
R6 1
R7 1
Value Tag
(pointer to instruction
ROB entry)
Entry 13
Entry 14
Entry 15
Entry Valid?
33
This Lecture
Register File (RF) or Register Alias Table (RAT)
R0
R1
R2
R3
R4
R5
R6
R7
Value Valid?
Value Tag
(pointer to the
reservation station entry
that will produce the value)
We will ignore Reorder Buffer for simplicity
34
Tomasulo’s Algorithm
◼ If reservation station available before renaming
❑ Instruction + renamed operands (source value/tag) inserted into the
reservation station
❑ Only rename if reservation station is available
◼ Else stall
◼ While in reservation station, each instruction:
❑ Watches common data bus (CDB) for tag of its sources
❑ When tag seen, grab value for the source and keep it in the reservation station
❑ When both operands available, instruction ready to be dispatched
◼ Dispatch instruction to the Functional Unit when instruction is ready
◼ After instruction finishes in the Functional Unit
❑ Arbitrate for CDB
❑ Put tagged value onto CDB (tag broadcast)
❑ Register file is connected to the CDB
◼ Register contains a tag indicating the latest writer to the register
◼ If the tag in the register file matches the broadcast tag, write broadcast value
into register (and set valid bit)
❑ Reclaim rename tag
◼ no valid copy of tag in system!
35
An Exercise
MUL R3 R1, R2 Pipeline F D E W
ADD R5 R3, R4
ADD R7 R2, R6
ADD F D E E E E W
ADD R10 R8, R9
MUL R11 R7, R10 MUL F D E E E E E E E E W
ADD R5 R5, R11
37
Exercise Continued
MUL R3 R1, R2
ADD R5 R3, R4
ADD R7 R2, R6
ADD R10 R8, R9
MUL R11 R7, R10
ADD R5 R5, R11
38
How It Works
39
Our First OoO Machine Simulation
Program We Will Simulate
MUL R1, R2 → R3
ADD R3, R4 → R5 Initially:
ADD R2, R6 → R7 1. Reservation Stations (RS’s) are all Invalid (Empty)
ADD R8, R9 → R10 2. All Registers are Valid
MUL R7, R10 → R11
ADD R5, R11 → R5
RS for ADD Unit RS for MUL Unit
Register Valid Tag Value
Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a x
R3 1 3
b y
R4 1 4
c z
R5 1 5
d t
R6 1 6
R7 1 7
R8 1 8
R9 1 9 + ∗
R10 1 10
R11 1 11 Tag Value Tag Value
Register Alias Table ADD and MUL Execution Units
have separate Tag & Value buses 40
Cycle 0
Cycle
MUL R1, R2 → R3
ADD R3, R4 → R5
ADD R2, R6 → R7
ADD R8, R9 → R10
MUL R7, R10 → R11
ADD R5, R11 → R5
41
Cycle 1
Cycle 1
MUL R1, R2 → R3 F
ADD R3, R4 → R5
ADD R2, R6 → R7
ADD R8, R9 → R10
MUL R7, R10 → R11
ADD R5, R11 → R5
42
MUL gets decoded and allocated into RS x
Cycle 2 Step 1: Check if reservation station available. Yes: x
43
1. MUL in RS x starts executing
Cycle 3 2. ADD gets decoded and allocated into RS a
MUL R1, R2 → R3 F D E1 E2 E3
ADD R3, R4 → R5 F D - -
ADD R2, R6 → R7 F D E1
ADD R8, R9 → R10 F D
MUL R7, R10 → R11 F
ADD R5, R11 → R5
46
Cycle 6
Cycle 1 2 3 4 5 6
MUL R1, R2 → R3 F D E1 E2 E3 E4
ADD R3, R4 → R5 F D - - -
ADD R2, R6 → R7 F D E1 E2
ADD R8, R9 → R10 F D E1
MUL R7, R10 → R11 F D
ADD R5, R11 → R5 F
47
All six instructions are now decoded and renamed
Cycle 7 Note what happened to R5: Renamed twice!
Cycle 1 2 3 4 5 6 7
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5
ADD R3, R4 → R5 F D - - - -
ADD R2, R6 → R7 F D E1 E2 E3
ADD R8, R9 → R10 F D E1 E2
MUL R7, R10 → R11 F D -
ADD R5, R11 → R5 F D
48
Cycle 8 (First Slide)
Cycle 1 2 3 4 5 6 7 8 MUL in RS x is done
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 Broadcast MUL’s tag (x)
ADD R3, R4 → R5 F D - - - -
F D E1 E2 E3 ✓ Check tag
ADD R2, R6 → R7
✓ Check for invalidity
ADD R8, R9 → R10 F D E1 E2
MUL R7, R10 → R11 F D -
Broadcast MUL’s result (2)
ADD R5, R11 → R5 F D
49
ADD in RS a is ready to execute in the next cycle!
Cycle 8 (Second Slide)
Cycle 1 2 3 4 5 6 7 8 ADD in RS b is also done
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 Broadcast ADD’s tag (b)
ADD R3, R4 → R5 F D - - - - -
F D E1 E2 E3 E4 ✓ Check tag
ADD R2, R6 → R7
✓ Check for invalidity
ADD R8, R9 → R10 F D E1 E2
MUL R7, R10 → R11 F D -
Broadcast ADD’s result (8)
ADD R5, R11 → R5 F D
50
MUL in RS y is still NOT ready to execute in the next cycle!
Cycle 8 (Third Slide)
Cycle 1 2 3 4 5 6 7 8
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6
ADD R3, R4 → R5 F D - - - - -
ADD R2, R6 → R7 F D E1 E2 E3 E4
ADD R8, R9 → R10 F D E1 E2 E3
MUL R7, R10 → R11 F D - -
ADD R5, R11 → R5 F D -
51
Cycle 9
Cycle 1 2 3 4 5 6 7 8 9
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 Broadcast and Update
MUL R7, R10 → R11 F D - - -
ADD R5, R11 → R5 F D - -
52
Cycle 10
Cycle 1 2 3 4 5 6 7 8 9 10
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1
ADD R5, R11 → R5 F D - - -
53
Cycle 11
Cycle 1 2 3 4 5 6 7 8 9 10 11
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2
ADD R5, R11 → R5 F D - - - -
54
Cycle 12
Cycle 1 2 3 4 5 6 7 8 9 10 11 12
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 Broadcast and Update
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2 E3
ADD R5, R11 → R5 F D - - - - -
55
Cycle 13
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 W
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2 E3 E4
ADD R5, R11 → R5 F D - - - - - -
56
Cycle 14
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 W
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2 E3 E4 E5
ADD R5, R11 → R5 F D - - - - - - -
57
Cycle 15
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 W
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
Broadcast and
MUL R7, R10 → R11 F D - - - E1 E2 E3 E4 E5 E6 Update
ADD R5, R11 → R5 F D - - - - - - - -
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 W
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2 E3 E4 E5 E6 W
ADD R5, R11 → R5 F D - - - - - - - - E1
59
Cycle 17
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 W
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2 E3 E4 E5 E6 W
ADD R5, R11 → R5 F D - - - - - - - - E1 E2
60
Cycle 18
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 W
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2 E3 E4 E5 E6 W
ADD R5, R11 → R5 F D - - - - - - - - E1 E2 E3
61
Cycle 19
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 W
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2 E3 E4 E5 E6 W
ADD R5, R11 → R5 F D - - - - - - - - E1 E2 E3 E4
Broadcast and Update
Register Valid Tag Value
Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 1 ~ 2 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 1 2
b 1 ~ 2 1 ~ 6 y 1 ~ 8 1 ~ 17
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0
1 d 142
d 1 ~ 6 1 ~ 136 t
R6 1 6
R7 1 8
R8 1 8
R9 1 9 + ∗
R10 1 17
d 142
R11 1 136
62
Cycle 20
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 W
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2 E3 E4 E5 E6 W
ADD R5, R11 → R5 F D - - - - - - - - E1 E2 E3 E4 W
63
Some Questions
◼ What is needed in hardware to perform tag broadcast and
value capture?
Wires, Comparators & Logic
→ make a value valid
→ wake up an instruction
MUL R3 R1, R2
ADD R5 R3, R4
ADD R7 R2, R6
ADD R10 R8, R9
MUL R11 R7, R10
ADD R5 R5, R11
Easy task for you: Draw the dataflow graph for the above code
65
State of RAT and RS in Cycle 7
Cycle 1 2 3 4 5 6 7 All 6 instructions are decoded and renamed
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 Note what happened to R5: Renamed twice!
ADD R3, R4 → R5 F D - - - -
ADD R2, R6 → R7 F D E1 E2 E3
ADD R8, R9 → R10 F D E1 E2
MUL R7, R10 → R11 F D -
ADD R5, R11 → R5 F D
Register Valid Tag Value RS for ADD Unit RS for MUL Unit
Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 0 x 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 0 x
b 1 ~ 2 1 ~ 6 y 0 b 0 c
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 a
d
d 0 a 0 y t
R6 1 6
R7 0 b
R8 1 8
R9 1 9 + ∗
R10 0 c
R11 0 y
Register Valid Tag Value RS for ADD Unit RS for MUL Unit
Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 0 x 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 0 x
b 1 ~ 2 1 ~ 6 y 0 b 0 c
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 a
d
d 0 a 0 y t
R6 1 6
R7 0 b
R8 1 8
R9 1 9 + ∗
R10 0 c
R11 0 y
71
Out-of-Order Execution with Precise Exceptions
◼ Idea: Use a reorder buffer to reorder instructions before
committing them to architectural state
Register Valid Tag Value RS for ADD Unit RS for MUL Unit
R1 1 1
R2 1 2 Source 1 Source 2 Source 1 Source 2
R4 1 4 a x
R5 1 5 b y
R6 1 6 c z
R7 1 7 d t
R8 1 8
R9 1 9
R10 1 10 + ∗
R11 1 11
R3 1 3 R3 3
R4 1 4 R4 4
R5 1 5 Entry 13 R5 5
R6 1 6 Entry 14 R6 6
R7 1 7 Entry 15 R7 7
R8 1 8 R8 8
R9 1 9 R9 9
R10 1 10 R10 10
R11 1 11 R11 11
S R
Integer add
C E
E
H Integer mul
O
E E E E E
F D R W
D FP mul
D
U E E E E E E E E
E
L
E E E E E E E E E ... R
Load/store
+ ∗
https://www.anandtech.com/show/1621/3 85
Enabling OoO Execution, Revisited
1. Link the consumer of a value to the producer
❑ Register renaming: Associate a “tag” with each data value
86
Summary of OOO Execution Concepts
◼ Register renaming eliminates false dependences, enables
linking of producer to consumers
87
OOO Execution: Restricted Dataflow
◼ An out-of-order engine dynamically builds the dataflow
graph of a piece of the program
❑ which piece?
Register Valid Tag Value RS for ADD Unit RS for MUL Unit
Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 0 x 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 0 x
b 1 ~ 2 1 ~ 6 y 0 b 0 c
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 a
d
d 0 a 0 y t
R6 1 6
R7 0 b
R8 1 8
R9 1 9 + ∗
R10 0 c
R11 0 y
◼ Smith and Sohi, “The Microarchitecture of Superscalar Processors,” Proc. IEEE, Dec. 1995.
92
A Modern OoO Design: Intel Pentium 4
93
Boggs et al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technology Journal, 2001.
Intel Pentium 4 Simplified
Mutlu+, “Runahead Execution,”
HPCA 2003.
94
Alpha 21264
Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, April 1996 96
IBM POWER4
◼ Tendler et al.,
“POWER4 system
microarchitecture,”
IBM J R&D, 2002.
97
IBM POWER4
◼ 2 cores, out-of-order execution
◼ 100-entry instruction window in each core
◼ 8-wide instruction fetch, issue, execute
◼ Large, local+global hybrid branch predictor
◼ 1.5MB, 8-way L2 cache
◼ Aggressive stream based prefetching
98
IBM POWER5
◼ Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE
Micro 2004.
99
AMD Zen2? (2019)
https://en.wikichip.org/wiki/amd/microarchitectures/zen_2 100
Apple M1 FireStorm? (2020)
https://www.anandtech.com/show/16226/apple-silicon-m1-a14-deep-dive/2
101
See Backup Slides for:
Handling Out-of-Order Execution
of Loads and Stores
Digital Design & Computer Arch.
Lecture 14: Out-of-Order Execution
ETH Zürich
Spring 2025
4 April 2025
Handling Out-of-Order Execution
of Loads and Stores
Registers versus Memory
◼ So far, we considered mainly registers as part of state
105
Memory Dependence Handling (I)
◼ Need to obey memory dependences in an out-of-order
machine
❑ and need to do so while providing high performance
◼ Approaches
❑ Conservative: Stall the load until all previous stores have
computed their addresses (or even retired from the machine)
❑ Aggressive: Assume load is independent of unknown-address
stores and schedule the load right away
❑ Intelligent: Predict (with a more sophisticated predictor) if the
load is dependent on any unknown address store
107
Handling of Store-Load Dependences
◼ A load’s dependence status is not known until all previous store
addresses are available.
◼ How does the OOO engine treat the scheduling of a load instruction wrt
previous stores?
❑ Option 1: Assume load dependent on all previous stores
108
Memory Disambiguation (I)
◼ Option 1: Assume load is dependent on all previous stores
+ No need for recovery
-- Too conservative: delays independent loads unnecessarily
109
Memory Disambiguation (II)
◼ Chrysos and Emer, “Memory Dependence Prediction Using Store
Sets,” ISCA 1998.
110
Data Forwarding Between Stores and Loads
◼ We cannot update memory out of program order
→ Need to buffer all store and load instructions in instruction window
111
Out-of-Order Completion of Memory Ops
◼ When a store instruction finishes execution, it writes its
address and data in its reorder buffer entry (or SQ entry)
113
114