0% found this document useful (0 votes)
36 views114 pages

Onur Ddca 2025 Lecture14 Out of Order Execution Afterlecture

This lecture focuses on Out-of-Order (OoO) Execution, a technique that allows independent instructions to execute while waiting for long-latency operations. It covers the concepts of dynamic instruction scheduling, register renaming, and the use of reservation stations to manage instruction readiness. The lecture also references Tomasulo's Algorithm, which is foundational for modern high-performance processors utilizing OoO execution.

Uploaded by

Minh Phuoc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views114 pages

Onur Ddca 2025 Lecture14 Out of Order Execution Afterlecture

This lecture focuses on Out-of-Order (OoO) Execution, a technique that allows independent instructions to execute while waiting for long-latency operations. It covers the concepts of dynamic instruction scheduling, register renaming, and the use of reservation stations to manage instruction readiness. The lecture also references Tomasulo's Algorithm, which is foundational for modern high-performance processors utilizing OoO execution.

Uploaded by

Minh Phuoc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 114

Digital Design & Computer Arch.

Lecture 14: Out-of-Order Execution

Prof. Onur Mutlu

ETH Zürich
Spring 2025
4 April 2025
Roadmap for Today (and Past Two Weeks)
◼ Prior to last week: Microarchitecture Fundamentals
❑ Single-cycle Microarchitectures
❑ Multi-cycle Microarchitectures
Problem
Algorithm
◼ Last week & yesterday: Pipelining
Program/Language
❑ Pipelining System Software
❑ Pipelined Processor Design SW/HW Interface
◼ Control & Data Dependence Handling Micro-architecture
◼ Precise Exceptions: State Maintenance & Recovery Logic
Devices
◼ Today: Out-of-Order Execution Electrons

❑ Out-of-Order Execution
❑ Issues in OoO Execution: Load-Store Handling, …
2
Readings
◼ This week
❑ Out-of-order execution
❑ H&H, Chapter 7.8-7.9
❑ Smith and Sohi, “The Microarchitecture of Superscalar Processors,”
Proceedings of the IEEE, 1995
◼ More advanced pipelining
◼ Interrupt and exception handling
◼ Out-of-order and superscalar execution concepts
◼ Optional
❑ Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro 1999.

◼ Next Week
❑ McFarling, “Combining Branch Predictors,” DEC WRL Technical
Report, 1993.
3
Review: In-Order Pipeline with Reorder Buffer
◼ Decode (D): Access regfile/ROB, allocate entry in ROB, check if instruction
can execute, if so dispatch instruction (i.e., send it to functional unit)
◼ Execute (E): Instructions can complete out-of-order
◼ Completion (R): Write result to reorder buffer
◼ Retirement/Commit (W): Check oldest instruction for exceptions; if none,
write result to architectural register file or memory; else, flush pipeline
and start from exception handler
◼ In-order dispatch/execution, out-of-order completion, in-order retirement
Integer add
E
Integer mul
E E E E
R W
F D FP mul
E E E E E E E E
R
E E E E E E E E ...
Load/store

ROB is implemented as a circular queue in hardware 4


Recall: Data Dependence Types
Flow dependence
r3  r1 op r2 Read-after-Write
r5  r3 op r4 (RAW)
Anti dependence
r3  r1 op r2 Write-after-Read
r1  r4 op r5 (WAR)

Output dependence
r3  r1 op r2 Write-after-Write
r5  r3 op r4 (WAW)
r3  r6 op r7
5
Recall: Register Renaming with a Reorder Buffer
◼ Output and anti dependences are not true dependences
❑ WHY? The same register refers to values that have nothing to
do with each other
❑ They exist due to lack of register ID’s (i.e. names) in
the ISA

◼ The register ID is renamed to the reorder buffer entry that


will hold the register’s value
❑ Register ID → ROB entry ID
❑ Architectural register ID → Physical register ID
❑ After renaming, ROB entry ID used to refer to the register

◼ This eliminates anti and output dependences


❑ Gives the illusion that there are a large number of registers
6
Recall: Reorder Buffer Example
Register File (RF) Reorder Buffer (ROB)
R0 Entry 0 Oldest
R1 Entry 1 instruction
R2 Entry 2
R3
R4
R5
R6
R7
Entry 8 Youngest
Value Valid?

Value Tag
(pointer to instruction
ROB entry)

Initially: all registers Entry 13


are valid in RF Entry 14
& ROB is empty Entry 15
Entry Valid?

Dest reg written?


Dest reg value
Dest reg ID
Simulate:
MUL R1, R2 → R3
MUL R3, R4 → R11
ADD R5, R6 → R3
ADD R3, R8 → R12
7
Recall: Reorder Buffer in Intel Pentium Pro

Boggs et al., “The


Microarchitecture of the
Pentium 4 Processor,” Intel
Technology Journal, 2001.

A Register Alias Table (RAT) points to where each register’s current value is (or will be)
Out-of-Order Execution
(Dynamic Instruction Scheduling)
An In-order Pipeline

Integer add
E
Integer mul
E E E E
FP mul
R W
F D
E E E E E E E E

E E E E E E E E ...
Cache miss

◼ Dispatch: Act of sending an instruction to a functional unit


◼ Renaming with ROB eliminates stalls due to false dependences
◼ Problem: A non-ready instruction stalls dispatch of younger
instructions into functional (execution) units

10
An Example Non-Ready Instruction

Independent instruction
cannot enter the
execution unit

Long-latency instruction
stalls the pipeline

Time: 12:55 11
An Example Non-Ready Instruction

Stalls the pipeline

Time: 12:57 12
An Example Non-Ready Instruction

Time: 12:58 13
An Example Non-Ready Instruction

Time: 13:00 14
Another View

15
Stalling Done & Independents Execute

Independent instruction
finally
dispatched and executing

Time: 13:06 16
An In-order Pipeline

Integer add
E
Integer mul
E E E E
FP mul
R W
F D
E E E E E E E E

E E E E E E E E ...
Cache miss

◼ Dispatch: Act of sending an instruction to a functional unit


◼ Renaming with ROB eliminates stalls due to false dependences
◼ Problem: A non-ready instruction stalls dispatch of
younger instructions into functional (execution) units

17
Can We Do Better?

18
How Can We Do Better?
◼ What do the following two pieces of code have in common
(with respect to execution in the previous design)?
MUL R3  R1, R2 LD R3  R1 (0)
ADD R3  R3, R1 ADD R3  R3, R1
ADD R4  R6, R7 ADD R4  R6, R7
MUL R5  R6, R8 MUL R5  R6, R8
ADD R7  R9, R9 ADD R7  R9, R9

◼ Answer: First ADD stalls the whole pipeline!


❑ ADD cannot dispatch because its source register unavailable
❑ Later independent instructions cannot get dispatched

◼ How are the above code portions different?


❑ Answer: Load latency is variable (unknown until runtime)
❑ What does this affect? Think compiler vs. microarchitecture
19
Preventing Dispatch Stalls
◼ Problem: in-order dispatch (scheduling, or execution)

◼ Solution: out-of-order dispatch (scheduling, or execution)

◼ Actually, we have seen the basic idea before:


❑ Dataflow: “fire” an instruction only when its inputs are ready
❑ We will use similar principles, but not expose it in the ISA

◼ Aside: Any other way to prevent dispatch stalls?


1. Compile-time instruction scheduling/reordering
2. Value prediction
3. Fine-grained multithreading

20
Out-of-order Execution (Dynamic Scheduling)
◼ Idea: Move the non-ready instructions out of the way of
independent ones (such that independent ones can dispatch)
❑ Rest areas for non-ready instructions: Reservation stations

◼ Monitor the source “values” of each instruction in the resting


(waiting) area
◼ When all source “values” of an instruction are available,
“fire” (i.e., dispatch) the instruction
❑ Instructions dispatched in dataflow (not control-flow) order

◼ Benefit:
❑ Latency tolerance: Allows independent instructions to execute
and complete in the presence of a long-latency operation
21
In-order vs. Out-of-order Dispatch
◼ In order dispatch + precise exceptions:
IMUL R3  R1, R2
F D E E E E R W
ADD R3  R3, R1
F D STALL E R W ADD R1  R6, R7
F STALL D E R W IMUL R5  R6, R8
ADD R7  R3, R5
F D E E E E E R W
F D STALL E R W

◼ Out-of-order dispatch + precise exceptions:


F D E E E E R W
F D WAIT E R W
F D E R W
F D E E E E R W
F D WAIT E R W

◼ 16 vs. 12 cycles
22
Enabling OoO Execution
1. Need to link the consumer of a value to the producer
❑ Register renaming: Associate a “tag” with each data value
2. Need to buffer instructions until they are ready to execute
❑ Insert instruction into reservation stations after renaming
3. Instructions need to keep track of readiness of source values
❑ Broadcast the “tag” when the value is produced
❑ Instructions compare their “source tags” to the broadcast tag
→ if match, source value becomes ready
4. When all source values of an instruction are ready, need to
dispatch the instruction to its functional unit (FU)
❑ Instruction wakes up if all sources are ready
❑ If multiple instructions are awake, need to select one per FU

23
Tomasulo’s Algorithm for OoO Execution
◼ OoO with register renaming invented by Robert Tomasulo
❑ Used in IBM 360/91 Floating Point Units
❑ Reading: Tomasulo, “An Efficient Algorithm for Exploiting Multiple
Arithmetic Units,” IBM Journal of R&D, Jan. 1967.

◼ What is the major difference today?


❑ Precise exceptions
❑ Provided by
◼ Patt, Hwu, Shebanow, “HPS, a new microarchitecture: rationale and
introduction,” MICRO 1985.
◼ Patt et al., “Critical issues regarding HPS, a high performance
microarchitecture,” MICRO 1985.

◼ OoO variants are used in most high-performance processors


❑ Initially in Intel Pentium Pro, AMD K5
❑ Alpha 21264, MIPS R10000, IBM POWER5, IBM z196, Oracle UltraSPARC T4, ARM Cortex A15, Apple M1, …
24
Two Humps in a Modern Pipeline
TAG and VALUE Broadcast Bus

S R
Integer add
C E
E
H Integer mul
O
E E E E E
F D R W
D FP mul
D
U E E E E E E E E
E
L
E E E E E E E E E ... R
Load/store

in order out of order in order

◼ Hump 1: Reservation stations (scheduling window)


◼ Hump 2: Reordering (reorder buffer, aka instruction window
or active window)
25
Two Humps in a Modern Pipeline
TAG and VALUE Broadcast Bus
S
C R
H E
S E O R
Integer add
C E D R E
H Integer mul D
U O
E E E E E E
F D L R W
D E R FP mul
D
U E E E E E E E E
E
L
E E E E E E E E E ... R
Load/store

in order out of order in order

◼ Hump 1: Reservation stations (scheduling window)


◼ Hump 2: Reordering (reorder buffer, aka instruction window
or active window)

Photo credit: http://true-wildlife.blogspot.ch/2010/10/bactrian-camel.html 26


General Organization of an OOO Processor

◼ Smith and Sohi, “The Microarchitecture of Superscalar Processors,” Proc. IEEE, Dec. 1995.

27
Tomasulo’s Machine: IBM 360/91

FP registers
from memory from instruction unit

load
buffers store buffers

operation bus

reservation
stations to memory
FP FU FP FU

Common data bus

28
IBM 360/91 in Real World

http://www.columbia.edu/cu/computinghistory/36091.html 29
IBM 360/91 in Real World

http://www.righto.com/2019/04/iconic-consoles-of-ibm-system360.html 30
Recall Once More: Register Renaming
◼ Output and anti dependences are not true dependences
❑ WHY? The same register refers to values that have nothing to do
with each other
❑ They exist due to lack of register ID’s (i.e. names) in ISA

◼ The register ID is renamed to the reorder buffer entry (or


reservation station entry) that will hold the register’s value
❑ Register ID → ROB or RS entry ID
❑ Architectural register ID → Physical register ID
❑ After renaming, ROB or RS entry ID used to refer to the register

◼ This eliminates anti and output dependences


❑ Gives the illusion that there are a large number of registers
❑ Approximates the performance benefit of having more registers
31
Tomasulo’s Algorithm: Renaming
◼ Register rename table (register alias table)

Tag Value Valid?

R0 1
R1 1
R2 1
R3 1
R4 1
R5 1
R6 1
R7 1

If Valid bit is set, the Value in the table is correct.


Otherwise, Tag specifies where to find the correct value.
Tag is a unique name for the Value to be produced.
32
Recall from Precise Exceptions Lecture
Register File (RF) Reorder Buffer (ROB)
R0 Entry 0 Oldest
R1 Entry 1 instruction
R2 Entry 2
R3
R4
R5
R6
R7
Entry 8 Youngest
Value Valid?

Value Tag
(pointer to instruction
ROB entry)

Entry 13
Entry 14
Entry 15
Entry Valid?

Dest reg written?


Dest reg value
Dest reg ID

33
This Lecture
Register File (RF) or Register Alias Table (RAT)
R0
R1
R2
R3
R4
R5
R6
R7
Value Valid?

Value Tag
(pointer to the
reservation station entry
that will produce the value)
We will ignore Reorder Buffer for simplicity

34
Tomasulo’s Algorithm
◼ If reservation station available before renaming
❑ Instruction + renamed operands (source value/tag) inserted into the
reservation station
❑ Only rename if reservation station is available
◼ Else stall
◼ While in reservation station, each instruction:
❑ Watches common data bus (CDB) for tag of its sources
❑ When tag seen, grab value for the source and keep it in the reservation station
❑ When both operands available, instruction ready to be dispatched
◼ Dispatch instruction to the Functional Unit when instruction is ready
◼ After instruction finishes in the Functional Unit
❑ Arbitrate for CDB
❑ Put tagged value onto CDB (tag broadcast)
❑ Register file is connected to the CDB
◼ Register contains a tag indicating the latest writer to the register
◼ If the tag in the register file matches the broadcast tag, write broadcast value
into register (and set valid bit)
❑ Reclaim rename tag
◼ no valid copy of tag in system!

35
An Exercise
MUL R3  R1, R2 Pipeline F D E W
ADD R5  R3, R4
ADD R7  R2, R6
ADD F D E E E E W
ADD R10  R8, R9
MUL R11  R7, R10 MUL F D E E E E E E E E W
ADD R5  R5, R11

◼ Assume ADD (4 cycle execute), MUL (6 cycle execute)


◼ Assume one adder and one multiplier
◼ How many cycles
❑ in a non-pipelined machine: 50 cycles (4*7 + 2*11)
❑ in an in-order-dispatch pipelined machine with imprecise
exceptions (no forwarding and forwarding)
❑ in an out-of-order dispatch pipelined machine imprecise
exceptions (forwarding)
36
Exercise Continued
in-order-dispatch pipelined machine
w/o forwarding: 31 cycles

in-order-dispatch pipelined machine


w/ forwarding: 25 cycles

37
Exercise Continued
MUL R3  R1, R2
ADD R5  R3, R4
ADD R7  R2, R6
ADD R10  R8, R9
MUL R11  R7, R10
ADD R5  R5, R11

out-of-order dispatch pipelined machine


w/ forwarding: 20 cycles

38
How It Works

39
Our First OoO Machine Simulation
Program We Will Simulate
MUL R1, R2 → R3
ADD R3, R4 → R5 Initially:
ADD R2, R6 → R7 1. Reservation Stations (RS’s) are all Invalid (Empty)
ADD R8, R9 → R10 2. All Registers are Valid
MUL R7, R10 → R11
ADD R5, R11 → R5
RS for ADD Unit RS for MUL Unit
Register Valid Tag Value
Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a x
R3 1 3
b y
R4 1 4
c z
R5 1 5
d t
R6 1 6
R7 1 7
R8 1 8
R9 1 9 + ∗
R10 1 10
R11 1 11 Tag Value Tag Value
Register Alias Table ADD and MUL Execution Units
have separate Tag & Value buses 40
Cycle 0
Cycle

MUL R1, R2 → R3
ADD R3, R4 → R5
ADD R2, R6 → R7
ADD R8, R9 → R10
MUL R7, R10 → R11
ADD R5, R11 → R5

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a x
R3 1 3
b y
R4 1 4
c z
R5 1 5
d t
R6 1 6
R7 1 7
R8 1 8
R9 1 9 + ∗
R10 1 10
R11 1 11

41
Cycle 1
Cycle 1

MUL R1, R2 → R3 F
ADD R3, R4 → R5
ADD R2, R6 → R7
ADD R8, R9 → R10
MUL R7, R10 → R11
ADD R5, R11 → R5

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a x
R3 1 3
b y
R4 1 4
c z
R5 1 5
d t
R6 1 6
R7 1 7
R8 1 8
R9 1 9 + ∗
R10 1 10
R11 1 11

42
MUL gets decoded and allocated into RS x
Cycle 2 Step 1: Check if reservation station available. Yes: x

Cycle 1 2 Step 2: Access the Register Alias Table


MUL R1, R2 → R3 F D Step 3: Put source registers into reservation station x
ADD R3, R4 → R5 F
Step 4: Rename destination register R3 → x
ADD R2, R6 → R7
ADD R8, R9 → R10 R3 is now renamed to x.
MUL R7, R10 → R11 Its new value will produced by the reservation station
ADD R5, R11 → R5 that is identified by tag x.

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a x ~ ~
R3 0
1 x 3
b y
R4 1 4
c z
R5 1 5
d t
R6 1 6
R7 1 7
R8 1 8
R9 1 9 + ∗
R10 1 10
R11 1 11
MUL in RS x is ready to execute in the next cycle!

43
1. MUL in RS x starts executing
Cycle 3 2. ADD gets decoded and allocated into RS a

Cycle 1 2 3 Check readiness (Both sources ready?) → Wakeup


MUL R1, R2 → R3 F D E1 Ready → Dispatch the instruction to the MUL unit
ADD R3, R4 → R5 F D
F
Same Steps 1-4 for ADD… Rename R5 → a
ADD R2, R6 → R7
ADD R8, R9 → R10
MUL R7, R10 → R11
ADD R5, R11 → R5

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a ~ x 1 ~ 1 1 ~ 2
R3 0 x
b y
R4 1 4
c z
R5 1
0 a 5
d t
R6 1 6
R7 1 7
R8 1 8
R9 1 9 + ∗ 6
Cycles
R10 1 10
R11 1 11
ADD in RS a cannot execute in the next cycle: one source is not valid
44
Cycle 4
Cycle 1 2 3 4 ADD in RS a waits because one source is not valid.
MUL R1, R2 → R3 F D E1 E2 Rename R7 → b
ADD R3, R4 → R5 F D -
ADD R2, R6 → R7 F D
ADD R8, R9 → R10 F
MUL R7, R10 → R11
ADD R5, R11 → R5

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 0 x 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 0 x
b ~ ~ y
R4 1 4
c z
R5 0 a
d t
R6 1 6
R7 0
1 b 7
R8 1 8
R9 1 9 + ∗
R10 1 10
R11 1 11
ADD in RS b is ready to execute in the next cycle!
It will be executed out of order in the next cycle. 45
Cycle 5
Cycle 1 2 3 4 5

MUL R1, R2 → R3 F D E1 E2 E3
ADD R3, R4 → R5 F D - -
ADD R2, R6 → R7 F D E1
ADD R8, R9 → R10 F D
MUL R7, R10 → R11 F
ADD R5, R11 → R5

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 0 x 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 0 x
b 1 ~ 2 1 ~ 6 y
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 a
d t
R6 1 6
R7 0 b
R8 1 8
R9 1 9 + 4
Cycles

R10 0
1 c 10
R11 1 11
ADD in RS c is ready to execute in the next cycle!

46
Cycle 6
Cycle 1 2 3 4 5 6

MUL R1, R2 → R3 F D E1 E2 E3 E4
ADD R3, R4 → R5 F D - - -
ADD R2, R6 → R7 F D E1 E2
ADD R8, R9 → R10 F D E1
MUL R7, R10 → R11 F D
ADD R5, R11 → R5 F

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 0 x 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 0 x
b 1 ~ 2 1 ~ 6 y 0 b 0 c
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 a
d t
R6 1 6
R7 0 b
R8 1 8
R9 1 9 + ∗
R10 0 c
R11 0
1 y 11

47
All six instructions are now decoded and renamed
Cycle 7 Note what happened to R5: Renamed twice!

Cycle 1 2 3 4 5 6 7

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5
ADD R3, R4 → R5 F D - - - -
ADD R2, R6 → R7 F D E1 E2 E3
ADD R8, R9 → R10 F D E1 E2
MUL R7, R10 → R11 F D -
ADD R5, R11 → R5 F D

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 0 x 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 0 x
b 1 ~ 2 1 ~ 6 y 0 b 0 c
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 a
d
d 0 a 0 y t
R6 1 6
R7 0 b
R8 1 8
R9 1 9 + ∗
R10 0 c
R11 0 y

48
Cycle 8 (First Slide)
Cycle 1 2 3 4 5 6 7 8 MUL in RS x is done
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 Broadcast MUL’s tag (x)
ADD R3, R4 → R5 F D - - - -
F D E1 E2 E3 ✓ Check tag
ADD R2, R6 → R7
✓ Check for invalidity
ADD R8, R9 → R10 F D E1 E2
MUL R7, R10 → R11 F D -
Broadcast MUL’s result (2)
ADD R5, R11 → R5 F D

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 0
1 x 2 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 0
1 x 2
b 1 ~ 2 1 ~ 6 y 0 b 0 c
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 d
d 0 a 0 y t
R6 1 6
R7 0 b
R8 1 8
R9 1 9 + ∗
R10 0 c
x 2
R11 0 y
x 2

49
ADD in RS a is ready to execute in the next cycle!
Cycle 8 (Second Slide)
Cycle 1 2 3 4 5 6 7 8 ADD in RS b is also done
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 Broadcast ADD’s tag (b)
ADD R3, R4 → R5 F D - - - - -
F D E1 E2 E3 E4 ✓ Check tag
ADD R2, R6 → R7
✓ Check for invalidity
ADD R8, R9 → R10 F D E1 E2
MUL R7, R10 → R11 F D -
Broadcast ADD’s result (8)
ADD R5, R11 → R5 F D

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 1 ~ 2 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 1 2
b 1 ~ 2 1 ~ 6 y 0
1 b 8 0 c
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 d
d 0 a 0 y t
R6 1 6
R7 0
1 b 8
R8 1 8
R9 1 9 + ∗
R10 0 c
b 8
R11 0 y
b 8

50
MUL in RS y is still NOT ready to execute in the next cycle!
Cycle 8 (Third Slide)
Cycle 1 2 3 4 5 6 7 8

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6
ADD R3, R4 → R5 F D - - - - -
ADD R2, R6 → R7 F D E1 E2 E3 E4
ADD R8, R9 → R10 F D E1 E2 E3
MUL R7, R10 → R11 F D - -
ADD R5, R11 → R5 F D -

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 1 ~ 2 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 1 2
b 1 ~ 2 1 ~ 6 y 1 ~ 8 0 c
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 d
d 0 a 0 y t
R6 1 6
R7 1 b 8
R8 1 8
R9 1 9 + ∗
R10 0 c
R11 0 y

51
Cycle 9
Cycle 1 2 3 4 5 6 7 8 9

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 Broadcast and Update
MUL R7, R10 → R11 F D - - -
ADD R5, R11 → R5 F D - -

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 1 ~ 2 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 1 2
b 1 ~ 2 1 ~ 6 y 1 ~ 8 0
1 c
~ 17
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 d
d 0 a 0 y t
R6 1 6
R7 1 8
R8 1 8
R9 1 9 + ∗
R10 0
1 c 17
c 17
R11 0 y
MUL in RS y is ready to execute in the next cycle!

52
Cycle 10
Cycle 1 2 3 4 5 6 7 8 9 10

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1
ADD R5, R11 → R5 F D - - -

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 1 ~ 2 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 1 2
b 1 ~ 2 1 ~ 6 y 1 ~ 8 1 ~ 17
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 d
d 0 a 0 y t
R6 1 6
R7 1 8
R8 1 8
R9 1 9 + ∗
R10 1 17
R11 0 y

53
Cycle 11
Cycle 1 2 3 4 5 6 7 8 9 10 11

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2
ADD R5, R11 → R5 F D - - - -

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 1 ~ 2 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 1 2
b 1 ~ 2 1 ~ 6 y 1 ~ 8 1 ~ 17
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 d
d 0 a 0 y t
R6 1 6
R7 1 8
R8 1 8
R9 1 9 + ∗
R10 1 17
R11 0 y

54
Cycle 12
Cycle 1 2 3 4 5 6 7 8 9 10 11 12

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 Broadcast and Update
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2 E3
ADD R5, R11 → R5 F D - - - - -

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 1 ~ 2 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 1 2
b 1 ~ 2 1 ~ 6 y 1 ~ 8 1 ~ 17
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 d
d 0
1 a
~ 6 0 y t
R6 1 6
R7 1 8
R8 1 8
R9 1 9 + ∗
R10 1 17
a 6
R11 0 y

55
Cycle 13
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 W
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2 E3 E4
ADD R5, R11 → R5 F D - - - - - -

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 1 ~ 2 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 1 2
b 1 ~ 2 1 ~ 6 y 1 ~ 8 1 ~ 17
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 d
d 1 ~ 6 0 y t
R6 1 6
R7 1 8
R8 1 8
R9 1 9 + ∗
R10 1 17
R11 0 y

56
Cycle 14
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 W
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2 E3 E4 E5
ADD R5, R11 → R5 F D - - - - - - -

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 1 ~ 2 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 1 2
b 1 ~ 2 1 ~ 6 y 1 ~ 8 1 ~ 17
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 d
d 1 ~ 6 0 y t
R6 1 6
R7 1 8
R8 1 8
R9 1 9 + ∗
R10 1 17
R11 0 y

57
Cycle 15
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 W
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
Broadcast and
MUL R7, R10 → R11 F D - - - E1 E2 E3 E4 E5 E6 Update
ADD R5, R11 → R5 F D - - - - - - - -

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 1 ~ 2 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 1 2
b 1 ~ 2 1 ~ 6 y 1 ~ 8 1 ~ 17
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 d
d 1 ~ 6 0
1 y
~ 136 t
R6 1 6
R7 1 8
R8 1 8
R9 1 9 + ∗
R10 1 17
y 136
R11 0
1 y 136

ADD in RS d is ready to execute in the next cycle!


58
Cycle 16
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 W
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2 E3 E4 E5 E6 W
ADD R5, R11 → R5 F D - - - - - - - - E1

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 1 ~ 2 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 1 2
b 1 ~ 2 1 ~ 6 y 1 ~ 8 1 ~ 17
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 d
d 1 ~ 6 1 ~ 136 t
R6 1 6
R7 1 8
R8 1 8
R9 1 9 + ∗
R10 1 17
R11 1 136

59
Cycle 17
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 W
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2 E3 E4 E5 E6 W
ADD R5, R11 → R5 F D - - - - - - - - E1 E2

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 1 ~ 2 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 1 2
b 1 ~ 2 1 ~ 6 y 1 ~ 8 1 ~ 17
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 d
d 1 ~ 6 1 ~ 136 t
R6 1 6
R7 1 8
R8 1 8
R9 1 9 + ∗
R10 1 17
R11 1 136

60
Cycle 18
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 W
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2 E3 E4 E5 E6 W
ADD R5, R11 → R5 F D - - - - - - - - E1 E2 E3

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 1 ~ 2 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 1 2
b 1 ~ 2 1 ~ 6 y 1 ~ 8 1 ~ 17
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 d
d 1 ~ 6 1 ~ 136 t
R6 1 6
R7 1 8
R8 1 8
R9 1 9 + ∗
R10 1 17
R11 1 136

61
Cycle 19
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 W
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2 E3 E4 E5 E6 W
ADD R5, R11 → R5 F D - - - - - - - - E1 E2 E3 E4
Broadcast and Update
Register Valid Tag Value
Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 1 ~ 2 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 1 2
b 1 ~ 2 1 ~ 6 y 1 ~ 8 1 ~ 17
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0
1 d 142
d 1 ~ 6 1 ~ 136 t
R6 1 6
R7 1 8
R8 1 8
R9 1 9 + ∗
R10 1 17
d 142
R11 1 136

62
Cycle 20
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 W
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2 E3 E4 E5 E6 W
ADD R5, R11 → R5 F D - - - - - - - - E1 E2 E3 E4 W

Register Valid Tag Value


Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 1 ~ 2 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 1 2
b 1 ~ 2 1 ~ 6 y 1 ~ 8 1 ~ 17
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 1 142
d 1 ~ 6 1 ~ 136 t
R6 1 6
R7 1 8
R8 1 8
R9 1 9 + ∗
R10 1 17
R11 1 136

63
Some Questions
◼ What is needed in hardware to perform tag broadcast and
value capture?
Wires, Comparators & Logic
→ make a value valid
→ wake up an instruction

◼ Does the tag have to be the ID of the Reservation Station


Entry? No, could be any unique name that enables linking
of producer to consumer

◼ What can potentially become the critical path?


❑ Tag broadcast → value capture → instruction wake up

◼ How can you reduce the potential critical paths?


❑ More pipelining and prediction
64
Dataflow Graph for Our Example

MUL R3  R1, R2
ADD R5  R3, R4
ADD R7  R2, R6
ADD R10  R8, R9
MUL R11  R7, R10
ADD R5  R5, R11

Easy task for you: Draw the dataflow graph for the above code

65
State of RAT and RS in Cycle 7
Cycle 1 2 3 4 5 6 7 All 6 instructions are decoded and renamed
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 Note what happened to R5: Renamed twice!
ADD R3, R4 → R5 F D - - - -
ADD R2, R6 → R7 F D E1 E2 E3
ADD R8, R9 → R10 F D E1 E2
MUL R7, R10 → R11 F D -
ADD R5, R11 → R5 F D

Register Valid Tag Value RS for ADD Unit RS for MUL Unit
Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 0 x 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 0 x
b 1 ~ 2 1 ~ 6 y 0 b 0 c
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 a
d
d 0 a 0 y t
R6 1 6
R7 0 b
R8 1 8
R9 1 9 + ∗
R10 0 c
R11 0 y

Register Alias Table


66
State of RAT and RS in Cycle 7
Slightly harder tasks for you:
1. Draw the dataflow graph for the executing code
2. Provide the executing code in sequential order

Register Valid Tag Value RS for ADD Unit RS for MUL Unit
Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 0 x 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 0 x
b 1 ~ 2 1 ~ 6 y 0 b 0 c
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 a
d
d 0 a 0 y t
R6 1 6
R7 0 b
R8 1 8
R9 1 9 + ∗
R10 0 c
R11 0 y

Register Alias Table


67
Corresponding Dataflow Graph (Reverse Engineered)

We can “easily” reverse-engineer the dataflow graph of the executing code!


Some More Questions (Design Choices)
◼ When is a reservation station entry deallocated?

◼ Should the reservation stations be dedicated to each


functional unit or global across functional units?
❑ Centralized vs. Distributed: What are the tradeoffs?

◼ Should reservation stations and ROB store data values or


should there be a centralized physical register file where all
data values are stored?
❑ What are the tradeoffs?

◼ Timing: Exactly when does an instruction broadcast its tag?

◼ Many other design choices for OoO engines


69
Recall: Our Exercise (We Did This!)
MUL R3  R1, R2 Pipeline F D E W
ADD R5  R3, R4
ADD R7  R2, R6
ADD F D E E E E W
ADD R10  R8, R9
MUL R11  R7, R10 MUL F D E E E E E E E E W
ADD R5  R5, R11

◼ Assume ADD (4 cycle execute), MUL (6 cycle execute)


◼ Assume one adder and one multiplier
◼ How many cycles
❑ in a non-pipelined machine: 50 cycles (4*7 + 2*11)
❑ in an in-order-dispatch pipelined machine with imprecise
exceptions (no forwarding and forwarding)
❑ in an out-of-order dispatch pipelined machine imprecise
exceptions (forwarding)
70
For You: An Exercise, w/ Precise Exceptions
MUL R3  R1, R2 Pipeline F D E R W
ADD R5  R3, R4
ADD R7  R2, R6
ADD F D E E E E R W
ADD R10  R8, R9
MUL R11  R7, R10 MUL F D E E E E E E E E R W
ADD R5  R5, R11

◼ Assume ADD (4 cycle execute), MUL (6 cycle execute)


◼ Assume one adder and one multiplier
◼ How many cycles
❑ in an in-order-dispatch pipelined machine with reorder buffer
(no forwarding and full forwarding)
❑ in an out-of-order dispatch pipelined machine with reorder
buffer (full forwarding)

71
Out-of-Order Execution with Precise Exceptions
◼ Idea: Use a reorder buffer to reorder instructions before
committing them to architectural state

◼ An instruction updates the RAT when it completes execution


❑ Also called frontend register file

◼ An instruction updates a separate architectural register file


when it retires
❑ i.e., when it is the oldest in the machine and has completed
execution
❑ In other words, the architectural register file is always updated in
program order

◼ On an exception: flush pipeline, copy architectural register file


into frontend register file
72
Recall: Our Initial OoO Machine

Register Valid Tag Value RS for ADD Unit RS for MUL Unit
R1 1 1
R2 1 2 Source 1 Source 2 Source 1 Source 2

R3 1 3 V Tag Value V Tag Value V Tag Value V Tag Value

R4 1 4 a x

R5 1 5 b y

R6 1 6 c z

R7 1 7 d t

R8 1 8
R9 1 9
R10 1 10 + ∗
R11 1 11

Register Alias Table Tag Value Tag Value


73
Add Arch Reg File & ROB for Precise Exceptions
Reorder Buffer (ROB)
Entry 0
Entry 1
Entry 2

Register Valid Tag Value Register Value


Entry 8 R1 1
R1 1 1
R2 1 2 R2 2

R3 1 3 R3 3

R4 1 4 R4 4

R5 1 5 Entry 13 R5 5

R6 1 6 Entry 14 R6 6

R7 1 7 Entry 15 R7 7

R8 1 8 R8 8

R9 1 9 R9 9

R10 1 10 R10 10

R11 1 11 R11 11

Frontend Register File Architectural Register File


74
OoO Machine with Precise Exceptions
RS for ADD Unit RS for MUL Unit
Source 1 Source 2 Source 1 Source 2
V Tag Value V Tag Value V Tag Value V Tag Value
a x
b y
c z
Reorder Buffer (ROB)
d t
Entry 0
Entry 1
Register Valid Tag Value Entry 2 Register Value
R1 1 1 R1 1
R2 1 2 R2 2
R3 1 3 R3 3
R4 1 4 R4 4
R5 1 5 R5 5
Entry 8
R6 1 6 R6 6
R7 1 7 R7 7
R8 1 8 R8 8
R9 1 9 R9 9
R10 1 10 Entry 13 R10 10
R11 1 11 Entry 14 R11 11
Entry 15
Frontend Register File Architectural Register File
75
Out-of-Order Execution with Precise Exceptions
TAG and VALUE Broadcast Bus

S R
Integer add
C E
E
H Integer mul
O
E E E E E
F D R W
D FP mul
D
U E E E E E E E E
E
L
E E E E E E E E E ... R
Load/store

in order out of order in order

◼ Hump 1: Reservation stations (scheduling window)


◼ Hump 2: Reordering (reorder buffer, aka instruction window
or active window)
76
Two Humps in a Modern Pipeline
TAG and VALUE Broadcast Bus
S
C R
H E
S E O R
Integer add
C E D R E
H Integer mul D
U O
E E E E E E
F D L R W
D E R FP mul
D
U E E E E E E E E
E
L
E E E E E E E E E ... R
Load/store

in order out of order in order

◼ Hump 1: Reservation stations (scheduling window)


◼ Hump 2: Reordering (reorder buffer, aka instruction window
or active window)

Photo credit: http://true-wildlife.blogspot.ch/2010/10/bactrian-camel.html 77


One Issue: Value Replication All Over the Place
RS for ADD Unit RS for MUL Unit
Source 1 Source 2 Source 1 Source 2
V Tag Value V Tag Value V Tag Value V Tag Value
a x
b y
c z
Reorder Buffer (ROB)
d t
Entry 0
Entry 1
Register Valid Tag Value Entry 2 Register Value
R1 1 1 R1 1
R2 1 2 R2 2
R3 1 3 R3 3
R4 1 4 R4 4
R5 1 5 R5 5
Entry 8
R6 1 6 R6 6
R7 1 7 R7 7
R8 1 8 R8 8
R9 1 9 R9 9
R10 1 10 Entry 13 R10 10
R11 1 11 Entry 14 R11 11
Entry 15
Frontend Register File Architectural Register File
78
Getting Rid of Replicated Values
PR Value
Reorder Buffer (ROB)
PR1 1
PR2 2 Entry 0
Entry 1
PR3 3
Entry 2
PR4 4
PR5 5
Pointers PR6 6 Pointers
to PRF PR7 7 to PRF
PR8 8
Register PR Register PR
PR9 9 Entry 8
R1 18 R1 12
PR10 10
R2 13 R2 2
PR11 11
R3 10 R3 10
PR12 12
R4 22 R4 22
PR13 13
R5 14 Entry 13 R5 5
PR14 14
R6 19 Entry 14 R6 9
PR15 15
R7 17 Entry 15 R7 11
PR16 16
R8 20 R8 20
PR17 17
R9 3 R9 7
PR18 18
R10 4 Physical Centralized R10 6
PR19 19
R11 1 R11 1
PR20 20
Register Value
Frontend PR21 21 File Storage Architectural
Register Map PR22 22 Register Map
Modern OoO Execution w/ Precise Exceptions
◼ Most modern processors use the following

◼ Reorder buffer to support in-order retirement of instructions

◼ A single register file (physical RF) to store all registers


❑ Both speculative and architectural registers
❑ INT and FP are still separate

◼ Two register maps store pointers to the physical RF


❑ Future/frontend register map → used for renaming
❑ Architectural register map → used for maintaining precise state

◼ This design avoids value replication in RSs, ROB, etc.


80
Getting Rid of Replicated Values (I)
PR Value
Reorder Buffer (ROB)
PR1 1
PR2 2 Entry 0
Entry 1
PR3 3
Entry 2
PR4 4
PR5 5
Pointers PR6 6 Pointers
to PRF PR7 7 to PRF
PR8 8
Register PR Register PR
PR9 9 Entry 8
R1 18 R1 12
PR10 10
R2 13 R2 2
PR11 11
R3 10 R3 10
PR12 12
R4 22 R4 22
PR13 13
R5 14 Entry 13 R5 5
PR14 14
R6 19 Entry 14 R6 9
PR15 15
R7 17 Entry 15 R7 11
PR16 16
R8 20 R8 20
PR17 17
R9 3 R9 7
PR18 18
R10 4 R10 6
PR19 19
Physical Centralized
R11 1 R11 1
PR20 20 Register Value
Frontend PR21 21 File Storage Architectural
Register Map PR22 22 (PRF) Register Map
Getting Rid of Replicated Values (II)
At Decode/Rename: Allocate DestPR to Dest Reg
At Decode/Rename: Read and Update Frontend Register Map

RS for ADD Unit RS for MUL Unit


Source 1 Source 2 Source 1 Source 2
V PR V PR V PR V PR
a a
b b
c c
d d

Before Execution: Access Physical Register File to Get Source Values

+ ∗

DestPR Value DestPR Value

After Execution: Access Physical Register File to Write Result Values


At Retirement : Update Architectural Register Map with DestPR
An Example from Modern Processors

Boggs et al., “The Microarchitecture of the Pentium 4 Processor,”


Intel Technology Journal, 2001. 83
Intel Pentium Pro (1995)

Processor chip Level 2 cache chip

Multi-chip module package

By Moshen - http://en.wikipedia.org/wiki/Image:Pentiumpro_moshen.jpg, CC BY-SA 2.5, https://commons.wikimedia.org/w/index.php?curid=2262471


84
Intel Pentium 4 (2000)

On-chip Level 2 Cache

https://www.anandtech.com/show/1621/3 85
Enabling OoO Execution, Revisited
1. Link the consumer of a value to the producer
❑ Register renaming: Associate a “tag” with each data value

2. Buffer instructions until they are ready


❑ Insert instruction into reservation stations after renaming

3. Keep track of readiness of source values of an instruction


❑ Broadcast the “tag” when the value is produced
❑ Instructions compare their “source tags” to the broadcast tag
→ if match, source value becomes ready

4. When all source values of an instruction are ready, dispatch


the instruction to functional unit (FU)
❑ Wakeup and select/schedule the instruction

86
Summary of OOO Execution Concepts
◼ Register renaming eliminates false dependences, enables
linking of producer to consumers

◼ Buffering in reservation stations enables the pipeline to


move for independent instructions

◼ Tag broadcast enables communication (of readiness of


produced value) between instructions

◼ Wakeup and select enables out-of-order dispatch

87
OOO Execution: Restricted Dataflow
◼ An out-of-order engine dynamically builds the dataflow
graph of a piece of the program
❑ which piece?

◼ The dataflow graph is limited to the instruction window


❑ Instruction window: all decoded but not yet retired
instructions

◼ Can we do it for the whole program?


◼ Why would we like to?
◼ In other words, how can we have a large instruction
window?
◼ Can we do it efficiently with Tomasulo’s algorithm?
88
State of RAT and RS in Cycle 7
Slightly harder tasks for you:
1. Draw the dataflow graph for the executing code
2. Provide the executing code in sequential order

Register Valid Tag Value RS for ADD Unit RS for MUL Unit
Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 0 x 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 0 x
b 1 ~ 2 1 ~ 6 y 0 b 0 c
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 a
d
d 0 a 0 y t
R6 1 6
R7 0 b
R8 1 8
R9 1 9 + ∗
R10 0 c
R11 0 y

Register Alias Table


89
Recall: Reverse Engineered Dataflow Graph

We can “easily” reverse-engineer the dataflow graph of the executing code!


Questions to Ponder
◼ Why is OoO execution beneficial?
❑ What if all operations take a single cycle?
❑ Latency tolerance: OoO execution tolerates the latency of
multi-cycle operations by executing independent operations
concurrently

◼ What if an instruction takes 1000 cycles?


❑ How large of an instruction window do we need to continue
decoding?
❑ How many cycles of latency can OoO tolerate?
❑ What limits the latency tolerance scalability of Tomasulo’s
algorithm?
◼ Instruction window size: how many decoded but not yet retired
instructions you can keep in the machine.
91
General Organization of an OOO Processor

◼ Smith and Sohi, “The Microarchitecture of Superscalar Processors,” Proc. IEEE, Dec. 1995.

92
A Modern OoO Design: Intel Pentium 4

93
Boggs et al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technology Journal, 2001.
Intel Pentium 4 Simplified
Mutlu+, “Runahead Execution,”
HPCA 2003.

94
Alpha 21264

Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro, March-April 1999. 95


MIPS R10000

Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, April 1996 96
IBM POWER4
◼ Tendler et al.,
“POWER4 system
microarchitecture,”
IBM J R&D, 2002.

97
IBM POWER4
◼ 2 cores, out-of-order execution
◼ 100-entry instruction window in each core
◼ 8-wide instruction fetch, issue, execute
◼ Large, local+global hybrid branch predictor
◼ 1.5MB, 8-way L2 cache
◼ Aggressive stream based prefetching

98
IBM POWER5
◼ Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE
Micro 2004.

99
AMD Zen2? (2019)

https://en.wikichip.org/wiki/amd/microarchitectures/zen_2 100
Apple M1 FireStorm? (2020)

https://www.anandtech.com/show/16226/apple-silicon-m1-a14-deep-dive/2
101
See Backup Slides for:
Handling Out-of-Order Execution
of Loads and Stores
Digital Design & Computer Arch.
Lecture 14: Out-of-Order Execution

Prof. Onur Mutlu

ETH Zürich
Spring 2025
4 April 2025
Handling Out-of-Order Execution
of Loads and Stores
Registers versus Memory
◼ So far, we considered mainly registers as part of state

◼ What about memory?

◼ What are the fundamental differences between registers


and memory?
❑ Register dependences known statically – memory
dependences determined dynamically
❑ Register state is small – memory state is large
❑ Register state is not visible to other threads/processors –
memory state is shared between threads/processors (in a
shared memory multiprocessor)

105
Memory Dependence Handling (I)
◼ Need to obey memory dependences in an out-of-order
machine
❑ and need to do so while providing high performance

◼ Observation and Problem: Memory address is not known until


a load/store executes

◼ Corollary 1: Renaming memory addresses is difficult


◼ Corollary 2: Determining dependence or independence of
loads/stores has to be handled after their (partial) execution
◼ Corollary 3: When a load/store has its address ready, there
may be older/younger stores/loads with unknown addresses
in the machine
106
Memory Dependence Handling (II)
◼ When do you schedule a load instruction in an OOO engine?
❑ Problem: A younger load can have its address ready before an
older store’s address is known
❑ Known as the memory disambiguation problem or the unknown
address problem

◼ Approaches
❑ Conservative: Stall the load until all previous stores have
computed their addresses (or even retired from the machine)
❑ Aggressive: Assume load is independent of unknown-address
stores and schedule the load right away
❑ Intelligent: Predict (with a more sophisticated predictor) if the
load is dependent on any unknown address store

107
Handling of Store-Load Dependences
◼ A load’s dependence status is not known until all previous store
addresses are available.

◼ How does the OOO engine detect dependence of a load instruction on a


previous store?
❑ Option 1: Wait until all previous stores committed (no need to check
for address match)
❑ Option 2: Keep a list of pending stores in a store buffer and check
whether load address matches a previous store address

◼ How does the OOO engine treat the scheduling of a load instruction wrt
previous stores?
❑ Option 1: Assume load dependent on all previous stores

❑ Option 2: Assume load independent of all previous stores

❑ Option 3: Predict the dependence of a load on an outstanding store

108
Memory Disambiguation (I)
◼ Option 1: Assume load is dependent on all previous stores
+ No need for recovery
-- Too conservative: delays independent loads unnecessarily

◼ Option 2: Assume load is independent of all previous stores


+ Simple and can be common case: no delay for independent loads
-- Requires recovery and re-execution of load and dependents on misprediction

◼ Option 3: Predict the dependence of a load on an


outstanding store
+ More accurate. Load store dependences persist over time
-- Still requires recovery/re-execution on misprediction
❑ Alpha 21264 : Initially assume load independent, delay loads found to be dependent
❑ Moshovos et al., “Dynamic speculation and synchronization of data dependences,”
ISCA 1997.
❑ Chrysos and Emer, “Memory Dependence Prediction Using Store Sets,” ISCA 1998.

109
Memory Disambiguation (II)
◼ Chrysos and Emer, “Memory Dependence Prediction Using Store
Sets,” ISCA 1998.

◼ Predicting store-load dependences important for performance


◼ Simple predictors (based on past history) can achieve most of
the potential performance

110
Data Forwarding Between Stores and Loads
◼ We cannot update memory out of program order
→ Need to buffer all store and load instructions in instruction window

◼ Even if we know all addresses of past stores when we


generate the address of a load, two questions still remain:
1. How do we check whether or not it is dependent on a store
2. How do we forward data to the load if it is dependent on a store

◼ Modern processors use a LQ (load queue) and a SQ for this


❑ Can be combined or separate between loads and stores
❑ A load searches the SQ after it computes its address. Why?
❑ A store searches the LQ after it computes its address. Why?

111
Out-of-Order Completion of Memory Ops
◼ When a store instruction finishes execution, it writes its
address and data in its reorder buffer entry (or SQ entry)

◼ When a later load instruction generates its address, it:


❑ searches the SQ with its address
❑ accesses memory with its address
❑ receives the value from the youngest older instruction that
wrote to that address (either from ROB or memory)

◼ This is a complicated “search logic” implemented as a


Content Addressable Memory
❑ Content is “memory address” (but also need size and age)

❑ Called store-to-load forwarding logic


112
Store-Load Forwarding Complexity
◼ Content Addressable Search (based on Load Address)

◼ Range Search (based on Address and Size of both the Load


and earlier Stores)

◼ Age-Based Search (for last written values)

◼ Load data can come from a combination of multiple places


❑ One or more stores in the Store Buffer (SQ)
❑ Memory/cache

113
114

You might also like