0% found this document useful (0 votes)

36 views114 pages

Onur Ddca 2025 Lecture14 Out of Order Execution Afterlecture

This lecture focuses on Out-of-Order (OoO) Execution, a technique that allows independent instructions to execute while waiting for long-latency operations. It covers the concepts of dynamic instruction scheduling, register renaming, and the use of reservation stations to manage instruction readiness. The lecture also references Tomasulo's Algorithm, which is foundational for modern high-performance processors utilizing OoO execution.

Uploaded by

Minh Phuoc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views114 pages

Onur Ddca 2025 Lecture14 Out of Order Execution Afterlecture

Uploaded by

Minh Phuoc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 114

Digital Design & Computer Arch.

Lecture 14: Out-of-Order Execution

Prof. Onur Mutlu

ETH Zürich
Spring 2025
4 April 2025
Roadmap for Today (and Past Two Weeks)
◼ Prior to last week: Microarchitecture Fundamentals
❑ Single-cycle Microarchitectures
❑ Multi-cycle Microarchitectures
Problem
Algorithm
◼ Last week & yesterday: Pipelining
Program/Language
❑ Pipelining System Software
❑ Pipelined Processor Design SW/HW Interface
◼ Control & Data Dependence Handling Micro-architecture
◼ Precise Exceptions: State Maintenance & Recovery Logic
Devices
◼ Today: Out-of-Order Execution Electrons

❑ Out-of-Order Execution
❑ Issues in OoO Execution: Load-Store Handling, …
2
Readings
◼ This week
❑ Out-of-order execution
❑ H&H, Chapter 7.8-7.9
❑ Smith and Sohi, “The Microarchitecture of Superscalar Processors,”
Proceedings of the IEEE, 1995
◼ More advanced pipelining
◼ Interrupt and exception handling
◼ Out-of-order and superscalar execution concepts
◼ Optional
❑ Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro 1999.

◼ Next Week
❑ McFarling, “Combining Branch Predictors,” DEC WRL Technical
Report, 1993.
3
Review: In-Order Pipeline with Reorder Buffer
◼ Decode (D): Access regfile/ROB, allocate entry in ROB, check if instruction
can execute, if so dispatch instruction (i.e., send it to functional unit)
◼ Execute (E): Instructions can complete out-of-order
◼ Completion (R): Write result to reorder buffer
◼ Retirement/Commit (W): Check oldest instruction for exceptions; if none,
write result to architectural register file or memory; else, flush pipeline
and start from exception handler
◼ In-order dispatch/execution, out-of-order completion, in-order retirement
Integer add
E
Integer mul
E E E E
R W
F D FP mul
E E E E E E E E
R
E E E E E E E E ...
Load/store

ROB is implemented as a circular queue in hardware 4

Recall: Data Dependence Types
Flow dependence
r3  r1 op r2 Read-after-Write
r5  r3 op r4 (RAW)
Anti dependence
r3  r1 op r2 Write-after-Read
r1  r4 op r5 (WAR)

Output dependence
r3  r1 op r2 Write-after-Write
r5  r3 op r4 (WAW)
r3  r6 op r7
5
Recall: Register Renaming with a Reorder Buffer
◼ Output and anti dependences are not true dependences
❑ WHY? The same register refers to values that have nothing to
do with each other
❑ They exist due to lack of register ID’s (i.e. names) in
the ISA

◼ The register ID is renamed to the reorder buffer entry that

will hold the register’s value
❑ Register ID → ROB entry ID
❑ Architectural register ID → Physical register ID
❑ After renaming, ROB entry ID used to refer to the register

◼ This eliminates anti and output dependences

❑ Gives the illusion that there are a large number of registers
6
Recall: Reorder Buffer Example
Register File (RF) Reorder Buffer (ROB)
R0 Entry 0 Oldest
R1 Entry 1 instruction
R2 Entry 2
R3
R4
R5
R6
R7
Entry 8 Youngest
Value Valid?

Value Tag
(pointer to instruction
ROB entry)

Initially: all registers Entry 13

are valid in RF Entry 14
& ROB is empty Entry 15
Entry Valid?

Dest reg written?

Dest reg value
Dest reg ID
Simulate:
MUL R1, R2 → R3
MUL R3, R4 → R11
ADD R5, R6 → R3
ADD R3, R8 → R12
7
Recall: Reorder Buffer in Intel Pentium Pro

Boggs et al., “The

Microarchitecture of the
Pentium 4 Processor,” Intel
Technology Journal, 2001.

A Register Alias Table (RAT) points to where each register’s current value is (or will be)
Out-of-Order Execution
(Dynamic Instruction Scheduling)
An In-order Pipeline

Integer add
E
Integer mul
E E E E
FP mul
R W
F D
E E E E E E E E

E E E E E E E E ...
Cache miss

◼ Dispatch: Act of sending an instruction to a functional unit

◼ Renaming with ROB eliminates stalls due to false dependences
◼ Problem: A non-ready instruction stalls dispatch of younger
instructions into functional (execution) units

10
An Example Non-Ready Instruction

Independent instruction
cannot enter the
execution unit

Long-latency instruction
stalls the pipeline

Time: 12:55 11
An Example Non-Ready Instruction

Stalls the pipeline

Time: 12:57 12
An Example Non-Ready Instruction

Time: 12:58 13
An Example Non-Ready Instruction

Time: 13:00 14
Another View

15
Stalling Done & Independents Execute

Independent instruction
finally
dispatched and executing

Time: 13:06 16
An In-order Pipeline

Integer add
E
Integer mul
E E E E
FP mul
R W
F D
E E E E E E E E

E E E E E E E E ...
Cache miss

◼ Dispatch: Act of sending an instruction to a functional unit

◼ Renaming with ROB eliminates stalls due to false dependences
◼ Problem: A non-ready instruction stalls dispatch of
younger instructions into functional (execution) units

17
Can We Do Better?

18
How Can We Do Better?
◼ What do the following two pieces of code have in common
(with respect to execution in the previous design)?
MUL R3  R1, R2 LD R3  R1 (0)
ADD R3  R3, R1 ADD R3  R3, R1
ADD R4  R6, R7 ADD R4  R6, R7
MUL R5  R6, R8 MUL R5  R6, R8
ADD R7  R9, R9 ADD R7  R9, R9

◼ Answer: First ADD stalls the whole pipeline!

❑ ADD cannot dispatch because its source register unavailable
❑ Later independent instructions cannot get dispatched

◼ How are the above code portions different?

❑ Answer: Load latency is variable (unknown until runtime)
❑ What does this affect? Think compiler vs. microarchitecture
19
Preventing Dispatch Stalls
◼ Problem: in-order dispatch (scheduling, or execution)

◼ Solution: out-of-order dispatch (scheduling, or execution)

◼ Actually, we have seen the basic idea before:

❑ Dataflow: “fire” an instruction only when its inputs are ready
❑ We will use similar principles, but not expose it in the ISA

◼ Aside: Any other way to prevent dispatch stalls?

1. Compile-time instruction scheduling/reordering
2. Value prediction
3. Fine-grained multithreading

20
Out-of-order Execution (Dynamic Scheduling)
◼ Idea: Move the non-ready instructions out of the way of
independent ones (such that independent ones can dispatch)
❑ Rest areas for non-ready instructions: Reservation stations

◼ Monitor the source “values” of each instruction in the resting

(waiting) area
◼ When all source “values” of an instruction are available,
“fire” (i.e., dispatch) the instruction
❑ Instructions dispatched in dataflow (not control-flow) order

◼ Benefit:
❑ Latency tolerance: Allows independent instructions to execute
and complete in the presence of a long-latency operation
21
In-order vs. Out-of-order Dispatch
◼ In order dispatch + precise exceptions:
IMUL R3  R1, R2
F D E E E E R W
ADD R3  R3, R1
F D STALL E R W ADD R1  R6, R7
F STALL D E R W IMUL R5  R6, R8
ADD R7  R3, R5
F D E E E E E R W
F D STALL E R W

◼ Out-of-order dispatch + precise exceptions:

F D E E E E R W
F D WAIT E R W
F D E R W
F D E E E E R W
F D WAIT E R W

◼ 16 vs. 12 cycles
22
Enabling OoO Execution
1. Need to link the consumer of a value to the producer
❑ Register renaming: Associate a “tag” with each data value
2. Need to buffer instructions until they are ready to execute
❑ Insert instruction into reservation stations after renaming
3. Instructions need to keep track of readiness of source values
❑ Broadcast the “tag” when the value is produced
❑ Instructions compare their “source tags” to the broadcast tag
→ if match, source value becomes ready
4. When all source values of an instruction are ready, need to
dispatch the instruction to its functional unit (FU)
❑ Instruction wakes up if all sources are ready
❑ If multiple instructions are awake, need to select one per FU

23
Tomasulo’s Algorithm for OoO Execution
◼ OoO with register renaming invented by Robert Tomasulo
❑ Used in IBM 360/91 Floating Point Units
❑ Reading: Tomasulo, “An Efficient Algorithm for Exploiting Multiple
Arithmetic Units,” IBM Journal of R&D, Jan. 1967.

◼ What is the major difference today?

❑ Precise exceptions
❑ Provided by
◼ Patt, Hwu, Shebanow, “HPS, a new microarchitecture: rationale and
introduction,” MICRO 1985.
◼ Patt et al., “Critical issues regarding HPS, a high performance
microarchitecture,” MICRO 1985.

◼ OoO variants are used in most high-performance processors

❑ Initially in Intel Pentium Pro, AMD K5
❑ Alpha 21264, MIPS R10000, IBM POWER5, IBM z196, Oracle UltraSPARC T4, ARM Cortex A15, Apple M1, …
24
Two Humps in a Modern Pipeline
TAG and VALUE Broadcast Bus

S R
Integer add
C E
E
H Integer mul
O
E E E E E
F D R W
D FP mul
D
U E E E E E E E E
E
L
E E E E E E E E E ... R
Load/store

in order out of order in order

◼ Hump 1: Reservation stations (scheduling window)

◼ Hump 2: Reordering (reorder buffer, aka instruction window
or active window)
25
Two Humps in a Modern Pipeline
TAG and VALUE Broadcast Bus
S
C R
H E
S E O R
Integer add
C E D R E
H Integer mul D
U O
E E E E E E
F D L R W
D E R FP mul
D
U E E E E E E E E
E
L
E E E E E E E E E ... R
Load/store

in order out of order in order

◼ Hump 1: Reservation stations (scheduling window)

◼ Hump 2: Reordering (reorder buffer, aka instruction window
or active window)

Photo credit: http://true-wildlife.blogspot.ch/2010/10/bactrian-camel.html 26

General Organization of an OOO Processor

◼ Smith and Sohi, “The Microarchitecture of Superscalar Processors,” Proc. IEEE, Dec. 1995.

27
Tomasulo’s Machine: IBM 360/91

FP registers
from memory from instruction unit

load
buffers store buffers

operation bus

reservation
stations to memory
FP FU FP FU

Common data bus

28
IBM 360/91 in Real World

http://www.columbia.edu/cu/computinghistory/36091.html 29
IBM 360/91 in Real World

http://www.righto.com/2019/04/iconic-consoles-of-ibm-system360.html 30
Recall Once More: Register Renaming
◼ Output and anti dependences are not true dependences
❑ WHY? The same register refers to values that have nothing to do
with each other
❑ They exist due to lack of register ID’s (i.e. names) in ISA

◼ The register ID is renamed to the reorder buffer entry (or

reservation station entry) that will hold the register’s value
❑ Register ID → ROB or RS entry ID
❑ Architectural register ID → Physical register ID
❑ After renaming, ROB or RS entry ID used to refer to the register

◼ This eliminates anti and output dependences

❑ Gives the illusion that there are a large number of registers
❑ Approximates the performance benefit of having more registers
31
Tomasulo’s Algorithm: Renaming
◼ Register rename table (register alias table)

If Valid bit is set, the Value in the table is correct.

Otherwise, Tag specifies where to find the correct value.
Tag is a unique name for the Value to be produced.
32
Recall from Precise Exceptions Lecture
Register File (RF) Reorder Buffer (ROB)
R0 Entry 0 Oldest
R1 Entry 1 instruction
R2 Entry 2
R3
R4
R5
R6
R7
Entry 8 Youngest
Value Valid?

Value Tag
(pointer to instruction
ROB entry)

Entry 13
Entry 14
Entry 15
Entry Valid?

Dest reg written?

Dest reg value
Dest reg ID

33
This Lecture
Register File (RF) or Register Alias Table (RAT)
R0
R1
R2
R3
R4
R5
R6
R7
Value Valid?

Value Tag
(pointer to the
reservation station entry
that will produce the value)
We will ignore Reorder Buffer for simplicity

34
Tomasulo’s Algorithm
◼ If reservation station available before renaming
❑ Instruction + renamed operands (source value/tag) inserted into the
reservation station
❑ Only rename if reservation station is available
◼ Else stall
◼ While in reservation station, each instruction:
❑ Watches common data bus (CDB) for tag of its sources
❑ When tag seen, grab value for the source and keep it in the reservation station
❑ When both operands available, instruction ready to be dispatched
◼ Dispatch instruction to the Functional Unit when instruction is ready
◼ After instruction finishes in the Functional Unit
❑ Arbitrate for CDB
❑ Put tagged value onto CDB (tag broadcast)
❑ Register file is connected to the CDB
◼ Register contains a tag indicating the latest writer to the register
◼ If the tag in the register file matches the broadcast tag, write broadcast value
into register (and set valid bit)
❑ Reclaim rename tag
◼ no valid copy of tag in system!

35
An Exercise
MUL R3  R1, R2 Pipeline F D E W
ADD R5  R3, R4
ADD R7  R2, R6
ADD F D E E E E W
ADD R10  R8, R9
MUL R11  R7, R10 MUL F D E E E E E E E E W
ADD R5  R5, R11

◼ Assume ADD (4 cycle execute), MUL (6 cycle execute)

◼ Assume one adder and one multiplier
◼ How many cycles
❑ in a non-pipelined machine: 50 cycles (4*7 + 2*11)
❑ in an in-order-dispatch pipelined machine with imprecise
exceptions (no forwarding and forwarding)
❑ in an out-of-order dispatch pipelined machine imprecise
exceptions (forwarding)
36
Exercise Continued
in-order-dispatch pipelined machine
w/o forwarding: 31 cycles

in-order-dispatch pipelined machine

w/ forwarding: 25 cycles

37
Exercise Continued
MUL R3  R1, R2
ADD R5  R3, R4
ADD R7  R2, R6
ADD R10  R8, R9
MUL R11  R7, R10
ADD R5  R5, R11

out-of-order dispatch pipelined machine

w/ forwarding: 20 cycles

38
How It Works

39
Our First OoO Machine Simulation
Program We Will Simulate
MUL R1, R2 → R3
ADD R3, R4 → R5 Initially:
ADD R2, R6 → R7 1. Reservation Stations (RS’s) are all Invalid (Empty)
ADD R8, R9 → R10 2. All Registers are Valid
MUL R7, R10 → R11
ADD R5, R11 → R5
RS for ADD Unit RS for MUL Unit
Register Valid Tag Value
Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a x
R3 1 3
b y
R4 1 4
c z
R5 1 5
d t
R6 1 6
R7 1 7
R8 1 8
R9 1 9 + ∗
R10 1 10
R11 1 11 Tag Value Tag Value
Register Alias Table ADD and MUL Execution Units
have separate Tag & Value buses 40
Cycle 0
Cycle

MUL R1, R2 → R3
ADD R3, R4 → R5
ADD R2, R6 → R7
ADD R8, R9 → R10
MUL R7, R10 → R11
ADD R5, R11 → R5

41
Cycle 1
Cycle 1

MUL R1, R2 → R3 F
ADD R3, R4 → R5
ADD R2, R6 → R7
ADD R8, R9 → R10
MUL R7, R10 → R11
ADD R5, R11 → R5

42
MUL gets decoded and allocated into RS x
Cycle 2 Step 1: Check if reservation station available. Yes: x

Cycle 1 2 Step 2: Access the Register Alias Table

MUL R1, R2 → R3 F D Step 3: Put source registers into reservation station x
ADD R3, R4 → R5 F
Step 4: Rename destination register R3 → x
ADD R2, R6 → R7
ADD R8, R9 → R10 R3 is now renamed to x.
MUL R7, R10 → R11 Its new value will produced by the reservation station
ADD R5, R11 → R5 that is identified by tag x.

43
1. MUL in RS x starts executing
Cycle 3 2. ADD gets decoded and allocated into RS a

Cycle 1 2 3 Check readiness (Both sources ready?) → Wakeup

MUL R1, R2 → R3 F D E1 Ready → Dispatch the instruction to the MUL unit
ADD R3, R4 → R5 F D
F
Same Steps 1-4 for ADD… Rename R5 → a
ADD R2, R6 → R7
ADD R8, R9 → R10
MUL R7, R10 → R11
ADD R5, R11 → R5

MUL R1, R2 → R3 F D E1 E2 E3
ADD R3, R4 → R5 F D - -
ADD R2, R6 → R7 F D E1
ADD R8, R9 → R10 F D
MUL R7, R10 → R11 F
ADD R5, R11 → R5

46
Cycle 6
Cycle 1 2 3 4 5 6

MUL R1, R2 → R3 F D E1 E2 E3 E4
ADD R3, R4 → R5 F D - - -
ADD R2, R6 → R7 F D E1 E2
ADD R8, R9 → R10 F D E1
MUL R7, R10 → R11 F D
ADD R5, R11 → R5 F

47
All six instructions are now decoded and renamed
Cycle 7 Note what happened to R5: Renamed twice!

Cycle 1 2 3 4 5 6 7

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5
ADD R3, R4 → R5 F D - - - -
ADD R2, R6 → R7 F D E1 E2 E3
ADD R8, R9 → R10 F D E1 E2
MUL R7, R10 → R11 F D -
ADD R5, R11 → R5 F D

48
Cycle 8 (First Slide)
Cycle 1 2 3 4 5 6 7 8 MUL in RS x is done
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 Broadcast MUL’s tag (x)
ADD R3, R4 → R5 F D - - - -
F D E1 E2 E3 ✓ Check tag
ADD R2, R6 → R7
✓ Check for invalidity
ADD R8, R9 → R10 F D E1 E2
MUL R7, R10 → R11 F D -
Broadcast MUL’s result (2)
ADD R5, R11 → R5 F D

49
ADD in RS a is ready to execute in the next cycle!
Cycle 8 (Second Slide)
Cycle 1 2 3 4 5 6 7 8 ADD in RS b is also done
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 Broadcast ADD’s tag (b)
ADD R3, R4 → R5 F D - - - - -
F D E1 E2 E3 E4 ✓ Check tag
ADD R2, R6 → R7
✓ Check for invalidity
ADD R8, R9 → R10 F D E1 E2
MUL R7, R10 → R11 F D -
Broadcast ADD’s result (8)
ADD R5, R11 → R5 F D

50
MUL in RS y is still NOT ready to execute in the next cycle!
Cycle 8 (Third Slide)
Cycle 1 2 3 4 5 6 7 8

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6
ADD R3, R4 → R5 F D - - - - -
ADD R2, R6 → R7 F D E1 E2 E3 E4
ADD R8, R9 → R10 F D E1 E2 E3
MUL R7, R10 → R11 F D - -
ADD R5, R11 → R5 F D -

51
Cycle 9
Cycle 1 2 3 4 5 6 7 8 9

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 Broadcast and Update
MUL R7, R10 → R11 F D - - -
ADD R5, R11 → R5 F D - -

52
Cycle 10
Cycle 1 2 3 4 5 6 7 8 9 10

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1
ADD R5, R11 → R5 F D - - -

53
Cycle 11
Cycle 1 2 3 4 5 6 7 8 9 10 11

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2
ADD R5, R11 → R5 F D - - - -

Register Valid Tag Value

54
Cycle 12
Cycle 1 2 3 4 5 6 7 8 9 10 11 12

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 Broadcast and Update
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2 E3
ADD R5, R11 → R5 F D - - - - -

55
Cycle 13
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 W
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2 E3 E4
ADD R5, R11 → R5 F D - - - - - -

56
Cycle 14
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 W
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2 E3 E4 E5
ADD R5, R11 → R5 F D - - - - - - -

Register Valid Tag Value

57
Cycle 15
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 W
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
Broadcast and
MUL R7, R10 → R11 F D - - - E1 E2 E3 E4 E5 E6 Update
ADD R5, R11 → R5 F D - - - - - - - -

ADD in RS d is ready to execute in the next cycle!

58
Cycle 16
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

59
Cycle 17
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Register Valid Tag Value

60
Cycle 18
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Register Valid Tag Value

61
Cycle 19
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 E6 W
ADD R3, R4 → R5 F D - - - - - E1 E2 E3 E4 W
ADD R2, R6 → R7 F D E1 E2 E3 E4 W
ADD R8, R9 → R10 F D E1 E2 E3 E4 W
MUL R7, R10 → R11 F D - - - E1 E2 E3 E4 E5 E6 W
ADD R5, R11 → R5 F D - - - - - - - - E1 E2 E3 E4
Broadcast and Update
Register Valid Tag Value
Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 1 ~ 2 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 1 2
b 1 ~ 2 1 ~ 6 y 1 ~ 8 1 ~ 17
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0
1 d 142
d 1 ~ 6 1 ~ 136 t
R6 1 6
R7 1 8
R8 1 8
R9 1 9 + ∗
R10 1 17
d 142
R11 1 136

62
Cycle 20
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

63
Some Questions
◼ What is needed in hardware to perform tag broadcast and
value capture?
Wires, Comparators & Logic
→ make a value valid
→ wake up an instruction

◼ Does the tag have to be the ID of the Reservation Station

Entry? No, could be any unique name that enables linking
of producer to consumer

◼ What can potentially become the critical path?

❑ Tag broadcast → value capture → instruction wake up

◼ How can you reduce the potential critical paths?

❑ More pipelining and prediction
64
Dataflow Graph for Our Example

MUL R3  R1, R2
ADD R5  R3, R4
ADD R7  R2, R6
ADD R10  R8, R9
MUL R11  R7, R10
ADD R5  R5, R11

Easy task for you: Draw the dataflow graph for the above code

65
State of RAT and RS in Cycle 7
Cycle 1 2 3 4 5 6 7 All 6 instructions are decoded and renamed
MUL R1, R2 → R3 F D E1 E2 E3 E4 E5 Note what happened to R5: Renamed twice!
ADD R3, R4 → R5 F D - - - -
ADD R2, R6 → R7 F D E1 E2 E3
ADD R8, R9 → R10 F D E1 E2
MUL R7, R10 → R11 F D -
ADD R5, R11 → R5 F D

Register Valid Tag Value RS for ADD Unit RS for MUL Unit
Source 1 Source 2 Source 1 Source 2
R1 1 1
V Tag Value V Tag Value V Tag Value V Tag Value
R2 1 2
a 0 x 1 ~ 4 x 1 ~ 1 1 ~ 2
R3 0 x
b 1 ~ 2 1 ~ 6 y 0 b 0 c
R4 1 4
c 1 ~ 8 1 ~ 9 z
R5 0 a
d
d 0 a 0 y t
R6 1 6
R7 0 b
R8 1 8
R9 1 9 + ∗
R10 0 c
R11 0 y

Register Alias Table

66
State of RAT and RS in Cycle 7
Slightly harder tasks for you:
1. Draw the dataflow graph for the executing code
2. Provide the executing code in sequential order

Register Alias Table

67
Corresponding Dataflow Graph (Reverse Engineered)

We can “easily” reverse-engineer the dataflow graph of the executing code!

Some More Questions (Design Choices)
◼ When is a reservation station entry deallocated?

◼ Should the reservation stations be dedicated to each

functional unit or global across functional units?
❑ Centralized vs. Distributed: What are the tradeoffs?

◼ Should reservation stations and ROB store data values or

should there be a centralized physical register file where all
data values are stored?
❑ What are the tradeoffs?

◼ Timing: Exactly when does an instruction broadcast its tag?

◼ Many other design choices for OoO engines

69
Recall: Our Exercise (We Did This!)
MUL R3  R1, R2 Pipeline F D E W
ADD R5  R3, R4
ADD R7  R2, R6
ADD F D E E E E W
ADD R10  R8, R9
MUL R11  R7, R10 MUL F D E E E E E E E E W
ADD R5  R5, R11

◼ Assume ADD (4 cycle execute), MUL (6 cycle execute)

◼ Assume one adder and one multiplier
◼ How many cycles
❑ in a non-pipelined machine: 50 cycles (4*7 + 2*11)
❑ in an in-order-dispatch pipelined machine with imprecise
exceptions (no forwarding and forwarding)
❑ in an out-of-order dispatch pipelined machine imprecise
exceptions (forwarding)
70
For You: An Exercise, w/ Precise Exceptions
MUL R3  R1, R2 Pipeline F D E R W
ADD R5  R3, R4
ADD R7  R2, R6
ADD F D E E E E R W
ADD R10  R8, R9
MUL R11  R7, R10 MUL F D E E E E E E E E R W
ADD R5  R5, R11

◼ Assume ADD (4 cycle execute), MUL (6 cycle execute)

◼ Assume one adder and one multiplier
◼ How many cycles
❑ in an in-order-dispatch pipelined machine with reorder buffer
(no forwarding and full forwarding)
❑ in an out-of-order dispatch pipelined machine with reorder
buffer (full forwarding)

71
Out-of-Order Execution with Precise Exceptions
◼ Idea: Use a reorder buffer to reorder instructions before
committing them to architectural state

◼ An instruction updates the RAT when it completes execution

❑ Also called frontend register file

◼ An instruction updates a separate architectural register file

when it retires
❑ i.e., when it is the oldest in the machine and has completed
execution
❑ In other words, the architectural register file is always updated in
program order

◼ On an exception: flush pipeline, copy architectural register file

into frontend register file
72
Recall: Our Initial OoO Machine

R3 1 3 V Tag Value V Tag Value V Tag Value V Tag Value

R4 1 4 a x

R5 1 5 b y

R6 1 6 c z

R7 1 7 d t

R8 1 8
R9 1 9
R10 1 10 + ∗
R11 1 11

Register Alias Table Tag Value Tag Value

73
Add Arch Reg File & ROB for Precise Exceptions
Reorder Buffer (ROB)
Entry 0
Entry 1
Entry 2

Register Valid Tag Value Register Value

Entry 8 R1 1
R1 1 1
R2 1 2 R2 2

R3 1 3 R3 3

R4 1 4 R4 4

R5 1 5 Entry 13 R5 5

R6 1 6 Entry 14 R6 6

R7 1 7 Entry 15 R7 7

R8 1 8 R8 8

R9 1 9 R9 9

R10 1 10 R10 10

R11 1 11 R11 11

Frontend Register File Architectural Register File

74
OoO Machine with Precise Exceptions
RS for ADD Unit RS for MUL Unit
Source 1 Source 2 Source 1 Source 2
V Tag Value V Tag Value V Tag Value V Tag Value
a x
b y
c z
Reorder Buffer (ROB)
d t
Entry 0
Entry 1
Register Valid Tag Value Entry 2 Register Value
R1 1 1 R1 1
R2 1 2 R2 2
R3 1 3 R3 3
R4 1 4 R4 4
R5 1 5 R5 5
Entry 8
R6 1 6 R6 6
R7 1 7 R7 7
R8 1 8 R8 8
R9 1 9 R9 9
R10 1 10 Entry 13 R10 10
R11 1 11 Entry 14 R11 11
Entry 15
Frontend Register File Architectural Register File
75
Out-of-Order Execution with Precise Exceptions
TAG and VALUE Broadcast Bus

S R
Integer add
C E
E
H Integer mul
O
E E E E E
F D R W
D FP mul
D
U E E E E E E E E
E
L
E E E E E E E E E ... R
Load/store

in order out of order in order

◼ Hump 1: Reservation stations (scheduling window)

◼ Hump 2: Reordering (reorder buffer, aka instruction window
or active window)
76
Two Humps in a Modern Pipeline
TAG and VALUE Broadcast Bus
S
C R
H E
S E O R
Integer add
C E D R E
H Integer mul D
U O
E E E E E E
F D L R W
D E R FP mul
D
U E E E E E E E E
E
L
E E E E E E E E E ... R
Load/store

in order out of order in order

◼ Hump 1: Reservation stations (scheduling window)

◼ Hump 2: Reordering (reorder buffer, aka instruction window
or active window)

Photo credit: http://true-wildlife.blogspot.ch/2010/10/bactrian-camel.html 77

One Issue: Value Replication All Over the Place
RS for ADD Unit RS for MUL Unit
Source 1 Source 2 Source 1 Source 2
V Tag Value V Tag Value V Tag Value V Tag Value
a x
b y
c z
Reorder Buffer (ROB)
d t
Entry 0
Entry 1
Register Valid Tag Value Entry 2 Register Value
R1 1 1 R1 1
R2 1 2 R2 2
R3 1 3 R3 3
R4 1 4 R4 4
R5 1 5 R5 5
Entry 8
R6 1 6 R6 6
R7 1 7 R7 7
R8 1 8 R8 8
R9 1 9 R9 9
R10 1 10 Entry 13 R10 10
R11 1 11 Entry 14 R11 11
Entry 15
Frontend Register File Architectural Register File
78
Getting Rid of Replicated Values
PR Value
Reorder Buffer (ROB)
PR1 1
PR2 2 Entry 0
Entry 1
PR3 3
Entry 2
PR4 4
PR5 5
Pointers PR6 6 Pointers
to PRF PR7 7 to PRF
PR8 8
Register PR Register PR
PR9 9 Entry 8
R1 18 R1 12
PR10 10
R2 13 R2 2
PR11 11
R3 10 R3 10
PR12 12
R4 22 R4 22
PR13 13
R5 14 Entry 13 R5 5
PR14 14
R6 19 Entry 14 R6 9
PR15 15
R7 17 Entry 15 R7 11
PR16 16
R8 20 R8 20
PR17 17
R9 3 R9 7
PR18 18
R10 4 Physical Centralized R10 6
PR19 19
R11 1 R11 1
PR20 20
Register Value
Frontend PR21 21 File Storage Architectural
Register Map PR22 22 Register Map
Modern OoO Execution w/ Precise Exceptions
◼ Most modern processors use the following

◼ Reorder buffer to support in-order retirement of instructions

◼ A single register file (physical RF) to store all registers

❑ Both speculative and architectural registers
❑ INT and FP are still separate

◼ Two register maps store pointers to the physical RF

❑ Future/frontend register map → used for renaming
❑ Architectural register map → used for maintaining precise state

◼ This design avoids value replication in RSs, ROB, etc.

80
Getting Rid of Replicated Values (I)
PR Value
Reorder Buffer (ROB)
PR1 1
PR2 2 Entry 0
Entry 1
PR3 3
Entry 2
PR4 4
PR5 5
Pointers PR6 6 Pointers
to PRF PR7 7 to PRF
PR8 8
Register PR Register PR
PR9 9 Entry 8
R1 18 R1 12
PR10 10
R2 13 R2 2
PR11 11
R3 10 R3 10
PR12 12
R4 22 R4 22
PR13 13
R5 14 Entry 13 R5 5
PR14 14
R6 19 Entry 14 R6 9
PR15 15
R7 17 Entry 15 R7 11
PR16 16
R8 20 R8 20
PR17 17
R9 3 R9 7
PR18 18
R10 4 R10 6
PR19 19
Physical Centralized
R11 1 R11 1
PR20 20 Register Value
Frontend PR21 21 File Storage Architectural
Register Map PR22 22 (PRF) Register Map
Getting Rid of Replicated Values (II)
At Decode/Rename: Allocate DestPR to Dest Reg
At Decode/Rename: Read and Update Frontend Register Map

RS for ADD Unit RS for MUL Unit

Source 1 Source 2 Source 1 Source 2
V PR V PR V PR V PR
a a
b b
c c
d d

Before Execution: Access Physical Register File to Get Source Values

+ ∗

DestPR Value DestPR Value

After Execution: Access Physical Register File to Write Result Values

At Retirement : Update Architectural Register Map with DestPR
An Example from Modern Processors

Boggs et al., “The Microarchitecture of the Pentium 4 Processor,”

Intel Technology Journal, 2001. 83
Intel Pentium Pro (1995)

Processor chip Level 2 cache chip

Multi-chip module package

By Moshen - http://en.wikipedia.org/wiki/Image:Pentiumpro_moshen.jpg, CC BY-SA 2.5, https://commons.wikimedia.org/w/index.php?curid=2262471

84
Intel Pentium 4 (2000)

On-chip Level 2 Cache

https://www.anandtech.com/show/1621/3 85
Enabling OoO Execution, Revisited
1. Link the consumer of a value to the producer
❑ Register renaming: Associate a “tag” with each data value

2. Buffer instructions until they are ready

❑ Insert instruction into reservation stations after renaming

3. Keep track of readiness of source values of an instruction

❑ Broadcast the “tag” when the value is produced
❑ Instructions compare their “source tags” to the broadcast tag
→ if match, source value becomes ready

4. When all source values of an instruction are ready, dispatch

the instruction to functional unit (FU)
❑ Wakeup and select/schedule the instruction

86
Summary of OOO Execution Concepts
◼ Register renaming eliminates false dependences, enables
linking of producer to consumers

◼ Buffering in reservation stations enables the pipeline to

move for independent instructions

◼ Tag broadcast enables communication (of readiness of

produced value) between instructions

◼ Wakeup and select enables out-of-order dispatch

87
OOO Execution: Restricted Dataflow
◼ An out-of-order engine dynamically builds the dataflow
graph of a piece of the program
❑ which piece?

◼ The dataflow graph is limited to the instruction window

❑ Instruction window: all decoded but not yet retired
instructions

◼ Can we do it for the whole program?

◼ Why would we like to?
◼ In other words, how can we have a large instruction
window?
◼ Can we do it efficiently with Tomasulo’s algorithm?
88
State of RAT and RS in Cycle 7
Slightly harder tasks for you:
1. Draw the dataflow graph for the executing code
2. Provide the executing code in sequential order

Register Alias Table

89
Recall: Reverse Engineered Dataflow Graph

We can “easily” reverse-engineer the dataflow graph of the executing code!

Questions to Ponder
◼ Why is OoO execution beneficial?
❑ What if all operations take a single cycle?
❑ Latency tolerance: OoO execution tolerates the latency of
multi-cycle operations by executing independent operations
concurrently

◼ What if an instruction takes 1000 cycles?

❑ How large of an instruction window do we need to continue
decoding?
❑ How many cycles of latency can OoO tolerate?
❑ What limits the latency tolerance scalability of Tomasulo’s
algorithm?
◼ Instruction window size: how many decoded but not yet retired
instructions you can keep in the machine.
91
General Organization of an OOO Processor

◼ Smith and Sohi, “The Microarchitecture of Superscalar Processors,” Proc. IEEE, Dec. 1995.

92
A Modern OoO Design: Intel Pentium 4

93
Boggs et al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technology Journal, 2001.
Intel Pentium 4 Simplified
Mutlu+, “Runahead Execution,”
HPCA 2003.

94
Alpha 21264

Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro, March-April 1999. 95

MIPS R10000

Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, April 1996 96
IBM POWER4
◼ Tendler et al.,
“POWER4 system
microarchitecture,”
IBM J R&D, 2002.

97
IBM POWER4
◼ 2 cores, out-of-order execution
◼ 100-entry instruction window in each core
◼ 8-wide instruction fetch, issue, execute
◼ Large, local+global hybrid branch predictor
◼ 1.5MB, 8-way L2 cache
◼ Aggressive stream based prefetching

98
IBM POWER5
◼ Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE
Micro 2004.

99
AMD Zen2? (2019)

https://en.wikichip.org/wiki/amd/microarchitectures/zen_2 100
Apple M1 FireStorm? (2020)

https://www.anandtech.com/show/16226/apple-silicon-m1-a14-deep-dive/2
101
See Backup Slides for:
Handling Out-of-Order Execution
of Loads and Stores
Digital Design & Computer Arch.
Lecture 14: Out-of-Order Execution

Prof. Onur Mutlu

ETH Zürich
Spring 2025
4 April 2025
Handling Out-of-Order Execution
of Loads and Stores
Registers versus Memory
◼ So far, we considered mainly registers as part of state

◼ What about memory?

◼ What are the fundamental differences between registers

and memory?
❑ Register dependences known statically – memory
dependences determined dynamically
❑ Register state is small – memory state is large
❑ Register state is not visible to other threads/processors –
memory state is shared between threads/processors (in a
shared memory multiprocessor)

105
Memory Dependence Handling (I)
◼ Need to obey memory dependences in an out-of-order
machine
❑ and need to do so while providing high performance

◼ Observation and Problem: Memory address is not known until

a load/store executes

◼ Corollary 1: Renaming memory addresses is difficult

◼ Corollary 2: Determining dependence or independence of
loads/stores has to be handled after their (partial) execution
◼ Corollary 3: When a load/store has its address ready, there
may be older/younger stores/loads with unknown addresses
in the machine
106
Memory Dependence Handling (II)
◼ When do you schedule a load instruction in an OOO engine?
❑ Problem: A younger load can have its address ready before an
older store’s address is known
❑ Known as the memory disambiguation problem or the unknown
address problem

◼ Approaches
❑ Conservative: Stall the load until all previous stores have
computed their addresses (or even retired from the machine)
❑ Aggressive: Assume load is independent of unknown-address
stores and schedule the load right away
❑ Intelligent: Predict (with a more sophisticated predictor) if the
load is dependent on any unknown address store

107
Handling of Store-Load Dependences
◼ A load’s dependence status is not known until all previous store
addresses are available.

◼ How does the OOO engine detect dependence of a load instruction on a

previous store?
❑ Option 1: Wait until all previous stores committed (no need to check
for address match)
❑ Option 2: Keep a list of pending stores in a store buffer and check
whether load address matches a previous store address

◼ How does the OOO engine treat the scheduling of a load instruction wrt
previous stores?
❑ Option 1: Assume load dependent on all previous stores

❑ Option 2: Assume load independent of all previous stores

❑ Option 3: Predict the dependence of a load on an outstanding store

108
Memory Disambiguation (I)
◼ Option 1: Assume load is dependent on all previous stores
+ No need for recovery
-- Too conservative: delays independent loads unnecessarily

◼ Option 2: Assume load is independent of all previous stores

+ Simple and can be common case: no delay for independent loads
-- Requires recovery and re-execution of load and dependents on misprediction

◼ Option 3: Predict the dependence of a load on an

outstanding store
+ More accurate. Load store dependences persist over time
-- Still requires recovery/re-execution on misprediction
❑ Alpha 21264 : Initially assume load independent, delay loads found to be dependent
❑ Moshovos et al., “Dynamic speculation and synchronization of data dependences,”
ISCA 1997.
❑ Chrysos and Emer, “Memory Dependence Prediction Using Store Sets,” ISCA 1998.

109
Memory Disambiguation (II)
◼ Chrysos and Emer, “Memory Dependence Prediction Using Store
Sets,” ISCA 1998.

◼ Predicting store-load dependences important for performance

◼ Simple predictors (based on past history) can achieve most of
the potential performance

110
Data Forwarding Between Stores and Loads
◼ We cannot update memory out of program order
→ Need to buffer all store and load instructions in instruction window

◼ Even if we know all addresses of past stores when we

generate the address of a load, two questions still remain:
1. How do we check whether or not it is dependent on a store
2. How do we forward data to the load if it is dependent on a store

◼ Modern processors use a LQ (load queue) and a SQ for this

❑ Can be combined or separate between loads and stores
❑ A load searches the SQ after it computes its address. Why?
❑ A store searches the LQ after it computes its address. Why?

111
Out-of-Order Completion of Memory Ops
◼ When a store instruction finishes execution, it writes its
address and data in its reorder buffer entry (or SQ entry)

◼ When a later load instruction generates its address, it:

❑ searches the SQ with its address
❑ accesses memory with its address
❑ receives the value from the youngest older instruction that
wrote to that address (either from ROB or memory)

◼ This is a complicated “search logic” implemented as a

Content Addressable Memory
❑ Content is “memory address” (but also need size and age)

❑ Called store-to-load forwarding logic

112
Store-Load Forwarding Complexity
◼ Content Addressable Search (based on Load Address)

◼ Range Search (based on Address and Size of both the Load

and earlier Stores)

◼ Age-Based Search (for last written values)

◼ Load data can come from a combination of multiple places

❑ One or more stores in the Store Buffer (SQ)
❑ Memory/cache

113
114

Query To Get The AP Invoice Details
No ratings yet
Query To Get The AP Invoice Details
2 pages
Share Buy-Back by Companies in Nigeria
No ratings yet
Share Buy-Back by Companies in Nigeria
8 pages
Onur Digitaldesign - Comparch 2021 Lecture16 Out of Order Execution Beforelecture
No ratings yet
Onur Digitaldesign - Comparch 2021 Lecture16 Out of Order Execution Beforelecture
89 pages
Onur Digitaldesign - Comparch 2021 Lecture15b Out of Order Execution I Afterlecture
No ratings yet
Onur Digitaldesign - Comparch 2021 Lecture15b Out of Order Execution I Afterlecture
110 pages
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
No ratings yet
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
67 pages
Onur 447 Spring15 Lecture11 Precise Exceptions Afterlecture
No ratings yet
Onur 447 Spring15 Lecture11 Precise Exceptions Afterlecture
49 pages
EE457Unit9a OoO
No ratings yet
EE457Unit9a OoO
77 pages
08 Speculation
No ratings yet
08 Speculation
21 pages
Lec16 OoOa
No ratings yet
Lec16 OoOa
57 pages
L1.3b OOOpipelines
No ratings yet
L1.3b OOOpipelines
72 pages
CH10-Processor Structure and Function
No ratings yet
CH10-Processor Structure and Function
14 pages
Superscalar
No ratings yet
Superscalar
38 pages
Module 5 - Processor Structure and Function
No ratings yet
Module 5 - Processor Structure and Function
74 pages
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
No ratings yet
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
21 pages
Hafta 14
No ratings yet
Hafta 14
23 pages
Onur Ddca 2025 Lecture15a Dataflow Superscalar Beforelecture
No ratings yet
Onur Ddca 2025 Lecture15a Dataflow Superscalar Beforelecture
50 pages
Instruction Scheduling
No ratings yet
Instruction Scheduling
17 pages
Computer Architecture Revision For Final Exam
No ratings yet
Computer Architecture Revision For Final Exam
60 pages
3 Pipeline
No ratings yet
3 Pipeline
38 pages
Presentation Cea Chapter16 2 Demo
No ratings yet
Presentation Cea Chapter16 2 Demo
30 pages
CAQA5e ch3
No ratings yet
CAQA5e ch3
45 pages
Chapter 2 Lecture 4 and 5
No ratings yet
Chapter 2 Lecture 4 and 5
56 pages
Midterm Recap: Performance Evaluation
No ratings yet
Midterm Recap: Performance Evaluation
5 pages
CH12 CPU Structure and Function
No ratings yet
CH12 CPU Structure and Function
44 pages
CompArch 17e ILP-1
No ratings yet
CompArch 17e ILP-1
15 pages
Computer Organization and Architecture What Does Superscalar Mean?
No ratings yet
Computer Organization and Architecture What Does Superscalar Mean?
14 pages
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
74 pages
Unit - 1 Microprocessor Architecture
No ratings yet
Unit - 1 Microprocessor Architecture
52 pages
CPU Structure and Function
100% (1)
CPU Structure and Function
30 pages
Arch3 Pipelining Afterlecture
No ratings yet
Arch3 Pipelining Afterlecture
180 pages
Onur Digitaldesign - Comparch 2021 Lecture13 Pipelining Afterlecture
No ratings yet
Onur Digitaldesign - Comparch 2021 Lecture13 Pipelining Afterlecture
138 pages
Arch4 Pipelined Processor Design Afterlecture
No ratings yet
Arch4 Pipelined Processor Design Afterlecture
130 pages
10th Lecture: Multiple-Issue Processors: Please Recall: Branch Prediction
No ratings yet
10th Lecture: Multiple-Issue Processors: Please Recall: Branch Prediction
28 pages
Onur Digitaldesign - Comparch 2021 Lecture14 Pipelined Processor Design Afterlecture
No ratings yet
Onur Digitaldesign - Comparch 2021 Lecture14 Pipelined Processor Design Afterlecture
97 pages
COA Unit 3
No ratings yet
COA Unit 3
89 pages
The First Encounter
50% (2)
The First Encounter
44 pages
Chapter7 - Basic Processing Unit 1
No ratings yet
Chapter7 - Basic Processing Unit 1
31 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
55 pages
Topic 10: Pipelining: Cos / Ele 375 Computer Architecture and Organization
No ratings yet
Topic 10: Pipelining: Cos / Ele 375 Computer Architecture and Organization
64 pages
12 - Processor Structure and Function
No ratings yet
12 - Processor Structure and Function
73 pages
RN ACA-5 Unit-II
No ratings yet
RN ACA-5 Unit-II
42 pages
CH14 COA9e Processor Structure and Function
No ratings yet
CH14 COA9e Processor Structure and Function
40 pages
CEA201 - Chapter 14 - Processor Structure and Function
No ratings yet
CEA201 - Chapter 14 - Processor Structure and Function
42 pages
Slot15 CH14 ProcessorStructureAndFunction 42 Slots
No ratings yet
Slot15 CH14 ProcessorStructureAndFunction 42 Slots
42 pages
CPU Structure & Functions
No ratings yet
CPU Structure & Functions
44 pages
Instruction Level Parallelism and Superscalar Processors
No ratings yet
Instruction Level Parallelism and Superscalar Processors
34 pages
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
No ratings yet
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
35 pages
William Stallings Computer Organization and Architecture: CPU Structure and Function
No ratings yet
William Stallings Computer Organization and Architecture: CPU Structure and Function
40 pages
Reduced Instruction Set Computer (Risc) Complex Instruction Set Computer (Cisc)
No ratings yet
Reduced Instruction Set Computer (Risc) Complex Instruction Set Computer (Cisc)
7 pages
3 Pipeline
No ratings yet
3 Pipeline
21 pages
03ILP Speculation and Advanced Topics
No ratings yet
03ILP Speculation and Advanced Topics
48 pages
l12 Ooo Pipes
No ratings yet
l12 Ooo Pipes
28 pages
CA Lecture 12
No ratings yet
CA Lecture 12
48 pages
10 Pipelining
No ratings yet
10 Pipelining
44 pages
Chapter 4
No ratings yet
Chapter 4
78 pages
Chapter 2 ILP
No ratings yet
Chapter 2 ILP
89 pages
Comparch 06 Advanced Concepts
No ratings yet
Comparch 06 Advanced Concepts
37 pages
Slot24 25 CH14 ProcessorStructureAndFunction 42 Slots
No ratings yet
Slot24 25 CH14 ProcessorStructureAndFunction 42 Slots
42 pages
Week 11
No ratings yet
Week 11
33 pages
10 Week
No ratings yet
10 Week
35 pages
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
From Everand
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
Manoj R Chakravarthi
No ratings yet
Modern C++ Programming: Including the recent standards C++11, C++17, C++20, C++23
From Everand
Modern C++ Programming: Including the recent standards C++11, C++17, C++20, C++23
Orhan Gazi
No ratings yet
Thesis Topics Environmental Education
100% (2)
Thesis Topics Environmental Education
9 pages
GU Elsevier Adaptive Quizzing - Quiz Performance
No ratings yet
GU Elsevier Adaptive Quizzing - Quiz Performance
47 pages
Denim Present and Future of Bangladesh (For Selim)
33% (3)
Denim Present and Future of Bangladesh (For Selim)
12 pages
College of Engineering and Technology, Bikaner Time Table - Revised B. Tech. II Semester Session 2013-14
No ratings yet
College of Engineering and Technology, Bikaner Time Table - Revised B. Tech. II Semester Session 2013-14
6 pages
Inst Axs4iccp 142095
No ratings yet
Inst Axs4iccp 142095
36 pages
Constitutional Status of Women in India
No ratings yet
Constitutional Status of Women in India
9 pages
Physics Project
No ratings yet
Physics Project
8 pages
66 Heirs of The Late Sps Maglasang V Manila Banking Corporation. GR 171206
No ratings yet
66 Heirs of The Late Sps Maglasang V Manila Banking Corporation. GR 171206
4 pages
Architecture Site Analysis Guide
No ratings yet
Architecture Site Analysis Guide
11 pages
Lecture 5 Toplogy and Spatial Relationship
No ratings yet
Lecture 5 Toplogy and Spatial Relationship
38 pages
Intramural and Extramural Sports
No ratings yet
Intramural and Extramural Sports
6 pages
Essay On Addiction
100% (2)
Essay On Addiction
6 pages
Project Report On Insurance Industry
No ratings yet
Project Report On Insurance Industry
19 pages
Tubeo, Lush Ishi E. (BSN-1D) LM-2
No ratings yet
Tubeo, Lush Ishi E. (BSN-1D) LM-2
8 pages
Reflections On Comparative Literature in The Twenty First
No ratings yet
Reflections On Comparative Literature in The Twenty First
11 pages
Airway Management and CO2 Laser Treatment of Subglottic and Tracheal Stenosis Using Flexible Bronchoscope and Laryngeal Mask Anesthesia, 2011
No ratings yet
Airway Management and CO2 Laser Treatment of Subglottic and Tracheal Stenosis Using Flexible Bronchoscope and Laryngeal Mask Anesthesia, 2011
4 pages
ToP GUIA
No ratings yet
ToP GUIA
78 pages
Siklus Hidup Produk: (Product Life Cycle)
No ratings yet
Siklus Hidup Produk: (Product Life Cycle)
24 pages
Touch Potential
No ratings yet
Touch Potential
5 pages
List of Foreign Correspondents in The Spanish Civil War - Wikipedia
No ratings yet
List of Foreign Correspondents in The Spanish Civil War - Wikipedia
7 pages
NETBACKUP White Paper
No ratings yet
NETBACKUP White Paper
20 pages
Pamphlet Stitch Bookbinding Class Handout
No ratings yet
Pamphlet Stitch Bookbinding Class Handout
6 pages
JSS2 Mathematics Scheme
No ratings yet
JSS2 Mathematics Scheme
3 pages
The Lost Dog 6 Grade Fiction: Center For Urban Education ©2007
No ratings yet
The Lost Dog 6 Grade Fiction: Center For Urban Education ©2007
2 pages
Running Head: Vertical Integration of Zhangzidao 1
No ratings yet
Running Head: Vertical Integration of Zhangzidao 1
8 pages
Lesson Plan Three Phase Construction
No ratings yet
Lesson Plan Three Phase Construction
11 pages
How To Program Your Healy Resonance With Custom Vibration Programs Rev080621
No ratings yet
How To Program Your Healy Resonance With Custom Vibration Programs Rev080621
28 pages
Purchasing and Supply Chain Management 4th Edition Robert M. Monczka Instant Download
100% (1)
Purchasing and Supply Chain Management 4th Edition Robert M. Monczka Instant Download
61 pages

Onur Ddca 2025 Lecture14 Out of Order Execution Afterlecture

Uploaded by

Onur Ddca 2025 Lecture14 Out of Order Execution Afterlecture

Uploaded by

Digital Design & Computer Arch.

Lecture 14: Out-of-Order Execution

Prof. Onur Mutlu

ROB is implemented as a circular queue in hardware 4

◼ The register ID is renamed to the reorder buffer entry that

◼ This eliminates anti and output dependences

Initially: all registers Entry 13

Dest reg written?

Boggs et al., “The

◼ Dispatch: Act of sending an instruction to a functional unit

Stalls the pipeline

◼ Dispatch: Act of sending an instruction to a functional unit

◼ Answer: First ADD stalls the whole pipeline!

◼ How are the above code portions different?

◼ Solution: out-of-order dispatch (scheduling, or execution)

◼ Actually, we have seen the basic idea before:

◼ Aside: Any other way to prevent dispatch stalls?

◼ Monitor the source “values” of each instruction in the resting

◼ Out-of-order dispatch + precise exceptions:

◼ What is the major difference today?

◼ OoO variants are used in most high-performance processors

in order out of order in order

◼ Hump 1: Reservation stations (scheduling window)

in order out of order in order

◼ Hump 1: Reservation stations (scheduling window)

Photo credit: http://true-wildlife.blogspot.ch/2010/10/bactrian-camel.html 26

Common data bus

◼ The register ID is renamed to the reorder buffer entry (or

◼ This eliminates anti and output dependences

Tag Value Valid?

If Valid bit is set, the Value in the table is correct.

Dest reg written?

◼ Assume ADD (4 cycle execute), MUL (6 cycle execute)

in-order-dispatch pipelined machine

out-of-order dispatch pipelined machine

Register Valid Tag Value

Register Valid Tag Value

Cycle 1 2 Step 2: Access the Register Alias Table

Register Valid Tag Value

Cycle 1 2 3 Check readiness (Both sources ready?) → Wakeup

Register Valid Tag Value

Register Valid Tag Value

Register Valid Tag Value

Register Valid Tag Value

Register Valid Tag Value

Register Valid Tag Value

Register Valid Tag Value

Register Valid Tag Value

Register Valid Tag Value

Register Valid Tag Value

Register Valid Tag Value

Register Valid Tag Value

Register Valid Tag Value

Register Valid Tag Value

Register Valid Tag Value

ADD in RS d is ready to execute in the next cycle!

Register Valid Tag Value

Register Valid Tag Value

Register Valid Tag Value

Register Valid Tag Value

◼ Does the tag have to be the ID of the Reservation Station

◼ What can potentially become the critical path?

◼ How can you reduce the potential critical paths?

Register Alias Table

Register Alias Table

We can “easily” reverse-engineer the dataflow graph of the executing code!

◼ Should the reservation stations be dedicated to each

◼ Should reservation stations and ROB store data values or

◼ Timing: Exactly when does an instruction broadcast its tag?

◼ Many other design choices for OoO engines

◼ Assume ADD (4 cycle execute), MUL (6 cycle execute)

◼ Assume ADD (4 cycle execute), MUL (6 cycle execute)

◼ An instruction updates the RAT when it completes execution

◼ An instruction updates a separate architectural register file

◼ On an exception: flush pipeline, copy architectural register file

R3 1 3 V Tag Value V Tag Value V Tag Value V Tag Value