Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
Computer Architecture
Lecture 12: Out-of-Order Execution
(Dynamic Instruction Scheduling)
n Pipelining
n Out-of-Order Execution
3
Readings Specifically for Today
n Smith and Sohi, “The Microarchitecture of Superscalar
Processors,” Proceedings of the IEEE, 1995
q More advanced pipelining
q Interrupt and exception handling
q Out-of-order and superscalar execution concepts
4
Recap of Last Lecture
n Issues with Multi-Cycle Execution
n Exceptions vs. Interrupts
n Precise Exceptions/Interrupts
n Why Do We Want Precise Exceptions?
n How Do We Ensure Precise Exceptions?
q Reorder buffer
q History buffer
q Future register file (best of both worlds)
q Checkpointing
n Register renaming with a reorder buffer
n How to Handle Exceptions
n How to Handle Branch Mispredictions
n Speed of State Recovery: Recovery and Interrupt Latency
q Checkpointing
n Registers vs. Memory
5
Important: Register Renaming with a Reorder Buffer
n Output and anti dependencies are not true dependencies
q WHY? The same register refers to values that have nothing to
do with each other
q They exist due to lack of register ID’s (i.e. names) in
the ISA
n The register ID is renamed to the reorder buffer entry that
will hold the register’s value
q Register ID à ROB entry ID
q Architectural register ID à Physical register ID
q After renaming, ROB entry ID used to refer to the register
6
Review: Register Renaming Examples
8
Review: Checkpointing
n When a branch is decoded
q Make a copy of the future file/map and associate it with the
branch
9
Review: Registers versus Memory
n So far, we considered mainly registers as part of state
10
Maintaining Speculative Memory State: Stores
n Handling out-of-order completion of memory operations
q UNDOing a memory write more difficult than UNDOing a
register write. Why?
q One idea: Keep store address/data in reorder buffer
n How does a load instruction find its data?
q Store/write buffer: Similar to reorder buffer, but used only for
store instructions
n Program-order list of un-committed store operations
n When store is decoded: Allocate a store buffer entry
n When store address and data become available: Record in store
buffer entry
n When the store is the oldest instruction in the pipeline: Update
the memory address (i.e. cache) with store data
12
Remember: Questions to Ponder
n What is the role of the hardware vs. the software in the
order in which instructions are executed in the pipeline?
q Software based instruction scheduling à static scheduling
q Hardware based instruction scheduling à dynamic scheduling
13
Dynamic Instruction Scheduling
n Hardware has knowledge of dynamic events on a per-
instruction basis (i.e., at a very fine granularity)
q Cache misses
q Branch mispredictions
q Load/store addresses
14
Out-of-Order Execution
(Dynamic Instruction Scheduling)
An In-order Pipeline
Integer add
E
Integer mul
E E E E
FP mul
R W
F D
E E E E E E E E
E E E E E E E E ...
Cache miss
16
Can We Do Better?
n What do the following two pieces of code have in common
(with respect to execution in the previous design)?
IMUL R3 ß R1, R2 LD R3 ß R1 (0)
ADD R3 ß R3, R1 ADD R3 ß R3, R1
ADD R1 ß R6, R7 ADD R1 ß R6, R7
IMUL R5 ß R6, R8 IMUL R5 ß R6, R8
ADD R7 ß R9, R9 ADD R7 ß R9, R9
n Benefit:
q Latency tolerance: Allows independent instructions to execute
and complete in the presence of a long latency operation
19
In-order vs. Out-of-order Dispatch
n In order dispatch + precise exceptions:
IMUL R3 ß R1, R2
F D E E E E R W
ADD R3 ß R3, R1
F D STALL E R W ADD R1 ß R6, R7
F STALL D E R W IMUL R5 ß R6, R8
ADD R7 ß R3, R5
F D E E E E E R W
F D STALL E R W
n 16 vs. 12 cycles
20
Enabling OoO Execution
1. Need to link the consumer of a value to the producer
q Register renaming: Associate a “tag” with each data value
2. Need to buffer instructions until they are ready to execute
q Insert instruction into reservation stations after renaming
3. Instructions need to keep track of readiness of source values
q Broadcast the “tag” when the value is produced
q Instructions compare their “source tags” to the broadcast tag
à if match, source value becomes ready
4. When all source values of an instruction are ready, need to
dispatch the instruction to its functional unit (FU)
q Instruction wakes up if all sources are ready
q If multiple instructions are awake, need to select one per FU
21
Tomasulo’s Algorithm
n OoO with register renaming invented by Robert Tomasulo
q Used in IBM 360/91 Floating Point Units
q Read: Tomasulo, “An Efficient Algorithm for Exploiting Multiple
Arithmetic Units,” IBM Journal of R&D, Jan. 1967.
S R
Integer add
C E E
H Integer mul
O
E E E E E
F D R W
D FP mul
D
U E E E E E E E E
E
L
E E E E E E E E E ... R
Load/store
n Smith and Sohi, “The Microarchitecture of Superscalar Processors,” Proc. IEEE, Dec.
1995.
24
Tomasulo’s Machine: IBM 360/91
FP registers
from memory from instruction unit
load
buffers store buffers
operation bus
reservation
stations to memory
FP FU FP FU
25
Register Renaming
n Output and anti dependencies are not true dependencies
q WHY? The same register refers to values that have nothing to
do with each other
q They exist because not enough register ID’s (i.e.
names) in the ISA
n The register ID is renamed to the reservation station entry
that will hold the register’s value
q Register ID à RS entry ID
q Architectural register ID à Physical register ID
q After renaming, RS entry ID used to refer to the register
R0 1
R1 1
R2 1
R3 1
R4 1
R5 1
R6 1
R7 1
R8 1
R9 1
27
Tomasulo’s Algorithm
n If reservation station available before renaming
q Instruction + renamed operands (source value/tag) inserted into the
reservation station
q Only rename if reservation station is available
n Else stall
n While in reservation station, each instruction:
q Watches common data bus (CDB) for tag of its sources
q When tag seen, grab value for the source and keep it in the reservation station
q When both operands available, instruction ready to be dispatched
n Dispatch instruction to the Functional Unit when instruction is ready
n After instruction finishes in the Functional Unit
q Arbitrate for CDB
q Put tagged value onto CDB (tag broadcast)
q Register file is connected to the CDB
n Register contains a tag indicating the latest writer to the register
n If the tag in the register file matches the broadcast tag, write broadcast value
into register (and set valid bit)
q Reclaim rename tag
n no valid copy of tag in system!
28
An Exercise
MUL R3 ß R1, R2
ADD R5 ß R3, R4
ADD R7 ß R2, R6 F D E W
ADD R10 ß R8, R9
MUL R11 ß R7, R10
ADD R5 ß R5, R11
30
Exercise Continued
31
Exercise Continued
MUL R3 ß R1, R2
ADD R5 ß R3, R4
ADD R7 ß R2, R6
ADD R10 ß R8, R9
MUL R11 ß R7, R10
ADD R5 ß R5, R11
32
How It Works
33
Cycle 0
34
Cycle 2
35
Cycle 3
36
Cycle 4
37
Cycle 7
38
Cycle 8
39
Some Questions
n What is needed in hardware to perform tag broadcast and
value capture?
à make a value valid
à wake up an instruction
40
An Exercise, with Precise Exceptions
MUL R3 ß R1, R2
ADD R5 ß R3, R4
ADD R7 ß R2, R6 F D E R W
ADD R10 ß R8, R9
MUL R11 ß R7, R10
ADD R5 ß R5, R11
42
Out-of-Order Execution with Precise Exceptions
TAG and VALUE Broadcast Bus
S R
Integer add
C E E
H Integer mul
O
E E E E E
F D R W
D FP mul
D
U E E E E E E E E
E
L
E E E E E E E E E ... R
Load/store
44
An Example from Modern Processors
46
Summary of OOO Execution Concepts
n Register renaming eliminates false dependencies, enables
linking of producer to consumers
47
OOO Execution: Restricted Dataflow
n An out-of-order engine dynamically builds the dataflow
graph of a piece of the program
q which piece?
MUL R3 ß R1, R2
ADD R5 ß R3, R4
ADD R7 ß R2, R6
ADD R10 ß R8, R9
MUL R11 ß R7, R10
ADD R5 ß R5, R11
49
State of RAT and RS in Cycle 7
50
Dataflow Graph
51
In-Class Exercise on Tomasulo
52
In-Class Exercise on Tomasulo
53
In-Class Exercise on Tomasulo
54
In-Class Exercise on Tomasulo
55
In-Class Exercise on Tomasulo
56
In-Class Exercise on Tomasulo
57
In-Class Exercise on Tomasulo
58
In-Class Exercise on Tomasulo
59
In-Class Exercise on Tomasulo
60
In-Class Exercise on Tomasulo
61
In-Class Exercise on Tomasulo
62
In-Class Exercise on Tomasulo
63
Tomasulo Template
64
We did not cover the following slides in lecture.
These are for your preparation for the next lecture.
Restricted Data Flow
n An out-of-order machine is a “restricted data flow” machine
q Dataflow-based execution is restricted to the microarchitecture
level
q ISA is still based on von Neumann model (sequential
execution)