EE457Unit9a OoO
EE457Unit9a OoO
EE 457 Unit 9a
Exploiting ILP
Out-of-Order Execution
2
Credits
• Some of the material in this presentation is taken from:
– Computer Architecture: A Quantitative Approach
• John Hennessy & David Patterson
• Some of the material in this presentation is derived from
course notes and slides from
– Prof. Michel Dubois (USC)
– Prof. Murali Annavaram (USC)
– Prof. David Patterson (UC Berkeley)
3
Exploiting Parallelism
• With increasing transistor budgets of modern processors (i.e.,
can do more things at the same time) the question becomes
how do we find enough useful tasks to increase performance,
or, put another way, what is the most effective way of
exploiting parallelism!
• Many types of parallelism available
– Instruction Level Parallelism (ILP): Overlapping instructions within a
single process/thread of execution
– Thread Level Parallelism (TLP): Overlap execution of multiple
processes/threads
– Data Level Parallelism (DLP): Overlap an operation (instruction) that is
to be applied independently to multiple data values (usually, an array)
for (int i=0; i < MAX; i++) { A[i] = A[i] + 5; }
Outline
• Instruction Level Parallelism
– In-order (IO) pipeline
• From academic 5-stage pipeline
• To 8-stage MIPS R4000 pipeline
• Superscalar, superpipelined
– Out-of-Order (OoO) Execution
• This unit: OoO Execution (Compute the result) AND
OoO Completion (write result to memory or a register).
(Problem: Exceptions
• Next Unit: OoO Execution BUT In-order completion
5
Basic Blocks
• Basic Block (def.) = Sequence of instructions that will
always be executed together
– No conditional branches out lw $s3,0($s4)
and $t3,$t2,$t3
– No branch targets coming in L1: add
or
$t0,$t0,$s4
$t5,$t3,$t2
This is a
basic block
sub $t1,$t1,$t2 (starts w/
– Also called “straight-line” code beq $t0,$t8,L1 target, ends
xor $s0,$t1,$s2 with branch)
– Average size: 5-7 instrucs.
• Instructions in a basic block can be overlapped if
there are no data dependencies
• Control dependences really limit our window of
possible instructions to overlap
– W/o extra hardware, we can only overlap execution of
instructions within a basic block
7
Overview
• Superscalar = More than 1 instruction completing per clock cycle (IPC > 1)
– 2-way superscalar = Proc. that can issue 2 instructions per clock cycle
– Success is sensitive to ability to find independent instructions to issue in the same cycle
• Superpipelining = Many small stages to boost clock freq.
– Success depends of finding instructions to schedule in the shadow of data and control hazards
Superscalar: Executing more than 1 instruction per clock cycle (CPI < 1 or IPC > 1)
Superpipelining
Instruction
1 IF1 IF2 ID EX DM1 DM2 DM3 WB
Instruction
2 IF1 IF2 ID EX DM1 DM2 DM3 WB
Superpipelining: Divide logic into many short stages (Higher Clock Frequency)
9
2-way Superscalar
• Ex: One ALU & Data transfer (LW/SW) instruction can be issued at the same time
• Relies on compiler to find and reorder appropriate instructions (using nops if no
appropriate instruction can be found
Instruction Pipeline Stages
ALU or branch IF ID EX MEM WB
LW/SW IF ID EX MEM WB
ALU or branch IF ID EX MEM WB
LW/SW IF ID EX MEM WB
ALU or branch IF ID EX MEM WB
LW/SW IF ID EX MEM WB
Integer Slot
PC
ALU
Reg.
File
I-Cache
(4 Read,
Addr.
LD/ST Slot
2 Write) D-Cache
Calc.
2 instructions
10
Sample Scheduling
• Compiler can reorder instructions to find integer and memory
instructions to fuse together that can be run down the
pipeline at the same time
void f1(int *A, int n) { time
do {
*A += 5; Int./Branch Slot LD/ST Slot
A++; addi $7, $7, -1 lw $9,0($6)
n--;
} while (n != 0); addi $6, $6, 4
}
addi $9, $9, 5
# $6 = A bne $0,$7,L1 st $9,-4($6)
# $7 = n = # of iterations
L1: ld $9, 0(%6)
add $9, $9, 5
w/ modifications and code movement
st %r9,0(%rdi) IPC = 6 instrucs. / 4 cycle = 1.5
add $6, $6, 4
add $7, $7, -1
jne $0,%esi,L1
11
Scheduling Strategies
• Static Scheduling
– Compiler re-orders instructions in such a way that no
dependencies will be violated and allows for OoOE
• Dynamic Scheduling
– HW implementing the Tomasulo algorithm or other similar
approach will re-order instructions to allow for OoOE
• More Advanced Concepts
– Branch prediction and speculative execution (execution beyond
a branch flushing if incorrect) will be covered later
12
Static Scheduling
• Strengths
– Hardware simplicity [Better clock rate]
• Power/energy advantage
• Compiler has a global view of the program anyway, so it should be able to
do a “good” job
– Very predictable: static performance predictions are reliable
• Weaknesses
– Requires re-compilation to take advantage of new/modified
architecture
– Cannot foresee dynamic (data-dependent) events
• Cache miss, conditional branches (can only recedule instructions in a basic
block)
– Cannot precompute memory addresses
– No good solution for precise exceptions with out-of-order completion
13
OUT-OF-ORDER EXECUTION
14
Out-of-Order Motivation
• We will focus on dynamically scheduled, OoO processors
• Hide the impact of dynamic events such as a cache miss
– Let independent instructions behind a stalled instruction execute
• Separate functional units (ALU, MUL, DMEM, etc.)
• "Queues" where instructions wait
Queues +
ADD
SUB
until they are ready at which point Functional ALU
Units
they can execute "out-of-order"
MUL
MUL
LW $4,0($5)
// cache miss IM Reg Reg
ADD $6,$7,$4 DIV
SUB $1,$2,$3
MUL $9,$7,$2
LW DMEM
(Cache)
15
LW $4,0($5)
// cache miss
ADD $6,$7,$4
SUB $1,$2,$3 In-order In-order
MUL $9,$7,$2
Out-of-Order
16
Branch Handling
• We will present the concept of OoOC (out-of-order
completion) which is a bit easier and then come back to the
desired approach of In-Order Completion (IOC)
• OoOC Issues
– Branches…we should not commit an instruction that came after (in
program order) a branch
Execution
– Solution: Stall dispatching instructions
after a branch until we resolve the
outcome
Issue/Dispatch Completion
EX Stage Stalling
• In our 5-stage pipeline, could we have stalled in the EX stage
• No! If ADD depended on an instruction in WB then it has no place to store
that forwarded data while it stalls
0
1
FLUSH
PCWrite
Ex Mem WB
IRWrite HDU
0
Mem WB
0 1
Stall
IF.Flush 0
Why? What if ADD was also
WB
0 1
dependent on the instruction in
MemToReg
Control Branch
4
+
+
buffer that forwarded value Read Sh.
MemRead &
MemWrite
5 Reg. 1 # Left
2
Read 0
Thus we stall in ID so we can Read
1
Reg. 2 #
ALUSelA
Reg. #
ALU
stalling in ID incurs only 1 cycle Read Res.
penalty as would stalling in EX. Write data 2 0 0
D-Cache
Data 1
1 1
2
Register File
Prior ALU
rt 1
Result
rd Regwrite &
WriteReg# Regwrite,
WriteReg#
19
Where to Stall?
• But to implement OoO execution, we cannot stall in the decode stage
since that would prevent any further issuing of instructions
• Thus, now we will issue to queues for each of the multiple functional units
and have the instruction stall in the queue until it is ready
Queues +
Functional ALU
Units
MUL
IM Reg Reg
DIV
Mem WB
latest version) IF.Flush
Stall
0 1
WB
• Instead, the dispatch unit will explicitly tell the dependent instruction who to
0 1
MemToReg
Control Branch
4
+
+
Read Sh.
MemRead &
MemWrite
5 Reg. 1 # Left
2
Read 0
Read
Reg. 2 # 1
Pipeline Stage Register
5 data 1
2 0
I-Cache
. Write Zero
PC
ALUSelA
Reg. #
ALU
Read Res.
Write data 2 0 0
D-Cache
Data 1
1 1
2
Register File
rt 1
Result
rd Regwrite &
WriteReg# Regwrite,
WriteReg#
21
Tomasulo’s Plan
• OoO Execution
• Multiple functional units
– Integer ALU, Data memory, Multiplier, Divider
• Queues between ID and EX stages (in place of ID/EX
register)
– Allows later instructions to keep issuing even if earlier ones
are stalled
• Method for dealing with RAW data hazards by
specifying who dependent instructions should get
data from
– But with OoO execution, new hazards arise!
22
An anti-dependency
– lw $8, 40($2)
• WAW = Write After Write
– add $9, $8, $6 WAW
An anti-dependency
– lw $9, 40($2)
Note: No information is communicated in WAR/WAW hazards.
If no info is communicated can we somehow solve these hazards?
26
Register Renaming
WAR = Write After Read
• WAR and WAW hazards can add $9, $8, $6
always be solved by simply lw $8$48, 40($2)
choosing a DIFFERENT register WAW = Write After Write
lw $8, 40($2)
• If we had 64 registers instead First iteration add $8, $8, $8
sw $8, 40($2)
of 32 registers, then perhaps
the compiler might have used Second
iteration lw $48, 60($3)
$48 instead of $8 and we could (using add $48, $48, $48
alternate sw $48, 60($3)
have executed the second part register, $48)
of the code before the first part
28
Register Renaming
• Renaming requires more registers
• We have limited architectural registers
– Registers the instruction set is aware of
• We could have more physical registers
– Actual registers part of the register file
Assume Delayed
lw $8, 40($2) It is clear the compiler is using $8 as a
add $8, $8, $8 temporary register
sw $8, 40($2)
If there is a delay in obtaining $2 the first
lw $8, 60($3) part of the code cannot proceed
add $8, $8, $8
sw $8, 60($3) Unfortunately, the second part of the code
cannot proceed because of the name
dependency for $8
29
Register Renaming
• Rather than creating new architectural registers, let
us internally provide multiple "versions" of the same
architectural register
– $8v1 = $8 version 1
– $8v2 = $8 version 2
lw $8v1, 40($2)
add $8v2, $8v1, $8v1 $8v1
sw $8v2, 40($2)
$8v2
$8
lw $8v3, 60($3)
$8v3
add $8v4, $8v3, $8v3 "Arch. Reg" $8v4
sw $8v4, 60($3)
Phys Reg
31
dependencies) Instruc.
Queue
Decode & dispatch multiple
instructions per cycle tracking
Register dependencies on earlier
Status instructions
Table Dispatch
Instructions wait in queues
until their respective
functional unit (the
hardware that will compute
their value) is free AND
Mult. Queue
they have their data
L/S Queue
Int. Queue
Div Queue
available (from the
instructions they depend
upon). These act as
additional "physical
registers"
Issue
Unit
Integer /
D-Cache Div Mul
Branch Results and TAGs of
multiple instructions can
be written back per cycle.
Results are broadcast to
any instruction waiting for
Block Diagram Adapted that result.
from Prof. Michel Dubois Common Data Bus
(Simplified for EE457)
33
Tomasulo’s Algorithm
• Dispatch/Issue unit decodes and dispatches instructions
• Assign a binary code (aka TAG) to each instruction producing a register
value using the TAG FIFO
• Adds a Register Status Table (RST) that holds the TAG of the instruction that
is producing the LATEST version of each architectural register or NULL if the
LATEST version is in the register file
• The destination operand is represented by the TAG but not the actual
register name
• For source operands, an instruction carries either the values (if TAG is null in
RST) or TAGs of the operands (but not the actual register name)
• When an instruction executes and produces a result it broadcasts the result
and its destination TAG
– Any instruction waiting can compare its SRC tags with the destination tag and
grab the value if they match
– If entry in RST matches the TAG then this instruction is the latest producer of
the register and the value will be written to the register file
34
Tagging process
RST
(Identify latest
version of a reg.) RF
sqrt $2, $10
$1 $1
lw $8, 40($2) $2 $2
add $8, $8, $8 $3 $3
$4 $4
sw $8, 40($2) $5 $5
$6 $6
lw $8, 60($3) $7 $7
add $8, $8, $8 $8 $8
sw $8, 60($3) … …
$31 $31
Issue Logic
RST = Register
Status Table
RF = Register File
$31 $31
Issue Logic
Instruction that will write to a destination register,
take a TAG and enter that TAG into the RST to
track the latest version/producer
RST = Register
T1: SQRT $2 Val / $10 Val
Status Table
INT INT MUL/DIV/SQRT Load/
ALU Store RF = Register File
36
$31 $31
Issue Logic
RST = Register
T1: SQRT $2 Val / $10 Val T2: LW T1 / 40
Status Table
INT INT MUL/DIV/SQRT Load/
RF = Register File
ALU Store
37
$31 $31
Issue Logic
Notice the RST only stores the TAG of the
LATEST producer/version. Solves WAR/WAW
hazards by not accepting a writeback unless it is
from the latest/producer
RST = Register
T3: ADD T2 / T2 T1: SQRT $2 Val / $10 Val T2: LW T1 / 40
Status Table
INT INT MUL/DIV/SQRT Load/
RF = Register File
ALU Store
38
$31 $31
Issue Logic
RST = Register
Status Table
RF = Register File
SW T3 / T1 / 40
T3: ADD T2 / T2 T1: SQRT $2 Val / $10 Val T2: LW T1 / 40
INT INT MUL/DIV/SQRT Load/
ALU Store
39
$31 $31
Issue Logic
RST = Register
Status Table
RF = Register File
T4: LW $3 val / 60
SW T3 / T1 / 40
T3: ADD T2 / T2 T1: SQRT $2 Val / $10 Val T2: LW T1 / 40
INT INT MUL/DIV/SQRT Load/
ALU Store
40
$31 $31
Issue Logic
RST = Register
Status Table
RF = Register File
T4: LW $3 val / 60
T5: ADD T4 / T4 SW T3 / T1 / 40
T3: ADD T2 / T2 T1: SQRT $2 Val / $10 Val T2: LW T1 / 40
INT INT MUL/DIV/SQRT Load/
ALU Store
41
$31 $31
Issue Logic
RST = Register
Status Table
RF = Register File
T4: LW $3 val / 60
T5: ADD T4 / T4 SW T3 / T1 / 40
T3: ADD T2 / T2 T1: SQRT $2 Val / $10 Val T2: LW T1 / 40
INT INT MUL/DIV/SQRT Load/
ALU Store
$31 $31
$31 $31
Issue Logic
RST = Register
Status Table
RF = Register File
SW 0x2222, $3 val / 60
SW T3 / T1 / 40
T3: ADD T2 / T2 T1: SQRT $2 Val / $10 Val T2: LW T1 / 40
INT INT MUL/DIV/SQRT Load/
ALU Store
$31 $31
SW T3 / T1 / 40
T3: ADD T2 / T2 T1: SQRT $2 Val / $10 Val T2: LW T1 / 40
INT INT MUL/DIV/SQRT Load/
ALU Store
$31 $31
Issue Logic
Since RST entry for $8 is NULL, RF will not update
when LW attempts to writeback.
RST = Register
Status Table
SW T3 / 0xacd0 / 40
T3: ADD T2 / T2 T2: LW 0xacd0 / 40 RF = Register File
INT INT MUL/DIV/SQRT Load/
ALU Store
$31 $31
Issue Logic
RST = Register
Status Table
RF = Register File
SW T3 / 0xacd0 / 40
T3: ADD 0x5678 / 0x5678
INT INT MUL/DIV/SQRT Load/
ALU Store
$31 $31
Issue Logic
RST = Register
Status Table
RF = Register File
SW 0xacf0 / 0xacd0 / 40
Register Renaming
RST RF
sqrt $2, $10
$1 $1
add $2, $2, $2 $2 T1, T2, T3, T4 $2
add $2, $2, $2 $3 $3
$4 $4
add $2, $2, $2 $5 $5
$6 $6
add $2, $2, $2 $7 $7
$8 $8
… …
$31 $31
Issue Logic
RST = Register
Status Table
RF = Register File
T4: ADD T3 / T3
T3: ADD T2 / T2
T2: ADD T1 / T1 T1: SQRT $2 Val / $10 Val
INT INT MUL/DIV/SQRT Load/
ALU Store
49
Unique TAGs
• Like SSN, we need a unique TAG
• SSN’s are reused.
• Similarly TAGS can be reused
• TAGs are similar to number TOKEN
Tags (= Tokens)
• How many tokens should the bank casheir
have to start with?
• What happens if the tokens run out?
• Does the cashier need to have any order in
holding tokens and issuing tokens?
• Do they have to collect the tokens back?
51
TAG FIFO
FIFO’s are taught in EE 560
wp 0 rp wp 1
1 wp
2 2 rp 2 rp
… … …
63 63 63
FULL 2 Tokens issued 1 Tokens returned
52
Queue
Register
Status
Table Dispatch
Mult. Queue
L/S Queue
Int. Queue
Div Queue
Issue
Unit
Integer /
D-Cache Div Mul
Branch
CDB
53
MEMORY DISAMBIGUATION
56
Memory Disambiguation
• Data hazards (RAW, WAR, WAW) can occur in memory just as
with registers, and hazards in memory are much harder to deal
with since many combinations could produce the same address
RAW This later lw can proceed only if there is
no store ahead of it with the same address
sw $2, 2000($0)
lw $8, 2000($0)
Memory Disambiguation
• When can LSQ can issue a LW or SW to cache?
– Loads can issue to a cache when their address is ready
– Stores can issue to cache when both address & data is ready
– Memory hazards (RAW, WAR, WAW) are resolved in the LSQ
• Load can issue to cache if no store with same address is before it
• Store can issue to cache if no store or load with same address before it
• Otherwise, access waits in LSQ
– If an address is unknown it is assumed to be the same
• Worst case to enforce correctness
– The process of figuring out and comparing memory address is called
“disambiguation”
60
Issue Unit
• How do we determine when to issue an instruction to the
functional unit?
– Is the instruction ready
– Is the functional unit free to start the operation?
– CDB availability constraint
• Will there be room on the CDB when operation finished?
– Priority/conflict resolution
• If many instructions are available, which should be chosen? (Is round-
robin priority adequate)?
How do we prioritize
instructions that are ready?
62
How do we prioritize
instructions that are ready?
63
LSQ Ordering/Priority
• Maintaining instructions in the order of arrival
– Issue order/program order in a queue
• Is this necessary and/or desirable?
– In the case of LSQ?
• Necessary! To enforce memory disambiguation
– In the case of Integer, MUL, DIV queues?
• Desirable, so that an earlier instruction gets executed
whenever possible, thereby reducing queue pressure
from too many instructions waiting on it
64
Conditional Branches
• Dispatcher stalls when it reaches a branch (and waits until it is resolved)
• Branches are dispatched to integer queue where they wait for their
operands (if necessary)
• When branch executes it puts its outcome & target on CDB
– If untaken, dispatch unit resumes
– If taken, then dispatch clears flushes the IFQ and resumes at target
• Since we stop dispatching instructions after a branch, does it mean that
this branch is the last instruction to be executed in the back-end?
• Is it possible that the back-end holds simultaneously
– A. Some instructions dispatched before
the branch .. AND ..
– B. Some instructions issued after
the branch
ADD $4,$5,$5
BEQ $6,$7,L1
...
L1: SUB $1,$2,$3
MUL $9,$7,$2
65
BACKUP
67
… …
sw $8, 40($2)
$31 $31
lw $8, 60($3)
add $8, $8, $8
sw $8, 60($3)
… …
sw $8, 40($2)
$31 $31
lw $8, 60($3)
add $8, $8, $8
sw $8, 60($3)
… …
sw $8, 40($2)
$31 $31
lw $8, 60($3)
add $8, $8, $8
sw $8, 60($3)
lw $8, 60($3)
add $8, $8, $8
sw $8, 60($3)
lw $8, 60($3)
• Dispatch unit decodes and dispatches instructions
add $8, $8, $8 • For destination operand, an instruction carreis a
sw $8, 60($3) TAG (but not the actual register name)
• For source operands, an instruction carries either
the values (if no TAG in RST) or TAGs of the
operands (but not the actual register name)
• When
72
Queue
Register
Status
Table Dispatch
Mult. Queue
L/S Queue
Int. Queue
Div Queue
Issue
Unit
Integer /
D-Cache Div Mul
Branch
CDB
73
Queues +
Functional ALU
Units
MUL
IM Reg Reg
DIV
DMEM
(Cache)
74
Queues +
Functional ALU
Units
MUL
IM Reg DM Reg
DIV
DM
(Cache)
75
Where to Stall?
• But to implement OoO execution, we cannot stall in the decode stage
since that would prevent any further issuing of instructions
• Thus, now we will issue to queues for each of the multiple functional units
and have the instruction stall in the queue until it is ready
Queues +
Functional ALU
Units
MUL
IM Reg DM Reg
DIV
EX
FP Add Look Ahead: Tomasulo
Algorithm will help absorb
An added complication of A1 A2 A3 A4 latency of different functional
units and cache miss latency by
out-of-order execution & Int. & FP MUL allowing other ready instruction
completion: WAW & WAR proceed out-of-order
hazards M1 M2 M3 M4 M5 M6 M7
Int. & FP DIV
Integer ALU 0 1
FP Add 3 1
FP Mul. 6 1
FP Div. 24 25
77
I-Cache D-Cache
ROB
Instruc.
Reg. File
(Reorder
Queue Buffer)
Br. Pred.
Buffer Dispatch Exceptions?
No problem
Mult. Queue
L/S Queue
Int. Queue
Div Queue
Addr.
Buffer
Issue
Unit
Exec. Unit
Integer /
D-Cache Div Mul
Branch
L/S Buffer
CDB