0% found this document useful (0 votes)
15 views77 pages

EE457Unit9a OoO

Uploaded by

Shaurya Chandra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views77 pages

EE457Unit9a OoO

Uploaded by

Shaurya Chandra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

1

EE 457 Unit 9a

Exploiting ILP
Out-of-Order Execution
2

Credits
• Some of the material in this presentation is taken from:
– Computer Architecture: A Quantitative Approach
• John Hennessy & David Patterson
• Some of the material in this presentation is derived from
course notes and slides from
– Prof. Michel Dubois (USC)
– Prof. Murali Annavaram (USC)
– Prof. David Patterson (UC Berkeley)
3

Exploiting Parallelism
• With increasing transistor budgets of modern processors (i.e.,
can do more things at the same time) the question becomes
how do we find enough useful tasks to increase performance,
or, put another way, what is the most effective way of
exploiting parallelism!
• Many types of parallelism available
– Instruction Level Parallelism (ILP): Overlapping instructions within a
single process/thread of execution
– Thread Level Parallelism (TLP): Overlap execution of multiple
processes/threads
– Data Level Parallelism (DLP): Overlap an operation (instruction) that is
to be applied independently to multiple data values (usually, an array)
for (int i=0; i < MAX; i++) { A[i] = A[i] + 5; }

• We'll focus on ILP in this unit


4

Outline
• Instruction Level Parallelism
– In-order (IO) pipeline
• From academic 5-stage pipeline
• To 8-stage MIPS R4000 pipeline
• Superscalar, superpipelined
– Out-of-Order (OoO) Execution
• This unit: OoO Execution (Compute the result) AND
OoO Completion (write result to memory or a register).
(Problem: Exceptions
• Next Unit: OoO Execution BUT In-order completion
5

Instruction Level Parallelism (ILP)


• Although a program defines a sequential ordering of instructions, in reality
many instructions can be executed in parallel (i.e. out of (program) order).
• ILP refers to the process of finding instructions from a single program/thread
of execution that can be executed in parallel
• Data flow (data dependencies) limits out-of-order execution
• Independent instructions (no data dependencies) can be executed at the
same time)
• Control hazards also provide some ordering constraints
lw $s3,0($s4)
and $t3,$t2,$t3
LW ADD SUB AND
Program add $t0,$t0,$s4
Order or $t5,$t3,$t2 Dependency
(In-order) sub $t1,$t1,$t2 Graph BEQ OR
beq $t0,$t8,L1
We may perform
xor $s0,$t1,$s2
execution out-of-order XOR
Cycle 1: lw $s3,0($s4) / add $t0,$t0,$s4 / sub $t1,$t1,$t2 / and $t3,$t2,$t3
Cycle 2: / beq $t0,$t8,L1 / / or $t5,$t3,$t2
Cycle 3: / / xor $s0,$t1,$s2 /
6

Basic Blocks
• Basic Block (def.) = Sequence of instructions that will
always be executed together
– No conditional branches out lw $s3,0($s4)
and $t3,$t2,$t3
– No branch targets coming in L1: add
or
$t0,$t0,$s4
$t5,$t3,$t2
This is a
basic block
sub $t1,$t1,$t2 (starts w/
– Also called “straight-line” code beq $t0,$t8,L1 target, ends
xor $s0,$t1,$s2 with branch)
– Average size: 5-7 instrucs.
• Instructions in a basic block can be overlapped if
there are no data dependencies
• Control dependences really limit our window of
possible instructions to overlap
– W/o extra hardware, we can only overlap execution of
instructions within a basic block
7

Other In-Order techniques

SUPERSCALAR & SUPERPIPELINING


8

Overview
• Superscalar = More than 1 instruction completing per clock cycle (IPC > 1)
– 2-way superscalar = Proc. that can issue 2 instructions per clock cycle
– Success is sensitive to ability to find independent instructions to issue in the same cycle
• Superpipelining = Many small stages to boost clock freq.
– Success depends of finding instructions to schedule in the shadow of data and control hazards

Instruction Instruc. Instruc. Data


Superscalar

1 Execute Write back


Fetch Decode Memory

Instruction Instruc. Instruc. Data


2 Execute Write back
Fetch Decode Memory

Superscalar: Executing more than 1 instruction per clock cycle (CPI < 1 or IPC > 1)
Superpipelining

Instruction
1 IF1 IF2 ID EX DM1 DM2 DM3 WB

Instruction
2 IF1 IF2 ID EX DM1 DM2 DM3 WB

Superpipelining: Divide logic into many short stages (Higher Clock Frequency)
9

2-way Superscalar
• Ex: One ALU & Data transfer (LW/SW) instruction can be issued at the same time
• Relies on compiler to find and reorder appropriate instructions (using nops if no
appropriate instruction can be found
Instruction Pipeline Stages
ALU or branch IF ID EX MEM WB
LW/SW IF ID EX MEM WB
ALU or branch IF ID EX MEM WB
LW/SW IF ID EX MEM WB
ALU or branch IF ID EX MEM WB
LW/SW IF ID EX MEM WB
Integer Slot

PC
ALU
Reg.
File
I-Cache
(4 Read,
Addr.
LD/ST Slot

2 Write) D-Cache
Calc.
2 instructions
10

Sample Scheduling
• Compiler can reorder instructions to find integer and memory
instructions to fuse together that can be run down the
pipeline at the same time
void f1(int *A, int n) { time
do {
*A += 5; Int./Branch Slot LD/ST Slot
A++; addi $7, $7, -1 lw $9,0($6)
n--;
} while (n != 0); addi $6, $6, 4
}
addi $9, $9, 5
# $6 = A bne $0,$7,L1 st $9,-4($6)
# $7 = n = # of iterations
L1: ld $9, 0(%6)
add $9, $9, 5
w/ modifications and code movement
st %r9,0(%rdi) IPC = 6 instrucs. / 4 cycle = 1.5
add $6, $6, 4
add $7, $7, -1
jne $0,%esi,L1
11

Scheduling Strategies
• Static Scheduling
– Compiler re-orders instructions in such a way that no
dependencies will be violated and allows for OoOE
• Dynamic Scheduling
– HW implementing the Tomasulo algorithm or other similar
approach will re-order instructions to allow for OoOE
• More Advanced Concepts
– Branch prediction and speculative execution (execution beyond
a branch flushing if incorrect) will be covered later
12

Static Scheduling
• Strengths
– Hardware simplicity [Better clock rate]
• Power/energy advantage
• Compiler has a global view of the program anyway, so it should be able to
do a “good” job
– Very predictable: static performance predictions are reliable
• Weaknesses
– Requires re-compilation to take advantage of new/modified
architecture
– Cannot foresee dynamic (data-dependent) events
• Cache miss, conditional branches (can only recedule instructions in a basic
block)
– Cannot precompute memory addresses
– No good solution for precise exceptions with out-of-order completion
13

OUT-OF-ORDER EXECUTION
14

Out-of-Order Motivation
• We will focus on dynamically scheduled, OoO processors
• Hide the impact of dynamic events such as a cache miss
– Let independent instructions behind a stalled instruction execute
• Separate functional units (ALU, MUL, DMEM, etc.)
• "Queues" where instructions wait
Queues +

ADD
SUB
until they are ready at which point Functional ALU
Units
they can execute "out-of-order"

MUL
MUL

LW $4,0($5)
// cache miss IM Reg Reg
ADD $6,$7,$4 DIV
SUB $1,$2,$3
MUL $9,$7,$2
LW DMEM
(Cache)
15

Dispatch, Execution, and Completion


• "Execution" here means producing the results not necessarily
writing them to a register or memory
• Completion means committing/writing the results to register
file or memory
• While we say out-of-order execution we really mean/want:
– In-order (Program order) Issue/Dispatch (IoD) Execution
– Out-of-Order Execution (OoOE)
– In-order Completion (IoC) [hard]
• So we'll start with the easier Issue/Dispatch Completion
Out-of-Order Completion (OoOC)

LW $4,0($5)
// cache miss
ADD $6,$7,$4
SUB $1,$2,$3 In-order In-order
MUL $9,$7,$2
Out-of-Order
16

Branch Handling
• We will present the concept of OoOC (out-of-order
completion) which is a bit easier and then come back to the
desired approach of In-Order Completion (IOC)
• OoOC Issues
– Branches…we should not commit an instruction that came after (in
program order) a branch
Execution
– Solution: Stall dispatching instructions
after a branch until we resolve the
outcome
Issue/Dispatch Completion

LW $4,0($5) // cache miss


BEQ $4,$0,L1
ADD $6,$7,$8
// What if we execute this Stall branches
ADD out of order
here
In-order In-order
Out-of-Order
17

Data Hazard Stalling


• In our 5-stage pipeline (in-order execution) RAW dependency
was solved by
– Forwarding (preferably) or
– Stalling (LW followed by dependent instruction)
• Dependent instructions stalled in the ID stage if necessary
• Do we want to stall in the decode stage in our OoO processor?
– No! Doing so would necessarily stall everyone behind us
ADD $1,$3,$4
(Stall here) bubble LW $4

IM Reg ALU DM Reg

Stalling here would plug up the


pipeline
18

EX Stage Stalling
• In our 5-stage pipeline, could we have stalled in the EX stage
• No! If ADD depended on an instruction in WB then it has no place to store
that forwarded data while it stalls
0
1
FLUSH
PCWrite

Ex Mem WB
IRWrite HDU
0

Mem WB
0 1
Stall
IF.Flush 0
Why? What if ADD was also

WB
0 1
dependent on the instruction in

MemToReg
Control Branch
4
+

WB… ADD has no place to


rs

+
buffer that forwarded value Read Sh.

MemRead &
MemWrite
5 Reg. 1 # Left
2

Pipeline Stage Register

Pipeline Stage Register


rt
Instruction Register

Read 0
Thus we stall in ID so we can Read
1
Reg. 2 #

Pipeline Stage Register


5 data 1
use the Register File to grab 2 0
I-Cache

dependent values. Further . Write Zero


PC

ALUSelA
Reg. #

ALU
stalling in ID incurs only 1 cycle Read Res.
penalty as would stalling in EX. Write data 2 0 0

D-Cache
Data 1
1 1
2
Register File

Data Mem. or ALU result


Sign ALUSelB
ALUSrc
Reset
Extend
16 32 Forwarding
Unit 0
rs

Prior ALU
rt 1

Result
rd Regwrite &
WriteReg# Regwrite,
WriteReg#
19

Where to Stall?
• But to implement OoO execution, we cannot stall in the decode stage
since that would prevent any further issuing of instructions
• Thus, now we will issue to queues for each of the multiple functional units
and have the instruction stall in the queue until it is ready
Queues +
Functional ALU
Units

MUL

IM Reg Reg

DIV

Stalling here would plug up the DMEM


pipeline
(Cache)
20

Forwarding in OoO Execution


• In 5-stage pipeline later instructions carried their source register IDs into the
EX stage to be compared with destination register ID’s of their earlier
instructions
• But in OoO execution, we may have many (earlier) instructions in front of us
and would require more complex hardware to determine who is producing 0
1

the data we need


PCWrite
HDU
(especially when multiple producers exist and we want the
IRWrite
Ex Mem WB FLUSH

Mem WB
latest version) IF.Flush
Stall
0 1

WB
• Instead, the dispatch unit will explicitly tell the dependent instruction who to
0 1

MemToReg
Control Branch
4
+

get data from using part of Tomasulo's algorithm


rs

+
Read Sh.

MemRead &
MemWrite
5 Reg. 1 # Left
2

Pipeline Stage Register

Pipeline Stage Register


rt
Instruction Register

Read 0
Read
Reg. 2 # 1
Pipeline Stage Register

5 data 1
2 0
I-Cache

. Write Zero
PC

ALUSelA
Reg. #
ALU

Read Res.
Write data 2 0 0

D-Cache
Data 1
1 1
2
Register File

Data Mem. or ALU result


Sign ALUSelB
ALUSrc
Reset
Extend
16 32 Forwarding
Unit 0
rs
Prior ALU

rt 1
Result

rd Regwrite &
WriteReg# Regwrite,
WriteReg#
21

Tomasulo’s Plan
• OoO Execution
• Multiple functional units
– Integer ALU, Data memory, Multiplier, Divider
• Queues between ID and EX stages (in place of ID/EX
register)
– Allows later instructions to keep issuing even if earlier ones
are stalled
• Method for dealing with RAW data hazards by
specifying who dependent instructions should get
data from
– But with OoO execution, new hazards arise!
22

WAR and WAW

NEW DATA HAZARDS


23

RAW, WAR, and WAW


• RAW = Read After Write
– lw $8, 40($2)
– add $9, $8, $7
• WAR = Write After Read
– add $9, $8, $6  say $6 is not available yet, can LW execute?
– lw $8, 40($2)
• WAW = Write After Write
– add $9, $8, $6  say $6 is not available yet, can LW execute?
– lw $9, 40($2)
Why would anyone produce one result in $9 without utilizing
that result? Why would he overwrite it with another result?
How is this possible?
24

WAW can easily occur


• How is WAW possible? for(i=MAX; i != 0; i--)
A[i] = A[i] * 3;
• Example 1
– Say a company gives standard bonus to L1: lw $2, 40($1)
mult $4, $2, $3
most of the employees and a higher bonus sw $4, 40($1)
to managers addi $1, $1,-4
bne $1, $0,L1
– The software may set a default value to the
standard bonus and then overwrite for the Original Code
special case L1: lw $2, 40($1)
mult $4, $2, $3
• Example 2 sw $4, 40($1)
addi $1, $1,-4
– Consider multiple iterations of a loop body bne $1, $0,L1

L1: lw $2, 40($1)


int x = standard_bonus;
mult $4, $2, $3
if (manager) sw $4, 40($1)
addi $1, $1,-4
x = special_bonus;
bne $1, $0,L1
set_bonus(x);
25

RAW, WAR, and WAW


• Some terminology to remember
• RAW = Read After Write
RAW
– lw $8, 40($2) A true dependency
– add $9, $8, $7
• WAR = Write After Read
– add $9, $8, $6 WAR
Name Depdencies

An anti-dependency
– lw $8, 40($2)
• WAW = Write After Write
– add $9, $8, $6 WAW
An anti-dependency
– lw $9, 40($2)
Note: No information is communicated in WAR/WAW hazards.
If no info is communicated can we somehow solve these hazards?
26

RAW, WAR, and WAW


• In-order execution:
– We need to deal with RAW only
• Out-of-order execution
– Now we need to deal with WAR and WAW hazards besides RAW
– Any of these hazards seem to prevent re-ordering instructions and
executing them out-of-order
27

Register Renaming
WAR = Write After Read
• WAR and WAW hazards can add $9, $8, $6
always be solved by simply lw $8$48, 40($2)
choosing a DIFFERENT register WAW = Write After Write

since no data is being add $9, $8, $6

communicated but we were lw $9$49, 40($2)

simply "reusing" a register


This is an example of a name-dependency

lw $8, 40($2)
• If we had 64 registers instead First iteration add $8, $8, $8
sw $8, 40($2)
of 32 registers, then perhaps
the compiler might have used Second
iteration lw $48, 60($3)
$48 instead of $8 and we could (using add $48, $48, $48
alternate sw $48, 60($3)
have executed the second part register, $48)
of the code before the first part
28

Register Renaming
• Renaming requires more registers
• We have limited architectural registers
– Registers the instruction set is aware of
• We could have more physical registers
– Actual registers part of the register file
Assume Delayed
lw $8, 40($2) It is clear the compiler is using $8 as a
add $8, $8, $8 temporary register
sw $8, 40($2)
If there is a delay in obtaining $2 the first
lw $8, 60($3) part of the code cannot proceed
add $8, $8, $8
sw $8, 60($3) Unfortunately, the second part of the code
cannot proceed because of the name
dependency for $8
29

Increasing Number of Registers


• Can a later implementation provide 64
registers (instead of 32) while maintaining
binary compatibility with previously compiled
code?
• Answer: Yes / No
NO
• Why?
Machine code has 5-bit fields for register ID’s

R-Type opcode=6 rs=5 rt=5 rd=5 shamt=5 func=6


30

Register Renaming
• Rather than creating new architectural registers, let
us internally provide multiple "versions" of the same
architectural register
– $8v1 = $8 version 1
– $8v2 = $8 version 2

lw $8v1, 40($2)
add $8v2, $8v1, $8v1 $8v1
sw $8v2, 40($2)
$8v2
$8
lw $8v3, 60($3)
$8v3
add $8v4, $8v3, $8v3 "Arch. Reg" $8v4
sw $8v4, 60($3)
Phys Reg
31

Tomasulo's Approach to Renaming


• Cannot change the number of architectural registers

• Instead we will perform


Register Renaming through Tagging Registers
– This solves name dependency problems (WAR and WAW)
while attending to true dependency (RAW) through waiting
in queues
– Please be sure you understand this!
32

OoO Execution & Tomasulo's Algorithm


Uses "tags" to track
which instruction is
I-Cache Fetch multiple instructions per
the latest producer clock cycle in PROGRAM ORDER
(version) of a register. (i.e. normal order generated by
(Helps solve RAW, the compiler)
WAR, WAW
Reg. File

dependencies) Instruc.
Queue
Decode & dispatch multiple
instructions per cycle tracking
Register dependencies on earlier
Status instructions
Table Dispatch
Instructions wait in queues
until their respective
functional unit (the
hardware that will compute
their value) is free AND

Mult. Queue
they have their data
L/S Queue
Int. Queue

Div Queue
available (from the
instructions they depend
upon). These act as
additional "physical
registers"

Issue
Unit
Integer /
D-Cache Div Mul
Branch Results and TAGs of
multiple instructions can
be written back per cycle.
Results are broadcast to
any instruction waiting for
Block Diagram Adapted that result.
from Prof. Michel Dubois Common Data Bus
(Simplified for EE457)
33

Tomasulo’s Algorithm
• Dispatch/Issue unit decodes and dispatches instructions
• Assign a binary code (aka TAG) to each instruction producing a register
value using the TAG FIFO
• Adds a Register Status Table (RST) that holds the TAG of the instruction that
is producing the LATEST version of each architectural register or NULL if the
LATEST version is in the register file
• The destination operand is represented by the TAG but not the actual
register name
• For source operands, an instruction carries either the values (if TAG is null in
RST) or TAGs of the operands (but not the actual register name)
• When an instruction executes and produces a result it broadcasts the result
and its destination TAG
– Any instruction waiting can compare its SRC tags with the destination tag and
grab the value if they match
– If entry in RST matches the TAG then this instruction is the latest producer of
the register and the value will be written to the register file
34

Tagging process
RST
(Identify latest
version of a reg.) RF
sqrt $2, $10
$1 $1
lw $8, 40($2) $2 $2
add $8, $8, $8 $3 $3
$4 $4
sw $8, 40($2) $5 $5
$6 $6
lw $8, 60($3) $7 $7
add $8, $8, $8 $8 $8
sw $8, 60($3) … …

$31 $31

Issue Logic
RST = Register
Status Table
RF = Register File

T1: SQRT $2 Val / $10 Val


INT INT MUL/DIV/SQRT Load/
ALU Store
35

Tagging process: CC1


RST
(Identify latest
version of a reg.) RF
sqrt $2, $10
$1 $1
lw $8, 40($2) $2 T1 $2
add $8, $8, $8 $3 $3
$4 $4
sw $8, 40($2) $5 $5
$6 $6
lw $8, 60($3) $7 $7
add $8, $8, $8 $8 $8
sw $8, 60($3) … …

$31 $31

Issue Logic
Instruction that will write to a destination register,
take a TAG and enter that TAG into the RST to
track the latest version/producer

RST = Register
T1: SQRT $2 Val / $10 Val
Status Table
INT INT MUL/DIV/SQRT Load/
ALU Store RF = Register File
36

Tagging process: CC2


RST RF
sqrt $2, $10
$1 $1
lw $8, 40($2) $2 T1 $2
add $8, $8, $8 $3 $3
$4 $4
sw $8, 40($2) $5 $5
$6 $6
lw $8, 60($3) $7 $7
add $8, $8, $8 $8 T2 $8
sw $8, 60($3) … …

$31 $31

Issue Logic

RST = Register
T1: SQRT $2 Val / $10 Val T2: LW T1 / 40
Status Table
INT INT MUL/DIV/SQRT Load/
RF = Register File
ALU Store
37

Tagging process: CC3


RST RF
sqrt $2, $10
$1 $1
lw $8, 40($2) $2 T1 $2
add $8, $8, $8 $3 $3
$4 $4
sw $8, 40($2) $5 $5
$6 $6
lw $8, 60($3) $7 $7
add $8, $8, $8 $8 T2 T3 $8
sw $8, 60($3) … …

$31 $31

Issue Logic
Notice the RST only stores the TAG of the
LATEST producer/version. Solves WAR/WAW
hazards by not accepting a writeback unless it is
from the latest/producer

RST = Register
T3: ADD T2 / T2 T1: SQRT $2 Val / $10 Val T2: LW T1 / 40
Status Table
INT INT MUL/DIV/SQRT Load/
RF = Register File
ALU Store
38

Tagging process: CC4


RST RF
sqrt $2, $10
$1 $1
lw $8, 40($2) $2 T1 $2
add $8, $8, $8 $3 $3
$4 $4
sw $8, 40($2) $5 $5
$6 $6
lw $8, 60($3) $7 $7
add $8, $8, $8 $8 T3 $8
sw $8, 60($3) … …

$31 $31

Issue Logic
RST = Register
Status Table
RF = Register File

SW T3 / T1 / 40
T3: ADD T2 / T2 T1: SQRT $2 Val / $10 Val T2: LW T1 / 40
INT INT MUL/DIV/SQRT Load/
ALU Store
39

Tagging process: CC5


RST RF
sqrt $2, $10
$1 $1
lw $8, 40($2) $2 T1 $2
add $8, $8, $8 $3 $3
$4 $4
sw $8, 40($2) $5 $5
$6 $6
lw $8, 60($3) $7 $7
add $8, $8, $8 $8 T3 T4 $8
sw $8, 60($3) … …

$31 $31

Issue Logic
RST = Register
Status Table
RF = Register File

T4: LW $3 val / 60
SW T3 / T1 / 40
T3: ADD T2 / T2 T1: SQRT $2 Val / $10 Val T2: LW T1 / 40
INT INT MUL/DIV/SQRT Load/
ALU Store
40

Tagging process: CC6


RST RF
sqrt $2, $10
$1 $1
lw $8, 40($2) $2 T1 $2
add $8, $8, $8 $3 $3
$4 $4
sw $8, 40($2) $5 $5
$6 $6
lw $8, 60($3) $7 $7
add $8, $8, $8 $8 T4 T5 $8
sw $8, 60($3) … …

$31 $31

Issue Logic
RST = Register
Status Table
RF = Register File

T4: LW $3 val / 60
T5: ADD T4 / T4 SW T3 / T1 / 40
T3: ADD T2 / T2 T1: SQRT $2 Val / $10 Val T2: LW T1 / 40
INT INT MUL/DIV/SQRT Load/
ALU Store
41

Tagging process: CC7


RST RF
sqrt $2, $10
$1 $1
lw $8, 40($2) $2 T1 $2
add $8, $8, $8 $3 $3
$4 $4
sw $8, 40($2) $5 $5
$6 $6
lw $8, 60($3) $7 $7
add $8, $8, $8 $8 T5 $8
sw $8, 60($3) … …

$31 $31

Issue Logic
RST = Register
Status Table
RF = Register File

T4: LW $3 val / 60
T5: ADD T4 / T4 SW T3 / T1 / 40
T3: ADD T2 / T2 T1: SQRT $2 Val / $10 Val T2: LW T1 / 40
INT INT MUL/DIV/SQRT Load/
ALU Store

T4: Read 0x1111


42

Tagging process: CC8


RST RF
sqrt $2, $10
$1 $1
lw $8, 40($2) $2 T1 $2
add $8, $8, $8 $3 $3
$4 $4
sw $8, 40($2) $5 $5
$6 $6
lw $8, 60($3) $7 $7
add $8, $8, $8 $8 T5 => null $8 0x2222
sw $8, 60($3) … …

$31 $31

T5: Sum 0x2222 Issue Logic


When latest producer writes to register, we reset
RST entry to NULL (indicates that the RF has the
latest value and issuing instructions can just take
that value from the RF)

T5: ADD 0x1111 / 0x1111 SW T3 / T1 / 40 RST = Register


T3: ADD T2 / T2 T1: SQRT $2 Val / $10 Val T2: LW T1 / 40
Status Table

INT INT MUL/DIV/SQRT Load/ RF = Register File


ALU Store

T5: Sum 0x2222


43

Tagging process: CC9


RST RF
sqrt $2, $10
$1 $1
lw $8, 40($2) $2 T1 $2
add $8, $8, $8 $3 $3
$4 $4
sw $8, 40($2) $5 $5
$6 $6
lw $8, 60($3) $7 $7
add $8, $8, $8 $8 null $8 0x2222
sw $8, 60($3) … …

$31 $31

Issue Logic
RST = Register
Status Table
RF = Register File

SW 0x2222, $3 val / 60
SW T3 / T1 / 40
T3: ADD T2 / T2 T1: SQRT $2 Val / $10 Val T2: LW T1 / 40
INT INT MUL/DIV/SQRT Load/
ALU Store

T5: Sum 0x2222


44

Tagging process: CC10


RST RF
sqrt $2, $10
$1 $1
lw $8, 40($2) $2 T1 => null $2
add $8, $8, $8 $3 $3
$4 $4
sw $8, 40($2) $5 $5
$6 $6
lw $8, 60($3) $7 $7
add $8, $8, $8 $8 $8 0x2222
sw $8, 60($3) … …

$31 $31

T1: SQRT 0xacd0 Issue Logic


RST = Register
Status Table
RF = Register File

SW T3 / T1 / 40
T3: ADD T2 / T2 T1: SQRT $2 Val / $10 Val T2: LW T1 / 40
INT INT MUL/DIV/SQRT Load/
ALU Store

T1: SQRT 0xacd0


45

Tagging process: CC11


RST RF
sqrt $2, $10
$1 $1
lw $8, 40($2) $2 $2 0xacd0
add $8, $8, $8 $3 $3
$4 $4
sw $8, 40($2) $5 $5
$6 $6
lw $8, 60($3) $7 $7
add $8, $8, $8 $8 $8 0x2222
sw $8, 60($3) … …

$31 $31

Issue Logic
Since RST entry for $8 is NULL, RF will not update
when LW attempts to writeback.

RST = Register
Status Table
SW T3 / 0xacd0 / 40
T3: ADD T2 / T2 T2: LW 0xacd0 / 40 RF = Register File
INT INT MUL/DIV/SQRT Load/
ALU Store

T2: Read 0x5678


46

Tagging process: CC12


RST RF
sqrt $2, $10
$1 $1
lw $8, 40($2) $2 $2 0xacd0
add $8, $8, $8 $3 $3
$4 $4
sw $8, 40($2) $5 $5
$6 $6
lw $8, 60($3) $7 $7
add $8, $8, $8 $8 $8 0x2222
sw $8, 60($3) … …

$31 $31

Issue Logic
RST = Register
Status Table
RF = Register File

SW T3 / 0xacd0 / 40
T3: ADD 0x5678 / 0x5678
INT INT MUL/DIV/SQRT Load/
ALU Store

T3: Sum 0xACF0


47

Tagging process: CC13


RST RF
sqrt $2, $10
$1 $1
lw $8, 40($2) $2 $2 0xacd0
add $8, $8, $8 $3 $3
$4 $4
sw $8, 40($2) $5 $5
$6 $6
lw $8, 60($3) $7 $7
add $8, $8, $8 $8 $8 0x2222
sw $8, 60($3) … …

$31 $31

Issue Logic
RST = Register
Status Table
RF = Register File

SW 0xacf0 / 0xacd0 / 40

INT INT MUL/DIV/SQRT Load/


ALU Store
48

Register Renaming
RST RF
sqrt $2, $10
$1 $1
add $2, $2, $2 $2 T1, T2, T3, T4 $2
add $2, $2, $2 $3 $3
$4 $4
add $2, $2, $2 $5 $5
$6 $6
add $2, $2, $2 $7 $7
$8 $8

… …

$31 $31

Issue Logic
RST = Register
Status Table
RF = Register File

T4: ADD T3 / T3
T3: ADD T2 / T2
T2: ADD T1 / T1 T1: SQRT $2 Val / $10 Val
INT INT MUL/DIV/SQRT Load/
ALU Store
49

Unique TAGs
• Like SSN, we need a unique TAG
• SSN’s are reused.
• Similarly TAGS can be reused
• TAGs are similar to number TOKEN

Helps to create a In State Bank of India, the cashier issues


virtual queue. brass token to customers trying to draw
money as an ID (and not at all to put them
in any virtual queue / ordering). Token
We do not need numbers are in random order.
that here
The cashier verifies the signature in the
record rooms, returns with money, calls the
token number and issues the money.
Tokens are reclaimed & reused.
50

Tags (= Tokens)
• How many tokens should the bank casheir
have to start with?
• What happens if the tokens run out?
• Does the cashier need to have any order in
holding tokens and issuing tokens?
• Do they have to collect the tokens back?
51

TAG FIFO
FIFO’s are taught in EE 560

• To issue and collect tokens (TAGS) use a circular FIFO (First-


In/First-Out) unit
– While the FIFO order is not important here, a FIFO is the easiest to
implement in hardware compared to a random order in a pile
• Filled (with say) 64 tokens (in any order) initially on reset
• Tokens return in any order
• Put tokens back in the FIFO and reissue
TAG FIFO TAG FIFO TAG FIFO

wp 0 rp wp 1
1 wp

2 2 rp 2 rp

… … …
63 63 63
FULL 2 Tokens issued 1 Tokens returned
52

Organization for OoO Execution


I-Cache TAG FIFO Block Diagram
Adapted from Prof.
Michel Dubois

Instruc. (Simplified for EE 457)


Reg. File

Queue

Register
Status
Table Dispatch

Mult. Queue
L/S Queue
Int. Queue

Div Queue

Issue
Unit
Integer /
D-Cache Div Mul
Branch

CDB
53

Front-End & Back-End


• IFQ (Instruction Fetch Queue)
– A FIFO structure
• Dispatch (Issue) Unit
– Includes RST, RF, Tag FIFO
• Load/Store and other Issue Queues
• Issue Units
• Functional units
• CDB (Common Data Bus)
– Like a public address system that everyone can see/hear
when data is produced
54

More Tomasulo Algorithm


• Front End
– Instructions are fetched
– They are stored in a FIFO (IFQ)
– When instruction reached the head of the IFQ it is
• Decoded
• Dispatched to an issue queue/functional unit
• Even if some of the inputs are not ready (takes TAGs)
• Back End
– Instructions in issue queues wait for their input operands
– Once register operands are ready instructions can be scheduled for execution provided
they will not conflict for the CDB or their functional unit
– Instructions execute in their functional unit and their result is put on the CDB
– All instructions in queues and the register file “watch” the CDB and grab the value they
are waiting for when it is produced
• Bottleneck in Tomasulo's algorithm?
– The CDB!!!
– Do all instructions use the CDB? No, not SW, J (jump), BEQ
55

Data hazards and memory

MEMORY DISAMBIGUATION
56

Load/Store Queue (LSQ)


• For our course, the LSQ performs
– Address calculation
– Memory disambiguation
• RAW, WAR, WAW hazards due to memory reads and
writes

// Is there a dependency here?


SW $2,0($5)
LW $8,0($5)
// What about here?
SW $2, 1000($4)
LW $3, 0($6)
57

Memory Disambiguation
• Data hazards (RAW, WAR, WAW) can occur in memory just as
with registers, and hazards in memory are much harder to deal
with since many combinations could produce the same address
RAW This later lw can proceed only if there is
no store ahead of it with the same address
sw $2, 2000($0)
lw $8, 2000($0)

WAW This later sw can proceed only if there is


no store ahead of it with the same address
sw $2, 2000($0)
sw $8, 2000($0)

WAR This later sw can proceed only if there is


no load ahead of it with the same address
lw $2, 2000($0)
sw $8, 2000($0)
58

Address Calculation for LW/SW


• EE 557 approach for address calculation
– Loads & store in 2 sub-instructions
• 1 instruction computes address and is dispatched to
integer ALU
• 1 instruction access data cache and is issued to LSQ
• Address is communicated from integer ALU to LSQ via
CDB forwarding using a tag
• EE 560/457 approach
– Use a dedicated adder in the LSQ to compute
address (so just 1 dispatched instruction)
59

Memory Disambiguation
• When can LSQ can issue a LW or SW to cache?
– Loads can issue to a cache when their address is ready
– Stores can issue to cache when both address & data is ready
– Memory hazards (RAW, WAR, WAW) are resolved in the LSQ
• Load can issue to cache if no store with same address is before it
• Store can issue to cache if no store or load with same address before it
• Otherwise, access waits in LSQ
– If an address is unknown it is assumed to be the same
• Worst case to enforce correctness
– The process of figuring out and comparing memory address is called
“disambiguation”
60

Issue Queue priority, Branches, etc.

LAST CONSIDERATIONS FOR


OUT-OF-ORDER
EXECUTION/COMPLETION
61

Issue Unit
• How do we determine when to issue an instruction to the
functional unit?
– Is the instruction ready
– Is the functional unit free to start the operation?
– CDB availability constraint
• Will there be room on the CDB when operation finished?
– Priority/conflict resolution
• If many instructions are available, which should be chosen? (Is round-
robin priority adequate)?

How do we prioritize
instructions that are ready?
62

Issue Queue Priority


• Priority (based on the order of arrival among
ready instructions)
– Is it necessary or just desirable?
– Local priority within queues?
– Global priority across the queues?

How do we prioritize
instructions that are ready?
63

LSQ Ordering/Priority
• Maintaining instructions in the order of arrival
– Issue order/program order in a queue
• Is this necessary and/or desirable?
– In the case of LSQ?
• Necessary! To enforce memory disambiguation
– In the case of Integer, MUL, DIV queues?
• Desirable, so that an earlier instruction gets executed
whenever possible, thereby reducing queue pressure
from too many instructions waiting on it
64

Conditional Branches
• Dispatcher stalls when it reaches a branch (and waits until it is resolved)
• Branches are dispatched to integer queue where they wait for their
operands (if necessary)
• When branch executes it puts its outcome & target on CDB
– If untaken, dispatch unit resumes
– If taken, then dispatch clears flushes the IFQ and resumes at target
• Since we stop dispatching instructions after a branch, does it mean that
this branch is the last instruction to be executed in the back-end?
• Is it possible that the back-end holds simultaneously
– A. Some instructions dispatched before
the branch .. AND ..
– B. Some instructions issued after
the branch
ADD $4,$5,$5
BEQ $6,$7,L1
...
L1: SUB $1,$2,$3
MUL $9,$7,$2
65

Structural Hazards + Exceptions


• Structural Stalls
– Dispatch must stall if IFQ empty OR all
entries in the desired functional unit’s
issue queue are occupied AND an
instruction of that type is attempting to
dispatch
– Fetch unit must stall if the IFQ is full
– Functional units stall when no ready
instructions in the queue or CDB
scheduling conflicts
• Precise exceptions not supported
– Some instructions after the offending
instruction may have updated registers
or memory! BAD!
– We'll handle this in the next unit
66

BACKUP
67

Tagging Registers: CC1


Orange means dispatched and
SQRT is a long-latency RST RF
DOG computation
$1 $1
sqrt $2, $10 $2 DOG $2
$3 $3
$4 $4
$5 $5
lw $8, 40($2) $6 $6
$7 $7
add $8, $8, $8 $8 $8

… …
sw $8, 40($2)
$31 $31

lw $8, 60($3)
add $8, $8, $8
sw $8, 60($3)

Destination Dependent source

RST = Register Status Table


RF = Register File
68

Tagging Registers: CC2


Orange means dispatched and
SQRT is a long-latency RST RF
DOG computation
$1 $1
sqrt $2, $10 $2 DOG $2
$3 $3
$4 $4
LION $5 $5
lw $8, 40($2)DOG $6 $6
$7 $7
add $8, $8, $8 $8 LION $8

… …
sw $8, 40($2)
$31 $31

lw $8, 60($3)
add $8, $8, $8
sw $8, 60($3)

Destination Dependent source

RST = Register Status Table


RF = Register File
69

Tagging Registers: CC3


Orange means dispatched and
SQRT is a long-latency RST RF
DOG computation
$1 $1
sqrt $2, $10 $2 DOG $2
$3 $3
$4 $4
LION $5 $5
lw $8, 40($2) DOG $6 $6
TIGER LION LION $7 $7
add $8, $8, $8 $8 TIGER $8

… …
sw $8, 40($2)
$31 $31

lw $8, 60($3)
add $8, $8, $8
sw $8, 60($3)

Destination Dependent source

RST = Register Status Table


RF = Register File
70

Tagging Registers: CC4


Orange means dispatched and
SQRT is a long-latency RST RF
DOG computation
$1 $1
sqrt $2, $10 $2 DOG $2
$3 $3
$4 $4
LION $5 $5
lw $8, 40($2) DOG $6 $6
TIGER LION LION $7 $7
add $8, $8, $8 $8 TIGER $8
TIGER … …
sw $8, 40($2)
$31 $31

lw $8, 60($3)
add $8, $8, $8
sw $8, 60($3)

Destination Dependent source

RST = Register Status Table


RF = Register File
71

Tagging Registers Review


RST RF
DOG
$1 $1
sqrt $2, $10 $2 DOG $2
$3 $3
$4 $4
LION $5 $5
lw $8, 40($2) DOG $6 $6
TIGER LION LION $7 $7
add $8, $8, $8 $8 TIGER $8
TIGER … …
sw $8, 40($2)
$31 $31

lw $8, 60($3)
• Dispatch unit decodes and dispatches instructions
add $8, $8, $8 • For destination operand, an instruction carreis a
sw $8, 60($3) TAG (but not the actual register name)
• For source operands, an instruction carries either
the values (if no TAG in RST) or TAGs of the
operands (but not the actual register name)
• When
72

Organization for OoO Execution


I-Cache TAG FIFO Block Diagram
Adapted from Prof.
Michel Dubois

Instruc. (Simplified for EE 457)


Reg. File

Queue

Register
Status
Table Dispatch

Mult. Queue
L/S Queue
Int. Queue

Div Queue

Issue
Unit
Integer /
D-Cache Div Mul
Branch

CDB
73

Multiple Functional Units


• We now provide multiple functional units
• After decode, issue to a queue, stalling if the unit is busy or
waiting for data dependency to resolve

Queues +
Functional ALU
Units

MUL

IM Reg Reg

DIV

DMEM
(Cache)
74

Multiple Functional Units


• We now provide multiple functional units
• After decode, issue to a queue, stalling if the unit is busy or
waiting for data dependency to resolve

Queues +
Functional ALU
Units

MUL

IM Reg DM Reg

DIV

DM
(Cache)
75

Where to Stall?
• But to implement OoO execution, we cannot stall in the decode stage
since that would prevent any further issuing of instructions
• Thus, now we will issue to queues for each of the multiple functional units
and have the instruction stall in the queue until it is ready

Queues +
Functional ALU
Units

MUL

IM Reg DM Reg

DIV

Stalling here would plug up the


pipeline Addr
Calc.
76

Functional Unit Latencies


Int. ALU, Addr. Calc.

EX
FP Add Look Ahead: Tomasulo
Algorithm will help absorb
An added complication of A1 A2 A3 A4 latency of different functional
units and cache miss latency by
out-of-order execution & Int. & FP MUL allowing other ready instruction
completion: WAW & WAR proceed out-of-order
hazards M1 M2 M3 M4 M5 M6 M7
Int. & FP DIV

Functional Unit Latency Initiation Interval


(Required stalls cycles (Distance between 2 independent instructions
between dependent [RAW] instrucs.) requiring the same FU)

Integer ALU 0 1
FP Add 3 1
FP Mul. 6 1
FP Div. 24 25
77

OoO Execution w/ ROB


• ROB allows for OoO execution but in-order completion

I-Cache D-Cache

ROB
Instruc.
Reg. File

(Reorder
Queue Buffer)

Br. Pred.
Buffer Dispatch Exceptions?
No problem

Mult. Queue
L/S Queue
Int. Queue

Div Queue

Addr.
Buffer
Issue
Unit
Exec. Unit
Integer /
D-Cache Div Mul
Branch
L/S Buffer
CDB

You might also like