CS 6290 Instruction Level Parallelism
CS 6290 Instruction Level Parallelism
CS 6290 Instruction Level Parallelism
• Basic idea:
Execute several instructions in parallel
• We already do pipelining…
– But it can only push thtough at most 1 inst/cycle
• We want multiple instr/cycle
– Yes, it gets a bit complicated
•More transistors/logic
– That’s how we got from 486 (pipelined)
to Pentium and beyond
Is this Legal?!?
• ISA defines instruction execution one by one
– I1: ADD R1 = R2 + R3
•fetch the instruction
•read R2 and R3
•do the addition
•write R1
•increment PC
– Now repeat for I2
• Darth Sidious: Begin landing your troops.
Nute Gunray: Ah, my lord, is that... legal?
Darth Sidious: I will make it legal.
It’s legal if we don’t get caught…
• How about pipelining?
– already breaks the “rules”
•we fetch I2 before I1 has finished
You Didn’t
Before Toll Booth See That… After Toll Booth
Illusion of Sequentiality
• So long as everything looks OK to the outside
world you can do whatever you want!
– “Outside Appearance” = “Architecture” (ISA)
– “Whatever you want” = “Microarchitecture”
Execute Execute
Writeback Writeback
Repeat Example for Pentium-like
CPU
• A: ADD R1 = R2 + R3
• B: SUB R4 = R1 – R5
• C: XOR R6 = R7 ^ R8
• D: Store R6 0[R4]
• E: MUL R3 = R5 * R9
• F: ADD R7 = R1 + R6
• G: SHL R8 = R7 << R4
This is “Superscalar”
• “Scalar” CPU executes one inst at a time
– includes pipelined processors
• “Vector” CPU executes one inst at a time, but
on vector data
– X[0:7] + Y[0:7] is one instruction, whereas on a
scalar processor, you would need eight
• “Superscalar” can execute more than one
unrelated instruction at a time
– ADD X + Y, MUL W * Z
Scheduling
• Memory dependencies
– Based on memory address
– This is harder
•Register names known at decode
•Memory addresses not known until execute
Hazards
• When two instructions that have one or more
dependences between them occur close
enough that changing the instruction order
will change the outcome of the program
I1: R2 = 17
I2: R1 = 49
I3: R3 = -8
I4: R5 = LOAD 0[R3]
I5: R4 = R1 + R2
I6: R7 = R4 – R3
I7: R6 = R4 * R5
Dynamic (Out-of-Order) Scheduling
– Data dependencies
– Control dependencies
– Memory dependencies
Types of Data Dependencies
(Assume A comes before B in program order)
• RAW (Read-After-Write)
– A writes to a location, B reads from the location,
therefore B has a RAW dependency on A
– Also called a “true dependency”
Data Dep’s (cont’d)
• WAR (Write-After-Read)
– A reads from a location, B writes to the location,
therefore B has a WAR dependency on A
– If B executes before A has read its operand, then
the operand will be lost
– Also called an anti-dependence
Data Dep’s (cont’d)
• Write-After-Write
– A writes to a location, B writes to the same
location
– If B writes first, then A writes, the location will end
up with the wrong value
– Also called an output-dependence
Control Dependencies
• If we have a conditional branch, until we
actually know the outcome, all later
instructions must wait
– That is, all instructions are control dependent on
all earlier branches
– This is true for unconditional branches as well
(e.g., can’t return from a function until we’ve
loaded the return address)
Memory Dependencies
• Basically similar to regular (register) data
dependencies: RAW, WAR, WAW
• However, the exact location is not known:
– A: STORE R1, 0[R2]
– B: LOAD R5, 24[R8]
– C: STORE R3, -8[R9]
R1 5A 7 7 R1 5A 3 3 R1 5 A 7 B 27
B
R2 -2 -2 -2 R2 -2 -2 -2 R2 -2 -2 -2
R3 9 9B 9 R3 9 9 -6 R3 9 9 9
R4 3 3 21 R4 3 3 3 R4 3 3 3
R1 5 5A 7 R1 5 5 A -2 R1 5 B 27 A 7
B
R2 -2 -2 -2 R2 -2 -2 -2 R2 -2 -2 -2
R3 9B 9 9 R3 9 -6 -6 R3 9 9 9
R4 3 15 15 R4 3 3 3 R4 3 3 3
Eliminating WAR Dependencies
• WAR dependencies are from reusing registers
A: R1 = R3 / R4 A: R1 =X R3 / R4
B: R3 = R2 * R4 B: R5 = R2 * R4
R1 5A 3 3 R1 5 5 A -2 R1 5 5A 3
B B B
R2 -2 -2 -2 R2 -2 -2 -2 R2 -2 -2 -2
R3 9 9 -6 R3 9 -6 -6 R3 9 9 9
R4 3 3 3 R4 3 3 3 R4 3 3 3
R5 4 -6 -6
A: R1 = R2 + R3 A: R5 =
X R2 + R3
B: R1 = R3 * R4 B: R1 = R3 * R4
R1 5 A 7 B 27 R1 5 B 27 A 7 R1 5 B 27 A 27
R2 -2 -2 -2 R2 -2 -2 -2 R2 -2 -2 -2
R3 9 9 9 R3 9 9 9 R3 9 9 9
R4 3 3 3 R4 3 3 3 R4 3 3 3
R5 4 4 7
RAW
WAR
WAW
Register Renaming
Program code
• Solution:
I1: ADD R1, R2, R3
Let’s give I2 temporary name/
I2: SUB R2,
S, R1, R6
location (e.g., S) for the value
I3: AND R6, R11,R7
U, R11, R7
it produces.
I4: OR R8, R5, SR2
• But I4 uses that value, I5: XOR R2,
T, R4, R11
so we must also change that to S…
• In fact, all uses of R5 from I3 to the next instruction
that writes to R5 again must now be changed to S!
• We remove WAW deps in the same way: change R2
in I5 (and subsequent instrs) to T.
Register Renaming
Program code
• Implementation I1: ADD R1, R2, R3
• Simple Solution
– Do renaming for every instruction
– Change the name of a register
each time we decode an
instruction that will write to it.
– Remember what name we gave it ☺
Register File Organization
• We need some physical structure to store the
register values
Architected
Register
File
ARF
“Outside” world sees the ARF
RAT
One PREG per instruction in-flight
PRF
Register
Physical
Alias
Register
Table
File
Putting it all Together
Free pool:
top: X9, X11, X7, X2, X13, X4, X8, X12, X3,
• R1 = R2 + R3 X5…
ARF PRF
• R2 = R4 – R1 R1 X1
R2 X2
• R1 = R3 * R6 R3 X3
R4 X4
• R2 = R1 + R2 R5 X5
R6 X6
• R3 = R1 >> 1 X7
X8
RAT
• BNEZ R3, top R1 R1
X9
X10
R2 R2 X11
R3 R3 X12
R4 R4 X13
R5 R5 X14
R6 R6 X15
X16
Renaming in action
R1 = R2 + R3 = R2 + R3 Free pool:
R2 = R4 – R1 = R4 – X9, X11, X7, X2, X13, X4, X8, X12, X3,
R1 = R3 * R6 = R3 * R6 X5…
R2 = R1 + R2 = +
ARF PRF
R3 = R1 >> 1 = >> 1
R1 X1
BNEZ R3, top BNEZ , top R2 X2
R1 = R2 + R3 = + R3 X3
R2 = R4 – R1 = – R4 X4
R1 = R3 * R6 = * R6 R5 X5
R2 = R1 + R2 = + R6 X6
X7
R3 = R1 >> 1 = >> 1 X8
BNEZ R3, top BNEZ , top RAT X9
R1 R1 X10
R2 R2 X11
R3 R3 X12
R4 R4 X13
R5 R5 X14
R6 R6 X15
X16
Even Physical Registers are Limited
T17
At some point in the future,
PRF the newer writer of R3 exits
T42 This instruction was the most
Free Pool
recent writer, now update the RAT
Deallocate physical register
Instruction Commit: a Problem
I1: ADD R3,R2,R1 Decode I1 (rename R3 to T42)
I2: ADD R7,R3,R5 Decode I2 (uses T42 instead of R3)
I3: ADD R6,R1,R1 R3
Execute I1 (Write result to T42)
ARF I2 can’t execute (e.g. R5 not ready)
Commit I1 (T42->R3, free T42)
Decode I3 (uses T42 instead of R6)
R3
RAT Execute I3 (writes result to T42)
R6 R5 finally becomes ready
Execute I2 (read from T42)
PRF We read the wrong value!!
T42
Free Pool