Chapter 3 Instruction Parallelism Examples
● Loop-level parallelism
– Loop unrolling (compiler)
Instruction-Level Parallelism – Dynamic unrolling (superscalar scheduling)
● Data parallelism
– Vector computers
and
● Cray X1, X1E, X2; NEC SX-9
– SIMT
Its Exploitation ● GPUs
– SIMD
● Short SIMD (SSE, AVX, Intel Phi)
Introduction Types of Dependences
● Instruction level parallelism = ILP =
– (potential) overlap among instructions
● First universal ILP: pipelining (since 1985)
● Two approaches to ILP
– Discover and exploit parallelism in hardware ● Data dependences
● Dominant in server and desktop market segments ● Name dependences
● Not used in PMD segment due to energy constraints
– May be changing with Cortex-A9
● Control dependences
– Software-based discovery at compile time
● Technical markets, scientific computing, HPC
● Itanium is an example of aggressive software discovery
– But mostly abandoned by the majority of server makers
2 5
Instruction-Level Parallelism Basics Data Dependence: Basics
● Not all instructions can be executed in parallel
● Goal: minimize CPI (maximize IPC)
● Data dependent instructions have to be executed “in order”
● In a pipelined processor
– Data dependence is a property of the code
– CPI = ideal CPI + stalls (overheads):
● Instruction j is data dependent on instruction i if
● Structural stalls
● Data hazard stalls – Instruction i produces result used by instruction j
● Control stalls – Instruction j depends on instruction k and k data depends on i
● Fundamental unit for extracting parallelism is ● Pipeline interlocks
– Basic block (block of instructions between branches) – With interlocks, data dependence causes a hazard and stall
– Branches disrupt analysis; add runtime dependence – Without interlocks, data dependence prohibits the compiler
from scheduling instructions with overlap
● But typical basic blocks are small
● Data dependence conveys:
– 3-6 instruction (15%-25% branch frequency)
– Possibility of a hazard (= negative side-effect if not “in order”)
● Optimizing across branches is a must
– The required order of instructions
– Examples: loop-level parallelism, data parallelism (SIMD)
3 – Upper bound on achievable parallelism 6
Data Dependence Example Data Hazards
● Hazards exists as a result of data or name dependence
– Overlap of dependent (and nearby) instructions could change
access order to instructions' operands
double F0, *R1 – Avoiding hazards ensures program order
Loop: Load.D F0, 0(R1) F0=array element F0=R1[0] ● Possible data hazards
– RAW (read after write) ← true data dependence
Add.D F4, F0, F2 add scalar in F2 F4=F0+F2 ● Instruction i: write to x
● Instruction j: read from x
Store.D F4, 0(R1) store result R1[0]=F4
– WAW (write after write) ← output dependence
Add.I R1,R1,#-8 decrement by 8B R1-=1 ● Instruction i: write to x
● Instruction j: write to x
Branch.NE R1, R2, Loop if (R1!=R2) – WAR (write after read) ← antidependence
goto Loop
● Instruction i: read from x
● Instruction j: write to x
7
● RAR is not a hazard 10
Data Dependence Details Control Dependence
● Control dependence determines ordering of an instruction
with respect to a branch instruction
– The order must be preserved
– The execution should be conditional
● Overcoming data dependence
● Example:
– Maintain the dependence by preventing the hazard
– If p1 { S1 }
● By compiler or by hardware scheduler
– If p2 { S2 }
– Eliminate dependence by transforming the code
● A separate topic – S1 is control dependent on p1
– S2 is control dependent on p2
– Dependences that flow through memory
– S2 is not control dependent on p1
● Is R4[100] the same as R6[20]?
● Is R4[20] the same as R6[20]? ● Branches create barriers in code for potential code motion
● It might be possible to violate control dependence but
preserve correct execution with extra hardware
– Speculative execution
8 11
Name Dependence Control Dependence Examples
● Exception handling
● Name dependence occurs when two instructions use the
same register (or memory location) but there is no flow of – Add R2, R3, R4
data between them – Branch.equal0 R2, L1
– Instruction i – Load R1, 0(R2)
– Instruction j – L1: NoOp
● Two types of name dependence – No data dependence between Branch and Load
– Antidependence: Instruction i reads what instruction j writes – Load from wrong R2 could cause exception:
● Store.D F4, 0(R1) ● int *r2 = r3 + r4; y = r2 ? r2[0] : 0;
● Add.I R1, R1, #-8 ● Data flow
– Output dependence: Instructions i and j both write – Add R1, R2, R3
● Renaming is the common technique to deal with name – Branch.equal0 R4, L
dependence – Subtract R1, R5, R6
– Register renaming – L: NoOp ?
– Shadow registers – Or R7, R1, R8
9 12
Control Dependence: Software Speculation Original vs. Unrolled Loop
● Loop
F0 = R1[0]
F4 = F0 + F2
R1[0] = F4
● Ignoring control dependence may be possible after code R1 -= 1
analysis (liveness property) if (R1...
● Loop F6 = R1[-1]
– Add R1, R2, R3 F0 = R1[0] F8 = F6 + F2
– Branch.eq0 R12, Skip F4 = F0 + F2 R1[-1] = F8
?
R1[0] = F4 F10 = R1[-2]
– Subtract R4, R5, R6 R1 -= 1 F12 = F10 + F2
– Add R5, R4, R9 if (R1 != R2) R1[-2] = F12
– Skip: Or R7, R8, R9 goto Loop F14 = R1[-3]
– ; R4 is not used again (is dead) F16 = F14 + F2
R1[-3] = F16
R1 -= 4
if (R1 != R2) goto Loop
13
● New registers: F6, F8, ... 16
Compiler Techniques for Exposing ILP Loop Unrolling + Pipeline Scheduling
● Loop: Load F0, 0(R1) ● Loop: Load F0, 0(R1)
● Stall ● Add.I R1, R1, #-8 ● Loop ● Loop
● Add F4, F0, F2 F0 = R1[0] F0 = R1[0]
● Add F4, F0, F2 F4 = F0 + F2 F6 = R1[-1]
● Stall ● Stall R1[0] = F4 F10 = R1[-2]
F6 = R1[-1] F14 = R1[-3]
● Stall ● Stall
F8 = F6 + F2 F4 = F0 + F2
● Store F4, 0(R1) ● Store F4, 0(R1) R1[-1] = F8 F8 = F6 + F2
F10 = R1[-2] F12 = F10 + F2
● Add.I R1, R1, #-8 ● Branch.NoEq R1, R2, Loop F12 = F10 + F2 F16 = F14 + F2
● Branch.NoEq R1, R2, Loop R1[-2] = F12 R1[0] = F4
F14 = R1[-3] R1[-1] = F8
Instruction producing result Instruction using result Latency in clock cycles F16 = F14 + F2 R1 -= 4
FP ALU op Another FPU ALU op 3 R1[-3] = F16 R1[2] = F12
FP ALU op Store double 2 R1 -= 4 R1[1] = F16
Load double FP ALU op 1 if (R1 != R2) goto Loop if (R1 != R2) goto Loop
Load double Store double 0
14 17
Loop Unrolling Overview Unrolling with Generic Loops
● Loop unrolling simply copies the body of the loop multiple
times, each copy operates on a new loop index
● Given: for (k=0; k < N; ++k)
● Benefits
– Let's unroll 4 times
– Less branch instructions
– But what if N not divisible by 4?
● Less pressure on branch predictor
– Increased basic block size ● Solution:
● Potential for more parallelism – First, unroll N%k times: for (k=0; k < N%4; ++k)
– Less instructions executed – Then loop of group of 4:
● For example: less increments of the loop counter for (k = N%4; k < N; k += 4)
// unroll 4 times for k, k+1, k+2, k+3
● Downsides
● Refer to Chapter 4 and the technique called
– Greater register pressure
– Strip mining
– Increased use of instruction cache
● Could spill the instruction cache and cause cache thrashing
15 18
Branch Prediction Dynamic Scheduling Basics
● Instead of waiting for the branch to finish executing
● Simple techniques can only eliminate some data
– Try to predict its behavior and act upon the prediction dependence stalls
● Requirements – Pipeline scheduling by compiler
– Prediction must be cheaper than executing the branch – Forwarding and bypassing
instruction
● Dynamic scheduling adds another level adding parallelism
● Usually based on few bits of information
while maintaining the data flow
– There has to be a way of dealing with wrong predictions
– Some dependences are not known until runtime
● Beware of exceptions etc.
– The same binaries can run efficiently without recompilation
● Simple predictor
– Compiler might not know the details of the micro-architecture
– Keep a bit (or two) for a (fixed) number of branches
– There could be unpredictable delays: multi-level caches
– Every time a branch is taken increase the count
● Disadvantages
● If N-consecutive executions resulted in “branch taken” then the
next act as if the branch will be taken – Substantial increase in hardware complexity
– If N-consecutive “branch not taken” then start predicting “not – Exception handling (imprecise exceptions)
taken” 19 22
Correlating Branch Predictors Dynamic Scheduling Details
● Dynamic scheduling breaks the “in order” execution
● Observation (based on existing codes)
– Out-of-order execution
– Branches are correlated with each other
● Incoming instructions rearranged and unknown until runtime
● Application: correlating branch predictors
– Out-of-order completion
● Instead of keeping of each branch individually, ● Retired instructions' order depends on code, execution, delays
look also at the recent M branches
● New hazards to deal with
● (M, N) predictor
– WAR
– Uses behavior of last M branches ● Possibility of overwriting a value that has not been read yet
● Total of 2M branch decisions – Load F0, 0(R1) // a load from memory may be stalled for many cycles
– Each predictor has N bits – Load R1, #1 // load of a constant takes only a few cycles
– WAW
● Advantages
● Writing twice to the same location
– Better prediction yield (always test on your own code!)
– RAW hazards are still a problem
– Little hardware require to implement it ● They always occur since they are called “true data dependence”
20 23
Tournament Branch Predictors Dynamic Scheduling and Hazards
● Problem
– Branches might be badly mispredicted when moving between
program scopes
– The branch prediction information from the inner scope is
inadequate for the outer scope
● F0 = F2 / F4
● Observation True data dependence (RAW)
– There is locality in branching
● F6 = F0 + F8
● Inner and outer loops, etc. Output depenendance
● R1[0] = F6 antidepenendence (WAR)
(WAW)
● Solution ● F8 = F10 – F14 antidepenendence (WAR)
– Combine local and global information ● F6 = F10 * F8
● Typical predictors
– Size: 8K-32K bits
– Local predictors unchanged
● Examples: DEC Alpha, AMD Phenom and Opteron 21 24
Register Renaming Example Pipelined Execution
Clock cycles
Instruction 1 2 3 4 5 6 7 8
Instr. I fetch decode exe mem write
Instr. I+1 fetch decode exe mem write
● F0 = F2 / F4 ● F0 = F2 / F4 Instr. I+2 fetch decode exe mem write
Instr. I+3 fetch decode exe mem
● F6 = F0 + F8 ● S = F0 + F8 Instr. I+4 fetch decode exe
Instr. I+5 fetch decode
● R1[0] = F6 ● R1[0] = S Instr. I+6 fetch
● F8 = F10 – F14 ● T = F10 – F14 Instr. I fetch decode exe mem write
● F6 = F10 * F8 ● F6 = F10 * T Instr. I+1 fetch decode exe mem write
Instr. I+2 fetch decode exe mem write
● ●
Instr. I+3 stall fetch decode exe
Instr. I+4 fetch decode
● ● Only RAW hazards remain Instr. I+5 fetch
I1 F1 F2 R X1 X2 X3 D1 D2 T W
I2 F1 F2 R X1 X2 X3 X4 D1 D2 T W
I3 F1 F2 R X1 s D1 s s D2 T W
25 28
Register Renaming Details Tomasulo's Algorithm
● Register renaming provided by reservation stations (RS) ● Tomasulo's approach allows...
● Each entry of RS contains – Out-of-order execution (as in scoreboarding in ARM A8)
– Instruction ● Unlike scoreboarding, Tomasulo can handle anti- and output
dependences by renaming (four-issue Intel i7)
– Buffered operand values (when available)
– Extension to handle speculation
– References to instructions in RS that will provide values
● In Tomasulo, each instruction goes through three steps
● Operation
– Issue
– RS fetches and buffers an operand when available
● FIFO queue maintains correct data flow
● Might bypass a register
● Transfer instruction to RS if available or structural stall
– Pending instructions indicate the RS where they send their
output ● Rename registers to eliminate WAR and WAW (stall if no data)
– Execute
● Results broadcast on result bus (Common Data Bus)
– Only the last output updates the register file ● Monitor bus for new data and distribute it to waiting RS (RAW)
● Execute instructions in functional units when operands available
– Upon instruction issue, registers are renamed with references
to RS – Write result (other RS, registers, store buffers)
● There may be more RS than registers! 26
● Store buffer waits for address, value and memory unit(s) 29
Tomasulo Approach Example Dynamic Execution: All Issued
Instruction status
Issue Execute Write result
Load f6, (r2) x x x
Load f2, (r3) x x
● Introduced by Robert Tomasulo Mult f0,f2,f4 x
Sub f8,f2,f6 x
– Implemented in IBM 360/91 in its floating-point unit Div f9,f0,f6 x
– IBM 360/91 had long memory and floating-point delays Add f6,f8,f2 x
Reservation stations
● Only 4 floating-point registers
Busy Op Vj Vk Qj A Qk
– Binary compatibility was important for IBM customers Load1 no
● Modern processors use a variation of Tomasulo's approach Load2 yes load Reg[r3]
Add1 yes sub Mem[r2] Load2
– Also in use is a simpler algorithm called scoreboarding Add2 yes add Add1 Load2
Add3 no
Mult1 yes mul Reg[f4] Load2
Mult2 yes div Mem[r2] Mult1
f0: Mult1 f1: Load2 f6: Add2 f8: Add1 f9: Mult2
27 30
Example Dynamic Execution: Mult Ready Reorder Buffer
Instruction status ● Another set of (invisible to programmer) registers for
Issue Execute Write result intermediate results
Load f6, (r2) x x x
Load f2, (r3) x x + ● ROB registers hold data after instruction completion but
Mult f0,f2,f4 x + before instruction commit
Sub f8,f2,f6 x + + ● Each ROB entry (register) contains additional fields:
Div f9,f0,f6 x
Add f6,f8,f2 x + + – Instruction type
Reservation stations ● Branch (no destination), store (memory destination), register op
Busy Op Vj Vk Qj Qk A (ALU or register destination)
Load1 no – Destination
Load2 no ● Register number (for loads or ALU ops)or memory address (for
Add1 no stores)
Add2 no – Value
Add3 no
Mult1 yes mul Mem[r3] Reg[f4] – Ready
Mult2 yes div Mem[r2] Mult1 ● Indicates whether the supplying instruction completed its
f0: Mult1 f1: f6: f8: f9: Mult2 execution
31 34
Hardware Speculation Basics Reorder Buffer in Action
● After dealing with data dependences, control dependences ● Issue
become an issue – Has to wait until ROB entry is available (in addition to RS
– Branch prediction is not as effective with multiple instructions entry)
in-flight ● Execute
● If predicted “taken”, conditional instructions are fetched and – Results from Common Data Bus will have to end up on ROB
issued
– Speculation allows to proceed almost as if branch was not ● Write result
there – Results have to be copied to ROB
● Conditional instructions are fetched, issued, and executed ● Commit (also called completion, graduation)
● Hardware speculation comprises Normal commit: store result in destination, mark ROB entry as
–
– Dynamic branch prediction empty
– (speculative) execution: instructions are executed and – Store commit: destination is a memory
possibly undone) – Branch commit:
– Dynamic scheduling ● If correct prediction: no action needed
● More basic blocks available after branches are speculated out of ● If incorrect prediction: ROB result is thrown away and
the instruction stream 32 instructions restarted at the correct branch point 35
Hardware Speculation Components Reorder Buffer Exception Handling
● Additional step in instruction execution
– Issue, Execute, Write result, Commit
● Exceptions are not recognized until they are ready to commit
● Reorder buffer (ROB)
● ROB records exceptions
● Handling of...
– On mispredictions: flush the exception
– Mispredictions
– Upon reaching the head of ROB: raise the exception
– Mis-speculations
– Exceptions
33 36
Speculation at Compile Time VLIW Disadvantages
● Static parallelism
● a += 1 – Must be discovered and exploited early
If (x==0) { ● Preferably by the compiler
b += 1 ● Potential for intermediate representations, bytecodes
c += 1 ● Large code size
} else {
● If (x==0) { a -= 1 – Parallelism relies on large basic blocks
a += 1 } – Clever encoding or on-the-fly decompression may be needed
b += 1 ● a_copy = a+1 ● Lack of hazard detection in lockstep execution
c += 1 b_copy = b+1
} c_copy = c+1 ● Binary compatibility
if (x==0) { – Take code from 2-issue VLIW to (next-gen) 3-issue VLIW
a = a_copy – Add a single ALU unit to the new processor and the old code
b = b_copy will not take advantage of it
c = c_copy
} – New (wider, with more functions) processors could change
instruction encoding
37 ● ISA must provide for future hardware expansions 40
Multiple Issue Execution Tomasulo Recap
● All the techniques presented so far lead to ideal CPI=1 0x0: Load F2, 0(R1)
F0
0x1: Mul F0, F2, F4
● For CPI to go below 1 there need to be multiple instructions 0x2: ...
Empty RS F1
retired most of the time Empty RS F2
F3
Mul v1, v2, v3
– Too many stalls can quickly increase CPI above 1 F4
Load v1, v2
– See Amdahl law
● Most common flavors of multiple issue processors
Memory Buffer
– Statically scheduled processors
● In-order execution
Common data bus
● Examples: MIPS, ARM
To memory hierarchy
– VLIW (very long instruction word) processors
● Each cycle issues multiple (fixed number of) instructions Multiplier/Divider Adder/Subtractor
● Examples: DSPs, Itaniums, some GPUs
– Dynamically scheduled superscalar processors
● Out-of-order execution
● Examples: Intel i3-7, AMD Phenom, IBM POWER7 38 41
VLIW Processor Basics Multiple Issue Taxonomy
Common Issue Hazard Scheduling Distinguishing Exampes
● How many instructions per cycle? name structure detection characteristic
– Two-issue is common place
Superscalar Dynamic Hardware Static In-order execution Mostly in the
– Four-issue is manageable (static) embedded space:
MIPS and ARM
● Scheduling techniques (Cortex A8)
– Local Superscalar Dynamic Hardware Dynamic Some out-of-order None at present
(dynamic) execution but no
● Basic blocks speculation
– Global Superscalar Dynamic Hardware Dynamic Out-of-order execution Intel Core i3-7;
● Across branches (speculative) with with speculation AMD Phenom; IBM
speculation POWER7
– Trace
VLIW/LIW & Static Primarily Static All hazards Most examples are
● VLIW-specific Static software determined and in signal
● Extensive loop unrolling to generate large basic blocks indicated by compiler processing, such
(often implicitly) as TI 6Cx. Also
some GPUs
● Disadvantages EPIC (Exp. Primarily Primarily Mostly All hazards Itanium
Parallel static software static determined and
– Static parallelism, large code size, lack of hazard detection for Instruction indicated explicitly by
lockstep execution, binary compatibility Comp.) the compiler
39 42
VLIW Processors Basic Design Return Address Predictor
● Package multiple operations into one instruction
– Instruction bundles
● Branch prediction deals with conditional branches
● Example VLIW processor
● Most unconditional branches come from function returns
– One integer instruction (or branch)
● But the same function can be called from multiple sites
– Two independent floating-point operations
– This may cause the branch prediction buffer to forget about
– Two independent memory references
return address from previous calls
– Notice: there are restrictions on the instructions
● Solution
● Must be enough parallelism in code to fill the available slots
– Create return address buffer organized as a stack
– Compiler: aggressive loop unrolling
– Programmer: program restructuring
43 46
Modern Microarchitectures
● Combine:
– Dynamic scheduling
– Multiple issue
– Speculation
● Two approaches to dealing with dependences
– Assign reservation stations and update pipeline control table
in half clock cycles
● Only supports 2 instructions/clock
– Design logic to handle any possible dependencies between
the instructions
● Notice: design complexity
– Hybrid approaches
● New bottleneck:
– Issue logic
44
Modern Multiple Issue
● Limit the complexity of a single instruction “bundle”
– Limit bundle size
– Limit classes of instruction in a bundle
● One integer, two floating-point
● With limited size, all dependences in a bundle can be
examined
● Dependences from a small bundle can also be fully encoded
in RS
● Another bottleneck:
– Completion/commit unit
– Need multiple such units to keep up with incoming instructions
45