Computer Architecture
A Quantitative Approach, Sixth Edition
Chapter 3
Instruction-Level Parallelism
and Its Exploitation
Copyright © 2019, Elsevier Inc. All rights Reserved 1
Introduction
Introduction
Pipelining become universal technique in 1985
Overlaps execution of instructions
Exploits “Instruction Level Parallelism”
Beyond this, there are two main approaches:
Hardware-based dynamic approaches
Used in server and desktop processors
Not used as extensively in PMP (personal Mobile
Processors) processors
Compiler-based static approaches
Not as successful outside of scientific applications
Copyright © 2019, Elsevier Inc. All rights Reserved 2
Introduction
Instruction-Level Parallelism
When exploiting instruction-level parallelism, goal is to
minimize CPI (Cycles Per Instruction)
Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard
stalls + Control stalls
Where:
The ideal pipeline CPI is a measure of the maximum performance attainable by the
implementation
Structural hazards arise from resource conflicts when the hardware cannot support
all possible combinations of instructions simultaneously in overlapped execution.
Data hazards arise when an instruction depends on the results of a previous
instruction in a way that is exposed by the overlapping of instructions in the
pipeline.
Control hazards arise from the pipelining of branches and other instructions that
change the PC.
Parallelism with basic block is limited
a straight-line code sequence with no branches in except to the entry
and no branches out except at the exit
Typical size of basic block = 3-6 instructions
Must optimize across branches
Copyright © 2019, Elsevier Inc. All rights Reserved 3
Introduction
Data Dependence
The simplest and most common way to increase the
ILP is to exploit parallelism among iterations of a loop
Loop-Level Parallelism
Unroll loop statically or dynamically
As an alternative, Use SIMD (vector processors and
GPUs)
Challenges:
Data dependency
Instruction j is data dependent on instruction i if
Instruction i produces a result that may be used by instruction j
Instruction j is data dependent on instruction k and instruction k
is data dependent on instruction i
Dependent instructions cannot be executed
simultaneously
Copyright © 2019, Elsevier Inc. All rights Reserved 4
Introduction
Data Dependence
Copyright © 2019, Elsevier Inc. All rights Reserved 5
Introduction
Data Dependence
Dependencies are a property of programs
Pipeline organization determines if dependence
is detected and if it causes a stall
Data dependence conveys:
Possibility of a hazard
Order in which results must be calculated
Upper bound on exploitable instruction level
parallelism
Dependencies that flow through memory
locations are difficult to detect
Copyright © 2019, Elsevier Inc. All rights Reserved 6
Introduction
Name Dependence
Two instructions use the same name but no flow of
information.
occurs when two instructions use the same
register or memory location.
Not a true data dependence, but is a problem when
reordering instructions
two types of name dependences between an
instruction i that precedes instruction j:
Antidependence: instruction j writes a register or
memory location that instruction i reads
Initial ordering (i before j) must be preserved
Output dependence: instruction i and instruction j
write the same register or memory location
Ordering must be preserved
To resolve, use register renaming techniques
Copyright © 2019, Elsevier Inc. All rights Reserved 7
Introduction
Other Factors
A hazard occurs whenever:
there is a name or data dependence between
instructions, and
The instructions are close enough that the
overlap during execution would change the
order of access to the operand involved in the
dependence.
Solution: Preserve program order (program
should execute sequentially).
The goal of both software and hardware
techniques is to exploit parallelism by
preserving program order only where it
affects the outcome of the program
Copyright © 2019, Elsevier Inc. All rights Reserved 8
Introduction
Other Factors
Consider two instructions i and j, with i preceding j in
program order. The possible data hazards are
Read after write (RAW):j tries to read a source before i writes it.
Write after write (WAW): j tries to write an operand before it is written
by i.
Write after read (WAR): j tries to write a destination before it is read by
i
Control Dependence
Determines the ordering of instruction i with respect to a
branch instruction so that instruction i is executed in
correct program order and only when it should be
Instruction control dependent on a branch cannot be moved before
the branch so that its execution is no longer controlled by the
branch
An instruction not control dependent on a branch cannot be moved
after the branch so that its execution is controlled by the branch
Copyright © 2019, Elsevier Inc. All rights Reserved 9
Introduction
Examples
• Example 1: or instruction
add x1,x2,x3 dependent on
beq x4,x0,L
sub x1,x1,x6
add and sub
L: …
or x7,x1,x8
• Example 2:
add x1,x2,x3
Assume x4 isn’t
beq x12,x0,skip used after skip
sub x4,x5,x6 Possible to
add x5,x4,x9 move sub
skip: before the
or x7,x8,x9 branch
Copyright © 2019, Elsevier Inc. All rights Reserved 10
Compiler Techniques
Compiler Techniques for Exposing ILP
Pipeline scheduling
Find sequences of unrelated instructions that
can be overlapped in the pipeline.
To avoid a pipeline stall, the execution of a
dependent instruction must be separated from
the source instruction by a distance in clock
cycles equal to the pipeline latency of that
source instruction.
A compiler’s ability to perform this
scheduling depends on:
Amount of ILP available in the program.
latencies of the functional units in the pipeline.
Copyright © 2019, Elsevier Inc. All rights Reserved 11
Compiler Techniques
Compiler Techniques for Exposing ILP
Example:
for (i=999; i>=0;
i=i-1)
x[i] = x[i] + s;
The loop is
parallel: body of
each iteration is
independent.
Copyright © 2019, Elsevier Inc. All rights Reserved 12
Compiler Techniques
Pipeline Stalls
Copyright © 2019, Elsevier Inc. All rights Reserved 13
Compiler Techniques
Loop Unrolling
Loop unrolling
Unroll by a factor of 4 (assume # elements is divisible by 4)
Eliminate unnecessary instructions
Loop: fld f0,0(x1)
fadd.d f4,f0,f2
fsd f4,0(x1) //drop addi & bne
fld f6,-8(x1)
fadd.d f8,f6,f2
fsd f8,-8(x1) //drop addi & bne
fld f10,-16(x1)
fadd.d f12,f0,f2 note: number
fsd f12,-16(x1) //drop addi & bne
of live registers
fld f14,-24(x1)
fadd.d f16,f14,f2
vs. original loop
fsd f16,-24(x1) 26 clock cycles
addi x1,x1,-32
bne x1,x2,Loop
Copyright © 2019, Elsevier Inc. All rights Reserved 14
Compiler Techniques
Loop Unrolling/Pipeline Scheduling
Pipeline schedule the unrolled loop:
After Before
Loop: fld f0,0(x1)
fld f6,-8(x1)
fld f8,-16(x1)
fld f14,-24(x1)
fadd.d f4,f0,f2
fadd.d f8,f6,f2
fadd.d f12,f0,f2
fadd.d f16,f14,f2
fsd f4,0(x1)
fsd f8,-8(x1)
fsd f12,-16(x1)
fsd f16,-24(x1)
addi x1,x1,-32
14 cycles
bne x1,x2,Loop 3.5 cycles per element
Copyright © 2019, Elsevier Inc. All rights Reserved 15
Compiler Techniques
Strip Mining
Unknown number of loop iterations?
(upper bound on the loop is unknown)
Number of iterations = n
Goal: make k copies of the loop body
Instead of a single unrolled loop, Generate
pair of consecutive loops:
First executes n mod k times
Second executes n / k times
“Strip mining”
Copyright © 2019, Elsevier Inc. All rights Reserved 16
Branch Prediction
Branch Prediction
Basic 2-bit predictor:
For each branch:
Predict taken or not taken
If the prediction is wrong two consecutive times, change prediction
Correlating predictor/two-level predictors:
Branch predictors use the behavior of other branches to make a
prediction.
Multiple 2-bit predictors for each branch
One for each possible combination of outcomes of preceding n
branches
(m,n) predictor: behavior from last m branches to choose from 2m n-bit predictors
Tournament predictor:
Combine correlating predictor with local predictor: choose among two
different predictors based on which predictor (local, global, or even
some time varying mix) was most effective in recent predictions.
Copyright © 2019, Elsevier Inc. All rights Reserved 17
Branch Prediction
18
Branch Prediction
Branch Prediction
gshare tournament
Copyright © 2019, Elsevier Inc. All rights Reserved 19
Branch Prediction
Branch Prediction Performance
Copyright © 2019, Elsevier Inc. All rights Reserved 20
Branch Prediction
Branch Prediction Performance
Copyright © 2019, Elsevier Inc. All rights Reserved 21
Branch Prediction
Tagged Hybrid Predictors
This class of branch predictors employs a
series of global predictors indexed with
different length histories.
Need to have predictor for each branch
and history
Problem: this implies huge tables
Solution:
Use hash tables, whose hash value is based on
branch address and branch history
Longer histories may lead to increased chance of
hash collision, so use multiple tables with
increasingly shorter histories
Copyright © 2019, Elsevier Inc. All rights Reserved 22
Branch Prediction
Tagged Hybrid Predictors
Copyright © 2019, Elsevier Inc. All rights Reserved 23
Branch Prediction
Tagged Hybrid Predictors
Copyright © 2019, Elsevier Inc. All rights Reserved 24
Dynamic Scheduling
Dynamic Scheduling
Dynamic Scheduling: Rearrange order of instructions to
reduce stalls while maintaining data flow
Advantages:
code that was compiled with one pipeline in mind can run
efficiently on a different pipeline
Compiler doesn’t need to have knowledge of microarchitecture
Handles cases where dependencies are unknown at compile
time
allows the processor to tolerate unpredictable delays, such as
cache misses, by executing other code while waiting for the miss
to resolve
Disadvantage:
Substantial increase in hardware complexity
Complicates exceptions
Copyright © 2019, Elsevier Inc. All rights Reserved 25
Dynamic Scheduling
Dynamic Scheduling
dynamically scheduled processor cannot change the
data flow, it tries to avoid stalling when dependences
are present.
Static pipeline scheduling by the compiler tries to
minimize stalls by separating dependent instructions
so that they will not lead to hazards
Copyright © 2019, Elsevier Inc. All rights Reserved 26
Dynamic Scheduling
Dynamic Scheduling
Dynamic scheduling implies:
Out-of-order execution
Out-of-order completion
Example 1:
fdiv.d f0,f2,f4
fadd.d f10,f0,f8
fsub.d f12,f8,f14
fsub.d is not dependent, issue before fadd.d
Copyright © 2019, Elsevier Inc. All rights Reserved 27
Dynamic Scheduling
Dynamic Scheduling
Example 2:
fdiv.d f0,f2,f4
fmul.d f6,f0,f8
fadd.d f0,f10,f14
fadd.d is not dependent, but the
antidependence makes it impossible to issue
earlier without register renaming
fmul.d and fadd.d: antidependence (Register
f0)
If fadd.d executes before fmul.d, it will result in
WAR Copyright © 2019, Elsevier Inc. All rights Reserved 28
Dynamic Scheduling
Register Renaming
Example 3:
fdiv.d f0,f2,f4
fadd.d f6,f0,f8
antidependence
fsd f6,0(x1)
fsub.d f8,f10,f14 antidependence
fmul.d f6,f10,f8
WAR hazards on the use of f8 by fadd.d and its use by the fsub.d
WAW hazard because the fadd.d may finish later than the fmul.d
There are also three true data dependences:
between the fdiv.d and the fadd.d,
between the fsub.d and the fmul.d,
between the fadd.d and the fsd.
Copyright © 2019, Elsevier Inc. All rights Reserved 29
Dynamic Scheduling
Register Renaming
Example 3: assume the existence of two
temporary registers, S and T.
fdiv.d f0,f2,f4
fadd.d S,f0,f8
fsd S,0(x1)
fsub.d T,f10,f14
fmul.d f6,f10,T
Now only RAW hazards remain, which can be strictly
ordered
Copyright © 2019, Elsevier Inc. All rights Reserved 30
Dynamic Scheduling
Register Renaming
Tomasulo’s Approach
Tracks when operands are available: Minimize RAW
Introduces register renaming in hardware
Minimizes WAW and WAR hazards
rely on two key principles:
dynamically determining when an instruction is ready to execute
renaming registers to avoid unnecessary hazards.
Register renaming is provided by reservation stations (RS)
The basic idea is that a RS fetches and buffers an operand
as soon as it is available, eliminating the need to get the
operand from a register
RS Contains:
The instruction
Buffered operand values (when available)
Reservation station number of instruction providing
the operand values
Copyright © 2019, Elsevier Inc. All rights Reserved 31
Dynamic Scheduling
Register Renaming
RS fetches and buffers an operand as soon as it
becomes available (not necessarily involving register file)
Pending instructions designate the RS to which they will
send their output
Result values broadcast on a result bus, called the common data
bus (CDB)
Only the last output updates the register file
As instructions are issued, the register specifiers are
renamed with the reservation station
May be more reservation stations than registers
Load and store buffers
Contain data and addresses, act like reservation stations
Copyright © 2019, Elsevier Inc. All rights Reserved 32
Dynamic Scheduling
Tomasulo’s Algorithm
Copyright © 2019, Elsevier Inc. All rights Reserved 33
Dynamic Scheduling
Tomasulo’s Algorithm
Three Steps:
Issue
Get next instruction from FIFO instruction queue
If available RS, issue the instruction to the RS with operand values if
available
If operand values not available, stall the instruction
Execute
If one or more of the operands is not yet available, monitor the
common data bus while waiting for it to be computed
When operand becomes available, store it in any reservation
stations waiting for it
When all operands are ready, issue the instruction
Loads and store maintained in program order through effective
address
No instruction allowed to initiate execution until all branches that
proceed it in program order have completed
Write result
Write result on CDB into reservation stations and store buffers
(Stores must wait until address and value are received)
Copyright © 2019, Elsevier Inc. All rights Reserved 34
Dynamic Scheduling
Example
Copyright © 2019, Elsevier Inc. All rights Reserved 35
Dynamic Scheduling
Tomasulo’s Algorithm
Example loop:
Loop: fld f0,0(x1)
fmul.d f4,f0,f2
fsd f4,0(x1)
addi x1,x1,8
bne x1,x2,Loop // branches if x16 != x2
Copyright © 2019, Elsevier Inc. All rights Reserved 36
Dynamic Scheduling
Tomasulo’s Algorithm
Copyright © 2019, Elsevier Inc. All rights Reserved 37
Hardware-Based Speculation
Hardware-Based Speculation
Execute instructions along predicted
execution paths but only commit the
results if prediction was correct
Instruction commit: allowing an instruction
to update the register file when instruction
is no longer speculative
Need an additional piece of hardware to
prevent any irrevocable action until an
instruction commits
I.e. updating state or taking an execution
Copyright © 2019, Elsevier Inc. All rights Reserved 38
Hardware-Based Speculation
Reorder Buffer
Reorder buffer – holds the result of
instruction between completion and
commit
Four fields:
Instruction type: branch/store/register
Destination field: register number
Value field: output value
Ready field: completed execution?
Modify reservation stations:
OperandCopyright
source is now reorder buffer instead 39
© 2019, Elsevier Inc. All rights Reserved
Hardware-Based Speculation
Reorder Buffer
Issue:
Allocate RS and ROB, read available
operands
Execute:
Begin execution when operand values are
available
Write result:
Write result and ROB tag on CDB
Commit:
When ROB reaches head of ROB, update
register
When a mispredicted branch reaches head of40
Copyright © 2019, Elsevier Inc. All rights Reserved
Hardware-Based Speculation
Reorder Buffer
Register values and memory values are
not written until an instruction commits
On misprediction:
Speculated entries in ROB are cleared
Exceptions:
Not recognized until it is ready to commit
Copyright © 2019, Elsevier Inc. All rights Reserved 41
Hardware-Based Speculation
Reorder Buffer
Copyright © 2019, Elsevier Inc. All rights Reserved 42
Hardware-Based Speculation
Reorder Buffer
Copyright © 2019, Elsevier Inc. All rights Reserved 43
Multiple Issue and Static Scheduling
Multiple Issue and Static Scheduling
To achieve CPI < 1, need to complete
multiple instructions per clock
Solutions:
Statically scheduled superscalar processors
VLIW (very long instruction word) processors
Dynamically scheduled superscalar
processors
Copyright © 2019, Elsevier Inc. All rights Reserved 44
Multiple Issue and Static Scheduling
Multiple Issue
Copyright © 2019, Elsevier Inc. All rights Reserved 45
Multiple Issue and Static Scheduling
VLIW Processors
Package multiple operations into one
instruction
Example VLIW processor:
One integer instruction (or branch)
Two independent floating-point operations
Two independent memory references
Must be enough parallelism in code to fill
the available slots
Copyright © 2019, Elsevier Inc. All rights Reserved 46
Multiple Issue and Static Scheduling
VLIW Processors
Disadvantages:
Statically finding parallelism
Code size
No hazard detection hardware
Binary code compatibility
Copyright © 2019, Elsevier Inc. All rights Reserved 47
Dynamic Scheduling, Multiple Issue, and Speculation
Dynamic Scheduling, Multiple Issue, and Speculation
Modern microarchitectures:
Dynamic scheduling + multiple issue +
speculation
Two approaches:
Assign reservation stations and update
pipeline control table in half clock cycles
Only supports 2 instructions/clock
Design logic to handle any possible
dependencies between the instructions
Issue logic is the bottleneck in dynamically
Copyright © 2019, Elsevier Inc. All rights Reserved 48
Dynamic Scheduling, Multiple Issue, and Speculation
Overview of Design
Copyright © 2019, Elsevier Inc. All rights Reserved 49
Dynamic Scheduling, Multiple Issue, and Speculation
Multiple Issue
Examine all the dependencies amoung the
instructions in the bundle
If dependencies exist in bundle, encode
them in reservation stations
Also need multiple completion/commit
To simplify RS allocation:
Limit the number of instructions of a given
class that can be issued in a “bundle”, i.e. on
FP, one integer, one load, one store
Copyright © 2019, Elsevier Inc. All rights Reserved 50
Dynamic Scheduling, Multiple Issue, and Speculation
Example
Loop: ld x2,0(x1) //x2=array element
addi x2,x2,1 //increment x2
sd x2,0(x1) //store result
addi x1,x1,8 //increment pointer
bne x2,x3,Loop //branch if not last
Copyright © 2019, Elsevier Inc. All rights Reserved 51
Dynamic Scheduling, Multiple Issue, and Speculation
Example (No Speculation)
Copyright © 2019, Elsevier Inc. All rights Reserved 52
Dynamic Scheduling, Multiple Issue, and Speculation
Example (Mutiple Issue with Speculation)
Copyright © 2019, Elsevier Inc. All rights Reserved 53
Adv. Techniques for Instruction Delivery and Speculation
Branch-Target Buffer
Need high instruction bandwidth
Branch-Target buffers
Next PC prediction buffer, indexed by current PC
Copyright © 2019, Elsevier Inc. All rights Reserved 54
Adv. Techniques for Instruction Delivery and Speculation
Branch Folding
Optimization:
Larger branch-target buffer
Add target instruction into buffer to deal with
longer decoding time required by larger buffer
“Branch folding”
Copyright © 2019, Elsevier Inc. All rights Reserved 55
Adv. Techniques for Instruction Delivery and Speculation
Return Address Predictor
Most unconditional branches come from
function returns
The same procedure can be called from
multiple sites
Causes the buffer to potentially forget about
the return address from previous calls
Create return address buffer organized
as a stack
Copyright © 2019, Elsevier Inc. All rights Reserved 56
Adv. Techniques for Instruction Delivery and Speculation
Return Address Predictor
Copyright © 2019, Elsevier Inc. All rights Reserved 57
Adv. Techniques for Instruction Delivery and Speculation
Integrated Instruction Fetch Unit
Design monolithic unit that performs:
Branch prediction
Instruction prefetch
Fetch ahead
Instruction memory access and buffering
Deal with crossing cache lines
Copyright © 2019, Elsevier Inc. All rights Reserved 58
Adv. Techniques for Instruction Delivery and Speculation
Register Renaming
Register renaming vs. reorder buffers
Instead of virtual registers from reservation stations and reorder
buffer, create a single register pool
Contains visible registers and virtual registers
Use hardware-based map to rename registers during issue
WAW and WAR hazards are avoided
Speculation recovery occurs by copying during commit
Still need a ROB-like queue to update table in order
Simplifies commit:
Record that mapping between architectural register and physical register is no
longer speculative
Free up physical register used to hold older value
In other words: SWAP physical registers on commit
Physical register de-allocation is more difficult
Simple approach: deallocate virtual register when next instruction writes to its
mapped architecturally-visibly register
Copyright © 2019, Elsevier Inc. All rights Reserved 59
Adv. Techniques for Instruction Delivery and Speculation
Integrated Issue and Renaming
Combining instruction issue with register renaming:
Issue logic pre-reserves enough physical registers for the
bundle
Issue logic finds dependencies within bundle, maps registers
as necessary
Issue logic finds dependencies between current bundle and
already in-flight bundles, maps registers as necessary
Copyright © 2019, Elsevier Inc. All rights Reserved 60
Adv. Techniques for Instruction Delivery and Speculation
How Much?
How much to speculate
Mis-speculation degrades performance and
power relative to no speculation
May cause additional misses (cache, TLB)
Prevent speculative code from causing
higher costing misses (e.g. L2)
Speculating through multiple branches
Complicates speculation recovery
Speculation and energy efficiency
Note: speculation is only energy efficient
when it significantly improves performance
Copyright © 2019, Elsevier Inc. All rights Reserved 61
Adv. Techniques for Instruction Delivery and Speculation
How Much?
integer
Copyright © 2019, Elsevier Inc. All rights Reserved 62
Adv. Techniques for Instruction Delivery and Speculation
Energy Efficiency
Value prediction
Uses:
Loads that load from a constant pool
Instruction that produces a value from a small set
of values
Not incorporated into modern processors
Similar idea--address aliasing prediction--is
used on some processors to determine if
two stores or a load and a store reference
the same address to allow for reordering
Copyright © 2019, Elsevier Inc. All rights Reserved 63
Fallacies and Pitfalls
Fallacies and Pitfalls
It is easy to predict the performance/energy
efficiency of two different versions of the same
ISA if we hold the technology constant
Copyright © 2019, Elsevier Inc. All rights Reserved 64
Fallacies and Pitfalls
Fallacies and Pitfalls
Processors with lower CPIs / faster clock rates
will also be faster
Pentium 4 had higher clock, lower CPI
Itanium had same CPI, lower clock
Copyright © 2019, Elsevier Inc. All rights Reserved 65
Fallacies and Pitfalls
Fallacies and Pitfalls
Sometimes bigger and dumber is better
Pentium 4 and Itanium were advanced designs, but
could not achieve their peak instruction throughput
because of relatively small caches as compared to i7
And sometimes smarter is better than bigger and
dumber
TAGE branch predictor outperforms gshare with less
stored predictions
Copyright © 2019, Elsevier Inc. All rights Reserved 66
Fallacies and Pitfalls
Fallacies and Pitfalls
Believing that there
are large amounts
of ILP available, if
only we had the
right techniques
Copyright © 2019, Elsevier Inc. All rights Reserved 67