Advanced Computer Architecture
Advanced Computer Architecture
Advanced Computer Architecture
Architecture
Chapter 4
Advanced Pipelining
Ioannis Papaefstathiou
CS 590.25
Easter 2003
(thanks to Hennesy & Patterson)
Chapter Overview
4.1 Instruction Level Parallelism: Concepts and Challenges
4.2 Overcoming Data Hazards with Dynamic Scheduling
4.3 Reducing Branch Penalties with Dynamic Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more Parallelism
4.7 Studies of ILP
Chap. 4 - Pipelining II 2
Chapter Overview
Technique Reduces Section
Loop Unrolling Control Stalls 4.1
Dynamic Scheduling with Register Renaming WAR and WAW stalls 4.2
Software pipelining and trace scheduling Ideal CPI & data stalls 4.5
Chap. 4 - Pipelining II 3
Instruction Level
Parallelism
4.1 Instruction Level Parallelism:
Concepts and Challenges ILP is the principle that there are many
instructions in code that don’t
4.2 Overcoming Data Hazards
with Dynamic Scheduling
depend on each other. That means
it’s possible to execute those
4.3 Reducing Branch Penalties instructions in parallel.
with Dynamic Hardware
Prediction
This is easier said than done:
4.4 Taking Advantage of More ILP
with Multiple Issue Issues include:
• Building compilers to analyze the
4.5 Compiler Support for
Exploiting ILP code,
• Building hardware to be even
4.6 Hardware Support for
Extracting more Parallelism smarter than that code.
4.7 Studies of ILP
This section looks at some of the
problems to be solved.
Chap. 4 - Pipelining II 4
Instruction Level Pipeline Scheduling and
Loop Unrolling
Parallelism
Terminology
Basic Block - That set of instructions between entry points and between
branches. A basic block has only one entry and one exit. Typically
this is about 6 instructions long.
Loop Level Parallelism - that parallelism that exists within a loop. Such
parallelism can cross loop iterations.
Chap. 4 - Pipelining II 5
Instruction Level Pipeline Scheduling and
Loop Unrolling
Parallelism
Chap. 4 - Pipelining II 6
Instruction Level Pipeline Scheduling and
Loop Unrolling
Parallelism
FP Loop Hazards
Loop: LD F0,0(R1) ;F0=vector element
ADDD F4,F0,F2 ;add scalar in F2
SD 0(R1),F4 ;store result
SUBI R1,R1,8 ;decrement pointer 8B (DW)
BNEZ R1,Loop ;branch R1!=zero
NOP ;delayed branch slot
• Determine that the loads and stores in the unrolled loop can be
interchanged by observing that the loads and stores from different
iterations are independent. This requires analyzing the memory
addresses and finding that they do not refer to the same address.
• Schedule the code, preserving any dependences needed to yield the
same result as the original code.
Chap. 4 - Pipelining II 12
Instruction Level Dependencies
Parallelism
Compiler Perspectives on Code Movement
Compiler concerned about dependencies in program. Not concerned if a
HW hazard depends on a given pipeline.
• Tries to schedule code to avoid hazards.
• Looks for Data dependencies (RAW if a hazard for HW)
– Instruction i produces a result used by instruction j, or
– Instruction j is data dependent on instruction k, and instruction k is data
dependent on instruction i.
• If dependent, can’t execute in parallel
• Easy to determine for registers (fixed names)
• Hard for memory:
– Does 100(R4) = 20(R6)?
– From different loop iterations, does 20(R6) = 20(R6)?
Chap. 4 - Pipelining II 13
Instruction Level Data Dependencies
Parallelism
Chap. 4 - Pipelining II 14
Instruction Level Name Dependencies
Parallelism
Chap. 4 - Pipelining II 15
Instruction Level Name Dependencies
Parallelism
Compiler Perspectives on Code Movement
1 Loop: LD F0,0(R1) Where are the name
2 ADDD F4,F0,F2
dependencies?
3 SD 0(R1),F4
4 LD F0,-8(R1)
5 ADDD F4,F0,F2 No data is passed in F0, but
6 SD -8(R1),F4 can’t reuse F0 in cycle 4.
7 LD F0,-16(R1)
8 ADDD F4,F0,F2
9 SD -16(R1),F4
10 LD F0,-24(R1)
11 ADDD F4,F0,F2
12 SD -24(R1),F4
13 SUBI R1,R1,#32
14 BNEZ R1,LOOP
15 NOP
How can we remove these
dependencies? Chap. 4 - Pipelining II 16
Instruction Level Name Dependencies
Parallelism
Compiler Perspectives on Code Movement
• Again Name Dependencies are Hard for Memory Accesses
– Does 100(R4) = 20(R6)?
– From different loop iterations, does 20(R6) = 20(R6)?
• Our example required compiler to know that if R1 doesn’t change then:
Chap. 4 - Pipelining II 18
Instruction Level Control Dependencies
Parallelism
Compiler Perspectives on Code Movement
Chap. 4 - Pipelining II 19
Instruction Level Control Dependencies
Parallelism
Compiler Perspectives on Code Movement
Chap. 4 - Pipelining II 20
Instruction Level Control Dependencies
Parallelism
Compiler Perspectives on Code Movement
1 Loop: LD F0,0(R1)
2 ADDD F4,F0,F2
3 SD 0(R1),F4
4 SUBI R1,R1,8 Where are the control
5 BEQZ R1,exit dependencies?
6 LD F0,0(R1)
7 ADDD F4,F0,F2
8 SD 0(R1),F4
9 SUBI R1,R1,8
10 BEQZ R1,exit
11 LD F0,0(R1)
12 ADDD F4,F0,F2
13 SD 0(R1),F4
14 SUBI R1,R1,8
15 BEQZ R1,exit
....
Chap. 4 - Pipelining II 21
Instruction Level Loop Level Parallelism
Parallelism
When Safe to Unroll Loop?
• Example: Where are data dependencies?
(A,B,C distinct & non-overlapping)
for (i=1; i<=100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */
}
• Note the case for our prior example; each iteration was distinct
Chap. 4 - Pipelining II 22
Instruction Level Loop Level Parallelism
Parallelism
When Safe to Unroll Loop?
• Example: Where are data dependencies?
(A,B,C,D distinct & non-overlapping)
for (i=1; i<=100; i=i+1) {
A[i+1] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
Chap. 4 - Pipelining II 23
Instruction Level Loop Level Parallelism
Parallelism
Chap. 4 - Pipelining II 24
Dynamic Scheduling
4.1 Instruction Level Parallelism:
Concepts and Challenges Dynamic Scheduling is when the
hardware rearranges the order of
4.2 Overcoming Data Hazards
with Dynamic Scheduling
instruction execution to reduce
stalls.
4.3 Reducing Branch Penalties Advantages:
with Dynamic Hardware
Prediction • Dependencies unknown at compile
time can be handled by the hardware.
4.4 Taking Advantage of More ILP
with Multiple Issue • Code compiled for one type of
pipeline can be efficiently run on
4.5 Compiler Support for
Exploiting ILP another.
Disadvantages:
4.6 Hardware Support for
Extracting more Parallelism • Hardware much more complex.
4.7 Studies of ILP
Chap. 4 - Pipelining II 25
The idea:
Dynamic Scheduling
HW Schemes: Instruction Parallelism
• Why in HW at run time?
– Works when can’t know real dependence at compile time
– Compiler simpler
– Code for one machine runs well on another
• Key Idea: Allow instructions behind stall to proceed.
• Key Idea: Instructions executing in parallel. There are multiple
execution units, so use them.
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F12,F8,F14
– Enables out-of-order execution => out-of-order completion
Chap. 4 - Pipelining II 26
The idea:
Dynamic Scheduling
HW Schemes: Instruction Parallelism
• Out-of-order execution divides ID stage:
1. Issue—decode instructions, check for structural hazards
2. Read operands—wait until no data hazards, then read operands
• Scoreboards allow instruction to execute whenever 1 & 2 hold, not
waiting for prior instructions.
• A scoreboard is a “data structure” that provides the information
necessary for all pieces of the processor to work together.
• We will use In order issue, out of order execution, out of order
commit ( also called completion)
• First used in CDC6600. Our example modified here for DLX.
• CDC had 4 FP units, 5 memory reference units, 7 integer units.
• DLX has 2 FP multiply, 1 FP adder, 1 FP divider, 1 integer.
Chap. 4 - Pipelining II 27
Using A Scoreboard
Dynamic Scheduling
Scoreboard Implications
• Out-of-order completion => WAR, WAW hazards?
• Solutions for WAR
– Queue both the operation and copies of its operands
– Read registers only during Read Operands stage
• For WAW, must detect hazard: stall until other completes
• Need to have multiple instructions in execution phase => multiple
execution units or pipelined execution units
• Scoreboard keeps track of dependencies, state or operations
• Scoreboard replaces ID, EX, WB with 4 stages
Chap. 4 - Pipelining II 28
Using A Scoreboard
Dynamic Scheduling
Chap. 4 - Pipelining II 29
Using A Scoreboard
Dynamic Scheduling
Chap. 4 - Pipelining II 30
Using A Scoreboard
Dynamic Scheduling
Four Stages of Scoreboard Control
3. Execution —operate on operands (EX)
The functional unit begins execution upon receiving
operands. When the result is ready, it notifies the
scoreboard that it has completed execution.
Chap. 4 - Pipelining II 31
Using A Scoreboard
Dynamic Scheduling
Three Parts of the Scoreboard
Chap. 4 - Pipelining II 32
Using A Scoreboard
Dynamic Scheduling
Detailed Scoreboard Pipeline Control
Instruction Bookkeeping
Wait until
status
Busy(FU) yes; Op(FU) op;
Fi(FU) `D’; Fj(FU) `S1’;
Not busy (FU) Fk(FU) `S2’; Qj Result(‘S1’);
Issue
and not result(D) Qk Result(`S2’); Rj not Qj;
Rk not Qk; Result(‘D’) FU;
Read Rj No; Rk No
Rj and Rk
operands
Execution Functional unit
complete done
Chap. 4 - Pipelining II 33
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example
This is the sample code we’ll be working with in the example:
LD F6, 34(R2)
LD F2, 45(R3)
MULT F0, F2, F4
SUBD F8, F6, F2
DIVD F10, F0, F6
ADDD F6, F8, F2
Chap. 4 - Pipelining II 34
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example
Chap. 4 - Pipelining II 36
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 2
Instruction status Read Execution
W rite LD #2 can’t issue since
Instruction j k Issue operandscompleteResult integer unit is busy.
LD F6 34+ R2 1 2 MULT can’t issue because
LD F2 45+ R3
MULTDF0 F2 F4
we require in-order issue.
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Integer
Chap. 4 - Pipelining II 37
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 3
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Integer
Chap. 4 - Pipelining II 38
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 4
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Integer
Chap. 4 - Pipelining II 39
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 5
Instruction status Read Execution
W rite Issue LD #2 since integer
Instruction j k Issue operandscompleteResult unit is now free.
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Integer
Chap. 4 - Pipelining II 40
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 6
Instruction status Read Execution
W rite Issue MULT.
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6
MULTDF0 F2 F4 6
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 Integer
Chap. 4 - Pipelining II 41
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 7
Instruction status Read Execution
W rite MULT can’t read its
Instruction j k Issue operandscompleteResult operands (F2) because LD
LD F6 34+ R2 1 2 3 4 #2 hasn’t finished.
LD F2 45+ R3 5 6 7
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add Yes Sub F8 F6 F2 Integer Yes No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 Integer Add
Chap. 4 - Pipelining II 42
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 8a
Chap. 4 - Pipelining II 45
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 11
Instruction status Read Execution
W rite ADDD can’t start because
Instruction j k Issue operandscompleteResult add unit is busy.
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
8 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
0 Add Yes Sub F8 F6 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
11 FU Mult1 Add Divide
Chap. 4 - Pipelining II 46
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 12
Chap. 4 - Pipelining II 53
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 19
Chap. 4 - Pipelining II 56
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 22
Chap. 4 - Pipelining II 58
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 62
Instruction status Read Execution
W rite DONE!!
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61 62
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
0 Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
62 FU
Chap. 4 - Pipelining II 59
Using A Scoreboard
Dynamic Scheduling
Another Dynamic Algorithm:
Tomasulo Algorithm
• For IBM 360/91 about 3 years after CDC 6600 (1966)
• Goal: High Performance without special compilers
• Differences between IBM 360 & CDC 6600 ISA
– IBM has only 2 register specifiers / instruction vs. 3 in CDC 6600
– IBM has 4 FP registers vs. 8 in CDC 6600
• Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II,
PowerPC 604, …
Chap. 4 - Pipelining II 60
Using A Scoreboard
Dynamic Scheduling
Tomasulo Algorithm vs. Scoreboard
• Control & buffers distributed with Function Units (FU) vs.
centralized in scoreboard;
– FU buffers called “reservation stations”; have pending operands
• Registers in instructions replaced by values or pointers to
reservation stations(RS); called register renaming ;
– avoids WAR, WAW hazards
– More reservation stations than registers, so can do optimizations
compilers can’t
• Results to FU from RS, not through registers, over Common
Data Bus that broadcasts results to all FUs
• Load and Stores treated as FUs with RSs as well
• Integer instructions can go past branches, allowing
FP ops beyond basic block in FP queue
Chap. 4 - Pipelining II 61
Dynamic Scheduling Using A Scoreboard
Tomasulo Organization
FP Op Queue FP
Registers
Load
Buffer
Store
Common Buffer
Data
Bus
FP Add FP Mul
Res. Res.
Station Station
Chap. 4 - Pipelining II 62
Using A Scoreboard
Dynamic Scheduling
Reservation Station Components
Op—Operation to perform in the unit (e.g., + or –)
Vj, Vk—Value of Source operands
– Store buffers have V field, result to be stored
Qj, Qk—Reservation stations producing source registers (value to be
written)
– Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready
– Store buffers only have Qi for RS producing result
Busy—Indicates reservation station or FU is busy
Chap. 4 - Pipelining II 63
Using A Scoreboard
Dynamic Scheduling
Chap. 4 - Pipelining II 64
Using A Scoreboard
Dynamic Scheduling
Tomasulo Example Cycle 0
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 Load1 No
LD F2 45+ R3 Load2 No
MULTDF0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 No
0 Mult2 No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
0 FU
Chap. 4 - Pipelining II 65
Using A Scoreboard
Dynamic Scheduling
Review: Tomasulo
Chap. 4 - Pipelining II 66
Dynamic Hardware
Prediction
4.1 Instruction Level Parallelism:
Concepts and Challenges Dynamic Branch Prediction is the ability
of the hardware to make an educated
4.2 Overcoming Data Hazards
with Dynamic Scheduling
guess about which way a branch will
go - will the branch be taken or not.
4.3 Reducing Branch Penalties
with Dynamic Hardware
Prediction The hardware can look for clues based
on the instructions, or it can use past
4.4 Taking Advantage of More ILP
with Multiple Issue history - we will discuss both of
these directions.
4.5 Compiler Support for
Exploiting ILP
4.6 Hardware Support for
Extracting more Parallelism
4.7 Studies of ILP
Chap. 4 - Pipelining II 67
Dynamic Hardware Basic Branch Prediction:
Branch Prediction Buffers
Prediction
Dynamic Branch Prediction
• Performance = ƒ(accuracy, cost of misprediction)
• Branch History Lower bits of PC address index table of 1-bit values
– Says whether or not branch taken last time
• Problem: in a loop, 1-bit BHT will cause two mis-predictions:
– End of loop case, when it exits instead of looping as before
– First time through loop on next time through code, when it predicts exit instead
of looping
P
Address 0 r
e
d
31 1 Bits 13 - 2 i
c
t
1023 i
o
n
Chap. 4 - Pipelining II 68
Dynamic Hardware Basic Branch Prediction:
Branch Prediction Buffers
Prediction
Dynamic Branch Prediction
T
NT
Predict Taken Predict Taken
T
T NT
NT
Predict Not Predict Not
Taken T Taken
NT
Chap. 4 - Pipelining II 69
Dynamic Hardware Basic Branch Prediction:
Branch Prediction Buffers
Prediction
BHT Accuracy
Chap. 4 - Pipelining II 70
Dynamic Hardware Basic Branch Prediction:
Branch Prediction Buffers
Prediction
Correlating Branches
Idea: taken/not taken of Branch address
recently executed branches is
related to behavior of next 2-bits per branch predictors
branch (as well as the history
of that branch behavior)
– Then behavior of recent
branches selects between, say, Prediction
four predictions of next branch,
updating just that prediction
14%
Frequency of Mispredictions
12% 11%
10%
8%
6% 6% 6%
6% 5% 5%
4%
4%
2% 1% 1%
0% 0%
0%
doducd
gcc
nasa7
eqntott
espresso
spice
fpppp
tomcatv
li
matrix300
Chap. 4 - Pipelining II 74
Multiple Issue
4.1 Instruction Level Parallelism: Multiple Issue is the ability of the
Concepts and Challenges processor to start more than one
4.2 Overcoming Data Hazards instruction in a given cycle.
with Dynamic Scheduling
4.3 Reducing Branch Penalties Flavor I:
with Dynamic Hardware Superscalar processors issue varying
Prediction
number of instructions per clock - can
4.4 Taking Advantage of More ILP be either statically scheduled (by the
with Multiple Issue
compiler) or dynamically scheduled
4.5 Compiler Support for (by the hardware).
Exploiting ILP
4.6 Hardware Support for Superscalar has a varying number of
Extracting more Parallelism
instructions/cycle (1 to 8), scheduled
4.7 Studies of ILP by compiler or by HW (Tomasulo).
Chap. 4 - Pipelining II 75
Multiple Issue
Chap. 4 - Pipelining II 76
Multiple Issue
Chap. 4 - Pipelining II 77
A SuperScalar Version of DLX
Multiple Issue
Issuing Multiple Instructions/Cycle
In our DLX example,
– Fetch 64-bits/clock cycle; Int on left, FP on right
we can handle 2
– Can only issue 2nd instruction if 1st instruction issues
instructions/cycle:
– More ports for FP registers to do FP load & FP op in a pair
• Floating Point
• Anything Else
Type Pipe Stages
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
• 1 cycle load delay causes delay to 3 instructions in Superscalar
– instruction in right half can’t use it, nor instructions in next slot
Chap. 4 - Pipelining II 78
A SuperScalar Version of DLX
Multiple Issue
Unrolled Loop Minimizes Stalls for Scalar
1 Loop: LD F0,0(R1) Latencies:
2 LD F6,-8(R1) LD to ADDD: 1 Cycle
3 LD F10,-16(R1) ADDD to SD: 2 Cycles
4 LD F14,-24(R1)
5 ADDD F4,F0,F2
6 ADDD F8,F6,F2
7 ADDD F12,F10,F2
8 ADDD F16,F14,F2
9 SD 0(R1),F4
10 SD -8(R1),F8
11 SD -16(R1),F12
12 SUBI R1,R1,#32
13 BNEZ R1,LOOP
14 SD 8(R1),F16 ; 8-32 = -24
Chap. 4 - Pipelining II 81
Multiple Instruction Issue &
Multiple Issue Dynamic Scheduling
Chap. 4 - Pipelining II 82
Multiple Instruction Issue &
Multiple Issue Dynamic Scheduling
Chap. 4 - Pipelining II 84
Limitations With Multiple Issue
Multiple Issue
Limits to Multi-Issue Machines
• Inherent limitations of ILP
– 1 branch in 5 instructions => how to keep a 5-way VLIW busy?
– Latencies of units => many operations must be scheduled
– Need about Pipeline Depth x No. Functional Units of independent
operations to keep machines busy.
• Difficulties in building HW
– Duplicate Functional Units to get parallel execution
– Increase ports to Register File (VLIW example needs 6 read and 3
write for Int. Reg. & 6 read and 4 write for Reg.)
– Increase ports to memory
– Decoding SS and impact on clock rate, pipeline depth
Chap. 4 - Pipelining II 85
Limitations With Multiple Issue
Multiple Issue
Chap. 4 - Pipelining II 86
Limitations With Multiple Issue
Multiple Issue
Multiple Issue Challenges
• While Integer/FP split is simple for the HW, get CPI of 0.5 only for
programs with:
– Exactly 50% FP operations
– No hazards
• If more instructions issue at same time, greater difficulty of decode and
issue
– Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2
instructions can issue
• VLIW: tradeoff instruction space for simple decoding
– The long instruction word has room for many operations
– By definition, all the operations the compiler puts in the long instruction word are
independent => execute in parallel
– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
• 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide
– Need compiling technique that schedules across several branches
Chap. 4 - Pipelining II 87
Compiler Support For ILP
4.1 Instruction Level Parallelism:
Concepts and Challenges How can compilers be smart?
1. Produce good scheduling of code.
4.2 Overcoming Data Hazards
with Dynamic Scheduling
2. Determine which loops might contain
parallelism.
4.3 Reducing Branch Penalties 3. Eliminate name dependencies.
with Dynamic Hardware
Prediction
Compilers must be REALLY smart to
4.4 Taking Advantage of More ILP
with Multiple Issue figure out aliases -- pointers in C are
a real problem.
4.5 Compiler Support for
Exploiting ILP
Techniques lead to:
4.6 Hardware Support for
Extracting more Parallelism Symbolic Loop Unrolling
Critical Path Scheduling
4.7 Studies of ILP
Chap. 4 - Pipelining II 88
Compiler Support For ILP Symbolic Loop Unrolling
Software Pipelining
• Observation: if iterations from loops are independent, then can get ILP
by taking instructions from different iterations
• Software pipelining: reorganizes loops so that each iteration is made
from instructions chosen from different iterations of the original loop
(Tomasulo in SW) Iteration
0 Iteration
1 Iteration
2 Iteration
3 Iteration
4
Software-
pipelined
iteration
Chap. 4 - Pipelining II 89
Compiler Support For ILP Symbolic Loop Unrolling
SW Pipelining Example
Before: Unrolled 3 times After: Software Pipelined
1 LD F0,0(R1) LD F0,0(R1)
2 ADDD F4,F0,F2 ADDD F4,F0,F2
3 SD 0(R1),F4 LD F0,-8(R1)
4 LD F6,-8(R1) 1 SD 0(R1),F4; Stores M[i]
5 ADDD F8,F6,F2 2 ADDD F4,F0,F2; Adds to M[i-1]
6 SD -8(R1),F8 3 LD F0,-16(R1); loads M[i-2]
7 LD F10,-16(R1) 4 SUBI R1,R1,#8
8 ADDD F12,F10,F2 5 BNEZ R1,LOOP
9 SD -16(R1),F12 SD 0(R1),F4
10 SUBI R1,R1,#24 ADDD F4,F0,F2
11 BNEZ R1,LOOP SD -8(R1),F4
Read F4 Read F0
SD IF ID EX Mem WB Write F4
ADDD IF ID EX Mem WB
LD IF ID EX Mem WB
Chap. 4 - Pipelining II Write F0 90
Compiler Support For ILP Symbolic Loop Unrolling
SW Pipelining Example
Symbolic Loop Unrolling
– Less code space
– Overhead paid only once
vs. each iteration in loop unrolling
Software Pipelining
Loop Unrolling
Trace Scheduling
• Parallelism across IF branches vs. LOOP branches
• Two steps:
– Trace Selection
• Find likely sequence of basic blocks (trace)
of (statically predicted or profile predicted)
long sequence of straight-line code
– Trace Compaction
• Squeeze trace into few VLIW instructions
• Need bookkeeping code in case prediction is wrong
• Compiler undoes bad guess
(discards values in registers)
• Subtle compiler bugs mean wrong answer
vs. poorer performance; no hardware interlocks
Chap. 4 - Pipelining II 92
Hardware Support For
Parallelism
4.1 Instruction Level Parallelism:
Concepts and Challenges Software support of ILP is best when
code is predictable at compile time.
4.2 Overcoming Data Hazards
with Dynamic Scheduling
But what if there’s no predictability?
4.3 Reducing Branch Penalties Here we’ll talk about hardware
with Dynamic Hardware
Prediction techniques. These include:
4.4 Taking Advantage of More ILP
with Multiple Issue • Conditional or Predicated
Instructions
4.5 Compiler Support for
Exploiting ILP
• Hardware Speculation
4.6 Hardware Support for
Extracting more Parallelism
4.7 Studies of ILP
Chap. 4 - Pipelining II 93
Hardware Support For Nullified Instructions
Parallelism
Tell the Hardware To Ignore An Instruction
• Avoid branch prediction by turning branches into
conditionally executed instructions:
IF (x) then A = B op C else NOP
– If false, then neither store result nor cause exception
– Expanded ISA of Alpha, MIPs, PowerPC, SPARC, x
have conditional move. PA-RISC can annul any
following instruction.
– IA-64: 64 1-bit condition fields selected so
conditional execution of any instruction A=
• Drawbacks to conditional instructions: B op C
– Still takes a clock, even if “annulled”
– Stalls if condition evaluated late
– Complex conditions reduce effectiveness; condition
becomes known late in pipeline.
This can be a major win because there is no time lost by
taking a branch!!
Chap. 4 - Pipelining II 94
Hardware Support For Nullified Instructions
Parallelism
Tell the Hardware To Ignore An Instruction
Suppose we have the code: Nullified Method:
if ( VarA == 0 ) LD R1, VarA
VarS = VarT; Compare LD R2, VarT
and Nullify CMPNNZ R1, #0
Previous Method: Next Instr. SD VarS, R2
LD R1, VarA If Not Zero Label:
BNEZ R1, Label
LD R2, VarT Nullified Method:
SD VarS, R2 LD R1, VarA
Label: Compare LD R2, VarT
and Move CMOVZ VarS,R2, R1
IF Zero
Chap. 4 - Pipelining II 95
Hardware Support For Compiler Speculation
Parallelism
Increasing Parallelism
The theory here is to move an instruction across a branch so as to
increase the size of a basic block and thus to increase parallelism.
1. Use a set of status bits (poison bits) associated with the registers.
Are a signal that the instruction results are invalid until some later
time.
2. Result of instruction isn’t written until it’s certain the instruction is
no longer speculative.
Chap. 4 - Pipelining II 96
Hardware Support For Compiler Speculation
Parallelism
Original Code:
Increasing LW R1, 0(R3) Load A
Parallelism BNEZ R1, L1 Test A
LW R1, 0(R2) If Clause
Example on Page 305. J L2 Skip Else
Code for L1: ADDI R1, R1, #4 Else Clause
if ( A == 0 ) L2: SW 0(R3), R1 Store A
A = B;
else Speculated Code:
A = A + 4; LW R1, 0(R3) Load A
Assume A is at 0(R3) and LW R14, 0(R2) Spec Load B
B is at 0(R4) BEQZ R1, L3 Other if Branch
Note here that only ONE ADDI R14, R1, #4 Else Clause
side needs to take a L3: SW 0(R3), R14 Non-Spec Store
branch!!
Chap. 4 - Pipelining II 97
Hardware Support For Compiler Speculation
Parallelism
Poison Bits
Speculated Code:
In the example on the last LW R1, 0(R3) Load A
page, if the LW* produces
LW* R14, 0(R2) Spec Load B
an exception, a poison bit
is set on that register. The BEQZ R1, L3 Other if Branch
if a later instruction tries to ADDI R14, R1, #4 Else Clause
use the register, an L3: SW 0(R3), R14 Non-Spec Store
exception is THEN raised.
Chap. 4 - Pipelining II 98
Hardware Support For Hardware Speculation
Parallelism
HW support for More ILP
• Need HW buffer for results of
uncommitted instructions: reorder buffer Reorder
– Reorder buffer can be operand Buffer
source FP
Op
– Once operand commits, result is
Queue
found in register FP Regs
– 3 fields: instr. type, destination, value
– Use reorder buffer number instead
of reservation station Res Stations Res Stations
– Discard instructions on mis-
predicted branches or on exceptions FP Adder FP Adder
Chap. 4 - Pipelining II 99
Hardware Support For Hardware Speculation
Parallelism
HW support for More ILP
How is this used in practice?
160 150.1
FP: 75 - 150
140
118.7
120 Integer: 18 - 60
Instruction Issues per cycle
100
75.2
80
IPC
62.6
54.8
60
40
17.9
20
0
gcc espresso li fpppp doducd tomcatv
Instructions that could
Programs
theoretically be issued
per cycle. Chap. 4 - Pipelining II 103
Studies of ILP
Impact of Realistic Branch
Prediction
11
Choose Non-correlator
10
Branch Addr 01 Choose Correlator
00
2
Global
History 00
01 8K x 2 bit
10 Selector
11
11 Taken
10
2048 x 4 x 2 bits
01 Not Taken
00
Chap. 4 - Pipelining II 105
Impact of Realistic
Studies of ILP
Branch Prediction
Limiting the type of Figure 4.42, Page 325
branch prediction. 61
58
60
60
FP: 15 - 45
50 48
46 45 46 45 45
41
Instruction issues per cycle
40
35
Integer: 6 - 12
29
30
IPC
19
20 16
15
13 14
12
10
9
10 6 7 6 6 7
6
4
2 2 2
Program
49
50
45
44
Instruction issues per cycle
40
35
Integer: 5 - 15 29
IPC
30 28
20
20
16
15 15 15
13
12 12 12 11 11
11 10 10 10
9
10 7
5 6 5 5 5 5
4 5 4 5
4
Program
35
Integer: 4 - 9
Instruction issues per cycle
IPC
30
25
20
16 16
15
15
12
10
10 9
7 7
5 5 6
4 4 4 5
3 3 3 4 4
5
Program
Perfect Global/Stack perf; Inspec. None
Perfect Chap.
heap conflicts4 - Pipelining
Global/stack Perfect II
Inspection None 108
Assem.
Summary