Advanced Computer Architecture

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 108

Advanced Computer

Architecture
Chapter 4
Advanced Pipelining

Ioannis Papaefstathiou
CS 590.25
Easter 2003
(thanks to Hennesy & Patterson)
Chapter Overview
4.1 Instruction Level Parallelism: Concepts and Challenges
4.2 Overcoming Data Hazards with Dynamic Scheduling
4.3 Reducing Branch Penalties with Dynamic Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more Parallelism
4.7 Studies of ILP

Chap. 4 - Pipelining II 2
Chapter Overview
Technique Reduces Section
Loop Unrolling Control Stalls 4.1

Basic Pipeline Scheduling RAW Stalls 4.1

Dynamic Scheduling with Scoreboarding RAW stalls 4.2

Dynamic Scheduling with Register Renaming WAR and WAW stalls 4.2

Dynamic Branch Prediction Control Stalls 4.3

Issue Multiple Instructions per Cycle Ideal CPI 4.4

Compiler Dependence Analysis Ideal CPI & data stalls 4.5

Software pipelining and trace scheduling Ideal CPI & data stalls 4.5

Speculation All data & control stalls 4.6

Dynamic memory disambiguation RAW stalls involving memory 4.2, 4.6

Chap. 4 - Pipelining II 3
Instruction Level
Parallelism
4.1 Instruction Level Parallelism:
Concepts and Challenges ILP is the principle that there are many
instructions in code that don’t
4.2 Overcoming Data Hazards
with Dynamic Scheduling
depend on each other. That means
it’s possible to execute those
4.3 Reducing Branch Penalties instructions in parallel.
with Dynamic Hardware
Prediction
This is easier said than done:
4.4 Taking Advantage of More ILP
with Multiple Issue Issues include:
• Building compilers to analyze the
4.5 Compiler Support for
Exploiting ILP code,
• Building hardware to be even
4.6 Hardware Support for
Extracting more Parallelism smarter than that code.
4.7 Studies of ILP
This section looks at some of the
problems to be solved.

Chap. 4 - Pipelining II 4
Instruction Level Pipeline Scheduling and
Loop Unrolling
Parallelism
Terminology
Basic Block - That set of instructions between entry points and between
branches. A basic block has only one entry and one exit. Typically
this is about 6 instructions long.

Loop Level Parallelism - that parallelism that exists within a loop. Such
parallelism can cross loop iterations.

Loop Unrolling - Either the compiler or the hardware is able to exploit


the parallelism inherent in the loop.

Chap. 4 - Pipelining II 5
Instruction Level Pipeline Scheduling and
Loop Unrolling
Parallelism

Simple Loop and its Assembler Equivalent


for (i=1; i<=1000; i++) This is a clean and
simple example!
x(i) = x(i) + s;

Loop: LD F0,0(R1) ;F0=vector element


ADDD F4,F0,F2 ;add scalar from F2
SD 0(R1),F4 ;store result
SUBI R1,R1,8 ;decrement pointer 8bytes (DW)
BNEZ R1,Loop ;branch R1!=zero
NOP ;delayed branch slot

Chap. 4 - Pipelining II 6
Instruction Level Pipeline Scheduling and
Loop Unrolling
Parallelism
FP Loop Hazards
Loop: LD F0,0(R1) ;F0=vector element
ADDD F4,F0,F2 ;add scalar in F2
SD 0(R1),F4 ;store result
SUBI R1,R1,8 ;decrement pointer 8B (DW)
BNEZ R1,Loop ;branch R1!=zero
NOP ;delayed branch slot

Instruction Instruction Latency in


producing result using result clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0
Integer op Integer op 0

Where are the stalls? Chap. 4 - Pipelining II 7


Instruction Level
Pipeline Scheduling and
Parallelism Loop Unrolling

FP Loop Showing Stalls


1 Loop: LD F0,0(R1) ;F0=vector element
2 stall
3 ADDD F4,F0,F2 ;add scalar in F2
4 stall
5 stall
6 SD 0(R1),F4 ;store result
7 SUBI R1,R1,8 ;decrement pointer 8Byte (DW)
8 stall
9 BNEZ R1,Loop ;branch R1!=zero
10 stall ;delayed branch slot
Instruction Instruction Latency in
producing result using result clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0
Integer op Integer op 0

10 clocks: Rewrite code


Chap. 4 - Pipelining II 8
to minimize stalls?
Instruction Level Pipeline Scheduling and
Loop Unrolling
Parallelism
Scheduled FP Loop Minimizing Stalls
1 Loop: LD F0,0(R1)
2 SUBI R1,R1,8 Stall is because SD
3 ADDD F4,F0,F2 can’t proceed.
4 stall
5 BNEZ R1,Loop ;delayed branch
6 SD 8(R1),F4 ;altered when move past SUBI

Swap BNEZ and SD by changing address of SD


Instruction Instruction Latency in
producing result using result clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1

Now 6 clocks: Now unroll


Chap. 4 - Pipelining II
loop 4 times to make faster. 9
Instruction Level Pipeline Scheduling and
Loop Unrolling
Parallelism
Unroll Loop Four Times (straightforward way)
1 Loop: LD F0,0(R1)
2 stall 15 ADDD F12,F10,F2
3 ADDD F4,F0,F2 16 stall
4 stall 17 stall
5 stall 18 SD -16(R1),F12
6 SD 0(R1),F4 19 LD F14,-24(R1)
7 LD F6,-8(R1) 20 stall
8 stall 21 ADDD F16,F14,F2
9 ADDD F8,F6,F2 22 stall
10 stall 23 stall
11 stall 24 SD -24(R1),F16
12 SD -8(R1),F8 25 SUBI R1,R1,#32
13 LD F10,-16(R1) 26 BNEZ R1,LOOP
14 stall 27 stall
28 NOP
15 + 4 x (1+2) +1 = 28 clock cycles, or 7 per iteration
Assumes R1 is multiple of 4
Rewrite loop to minimize stalls.
Chap. 4 - Pipelining II 10
Instruction Level Pipeline Scheduling and
Loop Unrolling
Parallelism
Unrolled Loop That Minimizes Stalls
1 Loop: LD F0,0(R1) What assumptions made when
2 LD F6,-8(R1) moved code?
3 LD F10,-16(R1) – OK to move store past SUBI
even though changes register
4 LD F14,-24(R1)
– OK to move loads before
5 ADDD F4,F0,F2 stores: get right data?
6 ADDD F8,F6,F2 – When is it safe for compiler to
7 ADDD F12,F10,F2 do such changes?
8 ADDD F16,F14,F2
9 SD 0(R1),F4
10 SD -8(R1),F8
11 SD -16(R1),F12
12 SUBI R1,R1,#32
13 BNEZ R1,LOOP
14 SD 8(R1),F16 ; 8-32 = -24
No Stalls!!
14 clock cycles, or 3.5 per iteration
Chap. 4 - Pipelining II 11
Instruction Level Pipeline Scheduling and
Loop Unrolling
Parallelism
Summary of Loop Unrolling Example
• Determine that it was legal to move the SD after the SUBI and BNEZ,
and find the amount to adjust the SD offset.
• Determine that unrolling the loop would be useful by finding that the
loop iterations were independent, except for the loop maintenance
code.

• Use different registers to avoid unnecessary constraints that would


be forced by using the same registers for different computations.
• Eliminate the extra tests and branches and adjust the loop
maintenance code.

• Determine that the loads and stores in the unrolled loop can be
interchanged by observing that the loads and stores from different
iterations are independent. This requires analyzing the memory
addresses and finding that they do not refer to the same address.
• Schedule the code, preserving any dependences needed to yield the
same result as the original code.
Chap. 4 - Pipelining II 12
Instruction Level Dependencies
Parallelism
Compiler Perspectives on Code Movement
Compiler concerned about dependencies in program. Not concerned if a
HW hazard depends on a given pipeline.
• Tries to schedule code to avoid hazards.
• Looks for Data dependencies (RAW if a hazard for HW)
– Instruction i produces a result used by instruction j, or
– Instruction j is data dependent on instruction k, and instruction k is data
dependent on instruction i.
• If dependent, can’t execute in parallel
• Easy to determine for registers (fixed names)
• Hard for memory:
– Does 100(R4) = 20(R6)?
– From different loop iterations, does 20(R6) = 20(R6)?

Chap. 4 - Pipelining II 13
Instruction Level Data Dependencies
Parallelism

Compiler Perspectives on Code Movement

Where are the data


dependencies?
1 Loop: LD F0,0(R1)
2 ADDD F4,F0,F2
3 SUBI R1,R1,8
4 BNEZ R1,Loop ;delayed branch
5 SD 8(R1),F4 ;altered when move past SUBI

Chap. 4 - Pipelining II 14
Instruction Level Name Dependencies
Parallelism

Compiler Perspectives on Code Movement

• Another kind of dependence called name dependence:


two instructions use same name (register or memory location) but don’t
exchange data
• Anti-dependence (WAR if a hazard for HW)
– Instruction j writes a register or memory location that instruction i reads from
and instruction i is executed first
• Output dependence (WAW if a hazard for HW)
– Instruction i and instruction j write the same register or memory location;
ordering between instructions must be preserved.

Chap. 4 - Pipelining II 15
Instruction Level Name Dependencies
Parallelism
Compiler Perspectives on Code Movement
1 Loop: LD F0,0(R1) Where are the name
2 ADDD F4,F0,F2
dependencies?
3 SD 0(R1),F4
4 LD F0,-8(R1)
5 ADDD F4,F0,F2 No data is passed in F0, but
6 SD -8(R1),F4 can’t reuse F0 in cycle 4.
7 LD F0,-16(R1)
8 ADDD F4,F0,F2
9 SD -16(R1),F4
10 LD F0,-24(R1)
11 ADDD F4,F0,F2
12 SD -24(R1),F4
13 SUBI R1,R1,#32
14 BNEZ R1,LOOP
15 NOP
How can we remove these
dependencies? Chap. 4 - Pipelining II 16
Instruction Level Name Dependencies
Parallelism
Compiler Perspectives on Code Movement
• Again Name Dependencies are Hard for Memory Accesses
– Does 100(R4) = 20(R6)?
– From different loop iterations, does 20(R6) = 20(R6)?
• Our example required compiler to know that if R1 doesn’t change then:

0(R1) ≠ -8(R1) ≠ -16(R1) ≠ -24(R1)

There were no dependencies between some loads and stores so they


could be moved around each other

Chap. 4 - Pipelining II 18
Instruction Level Control Dependencies
Parallelism
Compiler Perspectives on Code Movement

• Final kind of dependence called control dependence


• Example
if p1 {S1;};
if p2 {S2;};
S1 is control dependent on p1 and S2 is control dependent on p2 but not
on p1.

Chap. 4 - Pipelining II 19
Instruction Level Control Dependencies
Parallelism
Compiler Perspectives on Code Movement

• Two (obvious) constraints on control dependences:


– An instruction that is control dependent on a branch cannot be moved
before the branch so that its execution is no longer controlled by the
branch.

– An instruction that is not control dependent on a branch cannot be


moved to after the branch so that its execution is controlled by the
branch.

• Control dependencies relaxed to get parallelism; get same effect if


preserve order of exceptions (address in register checked by branch
before use) and data flow (value in register depends on branch)

Chap. 4 - Pipelining II 20
Instruction Level Control Dependencies
Parallelism
Compiler Perspectives on Code Movement
1 Loop: LD F0,0(R1)
2 ADDD F4,F0,F2
3 SD 0(R1),F4
4 SUBI R1,R1,8 Where are the control
5 BEQZ R1,exit dependencies?
6 LD F0,0(R1)
7 ADDD F4,F0,F2
8 SD 0(R1),F4
9 SUBI R1,R1,8
10 BEQZ R1,exit
11 LD F0,0(R1)
12 ADDD F4,F0,F2
13 SD 0(R1),F4
14 SUBI R1,R1,8
15 BEQZ R1,exit
....
Chap. 4 - Pipelining II 21
Instruction Level Loop Level Parallelism
Parallelism
When Safe to Unroll Loop?
• Example: Where are data dependencies?
(A,B,C distinct & non-overlapping)
for (i=1; i<=100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */
}

1. S2 uses the value, A[i+1], computed by S1 in the same iteration.


2. S1 uses a value computed by S1 in an earlier iteration, since
iteration i computes A[i+1] which is read in iteration i+1. The same
is true of S2 for B[i] and B[i+1].
This is a “loop-carried dependence” between iterations

• Implies that iterations are dependent, and can’t be executed in parallel

• Note the case for our prior example; each iteration was distinct

Chap. 4 - Pipelining II 22
Instruction Level Loop Level Parallelism
Parallelism
When Safe to Unroll Loop?
• Example: Where are data dependencies?
(A,B,C,D distinct & non-overlapping)
for (i=1; i<=100; i=i+1) {
A[i+1] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}

1. No dependence from S1 to S2. If there were, then there would be a


cycle in the dependencies and the loop would not be parallel. Since
this other dependence is absent, interchanging the two statements
will not affect the execution of S2.
2. On the first iteration of the loop, statement S1 depends on the value
of B[1] computed prior to initiating the loop.

Chap. 4 - Pipelining II 23
Instruction Level Loop Level Parallelism
Parallelism

Now Safe to Unroll Loop? (p. 240)


for (i=1; i<=100; i=i+1) { No circular dependencies.
OLD: A[i+1] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i];} /* S2 */
Loop caused dependence
on B.

A[1] = A[1] + B[1];


for (i=1; i<=99; i=i+1) { Have eliminated loop
NEW:
B[i+1] = C[i] + D[i]; dependence.
A[i+1] = + A[i+1] + B[i+1];
}
B[101] = C[100] + D[100];

Chap. 4 - Pipelining II 24
Dynamic Scheduling
4.1 Instruction Level Parallelism:
Concepts and Challenges Dynamic Scheduling is when the
hardware rearranges the order of
4.2 Overcoming Data Hazards
with Dynamic Scheduling
instruction execution to reduce
stalls.
4.3 Reducing Branch Penalties Advantages:
with Dynamic Hardware
Prediction • Dependencies unknown at compile
time can be handled by the hardware.
4.4 Taking Advantage of More ILP
with Multiple Issue • Code compiled for one type of
pipeline can be efficiently run on
4.5 Compiler Support for
Exploiting ILP another.
Disadvantages:
4.6 Hardware Support for
Extracting more Parallelism • Hardware much more complex.
4.7 Studies of ILP

Chap. 4 - Pipelining II 25
The idea:
Dynamic Scheduling
HW Schemes: Instruction Parallelism
• Why in HW at run time?
– Works when can’t know real dependence at compile time
– Compiler simpler
– Code for one machine runs well on another
• Key Idea: Allow instructions behind stall to proceed.
• Key Idea: Instructions executing in parallel. There are multiple
execution units, so use them.

DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F12,F8,F14
– Enables out-of-order execution => out-of-order completion

Chap. 4 - Pipelining II 26
The idea:
Dynamic Scheduling
HW Schemes: Instruction Parallelism
• Out-of-order execution divides ID stage:
1. Issue—decode instructions, check for structural hazards
2. Read operands—wait until no data hazards, then read operands
• Scoreboards allow instruction to execute whenever 1 & 2 hold, not
waiting for prior instructions.
• A scoreboard is a “data structure” that provides the information
necessary for all pieces of the processor to work together.
• We will use In order issue, out of order execution, out of order
commit ( also called completion)
• First used in CDC6600. Our example modified here for DLX.
• CDC had 4 FP units, 5 memory reference units, 7 integer units.
• DLX has 2 FP multiply, 1 FP adder, 1 FP divider, 1 integer.

Chap. 4 - Pipelining II 27
Using A Scoreboard
Dynamic Scheduling

Scoreboard Implications
• Out-of-order completion => WAR, WAW hazards?
• Solutions for WAR
– Queue both the operation and copies of its operands
– Read registers only during Read Operands stage
• For WAW, must detect hazard: stall until other completes
• Need to have multiple instructions in execution phase => multiple
execution units or pipelined execution units
• Scoreboard keeps track of dependencies, state or operations
• Scoreboard replaces ID, EX, WB with 4 stages

Chap. 4 - Pipelining II 28
Using A Scoreboard
Dynamic Scheduling

Four Stages of Scoreboard Control


1. Issue —decode instructions & check for structural hazards (ID1)
If a functional unit for the instruction is free and no other active
instruction has the same destination register (WAW), the
scoreboard issues the instruction to the functional unit and
updates its internal data structure.
If a structural or WAW hazard exists, then the instruction issue
stalls, and no further instructions will issue until these hazards
are cleared.

Chap. 4 - Pipelining II 29
Using A Scoreboard
Dynamic Scheduling

Four Stages of Scoreboard Control


2. Read operands —wait until no data hazards, then read
operands (ID2)

A source operand is available if no earlier issued active


instruction is going to write it, or if the register containing
the operand is being written by a currently active
functional unit.
When the source operands are available, the scoreboard tells
the functional unit to proceed to read the operands from
the registers and begin execution. The scoreboard
resolves RAW hazards dynamically in this step, and
instructions may be sent into execution out of order.

Chap. 4 - Pipelining II 30
Using A Scoreboard
Dynamic Scheduling
Four Stages of Scoreboard Control
3. Execution —operate on operands (EX)
The functional unit begins execution upon receiving
operands. When the result is ready, it notifies the
scoreboard that it has completed execution.

4. Write result —finish execution (WB)


Once the scoreboard is aware that the functional unit has
completed execution, the scoreboard checks for WAR
hazards. If none, it writes results. If WAR, then it stalls the
instruction.
Example:
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F8,F8,F14
Scoreboard would stall SUBD until ADDD reads operands

Chap. 4 - Pipelining II 31
Using A Scoreboard
Dynamic Scheduling
Three Parts of the Scoreboard

1. Instruction status—which of 4 steps the instruction is in

2. Functional unit status—Indicates the state of the functional unit (FU). 9


fields for each functional unit
Busy—Indicates whether the unit is busy or not
Op—Operation to perform in the unit (e.g., + or –)
Fi—Destination register
Fj, Fk—Source-register numbers
Qj, Qk—Functional units producing source registers Fj, Fk
Rj, Rk—Flags indicating when Fj, Fk are ready

3. Register result status—Indicates which functional unit will write each


register, if one exists. Blank when no pending instructions will write that
register

Chap. 4 - Pipelining II 32
Using A Scoreboard
Dynamic Scheduling
Detailed Scoreboard Pipeline Control
Instruction Bookkeeping
Wait until
status
Busy(FU) yes; Op(FU) op;
Fi(FU) `D’; Fj(FU) `S1’;
Not busy (FU) Fk(FU) `S2’; Qj Result(‘S1’);
Issue
and not result(D) Qk Result(`S2’); Rj not Qj;
Rk not Qk; Result(‘D’) FU;
Read Rj No; Rk No
Rj and Rk
operands
Execution Functional unit
complete done

f((Fj( f )≠Fi(FU) f(if Qj(f)=FU then Rj(f) Yes);


or Rj( f )=No) & f(if Qk(f)=FU then Rj(f) Yes);
Write result
(Fk( f ) ≠Fi(FU) or Result(Fi(FU)) 0; Busy(FU) No
Rk( f )=No))

Chap. 4 - Pipelining II 33
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example
This is the sample code we’ll be working with in the example:

LD F6, 34(R2)
LD F2, 45(R3)
MULT F0, F2, F4
SUBD F8, F6, F2
DIVD F10, F0, F6
ADDD F6, F8, F2

What are the hazards in this code?


Latencies (clock cycles):
LD 1
MULT 10
SUBD 2
DIVD 40
ADDD 2

Chap. 4 - Pipelining II 34
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example

Instruction status Read Execution Write


Instruction j k Issue operandscompleteResult
LD F6 34+ R2
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
FU
Chap. 4 - Pipelining II 35
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 1
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
Issue LD #1
LD F6 34+ R2 1
LD F2 45+ R3
MULTDF0 F2 F4 Shows in which cycle
SUBD F8 F6 F2 the operation occurred.
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Integer

Chap. 4 - Pipelining II 36
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 2
Instruction status Read Execution
W rite LD #2 can’t issue since
Instruction j k Issue operandscompleteResult integer unit is busy.
LD F6 34+ R2 1 2 MULT can’t issue because
LD F2 45+ R3
MULTDF0 F2 F4
we require in-order issue.
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Integer

Chap. 4 - Pipelining II 37
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 3
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Integer

Chap. 4 - Pipelining II 38
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 4
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Integer

Chap. 4 - Pipelining II 39
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 5
Instruction status Read Execution
W rite Issue LD #2 since integer
Instruction j k Issue operandscompleteResult unit is now free.
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Integer

Chap. 4 - Pipelining II 40
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 6
Instruction status Read Execution
W rite Issue MULT.
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6
MULTDF0 F2 F4 6
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 Integer

Chap. 4 - Pipelining II 41
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 7
Instruction status Read Execution
W rite MULT can’t read its
Instruction j k Issue operandscompleteResult operands (F2) because LD
LD F6 34+ R2 1 2 3 4 #2 hasn’t finished.
LD F2 45+ R3 5 6 7
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add Yes Sub F8 F6 F2 Integer Yes No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 Integer Add

Chap. 4 - Pipelining II 42
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 8a

Instruction status Read Execution


W rite
DIVD issues.
Instruction j k Issue operandscompleteResult MULT and SUBD both
LD F6 34+ R2 1 2 3 4 waiting for F2.
LD F2 45+ R3 5 6 7
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add Yes Sub F8 F6 F2 Integer Yes No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 Integer Add Divide
Chap. 4 - Pipelining II 43
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 8b

Instruction status Read Execution


W rite LD #2 writes F2.
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Sub F8 F6 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 Add Divide
Chap. 4 - Pipelining II 44
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 9
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4 Now MULT and SUBD can
LD F2 45+ R3 5 6 7 8 both read F2.
MULTDF0 F2 F4 6 9 How can both instructions
SUBD F8 F6 F2 7 9 do this at the same time??
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
10 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
2 Add Yes Sub F8 F6 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
9 FU Mult1 Add Divide

Chap. 4 - Pipelining II 45
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 11
Instruction status Read Execution
W rite ADDD can’t start because
Instruction j k Issue operandscompleteResult add unit is busy.
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
8 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
0 Add Yes Sub F8 F6 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
11 FU Mult1 Add Divide

Chap. 4 - Pipelining II 46
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 12

Instruction status Read Execution


W rite
SUBD finishes.
Instruction j k Issue operandscompleteResult DIVD waiting for F0.
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
7 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
12 FU Mult1 Divide
Chap. 4 - Pipelining II 47
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 13

Instruction status Read Execution


W rite ADDD issues.
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
6 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
13 FU Mult1 Add Divide
Chap. 4 - Pipelining II 48
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 14
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
5 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
2 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
14 FU Mult1 Add Divide
Chap. 4 - Pipelining II 49
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 15

Instruction status Read Execution


W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
4 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
1 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
15 FU Mult1 Add Divide
Chap. 4 - Pipelining II 50
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 16

Instruction status Read Execution


W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
3 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
0 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
16 FU Mult1 Add Divide
Chap. 4 - Pipelining II 51
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 17

Instruction status Read Execution


W rite ADDD can’t write because
Instruction j k Issue operandscompleteResult of DIVD. RAW!
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
2 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
17 FU Mult1 Add Divide
Chap. 4 - Pipelining II 52
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 18
Instruction status Read Execution
W rite Nothing Happens!!
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
1 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
18 FU Mult1 Add Divide

Chap. 4 - Pipelining II 53
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 19

Instruction status Read Execution


W rite MULT completes execution.
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
0 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
19 FU Mult1 Add Divide
Chap. 4 - Pipelining II 54
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 20

Instruction status Read Execution


W rite MULT writes.
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
20 FU Add Divide
Chap. 4 - Pipelining II 55
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 21
Instruction status Read Execution
W rite DIVD loads operands
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
21 FU Add Divide

Chap. 4 - Pipelining II 56
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 22

Instruction status Read Execution


W rite Now ADDD can write since
Instruction j k Issue operandscompleteResult WAR removed.
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
40 Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
22 FU Divide
Chap. 4 - Pipelining II 57
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 61
Instruction status Read Execution
W rite DIVD completes execution
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
0 Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
61 FU Divide

Chap. 4 - Pipelining II 58
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 62
Instruction status Read Execution
W rite DONE!!
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61 62
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
0 Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
62 FU
Chap. 4 - Pipelining II 59
Using A Scoreboard
Dynamic Scheduling
Another Dynamic Algorithm:
Tomasulo Algorithm
• For IBM 360/91 about 3 years after CDC 6600 (1966)
• Goal: High Performance without special compilers
• Differences between IBM 360 & CDC 6600 ISA
– IBM has only 2 register specifiers / instruction vs. 3 in CDC 6600
– IBM has 4 FP registers vs. 8 in CDC 6600
• Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II,
PowerPC 604, …

Chap. 4 - Pipelining II 60
Using A Scoreboard
Dynamic Scheduling
Tomasulo Algorithm vs. Scoreboard
• Control & buffers distributed with Function Units (FU) vs.
centralized in scoreboard;
– FU buffers called “reservation stations”; have pending operands
• Registers in instructions replaced by values or pointers to
reservation stations(RS); called register renaming ;
– avoids WAR, WAW hazards
– More reservation stations than registers, so can do optimizations
compilers can’t
• Results to FU from RS, not through registers, over Common
Data Bus that broadcasts results to all FUs
• Load and Stores treated as FUs with RSs as well
• Integer instructions can go past branches, allowing
FP ops beyond basic block in FP queue

Chap. 4 - Pipelining II 61
Dynamic Scheduling Using A Scoreboard
Tomasulo Organization
FP Op Queue FP
Registers
Load
Buffer

Store
Common Buffer
Data
Bus
FP Add FP Mul
Res. Res.
Station Station

Chap. 4 - Pipelining II 62
Using A Scoreboard
Dynamic Scheduling
Reservation Station Components
Op—Operation to perform in the unit (e.g., + or –)
Vj, Vk—Value of Source operands
– Store buffers have V field, result to be stored
Qj, Qk—Reservation stations producing source registers (value to be
written)
– Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready
– Store buffers only have Qi for RS producing result
Busy—Indicates reservation station or FU is busy

Register result status—Indicates which functional unit will write each


register, if one exists. Blank when no pending instructions that will
write that register.

Chap. 4 - Pipelining II 63
Using A Scoreboard
Dynamic Scheduling

Three Stages of Tomasulo Algorithm


1. Issue—get instruction from FP Op Queue
If reservation station free (no structural hazard),
control issues instruction & sends operands (renames registers).
2. Execution—operate on operands (EX)
When both operands ready then execute;
if not ready, watch Common Data Bus for result
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting units;
mark reservation station available
• Normal data bus: data + destination (“go to” bus)
• Common data bus: data + source (“come from” bus)
– 64 bits of data + 4 bits of Functional Unit source address
– Write if matches expected Functional Unit (produces result)
– Does the broadcast

Chap. 4 - Pipelining II 64
Using A Scoreboard
Dynamic Scheduling
Tomasulo Example Cycle 0
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 Load1 No
LD F2 45+ R3 Load2 No
MULTDF0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 No
0 Mult2 No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
0 FU

Chap. 4 - Pipelining II 65
Using A Scoreboard
Dynamic Scheduling
Review: Tomasulo

• Prevents Register as bottleneck


• Avoids WAR, WAW hazards of Scoreboard
• Allows loop unrolling in HW
• Not limited to basic blocks (provided branch prediction)
• Lasting Contributions
– Dynamic scheduling
– Register renaming
– Load/store disambiguation
• 360/91 descendants are PowerPC 604, 620; MIPS R10000; HP-PA
8000; Intel Pentium Pro

Chap. 4 - Pipelining II 66
Dynamic Hardware
Prediction
4.1 Instruction Level Parallelism:
Concepts and Challenges Dynamic Branch Prediction is the ability
of the hardware to make an educated
4.2 Overcoming Data Hazards
with Dynamic Scheduling
guess about which way a branch will
go - will the branch be taken or not.
4.3 Reducing Branch Penalties
with Dynamic Hardware
Prediction The hardware can look for clues based
on the instructions, or it can use past
4.4 Taking Advantage of More ILP
with Multiple Issue history - we will discuss both of
these directions.
4.5 Compiler Support for
Exploiting ILP
4.6 Hardware Support for
Extracting more Parallelism
4.7 Studies of ILP

Chap. 4 - Pipelining II 67
Dynamic Hardware Basic Branch Prediction:
Branch Prediction Buffers
Prediction
Dynamic Branch Prediction
• Performance = ƒ(accuracy, cost of misprediction)
• Branch History Lower bits of PC address index table of 1-bit values
– Says whether or not branch taken last time
• Problem: in a loop, 1-bit BHT will cause two mis-predictions:
– End of loop case, when it exits instead of looping as before
– First time through loop on next time through code, when it predicts exit instead
of looping

P
Address 0 r
e
d
31 1 Bits 13 - 2 i
c
t
1023 i
o
n
Chap. 4 - Pipelining II 68
Dynamic Hardware Basic Branch Prediction:
Branch Prediction Buffers
Prediction
Dynamic Branch Prediction

• Solution: 2-bit scheme where change prediction only if get


misprediction twice: (Figure 4.13, p. 264)

T
NT
Predict Taken Predict Taken
T
T NT
NT
Predict Not Predict Not
Taken T Taken
NT

Chap. 4 - Pipelining II 69
Dynamic Hardware Basic Branch Prediction:
Branch Prediction Buffers
Prediction

BHT Accuracy

• Mispredict because either:


– Wrong guess for that branch
– Got branch history of wrong branch when index the table
• 4096 entry table programs vary from 1% misprediction (nasa7,
tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%
• 4096 about as good as infinite table, but 4096 is a lot of HW

Chap. 4 - Pipelining II 70
Dynamic Hardware Basic Branch Prediction:
Branch Prediction Buffers
Prediction

Correlating Branches
Idea: taken/not taken of Branch address
recently executed branches is
related to behavior of next 2-bits per branch predictors
branch (as well as the history
of that branch behavior)
– Then behavior of recent
branches selects between, say, Prediction
four predictions of next branch,
updating just that prediction

2-bit global branch history


Chap. 4 - Pipelining II 71
Dynamic Hardware Basic Branch Prediction:
Branch Prediction Buffers
Prediction
Accuracy of Different Schemes
(Figure 4.21,
4096 Entries 2-bits per entry
p. 272)
Unlimited Entries 2-bits per entry
Frequency of Mispredictions

18% 1024 Entries - 2 bits of history,


18%
2 bits per entry
16%

14%
Frequency of Mispredictions

12% 11%

10%

8%
6% 6% 6%
6% 5% 5%
4%
4%

2% 1% 1%
0% 0%
0%
doducd

gcc
nasa7

eqntott
espresso
spice

fpppp
tomcatv

li
matrix300

4,096 entries: 2-bits per entry Chap. 4 - Pipelining II


Unlimited entries: 2-bits/entry
1,024 entries (2,2) 72
Dynamic Hardware Basic Branch Prediction:
Branch Target Buffers
Prediction
Branch Target Buffer
• Branch Target Buffer (BTB): Use address of branch as index to get prediction AND
branch address (if taken)
– Note: must check for branch match now, since can’t use wrong branch address (Figure 4.22, p.
273)

Predicted PC Branch Prediction:


Taken or not Taken

• Return instruction addresses predicted with stack


Chap. 4 - Pipelining II 73
Dynamic Hardware Basic Branch Prediction:
Branch Target Buffers
Prediction
Example Instructions
in Buffer
Prediction Actual
Branch
Penalty
Cycles
Yes Taken Taken 0
Yes Taken Not taken 2
No Taken 2

Example on page 274.


Determine the total branch penalty for a BTB using the above
penalties. Assume also the following:
• Prediction accuracy of 80%
• Hit rate in the buffer of 90%
• 60% taken branch frequency.
Branch Penalty = Percent buffer hit rate X Percent incorrect predictions X 2
+ ( 1 - percent buffer hit rate) X Taken branches X 2
Branch Penalty = ( 90% X 10% X 2) + (10% X 60% X 2)
Branch Penalty = 0.18 + 0.12 = 0.30 clock cycles

Chap. 4 - Pipelining II 74
Multiple Issue
4.1 Instruction Level Parallelism: Multiple Issue is the ability of the
Concepts and Challenges processor to start more than one
4.2 Overcoming Data Hazards instruction in a given cycle.
with Dynamic Scheduling
4.3 Reducing Branch Penalties Flavor I:
with Dynamic Hardware Superscalar processors issue varying
Prediction
number of instructions per clock - can
4.4 Taking Advantage of More ILP be either statically scheduled (by the
with Multiple Issue
compiler) or dynamically scheduled
4.5 Compiler Support for (by the hardware).
Exploiting ILP
4.6 Hardware Support for Superscalar has a varying number of
Extracting more Parallelism
instructions/cycle (1 to 8), scheduled
4.7 Studies of ILP by compiler or by HW (Tomasulo).

IBM PowerPC, Sun UltraSparc, DEC


Alpha, HP 8000

Chap. 4 - Pipelining II 75
Multiple Issue

Issuing Multiple Instructions/Cycle


Flavor II:
VLIW - Very Long Instruction Word - issues a fixed number of
instructions formatted either as one very large instruction or as a
fixed packet of smaller instructions.

fixed number of instructions (4-16) scheduled by the compiler; put


operators into wide templates
– Joint HP/Intel agreement in 1999/2000
– Intel Architecture-64 (IA-64) 64-bit address
– Style: “Explicitly Parallel Instruction Computer (EPIC)”

Chap. 4 - Pipelining II 76
Multiple Issue

Issuing Multiple Instructions/Cycle


Flavor II - continued:
• 3 Instructions in 128 bit “groups”; field determines if instructions
dependent or independent
– Smaller code size than old VLIW, larger than x86/RISC
– Groups can be linked to show independence > 3 instr
• 64 integer registers + 64 floating point registers
– Not separate files per functional unit as in old VLIW
• Hardware checks dependencies
(interlocks => binary compatibility over time)
• Predicated execution (select 1 out of 64 1-bit flags)
=> 40% fewer mis-predictions?
• IA-64 : name of instruction set architecture; EPIC is type
• Merced is name of first implementation (1999/2000?)

Chap. 4 - Pipelining II 77
A SuperScalar Version of DLX
Multiple Issue
Issuing Multiple Instructions/Cycle
In our DLX example,
– Fetch 64-bits/clock cycle; Int on left, FP on right
we can handle 2
– Can only issue 2nd instruction if 1st instruction issues
instructions/cycle:
– More ports for FP registers to do FP load & FP op in a pair
• Floating Point
• Anything Else
Type Pipe Stages
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
• 1 cycle load delay causes delay to 3 instructions in Superscalar
– instruction in right half can’t use it, nor instructions in next slot

Chap. 4 - Pipelining II 78
A SuperScalar Version of DLX
Multiple Issue
Unrolled Loop Minimizes Stalls for Scalar
1 Loop: LD F0,0(R1) Latencies:
2 LD F6,-8(R1) LD to ADDD: 1 Cycle
3 LD F10,-16(R1) ADDD to SD: 2 Cycles
4 LD F14,-24(R1)
5 ADDD F4,F0,F2
6 ADDD F8,F6,F2
7 ADDD F12,F10,F2
8 ADDD F16,F14,F2
9 SD 0(R1),F4
10 SD -8(R1),F8
11 SD -16(R1),F12
12 SUBI R1,R1,#32
13 BNEZ R1,LOOP
14 SD 8(R1),F16 ; 8-32 = -24

14 clock cycles, or 3.5 per iteration


Chap. 4 - Pipelining II 79
A SuperScalar Version of DLX
Multiple Issue
Loop Unrolling in Superscalar
Integer instruction FP instruction Clock cycle
Loop: LD F0,0(R1) 1
LD F6,-8(R1) 2
LD F10,-16(R1) ADDD F4,F0,F2 3
LD F14,-24(R1) ADDD F8,F6,F2 4
LD F18,-32(R1) ADDD F12,F10,F2 5
SD 0(R1),F4 ADDD F16,F14,F2 6
SD -8(R1),F8 ADDD F20,F18,F2 7
SD -16(R1),F12 8
SD -24(R1),F16 9
SUBI R1,R1,#40 10
BNEZ R1,LOOP 11
SD 8(R1),F20 12
• Unrolled 5 times to avoid delays (+1 due to SS)
• 12 clocks, or 2.4 clocks per iteration
Chap. 4 - Pipelining II 80
Multiple Instruction Issue &
Multiple Issue Dynamic Scheduling

Dynamic Scheduling in Superscalar

Code compiler for scalar version will run poorly on Superscalar


May want code to vary depending on how Superscalar

Simple approach: separate Tomasulo Control for separate reservation


stations for Integer FU/Reg and for FP FU/Reg

Chap. 4 - Pipelining II 81
Multiple Instruction Issue &
Multiple Issue Dynamic Scheduling

Dynamic Scheduling in Superscalar


• How to do instruction issue with two instructions and keep in-order
instruction issue for Tomasulo?
– Issue 2X Clock Rate, so that issue remains in order
– Only FP loads might cause dependency between integer and FP
issue:
• Replace load reservation station with a load queue;
operands must be read in the order they are fetched
• Load checks addresses in Store Queue to avoid RAW violation
• Store checks addresses in Load Queue to avoid WAR,WAW

Chap. 4 - Pipelining II 82
Multiple Instruction Issue &
Multiple Issue Dynamic Scheduling

Performance of Dynamic Superscalar


Iteration Instructions Issues Executes Writes result
no. clock-cycle number
1 LD F0,0(R1) 1 2 4
1 ADDD F4,F0,F2 1 5 8
1 SD 0(R1),F4 2 9
1 SUBI R1,R1,#8 3 4 5
1 BNEZ R1,LOOP 4 5
2 LD F0,0(R1) 5 6 8
2 ADDD F4,F0,F2 5 9 12
2 SD 0(R1),F4 6 13
2 SUBI R1,R1,#8 7 8 9
2 BNEZ R1,LOOP 8 9
4 clocks per iteration
Branches, Decrements still take 1 clock cycle
Chap. 4 - Pipelining II 83
VLIW
Multiple Issue
Loop Unrolling in VLIW
Memory Memory FP FP Int. op/ Clock
reference 1 reference 2 operation 1 op. 2 branch
LD F0,0(R1) LD F6,-8(R1) 1
LD F10,-16(R1) LD F14,-24(R1) 2
LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3
LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
ADDD F20,F18,F2 ADDD F24,F22,F2 5
SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
SD -16(R1),F12 SD -24(R1),F16 7
SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8
SD -0(R1),F28 BNEZ R1,LOOP 9

• Unrolled 7 times to avoid delays


• 7 results in 9 clocks, or 1.3 clocks per iteration
• Need more registers to effectively use VLIW

Chap. 4 - Pipelining II 84
Limitations With Multiple Issue
Multiple Issue
Limits to Multi-Issue Machines
• Inherent limitations of ILP
– 1 branch in 5 instructions => how to keep a 5-way VLIW busy?
– Latencies of units => many operations must be scheduled
– Need about Pipeline Depth x No. Functional Units of independent
operations to keep machines busy.

• Difficulties in building HW
– Duplicate Functional Units to get parallel execution
– Increase ports to Register File (VLIW example needs 6 read and 3
write for Int. Reg. & 6 read and 4 write for Reg.)
– Increase ports to memory
– Decoding SS and impact on clock rate, pipeline depth

Chap. 4 - Pipelining II 85
Limitations With Multiple Issue
Multiple Issue

Limits to Multi-Issue Machines

• Limitations specific to either SS or VLIW implementation


– Decode issue in SS
– VLIW code size: unroll loops + wasted fields in VLIW
– VLIW lock step => 1 hazard & all instructions stall
– VLIW & binary compatibility

Chap. 4 - Pipelining II 86
Limitations With Multiple Issue
Multiple Issue
Multiple Issue Challenges
• While Integer/FP split is simple for the HW, get CPI of 0.5 only for
programs with:
– Exactly 50% FP operations
– No hazards
• If more instructions issue at same time, greater difficulty of decode and
issue
– Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2
instructions can issue
• VLIW: tradeoff instruction space for simple decoding
– The long instruction word has room for many operations
– By definition, all the operations the compiler puts in the long instruction word are
independent => execute in parallel
– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
• 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide
– Need compiling technique that schedules across several branches

Chap. 4 - Pipelining II 87
Compiler Support For ILP
4.1 Instruction Level Parallelism:
Concepts and Challenges How can compilers be smart?
1. Produce good scheduling of code.
4.2 Overcoming Data Hazards
with Dynamic Scheduling
2. Determine which loops might contain
parallelism.
4.3 Reducing Branch Penalties 3. Eliminate name dependencies.
with Dynamic Hardware
Prediction
Compilers must be REALLY smart to
4.4 Taking Advantage of More ILP
with Multiple Issue figure out aliases -- pointers in C are
a real problem.
4.5 Compiler Support for
Exploiting ILP
Techniques lead to:
4.6 Hardware Support for
Extracting more Parallelism Symbolic Loop Unrolling
Critical Path Scheduling
4.7 Studies of ILP

Chap. 4 - Pipelining II 88
Compiler Support For ILP Symbolic Loop Unrolling

Software Pipelining
• Observation: if iterations from loops are independent, then can get ILP
by taking instructions from different iterations
• Software pipelining: reorganizes loops so that each iteration is made
from instructions chosen from different iterations of the original loop
(Tomasulo in SW) Iteration
0 Iteration
1 Iteration
2 Iteration
3 Iteration
4

Software-
pipelined
iteration

Chap. 4 - Pipelining II 89
Compiler Support For ILP Symbolic Loop Unrolling

SW Pipelining Example
Before: Unrolled 3 times After: Software Pipelined
1 LD F0,0(R1) LD F0,0(R1)
2 ADDD F4,F0,F2 ADDD F4,F0,F2
3 SD 0(R1),F4 LD F0,-8(R1)
4 LD F6,-8(R1) 1 SD 0(R1),F4; Stores M[i]
5 ADDD F8,F6,F2 2 ADDD F4,F0,F2; Adds to M[i-1]
6 SD -8(R1),F8 3 LD F0,-16(R1); loads M[i-2]
7 LD F10,-16(R1) 4 SUBI R1,R1,#8
8 ADDD F12,F10,F2 5 BNEZ R1,LOOP
9 SD -16(R1),F12 SD 0(R1),F4
10 SUBI R1,R1,#24 ADDD F4,F0,F2
11 BNEZ R1,LOOP SD -8(R1),F4

Read F4 Read F0
SD IF ID EX Mem WB Write F4
ADDD IF ID EX Mem WB
LD IF ID EX Mem WB
Chap. 4 - Pipelining II Write F0 90
Compiler Support For ILP Symbolic Loop Unrolling

SW Pipelining Example
Symbolic Loop Unrolling
– Less code space
– Overhead paid only once
vs. each iteration in loop unrolling

Software Pipelining

Loop Unrolling

100 iterations = 25 loops with 4 unrolled iterations each


Chap. 4 - Pipelining II 91
Compiler Support For ILP Critical Path Scheduling

Trace Scheduling
• Parallelism across IF branches vs. LOOP branches
• Two steps:
– Trace Selection
• Find likely sequence of basic blocks (trace)
of (statically predicted or profile predicted)
long sequence of straight-line code
– Trace Compaction
• Squeeze trace into few VLIW instructions
• Need bookkeeping code in case prediction is wrong
• Compiler undoes bad guess
(discards values in registers)
• Subtle compiler bugs mean wrong answer
vs. poorer performance; no hardware interlocks

Chap. 4 - Pipelining II 92
Hardware Support For
Parallelism
4.1 Instruction Level Parallelism:
Concepts and Challenges Software support of ILP is best when
code is predictable at compile time.
4.2 Overcoming Data Hazards
with Dynamic Scheduling
But what if there’s no predictability?
4.3 Reducing Branch Penalties Here we’ll talk about hardware
with Dynamic Hardware
Prediction techniques. These include:
4.4 Taking Advantage of More ILP
with Multiple Issue • Conditional or Predicated
Instructions
4.5 Compiler Support for
Exploiting ILP
• Hardware Speculation
4.6 Hardware Support for
Extracting more Parallelism
4.7 Studies of ILP

Chap. 4 - Pipelining II 93
Hardware Support For Nullified Instructions
Parallelism
Tell the Hardware To Ignore An Instruction
• Avoid branch prediction by turning branches into
conditionally executed instructions:
IF (x) then A = B op C else NOP
– If false, then neither store result nor cause exception
– Expanded ISA of Alpha, MIPs, PowerPC, SPARC, x
have conditional move. PA-RISC can annul any
following instruction.
– IA-64: 64 1-bit condition fields selected so
conditional execution of any instruction A=
• Drawbacks to conditional instructions: B op C
– Still takes a clock, even if “annulled”
– Stalls if condition evaluated late
– Complex conditions reduce effectiveness; condition
becomes known late in pipeline.
This can be a major win because there is no time lost by
taking a branch!!
Chap. 4 - Pipelining II 94
Hardware Support For Nullified Instructions
Parallelism
Tell the Hardware To Ignore An Instruction
Suppose we have the code: Nullified Method:
if ( VarA == 0 ) LD R1, VarA
VarS = VarT; Compare LD R2, VarT
and Nullify CMPNNZ R1, #0
Previous Method: Next Instr. SD VarS, R2
LD R1, VarA If Not Zero Label:
BNEZ R1, Label
LD R2, VarT Nullified Method:
SD VarS, R2 LD R1, VarA
Label: Compare LD R2, VarT
and Move CMOVZ VarS,R2, R1
IF Zero

Chap. 4 - Pipelining II 95
Hardware Support For Compiler Speculation
Parallelism
Increasing Parallelism
The theory here is to move an instruction across a branch so as to
increase the size of a basic block and thus to increase parallelism.

Primary difficulty is in avoiding exceptions. For example


if ( a ^= 0 ) c = b/a; may have divide by zero error in some cases.

Methods for increasing speculation include:

1. Use a set of status bits (poison bits) associated with the registers.
Are a signal that the instruction results are invalid until some later
time.
2. Result of instruction isn’t written until it’s certain the instruction is
no longer speculative.

Chap. 4 - Pipelining II 96
Hardware Support For Compiler Speculation
Parallelism
Original Code:
Increasing LW R1, 0(R3) Load A
Parallelism BNEZ R1, L1 Test A
LW R1, 0(R2) If Clause
Example on Page 305. J L2 Skip Else
Code for L1: ADDI R1, R1, #4 Else Clause
if ( A == 0 ) L2: SW 0(R3), R1 Store A
A = B;
else Speculated Code:
A = A + 4; LW R1, 0(R3) Load A
Assume A is at 0(R3) and LW R14, 0(R2) Spec Load B
B is at 0(R4) BEQZ R1, L3 Other if Branch
Note here that only ONE ADDI R14, R1, #4 Else Clause
side needs to take a L3: SW 0(R3), R14 Non-Spec Store
branch!!

Chap. 4 - Pipelining II 97
Hardware Support For Compiler Speculation
Parallelism

Poison Bits
Speculated Code:
In the example on the last LW R1, 0(R3) Load A
page, if the LW* produces
LW* R14, 0(R2) Spec Load B
an exception, a poison bit
is set on that register. The BEQZ R1, L3 Other if Branch
if a later instruction tries to ADDI R14, R1, #4 Else Clause
use the register, an L3: SW 0(R3), R14 Non-Spec Store
exception is THEN raised.

Chap. 4 - Pipelining II 98
Hardware Support For Hardware Speculation
Parallelism
HW support for More ILP
• Need HW buffer for results of
uncommitted instructions: reorder buffer Reorder
– Reorder buffer can be operand Buffer
source FP
Op
– Once operand commits, result is
Queue
found in register FP Regs
– 3 fields: instr. type, destination, value
– Use reorder buffer number instead
of reservation station Res Stations Res Stations
– Discard instructions on mis-
predicted branches or on exceptions FP Adder FP Adder

Figure 4.34, page 311

Chap. 4 - Pipelining II 99
Hardware Support For Hardware Speculation
Parallelism
HW support for More ILP
How is this used in practice?

Rather than predicting the direction of a branch, execute the


instructions on both side!!

We early on know the target of a branch, long before we know it if will


be taken or not.

So begin fetching/executing at that new Target PC.


But also continue fetching/executing as if the branch NOT taken.

Chap. 4 - Pipelining II 100


Studies of ILP
4.1 Instruction Level Parallelism: • Conflicting studies of amount of
Concepts and Challenges
improvement available
4.2 Overcoming Data Hazards – Benchmarks (vectorized FP
with Dynamic Scheduling
Fortran vs. integer C programs)
4.3 Reducing Branch Penalties
with Dynamic Hardware – Hardware sophistication
Prediction – Compiler sophistication
4.4 Taking Advantage of More ILP • How much ILP is available using
with Multiple Issue existing mechanisms with increasing
4.5 Compiler Support for HW budgets?
Exploiting ILP
• Do we need to invent new HW/SW
4.6 Hardware Support for mechanisms to keep on processor
Extracting more Parallelism performance curve?
4.7 Studies of ILP

Chap. 4 - Pipelining II 101


Studies of ILP
Limits to ILP
Initial HW Model here; MIPS compilers.
Assumptions for ideal/perfect machine to start:
1. Register renaming–infinite virtual registers and all WAW & WAR
hazards are avoided
2. Branch prediction–perfect; no mispredictions
3. Jump prediction–all jumps perfectly predicted => machine with
perfect speculation & an unbounded buffer of instructions available
4. Memory-address alias analysis–addresses are known & a store can
be moved before a load provided addresses not equal
1 cycle latency for all instructions; unlimited number of instructions
issued per clock cycle

Chap. 4 - Pipelining II 102


Studies of ILP Upper Limit to ILP: Ideal
This is the amount of parallelism when
there are no branch mis-predictions and
Machine
we’re limited only by data dependencies. (Figure 4.38, page 319)

160 150.1
FP: 75 - 150
140
118.7
120 Integer: 18 - 60
Instruction Issues per cycle

100

75.2
80
IPC

62.6
54.8
60

40
17.9
20

0
gcc espresso li fpppp doducd tomcatv
Instructions that could
Programs
theoretically be issued
per cycle. Chap. 4 - Pipelining II 103
Studies of ILP
Impact of Realistic Branch
Prediction

What parallelism do we get when we don’t allow perfect branch


prediction, as in the last picture, but assume some realistic model?
Possibilities include:

1. Perfect - all branches are perfectly predicted (the last slide)

2. Selective History Predictor - a complicated but do-able mechanism for


selection.

3. Standard 2-bit history predictor with 512 2-bit entries.

4. Static prediction based on past history of the program.

5. None - Parallelism is limited to basic block.

Chap. 4 - Pipelining II 104


Studies of ILP Bonus!!

Selective History Predictor


8096 x 2 bits
1
0
Taken/Not Taken

11
Choose Non-correlator
10
Branch Addr 01 Choose Correlator
00
2
Global
History 00
01 8K x 2 bit
10 Selector
11
11 Taken
10
2048 x 4 x 2 bits
01 Not Taken
00
Chap. 4 - Pipelining II 105
Impact of Realistic
Studies of ILP
Branch Prediction
Limiting the type of Figure 4.42, Page 325
branch prediction. 61
58
60
60

FP: 15 - 45
50 48
46 45 46 45 45

41
Instruction issues per cycle

40
35

Integer: 6 - 12
29
30
IPC

19
20 16
15
13 14
12
10
9
10 6 7 6 6 7
6
4
2 2 2

gcc espresso li fpppp doducd tomcatv

Program

Perfect Selective predictor Standard 2-bit Static None

Chap. 4 - Pipelining II 106


Perfect Selective Hist BHT (512) Profile No prediction
Studies of ILP More Realistic HW:
Register Impact
Effect of limiting the Figure 4.44, Page 328
number of renaming
59
60 registers. FP: 11 - 45
54

49
50
45
44
Instruction issues per cycle

40
35

Integer: 5 - 15 29
IPC

30 28

20
20
16
15 15 15
13
12 12 12 11 11
11 10 10 10
9
10 7
5 6 5 5 5 5
4 5 4 5
4

gcc espresso li fpppp doducd tomcatv

Program

Infinite Chap. 4128


256 - Pipelining
64 II 32 None 107
Infinite 256 128 64 32 None
Studies of ILP More Realistic HW:
What happens when there Alias Impact
may be conflicts with Figure 4.46, Page 330
memory aliasing?
FP: 4 - 45
50
49 49
(Fortran, 45 45
45
no heap)
40

35
Integer: 4 - 9
Instruction issues per cycle
IPC

30

25

20
16 16
15
15
12
10
10 9
7 7
5 5 6
4 4 4 5
3 3 3 4 4
5

gcc espresso li fpppp doducd tomcatv

Program
Perfect Global/Stack perf; Inspec. None
Perfect Chap.
heap conflicts4 - Pipelining
Global/stack Perfect II
Inspection None 108
Assem.
Summary

4.1 Instruction Level Parallelism: Concepts and Challenges


4.2 Overcoming Data Hazards with Dynamic Scheduling
4.3 Reducing Branch Penalties with Dynamic Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more Parallelism
4.7 Studies of ILP

Chap. 4 - Pipelining II 109

You might also like