ILP Techniques: Laxmi N. Bhuyan CS 162 Spring 2003

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 23

Lecture 6: ILP Techniques

Laxmi N. Bhuyan CS 162 Spring 2003

DAP Spr.98 UCB 1

HW Schemes: Instruction Parallelism


Why in HW at run time?
Works when cant know real dependence at compile time Compiler simpler Code for one machine runs well on another

Key idea: Allow instructions behind stall to proceed


DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14
Enables out-of-order execution => out-of-order completion ID stage checks for hazards. If no hazards, issue the instn for execution. Scoreboard dates to CDC 6600 in 1963

DAP Spr.98 UCB 2

How ILP Works


Issuing multiple instructions per cycle would require fetching multiple instructions from memory per cycle => called Superscalar degree or Issue width To find independent instructions, we must have a big pool of instructions to choose from, called instruction buffer (IB). As IB length increases, complexity of decoder (control) increases that increases the datapath cycle time Prefetch instructions sequentially by an IFU that operates independently from datapath control. Fetch instruction (PC)+L, where L is the IB size or as directed by the branch predictor. (See Fig. 6-1 Pentium diagram)
DAP Spr.98 UCB 3

Pentium Datapath
Pentium consists of two pipes (U-pipe and V-pipe) operating in parallel. Upipe contains an 8-stage FP pipeline (see Pentium Figure) Two stages of Decode Decode and control one stage Register read 2nd stage See I-cache and D-cache in Fig. 6-1. What is TLB? How does the Virtual memory work?

DAP Spr.98 UCB 4

HW Schemes: Instruction Parallelism


Two types: Scoreboard and Tomasulo Scoreboard (EX: PENTIUM):
Out-of-order execution divides ID stage:
1. Issuedecode instructions, check for structural hazards 2. Read operandswait until no data hazards, then read operands

Scoreboards allow instruction to execute whenever there is no structural hazard or not waiting for prior instructions. So the instructions are issued in order, but can bypass the waiting instructions in the read operand stage => In-order issue Out-of-Order execution => Out-of-Order completion Named after CDC 6600 Scoreboard, which developed this capability
DAP Spr.98 UCB 5

Scoreboard Implications
Scoreboard replaces ID, EX, WB with 4 stages Out-of-order completion => WAR, WAW hazards? Solutions for WAR => Wait at the WB stage until the other instruction completes For WAW, must detect hazard at the ID stage: stall until other completes Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units Scoreboard keeps track of dependencies, state or operations

DAP Spr.98 UCB 6

Four Stages of Scoreboard Control


1. Issuedecode instructions & check for structural hazards (ID1)
If a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared.

2. Read operandswait until no data hazards, then read operands (ID2)


A source operand is available if no earlier issued active instruction is going to write it, or if the register containing the operand is being written by a currently active functional unit. If the source operands are available for an instn, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order.
DAP Spr.98 UCB 7

Four Stages of Scoreboard Control


3. Executionoperate on operands (EX)
The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution.

4. Write resultfinish execution (WB)


Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction. Example: DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14 CDC 6600 scoreboard would stall SUBD until ADDD reads operands

DAP Spr.98 UCB 8

Design of the Scoreboard


1. Instruction statuswhich of 4 steps the instruction is in 2. Functional unit statusIndicates the state of the functional unit (FU). 9 fields for each functional unit
BusyIndicates whether the unit is busy or not OpOperation to perform in the unit (e.g., + or ) FiDestination register Fj, FkSource-register numbers Qj, QkFunctional units producing source registers Fj, Fk Rj, RkFlags indicating when Fj, Fk are ready

3. Register result statusIndicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register

DAP Spr.98 UCB 9

Detailed Scoreboard Pipeline Control


Instruction status Wait until Bookkeeping Busy(FU) yes; Op(FU) op; Fi(FU) `D; Fj(FU) `S1; Fk(FU) `S2; Qj Result(S1); Qk Result(`S2); Rj not Qj; Rk not Qk; Result(D) FU; Rj No; Rk No

Issue

Not busy (FU) and not result(D)

Read operands
Execution complete

Rj and Rk Functional unit done

Write result

f((Fj( f )Fi(FU) f(if Qj(f)=FU then Rj(f) Yes); or Rj( f )=No) & f(if Qk(f)=FU then Rj(f) Yes); (Fk( f ) Fi(FU) or Result(Fi(FU)) 0; Busy(FU) No Rk( f )=No))
DAP Spr.98 UCB 10

CDC 6600 Scoreboard


Speedup 1.7 from compiler; 2.5 by hand BUT slow memory (no cache) limits benefit Limitations of 6600 scoreboard:
No forwarding hardware Limited to instructions in basic block (small window) Small number of functional units (structural hazards), especailly integer/load store units Do not issue on structural hazards Wait for WAR hazards Prevent WAW hazards

DAP Spr.98 UCB 11

Summary
Instruction Level Parallelism (ILP) in SW or HW Loop level parallelism is easiest to see SW parallelism dependencies defined for program, hazards if HW cannot resolve SW dependencies/compiler sophistication determine if compiler can unroll loops
Memory dependencies hardest to determine

HW exploiting ILP
Works when cant know dependence at run time Code for one machine runs well on another

Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode => Issue instr & read operands)
Enables out-of-order execution => out-of-order completion ID stage checked both for structural
DAP Spr.98 UCB 12

Tomasulo Algorithm
(Implemented in IBM 360/91 in 1966)
Control & buffers distributed with Function Units (FU) vs. centralized in scoreboard;
FU buffers called reservation stations; have pending operands

Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ;
avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers cant

Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches, allowing DAP Spr.98 UCB 13 FP ops beyond basic block in FP queue

Tomasulo Organization
FP Op Queue Load Buffer FP Registers

Common Data Bus FP Add Res. Station

Store Buffer

FP Mul Res. Station


DAP Spr.98 UCB 14

Reservation Station Components


OpOperation to perform in the unit (e.g., + or ) Vj, VkValue of Source operands
Store buffers has V field, result to be stored

Qj, QkReservation stations producing source registers (value to be written)


Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready
Store buffers only have Qi for RS producing result

BusyIndicates reservation station or FU is busy

Register result statusIndicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that DAP Spr.98 UCB 15 register.

Three Stages of Tomasulo Algorithm


1. Issueget instruction from FP Op Queue
If reservation station free (no structural hazard), control issues instr & sends operands (renames registers).

2. Executionoperate on operands (EX)


When both operands ready then execute; if not ready, watch Common Data Bus for result

3. Write resultfinish execution (WB)


Write on Common Data Bus to all awaiting units; mark reservation station available

Normal data bus: data + destination (go to bus) Common data bus: data + source (come from bus)
64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast DAP Spr.98 UCB 16

Tomasulo v. Scoreboard (IBM 360/91 v. CDC 6600)


Pipelined Functional Units Multiple Functional Units (6 load, 3 store, 3 +, 2 x/) (1 load/store, 1 + , 2 x, 1 ) window size: 14 instructions 5 instructions No issue on structural hazard same WAR: renaming avoids stall completion WAW: renaming avoids stall completion Broadcast results from FU Write/read registers distributed reservation stations central scoreboard

DAP Spr.98 UCB 17

Tomasulo Drawbacks
Complexity
delays of 360/91, MIPS 10000, IBM 620?

Many associative stores (CDB) at high speed Performance limited by Common Data Bus
Multiple CDBs => more FU logic for parallel assoc stores

DAP Spr.98 UCB 18

Tomasulo Summary
Reservations stations: renaming to larger set of registers + buffering source operands
Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW

Not limited to basic blocks (integer units gets ahead, beyond branches) Helps cache misses as well Lasting Contributions
Dynamic scheduling Register renaming Load/store disambiguation

360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264 DAP Spr.98 UCB 19

HW support for More ILP


Speculation: allow an instruction to issue that is dependent on branch predicted to be taken without any consequences (including exceptions) if branch is not actually taken (HW undo); called boosting Combine branch prediction with dynamic scheduling to execute before branches resolved Separate speculative bypassing of results from real bypassing of results
When instruction no longer speculative, write boosted results (instruction commit) or discard boosted results execute out-of-order but commit in-order to prevent irrevocable action (update state or exception) until instruction commits

DAP Spr.98 UCB 20

HW support for More ILP


Need HW buffer for results of uncommitted instructions: reorder buffer
3 fields: instr, destination, value Reorder buffer can be operand source => more registers like RS Use reorder buffer number instead of reservation station when execution FP completes Op Supplies operands between execution Queue complete & commit Once operand commits, result is put into register Instructions commit in order As a result, its easy to undo speculated Res Stations instructions on mispredicted branches FP Adder or on exceptions

Reorder Buffer

FP Regs

Res Stations FP Adder

DAP Spr.98 UCB 21

Four Steps of Speculative Tomasulo Algorithm


1. Issueget instruction from FP Op Queue
If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called dispatch)

2. Executionoperate on operands (EX)


When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called issue)

3. Write resultfinish execution (WB)


Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.

4. Commitupdate register with reorder result


When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called graduation)

DAP Spr.98 UCB 22

Renaming Registers
Common variation of speculative design Reorder buffer keeps instruction information but not the result Extend register file with extra renaming registers to hold speculative results Rename register allocated at issue; result into rename register on execution complete; rename register into real register on commit Operands read either from register file (real or speculative) or via Common Data Bus Advantage: operands are always from single source (extended register file)

DAP Spr.98 UCB 23

You might also like