ILP Techniques: Laxmi N. Bhuyan CS 162 Spring 2003
ILP Techniques: Laxmi N. Bhuyan CS 162 Spring 2003
ILP Techniques: Laxmi N. Bhuyan CS 162 Spring 2003
Pentium Datapath
Pentium consists of two pipes (U-pipe and V-pipe) operating in parallel. Upipe contains an 8-stage FP pipeline (see Pentium Figure) Two stages of Decode Decode and control one stage Register read 2nd stage See I-cache and D-cache in Fig. 6-1. What is TLB? How does the Virtual memory work?
Scoreboards allow instruction to execute whenever there is no structural hazard or not waiting for prior instructions. So the instructions are issued in order, but can bypass the waiting instructions in the read operand stage => In-order issue Out-of-Order execution => Out-of-Order completion Named after CDC 6600 Scoreboard, which developed this capability
DAP Spr.98 UCB 5
Scoreboard Implications
Scoreboard replaces ID, EX, WB with 4 stages Out-of-order completion => WAR, WAW hazards? Solutions for WAR => Wait at the WB stage until the other instruction completes For WAW, must detect hazard at the ID stage: stall until other completes Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units Scoreboard keeps track of dependencies, state or operations
3. Register result statusIndicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register
Issue
Read operands
Execution complete
Write result
f((Fj( f )Fi(FU) f(if Qj(f)=FU then Rj(f) Yes); or Rj( f )=No) & f(if Qk(f)=FU then Rj(f) Yes); (Fk( f ) Fi(FU) or Result(Fi(FU)) 0; Busy(FU) No Rk( f )=No))
DAP Spr.98 UCB 10
Summary
Instruction Level Parallelism (ILP) in SW or HW Loop level parallelism is easiest to see SW parallelism dependencies defined for program, hazards if HW cannot resolve SW dependencies/compiler sophistication determine if compiler can unroll loops
Memory dependencies hardest to determine
HW exploiting ILP
Works when cant know dependence at run time Code for one machine runs well on another
Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode => Issue instr & read operands)
Enables out-of-order execution => out-of-order completion ID stage checked both for structural
DAP Spr.98 UCB 12
Tomasulo Algorithm
(Implemented in IBM 360/91 in 1966)
Control & buffers distributed with Function Units (FU) vs. centralized in scoreboard;
FU buffers called reservation stations; have pending operands
Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ;
avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers cant
Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches, allowing DAP Spr.98 UCB 13 FP ops beyond basic block in FP queue
Tomasulo Organization
FP Op Queue Load Buffer FP Registers
Store Buffer
Register result statusIndicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that DAP Spr.98 UCB 15 register.
Normal data bus: data + destination (go to bus) Common data bus: data + source (come from bus)
64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast DAP Spr.98 UCB 16
Tomasulo Drawbacks
Complexity
delays of 360/91, MIPS 10000, IBM 620?
Many associative stores (CDB) at high speed Performance limited by Common Data Bus
Multiple CDBs => more FU logic for parallel assoc stores
Tomasulo Summary
Reservations stations: renaming to larger set of registers + buffering source operands
Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW
Not limited to basic blocks (integer units gets ahead, beyond branches) Helps cache misses as well Lasting Contributions
Dynamic scheduling Register renaming Load/store disambiguation
360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264 DAP Spr.98 UCB 19
Reorder Buffer
FP Regs
Renaming Registers
Common variation of speculative design Reorder buffer keeps instruction information but not the result Extend register file with extra renaming registers to hold speculative results Rename register allocated at issue; result into rename register on execution complete; rename register into real register on commit Operands read either from register file (real or speculative) or via Common Data Bus Advantage: operands are always from single source (extended register file)