A Pipelining
A Pipelining
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 1 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 2
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 3 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 4
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Simple 5-Stage Pipeline Pipeline Registers (Latches)
regfile regfile
PC F/D D/X X/M M/W PC F/D D/X X/M M/W
+4 +4
D$ D$
I$ I$
F D X M W F D X M W
• 5 stages (pipeline depth is 5) • contain info for controlling flow of instructions through pipe
• fetch (F, IF): fetch instruction from I$ • PC: PC
• decode (D, ID): decode instruction, read input registers • F/D: PC, undecoded instruction
• execute (X, EX): ALU, load/store address, branch outcome • D/X: PC, opcode, regfile[rs1], regfile[rs2], immed, rd
• memory access (M, MEM): load/store to D$/DTLB • X/M: opcode (why?), regfile[rs1], ALUOUT, rd
• writeback (W, WB): write results (from ALU or ld) back to register file • M/W: ALUOUT, MEMOUT, rd
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 5 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 6
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 7 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 8
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Wrong (part I): Pipeline Overhead Wrong (part II): Hazards
V := oVerhead delay per pipe stage hazards: conditions that lead to incorrect behavior if not fixed
• cause #1: latch overhead • structural: two instructions use same h/w in same cycle
• pipeline registers take time • data: two instructions use same data (register/memory)
• cause #2: clock/data skew • control: one instruction affects which instruction is next
so, for an N-stage pipeline with overheads
• hazards ⇒ stalls (sometimes)
• single-instruction latency T = Σ(V + tn) = N*V + Σtn • stall: instruction stays in same stage for more than one cycle
• throughput = 1/(max(tn) + V) <= N/T (and <= 1/V) • what if average stall per instruction = S stages?
• M-instruction latency = M*(max(tn) + V) <= M*V + M*T/N • latency’ ⇒ T(N+S)/N = ((N+S)/N)*latency > latency
• speedup = T/(V+max(tn)) <= N • throughput’ ⇒ N2/T(N+S) = (N/(N+S))*throughput < throughput
• M_latency’ ⇒ M*T(N+S)/N2 = ((N+S)/N)*M_latency > M_latency
Overhead limits throughput, speedup & useful pipeline depth • speedup’ ⇒ N2/(N+S) = (N/(N+S))*speedup < speedup
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 9 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 10
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Pipelining: Clock Rate vs. IPC Clock Rate vs. IPC Example
deeper pipeline (more stages, larger N) • G: gate-delays to process an instruction
+ increases clock rate • V: gate-delays of overhead per stage
– decreases IPC (longer stalls for hazards - will see later) • S: average cycle stall per instruction per pipe stage
• ultimate metric is execution rate = clock rate*IPC – overly simplistic model for stalls
• (clock cycle / unit real time) * (instructions / clock cycle) • compute optimal N (depth) given G, V, S [Smith+Pleszkun]
• number of instructions is fixed, for purposes of this discussion • IPC = 1 / (1 + S*N)
• how does pipeline overhead factor in? • clock rate (in gate-delays) = 1/(gate delays/stage) =1/(G/N + O)
• e.g., G = 80, S = 0.16, V = 1
to think about this, parameterize the clock cycle
N IPC := 1/(1+0.16*N) clock := 1/(80/N+1) execution rate
• basic time unit is the gate-delay (time to go through a gate) 10 0.38 0.11 0.042
• e.g., 80 gate-delays to process (fetch, decode,...) an instruction 20 0.24 0.20 0.048
• let’s look at an example ... 30 0.17 0.27 0.046
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 11 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 12
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Pipeline Depth Upshot Managing the Pipeline
trend is for deeper pipelines (more stages) to resolve hazards, need fine pipe-stage control
• why? faster clock (higher frequency) • play with pipeline registers to control pipe flow
• clock period = f(transistor latency, gate delays per pipe stage) • trick #1: the stall (or the bubble)
• superpipelining: add more stages to reduce gate-delays/pipe-stage • effect: stops SOME instructions in current pipe-stages
• but increased frequency may not mean increased performance... • use: make younger instructions wait for older ones to complete
• who cares? we can sell frequency! • implementation: de-assert write-enable signals to pipeline registers
• e.g., Intel IA-32 pipelines • trick #2: the flush
• 486: 5 stages (50+ gate-delays per clock period) • effect: clears instructions out of current pipe-stages
• Pentium: 7 stages • use: undoes speculative work that was incorrect (see later)
• Pentium II/III: 12 stages • implementation: assert clear signals on pipeline registers
• Pentium 4: 22 stages (10 gate-delays per clock)
• stalls & flushes must be propagated upstream (why?)
• Gotcha! 800MHz Pentium III performs better than 1GHz Pentium 4
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 13 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 14
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 15 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 16
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Avoiding Structural Hazards Data Hazards
• option #1: replicate the contended resource two different instructions use the same storage location
+ good performance • we must preserve the illusion of sequential execution
– increased area, slower (interconnect delay)?
• use for cheap, divisible, or highly-contended resources (e.g., I$/D$) add R1, R2, R3 add R1, R2, R3 add R1, R2, R3
• option #2: pipeline the contended resource sub R2, R4, R1 sub R2, R4, R1 sub R2, R4, R1
+ good performance, low area or R1, R6, R3 or R1, R6, R3 or R1, R6, R3
– sometimes complex (e.g., RAM) read-after-write write-after-read write-after-write
• useful for multicycle resources (RAW) (WAR) (WAW)
• option #3: design ISA/pipeline to reduce structural hazards
true dependence anti-dependence output dependence
• key 1: each instruction uses a given resource at most once (real) (artificial) (artificial)
• key 2: each instruction uses a given resource in same pipeline stage
• key 3: each instruction uses a given resource for one cycle
Q: What about read-after-read dependences? (RAR)
• this is why we force ALU operations to go thru MEM stage
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 17 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 18
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
D$ D$
X M W X M W
why wait until WB stage? data available at end of EX/MEM stage • first, detect bypass opportunity
• bypass (aka “forward”) data directly to input of EX • tag compares in D/X latch
• similar to but separate from stall logic in F/D latch
+ very effective at reducing/avoiding stalls
• in practice, a large fraction of input operands are bypassed (why?) • then, control bypass MUX
• if rs2(X) == rd(X/M) then ALUOUT(M)
– complex
• else if rs2(X) == rd(M/W) then ALUOUT(W)
• does not relieve you from having to perform WB
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 23 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 24
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Pipeline Diagrams with Bypassing Pipeline Scheduling
1 2 3 4 5 6 7 8 9 10 11 compiler schedules (moves) instructions to reduce stall
add R1,R5,R3 F D X M W • eliminate back-to-back load-ALU scenarios
sub R2,R4,R1 F D X M W example 1
• example code sequence a = b + c; d = e - f
1 2 3 4 5 6 7 8 9 10 11
load R1,24(R5) F D X M W
before after
add R3,R6,R7 F D X M W load R2, b load R2, b
sub R2,R4,R1 F D X M W example 2 load R3, c load R3, c
add R1, R2, R3 //stall load R5, e
• even with full bypassing, not all RAW stalls can be avoided
store R1, a add R1, R2, R3 // no stall
• example: load to ALU in consecutive cycles
load R5, e load R6, f
1 2 3 4 5 6 7 8 9 10 11 load R6, f store R1, a
load R1,24(R5) F D X M W sub R4, R5, R6 // stall sub R4, R5, R6 // no stall
sub R2,R4,R1 F D d* X M W example 3 store R4, d store Rd, d
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 25 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 26
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 27 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 28
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
RAR: Read After Read Memory Data Hazards
read-after-read (RAR) have seen register hazards, can also have memory hazards
RAW WAR WAW
add R1, R2, R3 store R1,0(SP) load R4,0(SP) store R1,0(SP)
sub R2, R4, R1 load R4,0(SP) store R1,0(SP) store R4,0(SP)
or R1, R6, R3
1 2 3 4 5 6 7 8 9
• no problem: R3 is correct even with reordering store R1,0(SP) F D X M W
load R1,0(SP) F D X M W
• in simple pipeline, memory hazards are easy
• in-order
• one at a time
• read & write in same stage
• in general, though, more difficult than register hazards
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 29 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 30
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
1 2 3 4 5 6 7 8 9
store R4,0(R5) F D X M W
bne R2,R3,loop F D X M W
?? c* c* F D X M W
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 31 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 32
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Control Hazards: “Fast” Branches Control Hazards: Delayed Branches
fast branches: can be evaluated in ID (rather than EX) delayed branch: execute next instruction whether taken or not
+ reduce stall from 2 cycles to 1 • instruction after branch said to be in “delay slot”
1 2 3 4 5 6 7 8 9 • old microcode trick stolen by RISC (MIPS)
sw R4,0(R5) F D X M W store R4,0(R5) bned R2,R3,loop
bne R2,R3,loop F D X M W bne R2,R3,loop store R4,0(R5)
?? c* F D X M W
sub R1,R6,R6 sub R1,R6,R6
– requires more hardware
• dedicated ID adder for (PC + immediate) targets
1 2 3 4 5 6 7 8 9
– requires simple branch instructions
bned R2,R3,loop F D X M W
• no time to compare two registers (would need full ALU)
store R4,0(R5) F D X M W
• comparisons with 0 are fast (beqz, bnez) sub R1,R6,R6 c* F D X M W
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 33 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 34
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 35 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 36
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
The Speculation Game Speculative Execution Scenarios
speculation: engagment in risky business transactions on the 1 2 3 4 5 • correct speculation
chance of quick or considerable profit inst0/B F D X M W • cycle1: fetch branch, predict next (inst8)
• speculative execution (control speculation) inst8 F D X M • c2, c3: fetch inst8, inst9
• execute before all parameters known with certainty inst9 F D X • c3: execute/verify branch ⇒ correct
inst10 F D • nothing needs to be fixed or changed
+ correct speculation
+ avoid stall/get result early, performance improves
– incorrect speculation (mis-speculation) • incorrect speculation: mis-speculation
1 2 3 4 5 • c1: fetch branch, predict next (inst1)
– must abort/squash incorrect instructions
inst0/B F D X M W
– must undo incorrect changes (recover pre-speculation state) • c2, c3: fetch inst1, inst2
inst1 F D
• c3: execute/verify branch ⇒ wrong
• the speculation game: profit > penalty inst2 F
inst8 verify/flush F D • c3: send correct target to IF (inst8)
• profit = speculation accuracy * correct-speculation gain
• c3: squash (abort) inst1, inst2 (flush F/D)
• penalty = (1–speculation accuracy) * mis-speculation penalty
• c4: fetch inst8
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 37 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 38
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 39 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 40
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Dynamic Branch Prediction Branch History Table (BHT)
regfile branch PC ⇒ prediction (T, NT)
PC F/D D/X X/M M/W
– need decoder/adder to compute target if taken
BP • branch history table (BHT)
D$ BHT
I$ I$ • read prediction with least significant bits (LSBs) of branch PC
F D X M W 1
• change bit on misprediction
0
+ simple
hardware (BP) guesses whether and where a branch will go 1
– multiple PCs may map to same bit (aliasing)
0x64 bnez r1,#10
0x74 add r3,r2,r1 • major improvements
• two-bit counters [Smith] branch PC
• start with branch PC (0x64) and produce
• direction (Taken) • correlating/two-level predictors [Patt]
• direction + target PC (0x74) • hybrid predictors [McFarling]
• direction + target PC + target instruction (add r3, r2,r1)
T/N
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 41 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 42
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 45 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 46
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
predictor 2
• ...and other is right... – many more bits per entry than BHT
chooser
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 47 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 48
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Jump Prediction Branch Issues
exploit behavior of different kinds of jumps to improve prediction issue1: how do we know at IF which instructions are branches?
• function returns • BTB: don’t need to “know”
• use hardware return address stack (RAS) • check every instruction: BTB entry ⇒ instruction is a branch
• call pushes return address on top of RAS
• for return, predict address at top of RAS and pop issue2: BHR (RAS) depend on branch (call) history
– trouble: must manage speculatively • when are these updated?
• at WB is too late (if another branch is in-flight)
• indirect jumps (switches, virtual functions)
• at IF (after prediction)
• more than one taken target per jump
• must be able to recover BHR (RAS) on mis-speculation (nasty)
• path-based BTB [Driesen+Holzle]
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 49 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 50
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 51 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 52
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Multi-Cycle Example Another Multi-Cycle Example
1 2 3 4 5 6 7 8 9 10 example: SAXPY (math kernel)
divf f0,f1,f2 F D E/ E/ E/ E/ W Z[i] = A*X[i] + Y[i] // single precision
mulf f0,f3,f4 F D E* E* W
addf f5,f6,f7 F D E+ E+ W 1 2 3 4 5 6 7 8 9 10
subf f8,f6,f7 F D * E+ E+ W ldf f2,0(r1) F D X M W
mulf f9,f8,f7 F D * * E* E* mulf f6,f0,f2 F D d* E* E* E* E* W
ldf f4,0(r2) F p* D X M W f6
• write-after-write (WAW) hazards addf f8,f6,f4 F D d* d* E+ E+ W
• register write port structural hazards stf f8,0(r3) F p* p* D X M W
add r1,r1,#4 F D X M W
• functional unit structural hazards
add r2,r2,#4 F D X M W
• elongated read-after-write (RAW) hazards add r3,r3,#4 F D X M W
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 53 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 54
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 55 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 56
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Dealing With Interrupts Precise Interrupts
interrupts (aka faults, exceptions, traps) “unobserved system can exist in any intermediate state, upon
• e.g., arithmetic overflow, divide by zero, protection violation observation system collapses to well-defined state”
–2nd postulate of quantum mechanics
• e.g., I/O device request, OS call, page fault
• system ⇒ processor, observation ⇒ interrupt
classifying interrupts
what is the “well-defined” state?
• terminal (fatal) vs. restartable (control returned to program)
• von Neumann: “sequential, instruction atomic execution”
• synchronous (internal) vs. asynchronous (external)
• precise state at interrupt
• user vs. coerced • all instructions older than interrupt are complete
• maskable (ignorable) vs. non-maskable • all instructions younger than interrupt haven’t started
• between instructions vs. within instruction • implies interrupts are taken in program order
• necessary for VM (why?), “highly recommended” by IEEE
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 57 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 58
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 59 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 60
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Posted Interrupts Interrupts and Multi-Cycle Operations
posted interrupts 1 2 3 4 5 6 7 8 9 10 11
• set interrupt bit when condition is raised divf f0,f1,f2 F D E/ E/ E/ E/ W div by 0 (posted)
mulf f3,f4,f5 F D E* E* W
• check interrupt bit (potentially “take” interrupt) in WB
addf f6,f7,f8 F D E+ E+ s* W
+ interrupts are taken in order
– longer latency, more complex multi-cycle operations + precise state = trouble
1 2 3 4 5 6 7 8 9 • #1: how to undo early writes?
inst0 F D X M W data page fault • e.g., must make it seem as if mulf hasn’t executed
inst1 F D X M W instruction page fault • undo writes: future file, history file -> ugly!
• what happens now? • #2: how to take interrupts in-order if WB is not in-order?
• c2: set inst1 bit • force in-order WB
• c4: set inst0 bit – slow
• c5: take inst0 interrupt
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 61 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 62
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 63 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 64
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining