0% found this document useful (0 votes)
20 views16 pages

A Pipelining

The document discusses the principles and mechanisms of pipelining in computer architecture, highlighting its advantages over non-pipelined systems, such as increased resource utilization and throughput. It covers various types of hazards that can occur in pipelined processors, including structural, data, and control hazards, and methods to manage these hazards through techniques like stalls and flushes. Additionally, it examines the trade-offs between clock rate and instructions per cycle (IPC), emphasizing the importance of pipeline depth and overhead in optimizing performance.

Uploaded by

Lucky John
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views16 pages

A Pipelining

The document discusses the principles and mechanisms of pipelining in computer architecture, highlighting its advantages over non-pipelined systems, such as increased resource utilization and throughput. It covers various types of hazards that can occur in pipelined processors, including structural, data, and control hazards, and methods to manage these hazards through techniques like stalls and flushes. Additionally, it examines the trade-offs between clock rate and instructions per cycle (IPC), emphasizing the importance of pipeline depth and overhead in optimizing performance.

Uploaded by

Lucky John
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Readings in Pipelining Basic Pipelining

H+P • basic := single, in-order issue


• Appendix A (except for A.8) • single issue := one instruction at a time (per stage)
• in-order issue := instructions (start to) execute in order
Recent Research Papers • next unit: multiple issue
• “The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 • unit after that: out-of-order issue
Inverter Delays”, Hrishikesh et al., ISCA 2002.
• pipelining principles
• “Power: A First Class Design Constraint”, Mudge, IEEE • tradeoff: clock rate vs. IPC
Computer, April 2001. (not directly related to pipelining) • hazards: structural, data, control
• vanilla pipeline: single-cycle operations
• structural hazards, RAW hazards, control hazards
• dealing with multi-cycle operations
• more structural hazards, WAW hazards, precise state

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 1 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 2
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Pipelining Without Pipelining


nPC regfile
observe: instruction processing consists of N sequential stages
PC
idea: overlap different instructions at different stages
+4
D$
non-pipelined inst0.1 inst0.2 inst0.3 I$
F D X M W
inst1.1 inst1.2 inst1.3
pipelined inst0.1 inst0.2 inst0.3 • 5 parts of instruction execution
inst1.1 inst1.2 inst1.3 • fetch (F, IF): fetch instruction from I$
• decode (D, ID): decode instruction, read input registers
+ increase resource utilization: fewer stages sitting idle
• execute (X, EX): ALU, load/store address, branch outcome
+ increase completion rate (throughput): up to 1 in 1/N time • memory access (M, MEM): load/store to D$/DTLB
• almost every processor built since 1970 is pipelined • writeback (W, WB): write results (from ALU or ld) back to register file
• first pipelined processor: IBM Stretch [1962]

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 3 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 4
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Simple 5-Stage Pipeline Pipeline Registers (Latches)
regfile regfile
PC F/D D/X X/M M/W PC F/D D/X X/M M/W
+4 +4
D$ D$
I$ I$
F D X M W F D X M W

• 5 stages (pipeline depth is 5) • contain info for controlling flow of instructions through pipe
• fetch (F, IF): fetch instruction from I$ • PC: PC
• decode (D, ID): decode instruction, read input registers • F/D: PC, undecoded instruction
• execute (X, EX): ALU, load/store address, branch outcome • D/X: PC, opcode, regfile[rs1], regfile[rs2], immed, rd
• memory access (M, MEM): load/store to D$/DTLB • X/M: opcode (why?), regfile[rs1], ALUOUT, rd
• writeback (W, WB): write results (from ALU or ld) back to register file • M/W: ALUOUT, MEMOUT, rd

• stages divided by pipeline registers/latches

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 5 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 6
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Pipeline Diagram Principles of Pipelining


1 2 3 4 5 6 7 8 ⇐ cycles let: instruction execution require N stages, each takes tn time
inst0 F D X M W • un-pipelined processor
inst1 F D X M W • single-instruction latency T = Σtn
inst2 F D X M W • throughput = 1/T = 1/Σtn
inst3 F D X M W
• M-instruction latency = M*T (M>>1)
Compared to non-pipelined case: • now: N-stage pipeline
• Better throughput: an instruction finishes every cycle • single-instruction latency T = Σtn (same as unpipelined)
• throughput = 1/ max(tn) <= N/T (max(tn) is the bottleneck)
• Same latency per instruction: each still takes 5 cycles
if all tn are equal (i.e., max(tn) = T/N), then throughput = N/T
• M-instruction latency (M >> 1) = M * max(tn) <= M*T/N
• speedup <= N
• can we choose N to get arbitrary speedup?

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 7 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 8
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Wrong (part I): Pipeline Overhead Wrong (part II): Hazards
V := oVerhead delay per pipe stage hazards: conditions that lead to incorrect behavior if not fixed
• cause #1: latch overhead • structural: two instructions use same h/w in same cycle
• pipeline registers take time • data: two instructions use same data (register/memory)
• cause #2: clock/data skew • control: one instruction affects which instruction is next
so, for an N-stage pipeline with overheads
• hazards ⇒ stalls (sometimes)
• single-instruction latency T = Σ(V + tn) = N*V + Σtn • stall: instruction stays in same stage for more than one cycle
• throughput = 1/(max(tn) + V) <= N/T (and <= 1/V) • what if average stall per instruction = S stages?
• M-instruction latency = M*(max(tn) + V) <= M*V + M*T/N • latency’ ⇒ T(N+S)/N = ((N+S)/N)*latency > latency
• speedup = T/(V+max(tn)) <= N • throughput’ ⇒ N2/T(N+S) = (N/(N+S))*throughput < throughput
• M_latency’ ⇒ M*T(N+S)/N2 = ((N+S)/N)*M_latency > M_latency
Overhead limits throughput, speedup & useful pipeline depth • speedup’ ⇒ N2/(N+S) = (N/(N+S))*speedup < speedup

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 9 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 10
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Pipelining: Clock Rate vs. IPC Clock Rate vs. IPC Example
deeper pipeline (more stages, larger N) • G: gate-delays to process an instruction
+ increases clock rate • V: gate-delays of overhead per stage
– decreases IPC (longer stalls for hazards - will see later) • S: average cycle stall per instruction per pipe stage
• ultimate metric is execution rate = clock rate*IPC – overly simplistic model for stalls
• (clock cycle / unit real time) * (instructions / clock cycle) • compute optimal N (depth) given G, V, S [Smith+Pleszkun]
• number of instructions is fixed, for purposes of this discussion • IPC = 1 / (1 + S*N)
• how does pipeline overhead factor in? • clock rate (in gate-delays) = 1/(gate delays/stage) =1/(G/N + O)
• e.g., G = 80, S = 0.16, V = 1
to think about this, parameterize the clock cycle
N IPC := 1/(1+0.16*N) clock := 1/(80/N+1) execution rate
• basic time unit is the gate-delay (time to go through a gate) 10 0.38 0.11 0.042
• e.g., 80 gate-delays to process (fetch, decode,...) an instruction 20 0.24 0.20 0.048
• let’s look at an example ... 30 0.17 0.27 0.046

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 11 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 12
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Pipeline Depth Upshot Managing the Pipeline
trend is for deeper pipelines (more stages) to resolve hazards, need fine pipe-stage control
• why? faster clock (higher frequency) • play with pipeline registers to control pipe flow
• clock period = f(transistor latency, gate delays per pipe stage) • trick #1: the stall (or the bubble)
• superpipelining: add more stages to reduce gate-delays/pipe-stage • effect: stops SOME instructions in current pipe-stages
• but increased frequency may not mean increased performance... • use: make younger instructions wait for older ones to complete
• who cares? we can sell frequency! • implementation: de-assert write-enable signals to pipeline registers
• e.g., Intel IA-32 pipelines • trick #2: the flush
• 486: 5 stages (50+ gate-delays per clock period) • effect: clears instructions out of current pipe-stages
• Pentium: 7 stages • use: undoes speculative work that was incorrect (see later)
• Pentium II/III: 12 stages • implementation: assert clear signals on pipeline registers
• Pentium 4: 22 stages (10 gate-delays per clock)
• stalls & flushes must be propagated upstream (why?)
• Gotcha! 800MHz Pentium III performs better than 1GHz Pentium 4

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 13 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 14
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Structural Hazards Fixing Structural Hazards


two different instructions need same h/w resource in same cycle • fix structural hazard by stalling (s* = structural stall)
• e.g., loads/stores use the same cache port as fetch + low cost, simple
• assume unified L1 cache (for this example) – decreases IPC
• used rarely
1 2 3 4 5 6 7 8 9 10 11 12 13
load F D X M W • Q: which one to stall, inst4 or load?
inst2 F D X M W • always safe to stall younger instruction (why?)...
inst3 F D X M W • ...but may not be the best thing to do performance-wise (why?)
inst3 F D X M W
1 2 3 4 5 6 7 8 9 10 11 12 13
load F D X M W
inst2 F D X M W
inst3 F D X M W
inst4 s* F D X M W

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 15 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 16
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Avoiding Structural Hazards Data Hazards
• option #1: replicate the contended resource two different instructions use the same storage location
+ good performance • we must preserve the illusion of sequential execution
– increased area, slower (interconnect delay)?
• use for cheap, divisible, or highly-contended resources (e.g., I$/D$) add R1, R2, R3 add R1, R2, R3 add R1, R2, R3
• option #2: pipeline the contended resource sub R2, R4, R1 sub R2, R4, R1 sub R2, R4, R1
+ good performance, low area or R1, R6, R3 or R1, R6, R3 or R1, R6, R3
– sometimes complex (e.g., RAM) read-after-write write-after-read write-after-write
• useful for multicycle resources (RAW) (WAR) (WAW)
• option #3: design ISA/pipeline to reduce structural hazards
true dependence anti-dependence output dependence
• key 1: each instruction uses a given resource at most once (real) (artificial) (artificial)
• key 2: each instruction uses a given resource in same pipeline stage
• key 3: each instruction uses a given resource for one cycle
Q: What about read-after-read dependences? (RAR)
• this is why we force ALU operations to go thru MEM stage
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 17 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 18
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

RAW RAW: Detect and Stall


read-after-write (RAW) = true dependence (dataflow) detect RAW and stall instruction at ID before it reads registers
• mechanics? disable PC, F/D write
add R1, R2, R3 • RAW detection? compare register names
sub R2, R4, R1 • notation: rs1(D) := source register #1 of instruction in D stage
or R1, R6, R3 • compare rs1(D) and rs2(D) with rd(D/X), rd(X/M), rd(M/W)
• stall (disable PC + F/D, clear D/X) on any match
• problem: sub reads R1 before add has written it
• Pipelining enables this overlapping to occur • RAW detection? register busy-bits
• But this violates sequential execution semantics! • set for rd(D/X) when instruction passes ID
• Recall: user just sees ISA and expects sequential execution • clear for rd(M/W)
• stall if rs1(D) or rs2(D) are “busy”
+ low cost, simple
– low performance (many stalls)
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 19 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 20
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Two Stall Timings Stall Signal Example (2nd Timing)
RAW
depend on how ID and WB stages share the register file PC F/D D/X X/M M/W
load r6,0(r4) add r4,r2,r1 load r2,0(r3) add r5,r5,#4 call func
• each gets register file for half a cycle F D X M W
• 1st half ID reads, 2nd half WB writes ⇒ 3 cycle stall write disable write disable clear c1: rs1(D) == rd(D/X) ⇒ stall

1 2 3 4 5 6 7 8 9 PC F/D D/X X/M M/W


add R1,R2,R3 F D X M W load r6,0(r4) add r4,r2,r1 load r2,0(r3) add r5,r5,#4
sub R2,R4,R1 F d* d* d* D X M W F D X M W
load R5,R6,R7 p* p* p* F D X M
write disable write disable clear c2: rs1(D) == rd(X/M) ⇒ stall
PC F/D D/X X/M M/W
load r6,0(r4) add r4,r2,r1 load r2,0(r3)
• 1st half WB writes, 2nd half ID reads ⇒ 2 cycle stall F D X M W
1 2 3 4 5 6 7 8 9 c3: rs1(D) == rd(X/M) ⇒ go
add R1,R2,R3 F D X M W PC F/D D/X X/M M/W
sub R2,R4,R1 F d* d* D X M W sub r6,r6,#1 load r6,0(r4) add r4,r2,r1
F D X M W
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 21 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 22
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Reducing RAW Stalls: Bypassing Implementing Bypassing


regfile regfile
D/X X/M M/W D/X X/M M/W

D$ D$
X M W X M W

why wait until WB stage? data available at end of EX/MEM stage • first, detect bypass opportunity
• bypass (aka “forward”) data directly to input of EX • tag compares in D/X latch
• similar to but separate from stall logic in F/D latch
+ very effective at reducing/avoiding stalls
• in practice, a large fraction of input operands are bypassed (why?) • then, control bypass MUX
• if rs2(X) == rd(X/M) then ALUOUT(M)
– complex
• else if rs2(X) == rd(M/W) then ALUOUT(W)
• does not relieve you from having to perform WB
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 23 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 24
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Pipeline Diagrams with Bypassing Pipeline Scheduling
1 2 3 4 5 6 7 8 9 10 11 compiler schedules (moves) instructions to reduce stall
add R1,R5,R3 F D X M W • eliminate back-to-back load-ALU scenarios
sub R2,R4,R1 F D X M W example 1
• example code sequence a = b + c; d = e - f
1 2 3 4 5 6 7 8 9 10 11
load R1,24(R5) F D X M W
before after
add R3,R6,R7 F D X M W load R2, b load R2, b
sub R2,R4,R1 F D X M W example 2 load R3, c load R3, c
add R1, R2, R3 //stall load R5, e
• even with full bypassing, not all RAW stalls can be avoided
store R1, a add R1, R2, R3 // no stall
• example: load to ALU in consecutive cycles
load R5, e load R6, f
1 2 3 4 5 6 7 8 9 10 11 load R6, f store R1, a
load R1,24(R5) F D X M W sub R4, R5, R6 // stall sub R4, R5, R6 // no stall
sub R2,R4,R1 F D d* X M W example 3 store R4, d store Rd, d

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 25 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 26
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

WAR: Write After Read WAW: Write After Write


write-after-read (WAR) = artificial (name) dependence write-after-write (WAW) = artificial (name) dependence
add R1,R2,R3
add R1, R2, R3 sub R2,R4,R1
sub R2, R4, R1 or R1,R6,R3
or R1, R6, R3
• problem: reordering could leave wrong value in R1
• problem: add could use wrong value for R2 • later instruction that reads R1 would get wrong value
• can’t happen in vanilla pipeline (reads in ID, writes in WB) • can’t happen in vanilla pipeline (register writes are in order)
• can happen if: early writes (e.g., auto-increment) + late reads (??) • another reason for making ALU ops go through MEM stage
• can happen if: out-of-order reads (e.g., out-of-order execution) • can happen: multi-cycle operations (e.g., FP, cache misses)
• artificial: using different output register for sub would solve • artificial: using different output register for or would solve
• The dependence is on the name R2, but not on actual dataflow • Also a dependence on a name: R1

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 27 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 28
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
RAR: Read After Read Memory Data Hazards
read-after-read (RAR) have seen register hazards, can also have memory hazards
RAW WAR WAW
add R1, R2, R3 store R1,0(SP) load R4,0(SP) store R1,0(SP)
sub R2, R4, R1 load R4,0(SP) store R1,0(SP) store R4,0(SP)
or R1, R6, R3
1 2 3 4 5 6 7 8 9
• no problem: R3 is correct even with reordering store R1,0(SP) F D X M W
load R1,0(SP) F D X M W
• in simple pipeline, memory hazards are easy
• in-order
• one at a time
• read & write in same stage
• in general, though, more difficult than register hazards
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 29 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 30
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Hazards vs. Dependences Control Hazards


dependence: fixed property of instruction stream (i.e., program) when an instruction affects which instruction executes next
store R4,0(R5)
hazard: property of program and processor organization
bne R2,R3,loop
• implies potential for executing things in wrong order sub R1,R6,R3
• potential only exists if instructions can be simultaneously “in-flight”
• property of dynamic distance between instrs vs. pipeline depth • naive solution: stall until outcome is available (end of EX)
+ simple
For example, can have RAW dependence with or without hazard
– low performance (2 cycles here, longer in general)
• depends on pipeline • e.g. 15% branches * 2 cycle stall ⇒ 30% CPI increase!

1 2 3 4 5 6 7 8 9
store R4,0(R5) F D X M W
bne R2,R3,loop F D X M W
?? c* c* F D X M W

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 31 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 32
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Control Hazards: “Fast” Branches Control Hazards: Delayed Branches
fast branches: can be evaluated in ID (rather than EX) delayed branch: execute next instruction whether taken or not
+ reduce stall from 2 cycles to 1 • instruction after branch said to be in “delay slot”
1 2 3 4 5 6 7 8 9 • old microcode trick stolen by RISC (MIPS)
sw R4,0(R5) F D X M W store R4,0(R5) bned R2,R3,loop
bne R2,R3,loop F D X M W bne R2,R3,loop store R4,0(R5)
?? c* F D X M W
sub R1,R6,R6 sub R1,R6,R6
– requires more hardware
• dedicated ID adder for (PC + immediate) targets
1 2 3 4 5 6 7 8 9
– requires simple branch instructions
bned R2,R3,loop F D X M W
• no time to compare two registers (would need full ALU)
store R4,0(R5) F D X M W
• comparisons with 0 are fast (beqz, bnez) sub R1,R6,R6 c* F D X M W

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 33 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 34
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

What To Put In Delay Slot? Control Hazards: Speculative Execution


• instruction from before branch idea: doing anything is better than waiting around doing nothing
• when? if branch and instruction are independent • speculative execution
• helps? always • guess branch target ⇒ start executing at guessed position
• instruction from target (taken) path • execute branch ⇒ verify (check) guess
• when? if safe to execute, but may have to duplicate code + minimize penalty if guess is right (to zero?)
• helps? on taken branch, but may increase code size – wrong guess could be worse than not guessing
• instruction from fall-through (not-taken) path • branch prediction: guessing the branch
• when? if safe to execute • one of the “important” problems in computer architecture
• helps? on not-taken branch • very heavily researched area in last 15 years
• upshot: short-sighted ISA feature • static: prediction by compiler
– not a big win for today’s machines (why? consider pipeline depth) • dynamic: prediction by hardware
– complicates interrupt handling (later) • hybrid: compiler hints to hardware predictor

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 35 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 36
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
The Speculation Game Speculative Execution Scenarios
speculation: engagment in risky business transactions on the 1 2 3 4 5 • correct speculation
chance of quick or considerable profit inst0/B F D X M W • cycle1: fetch branch, predict next (inst8)
• speculative execution (control speculation) inst8 F D X M • c2, c3: fetch inst8, inst9
• execute before all parameters known with certainty inst9 F D X • c3: execute/verify branch ⇒ correct
inst10 F D • nothing needs to be fixed or changed
+ correct speculation
+ avoid stall/get result early, performance improves
– incorrect speculation (mis-speculation) • incorrect speculation: mis-speculation
1 2 3 4 5 • c1: fetch branch, predict next (inst1)
– must abort/squash incorrect instructions
inst0/B F D X M W
– must undo incorrect changes (recover pre-speculation state) • c2, c3: fetch inst1, inst2
inst1 F D
• c3: execute/verify branch ⇒ wrong
• the speculation game: profit > penalty inst2 F
inst8 verify/flush F D • c3: send correct target to IF (inst8)
• profit = speculation accuracy * correct-speculation gain
• c3: squash (abort) inst1, inst2 (flush F/D)
• penalty = (1–speculation accuracy) * mis-speculation penalty
• c4: fetch inst8

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 37 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 38
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Static (Compiler) Branch Prediction Comparison of Some Static Schemes


Some static prediction options CPI-penalty = %branch * [(%T * penaltyT) + (%NT * penaltyNT)]
• predict always not-taken • simple branch statistics
+ very simple, since we already know the target (PC+4) • 14% PC-changing instructions (“branches”)
– most branches (~65%) are taken (why?) • 65% of PC-changing instructions are “taken”
• predict always taken
+ better performance penaltyT penaltyNT
scheme CPI penalty
– more difficult, must know target before branch is decoded stall 2 2 0.28
• predict backward taken fast branch 1 1 0.14
• most backward branches are taken delayed branch 1.5 1.5 0.21
not-taken 2 0 0.18
• predict specific opcodes taken
taken 0 2 0.10
• use profiles to predict on per-static branch basis
• pretty good

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 39 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 40
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Dynamic Branch Prediction Branch History Table (BHT)
regfile branch PC ⇒ prediction (T, NT)
PC F/D D/X X/M M/W
– need decoder/adder to compute target if taken
BP • branch history table (BHT)
D$ BHT
I$ I$ • read prediction with least significant bits (LSBs) of branch PC
F D X M W 1
• change bit on misprediction
0
+ simple
hardware (BP) guesses whether and where a branch will go 1
– multiple PCs may map to same bit (aliasing)
0x64 bnez r1,#10
0x74 add r3,r2,r1 • major improvements
• two-bit counters [Smith] branch PC
• start with branch PC (0x64) and produce
• direction (Taken) • correlating/two-level predictors [Patt]
• direction + target PC (0x74) • hybrid predictors [McFarling]
• direction + target PC + target instruction (add r3, r2,r1)
T/N

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 41 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 42
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Improvement: Two-bit Counters Improvement: Correlating Predictors


example: 4-iteration inner loop branch different branches may be correlated
state/prediction N T T T N T T T N T T T • outcome of branch depends on outcome of other branches
branch outcome T T T N T T T N T T T N • makes intuitive sense (programs are written this way)
mis-prediction? * * * * * * • e.g., if the first two conditions are true, then third is false
– problem: two mis-predictions per loop if (aa == 2) aa = 0;
• solution: 2-bit saturating counter to implement hysteresis if (bb == 2) bb = 0;
• 4 states: strong/weak not-taken (N/n), strong/weak taken (T/t) if (aa != bb) { . . . }
• transitions: N ⇔ n ⇔ t ⇔ T
state/prediction n t T T t T T T t T T T revelation: prediction = f(branch PC, recent branch outcomes)
branch outcome T T T N T T T N T T T N • revolution: BP accuracies increased dramatically
mis-prediction? * * * *
• lots of reseach in designing that function for best BP
+ only one mis-prediction per iteration
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 43 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 44
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Correlating (Two-Level) Predictors Correlating Predictor Example
• branch history shift register (BHR) holds recent outcomes • example with alternating T,N (1-bit BHT, no correlation)
• combination of PC and BHR accesses BHT
state/prediction N T N T N T N T N T N T
• basically, multiple predictions per branch, choose based on history
branch outcome T N T N T N T N T N T N
mis-prediction? * * * * * * * * * * * *
design space branch PC
• number of BHRs • add 1 1-bit BHR, concatenate with PC
BHT • effectively, two predictors per PC
• multiple BHRs (“local”, Intel)
• top (BHR=N) bottom (BHR=T) active entry
• 1 global BHR (“global”, everyone else)
BHR f state/prediction N T T T T T T T T T T T
• PC/BHR overlap N N N N N N N N N N N N
• full, partial, none (concatenated?)
branch outcome T N T N T N T N T N T N
• popular design: Gshare [McFarling] mis-prediction? *
• 1 global BHR, full overlap, f = XOR
T/N

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 45 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 46
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Hybrid/Competitive/Tournament Predictors Branch Target Buffer (BTB)


observation: different schemes work better for different branches branch PC ⇒ target PC
• target PC available at end of IF stage
idea: multiple predictors, choose on per static-branch basis
+ no bubble for correct predictions
mechanics • branch target buffer (BTB)
• two (or more) predictors branch PC • index: branch PC
• chooser • data: target PC (+ T/NT?)
• if chosen predictor is wrong... • tags: branch PC (why are tags needed here and not in BHT?)
predictor 1

predictor 2

• ...and other is right... – many more bits per entry than BHT
chooser

• ...flip chooser • considerations: combine with I-cache? store not-taken branches?


f
• popular design: Gselect [McFarling] • branch target cache (BTC)
BHR • data: target PC + target instruction(s)
• Gshare + 2-bit saturating counter
• enables “branch folding” optimization (branch removed from pipe)

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 47 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 48
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Jump Prediction Branch Issues
exploit behavior of different kinds of jumps to improve prediction issue1: how do we know at IF which instructions are branches?
• function returns • BTB: don’t need to “know”
• use hardware return address stack (RAS) • check every instruction: BTB entry ⇒ instruction is a branch
• call pushes return address on top of RAS
• for return, predict address at top of RAS and pop issue2: BHR (RAS) depend on branch (call) history
– trouble: must manage speculatively • when are these updated?
• at WB is too late (if another branch is in-flight)
• indirect jumps (switches, virtual functions)
• at IF (after prediction)
• more than one taken target per jump
• must be able to recover BHR (RAS) on mis-speculation (nasty)
• path-based BTB [Driesen+Holzle]

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 49 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 50
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Adding Multi-Cycle Operations Extended Pipeline


int RF
RISC tenet #1: “single-cycle operations”
D/X X/M M/W
• why was this such a big deal? PC F/D D$
• fact: not all operations complete in 1 cycle X M W
• FP add, int/FP multiply: 2–4 cycles, int/FP divide: 20–50 cycles I$
F FP+ FP+
• data cache misses: 10–150 cycles! D E+ W
• slow clock cycle down to slowest operation? FP RF
– can’t without incurring huge performance loss
• separate integer/FP, pipe register files
• solution: extend pipeline - add pipeline stages to EX • loads/stores in integer pipeline only (why?)
• additional, parallel functional units
• E+: FP adder (2 cycles, pipelined)
• E*: FP/integer multiplier (4 cycles, pipelined)
• E/: FP/integer divider (20 cycles, not pipelined)

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 51 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 52
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Multi-Cycle Example Another Multi-Cycle Example
1 2 3 4 5 6 7 8 9 10 example: SAXPY (math kernel)
divf f0,f1,f2 F D E/ E/ E/ E/ W Z[i] = A*X[i] + Y[i] // single precision
mulf f0,f3,f4 F D E* E* W
addf f5,f6,f7 F D E+ E+ W 1 2 3 4 5 6 7 8 9 10
subf f8,f6,f7 F D * E+ E+ W ldf f2,0(r1) F D X M W
mulf f9,f8,f7 F D * * E* E* mulf f6,f0,f2 F D d* E* E* E* E* W
ldf f4,0(r2) F p* D X M W f6
• write-after-write (WAW) hazards addf f8,f6,f4 F D d* d* E+ E+ W
• register write port structural hazards stf f8,0(r3) F p* p* D X M W
add r1,r1,#4 F D X M W
• functional unit structural hazards
add r2,r2,#4 F D X M W
• elongated read-after-write (RAW) hazards add r3,r3,#4 F D X M W

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 53 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 54
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Register Write Port Structural Hazards WAW Hazards


where are these resolved? how are these dealt with?
• multiple writeback ports? • stall younger instruction writeback?
– not a good idea (why not?) + intuitive, simpler
• in ID? – lower performance (cascading writeback structural hazards)
• reserve writeback slot in ID (writeback reservation bits) • abort (don’t do) older instruction writeback?
+ simple, keeps stall logic localized to ID stage + no performance loss
– won’t work for cache misses (why not?) – but what if intermediate instruction causes an interrupt (next)
• in MEM?
+ works for cache misses, better utilization
– two stall controls (F/D and M/W) must be synchronized
• in general: cache misses are hard
• don’t know in ID whether they will happen early enough (in ID)

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 55 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 56
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Dealing With Interrupts Precise Interrupts
interrupts (aka faults, exceptions, traps) “unobserved system can exist in any intermediate state, upon
• e.g., arithmetic overflow, divide by zero, protection violation observation system collapses to well-defined state”
–2nd postulate of quantum mechanics
• e.g., I/O device request, OS call, page fault
• system ⇒ processor, observation ⇒ interrupt
classifying interrupts
what is the “well-defined” state?
• terminal (fatal) vs. restartable (control returned to program)
• von Neumann: “sequential, instruction atomic execution”
• synchronous (internal) vs. asynchronous (external)
• precise state at interrupt
• user vs. coerced • all instructions older than interrupt are complete
• maskable (ignorable) vs. non-maskable • all instructions younger than interrupt haven’t started
• between instructions vs. within instruction • implies interrupts are taken in program order
• necessary for VM (why?), “highly recommended” by IEEE

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 57 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 58
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Interrupt Example: Data Page Fault More Interrupts


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 • interrupts can occur at different stages
inst0 F D X M W • IF, MEM: page fault, misaligned data, protection violation
inst1 F D X M page fault • ID: illegal/privileged instruction
inst2 F D X • EX: arithmetic exception
inst3 F D restart faulting instruction
1 2 3 4 5 6 7 8 9
inst4 F
inst0 F D X M W data page fault
TRAP F D X M W
inst1 F D X M W instruction page fault
trap0 flush EX, ID,IF F D X M W
inst1 inject TRAP instr OS trap handler F D X M • too complicated to draw what goes on here
• cycle2: instruction page fault, flush inst1, inject TRAP
• squash (effects of) younger instructions • c4: data page fault, flush inst0, inst1, TRAP
• inject fake TRAP instruction into IF – can get into an infinite loop here (with help of OS page placement)

• from here, like a SYSCALL

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 59 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 60
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Posted Interrupts Interrupts and Multi-Cycle Operations
posted interrupts 1 2 3 4 5 6 7 8 9 10 11
• set interrupt bit when condition is raised divf f0,f1,f2 F D E/ E/ E/ E/ W div by 0 (posted)
mulf f3,f4,f5 F D E* E* W
• check interrupt bit (potentially “take” interrupt) in WB
addf f6,f7,f8 F D E+ E+ s* W
+ interrupts are taken in order
– longer latency, more complex multi-cycle operations + precise state = trouble
1 2 3 4 5 6 7 8 9 • #1: how to undo early writes?
inst0 F D X M W data page fault • e.g., must make it seem as if mulf hasn’t executed
inst1 F D X M W instruction page fault • undo writes: future file, history file -> ugly!
• what happens now? • #2: how to take interrupts in-order if WB is not in-order?
• c2: set inst1 bit • force in-order WB
• c4: set inst0 bit – slow
• c5: take inst0 interrupt

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 61 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 62
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Interrupts Are Nasty Summary


• odd bits of state must be precise (e.g., CC) • principles of pipelining
• delayed branches • pipeline depth: clock rate vs. number of stalls (CPI)
• what if instruction in delay slot takes an interrupt? • hazards
• modes with early-writes (e.g., auto-increment) • structural
• must undo write (e.g., future-file, history-file) • data (RAW, WAR, WAW)
• control
• some machines had precise interrupts only in integer pipe
• sufficient for implementing VM • multi-cycle operations
• e.g., VAX/Alpha • structural hazards, WAW hazards
• interrupts
Lucky for us, there’s a nice, clean way to handle precise state • precise state
• We’ll see how this is done in a couple of lectures ...
next up: dynamic ILP (chapter 3)

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 63 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 64
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

You might also like