0% found this document useful (0 votes)

20 views16 pages

A Pipelining

The document discusses the principles and mechanisms of pipelining in computer architecture, highlighting its advantages over non-pipelined systems, such as increased resource utilization and throughput. It covers various types of hazards that can occur in pipelined processors, including structural, data, and control hazards, and methods to manage these hazards through techniques like stalls and flushes. Additionally, it examines the trade-offs between clock rate and instructions per cycle (IPC), emphasizing the importance of pipeline depth and overhead in optimizing performance.

Uploaded by

Lucky John

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views16 pages

A Pipelining

Uploaded by

Lucky John

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Readings in Pipelining Basic Pipelining

H+P • basic := single, in-order issue

• Appendix A (except for A.8) • single issue := one instruction at a time (per stage)
• in-order issue := instructions (start to) execute in order
Recent Research Papers • next unit: multiple issue
• “The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 • unit after that: out-of-order issue
Inverter Delays”, Hrishikesh et al., ISCA 2002.
• pipelining principles
• “Power: A First Class Design Constraint”, Mudge, IEEE • tradeoff: clock rate vs. IPC
Computer, April 2001. (not directly related to pipelining) • hazards: structural, data, control
• vanilla pipeline: single-cycle operations
• structural hazards, RAW hazards, control hazards
• dealing with multi-cycle operations
• more structural hazards, WAW hazards, precise state

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 1 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 2
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Pipelining Without Pipelining

nPC regfile
observe: instruction processing consists of N sequential stages
PC
idea: overlap different instructions at different stages
+4
D$
non-pipelined inst0.1 inst0.2 inst0.3 I$
F D X M W
inst1.1 inst1.2 inst1.3
pipelined inst0.1 inst0.2 inst0.3 • 5 parts of instruction execution
inst1.1 inst1.2 inst1.3 • fetch (F, IF): fetch instruction from I$
• decode (D, ID): decode instruction, read input registers
+ increase resource utilization: fewer stages sitting idle
• execute (X, EX): ALU, load/store address, branch outcome
+ increase completion rate (throughput): up to 1 in 1/N time • memory access (M, MEM): load/store to D$/DTLB
• almost every processor built since 1970 is pipelined • writeback (W, WB): write results (from ALU or ld) back to register file
• first pipelined processor: IBM Stretch [1962]

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 3 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 4
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Simple 5-Stage Pipeline Pipeline Registers (Latches)
regfile regfile
PC F/D D/X X/M M/W PC F/D D/X X/M M/W
+4 +4
D$ D$
I$ I$
F D X M W F D X M W

• 5 stages (pipeline depth is 5) • contain info for controlling flow of instructions through pipe
• fetch (F, IF): fetch instruction from I$ • PC: PC
• decode (D, ID): decode instruction, read input registers • F/D: PC, undecoded instruction
• execute (X, EX): ALU, load/store address, branch outcome • D/X: PC, opcode, regfile[rs1], regfile[rs2], immed, rd
• memory access (M, MEM): load/store to D$/DTLB • X/M: opcode (why?), regfile[rs1], ALUOUT, rd
• writeback (W, WB): write results (from ALU or ld) back to register file • M/W: ALUOUT, MEMOUT, rd

• stages divided by pipeline registers/latches

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 5 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 6
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Pipeline Diagram Principles of Pipelining

1 2 3 4 5 6 7 8 ⇐ cycles let: instruction execution require N stages, each takes tn time
inst0 F D X M W • un-pipelined processor
inst1 F D X M W • single-instruction latency T = Σtn
inst2 F D X M W • throughput = 1/T = 1/Σtn
inst3 F D X M W
• M-instruction latency = M*T (M>>1)
Compared to non-pipelined case: • now: N-stage pipeline
• Better throughput: an instruction finishes every cycle • single-instruction latency T = Σtn (same as unpipelined)
• throughput = 1/ max(tn) <= N/T (max(tn) is the bottleneck)
• Same latency per instruction: each still takes 5 cycles
if all tn are equal (i.e., max(tn) = T/N), then throughput = N/T
• M-instruction latency (M >> 1) = M * max(tn) <= M*T/N
• speedup <= N
• can we choose N to get arbitrary speedup?

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 7 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 8
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Wrong (part I): Pipeline Overhead Wrong (part II): Hazards
V := oVerhead delay per pipe stage hazards: conditions that lead to incorrect behavior if not fixed
• cause #1: latch overhead • structural: two instructions use same h/w in same cycle
• pipeline registers take time • data: two instructions use same data (register/memory)
• cause #2: clock/data skew • control: one instruction affects which instruction is next
so, for an N-stage pipeline with overheads
• hazards ⇒ stalls (sometimes)
• single-instruction latency T = Σ(V + tn) = N*V + Σtn • stall: instruction stays in same stage for more than one cycle
• throughput = 1/(max(tn) + V) <= N/T (and <= 1/V) • what if average stall per instruction = S stages?
• M-instruction latency = M*(max(tn) + V) <= M*V + M*T/N • latency’ ⇒ T(N+S)/N = ((N+S)/N)*latency > latency
• speedup = T/(V+max(tn)) <= N • throughput’ ⇒ N2/T(N+S) = (N/(N+S))*throughput < throughput
• M_latency’ ⇒ M*T(N+S)/N2 = ((N+S)/N)*M_latency > M_latency
Overhead limits throughput, speedup & useful pipeline depth • speedup’ ⇒ N2/(N+S) = (N/(N+S))*speedup < speedup

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 9 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 10
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Pipelining: Clock Rate vs. IPC Clock Rate vs. IPC Example
deeper pipeline (more stages, larger N) • G: gate-delays to process an instruction
+ increases clock rate • V: gate-delays of overhead per stage
– decreases IPC (longer stalls for hazards - will see later) • S: average cycle stall per instruction per pipe stage
• ultimate metric is execution rate = clock rate*IPC – overly simplistic model for stalls
• (clock cycle / unit real time) * (instructions / clock cycle) • compute optimal N (depth) given G, V, S [Smith+Pleszkun]
• number of instructions is fixed, for purposes of this discussion • IPC = 1 / (1 + S*N)
• how does pipeline overhead factor in? • clock rate (in gate-delays) = 1/(gate delays/stage) =1/(G/N + O)
• e.g., G = 80, S = 0.16, V = 1
to think about this, parameterize the clock cycle
N IPC := 1/(1+0.16*N) clock := 1/(80/N+1) execution rate
• basic time unit is the gate-delay (time to go through a gate) 10 0.38 0.11 0.042
• e.g., 80 gate-delays to process (fetch, decode,...) an instruction 20 0.24 0.20 0.048
• let’s look at an example ... 30 0.17 0.27 0.046

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 11 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 12
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Pipeline Depth Upshot Managing the Pipeline
trend is for deeper pipelines (more stages) to resolve hazards, need fine pipe-stage control
• why? faster clock (higher frequency) • play with pipeline registers to control pipe flow
• clock period = f(transistor latency, gate delays per pipe stage) • trick #1: the stall (or the bubble)
• superpipelining: add more stages to reduce gate-delays/pipe-stage • effect: stops SOME instructions in current pipe-stages
• but increased frequency may not mean increased performance... • use: make younger instructions wait for older ones to complete
• who cares? we can sell frequency! • implementation: de-assert write-enable signals to pipeline registers
• e.g., Intel IA-32 pipelines • trick #2: the flush
• 486: 5 stages (50+ gate-delays per clock period) • effect: clears instructions out of current pipe-stages
• Pentium: 7 stages • use: undoes speculative work that was incorrect (see later)
• Pentium II/III: 12 stages • implementation: assert clear signals on pipeline registers
• Pentium 4: 22 stages (10 gate-delays per clock)
• stalls & flushes must be propagated upstream (why?)
• Gotcha! 800MHz Pentium III performs better than 1GHz Pentium 4

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 13 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 14
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Structural Hazards Fixing Structural Hazards

two different instructions need same h/w resource in same cycle • fix structural hazard by stalling (s* = structural stall)
• e.g., loads/stores use the same cache port as fetch + low cost, simple
• assume unified L1 cache (for this example) – decreases IPC
• used rarely
1 2 3 4 5 6 7 8 9 10 11 12 13
load F D X M W • Q: which one to stall, inst4 or load?
inst2 F D X M W • always safe to stall younger instruction (why?)...
inst3 F D X M W • ...but may not be the best thing to do performance-wise (why?)
inst3 F D X M W
1 2 3 4 5 6 7 8 9 10 11 12 13
load F D X M W
inst2 F D X M W
inst3 F D X M W
inst4 s* F D X M W

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 15 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 16
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Avoiding Structural Hazards Data Hazards
• option #1: replicate the contended resource two different instructions use the same storage location
+ good performance • we must preserve the illusion of sequential execution
– increased area, slower (interconnect delay)?
• use for cheap, divisible, or highly-contended resources (e.g., I$/D$) add R1, R2, R3 add R1, R2, R3 add R1, R2, R3
• option #2: pipeline the contended resource sub R2, R4, R1 sub R2, R4, R1 sub R2, R4, R1
+ good performance, low area or R1, R6, R3 or R1, R6, R3 or R1, R6, R3
– sometimes complex (e.g., RAM) read-after-write write-after-read write-after-write
• useful for multicycle resources (RAW) (WAR) (WAW)
• option #3: design ISA/pipeline to reduce structural hazards
true dependence anti-dependence output dependence
• key 1: each instruction uses a given resource at most once (real) (artificial) (artificial)
• key 2: each instruction uses a given resource in same pipeline stage
• key 3: each instruction uses a given resource for one cycle
Q: What about read-after-read dependences? (RAR)
• this is why we force ALU operations to go thru MEM stage
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 17 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 18
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

RAW RAW: Detect and Stall

read-after-write (RAW) = true dependence (dataflow) detect RAW and stall instruction at ID before it reads registers
• mechanics? disable PC, F/D write
add R1, R2, R3 • RAW detection? compare register names
sub R2, R4, R1 • notation: rs1(D) := source register #1 of instruction in D stage
or R1, R6, R3 • compare rs1(D) and rs2(D) with rd(D/X), rd(X/M), rd(M/W)
• stall (disable PC + F/D, clear D/X) on any match
• problem: sub reads R1 before add has written it
• Pipelining enables this overlapping to occur • RAW detection? register busy-bits
• But this violates sequential execution semantics! • set for rd(D/X) when instruction passes ID
• Recall: user just sees ISA and expects sequential execution • clear for rd(M/W)
• stall if rs1(D) or rs2(D) are “busy”
+ low cost, simple
– low performance (many stalls)
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 19 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 20
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Two Stall Timings Stall Signal Example (2nd Timing)
RAW
depend on how ID and WB stages share the register file PC F/D D/X X/M M/W
load r6,0(r4) add r4,r2,r1 load r2,0(r3) add r5,r5,#4 call func
• each gets register file for half a cycle F D X M W
• 1st half ID reads, 2nd half WB writes ⇒ 3 cycle stall write disable write disable clear c1: rs1(D) == rd(D/X) ⇒ stall

1 2 3 4 5 6 7 8 9 PC F/D D/X X/M M/W

add R1,R2,R3 F D X M W load r6,0(r4) add r4,r2,r1 load r2,0(r3) add r5,r5,#4
sub R2,R4,R1 F d* d* d* D X M W F D X M W
load R5,R6,R7 p* p* p* F D X M
write disable write disable clear c2: rs1(D) == rd(X/M) ⇒ stall
PC F/D D/X X/M M/W
load r6,0(r4) add r4,r2,r1 load r2,0(r3)
• 1st half WB writes, 2nd half ID reads ⇒ 2 cycle stall F D X M W
1 2 3 4 5 6 7 8 9 c3: rs1(D) == rd(X/M) ⇒ go
add R1,R2,R3 F D X M W PC F/D D/X X/M M/W
sub R2,R4,R1 F d* d* D X M W sub r6,r6,#1 load r6,0(r4) add r4,r2,r1
F D X M W
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 21 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 22
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Reducing RAW Stalls: Bypassing Implementing Bypassing

regfile regfile
D/X X/M M/W D/X X/M M/W

D$ D$
X M W X M W

why wait until WB stage? data available at end of EX/MEM stage • first, detect bypass opportunity
• bypass (aka “forward”) data directly to input of EX • tag compares in D/X latch
• similar to but separate from stall logic in F/D latch
+ very effective at reducing/avoiding stalls
• in practice, a large fraction of input operands are bypassed (why?) • then, control bypass MUX
• if rs2(X) == rd(X/M) then ALUOUT(M)
– complex
• else if rs2(X) == rd(M/W) then ALUOUT(W)
• does not relieve you from having to perform WB
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 23 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 24
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Pipeline Diagrams with Bypassing Pipeline Scheduling
1 2 3 4 5 6 7 8 9 10 11 compiler schedules (moves) instructions to reduce stall
add R1,R5,R3 F D X M W • eliminate back-to-back load-ALU scenarios
sub R2,R4,R1 F D X M W example 1
• example code sequence a = b + c; d = e - f
1 2 3 4 5 6 7 8 9 10 11
load R1,24(R5) F D X M W
before after
add R3,R6,R7 F D X M W load R2, b load R2, b
sub R2,R4,R1 F D X M W example 2 load R3, c load R3, c
add R1, R2, R3 //stall load R5, e
• even with full bypassing, not all RAW stalls can be avoided
store R1, a add R1, R2, R3 // no stall
• example: load to ALU in consecutive cycles
load R5, e load R6, f
1 2 3 4 5 6 7 8 9 10 11 load R6, f store R1, a
load R1,24(R5) F D X M W sub R4, R5, R6 // stall sub R4, R5, R6 // no stall
sub R2,R4,R1 F D d* X M W example 3 store R4, d store Rd, d

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 25 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 26
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

WAR: Write After Read WAW: Write After Write

write-after-read (WAR) = artificial (name) dependence write-after-write (WAW) = artificial (name) dependence
add R1,R2,R3
add R1, R2, R3 sub R2,R4,R1
sub R2, R4, R1 or R1,R6,R3
or R1, R6, R3
• problem: reordering could leave wrong value in R1
• problem: add could use wrong value for R2 • later instruction that reads R1 would get wrong value
• can’t happen in vanilla pipeline (reads in ID, writes in WB) • can’t happen in vanilla pipeline (register writes are in order)
• can happen if: early writes (e.g., auto-increment) + late reads (??) • another reason for making ALU ops go through MEM stage
• can happen if: out-of-order reads (e.g., out-of-order execution) • can happen: multi-cycle operations (e.g., FP, cache misses)
• artificial: using different output register for sub would solve • artificial: using different output register for or would solve
• The dependence is on the name R2, but not on actual dataflow • Also a dependence on a name: R1

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 27 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 28
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
RAR: Read After Read Memory Data Hazards
read-after-read (RAR) have seen register hazards, can also have memory hazards
RAW WAR WAW
add R1, R2, R3 store R1,0(SP) load R4,0(SP) store R1,0(SP)
sub R2, R4, R1 load R4,0(SP) store R1,0(SP) store R4,0(SP)
or R1, R6, R3
1 2 3 4 5 6 7 8 9
• no problem: R3 is correct even with reordering store R1,0(SP) F D X M W
load R1,0(SP) F D X M W
• in simple pipeline, memory hazards are easy
• in-order
• one at a time
• read & write in same stage
• in general, though, more difficult than register hazards
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 29 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 30
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Hazards vs. Dependences Control Hazards

dependence: fixed property of instruction stream (i.e., program) when an instruction affects which instruction executes next
store R4,0(R5)
hazard: property of program and processor organization
bne R2,R3,loop
• implies potential for executing things in wrong order sub R1,R6,R3
• potential only exists if instructions can be simultaneously “in-flight”
• property of dynamic distance between instrs vs. pipeline depth • naive solution: stall until outcome is available (end of EX)
+ simple
For example, can have RAW dependence with or without hazard
– low performance (2 cycles here, longer in general)
• depends on pipeline • e.g. 15% branches * 2 cycle stall ⇒ 30% CPI increase!

1 2 3 4 5 6 7 8 9
store R4,0(R5) F D X M W
bne R2,R3,loop F D X M W
?? c* c* F D X M W

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 31 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 32
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Control Hazards: “Fast” Branches Control Hazards: Delayed Branches
fast branches: can be evaluated in ID (rather than EX) delayed branch: execute next instruction whether taken or not
+ reduce stall from 2 cycles to 1 • instruction after branch said to be in “delay slot”
1 2 3 4 5 6 7 8 9 • old microcode trick stolen by RISC (MIPS)
sw R4,0(R5) F D X M W store R4,0(R5) bned R2,R3,loop
bne R2,R3,loop F D X M W bne R2,R3,loop store R4,0(R5)
?? c* F D X M W
sub R1,R6,R6 sub R1,R6,R6
– requires more hardware
• dedicated ID adder for (PC + immediate) targets
1 2 3 4 5 6 7 8 9
– requires simple branch instructions
bned R2,R3,loop F D X M W
• no time to compare two registers (would need full ALU)
store R4,0(R5) F D X M W
• comparisons with 0 are fast (beqz, bnez) sub R1,R6,R6 c* F D X M W

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 33 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 34
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

What To Put In Delay Slot? Control Hazards: Speculative Execution

• instruction from before branch idea: doing anything is better than waiting around doing nothing
• when? if branch and instruction are independent • speculative execution
• helps? always • guess branch target ⇒ start executing at guessed position
• instruction from target (taken) path • execute branch ⇒ verify (check) guess
• when? if safe to execute, but may have to duplicate code + minimize penalty if guess is right (to zero?)
• helps? on taken branch, but may increase code size – wrong guess could be worse than not guessing
• instruction from fall-through (not-taken) path • branch prediction: guessing the branch
• when? if safe to execute • one of the “important” problems in computer architecture
• helps? on not-taken branch • very heavily researched area in last 15 years
• upshot: short-sighted ISA feature • static: prediction by compiler
– not a big win for today’s machines (why? consider pipeline depth) • dynamic: prediction by hardware
– complicates interrupt handling (later) • hybrid: compiler hints to hardware predictor

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 35 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 36
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
The Speculation Game Speculative Execution Scenarios
speculation: engagment in risky business transactions on the 1 2 3 4 5 • correct speculation
chance of quick or considerable profit inst0/B F D X M W • cycle1: fetch branch, predict next (inst8)
• speculative execution (control speculation) inst8 F D X M • c2, c3: fetch inst8, inst9
• execute before all parameters known with certainty inst9 F D X • c3: execute/verify branch ⇒ correct
inst10 F D • nothing needs to be fixed or changed
+ correct speculation
+ avoid stall/get result early, performance improves
– incorrect speculation (mis-speculation) • incorrect speculation: mis-speculation
1 2 3 4 5 • c1: fetch branch, predict next (inst1)
– must abort/squash incorrect instructions
inst0/B F D X M W
– must undo incorrect changes (recover pre-speculation state) • c2, c3: fetch inst1, inst2
inst1 F D
• c3: execute/verify branch ⇒ wrong
• the speculation game: profit > penalty inst2 F
inst8 verify/flush F D • c3: send correct target to IF (inst8)
• profit = speculation accuracy * correct-speculation gain
• c3: squash (abort) inst1, inst2 (flush F/D)
• penalty = (1–speculation accuracy) * mis-speculation penalty
• c4: fetch inst8

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 37 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 38
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Static (Compiler) Branch Prediction Comparison of Some Static Schemes

Some static prediction options CPI-penalty = %branch * [(%T * penaltyT) + (%NT * penaltyNT)]
• predict always not-taken • simple branch statistics
+ very simple, since we already know the target (PC+4) • 14% PC-changing instructions (“branches”)
– most branches (~65%) are taken (why?) • 65% of PC-changing instructions are “taken”
• predict always taken
+ better performance penaltyT penaltyNT
scheme CPI penalty
– more difficult, must know target before branch is decoded stall 2 2 0.28
• predict backward taken fast branch 1 1 0.14
• most backward branches are taken delayed branch 1.5 1.5 0.21
not-taken 2 0 0.18
• predict specific opcodes taken
taken 0 2 0.10
• use profiles to predict on per-static branch basis
• pretty good

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 39 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 40
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Dynamic Branch Prediction Branch History Table (BHT)
regfile branch PC ⇒ prediction (T, NT)
PC F/D D/X X/M M/W
– need decoder/adder to compute target if taken
BP • branch history table (BHT)
D$ BHT
I$ I$ • read prediction with least significant bits (LSBs) of branch PC
F D X M W 1
• change bit on misprediction
0
+ simple
hardware (BP) guesses whether and where a branch will go 1
– multiple PCs may map to same bit (aliasing)
0x64 bnez r1,#10
0x74 add r3,r2,r1 • major improvements
• two-bit counters [Smith] branch PC
• start with branch PC (0x64) and produce
• direction (Taken) • correlating/two-level predictors [Patt]
• direction + target PC (0x74) • hybrid predictors [McFarling]
• direction + target PC + target instruction (add r3, r2,r1)
T/N

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 41 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 42
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Improvement: Two-bit Counters Improvement: Correlating Predictors

example: 4-iteration inner loop branch different branches may be correlated
state/prediction N T T T N T T T N T T T • outcome of branch depends on outcome of other branches
branch outcome T T T N T T T N T T T N • makes intuitive sense (programs are written this way)
mis-prediction? * * * * * * • e.g., if the first two conditions are true, then third is false
– problem: two mis-predictions per loop if (aa == 2) aa = 0;
• solution: 2-bit saturating counter to implement hysteresis if (bb == 2) bb = 0;
• 4 states: strong/weak not-taken (N/n), strong/weak taken (T/t) if (aa != bb) { . . . }
• transitions: N ⇔ n ⇔ t ⇔ T
state/prediction n t T T t T T T t T T T revelation: prediction = f(branch PC, recent branch outcomes)
branch outcome T T T N T T T N T T T N • revolution: BP accuracies increased dramatically
mis-prediction? * * * *
• lots of reseach in designing that function for best BP
+ only one mis-prediction per iteration
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 43 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 44
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Correlating (Two-Level) Predictors Correlating Predictor Example
• branch history shift register (BHR) holds recent outcomes • example with alternating T,N (1-bit BHT, no correlation)
• combination of PC and BHR accesses BHT
state/prediction N T N T N T N T N T N T
• basically, multiple predictions per branch, choose based on history
branch outcome T N T N T N T N T N T N
mis-prediction? * * * * * * * * * * * *
design space branch PC
• number of BHRs • add 1 1-bit BHR, concatenate with PC
BHT • effectively, two predictors per PC
• multiple BHRs (“local”, Intel)
• top (BHR=N) bottom (BHR=T) active entry
• 1 global BHR (“global”, everyone else)
BHR f state/prediction N T T T T T T T T T T T
• PC/BHR overlap N N N N N N N N N N N N
• full, partial, none (concatenated?)
branch outcome T N T N T N T N T N T N
• popular design: Gshare [McFarling] mis-prediction? *
• 1 global BHR, full overlap, f = XOR
T/N

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 45 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 46
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Hybrid/Competitive/Tournament Predictors Branch Target Buffer (BTB)

observation: different schemes work better for different branches branch PC ⇒ target PC
• target PC available at end of IF stage
idea: multiple predictors, choose on per static-branch basis
+ no bubble for correct predictions
mechanics • branch target buffer (BTB)
• two (or more) predictors branch PC • index: branch PC
• chooser • data: target PC (+ T/NT?)
• if chosen predictor is wrong... • tags: branch PC (why are tags needed here and not in BHT?)
predictor 1

predictor 2

• ...and other is right... – many more bits per entry than BHT
chooser

• ...flip chooser • considerations: combine with I-cache? store not-taken branches?

f
• popular design: Gselect [McFarling] • branch target cache (BTC)
BHR • data: target PC + target instruction(s)
• Gshare + 2-bit saturating counter
• enables “branch folding” optimization (branch removed from pipe)

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 47 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 48
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Jump Prediction Branch Issues
exploit behavior of different kinds of jumps to improve prediction issue1: how do we know at IF which instructions are branches?
• function returns • BTB: don’t need to “know”
• use hardware return address stack (RAS) • check every instruction: BTB entry ⇒ instruction is a branch
• call pushes return address on top of RAS
• for return, predict address at top of RAS and pop issue2: BHR (RAS) depend on branch (call) history
– trouble: must manage speculatively • when are these updated?
• at WB is too late (if another branch is in-flight)
• indirect jumps (switches, virtual functions)
• at IF (after prediction)
• more than one taken target per jump
• must be able to recover BHR (RAS) on mis-speculation (nasty)
• path-based BTB [Driesen+Holzle]

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 49 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 50
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Adding Multi-Cycle Operations Extended Pipeline

int RF
RISC tenet #1: “single-cycle operations”
D/X X/M M/W
• why was this such a big deal? PC F/D D$
• fact: not all operations complete in 1 cycle X M W
• FP add, int/FP multiply: 2–4 cycles, int/FP divide: 20–50 cycles I$
F FP+ FP+
• data cache misses: 10–150 cycles! D E+ W
• slow clock cycle down to slowest operation? FP RF
– can’t without incurring huge performance loss
• separate integer/FP, pipe register files
• solution: extend pipeline - add pipeline stages to EX • loads/stores in integer pipeline only (why?)
• additional, parallel functional units
• E+: FP adder (2 cycles, pipelined)
• E*: FP/integer multiplier (4 cycles, pipelined)
• E/: FP/integer divider (20 cycles, not pipelined)

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 51 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 52
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Multi-Cycle Example Another Multi-Cycle Example
1 2 3 4 5 6 7 8 9 10 example: SAXPY (math kernel)
divf f0,f1,f2 F D E/ E/ E/ E/ W Z[i] = A*X[i] + Y[i] // single precision
mulf f0,f3,f4 F D E* E* W
addf f5,f6,f7 F D E+ E+ W 1 2 3 4 5 6 7 8 9 10
subf f8,f6,f7 F D * E+ E+ W ldf f2,0(r1) F D X M W
mulf f9,f8,f7 F D * * E* E* mulf f6,f0,f2 F D d* E* E* E* E* W
ldf f4,0(r2) F p* D X M W f6
• write-after-write (WAW) hazards addf f8,f6,f4 F D d* d* E+ E+ W
• register write port structural hazards stf f8,0(r3) F p* p* D X M W
add r1,r1,#4 F D X M W
• functional unit structural hazards
add r2,r2,#4 F D X M W
• elongated read-after-write (RAW) hazards add r3,r3,#4 F D X M W

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 53 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 54
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Register Write Port Structural Hazards WAW Hazards

where are these resolved? how are these dealt with?
• multiple writeback ports? • stall younger instruction writeback?
– not a good idea (why not?) + intuitive, simpler
• in ID? – lower performance (cascading writeback structural hazards)
• reserve writeback slot in ID (writeback reservation bits) • abort (don’t do) older instruction writeback?
+ simple, keeps stall logic localized to ID stage + no performance loss
– won’t work for cache misses (why not?) – but what if intermediate instruction causes an interrupt (next)
• in MEM?
+ works for cache misses, better utilization
– two stall controls (F/D and M/W) must be synchronized
• in general: cache misses are hard
• don’t know in ID whether they will happen early enough (in ID)

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 55 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 56
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Dealing With Interrupts Precise Interrupts
interrupts (aka faults, exceptions, traps) “unobserved system can exist in any intermediate state, upon
• e.g., arithmetic overflow, divide by zero, protection violation observation system collapses to well-defined state”
–2nd postulate of quantum mechanics
• e.g., I/O device request, OS call, page fault
• system ⇒ processor, observation ⇒ interrupt
classifying interrupts
what is the “well-defined” state?
• terminal (fatal) vs. restartable (control returned to program)
• von Neumann: “sequential, instruction atomic execution”
• synchronous (internal) vs. asynchronous (external)
• precise state at interrupt
• user vs. coerced • all instructions older than interrupt are complete
• maskable (ignorable) vs. non-maskable • all instructions younger than interrupt haven’t started
• between instructions vs. within instruction • implies interrupts are taken in program order
• necessary for VM (why?), “highly recommended” by IEEE

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 57 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 58
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Interrupt Example: Data Page Fault More Interrupts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 • interrupts can occur at different stages
inst0 F D X M W • IF, MEM: page fault, misaligned data, protection violation
inst1 F D X M page fault • ID: illegal/privileged instruction
inst2 F D X • EX: arithmetic exception
inst3 F D restart faulting instruction
1 2 3 4 5 6 7 8 9
inst4 F
inst0 F D X M W data page fault
TRAP F D X M W
inst1 F D X M W instruction page fault
trap0 flush EX, ID,IF F D X M W
inst1 inject TRAP instr OS trap handler F D X M • too complicated to draw what goes on here
• cycle2: instruction page fault, flush inst1, inject TRAP
• squash (effects of) younger instructions • c4: data page fault, flush inst0, inst1, TRAP
• inject fake TRAP instruction into IF – can get into an infinite loop here (with help of OS page placement)

• from here, like a SYSCALL

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 59 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 60
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining
Posted Interrupts Interrupts and Multi-Cycle Operations
posted interrupts 1 2 3 4 5 6 7 8 9 10 11
• set interrupt bit when condition is raised divf f0,f1,f2 F D E/ E/ E/ E/ W div by 0 (posted)
mulf f3,f4,f5 F D E* E* W
• check interrupt bit (potentially “take” interrupt) in WB
addf f6,f7,f8 F D E+ E+ s* W
+ interrupts are taken in order
– longer latency, more complex multi-cycle operations + precise state = trouble
1 2 3 4 5 6 7 8 9 • #1: how to undo early writes?
inst0 F D X M W data page fault • e.g., must make it seem as if mulf hasn’t executed
inst1 F D X M W instruction page fault • undo writes: future file, history file -> ugly!
• what happens now? • #2: how to take interrupts in-order if WB is not in-order?
• c2: set inst1 bit • force in-order WB
• c4: set inst0 bit – slow
• c5: take inst0 interrupt

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 61 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 62
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Interrupts Are Nasty Summary

• odd bits of state must be precise (e.g., CC) • principles of pipelining
• delayed branches • pipeline depth: clock rate vs. number of stalls (CPI)
• what if instruction in delay slot takes an interrupt? • hazards
• modes with early-writes (e.g., auto-increment) • structural
• must undo write (e.g., future-file, history-file) • data (RAW, WAR, WAW)
• control
• some machines had precise interrupts only in integer pipe
• sufficient for implementing VM • multi-cycle operations
• e.g., VAX/Alpha • structural hazards, WAW hazards
• interrupts
Lucky for us, there’s a nice, clean way to handle precise state • precise state
• We’ll see how this is done in a couple of lectures ...
next up: dynamic ILP (chapter 3)

© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 63 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 64
Sohi, Smith, Vijaykumar, Lipasti Pipelining Sohi, Smith, Vijaykumar, Lipasti Pipelining

Topic 10: Pipelining: Cos / Ele 375 Computer Architecture and Organization
No ratings yet
Topic 10: Pipelining: Cos / Ele 375 Computer Architecture and Organization
64 pages
07 Pipeline Notes
No ratings yet
07 Pipeline Notes
145 pages
Chapter # 03 Pipelining
No ratings yet
Chapter # 03 Pipelining
85 pages
Pipeline
No ratings yet
Pipeline
22 pages
1 Processor Pipeline
No ratings yet
1 Processor Pipeline
73 pages
Coa 3
No ratings yet
Coa 3
74 pages
3-Pipelining 241110 203716
No ratings yet
3-Pipelining 241110 203716
59 pages
5.1-5.3 Pipelining and Parallel Processing
No ratings yet
5.1-5.3 Pipelining and Parallel Processing
56 pages
Pipeline Processing
No ratings yet
Pipeline Processing
28 pages
Pipelining Lecture
No ratings yet
Pipelining Lecture
74 pages
Module 4-Pipelining
No ratings yet
Module 4-Pipelining
39 pages
Chapter 4.5 - 4.8 Piplined Processor and Hazards
No ratings yet
Chapter 4.5 - 4.8 Piplined Processor and Hazards
68 pages
Lecture 11 - Pipelining
No ratings yet
Lecture 11 - Pipelining
39 pages
Pipelining
No ratings yet
Pipelining
43 pages
Pipelining Concepts and Problems
No ratings yet
Pipelining Concepts and Problems
33 pages
Slide 6
No ratings yet
Slide 6
46 pages
Kien-Truc-May-Tinh-Nang-Cao - Tran-Ngoc-Thinh - Lec03-Pipelining - (Cuuduongthancong - Com)
No ratings yet
Kien-Truc-May-Tinh-Nang-Cao - Tran-Ngoc-Thinh - Lec03-Pipelining - (Cuuduongthancong - Com)
35 pages
Cse410 10 Pipelining A
No ratings yet
Cse410 10 Pipelining A
27 pages
Pipelined Datapath and Control
No ratings yet
Pipelined Datapath and Control
37 pages
Pipelining Basic and Intermediate Concepts
No ratings yet
Pipelining Basic and Intermediate Concepts
75 pages
Lec18 Pipeline Chap9 2
No ratings yet
Lec18 Pipeline Chap9 2
26 pages
Pipelining Unit 3
No ratings yet
Pipelining Unit 3
19 pages
Module 3-Part 2
No ratings yet
Module 3-Part 2
50 pages
Lec03-Pipelining 2021
No ratings yet
Lec03-Pipelining 2021
20 pages
6.1.CSE 4293 Pipelining
No ratings yet
6.1.CSE 4293 Pipelining
36 pages
CS 211: Computer Architecture: Instructor: Prof. Bhagi Narahari
No ratings yet
CS 211: Computer Architecture: Instructor: Prof. Bhagi Narahari
82 pages
Week 11
No ratings yet
Week 11
33 pages
Computer Architecture: Nguyễn Trí Thành
No ratings yet
Computer Architecture: Nguyễn Trí Thành
77 pages
Lecture 4
No ratings yet
Lecture 4
19 pages
CS 211: Computer Architecture: Instructor: Prof. Bhagi Narahari
No ratings yet
CS 211: Computer Architecture: Instructor: Prof. Bhagi Narahari
82 pages
Lecture 13 Pipelining
No ratings yet
Lecture 13 Pipelining
12 pages
Pipeline Processing
No ratings yet
Pipeline Processing
16 pages
Computer Architecture Pipe Line
No ratings yet
Computer Architecture Pipe Line
28 pages
Pipelining Basics
No ratings yet
Pipelining Basics
12 pages
Computer Architecture: Appendix A Pipelining Prof. Jerry Breecher CSCI 240 Fall 2003
No ratings yet
Computer Architecture: Appendix A Pipelining Prof. Jerry Breecher CSCI 240 Fall 2003
58 pages
Week 11 Reduced
No ratings yet
Week 11 Reduced
29 pages
Pipeline
No ratings yet
Pipeline
39 pages
Slides14 Pipeline1 4up
No ratings yet
Slides14 Pipeline1 4up
6 pages
Write-After-Read (WAR) Artificial (Name) Dependence
No ratings yet
Write-After-Read (WAR) Artificial (Name) Dependence
17 pages
Lec11 Pipeline 1 Notes
No ratings yet
Lec11 Pipeline 1 Notes
26 pages
Pipelinehazard 160823134502
No ratings yet
Pipelinehazard 160823134502
61 pages
Lec 04 Pipeline D Processor
No ratings yet
Lec 04 Pipeline D Processor
106 pages
Lec3 1
No ratings yet
Lec3 1
18 pages
Pipe Lining
No ratings yet
Pipe Lining
66 pages
33 Hazards in Pipeline 06-04-2023
No ratings yet
33 Hazards in Pipeline 06-04-2023
27 pages
06 Pipeline PDF
No ratings yet
06 Pipeline PDF
17 pages
Advanced Linux Programming
No ratings yet
Advanced Linux Programming
31 pages
Pipelinehazard For Class
No ratings yet
Pipelinehazard For Class
61 pages
CS530 Fall2015 Lecture9
No ratings yet
CS530 Fall2015 Lecture9
5 pages
Pipelining
No ratings yet
Pipelining
44 pages
Computer Architecture and Organization
No ratings yet
Computer Architecture and Organization
49 pages
Lec18 Pipeline
No ratings yet
Lec18 Pipeline
59 pages
Pipeline Processor Design
No ratings yet
Pipeline Processor Design
89 pages
This Study Resource Was: Pipelining Analogy
No ratings yet
This Study Resource Was: Pipelining Analogy
58 pages
Pipelining: CIT 595 Spring 2007
No ratings yet
Pipelining: CIT 595 Spring 2007
16 pages
Pipeline and Vector
No ratings yet
Pipeline and Vector
29 pages
Cse410 10 Pipelining A
No ratings yet
Cse410 10 Pipelining A
7 pages
View Architecture Planning: Vmware Horizon 7
No ratings yet
View Architecture Planning: Vmware Horizon 7
104 pages
PipeLining in Microprocessors
No ratings yet
PipeLining in Microprocessors
19 pages
VLAN Interview Questions and Answers
100% (1)
VLAN Interview Questions and Answers
2 pages
General Principles of Pipelining: Andrew Warfield CS313
No ratings yet
General Principles of Pipelining: Andrew Warfield CS313
25 pages
PBL Report
No ratings yet
PBL Report
10 pages
Integrating Splunk With Arcsight
No ratings yet
Integrating Splunk With Arcsight
11 pages
Data Communications Standards by Tomasi
No ratings yet
Data Communications Standards by Tomasi
26 pages
Fdma Cdma Tdma
No ratings yet
Fdma Cdma Tdma
23 pages
Software Engineering - Module1
No ratings yet
Software Engineering - Module1
60 pages
Saphir Getting Started
No ratings yet
Saphir Getting Started
98 pages
TmForum ODA
No ratings yet
TmForum ODA
42 pages
Tower & Power Materials Dimension Data
No ratings yet
Tower & Power Materials Dimension Data
231 pages
IEEE Research Paper-1
No ratings yet
IEEE Research Paper-1
7 pages
Recursion, As A Different Way of Solving Problems. Example Programs Such As Finding Factorial. Fibon
No ratings yet
Recursion, As A Different Way of Solving Problems. Example Programs Such As Finding Factorial. Fibon
11 pages
Computer & Generations & AI
No ratings yet
Computer & Generations & AI
12 pages
Haptic Technology Abstract
50% (2)
Haptic Technology Abstract
3 pages
6th International Conference On VLSI & Embedded Systems (VLSIE 2025)
No ratings yet
6th International Conference On VLSI & Embedded Systems (VLSIE 2025)
2 pages
Vacant Seats For Spot Round 2024
No ratings yet
Vacant Seats For Spot Round 2024
5 pages
CS Project File
No ratings yet
CS Project File
12 pages
Presentation 27 05 2024
No ratings yet
Presentation 27 05 2024
24 pages
ICT PRE-MOCK Paper 2
No ratings yet
ICT PRE-MOCK Paper 2
4 pages
SVMCM Manual Non Net Applicant
No ratings yet
SVMCM Manual Non Net Applicant
12 pages
The Athena Service Management System: Mark A. Rosenstein Daniel E. Geer, Jr. Peter J. Levine
No ratings yet
The Athena Service Management System: Mark A. Rosenstein Daniel E. Geer, Jr. Peter J. Levine
11 pages
Software Engineering Fundamental Questions For 1 or 2 Marks
No ratings yet
Software Engineering Fundamental Questions For 1 or 2 Marks
4 pages
6.+Shweta+Gupta Library+Waves 4.1
No ratings yet
6.+Shweta+Gupta Library+Waves 4.1
8 pages
Elevator Control System
No ratings yet
Elevator Control System
24 pages
Broshuresp 3510 SF
No ratings yet
Broshuresp 3510 SF
4 pages
Configuration of Procomm Plus
No ratings yet
Configuration of Procomm Plus
6 pages
Robots, Androids, Al: Which Transmit
No ratings yet
Robots, Androids, Al: Which Transmit
7 pages
Department of Computer Science & Engineering: B.Tech. Semester - 4 Question Bank 2101CS402 - Madf
No ratings yet
Department of Computer Science & Engineering: B.Tech. Semester - 4 Question Bank 2101CS402 - Madf
2 pages
Aselsan CMFD 55
No ratings yet
Aselsan CMFD 55
2 pages
GSTIN
No ratings yet
GSTIN
3 pages

A Pipelining

Uploaded by

A Pipelining

Uploaded by

Readings in Pipelining Basic Pipelining

H+P • basic := single, in-order issue

Pipelining Without Pipelining

• stages divided by pipeline registers/latches

Pipeline Diagram Principles of Pipelining

Structural Hazards Fixing Structural Hazards

RAW RAW: Detect and Stall

1 2 3 4 5 6 7 8 9 PC F/D D/X X/M M/W

Reducing RAW Stalls: Bypassing Implementing Bypassing

WAR: Write After Read WAW: Write After Write

Hazards vs. Dependences Control Hazards

What To Put In Delay Slot? Control Hazards: Speculative Execution

Static (Compiler) Branch Prediction Comparison of Some Static Schemes

Improvement: Two-bit Counters Improvement: Correlating Predictors

Hybrid/Competitive/Tournament Predictors Branch Target Buffer (BTB)

• ...flip chooser • considerations: combine with I-cache? store not-taken branches?

Adding Multi-Cycle Operations Extended Pipeline

Register Write Port Structural Hazards WAW Hazards

Interrupt Example: Data Page Fault More Interrupts

• from here, like a SYSCALL

Interrupts Are Nasty Summary

You might also like