0% found this document useful (0 votes)
13 views25 pages

Anch Prediction

The document discusses advanced concepts in branch prediction, focusing on static and dynamic methods. It outlines various prediction schemes, including static predictions based on compile-time behavior and dynamic predictors that leverage historical data to improve accuracy. Key techniques such as the Branch History Table, correlating branches, and tournament predictors are highlighted for their roles in enhancing processor performance by minimizing control hazards.

Uploaded by

Herlin L.T.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views25 pages

Anch Prediction

The document discusses advanced concepts in branch prediction, focusing on static and dynamic methods. It outlines various prediction schemes, including static predictions based on compile-time behavior and dynamic predictors that leverage historical data to improve accuracy. Key techniques such as the Branch History Table, correlating branches, and tournament predictors are highlighted for their roles in enhancing processor performance by minimizing control hazards.

Uploaded by

Herlin L.T.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 25

CS2354 Advanced Computer Architecture

Unit I
Branch prediction – Static & Dynamic

FK.F02 1
4.2 Static Branch Prediction
• used where branch behavior is highly predictable at
compile time
• architectural feature to support static branch prediction
– delayed branch

The instruction in the


branch delay slot is
executed whether or
not the branch is
taken (for zero cycle
penalty)

FK.F02 2
2
Static Branch Prediction for Load Stall

LD R1, 0(R2) ← Load Stall


DSUBU R1, R1, R3
BEQZ R1, L
OR R4, R5, R6
DADDU R10, R4, R3
L: DADDU R7, R8, R9

almost rarely taken


always taken

LD R1, 0(R2) LD R1, 0(R2)


DADDU R7, R8, R9 OR R4, R5, R6
DSUBU R1, R1, R3 DSUBU R1, R1, R3
BEQZ R1, L BEQZ R1, L
OR R4, R5, R6 DADDU R10, R4, R3
DADDU R10, R4, R3 L: DADDU R7, R8, R9
L: assume it’s safe if mis-predicted
FK.F02 3
3
Static Branch Prediction Schemes
• Simplest scheme - predict branch as taken
– 34% misprediction rate for SPEC programs (59% to 9%)
• Direction-based scheme - predict
backward-going branch as taken and
forward-going branch as not taken
– Not good for SPEC programs
– Overall misprediction rate is not less than 30% to 40%
• Profile-based scheme – predict on the basis of
profile information collected from earlier runs
– An individual branch is often highly biased toward taken or
untaken
– Changing the input so that the profile is for a different run leads
to only a small change in the accuracy

FK.F02 4
4
Profile-based Static
Branch Prediction
Misprediction rate on SPEC92
•varying widely: 3% to 24%
•in average, 9% for FP programs and
15% for integer programs

Number of instructions executed


between mispredicted branches
avg. Taken Profile
FP 30 173
INT 10 46
All 20 110
std dev 27 85
varying widely: depending
on branch frequency and
prediction precisionFK.F02 5
5
What we have done !

• Have described techniques to overcome data


hazards
• Will describe techniques to overcome control
hazards
– What limits the amount of ILP? Control dependence
– Prediction is helpful in single-issue processor
– Prediction is crucial to multi-issue processors
– WHY ?
» Branches arrives up to n times faster for an n-issue
processor
» The relative impact of the control stalls will be larger
with the lower CPI (Amdahl’s Law)
CSE 7381
Computer Architecture FK.F02 6
Basic Predict Schemes
• Static schemes
– Predict not taken
– Predict taken
– Delayed branch
• Dynamic schemes
1. 1-bit Branch-Prediction Buffer
2. 2-bit Branch-Prediction Buffer
3. Correlating Branch Prediction Buffer
4. Tournament Branch Predictor
5. Branch Target Buffer
For High-
6. Integrated Instruction Fetch Units
Performance
7. Return Address Predictors
Delivery

• Goal: Allowing the processor to resolve the outcome


of a branch early
CSE 7381
Computer Architecture FK.F02 7
1-bit Branch History Table
(BHT)
• Performance = ƒ(accuracy, cost of misprediction)
• Branch History Table: Lower bits of PC address
index table of 1-bit values
– Says whether or not branch taken last time
– No address check (saves HW, but may not be right branch)
• Problem: in a loop, 1-bit BHT will cause
2 mispredictions (avg is 9 iterations before exit):
– End of loop case, when it exits instead of looping as before
– First time through loop on next time through code, when it
predicts exit instead of looping
– Only 80% accuracy even if loop 90% of the time

CSE 7381
Computer Architecture FK.F02 8
2-bit Branch History Table
(BHT)
• Solution: 2-bit scheme where change prediction only
if get misprediction twice:
T
NT
Predict Taken Predict Taken
T
T NT
NT
Predict Not Predict Not
T Taken
Taken

NT
• Red: stop, not taken
• Green: go, taken
CSE 7381
Computer Architecture FK.F02 9
Correlating Branches
B1: if (aa==2) LD R1, aa
LD R2, bb
aa = 0;
DSUBUI R3, R1,#2
B2: If (bb == 2) BNEZ R3, L1
bb = 0; DADD R1, R0, R0
L1: DSUBUI R3, R2, #2
B3: If (aa!=bb) { …
BNEZ R3, L2
DADD R2, R0, R0
L2: DSUBUI R3, R1, R2
BEQZ R3, L3

Observation: the 3rd branch is correlated with the


1st and 2nd branches:
B1 = NT & B2 = NT  B3 = T
CSE 7381
Computer Architecture FK.F02 10
Correlating Branches
B1: if (d==0) LD R1, d
d = 0; BNEZ R1, L1
DADDIU R1, R0,#1
B2: If (d == 1) L1: DADDIU R3, R1,-#1
BNEZ R3, L2

L2:

Observation: B1 = NT  B2 = T

CSE 7381
Computer Architecture FK.F02 11
Correlating Branches
Idea: taken/not taken of
recently executed Branch address (4 bits)
branches is related to
behavior of next branch 2-bits per branch
(as well as the history of local predictors
that branch behavior)
– Then behavior of recent
branches selects between,
say, 4 predictions of next
branch, updating just that Prediction
Prediction
prediction
• (2,2) predictor: 2-bit
global, 2-bit local

2-bit global
branch history
CSE 7381
Computer Architecture (01 = not taken then taken) FK.F02 12
Accuracy of Different Schemes
20%

18%
4096 Entries 2-bit BHT
Frequency of Mispredictions

16% Unlimited Entries 2-bit BHT


14% 1024 Entries (2,2) BHT
12% 11%

10%

8%
6% 6% 6%
6% 5% 5%
4%
4%

2% 1% 1%
0% 0%
0%

CSE 7381
4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)
Computer Architecture FK.F02 13
Tournament Predictors:
An example of multilevel branch predictors
• Motivation for correlating branch predictors is 2-
bit predictor failed on important branches;
– by adding global information, performance improved
• Tournament predictors: use 2 predictors,
– 1 based on global information and
– 1 based on local information, and combine with a selector
– A 2-bit saturating counter per branch to choose among two
different predictors

• Hopes to select right predictor for right branch

CSE 7381
Computer Architecture FK.F02 14
The Alpha 21264 Branch Predictor
0/0, 1/0, 1/1 0/0, 0/1, 1/1 Pi: predictor i

2-bit
counter
Use P1 Use P2

Local 1/0 0/1 1/0 0/1


Branch 0/1
Address
Use P1 Use P2
1/0

0/0, 1/1 0/0, 1/1


•4K 2-bit counters to choose from among a global predictor and a local predictor

CSE 7381
4K Counter is incremented whenever the “predicted” predictor is correct.
Decremented in the reverse situation.
Computer Architecture FK.F02 15
Global Predictor
Branch address (4 bits)

2-bits per branch


local predictors
History of Global
the last 12 Predictor
branches 2-bit
Prediction 4K entries predictor
Prediction

•Global predictor also has 4K entries and is indexed by the history of the last 12 branches;
each entry in the global predictor is a standard 2-bit predictor (Ref. Slide #5)
–12-bit pattern: ith bit 0 => ith prior branch not taken;
2-bit global ith bit 1 => ith prior branch taken;
branch history
(01 = not taken then taken)
CSE 7381
Computer Architecture FK.F02 16
2-Level Local Predictor
The most recent 10
branch outcomes

1024 10-bit entries

1K 3-bit counters

Local prediction

CSE 7381
Computer Architecture FK.F02 17
% of predictions from local
predictor in Tournament
Prediction Scheme
0% 20% 40% 60% 80% 100%
nasa7 98%
matrix300 100%
tomcatv 94%
doduc 90%
spice 55%
fpppp 76%
gcc 72%
espresso 63%
eqntott 37%
li 69%
CSE 7381
Computer Architecture FK.F02 18
Accuracy of Branch Prediction
99%
tomcatv 99%
100%

95%
doduc 84%
97%

86%
fpppp 82% Profile-based
98%
2-bit counter
88% Tournament
li 77%
98%

86%
espresso 82%
96%

88%
gcc 70%
94%

0% 20% 40% 60% 80% 100%


Branch prediction accuracy
CSE 7381 • Profile: branch profile from last execution
(static in that in encoded in instruction, but profile)
Computer Architecture FK.F02 19
Accuracy v. Size (SPEC89)
10%
9%
Conditional branch misprediction rate

8%

7%
Local
6%
5%
Correlating
4%

3%
2%
Tournament
1%

0%
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
CSE 7381
Computer Architecture Total predictor size (Kbits) FK.F02 20
Need Address
at Same Time as Prediction
• Branch Target Buffer/Cache (BTB): Address of branch index to get
prediction AND branch address (if taken)
– Note: must check for branch match now, since can’t use wrong branch address

Branch PC Predicted PC
PC of instruction
FETCH

=? Extra
Yes: instruction is prediction state
branch and use bits
No: branch not predicted PC as next
predicted,
CSE 7381proceed normally PC
(Next
Computer PC = PC+4)
Architecture FK.F02 21
Branch Folding
• In the BT-buffer, store one or more target
instructions instead of the predicted PC
– Obtain zero cycle unconditional branches
– Sometimes, zero cycle conditional branches

Uncon Decode target


Branch instruction
IF ID EX MEM WB
IF ID EX MEM WB

– Multi-issue: BT-buffer needs to supply multiple


instructions
CSE 7381
Computer Architecture FK.F02 22
Integrated Instruction Fetch
Units
• IF unit in multi-issue processors integrates:
– Integrated Branch Prediction
– Instruction Prefetch
– Instruction Memory Access and Buffering

CSE 7381
Computer Architecture FK.F02 23
Return Address Predictors

• Predicting indirect jumps: jumps whose


destination address varies at run-time.
• Register Indirect branch hard to predict
address
• SPEC89 85% such branches for procedure
return
• Since stack discipline for procedures, save
return address in small buffer that acts like a
stack: 8 to 16 entries has small miss rate

CSE 7381
Computer Architecture FK.F02 24
Dynamic Branch Prediction
Summary
• Prediction becoming important part of scalar
execution
• Branch History Table: 2 bits for loop accuracy
• Correlation: Recently executed branches correlated
with next branch.
– Either different branches
– Or different executions of same branches
• Tournament Predictor: more resources to
competitive solutions and pick between them
• Branch Target Buffer: include branch address &
prediction
• Predicated Execution can reduce number of
branches, number of mispredicted branches
• CSE
Return
7381 address stack for prediction of indirect jump
Computer Architecture FK.F02 25

You might also like