CS252 Graduate Computer Architecture Prediction (Con't) (Dependencies, Load Values, Data Values) February 22, 2010
CS252 Graduate Computer Architecture Prediction (Con't) (Dependencies, Load Values, Data Values) February 22, 2010
John Kubiatowicz
Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252
k
I-Cache Instruction Opcode BHT Index
offset +
Branch?
Target PC
Taken/Taken?
IMEM
k
PC target BP
BP bits are stored with the predicted target address. IF stage: If (BP=taken) then nPC=target else nPC=PC+4 later: check prediction, if wrong then kill the instruction and update BTB & BPb else update BPb
2/22/10 cs252-S10, Lecture 9 4
= 236 = 1032
kill PC=236 and fetch PC=1032 Is this a common occurrence? Can we avoid these bubbles?
2/22/10 cs252-S10, Lecture 9 5
2/22/10
cs252-S10, Lecture 9
Valid
predicted target PC
match
2/22/10
valid
target
Keep both the branch PC and target PC in the BTB PC+4 is fetched if match fails Only predicted taken branches and jumps held in BTB Next PC determined before branch fetched and decoded
cs252-S10, Lecture 9 7
BPb take
The match for PC=1028 fails and 1028+4 is fetched eliminates false predictions after ALU instructions BTB contains entries only for control transfer instructions more room to store branch targets
2/22/10
cs252-S10, Lecture 9
BTB
BHT in later pipeline stage corrects when BTB misses a predicted taken branch
BHT
11
Misprediction frequency
0
2/22/10
16
12
Correlating Branches
Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch Two possibilities; Current branch depends on:
Last m most recently executed branches anywhere in program Produces a GA (for global adaptive) in the Yeh and Patt classification (e.g. GAg) Last m most recent outcomes of same branch. Produces a PA (for per-address adaptive) in same classification (e.g. PAg)
Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table entry
A single history table shared by all branches (appends a g at end), indexed by history value. Address is used along with history to select table entry (appends a p at end of classification) If only portion of address used, often appends an s to indicate setindexed tables (I.e. GAs)
2/22/10
cs252-S10, Lecture 9
13
2/22/10
cs252-S10, Lecture 9
14
Correlating Branches
For instance, consider global history, set-indexed BHT. That gives us a GAs history table. (2,2) GAs predictor
First 2 means that we keep two bits of history Second means that we have 2 bit counters in each slot. Then behavior of recent branches selects between, say, four predictions of next branch, updating just that prediction Note that the original two-bit counter solution would be a (0,2) GAs predictor Note also that aliasing is possible here...
2/22/10
Prediction
cs252-S10, Lecture 9
15
2/22/10
cs252-S10, Lecture 9
16
Speed: Does this affect cycle time? Space: Clearly Total Space matters!
Papers which do not try to normalize across different options are playing fast and lose with data Try to get best performance for the cost
2/22/10
cs252-S10, Lecture 9
17
4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 11%
6% 6% 4% 2% 0% 0% 1% 0% 1% 5%
6% 4%
6% 5%
doducd
nasa7
spice
gcc
espresso
matrix300
tomcatv
eqntott
fpppp
li
2/22/10
cs252-S10, Lecture 9
18
BHT Accuracy
4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%
For SPEC92, 4096 about as good as infinite table
How could HW predict this loop will execute 3 times using a simple mechanism?
Need to track history of just that branch For given pattern, track most likely following branch direction
GBHR
PABHR
GAg
GPHT
PAg
GPHT
PABHR
PAp
PAPHT
GAg: Global History Register, Global History Table PAg: Per-Address History Register, Global History Table PAp: Per-Address History Register, Per-Address History Table
2/22/10 cs252-S10, Lecture 9 20
PAp best: But uses a lot more state! GAg not effective with 6-bit history registers
2/22/10 cs252-S10, Lecture 9
Cost:
GAg requires 18-bit history register PAg requires 12-bit history register PAp requires 6-bit history register
Problem with GAg? It aliases results from different branches into same table
Issue is that different branches may take same global pattern and resolve it differently GAg doesnt leave flexibility to do this
2/22/10
cs252-S10, Lecture 9
23
GBHR
GBHR
Address
GAs
GShare
GAs: Global History Register, Per-Address (Set Associative) History Table Gshare: Global History Register, Global History Table with Simple attempt at anti-aliasing
2/22/10 cs252-S10, Lecture 9 24
From: A Comparative Analysis of Schemes for Correlated Branch Prediction, by Cliff Young, Nicolas Gloy, and Michael D. Smith Many branches are highly biased to be taken or not taken
Use of path history can be used to further bias branch behavior
Address
History
TAG Pred Pred TAG
BiMode
2/22/10 cs252-S10, Lecture 9
YAGS
26
2/22/10
cs252-S10, Lecture 9
28
Tournament Predictors
Motivation for correlating branch predictors is 2bit predictor failed on important branches; by adding global information, performance improved Tournament predictors: use 2 predictors, 1 based on global information and 1 based on local information, and combine with a selector Use the predictor that tends to guess correctly
addr
history
Predictor A
Predictor B
2/22/10
cs252-S10, Lecture 9
29
Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits! (~180,000 transistors)
2/22/10 cs252-S10, Lecture 9 30
20%
40%
60%
80%
2/22/10
cs252-S10, Lecture 9
31
doduc
fpppp
98%
li
77%
espresso
86% 82%
96%
70% 80%
fig 3.40
Profile: branch profile from last execution (static in that in encoded in instruction, but profile)
2/22/10 cs252-S10, Lecture 9 32
2/22/10
cs252-S10, Lecture 9
33
Implementation
Keep queue of stores, in program order Watch for position of new loads relative to existing stores Typically, this is a different buffer than ROB! Could be ROB (has right properties), but too expensive
2/22/10
cs252-S10, Lecture 9
35
Address Speculation
st r1, (r2) ld r3, (r4)
Guess that r4 != r2 Execute load before store address known
Need to hold all completed but uncommitted load/store addresses in program order
If subsequently find r4==r2, squash load and all following instructions => Large penalty for inaccurate address speculation
2/22/10
cs252-S10, Lecture 9
37
2/22/10
cs252-S10, Lecture 9
38
2/22/10
cs252-S10, Lecture 9
39
Load Address
L1 Data Cache
Tags
Store Commit Path
Data
Load Data
On store execute:
mark entry valid and speculative, and save data and tag of instruction.
On store commit:
clear speculative bit and eventually move data to cache
On store abort:
clear valid bit
2/22/10 cs252-S10, Lecture 9 40
Load Address
L1 Data Cache
Tags
Store Commit Path
Data
Load Data
If data in both store buffer and cache, which should we use: Speculative store buffer If same address in store buffer twice, which should we use: Youngest store older than load
2/22/10
cs252-S10, Lecture 9
41
Goal of prediction:
Avoid false dependencies and order violations From Memory Dependence Prediction using Store Sets, Chrysos and Emer.
2/22/10 cs252-S10, Lecture 9 42
2/22/10
cs252-S10, Lecture 9
43
28 Load B Store set { PC 8 } 32 Load D Store set { (null) } 36 Load C Store set { PC 0, PC 12 } 40 Load B Store set { PC 8 } Idea: Store set for load starts empty. If ever load go forward and this causes a violation, add offending store to loads store set
Try to predict the result of a load before going to memory Paper: Value locality and load value prediction
Mikko H. Lipasti, Christopher B. Wilkerson and John Paul Shen
LVPT
Results Load Value Prediction Table (LVPT)
Untagged, Direct Mapped Takes Instructions Predicted Data
How to predict?
When n=1, easy When n=16? Use Oracle
LCT
Correction Load Classification Table (LCT) How to implement?
Untagged, Direct Mapped Takes Instructions Single bit of whether or not to predict Uses saturating counters (2 or 1 bit) When prediction correct, increment When prediction incorrect, decrement 0,1 not predictable 2 predictable 3 constant (very predictable) 0 not predictable 1 constant (very predictable)
cs252-S10, Lecture 9
50
Accuracy of LCT
Question of accuracy is about how well we avoid:
Predicting unpredictable load Not predicting predictable loads
Basic Principle:
Often works better to have one structure decide on the basic predictability of structure Independent of prediction structure
2/22/10 cs252-S10, Lecture 9 51
2/22/10
cs252-S10, Lecture 9
52
Conclusion
Correlation: Recently executed branches correlated with next branch.
Either different branches (GA) Or different executions of same branches (PA).
Dependence Prediction: Try to predict whether load depends on stores before addresses are known
Store set: Set of stores that have had dependencies with load in past