0% found this document useful (0 votes)
9 views14 pages

2 Level Type

The document presents the Two-Level Adaptive Training Branch Prediction scheme, a dynamic branch predictor that enhances prediction accuracy in microprocessors by utilizing run-time execution history. This scheme achieves 97% accuracy on nine out of ten SPEC benchmarks, significantly outperforming existing static and dynamic predictors, which achieve less than 93% accuracy. The proposed method reduces the miss rate to 3%, resulting in over 100% improvement in minimizing pipeline flushes during execution.

Uploaded by

maneabhishek5355
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views14 pages

2 Level Type

The document presents the Two-Level Adaptive Training Branch Prediction scheme, a dynamic branch predictor that enhances prediction accuracy in microprocessors by utilizing run-time execution history. This scheme achieves 97% accuracy on nine out of ten SPEC benchmarks, significantly outperforming existing static and dynamic predictors, which achieve less than 93% accuracy. The proposed method reduces the miss rate to 3%, resulting in over 100% improvement in minimizing pipeline flushes during execution.

Uploaded by

maneabhishek5355
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

From the companion CD-ROM to the IEEE CS Press book,

"The Anatomy of a Microprocessor: A Systems Perspective,"


by Shriver & Smith

Two-Level Adaptive Training


Branch Prediction

Tse-Yu Yeh and Yale N. Patt


Department of Electrical Engineering and Computer Science
The University of Michigan
Ann Arbor, Michigan 48109-2122

Abstract. High-performance microarchitectures use, among other structures, deep pipelines to help speed up exe-
cution. The importance of a good branch predictor to the effectiveness of a deep pipeline in the presence of condi-
tional branches is well-known. In fact, the literature contains proposals for a number of branch prediction schemes.
Some are static in that they use opcode information and profiling statistics to make predictions. Others are dynamic
in that they use run-time execution history to make predictions.
This paper proposes a new dynamic branch predictor, the Two-Level Adaptive Paining scheme, which alters the
branch prediction algorithm on the basis of information collected at run-time.
Several configurations of the Two-Level Adaptive Training Branch Predictor are introduced, simulated, and
compared to simulations of other known static and dynamic branch prediction schemes. Two-Level Adaptive
Training Branch Prediction achieves 97 percent accuracy on nine of the ten SPEC benchmarks, compared to less
than 93 percent for other schemes. Since a prediction miss requires flushing of the speculative execution already in
progress, the relevant metric is the miss rate. The miss rate is 3 percent for the Two-Level Adaptive Training scheme
vs. 7 percent (best case) for the other schemes. This represents more than a 100 percent improvement in reducing
the number of pipeline hushes required.

1. Introduction ble to reduce the instruction fetch pipeline bubbles. The


second is to provide fast fetching and decoding of the
Pipelining, at least as early as [18] and continuing to target instruction to reduce the execution pipeline bub-
the present time [6], has been one of the most effective bles. Branch prediction is a way to reduce the execution
ways to improve performance on a single processor. On penalty due to branches by predicting, prefetching and
the other hand, branches impede machine performance initiating execution of the branch target before the
due to pipeline stalls for unresolved branches. As pipe- branch is resolved.
lines get deeper or issuing bandwidth becomes greater, Branch prediction schemes can be classified into
the negative effect of branches on performance increases. static schemes and dynamic schemes depending on the
Among different types of branches, conditional information used to make predictions. Static branch
branches have to wait for the condition to be resolved prediction schemes can be as simple as predicting that
and the target address to be calculated before the target all branches are not taken or predicting that all
instruction can be fetched. Unconditional branches branches are taken. Predicting that all branches are
have to wait for the target address to be calculated. In taken can achieve approximately 68 percent prediction
conventional computers, instruction issuing stalls until accuracy as reported by Lee and Smith [13]. In the dy-
the target address is determined, resulting in pipeline namic instructions of the benchmarks used in this study,
bubbles. When the number of cycles taken to resolve a about 60 percent of conditional branches are taken.
branch is large, the performance loss due to the pipeline Static predictions can also be based on the opcode.
stalls is considerable. There are two ways to reduce the Certain classes of branch instructions tend to branch
loss: the first is to resolve the branch as early as possi- more in one direction than the other. The branch direc-
Originally published in Proc. 24th Ann. Int’l Symp. 1
Microarchitecture, 1991, pp. 51–61. Copyright  1991
ACM. All rights reserved.
tion can also be taken into consideration such as the ing Branch Prediction, the average prediction accuracy
Backward Taken and Forward Not Taken scheme [16] for the benchmarks reaches 97 percent, while most of
which is fairly effective in loop-bound programs, be- the other schemes achieve under 93 percent. This rep-
cause it misses only once over all iterations of a loop. resents more than 100 percent reduction in mispredic-
However, this scheme does not work well on programs tions by using the Two-Level Adaptive Training
with irregular branches. Profiling [12, 5] can also be scheme. This reduction can lead directly to a large per-
used to predict the branch path by measuring the ten- formance gain on a high-performance processor.
dencies of the branches and presetting a static predic- Section two gives an introduction to the proposed
tion bit in the opcode. However, program profiling has Two-Level Adaptive Training Branch Prediction
to be performed in advance with certain sample data scheme. Section three discusses the methodology used
sets which may have different branch tendencies than in this study and the simulated prediction models. Sec-
the data sets that occur at run-time. tion four reports the simulation results of a wide selec-
Dynamic branch prediction takes advantage of the tion of schemes including both the dynamic and the
knowledge of branches’ run-time behavior to make static branch predictors. Section five contains some
predictions. Lee and Smith proposed a structure they concluding remarks.
called a Branch Target Buffer [13] which uses 2-bit
saturating up-down counters to collect history informa- 2. Two-Level Adaptive Training Branch
tion which is then used to make predictions. The exe- Prediction
cution history dynamically changes the state of the
branch’s entry in the buffer. In their scheme, branch The Two-Level Adaptive Training Branch Prediction
prediction is based on the state of the entry. The Branch scheme has the following characteristics:
Target Buffer design can also be simplified to record
only the result of the last execution of the branch. An-
• Branch prediction is based on the history of
other dynamic scheme also proposed by Lee and Smith
branches executed during the current execution of
is the Static Training scheme [13] which uses the sta-
the program.
tistics collected from a pre-run of the program and a
history pattern consisting of the last n run-time execu- • Execution history pattern information is collected
tion results of the branch to make a prediction. The on the fly of the program execution by updating the
major disadvantage of the Static Training scheme is pattern history information in the branch history
that the program has to be run first to accumulate the pattern table of the predictor. Therefore, no pre-
statistics and the same statistics may not be applicable runs of the program are necessary.
to different data sets.
There is serious performance degradation in deep- 2.1 Concept of Two-Level Adaptive Training
pipelined and/or superscalar machines caused by pre- Branch Prediction
diction misses due to the large amount of speculative
work that has to be discarded [1, 8]. This is the motiva- The Two-Level Adaptive Training scheme has two
tion for proposing a new, higher-accuracy dynamic major data structures, the branch history register (HR)
branch prediction scheme. The new scheme uses two and the branch history pattern table (PT), similar to
levels of branch history information to make predic- those used in the Static Training scheme of Lee and
tions. The first level is the history of the last n Smith [13]. In Two-Level Adaptive Training, instead of
branches. The second is the branch behavior for the last accumulating statistics by profiling the programs, the
s occurrences of that unique pattern of the last n execution history information on which branch predic-
branches. The history information is collected on the tions are based is collected by updating the contents of
fly without executing the program beforehand, elimi- the history registers and the pattern history bits in the
nating the major disadvantage of Static Training Pre- entries of the pattern table depending on the outcomes
diction. The scheme proposed here is called Two-Level of the branches. The history register is a shift register
Adaptive Training Branch Prediction, because predic- which shifts in bits representing the branch results of
tions are based not only on the record of the last n the most recent history information. All the history
branches, but moreover on the record of the last s oc- registers are contained in a history register table (HRT).
currences of the particular record of the last n branches. The pattern history bits represent the most recent
Trace-driven simulations were used in this study. branch results for the particular contents of the history
The Two-Level Adaptive Training branch prediction register. Branch predictions are made by checking the
scheme as well as the other dynamic and static branch pattern history bits in the pattern table entry indexed by
prediction schemes were simulated on the SPEC the content of the history register for the particular
benchmark suite. By using Two-Level Adaptive Train- branch that is being predicted.
Originally published in Proc. 24th Ann. Int’l Symp. 2
Microarchitecture, 1991, pp. 51–61. Copyright  1991
ACM. All rights reserved.
Since the history register table is indexed by branch After the conditional branch is resolved, the out-
instruction addresses, the history register table is called come Ri,c is shifted left into the history register HRi in
a per-address history register table (PHRT). The pattern the least significant bit position and is also used to up-
table is called a global pattern table, because all the date the pattern history bits in the pattern table entry
history registers access the same pattern table. PTRi,c–kRic–k+1,……Ric–1. After being updated, the con-
The structure of Two-Level Adaptive Training tent of the history register becomes Ric–k+1Ri,c–
k+2……Ri,c and the state represented by
Branch Prediction is shown in Figure 1. The prediction the pattern
of a branch Bi is based on the history pattern of the last history bits becomes Sc+1. The transition of the pattern
outcomes of executing the branch; therefore, k bits are history bits in the pattern table entry is done by the state
needed in the history register for each branch to keep transition function δ which takes in the old pattern
track of the history. If the branch was taken, then a “1” history bits and the outcome of the branch as inputs to
is recorded; if not, a “0” is recorded. Since there are k generate the new pattern history bits. Therefore, the
bits in the history register, at most 2k different patterns new pattern history bits Sc+l become
appear in the history register. In order to keep track of
the history of the patterns, there are 2k entries in the
pattern table; each entry is indexed by one distinct his- Sc+l = δ(Sc, Ric) (2)
tory pattern.
When a conditional branch Bi is being predicted,
the contents of its history register, HRi, whose content
is denoted as Ri, c–kRi, c–k+l……Ri, c–1 for the last k out-
comes of executing the branch, is used to address the A straightforward combinational logic circuit is
pattern table. The pattern history bits Sc in the ad- used to implement the function δ to update the pattern
dressed entry PTRi,c–kRic–k+1,……Ric–1 in the pattern history bits in the entries of the pattern table. The tran-
table are then used for predicting the branch. The pre- sition function δ, pattern history bits S and the outcome
diction of the branch is R of the branch comprise a finite-state machine, which
can be characterized by equations 1 and 2. Since the
zc = λ(Sc), (1) prediction is based on the pattern history bits, the finite-
state machine is a Moore machine with the output z
where λ is the prediction decision function. characterized by equation 1.

Figure 1: The structure of the Two-Level Adaptive Training scheme.

Originally published in Proc. 24th Ann. Int’l Symp. 3


Microarchitecture, 1991, pp. 51–61. Copyright  1991
ACM. All rights reserved.
Figure 2: The state transition diagrams of the finite-state machines used for updating the pattern history in
the pattern table entry.

The state transition diagrams of the finite-state ma- output of λ is determined before execution for a given
chines used in this study for updating the pattern his- branch history pattern. That is, the same branch predic-
tory in the pattern table entry are shown in Figure 2. tions are made if the same history pattern appears at
The automaton Last-Time stores in the pattern history different times during execution. Two-Level Adaptive
bit only the outcome of the last execution of the branch Training, on the other hand, updates the appropriate
when the history pattern appeared. The next time the pattern history information with the actual result of
same history pattern appears the prediction will be what each branch. As a result, given the same branch history
happened last time. Only one bit is needed to store the pattern, different pattern history information can be
pattern history information. The automaton Al records found in the pattern table; therefore, there can be dif-
the results of the last two times the same history pattern ferent inputs to the prediction decision function for
appeared. Only when there is no taken branch recorded, Two-Level Adaptive Training. Predictions of Two-
the next execution of the branch when the history reg- Level Adaptive Training change adaptively in accor-
ister has the same history pattern will be predicted as dance with the program execution behavior.
not taken; otherwise, the branch will be predicted as Since the pattern history bits change in Two-Level
taken. The automaton A2 is a saturating up-down Adaptive Training, the predictor can adjust to the cur-
counter, which is also used, but differently, in Lee and rent branch execution behavior of the program to make
Smith’s Branch Target Buffer design [13]. The counter proper predictions. With the updates, Two-Level
is incremented when the branch is taken and is decre- Adaptive Training can still be highly accurate over
mented when the branch is not taken. The next execu- many different programs and data sets. Static Training,
tion of the branch will be predicted as taken when the on the contrary, may not predict well if changing data
counter value is greater than or equal to two; otherwise, sets results in different execution behavior.
the branch will be predicted as not taken. Automata
A3and A4 are both similar to A2.
Both Static Training and Two-Level Adaptive 3. Implementation Methods
Training are dynamic branch predictors, because their
predictions are based on run-time information, i.e. the 3.1 Implementations of the Per-address History
dynamic branch history. The major difference between Register Table
these two schemes is that the pattern history informa-
tion in the pattern table changes dynamically in Two- It is not feasible to have a big enough history register
Level Adaptive Training but is preset in Static Training table for each static branch to have its own history reg-
from profiling. In Static Training, the input to the pre- ister in real implementations. Therefore, two ap-
diction decision function, λ, for a given branch history proaches are proposed for implementing the Per-
pattern is determined before execution. Therefore, the address History Register Table.

Originally published in Proc. 24th Ann. Int’l Symp. 4


Microarchitecture, 1991, pp. 51–61. Copyright  1991
ACM. All rights reserved.
The first approach is to implement the per-address and the pattern table does not have to be accessed that
register table as a set-associative cache. A fixed number cycle.
of entries in the table are grouped together as a set. Another problem occurs when the prediction of a
Within a set, the Least-Recently-Used (LRU) algorithm branch is required before the result of the previous exe-
is used for replacement. The lower part of a branch cution of the branch has been confirmed. This case
address is used to index into the table and the higher appears very often when a tight loop is being executed
part is used as a tag which is recorded in the entry allo- by a deep-pipelined superscalar machine, but not usu-
cated for the branch. The per-address history register ally otherwise. Since this kind of branch has a high
table implemented in this way is called the Associative tendency to be taken, the branch is predicted taken and
History Register Table (AHRT). When a conditional the machine does not have to stall until the previous
branch is to be predicted, the branch’s entry in the branch result is confirmed.
AHRT is located first. If the branch has an entry in the
AHRT, the contents of the corresponding history reg-
ister is used to address the pattern table. If the branch 4. Methodology and Simulation Model
does not have an entry in the AHRT, a new entry is
allocated for the branch. There is an extra cost for im- Trace-driven simulations were used in this study. A
plementing the tag store in this approach. Motorola 88100 instruction level simulator (ISIM) is
The second approach is to implement the history used for generating instruction traces. The instruction
register table as a hash table. The address of a condi- and address traces are fed into the branch prediction
tional branch is used for hashing into the table. The simulator which decodes instructions, predicts
per-address history table using this approach is called branches, and verifies the predictions with the branch
the Hash History Register Table (HHRT). Since colli- results to collect statistics for branch prediction accuracy.
sions can occur when accessing a hash table, this im- The branch instructions in the M88100 instruction
plementation results in more interference in the execu- set [4] are classified into four classes: conditional
tion history. As one would expect, the prediction accu- branches, subroutine return branches, immediate un-
racy for this approach is lower than what would be ob- conditional branches, and unconditional branches on
tained with an AHRT, but the cost of the tag store is registers. Instructions other than the branches are clas-
saved. sified into the non-branch instruction class.
In this study, the above two practical approaches Conditional branches have to wait for condition
and the Ideal History Register Table (IHRT), in which codes in order to decide the branch targets. Subroutine
there is a history register for each static conditional return branches can be predicted by using a return ad-
branch, were simulated for the Two-Level Adaptive dress stack. A return address is pushed onto the stack
Training Branch Predictor. The AHRT was simulated when a subroutine is called and is popped as the pre-
with two configurations: 512-entry 4-way set- diction for the branch target address when a return in-
associative and 256 entry 4-way set-associative. The struction is detected. The return address prediction may
HHRT was also simulated with 512 entries and 256 miss when the return address stack overflows. For in-
entries The IHRT simulation data is provided to show struction sets without special instructions for returns
how much accuracy is lost due to the history interfer- from sub-routines, the double stacks scheme proposed
ence in the practical history register table designs. by Kaeli and Emma in [2] is able to perform the return
address prediction. An immediate unconditional
3.2 Prediction Latency branch’s target address is calculated by adding the off-
set in the instruction to the program counter; therefore,
The Two-Level Adaptive Training Branch Predictor the target address can be generated immediately. Un-
needs two sequential table lookups to make a predic- conditional branches on registers have to wait for the
tion. It is hard to squeeze the two lookups into one cy- register value which is the target address to become
cle, which is usually the requirement for a high- ready.
performance processor in determining the next instruc-
tion address. The solution to this problem is to perform 4.1 Description of Traces
the pattern table lookup with the updated history pat-
tern of a branch at the time the history register is up- Nine benchmarks from the SPEC benchmark suite are
dated, produce a prediction from the pattern table, and used in this branch prediction study. Five are float-ing
store the prediction as a prediction bit in the history point benchmarks and four are integer benchmarks. The
register table with the history register for the branch. floating point benchmarks include doduc, fpppp, ma-
Therefore, the next time the branch must be predicted, trix300, spice2g6 and tomcatv and the integer ones
the prediction is available in the history register table, include eqntott, espresso, gcc, and li. Nasa7 is not in-
Originally published in Proc. 24th Ann. Int’l Symp. 5
Microarchitecture, 1991, pp. 51–61. Copyright  1991
ACM. All rights reserved.
cluded because it takes too long to capture the branch lated for the benchmarks range from fifty million to 1.8
behavior of all seven kernels. Among the five floating billion.
point benchmarks, matrix300 and tomcatv have repeti- The dynamic instruction distribution is shown in
tive loop execution; thus, a very high prediction accu- Figure 3. About 24 percent of the dynamic instructions
racy is attainable. The integer benchmarks tend to have for the integer benchmarks and about 5 percent of the
many conditional branches and irregular branch be- dynamic instructions for the floating point benchmarks
havior. Therefore, it is on the integer benchmarks are branch instructions.
where the mettle of the branch predictor is tested. The distribution of the dynamic branch instructions
is shown in Figure 4. As can be seen from the distribu-
Since this study focuses on the prediction for con- tion, about 80 percent of the dynamic branch instruc-
ditional branches, all benchmarks except fpppp and gcc tions are conditional branches. The conditional branch
were simulated for twenty million conditional branch is the branch class that should be studied to improve the
instructions. The benchmarks fpppp and gcc finish exe- prediction accuracy. The number of static conditional
cution before twenty millions conditional branches are branches in the trace tapes of the benchmarks are listed
executed. The number of dynamic instructions simu- in Table 1.

Figure 3: Distribution of dynamic instructions.

Figure 4: Distribution of dynamic branch instructions.

Originally published in Proc. 24th Ann. Int’l Symp. 6


Microarchitecture, 1991, pp. 51–61. Copyright  1991
ACM. All rights reserved.
Table 1: The number of static conditional branches in each benchmark.

Benchmark name Number of Static


Conditional Branches
eqntott 277
gcc 6922
doduc 1149
Matrix300 213
tomcatv 370
espresso 556
li 489
fpppp 653
Spice2g6 606

4.2 SimuIation Model used for training and testing. If Data is not specified,
no training data set is needed for the schemes, as in
Several configurations were simulated for the Two- Two-Level Adaptive Training schemes or Lee and
Level Adaptive Training scheme. For the per-address Smith’s Branch Target Buffer designs. The configura-
history register table (PHRT), two practical implemen- tion and scheme of each simulation model in this study
tations, the associative HRT (AHRT) and the hash are listed in Table 2.
HRT (HHRT), along with the ideal HRT (IHRT) were Since about 60 percent of branches are taken ac-
simulated. In order to distinguish the different schemes, cording to our simulation results, the contents of the
the naming convention for the branch prediction history register usually should contain more 1’s than
schemes is Scheme(History(Size, Entry_Content), Pat- 0’s. Accordingly, all the bits in the history register of
tern(Size, Entry_Content), Data). Scheme specifies the each entry in the HRT are initialized to l’s at the begin-
scheme, for example, Two-Level Adaptive Training ning of program execution. During execution, when an
(AT), Static Training (ST), or Lee and Smith’s Branch entry is re-allocated to a different static branch, the
Target Buffer design (LS). In History(Size, En- history register is not re-initialized.
try_Content), History is the implementation for keeping The pattern history bits in the pattern table entries
history information of branches, for example, IHRT, are also initialized at the beginning of execution.
AHRT, or HHRT. Size specifies the number of entries Since taken branches are more likely, for those pat-
in the implementation, and Entry Content specifies the tern tables using automata, Al, A2, A3, and A4, all
content in each entry. The content of an entry in the entries are initialized to state 3. For Last-Time, all
history register table can be any automaton shown in entries are initialized to state 1 such that the branches
Figure 2 or a history register. In Pattern(Size, En- at the beginning of execution will be more likely to be
try_Content), Pattern is the implementation for keeping predicted taken.
history information for history patterns, Size specifies In addition to the Two-Level Adaptive Training
the number of entries in the implementation, and En- schemes, Lee and Smith’s Static Training schemes and
try_Content specifies the content in each entry. The Branch Target Buffer designs, and some dynamic and
content of an entry in the pattern history table can be static branch prediction schemes were simulated for
any automaton shown in Figure 2. For Lee and Smith’s comparison purposes. Lee and Smith’s Static Training
Branch Target Buffer designs, the Pattern part is not scheme is similar to the Two-Level Adaptive Training
included, because there is no pattern history informa- scheme with an IHRT but with the important difference
tion kept in their designs. Data specifies how the data that the prediction for a given pattern is pre-determined
sets are used. When Data is specified as Same, the by profiling. The two practical approaches for the HRT
same data set is used for both training and testing. were also simulated for Static Training with the same
When Data is specified as Diff, different data sets are accessing method introduced above.

Originally published in Proc. 24th Ann. Int’l Symp. 7


Microarchitecture, 1991, pp. 51–61. Copyright  1991
ACM. All rights reserved.
Table 2: Configurations of simulated branch predictors.

Model Name Number of Entry Content Number of Entries Entry Content


Entries
AT (AHRT(256, 12SR), 256 12-bit SR 212 ATM A2
PT(212, A2),)
AT (AHRT(512, 12SR), 512 12-bit SR 212 ATM A2
PT(212, A2),)
AT (AHRT(512, 12SR), 512 12-bit SR 212 ATM A3
PT(212, A3),)
AT (AHRT(512, 12SR), 512 12-bit SR 212 ATM A4
PT(212, A4),)
AT (AHRT(512, 12SR), 512 12-bit SR 212 ATM LT
PT(212, LT),)
AT (AHRT(512, 10SR), 512 10-bit SR 210 ATM A2
PT(210, A2),)
AT (AHRT(512, 8SR), 512 8-bit SR 28 ATM A2
PT(28, A2),)
AT (AHRT(512, 6SR), 512 6-bit SR 26 ATM A2
PT(26, A2),)
AT (HHRT(256, 12SR), 256 12-bit SR 212 ATM A2
PT(212, A2),)
AT (HHRT(512, 12SR), 512 12-bit SR 212 ATM A2
PT(212, A2),)
AT (IHRT(256, 12SR), ∞ 12-bit SR 212 ATM A2
PT(212, A2),)
ST (AHRT(512, 12SR), 512 12-bit SR 212 PB
PT(212, PB),Same)
ST (HHRT(512, 12SR), 512 12-bit SR 212 PB
PT(212, PB),Same)
ST (IHRT(12SR), ∞ 12-bit SR 212 PB
PT(212, PB),Same)
ST (AHRT(512, 12SR), 512 12-bit SR 212 PB
PT(212, PB),Diff)
ST (HHRT(512, 12SR), 512 12-bit SR 212 PB
PT(212, 512),Diff)
ST (IHRT(12SR), ∞ 12-bit SR 212 PB
PT(212, PB),Diff)
LS(AHRT(512, A2),,) 512 ATM A2
LS(AHRT(512, LT),,) 512 ATM LT
LS(HHRT(512, A2),,) 512 ATM A2
LS(HHRT(512, LT),,) 512 ATM LT
LS(IHRT(,A2),,) ∞ ATM A2
LS(IHRT(,LT),,) ∞ ATM LT

Originally published in Proc. 24th Ann. Int’l Symp. 8


Microarchitecture, 1991, pp. 51–61. Copyright  1991
ACM. All rights reserved.
Figure 5: Two-Level Adaptive Training schemes using different state transition automata.

Lee and Smith’s Branch Target Buffer designs were 5.1 Two-Level Adaptive Training
simulated with automata A2, A3, A4, and Last-Time.
The static branch prediction schemes simulated include The Two-Level Adaptive Training schemes were
the Always Taken, Backward Taken and Forward Not simulated with different state transition automata, dif-
taken, and a simple profiling scheme. The profiling ferent HRT implementations, and different history reg-
scheme is done by counting the frequency of taken and ister lengths to show their effects on prediction accu-
not-taken for each static branch in the profiling execu- racy. The simulations of the Two-Level Adaptive
tion. The predicted direction of a branch is the one the Training scheme using an IHRT demonstrate the accu-
branch takes most frequently. Since the same data set racy the scheme can achieve without history table miss
was used for profiling and execution in this study, the effect and is used as a comparison to Lee and Smith’s
prediction accuracy was calculated by taking the ratio of Static Training scheme which also uses the ideal history
the sum of the larger number in the two numbers for two register table.
possible directions of every static branch over the total
number of the dynamic conditional branch instructions.
5.1.1 Effect of State Transition Automata
5. Simulation Results Figure 5 shows the efficiency of different state transi-
tion automata. Four state transition automata, A2, A3,
The simulation results presented in this section were A4, and Last-Time were simulated. A1 is not included,
run with the Two-Level Adaptive Raining schemes, the because early experiments indicated it was inferior to
Static Training Schemes, the Branch Target Buffer the other four-state automata, A2, A3, and A4. The
designs, and some static branch prediction schemes. scheme using Last-Time performs about l percent worse
Figures 5 through 10 show the prediction accuracy than the ones using the other automata which achieve
across the nine benchmarks. On the horizontal axis, the similar accuracy around 97 percent. The four-state fi-
category labeled as “Tot G Mean” shows the geometric nite-state machines maintain more history information
mean across all the benchmarks, “Int G Mean” shows than the Last Time which only records what happened
the geometric mean across all integer benchmarks, and last time; A2, A3, and A4 are therefore more tolerant to
“FP G Mean” shows the geometric mean across all noise in the execution history.
floating point benchmarks. The vertical axis shows the In order to show the curves clearly in the following
prediction accuracy scaled from 76 percent to 100 per- figures, each scheme is shown with the state transition
cent. This section concludes with a comparison be- automata A2 which usually performs the best among the
tween different branch prediction schemes. state transition automata used in this study.

Originally published in Proc. 24th Ann. Int’l Symp. 9


Microarchitecture, 1991, pp. 51–61. Copyright  1991
ACM. All rights reserved.
5.1.2 Effect of History Register Table Implementation tistics gathered from profiling the program with a
Figure 6 shows the effects of the HRT implementations training data set to calculate the probabilities the
on the prediction accuracy of the Two-level Adaptive branch will be taken or not-taken with the given history
Training schemes. Every scheme in the graph was pattern to predict the branch path.
simulated with the same history register length. With Although the accounting required to gather the
the equivalent history register length, the IHRT scheme training statistics can be done in software, the Static
performs the best, the 512-entry AHRT scheme the Training scheme needs to keep track of the execution
second, the 512-entry HHRT scheme the third, the 256- history of every static branch in the program, which
entry AHRT scheme the fourth and the 256-entry requires hardware support. History registers must be
HHRT scheme the worst, in the decreasing order of the used to keep track of the branch execution history of
HRT hit ratio. This is due to the increasing interference each static branch during run-time. When a branch is
in the branch history as the hit ratio decreases. being predicted, its recorded history pattern is used to
index the branch pattern table which contains preset
5.1.3 Effect of History Register Length branch prediction information. The preset prediction bit
is then used for predicting the branch. Because the
Figure 7 shows the effect of history register length on
number of static branches varies from one program to
the prediction accuracy of Two-Level Adaptive Train-
another, the number of history registers required
ing schemes. The Two-Level Adaptive Training
changes, which requires the hardware to offer a big
schemes using four different history register lengths
enough table like IHRT to hold all the static branches
were simulated. The accuracy increases for about 0.5
in the programs. In order to consider the effects of
percent by lengthening the history registers for 2 bits.
practical implementations, in addition to the IHRT, the
According to the simulation results, increasing the his-
two practical HRT implementations used in this study
tory register length often improves the prediction accu-
were simulated with the Static Training schemes. The
racy until the accuracy asymptote is reached.
cost to implement Static Training is not any less expen-
sive than for Two-Level Adaptive Training, because
5.2 Static Training the history register table and pattern table required by
both schemes are similar. However, the state transition
Static Training Branch Prediction examines the history logic in the pattern table is simpler for the Static
pattern of the last n executions of a branch and the sta- Training scheme.

Figure 6: Two-Level Adaptive Training schemes using different history register table implementations.

Originally published in Proc. 24th Ann. Int’l Symp. 10


Microarchitecture, 1991, pp. 51–61. Copyright  1991
ACM. All rights reserved.
Figure 7: Two-Level Adaptive Training schemes using history registers of different lengths.

In order to show the effects of the training data sets, the Static Training schemes can achieve with that data
the simulation results for the schemes (with Same in set, because the best predictions for branches are
their names) which were trained and tested on the same known beforehand.
data set and those for the schemes (with Diff in their Five of nine benchmarks were trained with other
names) which were trained and tested on different data applicable data sets. The other four benchmarks, eqn-
sets are both presented. All the testing data sets are the tott, matrix300, fpppp, and tomcatv, are excluded be-
same as those used by other schemes in order for a fair cause there are no other applicable data sets or the ap-
comparison. In the schemes which were trained and plicable data sets are too similar to each other. The data
executed on the same data set, the results are the best sets used in training and testing are shown in Table 3.

Table 3: Training and testing data sets of each benchmark.

Benchmark Name Training Data Set Testing Data Set


Eqntott NA Int_pri_3.eqn
Espresso Cps Bca
Gcc Cexp.i Dbxout.i
Li Tower of hanoi Eight queens
Doduc Tiny doducin Doducin
Fpppp NA Natoms
Matrix300 NA NA
Spice2g6 Short greycode.in Greycode.in
tomcatv NA NA

Originally published in Proc. 24th Ann. Int’l Symp. 11


Microarchitecture, 1991, pp. 51–61. Copyright  1991
ACM. All rights reserved.
The Static Training schemes with similar configu- and the profiling scheme. The Branch Target Buffer
rations to the Two-Level Adaptive Training schemes in designs were simulated with automata, A1, A2, A3, A4,
Figure 6 are shown in Figure 8. The highest prediction and Last-Time. Only the results of the designs using A2
accuracy of the schemes using the same data set for and Last-Time are shown in the figure, because the re-
training and execution is about 97 percent. This is sults of the designs using A3 and A4 are similar to those
achieved by the Static Training scheme using 12 bit of the designs using A2. The designs using A1 predict
history registers and an IHRT. The accuracy is about about 2 to 3 percent lower than those using A2. Three
the same as that achieved by the Two-Level Adaptive buffer configurations, similar to IHRT, AHRT, and
Training scheme using 12 bit history registers and a HHRT, were simulated. Using an IHRT in those
512-entry 4-way AHRT. However, when different data schemes sets the upper bound at 93 percent for the
sets are used for training and execution, the prediction same schemes with practical HRT implementations.
accuracy for gcc and espresso is about 1 percent lower Using Last-Time is about 4 percent lower than using
respectively. The drop in the accuracy for li is more A2.
significant. It is about 5 percent lower. For the floating BTFN and Always Taken predict poorly compared
point benchmarks, the degradations are not so apparent to the other schemes. Some of the data points fall below
due to the regular branch behavior of the programs. 76 percent.
The degradations are within 0.5 percent. Since the data The Backward Taken and Forward Not taken
for the Static Training Schemes using different data sets scheme (BTFN) is effective for the loop-bound bench-
for training and testing is not complete, the average marks like matrix300 and tomcatv but not for other
accuracy for the schemes is not graphed. benchmarks. For the loop-bound benchmarks, the pre-
diction accuracy is as high as 98 percent. However, for
the other benchmarks, its accuracy is often lower than
5.3 Other Schemes 70 percent. The average accuracy is approximately 69
percent.
Figure 9 shows the simulation results of Lee and The accuracy of the Always Taken scheme changes
Smith’s Branch Target Buffer designs, Backward quite markedly from one benchmark to another. Its av-
Taken and Forward Not taken (BTFN), Always Taken, erage is about 60 percent.

Figure 8: Prediction accuracy of Static Training schemes.

Originally published in Proc. 24th Ann. Int’l Symp. 12


Microarchitecture, 1991, pp. 51–61. Copyright  1991
ACM. All rights reserved.
Figure 9: Prediction accuracy of Branch Target Buffer designs, BTFN, Always Taken, and the Profiling
scheme.

Figure 10: Comparison of branch prediction schemes

The simple profiling scheme simulated here is to 5.4 Comparison of Schemes


run the program once to accumulate the statistics of
how many times the branch is taken and how many Figure 10 illustrates the comparison between the
times the branch is not taken for each branch. The pre- schemes mentioned above. The 512-entry 4-way AHRT
diction bit in the opcode of the branch is set or cleared was chosen for all the uses of HRT, because it is simple
depending on whether the taken branch count is larger enough to be implemented. Two-Level Adaptive and
than the not-taken branch count or not. The run-time Static training schemes are chosen on the basis of
prediction of the branch is made according to the pre- similar costs. At the top is the Two-Level Adaptive
diction bit. The average of this scheme is about 92.5 Training scheme whose average prediction accuracy is
percent. This scheme is fairly simple but at the cost of about 97 percent. As can be seen from the graph, the
profiling and low prediction accuracy. Static Training scheme predicts about 1 to 5 percent

Originally published in Proc. 24th Ann. Int’l Symp. 13


Microarchitecture, 1991, pp. 51–61. Copyright  1991
ACM. All rights reserved.
lower than the top curve. The profiling scheme predicts Training Branch Prediction is proposed as a way to
almost as well as Lee and Smith’s Branch Target support high performance processors by minimizing the
Buffer design with accuracy around 92.5 percent. The penalty associated with mispredicted branches.
scheme which predicts a branch with the last result of
the execution of the branch achieves about 89 percent
accuracy. References
6. Concluding Remarks [1] M. Butler, T-Y Yeh, Y.N. Patt, M. Alsup, H. Scales, and M.
Shebanow, “Instruction Level Parallelism is Greater Than
Two,” Proc. 18th Int’l Symp. Computer Architecture, 1991,
This paper proposes a new branch predictor, Two- pp. 276-286.
Level Adaptive Training. The scheme predicts a branch [2] D.R. Kaeli and P.G. Emma, “Branch History Table Prediction
by examining the history of the last n branches and the of Moving Target Branches Due to Subroutine Returns,” Proc.
branch behavior for the last s occurrences of that 18th Int’l Symp. Computer Architecture, 1991, pp. 34-42.
unique pattern of the last n branches. [3] Tse-Yu Yeh, “Two-Level Adaptive Training Branch Predic-
tion,” Tech. Report, University of Michigan, 1991.
The Two-Level Adaptive Training schemes were
[4] Motorola Inc., “M88100 User’s Manual,” Phoenix, Arizona,
simulated with three HRT configurations: the IHRT
Mar. 13, 1989.
which is an ideal history register table large enough to [5] W.W. Hwu, T.M.Conte, and P.P.Chang, “Comparing Software
hold all static branches, the AHRT which is a set- and Hardware Schemes for Reducing the Cost of Branches,”
associative cache, and the HHRT which is a hash table. Proc. 16th Int’l Symp. Computer Architecture, May 1989.
The IHRT data was included to obtain upper bounds [6] N.P. Jouppi and D. Wall, “Available Instruction-Level Paral-
for each of the other schemes. A scheme using an lelism for Superscalar and Super-pipelined Machines,” Proc.
3rd Int’l Conf. Architectural Support for Programming Lan-
AHRT usually has higher prediction accuracy than the guages and Operating Systems, 1989, pp. 272-282.
same scheme using an HHRT of the same size, because [7] D.J. Lila, “Reducing the Branch Penalty in Pipelined Proces-
the AHRT has lower miss rate than the HHRT. sors,” Computer, July 1988, pp. 47-55.
Each Two-Level Adaptive Training scheme was [8] W.W. Hwu and Y.N. Patt, “Checkpoint Repair for Out-of-order
simulated with various history register lengths. As seen Execution Machines,” IEEE Trans. Computers, Dec. 1987, pp.
from the simulation results, prediction accuracy is usu- 1496-1514.
ally improved by lengthening the history register. [9] P.G. Emma and E.S. Davidson, “Characterization of Branch
and Data Dependencies in Programs for Evaluating Pipeline
In addition to the Two-Level Adaptive Training Performance,” IEEE Trans. Computers, July 1987, pp. 859-
scheme, several other dynamic or static branch predic- 876.
tion schemes such as Lee and Smith’s Static Training [10] J.A. DeRosa and H.M. Levy, “An Evaluation of Branch Ar-
schemes, Branch Target Buffer designs, Always Taken, chitectures,” Proc. 11th Int’l Symp. Computer Architecture,
Backward Taken and Forward Not taken, and a simple 1987, pp.1-16.
profiling scheme were simulated. [11] D.R. Ditzel and H.R. McLellan, “Branch Folding in the CRISP
Microprocessor: Reducing Branch Delay to Zero,” Proc. 13th
The Two-Level Adaptive Training scheme has been Int’l Symp. Computer Architecture, 1987, pp. 2-9.
shown to have an average prediction accuracy of 97 [12] S. McFarling and J. Hennessy, “Reducing Cost of Branches,”
percent on nine benchmarks from the SPEC benchmark Proc. 13th Int’l Symp. Computer Architecture, 1986, pp. 396-
suite. The prediction accuracy is about 4 percent better 403.
than most of the other static or dynamic branch predic- [13] J. Lee and A.J. Smith, “Branch Prediction Strategies and
tion schemes, which means more than a 100 percent Branch Target Buffer Design,” Computer, Jan. 1984, pp. 6-22.
reduction in the number of pipeline flushes required. [14] T.R. Gross and J. Hennessy, “Optimizing Delayed Branches,”
Proc. 15th Ann. Workshop on Microprogramming, 1982, pp.
Since a prediction miss causes flushing of the specula- 114-120.
tive execution already in progress, the performance [15] D.A. Patterson and C.H. Sequin, “RISC-I: A Reduced Instruc-
improvement on a high-performance processor can be tion Set VLSI Computer,” Proc. 8th Int’l Symp. Computer Ar-
considerable by using the Two-Level Adaptive Train- chitecture, 1981, pp.443-458.
ing scheme. [16] J.E. Smith, “A Study of Branch Prediction Strategies,” Proc.
Deep-pipelining and superscalar execution are ef- 8th Int’l Symp. Computer Architecture, 1981, pp. 443-458.
fective methods for exploiting instruction level paral- [17] L.E. Shar and E.S. Davidson, “A Multiminiprocessor System
Implemented Through Pipelining,” Computer, Feb. 1974,
lelism to improve single processor performance. This pp.42-51.
effectiveness, however, depends critically on the accu- [18] T.C. Chen, “Parallelism, Pipelining and Computer Efficiency,”
racy of a good branch predictor. Two-Level Adaptive Computer Design, Vol. 10, No. 1, Jan. 1971, pp. 69-74.

Originally published in Proc. 24th Ann. Int’l Symp. 14


Microarchitecture, 1991, pp. 51–61. Copyright  1991
ACM. All rights reserved.

You might also like