0% found this document useful (0 votes)
15 views13 pages

Branch Net

Uploaded by

Ambrose Ling
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views13 pages

Branch Net

Uploaded by

Ambrose Ling
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

BranchNet: A Convolutional Neural Network to


Predict Hard-To-Predict Branches
Siavash Zangeneh∗ , Stephen Pruett∗ , Sangkug Lym† , and Yale N. Patt∗
[email protected], [email protected], [email protected], [email protected]
∗ University of Texas at Austin † Nvidia

Abstract—The state-of-the-art branch predictor, TAGE, re- 15.0 CNN Improvements for top 8 branches
mains inefficient at identifying correlated branches deep in a CNN Improvements for top 25 branches
noisy global branch history. We argue this inefficiency is a 12.5
CNN Improvements for top 50 branches
fundamental limitation of runtime branch prediction and not 10.0 Remaining Mispredictions

MPKI
a coincidental artifact due to the design of TAGE. To further
7.5 MPKI Reduction with 50 CNN branches (%)
improve branch prediction, we need to relax the constraint of
runtime only training and adopt more sophisticated prediction 5.0 34.0%
mechanisms. To this end, Tarsa et al. proposed using convo- 23.7% 20.3% 19.1%
2.5 12.6% 17.1%
lutional neural networks (CNNs) that are trained at compile- 8.6%
2.5% 2.7% 0.0% 0.0%
time to accurately predict branches that TAGE cannot. Given 0.0
enough profiling coverage, CNNs learn input-independent branch mcf leel
a xz
psje
ng gcc etpp hange2 x264 rlbench ncbmk mean
dee omn exc pe xala
correlations that can accurately predict branches when running
a program with unseen inputs. We build on their work and
introduce BranchNet, a CNN with a practical on-chip inference Fig. 1. MPKI Reduction of using large CNNs to predict a few hard-to-predict
engine tailored to the needs of branch prediction. At runtime, branches along 64KB TAGE-SC-L.
BranchNet predicts a few hard-to-predict branches, while TAGE-
SC-L predicts the remaining branches. This hybrid approach dresses a key weakness of TAGE-like predictors: identifying
reduces the MPKI of SPEC2017 Integer benchmarks by 7.6% correlated branches in a noisy global history. When the global
(and up to 15.7%) when compared to a very large (impractical) history is noisy, (i.e., the global history contains uncorrelated
MTAGE-SC baseline, demonstrating a fundamental advantage branches that constantly change directions, or the positions
in the prediction capabilities of BranchNet compared to TAGE- of correlated branches in the history are nondeterministic), a
like predictors. We also propose a practical resource-constrained
variant of BranchNet that improves the MPKI by 9.6% (and up TAGE-like predictor has to dedicate unique prediction counters
to 17.7%) compared to a 64KB TAGE-SC-L without increasing for each possible history pattern. Thus, the number of counters
the prediction latency. needed grows exponentially with the size of the history,
making a TAGE-like approach infeasible when correlated
I. I NTRODUCTION branches appear deep into a noisy history. A CNN, however,
Branch prediction remains a major bottleneck in improv- learns to ignore uncorrelated branches and identify correlated
ing single-thread performance. Even with TAGE-SC-L [1], branch patterns anywhere in the history, enabling expressive
the state-of-the-art branch predictor, many SPEC2017 Integer prediction functions that remain efficient even with a long
benchmarks still suffer from high branch mispredictions per noisy history.
kilo instructions (MPKI), resulting in significant loss of per- This increase in prediction capability, however, comes at
formance. Moreover, the branch misprediction penalty wors- the cost of computationally-expensive training and the need
ens as processors move towards deeper and wider pipelines for large training data. Therefore, it is not possible to train
[2]–[5]. Unfortunately, fundamental breakthroughs in branch BranchNet at runtime. Instead, we use offline (i.e., compile-
prediction have become rare [6]. All predictors submitted to time) training by profiling targeted applications. Offline train-
the 2016 Championship Branch Prediction competition were ing works if a predictor can learn invariant branch relationships
variants of existing TAGE and Perceptron designs [1], [7]– that are true at all phases of a program with any inputs. By
[10]. Branch prediction research needs new insights to further profiling runs of a program with multiple inputs, one can
improve the prediction accuracy. collect diverse training examples to train powerful machine
Traditional branch predictors like TAGE [11] and Percep- learning models that can infer such invariant relationships. Af-
tron [12] are designed to be updated online, i.e., at run time. ter offline training, one can attach the trained models (i.e., the
Thus, their update algorithms have to be simple, cheap, and collection of weights that represent the branch relationships)
quick to adapt to execution phase behavior. While simplicity to the program binary. At runtime, the branch predictor uses
and adaptivity are necessary for predicting most branches at the trained models to predict the directions of these hard-to-
runtime, limitations in training time and processing power predict branches without further training.
make it difficult for online branch predictors to learn complex Fig. 1 shows the potential of using large CNNs to predict
correlations in the branch history. To learn these correla- the top few hard-to-predict branches that benefit the most from
tions, it is necessary to adopt more sophisticated prediction CNNs. Each bar shows the MPKI of 64KB TAGE-SC-L when
mechanisms that require more computationally-heavy training running SPEC2017 Integer Speed benchmarks. The segments
algorithms and additional compiler support. in each bar show the mispredictions that could be avoided if
Building on the work of Tarsa et al. [13], we propose we use CNNs to predict up to 8, 25, or 50 static branches. The
BranchNet, a convolutional neural network (CNN) that ad- figure demonstrates that for most benchmarks, predicting 8

978-1-7281-7383-2/20/$31.00 ©2020 IEEE 118


DOI 10.1109/MICRO50266.2020.00022
Authorized licensed use limited to: The University of Toronto. Downloaded on October 09,2024 at 03:48:30 UTC from IEEE Xplore. Restrictions apply.
branches with CNNs is sufficient for significant overall MPKI presence of long noisy histories. Even though we argue this
reduction, and often predicting more than 25 branches with weakness is a fundamental consequence of runtime training,
CNNs has diminishing returns. Thus, we opt for a hybrid prior branch predictors that use offline training (with profiling)
approach, using CNNs to predict a few hard-to-predict static do not remedy this weakness. In this section, we describe
branches and using state-of-the-art runtime predictors for all the phenomenon of noisy history and make a case for why
other branches. deep learning can succeed where previous attempts at offline
Tarsa et al. [13] were the first to propose using CNNs training failed.
with offline training to predict hard-to-predict branches. They
showed that (1) CNN branch predictors could identify in- A. Online Branch Predictors
dividual correlated branches in the global branch history, All state-of-the-art branch predictors are variants of TAGE
and (2) that CNNs could be trained offline to avoid their [11] and the hashed perceptron [14]. While their prediction
expensive training algorithms at runtime. BranchNet builds mechanisms may differ, all conventional predictors hash the
on their approach and tailors the CNN architecture to branch global branch and path history into one or more indices to
prediction, resulting in higher prediction accuracy and better access prediction tables. Ideally, the predictors would allocate
storage-efficiency. The contributions of this paper are: unique table entries for each history pattern they observe.
• We identify a class of branches that are hard-to-predict by In practice, they employ storage-saving mechanisms to avoid
conventional runtime branch predictors but can be predicted redundant allocations for the most common branch behaviors
accurately by convolutional neural networks. These branches (e.g. TAGE uses an approximation of PPM compression [15]).
are correlated to the counts of other braches in a noisy global However, when the global history is noisy, i.e., uncorrelated
history. When the history is noisy, a table-based predictor branches constantly change directions or branches appear
(e.g. TAGE) relies on allocating predictor entries for each in nondeterministic positions in the history, these storage-
history pattern, which is infeasible for long histories. How- saving mechanisms do not work well, requiring the online
ever, a CNN expresses the actual branch relationship by predictors to allocate unique entries for all possible history
counting the explicitly identified correlated branches in the patterns. The number of entries required to remember all
global history. history patterns is an exponential function of the history size.
• We make a new case for branch prediction with offline When these entries are not available, the predictors cannot
training. We show that unlike previously proposed offline produce accurate predictions. Even if capacity were available,
training techniques, BranchNet relies less on the represen- the runtime predictors would require a long time to warm up
tativeness of the training data and more on coverage. The the large number of table entries, and can never generalize
key is exposing enough control flow paths to detect input- their predictions to unseen history patterns.
independent branch correlations that can be generalized to TAGE-SC-L. TAGE-SC-L is the winner of Championship
unseen inputs. Branch Prediction 2016 [1]. Its main component, TAGE [11],
• We propose a CNN architecture tailored to branch predic- hashes the global branch and path history to lookup tables of
tion requirements in two ways. One, we draw inspiration tagged saturating counters that provide the prediction. It uses
from traditional branch predictors and use geometric history multiple counter tables, each corresponding to a unique history
lengths as inputs. Two, we use sum-pooling layers to aggres- length. Longer history tables are used only when shorter
sively compress the information in the global branch history. history tables cannot provide accurate predictions. If prediction
Because of its specialized design, BranchNet significantly accuracy depends on correlated branches deep into a noisy
outperforms its predecessor CNN branch predictor. global history, short history tables do not provide any value,
• We demonstrate a novel way to approximate wide convo- resulting in high allocation pressure on the long history tables.
lution filters and sum-pooling layers. These approximations In the worst case, a long history TAGE table behaves similarly
enable BranchNet to have the same prediction latency as to a global 2-level predictor [16], which requires O(2n ) table
TAGE-SC-L (4 cycles) and be more storage-efficient. entries for an n-bit history input, which is infeasible for a large
In the rest of the paper, we first motivate why CNNs with n.
offline training can overcome a key limitation of prior work. Perceptron. Perceptron-based branch predictors use a
We then describe the architecture of BranchNet, the process to single-layer neural network to learn the correlations of the
train BranchNet offline, the design of an inference engine to branch outcome to its history bits. To make a prediction,
make predictions at runtime, and the ISA/OS support needed the predictors add the correlation factors and compare the
to use BranchNet. We show that without area constraints, sum to the branch bias. The Perceptron predictor [12] finds
BranchNet reduces the average MPKI of SPEC2017 Integer an individual correlation factor for each bit position in the
benchmarks by 7.6% (up to 15.7% for the most improved global history. When the positions of branches in the global
benchmark) when compared to an unlimited MTAGE-SC base- history are nondeterministic, the correlations to the history
line. We also show that by using our area/latency constrained bits become unreliable. The hashed perceptron [14] (also used
BranchNet inference engine along with a 64KB TAGE-SC-L, in the newest perceptron-based predictor, Multiperspective
we can improve the MPKI by 9.6% (up to 17.7%), and IPC Perceptron [8]) learns correlation factors for hashes of the
by 1.3% (up to 7.9%) over a 64KB TAGE-SC-L baseline. global branch and path history, which mitigates the problem
of nondeterministic positions. Still, when the history is noisy,
II. L IMITATIONS OF STATE - OF - THE - ART the hashed perceptron suffers from significant aliasing among
Current-day online (runtime) branch predictors have a key history patterns and their hashes, resulting in significant loss
weakness: they need an exponentially growing capacity in the of accuracy.

119

Authorized licensed use limited to: The University of Toronto. Downloaded on October 09,2024 at 03:48:30 UTC from IEEE Xplore. Restrictions apply.
Perceptron-based branch predictors have another inherent Global Transformed 2-channel Pooling Fully-Connected
Branch/Path History Convolution Layer Neuron
limitation: because they use a single-layer neural network, History Layer Outputs Outputs Output
they cannot learn non-linear relationships between branches.
The fact that all prior perceptron-based predictors use a single Convolution
neuron is a consequence of runtime training. The training history
length
Width

algorithm for multi-layer perceptrons is too expensive to be


implementable at runtime. Pooling
Since TAGE predictors outperform Perceptron-based pre- Width
One-hot
dictors on our benchmarks, we use TAGE as our baseline for Vector
state-of-the-art runtime branch predictors. Fig. 2. Dataflow in a simple CNN branch predictor.
B. Branch Predictors with Offline Training identify input-independent correlations that are always true.
Prior predictors. Many prior studies propose using offline Section IV provides a concrete example of this distinction.
profiling to improve branch prediction. Some train static pre-
dictors that simply learn the statistical bias of branches, which III. BACKGROUND
is useful for compile-time optimizations, but not for predicting Convolutional Neural Networks (CNN) are state-of-the-art
hard-to-predict branches [17]–[20]. Some work use profiling to in both image classification [27], [28], and sequential tasks like
train application-specific predictors, resulting in a comparable natural language understanding [29]. When used as a branch
accuracy to contemporary dynamic branch predictors [21]– predictor, a CNN first identifies important branch patterns in
[25]. The most recent proposal, Spotlight [25], is a gshare-like the global history and then classifies the branch as taken or
predictor [26] that uses profiling to identify the most useful not taken using the identified patterns.
fragment of the global branch history. However, Spotlight is In this section, we provide a high-level description of
still susceptible to shifts in the history and cannot identify a simple CNN branch predictor. The goal is to introduce
correlated branches that appear in nondeterministic positions the terminology and provide an intuition for how the CNN
in the history. Spotlight’s training mechanism also relies on components work together to predict branches.
exhaustively comparing all possible views of history, which
A. CNN Building Blocks
does not scale when training more complicated predictors
with long histories. Similar to Spotlight, most prior predictors Fig. 2 shows the data flow for branch prediction using a
are either too simple to help with hard-to-predict branches simple CNN. The CNN takes the global branch and path
or there is no known way to use them in conjunction with history (program counters and directions of branches) as
state-of-the-art online predictors. The only offline method input, operates on the input using a sequence of operations,
we can easily apply to TAGE-SC-L is to use static branch and finally produces a prediction. The critical operations are
biases when TAGE-SC-L is not confident. However, we have referred to as layers. The layers operate using a collection of
observed that using static biases only slightly improves the trainable parameters (weights). The combination of the CNN
accuracy of TAGE-SC-L (0.3% MPKI reduction for the best layers and their trained parameters form a CNN model.
benchmark, 0.0% for many) and its benefits are orthogonal to Input as one-hot vectors. CNNs assume that the magni-
the contributions of BranchNet. tude of each input conveys information about the input. For
Since state-of-the-art predictors do not benefit from using example, the inputs to a CNN image classifier convey the color
prior offline techniques, we only compare BranchNet to the intensity of an image at each pixel. However, the inputs to a
best online predictors. branch predictor are branch program counters and directions,
Representativeness vs. coverage. The key advantage of whose magnitudes convey nothing about the branches. Thus,
offline training is the removal of time and compute constraints we need to represent branches in a format that makes it
from training, enabling arbitrarily complex training algorithms. easier for CNNs to distinguish different program counters. One
However, prior offline predictors never fully leverage this solution is to represent components in the history as one-hot
because they only train simple prediction mechanisms that vectors.
rely on the repetition of exact history patterns. Thus, prior Input as embeddings. Alternatively, we can use embed-
work could only perform well when the input sets used dings to transform each history element into a vector repre-
for profiling were representative of future runs, which is sentation trained specifically for the problem we want to solve
challenging. For example, for Spotlight to be effective, the [30]. For large-enough discrete numbers, embeddings often
positions of correlated branches in the global history should lead to a more efficient solution than simply using one-hot
be exactly the same during profiling and at runtime. However, vectors 1 . For example, Hashemi et al. [31] use embeddings
the positions of branches that appear deep in the global history to represent PC and memory addresses in a model for data
are rarely generalizable to other inputs, especially for the prefetching, which is very similar to BranchNet representing
hard-to-predict branches of state-of-the-art runtime predictors. branch PC and directions.
In contrast, deep learning does not need representative input Convolutional layers. At a high level, a convolution layer
sets; it just needs enough coverage in the training set to identifies the occurrences of features in its input [30], [32].
expose generalizable input-independent relationships between The set of weights that are trained to identify a feature is
branches. As long as the training set includes enough examples 1 e.g., representing a 12-bit program counter as a one-hot vector requires
of different branch behavior (i.e., different program phases that 212 = 4096 trainable weights for a 1-wide convolution filter, but embeddings
exercise different control flows), deep learning algorithms can can still be effective with much fewer weights (e.g., 32).

120

Authorized licensed use limited to: The University of Toronto. Downloaded on October 09,2024 at 03:48:30 UTC from IEEE Xplore. Restrictions apply.
1 i n t x = 0; 100
2 f o r ( i n t i = 0 ; i < N ; ++ i ) {

Test Set Accuracy (%)


3 i f ( r a n d o m c o n d i t i o n ( a l p h a ) ) { // Branch A 80
4 / / x i n c r e m e n t s i f Branch A i s not t a k e n
5 x += 1 ; 60
6 }
7 } CNN with training set 3: = 0.5, N ~ rand(1,4)
40
8 CNN with training set 2: = 1.0, N ~ rand(5,10)
9 uncorrelated function (); 20 CNN with training set 1: = 1.0, N = 10
10 64KB TAGE-SC-L with runtime training
11 f o r ( i n t j = 0 ; j < x ; ++ j ) { // Branch B
12 ... 0
0.2 0.4 0.6 0.8 1.0
13 } / / e x i t s when B r a n c h B i s t a k e n
in the Test Set
Convolutional Sum Pooling Final
Branch
Layer Layer Fully-connected Fig. 4. Accuracy of predicting Branch B from Fig. 3. N ∼ rand(5, 10) in
History the test set.
Outputs Outputs Neuron
PC Direction Channel 0 Channel 1
branch prediction should contain examples from multiple input
A 0 0 1
Predict
sets and exercise different control flow paths, which enables
X 1 0 0
A 0 0 0 ≥ Taken the CNN to learn invariant branch relationships.
X 1 0 0 If Greater or
A 0 0 1
1 2
Equal IV. M OTIVATION
Uncorrelated
Branches
We use the source code in Fig. 3 to show how CNNs
B 0 1 0
∑ ∑
can predict otherwise hard-to-predict branches. The code is
Youngest
Branch
X 1 0 0 a simplified version of a hot segment of the benchmark leela,
which is responsible for a significant fraction of the total
Fig. 3. A program with a hard-to-predict branch (Branch B) and a trained number of mispredictions.
CNN that can accurately predict the branch. Can we predict Branch B using the global history?
Branch B is the exit branch of the second loop in the source
called a filter. The convolution width controls the number of code. The number of iterations of the second loop equals the
neighboring items that form a feature. For branch prediction, variable x, which is set by the first loop. Branch B is taken only
the neighboring items we consider are neighboring entries in if the variable j (the loop variable) is equal to the variable x.
the branch/path history. Applying a filter to the inputs produces There is enough information in the global history to infer the
an output channel. For branch prediction, each filter identifies values of x and j: x equals the number of not taken instances
the presence of a specific correlated branch pattern in the of Branch A in the history, and j equals the number of not
history and marks its location by outputting a non-zero value taken instances of branch B. Thus, in theory, a branch predictor
to the corresponding output channel for the filter. should be able to predict this branch accurately.
Sum-pooling layers. A sum-pooling layer reduces the Why do state-of-the-art predictors fail to predict Branch
computational requirements of subsequent layers by combing B? Unfortunately, state-of-the-art predictors have no way of
the neighboring outputs of the convolution output channel knowing which branches in the global history are actually
into a sum [30]. The pooling width defines the number of useful for prediction. Thus, as explained in Section II-A,
neighboring outputs that are summed together. Effectively, the they hash the whole global history and attempt to learn a
outputs (i.e. generated sums) of a sum-pooling layer indicate prediction for the history pattern as a whole. However, due to
the occurrence counts of the feature identified in each channel. the large number of loop iterations, the probabilistic nature of
Sum-pooling reduces the computational needs of the next the correlated branches, and the uncorrelated branches close
CNN layers at the cost of discarding fine-grained positions to Branch B, the number of observable history patterns for
of identified features. As we show later in Section IV, this is Branch B is beyond what online predictors can predict. For ex-
often a good trade-off for branch prediction because the exact ample, if N=10 and uncorrelated function has 20 conditional
positions of correlated branches do not matter. branches, a TAGE-like predictor has to allocate storage for at
Fully-connected layers. A fully-connected layer is made of least 10 × 2(10+20) history patterns. This amount of storage
multiple neurons, where each neuron learns a linear function is infeasible. As a result, Multi-Perspective Perceptron and
of all its inputs [30]. It is possible to cascade fully-connected TAGE-SC-L predict branch B with 81% accuracy, which is
layers to learn nonlinear functions of convolution outputs. only slightly more accurate than always predicting not taken
For branch prediction, the fully-connected layers map the with 78% accuracy.
identified feature counts to a prediction. Note that even if an online predictor has enough storage to
remember all history patterns it sees, it will take a long time
B. Training Algorithm to warm up and can never generalize its predictions to the
We train CNNs using a large set of input and expected history patterns it has not seen.
output pairs (the training set) that define the desired behavior How does a CNN predict Branch B accurately? A CNN
of the model. Conceptually, the training algorithm constantly can directly infer the values of variables x and j from the global
iterates through the examples in the training set and identifies history, allowing it to predict Branch B both accurately and
consistent signals for producing the expected output. Since this efficiently. Fig. 3 shows the outputs of a manually trained CNN
algorithm (Stochastic Gradient Descent [33] using Backprop- that predicts the direction of Branch B 100% accurately. The
agation [34]) is computationally expensive, the training has to input on the left is a snapshot of the global history before pre-
be done offline using profiling. Thus, a good training set for dicting branch B. The program counters of branches that are

121

Authorized licensed use limited to: The University of Toronto. Downloaded on October 09,2024 at 03:48:30 UTC from IEEE Xplore. Restrictions apply.
Inputs Feature Extraction Classification

Hidden Fully-connected Layer


Slice 1 H1 (H1) (H1, E) (H1, C1) (H1/P1, C1)

Single Sigmoid
(N) Prediction

1D Convolution

Sum Pooling
PC and Path

Embedding
History
Slice 5 H5 (H5) (H5, E) (H5, C5) (H5/P5, C5)

Fig. 5. High-level diagram of Big-BranchNet CNN architecture for one branch.

not involved in the prediction (i.e. uncorrelated branches) are if there exist persistent branch relationships that are indepen-
marked as X. The history is encoded as one-hot vectors (not dent of input data and program phase behavior. Sometimes
shown in the figure for brevity) and fed into a convolutional there is no branch in the global history that can provide
layer. The convolution width is 1 and there are 2 channels. any information about the outcome of the target branch. For
Channel 0 is trained to identify the not-taken instances of example, some branches depend on data that was stored in
Branch B. Channel 1 is trained to identify not-taken instances memory long before the branch executes. In this case, there
of Branch A. We use a sum-pooling layer as wide as the is nothing in the recent branch history that is correlated to the
history. Thus, the outputs of sum-pooling are simply the counts data in memory. Using only global branch history as input, it
of not taken instances of Branch A and Branch B, which equal is impossible to learn any branch prediction strategy offline.
the values of variables j and x right before the branch executes. Thus, we defer to the baseline online branch predictor to
The final fully-connected neuron is trained to predict taken predict these branches.
only if j ≥ x ( sum-pooled channel 0 ≥ sum-pooled channel As discussed earlier in Section I, Fig. 1 shows the MPKI
1), resulting in 100% prediction accuracy. reduction of using large CNN models to predict the top hard-
Does offline training work? Thus far, we have shown to-predict branches in SPEC2017 benchmarks. Since we use
that a manually configured CNN can predict Branch B. Now, the same input signals for TAGE-SC-L and the CNN models,
we show that we can train a CNN offline using profiling. the difference in prediction accuracy is mainly due to the capa-
Suppose the random condition in line 3 of Fig. 3 is set using bility of CNNs in identifying useful information in the global
a Bernoulli distribution that is true with probability α, and N history. Thus, the 19.1% reduction in MPKI can be interpreted
is set using a uniform distribution with adjustable minimum as an approximation for the fraction of branch mispredictions
and maximum. We collected three different training sets for due to noisy history. The remaining mispredictions are due to
Branch B with three program inputs: (1) N = 10, α = 1, data-dependent or inherently unpredictable branches.
(2) N ∼ rand(5, 10), α = 1, and (3) N ∼ rand(1, 4), Can Other Machine Learning Models Predict Branches?
α = 0.5. We then evaluated the accuracy of CNNs trained Any sophisticated learning model can learn invariant branch
on each of the three training sets on runs of the program relationships from large training sets. For example, Recurrent
with N ∼ rand(5, 10) and α ranging from 0.2 to 1. We also Neural Network can also predict the same type of hard-to-
evaluated the accuracy of a 64KB TAGE-SC-L (with normal predict branches as BranchNet. However, we limit the scope of
runtime training) on the same test sets. Fig. 4 shows the results. this paper to the study of CNNs for branch prediction because
We see that CNNs trained using sets (1) and (2) perform we see a clearer path towards low-latency and storage-efficient
even worse than TAGE-SC-L, especially when α < 1. These branch predictors with CNNs.
two training sets do not expose input-independent branch
relationships to the CNN. When training with the set (1), the V. B RANCH N ET
CNN likely learns that the length of the second loop is always
10, which is not true. When training with the set (2), since Having described the general principles behind using CNNs
Branch A is always not taken, the CNN might learn that the for branch prediction, we now present Big-BranchNet and
length of the second loop equals the length of the first loop, Mini-BranchNet. Both variants of BranchNet are CNN models
which is true only when α = 1. However, the branch behavior that we train offline to accurately predict many branches that
in the set (3) is diverse enough to expose the input-independent are hard to predict for traditional branch predictors. We use
correlation. Thus, the CNN trained with the set (3) can predict Big-BranchNet to show available headroom in using CNNs for
Branch B with 100% accuracy for runs with any value of α. branch prediction. Big-BranchNet does not have a practical
on-chip inference engine. Mini-BranchNet is a smaller model
Is representativeness of profiling required? No! Note that
co-designed with a practical inference engine.
the range of N in the set (3) (N ∼ rand(1, 4)) does not over-
lap with the range of N on evaluation runs (N ∼ rand(5, 10))
at all. Yet, the trained model still generalizes perfectly to A. Big-BranchNet
history patterns it has not seen. The key criterion for a good Big-BranchNet is a pure software model and we do not
training set is good coverage of different branch behaviors, propose using it as a practical branch predictor. Big-BranchNet
not representativeness of history patterns. is composed of 5 feature extraction sub-networks and two
Can a CNN predict all branches? A CNN is only accurate fully-connected layers. We call each feature extraction sub-

122

Authorized licensed use limited to: The University of Toronto. Downloaded on October 09,2024 at 03:48:30 UTC from IEEE Xplore. Restrictions apply.
TABLE I
B RANCH N ET ARCHITECTURE KNOBS .
Knob Big-BranchNet Mini-BranchNet Mini-BranchNet Mini-BranchNet Mini-BranchNet Tarsa-Ternary
2KB 1KB 0.5KB 0.25KB 5.125KB
H: History sizes 42,78,150,294,582 37,77,152,302,603 37,77,152,302,603 37,77,152,302,603 44,92,182 200
C: Convolution channels 32,32,32,32,32 4,5,5,4,4 3,3,4,4,3 3,3,3,2,2 2,2,2 32
P: Pooling widths 3,6,12,24,48 7,15,30,60,120 7,15,30,60,120 7,15,30,60,120 7,15,30 N/A
Use Precise pooling N/A Y,Y,Y,N,N Y,Y,N,N,N Y,Y,N,N,N Y,Y,N N/A
p: Branch PC width 12 12 12 12 12 7
h: Convolution hash width N/A 8 8 7 7 N/A
E: Embedding dimensions 32 32 32 32 32 N/A
K: Convolution width 7 3 3 3 3 1
N: Hidden neurons 128, 128 10 8 6 4 N/A
q: Fully-connected quantization N/A 4 3 3 3 2

network a slice2 . Each slice uses an embedding layer, a Update Pipeline Prediction Pipeline
convolution layer, and a sum-pooling layer to extract features PC of the
Target Branch
out of the branch history. Different slices operate on different
Hash of 3
history lengths, with the history lengths forming a geometric Most Recent PC in
series. The benefits of using geometric history lengths are Branches BranchNet?
Convolutional
well studied for branch predictors [35]. Finally, the outputs Histories Branch ID
BranchNet
of the slices are concatenated and fed into two sequential Branch 1 Hit / Miss
fully-connected layers to make a prediction. Fig. 5 shows a Slices
Weight
high-level diagram of Big-BranchNet. Table
We define Big-BranchNet in terms of a set of architecture
knobs. For now, we explain the functionality of all Big-
BranchNet layers using these architecture knobs. We report Branch 41 Fully-connected
Slices Layers
the knob values we used for Big-BranchNet and other related
CNN models in Table I.
History Format. We concatenate the direction and the Prediction
least significant bits of the program counter of each branch
Fig. 6. Mini-BranchNet inference engine.
to represent it as an integer. Thus, if we use p bits of PC,
and a history size of H for a slice, the input history is a 1- deep into the history makes BranchNet resilient against shifts
dimensional array of H integers, ranging from 0 to 2p+1 − 1. in history by eliminating fine-grained positions of the identi-
Embedding Layers. Embeddings transform each branch in fied features in the history.
the input history to a dense vector of numbers. The size of Fully-connected Layers. The first fully-connected layer
the embedding vectors is controlled by knob E . Note that consists of N neurons. Each neuron is connected to the
as mentioned in Section III, we could have used one-hot outputs of all slices. The fully-connected neurons are followed
encodings instead of the embeddings, but we found that using by batch normalization and ReLU activation functions. The
embeddings improved the convergence and training time of final fully-connected layer is made of a single neuron with a
BranchNet. Sigmoid activation function to make the final prediction.
Convolutional Layers. Ci denotes the number of output
channels for slice i and K denotes the convolution width. With B. Mini-BranchNet
more output channels, BranchNet can learn more independent Mini-BranchNet is a smaller variant of BranchNet that we
features of the branch history. With a larger K , BranchNet co-design with an inference engine that could work as a
can identify longer sequences of correlated branches. We practical branch predictor. For the most part, Mini-BranchNet
always use a convolution stride of 1. The convolution operation is similar to Big-BranchNet with architecture knobs that we
is followed by batch normalization3 and ReLU activations4 . tuned to minimize storage and latency overheads. In the rest
The type of activation functions is not important for Big- of this subsection, we describe key optimizations in designing
BranchNet’s accuracy. an inference engine for Mini-BranchNet. We also explain
Sum-Pooling Layers. In each slice, a sum-pooling layer other modifications to the BranchNet CNN architecture as they
down-samples the convolution outputs with a width and stride pertain to the inference engine optimizations.
of Pi . We use geometric pooling sizes proportional to the Optimization 1: Maintaining Convolutional Histories.
history lengths of each slice. Larger pooling widths for longer Computing the outputs of the various slices of BranchNet
history lengths work well because history becomes noisier feature extraction layers involves operations on hundreds of
deeper into the history. Aggressive pooling for features found branches in the global history. Instead of doing all these
2 In deep learning, sub-networks in a larger neural network are often called
operations at prediction-time, the inference engine processes
branches. We avoid this terminology and use the term ”slice” to avoid incoming branches one at a time and buffers their down-
confusion with branch instructions. sampled convolution outputs for future use. We call these
3 Batch normalization converts each output channel to a standard normal
buffers Convolutional Histories. Fig. 6 shows the block di-
distribution, which guides the learning algorithms towards better solutions agram of a Mini-BranchNet inference engine that can predict
without affecting the prediction capability [36].
4 Activations are non-linear element-wise functions that are applied after up to 41 static branches in a program. The update pipeline
convolution and fully-connected operations [30]. maintains the convolutional histories of all 41 Mini-BranchNet

123

Authorized licensed use limited to: The University of Toronto. Downloaded on October 09,2024 at 03:48:30 UTC from IEEE Xplore. Restrictions apply.
Input sigmoid), which is exactly 0 or 1. These binary values can now
Branches Embedding Convolution
Table Weights
PC, dir 13
be stored in small tables that the Mini-BranchNet inference
PC, dir
8192x32
7x32 Batch engine looks up to get the convolution output for a branch
7-wide Normalization
window LUT and ReLU hash (Fig. 7c). No arithmetic operation is needed at runtime,
PC, dir
eliminating a 32-dimensional inner product per convolution
(a) Big-BranchNet Training and Inference operation.
Optimization 3: Using Running Sum Registers. Fig. 8a
Input Embedding Convolution
Branches Table Weights
shows the sum-pooling operation of Big-BranchNet. Mini-
PC, dir Batch BranchNet inference engine uses two designs to compute the
3-wide 8 256x32 1x32 Normalization
window PC, dir hash
LUT and Binarized
sum-pooling outputs. For shorter history slices, the engine
PC, dir Sigmoid implements precise pooling (Fig. 8b). Precise pooling uses
(b) Mini-BranchNet Training a buffer and a running sum register to constantly compute
Input
the output of the most recent pooling window and inserts
Convolution
Branches Table the pooling outputs into a second set of buffers. As a result,
PC, dir
8
this second set of buffers contains the pooling outputs of
3-wide 256x1
window
PC, dir hash
LUT
0 or 1 overlapping windows. At prediction-time, only 1 out of P
PC, dir pooling outputs (recall P = pooling width) are fed into
(c) Mini-BranchNet Inference the next layer. The buffer space needed to implement pre-
Fig. 7. BranchNet convolutional layer. cise pooling grows linearly with the history size. To reduce
storage needs for longer history slices, the Mini-BranchNet
Convolution +
Layer Pooling inference engine uses sliding pooling (Fig. 8c). Sliding pooling
Layer
Outputs
+ Outputs
accumulates the pooling output of a window over multiple
cycles and inserts the output in the pooling buffer once every
(a) Big-BranchNet Sum-pooling P cycles. The trade-off is that at prediction-time, the most
Latest
Convolution + recent convolution outputs may not have formed a complete
Output Pooling
Layer pooling window. Thus, some of the most recent branches in the
Running
Sum
Outputs history are not used for prediction, and in general, the pooling
windows have nondeterministic boundaries. In practice, this
- is not a problem because we only use sliding poolings in
long-history slices of Mini-BranchNet, which do not rely on
fine-grained positions of identified features because of their
(b) Mini-BranchNet Inference Engine Precise Pooling
proportionally wide pooling widths. To account for sliding
Latest
Convolution +
poolings during training, we randomly discard some of the
Output Enqueue
Pooling
Layer
most recent branches (0 to P − 1 branches) that are fed
Running
Sum
Outputs into the long-history slices. This randomization makes the
training algorithm resilient against nondeterministic pooling
Clear
Sum
Sum
boundaries at runtime.
Controller Optimization 4: Quantizing Fully-connected Layers.
(c) Mini-BranchNet Inference Engine Sliding Pooling Mini-BranchNet uses fixed-point arithmetic to compute the
outputs of the fully-connected layers. We empirically found
Fig. 8. BranchNet 4-wide sum-pooling.
that using 3 or 4 bits of precision (denoted by architecture
models. To make a prediction, the prediction pipeline simply knob q) is sufficient for the sum-pooling outputs and the
selects the convolutional histories corresponding to the target first fully-connected weights. The outputs of the first fully-
branch and computes only the two fully-connected layers. connected layers need even less precision and can be binarized.
Without this optimization, the predictor would need to com- We replace ReLU activations with Tanh to restrict the layer
pute 4865 convolution operations for each prediction. With this outputs to be between -1 and 1, which helps with quantization
optimization, the engine computes 521 convolution operations [37]. We also insert batch normalization and Tanh after the
every time that a branch is inserted into the global history. sum-pooling layer to stabilize the inputs to the fully-connected
layers. After training is done, we fuse the batch normalization
Optimization 2: Replacing Convolutions with Table operations with the fully-connected dot products to eliminate
Lookups. A convolution operation on a single window of their latency. Since the hidden fully-connected outputs are
branches involves a dot product operation. Fig. 7a shows binarized, we can use a lookup table to eliminate arithmetic
how Big-BranchNet computes one convolution output. Mini- operations of the last layer.
BranchNet eliminates all the arithmetic operation in two steps. Optimal Architecture Knobs. It is not storage-efficient
During training, instead of embedding each branch in the to use the same architecture knobs for all hard-to-predict
convolution window independently, it embeds a smaller hash branches. Some branches need larger CNN models for good
of the branches in a window (Fig. 7b) and uses binarized sig- prediction accuracy, while some can be predicted well with
moid [37] activations instead of ReLU. After training is done, much smaller storage budgets. Thus, we evaluate four Mini-
for each possible branch hash, we compute the convolution BranchNet models with varying storage budgets per branch.
output (embedding + dot product + normalization + binarized Table I reports the architecture knob values for each configu-

124

Authorized licensed use limited to: The University of Toronto. Downloaded on October 09,2024 at 03:48:30 UTC from IEEE Xplore. Restrictions apply.
TABLE II convolution filters, it uses multiple history lengths, it has an
B REAKDOWN OF THE M INI -B RANCH N ET INFERENCE ENGINE STORAGE additional fully-connected layer, and it uses heterogeneous
REQUIREMENTS FOR ONE STATIC BRANCH .
model sizes based on the needs of each branch. As a result,
Using Architecture Knobs 1KB Config
P h Mini-BranchNet is smaller, faster, and more accurate than
Convolution Tables P(2 ) 0.53 KB
Tarsa-Ternary.
Precise Pooling Buffers P(5 + Pi + q(1 + Hi − Pi )) 0.11 KB
Sliding Pooling Buffers (7P+ log2 (Pi ) + q(Hi /Pi )) 0.04 KB The sum-pooling layers are critical in enabling BranchNet
Fully-connected weights qN (Ci (Hi /Pi ) + 2N 0.29 KB to be more storage-efficient and have lower prediction latency.
Without sum-poolings, each convolutional history in Tarsa-
ration. Ternary has to buffer 200 ternary values (proportional to
history length). In contrast, Mini-BranchNet’s convolutional
C. On-chip Constraints
histories using sliding sum-poolings need to buffer only five
Storage. Table II shows the breakdown of storage needed 4-bit values (independent of history length). Because of large
to predict a single hard-to-predict branch using the Mini- storage and latency savings of using sum-poolings, Mini-
BranchNet inference engine. As an example, it also shows BranchNet can use longer history lengths and a second fully-
the storage breakdown for a 1KB Mini-BranchNet model. connected layer (necessary for higher accuracy), while remain-
Prediction Latency. Modern processors typically have two ing smaller and faster than Tarsa-Ternary.
tiers of branch predictors: a less accurate light-weight predictor
that provides early single-cycle predictions and a heavy-weight E. Offline Training Process
predictor that can later correct the prediction if necessary [38]. We profile target programs with a diverse set of inputs
We envision BranchNet to be a heavy-weight predictor with to collect branch traces. We divide these traces into three
multi-cycle latency. mutually exclusive sets: the training set, the validation set,
The critical path of updating the convolutional histories and the test set. We then train BranchNet using the training set
consists of hashing the most recent branches, the convolution and the validation set in a 3-step process. First, we select the
table look-up, an addition (7-bit running sum), quantization, 100 highest MPKI branches (hard-to-predict branches) in the
and insertion into a convolution history buffer. Using CACTI validation set. Then, we train one CNN model for each hard-
[39] for the table lookups and counting the gate delays of the to-predict branch using the training set. Finally, we measure
arithmetic operations, we computed the update latency to be the MPKI reduction of each branch on the validation set and
roughly equal to the latency of a 64-bit Kogge-Stone adder (21 attach the BranchNet models for the most improved branches
gate delays). Since 64-bit additions are single-cycle operations (up to 41 branches for iso-latency Mini-BranchNet) to the
in modern processors [40], we estimate that Mini-BranchNet program binary. To measure the final accuracy on unseen
updates are also single-cycle operations. The critical path inputs, we report the accuracy of BranchNet on the test set.
of the prediction pipeline for a 2KB Mini-BranchNet model
includes the weight table look-up, the selection of the convolu- F. System and ISA Requirements
tional history, and a forward pass of the fully-connected layers BranchNet requires collaboration across the software stack
(a 4-bit multiply, a 110-input 8-bit adder tree, a comparison, for loading trained BranchNet models to the on-chip unit at
and accessing a 1024-entry table). The prediction latency is runtime. We envision an approach where the trained Branch-
roughly 4 times the latency of a 64-bit Kogge-Stone adder. The Net models are augmented to the program binary and the oper-
latency of a 64KB TAGE-SC-L5 is 1.1 times the latency of ating system (OS) is responsible for loading these models into
the Mini-BranchNet inference engine. Thus, we conservatively the on-chip BranchNet engine at load-time or during context
estimate both Mini-BranchNet and 64KB TAGE-SC-L are 4- switches. The ISA should provide BranchNet instructions that
cycle predictors. the OS uses to enable, disable, or update the on-chip engine.
Recovery. At the time of a pipeline flush, the convolutional As a design choice, these instructions may be implemented
histories and accumulator registers can easily be recovered as non-blocking instructions to hide the overhead of loading
using a mechanism similar to what already exists to restore BranchNet models. Lee et al. [41] proposed a similar approach
long global histories. Extra shadow space is reserved in each for using the OS to save and restore the state of runtime branch
register to hold the n most recently shifted out entries of each predictors during context switches, albeit for a different goal
register. This allows us to recover the state of the predictor of mitigating context switch penalties on branch prediction
by shifting back in the lost state, as long as we restrict our accuracy. We leave a more detailed analysis and evaluation
design to allow n branches in flight. of System and ISA requirements or alternative approaches to
future work.
D. Differences with Prior Work
BranchNet builds on the CNN predictor of Tarsa et al. [13]. VI. R ESULTS
We refer to their proposed model as Tarsa-Ternary and define In this section, we show the effectiveness of BranchNet
Tarsa-Ternary in terms of BranchNet architecture knobs in on SPEC2017 Integer Speed Benchmarks. We chose SPEC
Table I. BranchNet is different from Tarsa-Ternary in five benchmarks because we could use various inputs for the same
ways: it uses sum-pooling layers, it approximates 3-wide benchmark to test the generalization of offline training to
unseen data. Big-BranchNet results demonstrate the available
5 The critical path of TAGE-SC-L: accessing banked TAGE tables, tag
headroom of branch prediction with offline deep learning.
comparisons, TAGE mux tree (with a depth of log(n)), selection logic for
alternative prediction and the loop predictor, accessing the statistical corrector Mini-BranchNet results show the benefits of using CNN
GEHL tables, a 20-input 6-bit adder tree, and final selection logic. branch predictors in practical settings.

125

Authorized licensed use limited to: The University of Toronto. Downloaded on October 09,2024 at 03:48:30 UTC from IEEE Xplore. Restrictions apply.
TABLE III 64KB TAGE-SC-L Unlimited MTAGE-SC Big-BranchNet
I NPUTS OF SPEC WORKLOADS THAT WE USE TO EVALUATE B RANCH N ET. 0.3
100

Over MTAGE-SC
MPKI Reduction
Inputs Purpose

Accuracy (%)
80
The training set Alberta Training BranchNet models 0.2
60
The validation set SPEC train Identifying best BranchNet branches 40
The test set SPEC ref Final evaluation of accuracy 0.1
20
15.0 25 0 0.0
64KB TAGE-SC-L 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Normalized MPKI Reduction (%)


Most Improved Branches of leela

Over Unlimited MTAGE-SC


12.5 Unlimited GTAGE 20
Unlimited GTAGE-SC, No local Components
15.7% Unlimited MTAGE-SC 100 0.3

Over MTAGE-SC
MPKI Reduction
10.0 15

Accuracy (%)
Unlimited MTAGE-SC Warmed Up 80
0.2
MPKI

Unlimited MTAGE-SC + Big-BranchNet


7.5 8.7% 7.6% 10 60
6.5% 40 0.1
5.0 3.8% 5 20
1.3% 1.5%
0.0% 0.1% 0.0% 0.0% 0 0.0
2.5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Most Improved Branches of mcf
0.0 5
mcf leel
a xz
psje
ng gcc etpp nge2 x264 rlbench ncbmk mean
dee omn excha pe xala
Fig. 10. Accuracy of most improved branches using Big-BranchNet.

Fig. 9. MPKI of MTAGE-SC and Big-BranchNet on SPEC2017 benchmarks. storage category of CBP 2016 [7]. Adding Big-BranchNet
to MTAGE-SC reduces the average MPKI from 3.42 to
A. Evaluation Methodology 3.16 (7.6% reduction). Big-BranchNet improves the overall
We run each SPEC2017 Integer Speed benchmark using MPKI by predicting only a few static branches that are hard-
inputs provided by SPEC (train and ref inputs) and Alberta to-predict. On average, BranchNet improves the prediction
inputs [42]. We collect up to 10 branch traces from each accuracy on 19 static branches per benchmark, varying from
workload’s representative regions using SimPoints [43]. We 71 improved static branches in leela to no improved branches
then train BranchNet models using the process described in in gcc, xalancbmk, and perlbench.
Section V-E. Table III shows how we partition the inputs to There is a large variance in MPKI reduction among the ten
generate the datasets needed for offline training. All numbers benchmarks. In general, high-MPKI benchmarks tend to have
reported in this section refer to measurements on the test set hard-to-predict branches that are more suitable for BranchNet.
(the SPEC ref inputs), adjusted according to SimPoint weights. In particular, the MPKI of benchmarks leela, xz, mcf, and
Depending on the configuration, our training infrastructure deepsjeng are reduced significantly. On the other hand, the
takes between 6 to 18 hours on 4 GPUs to train all BranchNet MPKI reduction on omnetpp is small since the main hard-
models for a given benchmark. Training could be easily sped to-predict branches in omnetpp are data-dependent branches,
up with more GPUs since BranchNet models are trained in which BranchNet cannot improve. Even worse, there is almost
parallel. We have open-sourced our evaluation infrastructure no MPKI gain for gcc. gcc contains many static branches that
[44]. equally contribute to the total MPKI because of its large code
We make a slight adjustment to the training and validation footprint and many execution phases. Our current methodology
inputs of gcc and xz. As part of their inputs, these two bench- cannot improve such benchmarks significantly. exchange2,
marks have high-level control flags (optimization settings and x264, perlbench, and xalancbmk do not have many hard-to-
compression level, respectively). Since these control flags predict branches, so there is little opportunity for BranchNet.
likely do not change frequently in deployment, it is reasonable To better understand the limitations of TAGE-SC, Fig. 9 also
to train specialized CNN models targeting runs with certain shows the MPKI of MTAGE-SC without certain key compo-
execution flags. The data inputs remain different in training, nents (GTAGE is the global history component of MTAGE).
validation, and test sets. Most of the accuracy gap between TAGE-SC-L and MTAGE-
We evaluate the IPC of benchmarks using Scarab [45], an SC is due to the larger size of the global history TAGE and the
execution-driven, cycle-level simulator for x86-64 processors, Statistical Corrector. This means that high-MPKI benchmarks
which accurately models branch misprediction behavior by exert high allocation pressure on the predictor tables, which
fetching and executing wrong-path instructions. We use a is a sign that their global histories are indeed noisy. The local
4KB gshare predictor as the single-cycle lightweight predictor history components are also significant for a few benchmarks.
and TAGE-SC-L and BranchNet as 4-cycle late predictors. If We also evaluated MTAGE-SC with an additional warmup
the prediction of the late predictor disagrees with the early phase of 20 million instructions. The MPKI improvement due
predictor, we flush the frontend and re-fetch the instructions to warmup is not significant.
after the branch using the new prediction. We configure the
processor to resemble a high-performance processor: 6-wide C. Characteristics of Improved Branches
fetch, 512-entry ROB, 2MB LLC, 10-stage frontend pipeline, To better understand why BranchNet outperforms TAGE
execution latency similar to an Intel Skylake processor [40], predictors, we have examined the source code of some of the
and DDR4 main memory simulated with Ramulator [46]. most improved branches in mcf and leela. We describe our
observations on the nature of these branches.
B. Measuring Headroom with Big-BranchNet Most mispredicting branches of mcf appear in the qsort
Fig. 9 shows the MPKI reduction of using Big-BranchNet function. Branches in the comparison function are naturally
along with MTAGE-SC, the best predictor in the unlimited hard-to-predict as they depend on data in an unsorted array.

126

Authorized licensed use limited to: The University of Toronto. Downloaded on October 09,2024 at 03:48:30 UTC from IEEE Xplore. Restrictions apply.
Tarsa-Ternary Mini-BranchNet Mini-BranchNet

MPKI Reduction (%)


148.6 KB Tarsa-Float iso-storage iso-latency Big-BranchNet
5
40
4
Relative MPKI

3
Reduction (%)

30
20
2
10
1
0
0
10

put t
19.0%
10
1 inimpoin put ts put ts s
put nts
s
put nts
s
put nts
1 iinmpoin 1 inimpoin
Improvement (%)

8
2 isnimpoi 3 isnimpoi 4 isnimpoi
Relative IPC

1 s s s
3 all all all all
6
4
2
0
Fig. 12. Sensitivity of Big-BranchNet to the training set size.
mcf leel
a xz ng cc tpp e2 64 ch mk an
psje g omne xchang x2 erlben alancb me
dee e p x

MPKI Reduction (%)


12.5

SPEC2017 Average
Fig. 11. MPKI and IPC improvement of BranchNet and other CNN branch 10.0
predictors compared to 64KB TAGE-SC-L. 7.5
5.0
BranchNet does not improve these data-dependent branches. 2.5
However, there are many branches in the body of qsort that 0.0
depend on the results of these comparisons. TAGE does not 1 8 16 24 32 40 48
Total Mini-BranchNet Storage (KB)
learn these relationships because of the noisy nature of the
history when running qsort. BranchNet, on the other hand, Fig. 13. Sensitivity of iso-latency Mini-BranchNet to its storage budget on
learns to ignore the noise and learn the relationships. SPEC2017 benchmarks.
leela spends most of its execution time in evaluating the with a 56KB TAGE-SC-L6 , showing 5.5% average MPKI
properties of a Go board. The directions of most mispredicting reduction, up to 9.5%, and 0.6% average IPC improvement, up
branches are functions of these properties. In theory, many to 3.9%. The iso-latency setting pairs a 32KB Mini-BranchNet
of these branches should be predictable because there are (eight 2KB models, seven 1KB models, ten 0.5KB models, and
often other branches in the global history that depend on a sixteen 0.25KB models) with the baseline 64KB TAGE-SC-L,
shared property. However, there are also many uncorrelated showing 9.6% MPKI reduction on average (up to 17.7%) and
branches, which make the history too noisy. Again, BranchNet a geometric mean of 1.3% IPC Improvement (up to 7.9%).
circumvents the noisy history by only counting the correlated Finally, the Big-BranchNet setting shows the opportunity if
branches. Although the exact form of trained models vary it were possible to get the full benefits of floating-point
(e.g., the number of required filters, the nonlinear function, BranchNet models with a 4-cycle latency at runtime: 2.9%
the minimum history length), they are conceptually similar to average improvement, up to 19.0% for the best benchmark.
the example we provided in Section IV. We evaluated two configurations of Tarsa’s CNNs. Tarsa-
Fig. 10 shows the accuracy of the 16 most improved Float is an oracular software model, analogous to Big-
branches of leela and mcf compared to unlimited MTAGE- BranchNet. Tarsa-Ternary is analogous to iso-latency Mini-
SC. The branches are sorted using MPKI reduction from BranchNet but with a much larger storage budget (5.125KB
left to right. In many cases, Big-BranchNet improves the per branch, up to 29 static branches). As discussed in Section
prediction accuracy to almost 100%. For example, take the V-D, Mini-BranchNet architecture and optimizations allow
fourth branch in leela and the top two branches in mcf, it to use longer histories and a deeper network with less
BranchNet improves their accuracies from 79.1%, 73.9%, and storage. Thus, as Fig. 11 shows, BranchNet is significantly
67.4% to 99.98%, 98.4%, and 98.6%. Even with its large more accurate than Tarsa’s CNNs.
storage budget, MTAGE-SC predicts the same branches with
much lower accuracy (91.4%, 78.9%, and 82.6% respectively). E. Sensitivity Analysis
Note that even if BranchNet cannot predict these branches Fig. 12 shows the MPKI reduction of BranchNet over un-
100% accurately, any improvement in accuracy results in high limited MTAGE-SC using different training set sizes. Training
MPKI reduction because these branches are among the most with all the SimPoints of one program provides much better
frequently mispredicted branches. coverage of branch behavior compared to using only one
D. Practical Mini-BranchNet Results SimPoint, which improves the generalization of the trained
models. Similarly, using more than one input further improves
Fig. 11 shows the MPKI and IPC improvement of Branch- the MPKI reduction. However, once the coverage is enough
Net and the CNN branch predictor of Tarsa et al. [13] to expose all input-independent correlations, using additional
compared to a 64KB TAGE-SC-L baseline. We disable the inputs shows diminishing returns.
local history components of the Statistical Corrector because Fig. 13 shows the sensitivity of iso-latency Mini-BranchNet
realistic processors avoid maintaining speculative local histo- to its storage budget. Since storage more than 32KB shows
ries because of design challenges. For each Mini-BranchNet diminishing returns, we chose 32KB as the budget for iso-
storage budget, we try all possible assignments of top hard- latency Mini-BranchNet.
to-predict branches to configurations and use the best combi- Table IV illustrates the negative impact of various con-
nation of models across all SPEC benchmarks. straints and approximations needed to make Mini-BranchNet
We evaluated BranchNet in three settings. The iso-storage
setting pairs an 8KB Mini-BrachNet (one 2KB model, one 6 We build the 56KB TAGE-SC-L by decreasing the number of table entries
1KB model, seven 0.5KB models, and six 0.25KB models) and tag bits of TAGE.

127

Authorized licensed use limited to: The University of Toronto. Downloaded on October 09,2024 at 03:48:30 UTC from IEEE Xplore. Restrictions apply.
TABLE IV BranchNet and the bias-free predictor both target the same
P ROGRESSION OF MPKI REDUCTION OF leela FROM B IG -B RANCH N ET TO problem that not all branches in the history matter. The bias-
M INI -B RANCHNET.
free predictor addresses this problem using a simple runtime
Big-BranchNet: No branch capacity limit 35.8 %
filtering mechanism. However, offline deep learning allows
Big-BranchNet: Same branches as Mini-BranchNet 25.1 %
Mini-BranchNet: Floating-point 20.0 % BranchNet to be more powerful.
Mini-BranchNet: Quantized convolution 18.7 % Evers et al. [52] describe how identifying correlated
Mini-BranchNet: Fully-quantized 15.7 % branches in the history is useful for improving branch pre-
diction accuracy. To this end, Thomas et al. [53] use an on-
practical. Quantization of convolution layers has the least chip mechanism to track dataflow dependencies to identify
significant impact on MPKI reduction, which agrees with our correlated branches in the history. However, their mechanism
intuition that the role of the convolution layer is to simply cannot track dataflow through memory. This is particularly
identify correlated branch patterns, so a binary output should problematic for identifying correlations that appear deep into
be sufficient. a long history because dataflow dependencies through memory
Note: these sensitivity studies were done with a slightly become more likely.
different training setup, resulting in lower MPKI reduction Seznec et al. [54] propose using Inner Most Loop Iteration
compared to what we reported in earlier sections. (IMLI) counters to identify correlated branches in history.
Inspired by the Wormhole predictor [55], IMLI counters are
F. Weaknesses and Future Directions useful for predicting branches within nested loops that are
The poor performance of BranchNet on gcc highlights correlated to the branches in the previous iterations of the outer
the first weakness: if the mispredictions of a program are loop. BranchNet is compatible with using IMLI counters as
distributed among many static branches, BranchNet cannot inputs. We leave the study of using IMLI counters as inputs
significantly improve its accuracy by improving the prediction to BranchNet for future work.
just a few branches. Even if we can train an accurate CNN
model for each mispredicting branch, we need a large storage VIII. C ONCLUSION
area to keep the models. One possible direction is to use the BranchNet is a convolutional neural network that we train
methodology of Predictor Virtualization [47] to maintain all offline to predict many branches that are fundamentally
the models in the main memory and use either a runtime hard-to-predict for state-of-the-art predictors. State-of-the-art
mechanism or explicit BranchNet instructions to load the branch predictors fail to accurately predict these branches
BranchNet models into the inference engine as needed. because they need exponentially large storage to identify
The large gap between the accuracy of Big-BranchNet and branch correlations that appear deep into a noisy in the global
Mini-BranchNet is another weakness that can be improved. history. In contrast, by using the abundant data and com-
Training multi-layer neural networks often relies on some putation available during offline training, BranchNet learns
degree of overparameterization, i.e., there is redundancy in to ignore uncorrelated noise in the history and use only the
trained models. Regularization and pruning are machine- correlated branches to make a prediction. To show the inherent
learning tools to combat this inefficiency. Furthermore, static advantage of CNNs in predicting this category of branches,
analysis and input preprocessing can help to learn even more we have compared Big-BranchNet to MTAGE-SC without
specialized prediction functions for branch prediction [48]. considering practical constraints. We have shown that Big-
Finally, perhaps the biggest weakness of BranchNet is data- BranchNet outperforms MTAGE-SC on some of the most
dependent branches. Our goal for BranchNet is to improve mispredicting branches among the SPEC2017 benchmarks,
branch prediction using the global branch history. However, resulting in 7.6% MPKI reduction. Furthermore, to show the
the combination of deep learning and offline training has the effectiveness of CNNs as practical branch predictors, we have
potential to further push branch prediction by using signals compared Mini-BranchNet to 64KB TAGE-SC-L. Without
other than the global branch history that can help to predict increasing the prediction latency, Mini-BranchNet reduces the
data-dependent branches. MPKI by 9.6%. While the IPC gains of Mini-BranchNet are
limited (average 1.3%, up to 7.9%), these results should not be
VII. OTHER R ELATED W ORK interpreted as a limit to the potential benefits of deep learning
Store-Load-Branch (SLB) predictor [49] and Probabilistic for branch prediction. The key takeaway from BranchNet is
Branch Support (PBS) [50] improve branch prediction for that offline deep learning is a powerful approach to address
data-dependent and probabilistic branches. Although the goal the weaknesses of state-of-the-art runtime branch predictors.
of SLB and PBS is different from BranchNet, they all break Further research and innovation can complement BranchNet
the runtime abstraction around branch prediction. SLB uses to remedy the remaining weaknesses of runtime predictors.
the compiler to identify data-dependent branches and ex-
ACKNOWLEDGMENT
poses their dependence chains to the branch predictor. PBS
uses programmer directives to change program semantics for We thank the anonymous reviewers, members of HPS
probabilistic branches to simplify branch prediction. While research group, and Yongkee Kwon for their feedback and
different from profiling, these approaches agree with our help with improving this paper. We thank Intel, Arm, and
general assertion that we need to revisit compile-time support Microsoft for their financial support. We acknowledge the
for branch prediction. Texas Advanced Computing Center (TACC) for providing
Gope and Lipasti [51] propose bias-free branch predictors to compute resources.
remove biased and redundant branches from branch histories.

128

Authorized licensed use limited to: The University of Toronto. Downloaded on October 09,2024 at 03:48:30 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES [22] T. Sherwood and B. Calder, “Automated design of finite state machine
predictors for customized processors,” in Proceedings 28th Annual
[1] A. Seznec, “Tage-sc-l branch predictors again,” in 5th JILP Workshop on International Symposium on Computer Architecture, June 2001, pp. 86–
Computer Architecture Competitions (JWAC-5): Championship Branch 97.
Prediction (CBP-5), 2016. [23] M. . Tarlescu, K. B. Theobald, and G. R. Gao, “Elastic history buffer: a
[2] C.-K. Lin and S. J. Tarsa, “Branch prediction is not a solved problem: low-cost method to improve branch prediction accuracy,” in Proceedings
Measurements, opportunities, and future directions,” in IEEE Interna- International Conference on Computer Design VLSI in Computers and
tional Symposium on Workload Characterization, 2019. Processors, Oct 1997, pp. 82–87.
[3] P. Michaud, A. Seznec, and S. Jourdan, “An exploration of instruction [24] J. Stark, M. Evers, and Y. N. Patt, “Variable length path branch
fetch requirement in out-of-order superscalar processors,” International prediction,” SIGOPS Oper. Syst. Rev., vol. 32, no. 5, pp. 170–179, Oct.
Journal of Parallel Programming, vol. 29, no. 1, pp. 35–58, Feb 2001. 1998. [Online]. Available: http://doi.acm.org/10.1145/384265.291042
[Online]. Available: https://doi.org/10.1023/A:1026431920605 [25] S. Verma, B. Maderazo, and D. M. Koppelman, “Spotlight - a low
[4] T. S. Karkhanis and J. E. Smith, “A first-order superscalar processor complexity highly accurate profile-based branch predictor,” in 2009
model,” in Proceedings. 31st Annual International Symposium on Com- IEEE 28th International Performance Computing and Communications
puter Architecture, 2004., June 2004, pp. 338–349. Conference, Dec 2009, pp. 239–247.
[26] S. Mcfarling, “Combining branch predictors,” Digital Equipment Cor-
[5] E. Sprangle and D. Carmean, “Increasing processor performance by
poration, Western Research Lab, Tech. Rep., 1993.
implementing deeper pipelines,” in Computer Architecture, 2002. Pro-
ceedings. 29th Annual International Symposium on. IEEE, 2002, pp. [27] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4,
25–34. inception-resnet and the impact of residual connections on learning,”
2016.
[6] P. Michaud, “An alternative tage-like conditional branch predictor,”
[28] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” 2018
ACM Trans. Archit. Code Optim., vol. 15, no. 3, Aug. 2018. [Online].
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun
Available: https://doi.org/10.1145/3226098
2018. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2018.00745
[7] A. Seznec, “Exploring branch predictability limits with the MTAGE+SC [29] F. Wu, A. Fan, A. Baevski, Y. N. Dauphin, and M. Auli, “Pay less
predictor,” in 5th JILP Workshop on Computer Architecture attention with lightweight and dynamic convolutions,” 2019.
Competitions (JWAC-5): Championship Branch Prediction (CBP-
[30] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,
5), Seoul, South Korea, Jun. 2016, p. 4. [Online]. Available:
2016, http://www.deeplearningbook.org.
https://hal.inria.fr/hal-01354251
[31] M. Hashemi, K. Swersky, J. Smith, G. Ayers, H. Litz, J. Chang,
[8] D. Jiménez, “Multiperspective perceptron predictor,” in 5th JILP Work- C. Kozyrakis, and P. Ranganathan, “Learning memory access
shop on Computer Architecture Competitions (JWAC-5): Championship patterns,” in Proceedings of the 35th International Conference on
Branch Prediction (CBP-5), 2016. Machine Learning, ser. Proceedings of Machine Learning Research,
[9] ——, “Multiperspective perceptron predictor with tage,” in 5th JILP J. Dy and A. Krause, Eds., vol. 80. Stockholmsmssan, Stockholm
Workshop on Computer Architecture Competitions (JWAC-5): Champi- Sweden: PMLR, 10–15 Jul 2018, pp. 1919–1928. [Online]. Available:
onship Branch Prediction (CBP-5), 2016. http://proceedings.mlr.press/v80/hashemi18a.html
[10] S. Pruett, S. Zangeneh, A. Fakhrzadehgan, B. Lin, and Y. Patt, “Dy- [32] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
namically sizing the tage branch predictor,” in 5th JILP Workshop on W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten
Computer Architecture Competitions (JWAC-5): Championship Branch zip code recognition,” Neural Comput., vol. 1, no. 4, pp. 541–551, Dec.
Prediction (CBP-5), 2016. 1989. [Online]. Available: http://dx.doi.org/10.1162/neco.1989.1.4.541
[11] A. Seznec and P. Michaud, “A case for (partially) tagged geometric [33] H. Robbins and S. Monro, “A stochastic approximation method,”
history length branch prediction,” J. Instruction-Level Parallelism, vol. 8, Ann. Math. Statist., vol. 22, no. 3, pp. 400–407, 09 1951. [Online].
2006. Available: https://doi.org/10.1214/aoms/1177729586
[12] D. A. Jimenez and C. Lin, “Dynamic branch prediction with percep- [34] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning Repre-
trons,” in Proceedings HPCA Seventh International Symposium on High- sentations by Back-Propagating Errors. Cambridge, MA, USA: MIT
Performance Computer Architecture, Jan 2001, pp. 197–206. Press, 1988, p. 696699.
[13] S. J. Tarsa, C.-K. Lin, G. Keskin, G. Chinya, and H. Wang, “Improving [35] A. Seznec, “Analysis of the o-geometric history length branch predictor,”
branch prediction by modeling global history with convolutional neural in 32nd International Symposium on Computer Architecture (ISCA’05),
networks,” in The 2nd International Workshop on AI-assisted Design for June 2005, pp. 394–405.
Architecture, 2019. [36] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
[14] D. Tarjan and K. Skadron, “Merging path and gshare indexing network training by reducing internal covariate shift,” in Proceedings
in perceptron branch prediction,” ACM Trans. Archit. Code Optim., of the 32Nd International Conference on International Conference on
vol. 2, no. 3, pp. 280–300, Sep. 2005. [Online]. Available: Machine Learning - Volume 37, ser. ICML’15. JMLR.org, 2015,
http://doi.acm.org/10.1145/1089008.1089011 pp. 448–456. [Online]. Available: http://dl.acm.org/citation.cfm?id=
[15] J. Cleary and I. Witten, “Data compression using adaptive coding and 3045118.3045167
partial string matching,” IEEE Transactions on Communications, vol. 32, [37] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio,
no. 4, pp. 396–402, April 1984. “Binarized neural networks: Training deep neural networks with weights
[16] T.-Y. Yeh and Y. N. Patt, “Two-level adaptive training branch predic- and activations constrained to +1 or -1,” 2016.
tion,” in Proceedings of the 24th Annual International Symposium on [38] D. A. Jimenez, S. W. Keckler, and C. Lin, “The impact of delay on the
Microarchitecture, ser. MICRO 24. New York, NY, USA: ACM, 1991, design of branch predictors,” in Proceedings 33rd Annual IEEE/ACM
pp. 51–61. International Symposium on Microarchitecture. MICRO-33 2000, Dec
[17] A. Krall, “Improving semi-static branch prediction by code replication,” 2000, pp. 67–76.
SIGPLAN Not., vol. 29, no. 6, pp. 97–106, Jun. 1994. [Online]. [39] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing
Available: http://doi.acm.org/10.1145/773473.178252 nuca organizations and wiring alternatives for large caches with cacti
[18] B. Calder, D. Grunwald, M. Jones, D. Lindsay, J. Martin, M. Mozer, 6.0,” in 40th Annual IEEE/ACM International Symposium on Microar-
and B. Zorn, “Evidence-based static branch prediction using machine chitecture (MICRO 2007), Dec 2007, pp. 3–14.
learning,” ACM Trans. Program. Lang. Syst., vol. 19, no. 1, pp. [40] A. Fog, “Instruction tables: Lists of instruction latencies, throughputs
188–222, Jan. 1997. [Online]. Available: http://doi.acm.org.ezproxy.lib. and micro-operation breakdowns for intel, amd and via cpus,”
utexas.edu/10.1145/239912.239923 Technical University of Denmark, Tech. Rep. [Online]. Available:
[19] J. R. C. Patterson, “Accurate static branch prediction by value range https://www.agner.org/optimize/instruction tables.pdf
propagation,” SIGPLAN Not., vol. 30, no. 6, pp. 67–78, Jun. 1995. [41] M.-S. Lee, Y.-J. Kang, J.-W. Lee, and S.-R. Maeng, “Opts: increasing
[Online]. Available: http://doi.acm.org.ezproxy.lib.utexas.edu/10.1145/ branch prediction accuracy under context switch,” Microprocessors and
223428.207117 Microsystems, vol. 26, no. 6, pp. 291 – 300, 2002. [Online]. Available:
[20] C. Young and M. D. Smith, “Improving the accuracy of static http://www.sciencedirect.com/science/article/pii/S0141933102000418
branch prediction using branch correlation,” SIGOPS Oper. Syst. [42] J. N. Amaral, E. Borin, D. R. Ashley, C. Benedicto, E. Colp, J. H. S.
Rev., vol. 28, no. 5, pp. 232–241, Nov. 1994. [Online]. Available: Hoffmam, M. Karpoff, E. Ochoa, M. Redshaw, and R. E. Rodrigues,
http://doi.acm.org/10.1145/381792.195549 “The alberta workloads for the spec cpu 2017 benchmark suite,” in 2018
[21] D. A. Jimenez, H. L. Hanson, and C. Lin, “Boolean formula-based IEEE International Symposium on Performance Analysis of Systems and
branch prediction for future technologies,” in Proceedings 2001 Interna- Software (ISPASS), April 2018, pp. 159–168.
tional Conference on Parallel Architectures and Compilation Techniques, [43] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically
Sep. 2001, pp. 97–106. characterizing large scale program behavior,” in Proceedings of

129

Authorized licensed use limited to: The University of Toronto. Downloaded on October 09,2024 at 03:48:30 UTC from IEEE Xplore. Restrictions apply.
the 10th International Conference on Architectural Support for [51] D. Gope and M. H. Lipasti, “Bias-free branch predictor,” in 2014 47th
Programming Languages and Operating Systems, ser. ASPLOS X. Annual IEEE/ACM International Symposium on Microarchitecture, Dec
New York, NY, USA: ACM, 2002, pp. 45–57. [Online]. Available: 2014, pp. 521–532.
http://doi.acm.org/10.1145/605397.605403 [52] M. Evers, S. J. Patel, R. S. Chappell, and Y. N. Patt, “An
[44] “Branchnet,” https://github.com/siavashzk/BranchNet. analysis of correlation and predictability: What makes two-level
[45] “Scarab,” https://github.com/hpsresearchgroup/scarab. branch predictors work,” in Proceedings of the 25th Annual
[46] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A fast and extensible International Symposium on Computer Architecture, ser. ISCA 98.
dram simulator,” IEEE Computer Architecture Letters, vol. 15, no. 1, USA: IEEE Computer Society, 1998, p. 5261. [Online]. Available:
pp. 45–49, Jan 2016. https://doi.org/10.1145/279358.279368
[47] I. Burcea, S. Somogyi, A. Moshovos, and B. Falsafi, “Predictor [53] R. Thomas, M. Franklin, C. Wilkerson, and J. Stark, “Improving
virtualization,” in Proceedings of the 13th International Conference branch prediction by dynamic dataflow-based identification of correlated
on Architectural Support for Programming Languages and Operating branches from a large global history,” in Proceedings of the 30th Annual
Systems, ser. ASPLOS XIII. New York, NY, USA: Association International Symposium on Computer Architecture, ser. ISCA 03.
for Computing Machinery, 2008, p. 157167. [Online]. Available: New York, NY, USA: Association for Computing Machinery, 2003, p.
https://doi.org/10.1145/1346281.1346301 314323. [Online]. Available: https://doi.org/10.1145/859618.859655
[48] S. Zangeneh, S. Pruett, and Y. Patt, “Branch prediction with multi- [54] A. Seznec, J. S. Miguel, and J. Albericio, “The inner most loop iteration
layer neural networks: The value of specialization,” ML for Computer counter: A new dimension in branch history,” in Proceedings of the
Architecture and Systems, 2020. 48th International Symposium on Microarchitecture, ser. MICRO-48.
[49] M. Farooq, K. Khubaib, and L. John, “Store-load-branch (slb) predictor: New York, NY, USA: ACM, 2015, pp. 347–357. [Online]. Available:
A compiler assisted branch prediction for data dependent branches,” 02 http://doi.acm.org/10.1145/2830772.2830831
2013, pp. 59–70. [55] J. Albericio, J. S. Miguel, N. E. Jerger, and A. Moshovos, “Wormhole:
[50] A. Adileh, D. Lilja, and L. Eeckhout, “Architectural support for prob- Wisely predicting multidimensional branches,” in Proceedings of the
abilistic branches,” in 51st annual IEEE/ACM International Symposium 47th Annual IEEE/ACM International Symposium on Microarchitecture,
on Microarchitecture, Fukuoka, Japan, Oct 2018. ser. MICRO-47. Washington, DC, USA: IEEE Computer Society,
2014, pp. 509–520. [Online]. Available: http://dx.doi.org/10.1109/
MICRO.2014.40

130

Authorized licensed use limited to: The University of Toronto. Downloaded on October 09,2024 at 03:48:30 UTC from IEEE Xplore. Restrictions apply.

You might also like