0% found this document useful (0 votes)

30 views10 pages

Bayesian Feed Forward

1. The document introduces Bayes by Backprop, a new algorithm for learning a probability distribution over the weights of a neural network. This places uncertainty on the weights and regularizes the network. 2. Rather than having fixed weights, each weight is represented by a probability distribution. The network effectively trains an ensemble where weights are drawn from the shared, learned distribution. 3. The algorithm takes a variational approach to approximate Bayesian inference over the weights, making gradients computable. This allows for weight uncertainty in a backpropagation-compatible way.

Uploaded by

mihai ilie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views10 pages

Bayesian Feed Forward

Uploaded by

mihai ilie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Weight Uncertainty in Neural Networks

Charles Blundell CBLUNDELL @ GOOGLE . COM

Julien Cornebise JUCOR @ GOOGLE . COM
Koray Kavukcuoglu KORAYK @ GOOGLE . COM
Daan Wierstra WIERSTRA @ GOOGLE . COM
Google DeepMind
arXiv:1505.05424v2 [stat.ML] 21 May 2015

Abstract vent overfitting in neural networks such as early stopping,

weight decay, and dropout (Hinton et al., 2012). In this
We introduce a new, efficient, principled and
work, we introduce an efficient, principled algorithm for
backpropagation-compatible algorithm for learn-
regularisation built upon Bayesian inference on the weights
ing a probability distribution on the weights of
of the network (MacKay, 1992; Buntine and Weigend,
a neural network, called Bayes by Backprop. It
1991; MacKay, 1995). This leads to a simple approxi-
regularises the weights by minimising a com-
mate learning algorithm similar to backpropagation (Le-
pression cost, known as the variational free en-
Cun, 1985; Rumelhart et al., 1988). We shall demonstrate
ergy or the expected lower bound on the marginal
how this uncertainty can improve predictive performance
likelihood. We show that this principled kind
in regression problems by expressing uncertainty in regions
of regularisation yields comparable performance
with little or no data, how this uncertainty can lead to more
to dropout on MNIST classification. We then
systematic exploration than -greedy in contextual bandit
demonstrate how the learnt uncertainty in the
tasks.
weights can be used to improve generalisation
in non-linear regression problems, and how this All weights in our neural networks are represented by prob-
weight uncertainty can be used to drive the ability distributions over possible values, rather than having
exploration-exploitation trade-off in reinforce- a single fixed value as is the norm (see Figure 1). Learnt
ment learning. representations and computations must therefore be robust
under perturbation of the weights, but the amount of per-
turbation each weight exhibits is also learnt in a way that
1. Introduction coherently explains variability in the training data. Thus
instead of training a single network, the proposed method
Plain feedforward neural networks are prone to overfit- trains an ensemble of networks, where each network has its
ting. When applied to supervised or reinforcement learn- weights drawn from a shared, learnt probability distribu-
ing problems these networks are also often incapable of tion. Unlike other ensemble methods, our method typically
correctly assessing the uncertainty in the training data and only doubles the number of parameters yet trains an infi-
so make overly confident decisions about the correct class, nite ensemble using unbiased Monte Carlo estimates of the
prediction or action. We shall address both of these con- gradients.
cerns by using variational Bayesian learning to introduce
uncertainty in the weights of the network. We call our al- In general, exact Bayesian inference on the weights of a
gorithm Bayes by Backprop. We suggest at least three mo- neural network is intractable as the number of parameters
tivations for introducing uncertainty on the weights: 1) reg- is very large and the functional form of a neural network
ularisation via a compression cost on the weights, 2) richer does not lend itself to exact integration. Instead we take a
representations and predictions from cheap model averag- variational approximation to exact Bayesian updates. We
ing, and 3) exploration in simple reinforcement learning build upon the work of Graves (2011), who in turn built
problems such as contextual bandits. upon the work of Hinton and Van Camp (1993). In con-
trast to this previous work, we show how the gradients
Various regularisation schemes have been developed to pre- of Graves (2011) can be made unbiased and further how
this method can be used with non-Gaussian priors. Con-
Proceedings of the 32 nd International Conference on Machine sequently, Bayes by Backprop attains performance compa-
Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy-
right 2015 by the author(s). rable to that of dropout (Hinton et al., 2012). Our method
Weight Uncertainty in Neural Networks

Y Y the parameters of the categorical distribution are passed

through the exponential function then re-normalised. For
0.1 0.7 1.3
0.5
regression Y is R and P (y|x, w) is a Gaussian distribution
H1 H2 H3 1 H1 H2 H3 1 – this corresponds to a squared loss.
0.1 0.1 0.3 1.4
0.2 Inputs x are mapped onto the parameters of a distribu-
1.2 tion on Y by several successive layers of linear transforma-
X 1 X 1 tion (given by w) interleaved with element-wise non-linear
transforms.
Figure 1. Left: each weight has a fixed value, as provided by clas- The weights can be learnt by maximum likelihood estima-
sical backpropagation. Right: each weight is assigned a distribution (MLE): given a set of training examples D = (xi , yi )i ,
tion, as provided by Bayes by Backprop. the MLE weights wMLE are given by:

wMLE = arg max log P (D|w)

w
is related to recent methods in deep, generative modelling X
(Kingma and Welling, 2014; Rezende et al., 2014; Gregor = arg max log P (yi |xi , w).
w
i
et al., 2014), where variational inference has been applied
to stochastic hidden units of an autoencoder. Whilst the
number of stochastic hidden units might be in the order of This is typically achieved by gradient descent (e.g., back-
thousands, the number of weights in a neural network is propagation), where we assume that log P (D|w) is differ-
easily two orders of magnitude larger, making the optimisa- entiable in w.
tion problem much larger scale. Uncertainty in the hidden Regularisation can be introduced by placing a prior upon
units allows the expression of uncertainty about a particular the weights w and finding the maximum a posteriori
observation, uncertainty in the weights is complementary (MAP) weights wMAP :
in that it captures uncertainty about which neural network
is appropriate, leading to regularisation of the weights and wMAP = arg max log P (w|D)
model averaging. w
= arg max log P (D|w) + log P (w).
This uncertainty can be used to drive exploration in contex- w
tual bandit problems using Thompson sampling (Thomp-
son, 1933; Chapelle and Li, 2011; Agrawal and Goyal, If w are given a Gaussian prior, this yields L2 regularisa-
2012; May et al., 2012). Weights with greater uncertainty tion (or weight decay). If w are given a Laplace prior, then
introduce more variability into the decisions made by the L1 regularisation is recovered.
network, leading naturally to exploration. As more data are
observed, the uncertainty can decrease, allowing the deci- 3. Being Bayesian by Backpropagation
sions made by the network to become more deterministic
as the environment is better understood. Bayesian inference for neural networks calculates the pos-
terior distribution of the weights given the training data,
The remainder of the paper is organised as follows: Sec-
P (w|D). This distribution answers predictive queries
tion 2 introduces notation and standard learning in neural
about unseen data by taking expectations: the predictive
networks, Section 3 describes variational Bayesian learn-
distribution of an unknown label ŷ of a test data item x̂,
ing for neural networks and our contributions, Section 4
is given by P (ŷ|x̂) = EP (w|D) [P (ŷ|x̂, w)]. Each pos-
describes the application to contextual bandit problems,
sible configuration of the weights, weighted according to
whilst Section 5 contains empirical results on a classifica-
the posterior distribution, makes a prediction about the un-
tion, a regression and a bandit problem. We conclude with
known label given the test data item x̂. Thus taking an
a brief discussion in Section 6.
expectation under the posterior distribution on weights is
equivalent to using an ensemble of an uncountably infi-
2. Point Estimates of Neural Networks nite number of neural networks. Unfortunately, this is in-
tractable for neural networks of any practical size.
We view a neural network as a probabilistic model
P (y|x, w): given an input x ∈ Rp a neural network as- Previously Hinton and Van Camp (1993) and Graves
signs a probability to each possible output y ∈ Y, using (2011) suggested finding a variational approximation to the
the set of parameters or weights w. For classification, Y is Bayesian posterior distribution on the weights. Variational
a set of classes and P (y|x, w) is a categorical distribution – learning finds the parameters θ of a distribution on the
this corresponds to the cross-entropy or softmax loss, when weights q(w|θ) that minimises the Kullback-Leibler (KL)
Weight Uncertainty in Neural Networks

divergence with the true Bayesian posterior on the weights: The deterministic function t(θ, ) transforms a sample of
parameter-free noise and the variational posterior param-
θ? = arg min KL[q(w|θ)||P (w|D)] eters θ into a sample from the variational posterior. Below
θ
Z we shall see how this transform works in practice for the
q(w|θ)
= arg min q(w|θ) log dw Gaussian case.
θ P (w)P (D|w)
We apply Proposition 1 to the optimisation problem in
= arg min KL [q(w|θ) || P (w)] − Eq(w|θ) [log P (D|w)] .
θ (1): let f (w, θ) = log q(w|θ) − log P (w)P (D|w). Us-
ing Monte Carlo sampling to evaluate the expectations,
The resulting cost function is variously known as the varia- a backpropagation-like (LeCun, 1985; Rumelhart et al.,
tional free energy (Neal and Hinton, 1998; Yedidia et al., 1988) algorithm is obtained for variational Bayesian infer-
2000; Friston et al., 2007) or the expected lower bound ence in neural networks – Bayes by Backprop – which uses
(Saul et al., 1996; Neal and Hinton, 1998; Jaakkola and unbiased estimates of gradients of the cost in (1) to learn a
Jordan, 2000). For simplicity we shall denote it as distribution over the weights of a neural network.
Proposition 1 is a generalisation of the Gaussian re-
F(D, θ) = KL [q(w|θ) || P (w)]
parameterisation trick (Opper and Archambeau, 2009;
− Eq(w|θ) [log P (D|w)] . (1) Kingma and Welling, 2014; Rezende et al., 2014) used for
latent variable models, applied to Bayesian learning of neu-
The cost function of (1) is a sum of a data-dependent part, ral networks. Our work differs from this previous work in
which we shall refer to as the likelihood cost, and a prior- several significant ways. Bayes by Backprop operates on
dependent part, which we shall refer to as the complexity weights (of which there are a great many), whilst most pre-
cost. The cost function embodies a trade-off between satis- vious work applies this method to learning distributions on
fying the complexity of the data D and satisfying the sim- stochastic hidden units (of which there are far fewer than
plicity prior P (w). (1) is also readily given an information the number of weights). Titsias and Lázaro-Gredilla (2014)
theoretic interpretation as a minimum description length considered a large-scale logistic regression task. Unlike
cost (Hinton and Van Camp, 1993; Graves, 2011). Exactly previous work, we do not use the closed form of the com-
minimising this cost naı̈vely is computationally prohibitive. plexity cost (or entropic part): not requiring a closed form
Instead gradient descent and various approximations are of the complexity cost allows many more combinations of
used. prior and variational posterior families. Indeed this scheme
is also simple to implement and allows prior/posterior com-
3.1. Unbiased Monte Carlo gradients binations to be interchanged. We approximate the exact
Under certain conditions, the derivative of an expectation cost (1) as:
can be expressed as the expectation of a derivative: n
X
Proposition 1. Let be a random variable having a prob- F(D, θ) ≈ log q(w(i) |θ) − log P (w(i) )
i=1
ability density given by q() and let w = t(θ, ) where
t(θ, ) is a deterministic function. Suppose further that − log P (D|w(i) ) (2)
the marginal probability density of w, q(w|θ), is such that
q()d = q(w|θ)dw. Then for a function f with deriva- where w(i) denotes the ith Monte Carlo sample drawn from
tives in w: the variational posterior q(w(i) |θ). Note that every term of
this approximate cost depends upon the particular weights
drawn from the variational posterior: this is an instance of

∂ ∂f (w, θ) ∂w ∂f (w, θ)
Eq(w|θ) [f (w, θ)] = Eq() + . a variance reduction technique known as common random
∂θ ∂w ∂θ ∂θ
numbers (Owen, 2013). In previous work, where a closed
form complexity cost or closed form entropy term are used,
Proof. part of the cost is sensitive to particular draws from the
Z posterior, whilst the closed form part is oblivious. Since
∂ ∂
Eq(w|θ) [f (w, θ)] = f (w, θ)q(w|θ)dw each additive term in the approximate cost in (2) uses the
∂θ ∂θ same weight samples, the gradients of (2) are only affected
Z
∂ by the parts of the posterior distribution characterised by
= f (w, θ)q()d
∂θ the weight samples. In practice, we did not find this to

∂f (w, θ) ∂w ∂f (w, θ) perform better than using a closed form KL (where it could
= Eq() +
∂w ∂θ ∂θ be computed), but we did not find it to perform worse. In
our experiments, we found that a prior without an easy-to-
compute closed form complexity cost performed best.
Weight Uncertainty in Neural Networks

3.2. Gaussian variational posterior cross-validation where possible. Empirically we found op-
timising the parameters of a prior P (w) (by taking deriva-
Suppose that the variational posterior is a diagonal Gaus-
tives of (1)) to not be useful, and yield worse results.
sian distribution, then a sample of the weights w can be
Graves (2011) and Titsias and Lázaro-Gredilla (2014) pro-
obtained by sampling a unit Gaussian, shifting it by a mean
pose closed form updates of the prior hyperparameters.
µ and scaling by a standard deviation σ. We parameterise
Changing the prior based upon the data that it is meant to
the standard deviation pointwise as σ = log(1 + exp(ρ))
regularise is known as empirical Bayes and there is much
and so σ is always non-negative. The variational posterior
debate as to its validity (Gelman, 2008). A reason why it
parameters are θ = (µ, ρ). Thus the transform from a sam-
fails for Bayes by Backprop is as follows: it can be eas-
ple of parameter-free noise and the variational posterior pa-
ier to change the prior parameters (of which there are few)
rameters that yields a posterior sample of the weights w is:
than it is to change the posterior parameters (of which there
w = t(θ, ) = µ + log(1 + exp(ρ)) ◦ where ◦ is point-
are many) and so very quickly the prior parameters try to
wise multiplication. Each step of optimisation proceeds as
capture the empirical distribution of the weights at the be-
follows:
ginning of learning. Thus the prior learns to fit poor initial
parameters quickly, and makes the cost in (1) less willing
1. Sample ∼ N (0, I). to move away from poor initial parameters. This can yield
2. Let w = µ + log(1 + exp(ρ)) ◦ . slow convergence, introduce strange local minima and re-
3. Let θ = (µ, ρ). sult in poor performance.
4. Let f (w, θ) = log q(w|θ) − log P (w)P (D|w).
We propose using a scale mixture of two Gaussian densi-
5. Calculate the gradient with respect to the mean
ties as the prior. Each density is zero mean, but differing
∂f (w, θ) ∂f (w, θ) variances:
∆µ = + . (3)
∂w ∂µ
6. Calculate the gradient with respect to the standard de- Y
viation parameter ρ P (w) = πN (wj |0, σ12 ) + (1 − π)N (wj |0, σ22 ), (7)
j
∂f (w, θ) ∂f (w, θ)
∆ρ = + . (4)
∂w 1 + exp(−ρ) ∂ρ
7. Update the variational parameters: where wj is the jth weight of the network, N (x|µ, σ 2 ) is
the Gaussian density evaluated at x with mean µ and vari-
µ ← µ − α∆µ (5) ance σ 2 and σ12 and σ22 are the variances of the mixture
ρ ← ρ − α∆ρ . (6) components. The first mixture component of the prior is
given a larger variance than the second, σ1 > σ2 , provid-
ing a heavier tail in the prior density than a plain Gaussian
Note that the ∂f∂w
(w,θ)
term of the gradients for the mean and
prior. The second mixture component has a small variance
standard deviation are shared and are exactly the gradients
σ2 1 causing many of the weights to a priori tightly con-
found by the usual backpropagation algorithm on a neural
centrate around zero. Our prior resembles a spike-and-slab
network. Thus, remarkably, to learn both the mean and the
prior (Mitchell and Beauchamp, 1988; George and McCul-
standard deviation we must simply calculate the usual gra-
loch, 1993; Chipman, 1996), where instead all the prior pa-
dients found by backpropagation, and then scale and shift
rameters are shared among all the weights. This makes the
them as above.
prior more amenable to use during optimisation by stochas-
tic gradient descent and avoids the need for prior parameter
3.3. Scale mixture prior optimisation based upon training data.
Having liberated our algorithm from the confines of Gaus-
sian priors and posteriors, we propose a simple scale mix- 3.4. Minibatches and KL re-weighting
ture prior combined with a diagonal Gaussian posterior.
As several authors have noted, the cost in (1) is amenable
The diagonal Gaussian posterior is largely free from nu-
to minibatch optimisation, often used with neural networks:
merical issues, and two degrees of freedom per weight only
for each epoch of optimisation the training data D is ran-
increases the number of parameters to optimise by a factor
domly split into a partition of M equally-sized subsets,
of two, whilst giving each weight its own quantity of un-
D1 , D2 , . . . , DM . Each gradient is averaged over all ele-
certainty.
ments in one of these minibatches; a trade-off between a
We pick a fixed-form prior and do not adjust its hyper- fully batched gradient descent and a fully stochastic gradi-
parameters during training, instead picking the them by ent descent. Graves (2011) proposes minimising the mini-
Weight Uncertainty in Neural Networks

batch cost for minibatch i = 1, 2, . . . , M : time, the agent can under-explore, as it may miss more re-
warding actions.1
1
FiEQ (Di , θ) = KL [q(w|θ) || P (w)] Thompson sampling (Thompson, 1933) is a popular means
M
of picking an action that trades-off between exploitation
− Eq(w|θ) [log P (Di |w)] . (8)
(picking the best known action) and exploration (picking
This is equivalent to the cost in (1) since i FiEQ (Di , θ) =
P what might be a suboptimal arm to learn more). Thomp-
F(D, θ). There are many ways to weight the complexity son sampling usually necessitates a Bayesian treatment of
cost relative to the likelihood cost on each minibatch. For the model parameters. At each step, Thompson sampling
example, if minibatches are partitioned uniformly at ran- draws a new set of parameters and then picks the action
dom, the KL cost can be distributed non-uniformly among relative to those parameters. This can be seen as a kind
the minibatches at each epoch. Let π ∈ [0, 1]M and of stochastic hypothesis testing: more probable parame-
PM ters are drawn more often and thus refuted or confirmed
i=1 πi = 1, and define:
the fastest. More concretely Thompson sampling proceeds
Fiπ (Di , θ) = πi KL [q(w|θ) || P (w)] as follows:
− Eq(w|θ) [log P (Di |w)] (9)
1. Sample a new set of parameters for the model.
PM π 2. Pick the action with the highest expected reward ac-
Then EM [ i=1 Fi (Di , θ)]= F(D, θ) where EM denotes
an expectation over the random partitioning of minibatches. cording to the sampled parameters.
M −i
In particular, we found the scheme πi = 22M −1 to work 3. Update the model. Go to 1.
well: the first few minibatches are heavily influenced by
the complexity cost, whilst the later minibatches are largely There is an increasing literature concerning the efficacy and
influenced by the data. At the beginning of learning this is justification of this means of exploration (Chapelle and Li,
particularly useful as for the first few minibatches changes 2011; May et al., 2012; Kaufmann et al., 2012; Agrawal
in the weights due to the data are slight and as more data and Goyal, 2012; 2013). Thompson sampling is easily
are seen, data become more influential and the prior less adapted to neural networks using the variational posterior
influential. found in Section 3:

4. Contextual Bandits 1. Sample weights from the variational posterior: w ∼

q(w|θ).
Contextual bandits are simple reinforcement learning prob- 2. Receive the context x.
lems without persistent state (Li et al., 2010; Filippi et al., 3. Pick the action a that minimises EP (r|x,a,w) [r]
2010). At each step an agent is presented with a context
4. Receive reward r.
x and a choice of one of K possible actions a. Different
5. Update variational parameters θ according to Sec-
actions yield different unknown rewards r. The agent must
tion 3. Go to 1.
pick the action that yields the highest expected reward. The
context is assumed to be presented independent of any pre-
vious actions, rewards or contexts. Note that it is possible, as mentioned in Section 3.1, to de-
crease the variance of the gradient estimates, trading off for
An agent builds a model of the distribution of the rewards reduced exploration, by using more than one Monte Carlo
conditioned upon the action and the context: P (r|x, a, w). sample, using the corresponding networks as an ensemble
It then uses this model to pick its action. Note, importantly, and picking the action by minimising the average of the
that an agent does not know what reward it could have re- expectations.
ceived for an action that it did not pick, a difficulty often
known as “the absence of counterfactual”. As the agent’s Initially the variational posterior will be close to the prior,
model P (r|x, a, w) is trained online, based upon the ac- and actions will be picked uniformly. As the agent takes ac-
tions chosen, unless exploratory actions are taken, the agent tions, the variational posterior will begin to converge, and
may perform suboptimally. uncertainty on many parameters can decrease, and so ac-
tion selection will become more deterministic, focusing on
4.1. Thompson Sampling for Neural Networks the high expected reward actions discovered so far. It is
1
As in Section 2, P (r|x, a, w) can be modelled by a neural Interestingly, depending upon how w are initialised and the
mean of prior used during MAP inference, it is sometimes pos-
network where w are the weights of the neural network. sible to obtain another heuristic for the exploration-exploitation
However if this network is simply fit to observations and trade-off: optimism-under-uncertainty. We leave this for future
the action with the highest expected reward taken at each investigation.
Weight Uncertainty in Neural Networks

Table 1. Classification Error Rates on MNIST. ? indicates result 2.0

used an ensemble of 5 networks.

# Units/Layer

Test error (%)

Algorithm
1.6 Bayes by Backprop

# Weights
Dropout
Vanilla SGD
1.2
Test
Method
Error 0.8
SGD, no regularisation (Simard et al., 2003) 800 1.3m 1.6% 0 100 200 300
Epochs
400 500 600

SGD, dropout (Hinton et al., 2012) ≈ 1.3%

SGD, dropconnect (Wan et al., 2013) 800 1.3m 1.2%?
SGD 400 500k 1.83% Figure 2. Test error on MNIST as training progresses.
800 1.3m 1.84%
1200 2.4m 1.88%
SGD, dropout 400 500k 1.51%
15
800 1.3m 1.33%
1200 2.4m 1.36%
Algorithm
Bayes by Backprop, Gaussian 400 500k 1.82% 10

Density
Bayes by Backprop
800 1.3m 1.99% Dropout
1200 2.4m 2.04% 5 Vanilla SGD
Bayes by Backprop, Scale mixture 400 500k 1.36%
800 1.3m 1.34%
0
1200 2.4m 1.32%
−0.2 −0.1 0.0 0.1 0.2
Weight

known that variational methods under-estimate uncertainty Figure 3. Histogram of the trained weights of the neural network,
(Minka, 2001; 2005; Bishop, 2006) which could lead to for Dropout, plain SGD, and samples from Bayes by Backprop.
under-exploration and premature convergence in practice,
but we did not find this in practice.
networks, using either a Gaussian or Gaussian scale mix-
5. Experiments ture prior. Performance is comparable to that of dropout,
perhaps slightly better, as also see on Figure 2. Note that
We present some empirical evaluation of the methods pro- we trained on 50,000 digits and used 10,000 digits as a val-
posed above: on MNIST classification, on a non-linear re- idation set, whilst Hinton et al. (2012) trained on 60,000
gression task, and on a contextual bandits task. digits and did not use a validation set. We used the vali-
dation set to pick the best hyperparameters (learning rate,
5.1. Classification on MNIST number of gradients to average) and so we also repeated
this protocol for dropout and SGD (Stochastic Gradient De-
We trained networks of various sizes on the MNIST dig-
scent on the MLE objective in Section 2). We considered
its dataset (LeCun and Cortes, 1998), consisting of 60,000
learning rates of 10−3 , 10−4 and 10−5 with minibatches
training and 10,000 testing pixel images of size 28 by 28.
of size 128. For Bayes by Backprop, we averaged over ei-
Each image is labelled with its corresponding number (be-
ther 1, 2, 5, or 10 samples and considered π ∈ { 41 , 12 , 34 },
tween zero and nine, inclusive). We preprocessed the pix-
− log σ1 ∈ {0, 1, 2} and − log σ2 ∈ {6, 7, 8}.
els by dividing values by 126. Many methods have been
proposed to improve results on MNIST: generative pre- Figure 2 shows the learning curves on the test set for Bayes
training, convolutions, distortions, etc. Here we shall focus by Backprop, dropout and SGD on a network with two lay-
on improving the performance of an ordinary feedforward ers of 1200 rectified linear units. As can be seen, SGD
neural network without using any of these methods. We converges the quickest, initially obtaining a low test er-
used a network of two hidden layers of rectified linear units ror and then overfitting. Bayes by Backprop and dropout
(Nair and Hinton, 2010; Glorot et al., 2011), and a softmax converge at similar rates (although each iteration of Bayes
output layer with 10 units, one for each possible label. by Backprop is more expensive than dropout – around two
times slower). Eventually Bayes by Backprop converges
According to Hinton et al. (2012), the best published feed-
on a better test error than dropout after 600 epochs.
forward neural network classification result on MNIST (ex-
cluding those using data set augmentation, convolutions, Figure 3 shows density estimates of the weights. The Bayes
etc.) is 1.6% (Simard et al., 2003), whilst dropout with by Backprop weights are sampled from the variational pos-
an L2 regulariser attains errors around 1.3%. Results from terior, and the dropout weights are those used at test time.
Bayes by Backprop are shown in Table 1, for various sized Interestingly the regularised networks found by dropout
Weight Uncertainty in Neural Networks

and Bayes by Backprop have a greater range and with fewer

Table 2. Classification Errors after Weight pruning
centred at zero than those found by SGD. Bayes by Back-
Proportion removed # Weights Test Error
prop uses the greatest range of weights.
0% 2.4m 1.24%
50% 1.2m 1.24%
0.8 75% 600k 1.24%
95% 120k 1.29%
0.6
98% 48k 1.39%
Density

0.4

It is interesting to contrast this weight removal approach

0.2
to obtaining a fast, smaller, sparse network for prediction
0.0 after training with the approach taken by distillation (Hin-
−5.0 −2.5 0.0 ton et al., 2014) which requires an extra stage of training
Signal−to−Noise Ratio (dB)
to obtain a compressed prediction model. As with distil-
1.00
lation, our method begins with an ensemble (one for each
possible assignment of the weights). However, unlike dis-
0.75 tillation, we can simply obtain a subset of this ensemble by
using the probabilistic properties of the weight distributions
CDF

0.50
learnt to gracefully prune the ensemble down into a smaller
0.25
network. Thus even though networks trained by Bayes by
Backprop may have twice as many weights, the number of
0.00 parameters that actually need to be stored at run time can be
−7.5 −5.0 −2.5 0.0 far fewer. Graves (2011) also considered pruning weights
Signal−to−Noise Ratio (dB)
using the signal to noise ratio, but demonstrated results on
a network 20 times smaller and did not prune as high a
Figure 4. Density and CDF of the Signal-to-Noise ratio over all proportion of weights (at most 11%) whilst still maintain-
weights in the network. The red line denotes the 75% cut-off.
ing good test performance. The scale mixture prior used
by Bayes by Backprop encourages a broad spread of the
weights. Many of these weights can be successfully pruned
In Table 2, we examine the effect of replacing the vari-
without impacting performance significantly.
ational posterior on some of the weights with a constant
zero, so as to determine the level of redundancy in the
network found by Bayes by Backprop. We took a Bayes 5.2. Regression curves
by Backprop trained network with two layers of 1200 We generated training data from the curve:
units2 and ordered the weights by their signal-to-noise ra-
tio (|µi |/σi ). We removed the weights with the lowest sig- y = x + 0.3 sin(2π(x + )) + 0.3 sin(4π(x + )) +
nal to noise ratio. As can be seen in Table 2, even when
95% of the weights are removed the network still performs where ∼ N (0, 0.02). Figure 5 shows two examples of
well, with a significant drop in performance once 98% of fitting a neural network to these data, minimising a condi-
the weights have been removed. tional Gaussian loss. Note that in the regions of the input
space where there are no data, the ordinary neural network
In Figure 4 we examined the distribution of the signal-to-
reduces the variance to zero and chooses to fit a particu-
noise relative to the cut-off in the network uses in Table 2.
lar function, even though there are many possible extrap-
The lower plot shows the cumulative distribution of signal-
olations of the training data. On the left, Bayesian model
to-noise ratio, whilst the top plot shows the density. From
averaging affects predictions: where there are no data, the
the density plot we see there are two modalities of signal-
confidence intervals diverge, reflecting there being many
to-noise ratios, and from the CDF we see that the 75%
possible extrapolations. In this case Bayes by Backprop
cut-off separates these two peaks. These two peaks coin-
prefers to be uncertain where there are no nearby data, as
cide with a drop in performance in Table 2 from 1.24%
opposed to a standard neural network which can be overly
to 1.29%, suggesting that the signal-to-noise heuristic is in
confident.
fact related to the test performance.
2
We used a network from the end of training rather than pick- 5.3. Bandits on Mushroom Task
ing a network with a low validation cost found during training,
hence the disparity with results in Table 1. The lowest test error We take the UCI Mushrooms data set (Bache and Lichman,
observed was 1.12%. 2013), and cast it as a bandit task, similar to Guez (2015,
Weight Uncertainty in Neural Networks

1.2 1.2
for action selection. We kept the last 4096 reward, context
0.8 xxxx xx 0.8 xxxx x
and action tuples in a buffer, and trained the networks us-
xxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxx xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxx x
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxx xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxx x
ing randomly drawn minibatches of size 64 for 64 training
x x x x xxxxxx x xxxx x x x x xxxxxx x xxxx
0.4 xx
xxxxxxxx
xx
xxxxxxxxx
xxxxxxxx
x xxxxxxxxxxxxxx xxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx 0.4
xxxxx
xxxxxxxx
xxxxxxxxx
xxxxxxxx
xxxxxxxxxxxxxx xxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx
steps (64 × 64 = 4096) per interaction with the Mushroom
xxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxx xxxxxxxxxx
xxxxxxx
xxxx
xxxxxx
x
xxxxxxx
xxxx
xxxxxx
x bandit. A common heuristic for trading-off exploration vs.
xxxx xxxx
0.0 xxx
x
x 0.0 xxx
x
x
exploitation is to follow an ε-greedy policy: with proba-
bility ε propose a uniformly random action, otherwise pick
−0.4 −0.4
the best action according to the neural network.

0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 Figure 6 compares a Bayes by Backprop agent with three
ε-greedy agents, for values of ε of 0% (pure greedy), 1%,
Figure 5. Regression of noisy data with interquatile ranges. Black and 5%. An ε of 5% appears to over-explore, whereas a
crosses are training samples. Red lines are median predictions. purely greedy agent does poorly at the beginning, greed-
Blue/purple region is interquartile range. Left: Bayes by Back- ily electing to eat nothing, but then does much better once
prop neural network, Right: standard neural network. it has seen enough data. It seems that non-local function
approximation updates allow the greedy agent to explore,
as for the first 1, 000 steps, the agent eats nothing but after
approximately 1, 000 the greedy agent suddenly decides to
Cumulative Regret

10000
eat mushrooms. The Bayes by Backprop agent explores
from the beginning, both eating and ignoring mushrooms
5% Greedy and quickly converges on eating and non-eating with an al-
1% Greedy most perfect rate (hence the almost flat regret).
1000 Greedy
Bayes by Backprop
6. Discussion
0 10000 20000 30000 40000 50000
Step
We introduced a new algorithm for learning neural net-
works with uncertainty on the weights called Bayes by
Figure 6. Comparison of cumulative regret of various agents on Backprop. It optimises a well-defined objective function
the mushroom bandit task, averaged over five runs. Lower is bet-
to learn a distribution on the weights of a neural network.
ter.
The algorithm achieves good results in several domains.
When classifying MNIST digits, performance from Bayes
Chapter 6). Each mushroom has a set of features, which we by Backprop is comparable to that of dropout. We demon-
treat as the context for the bandit, and is labelled as edible strated on a simple non-linear regression problem that the
or poisonous. An agent can either eat or not eat a mush- uncertainty introduced allows the network to make more
room. If an agent eats an edible mushroom, then it receives reasonable predictions about unseen data. Finally, for con-
a reward of 5. If an agent eats a poisonous mushroom, then textual bandits, we showed how Bayes by Backprop can
with probability 12 it receives a reward of −35, otherwise automatically learn how to trade-off exploration and ex-
a reward of 5. If an agent elects not to eat a mushroom, ploitation. Since Bayes by Backprop simply uses gradient
it receives a reward of 0. Thus an agent expects to receive updates, it can readily be scaled using multi-machine opti-
a reward of 5 for eating an edible reward, but an expected misation schemes such as asynchronous SGD (Dean et al.,
reward of −15 for eating a poisonous mushroom. 2012). Furthermore, all of the operations used are readily
implemented on a GPU.
Regret measures the difference between the reward achiev-
able by an oracle and the reward received by an agent. In
this case, an oracle will always receive a reward of 5 for an Acknowledgements The authors would like to thank Ivo
edible mushroom, or 0 for a poisonous mushroom. We take Danihelka, Danilo Rezende, Silvia Chiappa, Alex Graves,
the cumulative sum of regret of several agents and show Remi Munos, Ben Coppin, Liam Clancy, James Kirk-
them in Figure 6. Each agent uses a neural network with patrick, Shakir Mohamed, David Pfau, and Theophane We-
two hidden layers of 100 rectified linear units. The input ber for useful discussions and comments.
to the network is a vector consisting of the mushroom fea-
tures (context) and a one of K encoding of the action. The
output of the network is a single scalar, representing the ex-
References
pected reward of the given action in the given context. For Shipra Agrawal and Navin Goyal. Analysis of Thompson
Bayes by Backprop, we sampled the weights twice and av- sampling for the multi-armed bandit problem. In Pro-
eraged two of these outputs to obtain the expected reward ceedings of the 25th Annual Conference On Learning
Weight Uncertainty in Neural Networks

Theory (COLT), volume 23, pages 39.1–39.26, 2012. Alex Graves. Practical variational inference for neural net-
works. In Advances in Neural Information Processing
Shipra Agrawal and Navin Goyal. Further optimal regret Systems (NIPS), pages 2348–2356, 2011.
bounds for Thompson sampling. In Proceedings of the
16th International Conference on Artificial Intelligence Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blun-
and Statistics Learning (AISTATS), pages 99–107, 2013. dell, and Daan Wierstra. Deep AutoRegressive net-
works. In Proceedings of the 31st International Confer-
Kevin Bache and Moshe Lichman. UCI Machine Learning ence on Machine Learning (ICML), pages 1242–1250,
Repository. University of California, Irvine, School of 2014.
Information and Computer Sciences, 2013. URL http:
//archive.ics.uci.edu/ml. Arthur Guez. Sample-Based Search Methods For Bayes-
Adaptive Planning. PhD thesis, University College Lon-
Christopher M Bishop. Section 10.1: variational inference. don, 2015.
In Pattern Recognition and Machine Learning. Springer,
2006. ISBN 9780387310732. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling
the knowledge in a neural network. In NIPS 2014 Deep
Wray L Buntine and Andreas S Weigend. Bayesian back- Learning and Representation Learning Workshop, 2014.
propagation. Complex systems, 5(6):603–643, 1991.
Geoffrey E Hinton and Drew Van Camp. Keeping the neu-
Olivier Chapelle and Lihong Li. An empirical evaluation of ral networks simple by minimizing the description length
Thompson sampling. In Advances in Neural Information of the weights. In Proceedings of the 16th Annual Con-
Processing Systems (NIPS), pages 2249–2257, 2011. ference On Learning Theory (COLT), pages 5–13. ACM,
1993.
Hugh Chipman. Bayesian variable selection with related
predictors. Canadian Journal of Statistics, 24(1):17–36, Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky,
1996. Ilya Sutskever, and Ruslan R. Salakhutdinov. Improving
neural networks by preventing co-adaptation of feature
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, detectors. arXiv:1207.0580, July 2012.
Matthieu Devin, Mark Mao, Andrew Senior, Paul
Tommi S. Jaakkola and Michael I. Jordan. Bayesian param-
Tucker, Ke Yang, Quoc V Le, et al. Large scale dis-
eter estimation via variational methods. Statistics and
tributed deep networks. In Advances in Neural Infor-
Computing, 10(1):25–37, 2000.
mation Processing Systems (NIPS), pages 1223–1231,
2012. Emilie Kaufmann, Nathaniel Korda, and Rémi Munos.
Thompson sampling: An asymptotically optimal finite-
Sarah Filippi, Olivier Cappe, Aurlien Garivier, and Csaba
time analysis. In Proceedings of the 23rd Annual Confer-
Szepesvri. Parametric bandits: The generalized linear
ence on Algorithmic Learning Theory (ALT), pages 199–
case. In Advances in Neural Information Processing Sys-
213. Springer, 2012.
tems, pages 586–594, 2010.
Diederik P. Kingma and Max Welling. Auto-encoding vari-
Karl Friston, Jérémie Mattout, Nelson Trujillo-Barreto, ational Bayes. In Proceedings of the 2nd International
John Ashburner, and Will Penny. Variational free en- Conference on Learning Representations (ICLR), 2014.
ergy and the Laplace approximation. Neuroimage, 34 arXiv: 1312.6114.
(1):220–234, 2007.
Yann LeCun. Une procédure d’apprentissage pour réseau
Andrew Gelman. Objections to Bayesian statistics. à seuil asymmetrique (a learning scheme for asymmetric
Bayesian Analysis, 3:445–450, 2008. ISSN 1931-6690. threshold networks). In Proceedings of Cognitiva 85,
doi: 11.1214/08-BA318. Paris, France, pages 599–604, 1985.
Edward I George and Robert E McCulloch. Variable selec- Yann LeCun and Corinna Cortes. The MNIST database
tion via gibbs sampling. Journal of the American Statis- of handwritten digits. 1998. URL http://yann.
tical Association, 88(423):881–889, 1993. lecun.com/exdb/mnist/.

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep Lihong Li, Wei Chu, John Langford, and Robert E.
sparse rectifier networks. In Proceedings of the 14th Schapire. A contextual-bandit approach to personal-
International Conference on Artificial Intelligence and ized news article recommendation. In Proceedings of
Statistics Learning (AISTATS), volume 15, pages 315– the 19th International Conference on World Wide Web,
323, 2011. WWW ’10, pages 661–670, New York, NY, USA,
Weight Uncertainty in Neural Networks

2010. ACM. ISBN 978-1-60558-799-8. doi: 10.1145/ Patrice Y Simard, Dave Steinkraus, and John C Platt. Best
1772690.1772758. practices for convolutional neural networks applied to vi-
sual document analysis. In Proceedings of the 12th Inter-
David JC MacKay. A practical Bayesian framework for national Conference on Document Analysis and Recog-
backpropagation networks. Neural computation, 4(3): nition (ICDAR), volume 2, pages 958–958. IEEE Com-
448–472, 1992. puter Society, 2003.
David JC MacKay. Probable networks and plausible William R Thompson. On the likelihood that one unknown
predictions-a review of practical Bayesian methods for probability exceeds another in view of the evidence of
supervised neural networks. Network: Computation in two samples. Biometrika, pages 285–294, 1933.
Neural Systems, 6(3):469–505, 1995.
Michalis Titsias and Miguel Lázaro-Gredilla. Doubly
Benedict C May, Nathan Korda, Anthony Lee, and stochastic variational bayes for non-conjugate inference.
David S. Leslie. Optimistic Bayesian sampling in In Proceedings of the 31st International Conference on
contextual-bandit problems. The Journal of Machine Machine Learning (ICML-14), pages 1971–1979, 2014.
Learning Research, 13(1):2069–2106, 2012.
Li Wan, Matthew Zeiler, Sixin Zhang, Yann L Cun, and
Thomas P Minka. A family of algorithms for approximate Rob Fergus. Regularization of neural networks us-
Bayesian inference. PhD thesis, Massachusetts Institute ing dropconnect. In Proceedings of the 30th Inter-
of Technology, 2001. national Conference on Machine Learning (ICML-13),
pages 1058–1066, 2013.
Thomas P Minka. Divergence measures and message pass-
ing. Technical report, Microsoft Research, 2005. Jonathan S Yedidia, William T Freeman, and Yair Weiss.
Generalized belief propagation. In Advances in Neu-
Toby J Mitchell and John J Beauchamp. Bayesian variable ral Information Processing Systems (NIPS), volume 13,
selection in linear regression. Journal of the American pages 689–695, 2000.
Statistical Association, 83(404):1023–1032, 1988.

Vinod Nair and Geoffrey E Hinton. Rectified linear units

improve restricted Boltzmann machines. In Proceedings
of the 27th International Conference on Machine Learn-
ing (ICML), pages 807–814, 2010.

Radford M Neal and Geoffrey E Hinton. A view of the EM

algorithm that justifies incremental, sparse, and other
variants. In Learning in graphical models, pages 355–
368. Springer, 1998.

Manfred Opper and Cédric Archambeau. The variational

Gaussian approximation revisited. Neural computation,
21(3):786–792, 2009.

Art B. Owen. Monte Carlo theory, methods and examples.

2013.

Danilo Jimenez Rezende, Shakir Mohamed, and Daan

Wierstra. Stochastic backpropagation and approximate
inference in deep generative models. In Proceedings of
the 31st International Conference on Machine Learning
(ICML), pages 1278–1286, 2014.

David E Rumelhart, Geoffrey E Hinton, and Ronald J

Williams. Learning representations by back-propagating
errors. Cognitive modeling, 5, 1988.

Lawrence K Saul, Tommi Jaakkola, and Michael I Jordan.

Mean field theory for sigmoid belief networks. Journal
of artificial intelligence research, 4(1):61–76, 1996.

Lesson 1 Human Cultural Variations, Social Differences, Social Change and Political Identities
86% (78)
Lesson 1 Human Cultural Variations, Social Differences, Social Change and Political Identities
29 pages
Department of Education: Republic of The Philippines
100% (1)
Department of Education: Republic of The Philippines
8 pages
Smart Time Special Edition Grade 12 Teachers Book Web1
No ratings yet
Smart Time Special Edition Grade 12 Teachers Book Web1
307 pages
Hernandez Lobatoc15
No ratings yet
Hernandez Lobatoc15
9 pages
BayesianCNN With VariationalInference
No ratings yet
BayesianCNN With VariationalInference
8 pages
BNN Tutorial CILVR
No ratings yet
BNN Tutorial CILVR
83 pages
For Blu Vin 2017 A
No ratings yet
For Blu Vin 2017 A
11 pages
Hands-On Bayesian Neural Network
No ratings yet
Hands-On Bayesian Neural Network
28 pages
B R N N: Ayesian Ecurrent Eural Etworks
No ratings yet
B R N N: Ayesian Ecurrent Eural Etworks
14 pages
UNIT 2 Notes
No ratings yet
UNIT 2 Notes
19 pages
Bitwise Neural Network
No ratings yet
Bitwise Neural Network
5 pages
Deep Feedforward Networks Application To Patter Recognition
No ratings yet
Deep Feedforward Networks Application To Patter Recognition
5 pages
Lecture 06
No ratings yet
Lecture 06
22 pages
2020 Regularisation of Neural Networks by Enforcing
No ratings yet
2020 Regularisation of Neural Networks by Enforcing
27 pages
Ratnn Si 2015 09 04
No ratings yet
Ratnn Si 2015 09 04
23 pages
Cao 2015
No ratings yet
Cao 2015
17 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
General Observation
No ratings yet
General Observation
93 pages
Awesome Machine Learning Papers
100% (1)
Awesome Machine Learning Papers
326 pages
Notes
No ratings yet
Notes
9 pages
Optimization
No ratings yet
Optimization
44 pages
Hands-On Bayesian Neural NetworksA Tutorial For Deep Learning Users
No ratings yet
Hands-On Bayesian Neural NetworksA Tutorial For Deep Learning Users
20 pages
Notes On Backpropagation
No ratings yet
Notes On Backpropagation
14 pages
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
No ratings yet
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
43 pages
A Gentle Introduction To Backpropagation
100% (1)
A Gentle Introduction To Backpropagation
15 pages
Dropout As A Bayesian Approximation
No ratings yet
Dropout As A Bayesian Approximation
10 pages
A Probabilistic Theory of Deep Learning: Unit 2
100% (1)
A Probabilistic Theory of Deep Learning: Unit 2
17 pages
L05 Slides - mlp2
No ratings yet
L05 Slides - mlp2
21 pages
3an Empirical Study of Binary N
No ratings yet
3an Empirical Study of Binary N
11 pages
Icann 2023
No ratings yet
Icann 2023
4 pages
L06 Slides - mlp3
No ratings yet
L06 Slides - mlp3
26 pages
Curs5site PDF
No ratings yet
Curs5site PDF
47 pages
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
No ratings yet
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
39 pages
1611 03530 PDF
No ratings yet
1611 03530 PDF
15 pages
CI-6-8 Backpropagation (COMPLETE) Updated
No ratings yet
CI-6-8 Backpropagation (COMPLETE) Updated
76 pages
Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey
No ratings yet
Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey
16 pages
Exp - 4 - 5 (Prakash)
No ratings yet
Exp - 4 - 5 (Prakash)
10 pages
Active and Transfer Learning
No ratings yet
Active and Transfer Learning
15 pages
On The Power and Limitations of Random Features For Understanding Neural Networks
No ratings yet
On The Power and Limitations of Random Features For Understanding Neural Networks
30 pages
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
9 pages
Unit 4 Short Notes
No ratings yet
Unit 4 Short Notes
27 pages
A Neural Model For The Prediction of Pathogenic Genomic Variants in Mendelian Diseases
No ratings yet
A Neural Model For The Prediction of Pathogenic Genomic Variants in Mendelian Diseases
7 pages
MOPED
No ratings yet
MOPED
8 pages
Back Propagation Back Propagation Network Network Network Network
No ratings yet
Back Propagation Back Propagation Network Network Network Network
29 pages
Deep Learning As Optimal Control Problems - Models and Numerical Methods
No ratings yet
Deep Learning As Optimal Control Problems - Models and Numerical Methods
34 pages
Deep Learning
No ratings yet
Deep Learning
299 pages
394 A Scalable Laplace Approximati
No ratings yet
394 A Scalable Laplace Approximati
15 pages
L4deep Learning
No ratings yet
L4deep Learning
14 pages
Ceng403 - Week 6b
No ratings yet
Ceng403 - Week 6b
51 pages
Chapter 1
No ratings yet
Chapter 1
25 pages
Neural Networks Tricks: Patrick Van Der Smagt
No ratings yet
Neural Networks Tricks: Patrick Van Der Smagt
20 pages
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
No ratings yet
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
56 pages
Chen, Deng Et Al 2021 - Effective and Efficient Batch Normalization
No ratings yet
Chen, Deng Et Al 2021 - Effective and Efficient Batch Normalization
15 pages
Weights and Biases
No ratings yet
Weights and Biases
10 pages
PNAL6 MLPTraining
No ratings yet
PNAL6 MLPTraining
40 pages
Binaryconnect: Training Deep Neural Networks With Binary Weights During Propagations
No ratings yet
Binaryconnect: Training Deep Neural Networks With Binary Weights During Propagations
9 pages
Introduction To Neural Networks: John Paxton Montana State University Summer 2003
No ratings yet
Introduction To Neural Networks: John Paxton Montana State University Summer 2003
24 pages
2403B05107 DL Activity 03
No ratings yet
2403B05107 DL Activity 03
9 pages
Back Propagation
No ratings yet
Back Propagation
29 pages
Artificial Neural Networks Mathematics of Backpropagation (Part 4) - BRIAN DOLHANSKY
No ratings yet
Artificial Neural Networks Mathematics of Backpropagation (Part 4) - BRIAN DOLHANSKY
9 pages
SS 2020 Solutions
No ratings yet
SS 2020 Solutions
22 pages
Dear The Weight
From Everand
Dear The Weight
Masud Rana
No ratings yet
Naive Bayes Classifier: Fundamentals and Applications
From Everand
Naive Bayes Classifier: Fundamentals and Applications
Fouad Sabry
No ratings yet
Lutfia Eka Arianti Xii Multimedia 1
No ratings yet
Lutfia Eka Arianti Xii Multimedia 1
4 pages
Theories Chart
No ratings yet
Theories Chart
2 pages
Disease Detection Using ML
100% (8)
Disease Detection Using ML
24 pages
Imperative Lesson Plan
100% (1)
Imperative Lesson Plan
8 pages
Part III Organizational Decision Making in Changing Environment
No ratings yet
Part III Organizational Decision Making in Changing Environment
9 pages
Lesson Plan Reem Munssir: Pickers/Name-Picker-Wheel/Full-Screen/?R Sqctihk2Hb
No ratings yet
Lesson Plan Reem Munssir: Pickers/Name-Picker-Wheel/Full-Screen/?R Sqctihk2Hb
4 pages
The Soar Cognitive Architecture-The MIT Press (2012) - John E. Laird
100% (1)
The Soar Cognitive Architecture-The MIT Press (2012) - John E. Laird
376 pages
Quarter 2 English Week 4 Day 4
No ratings yet
Quarter 2 English Week 4 Day 4
4 pages
04 Global Awareness Presentation Planning
No ratings yet
04 Global Awareness Presentation Planning
5 pages
Notes On Monitor Model
No ratings yet
Notes On Monitor Model
8 pages
Personal Swot Analysis: Strengths Opportunities
No ratings yet
Personal Swot Analysis: Strengths Opportunities
1 page
WEEK 9 Lesson Plan HUMSS 2 2nd Sem
No ratings yet
WEEK 9 Lesson Plan HUMSS 2 2nd Sem
7 pages
Lesson 4 Curriculum Conceptions
100% (2)
Lesson 4 Curriculum Conceptions
2 pages
Tarlac Agricultural University: College of Education
No ratings yet
Tarlac Agricultural University: College of Education
6 pages
Effective Teaching & Learning Environment, 20th Nov, 23
No ratings yet
Effective Teaching & Learning Environment, 20th Nov, 23
16 pages
Vdocuments - MX - Internship Report Format Vtu
No ratings yet
Vdocuments - MX - Internship Report Format Vtu
2 pages
DLP 3 Pattern of Development
100% (1)
DLP 3 Pattern of Development
20 pages
MIRAS
No ratings yet
MIRAS
4 pages
Thesis Vocabulary Learning Strategies
100% (3)
Thesis Vocabulary Learning Strategies
7 pages
10th Eldrok India K-12 Summit (EIKS)
No ratings yet
10th Eldrok India K-12 Summit (EIKS)
2 pages
DLL Module 1 Session 3 ACT. 8 9 10
100% (1)
DLL Module 1 Session 3 ACT. 8 9 10
5 pages
Call Center Supervisor Job Description Template
No ratings yet
Call Center Supervisor Job Description Template
3 pages
Critical Thinking: The Ultimate Cheatsheet For
No ratings yet
Critical Thinking: The Ultimate Cheatsheet For
1 page
Applied Semiotics - An Introduction 1
No ratings yet
Applied Semiotics - An Introduction 1
56 pages
PKL24 Bacaan Ting.3 Naskah Guru Bi
No ratings yet
PKL24 Bacaan Ting.3 Naskah Guru Bi
27 pages
Week Six Mayer CH 3
No ratings yet
Week Six Mayer CH 3
7 pages
2024 Ebs 04
No ratings yet
2024 Ebs 04
6 pages

Bayesian Feed Forward

Uploaded by

Bayesian Feed Forward

Uploaded by

Weight Uncertainty in Neural Networks

Charles Blundell CBLUNDELL @ GOOGLE . COM

Abstract vent overfitting in neural networks such as early stopping,

Y Y the parameters of the categorical distribution are passed

wMLE = arg max log P (D|w)

4. Contextual Bandits 1. Sample weights from the variational posterior: w ∼

Table 1. Classification Error Rates on MNIST. ? indicates result 2.0

Test error (%)

SGD, dropout (Hinton et al., 2012) ≈ 1.3%

and Bayes by Backprop have a greater range and with fewer

It is interesting to contrast this weight removal approach

Vinod Nair and Geoffrey E Hinton. Rectified linear units

Radford M Neal and Geoffrey E Hinton. A view of the EM

Manfred Opper and Cédric Archambeau. The variational

Art B. Owen. Monte Carlo theory, methods and examples.

Danilo Jimenez Rezende, Shakir Mohamed, and Daan

David E Rumelhart, Geoffrey E Hinton, and Ronald J

Lawrence K Saul, Tommi Jaakkola, and Michael I Jordan.

You might also like