0% found this document useful (0 votes)
18 views9 pages

DPNN Tpami2017

Uploaded by

lin moule
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views9 pages

DPNN Tpami2017

Uploaded by

lin moule
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/329481335

Bayesian Neural Networks with Weight Sharing Using Dirichlet Processes

Article in IEEE Transactions on Pattern Analysis and Machine Intelligence · December 2018
DOI: 10.1109/TPAMI.2018.2884905

CITATIONS READS
21 553

2 authors, including:

Franz Pernkopf
Graz University of Technology
226 PUBLICATIONS 3,740 CITATIONS

SEE PROFILE

All content following this page was uploaded by Franz Pernkopf on 17 May 2019.

The user has requested enhancement of the downloaded file.


1

Bayesian Neural Networks with find a simpler variational distribution that approximates
the true posterior distribution by minimizing some distance
Weight Sharing Using Dirichlet measure such as the Kullback-Leibler divergence. The vari-
Processes ational distribution often assumes independence among the
dimensions and its form typically allows for exact solutions
Wolfgang Roth and Franz Pernkopf, Senior or better approximations of the expectation in (1). Varia-
tional inference techniques for NNs have been used in [2],
Member, IEEE
[3], [4], [5].
MCMC techniques follow a different spirit. Given a
Abstract—We extend feed-forward neural networks with a Dirichlet mechanism that allows drawing samples from the posterior
process prior over the weight distribution. This enforces a sharing on distribution over the weights p(W |D), one can approximate
the network weights, which can reduce the overall number of parameters the expectation in (1) by a finite sum over the drawn
drastically. We alternately sample from the posterior of the weights and
samples. MCMC techniques became especially attractive for
the posterior of assignments of network connections to the weights. This
results in a weight sharing that is adopted to the given data. In order to
NNs with the hybrid Monte Carlo (HMC) method which
make the procedure feasible, we present several techniques to reduce generates new samples by using Hamiltonian dynamics in
the computational burden. Experiments show that our approach mostly a Metropolis-Hastings scheme at low rejection rates [6], [7].
outperforms models with random weight sharing. Our model is capable HMC uses gradient information of the posterior distribution
of reducing the memory footprint substantially while maintaining a good to increase the possible distance between two consecutive
performance compared to neural networks without weight sharing and samples to reduce their correlation. HMC has two disadvan-
other state-of-the-art models. tages. First, HMC involves a leapfrog integration whose per-
formance depends heavily on the choice of two parameters.
Index Terms—Dirichlet processes, Bayesian neural networks, weight
A promising approach to solve this issue is adaptive HMC
sharing, Gibbs sampling, hybrid Monte-Carlo, non-conjugate models
(AHMC) [8] which uses Bayesian optimization techniques to
F find these parameters automatically. Second, HMC operates
in batch mode, i.e. the gradient is computed using the entire
data set. This might be a limiting factor in case of large data
1 I NTRODUCTION sets. In recent years, some promising extensions of MCMC
Feed-forward neural networks (NNs) are traditionally methods operating on stochastic gradients computed from
trained by optimizing an objective that represents how well smaller minibatches have been proposed [9], [10], [11], [12].
the NN fits the given data. A Bayesian approach assumes In practice, a classifier using MCMC techniques requires
a prior distribution p(W ) over the weights of a NN and to store a large ensemble of NNs sampled from the poste-
interprets the output of a NN as likelihood p(t|x, W ). Given rior distribution whose outputs are averaged for the final
a data set D = {(x1 , t1 ), . . . , (xN , tN )}, the prior and the prediction. However, storing many NNs can be prohibitive,
likelihood induce a posterior distribution p(W |D) over the especially when the size of a single NN is already large.
weights. The goal is to consider the entire posterior distri- It is therefore desirable to develop schemes that reduce
bution and not just use a single point estimate. In particular, the memory footprint of MCMC based methods. In this
we want to compute expectations of the NN output with paper, we propose weight sharing to reduce the memory
respect to p(W |D), i.e. footprint of the ensemble. We restrict ourselves to setups
Z where HMC is applicable, i.e. small to medium sized data
Ep(W |D) [p(t|x, W )] = p(t|x, W ) p(W |D)dW . (1) sets and reasonably sized network architectures. HMC is
generally known to produce good samples, it usually does
However, due to the highly non-linear dependence on the not suffer from random walk behavior as some of the above
weights of the NN, the above integral does not allow for mentioned stochastic gradient MCMC methods, and often
analytical solutions and one has to use approximations. serves as a golden standard for assessing the performance
The simplest solution is to approximate the posterior of novel Bayesian methods. Sharing weights of a single
p(W |D) using a point mass at a mode, in which case NN does not provide much benefit in terms of memory
the maximum a-posteriori (MAP) framework is recovered. requirements since additionally the assignments of weights
Since a MAP solution is not necessarily located in a high to connections must be stored. Nevertheless, for many NNs
density region, this approach is prone to overfitting. A we can distribute the memory requirement for the weight
better solution is the Laplace approximation which replaces assignments by using the same weight assignment multiple
the posterior over the weights with a Gaussian distribu- times and only vary the shared weights.
tion around a mode [1]. However, the covariances of the We propose to use a Dirichlet process (DP) prior [13] on
Gaussian approximation are also obtained from information the prior distribution of the weights. The DP prior enforces
present in a single point of the posterior p(W |D). a sharing on the weights such that different NN connections
More elaborate approximation schemes such as varia- have the exact same weight. DPs are an elegant tool to share
tional inference and Markov Chain Monte Carlo (MCMC) parameters among different entities in a Bayesian model.
usually yield better results. Variational inference aims to For instance, they have been used to generalize mixture
models to infinitely many components [14]. In these mixture
• Wolfgang Roth ([email protected]) and Franz Pernkopf models, every data sample is assumed to have its own
([email protected]) are with the Signal Processing and Speech component parameters but due to the DP prior some of
Communication Laboratory, Graz University of Technology.
them will share the exact same values. Since in practice
2

only a finite amount of data is available, the mixture model denotes the number of neurons in layer l.1 The parameters of
automatically selects an appropriate number of components the NN are given by a set of weight matrices W = {W l }L l=1
according to the characteristics in the data. The drawback of where W l = (Wi,j l
)i=1,...,dl ,j=1,...,dl−1 .2 The NN defines a
DPs is that exact posterior inference is typically intractable function y = f (x) on an input vector x as follows: In layer
and one is forced to use approximation techniques which l the NN applies an affine transformation al := W l xl−1
are typically slow. A lot of current research is investigating to its input, followed by a non-linear activation function
different MCMC and variational inference techniques that xl := φl (al ). Here we have defined x0 := x. The output
aim to improve computation speed and ways to overcome of the NN is given by y := xL . Common choices for the
the common problems of sampling techniques such as long activation function φl for intermediate layers l < L are the
burn-in time and avoiding to get stuck in regions of low sigmoid, tanh [24] and the rectifier linear unit (ReLU) [25].
density [15], [16], [17], [18]. The output activation function φL depends on the type of
Our DP NN model maintains a set of weights and the prediction task: For regression tasks it is the identity
assigns each NN connection to a particular weight in this function, for binary classification tasks it is the sigmoid
set. We propose an MCMC sampling algorithm that alter- function and for multiclass classification tasks it is the soft-
nates between sampling the weights conditioned on the max function. Depending on the task, the output of a NN
weight assignments, and sampling the weight assignments is to be interpreted differently as conditional probability
conditioned on the weights. Before sampling new weight p(t|x, W ). In case of regression, it is common to assume
assignments, we sample an ensemble of weights in order a Gaussian model on the targets with x-dependent mean
to distribute the memory requirement of the weight assign- y = f (x) and covariance matrix βI where I denotes the
ments. We introduce algorithmic techniques and approxi- identity matrix, i.e. t ∼ N (f (x), βI). For classification the
mations that utilize the structure of NNs to make posterior outputs can be seen as class probabilities.
inference computationally tractable. We demonstrate the To adopt a full Bayesian approach, a prior probability
feasibility of the model in various classification experiments p(W ) on the parameters is required. It is common to assume
on MNIST [19] and variants thereof [20]. Furthermore, we a Gaussian prior with zero mean and covariance matrix
performed regression experiments on several UCI data sets γI . Other approaches have been proposed, such as using a
[21]. Our model maintains a good performance compared Gaussian mixture model prior p(W ) which basically results
to Bayesian NNs without weight sharing, NNs trained in a soft weight sharing of the weights [26]. Assuming
with backpropagation and support vector machines (SVMs) that the samples in D are iid, the prior distribution p(W )
while using only a fraction of the weights. The proposed QN
together with the likelihood p(D|W ) = n=1 p(tn |xn , W )
method outperforms NNs with random weight sharing on induces a posterior distribution p(W |D) on the parameters
most data sets and on some data sets even outperforms which is subsequently used for computing the integral in
Bayesian NNs without weight sharing. This indicates that (1).
our model has a regularizing effect and can help sampling
based algorithms that typically scale poorly with the size of 2.1 Dirichlet Processes
the sampling space, i.e. a large NN can be used while still
The DP is a distribution over distributions parameterized
operating in a relatively low-dimensional weight space.
by a base measure G0 and a concentration parameter α. [13]
[22] proposed a different method to reduce the memory
defined the DP in the following nonconstructive way: Given
footprint by “distilling” a sequence of MCMC samples in an
an arbitrary finite partition (R1 , . . . , RL ) of the space R on
online fashion into a single NN. However, their approach
which G0 is defined. G is drawn from a DP with parameters
relies on stochastic MCMC methods that generate samples
G0 and α, denoted as G ∼ DP(G0 , α), if the vector of
quickly in order to perform many weight updates in a
masses of G falling into the subsets Rl has a Dirichlet
short time. Interestingly, most weight sharing methods use a
distribution with parameters (αG0 (R1 ), . . . , αG0 (RL )).
sharing that is fixed before observing any data. In contrast to
Other equivalent definitions of the DP can be obtained
existing methods like convolutional NNs that use a prede-
that are more convenient for the use in practical algorithms.
fined weight sharing to utilize the structure of the input data
The constructive ’stick-breaking’ definition of [27] shows that
[19] or random weight sharing to reduce the complexity of
a distribution G drawn from a DP can be represented as an
the model [23], our approach adopts the weight sharing to
infinite mixture of point P masses. Concretely, a draw from
the given data.
DP(G0 , α) has the form ∞ k=1 πk δwk where wk ∼ G0 and
The paper is organized as follows. In Section 2 we
the mixture weights π = {πk }∞ k=1 are drawn according to
introduce the notation and define NNs with a DP prior
k−1
on the weight distribution. Section 3 describes the MCMC Y
algorithm to sample from the posterior distribution. Fur- ξk ∼ Beta(1, α), π k = ξk (1 − ξl ) . (2)
thermore, techniques to reduce computation time are pro- l=1

posed. In Section 4 we demonstrate our model in various Here δwk denotes a point mass located at wk . We denote that
experiments and Section 5 concludes the paper. π is drawn according to (2) as π ∼ GEM(α). This definition
makes the discreteness of G explicit which can be utilized to
achieve the sharing of parameters in various settings.
2 N ON - PARAMETRIC N EURAL N ETWORKS
1. We assume that the targets t for multiclass classification problems
The structure of a fully connected feed-forward NN with L are given as one-hot encoded vectors.
layers is defined by the number of neurons {d0 , d1 , . . . , dL } 2. We assume that each dl for 0 ≤ l < L is enlarged by one to account
in each layer, where x ∈ Rd0 , t ∈ RdL and dl for 1 ≤ l < L for the bias vectors that we do not mention explicitly.
3

Let {θm }M m=1 be a finite sample drawn from and γ to be fixed but the model can be extended to include
G. It is common to store a set of indicators priors over them.
z = {zm }M m=1 , where zm determines the assignment
of the mth sample to a particular wk ∈ w = {wk }∞ k=1 . The
prior for the indicators z is given by the probabilities π . G0 α
However, storing an infinite number of mixture weights π
and parameters w is impossible. In case of a finite sample L
size M it suffices to store only those K values wk to which πl
at least one sample θm is assigned. Furthermore, we can
marginalize out the infinite vector of mixture weights π .
This results in a Chinese restaurant process scheme for the wkl l
zi,j β
weight indicators z . The distribution of zm conditioned ∞ Ml
on all the other weight indicators z −m := z/{zm } is then
given by tn xn
N
(
M−m,k 1≤k≤K
p(zm = k) ∝ (3)
α k unassigned, Fig. 1: Graphical illustration of the DP NN model with
layerwise weight sharing. Observed variables and hyper-
where M−m,k := |{m0 : zm0 = k, m0 6= m}| [15]. In (3) new parameters are indicated as shaded circles. The dashed
assignments can be created with probability proportional to circle indicates that β is only relevant for regression tasks.
α. When (3) is used to sample a new value for a singleton For global weight sharing, the dependence on l is simply
zm , i.e. M−m,k = 0, its parameter can disappear if some dropped.
other existing value is drawn. Conceptually the concrete
values of the zm are not significant. In the remainder we
assume that zm ∈ {1, . . . , K}.
3 P OSTERIOR I NFERENCE IN DP N EURAL N ET-
2.2 A Prior Distribution over the Weight Distribution WORKS

In this paper we assume that the weights of the NN are Posterior inference using the joint distribution of the
independently drawn from a distribution G which is itself weights and the configuration p(w, z|D) is a challenging
drawn from a DP with concentration parameter α and base task. On the one hand, given a fixed configuration, the
measure G0 . We take G0 to be a Gaussian with zero mean posterior on the weights is typically complicated and highly
and variance γ . The extension to arbitrary base measures is multimodal. On the other hand, given a fixed set of weights,
straightforward. There are essentially two ways to use the there is an intractable number of configurations to consider.
DP prior. First, we can put a single DP prior over all weights Rather than searching for a single assignment of the weights
in the NN, which enforces a sharing between weights in the and the configuration, we propose to use sampling tech-
whole NN. Second, we can put a separate DP prior over the niques for inference.
weights of different layers. We refer to the former as global
sharing and to the latter as layerwise sharing. 3.1 Sampling from the Posterior Distribution
Throughout this paper we stick to the layer-
wise sharing since we empirically observed better re- We employ a sampling scheme where we alternate between
sults using this approach. In this case the model sampling from the posterior of the configuration condi-
is defined as follows: The weights are given by tioned on the weights p(z|w, D) and the posterior of the
w = {wl }L l weights given the configuration p(w|z, D).
l=1 where w is a vector of size Kl that stores the
assigned weights in layer l. The weight indicators are given Sampling the weight indicators z : We adapt the aux-
by z = {z l }L l l iliary variable Gibbs sampling scheme described as Al-
l=1 where z = (zi,j )i=1,...,dl ,j=1,...,dl−1 contains
the indicators of all connections in layer l. The weight gorithm 8 in [15]. Neal’s original algorithm is used for
matrices of the NN can be recovered by Wi,j l
= wzl l . The sampling the weight indicators in non-conjugate DP mixture
i,j models.3 Instead of solving an intractable integral when the
number of connections in layer l is denoted as Ml := |z l |. model is non-conjugate, the algorithm introduces auxiliary
We call a concrete assignment to the weight indicators a variables and only requires the ability to efficiently draw
configuration. Based on the stick-breaking representation, the samples from the base distribution G0 .
full model with layerwise sharing can be summarized as The pseudocode for sampling a single weight indicator
l
wkl ∼ G0 := N (0, γ) (4) zm with m = (i, j) is shown in Algorithm 1. The weight
of connection m is replaced by all other currently assigned
π l ∼ GEM(α) (5)
weights and r ≥ 1 additional auxiliary weights which are
l
zi,j ∼ Discrete(π l ) (6) drawn from the base distribution G0 . In case connection
tn ∼ p(·|xn , z, w [, β]), (7) m is currently assigned a singleton weight, one of the r
auxiliary variables is assigned the current weight wzml and
where the square brackets indicate that β is only relevant
when performing regression. The model is illustrated in 3. Due to the complex likelihood defined by the output of the NN we
Figure 1. In this paper we assume the hyperparameters α, β are restricted to algorithms suitable for non-conjugate models.
4

l
Algorithm 1 Sampling the weight indicator zm 3.2 Computational Tricks
Input: D, z, w, α, γ [, β] As already mentioned, the computational cost of Algorithm
K − := |{k : zm l 0
0 = k, m 6= m}| 1 is high. Nevertheless, changing the weight of a single con-

h := K + r nection is only a local change to the NN. In the following,
if |{m0 : zml l 0
0 = zm , m 6= m}| = 0 then we suggest techniques to avoid computing a full forward
Rearrange z and wl such that zm
l l
= K− + 1 pass each time a weight assignment is replaced.

Draw wk ∼ G0 (γ) for K + 1 < k ≤ h
else 3.2.1 Discretization Trick
Draw wk ∼ G0 (γ) for K − < k ≤ h
end if We introduce an approximation that reduces the number of
for k = 1 to h do full NN evaluations drastically. Assume that the activation
if k ≤ K − then function of the NN has a bounded output with lower bound
ρ := |{m0 : zm l 0
0 = k, m 6= m}|
l and upper bound u. For instance, the commonly used
else sigmoid or tanh activation functions satisfy this property.
ρ := αr Given a discretization parameter s ≥ 1, we define a set
end if Q X = {l + k (u − l)/s : k = 0, . . . , s} of s + 1 evenly spaced
pk := ρ N l
n=1 p(y n |xn , z −m , zm = k, w [, β])
values between the lower and the upper bound. Consider
l
end for updating weight indicator zi,j where i is in the target layer.
Normalize p Keeping the outputs xi0 of the other neurons i0 6= i in the
l

Draw zm l
∼ Discrete(p) same layer as neuron i fixed, we replace the output xli with
each value x ∈ X . For each of these values, the output of
the NN is computed and stored in a lookup table. When
l
replacing the weight assignment zi,j with each of the h
only r − 1 of them are drawn from G0 . For each replace-
different weights in Algorithm 1, we compute the output xli
ment, a value proportional to the conditional probability is
and round it to the next value in X . Rather than computing
computed which is subsequently used to sample the new
l a full forward pass for each of the h different weights, one
weight indicator zm .
can now obtain an approximation of the NN output from the
Our algorithm cycles through all connections of the NN
previously created lookup table. This reduces the number of
and updates their weight indicators with Gibbs sampling
NN evaluations from the number of different weights h to
according to Algorithm 1. The main difference of our model
constant s + 1 NN evaluations. Note that only the output
to a DP mixture is that the observations depend on all of
of neuron i is approximated and a memory overhead of
the weights rather than a single set of cluster parameters.
N · (s + 1) for the lookup table is added. The parameter s
This implies that for each connection and each assigned
also controls the quality of the approximation: On the one
weight the output of the NN for the whole data set has to be
extreme, s = 1 simply rounds the output of the neuron to
computed rather than for a single data sample as in the case
either the lower bound l or the upper bound u. As s goes to
of DP mixtures. This makes the sampling process computa-
∞, the approximation becomes exact.
tionally expensive. In Section 3.2 we show approximations
and approaches to avoid redundant computations.
Sampling the weights w: After updating the weight 3.2.2 Gibbs Cycling Order
indicators z for all connections of the NN, we propose to By using a specific order when cycling through the connec-
use AHMC [8] for sampling from the conditional distri- l
tions and sampling the weight indicators zi,j , we can further
bution of the weights p(w|z, D). HMC comes with two improve the computation time. We recommend sampling
advantages: (i) it updates all variables at once rather than one layer at a time. In particular, we start sampling the
sampling each weight conditioned on all the others as in configuration from the first layer and progress forward until
Gibbs sampling. (ii) it uses gradient information of the log- the last layer is reached. Sampling a layer at a time has the
density to explore the state space more systematically. A advantage that the computation of the NN up to previous
drawback of plain HMC is the need to set two parameters, layers stays unaffected. Thus, forward passes up to the
L and , that are critical in achieving a good performance. current layer need only be computed once.
HMC approximates an analytically intractable integral with Next, we propose to iterate through neurons i of the
the leapfrog discretization where  determines the step size target layer and to sample all connections (i, j) feeding
and L the number of steps to take. AHMC finds suitable into neuron i before progressing to another neuron i0 6= i.
parameters automatically by maximizing the normalized In combination with the discretization trick, this allows to
expected squared jumping distance between consecutive reuse the lookup table and to avoid many unnecessary
samples using Bayesian optimization. The conditional den- forward passes. This order effectively gets rid of the depen-
sity of the weights in our model is proportional to dency of the number of forward passes on the size of the in-
Kl
put layer. Using these techniques is crucial in order to make
N L Y
Y Y sampling from the posterior p(z|w, D) tractable. This makes
p(w|z, D) ∝ p(y n |xn , z, w[, β]) G0 (wkl |γ). (8)
n
it also hard to assess the impact of the discretization trick on
l k
the quality of the samples. Nevertheless, experiments (cf.
The gradient of the logarithm of (8) can be computed using Section 4.6) indicate that a relatively coarse approximation
the standard backpropagation algorithm. is sufficient.
5

3.3 Running Time of Posterior Sampling For our model (DP BNN), we evaluated the same vari-
We now investigate the running time of Algorithm 1. Since ances γ on the weights and NN structures as for BNNs.
the running time of our algorithm is largely determined The DP parameter α was evaluated for {100 , 101 , 102 , 103 }.
by computing forward passes in the NN, our analysis is We initialized the weight configuration using a Chinese
restricted to counting the number of forward passes needed. restaurant process with parameter α and performed 100
steps of configuration sampling as burn-in. This adapts
Proposition 1. The number of forward passes needed to sample the configuration to the given data and results in better
all connections (i, j) in layer l is O(N · dl−1 · dl · h̄) where h̄ is performance than starting from a random configuration.
the average number of weight replacements per connection. Then we generated 200 NNs with AHMC followed by 24
iterations alternating between sampling all weight indica-
Proof. Layer l contains dl−1 · dl connections. For each of
tors according to Algorithm 1 and sampling 200 sets of
these, we have to compute on average h̄ forward passes
weights with AHMC.4 This setting requires to store only 25
for each of the N training samples resulting in an overall
configurations and for each configuration the 200 different
number of O(N · dl−1 · dl · h̄). Implementing Algorithm 1
sets of weights, resulting in an ensemble of 5000 NNs. We
naı̈vely will therefore not be tractable for problems of any
fixed the number of auxiliary variables r = 100 and the
decent size.
discretization parameter s = 10 (cf. Section 4.6) for all
Proposition 2. With the discretization trick from Section 3.2.1 experiments.
and the Gibbs cycling order from Section 3.2.2 the number of We also performed experiments with randomly shared
forward passes to sample all connections (i, j) in layer l is O(N · weights (RND BNN). For each NN structure, we generated
dl · s). 25 random configurations with the same number of weights
as in the best performing DP BNN experiment. For each
Proof. By using the specific Gibbs cycling order, the lookup model we performed 400 iterations of AHMC and discarded
table of the discretization trick, which is precomputed with the first 200 models as burn-in resulting in an overall num-
s + 1 forward passes per data sample, can be reused for all ber of 5000 NNs.
dl−1 connections feeding into neuron i.
4.1 Classification Results
4 E XPERIMENTS For classification, we performed experiments on the MNIST
We compare the performance of several models. We data set for handwritten digit recognition [19] and variants
tested plain feed-forward NNs that were optimized us- of the MNIST data set [20]. In particular, we used the mnist-
ing the L-BFGS quasi-Newton algorithm [28]. We used basic, mnist-back, mnist-back-rand, mnist-rot and the mnist-
tanh (BFGS Tanh) and ReLU (BFGS ReLU) activation func- rot-back data sets where the MNIST data is modified by
tions with one or two hidden layers. We evaluated dl ∈ different transformations. Each data set contains images
{10, 25, 50, 100, 250, 500, 1000} number of neurons. In case with 28 × 28 pixels that we transformed with PCA as in
of two hidden layers, we set d1 = d2 . We optimized the [18] to 50 dimensions and normalized to zero mean and
variance on the weights γ ∈ {10−2 , . . . , 102 }. The variance unit variance. MNIST contains 60000 training images and
γ effectively determines the influence of the weight decay 10000 test images, and the variants of MNIST contain 12000
term in the objective function. The initial weights were training images and 50000 test images. For MNIST we
chosen randomly from a zero mean Gaussian with variance split the training set into 50000 training samples and 10000
10−2 . validation samples, and for the variants we performed a
Furthermore, we performed experiments using SVMs split of 10000 training samples and 2000 validation samples.
with radial basis function (RBF) kernel k(x, x0 ) = Each experiment is performed five times with different
exp(−λ||x − x0 ||2 ) [29]. We optimized the trade-off pa- initializations of the random number generator.
rameter C ∈ {2−10 , . . . , 210 } and the kernel parameter The best mean classification errors for each model are
λ ∈ {2−10 , . . . , 210 }. summarized in Table 1. BNNs and DP BNNs achieved the
For the Bayesian models, we only used the tanh activa- best results on all data sets except MNIST and mnist-basic.
tion function since it is compatible with the discretization BFGS optimized NNs achieved the best results always using
trick. We compare our model to Bayesian NNs (BNNs). larger NN structures than the sampling based algorithms. In
We used feed-forward NNs and generated 5000 sets of particular, on all data sets at least either BFGS Tanh or BFGS
weights, i.e. 5000 NNs, with AHMC after discarding the ReLU used the largest structure with two hidden layers with
first 200 NNs as burn-in. Again, we evaluated NNs with 1000 neurons. In contrast, BNNs used on three data sets 50
one or two hidden layers and constrained the number of neurons with two hidden layers.
neurons in each hidden layer to be equal. We evaluated DP BNNs consistently outperform RND BNNs and even
dl ∈ {50, 100, 250} number of neurons. The variance on outperform BNNs on three data sets. We believe that the
the weights was selected from γ ∈ {10−2 , . . . , 102 }. The reason for this might be threefold: (i) When using HMC,
sampling procedure was initialized using a mode in the one might get stuck in a high density region and be unable
posterior distribution. The upper and lower bounds of to move to other promising regions. The transitions of con-
AHMC used for Bayesian optimization were set to bL figurations could possibly move the current state to some
l = 1
and bL 
u = 250 for L and bl = 10
−6
and bu = 10−1 for .
4. AHMC actually generated 400 sets of weights but the first 200
For more details on these parameters the interested reader were used as burn-in and for finding suitable HMC parameters with
is referred to [8]. Bayesian optimization.
6

TABLE 1: Classification errors in % of different classifiers on various data sets. For NNs we report the mean classification
error and the standard deviation of five runs. For SVMs we report the classification error together with the standard error.

D ATA SET BFGS TANH BFGS R E LU SVM RBF BNN RND BNN DP BNN

M NIST 1.74 ± 0.03 1.55 ± 0.04 1.66 ± 0.13 1.70 ± 0.02 1.78 ± 0.03 1.63 ± 0.03
M NIST-B ASIC 4.13 ± 0.09 3.50 ± 0.07 3.50 ± 0.08 4.43 ± 0.07 4.80 ± 0.05 4.15 ± 0.04
M NIST-B ACK 21.42 ± 0.25 18.69 ± 0.17 21.28 ± 0.18 17.60 ± 0.05 18.18 ± 0.05 17.95 ± 0.07
M NIST-B ACK -R AND 9.25 ± 0.04 8.31 ± 0.08 8.61 ± 0.13 8.88 ± 0.03 9.27 ± 0.02 8.29 ± 0.05
M NIST-R OT 11.74 ± 0.14 13.93 ± 0.10 12.51 ± 0.15 11.49 ± 0.12 11.93 ± 0.02 11.00 ± 0.15
M NIST-R OT-B ACK 48.47 ± 0.24 48.15 ± 0.41 48.33 ± 0.22 41.41 ± 0.06 43.63 ± 0.10 43.01 ± 0.05

TABLE 2: Average root mean squared errors and standard errors on various UCI regression data sets obtained using 5-fold
cross-validation. Additionally, the number of data samples N and the number of input features d0 are shown.

D ATA SET N d0 BFGS TANH BFGS R E LU SVM RBF BNN RND BNN DP BNN

A BALONE 4177 8 2.067 ± 0.030 2.064 ± 0.025 2.119 ± 0.038 2.070 ± 0.028 2.066 ± 0.030 2.074 ± 0.023
B OSTON H OUSING 506 13 2.949 ± 0.150 2.996 ± 0.111 3.016 ± 0.157 2.847 ± 0.131 2.844 ± 0.170 2.842 ± 0.141
C ONCRETE S TRENGTH 1030 8 4.843 ± 0.218 4.350 ± 0.162 5.576 ± 0.176 3.998 ± 0.106 4.279 ± 0.139 4.232 ± 0.133
P OWER P LANT 9568 4 3.811 ± 0.064 3.689 ± 0.062 3.778 ± 0.059 3.527 ± 0.050 3.715 ± 0.052 3.727 ± 0.050
W INE Q UALITY R ED 1599 11 0.627 ± 0.010 0.636 ± 0.012 0.628 ± 0.014 0.599 ± 0.010 0.622 ± 0.009 0.617 ± 0.012
W INE Q UALITY W HITE 4898 11 0.700 ± 0.008 0.665 ± 0.008 0.665 ± 0.010 0.623 ± 0.011 0.648 ± 0.012 0.643 ± 0.011

other highly different state that is worth ’exploring’, result- 4.3 Reducing the Number of Weights
ing in better generalization. (ii) As many sampling based Next, we compare the classification error of BNNs and
algorithms scale poorly with the dimensionality, our method DP BNNs with equivalent NN architectures and report the
provides a principled way to use large NNs while still percentage of weights used by DP BNNs with α = 103 on
operating in a relatively low dimensional space, possibly mnist-back-rand. We evaluated dl ∈ {50, 100, 250} number of
improving the quality of the generated samples. (iii) Weight neurons with one and two hidden layers. We used the same
sharing has a regularizing effect on the NNs preventing parameter setting as in the previous sections. The results are
them from overfitting. The last claim is supported by the fact shown in Table 3. DP BNNs use at most 50% of the weights
that a single NN only achieves a relatively high classification compared to BNNs and much less for larger structures. For
error even on the training data. Figure 2c shows how the 250 neurons with two hidden layers, the memory savings
classification error of a DP BNN with two hidden layers of per NN are 90%. The additional memory needed to store
50 neurons evolves as the number of samples grows. The 25 sharing configurations compared to the overall number
steep decrease at the beginning indicates a good decorrela- of 5000 samples is relatively small and only increases the
tion between consecutive samples. memory requirements per NN by 0.5%. The savings grow
with larger structures due to the logarithmic dependence of
the number of weights on the number of connections within
4.2 Regression Experiments
a layer [30].
For regression, we used data sets from the UCI corpus [21]. Furthermore, the concentration parameter α can be used
For all NN models, we evaluated β ∈ {10−2 , 10−1 }. All to trade off between the number of weights and the classifi-
features and target values were normalized to zero mean cation error. Figure 2b shows the influence of α on both the
and unit variance. At test time, we compare the root mean classification error and the number of weights on mnist-back
squared error without target normalizations. We performed with two hidden layers having 100 neurons each. Setting
5-fold cross-validation and report the best test performance α = 1 uses only about 0.3% of the weights and increases the
of each classifier. classification error by approximately 3% (absolute). Using
The results are shown in Table 2. The Bayesian methods our setup of 200 weight samples per configuration, a total
perform best on all data sets except abalone. The BFGS- overhead of 0.5% to store the configuration is added and
optimized NNs tended to select smaller network structures thus the overall memory requirement per single NN is 0.8%
than in the classification experiments to avoid overfitting. compared to full BNNs.
This supports the fact that Bayesian methods are better
suited in regimes of smaller data sets.
The gap between the performance of our model and 4.4 Benefit over Random Weight Sharing
random weight sharing is smaller than in the classification The next experiment compares the structures identified by
experiments. Compared to the classification data sets, for the sampling the configuration with random weight sharing on
regression data sets the number of input features is much mnist-basic. The NN structure is fixed to two hidden layers
smaller and there is only a single output. Consequently, the with 50 neurons each. For several values of α we performed
weight matrices attached to the input and output neurons 200 iterations of configuration sampling, dropped the first
are smaller and exhibit less sharing than intermediate layers. 100 samples as burn-in and averaged the error of each single
Due to the nature of DPs, smaller weight matrices use sample. Note that no explicit weight sampling is performed.
relatively many weights (cf. Section 4.3) and the DP sharing This experiment was performed five times resulting in 500
does not show its full advantage. samples. For comparison, we initialized for each α five NNs
7

70 50 30
CE DP BNN sample 40 CE DP BNN
15 AHMC
60 CE RND BFGS CE RND BNN
40 z,s = 10
#weights [%] #weights [%] z,s = ∞
50 35
20 104

#weights [%]

#weights [%]

runtime [s]
30 10

CE [%]
40
CE [%]

CE [%]
30
CE Train
30 20 CE Validation
25 10 5 CE Test 103
20
10
10
20
0 0 0 0 0
1 1,000 2,000 3,000 4,000 5,000
10 101 102 103 100 101 102 103 100 101 102 103
α α #samples α

(a) mnist-basic (b) mnist-back (c) mnist-rot (d) mnist-rot-back

Fig. 2: (a) Classification errors (CE) of BFGS-optimized NNs using random weight sharing and posterior samples of DP
BNNs. Additionally, the fraction of used weights compared to full NNs are shown. (b) CE of DP BNN and RND BNN over
α and the fraction of used weights. (c) CE of DP BNN over number of averaged samples. (d) Runtime for several α values.
The bars show the mean runtime of sampling 200 weight sets (AHMC) and a single Gibbs cycle of configuration sampling
(z ) for two values of s (s = ∞ corresponds to the discretization trick not being used).

TABLE 3: Comparison of BNNs and DP BNNs for different


NN architectures on mnist-back-rand. The second and third 1 14

column show the classification errors in % of the two mod- 12


0.5
els. The last column shows the fraction (%) of weights used
10
by DP BNNs compared to BNNs.

tanh(x)

CE [%]
0
CE Train
8 CE Validation
S TRUCTURE BNN DP BNN % W EIGHTS CE Test
−0.5
6
50 9.24 ± 0.09 10.54 ± 0.24 48.93 ± 0.84
100 9.36 ± 0.05 9.88 ± 0.24 38.75 ± 0.72 −1
−3 −2 −1 0 1 2 3
4 0
10 101 102
250 10.32 ± 0.02 10.04 ± 0.19 24.68 ± 0.37 x s
50-50 8.88 ± 0.03 8.29 ± 0.05 48.22 ± 0.75
100-100 9.56 ± 0.02 8.55 ± 0.03 29.15 ± 0.34 (a) (b) mnist-back-rand
250-250 10.96 ± 0.04 8.87 ± 0.25 9.78 ± 0.11
Fig. 3: (a) Approximation (red) of tanh (blue) for s = 10 and
the approximation error (dashed black). (b) Mean classifica-
tion errors (CE) over five experiments of averaging an en-
with random sharing using the same number of weights semble of five NNs obtained with configuration sampling.
as obtained by the sampling process and trained them
using BFGS. The average test error is shown in Figure 2a.
The average error of a single DP BNN posterior sample is only slightly affected by the number of weights. However,
almost constant for all α values whereas the error of NNs since configuration sampling requires each connection to be
with random weight sharing drops consistently as more replaced by all existing weights, its running time grows with
and more weights are allowed. Especially when only a few larger α. Without the discretization trick the algorithm takes
weights are used, DP BNNs outperform NNs with random up to two orders of magnitudes longer which is impractical.
weight sharing. Without weight sharing, the best error with
BFGS is 5.7%. This is about 1.5% better than the error of a
single DP BNN posterior sample using only a few weights. 4.6 Influence of the Discretization Parameter s
We conducted a similar experiment with the full ensem-
ble of 5000 NNs on mnist-back shown in Figure 2b. We used To assess the impact of the discretization trick on the quality
two hidden layers with 100 neurons each and compared our of the generated samples, we evaluated several values of s
model to RND BNNs. Especially when α is small and only a on mnist-back-rand with two hidden layers with 100 neurons
few weights are used, random weight sharing achieves only each. We performed 10 iterations of configuration sampling,
a poor performance. discarded the first five iterations as burn-in, and averaged
the outputs of the last five NNs. The average classification
error of performing this experiment five times is shown in
4.5 Runtime Experiments Figure 3b. The error drops consistently until s = 10 and then
We compared the running time of configuration sampling stays almost constant until s = 100.5 Better approximations
and weight sampling for different α values on mnist-rot- with larger s do not seem to pay off for the increased compu-
back. We used two layers with 100 hidden neurons each. The tational effort. When performing the same experiment with
running time only slightly depends on the data set since intermediate runs of weight sampling using AHMC, the
they are all of equal size. Moreover, the number of weights error is approximately constant over the whole range of s
largely depends on α and the network structure and hence showing that the influence of s is even less severe. Figure 3a
is similar on all data sets. shows the approximated tanh activation function for s = 10.
The results are shown in Figure 2d. The running time
for weight sampling with AHMC is largely unaffected by 5. Running the experiment without the discretization trick is in-
α since computing gradients with backpropagation is itself tractable but s = 100 gives already a good approximation.
8

5 C ONCLUSION [6] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth, “Hybrid


Monte Carlo,” Physics letters B, vol. 195, no. 2, pp. 216–222, 1987.
We introduced DPs to share the weights in NNs and to [7] R. M. Neal, “Bayesian training of backpropagation networks by
reduce the memory footprint of storing an ensemble of the hybrid Monte Carlo method,” Dept. of Computer Science,
NNs. We use a DP prior over the weight distribution p(W ) University of Toronto, Tech. Rep., 1992.
[8] Z. Wang, S. Mohamed, and N. de Freitas, “Adaptive Hamiltonian
which results in weight sharing that is adopted to the
and Riemann manifold Monte Carlo,” in International Conference
given data. Inference is performed using an MCMC scheme on Machine Learning (ICML), 2013, pp. 1462–1470.
where we alternate between sampling from the posterior of [9] M. Welling and Y. W. Teh, “Bayesian learning via stochastic gra-
the weights given a configuration p(w|z, D) and sampling dient Langevin dynamics,” in International Conference on Machine
Learning (ICML), 2011, pp. 681–688.
from the posterior of the configuration given the weights [10] S. Ahn, A. K. Balan, and M. Welling, “Bayesian posterior sampling
p(z|w, D). Sampling from p(z|w, D) naı̈vely without taking via stochastic gradient Fisher scoring,” in International Conference
the structure of NNs into account is infeasible for NNs on Machine Learning (ICML), 2012, pp. 1591–1598.
and data sets of any decent size. We developed a sampling [11] T. Chen, E. B. Fox, and C. Guestrin, “Stochastic gradient Hamilto-
nian Monte Carlo,” in International Conference on Machine Learning
algorithm which samples all connections in turn keeping the (ICML), 2014, pp. 1683–1691.
changes by replacing a single weight assignment as local as [12] C. Li, C. Chen, D. E. Carlson, and L. Carin, “Preconditioned
possible to avoid redundant computations. stochastic gradient Langevin dynamics for deep neural networks,”
in AAAI Conference on Artificial Intelligence, 2016, pp. 1788–1794.
In our experiments, we demonstrated the ability of our [13] T. S. Ferguson, “A Bayesian analysis of some nonparametric
model to reduce the total number of parameters substan- problems,” The Annals of Statistics, vol. 1, no. 2, pp. 209–230, 1973.
tially. Our model mostly outperforms Bayesian NNs with [14] C. E. Antoniak, “Mixtures of Dirichlet processes with applications
random weight sharing and even achieves lower errors to Bayesian nonparametric problems,” The Annals of Statistics,
vol. 2, no. 6, pp. 1152–1174, 1974.
than Bayesian NNs without weight sharing on some data [15] R. M. Neal, “Markov chain sampling methods for Dirichlet process
sets. When using only a few weights, even a single poste- mixture models,” Journal of Computational and Graphical Statistics,
rior sample of our model substantially outperforms BFGS- vol. 9, no. 2, pp. 249–265, 2000.
[16] S. Jain and R. M. Neal, “A split-merge Markov chain Monte Carlo
optimized NNs with random weight sharing having the procedure for the Dirichlet process mixture model,” Journal of
same number of weights. The regression experiments with Computational and Graphical Statistics, vol. 13, no. 1, pp. 158–182,
NNs having relatively few weights in the input and output 2004.
layers suggest that different DP parameters α in different [17] D. M. Blei and M. I. Jordan, “Variational inference for Dirichlet
process mixtures,” Bayesian Analysis, vol. 1, no. 1, pp. 121–143,
layers could be useful. From a practical perspective, dis- 2006.
abling the sharing for small weight matrices completely [18] J. Chang and J. W. Fisher, “Parallel sampling of DP mixture
could be considered. Experiments using global sharing models using sub-cluster splits,” in Advances in Neural Information
Processing Systems (NIPS), 2013, pp. 620–628.
consistently resulted in worse performance than layerwise [19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
sharing. The reason could be that different layers need to learning applied to document recognition,” Proceedings of the IEEE,
exhibit different weight scales and constraining the weights vol. 86, no. 11, pp. 2278–2324, 1998.
of different layers to be equal deteriorates performance. For [20] H. Larochelle, D. Erhan, A. C. Courville, J. Bergstra, and Y. Bengio,
“An empirical evaluation of deep architectures on problems with
instance, [31] argue in a different context that initializing many factors of variation,” in International Conference on Machine
the weights of different layers at different scales can sub- Learning (ICML), 2007, pp. 473–480.
stantially improve convergence. We leave experiments with [21] M. Lichman, “UCI machine learning repository,” 2013. [Online].
Available: http://archive.ics.uci.edu/ml
other network architectures such as convolutional NNs, and [22] A. Korattikara, V. Rathod, K. P. Murphy, and M. Welling,
experiments where the hyperparameters are resampled to “Bayesian dark knowledge,” in Advances in Neural Information
future research. Processing Systems (NIPS), 2015, pp. 3438–3446.
[23] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen,
“Compressing neural networks with the hashing trick,” in Interna-
ACKNOWLEDGMENTS tional Conference on Machine Learning (ICML), 2015, pp. 2285–2294.
[24] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient back-
This work was supported by the Austrian Science Fund prop,” in Neural Networks: Tricks of the Trade. Springer, 2012, pp.
(FWF) under the project numbers P27803-N15 and I2706- 9–48.
N31. [25] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
Boltzmann machines,” in International Conference on Machine Learn-
ing (ICML), 2010, pp. 807–814.
R EFERENCES [26] S. J. Nowlan and G. E. Hinton, “Simplifying neural networks by
soft weight-sharing,” Neural Computation, vol. 4, no. 4, pp. 473–493,
[1] D. J. C. MacKay, “A practical Bayesian framework for backprop- 1992.
agation networks,” Neural Computation, vol. 4, no. 3, pp. 448–472, [27] J. Sethuraman, “A constructive definition of Dirichlet priors,”
1992. Statistica Sinica, vol. 4, pp. 639–650, 1994.
[2] G. E. Hinton and D. van Camp, “Keeping the neural networks [28] J. Nocedal and S. Wright, Numerical Optimization, 2nd ed. Springer
simple by minimizing the description length of the weights,” in New York, 2006.
ACM Conference on Computational Learning Theory (COLT), 1993, [29] C. Chang and C. Lin, “LIBSVM: A library for support vector
pp. 5–13. machines,” ACM Transactions on Intelligent Systems and Technology
[3] A. Graves, “Practical variational inference for neural networks,” (TIST), vol. 2, no. 3, p. 27, 2011.
in Advances in Neural Information Processing Systems (NIPS), 2011, [30] S. J. Gershman and D. M. Blei, “A tutorial on Bayesian nonpara-
pp. 2348–2356. metric models,” Journal of Mathematical Psychology, vol. 56, no. 1,
[4] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, pp. 1–12, 2012.
“Weight uncertainty in neural networks,” in International Confer- [31] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
ence on Machine Learning (ICML), 2015, pp. 1613–1622. Surpassing human-level performance on ImageNet classification,”
[5] J. M. Hernandez-Lobato and R. Adams, “Probabilistic backprop- in International Conference on Computer Vision (ICCV), 2015, pp.
agation for scalable learning of bayesian neural networks,” in 1026–1034.
International Conference on Machine Learning (ICML), 2015, pp.
1861–1869.

View publication stats

You might also like