0% found this document useful (0 votes)
17 views8 pages

MOPED

Uploaded by

wyb896409234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views8 pages

MOPED

Uploaded by

wyb896409234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Specifying Weight Priors in Bayesian Deep Neural Networks with Empirical Bayes

Ranganath Krishnan∗ Mahesh Subedar∗ Omesh Tickoo


Intel Labs Intel Labs Intel Labs
[email protected] [email protected] [email protected]
arXiv:1906.05323v3 [cs.NE] 28 Dec 2019

Abstract open problem. Hybrid Bayesian DNN architectures (Sube-


dar et al. 2019; Krishnan, Subedar, and Tickoo 2018) are
Stochastic variational inference for Bayesian deep neural net- used for complex computer vision tasks to balance complex-
work (DNN) requires specifying priors and approximate pos-
terior distributions over neural network weights. Specifying
ity of the model while providing benefits of Bayesian infer-
meaningful weight priors is a challenging problem, particu- ence. DNNs are shown to have structural benefits (Bengio,
larly for scaling variational inference to deeper architectures Courville, and Vincent 2013) which helps them in learning
involving high dimensional weight space. We propose MOdel complex models on larger datasets. The convergence speed
Priors with Empirical Bayes using DNN (MOPED) method to and performance (Goodfellow, Bengio, and Courville 2016)
choose informed weight priors in Bayesian neural networks. of DNN models heavily depend on the initialization of model
We formulate a two-stage hierarchical modeling, first find the weights and other hyper parameters. The transfer learning
maximum likelihood estimates of weights with DNN, and approaches (Shin et al. 2016) demonstrate the benefit of fine
then set the weight priors using empirical Bayes approach to tuning the pretrained DNN models from adjacent domains in
infer the posterior with variational inference. We empirically order to achieve faster convergence and better accuracies.
evaluate the proposed approach on real-world tasks including
image classification, video activity recognition and audio clas- Variational inference for Bayesian DNN involves choos-
sification with varying complex neural network architectures. ing prior distributions and approximate posterior distribu-
We also evaluate our proposed approach on diabetic retinopa- tions over neural network weights. In a pure Bayesian ap-
thy diagnosis task and benchmark with the state-of-the-art proach, prior distribution is specified before any data is ob-
Bayesian deep learning techniques. We demonstrate MOPED served. Specifying meaningful priors in large Bayesian DNN
method enables scalable variational inference and provides models with high dimensional weight space is an active
reliable uncertainty quantification. area of research (Wu et al. 2019; Nalisnick, Hernández-
Lobato, and Smyth 2019; Sun et al. 2019; Atanov et al.
1 Introduction 2019), as it is practically difficult to have prior belief on
millions of parameters. Empirical Bayes (Robbins 1956;
Uncertainty estimation in deep neural network (DNN) predic- Casella 1992) methods estimates prior distribution from the
tions is essential for designing reliable and robust AI systems. data. Based on Empirical Bayes and transfer learning ap-
Bayesian deep neural networks (Neal 1995; Gal 2016) has al- proaches, we propose MOdel Priors with Empirical Bayes
lowed bridging deep learning and probabilistic Bayesian the- using DNN (MOPED) method to initialize the weight priors
ory to quantify uncertainty by borrowing the strengths of both in Bayesian DNNs, which in our experiments have shown to
methodologies. Variational inference (VI) (Blei, Kucukelbir, achieve better training convergence for larger models.
and McAuliffe 2017) is an analytical approximation tech-
nique to infer the posterior distribution of model parameters. Our main contributions include:
VI methods formulate the Bayesian inference problem as an
• We propose MOPED method to specify informed weight
optimization-based approach which lends itself to the stochas-
priors in Bayesian neural networks using Empirical Bayes
tic gradient descent based optimization used in training DNN
framework. MOPED advances the current state-of-the-art
models. VI with generalized formulations (Graves 2011;
by enabling scalable variational inference for large models
Blundell et al. 2015) has renewed interest in Bayesian neural
applied to real-world tasks.
networks.
The recent research in Bayesian Deep Learning (BDL) is • We demonstrate with thorough empirical experiments on
focused on scaling the VI to more complex models. The multiple real-world tasks that the MOPED method helps
scalability of VI in Bayesian DNNs to practical applica- training convergence and provides better model perfor-
tions involving deep models and large-scale datasets is an mance, along with reliable uncertainty estimates. We also

evaluate MOPED on diabetic retinopathy diagnosis task
Equal Contribution using BDL benchmarking framework (Filos et al. 2019)
and demonstrate it outperforms state-of-the-art Bayesian In mean field variation inference, weights are modeled
deep learning methods. with fully factorized Gaussian distribution parameterized by
The rest of the document is organized as below. We pro- variational parameters µ and σ.
vide background material in Section 2. The details of pro- qθ (w) := N (w | µ, σ) (4)
posed method for initializing the weight priors in Bayesian
DNN models is presented in Section 3, and related work in The variational distribution qθ (w) and its parameters µ and
Section 4. Followed by empirical experiments and results σ are learnt while optimizing the cost function ELBO with
supporting the claims of proposed method in Section 5. the stochastic gradient steps.
(Graves 2011) proposed fully factorized Gaussian posteri-
2 Background ors and a differentiable loss function. (Blundell et al. 2015)
2.1 Bayesian neural networks proposed a Bayes by Backprop method which learns probabil-
ity distribution on the weights of the neural network by mini-
Bayesian neural networks provide a probabilistic interpreta-
mizing loss function. (Wen et al. 2018) proposed a Flipout
tion of deep learning models by placing distributions over the
method to apply pseudo-independent weight perturbations to
neural network weights (Neal 1995). Given training dataset
decorrelate the gradients within mini-batches.
D = {x, y} with inputs x = {x1 , ..., xN } and their corre-
sponding outputs y = {y1 , ..., yN }, in parametric Bayesian
2.3 Empirical Bayes
setting we would like to infer a distribution over weights
w as a function y = fw (x) that represents the neural net- Empirical Bayes (EB) (Casella 1992) methods lie in between
work model. A prior distribution is assigned over the weights frequestist and Bayesian statistical approaches as it attempts
p(w) that captures our prior belief as to which parameters to leverage strengths from both methodologies. EB methods
would have likely generated the outputs before observing any are considered as approximation to a fully Bayesian treat-
data. Given the evidence data p(y|x), prior distribution and ment of a hierarchical Bayes model. EB methods estimates
model likelihood p(y | x, w), the goal is to infer the posterior prior distribution from the data, which is in contrast to typi-
distribution over the weights p(w|D): cal Bayesian approach. The idea of Empirical Bayes is not
new and the original formulation of Empirical Bayes dates
p(y | x, w) p(w) back to 1950s (Robbins 1956), which is non-parametric EB.
p(w|D) = R (1)
p(y | x, w) p(w) dw Since then, many parametric formulations has been proposed
Computing the posterior distribution p(w|D) is often in- and used in wide variety of applications.We use parametric
tractable, some of the previously proposed techniques to Empirical Bayes approach in our proposed method for mean
achieve an analytically tractable inference include Markov field variational inference in Bayesian deep neural network,
Chain Monte Carlo (MCMC) sampling based probabilistic where weights are modeled with fully factorized Gaussian
inference (Neal 2012; Welling and Teh 2011), variational distribution.
inference (Graves 2011; Ranganath, Gerrish, and Blei 2013; Parametric EB specifies a family of prior distributions
Blundell et al. 2015), expectation propagation (Minka 2001) p(w|λ) where λ is a hyper-parameter. Analogous to Equa-
and Monte Carlo dropout approximate inference (Gal and tion 1, posterior distribution can be obtained with EB as given
Ghahramani 2016) . by Equation 5.
Predictive distribution is obtained through multiple p(y | x, w) p(w | λ)
stochastic forward passes on the network while sampling p(w|D, λ) = R (5)
from the weight posteriors using Monte Carlo estimators. p(y | x, w) p(w | λ) dw
Equation 2 shows the predictive distribution of the output y ∗
given new input x∗ :
2.4 Uncertainty Quantification
Z Uncertainty estimation is essential to build reliable and robust
p(y |x , D) = p(y ∗ |x∗ , w) p(w | D)dw
∗ ∗
AI systems, which is pivotal to understand system’s confi-
dence in predictions and decision-making. Bayesian DNNs
T (2) enable to capture different types of uncertainties: “Aleatoric”
∗ ∗ 1X ∗ ∗
p(y |x , D) ≈ p(y |x , wi ) , wi ∼ p(w | D) and “Epistemic” (Gal 2016). Aleatoric uncertainty captures
T i=1
noise inherent with observation. Epistemic uncertainty, also
where, T is number of Monte Carlo samples. known as model uncertainty captures lack of knowledge in
representing model parameters, specifically in the scenario
2.2 Variational inference of limited data.
Variational inference approximates a complex probability We evaluate the model uncertainty using Bayesian active
distribution p(w|D) with a simpler distribution qθ (w), pa- learning by disagreement (BALD) (Houlsby et al. 2011;
rameterized by variational parameters θ while minimizing Gal 2016), which quantifies mutual information between
the Kullback-Leibler (KL) divergence. Minimizing the KL di- parameter posterior distribution and predictive distribution.
vergence is equivalent to maximizing the log evidence lower
BALD := H(y ∗ |x∗ , D) − Ep(w|D) [H(y ∗ |x∗ , w)] (6)
bound (ELBO) (Bishop 2006), as shown in Equation 3.
where, H(y ∗ |x∗ , D) is the predictive entropy as shown in
Z
L := qθ (w) log p(y|x, w) dw − KL[qθ (w)||p(w)] (3) Equation 7. Predictive entropy captures a combination of
Bayesian DNN Validation Accuracy
Complexity Bayesian DNN
Dataset Modality Architecture (# parameters) DNN MFVI MOPED_MFVI
UCF-101 Video ResNet-101 C3D 170,838,181 0.851 0.029 0.867
UrbanSound8K Audio VGGish 144,274,890 0.817 0.143 0.819
Diabetic Retinopathy Images VGG 21,242,689 0.842 0.843 0.857
Resnet-56 1,714,250 0.926 0.896 0.927
CIFAR-10 Images
Resnet-20 546,314 0.911 0.878 0.916
MNIST Images LeNeT 1,090,856 0.994 0.993 0.995
Fashion-MNIST Images SCNN 442,218 0.921 0.906 0.923

Table 1: Accuracies for architectures with different complexities and input modalities. Mean field variational inference with MOPED
initialization (MOPED_MFVI) obtains reliable uncertainty estimates from Bayesian DNNs while achieving similar or better accuracy as
the deterministic DNNs. Mean field variational inference with random priors (MFVI) has convergence issues (shown in red) for complex
architectures, while the proposed method achieves model convergence. DNN and MFVI accuracy numbers for diabetic retinopathy dataset are
obtained from BDL-benchmarks.

input uncertainty and model uncertainty. where, w represents maximum likelihood estimates of
M LE

weights obtained from deterministic DNN model, and (ρ,


K−1
X ∆ρ) are hyper parameters (mean and variance of Gaussian
H(y ∗ |x∗ , D) := − piµ ∗ log piµ (7) perturbation for ρ).
i=0
For Bayesian DNNs of complex architectures involving
and piµ is predictive mean probability of ith class from very high dimensional weight space (hundreds of millions
T Monte Carlo samples, and K is total number of output of parameters), choice of ρ can be sensitive as values of the
classes. weights can vary by large margin with each other.
So, we propose to initialize the variational parameters in
3 MOPED: informed weight priors approximate posteriors as given in Equation 9.
MOPED advances the current state-of-the-art in variational |
w := w ; ρ := log(eδ|w M LE
− 1)
inference for Bayesian DNNs by providing a way for specify- M LE
(9)
ing meaningful prior and approximate posterior distributions w ∼ N (w M LE , δ|w
M LE |))
over weights using Empirical Bayes framework. Empirical where, δ is initial perturbation factor for the weight in terms
Bayes framework borrows strengths from both classical (fre- of percentage of the pretrained deterministic weight values.
quentist) and Bayesian statistical methodologies.
In the next section, we demonstrate the benefits of MOPED
We formulate a two-stage hierarchical modeling approach,
method for variational inference with extensive empirical
first find the maximum likelihood estimates (MLE) of
experiments. We showcase the proposed MOPED method
weights with DNN, and then set the weight priors using Em-
helps Bayesian DNN architectures to achieve better model
pirical Bayes approach to infer the posterior with variational
performance along with reliable uncertainty estimates.
inference.
We illustrate our proposed approach on mean-field vari-
ational inference (MFVI). For MFVI in Bayesian DNNs, 4 Related Work
weights are modeled with fully factorized Gaussian distri- Deterministic pretraining (Molchanov, Ashukha, and Vetrov
butions parameterized by variational parameters, i.e. each 2017; Sønderby et al. 2016) has been used to improve model
weight is independently sampled from the Gaussian dis- training for variational probabilistic models. Molchanov,
tribution w = N (w, σ), where w is mean and variance Ashukha, and Vetrov use a pretrained deterministic network
σ = log(1 + exp(ρ)). In order to ensure non-negative vari- for Sparse Variational Dropout method. Sønderby et al. use
ance, σ is expressed in terms of softplus function with uncon- a warm-up method for variational-auto encoder by rescal-
strained parameter ρ. We propose to set the weight priors in ing the KL-divergence term with a scalar term β, which is
Bayesian neural networks based on the MLE obtained from increased linearly from 0 to 1 during the first N epochs of
standard DNN of equivalent architecture. We set the prior training. Whereas, in our method we use the point-estimates
with mean equals w M LE and unit variance respectively, and from pretrained standard DNN of the same architecture to set
initialize the variational parameters in approximate posteriors the informed priors and the model is optimized with MFVI
as given in Equation 8. using full-scale KL-divergence term in ELBO.
Choosing weight priors in Bayesian neural networks is an
w := w M LE ; ρ ∼ N (ρ, ∆ρ) active area of research. Atanov et al. propose implicit priors
(8)
w ∼ N (w M LE , log(1 + eρ )) for variational inference in convolutional neural networks
Bayesian DNN AUPR AUROC
Dataset Archiectures MFVI MOPED_MFVI MFVI MOPED_MFVI
UCF-101 ResNet-101 C3D 0.0174 0.9186 0.6217 0.9967
Urban Sound 8K VGGish 0.1166 0.8972 0.551 0.9811
ResNet-20 0.9265 0.9622 0.9877 0.9941
CIFAR-10
ResNet-56 0.9225 0.9799 0.987 0.9970
MNIST LeNet 0.9996 0.9997 0.9999 0.9999
Fashion-MNIST SCNN 0.9722 0.9784 0.9962 0.9969

Table 2: Comparison of AUC of precision-recall (AUPR) and ROC (auROC) for models with varying complexities. MOPED method
outperforms training with random initialization of weight priors.

(a) Training convergence curves (b) AUPR curves


(c) Precision-recall

Figure 1: Comparison of MOPED and MOPED_MFVI for Bayesian ResNet-20 and ResNet-56 architectures. (a) training convergence,
(b) AUPR as a function of retained data based on model uncertainty and (c) precision-recall plots.

that exploit generative models. Nguyen et al. use prior with classification on UCF-101(Soomro, Zamir, and Shah 2012)
zero mean and unit variance, and initialize the optimizer at dataset, (ii) VGGish(Hershey et al. 2017) for audio classifica-
the mean of the MLE model and a very small initial variance tion on UrbanSound8K (Salamon, Jacoby, and Bello 2014)
for small-scale MNIST experiments. Wu et al. modify ELBO dataset, (iii) Modified version of VGG(Filos et al. 2019) for
with a deterministic approximation of reconstruction term diabetic retinopathy detection (Kaggle 2015), (iv) ResNet-20
and use Empirical Bayes procedure for selecting variance and ResNet-56 (He et al. 2016) for CIFAR-10 (Krizhevsky
of prior in KL term (with zero prior mean). The authors and Hinton 2009), (v) LeNet architecture for MNIST (LeCun
also caution about inherent scaling of their method could et al. 1998) digit classification, and (vi) Simple convolutional
potentially limit its practical use for networks with large neural network (SCNN) consisting of two convolutional lay-
hidden size. All of these works have been demonstrated only ers followed by two dense layers for image classification on
on small-scale models and simple datasets like MNIST. In Fashion-MNIST (Xiao, Rasul, and Vollgraf 2017) datasets.
our method, we retain the stochastic property of the expected We implemented above Bayesian DNN models and trained
log-likelihood term in ELBO, and specify both mean and them using Tensorflow and Tensorflow-Probability (Dillon
variance of weight priors based on pretrained DNN with et al. 2017) frameworks. The variational layers are modeled
Empirical Bayes. Further, we demonstrate our method on using Flipout (Wen et al. 2018), an efficient method that
large-scale Bayesian DNN models with complex datasets on decorrelates the gradients within a mini-batch by implicitly
real-world tasks. sampling pseudo-independent weight perturbations for each
input. The MLE weights obtained from the pretrained DNN
5 Experiments models are used in MOPED method to set the priors and
We evaluate proposed method on real-world applications in- initialize the variational parameters in approximate posteriors
cluding image and audio classification, and video activity (Equation 8 and 9), as described in Section3.
recognition. We consider multiple architectures with vary- During inference phase, predictive distributions are ob-
ing complexity to show the scalability of method in training tained by performing multiple stochastic forward passes over
deep Bayesian models. Our experiments include: (i) ResNet- the network while sampling from posterior distribution of the
101 C3D (Hara, Kataoka, and Satoh 2018) for video activity weights (40 Monte Carlo samples in our experiments). We
the precision-recall AUC values as a function of retained data
based on the model uncertainty estimates. Figure 1 (b) & (c)
show that MOPED_MFVI provides better performance than
MFVI. AUPR increases as most uncertain predictions are
ignored based on the model uncertainty, indicating reliable
uncertainty estimates. We show the results for different selec-
tion of ρ values (as shown in Equation 8).
In Figure 2, we show AUPR plots for CIFAR-10 and UCF-
101with different δ values as mentioned in Equation 9.

5.1 Benchmarking uncertainty estimates


Bayesian Deep Learning (BDL) benchmarks (Filos et al.
2019) is an open-source framework for evaluating deep prob-
(a) Bayesian ResNet-20 (CIFAR-10) abilistic machine learning models and their application to
real-world problems. BDL-benchmarks assess both the scal-
ability and effectiveness of different techniques for uncer-
tainty estimation. The proposed MOPED_MFVI method is
compared with state-of-the-art baseline methods available
in BDL-benchmarks suite on diabetic retinopathy detection
task (Kaggle 2015). The evaluation methodology assesses
the techniques by their diagnostic accuracy and area un-
der receiver-operating-characteristic (AUC-ROC) curve, as a
function of percentage of retained data based on predictive
uncertainty estimates. It is expected that the models with well-
calibrated uncertainty improve their performance (detection
accuracy and AUC-ROC) as most certain data is retrained.
We have followed the evaluation methodology presented
in the BDL-benchmarks to compare the accuracy and un-
certainty estimates obtained from our method. We used
the same model architecture (VGG) and hyper-parameters
as used by other baselines for evaluating MOPED_MFVI.
(b) Bayesian ResNet-101 C3D (UCF-101)
The results for the BDL baseline methods are obtained
Figure 2: Precision-recall AUC (AUPR) plots with different δ scale from (Filos et al. 2019). In Table 3 and Figure 3, quantitative
factors for initializing variance values in MOPED method. evaluation of AUC and accuracy values for BDL baseline
methods and MOPED_MFVI are presented. The proposed
MOPED_MFVI method outperforms other state-of-the-art
evaluate the model uncertainty and predictive uncertainty us- BDL techniques.
ing Bayesian active learning by disagreement (BALD) (Equa-
tion 6) and predictive entropy (Equation 7), respectively. 5.2 Robustness to out-of-distribution data
Quantitative comparison of uncertainty estimates are made by We evaluate the uncertainty estimates obtained from
calculating area under the curve of precision-recall (AUPR) MOPED_MFVI to detect out-of-distribution data. Out-of-
values by retaining different percentages (0.5 to 1.0) of most distribution samples are data points which fall far off from
certain test samples (i.e. ignoring most uncertain predictions the training data distribution. We evaluate two sets of out-
based on uncertainty estimates). of-distribution detection experiments. In the first set, we
In Table 1, classification accuracies for architectures with use CIFAR-10 as the in-distribution samples trained us-
various model complexity are presented. Bayesian DNNs ing ResNet-56 Bayesian DNN model. TinyImageNet (Rus-
with priors initialized with MOPED method achieves similar sakovsky et al. 2015) and SVHN (Goodfellow et al. 2013)
or better predictive accuracies as compared to equivalent datasets are used as out-of-distribution samples which were
DNN models. Bayesian DNNs with random initialization not seen during the training phase. The density histograms
of Gaussian priors has difficulty in converging to optimal (area under the histogram is normalized to one) for uncer-
solution for larger models (ResNet-101 C3D and VGGish). It tainty estimates obtained from the Bayesian DNN models are
is evident from these results that MOPED method guarantees plotted in Figure 5. The density histograms in Figure 5 (a)
the training convergence even for the complex models. & (b) indicate higher uncertainty estimates for the out-of-
In Figure 1, comparison of mean field variational inference distribution samples and lower uncertainty values for the
with MOPED method (MOPED_MFVI) and mean field vari- in-distribution samples. A similar trend is observed in the
ational inference with random initialization of priors (MFVI) second set using UCF-101 (Soomro, Zamir, and Shah 2012)
is shown for Bayesian ResNet-20 and ResNet-56 architec- and Moments-in-Time (MiT) (Monfort et al. 2018) video
tures trained on CIFAR-10 dataset. The AUPR plots capture activity recognition datasets as the in- and out-of-distribution
50% data retrained 75% data retrained 100% data retrained
Method AUC Accuracy AUC Accuracy AUC Accuracy
MC Dropout 0.878 0.913 0.852 0.871 0.821 0.845
Mean-field VI 0.866 0.881 0.84 0.850 0.821 0.843
Deep Ensembles 0.872 0.899 0.849 0.861 0.818 0.846
Deterministic 0.849 0.861 0.823 0.849 0.82 0.842
Ensemble MC Dropout 0.881 0.924 0.854 0.881 0.825 0.853
MOPED Mean-field VI 0.912 0.937 0.885 0.914 0.883 0.857
Random Referral 0. 818 0.848 0.820 0.843 0.820 0.842

Table 3: Comparison of Area under the receiver-operating characteristic curve (AUC) and classification accuracy as a function of retained data
from the BDL benchmark suite. The proposed method demonstrates an improvement of superior performance compared to all the baseline
models.

(a) Binary Accuracy (b) AUC-ROC

Figure 3: Benchmarking MOPED_MFVI with state-of-art Bayesian deep learning techniques on diabetic retinopathy diagnosis task using
BDL-benchmarks. Accuracy and area under the receiver-operating characteristic curve (AUC-ROC) plots for varied percentage of retained data
based on predictive uncertainty. MOPED_MFVI performs better than the other baselines from BDL-benchmarks suite. Shading shows the
standard error.

data, respectively. These results confirm the uncertainty esti-


mates obtained from proposed method are reliable and can
identify out-of-distribution data.
In order to evaluate robustness of our method
(MOPED_MFVI), we compare state-of-the-art probabilistic
deep learning methods for prediction accuracy as a function
of model confidence. Following the experiments in (Laksh-
minarayanan, Pritzel, and Blundell 2017), we trained our
model on MNIST training set and tested it on a mix of exam-
ples from MNIST and NotMNIST (out-of-distribution) test
set. The accuracy as a function of confidence plots should
increase monotonically, as higher accuracy is expected for
more confident results. A robust model should provide low
confidence for out-of-distribution samples while providing
Figure 4: Accuracy vs Confidence curves: Networks trained on high confidence for correct prediction from in-distribution
MNIST and tested on both MNIST and the NotMNIST (out-of- samples. The proposed variational inference method with
distribution) test sets. MOPED priors provides more robust results as compared to
the MC Dropout (Gal and Ghahramani 2016) and deep model
(a) Model uncertainty (b) Predictive uncertainty (c) Model uncertainty (d) Predictive uncertainty
(Bayesian ResNet-56) (Bayesian ResNet-56) (Bayesian ResNet-101 C3D) (Bayesian ResNet-101 C3D)

Figure 5: Density histograms obtained from in- and out-of-distribution samples. Bayesian DNN model uncertainty estimates indicate higher
uncertainty for out-of-distribution samples as compared to the in-distribution samples.

ensembles (Lakshminarayanan, Pritzel, and Blundell 2017) [Dillon et al. 2017] Dillon, J. V.; Langmore, I.; Tran, D.;
approaches (shown in Figure 4). Brevdo, E.; Vasudevan, S.; Moore, D.; Patton, B.; Alemi,
A.; Hoffman, M.; and Saurous, R. A. 2017. Tensorflow
6 Conclusions distributions. arXiv preprint arXiv:1711.10604.
[Filos et al. 2019] Filos, A.; Farquhar, S.; Gomez, A. N.; Rud-
We proposed MOPED method that specifies informed weight
ner, T. G. J.; Kenton, Z.; Smith, L.; Alizadeh, M.; de Kroon,
priors in Bayesian deep neural networks with Empirical
A.; and Gal, Y. 2019. Benchmarking bayesian deep learning
Bayes approach. We demonstrated with thorough empiri-
with diabetic retinopathy diagnosis.
cal experiments that MOPED enables scalable variational
inference for Bayesian DNNs.We demonstrated the proposed [Gal and Ghahramani 2016] Gal, Y., and Ghahramani, Z.
method outperforms state-of-the-art Bayesian deep learn- 2016. Dropout as a bayesian approximation: Representing
ing techniques using BDL-benchmarks framework. We also model uncertainty in deep learning. In international confer-
showed the uncertainty estimates obtained from the proposed ence on machine learning, 1050–1059.
method are reliable to identify out-of-distribution data. The [Gal 2016] Gal, Y. 2016. Uncertainty in deep learning. Uni-
results support proposed approach provides better model per- versity of Cambridge.
formance and reliable uncertainty estimates on real-world [Goodfellow, Bengio, and Courville 2016] Goodfellow, I.;
tasks with large scale complex models. Bengio, Y.; and Courville, A. 2016. Deep learning. MIT
press.
References [Goodfellow et al. 2013] Goodfellow, I. J.; Bulatov, Y.; Ibarz,
[Atanov et al. 2019] Atanov, A.; Ashukha, A.; Struminsky, J.; Arnoud, S.; and Shet, V. 2013. Multi-digit number recog-
K.; Vetrov, D.; and Welling, M. 2019. The deep weight prior. nition from street view imagery using deep convolutional
In International Conference on Learning Representations. neural networks. arXiv preprint arXiv:1312.6082.
[Bengio, Courville, and Vincent 2013] Bengio, Y.; Courville, [Graves 2011] Graves, A. 2011. Practical variational infer-
A.; and Vincent, P. 2013. Representation learning: A review ence for neural networks. In Advances in neural information
and new perspectives. IEEE transactions on pattern analysis processing systems, 2348–2356.
and machine intelligence 35(8):1798–1828. [Hara, Kataoka, and Satoh 2018] Hara, K.; Kataoka, H.; and
Satoh, Y. 2018. Can spatiotemporal 3d cnns retrace the
[Bishop 2006] Bishop, C. M. 2006. Pattern recognition
history of 2d cnns and imagenet? In Proceedings of the IEEE
and machine learning (information science and statistics)
Conference on Computer Vision and Pattern Recognition
springer-verlag new york. Inc. Secaucus, NJ, USA.
(CVPR), 6546–6555.
[Blei, Kucukelbir, and McAuliffe 2017] Blei, D. M.; Ku- [He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016.
cukelbir, A.; and McAuliffe, J. D. 2017. Variational inference: Deep residual learning for image recognition. In Proceed-
A review for statisticians. Journal of the American Statistical ings of the IEEE conference on computer vision and pattern
Association 112(518):859–877. recognition, 770–778.
[Blundell et al. 2015] Blundell, C.; Cornebise, J.; [Hershey et al. 2017] Hershey, S.; Chaudhuri, S.; Ellis, D. P.;
Kavukcuoglu, K.; and Wierstra, D. 2015. Weight uncertainty Gemmeke, J. F.; Jansen, A.; Moore, R. C.; Plakal, M.; Platt,
in neural networks. arXiv preprint arXiv:1505.05424. D.; Saurous, R. A.; Seybold, B.; et al. 2017. Cnn architectures
[Casella 1992] Casella, G. 1992. Illustrating empirical bayes for large-scale audio classification. In Acoustics, Speech
methods. Chemometrics and intelligent laboratory systems and Signal Processing (ICASSP), 2017 IEEE International
16(2):107–125. Conference on, 131–135. IEEE.
[Houlsby et al. 2011] Houlsby, N.; Huszár, F.; Ghahramani, Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large
Z.; and Lengyel, M. 2011. Bayesian active learning scale visual recognition challenge. International journal of
for classification and preference learning. arXiv preprint computer vision 115(3):211–252.
arXiv:1112.5745. [Salamon, Jacoby, and Bello 2014] Salamon, J.; Jacoby, C.;
[Kaggle 2015] Kaggle. 2015. Diabetic retinopathy detection and Bello, J. P. 2014. A dataset and taxonomy for urban
challenge. https://www.kaggle.com/c/diabetic-retinopathy- sound research. In Proceedings of the 22nd ACM interna-
detection/overview/description. tional conference on Multimedia, 1041–1044. ACM.
[Krishnan, Subedar, and Tickoo 2018] Krishnan, R.; Sube- [Shin et al. 2016] Shin, H.-C.; Roth, H. R.; Gao, M.; Lu, L.;
dar, M.; and Tickoo, O. 2018. Bar: Bayesian activity Xu, Z.; Nogues, I.; Yao, J.; Mollura, D.; and Summers, R. M.
recognition using variational inference. arXiv preprint 2016. Deep convolutional neural networks for computer-
arXiv:1811.03305. aided detection: CNN architectures, dataset characteristics
[Krizhevsky and Hinton 2009] Krizhevsky, A., and Hinton, and transfer learning. IEEE transactions on medical imaging
G. 2009. Learning multiple layers of features from tiny 35(5):1285–1298.
images. Technical report, Citeseer. [Sønderby et al. 2016] Sønderby, C. K.; Raiko, T.; Maaløe,
[Lakshminarayanan, Pritzel, and Blundell 2017] L.; Sønderby, S. K.; and Winther, O. 2016. Ladder variational
Lakshminarayanan, B.; Pritzel, A.; and Blundell, C. autoencoders. In Advances in neural information processing
2017. Simple and scalable predictive uncertainty estimation systems, 3738–3746.
using deep ensembles. In Advances in Neural Information [Soomro, Zamir, and Shah 2012] Soomro, K.; Zamir, A. R.;
Processing Systems, 6402–6413. and Shah, M. 2012. Ucf101: A dataset of 101 human
[LeCun et al. 1998] LeCun, Y.; Bottou, L.; Bengio, Y.; actions classes from videos in the wild. arXiv preprint
Haffner, P.; et al. 1998. Gradient-based learning applied to arXiv:1212.0402.
document recognition. Proceedings of the IEEE 86(11):2278– [Subedar et al. 2019] Subedar, M.; Krishnan, R.; Meyer, P. L.;
2324. Tickoo, O.; and Huang, J. 2019. Uncertainty-aware audio-
[Minka 2001] Minka, T. P. 2001. Expectation propagation for visual activity recognition using deep bayesian variational
approximate bayesian inference. In Proceedings of the Sev- inference. In The IEEE International Conference on Com-
enteenth conference on Uncertainty in artificial intelligence, puter Vision (ICCV).
362–369. Morgan Kaufmann Publishers Inc. [Sun et al. 2019] Sun, S.; Zhang, G.; Shi, J.; and Grosse, R.
[Molchanov, Ashukha, and Vetrov 2017] Molchanov, D.; 2019. Functional variational bayesian neural networks. arXiv
Ashukha, A.; and Vetrov, D. 2017. Variational dropout preprint arXiv:1903.05779.
sparsifies deep neural networks. In Proceedings of the 34th [Welling and Teh 2011] Welling, M., and Teh, Y. W. 2011.
International Conference on Machine Learning-Volume 70, Bayesian learning via stochastic gradient langevin dynam-
2498–2507. JMLR. org. ics. In Proceedings of the 28th International Conference on
Machine Learning (ICML-11), 681–688.
[Monfort et al. 2018] Monfort, M.; Zhou, B.; Bargal, S. A.;
Andonian, A.; Yan, T.; Ramakrishnan, K.; Brown, L.; Fan, Q.; [Wen et al. 2018] Wen, Y.; Vicol, P.; Ba, J.; Tran, D.; and
Gutfruend, D.; Vondrick, C.; et al. 2018. Moments in time Grosse, R. 2018. Flipout: Efficient pseudo-independent
dataset: one million videos for event understanding. arXiv weight perturbations on mini-batches. arXiv preprint
preprint arXiv:1801.03150. arXiv:1803.04386.
[Nalisnick, Hernández-Lobato, and Smyth 2019] Nalisnick, [Wu et al. 2019] Wu, A.; Nowozin, S.; Meeds, E.; Turner,
E.; Hernández-Lobato, J. M.; and Smyth, P. 2019. Dropout R. E.; Hernández-Lobato, J. M.; and Gaunt, A. L. 2019.
as a structured shrinkage prior. In International Conference Deterministic variational inference for robust bayesian neural
on Machine Learning, 4712–4722. networks. International Conference on Learning Representa-
tions (ICLR).
[Neal 1995] Neal, R. M. 1995. BAYESIAN LEARNING FOR
NEURAL NETWORKS. Ph.D. Dissertation, Citeseer. [Xiao, Rasul, and Vollgraf 2017] Xiao, H.; Rasul, K.; and
Vollgraf, R. 2017. Fashion-mnist: a novel image dataset for
[Neal 2012] Neal, R. M. 2012. Bayesian learning for neural benchmarking machine learning algorithms. arXiv preprint
networks, volume 118. Springer Science & Business Media. arXiv:1708.07747.
[Nguyen et al. 2017] Nguyen, C. V.; Li, Y.; Bui, T. D.; and
Turner, R. E. 2017. Variational continual learning. arXiv
preprint arXiv:1710.10628.
[Ranganath, Gerrish, and Blei 2013] Ranganath, R.; Gerrish,
S.; and Blei, D. M. 2013. Black box variational inference.
arXiv preprint arXiv:1401.0118.
[Robbins 1956] Robbins, H. 1956. An empirical bayes ap-
proach to statistics. Herbert Robbins Selected Papers 41–47.
[Russakovsky et al. 2015] Russakovsky, O.; Deng, J.; Su, H.;
Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.;

You might also like