MOPED
MOPED
Table 1: Accuracies for architectures with different complexities and input modalities. Mean field variational inference with MOPED
initialization (MOPED_MFVI) obtains reliable uncertainty estimates from Bayesian DNNs while achieving similar or better accuracy as
the deterministic DNNs. Mean field variational inference with random priors (MFVI) has convergence issues (shown in red) for complex
architectures, while the proposed method achieves model convergence. DNN and MFVI accuracy numbers for diabetic retinopathy dataset are
obtained from BDL-benchmarks.
input uncertainty and model uncertainty. where, w represents maximum likelihood estimates of
M LE
Table 2: Comparison of AUC of precision-recall (AUPR) and ROC (auROC) for models with varying complexities. MOPED method
outperforms training with random initialization of weight priors.
Figure 1: Comparison of MOPED and MOPED_MFVI for Bayesian ResNet-20 and ResNet-56 architectures. (a) training convergence,
(b) AUPR as a function of retained data based on model uncertainty and (c) precision-recall plots.
that exploit generative models. Nguyen et al. use prior with classification on UCF-101(Soomro, Zamir, and Shah 2012)
zero mean and unit variance, and initialize the optimizer at dataset, (ii) VGGish(Hershey et al. 2017) for audio classifica-
the mean of the MLE model and a very small initial variance tion on UrbanSound8K (Salamon, Jacoby, and Bello 2014)
for small-scale MNIST experiments. Wu et al. modify ELBO dataset, (iii) Modified version of VGG(Filos et al. 2019) for
with a deterministic approximation of reconstruction term diabetic retinopathy detection (Kaggle 2015), (iv) ResNet-20
and use Empirical Bayes procedure for selecting variance and ResNet-56 (He et al. 2016) for CIFAR-10 (Krizhevsky
of prior in KL term (with zero prior mean). The authors and Hinton 2009), (v) LeNet architecture for MNIST (LeCun
also caution about inherent scaling of their method could et al. 1998) digit classification, and (vi) Simple convolutional
potentially limit its practical use for networks with large neural network (SCNN) consisting of two convolutional lay-
hidden size. All of these works have been demonstrated only ers followed by two dense layers for image classification on
on small-scale models and simple datasets like MNIST. In Fashion-MNIST (Xiao, Rasul, and Vollgraf 2017) datasets.
our method, we retain the stochastic property of the expected We implemented above Bayesian DNN models and trained
log-likelihood term in ELBO, and specify both mean and them using Tensorflow and Tensorflow-Probability (Dillon
variance of weight priors based on pretrained DNN with et al. 2017) frameworks. The variational layers are modeled
Empirical Bayes. Further, we demonstrate our method on using Flipout (Wen et al. 2018), an efficient method that
large-scale Bayesian DNN models with complex datasets on decorrelates the gradients within a mini-batch by implicitly
real-world tasks. sampling pseudo-independent weight perturbations for each
input. The MLE weights obtained from the pretrained DNN
5 Experiments models are used in MOPED method to set the priors and
We evaluate proposed method on real-world applications in- initialize the variational parameters in approximate posteriors
cluding image and audio classification, and video activity (Equation 8 and 9), as described in Section3.
recognition. We consider multiple architectures with vary- During inference phase, predictive distributions are ob-
ing complexity to show the scalability of method in training tained by performing multiple stochastic forward passes over
deep Bayesian models. Our experiments include: (i) ResNet- the network while sampling from posterior distribution of the
101 C3D (Hara, Kataoka, and Satoh 2018) for video activity weights (40 Monte Carlo samples in our experiments). We
the precision-recall AUC values as a function of retained data
based on the model uncertainty estimates. Figure 1 (b) & (c)
show that MOPED_MFVI provides better performance than
MFVI. AUPR increases as most uncertain predictions are
ignored based on the model uncertainty, indicating reliable
uncertainty estimates. We show the results for different selec-
tion of ρ values (as shown in Equation 8).
In Figure 2, we show AUPR plots for CIFAR-10 and UCF-
101with different δ values as mentioned in Equation 9.
Table 3: Comparison of Area under the receiver-operating characteristic curve (AUC) and classification accuracy as a function of retained data
from the BDL benchmark suite. The proposed method demonstrates an improvement of superior performance compared to all the baseline
models.
Figure 3: Benchmarking MOPED_MFVI with state-of-art Bayesian deep learning techniques on diabetic retinopathy diagnosis task using
BDL-benchmarks. Accuracy and area under the receiver-operating characteristic curve (AUC-ROC) plots for varied percentage of retained data
based on predictive uncertainty. MOPED_MFVI performs better than the other baselines from BDL-benchmarks suite. Shading shows the
standard error.
Figure 5: Density histograms obtained from in- and out-of-distribution samples. Bayesian DNN model uncertainty estimates indicate higher
uncertainty for out-of-distribution samples as compared to the in-distribution samples.
ensembles (Lakshminarayanan, Pritzel, and Blundell 2017) [Dillon et al. 2017] Dillon, J. V.; Langmore, I.; Tran, D.;
approaches (shown in Figure 4). Brevdo, E.; Vasudevan, S.; Moore, D.; Patton, B.; Alemi,
A.; Hoffman, M.; and Saurous, R. A. 2017. Tensorflow
6 Conclusions distributions. arXiv preprint arXiv:1711.10604.
[Filos et al. 2019] Filos, A.; Farquhar, S.; Gomez, A. N.; Rud-
We proposed MOPED method that specifies informed weight
ner, T. G. J.; Kenton, Z.; Smith, L.; Alizadeh, M.; de Kroon,
priors in Bayesian deep neural networks with Empirical
A.; and Gal, Y. 2019. Benchmarking bayesian deep learning
Bayes approach. We demonstrated with thorough empiri-
with diabetic retinopathy diagnosis.
cal experiments that MOPED enables scalable variational
inference for Bayesian DNNs.We demonstrated the proposed [Gal and Ghahramani 2016] Gal, Y., and Ghahramani, Z.
method outperforms state-of-the-art Bayesian deep learn- 2016. Dropout as a bayesian approximation: Representing
ing techniques using BDL-benchmarks framework. We also model uncertainty in deep learning. In international confer-
showed the uncertainty estimates obtained from the proposed ence on machine learning, 1050–1059.
method are reliable to identify out-of-distribution data. The [Gal 2016] Gal, Y. 2016. Uncertainty in deep learning. Uni-
results support proposed approach provides better model per- versity of Cambridge.
formance and reliable uncertainty estimates on real-world [Goodfellow, Bengio, and Courville 2016] Goodfellow, I.;
tasks with large scale complex models. Bengio, Y.; and Courville, A. 2016. Deep learning. MIT
press.
References [Goodfellow et al. 2013] Goodfellow, I. J.; Bulatov, Y.; Ibarz,
[Atanov et al. 2019] Atanov, A.; Ashukha, A.; Struminsky, J.; Arnoud, S.; and Shet, V. 2013. Multi-digit number recog-
K.; Vetrov, D.; and Welling, M. 2019. The deep weight prior. nition from street view imagery using deep convolutional
In International Conference on Learning Representations. neural networks. arXiv preprint arXiv:1312.6082.
[Bengio, Courville, and Vincent 2013] Bengio, Y.; Courville, [Graves 2011] Graves, A. 2011. Practical variational infer-
A.; and Vincent, P. 2013. Representation learning: A review ence for neural networks. In Advances in neural information
and new perspectives. IEEE transactions on pattern analysis processing systems, 2348–2356.
and machine intelligence 35(8):1798–1828. [Hara, Kataoka, and Satoh 2018] Hara, K.; Kataoka, H.; and
Satoh, Y. 2018. Can spatiotemporal 3d cnns retrace the
[Bishop 2006] Bishop, C. M. 2006. Pattern recognition
history of 2d cnns and imagenet? In Proceedings of the IEEE
and machine learning (information science and statistics)
Conference on Computer Vision and Pattern Recognition
springer-verlag new york. Inc. Secaucus, NJ, USA.
(CVPR), 6546–6555.
[Blei, Kucukelbir, and McAuliffe 2017] Blei, D. M.; Ku- [He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016.
cukelbir, A.; and McAuliffe, J. D. 2017. Variational inference: Deep residual learning for image recognition. In Proceed-
A review for statisticians. Journal of the American Statistical ings of the IEEE conference on computer vision and pattern
Association 112(518):859–877. recognition, 770–778.
[Blundell et al. 2015] Blundell, C.; Cornebise, J.; [Hershey et al. 2017] Hershey, S.; Chaudhuri, S.; Ellis, D. P.;
Kavukcuoglu, K.; and Wierstra, D. 2015. Weight uncertainty Gemmeke, J. F.; Jansen, A.; Moore, R. C.; Plakal, M.; Platt,
in neural networks. arXiv preprint arXiv:1505.05424. D.; Saurous, R. A.; Seybold, B.; et al. 2017. Cnn architectures
[Casella 1992] Casella, G. 1992. Illustrating empirical bayes for large-scale audio classification. In Acoustics, Speech
methods. Chemometrics and intelligent laboratory systems and Signal Processing (ICASSP), 2017 IEEE International
16(2):107–125. Conference on, 131–135. IEEE.
[Houlsby et al. 2011] Houlsby, N.; Huszár, F.; Ghahramani, Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large
Z.; and Lengyel, M. 2011. Bayesian active learning scale visual recognition challenge. International journal of
for classification and preference learning. arXiv preprint computer vision 115(3):211–252.
arXiv:1112.5745. [Salamon, Jacoby, and Bello 2014] Salamon, J.; Jacoby, C.;
[Kaggle 2015] Kaggle. 2015. Diabetic retinopathy detection and Bello, J. P. 2014. A dataset and taxonomy for urban
challenge. https://www.kaggle.com/c/diabetic-retinopathy- sound research. In Proceedings of the 22nd ACM interna-
detection/overview/description. tional conference on Multimedia, 1041–1044. ACM.
[Krishnan, Subedar, and Tickoo 2018] Krishnan, R.; Sube- [Shin et al. 2016] Shin, H.-C.; Roth, H. R.; Gao, M.; Lu, L.;
dar, M.; and Tickoo, O. 2018. Bar: Bayesian activity Xu, Z.; Nogues, I.; Yao, J.; Mollura, D.; and Summers, R. M.
recognition using variational inference. arXiv preprint 2016. Deep convolutional neural networks for computer-
arXiv:1811.03305. aided detection: CNN architectures, dataset characteristics
[Krizhevsky and Hinton 2009] Krizhevsky, A., and Hinton, and transfer learning. IEEE transactions on medical imaging
G. 2009. Learning multiple layers of features from tiny 35(5):1285–1298.
images. Technical report, Citeseer. [Sønderby et al. 2016] Sønderby, C. K.; Raiko, T.; Maaløe,
[Lakshminarayanan, Pritzel, and Blundell 2017] L.; Sønderby, S. K.; and Winther, O. 2016. Ladder variational
Lakshminarayanan, B.; Pritzel, A.; and Blundell, C. autoencoders. In Advances in neural information processing
2017. Simple and scalable predictive uncertainty estimation systems, 3738–3746.
using deep ensembles. In Advances in Neural Information [Soomro, Zamir, and Shah 2012] Soomro, K.; Zamir, A. R.;
Processing Systems, 6402–6413. and Shah, M. 2012. Ucf101: A dataset of 101 human
[LeCun et al. 1998] LeCun, Y.; Bottou, L.; Bengio, Y.; actions classes from videos in the wild. arXiv preprint
Haffner, P.; et al. 1998. Gradient-based learning applied to arXiv:1212.0402.
document recognition. Proceedings of the IEEE 86(11):2278– [Subedar et al. 2019] Subedar, M.; Krishnan, R.; Meyer, P. L.;
2324. Tickoo, O.; and Huang, J. 2019. Uncertainty-aware audio-
[Minka 2001] Minka, T. P. 2001. Expectation propagation for visual activity recognition using deep bayesian variational
approximate bayesian inference. In Proceedings of the Sev- inference. In The IEEE International Conference on Com-
enteenth conference on Uncertainty in artificial intelligence, puter Vision (ICCV).
362–369. Morgan Kaufmann Publishers Inc. [Sun et al. 2019] Sun, S.; Zhang, G.; Shi, J.; and Grosse, R.
[Molchanov, Ashukha, and Vetrov 2017] Molchanov, D.; 2019. Functional variational bayesian neural networks. arXiv
Ashukha, A.; and Vetrov, D. 2017. Variational dropout preprint arXiv:1903.05779.
sparsifies deep neural networks. In Proceedings of the 34th [Welling and Teh 2011] Welling, M., and Teh, Y. W. 2011.
International Conference on Machine Learning-Volume 70, Bayesian learning via stochastic gradient langevin dynam-
2498–2507. JMLR. org. ics. In Proceedings of the 28th International Conference on
Machine Learning (ICML-11), 681–688.
[Monfort et al. 2018] Monfort, M.; Zhou, B.; Bargal, S. A.;
Andonian, A.; Yan, T.; Ramakrishnan, K.; Brown, L.; Fan, Q.; [Wen et al. 2018] Wen, Y.; Vicol, P.; Ba, J.; Tran, D.; and
Gutfruend, D.; Vondrick, C.; et al. 2018. Moments in time Grosse, R. 2018. Flipout: Efficient pseudo-independent
dataset: one million videos for event understanding. arXiv weight perturbations on mini-batches. arXiv preprint
preprint arXiv:1801.03150. arXiv:1803.04386.
[Nalisnick, Hernández-Lobato, and Smyth 2019] Nalisnick, [Wu et al. 2019] Wu, A.; Nowozin, S.; Meeds, E.; Turner,
E.; Hernández-Lobato, J. M.; and Smyth, P. 2019. Dropout R. E.; Hernández-Lobato, J. M.; and Gaunt, A. L. 2019.
as a structured shrinkage prior. In International Conference Deterministic variational inference for robust bayesian neural
on Machine Learning, 4712–4722. networks. International Conference on Learning Representa-
tions (ICLR).
[Neal 1995] Neal, R. M. 1995. BAYESIAN LEARNING FOR
NEURAL NETWORKS. Ph.D. Dissertation, Citeseer. [Xiao, Rasul, and Vollgraf 2017] Xiao, H.; Rasul, K.; and
Vollgraf, R. 2017. Fashion-mnist: a novel image dataset for
[Neal 2012] Neal, R. M. 2012. Bayesian learning for neural benchmarking machine learning algorithms. arXiv preprint
networks, volume 118. Springer Science & Business Media. arXiv:1708.07747.
[Nguyen et al. 2017] Nguyen, C. V.; Li, Y.; Bui, T. D.; and
Turner, R. E. 2017. Variational continual learning. arXiv
preprint arXiv:1710.10628.
[Ranganath, Gerrish, and Blei 2013] Ranganath, R.; Gerrish,
S.; and Blei, D. M. 2013. Black box variational inference.
arXiv preprint arXiv:1401.0118.
[Robbins 1956] Robbins, H. 1956. An empirical bayes ap-
proach to statistics. Herbert Robbins Selected Papers 41–47.
[Russakovsky et al. 2015] Russakovsky, O.; Deng, J.; Su, H.;
Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.;