NIPS 2016 Bayesian Optimization With Robust Bayesian Neural Networks Paper
NIPS 2016 Bayesian Optimization With Robust Bayesian Neural Networks Paper
Abstract
Bayesian optimization is a prominent method for optimizing expensive-to-evaluate
black-box functions that is widely applied to tuning the hyperparameters of machine
learning algorithms. Despite its successes, the prototypical Bayesian optimization
approach – using Gaussian process models – does not scale well to either many
hyperparameters or many function evaluations. Attacking this lack of scalability
and flexibility is thus one of the key challenges of the field. We present a general
approach for using flexible parametric models (neural networks) for Bayesian
optimization, staying as close to a truly Bayesian treatment as possible. We obtain
scalability through stochastic gradient Hamiltonian Monte Carlo, whose robustness
we improve via a scale adaptation. Experiments including multi-task Bayesian
optimization with 21 tasks, parallel optimization of deep neural networks and deep
reinforcement learning show the power and flexibility of this approach.
1 Introduction
Hyperparameter optimization is crucial for obtaining good performance in many machine learning
algorithms, such as support vector machines, deep neural networks, and deep reinforcement learning.
The most prominent method for hyperparameter optimization is Bayesian optimization (BO) based
on Gaussian processes (GPs), as e.g., implemented in the Spearmint system [1].
While GPs are the natural probabilistic models for BO, unfortunately, their complexity is cubic in
the number of data points and they often do not gracefully scale to high dimensions [2]. Although
alternative methods based on tree models [3, 4] or Bayesian linear regression using features from
a neural network [5] exist, they obtain scalability by partially sacrificing a principled treatment of
model uncertainties.
Here, we propose to use neural networks as a powerful and scalable parametric model, while
staying as close to a truly Bayesian treatment as possible. Crucially, we aim to keep the well-
calibrated uncertainty estimates of GPs since BO relies on them to accurately determine promising
hyperparameters. To this end we derive a more robust variant of the recent stochastic gradient
Hamiltonian Monte Carlo (SGHMC) method [6].
After providing background (Section 2), we make the following contributions: We derive a general
formulation for both single-task and multi-task BO with Bayesian neural networks that leads to
a robust, scalable, and parallel optimizer (Section 3). We derive a scale adaptation technique to
substantially improve the robustness of stochastic gradient HMC (Section 4). Finally, using our
method – which we dub Bayesian Optimization with Hamiltonian Monte Carlo Artificial Neural
Networks (BOHAMIANN) – we demonstrate state-of-the-art performance for a wide range of
optimization tasks. This includes multi-task BO, parallel optimization of deep residual networks, and
deep reinforcement learning. An implementation of our method can be found at https://github.
com/automl/RoBO.
30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.
2 Background
2.1 Bayesian optimization for single and multiple tasks
Let f : X → R be an arbitrary function defined over a convex set X ⊂ Rd that can be evaluated at
x ∈ X , yielding noisy observations y ∼ N (f (x), σobs 2
). We aim to find x∗ ∈ arg minx∈X f (x). To
solve this problem, BO (see, e.g., Brochu et al. [7]) typically starts by observing the function at an
initial design D = {(x1 , y1 ), . . . , (xI , yI )}. BO then repeatedly executes the following steps: (1) fit
a regression model p(f | D) to the current data D; (2) use p(f | D) to select an input xt+1 at which
to query f by maximizing an acquisition function (which trades off exploration and exploitation); (3)
2
observe yt+1 ∼ N (f (xt+1 ), σobs ) and add the result to the dataset: D := D ∪ {xt+1 , yt+1 }.
In the generalized case of multi-task Bayesian optimization [8], there are K related black-box
functions, F = {f1 , . . . , fK }, each with the same domain X ; and, the goal is to find x∗ ∈
arg minx∈X ft (x) for a given t.1 In this case, the initial design is augmented with previous evalua-
tions of the related functions. That is, D = D1 ∪ · · · ∪ DK with Dk = {(xk1 , y1k ), . . . , (xknk , ynk k )},
where yik ∼ N (fk (xki ), σobs
2
) and nk = |Dk | points have already been evaluated for function fk . BO
then requires a probabilistic model p(f | D) over the K functions, which can be used to transfer
knowledge from related tasks to the target task t (and thus reduce the required number of function
evaluations on t).
A concrete instantiation of BO is obtained by specifying the acquisition function and the probabilistic
model. As acquisition function, here, we will use the popular expected improvement (EI) criterion [9];
other commonly used options, such as UCB [10] could be directly applied. EI is defined as
ŷ − µ(f (x) | D)
αEI (x; D) = σ(f (x) | D) (γ(x)Φ(γ(x)) + φ(γ(x))) , with γ(x) = , (1)
σ(f (x) | D)
where Φ(·) and φ(·) denote the cumulative distribution function and the probability density function
of a standard normal distribution, respectively, and µ(f (x) | D) and σ(f (x) | D) denote the posterior
mean and standard deviation of our probabilistic model based on data D. While the prototypical
probabilistic model in BO is a GP [1], we will use a Bayesian neural network (BNN).
The ability to combine the flexibility and scalability of (deep) neural networks with well-calibrated
uncertainty estimates is highly desirable in many contexts. Not surprisingly, there thus exist many
approaches for this problem, including early work on (non-scalable) Hamiltonian Monte Carlo
[11], recent work on variational inference methods [12, 13] and expectation propagation [14], re-
interpretations of dropout as approximate inference [15, 16], as well as stochastic gradient MCMC
methods based on Hamiltonian Monte Carlo [6] and stochastic gradient Langevin MCMC [17].
While any of these methods could, in principle, be used for BO, we found most of them to result in
suboptimal uncertainty estimates. Our preliminary experiments – presented in the supplementary
material (Section B) – suggest these methods often conservatively estimate the uncertainty for points
far away from the data, particularly when based on little training data. This is problematic for BO,
which crucially relies on well-calibrated uncertainty estimates based on few function evaluations.
One family of methods that consistently resulted in good uncertainty estimates in our tests were
Hamiltonian Monte Carlo (HMC) methods, which we will thus use throughout this paper. Concretely,
we will build on the scalable stochastic MCMC method from Chen et al. [6].
2
where θ = [θµ , θσ2 ]T , fˆ(x, t; θµ ) is the output of a parametric model with parameters θµ , and where
we assume a homoscedastic noise model with zero mean and variance θσ2 .2 A single-task model can
trivially be obtained from this definition:
Single-task model. In the single-task setting we simply model the function mean fˆ(x, t; θµ ) =
h(x; θµ ) using a neural network, with output h (i.e. h implements a forward-pass).
Multi-task model. For the multi-task model we use a slightly adapted network architecture. As
additional input,the networkis provided with a task-specific embedding vector. That is, we have
T
fˆ(x, t; θµ ) = h [x; ψt ] , θh , where h(·), again, denotes the output of the neural network (here
with parameters θh ) and ψt is the t-th row of an embedding matrix ψ ∈ RK×L (we choose L = 5 for
our experiments). This embedding matrix is learned alongside all other parameters. Additionally,
if information about the dataset (such as data-set size etc.) is available it can be appended to this
embedding vector. The full vector of the network parameters then becomes θµ = [θh , vec(ψ)], where
vec(·) denotes vectorization. Instead of using a learned embedding we could have chosen to represent
the tasks through a one-out-of-K encoding vector, which functionally would be equivalent but would
induce a large number of additional parameters to be learned for large K. With these definitions, the
joint probability of the model parameters and the observed data is then
K |D
Y Yk|
HMC introduces a set of auxiliary variables, r, and then samples from the joint distribution
1
p(θ, r | D) ∝ exp −U (θ) − rT M−1 r , with U (θ) = − log p(D, θ) (6)
2
2
We note that, if required, we could model heteroscedastic functions by defining the observation noise
variance θσ2 as a deterministic function of x (e.g. as the second output of the neural network).
3
by simulating a fictitious physical system described by a set of differential equations, called Hamilton’s
equations. In this system, the negative log-likelihood U (θ) plays the role of a potential energy, r
corresponds to the momentum of the system, and M represents the (arbitrary) mass matrix [18].
Classically, the dynamics for θ and r depend on the gradient ∇U (θ) whose evaluation is too expensive
for our purposes, since it would involve evaluating the model on all data-points. By introducing a
user-defined friction matrix C, Chen et al. [6] showed how Hamiltonian dynamics can be modified to
sample from the correct distribution if only a noisy estimate ∇Ũ (θ), e.g. computed from a mini-batch,
is available. In particular, their discretized system of equations reads
Like many Monte Carlo methods, SGHMC does not come without caveats, namely the correct setting
of the user-defined quantities: the friction term C, the estimate of the gradient noise B̂, the mass
matrix M, the number of MCMC steps, and – most importantly – the step-size . We found the
friction term and the step-size to be highly model and data-set dependent3 , which is unacceptable for
BO, where robust estimates are required across many different functions F with as few parameter
choices as possible.
A closer look at Equation (7) shows why the step-size crucially impacts the robustness of SGHMC.
For the popular choice M = I, the change in the momentum is proportional to the gradient. If the
gradient elements are on vastly different scales (and potentially correlated), then the update effectively
assigns unequal importance to changes in different parameters of the model. This, in turn, can lead
to slow exploration of the target density. To correct for unequal parameter scales (and respect their
correlation), we would ideally like to use M as a pre-conditioner, reflecting the metric underlying
the model’s parameters. This would lead to a stochastic gradient analogue of Riemann Manifold
Hamiltonian Monte Carlo [19], which has been studied before by Ma et al. [20] and results in an
algorithm called generalized stochastic gradient Riemann Hamiltonian Monte Carlo (gSGRHMC).
Unfortunately, gSGRHMC requires computation (and storage) of the full Fisher information matrix
of U and its gradient, which is prohibitively expensive for our purposes.
As a pragmatic approach, we consider a pre-conditioning scheme increasing SGHMCs robustness
with respect to and C, while avoiding the costly computations of gSGRHMC. We want to note
that recently – and directly related to our approach – adaptive pre-conditioning using ideas from
SGD methods has been combined with stochastic gradient Langevin dynamics in Li et al. [21] and to
derive a hybrid between SGD optimization and HMC sampling in Chen et al. [22]. These approaches
however either come with additional hyperparameters that need to be set or do not guarantee unbiased
sampling. The rest of this section shows how all remaining SGHMC parameters in our method are
determined automatically.
Choosing M. For the mass matrix, we take inspiration from the connection between SGHMC
and SGD. Specifically, the literature [23, 24] shows how normalizing the gradient by its magnitude
(estimated over the whole dataset) improves the robustness of SGD. To perform the analogous
operation in SGHMC,
−1 we propose to adapt the mass matrix during the burn-in phase. We set
−1 /2
M = diag V̂θ , where V̂θ is an estimate of the (element-wise) uncentered variance of the
gradient: V̂θ ≈ E[(∇Ũ (θ))2 ]. We estimate V̂θ using an exponential moving average during the
3
We refer to Section 5 for a quantitative evaluation of this claim.
4
burn-in phase yielding the update equation
∆V̂θ = −τ −1 V̂θ + τ −1 ∇(Ũ (θ))2 , (8)
where τ is a free parameter vector specifying the exponential averaging windows. Note that all
multiplications above are element-wise and τ is a vector with the same dimensionality as θ.
Estimating B̂. While the above procedure removes the need to hand-tune M−1 (and will stabilize
the method for different C and ), we have not yet defined an estimate for B̂. Ideally, B̂ should be
the estimate of the empirical Fisher information matrix that, as discussed above, is too expensive
to compute. We therefore resort to a diagonal approximation yielding B̂ = 12 V̂θ which is readily
available from Equation (8).
Scale adapted update equations. Finally, we can combine all parameter estimates to formulate
our automatically scale adapted SGHMC method. Following Chen et al. [6], we introduce the variable
−1/2
substitution v = M−1 r = V̂θ r which leads us to the dynamical equations
−1/2 −1/2
−1/2 −1/2
∆θ = v , ∆v = −2 V̂θ ∇Ũ (θ) − V̂θ Cv + N 0, 23 V̂θ CV̂θ − 4 I , (10)
using the quantities estimated in Equations (8)-(9) during the burn-in phase, and then fixing the
choices for all parameters. Note that the approximation of B̂ cancels with the square of our estimate
of M−1 . In practice, we choose C = CI, i.e. the same independent noise for each element of θ. In
this case, Equation (10) constrains the choices of C and , as we need them to fulfill the relation
min(Vθ−1 )C ≥ . For the remainder of the paper, we fix = 10−2 (a robust choice in our experience)
−1/2
and chose C such that we have V̂θ C = 0.05I (intuitively this corresponds to a constant decay in
momentum of 0.05 per time step) potentially increasing it to satisfy the mentioned constraint at the
end of the burn-in phase.
We want to emphasize that our estimation/adaptation of the parameters only changes the HMC
procedure during the burn-in phase. After it, when actual samples are recorded, all parameters stay
fixed. In particular, this entails that as long as our choice of and C satisfies min(V̂θ−1 )C ≥ , our
method samples from the correct distribution. Our choices are compatible with the constraints on the
free parameters of the original SGHMC [6]. Further, we note that the scale adaptation technique is
agnostic to the parametric form of the density we aim to sample from; and could therefore potentially
also simplify SGHMC sampling for models beyond those considered in this paper.
5
Table 1: Log likelihood for regression benchmarks from the UCI repository. For comparison, we
include results for VI (variational inference) and PBP (probabilistic backpropagation) taken from
Hernández-Lobato and Adams [14]. We report mean ± standard deviation across 10 runs. The first
two SGHMC variants are the vanilla algorithm (without our modifications) optimized for best mean
performance (best average), and best performance on each dataset (tuned per dataset) via grid search.
100
0.5
10-1
0.0
10-2
−0.5
10-3
−1.0
10-4
−1.5
10-5
−2.0
10-6
0 50 100 150 200 −2.5
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
Number of function evaluations
Figure 1: Evaluation on common benchmark problems. (Left) Immediate regret of various optimizers
averaged over 30 runs on the Branin function. For DNGO and BOHAMIANN, we denote the layer
sizes for the (3 layer) networks in parenthesis. (Right) A fit of the sinOne function after 20 steps of
BO using BOHAMIANN. We plot the mean of the predictive posterior and ± 2 standard deviations;
calculated based on 50 MCMC samples.
6
Additionally, DNGO performed well for some high-dimensional problems (cf. Section 6.3), but it got
stuck when we used it to optimize 13 hyperparameters of a Deep RL agent (cf. Section 6.4).
BOHAMIANN BOHAMIANN
Function value (episodes to success)
20
700
15
600
10 500
400
5
0 100000 200000 300000 400000 0 50 100 150 200 250 300 350 400
Runtime in seconds Function evaluations
Figure 2: (Left) DNGO vs.BOHAMIANN for optimizing the 8 hyperparameters of a deep residual
network on CIFAR-10; we plotted each function evaluation performed over time, as well as the current
best; parallel random search is included as an additional baseline. (Right) DNGO vs. BOHAMIANN
for optimizing the 12 hyperparameters of an RL agent.
Next, we optimized the hyperparameters of the recently proposed residual network (ResNet) architec-
ture [28] for classification of CIFAR-10. We adopted a general parameterization of this architecture,
tuning both the parameters of the stochastic gradient descent training as well as key architectural
choices (such as the dimensionality reduction strategy used between residual blocks). We kept the
maximum number of parameters fixed at the number used by the 32 layer ResNet [28]. Training a
single ResNet took up to 6 hours in our experiments and we therefore used the parallel BO procedure
described in Section 1 of the supplementary material (evaluating 8 ResNet configurations in parallel,
for all of DNGO, random search, and BOHAMIANN).
Interestingly, all methods quickly found good configurations of the hyperparameters as shown in
Figure 2(left), with BOHAMIANN reaching the validation performance of the manually-tuned
baseline ResNet after 104 function evaluations (or approximately 27 hours of total training time).
When re-training this model on the full dataset it obtained a classification error of 7.40 % ± 0.3,
matching the performance of the hand-tuned version from He et al. [28] (7.51 %). Perhaps surprisingly,
this result was reached with a different architecture than the one presented in He et al. [28]: (1)
it used max-pooling instead of strided convolutions for the spatial dimensionality reduction; (2)
approximately 50% of the weights in all residual blocks were shared (thus reducing the number of
parameters).
7
DDGP on Cartpole
Table 2: Comparison between the original
0
DDPG algorithm and a version optimized us-
−1
ing BOHAMIANN on two control tasks. We
−2
show the number of episodes required to obtain
−3 successful performance in 10 consecutive test
episodes (reward above -2 for CartPole, above -6
Reward
−4
Finally, we optimized a neural reinforcement learning (RL) algorithm on two control tasks: the
Cartpole swing-up task and a two link robot arm reaching task. We used a re-implementation of the
DDPG algorithm by Lillicrap et al. [29] and aimed to minimize the interaction time with the simulated
system required to achieve stable performance (defined as: solving the task in 10 consecutive test
episodes). This is a critical performance metric for data-efficient RL.
The results of this experiment are given in Table 2 . While the original DDPG hyperparameters
were set to achieve robust performance on a large set of benchmarks (and out-of-the-box DDPG
performed remarkably well on the considered problems) our experiments indicate that the number
of samples required to achieve good performance can be substantially reduced for individual tasks
by hyperparameter optimization with BOHAMIANN. In contrast, DNGO did not perform as well
on this specific task, getting stuck during optimization, see Figure 2 (right). A comparison between
the learning curves of the original and the optimized DDPG, depicted in Figure 3, confirms this
observation. The parameters that had the most influence on this improved performance were (perhaps
unsurprisingly) the learning-rates of the Q-and policy networks and the number of SGD steps
performed between collected episodes. This observation was already used by domain experts in a
recent paper by Gu et al. [30] where they used 5 updates per sample (the hyperparameters found by
our method correspond to 10 updates per sample).
7 Conclusion
We proposed BOHAMIANN, a scalable and flexible Bayesian optimization method. It natively
supports multi-task optimization as well as parallel function evaluations, and scales to high dimensions
and many function evaluations. At its heart lies Bayesian inference for neural networks via stochastic
gradient Hamiltonian Monte Carlo, and we improved the robustness thereof by means of a scale
adaptation technique. In future work, we plan to implement Freeze-Thaw Bayesian optimization [31]
and Bayesian optimization across dataset sizes [32] in our framework, since both of these generate
many cheap function evaluations and thus reach the scalability limit of GPs. We thereby expect
substantial speedups in the practical hyperparameter optimization for ML algorithms on big datasets.
Acknowledgements
This work has partly been supported by the European Commission under Grant no. H2020-ICT-
645403-ROBDREAM, by the German Research Foundation (DFG), under Priority Programme
Autonomous Learning (SPP 1527, grant HU 1900/3-1), under Emmy Noether grant HU 1900/2-1,
and under the BrainLinks-BrainTools Cluster of Excellence (grant number EXC 1086).
8
References
[1] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms.
In Proc. of NIPS’12, 2012.
[2] K. Eggensperger, M. Feurer, F. Hutter, J. Bergstra, J. Snoek, H. Hoos, and K. Leyton-Brown. Towards an
empirical foundation for assessing Bayesian optimization of hyperparameters. In BayesOpt’13, 2013.
[3] F. Hutter, H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algorithm
configuration. In LION’11, 2011.
[4] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms for hyper-parameter optimization. In Proc.
of NIPS’11, 2011.
[5] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. M. A. Patwary, Prabhat, and R. P.
Adams. Scalable Bayesian optimization using deep neural networks. In Proc. of ICML’15, 2015.
[6] T. Chen, E.B. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In Proc. of ICML’14,
2014.
[7] E. Brochu, V. Cora, and N. de Freitas. A tutorial on Bayesian optimization of expensive cost functions,
with application to active user modeling and hierarchical reinforcement learning. CoRR, 2010.
[8] K. Swersky, J. Snoek, and R. Adams. Multi-task Bayesian optimization. In Proc. of NIPS’13, 2013.
[9] D. Jones, M. Schonlau, and W. Welch. Efficient global optimization of expensive black box functions.
JGO, 1998.
[10] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No
regret and experimental design. In Proc. of ICML’10, 2010.
[11] Radford M. Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1996.
[12] A. Graves. Practical variational inference for neural networks. In Proc. of ICML’11, 2011.
[13] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. In
Proc. of ICML’15, 2015.
[14] J. M. Hernández-Lobato and R. Adams. Probabilistic backpropagation for scalable learning of Bayesian
neural networks. In Proc. of ICML’15, 2015.
[15] Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep
learning. arXiv:1506.02142, 2015.
[16] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization trick. In
Proc. of NIPS’15, 2015.
[17] A. Korattikara, V. Rathod, K. P. Murphy, and M. Welling. Bayesian dark knowledge. In Proc. of NIPS’15.
2015.
[18] S. Duane, A. D. Kennedy, Brian J. Pendleton, and D. Roweth. Hybrid monte carlo. Phys. Lett. B, 1987.
[19] M. Girolami and B. Calderhead. Riemann manifold langevin and hamiltonian monte carlo methods.
Journal of the Royal Statistical Society Series B - Statistical Methodology, 2011.
[20] Y. Ma, T. Chen, and E.B. Fox. A complete recipe for stochastic gradient MCMC. In Proc. of NIPS’15,
2015.
[21] Chunyuan Li, Changyou Chen, David E. Carlson, and Lawrence Carin. Preconditioned stochastic gradient
langevin dynamics for deep neural networks. In Proc. of AAAI’16, 2016.
[22] Changyou Chen, David E. Carlson, Zhe Gan, Chunyuan Li, and Lawrence Carin. Bridging the gap between
stochastic gradient MCMC and stochastic optimization. In Proc. of AISTATS, 2016.
[23] T. Tieleman and G. Hinton. RmsProp: Divide the gradient by a running average of its recent magnitude. In
COURSERA: Neural Networks for Machine Learning. 2012.
[24] J.C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic
optimization. Journal of Machine Learning Research, 2011.
[25] T. Schaul, S. Zhang, and Y. LeCun. No More Pesky Learning Rates. In Proc. of ICML’13, 2013.
[26] J. Vanschoren, J. van Rijn, B. Bischl, and L. Torgo. OpenML: Networked science in machine learning.
SIGKDD Explor. Newsl., (2), June 2014.
[27] M. Feurer, T. Springenberg, and F. Hutter. Initializing Bayesian hyperparameter optimization via meta-
learning. In Proc. of AAAI’15, 2015.
[28] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. of CVPR’16,
2016.
[29] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous
control with deep reinforcement learning. In Proc. of ICLR, 2016.
[30] S. Gu, T. P. Lillicrap, I. Sutskever, and S. Levine. Continuous deep q-learning with model-based acceleration.
In Proc. of ICML, 2016.
[31] K. Swersky, J. Snoek, and R. Adams. Freeze-thaw Bayesian optimization. CoRR, 2014.
[32] A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter. Fast bayesian optimization of machine learning
hyperparameters on large datasets. CoRR, 2016.