0% found this document useful (0 votes)
2 views

bayesian optimization

The paper presents Ensemble Bayesian Optimization (EBO), a method designed to enhance Bayesian optimization for high-dimensional and large-scale problems by leveraging parallel computing. EBO addresses challenges such as large-scale observations, high-dimensional input spaces, and batch query selection through an ensemble of additive Gaussian process models. Empirical results demonstrate significant improvements in efficiency and scalability, enabling the handling of tens of thousands of observations in optimization tasks.

Uploaded by

Lingyi Tan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

bayesian optimization

The paper presents Ensemble Bayesian Optimization (EBO), a method designed to enhance Bayesian optimization for high-dimensional and large-scale problems by leveraging parallel computing. EBO addresses challenges such as large-scale observations, high-dimensional input spaces, and batch query selection through an ensemble of additive Gaussian process models. Empirical results demonstrate significant improvements in efficiency and scalability, enabling the handling of tens of thousands of observations in optimization tasks.

Uploaded by

Lingyi Tan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Batched Large-scale Bayesian Optimization in High-dimensional Spaces

Zi Wang Clement Gehring Pushmeet Kohli Stefanie Jegelka


MIT CSAIL MIT CSAIL DeepMind MIT CSAIL
arXiv:1706.01445v4 [stat.ML] 16 May 2018

Abstract Hennig and Schuler, 2012; Hernández-Lobato et al., 2014;


Wang et al., 2016a; Kawaguchi et al., 2015), techniques for
Bayesian optimization (BO) has become an effec- batch queries (Desautels et al., 2014; González et al., 2016),
tive approach for black-box function optimization and algorithms for high dimensional problems (Wang et al.,
problems when function evaluations are expensive 2016b; Kandasamy et al., 2015).
and the optimum can be achieved within a rela- Despite the above-mentioned successes, Bayesian optimiza-
tively small number of queries. However, many tion remains somewhat impractical, since it is typically
cases, such as the ones with high-dimensional in- coupled with expensive function estimators (Gaussian pro-
puts, may require a much larger number of obser- cesses) and non-convex acquisition functions that are hard
vations for optimization. Despite an abundance of to optimize in high dimensions and sometimes expensive to
observations thanks to parallel experiments, cur- evaluate. To alleviate these difficulties, recent work explored
rent BO techniques have been limited to merely the use of random feature approximations (Snoek et al.,
a few thousand observations. In this paper, we 2015; Lakshminarayanan et al., 2016) and sparse Gaus-
propose ensemble Bayesian optimization (EBO) sian processes (McIntire et al., 2016), but, while improving
to address three current challenges in BO simul- scalability, these methods still suffer from misestimation
taneously: (1) large-scale observations; (2) high of confidence bounds (an essential part of the acquisition
dimensional input spaces; and (3) selections of functions), and expensive or inaccurate Gaussian process
batch queries that balance quality and diversity. (GP) hyperparameter inference. Indeed, to the best of our
The key idea of EBO is to operate on an ensem- knowledge, Bayesian optimization is typically limited to a
ble of additive Gaussian process models, each of few thousand evaluations (Lakshminarayanan et al., 2016).
which possesses a randomized strategy to divide Yet, reliable search and estimation for complex functions
and conquer. We show unprecedented, previously in very high-dimensional spaces may well require more
impossible results of scaling up BO to tens of evaluations. With the increasing availability of parallel com-
thousands of observations within minutes of com- puting resources, large number of function evaluations are
putation. possible if the underlying approach can leverage the paral-
lelism. Comparing to the millions of evaluations possible
(and needed) with local methods like stochastic gradient de-
1 Introduction scent, the scalability of global Bayesian optimization leaves
large room for desirable progress. In particular, the lack of
Global optimization of black-box and non-convex functions scalable uncertainty estimates to guide the search is a major
is an important component of modern machine learning. roadblock for huge-scale Bayesian optimization.
From optimizing hyperparameters in deep models to solv-
ing inverse problems encountered in computer vision and In this paper, we propose ensemble Bayesian optimization
policy search for reinforcement learning, these optimiza- (EBO), a global optimization method targeted to high dimen-
tion problems have many important applications in ma- sional, large scale parameter search problems whose queries
chine learning and its allied disciplines. In the past decade, are parallelizable. Such problems are abundant in hyper
Bayesian optimization has become a popular approach for and control parameter optimization in machine learning and
global optimization of non-convex functions that are expen- robotics (Calandra, 2017; Snoek et al., 2012). EBO relies
sive to evaluate. Recent work addresses better query strate- on two main ideas that are implemented at multiple levels:
gies (Kushner, 1964; Moc̆kus, 1974; Srinivas et al., 2012; (1) we use efficient partition-based function approximators
(across both data and features) that simplify and acceler-
ate search and optimization; (2) we enhance the expressive
Proceedings of the 21st International Conference on Artificial Intel- power of these approximators by using ensembles and a
ligence and Statistics (AISTATS) 2018, Lanzarote, Spain. PMLR:
stochastic approach. We maintain an evolving (posterior)
Volume 84. Copyright 2018 by the author(s).
Batched Large-scale Bayesian Optimization in High-dimensional Spaces

distribution over the (infinite) ensemble and, in each itera- input space via a Mondrian tree and aggregates trees into a
tion, draw one member to perform search and estimation. forest. The closely related Mondrian kernels (Balog et al.,
2016) use random features derived from Mondrian forests
In particular, we use a new combination of three types of
to construct a kernel. Such a kernel, in fact, approximates
partition-based approximations: (1-2) For improved GP esti-
a Laplace kernel. In fact, Mondrian forest features can be
mation, we propose a novel hierarchical additive GP model
considered a special case of the popular tile coding features
based on tile coding (a.k.a. random binning or Mondrian
widely used in reinforcement learning (Sutton and Barto,
forest features). We learn a posterior distribution over ker-
1998; Albus et al., 1975). Lakshminarayanan et al. (2016)
nel width and the additive structure; here, Gibbs sampling
showed that, in low-dimensional settings, Mondrian forest
prevents overfitting. (3) To accelerate the sampler, which
kernels scale better than the regular GP and achieve good
depends on the likelihood of the observations, we use an ef-
uncertainty estimates in many low-dimensional problems.
ficient, randomized block approximation of the Gram matrix
based on a Mondrian process. Sampling and query selection Besides Mondrian forests, there is a rich literature on sparse
can then be parallelized across blocks, further accelerating GP methods to address the scalability of GP regression
the algorithm. (Seeger et al., 2003; Snelson and Ghahramani, 2006; Titsias,
2009; Hensman et al., 2013). However, these methods are
As a whole, this combination of simple, tractable structure
mostly only shown to be useful when the input dimension
with ensemble learning and randomization improves effi-
is low and there exist redundant data points, so that induc-
ciency, uncertainty estimates and optimization. Moreover,
ing points can be selected to emulate the original posterior
we show that our realization of these ideas offers an alter-
GP well. However, data redundancy is usually not the case
native explanation for global optimization heuristics that
in high-dimensional Bayesian optimization. Recent appli-
have been popular in other communities, indicating possible
cations of sparse GPs in BO (McIntire et al., 2016) only
directions for further theoretical analysis. Our empirical
consider experiments with less than 80 function evaluations
results demonstrate that EBO can speed up the posterior
in BO and do not show results on large scale observations.
inference by 2-3 orders of magnitude (400 times in one
Another approach to tackle large scale GPs distributes the
experiment) compared to the state-of-the-art, without sacri-
computation via local experts (Deisenroth and Ng, 2015).
ficing quality. Furthermore, we demonstrate the ability of
However, this is not very suitable for the acquisition function
EBO to handle sample-intensive hard optimization problems
optimization needed in Bayesian optimization, since every
by applying it to real-world problems with tens of thousands
valid prediction needs to synchronize the predictions from
of observations.
all the local experts. Our paper is also related to Gramacy
and Lee (2008). While Gramacy and Lee (2008) focuses on
Related Work There has been a series of works address- modeling non-stationary functions with treed partitions, our
ing the three big challenges in BO: selecting batch evalua- work integrates tree structures and Bayesian optimization in
tions (Contal et al., 2013; Desautels et al., 2014; González a novel way.
et al., 2016; Wang et al., 2017; Daxberger and Low, 2017),
high-dimensional input spaces (Wang et al., 2016b; Djo- 2 Background and Challenges
longa et al., 2013; Li et al., 2016; Kandasamy et al., 2015;
Wang et al., 2017; Wang and Jegelka, 2017), and scalability Consider a simple but high-dimensional search space X =
(Snoek et al., 2015; Lakshminarayanan et al., 2016; McIn- [0, R]D ⊆ RD . We aim to find a maximizer x∗ ∈
tire et al., 2016). Although these three problems tend to arg maxx∈X f (x) of a black-box function f : X → R.
co-occur, this paper is the first (to the best of our knowledge)
to address all three challenges jointly in one framework. Gaussian processes. Gaussian processes (GPs) are pop-
Most closely related to parts of this paper is (Wang et al., ular priors for modeling the function f in Bayesian opti-
2017), but our algorithm significantly improves on that work mization. They define distributions over functions where
in terms of scalability (see Sec. 4.1 for an empirical compar- any finite set of function values has a multivariate Gaussian
ison), and has fundamental technical differences. First, the distribution. A Gaussian process GP(µ, κ) is fully specified
Gibbs sampler by Wang et al. (2017) only learns the additive by a mean function µ(·) and covariance (kernel) function
structure but not the kernel parameters, while our sampler κ(·, ·). Let f be a function sampled from GP(0, κ). Given
jointly learns both of them. Second, our proposed algorithm observations Dn = {(xt , yt )}nt=1 where yt ∼ N (f (xt ), σ),
partitions the input space for scalability and parallel infer- we obtain the posterior mean and variance of the function as
ence. We achieve this by a Mondrian forest. Third, as a µn (x) = κn (x)T (Kn + σ 2 I)−1 yn , (2.1)
result, our method automatically generates batch queries, −1
while the other work needs an explicit batch strategy. σn2 (x) T
= κ(x, x) − κn (x) (Kn + σ I)2
κn (x) (2.2)

Other parts of our framework are inspired by the Mondrian via the kernel matrix Kn = [κ(xi , xj )]xi ,xj ∈Dn and
forest (Lakshminarayanan et al., 2016), which partitions the κn (x) = [κ(xi , x)]xi ∈Dn (Rasmussen and Williams,
Zi Wang, Clement Gehring, Pushmeet Kohli, Stefanie Jegelka

2006). The log data likelihood for Dn is given by Moreover, learning hyperparameters for random features is
expensive: for Fourier features, the computation of Eq. (2.6)
1 means re-computing the features, plus O(DR 3
) for the in-
log p(Dn ) = − ynT (Kn + σ 2 I)−1 yn
2 verse and determinant. With Mondrian features (Laksh-
1 n minarayanan et al., 2016), we can learn the kernel width
− log |Kn + σ 2 I| − log 2π. (2.3)
2 2 efficiently by adding more Mondrian blocks, but this proce-
While GPs provide flexible, broadly applicable function dure is not well compatible with learning additive structure,
estimators, the O(n3 ) computation of the inverse (Kn + since the whole structure of the sampled Mondrian features
σ 2 I)−1 and determinant |Kn + σ 2 I| can become major will change. In addition, we typically need a forest of trees
bottlenecks as n grows, for both posterior function value for a good approximation.
predictions and data likelihood estimation.
Tile coding. Tile coding (Sutton and Barto, 1998; Albus
Additive structure. To reduce the complexity of the et al., 1975) is a k-hot encoding widely used in reinforce-
vanilla GP, we assume a latent decomposition of the ment learning as an efficient set of non-linear features. In
input dimensions [D] = {1, . . . , D} into disjoint sub- its simplest form, tile coding is defined by k partitions, re-
SM ferred to as layers. An encoded data point becomes a binary
spaces, namely, m=1 Am = [D] and Ai ∩ Aj = ∅
vector with a non-zero entry for each bin containing the data
for all i 6= j, i, j ∈ [M ]. PAs a result, the func-
Am point. There exists methods for sampling random partitions
tion f decomposes as f (x) = m∈[M ] fm (x ) (Kan-
that allow to approximate various kernels, such as the ‘hat’
dasamy et al., 2015). If each component fm is drawn
kernel (Rahimi et al., 2007), making tile coding well suited
independently from GP(µ(m) , κ(m) ) for all m ∈ [M ],
for our purposes.
the resulting f will alsoPbe a sample from a GP: f ∼
GP(µ, κ), with µ(x) = m∈[M ] µm (xAm ), κ(x, x0 ) =
A
Variance starvation. It is probably not surprising that us-
(x , x0 m ).
(m) Am
P
m∈[M ] κ ing finite random features to learn the function distribution
The additive structure reduces sample complexity and helps will result in a loss in accuracy (Forster, 2005). For example,
BO to search more efficiently and effectively since the ac- we observed that, while the mean predictions are preserved
quisition function can be optimized component-wise. But it reasonably well around regions where we have observations,
remains challenging to learn a good decomposition structure both mean and confidence bound predictions can become
{Am }. Recently, Wang et al. (2017) proposed learning via very bad in regions where we do not have observations, once
Gibbs sampling. This sampler takes hours for merely a few there are more observations than features. We refer to this
hundred points, because it needs a vast number of expensive underestimation of variance scale compared to mean scale,
data likelihood computations. illustrated in Fig. 1, as variance starvation.

Random features. It is possible use random fea- 3 Ensemble Bayesian Optimization


tures (Rahimi et al., 2007) to approximate the GP kernel and
alleviate the O(n3 ) computation in Eq. (2.1) and Eq. (2.3). Next, we describe an approach that scales Bayesian Op-
Let φ : X 7→ RDR be the (scaled) random feature opera- timization when parallel computing resources are avail-
tor and Φn = [φ(x1 ), · · · , φ(xn )]T ∈ Rn×DR . The GP able. We name our approach, outlined in Alg.1, Ensemble
posterior mean and variance can be written as Bayesian optimization (EBO). At a high level, EBO uses a
(stochastic) series of Mondrian trees to partition the input
µn (x) = σ −2 φ(x)T Σn ΦTn yn , (2.4) space, learn the kernel parameters of a GP locally, and ag-
σn2 (x) T
= φ(x) Σn φ(x), (2.5) gregate these parameters. Our forest hence spans across BO
iterations.
where Σn = (ΦTn Φn σ −2 + I)−1 . By the Woodbury matrix In the t-th iteration of EBO in Alg. 1, we use a Mondrian
identity and the matrix determinant lemma, the log data process to randomly partition the search space into J parts
likelihood becomes (line 4), where J can be dependent on the size of the ob-
σ −4 T servations Dt−1 . For the j-th partition, we have a subset
log p(Dn ) = y Φn Σn ΦTn yn j
Dt−1 of observations. From those observations, we learn
2 n
1 σ −2 T n a local GP with random tile coding and additive structure,
− log |Σ−1
n |− yn yn − log 2πσ 2 . (2.6) via Gibbs sampling (line 6). For conciseness, we refer to
2 2 2
such GPs as TileGPs. The probabilistic tile coding can be
The number of random features necessary to approximate replaced by a Mondrian grid that approximates a Laplace
the GP well in general increases with the number of observa- kernel (Balog and Teh, 2015). Once a TileGP is learned lo-
tions (Rudi et al., 2017). Hence, for large-scale observations, cally, we can run BO with the acquisition function η in each
we cannot expect to solely use a fixed number of features. partition to generate a candidate set of points, and, from
Batched Large-scale Bayesian Optimization in High-dimensional Spaces

10 10
50
5 20 0 5

f(x)

f(x)

f(x)

f(x)
0 -50 0
0
3σ 3σ -100 3σ 3σ
-5 µ µ µ -5 µ
f
-20 f -150 f f
-10 -10
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1
(a) x (b) x (c) x (d) x

Figure 1: We use 1000 Fourier features to approximate a 1D GP with a squared exponential kernel. The observations are
samples from a function f (red line) drawn from the GP with zero mean in the range [−10, 0.5]. (a) Given 100 sampled
observations (red circles), the Fourier features lead to reasonable confidence bounds. (b) Given 1000 sampled observations
(red circles), the quality of the variance estimates degrades. (c) With additional samples (5000 observations), the problem is
exacerbated. The scale of the variance predictions relative to the mean prediction is very small. (d) For comparison, the
proper predictions of the original full GP conditioned on the same 5000 observations as (c). Variance starvation becomes a
serious problem for random features when the size of data is close to or larger than the size of the features.

those, select a batch that is both informative (high-quality) D f


and diverse (line 14).
β λ k L z D θ α
Algorithm 1 Ensemble Bayesian Optimization (EBO)
1: function EBO (f, D0 ) y x
2: Initialize z, k
3: for t = 1, · · · , T do
4: {Xj }Jj=1 ←M ONDRIAN([0, R]D , z, k, J)
Figure 2: The graphical model for TileGP, a GP with ad-
5: parfor j = 1, · · · , J do ditive and tile kernel partitioning structure. The parameter
6: j
z j , kj ← G IBBS S AMPLING(z, k | Dt−1 ) λ controls the rate for the number of cuts k of the tilings
7: j j
ηt−1 (·) ←ACQUISITION (Dt−1 , z , kj )
j (inverse of the kernel bandwidth); the parameter z controls
8: {Am }M m=1 ← D ECOMPOSITION (z )
j the additive decomposition of the input feature space.
9: for m = 1, · · · , M do
j
10: xAtj
m
← arg maxx∈X Am ηt−1 (x)
j
11: end for information that gives good uncertainty measures. In EBO,
12: end parfor we use a Mondrian process to divide the input space and the
13: z ← S YNC({z j }Jj=1 ), k ← S YNC({kj }Jj=1 ) observed data, so that nearby data points remain together in
14: {xtb }B J
b=1 ← F ILTER ({xtj }j=1 | z, k) one partition, preserving locality1 . The Mondrian process
15: parfor b = 1, · · · , B do uses axis-aligned cuts to divide the input space [0, R]D into
16: ytb ← f (xtb )
a set of partitions {Xj }Jj=0 where ∪j Xj = [0, R]D and
17: end parfor
18: Dt ← Dt−1 ∪ {xtb , ytb }B b=1
Xi ∩ Xj = ∅, ∀i 6= j. Each partition Xj can be conveniently
19: end for described by a hyperrectangle [l1j , hj1 ]×· · ·×[lD
j
, hjD ], which
20: end function facilitates the efficient use of tile coding and Mondrian grids
in a TileGP. In the next section, we define a TileGP and
introduce how its parameters are learned.
Since, in each iteration, we draw an input space partition
and update the kernel width and the additive structure, the
3.2 Learning a local TileGP via Gibbs sampling
algorithm may be viewed as implicitly and stochastically
running BO on an ensemble of GP models. In the following, For the j-th hyperrectangle partition Xj = [l1j , hj1 ] × · · · ×
we describe our model and the procedures of Alg. 1 in j
[lD , hjD ], we use a TileGP to model the function f locally.
detail. In the Appendix, we show an illustration how EBO
We use the acronym “TileGP” to denote the Gaussian pro-
optimizes a 2D function.
cess model that uses additive kernels, with each component
represented by tilings. We show the details of the genera-
3.1 Partitioning the input space via a Mondrian tive model for TileGP in Alg. 2 and the graphical model in
process Fig. 3.2 with fixed hyper-parameters α, β0 , β1 . The main
difference to the additive GP model used in (Wang et al.,
When faced with a “big” problem, a natural idea is to divide
and conquer. For large scale Bayesian optimization, the 1
We include the algorithm for input space partitioning in the
question is how to divide without losing the valuable local appendix.
Zi Wang, Clement Gehring, Pushmeet Kohli, Stefanie Jegelka

2017) is that TileGP constructs a hierarchical model for the nel. For each dimension d, we sample the group assignment
random features (and hence, the kernels), while Wang et al. zd according to
(2017) do not consider the kernel parameters to be part of
the generative model. The random features are based on tile p(zd = m | Dt−1 , k, z¬d ; α) ∝ p(Dt−1 | z, k)p(zd | z¬d )
coding or Mondrian grids, with the number of cuts gener- ∝ p(Dt−1 | z, k)(|Am | + αm ). (3.1)
ated by D Poisson processes on [ldj , hjd ] for each dimension
d = 1, · · · , D. On the i-th layer of the tilings, tile coding We sample the number of cuts kdi for each dimension d and
hjd −ld
j
each layer i from the posterior
samples the offset δ from a uniform distribution U [0, kdi ]
and places the cuts uniformly starting at δ + ldj . The Mon- p(kdi | Dt−1 , k¬di , z; β) ∝ p(Dt−1 | z, k)p(kdi | k¬di )
drian grid samples kdi cut locations uniformly randomly
p(Dn | z, k)Γ(β1 + |kd |)
from [ldj , hjd ]. Because of the data partition, we always have ∝ . (3.2)
more features than observations, which can alleviate the (β0 + L)kdi kdi !
variance starvation problem described in Section 2. If distributed computing is available, each hyperrectangle
We can use Gibbs sampling to efficiently learn the cut pa- partition of the input space is assigned a worker to manage
rameter k and decomposition parameter z by marginalizing all the computations within this partition. On each worker,
out λ and θ. Notice that both k and z take discrete values; we use the above Gibbs sampling method to learn the ad-
hence, unlike other continuous GP parameterizations, we ditive structure and kernel bandwidth jointly. Conditioned
only need to sample discrete variables for Gibbs sampling. on the observations associated with the partition on the
worker, we use the learned posterior TileGP to select the
Algorithm 2 Generative model for TileGP most promising input point in this partition, and eventually
1: Draw mixing proportions θ ∼ D IR(α) send this candidate input point back to the main process
2: for d = 1, · · · , D do together with the learned decomposition parameter z and
3: Draw additive decomposition zd ∼ M ULTI(θ) the cut parameter k. In the next section, we introduce the
4: Draw Poisson rate parameter λd ∼ G AMMA(β0 , β1 ) acquisition function we used in each worker and how to
5: for i = 1, · · · , L do
6: Draw number of cuts kdi ∼ P OISSON(λd (hjd − ldj )) filter the recommended candidates from all the partitions.
( j j
h −l
Draw offset δ ∼ U [0, dkdi d ] Tile Coding
7: j j 3.3 Acquisition functions and filtering
Draw cut locations b ∼ U [ld , hd ] Mondrian Grids
8: end for
In this paper, we mainly focus on parameter search prob-
9: end for
10: Construct the feature projection φ and the kernel κ = φT φ lems where the objective function is designed by an expert
from z and sampled tiles and the global optimum or an upper bound on the function
11: Draw function f ∼ GP(0, κ) is known. While any BO acquisition functions can be used
12: Given input x, draw function value y ∼ N (f (x), σ) within the EBO framework, we use an acquisition function
from (Wang and Jegelka, 2017) to exploit the knowledge
of the upper bound. Let f ∗ be such an upper bound, i.e.,
Given the observations Dt−1 in the j-th hyperrectangle par- ∀x ∈ X , f ∗ ≥ f (x). Given the observations Dt−1 j
associ-
tition, the posterior distribution of the (local) parameters ated with the j-th partition of the input space, we minimize
λ, k, z, θ is j f ∗ −µjt−1 (x)
the acquisition function ηt−1 (x) = j
σt−1 (x)
. Since the
p(λ, k, z, θ | Dt−1 ; α, β) j
kernel is additive, we can optimize ηt−1 (·) separately
for
∝ p(Dt−1 | z, k)p(z | θ)p(k | λ)p(θ; α)p(λ; β). each additive component. Namely, for the m-th compo-
j
Marginalizing over the Poisson rate parameter λ and the nent of the additive structure, we optimize ηt−1 (·) only on
mixing proportion θ gives the active dimensions Am . This resembles a block coordi-
nate descent, and greatly facilitates the optimization of the
p(k, z | Dt−1 ; α, β) acquisition function.
Z Z
∝ p(Dt−1 |z, k) p(z|θ)p(θ; α) dθ p(k|λ)p(λ; β) dλ Filtering. Once we have a proposed uery point from
Y Γ(|Am | + αm ) each partition, we select B of them according
PB to the scor-
∝ p(Dt−1 | z, k) ing function ξ(X) = log det KX − b=1 η(xb ) where
Γ(αm )
m X = {xb }B b=1 . We use the log determinant term to force di-
Y Γ(β1 + |kd |) versity and η to maintain quality. We maximize this function
× QL
d (i=1 kdi !)(β0 + L)
β1 +|kd | greedily. In some cases, the number of partitions J can be
PL smaller than the batch size B. In this case, one may either
where |kd | = i=1 kdi . Hence, we only need to sample k use just J candidates, or use batch BO on each partition.
and z when learning the hyperparameters of the TileGP ker- We use the latter, and discuss details in the appendix.
Batched Large-scale Bayesian Optimization in High-dimensional Spaces

1 observation 3 observations
(a) (b) (c) (d)
4 4 4 4

3 3 3 3

2 2 2 2

1 1 1 1

0 0 0 0

-1 -1 -1 -1

-2 -2 -2 -2
-2 0 2 4 -2 0 2 4 -2 0 2 4 -2 0 2 4

Figure 3: Posterior mean function (a, c) and GP-UCB acquisition function (b, d) for an additive GP in 2D. The maxima
of the posterior mean and acquisition function are at the points resulting from an exchange of coordinates between “good”
observed points (-1,0) and (2,2).

3.4 Efficient data likelihood computation and P OISSON(λd R) be the number of cuts in the Mon-
parameter synchronization drian grids of TileGP for dimension d ∈ [D] and
layer i ∈ [L]. The TileGP kernel κL satisfies
λd R|xAm −x0Am |
For the random features, we use tile coding due to its sparsity 0 1
PM
lim κL (x, x ) = M m=1 e , where
and efficiency. Since non-zero features can be found and L→∞

computed by binning, the computational cost for encoding {Am }M


m=1 is the additive decomposition.
a data point scales linearly with dimensions and number of
layers. The resulting representation is sparse and convenient We prove the lemma in the appendix. Balog et al. (2016)
to use. Additionally, the number of non-zero features is showed that in practice, the Mondrian kernel constructed
quite small, which allows us to efficiently compute a sparse from Mondrian features may perform slightly better than
Cholesky decomposition of the inner product (Gram matrix) random binning in certain cases. Although it would be
or the outer product of the data. This allows us to efficiently possible to use a Mondrian partition for each layer of tile
compute the data likelihoods. coding, we only consider uniform, grid based binning with
random offests because this allows the non-zero features to
In each iteration t, after the batch workers return the learned be computed more efficiently (O(1) instead of O(log k)).
decomposition indicator z b and the number of tiles k b , b ∈ Note that as more dimensions are discretized in this man-
[B], we synchronize these two parameters (line 13 of Alg. 1). ner, the number of features grows exponentially. However,
For the number of tiles k, we set kd to be the rounded the number of non-zero entries can be independently con-
mean of {kdb }B b=1 for each dimension d ∈ [D]. For the trolled, allowing to create sparse representations that remain
decomposition indicator, we use correlation clustering to computationally tractable.
cluster the input dimensions.

3.5 Relations to Mondrian kernels, random binning 3.6 Connections to evolutionary algorithms
and additive Laplace kernels
Next, we make some observations that connect our random-
Our model described in Section 3.2 can use tile coding and ized ensemble BO to ideas for global optimization heuristics
Mondrian grids to construct the kernel. Tile coding and that have successfully been used in other communities. In
Mondrian grids are also closely related to Mondrian Fea- particular, these connections offer an explanation from a BO
tures and Random Binning: All of the four kinds of random perspective and may aid further theoretical analysis.
features attempt to find a sparse random feature representa-
Evolutionary algorithms (Back, 1996) maintain an ensem-
tion for the raw input x based on the partition of the space
ble of “good” candidate solutions (called chromosomes)
with the help of layers. We illustrate the differences between
and, from those, generate new query points via a number
one layer of the features constructed by tile coding, Mon-
of operations. These methods too, implicitly, need to bal-
drian grids, Mondrian features and random binning in the
ance exploration with local search in areas known to have
appendix. Mondrian grids, Mondrian features and random
high function values. Hence, there are local operations
binning all converge to the Laplace kernel as the number of
(mutations) for generating new points, such as random per-
layers L goes to infinity. The tile coding kernel, however,
turbations or local descent methods, and global operations.
does not approximate a Laplace kernel. Our model with
While it is relatively straightforward to draw connections be-
Mondrian grids approximates an additive Laplace kernel:
tween those local operations and optimization methods used
Lemma 3.1. Let the random variable kdi ∼ in machine learning, we here focus on global exploration.
Zi Wang, Clement Gehring, Pushmeet Kohli, Stefanie Jegelka

Gibbs sam pling t im e (m inut es)


160 25
A popular global operation is crossover: given two “good”

Speed-up over 10 Cores


140 SKL
points x, y ∈ RD , this operation outputs a new point z 120 EBO 20

whose coordinates are a combination of the coordinates of 100


We stopped SKL after 2 hours 15
80
x and y, i.e., zi ∈ {xi , yi } for all i ∈ [D]. In fact, this 60 10
operation is analogous to BO with a (randomized) additive 40
EBO average runtime = 61 seconds 5
20
kernel: the crossover strategy implicity corresponds to the 0 1
0 100 200 300 400 500 10 100 240 500
assumption that high function values can be achieved by (a) Observat ion size (x100) (b) Num ber of Cores

combining coordinates from points with high function val-


ues. For comparison, consider an additive kernel κ(x, x0 ) = Figure 4: (a) Timing for the Gibbs sampler of EBO and
PM PM
m=1 κ (x , x0Am ) and f (x) =
(m) Am m Am
m=1 f (x ). SKL. EBO is significantly faster than SKL when the ob-
(m)
Since each sub-kernel κ is “blind” to the dimensions servation size N is relatively large. (b) Speed-up of EBO
in the complement of Am , any point x0 that is close to an ob- with 100, 240, 500 cores over EBO with 10 cores on 30,000
served high-value point x in the dimensions Am will receive observations. Running EBO with 240 cores is almost 20
a high value f m (x), independent of the other dimensions, times faster than with 10 cores.
and, as a result, looks like a “good” candidate.
We illustrate this reasoning with a 2D toy example. Figure 3 is 1500. Comparing the quality of learned parameter z for
shows the posterior mean prediction and GP-UCB crite- the additive structure, SKL has a Rand Index of 96.3% and
rion fˆ(x) + 0.1σ(x) for an additive kernel with A1 = {1}, EBO has a Rand Index of 96.8%, which are similar. In
A2 = {2} and κm (xm , ym ) = exp(−2(xm − ym )2 ). High Fig. 4(b), we show speed-ups for different number of cores.
values of the observed points generalize along the dimen- EBO with 500 cores is not significantly faster than with
sions “ignored” by the sub-kernels. After two good observa- 240 cores because EBO runs synchronized parallelization,
tions (−1, 0) and (2, 2), the “crossover” points (−1, 2) and whose runtime is decided by the slowest core. It is often the
(2, 0) are local maxima of GP-UCB and the posterior mean. case that most of the cores have finished while the program
is waiting for the slowest 1 or 2 cores to finish.
In real data, we do not know the best fitting underlying
grouping structure of the coordinates. Hence, crossover
4.2 Effectiveness of EBO
does a random search over such partitions by performing
random coordinate combinations, whereas our adaptive BO BO-SVI PBO
approach maintains a posterior distribution over partitions BO-Add-SVI EBO
7
that adapts to the data.
6

5
4 Experiments
4
Regret

We empirically verify the scalability of EBO and its effec- 3


tiveness of using random adaptive Mondrian partitions, and 2
finally evaluate EBO on two real-world problems.2
1

0
4.1 Scalability of EBO 0 10 20 30 40 50 60
Tim e (m inut es)

We compare EBO with a recent, state-of-the-art additive ker-


nel learning algorithm, Structural Kernel Learning (SKL) Figure 5: Averaged results of the regret of BO-SVI, BO-
(Wang et al., 2017). EBO can make use of parallel resources Add-SVI, PBO and EBO on 4 different functions drawn
both for Gibbs sampling and BO query selections, while from a 50D GP with an additive Laplace kernel. BO-SVI
SKL can only parallelize query selections but not sampling. has the highest regret for all functions. Using an additive
Because the kernel learning part is the computationally dom- GP within SVI (BO-Add-SVI) significantly improves over
inating factor of large scale BO, we compare the time each the full kernel. In general, EBO finds a good point much
method needs to run 10 iterations of Gibbs sampling with faster than the other methods.
100 to 50000 observations in 20 dimensions. We show the
timing results for the Gibbs samplers in Fig. 4(a), where Optimizing synthetic functions We verify the effective-
EBO uses 240 cores via the Batch Service of Microsoft ness of using ensemble models for BO on 4 functions ran-
Azure. Due to a time limit we imposed, we did not finish domly sampled from a 50-dimensional GP with an addi-
SKL for more than 1500 observations. EBO runs more tive Laplace kernel. The hyperparameter of the Laplace
than 390 times faster than SKL when the observation size kernel is known. In each iteration, each algorithm evalu-
2
Our code is publicly available at https://github.com/ ates a batch of parameters of size B in parallel. We de-
zi-w/Ensemble-Bayesian-Optimization. note r̃t = maxx∈X f (x) − maxb∈[B] f (xt,b ) as the im-
Batched Large-scale Bayesian Optimization in High-dimensional Spaces

mediate regret obtained by the batch at iteration t, and We describe a problem instance by defining a start posi-
rT = mint≤T r̃t as the regret, which captures the minimum tion s and a goal position g as well as a cost function
gap between the best point found and the global optimum over the state space. Trajectories are described by a set
of the black-box function f . of points on which a BSpline is to be fitted. By integrat-
ing the cost function over a given trajectory, we can com-
We compare BO using SVI (Hensman et al., 2013) (BO-
pute the trajectory cost c(x) of a given trajectory solution
SVI), BO using SVI with an additive GP (BO-Add-SVI)
x ∈ [0, 1]60 . We define the reward of this problem to be
and a distributed version of BO with a fixed partition (PBO)
f (x) = c(x) + λ(kx0,1 − sk1 + kx59,60 − gk1 ) + b. This
against EBO with a randomly sampled partition in each
reward function is non smooth, discontinuous, and con-
iteration. PBO has the same 1000 Mondrian partitions in all
cave over the first two and last two dimensions of the input.
the iterations while EBO can have at most 1000 Mondrian
These 4 dimensions represent the start and goal position
partitions. BO-SVI uses a Laplace isotropic kernel without
of the trajectory. The results in Fig. 7 showed that CEM
any additive structure, while BO-Add-SVI, PBO, EBO all
was able to achieve better results than the BO methods on
use the known prior. More detailed experimental settings
these functions, while EBO was still much better than the
can be found in the appendix. Our experimental results in
BO alternatives using SVI. More details can be found in the
Fig. 5 shows that EBO is able to find a good point much
appendix.
faster than BO-SVI and BO-Add-SVI; and, randomization
and the ensemble of partitions matters: EBO is much better
than PBO. BO-SVI CEM
BO-Add-SVI EBO
4
Optimizing control parameters for robot pushing We
follow Wang et al. (2017) and test our approach, EBO, on a 2

14 dimensional control parameter tuning problem for robot


0

Reward
pushing. We compare EBO, BO-SVI, BO-Add-SVI and
CEM Szita and Lörincz (2006) with the same 104 random 2
observations and repeat each experiment 10 times. We run
all the methods for 200 iterations, where each iteration has 4
a batch size of 100. We plot the median of the best re-
wards achieved by CEM and EBO at each iteration in Fig. 6. 6
10k 15k 20k 25k 30k 35k
Number of Samples
More details on the experimental setups and the reward
function can be found in the appendix. Overall CEM and
EBO performed comparably and much better than the sparse Figure 7: Comparing BO-SVI, BO-Add-SVI, CEM and
GP methods (BO-SVI and BO-Add-SVI). We noticed that EBO on a 60 dimensional trajectory optimization task.
among all the experiments, CEM achieved a maximum re-
ward of 10.19 while EBO achieved 9.50. However, EBO
behaved slightly better and more stable than CEM as re-
flected by the standard deviation on the rewards.
5 Conclusion
BO-SVI CEM
BO-Add-SVI EBO
10.0
9.5 Many black box function optimization problems are intrin-
9.0 sically high-dimensional and may require a huge number of
8.5 observations in order to be optimized well. In this paper, we
Reward

8.0 propose a novel framework, ensemble Bayesian optimiza-


7.5 tion, to tackle the problem of scaling Bayesian optimization
7.0
to both large numbers of observations and high dimensions.
6.5
To achieve this, we propose a new framework that jointly in-
6.0
5.5
tegrates randomized partitions at various levels: our method
10k 15k 20k 25k 30k
Number of Samples is a stochastic method over a randomized, adaptive ensem-
ble of partitions of the input data space; for each part, we use
Figure 6: Comparing BO-SVI, BO-Add-SVI, CEM and an ensemble of TileGPs, a new GP model we propose based
EBO on a control parameter tuning task with 14 parameters. on tile coding and additive structure. We also developed an
efficient Gibbs sampling approach to learn the latent vari-
ables. Moreover, our method automatically generates batch
Optimizing rover trajectories To further explore the per- queries. We empirically demonstrate the effectiveness and
formance of our method, we consider a trajectory optimiza- scalability of our method on high dimensional parameter
tion task in 2D, meant to emulate a rover navigation task. search tasks with tens of thousands of observation data.
Zi Wang, Clement Gehring, Pushmeet Kohli, Stefanie Jegelka

Acknowledgements Robert B Gramacy and Herbert K H Lee. Bayesian treed


Gaussian process models with an application to computer
We gratefully acknowledge support from NSF CAREER modeling. Journal of the American Statistical Associa-
award 1553284, NSF grants 1420927 and 1523767, from tion, 103(483):1119–1130, 2008.
ONR grant N00014-14-1-0486, and from ARO grant Philipp Hennig and Christian J Schuler. Entropy search
W911NF1410433. Any opinions, findings, and conclusions for information-efficient global optimization. Journal of
or recommendations expressed in this material are those of Machine Learning Research, 13:1809–1837, 2012.
the authors and do not necessarily reflect the views of our
sponsors. James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaus-
sian processes for big data. In Uncertainty in Artificial
Intelligence (UAI), 2013.
References
José Miguel Hernández-Lobato, Matthew W Hoffman, and
James S Albus et al. A new approach to manipulator control: Zoubin Ghahramani. Predictive entropy search for effi-
The cerebellar model articulation controller (CMAC). cient global optimization of black-box functions. In Ad-
Journal of Dynamic Systems, Measurement and Control, vances in Neural Information Processing Systems (NIPS),
97(3):220–227, 1975. 2014.
Thomas Back. Evolutionary algorithms in theory and prac- Kirthevasan Kandasamy, Jeff Schneider, and Barnabas Poc-
tice: evolution strategies, evolutionary programming, ge- zos. High dimensional Bayesian optimisation and bandits
netic algorithms. Oxford university press, 1996. via additive models. In International Conference on Ma-
Matej Balog and Yee Whye Teh. The Mondrian process chine Learning (ICML), 2015.
for machine learning. arXiv preprint arXiv:1507.05181,
Kenji Kawaguchi, Leslie Pack Kaelbling, and Tomás
2015.
Lozano-Pérez. Bayesian optimization with exponential
Matej Balog, Balaji Lakshminarayanan, Zoubin Ghahra- convergence. In Advances in Neural Information Process-
mani, Daniel M Roy, and Yee Whye Teh. The Mondrian ing Systems (NIPS), 2015.
kernel. In Uncertainty in Artificial Intelligence (UAI),
Harold J Kushner. A new method of locating the maximum
2016.
point of an arbitrary multipeak curve in the presence
Roberto Calandra. Bayesian Modeling for Optimization and of noise. Journal of Fluids Engineering, 86(1):97–106,
Control in Robotics. PhD thesis, Technische Universität, 1964.
2017.
Balaji Lakshminarayanan, Daniel M Roy, and Yee Whye
Emile Contal, David Buffoni, Alexandre Robicquet, and Teh. Mondrian forests for large-scale regression when
Nicolas Vayatis. Parallel Gaussian process optimization uncertainty matters. In International Conference on Arti-
with upper confidence bound and pure exploration. In ficial Intelligence and Statistics (AISTATS), 2016.
Joint European Conference on Machine Learning and
Knowledge Discovery in Databases, 2013. Chun-Liang Li, Kirthevasan Kandasamy, Barnabás Póczos,
and Jeff Schneider. High dimensional Bayesian optimiza-
Erik A Daxberger and Bryan Kian Hsiang Low. Distributed
tion via restricted projection pursuit models. In Interna-
batch Gaussian process optimization. In International
tional Conference on Artificial Intelligence and Statistics
Conference on Machine Learning (ICML), 2017.
(AISTATS), 2016.
Marc Peter Deisenroth and Jun Wei Ng. Distributed Gaus-
sian processes. arXiv preprint arXiv:1502.02843, 2015. Mitchell McIntire, Daniel Ratner, and Stefano Ermon.
Sparse Gaussian processes for Bayesian optimization.
Thomas Desautels, Andreas Krause, and Joel W Burdick. In Uncertainty in Artificial Intelligence (UAI), 2016.
Parallelizing exploration-exploitation tradeoffs in Gaus-
sian process bandit optimization. Journal of Machine J. Moc̆kus. On Bayesian methods for seeking the extremum.
Learning Research, 2014. In Optimization Techniques IFIP Technical Conference,
1974.
Josip Djolonga, Andreas Krause, and Volkan Cevher. High-
dimensional Gaussian process bandits. In Advances in Ali Rahimi, Benjamin Recht, et al. Random features for
Neural Information Processing Systems (NIPS), 2013. large-scale kernel machines. In Advances in Neural In-
formation Processing Systems (NIPS), 2007.
Malcolm R Forster. Notice: No free lunches for anyone,
bayesians included. Department of Philosophy, Univer- Carl Edward Rasmussen and Christopher KI Williams.
sity of Wisconsin–Madison Madison, USA, 2005. Gaussian Processes for Machine Learning. 2006.
Javier González, Zhenwen Dai, Philipp Hennig, and Neil D Alessandro Rudi, Raffaello Camoriano, and Lorenzo
Lawrence. Batch Bayesian optimization via local penal- Rosasco. Generalization properties of learning with ran-
ization. International Conference on Artificial Intelli- dom features. In Advances in Neural Information Pro-
gence and Statistics (AISTATS), 2016. cessing Systems (NIPS), 2017.
Batched Large-scale Bayesian Optimization in High-dimensional Spaces

Matthias Seeger, Christopher Williams, and Neil Lawrence.


Fast forward selection to speed up sparse Gaussian pro-
cess regression. In Artificial Intelligence and Statistics 9,
2003.
Edward Snelson and Zoubin Ghahramani. Sparse Gaussian
processes using pseudo-inputs. In Advances in Neural
Information Processing Systems (NIPS), 2006.
Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practi-
cal Bayesian optimization of machine learning algorithms.
In Advances in Neural Information Processing Systems
(NIPS), 2012.
Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros,
Nadathur Satish, Narayanan Sundaram, Mostofa Patwary,
Mr Prabhat, and Ryan Adams. Scalable Bayesian opti-
mization using deep neural networks. In International
Conference on Machine Learning, 2015.
Niranjan Srinivas, Andreas Krause, Sham M Kakade, and
Matthias W Seeger. Information-theoretic regret bounds
for Gaussian process optimization in the bandit setting.
IEEE Transactions on Information Theory, 2012.
Richard S Sutton and Andrew G Barto. Reinforcement
learning: An introduction. MIT press Cambridge, 1998.
István Szita and András Lörincz. Learning tetris using the
noisy cross-entropy method. Learning, 18(12), 2006.
Michalis K Titsias. Variational learning of inducing vari-
ables in sparse Gaussian processes. In International Con-
ference on Artificial Intelligence and Statistics (AISTATS),
2009.
Zi Wang and Stefanie Jegelka. Max-value entropy search
for efficient Bayesian optimization. In International Con-
ference on Machine Learning (ICML), 2017.
Zi Wang, Bolei Zhou, and Stefanie Jegelka. Optimization
as estimation with Gaussian processes in bandit settings.
In International Conference on Artificial Intelligence and
Statistics (AISTATS), 2016a.
Zi Wang, Chengtao Li, Stefanie Jegelka, and Pushmeet
Kohli. Batched high-dimensional Bayesian optimization
via structural kernel learning. In International Conference
on Machine Learning (ICML), 2017.
Ziyu Wang, Frank Hutter, Masrour Zoghi, David Matheson,
and Nando de Feitas. Bayesian optimization in a billion
dimensions via random embeddings. Journal of Artificial
Intelligence Research, 55:361–387, 2016b.
Zi Wang, Clement Gehring, Pushmeet Kohli, Stefanie Jegelka

A An Illustration of EBO 3.1.

We give an illustration of the proposed EBO algorithm on a Algorithm 3 Mondrian Partitioning


2D function shown in Fig. 8. This function is a sample from 1: function M ONDRIAN PARTITIONING (V, Np , S)
a 2D TileGP, where the decomposition parameter is z = 2: while |V | < Np do
[0, 1], the cut parameter is (inverse bandwidth) k = [10, 10], 3: pj ← length(vj ) · max(0, |Dj | − S), ∀vj ∈ V
4: if pj = 0, ∀j then
and the noise parameter is σ = 0.01.
5: break
6: end if
p
7: Sample vj ∼ P jpj , vj ∈ V
1.0 j
1.588
j j
hd −ld
1.051 8: Sample a dimension d ∼ P j j , d ∈ [D]
0.8 d hd −ld
0.514 j j j
9: Sample cut location ud ∼ U [ld , hd ]
−0.023
0.6
10: vj(lef t) ← [l1j , hj1 ] × · · · × [ldj , ujd ] × · · · × ×[lD
j
, hjD ]
−0.559
11: vj(right) ← [l1 , h1 ]×· · ·×[ud , hd ]×· · ·××[lD , hjD ]
j j j j j
−1.096
0.4
12: V ← V ∪ {vj(lef t) , vj(right) } \ vj
−1.633
13: end while
−2.170 14: return V
0.2
−2.707 15: end function
−3.244
0.0
0.0 0.2 0.4 0.6 0.8 1.0

In particular, we denote the maximum number of Mondrian


Figure 8: The 2D additive function we optimized in Fig. 9. partitions by Np (usually the worker pool size in the exper-
The global maximum is marked with “+”. iments) and the minimum number of data points in each
partition to be S. The set of partitions computed by the
Mondrian tree (a.k.a. the leaves of the tree), V , is initial-
The global maximum of this function is at (0.27, 0.41). In ized to be the function domain V = {[0, R]D }, the root of
this example, EBO is configured to have at least 20 data the tree. For each vj ∈ V described by a hyperrectangle
points on each partition, at most 50 Mondrian partitions, and [l1j , hj1 ] × · · · × [lD
j
, hj ], the length of vj is computed to
100 layers of tiles to approximate the Laplace kernel. We PDD j
be length(vj ) = d=1 (hd − ldj ). The observations asso-
run EBO for 10 iterations with 20 queries each batch. The ciated with vj is Dj . Here, for all (x, y) ∈ Dj , we have
results are shown in Fig. 9. In the first iteration, EBO has
x ∈ [l1j − , hj1 + ] × · · · × [lD j
− , hjD + ], where  controls
no information about the function; hence it spreads the 10
the how many neighboring data points to consider for the
queries (blue dots) “evenly” in the input domain to collect
partition vj . In our experiments,  is set to be 0. Alg. 3
information. In the 2nd iteration, based on the evaluations
is different from Algorithm 1 and 2 of Lakshminarayanan
on the selected points (yellow dots), EBO chooses to query
et al. (2016) in the stop criterion. Lakshminarayanan et al.
batch points (blue dots) that have high acquisition values,
(2016) uses an exponential clock to count down the time of
which appear to be around the global optimum and some
splitting the leaves of the tree, while we split the leaves until
other high valued regions. As the number of evaluations
the number of Mondrian partitions reaches Np or there is
exceeds 20, the minimum number of data points on each
no partition that have more than S data points. We designed
partition, EBO partitions the input space with a Mondrian
our stop criterion this way to balance the efficiency of EBO
process in the following iterations. Notice that each iteration
and the quality of selected points. Usually EBO is faster
draws a different partition (shown as the black lines) from
with larger number of partitions Np (i.e., more parallel com-
the Mondrian process so that the results will not “over-fit”
puting resources) and the quality of the selections are better
to one partition setting and the computation can remain
with larger size of observations on each partition (S).
efficient. In each partition, EBO runs the Gibbs sampling
inference algorithm to fit a local TileGP and uses batched
BO select a few candidates. Then EBO uses a filter to decide C Budget allocation and batched BO
the final batch of candidate queries (blue dots) among all
the recommended ones from each partition as described in In the EBO algorithm, we first use a batch of workers to
Sec. C. learn the local GPs and recommend potential good candidate
points from the local information. Then we aggregate the
B Partitioning the input space via a information of all the workers, and use a filter to select the
points to evaluate from the set of points recommended by
Mondrian process
all the workers based on the aggregated information on the
function.
Alg. 3 shows the full ‘Mondrian partitioning” algorithm, i.e.,
the input space partitioning strategy mentioned in Section There are two important details we did not have space to dis-
Batched Large-scale Bayesian Optimization in High-dimensional Spaces

t=1 t=2 t=3 t=4 t=5


1.0 1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2 0.2

0.0 0.0 0.0 0.0 0.0


0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

1.0
t=6 t=7 t=8 t=9 t=10
1.0 1.0 1.0 1.0

0.8
0.8 0.8 0.8 0.8

0.6
0.6 0.6 0.6 0.6

0.4
0.4 0.4 0.4 0.4

0.2
0.2 0.2 0.2 0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.0 0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure 9: An example of 10 iterations of EBO on a 2D toy example plotted in Fig. 8. The selections in each iteration are
blue and the existing observations orange. EBO quickly locates the region of the global optimum while still allocating
budget to explore regions that appear promising (e.g. around the local optimum (1.0, 0.4)).

cuss in the main paper: (1) how many points to recommend and can potentially improve upon our results.
from each local worker (budget allocation); and (2) how to
select a batch of points from the Mondrian partition on each
D Relations to Mondrian Kernels and
worker. Usually in the beginning of the iterations, we do not
have a lot of Mondrian partitions (since we stop splitting a Random Binning
partition once it reaches a minimum number of data points).
Hence, it is very likely that the number of partitions J is TileGP can use Mondrian grids or (our version of) tile cod-
smaller than the size of the batch. Hence we need to allocate ing to achieve efficient parameter inference for the decom-
the budget of recommendations from each worker properly position z and the number of cuts k (inverse of kernel band-
and use batched BO for each Mondrian partition. width). Mondrian grids and tile coding are closely related to
Mondrian kernels and random binning, but there are some
subtle differences. We illustrate the differences between one
Budget allocation In our current version of EBO, we did
the budget allocation using a heuristic, where we would
like to generate at least 2B recommendations from all the
workers, and each worker gets the budget proportional to a
score, the sum of the Mondrian partition volume (volume of
the domain of the partition) and the best function value of Tile coding Mondrian Grid Random Binning Mondrain Feature
the partition.
Figure 10: Illustrations of (our version of) tile coding, Mon-
Batched BO For batched BO, we also use a heuristic drian Grid, random binning and Mondrian feature.
where the points achieving the top n acquisition function
values are always included and the other ones come from layer of the features constructed by tile coding, Mondrian
random points selected in that partition. For the optimiza- grid, Mondrian feature and random binning in Fig. 10. For
tion of the acquisition function over each block of dimen- each layer of (our version of) tile coding, we sample a posi-
sions, we sample 1000 points in the low dimensional space tive integer k (number of cuts) from a Poisson distribution
associated with the additive component and minimize the parameterized by λR, and then set the offset to be a constant
acquisition function via L-BFGS-B starting from the point uniformly randomly sampled from [0, R k ]. For each layer of
that gives the best acquisition value. We add the optimized the Mondrian grid, the number of cuts k is sampled tile in
arg min to the 1000 points and sort them according to their coding, but instead of using an offset and uniform cuts, we
acquisition values, and then select the top n random ones, put the cuts at locations independently uniformly randomly
and combine with the sorted selections from other additive from [0, R]. Random binning does not sample k cuts but
components. Other batched BO methods can also be used samples the distance δ between neighboring cuts by drawing
Zi Wang, Clement Gehring, Pushmeet Kohli, Stefanie Jegelka

δ ∼ G AMMA(2, λR). Then, it samples the offset from [0, δ] over-estimation of variance for each additive component if
and finally places the cuts. All of the above-mentioned three inferred independently from others. We conjecture that this
types of random features can work individually for each over-estimation could result in an invalid regret bound for
dimension and then combine the cuts from all dimensions. Add-GP-UCB (Kandasamy et al., 2015). Nevertheless, we
The Mondrian feature (Mondrian forest features to be ex- found that using the block coordinate optimization for the
act), contrast, partitions the space jointly for all dimensions. acquisition function on the full GP is actually very help-
More details of Mondrian features can be found in Laksh- ful. In Figure. 11, we compare the acquisition function
minarayanan et al. (2016); Balog et al. (2016). For all of we described in Section 3.3 (denoted as BlockOpt) with
these four types of random features and for eachPL layer of the Add-GP-UCB (Kandasamy et al., 2015), Add-MES-R and
total L layers, the kernel is κL (x, x0 ) = L1 l=1 χl (x, x0 ) Add-MES-G (Wang and Jegelka, 2017) on the same ex-
where periment described in the first experiment of Section 6.5
( of (Wang and Jegelka, 2017), averaging over 20 functions.
0 1 x and x0 are in the same cell on the layer l Notice that we used the maximum value of the function as
χl (x, x ) =
0 otherwise part of our acquisition function in our approach (BlockOpt).
(D.1) Add-GP-UCB, ADD-MES-R and ADD-MES-G cannot use
this max-value information even if they have access to it,
For the case where the kernel has M additive components, because then they don’t have a strategy to deal with “credit
we simply use the tiling for each decomposition and nor- assignment”, which assigns the maximum value to each
malize by LM instead L. More precisely, we have additive component. We found that BlockOpt is able to find
PM of PM
κL (x, x0 ) = LM
1
m=1 l=1 χl (x
Am
, x0Am ). a solution as well as or even better than the best of the three
competing approaches.
We next prove the lemma mentioned in Section 3.5.
Lemma 3.1. Let the random variable kdi ∼ P OISSON(λd R) d=10 d=20 d=30
30 60 80
be the number of cuts in the Mondrian grids of TileGP for
20 40 60
dimension d ∈ [D] and layer i ∈ [L]. The kernel of TileGP
10 20 40

rt
rt

rt
λd R|xAm −x0Am |
PM
κL satisfies lim κL (x, x0 ) = M
1
m=1 e , 0 0 20
L→∞
where {Am }M
m=1 is the additive decomposition.
-10
100 200 300 400 500
-20
100 200 300 400 500
0
100 200 300 400 500
t t t
d=50 d=100
100 200
Proof. When constructing the Mondrian grid for each layer 80
150
Add-GP-UCB
and each dimension, one can think of the process of getting 60
100 Add-MES-R
rt

rt

another cut as a Poisson point process on the interval [0, R], 40


50
Add-MES-G
20 BlockOpt
where the time between two consecutive cuts is modeled as 0 0
100 200 300 400 500 100 200 300 400 500
an exponential random variable. Similar to Proposition 1 t t

(m)
in Balog et al. (2016), we have lim κL (xAm , x0Am ) =
L→∞ Figure 11: Comparing different acquisition functions for
E[no cut between xd and x0d , ∀d ∈ Am ] =
−λd R|xAm −x0Am |
BO with an additive GP. Our strategy, BlockOpt, achieves
e . By the additivity of the kernel, we have comparable or better results than other methods.
λd R|xAm −x0Am |
PM
lim κL (x, x0 ) = M 1
m=1 e .
L→∞

E Experiments Scalability of EBO For EBO, the maximum number of


Mondrian partitions is set to be 1000 and the minimum
number of data points in each Mondrian partition is 100.
Verifying the acquisition function As introduced in Sec-
The function that we used to test was generated from a fully
tion 3.3, we used a different acquisition function optimiza-
partitioned 20 dimensional GP with an additive Laplace
tion technique from (Kandasamy et al., 2015; Wang and
kernel (|Am | = 1, ∀m).
Jegelka, 2017). In (Kandasamy et al., 2015; Wang and
Jegelka, 2017), the authors used the fact that each additive
component is by itself a GP. Hence, they did posterior infer- Effectiveness of EBO In this experiment, we sampled 4
ence on each additive component and Bayesian optimization functions from a 50-dimensional GP with additive kernel.
independently from other additive components. In this work, Each component of the additive kernel is a Laplace kernel,
we use the full GP with the additive kernel to derive its ac- whose lengthscale parameter is set to be 0.1, variance scale
quisition function and optimize it with a block coordinate to be 1 and active dimensions are Paround 1 to 4. Namely,
M
optimization procedure, where the blocks are selected ac- the kernel we used is κ(x, x0 ) = i=1 κ(m) (xAm , x0Am )
|xAm −x0Am |
cording to the decomposition of the input dimensions. One where κ(m) (xAm , x0Am ) = e 0.1 , ∀m. The domain
reason we did this instead of following (Kandasamy et al., of the function is [0, 1]50 . We implemented the BO-SVI and
2015; Wang and Jegelka, 2017) is that we observed the BO-Add-SVI using the same acquisition function and batch
Batched Large-scale Bayesian Optimization in High-dimensional Spaces

selection strategy as EBO but with SVI-GP (Hensman et al.,


2013) and SVI-GP with additive kernels instead of TileGPs. 1.2 20.0

We used the SVI-GP implemented in ? and defined the 17.5


1.0

additive Laplace kernel according to the priors of the tested


functions. For both BO-SVI and BO-Add-SVI, we used 0.8
15.0

100 batchsize, 200 inducing points and the parameters were 12.5

optimized for 100 iterations. For EBO, we set the minimum 0.6

10.0
size of data points on each Mondrian partition to be 100.
0.4

We set the maximum number of Mondrian partitions to be 7.5

1000 for both EBO and PBO. The evaluations of the test 0.2
5.0
functions are negligible, so the timing results in Figure 5
reflect the actual runtime of each method. 0.0
2.5

−0.2
−0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2

Optimizing control parameters for robot pushing We


implemented the simulation of pushing two objects
with two robot hands in the Box2D physics engine ?. Figure 12: An example trajectory found by EBO.
The 14 parameters specifies the location and rotation of
the robot hands, pushing speed, moving direction and
pushing time. The lower limit of these parameters is tion has a batch size of 100. In total, each method obtains
[−5, −5, −10, −10, 2, 0, −5, −5, −10, −10, 2, 0, −5, −5] 2 × 104 data points in addition to the 104 initializations.
and the upper limit is
[5, 5, 10, 10, 30, 2π, 5, 5, 10, 10, 30, 2π, 5, 5]. Let the Optimizing rover trajectories We illustrate the problem
initial positions of the objects be si0 , si1 and the ending in Fig. 12 with an example trajectory found by EBO. We
positions be se0 , se1 . We use sg0 and sg1 to denote the goal set the trajectory cost to be −20.0 for any collision, λ to be
locations for the two objects. The reward is defined to be −10.0 and the constant b = 5.0. This reward function is
r = ksg0 − si0 k + ksg1 − si1 k − ksg0 − se0 k − ksg1 − se1 k, non smooth, discontinuous, and concave over the first two
namely, the progress made towards pushing the objects to and last two dimensions of the input. These 4 dimensions
the goal. represent the start and goal position of the trajectory. We
maximize the reward function f over the points on the tra-
We compare EBO, BO-SVI, BO-Add-SVI and CEM Szita
jectory. All the methods choose a batch of 500 trajectories
and Lörincz (2006) with the same 104 random observa-
to evaluate. Each method is initialized with 104 trajecto-
tions and repeat each experiment 10 times. All the methods
ries randomly uniformly selected from [0, 1]60 and their
choose a batch of 100 parameters to evaluate at each itera-
reward function values. We again compare EBO with BO-
tion. CEM uses the top 30% of the 104 initial observations
SVI, BO-Add-SVI and CEM (Szita and Lörincz, 2006). All
to fit its initial Gaussian distribution. At the end of each
the methods choose a batch of 500 trajectories to evaluate.
iteration in CEM, 30% of the new observations with top
Each method is initialized with 104 trajectories randomly
values were used to fit the new distribution. For all the BO
uniformly selected from [0, 1]60 and their reward function
based methods, we use the maximum value of the reward
values. The initializations are the same for each method,
function in the acquisition function. The standard deviation
and we repeat the experiments 5 times. CEM uses the top
of the observation noise in the GP models is set to be 0.1.
30% of the 104 initial observations to fit its initial Gaussian
We set EBO to have Modrian partitions with fewer than 150
distribution. At the end of each iteration in CEM, 30% of
data points and constrain EBO to have no more than 200
the new observations with top values were used to fit the
Mondrian partitions. In EBO, we set the hyper parameters
new distribution. For all the BO based methods, we use the
α = 1.0, β = [5.0, 5.0], and the Mondrian observation off-
maximum value of the reward function, 5.0, in the acqui-
set  = 0.05. In BO-SVI, we used 100 batchsize in SVI, 200
sition function. The standard deviation of the observation
inducing points and 500 iterations to optimize the data like-
noise in the GP models is set to be 0.01. We set EBO to
lihood with 0.1 step rate and 0.9 momentum. BO-Add-SVI
attempt to have Modrian partitions with fewer than 100
used the same parameters as BO-SVI, except that BO-Add-
data points, with a hard constraint of no more than 1000
SVI uses 3 outer loops to randomly select the decomposition
Mondrian partitions. In EBO, we set the hyper parameters
parameter z and in each loop, it uses an inner loop of 50
α = 1.0, β = [2.0, 5.0], and the Mondrian observation off-
iterations to maximize the data likelihood over the kernel
set  = 0.01. In BO-SVI, we used 100 batchsize in SVI, 200
parameters. The batch BO strategy used in BO-SVI and
inducing points and 500 iterations to optimize the data like-
BO-Add-SVI is identical to the one used in each Mondrian
lihood with 0.1 step rate and 0.9 momentum. BO-Add-SVI
partition of EBO.
used the same parameters as BO-SVI, except that BO-Add-
We run all the methods for 200 iterations, where each itera- SVI uses 3 outer loops to randomly select the decomposition
Zi Wang, Clement Gehring, Pushmeet Kohli, Stefanie Jegelka

parameter z and in each loop, it uses an inner loop of 50 estimating confidence bounds and avoid overfitting. How-
iterations to maximize the data likelihood over the kernel ever, the scaling of Gaussian processes is hard in general.
parameters. The batch BO strategy used in BO-SVI and We would like to reinforce the awareness about the impor-
BO-Add-SVI is identical to the one used in each Mondrian tance of estimating confidence of the model predictions on
partition of EBO. new queries, i.e., avoiding variance starvation.

F Discussion F.3 Future directions

Possible future directions include analyzing theoretically


F.1 Failure modes of EBO what should be the best input space partition strategy, batch
worker budget distribution strategy, better ways of predict-
EBO is a general framework for running large scale batched
ing variance in a principled way (not necessarily GP), better
BO in high-dimensional spaces. Admittedly, we made some
ways of doing small scale BO and how to adapt it to large
compromises in our design and implementation to scale up
scale BO. Moreover, add-GP is only one way of reducing
BO to a degree that conventional BO approaches cannot
the function space, and there could be others suitable ones
deal with. In the following, we list some limitations and
too.
aspects that we can improve in EBO in our future work.

• EBO partitions the space into smaller regions


{[lj , hj ]}Jj=1 and only uses the observations within
[lj − , hj + ] to do inference and Bayesian optimiza-
tion. It is hard to determine the value of . If  is large,
we may have high computational cost for the operations
within each region. But if  is very small, we found that
some selected BO points are on the boundaries of the
regions, partially because of the large uncertainty on
the boundaries. We used  = 0 in our experiments, but
the results can be improved with a more appropriate .

• Because of the additive structure, we need to optimize


the acquisition function for each additive component.
As a result, EBO has increased computational cost
when there are more than 50 additive components, and
it becomes harder for EBO to optimize functions more
than a few hundred dimensions. One solution is to
combine the additive structure with a low dimensional
projection approach (Wang et al., 2016b). We can also
simply run block coordinate descent on the acquisition
function, but it is harder to ensure that the acquisition
function is fully optimized.

F.2 Importance of avoiding variance starvation

Neural networks have been applied in many applications and


received success for tasks including regression and classifi-
cation. While researchers are still working on the theoretical
understanding, one hyoothesis is that neural networks “over-
fit” ?. Due to the similarity between the test and training
set in the reported experiments in, for example, the com-
puter vision community, overfitting may seem to be less
of a problem. However, in active learning (e.g. Bayesian
optimization), we do not have a “test set”. We require the
model to generalize well across the search space, and using
the classic neural network may be detrimental to the data
selection process, because of variance starvation (see Sec-
tion 2). Gaussian processes, on the contrary, are good at

You might also like