bayesian optimization
bayesian optimization
distribution over the (infinite) ensemble and, in each itera- input space via a Mondrian tree and aggregates trees into a
tion, draw one member to perform search and estimation. forest. The closely related Mondrian kernels (Balog et al.,
2016) use random features derived from Mondrian forests
In particular, we use a new combination of three types of
to construct a kernel. Such a kernel, in fact, approximates
partition-based approximations: (1-2) For improved GP esti-
a Laplace kernel. In fact, Mondrian forest features can be
mation, we propose a novel hierarchical additive GP model
considered a special case of the popular tile coding features
based on tile coding (a.k.a. random binning or Mondrian
widely used in reinforcement learning (Sutton and Barto,
forest features). We learn a posterior distribution over ker-
1998; Albus et al., 1975). Lakshminarayanan et al. (2016)
nel width and the additive structure; here, Gibbs sampling
showed that, in low-dimensional settings, Mondrian forest
prevents overfitting. (3) To accelerate the sampler, which
kernels scale better than the regular GP and achieve good
depends on the likelihood of the observations, we use an ef-
uncertainty estimates in many low-dimensional problems.
ficient, randomized block approximation of the Gram matrix
based on a Mondrian process. Sampling and query selection Besides Mondrian forests, there is a rich literature on sparse
can then be parallelized across blocks, further accelerating GP methods to address the scalability of GP regression
the algorithm. (Seeger et al., 2003; Snelson and Ghahramani, 2006; Titsias,
2009; Hensman et al., 2013). However, these methods are
As a whole, this combination of simple, tractable structure
mostly only shown to be useful when the input dimension
with ensemble learning and randomization improves effi-
is low and there exist redundant data points, so that induc-
ciency, uncertainty estimates and optimization. Moreover,
ing points can be selected to emulate the original posterior
we show that our realization of these ideas offers an alter-
GP well. However, data redundancy is usually not the case
native explanation for global optimization heuristics that
in high-dimensional Bayesian optimization. Recent appli-
have been popular in other communities, indicating possible
cations of sparse GPs in BO (McIntire et al., 2016) only
directions for further theoretical analysis. Our empirical
consider experiments with less than 80 function evaluations
results demonstrate that EBO can speed up the posterior
in BO and do not show results on large scale observations.
inference by 2-3 orders of magnitude (400 times in one
Another approach to tackle large scale GPs distributes the
experiment) compared to the state-of-the-art, without sacri-
computation via local experts (Deisenroth and Ng, 2015).
ficing quality. Furthermore, we demonstrate the ability of
However, this is not very suitable for the acquisition function
EBO to handle sample-intensive hard optimization problems
optimization needed in Bayesian optimization, since every
by applying it to real-world problems with tens of thousands
valid prediction needs to synchronize the predictions from
of observations.
all the local experts. Our paper is also related to Gramacy
and Lee (2008). While Gramacy and Lee (2008) focuses on
Related Work There has been a series of works address- modeling non-stationary functions with treed partitions, our
ing the three big challenges in BO: selecting batch evalua- work integrates tree structures and Bayesian optimization in
tions (Contal et al., 2013; Desautels et al., 2014; González a novel way.
et al., 2016; Wang et al., 2017; Daxberger and Low, 2017),
high-dimensional input spaces (Wang et al., 2016b; Djo- 2 Background and Challenges
longa et al., 2013; Li et al., 2016; Kandasamy et al., 2015;
Wang et al., 2017; Wang and Jegelka, 2017), and scalability Consider a simple but high-dimensional search space X =
(Snoek et al., 2015; Lakshminarayanan et al., 2016; McIn- [0, R]D ⊆ RD . We aim to find a maximizer x∗ ∈
tire et al., 2016). Although these three problems tend to arg maxx∈X f (x) of a black-box function f : X → R.
co-occur, this paper is the first (to the best of our knowledge)
to address all three challenges jointly in one framework. Gaussian processes. Gaussian processes (GPs) are pop-
Most closely related to parts of this paper is (Wang et al., ular priors for modeling the function f in Bayesian opti-
2017), but our algorithm significantly improves on that work mization. They define distributions over functions where
in terms of scalability (see Sec. 4.1 for an empirical compar- any finite set of function values has a multivariate Gaussian
ison), and has fundamental technical differences. First, the distribution. A Gaussian process GP(µ, κ) is fully specified
Gibbs sampler by Wang et al. (2017) only learns the additive by a mean function µ(·) and covariance (kernel) function
structure but not the kernel parameters, while our sampler κ(·, ·). Let f be a function sampled from GP(0, κ). Given
jointly learns both of them. Second, our proposed algorithm observations Dn = {(xt , yt )}nt=1 where yt ∼ N (f (xt ), σ),
partitions the input space for scalability and parallel infer- we obtain the posterior mean and variance of the function as
ence. We achieve this by a Mondrian forest. Third, as a µn (x) = κn (x)T (Kn + σ 2 I)−1 yn , (2.1)
result, our method automatically generates batch queries, −1
while the other work needs an explicit batch strategy. σn2 (x) T
= κ(x, x) − κn (x) (Kn + σ I)2
κn (x) (2.2)
Other parts of our framework are inspired by the Mondrian via the kernel matrix Kn = [κ(xi , xj )]xi ,xj ∈Dn and
forest (Lakshminarayanan et al., 2016), which partitions the κn (x) = [κ(xi , x)]xi ∈Dn (Rasmussen and Williams,
Zi Wang, Clement Gehring, Pushmeet Kohli, Stefanie Jegelka
2006). The log data likelihood for Dn is given by Moreover, learning hyperparameters for random features is
expensive: for Fourier features, the computation of Eq. (2.6)
1 means re-computing the features, plus O(DR 3
) for the in-
log p(Dn ) = − ynT (Kn + σ 2 I)−1 yn
2 verse and determinant. With Mondrian features (Laksh-
1 n minarayanan et al., 2016), we can learn the kernel width
− log |Kn + σ 2 I| − log 2π. (2.3)
2 2 efficiently by adding more Mondrian blocks, but this proce-
While GPs provide flexible, broadly applicable function dure is not well compatible with learning additive structure,
estimators, the O(n3 ) computation of the inverse (Kn + since the whole structure of the sampled Mondrian features
σ 2 I)−1 and determinant |Kn + σ 2 I| can become major will change. In addition, we typically need a forest of trees
bottlenecks as n grows, for both posterior function value for a good approximation.
predictions and data likelihood estimation.
Tile coding. Tile coding (Sutton and Barto, 1998; Albus
Additive structure. To reduce the complexity of the et al., 1975) is a k-hot encoding widely used in reinforce-
vanilla GP, we assume a latent decomposition of the ment learning as an efficient set of non-linear features. In
input dimensions [D] = {1, . . . , D} into disjoint sub- its simplest form, tile coding is defined by k partitions, re-
SM ferred to as layers. An encoded data point becomes a binary
spaces, namely, m=1 Am = [D] and Ai ∩ Aj = ∅
vector with a non-zero entry for each bin containing the data
for all i 6= j, i, j ∈ [M ]. PAs a result, the func-
Am point. There exists methods for sampling random partitions
tion f decomposes as f (x) = m∈[M ] fm (x ) (Kan-
that allow to approximate various kernels, such as the ‘hat’
dasamy et al., 2015). If each component fm is drawn
kernel (Rahimi et al., 2007), making tile coding well suited
independently from GP(µ(m) , κ(m) ) for all m ∈ [M ],
for our purposes.
the resulting f will alsoPbe a sample from a GP: f ∼
GP(µ, κ), with µ(x) = m∈[M ] µm (xAm ), κ(x, x0 ) =
A
Variance starvation. It is probably not surprising that us-
(x , x0 m ).
(m) Am
P
m∈[M ] κ ing finite random features to learn the function distribution
The additive structure reduces sample complexity and helps will result in a loss in accuracy (Forster, 2005). For example,
BO to search more efficiently and effectively since the ac- we observed that, while the mean predictions are preserved
quisition function can be optimized component-wise. But it reasonably well around regions where we have observations,
remains challenging to learn a good decomposition structure both mean and confidence bound predictions can become
{Am }. Recently, Wang et al. (2017) proposed learning via very bad in regions where we do not have observations, once
Gibbs sampling. This sampler takes hours for merely a few there are more observations than features. We refer to this
hundred points, because it needs a vast number of expensive underestimation of variance scale compared to mean scale,
data likelihood computations. illustrated in Fig. 1, as variance starvation.
10 10
50
5 20 0 5
f(x)
f(x)
f(x)
f(x)
0 -50 0
0
3σ 3σ -100 3σ 3σ
-5 µ µ µ -5 µ
f
-20 f -150 f f
-10 -10
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1
(a) x (b) x (c) x (d) x
Figure 1: We use 1000 Fourier features to approximate a 1D GP with a squared exponential kernel. The observations are
samples from a function f (red line) drawn from the GP with zero mean in the range [−10, 0.5]. (a) Given 100 sampled
observations (red circles), the Fourier features lead to reasonable confidence bounds. (b) Given 1000 sampled observations
(red circles), the quality of the variance estimates degrades. (c) With additional samples (5000 observations), the problem is
exacerbated. The scale of the variance predictions relative to the mean prediction is very small. (d) For comparison, the
proper predictions of the original full GP conditioned on the same 5000 observations as (c). Variance starvation becomes a
serious problem for random features when the size of data is close to or larger than the size of the features.
2017) is that TileGP constructs a hierarchical model for the nel. For each dimension d, we sample the group assignment
random features (and hence, the kernels), while Wang et al. zd according to
(2017) do not consider the kernel parameters to be part of
the generative model. The random features are based on tile p(zd = m | Dt−1 , k, z¬d ; α) ∝ p(Dt−1 | z, k)p(zd | z¬d )
coding or Mondrian grids, with the number of cuts gener- ∝ p(Dt−1 | z, k)(|Am | + αm ). (3.1)
ated by D Poisson processes on [ldj , hjd ] for each dimension
d = 1, · · · , D. On the i-th layer of the tilings, tile coding We sample the number of cuts kdi for each dimension d and
hjd −ld
j
each layer i from the posterior
samples the offset δ from a uniform distribution U [0, kdi ]
and places the cuts uniformly starting at δ + ldj . The Mon- p(kdi | Dt−1 , k¬di , z; β) ∝ p(Dt−1 | z, k)p(kdi | k¬di )
drian grid samples kdi cut locations uniformly randomly
p(Dn | z, k)Γ(β1 + |kd |)
from [ldj , hjd ]. Because of the data partition, we always have ∝ . (3.2)
more features than observations, which can alleviate the (β0 + L)kdi kdi !
variance starvation problem described in Section 2. If distributed computing is available, each hyperrectangle
We can use Gibbs sampling to efficiently learn the cut pa- partition of the input space is assigned a worker to manage
rameter k and decomposition parameter z by marginalizing all the computations within this partition. On each worker,
out λ and θ. Notice that both k and z take discrete values; we use the above Gibbs sampling method to learn the ad-
hence, unlike other continuous GP parameterizations, we ditive structure and kernel bandwidth jointly. Conditioned
only need to sample discrete variables for Gibbs sampling. on the observations associated with the partition on the
worker, we use the learned posterior TileGP to select the
Algorithm 2 Generative model for TileGP most promising input point in this partition, and eventually
1: Draw mixing proportions θ ∼ D IR(α) send this candidate input point back to the main process
2: for d = 1, · · · , D do together with the learned decomposition parameter z and
3: Draw additive decomposition zd ∼ M ULTI(θ) the cut parameter k. In the next section, we introduce the
4: Draw Poisson rate parameter λd ∼ G AMMA(β0 , β1 ) acquisition function we used in each worker and how to
5: for i = 1, · · · , L do
6: Draw number of cuts kdi ∼ P OISSON(λd (hjd − ldj )) filter the recommended candidates from all the partitions.
( j j
h −l
Draw offset δ ∼ U [0, dkdi d ] Tile Coding
7: j j 3.3 Acquisition functions and filtering
Draw cut locations b ∼ U [ld , hd ] Mondrian Grids
8: end for
In this paper, we mainly focus on parameter search prob-
9: end for
10: Construct the feature projection φ and the kernel κ = φT φ lems where the objective function is designed by an expert
from z and sampled tiles and the global optimum or an upper bound on the function
11: Draw function f ∼ GP(0, κ) is known. While any BO acquisition functions can be used
12: Given input x, draw function value y ∼ N (f (x), σ) within the EBO framework, we use an acquisition function
from (Wang and Jegelka, 2017) to exploit the knowledge
of the upper bound. Let f ∗ be such an upper bound, i.e.,
Given the observations Dt−1 in the j-th hyperrectangle par- ∀x ∈ X , f ∗ ≥ f (x). Given the observations Dt−1 j
associ-
tition, the posterior distribution of the (local) parameters ated with the j-th partition of the input space, we minimize
λ, k, z, θ is j f ∗ −µjt−1 (x)
the acquisition function ηt−1 (x) = j
σt−1 (x)
. Since the
p(λ, k, z, θ | Dt−1 ; α, β) j
kernel is additive, we can optimize ηt−1 (·) separately
for
∝ p(Dt−1 | z, k)p(z | θ)p(k | λ)p(θ; α)p(λ; β). each additive component. Namely, for the m-th compo-
j
Marginalizing over the Poisson rate parameter λ and the nent of the additive structure, we optimize ηt−1 (·) only on
mixing proportion θ gives the active dimensions Am . This resembles a block coordi-
nate descent, and greatly facilitates the optimization of the
p(k, z | Dt−1 ; α, β) acquisition function.
Z Z
∝ p(Dt−1 |z, k) p(z|θ)p(θ; α) dθ p(k|λ)p(λ; β) dλ Filtering. Once we have a proposed uery point from
Y Γ(|Am | + αm ) each partition, we select B of them according
PB to the scor-
∝ p(Dt−1 | z, k) ing function ξ(X) = log det KX − b=1 η(xb ) where
Γ(αm )
m X = {xb }B b=1 . We use the log determinant term to force di-
Y Γ(β1 + |kd |) versity and η to maintain quality. We maximize this function
× QL
d (i=1 kdi !)(β0 + L)
β1 +|kd | greedily. In some cases, the number of partitions J can be
PL smaller than the batch size B. In this case, one may either
where |kd | = i=1 kdi . Hence, we only need to sample k use just J candidates, or use batch BO on each partition.
and z when learning the hyperparameters of the TileGP ker- We use the latter, and discuss details in the appendix.
Batched Large-scale Bayesian Optimization in High-dimensional Spaces
1 observation 3 observations
(a) (b) (c) (d)
4 4 4 4
3 3 3 3
2 2 2 2
1 1 1 1
0 0 0 0
-1 -1 -1 -1
-2 -2 -2 -2
-2 0 2 4 -2 0 2 4 -2 0 2 4 -2 0 2 4
Figure 3: Posterior mean function (a, c) and GP-UCB acquisition function (b, d) for an additive GP in 2D. The maxima
of the posterior mean and acquisition function are at the points resulting from an exchange of coordinates between “good”
observed points (-1,0) and (2,2).
3.4 Efficient data likelihood computation and P OISSON(λd R) be the number of cuts in the Mon-
parameter synchronization drian grids of TileGP for dimension d ∈ [D] and
layer i ∈ [L]. The TileGP kernel κL satisfies
λd R|xAm −x0Am |
For the random features, we use tile coding due to its sparsity 0 1
PM
lim κL (x, x ) = M m=1 e , where
and efficiency. Since non-zero features can be found and L→∞
3.5 Relations to Mondrian kernels, random binning 3.6 Connections to evolutionary algorithms
and additive Laplace kernels
Next, we make some observations that connect our random-
Our model described in Section 3.2 can use tile coding and ized ensemble BO to ideas for global optimization heuristics
Mondrian grids to construct the kernel. Tile coding and that have successfully been used in other communities. In
Mondrian grids are also closely related to Mondrian Fea- particular, these connections offer an explanation from a BO
tures and Random Binning: All of the four kinds of random perspective and may aid further theoretical analysis.
features attempt to find a sparse random feature representa-
Evolutionary algorithms (Back, 1996) maintain an ensem-
tion for the raw input x based on the partition of the space
ble of “good” candidate solutions (called chromosomes)
with the help of layers. We illustrate the differences between
and, from those, generate new query points via a number
one layer of the features constructed by tile coding, Mon-
of operations. These methods too, implicitly, need to bal-
drian grids, Mondrian features and random binning in the
ance exploration with local search in areas known to have
appendix. Mondrian grids, Mondrian features and random
high function values. Hence, there are local operations
binning all converge to the Laplace kernel as the number of
(mutations) for generating new points, such as random per-
layers L goes to infinity. The tile coding kernel, however,
turbations or local descent methods, and global operations.
does not approximate a Laplace kernel. Our model with
While it is relatively straightforward to draw connections be-
Mondrian grids approximates an additive Laplace kernel:
tween those local operations and optimization methods used
Lemma 3.1. Let the random variable kdi ∼ in machine learning, we here focus on global exploration.
Zi Wang, Clement Gehring, Pushmeet Kohli, Stefanie Jegelka
5
4 Experiments
4
Regret
0
4.1 Scalability of EBO 0 10 20 30 40 50 60
Tim e (m inut es)
mediate regret obtained by the batch at iteration t, and We describe a problem instance by defining a start posi-
rT = mint≤T r̃t as the regret, which captures the minimum tion s and a goal position g as well as a cost function
gap between the best point found and the global optimum over the state space. Trajectories are described by a set
of the black-box function f . of points on which a BSpline is to be fitted. By integrat-
ing the cost function over a given trajectory, we can com-
We compare BO using SVI (Hensman et al., 2013) (BO-
pute the trajectory cost c(x) of a given trajectory solution
SVI), BO using SVI with an additive GP (BO-Add-SVI)
x ∈ [0, 1]60 . We define the reward of this problem to be
and a distributed version of BO with a fixed partition (PBO)
f (x) = c(x) + λ(kx0,1 − sk1 + kx59,60 − gk1 ) + b. This
against EBO with a randomly sampled partition in each
reward function is non smooth, discontinuous, and con-
iteration. PBO has the same 1000 Mondrian partitions in all
cave over the first two and last two dimensions of the input.
the iterations while EBO can have at most 1000 Mondrian
These 4 dimensions represent the start and goal position
partitions. BO-SVI uses a Laplace isotropic kernel without
of the trajectory. The results in Fig. 7 showed that CEM
any additive structure, while BO-Add-SVI, PBO, EBO all
was able to achieve better results than the BO methods on
use the known prior. More detailed experimental settings
these functions, while EBO was still much better than the
can be found in the appendix. Our experimental results in
BO alternatives using SVI. More details can be found in the
Fig. 5 shows that EBO is able to find a good point much
appendix.
faster than BO-SVI and BO-Add-SVI; and, randomization
and the ensemble of partitions matters: EBO is much better
than PBO. BO-SVI CEM
BO-Add-SVI EBO
4
Optimizing control parameters for robot pushing We
follow Wang et al. (2017) and test our approach, EBO, on a 2
Reward
pushing. We compare EBO, BO-SVI, BO-Add-SVI and
CEM Szita and Lörincz (2006) with the same 104 random 2
observations and repeat each experiment 10 times. We run
all the methods for 200 iterations, where each iteration has 4
a batch size of 100. We plot the median of the best re-
wards achieved by CEM and EBO at each iteration in Fig. 6. 6
10k 15k 20k 25k 30k 35k
Number of Samples
More details on the experimental setups and the reward
function can be found in the appendix. Overall CEM and
EBO performed comparably and much better than the sparse Figure 7: Comparing BO-SVI, BO-Add-SVI, CEM and
GP methods (BO-SVI and BO-Add-SVI). We noticed that EBO on a 60 dimensional trajectory optimization task.
among all the experiments, CEM achieved a maximum re-
ward of 10.19 while EBO achieved 9.50. However, EBO
behaved slightly better and more stable than CEM as re-
flected by the standard deviation on the rewards.
5 Conclusion
BO-SVI CEM
BO-Add-SVI EBO
10.0
9.5 Many black box function optimization problems are intrin-
9.0 sically high-dimensional and may require a huge number of
8.5 observations in order to be optimized well. In this paper, we
Reward
1.0
t=6 t=7 t=8 t=9 t=10
1.0 1.0 1.0 1.0
0.8
0.8 0.8 0.8 0.8
0.6
0.6 0.6 0.6 0.6
0.4
0.4 0.4 0.4 0.4
0.2
0.2 0.2 0.2 0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.0 0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Figure 9: An example of 10 iterations of EBO on a 2D toy example plotted in Fig. 8. The selections in each iteration are
blue and the existing observations orange. EBO quickly locates the region of the global optimum while still allocating
budget to explore regions that appear promising (e.g. around the local optimum (1.0, 0.4)).
cuss in the main paper: (1) how many points to recommend and can potentially improve upon our results.
from each local worker (budget allocation); and (2) how to
select a batch of points from the Mondrian partition on each
D Relations to Mondrian Kernels and
worker. Usually in the beginning of the iterations, we do not
have a lot of Mondrian partitions (since we stop splitting a Random Binning
partition once it reaches a minimum number of data points).
Hence, it is very likely that the number of partitions J is TileGP can use Mondrian grids or (our version of) tile cod-
smaller than the size of the batch. Hence we need to allocate ing to achieve efficient parameter inference for the decom-
the budget of recommendations from each worker properly position z and the number of cuts k (inverse of kernel band-
and use batched BO for each Mondrian partition. width). Mondrian grids and tile coding are closely related to
Mondrian kernels and random binning, but there are some
subtle differences. We illustrate the differences between one
Budget allocation In our current version of EBO, we did
the budget allocation using a heuristic, where we would
like to generate at least 2B recommendations from all the
workers, and each worker gets the budget proportional to a
score, the sum of the Mondrian partition volume (volume of
the domain of the partition) and the best function value of Tile coding Mondrian Grid Random Binning Mondrain Feature
the partition.
Figure 10: Illustrations of (our version of) tile coding, Mon-
Batched BO For batched BO, we also use a heuristic drian Grid, random binning and Mondrian feature.
where the points achieving the top n acquisition function
values are always included and the other ones come from layer of the features constructed by tile coding, Mondrian
random points selected in that partition. For the optimiza- grid, Mondrian feature and random binning in Fig. 10. For
tion of the acquisition function over each block of dimen- each layer of (our version of) tile coding, we sample a posi-
sions, we sample 1000 points in the low dimensional space tive integer k (number of cuts) from a Poisson distribution
associated with the additive component and minimize the parameterized by λR, and then set the offset to be a constant
acquisition function via L-BFGS-B starting from the point uniformly randomly sampled from [0, R k ]. For each layer of
that gives the best acquisition value. We add the optimized the Mondrian grid, the number of cuts k is sampled tile in
arg min to the 1000 points and sort them according to their coding, but instead of using an offset and uniform cuts, we
acquisition values, and then select the top n random ones, put the cuts at locations independently uniformly randomly
and combine with the sorted selections from other additive from [0, R]. Random binning does not sample k cuts but
components. Other batched BO methods can also be used samples the distance δ between neighboring cuts by drawing
Zi Wang, Clement Gehring, Pushmeet Kohli, Stefanie Jegelka
δ ∼ G AMMA(2, λR). Then, it samples the offset from [0, δ] over-estimation of variance for each additive component if
and finally places the cuts. All of the above-mentioned three inferred independently from others. We conjecture that this
types of random features can work individually for each over-estimation could result in an invalid regret bound for
dimension and then combine the cuts from all dimensions. Add-GP-UCB (Kandasamy et al., 2015). Nevertheless, we
The Mondrian feature (Mondrian forest features to be ex- found that using the block coordinate optimization for the
act), contrast, partitions the space jointly for all dimensions. acquisition function on the full GP is actually very help-
More details of Mondrian features can be found in Laksh- ful. In Figure. 11, we compare the acquisition function
minarayanan et al. (2016); Balog et al. (2016). For all of we described in Section 3.3 (denoted as BlockOpt) with
these four types of random features and for eachPL layer of the Add-GP-UCB (Kandasamy et al., 2015), Add-MES-R and
total L layers, the kernel is κL (x, x0 ) = L1 l=1 χl (x, x0 ) Add-MES-G (Wang and Jegelka, 2017) on the same ex-
where periment described in the first experiment of Section 6.5
( of (Wang and Jegelka, 2017), averaging over 20 functions.
0 1 x and x0 are in the same cell on the layer l Notice that we used the maximum value of the function as
χl (x, x ) =
0 otherwise part of our acquisition function in our approach (BlockOpt).
(D.1) Add-GP-UCB, ADD-MES-R and ADD-MES-G cannot use
this max-value information even if they have access to it,
For the case where the kernel has M additive components, because then they don’t have a strategy to deal with “credit
we simply use the tiling for each decomposition and nor- assignment”, which assigns the maximum value to each
malize by LM instead L. More precisely, we have additive component. We found that BlockOpt is able to find
PM of PM
κL (x, x0 ) = LM
1
m=1 l=1 χl (x
Am
, x0Am ). a solution as well as or even better than the best of the three
competing approaches.
We next prove the lemma mentioned in Section 3.5.
Lemma 3.1. Let the random variable kdi ∼ P OISSON(λd R) d=10 d=20 d=30
30 60 80
be the number of cuts in the Mondrian grids of TileGP for
20 40 60
dimension d ∈ [D] and layer i ∈ [L]. The kernel of TileGP
10 20 40
rt
rt
rt
λd R|xAm −x0Am |
PM
κL satisfies lim κL (x, x0 ) = M
1
m=1 e , 0 0 20
L→∞
where {Am }M
m=1 is the additive decomposition.
-10
100 200 300 400 500
-20
100 200 300 400 500
0
100 200 300 400 500
t t t
d=50 d=100
100 200
Proof. When constructing the Mondrian grid for each layer 80
150
Add-GP-UCB
and each dimension, one can think of the process of getting 60
100 Add-MES-R
rt
rt
(m)
in Balog et al. (2016), we have lim κL (xAm , x0Am ) =
L→∞ Figure 11: Comparing different acquisition functions for
E[no cut between xd and x0d , ∀d ∈ Am ] =
−λd R|xAm −x0Am |
BO with an additive GP. Our strategy, BlockOpt, achieves
e . By the additivity of the kernel, we have comparable or better results than other methods.
λd R|xAm −x0Am |
PM
lim κL (x, x0 ) = M 1
m=1 e .
L→∞
100 batchsize, 200 inducing points and the parameters were 12.5
optimized for 100 iterations. For EBO, we set the minimum 0.6
10.0
size of data points on each Mondrian partition to be 100.
0.4
1000 for both EBO and PBO. The evaluations of the test 0.2
5.0
functions are negligible, so the timing results in Figure 5
reflect the actual runtime of each method. 0.0
2.5
−0.2
−0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2
parameter z and in each loop, it uses an inner loop of 50 estimating confidence bounds and avoid overfitting. How-
iterations to maximize the data likelihood over the kernel ever, the scaling of Gaussian processes is hard in general.
parameters. The batch BO strategy used in BO-SVI and We would like to reinforce the awareness about the impor-
BO-Add-SVI is identical to the one used in each Mondrian tance of estimating confidence of the model predictions on
partition of EBO. new queries, i.e., avoiding variance starvation.