An Empirical Study of Bayesian Optimization: Acquisition Versus Partition
An Empirical Study of Bayesian Optimization: Acquisition Versus Partition
Abstract
Bayesian optimization (BO) is a popular framework for black-box optimization. Two classes
of BO approaches have shown promising empirical performance while providing strong the-
oretical guarantees. The first class optimizes an acquisition function to select points, which
is typically computationally expensive and can only be done approximately. The second
class of algorithms use systematic space partitioning, which is much cheaper computation-
ally but the selection is typically less informed. This points to a potential trade-off between
the computational complexity and empirical performance of these algorithms. The current
literature, however, only provides a sparse sampling of empirical comparison points, giving
little insight into this trade-off. The primary contribution of this work is to conduct a com-
prehensive, repeatable evaluation within a common software framework, which we provide
as an open-source package. Our results give strong evidence about the relative performance
of these methods and reveal a consistent top performer, even when accounting for overall
computation time.
Keywords: empirical evaluation, global optimization, black box optimization, bayesian
optimization
1. Introduction
We consider the problem of optimizing an unknown function f by selecting experiments
that each specify an input x and return a response f (x). Given an experimental budget,
the goal is to select a sequence of experiments in order to find an input that approximately
maximizes f . An effective approach to this problem is Bayesian optimization (BO) (Brochu
et al., 2010), which assumes a Bayesian prior in order to quantify the uncertainty over f
via posterior inference. This posterior can then be used to bias the experiment selection in
a variety of ways.
Perhaps the most traditional and widely used BO approach is acquisition-based BO
(ABO) (Kushner, 1964; Jones, 2001). The key idea is to define an acquisition function
in terms of the posterior, which is then optimized at each iteration to select the next
experiment. Two commonly used acquisition functions are Expected Improvement (EI)
(Mockus, 1994) and Upper Confidence Bound (UCB) (Srinivas et al., 2010), which have
both been shown to be practically effective. In addition, UCB has been shown to have
c 2021 Erich Merrill, Alan Fern, Xiaoli Fern and Nima Dolatnia.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v22/18-220.html.
Merrill, Fern, Fern, and Dolatnia
probabilistic guarantees on performance under the assumption that the acquisition function
can be perfectly optimized at each iteration. However, both UCB and EI are non-convex,
making optimization costly and inexact for higher dimensional functions. Thus, in practice,
the theoretical results for ABO do not hold and the selection of each experiment can be
computationally expensive.
A recent alternative approach to ABO completely avoids the optimization of acquisi-
tion functions. These approaches are inspired by the simultaneous optimistic optimization
(Munos et al., 2014) (SOO) algorithm, which is a non-Bayesian approach that intelligently
partitions the space based on observed experiments to effectively balance exploration and
exploitation of the objective. We will refer to SOO and algorithms derived from it as
partition-based global optimization (PGO) algorithms. A key feature of SOO is that it
provides finite time performance guarantees with minimal assumptions about the objective
function’s properties. SOO does not, however, exploit the potential benefits of posterior in-
ference. This has led to variants of SOO, such as BaMSOO (Wang et al., 2014) and IMGPO
(Kawaguchi et al., 2015), that integrate posterior inference to better direct SOO’s explo-
ration. We will refer to these alternative approaches that integrate posterior inference with
PGO approaches as partitioning-based BO (PBO) algorithms. Importantly, the PBO al-
gorithms are able to select each experiments using significantly less computation compared
to ABO. At the same time, they maintain probabilistic variants of SOO’s performance
guarantees.
ABO uses more computation per experiment selection than the partition-based ap-
proaches, so one might expect ABO to make higher quality decisions and outperform
partition-based methods given the same number of experiments. Thus, ABO may have
an advantage for application where the number of experiments is fixed and not limited by
the runtime of experiment selection. For example, when experiments involve lengthy wet
lab trials or expensive simulations, the computation time required for selecting experiments
may be negligible. Alternatively, there are applications where the expense of ABO can
limit the number of experiments compared to partition-based methods. For example, if
experiments correspond to running fast physics simulations, then the higher computational
cost of ABO may be a bottleneck that will lead to running fewer experiments in a fixed
time horizon and potentially perform worse than partitioning methods.
The above intuitions suggest a potential trade-off between acquisition-based and partition-
based methods, however, the literature provides little guidance regarding this trade-off.
Most comparisons between ABO, PBO, and PGO have been either indirect or on a sparse
set of problems. For example, the PBO algorithm BaMSOO was shown (Wang et al., 2014)
to outperform both SOO and selected ABO algorithms. However, this was on a modest
number of problems and for Bayesian hyper-parameters that were selected on a per problem
basis, which is not always a realistic real-world use scenario. Later, a non-Bayesian PGO
algorithm, LOGO (Kawaguchi et al., 2016), was shown to outperform SOO and BaMSOO,
which combined with the prior BaMSOO results was taken as evidence for preferring LOGO
over ABO approaches. More recent work on IMGPO (Kawaguchi et al., 2015) shows further
benefits of PBO over ABO on a small set of problems. However, the ABO implementations
used in those evaluations appear to yield inferior performance to the experience of others.
If taken at face value, these prior results suggest that PBO approaches dominate ABO
2
An Empirical Study of Bayesian Optimization
approaches despite their much lower computational cost. Yet most applications of BO are
still using ABO approaches.
The primary contribution of this paper is to conduct a more thorough evaluation of
these approaches with the aim of understanding when and if one should be preferred. We
conduct this investigation using a common software framework, which will be publicly
available and allow for complete reproducibility. Importantly, we do not introduce new
algorithms or variants of existing algorithms in order to avoid the potential appearance
of bias in our evaluation. Our results yield fairly consistent observations across a variety
of test problems, shedding light on the relative performance of some of key representative
acquisition-based and partition-based algorithms.
3
Merrill, Fern, Fern, and Dolatnia
3. Description of Algorithms
This section describes the algorithms that are included in our empirical study, which cov-
ers three algorithmic classes. First, we describe two widely used algorithms from the class
of acquisition-based BO (ABO) algorithms. Second, we describe two partitioning-based
global optimization (PGO) algorithms, which do not use a Bayesian prior, but have strong
theoretical guarantees on regret. Third, we describe two partitioning-based Bayesian opti-
mization (PBO) algorithms, which incorporate a Bayesian prior into the PGO approaches.
A comparison of the most relevant properties of all the algorithms is provided in Table 1.
3.1 Acquisition-Based BO
Assume we have the observations Dt = {x1:t , y1:t } and we are interested in the distribution
of the output y∗ of a test point (another experiment) x∗ . Since we have a GP with prior
mean zero, the joint distribution of y1:t and y∗ can be written as:
y1:t K(x1:t , x1:t ) K(x1:t , x∗ )
∼ N 0,
y∗ K(x∗ , x1:t ) K(x∗ , x∗ )
where K(·, ·) is the corresponding covariance matrix using the covariance function element-
wise. Having the joint distribution, the conditional distribution of y∗ given the observed
data can be derived as:
ABO uses an acquisition function as a selection heuristic that evaluates each candidate
point based on its mean and variance. Acquisition functions are generally designed so that
their high values correspond to potentially high values of the objective function. Starting
with some initial observed data points, the covariance matrix of the GP is calculated. Using
the posterior mean and variance of each candidate data point, the value of the acquisition
function can be obtained. The point that maximizes this value is then selected for obser-
vation. The experiment and its observed output get added to the data set and this process
can be repeated until a specified horizon or budget is exhausted. Since optimizing the hy-
perparameters of the kernel function helps fit a more accurate GP model to the data, the
kernel parameters are tuned periodically during the iterative process. Pseudocode for the
ABO algorithm is provided in Algorithm 1.
In this paper, we consider two of the most popular acquisition functions, namely Ex-
pected Improvement (EI) (Mockus et al., 1978) and Gaussian Process Upper Confidence
Bound (GP-UCB) (Srinivas et al., 2010). At any time step t + 1 with prior experiments
x1:t , the acquisition function EI measures the expected improvement with respect to the
current best previously observed objective value f (x+ +
t ), where xt = arg maxxi ∈x1:t f (xi ).
Under Gaussian Processes, EI can be analytically computed as follows:
+ +
AEI (x) = (µ(x) − f (x+ ))Φ( µ(x)−f
σ(x)
(x )
) + σ(x)φ( µ(x)−f
σ(x)
(x )
)
4
An Empirical Study of Bayesian Optimization
where µ(x) and σ 2 (x) are the predicted mean and variance of point x. And, Φ and φ are
the CDF and PDF of the standard normal distribution, respectively.
More recently, GP-UCB with provable cumulative regret bounds with high probability
was proposed as a BO algorithm (Srinivas et al., 2010). Formally, the algorithm defines the
following acquisition function:
1/2
AU CB (x) = µ(x) + βt σ(x)
5
Merrill, Fern, Fern, and Dolatnia
growth of the partitioning tree is what permits the algorithms their theoretical guarantees.
In this section, we consider two PGO algorithms with strong theoretical guarantees, one of
which (LOGO) has claimed in prior work (Kawaguchi et al., 2016) to be competitive with
ABO.
6
An Empirical Study of Bayesian Optimization
is higher than that of any larger-sized leaves. This will guarantee that the leaf with the
maximum upper bound according to ` is always selected for expansion regardless of the true
definition of `.
SOO takes as parameter a depth-limiting function hmax (n) that defines the maximum
depth to which the partitioning tree can be expanded after n cell refinements. In previous
work (Munos et al., 2014) (Kawaguchi et al., 2016), this function has always been defined
√
as hmax (n) = n so we ignore this parameter and assume that this common setting of hmax
is a part of SOO itself.
At a high level, the SOO algorithm can be viewed as consisting of two parts: the
cell selection process (lines 8-15), and the cell expansion process (lines 16-23). Since the
remaining partitioning-based algorithms involve modifying one or both of these processes,
we define a simplified version of the algorithm at Algorithm 3, which is used to enable a
more clear definition of the derivative algorithms.
7
Merrill, Fern, Fern, and Dolatnia
8
An Empirical Study of Bayesian Optimization
9
Merrill, Fern, Fern, and Dolatnia
the value of the lower confidence bound of its center point according to the prior. This
allows the algorithm to continue as expected by effectively ‘ignoring’ the unpromising cell
instead of spending an objective evaluation to assign a value to the probably-unimportant
cell.
Algorithm 5 Bayesian Multi-Scale Optimistic Optimization
1: function ExpandCellsBaMSOO (E)
2: D = the observations in τ not marked as GP-based
3: for each cell xh,i in E do
4: Refine xh,i into its three resultant children cells xh+1,j1 . . . xh+1,j3
5: for each child cell xh+1,j do
6: if U (xh+1,j |D) ≥ f ∗ then
7: g(xh+1,j ) = f (xh+1,j )
8: else
9: g(xh+1,j ) = L(xh+1,j |D)
10: Mark g(xh+1,j ) as GP-based
11: end if
12: τh+1 = τh+1 ∪ xh+1,j
13: end for
14: end for
15: end function
10
An Empirical Study of Bayesian Optimization
Algorithm Sample Selection Fits GP Optimizes Acquisition Function Sample ‘depth limit’
SOO Grid-aligned No No Yes
LOGO Grid-aligned No No Yes
DIRECT Grid-aligned No No No
BaMSOO Grid-aligned Yes No Yes
IMGPO Grid-aligned Yes No No
BO Unrestricted Yes Yes N/A
the PGO algorithms’ results. Accordingly, the depth limit function hmax (n) that is used in
every other SOO-derived algorithm is no longer necessary for IMGPO.
4. Experimental Setup
When reviewing the reported results of the algorithms evaluated in this paper, we found
that they were rarely evaluated in a true black-box setting. Instead, the algorithms’ hy-
perparameters were tuned after the fact or alternatively the results were presented with
non-standard metrics to best demonstrate the strengths of the proposed approach.
These differences in procedure led to confusing and seemingly contradictory results being
presented, which spurred the development of this work. Our goal is to directly compare
the performance of each algorithm in a setting where they have minimal knowledge of the
objective being optimized with the hopes of resolving some of the confusion one could suffer
from trying to collectively reason about the previous works’ results.
11
Merrill, Fern, Fern, and Dolatnia
2. Modifications were made to expose previously purely internal functions and data structures as required
for the BO-aware algorithms to function. The optimization behavior of the library itself is unchanged.
12
An Empirical Study of Bayesian Optimization
between results. Additionally, we chose to include the Rastrigin, Schwefel, and Ackley test
functions (Molga and Smutnicki, 2005) in our evaluation to provide more variety among the
objectives.
The Rastrigin, Schwefel, Ackley, and Rosenbrock objective functions (Molga and Smut-
nicki, 2005) can be defined with arbitrary dimensionality D. Functions with this property
are useful in allowing us to better determine the effect that the dimensionality of the objec-
tive has on each algorithm’s performance. For these objectives, we evaluated each algorithm
on each function with D =2, 4, 6, and 10.
A summary of the properties of each objective function can be found in Table 2.
13
Merrill, Fern, Fern, and Dolatnia
world problems, we chose to avoid them by randomizing certain properties of the objective
functions used in our evaluation.
The cell refinement procedure shared by SOO and its derived algorithms simply splits
evenly along the dimension along which the cell has the longest edge. This approach fre-
quently results in ties between dimensions which must be resolved using a tie breaking pro-
cedure. The tie breaking procedure used by the partitioning algorithms is simply to assign
a fixed priority order to the dimensions and split the higher-priority dimensions first in the
case of a tie. It is plausible that this approach could lead to situations where the assignment
of this arbitrary ordering could have a significant impact on the algorithms’ performance
on some objective functions. To mitigate the possibility that this behavior causes certain
algorithms to exhibit non-representative behavior, we randomize the tie break dimension
ordering at the beginning of each run of each algorithm.
To allow us to evaluate the performance of the partitioning algorithms on objective
functions in which the optimum point is in its center, we randomly shrink the bounds of
the hyper-rectangle that define the objective’s domain such that the optimum point is still
guaranteed to be contained within the new bounds.
These methods of randomizing objective functions are applied to every algorithm in a
predictable and repeatable way identical to all algorithms so that, for example, one algo-
rithm will not be advantaged by getting ‘easier’ domains on certain functions than another
algorithm.
which Srinivas et al. (2010) found to be effective. In this case, t is the number of function
observations made so far and δ is a constant set to 0.5.
SOO has no parameters to set. LOGO only requires a list of integers to use as the
adaptive schedule W for its local bias parameter. In this work, we set W = 3, 4, 5, 6, 8, 30
to duplicate the value chosen by Kawaguchi et al. (2016) in their work that introduced the
LOGO algorithm.
BaMSOO and IMGPO both require the use of a GP prior to estimate the upper bound
on the objective function’s value at certain nodes’ locations. Using the GP’s estimated mean
function µ and standard deviation function σ, we define the LCB and UCB as µ(x)±BN σ(x).
14
An Empirical Study of Bayesian Optimization
p
For BaMSOO we define BN = 2 2
p 2 log (π N /6η) as suggested by Wang et al. (2014), and
for IMGPO we define BN = 2 log (π 2 N 2 /12η) as suggested by Kawaguchi et al. (2015).
In both cases, N is the total number of times the GP was used to evaluate a node, and the
constant η is set to 0.05 to duplicate the value used in the experimental results included by
Kawaguchi et al. (2015) and Wang et al. (2014).
5. Empirical Results
In this section we present the results of our experiments using simple regret as our primary
performance metric. In particular, after each run of an optimization algorithm, the simple
regret is the difference between the best observed outcome and the optimal value. The
reported regrets are averages over 70 independent trials.
15
Merrill, Fern, Fern, and Dolatnia
101 ABO-EI
101 101 ABO-UCB
PBO-BaMSOO
PBO-IMGPO
Regret
Regret
Regret
100 PGO-LOGO
100 PGO-SOO
DIRECT
10 1 100 Random
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Function Evaluations Function Evaluations Function Evaluations
(a) Ackley 4 (b) Ackley 6 (c) Ackley 10
102 ABO-EI
ABO-UCB
PBO-BaMSOO
101 PBO-IMGPO
Regret
Regret
Regret
101 PGO-LOGO
PGO-SOO
DIRECT
Random
101
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Function Evaluations Function Evaluations Function Evaluations
(d) Rastrigin 4 (e) Rastrigin 6 (f) Rastrigin 10
Regret
Regret
103 ABO-EI
ABO-UCB
103 PBO-BaMSOO
PBO-IMGPO
Regret
Regret
Regret
102 PGO-LOGO
103 PGO-SOO
DIRECT
101 102 Random
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Function Evaluations Function Evaluations Function Evaluations
(j) Schwefel 4 (k) Schwefel 6 (l) Schwefel 10
Regret
Regret
10 1
10 6 10 2 PGO-LOGO
10 3 PGO-SOO
10 8
10 5 10 4 DIRECT
10 10 Random
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Function Evaluations Function Evaluations Function Evaluations
(m) Sin 2 (n) Branin (o) Rosenbrock 2
Figure 1: Regret curves for each algorithm on a variety of functions. Each curve shows the
regret at each time step averaged across 70 randomized runs. Error bars have been omitted
for readability. Statistical significance is considered in Section 5.2.
16
An Empirical Study of Bayesian Optimization
Regret (max)
10 1
3 × 103 PBO-IMGPO
Regret (max)
103
10 3 2 × 103 PGO-LOGO
102 PGO-SOO
10 5 PGO-DIRECT
101 103 Random
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Function Evaluations Function Evaluations Function Evaluations
(a) Branin (b) Rosenbrock 6 (c) Schwefel 10
Figure 2: Worst-case regret curves for each algorithm. Each curve shows the maximum
regret at each time step across 70 randomized runs.
Table 3: Pairwise comparison of each algorithm across all objective functions at t = 500
samples. Each cell displays the number of ‘wins’, ‘losses’, and ‘ties’ between the algorithm
in the row and the algorithm algorithm in the column (for example, the bottom-left-most
cell shows that Random beat ABO-EI 0 times, lost 23 times, and tied 0 times).
95% confidence interval of the mean regret of each algorithm on each objective function at
the specified sample size. We considered one algorithm to beat another on a given function
if the upper bound of the confidence interval of the mean of its regret was less than the
lower bound of the ‘challenger’ algorithm. If the confidence intervals of the two algorithms’
performance overlapped, then they were considered to tie. We use this information to
examine differences between different classes of algorithms, across different types of objective
functions, and different sample budgets.
ABO-UCB versus ABO-EI. ABO-EI and ABO-UCB, despite being similar algo-
rithms, performed very differently from one another compared to other algorithm pairs in
the same class. EI wins over UCB over half of the time, and loses only 3 times. While EI
consistently beats every other algorithm, UCB seems to be able to at best tie with the PBO
algorithms, and could only beat the PGO algorithms approximately half of the time.
Note that this result is based on our selected method for updating the UCB hyperpa-
rameter β. It is important to recall that this choice was based on our best effort to select
from the variety of β selection methods considered in previous work during a preliminary
validation period—we know of no better overall method for β selection.
It is very likely that one could tune β on a per problem basis to outperform EI, however,
this tuning methodology would not result in algorithm behavior that is representative of
its efficacy in many real-world scenarios. To test this possibility, we selected a number
of additional settings for the β parameter and compared UCB’s performance using these
settings to our previous results and ABO-EI’s. Selected results from these experiments are
shown in Figure 3.
17
Merrill, Fern, Fern, and Dolatnia
UCB-0.1
101 UCB-1.0
104 UCB-5.0
UCB-10.0
Regret
Regret
Regret
UCB-Krause
100 103 UCB-K
101 UCB-Annealed
102 EI
Random
0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250
Function Evaluations Function Evaluations Function Evaluations
(a) Ackley 4 (b) Rastrigin 6 (c) Rosenbrock 10
Figure 3: Regret curves for a variety of different β values for the UCB algorithm. ABO-EI
and Random are provided as benchmarks.
18
An Empirical Study of Bayesian Optimization
dominant versus Random, with both SOO and LOGO winning over Random only slightly
more often than they’re able to tie, and LOGO losing on 1 function.
Comparison to DIRECT. DIRECT clearly performs significantly better than either
SOO or LOGO in our results. It ties with ABO-EI approximately as often as the other
PGO methods lose to ABO-EI, and most notably never loses to either SOO or LOGO while
consistently outperforming them.
This is surprising, considering how similar DIRECT is to SOO in its structure and
operation. The most notable differences are DIRECT’s lack of a ‘depth limit’ when refining
its partitioning tree over the objective’s domain, and its lack of a uniform selection of cells
to refine among all depths of its partitioning tree.
While these properties are what give SOO its theoretical guarantees, the former may
prevent SOO from quickly exploiting an area of the objective that’s known to be good
while the latter forces it to ‘waste’ observations exploring areas that might otherwise seem
unappealing.
In practice, this appears to suggest that DIRECT’s lack of these potential limitations
allows it to significantly outperform the PGO algorithms. Since DIRECT’s runtime is
comparable to SOO and LOGO’s, our results do not suggest a potential use case for which
SOO or LOGO would be more appropriate to select as an optimization algorithm over
DIRECT.
19
Merrill, Fern, Fern, and Dolatnia
new samples are restrained to by the limited depth of the refinement tree may be limiting the
extent to which the partitioning algorithms are able to quickly ‘hone in’ on promising regions
of the function once they’ve been identified. Modifying the algorithms’ depth limiting hmax
function to allow for greater tree expansion at very large number of samples could prevent
this behavior, although it may also reduce performance by allowing PGO to spend too many
samples exploiting attractive-looking regions earlier on. Investigations during our validation
period did not reveal a superior choice for hmax overall.
Comparing PBO versus ABO, we see that both PBO algorithms reduce the number of
losses to ABO-EI at the larger sample size, they do not improve the number of wins over
ABO-EI. The performance of the PBO algorithms improves slightly against ABO-UCB,
however, ABO-UCB was already not very competitive when PBO was allowed only 500
samples.
To more directly compare the performance of PBO and ABO algorithms, we removed
the sample budget and applied a time horizon to the PBO methods that allows them to
execute for the same average runtime as the ABO method would on the same objective.
Due to ABO’s need to optimize its increasingly expensive to evaluate acquisition function
to select each point, at large numbers of observations ABO runs slowly enough that the
PBO methods which are not burdened by this responsibility are able to achieve up to
thousands of extra samples of the objective in the same amount of time. We show a sample
of representative results obtained with varying time horizon in Figure 4. As our previous
results indicate, despite being allowed up to thousands of extra samples of the objective
over ABO, the PBO methods are unable to translate those extra samples into a better final
regret.
We also evaluated the algorithms’ relative performance when the objective function
takes more or less time to execute. Our data set contains the wall clock execution time (e)
and sample count (t) for each step in each optimization. Since we know the benchmark
functions take near-zero time to evaluate, we assume that the wall clock time represents the
amount of time the algorithm itself has consumed up until that point. We can then derive a
data set simulating the algorithms’ performance on objective functions that takes time o to
evaluate by replacing each wall clock time/sample count pair (e, t) in the data set with an
updated pair (e + ot, t). We hoped to discover that by adjusting the ‘objective complexity’
we would could find a trade-off where the rapid sampling pace of the partitioning algorithms
would outperform the slower optimization-bound approach of the acquisition algorithms.
However, we did not follow through on this analysis once we observed that the partitioning
methods were unable to reliably outperform the acquisition methods even when given orders
of magnitude more samples.
‘Flexibility’ of ABO. Although we’re using one set of hyperparameters on one imple-
mentation of ABO for our evaluation, it’s important to consider that ABO has many implicit
and explicit parameters that can be modified to achieve a desired performance/computation
trade off. For example, by tweaking the time allowed by the inner optimization algorithm
that optimizes the GP’s parameters or the algorithm that optimizes the acquisition func-
tion, ABO can be made to run much faster with some unknown penalty to its performance
that depends on the objective’s sensitivity to the underlying GP’s accuracy or the accuracy
of the acquisition function optimization. We do not explicitly explore the details of those
minor tweaks to ABO in this work. Instead, to examine the impact on ABO’s performance
20
An Empirical Study of Bayesian Optimization
10 1
2 × 103
103
10 2
10 3 ABO-EI
102 ABO-UCB
Regret
Regret
Regret
10 4 103
PBO-BaMSOO
10 5 101 PBO-IMGPO
6 × 102
10 6
100
0 2880 5760 8640 11520 14400 0 2880 5760 8640 11520 14400 0 2880 5760 8640 11520 14400
Wall clock run time (s) Wall clock run time (s) Wall clock run time (s)
(a) Branin (b) Rosenbrock 6 (c) Schwefel 10
Figure 4: Regret curves for PBO and ABO methods in terms of wall clock time rather than
number of function evaluations executed. Each curve shows the regret at each time step
averaged across 70 randomized runs.
Regret
Regret
10 3 102 103 ABO-EI-sparse
ABO-UCB-sparse
10 5 101
6 × 102
100
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Function Evaluations Function Evaluations Function Evaluations
(a) Branin (b) Rosenbrock 6 (c) Schwefel 10
Figure 5: Regret curves for each instance of ABO on a representative selection of functions.
Each curve shows the regret at each time step averaged across 70 randomized runs. Error
bars have been omitted for readability.
when it is not allowed the same amount of wall clock execution time, we adjust the number
of samples in-between GP parameter optimizations.
Figure 5 compares the performance of the instance of ABO we present in this paper
to a ‘sparse’ instance that is identical to the implementation of ABO we used other than
that it re-learns GP parameters from the observed data one-eighth as frequently. That
is, it performs the GP optimization every sixteen observations instead of at every second
observation.
Despite an expected decrease in execution time by a factor of eight, the performance
of this ‘lighter’ instance of ABO does not appear to perform significantly differently than
our ‘standard’ ABO. If ABO-EI did not already appear to be the dominant algorithm in
this evaluation, instead presenting a version of it that achieves similar results in one-eighth
the time would make it appear even more attractive when compared to the other methods.
However, since making this change would not affect our conclusions, we do not consider this
‘lighter’ ABO in our results other than to demonstrate how flexible ABO approaches can
be made to be.
Dependence on Dimensionality. Because ABO is known to be effective in problems
with low dimensionality, we intentionally compared the algorithms on objective functions
with a wide range of dimensionality to better understand where and why the partitioning
methods’ performance differs from ABO’s. Table 5 shows the same comparison restricted
to only two-dimensional objective functions, while Table 6 only shows the results for the
objectives with dimension greater than two.
21
Merrill, Fern, Fern, and Dolatnia
Table 4: Pairwise comparison of each algorithm across all objective functions when al-
gorithms are run for the maximum number of samples considered in this paper. PGO,
DIRECT, and Random are allowed 10, 000 samples, ABO is allowed 500 samples, and PBO
is allowed the same wall-clock run time as the ABO algorithms.
Table 5: Pairwise comparison of each algorithm across all objective functions with dimension
D = 2 at t = 500 samples.
For the two-dimensional functions we evaluated, it appears that the difference between
the algorithms’ performance is not as pronounced. While ABO-EI still appears to be the
most effective, the PGO approaches frequently tie with its results and even beat it on one
occasion each.
For objectives with dimensions greater than two, ABO is much more dominant. ABO-EI
only loses to a partitioning method once, and wins against the other algorithms much more
frequently than it ties with them. Although we expect ABO’s performance to degrade at
higher dimensions, it seems that PGO’s performance is hit much harder by the increase in
objective dimensions. This may be because having more dimensions along which to split
the cells means the partitioning tree over the space would be much deeper before PGO can
start to refine its search towards promising areas (since each cell must be split along each
dimension in some order, regardless of the values observed). This ‘refinement’ shortcoming,
combined with the depth-limiting behavior of PGO algorithms, is likely what causes the
performance of PGO algorithms to degrade so consistently on high-dimensional functions.
Table 6: Pairwise comparison of each algorithm across all objective functions with dimension
D > 2 at t = 500 samples.
22
An Empirical Study of Bayesian Optimization
Comparison to Previous Results. Several aspects of our results appear to differ from
some of those reported in previous works. The experimental results presented by Wang et al.
(2014) suggest that BaMSOO performs better than both SOO and ABO-UCB while being
significantly faster than ABO-UCB. Those presented by Kawaguchi et al. (2015) suggest
that IMGPO performs better than BaMSOO, SOO, and UCB-EI while being significantly
faster than BaMSOO. Later, Kawaguchi et al. (2016) report results that suggests that
LOGO is more effective than BaMSOO and significantly more so than SOO.
Taken as a whole, these results seem to suggest that LOGO and IMGPO are the dom-
inant optimization algorithms among those considered, both handily beating the current
state-of-the-art ABO methods. Our results contradict this implied ranking. Most notably,
we found that ABO-EI was a dominant algorithm and LOGO rarely performs significantly
better than SOO.
Of course, we need to be cautious about reasoning about prior results in a transitive
fashion. Each evaluation was performed on a different set of black-box functions, sometimes
according to different metrics, and likely using different implementations of each algorithm.
Still, it’s surprising that there is such a discrepancy between our observed relatively perfor-
mances and those derived from previous work.
For LOGO, the code used to generate prior results was not available due to contractual
issues and we have not been able to replicate those results. However, we have validated
our implementation with the authors of LOGO. One potential source for the discrepancy is
that our evaluation employs randomization of the objective function, which we found was
important to avoid observing performance differences due to lucky initializations or grid
alignment.
ABO algorithms are necessarily ‘tuned’ by the authors of papers that use them in
comparisons—however, the parameters selected are rarely reported since they’re not con-
sidered relevant to the paper itself. Although EI is parameter-free, the Gaussian processes
used in ABO methods have many parameters that can significantly effect ABO’s perfor-
mance. This may explain the unusually poor performance by ABO-EI in previous work.
Without any motivation to maximize the ABO methods’ performance in the comparison,
it’s unclear whether or not the parameters selected for the evaluation result in representative
behavior from the algorithm.
6. Summary
We presented experimental results comparing PGO, PBO, and ABO methods within a com-
mon open-source evaluation framework. The results demonstrate that acquisition-based
optimization approaches, specifically ABO-EI, outperform partitioning-based optimization
methods when evaluated by the average regret achieved after a given number of function
observations in a strict black-box setting. Even when partitioning methods are given sig-
nificantly more samples of the objective function, they are frequently unable to match the
results that ABO-EI can achieve with much fewer samples.
The utility of an extremely computationally cheap black box optimization algorithm
is already questionable since the limiting factor in problems that apply these algorithms
is usually evaluating the black box function itself rather than the optimization algorithm’s
runtime. We demonstrate that fast partitioning-based methods tend to ‘flat-line’ on difficult
23
Merrill, Fern, Fern, and Dolatnia
problems even when given significantly more observations, or at least equal runtime, than
competing expensive algorithms. This suggests that there are fewer situations in which
PGO optimization should be chosen over BO-enabled methods than one might otherwise
assume.
This apparent weakness is described in our results, but not well explained. Follow-up
investigations into why or how the partitioning methods fail to improve as consistently
as ABO methods can after a large number of samples are warranted. If the algorithm’s
shortcomings are in fact due to the refinement issue we discuss above, it’s possible that
modifying the depth limiting function or making other small changes to the algorithm could
significantly improve its long-term performance compared to otherwise slower algorithms.
Still, the partitioning methods’ relative speed and efficiency make them a promising target
for future work.
Although UCB is commonly shown to perform well when evaluated on synthetic ob-
jective functions, we found that without manually tuning its β parameter to maximize its
performance it regularly failed to outperform both PBO and PGO methods.
Acknowledgements
This work was supported by NSF grant IIS-1619433 and DARPA contract N66001-19-2-
4035.
References
Eric Brochu, Vlad M. Cora, and Nando de Freitas. A Tutorial on Bayesian Optimization
of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical
Reinforcement Learning. arXiv e-prints, art. arXiv:1012.2599, Dec 2010.
Kirthevasan Kandasamy, Jeff Schneider, and Barnabás Póczos. High dimensional bayesian
optimisation and bandits via additive models. In International Conference on Machine
Learning, pages 295–304, 2015a.
Kirthevasan Kandasamy, Jeff G Schneider, and Barnabás Póczos. High dimensional bayesian
optimisation and bandits via additive models. In ICML, pages 295–304, 2015b.
Kenji Kawaguchi, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. Bayesian optimization
with exponential convergence. In Advances in Neural Information Processing Systems,
pages 2809–2817, 2015.
Kenji Kawaguchi, Yu Maruyama, and Xiaoyu Zheng. Global continuous optimization with
error bound and fast convergence. Journal of Artificial Intelligence Research, 56(1):153–
195, 2016.
24
An Empirical Study of Bayesian Optimization
Harold J Kushner. A new method of locating the maximum point of an arbitrary multipeak
curve in the presence of noise. Journal of Basic Engineering, 86(1):97–106, 1964.
David JC MacKay. Introduction to gaussian processes. NATO ASI Series F Computer and
Systems Sciences, 168:133–166, 1998.
J. Mockus, V. Tiesis, and A. Zilinskas. The application of bayesian methods for seeking
the extremum. In L.C.W.Dixon and G.P. Szego, editors, Towards Global Optimisation 2,
pages 117–129. 1978.
Marcin Molga and Czeslaw Smutnicki. Test functions for optimization needs. Test functions
for optimization needs, 101, 2005.
Rémi Munos et al. From bandits to monte-carlo tree search: The optimistic principle
applied to optimization and planning. Foundations and Trends R in Machine Learning,
7(1):1–129, 2014.
Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process
optimization in the bandit setting: no regret and experimental design. In Proceedings of
the 27th International Conference Machine Learning, pages 1015–1022. Omnipress, 2010.
Ziyu Wang, Babak Shakibi, Lin Jin, and Nando de Freitas. Bayesian multi-scale optimistic
optimization. In AISTATS, pages 1005–1014, 2014.
Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine
learning, volume 2. MIT Press Cambridge, MA, 2006.
25