0% found this document useful (0 votes)
93 views

Batch Bayesian Optimization Via Adaptive Local Search

Uploaded by

fanhl
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views

Batch Bayesian Optimization Via Adaptive Local Search

Uploaded by

fanhl
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Applied Intelligence

https://doi.org/10.1007/s10489-020-01790-5

Batch Bayesian optimization via adaptive local search


Jingfei Liu1 · Chao Jiang1 · Jing Zheng1

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract
Bayesian optimization (BO) provides an efficient tool for solving the black-box global optimization problems. Under situa-
tions where multiple points can be evaluated simultaneously, batch Bayesian optimization has been a popular extension
by taking full use of the computational and experimental resources. In this paper, an adaptive local search strategy is
investigated to select batch points for Bayesian optimization. First, multi-start strategy and gradient-based optimization
method are combined to maximize the acquisition function. Then, an automatic cluster approach (e.g., X-means) is applied to
adaptively identify the acquisition function’s local maxima from the gradient-based optimization results. Third, the Bayesian
stopping criterion is utilized to guarantee all the local maxima can be obtained theoretically. Moreover, the lower bound
confidence criterion and frontend truncation operation are employed to select the most promising local maxima as batch
points. Extensive evaluations on various synthetic functions and two hyperparameter tuning problems for deep learning
models are utilized to verify the proposed method.

Keywords Batch Bayesian optimization · Adaptive local search · Parallel search · Hyperparameter tuning

1 Introduction function is first established based on the prior information.


Then one observation model (i.e., the acquisition function)
BO has become a popular approach for black-box and non- could be maximized to determine the next evaluation point
convex global optimization problems in many scientific and [26, 36, 50, 64, 67]. After that, the prior information
engineering areas, including automatic machine learning [8, is augmented and the prior model is updated to be a
26, 64, 66], reinforcement learning [11], robotics [43, 48, more credible posterior model. The optimization process is
71], and information extraction [73]. It is also prospective achieved by executing the above procedures cyclically.
in many other applications, e.g., the distributed optimization Traditional BO is a sequential optimization method,
of cooperative control [55, 70]. The efficiency of BO which means only one point is evaluated at each itera-
stems from the application of famous “Bayes’ theorem”, tion [11, 29]. The sequential selection strategy would be
which means BO can achieve the efficient searching process less efficient when parallel computational and experimen-
through the elegant Bayesian updating mechanism. In BO, tal resources are available, e.g., tuning the hyperparameters
a prior model that contains our beliefs for the unknown for deep learning models in parallel environment or imple-
menting laboratory experiments with multiple setups. In
This work is financially supported by the National Key R&D order to accelerate BO, various batch Bayesian optimization
Program of China (2018YFB1701400) (BBO) approaches have been proposed to select a batch of
 Chao Jiang points at each iteration [52]. Most of the existing batch cri-
[email protected] teria sequentially update the acquisition function to select a
batch of points, which means the batch points are selected
Jingfei Liu
one after another. Ginsbourger et al. [21] proposed to add
[email protected]
q points by replacing the maximum that has been found
Jing Zheng with a constant value. Contal et al. [16] and Desautels et al.
[email protected] [18] cleverly took advantage of the characteristics of Gaus-
sian process (GP) to select a batch of points without the
1 School of Mechanical and Vehicle Engineering, evaluation of objective function. González et al. [23] pro-
Hunan University, ChangSha 410082, China posed the Local Penalization (LP) algorithm that picks a
J. Liu et al.

batch of points by penalizing the peaks of acquisition func- employed to adaptively identify the local maxima from the
tion through a properly estimated Lipschitz constant. Some gradient-based optimization results. Besides, the Bayesian
batch approaches utilize the predictive distribution to gener- stopping criterion (BSC) [62] is employed as the stopping
ate a batch of points without marginalization operation [5, 6, condition for the multi-start assisted gradient-based optimi-
8]. The above methods all belong to sequential and greedy zation. After that, the points correspond to the biggest value
batch selection strategies, which would introduce noisy points of acquisition function in each cluster component are selec-
and may cause the waste of computing resources [53]. Other ted as the candidates for batch points. Finally, the most pro-
approaches advocate the one-step batch selection strategies. mising candidate batch points are retained by the pro-
Chevalier and Ginsbourger [14] proposed to utilize the suit- posed lower bound confidence criterion (LBCC). Thus,
ability of the expected improvement (EI) to select the batch the batch points as well as batch size are automatically
points. Shah and Ghahramani [63] proposed to select a determined.
batch of points by extending the predictive entropy search Although B3O [53] and LP [23] are also intended to
approach [26]. Wu and Frazier [76] investigated the paral- select the local maxima of acquisition function as batch
lel knowledge gradient method to choose a batch of points. points, the proposed ALS achieves this goal through a bet-
Kathuria et al. [32] proposed to model the diversity of a ter and robust way. First, the working mechanism of ALS is
batch points through the determinantal point process [44, totally different with that of B3O and LP. B3O aims at fit-
65]. Wang et al. [72] also utilized the determinantal point ting the acquisition function with IGMM and choosing the
process for the batch setting in high-dimensional problems. mean values of each Gaussian component within IGMM as
Lyu et al. [45] proposed to solve the multi-objective opti- the locations of local maxima. LP sequentially selects the
mization problem for different acquisition functions, and local maxima by properly penalizing the acquisition func-
select the batch points from the pareto solution set. tion. The proposed ALS combines the multi-start assisted
However, all of the abovementioned methods need a pre- gradient based optimization, the X-means and the Bayesian
defined batch size, which obviously restricts the flexibility stopping criterion for adaptively locating the local maxima.
of BO. Since there is no adaptive self-adjusting for the Secondly, based on the unique working mechanism, the pro-
batch size, it may cause insufficient evaluation or excessive posed ALS has the following merits: (1) Since fewer noisy
evaluation that will further lead to the missing of true opti- points are selected by the proposed ALS, the computing
mum point or the wasting of computing resources. In order resources can be made full use of. ALS selects local max-
to select the batch points as well as the batch size adap- ima from the results of gradient-based optimization, and the
tively at each iteration. Daxberger and Low [17] extended local convergence property of gradient-based optimization
the GP-UCB [67] to a novel batch variant that is amenable can guarantee the selected points correspond exactly to the
to a Markov approximation. Nguyen et al. [53] proposed local maxima. On the contrary, B3O selects the mean values
the budgeted batch Bayesian optimization (B3O), which fits of each Gaussian component within IGMM, which cannot
the acquisition function by infinite Gaussian mixture model always correspond properly to the local maxima. Besides,
(IGMM) [61] and selects the mean values of each Gaussian the Lipschitz constant within LP also needs to be properly
components as the batch points. Although these methods estimated for locating the local maxima accurately. (2) ALS
can automatically determine the batch size at each iteration is more flexible in locating the local maxima. Because X-
of BO, they are based on complex approximation with cer- means and Bayesian stopping criterion are both employed
tain assumptions, e.g., the acquisition function is assumed in ALS, the self-adjusting in searching the local maxima
to be a kind of probability density function (PDF) in [53], can be achieved by exploring the characteristics of acquisi-
which may lead to the missing of truly promising points or tion function at each iteration. On the contrary, LP utilizes
the introducing of noisy points. a fixed value for the number of local maxima without adap-
In this paper, an adaptive local search (ALS) strategy tive self-adjusting, and B3O lacks of the measurement for
is porposed for BBO, the main purpose of ALS is to judging whether all the local maxima have been found.
select all the promising local maxima of acquisition func- Therefore, the proposed ALS is a novel batch Bayesian opti-
tion as batch points, which is similar with that of LP mization method that selects the acquisition function’s local
and B3O. The key insight of ALS is first to take advan- maxima as batch points.
tage of the multi-start strategy [25, 47] and the gradient- The rest contents are organized as follows. The basic
based optimization methods [33, 42] to optimize the acqui- theory of Bayesian optimization is briefly introduced in
sition function [4, 26, 36, 50, 67]. The local convergence Section 2. Details of the proposed method ALS are dis-
property of gradient-based optimization will guarantee all cussed in Section 3. In Section 4, empirical evaluations
the optimization results correspond exactly to the local ma- on 8 benchmark functions and two hyperparameter tuning
xima of acquisition function. Then X-means [57] or some problems for deep learning models are utilized to demon-
other automatic cluster methods [9, 19, 68, 69] can be strate the robustness of ALS, and the results are compared
Batch Bayesian optimization via adaptive local search
⎡ ⎤
with 6 other state-of-the-art algorithms. The discussions K (x1 , x1 ) · · · K (x1 , xn )
⎢ .. .. .. ⎥
about the superiority of the proposed method are given in K1:n =⎣ . . . ⎦+σ ×I (4)
Section 5. In Section 6, the conclusions are summarized. K (xn , x1 ) · · · K (xn , xn )

where I represents identity matrix.


2 Background and challenge Then, combined with a new predictive distribution of
the objective function Fnew = F (xnew ) at arbitrary
Since batch Bayesian optimization is built upon Bayesian predictive point xnew , the multivariate Gaussian distribution
optimization, it is necessary to introduce the basic theorem is obtained:
of Bayesian optimization. Therefore, in this section, we will


first introduce some theoretical bases of Bayesian optimi- F1:n K1:n KT
∼ N μ1:n+1 , (5)
zation, and then turn to the challenge of batch Bayesian F (xnew ) K K (xnew , xnew )
optimization.
where K = [K (xnew , x1 ) , K (xnew , x2 ) , · · · , K (xnew , xn )].
2.1 Bayesian optimization Combining the Sherman Morison Woodbury formula [74],
an expression for the predictive distribution at the predictive
The procedure of Bayesian optimization can be summarized point xnew can be easily obtained as:
 
as follows: 1. A GP prior model for the unknown function
P (F (xnew ) |Dl:n , xnew ) = N μnew (xnew ) , σnew2
(xnew ) + σ
is established based on some initial observation samples.
2. Based on the uncertinty information (the estimated (6)
mean and variance for the unknown function) provided by  
GP prior, the acquisition function can be maximized to μnew (xnew ) = KT×(K1:n +σ×I)−1× F1:n −μ1:n +μ (xnew )
locate the next evaluation point xnext . 3. After evaluating (7)
the objective function at xnext , the available samples are
augmented and the GP prior is updated to be the posterior σ 2 (xnew ) = K (xnew , xnew ) − KT × (K1:n + σ × I)−1 × K
model. 4. This procedure is repeated until certain stopping
(8)
criterion is satisfied. The elegant searching strategy can help
to achieve the goal of optimization with relatively small
2.1.2 Acquisition function
number of evaluations [64]. The most important ingredients
of BO are GP and acquisition function, which will be
The acquisition function is another important part of BO.
introduced in the following subsections.
The balance between exploitation and exploration is auto-
matically achieved by maximizing the acquisition function
2.1.1 Gaussian process
[4, 67].
There are various acquisition functions available. One
GP is an extensively recognized and powerful model for
intuitive strategy is to maximize the probability of impro-
building the prior model for the unknown function. It has
vement (PI) over the current best observation F (xbest ) [50]
been demonstrated that GP can perform well in many
at the corresponding location xbest . Alternatively, one could
situations [12, 22, 24, 28, 46, 54, 78]. GP is controlled by the
choose to maximize the expected improvement (EI) over
mean function μ(·) and the covariance function (also called
the current best observation F (xbest ) [50]. In addition to PI
kernal function) K(·), which is written as:
   and EI, the lower confidence bounds (LCB) [4, 67] is also
F(x) ∼ GP μ(x), K x, x (1) a popular choice. Since GP is selected as the prior model,
For any observation points X1:n = {x1 , x2 , . . . , xn }, these acquisition functions can be computed explicitly as:
the objective function value at these points are F1:n =
μ(xnew ) − F (xbest )
{F (x1 ) , F (x2 ) , . . . , F (xn )} (with observation noise σ ), PI(xnew ) = P (F (xnew ) ≥ F (xbest )) = 
σ (xnew )
(9)
then we get the initial observation samples D1:n =
{{x1 , F (x1 )}, {x2 , F (x2 )}, . . . , {xn , F (xn )}}. With certain
μ(·) and K(·), the prior model for the objective function can μ(xnew ) − F (xbest )
EI(xnew ) = (μ(xnew ) − F (xbest )) × 
σ (xnew )
be obtained as:
μ(xnew ) − F (xbest )
  + σ (xnew ) × φ (10)
F(x) ∼ GPprior μ1:n , K1:n (2) σ (xnew )

μ1:n = [μ (x1 ) , μ (x2 ) , . . . , μ (xn )] (3) LCB(xnew ) = μ(xnew ) − k × σ (xnew ) (11)


J. Liu et al.

where (·) represents the cumulative distribution function Except the inspiration from LP and B3O, the intuition of
(CDF) for the multivariate Gaussian distribution, φ(·) rep- the proposed approach stems from the following three facts:
resents the PDF for the multivariate Gaussian distribution, 1. The traditional BO usually selects the global maximum
and k ≥ 0 is a tunable parameter to balance exploitation of acquisition function as the next searching point xnext [29,
against exploration. 66]. 2. The acquisition function is always nonconvex and
multimodal [30, 72, 75], many of the local maxima are close
2.2 Batch Bayesian optimization in value. 3. Except the global maximum of the acquisition
function, the other local maxima may also be prospective
Batch Bayesian optimization is to accelerate the optimiza- points, since they provide the most valuable information in
tion process of BO when adequate computing resources or the areas away from the global maximum. As is shown in
parallel experiment setups are available. The problem can Fig. 1, at the current searching iteration for the 1D optimi-
be written explicitly [23]: zation problem, there are several local maxima for almost
 every kind of acquisition function, and many of those local
  
 κ−1 
xn,κ = arg max α x; Dn,κ−1 P yn,l |xn,l , Dn,l−1 maxima are close in value. Based on the above facts, we
x∈X
l=1
intend to select the promising local maxima of acquisition
  function as batch points, thus the batch selection problem in
×P xn,l |Dn,l−1 dxn,l dyn,l (12)
(12) can be changed into:
where n represents the iteration number of BO, κ rep-
resents the batch number, xn,κ represents the κ-th batch  
xn+1 , xn+2 , . . . , xn+κ
point at n-th iteration, α(·) represents the acquisition func-  
tion, Dn,κ−1 represents the observed data at n-th itera- = arg max a (x|D1:n ) , arg max a (x|D1:n ) , . . . , arg max a (x|D1:n )
x∈χ1 x∈χ2 x∈χκ
tion
 after the former κ − 1 batch points  are  selected,
 (13)
P yn,l |xn,l , Dn,l−1 = N yn,l ; μn xn,l , σn2 xn,l rep-
resents the predictive distribution of the unknown func-
tion
 at xn,l with
 GP
 prior and n observation
 samples,
 where κ is an unknown number of the local maxima for
P xn,l |Dn,l−1 = δ xn,l − arg maxx∈χ α x; Dn,l−1 rep- acquisition function at each iteration, χi ∩ χj = ∅, i = j
resents the step of optimization needed to compute xn,l represent the domains that contain different local maxima
when the last iteration has completed. and χ1 ∪ χ2 ∪ . . . ∪ χκ = χ represents the domain of
The (12) cannot be computed directly due to the complex the optimization problem. This is a special optimization
and implicit integral. In order to solve this problem, differ- problem where we need to locate all the local maxima
ent BBO approaches have been proposed. However, most as much as possible. In the subsection 3.1, the proposed
of the existing BBO approaches need a predefined batch solution to solve (13) will be discussed in details.
size, and such an inflexible setting is actually not efficient
due to the possibility of introducing noisy points or missing 3.1 The framework of adaptive local search
the true maximum [53]. In order to solve this problem, some
adaptive algorithms have been proposed [17, 53] to auto- ALS is proposed as a novel method for adaptively locating
matically select the batch points as well as the batch size. the acquisition function’s local maxima. The key insight of
However, those methods are based on complex approxima- ALS is summarized as follows:
tion with certain assumptions, which may also lead to the Step 1: Optimizing the acquisition function through
missing of the truly promising points. Therefore, a novel multi-start assisted gradient-based optimization method, the
approach is needed for exactly and adaptively selecting the value of acquisition function at arbitrary point xnew can be
batch points. obtained according to (5)-(11).
Step 2: Automatic cluster approach X-means [57] is
utilized to adaptively identify the local maxima from the
3 Adaptive local search for batch Bayesian gradient-based optimization results.
optimization Step 3: Points correspond to the biggest value of
acquisition function in each cluster component are selected
In this section, we first discuss the details of the proposed as the local maxima.
ALS, which aims to select all the promising local maxima Step 4: In order to obtain all the local maxima as
of acquisition function as batch points. Then, we integrate much as possible, the Bayesian stopping criterion [62]
it into the framework of Bayesian optimization to further is employed as the necessary condition. If the Bayesian
propose the batch Bayesian optimization via adaptive local stopping criterion is not satisfied, go back to Step 1 and add
search (BBO-ALS). more multi-start points.
Batch Bayesian optimization via adaptive local search

Fig. 1 An illustration of 1D
optimization problem to show
that most acquisition functions
are multimodal and many of the
local maxima are close in value

Step 5: In order to select the most promising points, the In this paper, L-BFGS-B is employed to maximize the
lower bound confidence criterion is proposed to filter out the acquisition function under multi-start strategy. Various other
local maxima with relative small acquisition function value. gradient-based optimization methods [33, 42] can also be
Step 6: In the case where available parallel computing integrated into the proposed framework.
resources are limited (e.g., only at most B deep learning
models can be trained simultaneously for one server), the 3.1.2 Automatic cluster method
frontend truncation strategy can be easily implemented to
select the former B points. After the implementation of multi-start assisted gradient
The proposed batch selection strategy can automatically based optimization, X-means is employed to identify the
locate the promising local maxima of acquisition function, local maxima.
thus the batch points as well as the batch size can be X-means [57] extends the famous K-means algorithm
adaptively determined at each iteration. A simple flowchart [10, 13] to be an automatic version. In the proposed method,
is given in Fig. 2. Other technical details will be discussed we want to adaptively locate the local maxima of acquisition
in the next subsections. function at each iteration. Since we do not know how many
local maxima are there in each iteration, we can not specify
3.1.1 Multi-start assisted gradient-based optimization the exact number of cluster components in advance. In such
method case, X-means can outperform other clustering methods in
adaptively locating the local maxima without specifying the
Combining multi-start strategy [25, 47] and gradient-based number of cluster components. That is the major reason why
optimization method [42] is actually a popular way [11, we choose X-means. Other clustering methods, including
47, 64, 75] for finding the global optimum [25]. Random K-means, mean shift, spectral clustering method, and recent
sampling or Latin hypercube sampling [49, 51, 56] are developed deep clustering methods [15, 58, 59], can all
both suitable for generating the multi-start points. The same be utilized under the premise that some modifications are
strategy is utilized to optimize the acquisition function in implemented for adaptively locating the local maxima. In X-
ALS, not only for finding the global maximum, but also for means, the Bayesian Information Criterion (BIC) is utilized
finding the other local maxima of acquisition function. as the measurment for the cluster models’ quality, and
A typical basic formula of gradient-based optimization the optimal cluster component number K is determined by
method can be expressed as: optimizing the BIC.
After maximizing the acquisition function by L-BFGS-
xk+1 = xk + β × α (14)
B with m multi-start points. We can obtain m optimization
where xk+1 represents the next searching point, β represents results D = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xm , ym )}. According
the searching step length, α represents the gradient to the Schwarz Criterion [31], the posterior probability can
information of the acquisition function at current searching be approximated as follows:
point xk .   pj
As (14) and Fig. 3 show, when the searching point come BIC Mj = lˆj (D ) − · log R (15)
2
close to the local maximum, the gradient α gets close to
zero, thus the searching result will convergence to the local where Mj represents the j -th cluster model with different
maximum. cluster number Kj , lˆj represents the maximum-likelihood
J. Liu et al.

Fig. 2 The framework of the


proposed batch selection
strategy

value of the j -th cluster model, pj represents the parame- where σ 2 is the variance of the data, D represents the data’s
ters’ number in the j -th cluster model Mj . And the logarith- dimension, Ri represents the amount of data points Xi cor-
mic form likelihood of the data D can be expressed as: respond to the i-th cluster component with mean value μi in
 
  1
the j -th cluster model Mj , and R represents the amount of
ˆl(D ) = log P (xi ) = log √ points available, P (xi ) is the point probability of i-th clus-
i i
2πσ D ter component. Other details about X-means and K-means

1 Ri can be found in Pelleg and Moore [57] and Bishop [10].
− 2
Xi − μi
2 + log (16)
2σ R After implementing X-means on the gradient-based
optimization results D , the cluster components correspond
to different local maxima are identified adaptively. In each
cluster component, the point corresponds to the biggest
value of acquisition function is choosed to be one candidate
for batch points.
An example on a 2D Griewank function (the 2D
Griewank function plays the role of acquisition function)
illustrating the mechanism of proposed ALS is shown
in Fig. 4. We first optimize the function by L-BFGS-B
with 300 multi-start points, and then employ X-means to
identify the local maxima from the optimization results. The
searching trajectory of L-BFGS-B as well as the clustering
results of X-means are clearly shown in Fig. 4, it is obvious
Fig. 3 An illustration to show the local convergence property of that the local maxima of the 2D Griewank function can be
gradient-based optimization method well located.
Batch Bayesian optimization via adaptive local search

Fig. 4 An example of a 2D
Griewank function with 12 local
maxima to show the mechanism
of the proposed ALS

3.1.3 Bayesian stopping criterion and lower bound to pick out those most influential ones from the obtained
confidence criterion local maxima. According to the definition of the acquisition
function in (9)-(11), the bigger is the value of acquisition
In order to obtain the local maxima as much as possible, function, the more promising is the corresponding point
the Bayesian stopping criterion (BSC) from the statistical [4, 26, 36, 50, 67]. Thus, we propose the lower bound
literature is employed as the stopping condition for multi- confidence criterion (LBCC) to filter out those local
start assisted gradient based optimization. According to maxima with relative small acquisition function value, the
Rinnooy Kan and Timmer [62], the BSC is given as [3]: LBCC is written as:
di
Pi = ≥p (18)
M−1 dmax
K = integer L ,L ≥ M − 3 (17)
L−M−2 where p is the lower confidence bound. Figure 5 gives
an illustration of LBCC, in which A, B, C, D represent
where K represents the threshold of BSC, M represents the different local maxima of the acquisition function, di , i =
number of local maxima that have been found, L represents 1, 2, 3, 4 represents acquisition function value of different
the number of multi-start points, the searching process of local maxima. It is obvious that the most influential points
the local maxima will terminate when M = K. (point A and B) are selected after filtering out those not so
We must admit that not all of the local maxima have the promising points (point C and D) by LBCC.
same influence to the searching process of BO. Therefore, in After the implementation of LBCC, a simple truncation
order to make full use of the computing resources, we need mechanism is implemented to limit the biggest number of

Fig. 5 An illustration to show


the mechanism of LBCC
J. Liu et al.

batch points, thus the batch points will not exceed the num- posed batch Bayesian optimization approach that is shown
ber of available computing resources. Specifically, the pro- in Algorithm 2.
mising points selected by LBCC are rearranged according to
the corresponding acquisition function value in descending
order, and the former B points are selected as the final batch
points.

3.2 The proposed algorithm

Until now, we have discussed all the technical details of


the proposed ALS, and we summarize it in the Algorithm
1. Besides, we also list the default settings of ALS: for
problems with D dimensionality, the number of the initial
multi-start points is set to be Ninitial = 400D (D ≤ 5) and
Ninitial = 300D (D > 5), and the default value of the lower
bound p is set to be 0.8.
Integrating the proposed batch selection strategy into the
framework of Bayesian optimization, we can obtain the pro-

4 Experiments

In order to validate the robustness and effectiveness of the


proposed ALS, we employ various empirical benchmarks,
including 8 synthetic functions and two hyperparameter tun-
ing problems for DNN on FASHION MNIST and CNN on
CIFAR 10. Six baselines are utilized to compare the perfor-
mance of the proposed ALS, including BUCB [18], LP [23],
B3O [53], qEI [14], qKG [7, 76] and MACE [45]. The
source code for BUCB and B3O can be found in https://
github.com/ntienvu/ICDM2016 B3O, the source code for LP
can be found in https://github.com/SheffieldML/GPyOpt,
the source code for qEI and qKG can be found in https://
github.com/pytorch/botorch and the source code for MACE
can be found in https://github.com/Alaya-in-Matrix/MACE.
We employ these source codes and make some modifi-
cations for reasonable comparation. Moreover, all of the
experiments in this paper are carried out on desktop com-
puters that have Intel 8700K CPU with 16 GB RAM and
NVIDIA GeForce GTX 1080Ti GPU with 11 GB video
memory.

4.1 Synthetic functions

We first compare the performance of ALS on 8 syn-


thetic functions, including Dropwave, Branin, Hartmann-3,
Apline2-5, gSobol-5, Apline2-6, Hartmann-6, and Apline2-
10. The details of these synthetic functions are listed in
Table 1, we aim to find the global minimum of each function
in this section.
Before we implement the experiments, we would like
to explain the parameter settings for batch Bayesian opti-
mization. The parameters within batch Bayesian optimiza-
tion can roughly be divided into two parts: (1) The basic
Batch Bayesian optimization via adaptive local search

Table 1 Details of the


synthetic functions [7, 60] Function Searching bound Dimension Global minimum

Dropwave [−5.12, 5.12]2 2 −1


Branin [−5, 10] ∗ [0, 15] 2 0.397887
Hartmann-3 [0, 1]3 3 −3.86276
Apline2-5 [1, 10]5 5 −2.8085
gSobol-5 [−4, 6]5 5 0
Hartmann-6 [0, 1]6 6 −3.32237
Apline2-6 [1, 10]6 6 −2.8086
Apline2-10 [1, 10]10 10 −2.80810

parameters, including the number of initial observation sam- Furthermore, the performance of the proposed ALS with
ples Nini , the batch size B, the iteration number n, the different settings of p are similar with each other.
types of mean function and covariance function in Gaus- As we can see from Table 3, the average number of
sian process, the types of acquisition function and etc. (2) evaluations in each iteration for ALS is usually different
The special parameters of different approaches, e.g., the for different synthetic functions, which means the local
types of local penalizer and its parameters in LP [23]. The maxima found by ALS changes adaptively according to the
basic parameters are usually kept the same for different characteristics of acquisition function.The batch size B at
approaches to make the experiment reasonable [7, 14, 18, each iteration is limited to a certain number, so that the batch
23, 45, 53, 76]. The settings of those special parameters size will not exceed this limit. Therefore, the computing
within each baseline algorithm can be adjusted in different resources can be saved to a certain extent. Besides, as the
scenarios. lower confidence bound p increases, the average number
The experiment settings are as follows: (1) The number of evaluations for ALS in each iteration usually decrease,
of initial observatin samples Nini as well as the batch size which is exactly match to the intent of the lower bound
B are set to be 2 and 5. (2) The iteration number n is set confidence criterion.
to be 10D for D dimensional problems. (3) The commonly As mentioned in Section 1, although the proposed ALS
utilized zero mean function and the squared exponential is also to select the acquisition function’s local maxima to
kernel are selected as the prior mean and the covariance be the batch points, just like what LP and B3O do, ALS
function respectively, and LCB in (11) is selected as the achieve this goal through a novel and robust way. We can
acquisition function. (4) The lower confidence bound p see from the optimization results in Fig. 6 and Table 2 that
of ALS is set to be 0.4 and 0.8 to analyze its influence the proposed ALS usually performs better than B3O and
comprehensively, and those special parameters of other LP. It stems from the adaptive mechanism in Algorithm 1,
algorithms are correspondingly set according to MACE based on which the most promising points are selected and
[45]. (5) Each experiment is repeated 5 times to obtain the the noisy points are rejected.
mean and variance of the optimization results. Due to the automatic clustering operation needed in
For the experiments on each synthetic function, the searching the local maxima of acquisition function, it
initial observation samples are generated randomly and usually takes several seconds to implement each iteration
kept unchanged. The average convergence plots for all the for the proposed ALS. Although the implementation time
synthetic functions are shown in Fig. 6, and the statistics for for each iteration of the proposed ALS is a bit longer than
the optimization results are listed in Table 2. It is clearly other baselines, it is neglectable for practical black-box
from Fig. 6 and Table 2, in most cases, the convergence optimization problems, e.g. the training process of deep
rate of the proposed ALS is faster or competitive with other learning models usually need several hours. Besides, for the
baselines. For different settings of the lower confidence purpose of reasonable comparation, the input normalization
bound p, the proposed ALS can achieve better results than (i.e., the scaling operation for the input), gradient enhanced
the other 5 baselines including qEI, qKG, BUCB, LP and strategy and other supplementary operations for BO are
B3O in 6 out of the 8 cases, including Dropwave, Branin, not employed, since the same operations can easily be
Apline2-5, Hartmann-6, Apline2-6 and Apline2-10 plotted integrated into all the algorithms.
in Fig. 6 (a), (b), (d), (f), (g) and (h). And compare with the Totally speaking, the proposed ALS can perform better
excellent method MACE proposed by Lyu et al. [45], the than the other 6 baselines. Since most of the promising
proposed ALS can achieve better results in 5 out of the 8 local maxima of acquisition function can be obtained
cases including Branin, Apline2-5, Hartmann-6, Apline2-6 through the adaptive searching mechanism, high accuracy
and Apline2-10 plotted in Fig. 6 (b), (d), (f), (g) and (h). and efficiency can be achieved by ALS.
J. Liu et al.

Fig. 6 The average optimization


convergence plots for 8
synthetic functions
Table 2 Statistics of the optimization results for the synthetic functions

Method qEI qKG BUCB LP B3O MACE ALS (p=0.4) ALS (p=0.8)

Dropwave −0.7194 ± −0.6659 ± −0.44316 ± −0.5492 ± −0.7498 ± −0.8676 ± −0.7684 ± −0.8369 ±


0.2307 0.190 0.1904 0.2372 0.2851 0.5239 0.1723 0.08274
Batch Bayesian optimization via adaptive local search

Branin 0.4363 ± 0.4421 ± 0.9549 ± 0.4694 ± 1.039 ± 0.4322 ± 0.4276 ± 0.4103 ±


0.04407 0.04401 0.5275 0.04648 0.4702 0.03372 0.0216 0.01003

Hartmann-3 −3.8625 ± −3.8461 ± −3.6533 ± −3.8600 ± −3.3192 ± −3.8001 ± −3.6543 ± −3.7100 ±


0.02216 0.005675 0.2231 0.004489 0.4118 0.04948 0.2608 0.1088

Apline2-5 –54.2723 ± −48.7770 ± −46.0989 ± −53.8201 ± −62.1144 ± −49.6854 ± −66.7444 ± −62.9714 ±


23.8230 7.9261 10.3822 15.2424 6.0612 8.4752 6.0293 7.8148

gSobol-5 22.3779 ± 24.5015 ± 7.5186 ± 16.6420 ± 2.9332 ± 10.4509 ± 14.6700 ± 20.943 ±


12.3951 14.6398 6.1823 28.1520 1.5324 6.0501 6.5825 8.7314

Apline2-6 −93.2583 ± −96.4559 ± −113.4560 ± −109.2378 ± −100.2166 ± −111.9516 ± −252.6666 ± −275.3333 ±


46.2782 40.1921 38.6059 23.1975 32.7962 20.9302 18.7390 19.2271

Hartmann-6 −3.0676 ± −3.2093 ± −3.02492 ± −3.1020 ± −2.5258 ± −2.8394 ± −3.3200 ± −3.3200 ±


0.1565 0.01063 0.01111 0.1943 0:09757 0.1487 0.005236 0.006025

Apline2-10 −859.5780 ± −761.583 ± −899.9728 ± −1441.8000 ± −976.834 ± −1085.9006 ± −7103.1250 ± −5292.8571 ±


282.976 260.915 301.2332 609.1320 273.057 372.9099 1031.091 1054.1997
J. Liu et al.

Table 3 Average number of the


evaluations for the synthetic Method qEI qKG BUCB LP B3O MACE ALS (p=0.4) ALS (p=0.8)
functions at each iteration
Dropwave 5 5 5 5 4.75 5 4.35 4.20
Branin 5 5 5 5 4.60 5 2.85 2.70
Hartmann-3 5 5 5 5 4.20 5 2.70 2.40
Apline2-5 5 5 5 5 4.86 5 4.95 4.92
gSobol-5 5 5 5 5 4.82 5 4.80 4.72
Apline2-6 5 5 5 5 4.75 5 4.65 4.55
Hartmann-6 5 5 5 5 4.45 5 4.25 4.10
Apline2-10 5 5 5 5 4.56 5 4.66 4.61

Table 4 Range over


hyperparameters for DNN on Hyperparameter Range Hyperparameter Range
Fashion MNIST
Learning rate of SGD [e−4 , 1] Batch size [100, 1000]
Dropout rate 1 [0, 1] Learning epoch [10, 100]
Dropout rate 2 [0, 1] Dense number [100, 1000]

Table 5 Range over


hyperparameters for CNN on Hyperparameter Range Hyperparameter Range
CIFAR 10
Learning rate of SGD [e−4 , 1] Batch size [100, 1000]
Dropout rate 1 [e−4 , 1) Learning epoch [100, 1000]
Dropout rate 2 [e−4 , 1) Dense number [100, 1000]
Dropout rate 3 [e−4 , 1)

Table 6 Statistics of the


optimization results of tuning Method Classification error for DNN Classification error for CNN
the hyperparameters for DNN
on Fashion MNIST and CNN qEI 0.1070 ± 0.005485 0.2113 ± 0.01803
on CIFAR10 qKG 0.1039 ± 0.003558 0.2080 ± 0.01084
BUCB 0.1158 ± 0.008340 0.2287 ± 0.02981
LP 0.1004 ± 0.001293 0.1926 ± 0.006888
B3O 0.1655 ± 0.04354 0.2734 ± 0.02112
MACE 0.1016 ± 0.004330 0.1893 ± 0.007628
ALS 0.09750 ± 0.001130 0.1722 ± 0.006524

Table 7 The optimal


hyperparameters for DNN on Hyperparameter qEI qKG BUCB LP B3O MACE ALS
Fashion MNIST
Learning epoch 60 80 55 69 94 63 92
Batch size 417 472 459 483 355 403 321
Dense number 888 784 819 853 957 983 776
Natural logarithm of –1.584 –1.996 –2.947 –1.771 –1.400 –1.764 −1.690
learning rate of SGD 72888 88262 4456 3372 97607 47483 29053
Dropout rate 1 0.507 0 0.374 0.9809 0.741 1 0.096
3586 9547 60787 37276
Dropout rate 2 0.389 0 0 0 0.741 1 0.519
06713 60787 46041
Classification error 0.1012 0.0998 0.1118 0.0991 0.1116 0.0985 0.0958
Batch Bayesian optimization via adaptive local search

Table 8 The optimal hyperparameters for CNN on CIFAR 10

Hyperparameter qEI qKG BUCB LP B3O MACE ALS


Learning epoch 825 379 752 1000 534 224 348
Batch size 469 845 721 597 814 100 162
Dense number 793 131 1000 783 752 628 317
Natural logarithm of −2.547 −2.621 −4.723 −4.313 −5.019 −2.210 −2.556
learning rate of SGD 94002 69715 41912 94513 99409 50382 98759
Dropout rate 1 0.5824 0.3557 0.3431 0 0.0910 0.309 0
87365 6749 3724 8456 9363
Dropout rate 2 0.7277 0.5230 0.4291 0.2137 0.4515 0.5049 0.7617
60425 2066 1775 0678 9041 0513 9779
Dropout rate 3 0.2715 0.2290 0.254 0.7473 0 0.5432 0.1003
87631 9382 8414 0406 5948 1283
Classification error 0.1890 0.187 0.198 0.184 0.25 0.1803 0.1645

4.2 Hyperparameter tuning problems for DNN training images and 10000 testing images. We optimize
and CNN 7 hyperparameters of a three-layer convolutional network,
the architecture of the CNN is chosen to be the same
The goal of hyperparameter tuning is to find the best as the example code in Keras.2 The ranges of the 7
hyperparameters that can minimize the classification or hyperparameters are listed in Table 5, including 3 discrete
regression error of deep learning models (DLMs) on cer- hyperparameters: batch size, learning epochs, and dense
tain test datasets. Although the performance of DLMs can number for the fully connect layer, and 4 continuous
also be improved by robust modeling fitting [37, 38], it is hyperparameters: learning rate of SGD and 3 dropout rates.
beyond the scope of this paper, and we only focus on th The statistics of the hyperparameter tuning results are
e hyperparameter tunning problems in this section. Hyper- listed in Table 6. As shown in Table 6, both the mean
parameter tuning is of great importance to obtain pretty and the variance values of ALS are better than that of the
DLMs with lower classification error [40, 66]. We apply the other 6 baselines. The optimal hyperparameters for each
proposed method ALS and 6 other baselines to 2 benchmark deep learning models found by each BBO approach are
hyperparameter tuning problems to compare their perfor- provided in Table 7 and Table 8. Since SGD is employed to
mance. The experiment settings are as follows: the initial train the DLMs, there may exist a slight oscillation in the
observatin samples Nini is set as 2, the number of iterations best results. It is clearly from Table 7 and Table 8 that the
is set as 15, each experiment is repeated 5 times to obtain proposed ALS can achieve the best results, e.g., the optimal
the statistics of the results, the lower confidence bound p for hyperparameters searched by ALS can achieve the smallest
the proposed ALS is set to be 0.8, and other settings are kept classification error 0.0958 for DNN on Fashion MNIST
the same with the Section 4.1. and 0.164 for CNN on CIFAR 10. Therefore, the same
conclusion can be obtained that ALS is a robust batch
DNN on Fashion MNIST. Deep neural networks (DNN) are Bayesian optimization approach.
the commonly used deep learning models [27, 39, 41]. In
this section, we first tune 6 hyperparameters of a 2 hidden
layer DNN on Fashion-MNIST [77], a recently proposed 5 Discussions
dataset that has 60000 images and 10000 images for training
and testing respectively. Each one of those images is 28x28 Although it is difficult to give some theoretical proof
grayscale with one label from 10 certain classes. The mathematically, we would like to analyze the superiority of
best performance for human on this dataset is 0.835 [66], the proposed method compared with the related works B3O
and the best performance for DNN is 0.8833 [77]. The [53] and LP [23] in details. As what has been explained
hyperparameters to be optimized in this experiment are lis- in Section 1, although ALS also aims at selecting the local
ted in Table 4. In this experiment, we use the same number maxima of acquisition function to be batch points, it has
of neurons in the two hidden layers. the following merits: (1) Compared with B3O, the proposed
ALS can locate the local maxima of acquisition function
CNN on CIFAR 10. Convolutional neural networks (CNN) more accurately. B3O assumes the acquisition function is
have played an important role in various areas [1, 2, 20,
27, 34, 35, 39]. In this experiment, the CNN is trained 2 https://github.com/keras-team/keras/blob/master/examples/cifar10
on the CIFAR-10 benchmark dataset that comprise 50000 cnn.py
J. Liu et al.

a kind of PDF and utilizes the IGMM to fit it. After that, Compliance with Ethical Standards
the mean values of each Gaussian component within IGMM
Conflict of interests There are neither financial interests nor relation-
are selected as the locations of the local maxima. However, ships that will influence the content reported in this paper.
the number of Gaussian components as well as the locations
of its mean values cannot always match to that of the
local maxima (i.e., the peaks of the target PDF) [61], References
except when the acquisition function is itself the PDF of
a distribution that is composed of the weighted sums of
1. Abdel-Hamid O, Mohamed Ar, Jiang H, Deng L, Penn G, Yu
several independent Gaussian distributions. It is obviously D (2014) Convolutional neural networks for speech recogni-
not always true, since the acquisition function is not a PDF tion. IEEE/ACM Transactions on audio, speech, and language
at all (the integral of a PDF equals to 1 in the domain of processing 22(10):1533–1545
definition). On the contrary, by taking advantage of the local 2. Acharya UR, Fujita H, Oh SL, Hagiwara Y, Tan JH, Adam
M (2017) Application of deep convolutional neural network for
convergence property of gradient based optimization, the automated detection of myocardial infarction using ecg signals.
proposed ALS can locate the local maxima more accurately. Inf Sci 415:190–198
(2) Compared with LP, the proposed ALS can locate the 3. Arora JS, Elwakeil OA, Chahande AI, Hsieh CC (1995) Global
local maxima more adaptively. LP needs a predefined optimization methods for engineering applications: a review. Struc
Optim 9:137–159. Sourced from Microsoft Academic - https://
batch size, and penalizes the local maxima of acquisition
academic.microsoft.com/paper/2086877131
function sequentially by the local penalizer. However, the 4. Auer P (2002) Using confidence bounds for exploitation-
number of local maxima at each iteration is unknown in exploration trade-offs. J Mach Learn Res 3:397–422
advance, and it may also change with the iteration. Thus, 5. Azimi J, Jalali A, Fern X (2011) Dynamic batch bayesian
optimization. arXiv:1110.3347
the noisy points will be introduced if the number of local
6. Azimi J, Jalali A, Fern X (2012) Hybrid batch bayesian
maxima is not consistent with the predefined batch size, optimization. arXiv:1202.5597
and it will clearly weaken the flexibility of LP. On the 7. Balandat M, Karrer B, Jiang DR, Daulton S, Letham B,
contrary, by taking advantage of automatic cluster method Wilson AG, Bakshy E (2019) Botorch: programmable bayesian
optimization in pytorch. arXiv:1910.06403
and the Bayesian stopping criterion, the proposed ALS
8. Bergstra JS, Bardenet R, Bengio Y, Kégl B. (2011) Algorithms for
can adaptively locate the local maxima. (3) The working hyper-parameter optimization. In: Advances in neural information
mechanism of ALS, described in Section 3.1, contributes processing systems 24, pp. 2546–2554
to the above merits. The experiment results in Section 4 9. Binder P, Muma M, Zoubir AM (2018) Gravitational clustering:
a simple, robust and adaptive approach for distributed networks.
validate the effectiveness of the proposed ALS.
Signal Process 149:36–48
10. Bishop CM et al (2006) Pattern recognition and machine learning.
Sourced from Microsoft Academic - https://academic.microsoft.
6 Conclusions com/paper/166397329
11. Brochu E, Cora VM, De Freitas N (2010) A tutorial on bayesian
optimization of expensive cost functions, with application to
An adaptive local search strategy is developed in this active user modeling and hierarchical reinforcement learning.
paper for batch Bayesian optimization. By combining the arXiv:1012.2599
multi-start assisted gradient based optimization method 12. Bui DM, Nguyen HQ, Yoon Y, Jun S, Amin MB, Lee S (2015)
and the automatic cluster method, the proposed ALS can Gaussian process for predicting cpu utilization and its application
to energy efficiency. Appl Intell 43(4):874–891
adaptively determine the batch points as well as the batch 13. Buja A, Tibshirani R, Hastie T, Simard P, Sackinger E, Duda ro,
size according to the characteristics of acquisition function Hart pe (1973) Pattern classification and scene analysis, Wiley,
at each iteration. ALS leverages the merits of both exact New York. Friedman, J. (1994), Flexible metric nearest neighbour
local search and adaptive selecting mechanism, making it classification, technical report, Stan-Ford University
14. Chevalier C, Ginsbourger D (2013) Fast computation of the multi-
a novel and robust BBO approach. Empirical evaluations points expected improvement with applications in batch selection.
on both extensive synthetic functions and hyperparameter In: Learning and intelligent optimization, pp 59–69
tuning problems for deep learning models validate the 15. Comaniciu D, Meer P (2002) Mean shift: a robust approach
effectiveness of the proposed method. The future work can toward feature space analysis. IEEE Trans Pattern Anal Mach
Intell 24(5):603–619
be extended to the combination of Bayesian optimization 16. Contal E, Buffoni D, Robicquet A, Vayatis N (2013) Parallel
and the interpretability of neural network. gaussian process optimization with upper confidence bound and
pure exploration. In: Machine learning and knowledge discovery
in databases, pp 225–240
Acknowledgments The authors sincerely thank the three reviewers 17. Daxberger EA, Low BKH (2017) Distributed batch gaussian pro-
and the associate editors for their enthusiasm and thoughtful cess optimization. In: Proceedings of the 34th international confe-
feedbacks, which helps a lot to improve this paper. rence on machine learning - volume 70, ICML’17, pp 951–960
Batch Bayesian optimization via adaptive local search

18. Desautels T, Krause A, Burdick JW (2014) Parallelizing 36. Kushner HJ (1964) A New Method of Locating the Maximum
exploration-exploitation tradeoffs in gaussian process bandit opti- Point of an Arbitrary Multipeak Curve in the Presence of Noise.
mization. The Journal of Machine Learning Research 15(1):3873– Journal of Fluids Engineering 86(1):97–106
3923 37. Lai T, Chen R, Yang C, Li Q, Fujita H, Sadri A, Wang H
19. Feng Y, Hamerly G (2007) Pg-means: learning the number of (2020) Efficient robust model fitting for multistructure data
clusters in data. In: Advances in neural information processing using global greedy search. IEEE Trans Cybern 50(7):3294–3306.
systems, pp 393–400 Sourced from Microsoft Academic - https://academic.microsoft.
20. Fujita H, Cimr D (2019) Decision support system for arrhythmia com/paper/2921794350
prediction using convolutional neural network structure without 38. Lai T, Fujita H, Yang C, Li Q, Chen R (2019) Robust model fitting
preprocessing. Appl Intell 49(9):3383–3391 based on greedy search and specified inlier threshold. IEEE Trans
21. Ginsbourger D, Le Riche R, Carraro L (2008) A multi-points Ind Electron 66(10):7956–7966
criterion for deterministic parallel global optimization based on 39. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature
gaussian processes. Tech rep 521(7553):436–444
22. González J, Longworth J, James DC, Lawrence ND (2015) 40. Li L, Jamieson K, DeSalvo G, Rostamizadeh A, Talwalkar A
arXiv:1505.01627. Sourced from Microsoft Academic - https:// (2016) Hyperband: A novel bandit-based approach to hyperparam-
academic.microsoft.com/paper/2291718609 eter optimization. arXiv:1603.06560
23. Gonzalez J, Dai Z, Hennig P, Lawrence ND (2016) Batch 41. Li X, Lai T, Wang S, Chen Q, Yang C, Chen R, Lin J,
Bayesian optimization via local penalization. In: Proceedings of Zheng F (2019) Weighted feature pyramid networks for object
the nineteenth international workshop on artificial intelligence and detection. In: 2019 IEEE Intl conf on parallel distributed pro-
statistics, vol 51, pp 648–657. Sourced from Microsoft Academic cessing with applications, big data cloud computing, sustain-
- https://academic.microsoft.com/paper/2409689189 able computing communications, social computing networking
24. González J, Dai Z, Damianou AC, Lawrence ND (2017) (ISPA/BDCloud/socialcom/sustaincom), pp 1500–1504
Preferential Bayesian optimization. In: Proceedings of the 34th 42. Liu DC, Nocedal J (1989) On the limited memory bfgs method for
international conference on machine learning, vol 70, pp 1282– large scale optimization. Math Program 45(1):503–528
1291. Sourced from Microsoft Academic - https://academic. 43. Lizotte DJ, Wang T, Bowling MH, Schuurmans D (2007)
microsoft.com/paper/2964168155 Automatic gait optimization with gaussian process regression. In:
25. György A, Kocsis L (2011) Efficient multi-start strategies for local IJCAI, vol 7, pp 944–949
search algorithms. J Artif Intell Res 41:407–444 44. Lyons R (2003) Determinantal probability measures. Publications
26. Hernández-Lobato JM, Hoffman MW, Ghahramani Z (2014) Mathématiques de l’IHÉS 98:167–212
Predictive entropy search for efficient global optimization 45. Lyu W, Yang F, Yan C, Zhou D, Zeng X (2018) Batch baye-
of black-box functions. In: Advances in neural information sian optimization via multi-objective acquisition ensemble for
processing systems 27, pp 918–926. Sourced from Microsoft automated analog circuit design. In: International conference on
Academic - https://academic.microsoft.com/paper/2167789032 machine learning, pp. 3306–3314
27. Hinton G, Deng L, Yu D, Dahl GE, Mohamed Ar, Jaitly N, Senior 46. Marchant R, Ramos F (2012) Bayesian optimisation for intel-
A, Vanhoucke V, Nguyen P, Sainath TN, et al. (2012) Deep neural ligent environmental monitoring. In: 2012 IEEE/RSJ interna-
networks for acoustic modeling in speech recognition: The shared tional conference on intelligent robots and systems, pp 2242–
views of four research groups. IEEE Signal processing magazine 2249
29(6):82–97 47. Martı́ R, Aceves R, León MT, Moreno-Vega JM, Duarte A (2019)
28. Hoffman M, Shahriari B, Freitas N (2014) On correlation and Intelligent multi-start methods, 221–243. Sourced from Microsoft
budget constraints in model-based bandit optimization with appli- Academic - https://academic.microsoft.com/paper/2890433140
cation to automatic machine learning. In: Artificial intelligence 48. Martinez-Cantin R, de Freitas N, Brochu E, Castellanos J,
and statistics, pp 365–374 Doucet A (2009) A bayesian exploration-exploitation approach for
29. Jones DR, Schonlau M, Welch WJ (1998) Efficient global optimal online sensing and planning with a visually guided mobile
optimization of expensive black-box functions. J Glob Optim robot. Auton Robot 27(2):93–103
13(4):455–492 49. McKay MD, Beckman RJ, Conover WJ (1979) Comparison of
30. Kandasamy K, Schneider J, Póczos B. (2015) High dimensional three methods for selecting values of input variables in the analysis
bayesian optimisation and bandits via additive models. In: of output from a computer code. Technometrics 21(2):239–
International conference on machine learning, pp 295–304 245
31. Kass RE, Wasserman L (1995) A reference bayesian test for nested 50. Močkus J (1975) On bayesian methods for seeking the extremum.
hypotheses and its relationship to the schwarz criterion. Journal of In: Optimization techniques IFIP technical conference, pp 400–
the American Statistical Association 90(431):928–934 404. Springer
32. Kathuria T, Deshpande A, Kohli P (2016) Batched gaussian 51. Morris MD, Mitchell TJ (1995) Exploratory designs for compu-
process bandit optimization via determinantal point processes. tational experiments. Journal of Statistical Planning and Inference
In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett 43(3):381–402
R (eds) Advances in neural information processing systems 29, 52. Nguyen V, Gupta S, Rana S, Li C, Venkatesh S (2018)
pp 4206–4214 Practical batch bayesian optimization for less expensive functions.
33. Kelley CT (1987) Iterative methods for optimization. Sourced arXiv:1811.01466
from Microsoft Academic - https://academic.microsoft.com/ 53. Nguyen V, Rana S, Gupta S, Li C, Venkatesh S (2017)
paper/2123224804 Budgeted batch bayesian optimization with unknown batch sizes.
34. Kim Y (2014) Convolutional neural networks for sentence arXiv:1703.04842
classification. arXiv:1408.5882 54. Nguyen V, Gupta S, Rana S, Li C, Venkatesh S (2019) Filtering
35. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet clas- bayesian optimization approach in weakly specified search space.
sification with deep convolutional neural networks. Commun Knowledge and Information Systems 60(1):385–413
ACM 60(6):84–90. Sourced from Microsoft Academic - https:// 55. Ning B, Han QL, Zuo Z (2019) Distributed optimization
academic.microsoft.com/paper/2618530766 for multiagent systems: An edge-based fixed-time consensus
J. Liu et al.

approach. IEEE Transactions on Systems Man, and Cybernetics and labeling. In: 2016 24Th european signal processing confer-
49(1):122–132 ence (EUSIPCO), pp. 448–452
56. Park JS (1994) Optimal latin-hypercube designs for computer ex- 69. Vu D, Georgievska S, Szoke S, Kuzniar A, Robert V (2017)
periments. Journal of statistical planning and inference 39(1):95– fMLC: fast multi-level clustering and visualization of large
111 molecular datasets. Bioinformatics 34(9):1577–1579
57. Pelleg D, Moore AW et al (2000) X-means: Extending k-means 70. Wang L, Xi J, He M, Liu G (2020) Robust time-varying formation
with efficient estimation of the number of clusters. In: Icml, vol 1, design for multiagent systems with disturbances: Extended-state-
pp 727–734 observer method. International Journal of Robust and Nonlinear
58. Peng X, Feng J, Xiao S, Yau W, Zhou JT, Yang S (2018) Control 30(7):2796–2808
Structured autoencoders for subspace clustering. IEEE Trans 71. Wang Z, Jegelka S, Kaelbling LP, Lozano-Pérez T (2017) Focused
Image Process 27(10):5076–5086 model-learning and planning for non-gaussian continuous
59. Peng X, Zhu H, Feng J, Shen C, Zhang H, Zhou JT state-action systems. In: 2017 IEEE International conference on
(2019) Deep clustering with sample-assignment invariance prior. robotics and automation (ICRA), pp 3754–3761
IEEE Transactions on Neural Networks and Learning Systems, 72. Wang Z, Li C, Jegelka S, Kohli P (2017) Batched high-
1–12 dimensional bayesian optimization via structural kernel learning.
60. Picheny V, Wagner T, Ginsbourger D (2013) A benchmark In: Proceedings of the 34th international conference on machine
of kriging-based infill criteria for noisy optimization. Struct learning - volume 70, ICML’17, pp 3656–3664
Multidiscip Optim 48(3):607–626 73. Wang Z, Shakibi B, Jin L, de Freitas N (2014) Bayesian multi-
61. Rasmussen CE (2000) The infinite gaussian mixture model. In: scale optimistic optimization
Advances in neural information processing systems, pp 554– 74. Williams CK, Rasmussen CE (2005) Gaussian processes for
560 machine learning. Sourced from Microsoft Academic - https://
62. Rinnooy Kan AHG, Timmer GT (1987) Stochastic global academic.microsoft.com/paper/1746819321
optimization methods part i: Clustering methods. Math Program 75. Wilson J, Hutter F, Deisenroth M (2018) Maximizing acquisition
39(1):27–56 functions for bayesian optimization. In: Bengio S, Wallach H,
63. Shah A, Ghahramani Z (2015) Parallel predictive entropy search Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds)
for batch global optimization of expensive objective functions. Advances in neural information processing systems 31, pp 9884–
In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett 9895
R (eds) Advances in neural information processing systems 28, 76. Wu J, Frazier P (2016) The parallel knowledge gradient method
pp 3330–3338 for batch bayesian optimization. In: Lee DD, Sugiyama M,
64. Shahriari B, Swersky K, Wang Z, Adams RP, De Freitas N Luxburg UV, Guyon I, Garnett R (eds) Advances in neural
(2015) Taking the human out of the loop: a review of bayesian information processing systems 29, pp 3126–3134
optimization. Proc IEEE 104(1):148–175 77. Xiao H, Rasul K, Vollgraf R (2017) Fashion-mnist: a novel
65. Shirai T, Takahashi Y (2003) Random point fields associated with image dataset for benchmarking machine learning algorithms.
certain fredholm determinants i: fermion, poisson and boson point arXiv:1708.07747
processes. J Funct Anal 205(2):414–463 78. Van Stein B, Wang H, Kowalczyk W, Emmerich M, Bäck T (2019)
66. Snoek J, Larochelle H, Adams RP (2012) Practical bayesian Cluster-based kriging approximation algorithms for complexity
optimization of machine learning algorithms. In: Advances in reduction. Appl Intell 50(3):1–14
neural information processing systems 25, pp 2951–2959
67. Srinivas N, Krause A, Kakade SM, Seeger M (2009) Gaussian
process optimization in the bandit setting: No regret and
experimental design. arXiv:0912.3995
68. Teklehaymanot FK, Muma M, Liu J, Zoubir AM (2016) In- Publisher’s note Springer Nature remains neutral with regard to
network adaptive cluster enumeration for distributed classification jurisdictional claims in published maps and institutional affiliations.

You might also like