RBFOpt
RBFOpt
A. Costa
Singapore University of Technology and Design
E-mail: [email protected]
G. Nannicini
Singapore University of Technology and Design
E-mail: [email protected]
2 Alberto Costa, Giacomo Nannicini
1 Introduction
or surrogate model. Examples of this approach are the Radial Basis Function
(RBF) method of [15] (see also [27]), the stochastic RBF method [29], and
the kriging-based Efficient Global Optimization method (EGO) of [23]. The
surrogate model constructed by these methods is a global model that uses
all available information on f , as opposed to methods that only build a local
model such as trust-region based methods [6, 7]. Other methods for black-box
optimization rely on direct search, i.e., they do not build a surrogate model of
the objective function. An overview of direct search methods can be found in
[24], and a comprehensive treatment is given in [7]. We refer the reader to [31]
and the references therein for a very recent survey on black-box methods and
an extensive computational evaluation.
In this paper we focus on problems that are nonconvex, relatively small-
dimensional, and for which only a small number of function evaluations is
allowed. For this type of problems, algorithms based on a surrogate model
are typically considered among the most effective. In particular, empirical ev-
idence [20] suggests that the RBF method is more effective on engineering
problems, despite the appealing theoretical properties of other methodologies
such as EGO. Besides this empirical evidence, there are three additional rea-
sons that make a surrogate model method more appealing than direct search
in our context. The first reason is that looking for alternative optima, or at
least a set of good solutions, is easier if we can rely on a surrogate model. The
second reason is that it is intuitively easier to “warm-start” such a method
in a context where each function evaluation (i.e. simulation) produces some
data that allows for fast recomputation of different but related objective func-
tions, because the model of these new objective functions can be quickly built.
The third reason is that the surrogate model can sometimes be used to allow
the fast (potentially inaccurate) exploration of the objective function around
the optimum: we study this possibility in Section 5.7. These properties are
important for our motivating application from a practical perspective, which
explains our choice.
In this paper, we review the RBF method and present some extensions
aimed at improving its practical performance. Our most significant contribu-
tions to the class of RBF methods are a fast procedure for automatic model
selection, and an approach to accelerate convergence in case we have access to
an additional oracle that returns noisy function values (i.e. affected by error)
but is less expensive to evaluate than the exact oracle for f . These contribu-
tions could be adapted to other surrogate model based methods and should
therefore be considered of general interest, rather than specific for the RBF
method. Our implementation of the method is an open-source library called
RBFOpt, and we show that it is competitive with state-of-the-art commercial
software on a set of test problems taken from the literature. In particular, the
two main contributions of these paper significantly decrease the number of
function evaluations for global convergence on our test set. Furthermore, we
show that on our test set, the proposed methodology for automatic model se-
lection yields a measure of model quality that is helpful in deciding whether or
4 Alberto Costa, Giacomo Nannicini
φ(r) dmin
r (linear) 0
r3 (cubic) 1
rp 2 log r (thin plate spline) 1
r2 + γ 2 (multiquadric) 0
√ 1 (inverse multiquadric) -1
r 2 +γ 2
2
e−γr (Gaussian) -1
not the surrogate model is accurate around the optimum, in order to perform
sensitivity analysis without requiring additional oracle evaluations.
The rest of the paper is organized as follows. In Section 2 we review the
RBF method. Sections 3 and 4 present some extensions to improve the effi-
cacy of the method. Section 5 describes our implementation and reports the
results of a computational evaluation on a set of test problems taken from the
literature. Section 6 concludes the paper.
be used to change the shape of these functions (but it is usually set to 1).
If φ(r) is cubic or thin plate spline, dmin = 1 and we obtain an interpolant
of the form:
k
X x
sk (x) := λi φ(kx − xi k) + hT , (3)
1
i=1
n+1
where h ∈ R . The values of λi , h can be determined by solving the following
linear system:
Φ P λ F
= , (4)
P T 0(n+1)×(n+1) h 0n+1
with:
T
x1 1 λ1 f (x1 )
.. .. ,
Φ = (φ(kxi − xj k))i,j=1,...,k , P = . λ = ... , F = ... .
.
xTk 1 λk f (xk )
= (−1)dmin +1 λT Φλ.
(5)
Let us assume that after k function values f (x1 ), . . . , f (xk ) are evaluated, we
want to find a point in ΩI where it is likely that the unknown function attains
a target value fk∗ ∈ R (strategies for selecting fk∗ will be discussed in the
following). Let sy be the RBF interpolant subject to the conditions:
The assumption of the RBF method is that a likely location for the point y
with function value fk∗ is the one that minimizes σ(sy ). That is, we look for
6 Alberto Costa, Giacomo Nannicini
f∗
Fig. 1 An example with four interpolation points (circles) and a value of fk∗ represented by
the horizontal dashed line. The RBF method assumes that it is more likely that a point with
value fk∗ is located at the diamond rather than the square, because the resulting interpolant
is less bumpy.
the interpolant that is “the least bumpy”. A sketch of this idea is given in
Figure 1.
Instead of computing the minimum of σ(sy ) to find the least bumpy inter-
polant, we define an equivalent optimization problem that is easier to solve.
Let `k be the RBF interpolant to the points (xi , 0), ∀i ∈ {1, . . . , k} and (y, 1).
A solution to (6)-(7) can be rewritten as:
T
where u(y) = (φ(ky − x1 k), . . . , φ(ky − xk k)) and π(y) is y T 1 when dmin =
1, 1 when dmin = 0 and it is not used when dmin = −1. With algebraic
manipulations (see [15, 27]) we can obtain from the system (8) the following
expression for µk (y):
1
µk (y) = −1 . (9)
T Φ P
φ(0) − (u(y) π(y)) (u(y) π(y))
PT 0
A natural choice for the initial sample points is to pick the 2n corner points of
the box Ω, but this is reasonable only for small values of n. A commonly used
strategy [15, 19, 20] for selecting the initial sample points is to choose n + 1
corner points of the box Ω, and the central point of Ω, but this could prioritize
the exploration in a part of the domain. [20] chooses xL and xL + eTi (xU L
i − xi )
8 Alberto Costa, Giacomo Nannicini
To tackle the problem of selecting the target value fk∗ at each Iteration step,
we employ the technique proposed in [20], that generalizes [15], as described
below. Let y ∗ := arg minx∈ΩI sk (x), fmin := mini=1,...,k f (xi ), and fmax :=
maxi=1,...,k f (xi ). In particular, we employ a cyclic strategy that picks target
values fk∗ ∈ [−∞, sk (y ∗ )] according to the following sequence of length κ + 2:
A library for black-box optimization 9
– Step −1 (InfStep): Choose fk∗ = −∞. In this case the problem of finding
xk+1 can be rewritten as:
1
xk+1 = arg max .
x∈ΩI (−1)dmin +1 µk (x)
This is an exploration phase: the algorithm tries to improve the surrogate
model in unknown parts of the domain.
– Step h ∈ {0, . . . , κ − 1} (Global search): Choose
In this case, there is a balance between improving the model quality and
finding the minimum. Notice that if fk∗ = sk (y ∗ ), then
1
xk+1 = arg max . (13)
x∈ΩI (−1)dmin +1 sk (x)
– [19] suggests transforming the domain of f into the unit hypercube. This
strategy is implemented in the rbfSolve function of the MATLAB toolkit
TOMLAB. In our tests, we found this transformation to be beneficial only
when the bounds of the domain are significantly skewed. When all vari-
ables are defined over an interval of approximately the same size we did not
observe any benefit from this transformation, and in fact sometimes per-
formance deteriorated. Note that the transformation cannot be applied on
integrality-constrained variables. After computational testing, our default
strategy is to transform the domain into the unit hypercube on problems
with no integer variables and such that the ratio of the lengths of the
largest to smallest variable domain exceeds a given threshold, set to 5 by
default.
– To prevent harmful oscillations of the RBF interpolant due to large dif-
ferences in the function values, [15] suggests clipping the function values
f (xi ) at the median (in other words, replacing values larger than the me-
dian by the median). This approach is also adopted by [3, 27]. We follow
this approach with one small change: function values are clipped at the
median only if the ratio of the largest to smallest absolute function value
exceeds a given threshold, set to 103 by default.
– For the same reason of preventing large differences in function values, it
has been proposed to rescale the codomain of f . [30] uses the plog (paired
log) approach, which consists in replacing each function value using the
A library for black-box optimization 11
following transformation:
(
log(1 + x) if x ≥ 0,
plog(x) =
− log(1 − x) if x < 0.
[20] replaces the values f (xi ) > max(0, fmin )+105 with max(0, fmin )+105 +
log10 (f (xi ) − max(0, fmin ) + 105 ). Our implementation offers the following
three choices for function scaling:
– off : we employ the original, unscaled function values;
– log scaling: if fmin ≥ 1 we replace each f (xi ) with log(f (xi )), otherwise
we replace it with log(f (xi ) + 1 + |fmin |) (similar to [20]);
– affine scaling: we replace each f (xi ) with ff(x i )−fmin
max −fmin
.
– In the Global search step, [15, 27] replace fmax in equation (12) with a dy-
namically chosen value f (xπ(α(k)) ), defined as follows. Let k0 be the number
of initial sampling points, h the index of the current Global search itera-
tion as in Section 2.2, π a permutation of {1, . . . , k} such that f (xπ(1) ) ≤
f (xπ(2) ) ≤ · · · ≤ f (xπ(k) ), and
(
k if h = 0
α(k) =
α(k − 1) − k−k
κ
0
otherwise.
As a result, fmax is used to define the target value fk∗ only at the first
step (h = 0) of each Global search cycle. In subsequent steps, we pick
progressively lower values of f (xi ), so as to stabilize search by avoiding too
large differences between the minimum of the RBF interpolant, and the
target value.
– If the initial sample points are chosen with a random strategy (for exam-
ple, a Latin Hypercube design), whenever we detect that the algorithm is
stalling, we apply a complete restart strategy [27, Sect. 5]. Restart strate-
gies have been applied to numerous combinatorial optimization problems,
such as satisfiability [14] and integer programming [1]. In the context of
the RBF algorithm, the restart strategy as introduced in [27] works by
restarting the algorithm from scratch (including the generation of new ini-
tial sample points) whenever the best known solution does not improve by
at least a specified value (0.1% by default) after a given number of opti-
mization cycles (5 by default). In our experience restarts tend to be more
useful if the initial sample points are chosen according to a randomized
strategy (otherwise the random number generator has impact only on the
solvers for the auxiliary problems).
– A known issue of the RBF method, explicitly pointed out in [27, Sect. 4],
is that large values of h in Global search do not necessarily imply that the
algorithm is performing a “relatively local” search as intended. In fact, the
next iterate can be arbitrarily far from the currently known best point, and
this can severely hamper convergence on problems where the global mini-
mum is in a steep valley. To alleviate this issue, [27, Sect. 4.3] proposes a
“restricted global minimization of the bumpiness function”. The basic idea
12 Alberto Costa, Giacomo Nannicini
is to progressively restrict the search box around the best known solution
during a Global search cycle. In particular, instead of solving (13) over ΩI ,
we intersect ΩI with the box [miny∈ΩI sk (y) − βk (xU xL ), miny∈ΩI sk (y) +
βk (xU xL )], where βk = 0.5(1 − h/κ) if (1 − h/κ) ≤ 0.5, and βk = 1 other-
wise (the numerical constants indicated are the values suggested by [27]).
It is easy to verify that this restricts the global search to a box centered on
the global minimizer of the RBF interpolant: the box coincides with ΩI at
the beginning of every Global search cycle, but gets smaller as h increases.
This turns out to be very beneficial on problems with steep global minima.
– A simple strategy that we found to be effective (see Section 5) is to repeat
the Local search step in case a Local search successfully improves the best
known solution. In our experiments, it was not beneficial to perform Local
search more than twice in a row, as this runs a high risk of focusing too
much on a local minimum, forsaking global search.
predicting function values for the points that have a low function value, which
are arguably the most important for local search. On the other hand, for
global search it seems reasonable to choose a model that has good predictive
performance on a larger range of function values, hence we use q̄70% . The
points with the highest function values are the farthest from the minimum
and our assumption is that they can be disregarded.
The RBF model with the lowest value of q̄10% is employed in the subsequent
optimization cycle for the Local search step and the Global search step with
h = κ − 1, while the RBF model with lowest value of q̄70% is employed for
all the remaining steps. In our experiments, the RBF models that we consider
are those with cubic, thin plate spline or multiquadric (with γ = 1) basis
functions. We exclude the linear basis function because in our experience it
sometimes leads to numerically unstable models, and its inclusion did not yield
a noticeable performance increase in terms of model quality. It is possible to
also include different scaling parameters as described in Section 2.3 in the
evaluation, but we did not pursue this possibility.
A drawback of leave-one-out cross validation is that it is typically expensive
to perform. However, in the setting of this paper the number of points k is
usually low, hence computing q̄10% and q̄70% only takes fraction of a second.
We can show that these two values can be computed by solving a sequence of
LPs that can be efficiently warmstarted. In our experiments this turned out
to be unnecessary and not worth the overhead of communicating with an LP
solver, but we believe that this approach may be of interest for larger values
of k, therefore we give an overview of the main idea.
To carry out the cross validation scheme, we must compute k RBF inter-
polants. Instead of repeatedly solving the linear system (4), we can set up an
optimization problem as follows:
min (−1)dmin +1 λT Φλ
s.t.: Φλ + P h + ξ = F
PTλ = 0n+1
(15)
∀i ∈ {1, . . . , k} \ {j} 0 ≤ ξi ≤ 0
−∞ ≤ ξj ≤ +∞
λj = 0.
In (15) we minimize the bumpiness of the interpolant, subject to the interpo-
lation conditions. Observe that in this problem, the j-th interpolation point
is ignored: the corresponding constraint is relaxed because of the free slack
variable, and λj is set to zero so the RBF centered on xj has no contribution.
(15) is a QP, but we now show that it can be reformulated into an LP. Using
the equality constraints, we can rewrite the objective function:
λT Φλ = λT (F − P h − ξ) = λT F − λT P h − λT ξ.
Furthermore, P T λ = 0n+1 , and we can write:
k
X
λT ξ = λi ξi + λj ξj = 0.
i=1, i6=j
14 Alberto Costa, Giacomo Nannicini
Notice that changing the index j in (16) involves modifications of the variable
bounds. Therefore, we can solve a sequence of LP problems of the form (16)
with the dual simplex method.
Fig. 2 Function evaluations affected by errors: the values returned by f˜ are the circles. The
dashed line interpolates exactly at those points, the solid line is a less bumpy interpolant, and
is still within the allowed error tolerances. Problem (17) would prefer the latter interpolant.
the values f˜(x1 ), . . . , f˜(xχ ) by an amount within the allowed error estimates
r , a . In particular, instead of solving (4) to determine the interpolant, we
introduce a vector of slack variables ξ ∈ Rk and solve the problem:
min (−1)dmin +1 λT Φλ
s.t.: Φλ + P h + ξ = F
P T λ = 0n+1 (17)
∀i ∈ L −r |f˜(xi )| − a ≤ ξi ≤ r |f˜(xi )| + a
∀i ∈ {1, . . . , k} \ L ξi = 0.
Here, F is assumed to contain f˜(xi ) instead of f (xi ) for all i ∈ L. Problem (17)
minimizes the bumpiness of the RBF interpolant, subject to the interpolation
conditions. The inequalities involving ξ allow the interpolant to take any value
within the error tolerances r , a of the noisy function values f˜(xi ), i ∈ L. A
sketch of this idea is given in Figure 2. If we set ξ = 0 for all i, thereby eliminat-
ing ξ from the problem, deriving the KKT optimality conditions recovers the
original system (4). Note that (17) admits at least one solution if (4) admits
a solution, and (17) is a convex quadratic problem because of the conditional
positive semidefiniteness of Φ. In practice, to avoid numerical difficulties in the
solution of (17) we use a local solver starting from the solution of (4).
A drawback of this method is that it requires an estimation of . A related
approach was adopted by [21], whereby all function values are allowed to
deviate from the given f (x1 ), . . . , f (xk ), but these deviations are penalized
in the objective function according to a pre-specified penalty parameter. The
difference between our approach and the one of [21] is that we require to
specify the range within which function values are allowed to vary, whereas
[21] requires to specify the value of the penalty parameter in the objective
function and computes the error terms accordingly. We believe that estimating
a penalty parameter may prove harder in practice than providing an error
range, hence our approach may be more natural for practitioners.
16 Alberto Costa, Giacomo Nannicini
both of which can be computed solving (17). When we are in the Local search
phase of the target value selection strategy, we compare σ(sk,w ) with the value
of σ(s∗k ) for all w ∈ L such that fk∗ ∈ [f˜(xw )−r |f˜(xw )|−a , f˜(xw )+r |f˜(xw )|+
a ]. In other words, we compare the bumpiness of the RBF interpolant at the
suggested new point xk+1 , with the bumpiness of the RBF interpolant if the
function value at xw were set to the target fk∗ . We do this only for points xw
such that f could take the value fk∗ at xw , according to the specified error
bounds. This way, we can verify whether we obtain a smoother interpolant by
placing the new point at a previously unexplored location, or at one of the
previously existing points. If this is the case and σ(sk,w ) < σ(s∗k ) for some
w ∈ L, we evaluate the function f (xw ) (i.e. the exact oracle f rather than the
noisy oracle f˜), replace the corresponding value, and set L ← L \ {w}. Note
that this step will be performed at most |L| = χ times and does not affect
global convergence in the limit.
A further modification of the algorithm, that we found to be beneficial in
practice, is to evaluate f (x) at points where f˜(x) has a potentially optimal or
satisfactory function value. In particular, if a target objective function value
is known, after every evaluation of f˜(x) we check whether the returned value
could be optimal up to the optimality tolerance and the error terms εr , εa . If
that is the case, we immediately evaluate f (x) at the same point and use the
corresponding function value. In our experiments, this significantly accelerated
convergence. From a practical perspective, while we cannot expect that the
optimal objective function value be known in advance, domain knowledge can
often provide a target value and an optimality tolerance such that solutions
within the specified tolerance are considered satisfactory, hence our approach
can be applied.
A library for black-box optimization 17
5 Computational experiments
5.1 Implementation
– Neumaier [25] problems: we included one problem of class “perm”, and one
of class “perm0”, generated with parameters n = 6, β = 60 and n = 8, β =
100 respectively. These problems were conceived to be challenging for global
optimization solvers, and are in our experience very difficult to solve with a
black-box approach. The global minimum of these instances is originally 0,
but achieving an optimality tolerance of 1% or 0.01 is essentially hopeless
for these problems. Hence, we translated the functions up by 1000.
An extensive computational evaluation of black-box solvers is discussed in
[31], which uses a much larger test set than ours. However, the setting of that
paper is different because the variable bounds are relatively large, the problem
dimension is typically higher, and a larger budget of function evaluations is
allowed (up to 2500, while we limit ourselves to 150). The type of problems on
which the RBF method is expected to perform better is different, and for these
reasons, [31] does not provide computational results for any implementation
of the RBF method despite discussing it.
The following list summarizes the different settings that we considered, see
Sections 2.3 and 3 for details:
– scaling [affine, log, off]: the type of scaling used;
– R: restart the algorithm after 6 cycles without improvement of the best
solution found;
– B: restricted global minimization of the bumpiness function;
– L: if the local search step improves the best solution, it is repeated a second
time;
– auto: automatic model selection using cross validation to choose the basis
function
The “default” configuration employs the cubic basis function, a random Latin
Hypercube design (generated with the maximum minimum distance criterion)
for the selection of the first sampling points, no InfStep, and 5 global search
steps (i.e., κ = 5). This is in accordance with [15, 27]. The number of function
evaluations is capped at 150, the time limit for the NLP solver is set to 60
seconds, and for the MINLP solver to 120 seconds. We parameterize BONMIN
to repeat NLP solutions up to 20 times at the root (effectively, this acts as
a multi-start approach on nonconvex continuous problems), and 10 times at
nodes in case of infeasibility. The time limit for each run is set to 4 hours.
Typically, hitting this time limit is indicative of numerical problems in the so-
lution process, e.g. the system (4) becomes badly conditioned. If this happens,
we consider the corresponding run as a failure.
We evaluate the performance obtained with our implementation on the test
instances of Table 2. Detailed results are given in the Appendix; here we give a
summary reporting: the geometric mean of the number of function evaluations
20 Alberto Costa, Giacomo Nannicini
to find a solution within 1% of the global optimum (over 20 runs with differ-
ent random seeds – a value of 150 evaluations was used for failed runs), the
geometric standard deviation in parentheses, the total number or successful
runs. Each row represents a different configuration of the algorithm. The best
values are in boldface. The geometric means are computed as follows: first,
for each instance we compute the arithmetic average of the number of func-
tion evaluations. Then, we compute the geometric mean of these arithmetic
averages across the instances. The reason for choosing this approach is that
within the same instance, we perform several random trials to get an estimate
of the expected number of function evaluations through sampling, hence the
arithmetic average is the natural estimator. After obtaining these numbers,
we aggregate them with a geometric mean so that each instance is given equal
weight, rather than putting more emphasis on problems that require more
function evaluations (such as problems where the algorithm does not converge
within the 150 function evaluations).
To compare different versions of the algorithm, we perform a Friedman
test using the average number of function evaluations on each instance as
blocks (rows), and the versions of the algorithm as groups (columns). The
null hypothesis of the test is that there is no difference among the groups.
This allows us to assess if one of the algorithms is consistently better than
the others on the majority of the instances. For details on and assumptions
of the Friedman test, we refer to [8]. Note that the Friedman test does not
take into account the magnitude of the differences among the values, but
results with the Quade test (a non-parametric statistical test that takes into
account differences in magnitude) are essentially in agreement, hence we only
report results for the Friedman test. All comparisons are performed at the
95% significance level. If an algorithm on the row is better (i.e. fewer function
evaluations according to the Friedman test) than an algorithm on the column,
we indicate it with a “*”. This is detected using post-hoc analysis when the
p-value < 0.05. The p-value is reported in the caption, along with a reference
to the table(s) in the Appendix with the detailed results.
We would like to answer the following research questions:
1. Which algorithmic configuration is the best, and in particular, are the
improvements of Section 2.3 beneficial in practice?
2. Is our approach to handle noisy function evaluations effective?
3. Is automatic model selection using cross validation beneficial in practice?
4. Is our implementation competitive with the state-of-the-art?
5. Can the surrogate models produced by the algorithm be useful to perform
sensitivity analysis around the optimum?
The first question is investigated in the rest of this section. The second and
the third questions are investigated in Sections 5.4-5.5. The fourth question is
discussed in Section 5.6. The fifth question is discussed in Section 5.7.
We report the performance of with different settings of the algorithms and
different scaling procedures in Tables 3-5. It appears that “off” scaling is al-
ways not worse and sometimes better than other scaling procedures. Similar
A library for black-box optimization 21
Table 3 Results obtained with the default configuration and different function value scaling
procedures. Data taken from Table 16. The Friedman test does not reject the null hypothesis
(p-value 0.111).
Table 4 Results obtained with the RB configuration and different function scaling pro-
cedures. Data taken from Table 17. According to a Friedman test, “off” scaling performs
better than “affine” (p-value 0.025).
Table 5 Results obtained with the RBL configuration and different function value scaling
procedures. Data taken from Table 18. According to a Friedman test, “off” and “affine”
scaling perform better than “log” (p-value 0.015).
Table 6 Results obtained with “off” scaling and different algorithm settings (see Tables
16-20, off scaling columns). According to a Friedman test, “BL” and “RBL” perform better
than default, “R”, “L”, “RL” (p-value 0.000).
Table 7 Results obtained with the “RBL” configuration without noise, with noise 10%, and
with noise 20% (see Table 21). According to a Friedman test, noise 10% performs better
than the other algorithms (p-value 0.006).
can become very slow. Restarts are helpful in preventing stalling. Since the
difference in terms of performance is negligible, we prefer “RBL” to “BL”. In
the rest of this paper we use the “RBL” configuration. Table 6 answers our
first research question.
Table 8 Results obtained with and without automatic model selection for “RBL”. The
Friedman test does not reject the null hypothesis (p-value 0.205).
Table 9 Results obtained with and without automatic model selection for “RBL” with noise
level 10% (see Table 23). The Friedman test does not reject the null hypothesis (p-value
0.503).
Table 10 Results obtained with and without automatic model selection for “RBL” with
noise level 20%. The Friedman test does not reject the null hypothesis (p-value 0.832)
cubic” or “RBL thin plate spline”, although the difference is not detected by
a Friedman test. This is our best performing algorithm configuration so far,
solving more instances than any other tested configuration and showing that
the automatic model selection is useful in our experiments. In particular, auto-
matic model selection improves over any one of the three tested basis function,
suggesting that it is able to find the best performing model. It is interesting to
compare our “auto” configuration with the best single basis function for each
instance, i.e. the results that could be obtained in the hypothetical situation
of being able to guess the best performing basis function before solving the
instance. This “RBL best-basis-function” would require on average 54.33 func-
tion evaluations (geometric standard deviation 2.81), solving 364 instances,
and is therefore only marginally better than “RBL auto” on our test set. The
results suggest that our model selection scheme is able to correctly guess the
best surrogate model in most situations.
The same results carry over when exploiting a noisy oracle with relative
error at most 10%: automatic model selection is able to reduce the number
of function evaluations by ≈ 7%, and solves more instances. However, with a
relative noise of 20%, the benefit from using automatic model selection thins
out considerably and is hardly noticeable. This can be explained with the fact
that automatic model selection relies on the function evaluations to assess
model performance: in a context where the function evaluations are affected
by a significant relative error, assessing model quality becomes difficult, and
therefore our proposed procedure brings little advantage. Still, even with a
large noise our “auto” configuration is no worse than the default one and finds
the global optimum on a few more instances. This answers our third research
question.
Table 11 Best results obtained with “RBL auto” with Latin Hypercube sampling.
Instance best
branin 21
goldsteinprice 29
hartman3 16
hartman6 50
shekel10 37
shekel5 63
shekel7 67
Szegö functions, because they are the only functions for which results are
consistently reported. We use the version of our algorithm that seems to be
the most effective, namely “RBL auto”. For most of the papers we cite below,
we only report the best available results.
A major issue is that the settings of the computational evaluations are not
always reported in full details. Thus, in some cases we could not retrieve exact
information about the algorithms. More importantly, many papers report a
single result for each instance, i.e. a single number of function evaluations. For
implementations of the RBF method, it seems unlikely that any algorithm can
be fully deterministic even if the initial sample points are chosen deterministi-
cally: the auxiliary problems that have be solved are nonconvex problems that
are not solved to global optimality, employing e.g. multistart heuristics. Unfor-
tunately, in some cases we do not know how to interpret the results reported in
the papers (i.e. if it is an average number of evaluations over repeated runs, or
the best result achieved, or a one-shot test). We report results verbatim from
these papers anyway, and we give in Table 11 the best results (over 20 runs)
obtained with our “RBL auto” configuration of the RBF algorithm, using a
Latin Hypercube for the initial sampling, instead of the averages reported in
previous sections.
A summary of the algorithms reported, their settings, and corresponding
references is given in Table 12. It is obvious that there are many different
settings employed for the different algorithms, and many differences are not
captured in Table 12. For example, in [21] the limit on function evaluations
is set 150 (if the algorithm failed, it is not counted in the computation of
the average), whereas in [28] it is 500. ARBFMIP employs the cubic RBF
for the hartman6 and goldsteinprice instances, and thin plate spline for the
remaining instances; it also considers 7 corner points and a central point only
as initial samples for the hartman6 instance. To add to the confusion, the
same algorithm can be called in different ways in different papers: RBF in
[15], Gutmann in [21], Gutmann-RBF in [27], and RBFGLOB in [19] are the
same algorithm. Similarly, CORS-RBF in [19] is CORS-RBF (sp1) in [28], and
CORS-RBF in [27] is CORS-RBF (sp2) in [28].
We report results in different tables depending on the strategy to choose
the initial points. Table 13 reports the average number of evaluations for the
algorithms presented in Table 12 that employ an initial sampling based on
26 Alberto Costa, Giacomo Nannicini
Table 12 Summary of available information on the algorithms from the literature discussed
in our paper: strategy of the initial sampling ((S)LH stands for (Symmetric) Latin Hyper-
cube) and number of points, type of basis function, Table (in this paper) where the results
are reported, and references for a description of the algorithm and for the results reported
in this paper.
Table 13 Comparison with the literature, for algorithms such that the initial sample points
are chosen as the corner points.
Instance RBL auto qualsolve Gutmann rbfSolve EGO CORS ARBFMIP CORS
-C -C -C -C -C -SP1 -SP2
branin 30.3 32 44 59 21 34 22 40
goldsteinprice 82.4 60 63 84 125 49 21 64
hartman3 26 46 43 18 17 25 31 61
hartman6 102.60 99 112 109 92 108 43 104
shekel10 124.30 (70%) 71 51 (0%) (0%) 51 25 64
shekel5 119.80 (75%) 70 76 (0%) (0%) 41 34 52
shekel7 124.15 (70%) 85 76 (0%) (0%) 46 31 64
corner points, as well as our “RBL auto” configuration initialized with the
2n corners as starting points. Table 14 does the same for the algorithms em-
ploying a Latin Hypercube initial sampling strategy. Finally, Table 15 reports
results for the remaining algorithms: either the initial sampling strategy is not
required, or it is not specified in the paper.
These tables show that our implementation of the RBF algorithm seems
to be competitive with existing methods from the literature, on the Dixon-
Szegö test set. It is difficult to draw statistically meaningful conclusions on
such a small set of instances, however from the average number of function
evaluations to find a global optimum it seems that our algorithm outperforms
DIRECT, EGO, DE, GLOBAL, AQUARS, and performs similarly to the best
algorithms described in the literature. In particular, our best results seem
competitive with the commercial implementation rbfSolve.
A library for black-box optimization 27
Table 14 Comparison with the literature, for algorithms such that the initial sample points
are chosen according to a Latin Hypercube design.
Instance RBL auto qualsolve rbfSolve EGO AQUARS AQUARS CG-RBF CORS-RBF
-LH -LH -LH -CGRBF -LMSRBF -Restart -Restart
branin 30.5 26.9 62.7 23 39.43 31.73 46.60 43.90
goldsteinprice 52.5 30.4 63 (0%) 65.77 35.33 61.60 59.27
hartman3 44.5 38.8 28 31 38.13 46.23 63.17 54.03
hartman6 92.82 (85%) 50.7 (95%) 122 (0%) 129.80 178.70 214.47 (90%) 199.67 (93.3%)
shekel10 83.46 (65%) 78 (60%) (0%) (0%) 121.10 179.63 169.33 121.30
shekel5 80.71 (35%) 61 (30%) (0%) (0%) 164.67 212.77 259.77 (93.3%) 216.97 (93.3%)
shekel7 96 (25%) 66 (60%) (0%) (0%) 152.70 178.03 156.23 150.77
Table 15 Comparison with the literature, for algorithms such that the initial sample is not
necessary or is not specified.
100 100
80 80
Percentage
Percentage
60 60
40 40
accurate at stepsize 0.005 accurate at stepsize 0.005
accurate at stepsize 0.01 accurate at stepsize 0.01
20 accurate at stepsize 0.02 20 accurate at stepsize 0.02
accurate at stepsize 0.05 accurate at stepsize 0.05
accurate at stepsize 0.1 accurate at stepsize 0.1
model trusted model trusted
0 0
0 5 10 15 Infty 0 5 10 15 Infty
Threshold Threshold
Fig. 3 Percentage of surrogate models that are trusted and corresponding percentage of
model errors that are below 10%, for threshold policies of the form q̄10% ≤ x (left figure)
and q̄20% ≤ x (right figure), where x is the value on the x-axis. The left axis goes from 0
to 20 with 0.5 increments, but the last point is out-of-scale and indicates x = ∞ (label:
“Infty”).
for δ = 0.005, 0.01, 0.02, 0.05, 0.1. Our main question is whether the values q̄k%
provide useful information about the accuracy of the model for some values of
k. We plotted graphs of q̄k% for k = 10, 20, . . . , 100 against the model errors.
This is a large amount of data; we report a summary of our findings.
For practical purposes, we decided that model errors of more than 10% are
not acceptable, and that we are looking for a simple threshold policy for q̄k%
to determine whether or not the surrogate model should be trusted around
the optimum. For the tested values of k%, we tried different thresholds t and
plotted the aggregated model errors. We plot these graphs for k = 10, 20 in
Figure 3, where we give the fraction of model errors (among all points within
the domain located ±∆i ei away from x∗ , i = 1, . . . , n) that are below 10%,
and the fractions of models that are trusted based on the given threshold.
Ideally, we want these values to be as high as possible; for k ≥ 30, all the
curves are shifted noticeably towards the bottom, therefore we do not report
the corresponding results. From the graphs, it seems that q̄20% ≤ 10% performs
relatively well in practice: up to δ = 0.02, about 75% of the time the model
errors stay below 10%.
A natural question is to determine if there is benefit in using the threshold
policy for q̄20% as compared to simply trusting the model every time. To answer
this question, in Figure 4 we compare the model errors for the models that
are trusted by our threshold policy, to the errors for all the models. We can
see that the reduction is significant. Using our policy, ≈ 60% of the errors are
within 1% for δ = 0.01, and more importantly, in almost 90% of the cases the
errors are smaller than 50%, whereas without the threshold policy we observe
a significant fraction of the errors above 50%.
To summarize, one should not expect that the surrogate model can always
predict the unknown objective function with high accuracy. However, in our
experiments evaluating the quality of the model via cross-validation is helpful
in assessing model accuracy: using a simple threshold policy on a measure
of model quality, we are able to identify a large number of the inaccurate
A library for black-box optimization 29
1 1
0.8 0.8
Prob (error <= X%)
0.4 0.4
stepsize 0.005 stepsize 0.005
0.2 stepsize 0.01 0.2 stepsize 0.01
stepsize 0.02 stepsize 0.02
stepsize 0.05 stepsize 0.05
stepsize 0.1 stepsize 0.1
0 0
0 1 2 5 10 20 50 100 200 500 1000 + 0 1 2 5 10 20 50 100 200 500 1000 +
Value X Value X
Fig. 4 Empirical cumulative distribution function of the model errors for the threshold
policy q̄20% ≤ 10% (left) and for no policy (right).
surrogate models. For small changes around the optimum, in most cases the
model errors are less than 10%, and large errors are rare. In some practical
applications, there may be value in using this approach to perform sensitivity
analysis as opposed to performing more expensive oracle evaluations.
We investigate one more possible research direction related to model qual-
ity. Using the notation of Section 3, it is straightforward to notice that limk→∞ qk,j =
0 for all j = 1, . . . , k on continuous problems, because s̃k,j agrees with f on a
dense subset of Ω. Hence, q̄100% goes to zero in the limit. We are interested
in determining if q̄100% is strongly
R correlated with a more traditional measure
of model quality, namely Ω |sk (x) − f (x)| dx, to understand if we can draw
conclusions on the global quality of the surrogate model using an inexpensive
computation. Conversely, evaluating the integral is very time-consuming, but
it can be done for n ≤ 3. Thus, we apply our algorithm on aR set of 11 two- and
three-dimensional problem instances, and record q̄100% and Ω |sk (x)−f (x)| dx
after every iteration. The Pearson’s correlation coefficient between the two
samples is 0.599 on average over this set, but unfortunately it is very low (and
even negative) on some of the Rproblems. While there is on average a positive
correlation between q̄100% and Ω |sk (x) − f (x)| dx across the iterations of the
algorithm (which is not surprising since both sequences go to zero as k in-
creases), due to the unpredictability of this measure we decide not to explore
this direction further.
6 Conclusions
In this paper we provided an overview of the RBF method for black-box op-
timization, which is considered one of the best surrogate model based meth-
ods for derivative-free optimization. We proposed some modifications of the
algorithm with the aim of improving practical performance. Our two main
contributions are a methodology to perform automatic model selection using
a cross-validation scheme, and an approach to exploit noisy but faster func-
tion evaluations. Computational experiments show that these contributions are
beneficial in practice, yielding a noticeable reduction in the number of function
30 Alberto Costa, Giacomo Nannicini
Acknowledgements The authors are grateful for the financial support by the SUTD-MIT
International Design Center under grant IDG21300102.
References
18. Hemker, T.: Derivative free surrogate optimization for mixed-integer nonlinear black-
box problems in engineering. Master’s thesis, Technischen Universität Darmstadt (2008)
19. Holmström, K.: An adaptive radial basis algorithm (ARBF) for expensive black-box
global optimization. Journal of Global Optimization 41(3), 447–464 (2008)
20. Holmström, K., Quttineh, N.H., Edvall, M.M.: An adaptive radial basis algorithm (arbf)
for expensive black-box mixed-integer constrained global optimization. Optimization
and Engineering 9(4), 311–339 (2008)
21. Jakobsson, S., Patriksson, M., Rudholm, J., Wojciechowski, A.: A method for simulation
based optimization using radial basis functions. Optimization and Engineering 11(4),
501–532 (2010)
22. Jones, D., Perttunen, C., Stuckman, B.: Lipschitzian optimization without the lipschitz
constant. Journal of Optimization Theory and Applications 79(1), 157–181 (1993)
23. Jones, D., Schonlau, M., Welch, W.: Efficient global optimization of expensive black-box
functions. Journal of Global optimization 13(4), 455–492 (1998)
24. Kolda, T.G., Lewis, R.M., Torczon, V.J.: Optimization by direct search: new perspec-
tives on some classical and modern methods. SIAM Review 45(3), 385–482 (2003)
25. Neumaier, A.: Neumaier’s collection of test problems for global optimization. URL http:
//www.mat.univie.ac.at/~neum/glopt/my_problems.html. Retrieved in May 2014
26. Powell, M.: Recent research at cambridge on radial basis functions. In: New Develop-
ments in Approximation Theory, International Series of Numerical Mathematics, vol.
132, pp. 215–232. Birkhauser Verlag, Basel (1999)
27. Regis, R., Shoemaker, C.: Improved strategies for radial basis function methods
for global optimization. Journal of Global Optimization 37, 113–135 (2007).
10.1007/s10898-006-9040-1
28. Regis, R.G., Shoemaker, C.A.: Constrained global optimization of expensive black box
functions using radial basis functions. Journal of Global Optimization 31(1), 153–171
(2005)
29. Regis, R.G., Shoemaker, C.A.: A stochastic radial basis function method for the global
optimization of expensive functions. INFORMS Journal on Computing 19(4), 497–509
(2007). DOI 10.1287/ijoc.1060.0182
30. Regis, R.G., Shoemaker, C.A.: A quasi-multistart framework for global optimization
of expensive functions using response surface models. Journal of Global Optimization
56(4), 1719–1753 (2013)
31. Rios, L.M., Sahinidis, N.V.: Derivative-free optimization: a review of algorithms and
comparison of software implementations. Journal of Global Optimization 56(3), 1247–
1293 (2013)
32. Schoen, F.: A wide class of test functions for global optimization. Journal of Global
Optimization 3(2), 133–137 (1993)
33. Törn, A., Žilinskas: Global optimization. Springer (1987)
34. Wächter, A., Biegler, L.T.: On the implementation of a primal-dual interior point filter
line search algorithm for large-scale nonlinear programming. Mathematical Program-
ming 106(1), 25–57 (2006)
Tables of results
The details of the numerical results obtained with the RBF algorithm are presented in Tables
16-25. For each instance we perform 20 runs of the algorithm, changing the random seed.
The algorithm fails if it cannot find a solution having a relative error less than or equal to
1% from the global optimum within 150 function evaluations. The relative error is computed
as |f ∗ − F ∗ |/|F ∗ |, where f ∗ is the best solution found and F ∗ is the global optimum of the
problem. In case of F ∗ = 0, the error is computed as |f ∗ − F ∗ |. The statistics presented
on the tables are the number of successful trials out of 20 (“#sol.”), the average number of
function evaluations, the standard deviation from the mean, and the average relative error
after 150 evaluations for those instances where the algorithm does not converge.
32 Alberto Costa, Giacomo Nannicini
Table 16 Results obtained with the default configuration and different function value scal-
ing procedures.
Table 17 Results obtained with the “RB” configuration and different function value scaling
procedures.
Table 18 Results obtained with the “RBL” configuration and different function value scal-
ing procedures.
Table 19 Results obtained with the “R”, “B” and “L” configurations and “off” scaling
procedure.
R B L
Instance #sol. avg. eval. error #sol. avg. eval. error #sol. avg. eval. error
branin 20 38.70 0.00 20 31.80 0.00 20 35.25 0.00
camel 20 42.50 0.00 20 38.50 0.00 20 39.50 0.00
ex4 1 1 20 19.85 0.00 20 20.20 0.00 20 18.15 0.00
ex4 1 2 20 9.60 0.00 20 9.65 0.00 20 9.60 0.00
ex8 1 1 20 7.40 0.00 20 7.30 0.00 20 7.40 0.00
ex8 1 4 20 28.45 0.00 20 27.60 0.00 20 27.40 0.00
gear 20 7.30 0.00 20 7.30 0.00 20 7.30 0.00
goldsteinprice 11 123.80 31.48 20 71.55 0.00 12 115.20 11.49
hartman3 20 35.90 0.00 20 45.85 0.00 20 40.20 0.00
hartman6 9 129.95 5.68 11 106.70 4.28 10 118.00 6.60
least 0 150.00 264.34 0 150.00 200.53 0 150.00 204.31
nvs04 4 131.10 194.44 8 105.90 194.44 5 125.55 194.44
nvs06 0 150.00 28.69 11 117.70 9.32 0 150.00 27.83
nvs09 20 14.20 0.00 20 14.25 0.00 20 14.20 0.00
nvs16 9 106.55 775.76 9 108.70 1221.82 8 107.40 1484.44
perm0 8 0 150.00 170.01 0 150.00 183.08 1 147.20 176.23
perm 6 0 150.00 116469.28 0 150.00 20224.70 0 150.00 73415.96
rbrock 1 147.45 35.22 7 132.00 9.16 3 145.05 19.10
schoen 10 1 0 150.00 43.80 9 144.45 15.16 0 150.00 52.80
schoen 10 2 0 150.00 14.77 14 143.05 1.73 0 150.00 18.10
schoen 6 1 3 143.10 15.69 20 91.00 0.00 6 136.05 16.94
schoen 6 2 4 142.00 18.69 17 89.45 110.28 6 130.30 22.65
shekel10 2 147.85 47.83 10 129.65 36.96 4 144.35 59.24
shekel5 0 150.00 39.48 5 142.25 40.78 5 138.50 60.54
shekel7 3 144.15 43.61 11 121.55 31.48 2 144.25 45.92
34 Alberto Costa, Giacomo Nannicini
Table 20 Results obtained with the “RL” and “BL” configurations and “off” scaling pro-
cedure.
RL BL
Instance #sol. avg. eval. error #sol. avg. eval. error
branin 20 35.25 0.00 20 32.80 0.00
camel 20 39.50 0.00 20 36.90 0.00
ex4 1 1 20 18.15 0.00 20 16.15 0.00
ex4 1 2 20 9.60 0.00 20 9.15 0.00
ex8 1 1 20 7.40 0.00 20 7.30 0.00
ex8 1 4 20 27.40 0.00 20 29.80 0.00
gear 20 7.30 0.00 20 7.30 0.00
goldsteinprice 9 125.05 6.39 18 77.70 3.74
hartman3 20 41.35 0.00 20 46.05 0.00
hartman6 11 116.90 3.99 11 104.15 4.30
least 0 150.00 271.77 0 150.00 213.07
nvs04 5 138.50 194.44 7 112.90 194.44
nvs06 0 150.00 24.66 6 133.95 8.58
nvs09 20 14.20 0.00 20 14.25 0.00
nvs16 11 104.60 675.56 8 113.55 1013.33
perm0 8 1 147.20 201.63 1 148.80 147.19
perm 6 0 150.00 163072.95 0 150.00 24567.01
rbrock 2 148.85 35.28 7 135.20 10.72
schoen 10 1 0 150.00 52.96 15 135.95 25.98
schoen 10 2 0 150.00 19.02 16 127.15 2.25
schoen 6 1 6 136.05 17.90 20 83.45 0.00
schoen 6 2 6 130.30 17.39 16 88.40 110.26
shekel10 2 145.70 49.02 8 136.15 47.75
shekel5 5 138.50 56.27 8 127.75 46.82
shekel7 2 144.25 41.60 9 127.95 29.08
Table 21 Results obtained with the “RBL” configuration using a noisy oracle with 10% or
20% relative error.
Table 22 Results obtained with automatic model selection for the “RBL” configuration of
the algorithm.
Table 23 Results obtained with automatic model selection for the “RBL” configuration of
the algorithm and a noisy oracle with 10% relative error.
Table 24 Results obtained with automatic model selection for the “RBL” configuration of
the algorithm and a noisy oracle with 20% relative error.
Table 25 Results obtained with the “RBL” configuration and “off” scaling procedures and
a noisy oracle with 10% relative error, but relative error estimate r of the algorithm set at
20%, and with 20% relative error but relative error estimate r set at 30%.