Global Optimization
Global Optimization
1
© Elsevier Science Publishers B.V. (North-Holland) 1989
Chapter IX
Global Optimization
A. H. G. Rinnooy Kan
Econometric Institute, Erasrnus University, Rotterdam, The Netherlands
G . T. T i m m e r
Econometric Institute, Erasmus University, Rotterdam, and ORTEC Consultants, Rotterdam,
The Netherlands
1. lntroduction
f(x*) ~ f ( x ) Vx ~ B n G . (1.1)
In general, however, several local optima may exist and the corresponding
631
632 A.H.G. Rinnooy Kan, G.T. Timmer
We note, however, that this set may contain points whose function values differ
considerably from y . .
So far only few solution methods for the global optimization problem have
been developed, certainly in comparison with the multitude of methods that
aim for a local optimum. The relative difficulty of global optimization as
compared to local optimization is easy to understand. If we assume that f is
twice continuously differentiable, then all that is required to test if a point is a
local minimum is knowledge of the first and second order derivatives at this
point. If the test does not yield a positive result, the continuous differentiability
of the function ensures that a neighbouring point can be found with a lower
function value. Thus, a sequence of points converging to a local optimum can
be constructed.
Such local tests are obviously not sufficient to verify global optimality.
Indeed, in some sense the global optimization problem as stated in (1.3) is
inherently unsolvable in a finite number of steps. For any continuously
differentiable function f, any point 2 and any neighbourhood B of 2, there
exists a function f ' such that f + f ' is continuously differentiable, f + f ' equals
f f o r all points outside B and the global minimum o f f + f ' is 2. ( ( f + f ' ) is an
indentation of f.) Thus, for any point 2, one cannot guarantee that it is not the
global minimum without evaluating the function in at least one point in every
neighbourhood B of Y. As B can be chosen arbitrarily small, it follows that any
method designed to solve the global optimization problem would require an
unbounded number of steps (Dixon 1978).
Of course, this argument does not direcfly apply to the case where one is
satisfied with an approximation of the global minimum. In particular, if an
element of Ax(e ) is sought, then enumerative strategies that only require a
finite number of function evaluations can be easily shown to exist. These
strategies, however, are of limited practical use, and it appears that the above
observation does prohibit the construction of practical methods for the global
optimization problem. Hence, either a further restriction of the class of
objective functions or a further relaxation of what is required of an algorithm
will be inevitable in what follows.
Subject to this first conclusion, the methods developed to solve the global
optimization problem can be divided into two classes, depending on whether or
not they incorporate any stochastic elements (Dixon & Szegö 1975, 1978a;
Fedorov 1985).
Deterministic methods do not involve any stochastic concepts. To provide a
rigid guarantee of success, such methods unavoidably involve additional as-
sumptions on f.
Most stochastic methods involve the evaluation of f in a random sample of
points from S and subsequent manipulations of the sample. As a result, we do
sacrifice the possibility of an absolute guarantee of success. However, under
634 A.H.G. Rinnooy Kan, G.T. Timmer
very mild conditions on the sampling distribution and on f, the probability that
an element of Ax(e), Af(e) or Ao(e) is sampled can be shown to approach 1 as
the sample size increases (Solis & Wets 1981).
Irrespective of whether a global optimization method is deterministic or
stochastic, it always aims for an appropriate convergence guarantee. In some
cases, all that can be asserted is that the method performs well in an empirical
sense. This is far from satisfactory. Ideally, one would like to be assured that
the method will find an element of Ax(s ), Af(e) or Ae~(e) in afinite number of
steps, and under appropriate conditions some deterministic methods provide
such a guarantee. Frequently, that best that one can do is to establish an
asymptotic guarantee, which ensures convergence to the global minimum as the
computational effort goes to infinity. As we have seen, stochastic methods
usually provide a stochastic version of such a guarantee, i.e. convergence in
probability or with probability 1 (almost surely, almost everywhere).
The existence of any type of asymptotic guarantee immediately raises the
issue of an appropriate stopping rule for the algorithm. This is already an
important question in regular nonlinear programming (where it is usually dealt
with in an ad hoc fashion), but acquires additional significance in global
optimization where the trade-off between reliability and effort is at heart of
every computational experiment. The design of an appropriate stopping rule
which weighs costs and benefits of continued computation against each other,
forms a problem of great inherent difficulty, and we shall be able to report only
partial success in solving it satisfactorily.
Rather than reviewing global optimization methods according to the distinc-
tion between deterministic and stochastic methods, we prefer to use a different
classification, based on what might be called the underlying philosophy of the
method. From the literature, five such philosophies can be identified:
(a) Partition and search. Here, S is partitioned into successively smaller
subregions among which the global minimum is sought, much in the spirit of
branch-and-bound methods for combinatorial optimization. A few of these
deterministic methods are reviewed in Section 2.
(b) Approximation and search. In this approach, f is replaced by an
increasingly better approximation that is easier from a computational point of
view. Some methods of this type are reviewed in Section 3.
(c) Global decrease. These methods aim for a permanent improvement in
f-values, culminating in arrival at the global minimum. A few of them will be
reviewed in Section 4.
(d) Improvement of local minima. Exploiting the availability of an efficient
local search routine, these methods seek to generate a sequence of local
minima of decreasing value. Clearly, the global minimum is the last one
encountered in this sequence. Some examples are given in Section 5.
(e) Enurneration of local minima. Complete enumeration of local minima
(or at least of a promising subset of them) is clearly a way to solve the global
optimization problem. Some methods developed for that purpose are discussed
in Section 6.
IX. Global optimization 635
is built up as follows:
ù~)(~)=u~«)(x)+u~)(x). (2.7)
The one for which f(x) is minimal produces the required upper bound UB on
Y,.
Now, each vi defines a halfspace { x E ~ n ] 14)iT( X -- -X) ~ W iT(Ui -- X)} OMi
{ui, -u;}) of which the intersection is a hyperrectangle containing S'. Obvious-
ly f achieves its minimum over this hyperrectangle at one of its 2 n vertices; it is
easy to find out which one, and this yields a lower bound on y . .
In addition, we also construct a hyperrectangle inscribing the ellipsoid
{x E N"[ f(x) = UB} , again of the form Di{x E Nn[ W Ti ( X -- -X) <~ d i } with ap-
propriate constants d; that can easily be computed. Clearly, x . cannot be
contained in the interior of this hyperrectangle, and the intersection of its
exterior with S' defines an appropriate family of subsets in which to look for
x , . Each of these at most 2n polytopes can be embedded in a surptex rooted at
the corresponding v;, and a suitable procedure (in particular, the original one
in a recursive fashion) can be called upon to solve these subproblems.
Several variations on this theme have been proposed (Zilverberg 1983;
Kalantari 1984), each of which fully exploits the extreme point property of the
problem. Generalizations of the problem, up to the case where f is the
difference of two convex functions and S' is arbitrary convex, in which some
form of Benders decomposition is clearly called for, have also been attacked.
There is no doubt that if a problem exhibits special structure of the above kind,
then specialized algorithms of this nature are the ones to turn to. Obviously,
this will not always be the case.
for all xi, x 2 E S. In the case that n = 1, the method essentially consists of
iterative updates of a piecewise linear function with directional derivatives
equal to L o r - L , which forms an adaptive Lipschitzian minorant function o f f .
Initially, f is evaluated at some arbitrary point x 1. A piecewise linear
function ~~ is defined by
Hence
O~_l(x) <~ Ok(x) <~f(x) Vx E S , (3.4)
r, = ( f ( x e ) - M k + e ) / L . (3.6)
f ( x ) >~f ( x « ) - L r i = M k - e . (3.7)
Hence, if the spheres V/(i = 1 . . . . , k) cover the whole set S, then the point £i
for which f ( £ i ) = M k is an element of A t ( e ). Thus, this result converts the
global minimization problem to the problem of covering S with spheres. In the
simpled case of one-dimensional global optimization where S is an interval
{x E N] a ~<x ~< b}, this covering problem is solved by choosing x 1 to be equal
t o a + e / L and
2e + f(x~) - M k
xk = xk-I + L (k = 2,3 . . . . ). (3.8)
n n/2
F Q'r
m(S) = F(1 + ½n) ' (3.9)
k(~)'~.,2
(3.10)
F(1 + ½n)
Thus, for the k hyperspheres to cover S we require
~>(c_~r,,)> (3.11)
Two processes sharing the same family (3.12) are called equivalent. The
process is called Gaussian if each distribution (3.12) is normal. It can be shown
642 A.H.G. Rinnooy Kan, G.T. Timmer
(Cramer & Leadbetter 1967) that the specification of such a process amounts
to the specification of an appropriate covariance function R: S x S-+ ~, defined
by
and N ( ~ , 0-2) is the normal density function with mean /x and variance 0-2.
Thus, given a set of points x, < x 2 < • • • < x~ at which f has been evaluated
so far, we can compute the posterior expectation (3.14) of the minimum value
y* for every interval (xi, xi+l) ( i = 1 , . . , k - 1 ) , and select the interval
(xj, Xj+l) for which the value y~ is minimal. As a next point at which to
evaluate f, we can choose the expected location of the minimum given by
xj+a
with
+
(Y7 - f(xj)- f(Xj+l))2)
(3.17)
2O-2(Xy + X j + l ) / '
where
A2 1 ~ (f(xi)-f(xi_l)) 2 (3.19)
O- ~--- --m i=2 (xi -- x i - 1)
is sufficiently small, and terminate the algorithm if all intervals except the one
contäining 37 are eliminated.
The above description demonstrates that the simple 1-dimensional Wiener
approach already leads to fairly cumbersome computations. Moreover, its
generalization to the case n > 1 is not obvious - many attractive properties of
the one-dimensional process do not generalize to higher dimensions. And as
remarked already, the almost everywhere nondifferentiability of its realizations
continues to be unfortunate.
o-2
R(xi' x2) = 1 + I]xl - x2ll 2 " (3.21)
where V = (vij) = (R(x i, xj)). For the particular case given by (3.21), (3.22) is
equal to E~=a (ai/(l + [Ix-xi[[2)) for appropriate constants ai, so that the
stationary points of El(x) can be found by finding all roots of a set of
polynomial equations. This can be done, for example, by the method from
(Garcia & Zan gwill 1979) discussed in Section 6, and hence we may take the
approximation f equal to El(x) to arrive at another example of an algorithm in
the current category.
Both stochastic approaches described above can be viewed as special cases of
a more general axiomatic approach (Zilinskas 1982). There, uncertainty about
the values of f(x) other than those observed at x l , . . . , x~, is assumed to be
representable by a binary relation ~<x, where (a, a') ~>x (b, b') signifies that the
event {f(x) ~ (a, a')} is at least as likely as the event {f(x) E (b, b')}. Under
some reasonable assumptions on this binary relation (e.g., transitivity and
completeness), there exists a unique density function Px that is compatible with
the relation in that for every pair of countable unions of intervals (A, A'), one
has that A ~>x A' if and only if
f
A
px(t) dt >1 f p x ( t ) d t .
A'
For the special case that all densities are Gaussian and hence characterized
by their means /xx and variances o-2, it is then tempting to approach the
question of where f should be evaluated next in the axiomatic framework of
utility maximization. For this, one has to presuppose that a preference relation
is defined on the set of all pairs (mx, Sx). Subject again to some reasonable
assumptionS about this preference relation and the utility function involved,
the surprising result is that the uniquely rational choice for a next point of
evaluation is the one for which the probability of finding a function value
smaller that 37- e is maximal. In the case of the Wiener process, this amounts
to the selection of the interval which maximizes (3.20) with 37 replaced by
37- e, say (xj, xj+l), and to take
This, by the way, is not necessarily the same choice as dictated by (3.16).
IX. Global optimization 645
4. Global decrease
I ] ( 1 - Gg[A]) = 0 , (4.2)
k=l
for every A C S with positive volume m(A), then it is easy to show that
i.e., we have convergence to As(e ) in probability. To prove (4.3), all that has
to be observed is that the probability in question is equal to
k
1 - lim I ] ( 1 - Gi(Ai(e)) ) (4.4)
k--+ ¢c i = 1
introducing the notion of reversals (Lawrence & Steiglitz 1972) with generally
better computational results (Schrack & Choit 1976). Hundreds of further
references can be found in (Devroye 1979).
The correctness of these methods is easily understood: condition (4.2)
guarantees that there is no systematic blas against any subset of S with positive
volume, and in particular no such bias against A1(e ). Only sporadic results are
available on the rate of convergence (Rastrigin 1963), and the issue of a
stopping rule has not been resolved at all. Experiments confirm that these
methods are fast but rather unreliable in practice.
A deterministic global descent approach is hard to imagine, unless one is
prepared to impose stringent conditions on f. There is, none the less, an
interesting class of approaches, both deterministic and stochastic, that is based
on appropriate perturbations of the steepest descent trajectory
dx(t)
- g(x(t)), (4.5)
dt
where g(x) is the gradient of f at x.
Deterministically, one tries to perturb (4.5) in such a way that the method
cannot converge to local minima outside Af(e), i.e.
d2x(t)
dt 2 - -s(f(x(t)))g(x(t)), (4.6)
dx(t) dw(t)
d~ - g(x(t)) + e(t) d-----~-' (4.8)
where w(t) is an n-dimensional Wiener process. (If e(t) = e 0 for all t, this is
known among physicists as the Smoluchovski-Kramers equation.) The solution
to (4.8) is a stochastic process x(t) that has the following useful property
(Schuss 1980): if e(t) = e0 for all t, then the probability density function p(z) of
1x. Global optimization 647
The global descent techniques discussed in the previous section were partial-
ly inspired by steepest descent, but did not really exploit the full power of such
a local search procedure in spite of the fact that the global minimum can
obviously be found among the local ones. Hence, a logical next step is to
investigate the possibility of generating a sequence of local minima with
decreasing function values. Ideally, the last local minimum in the sequence
generated by such an algorithm should demonstrably be the global one.
648 A.H.G. Rinnooy Kan, G.T. Timmer
A weil known deterministic method belonging to this class that has great
intuitive appeal is the tunneling method (Levy & G o m e z 1980). Actually, this
method can be viewed as a generalization of the deflation technique (Goldstein
& Price 1971) to find the global minimum of a one dimensional polynomial.
This latter method works as follows.
Consider a local minimum x* of a one dimensional polynomial f, and define
f(x) - f ( x * )
L(x) - (x - ~,)2 (5.1)
where x i , . . . , x~ are all local minima with function value equal to f(x*) found
in previous iterations, and A0, "~i a r e positive parameters of the method.
Subtracting f(x*) from f(x) eliminates all points satisfying f(x)>f(x*) as a
k ,
possible solution. The term IIi= 1 I I x - x ~ l l ~i is introduced to prevent the
algorithm from choosing the previously found minima as a solution. To prevent
the zero finding algorithm to converge to a stationary point of
f(x ) - y(x * )
k , , (5.3)
ni~l I I x - x i l l ~i
which is not a zero of (5.2), the term [ [ x - XmllA° is added, with x m chosen
appropriately.
If the global minimum has been found, then (5.2) will become positive for all
x. Therefore the method stops if no zero of (5.2) can be found.
IX. Global optimization 649
The tunneling method has the advantage that, provided that the local search
procedure is of the descent type, a local minimum with smaller function value
is located in each iteration. Hence, it is likely that a point with small function
value will be found relatively quickly. However, a major drawback of the
method is that it is difficult to be certain that the search for the global
minimum has been sufficiently thorough. In essence, the tunneling method
only reformulates the problem: rather than solving the original minimization
problem, one now must prove that the tunneling function does not have a zero.
This, however, is once again a global problem which is strongly related to the
original one. The information gained during the foregoing iterations is of no
obvious use in solving this new global problem, which therefore appears to be
as hard to solve as the original one. Thus, lacking any sort of guarantee, the
method is at best of some heuristic value. A decent stopping rule is hard to find
and even harder to justify.
A final method in this category can be criticized on similar grounds. It is the
method offilled functions, developed in (Ge Renpu 1983, 1984). In its simplest
form, for a given sequence of local minima x T , . . . , x k* , the method considers
the auxiliary function f(x) = exp(-I[x~ [[2/p2)/(r + f(x)). For an appropriate
choice of p arameters r and p, it can be shown that a method of local descent
applied to f will lead to a point x~+ 1 starting from which a decent procedure
applied to f will arrive at a bettet local minimum x~+ 1 (i.e., with f(x*~+l) <
f ( X*
k ) ) , if such an improved local minimum exists. Unfortunately, appropriate
values for the parameters have to be based on information about f that is not
readily available. As in the case of the tunneling method, it seems to be
difficult to identify a significant class of objective functions for which converg-
ence to the global minimum can be guaranteed.
The final class of methods that we shall consider is superficiaUy the most
naive one: it is based on the observation that enumeration of all local minima
will certainly lead to the detection of the global one. Of course, it is easy to
conceive of testfunctions where the number of local minima is so large as to
render such an approach inherently useless. Barring such examples, however,
this class contains some of the computationally most successful global optimiza-
tion techniques.
We start with a discussion of deterministic approaches based on this idea. As
in (Branin 1972), we consider the differential equation
dg(x(t))
dt - i~g(x(t)), where ~ E { - 1 , + 1 ) . (6.1)
Clearly, if/x = - 1 , then the points x(t) satisfying (6.2) will tend to a stationary
point with increasing t. I f / x = + 1, then the points will move away from that
stationary point.
In the region of x space where the Hessian H of f is nonsingular, (6.1) is
identical to the Newton equation
dx(t)
-- /&H l(x(t))" g(x(t)). (6.3)
dt
The differential equation (6.3) has the disadvantage that it is not defined in
regions where the determinant det H(.) is zero. To overcome this problem let
the adjoint matrix Adj H be defined by H -1 det H. A transformation of the
parameter t enables (6.3) to be written as
dx(t)
dt - / x A d j H(x(t))" g(x(t)). (6.4)
Equation (6.4) has several nice features. It defines a sequence of points on the
curve x(t), where the step from the k-th to the (k + 1)-th point of this sequence
is equivalent to a Newton step with variable stepsize and alternating sign, i.e.
B%H-l(x(t)) • g(x(t)) (Branin 1972; Hardy 1975). This sequence of points on
the curve x(t) solving (6.4) has to terminate in a stationary point of f (Branin
calls this an essential singularity), or in a point where Adj H(x(t)). g(x(t)) = 0
although g(x(t)) # 0 (an extraneous singularity).
A global optimization method would now be to follow the curve x(t) solving
(6.4) in the above way for a given sign from any starting point, until a
singnlarity is reached. Here, some extrapolation device is adopted to pass this
singularity. If necessary, the sign of (6.4) is then changed, after which the
process of generating points on the curve solving (6.4) is continued. The
trajectory followed in this way depends, of course, on the starting point and on
g(x(O)). It can be shown that the trajectories either lead to the boundary of S
or return to the starting point along a closed curve (Gomulka 1975, 1978).
It has been conjectured that in the absence of extraneous singularities all
trajectories pass through all stationary points. The truth of this conjecture has
not yet been determined, nor is it known what conditions on f are necessary to
avoid extraneous singularities. An example of a function of simple geometrical
shape that has an extraneous singularity is described in (Treccani 1975).
To arrive at a better understanding of this method, we note again that in
each point on a trajectory the gradient is proportional to a fixed vector g(x(O))
(cf. (6.2)). Hence, for a particular choice of x(0) the trajectory points are a
subset of the set C(x(O)) = {x ~ S I g(x) = Ag(x(0)), A ~ R}, which of course
contains all stationary points (take A = 0). Unfortunately, C(x(O)) can consist
of several components, only one of which coincides with the trajectory that will
be followed. All stationary points could be found if it would be possible to start
the method once in each component of C(x(O)), but it is not known how such
starting points could be obtained.
IX. Global optirnization 651
where g,j is a polynomial function. We can view this as a set of equations over
C" and ask for a procedure to compute all the roots of (6.5), i.e., all stationary
points of f.
To initiate such a procedure (Garcia & Zangwill 1979), we choose integers
q(j) (j = 1 , . . , n) with the property that, for every sequence of complex
vectors with [[z[[-->~, there exists an / ~ { 1 , . . , n } and a corresponding
infinite subsequence for which lim ~(z)/(z q(O- 1) does not converge to a
negative real number. Intuitively, this means that z q ( ° - 1 dominates Ot(z).
Such integers q(j) can always be found.
Now, let us consider the system
For A = 0, we obtain a trivial system with Q = IIj= 1 q(j) solutions z (h). For
A = 1, we recover the original set of equations. The idea is to increase A from 0
to 1 and trace the solution paths z(h)(A). Standard techniques from homotopy
theory can be used to show that some of these paths may diverge to infinity as
A--> 1 (the choice of q(i) precludes such behaviour for any A E (0, 1), as can
easily be seen), but the paths cannot cycle back and every root of (6.5) must
occur as the endpoint of some path. Simplicial pivoting techniques offer the
possibility to trace these paths to any degree of accuracy required.
652 A.H.G. Rinnooy Kan, G.T. Timmer
We now turn to stochastic methods that enumerate all local minima. In most
methods of this type, a descent local search procedure P is initiated in some or
all points of a random sample drawn from S, in an effort to identify all the local
minima that are potentially global.
The simplest way to make use of the local search procedure P occurs in a
folklore method known as multistart. Here, P is applied to every point in a
sample, drawn from a uniform distribution over S, and the local minimum with
the lowest function value found in this way is the candidate value for y , .
It is easy to see that this candidate value (or, indeed, already the smallest
function value i n the sample prior to application of P ) converges to y , with
probability 1 under very weak smoothness assumptions on f (Rubinstein 1981).
At the same time, it is obvious that this method is very wasteful ffom a
computational point of view, in that the same local minimum is discovered
again and again. Before discussing possible improvements, we first consider the
question of an appropriate stopping rule.
An interesting analysis of multistart was initiated in (Zielinski 1981) and
extended in (Boender & Zielinski 1982; Boender & Rinnooy Kan 1983;
Boender 1984). It is based on a Bayesian estimate of the number o f local
minima W and of the relative size of each region of attraction 0t =
m ( R ( x * ) ) / m ( S ) , l = 1 , . . , W, where a region o f attraction R(x*) is defined to
be the set of all points in S starting from whieh P will arrive at x*.
Why are the above parameters W and 0l important? It suffices to observe
that, if their values are given, the outcome of an application of Multistart is
easy to analyze. We can view the procedure as a series of experiments in which
a sample ffom a multinomial distribution is taken. Each cell of the distribution
corresponds to a minimum x*; the cell probability is equal to the probability
that a uniformly sampled point will be allocated to R(x*) by P, i.e. equal to the
corresponding 0~. Thus, the probability that the l-th local minimum is found b t
times (l = 1 , . . . , W) in N trials is
N! w
w • l-[ 0~t • (6.7)
11l=1 bz! z~1
It is impossible, however, to distinguish between outcomes that are identical
up to a relabeling of the minima. Thus, we have to restrict ourselves to
distinguishabIe aggregates of the random events that appear in (6.7). To
calculate the probability that w different local minima are found during N local
searches and that the i-th minimum is found ag times (a i > 0, Ei~l a i = N), let ci
be the number of ai's equal to j and let Sw(w ) denote the set of all
permutations of w different elements of (1, 2 , . . , W}. The required prob-
ability is then given by (Boender 1984)
N
1 • ~
N! ~]
l~ 0"4. (6.8)
Hi=l ci! Hi=I ai! (~1. . . . . . w)CSw(w) i=l
One could use (6.8) to obtain a maximum likelihood estimate of the unknown
IX. Global optimization 653
w ( N - 1)
(6.10)
N-w-2 "
The basic idea behind clustering methods is to start from a uniform sample
from S, to create groups of mutually close points that correspond to the
relevant regions of attractions, and to start P no more than once in every such
region. Two ways to create such groups from the initial sample have been
proposed. The first, called reduction (Becker & Lago 1970), removes a certain
fraction, say 1 - y, of the sample points with the highest function values. The
second, called concentration (Törn 1978), transforms the sample by allowing
one or at most a few steepest descent steps from every point.
Note that these transformations do not necessarily yield groups of points that
correspond to regions of attraction of f. For instance, if we define the level set
L(y) to be { x E S I f(x)~<y}, then the groups created by sample reduction
correspond to the connected components of a level set, and these do not
necessarily correspond to regions of attraction. We return to this problem later.
Several ways have been proposed to identify the clusters of points that result
from either sample reduction or concentration. The basic framework, however,
is always the same. Clusters are formed in a stepwise fashion, starting from a
seed point, which may be the unclustered point with lowest functfon value or
the local minimum found by applying P to this point. Points are added to the
cluster through application of a clustering rule until a terrnination criterion is
satisfied. Where the methods differ, is in their choice of the clustering rule and
of the corresponding termination criterion.
As elsewhere in geometrical cluster analysis, the two most popular clustering
rules are based on recognizing clusters by means of either the density (Törn
1976) or the distances among the points. In the former case, the termination
criterion will refer to the fact that the number of points per unit volume within
the cluster should not drop below a certain level. In the latter case, it will
express that a certain critical distance between pairs of closest points within the
cluster should not be exceeded.
Unlike the usual clustering problem we know here that:the points originally
come from a uniform distribution. Can this be exploited in the analysis? If the
clusters are based on concentration of the sample, the original distribution can
no longer be recognized. However, if the clusters have been formed by
reduction of the sample, then within each cluster the original uniform distribu-
tion still holds. Moreover, as pointe d out above, the clusters correspond to
connected components of a level set of a continuously differentiable function.
These two observations provide a powerful tool in obtaining statistically correct
termination criteria.
Since the connected components of a level set can be of any geometrical
shape, the clustering procedure should be able to produce clusters of widely
varying geometrical shape. Such a procedure, single linkage clustering, was
described in (Boender et al. 1982; Timmer 1984). In this method, clusters are
formed by adding the closest unclustered point to the current cluster until the
distance of this point to the cluster exceeds the critical distance. (The distance
between a point and a cluster is defined as the distance of this point to the
closest point in the cluster.) If no more points can be added to the cluster, the
IX. Global optimization 655
cluster is terminated. Provided that there are still points to be clustered, P is
then applied to the unclustered sample point with smallest fnnction value, and
a new cluster is formed around the resulting local minimum.
Actually, the method is implemented in an iterative fashion, where points are
sampled in groups of fixed size, say N, and the expanded sample is clustered
(after reduction) from scratch in every iteration. If X* denotes the set of local
minima that have been found in previous iterations, then the elements of X*
are used as the seed points for the first clusters. If, for some cr > 0, the critical
distance in iteration k is chosen to be
then for any cr > 0 a local minimum will be found by single linkage in every
connected component of L(yv) in which a point has been sampled, within a
finite number of iterations with probability 1 (Timmer 1984; Rinnooy Kan &
Timmer 1987a).
In order to assign a point to a cluster in single linkage, it suffices that this
point is close to one point already in the cluster. In principle it should be
possible to design superior methods by using information of more than two
sample points simultaneously. In (Timmer 1984) an approach is suggested in
which S is partitioned into small hypercubes or cells. If there are k N sample
points, a celt A is said to be full if it contains more than l m ( A ) k N / m ( S )
reduced sample points, i.e. more than half the expected number of sample
points in A. If a cell is not full, it is called empty. Intuitively, one would think
that if the cells get smaller with increasing sample size, then each component of
L(y~) can be approximated by a set of full cells, and different components of
L(y~) will be separated by a set of empty cells. Hence, we could let a cluster
correspond to a connected subset of S which coincides with a number of full
cells. These clusters can be found by applying a single linkage type algorithm to
the full cells, such that if two cells are neighbours, then they are assigned to the
same cluster. Note that in this approach a cluster corresponds to a set of cells
instead of a set of reduced sample points.
The mode analys# method, based on this idea and originally proposed in
(Timmer 1984), is again iterative. In iteration k, for some o-> 0, S is divided
into cells of measure (m(S)crlogkN)/(kN). After sample reduction, it is
determined which cells are full. Sequentially, a seed cell is chosen and a cluster
is grown around this seed cell such that each full cell which is a neighbour of a
656 A.H.G. Rinnooy Kan, G.T. Timmer
cell already in the cluster is added to the cluster until there are no more such
cells.
Initially, the cells that contain local minima previously found are chosen as
seed cells. If this is no longer possible, P is applied to the point ~ which has the
smallest function value among the reduced sample points which are in un-
clustered full cells. The cell containing 2 is then chosen as the next seed cell.
The properties of this mode analysis method are comparable to the prop-
erties of single linkage. In (Timmer 1984; Rinnooy Kan & Timmer 1987a) it is
shown that if cr > 20 (this is not the sharpest possible bound), then, even if the
sampling continues forever, the total number of local searches ever applied in
Mode Analysis is finite with probability 1. If o-> 0, then in every connected
component of L(yv) in which a point has been sampled, a local minimum will
be found by mode analysis in a finite number of iterations with probability 1.
Both single linkage and mode analysis share one major deficiency. The
clusters formed by these methods will (at best) correspond to the connected
components of a level set instead of regions of attraction. Although a region of
attraction cannot intersect with several components of a level set, it is
obviously possible that a component contains more than one region of attrac-
tion. Since only one local search is started in every cluster, it is therefore
possible that a local minimum will not be found although its region of
attraction contains a (reduced) sample point.
We conclude that both single linkage and mode analysis lack an important
guarantee that multistart was offering, namely that if a point is sampled in
R(x*), then the local minimum x* will be found.
We conclude this section by discussing methods that combine the computa-
tional efficiency of clustering methods with the theoretical virtues of multistart.
These multi-level methods exploit the function values in the sample points to
derive additional information about the structure o f f . These methods are again
applied iteratively to an expanding sample. In the multi-level single linkage
method, the local search procedure P is applied to every sample point, except if
there is another sample point within the critical distance which has a smaller
function value. Formulated in this way, the method does not even produce
clusters and it can indeed be applied to the entire sample without reduction or
concentration. Of course, it is still possible to use the reduced sample points
only, if one believes that it is unlikely that the global minimum will be found by
applying P to a sample point which does not belong to the reduced sample. If
desired, clusters can be constructed by associating a point to a local minimum if
there exists a chain of points linking it to that minimum, such that the distance
between each successive pair is at most equal to the critical distance and the
function value is decreasing along the chain. Clearly a point could in this way
be assigned to more than one minimum. This is perfectly acceptable.
In spite of its simplicity, the theoretical properties of multi-level single
linkage are quite strong. If the critical distance is chosen as in (6.11) and if
o-> 4, then even if sampling continues forever, the total number of local
searches ever started by multi-level single linkage is finite with probability 1
IX. Global optimization 657
(Timmer 1984; Rinnooy Kan & Timmer 1987b). At the same time we can
prove, however, for every o - > 0 that all local minima (including the global
one) will be located in the long run. More precisely, consider the connected
component of the level set L(y) containing a local minimum x* and define the
basin Q(x*) as the connected component corresponding to the largest value for
y for which this component contains x* as its only stationary point. It is then
possible to show that in the limit, if a point is sampled in Q(x*), then the local
minimum x* will be found by the procedure. As a result, if o-> 0, then any
local minimum x* will be found by multi level single linkage within a finite
number of iterations with probability 1.
We finally mention a method which is called multi-level mode analysis. This
method is a generalization of mode analysis in a similar way and for the same
reasons as multi-level single linkage is a generalization of single linkage. As in
mode analysis, S is partitioned into cells of measure (m(S)~r log kN)/(kN).
After sample reduction, it is determined which cells contain more than
1« log kN reduced sample points; these are labeled to be full. For each full cell
the function value of the cell is defined to be equal to the smallest function
value of any of the sample points in the cell. Finally, for every full cell, P is
applied to a point in the cell except if the cell has a neighbouring cell which is
full and has a smaller function value. Note that we still reduce the sample.
Although this is not longer strictly necessary, it creates the extra possibility that
two regions of attraction are recognized as such only because a cell on the
boundary of both regions is empty.
For multi-level mode analysis essentially the same results hold as for
multi-level single linkage.
Since the methods described in this section and multistart result in the same
set of minima with a probability that tends to 1 with increasing sample size, we
can easily modify and use the stopping rules which were designed for multistart
(Boender & Rinnooy Kan 1987).
The limited computational experience available (cf. the hext section) con-
firms that this final class of stochastic methods does offer an attractive mix of
reliability and speed.
7. Concluding remarks
Table 7.1
Test functions (Dixon & Szegö 1978b)
GP Goldstein and Price
BR Branin (RCOS)
It3 Hartman 3
H6 Hartman 6
85 Shekel 5
$7 Shekel 7
S10 Shekel 10
Table 7.2
Methods tested
Description References Computational results
A Trajectory (Branin&Hoo (Gomulka 1978)
1972)
B Density clustering (Törn 1976) (Törn 1978)
(concentration)
C Density clustering (de Biase & (de Biase &
Frontini 1978) Frontini 1978)
D Multi-level single (Timmer 1984) (Timmer 1984)
linkage
E Random direction (Bremmerman 1970) (Dixon & Szegö
1978b)
F Controlled random (Price 1978) (Price 1978)
search
IX. Global optimization 659
Table 7.3
Number of function evaluations
Function
Method GP BR H3 H6 $5 $7 S10
A . . . . 5500 5020 4860
B 2499 1558 2584 3447 3649 3606 3874
C 378 597 732 807 620 788 1160
D 148 206 197 487 404 432* 564
E 300 160 420L 515 375L 405L 336L
F 2500 1800 2400 7600 3800 4900 4400
L: The method did not find the global minimum.
*: The global minimum was not found in one of the four runs.
Table 7.4
Number of units standard time
Function
Method GP BR H3 H6 $5 $7 S10
A . . . . 9 8.5 9.5
B 4 4 8 16 10 13 15
C 15 14 16 21 23 20 30
D 0.15 0.25 0.5 2 1 1" 2
E 0.7 0.5 2L 3 1.5L 1.5L 2L
F 3 4 8 46 14 20 20
L: The method did not find the global minimum.
*: The method did not find the global minimum in one of the four runs.
be possible to i m p r o v e it e v e n f u r t h e r by d e v e l o p i n g an a p p r o p r i a t e rule to
m o d i f y t h e s a m p l i n g d i s t r i b u t i o n as t h e p r o c e d u r e c o n t i n u e s . This, h o w e v e r ,
w o u l d s e e m to p r e s u p p o s e a specific global model o f t h e f u n c t i o n , as w o u l d any
a t t e m p t to i n v o l v e f u n c t i o n v a l u e s directly in t h e s t o p p i n g rule.
A t t h e e n d o f this c h a p t e r , it is a p p r o p r i a t e to c o n c l u d e t h a t global
o p t i m i z a t i o n as a r e s e a r c h a r e a is still on its w ay to m a t u r i t y . T h e v a r i e t y o f
t e c h n i q u e s p r o p o s e d is i m p r e s s i v e , b u t t h e i r r e l a t i v e m e r i t s h a v e n e i t h e r b e e n
a n a l y z e d in a s y s t e m a t i c m a n n e r n o r p r o p e r l y i n v e s t i g a t e d in c o m p u t a t i o n a l
e x p e r i m e n t s . In v i e w o f th e significance o f t h e g l o b a l opfi,~,fization p r o b l e m , it
is to be h o p e d that this c h a l l e n g e will m e e t with an a p p r o " iate r e s p o n s e in t h e
n e a r f u t u re.
References
AI-Khayyal, F., R. Kertz, and G.A. Tovey (1985), Algorithms and diffusion approximations for
nonlinear programming with simulated annealing, to appear.
Aluffi-Pentini, F., V. Parisi and F. Zirilli (1985), Global optimization and stochastic differential
equations, Journal of Optimization Theory and Applications 47, 1-17.
660 A.H.G. Rinnooy Kan, G.T. Timmer
Anderssen, R.S. and P. B loomfield (1975), Properties of the random search in global optimization,
Journal of Optimization Theory and Applications 16, 383-398.
Archetti, F. and B. Betro (1978a), A priori analysis of deterministic strategies for global
optimization, in (Dixon & Szegö 1978a).
Archetti, F. and B. Betro (1978b), On the effectiveness of uniform random sampling in global
optimization problems, Technical Report, University of Pisa, Pisa, Italy.
Archetti, F. and B. Betro (1979a), Stochastic models in optimization, Bolletino Mathematica
Italiana.
Archetti, F. and B. Betro (1979b), A probabilistic algorithm for global optimization, CaIcolo 16,
335 -343.
Archetti, F. and F. Frontini (1978), The application of a global optimization method to some
technological problems, in (Dixon & Szegö 1978a).
Basso, P. (1985), Optimal search for the global maximum of functions with bounded seminorm,
SIAM Journal of Numerical Analysis 22, 888-903.
Becker, R.W. and G.V. Lago (1970), A global optimization algorithm, in: Proceedings of the 8th
Allerton Conference on Circuits and Systems Theory.
Betro, B. (1981), Bayesian testing of nonparametric hypotheses and its application to global
"optimization, Technical Report, CNR-IAMI, Italy.
Betro, B. and F. Schoen (1987), Sequential stopping rules for the multistart algorithm in global
optimization, Mathematical Programming 38, 271-280.
Boender, C.G.E. and R. Zielinski (1982), A sequential Bayesian approach to estimating the
dimension of a multinomial distribution, Technical Report, The Institute of Mathematics of the
Polish Academy of Sciences.
Boender, C.G.E., A.H.G. Rinnooy Kan, L. Stougie and G.T. Timmer (1982), A stochastic
method for global optimization, Mathematical Programming 22, 125-140.
Boender, C.G.E. and A.H.G. Rinnooy Kan (1983), A Bayesian analysis of the number of cells of
a multinomial distribution, The Statistician 32, 240-248.
Boender, C.G.E. and A.H.G. Rinnooy Kan (1987), Bayesian stopping rules for multistart global
optimization methods, Mathematical Programming 37, 59-80.
Boender, C.G.E. (1984), The generalized multinomial distribution: A Bayesian analysis and
applications, Ph.D. Dissertation, Erasmus Universiteit Rotterdam (Centrum voor Wiskunde en
Informatica, Amsterdam).
Branin, F.H. (1972), Widely convergent methods for finding multiple solutions of simultaneous
nonlinear equations, IBM Journal of Research Developments, 5,04-522.
Branin, F.H. and S.K. Hoo (1972), A method for finding multiple extrema of a function of n
variables, in: F.A. Lootsma (ed.), Numerical Methods of Nonlinear Optimization (Academic
Press, London).
Bremmerman, H. (1970), A method of unconstrained global optimization, Mathematical Bio-
sciences 9, 1-15.
Cramer, H. and M.R. Leadbetter (1967), Stationary and Related Stochastic Processes (Wiley, New
York).
De Blase, L. and F. Frontini (1978), A stochastic method for global optimization: its structure and
numerical performance, in (Dixon & Szegö 1978a).
Devroye, L. (1979), A bibliography on random search, Technical Report SOCS, McGill Univer-
sity, Montreat.
Diener, I. (1986), Trajectory nets connecting all critical points of a smooth function, Mathematical
Programming 36, 340-352.
Dixon, L.C.W. (1978), Global optima without convexity, Technical report, Numerical Optimiza-
tion Centre, Hatfietd Polytechnic, Hatfield, England.
Dixon, L.C.W. and G.P. Szegö (eds.) (1975), Towards Global Optimization (North-Holland,
Amsterdam).
Dixon, L.C.W. and G.P. Szegö (eds.) (1978a), Towards Global Optimization 2 (North-Holland,
Amsterdam).
Dixon, L.C.W. and G.P. Szegö (1978b), The global optimization problem, in (Dixon & Szegö
1978a).
1X. Global optimization 661
Evtushenko, Y.P. (1971), Numerical methods for finding global extrema of a nonuniform mesh,
U.S.S.R. Computing Machines and Mathematical Physics 11, 1390-1403.
Fedorov, V.V. (ed.) (1985), Problems of Cybernetics, Models and Methods in Global Optimization.
(USSR Academy of Sciences, Council of Cyberneti¢s, Moscow).
Garcia, C.B. and W.I. ZangwiU (1978), Determining all solutions to certain systems of nonlinear
equations, Mathematics of Operations Research 4, 1-14.
Garcia, C.B. and F.J. Gould (1980), Relations between several path following algorithms and local
global Newton methods, SIAM Review 22, 263-274.
Gaviani, M. (1975), Necessary and sufficient conditions for the convergence of an algorithm in
unconstrained minimization. In (Dixon & Szegö 1975).
Ge Renpu (1983), A filled function method for finding a global minimizer of a function of several
variables, Dundee Biennial Conference on Numerical Analysis.
Ge Renpu (1984), The theory of the filled function method for finding a global minimizer of a
nonlinearly constrained minimization problem, SIAM Conference on Numerical Optimization,
Boulder, CO.
Goldstein, A.A. and J.F. Price (1971), On descent from local minima. Mathematics of Computa-
tion 25, 569-574.
Gomulka, J. (1975), Remarks on Branin's method for solving nonlinear equations, in (Dixon &
Szegö 1975).
Gomulka, J. (1978), Two implementations of Branin's method: numerical experiences, in (Dixon
& Szegö 1978a).
Griewank, A.O. (1981), Generalized descent for global optimization, Journal of Optimization
Theory and Applications 34, 11-39.
Hardy, J.W. (1975), An implemented extension of Branin's method, in (Dixon & Szegö 1975).
Horst, R. (1986), A general class of branch-and-bound methods in global optimization with some
new approaches for concave minimization, Journal of Optimization Theory and Applications 51,
27l -291.
Horst, R. and H. Tuy (1987), On the convergence of global methods in multiextremal optimiza-
tion, Journal of Optimization Theory and Applications 54, 253-271.
Kalantari, B. (1984), Large scale global minimization of linearly constrained concave quadratic
functions and related problems, Ph.D. Thesis, Computer Science Department, University of
Minnesota.
Kirkpatric, S., C.D. Crelatt Jr., and M.P. Vecchi (1983), Optimization by simulated annealing,
Scienee 220, 671-680.
Kushner, M.J. (1964), A new method for locating the maximum point of an arbitrary multipeak
curve in presence of noise, Journal of Basic Engineering 86, 97-106.
Lawrence, J.P. and K. Steiglitz (1972), Randomized pattern search, IEEE Transactions on
Computers 21, 382-385.
Levy, A. and S. Gomez (1980), The tunneling algorithm for the global optimization problem of
constrained functions, Technical Report, Universidad National Autonoma de Mexico.
McCormick; G.P. (1983), Nonlinear Programming: Theory, Algorithrns and Applications (John
Wiley and Sons, New York).
Mentzer, S.G. (1985), Private communication.
Mladineo, R.H. (1986), An algorithm for finding the global maximum of a multimodal, mul-
tivatiate function, Mathematical Programming 34, 188-200.
Pardälos, P.M., and J.B. Rosen (1986), Methods for global concave minimization: A bibliographic
survey, SIAM Review 28, 367-379.
Pinter, J. (1983), A unified approach to globally convergent one-dimensional optimization
algorithms, Technical report, CNR-IAMI, ltaly.
Pinter, J. (1986a), Globally convergent methods for n-dimensional multi-extremal optimization,
Optimization 17, 187-202.
Pinter, J. (1986b), Extended univariate algorithms for n-dimensional global optimization, Comput-
ing 36, 91-103.
Pinter, J. (1988), Branch and Bound algorithms for solving global optimization problems with
Lipschitzian structure, Optimization 19, 101-110.
662 A.H.G. Rinnooy Kan, G.T. Timmer
Price, W.L. (1978), A controlled random search procedure for global optimization, in (Dixon &
Szegö 1978a).
Rastrigin, L.A. (1963), The convergence of the random search method in the extremal control of a
many-parameter system, Automation and Remote Control 24, 1337-1342.
Ratschek, H. (1985), Inclusion functions and global optimization, Mathematical Programming 33,
300-317.
Rinnooy Kan, A.H.G. and G.T. Timmer (1987a), Stochastic global optimization methods. Part I:
Clustering methods, Mathematical Programming 39, 27-56.
Rinnooy Kan, A.H.G. and G.T. Timmer (1987b), Stochastic global optimization methods. Part II:
Multi-levet methods, Mathematical Programming 39, 57-78.
Rosen, J.B. (1983), Global minization of a linearly constrained concave function by partion of
feäsible domain, Mathematics of Operations Research 8, 215-230.
Rubinstein, R.Y. (1981), Simulation and the Monte Carlo Method (John Wiley & Sons, New York).
Schrack, G. and M. Choit (1976), Optimized relative step size random searches, Mathematical
Programming 16, 230-244.
Schuss, Z. (1980), Theory and Applieations of Stochastic Differential Equations (John Wiley and
Sons, New York).
Shepp, L.A. (1979), The joint density of the maximum and its location for a Wiener process with
drift, Journal of Applied Probability 16, 423-427.
Shubert, B.O. (1972), A sequential method seeking the global maximum Of a function, SIAM
Journal on Numerical Analysis 9, 379-388.
Sobol, I.M. (1982), On an estimate of the accuracy of a simple multidimensional search, Soviet
Math. Dokl. 26, 398-401.
Solis, F.J. and R.J.E. Wets (1981), Minimization by random search techniques, Mathematies oB
Operations Research 6, 19-30.
Strongin, R.G. (1978), Numerical Methods for Multiextremal Problems (Nauka, Moscow).
Sukharev, A.G. (1971), Optimal strategies of the search for an extremum, Computational
Mathematics and Mathematical Physics 11, 119-137.
Timmer, G.T. (1984), Global optimization: A stochastic approach, Ph.D. Dissertation, Erasmus
University Rotterdam.
Törn, A.A. (1976), Cluster analysis using seed points and density determined hyperspheres with
an application to global optimization, in: Proceeding of the third International Conference on
Pattern Recognition, Coronado, CA.
Törn, A.A. (1978), A search clustering approach to global optimization, in (Dixon & Szegö
1978a).
Treccani, G. (1975), On the critical points of continuously differentiable functions, in (Dixon &
Szegö 1975).
Tuy, H. (1987), Global optimization of a difference of two convex functions, MathematicaI
Programming Study 30, 150-182.
Walster, G.W., E.R. Hansen and S. Sengupta (1985), Test results for a global optimization
algorithm, in: P.T. Boggs and R.B. Schnabel (eds.), Numerical Optimization 1984 (SIAM,
Philadelphia, PA).
Woerlee, A. (1984), A Bayesian global optimization method, Masters Thesis, Erasmus University
Rotterdam.
Wood, G.R. (1985), Multidimensional bisection and global minimization, Technical report,
University of Canterbury.
Zielinski, R. (1981), A stochastic estimate of the structure of multi-extremal problems, Mathemati-
cal Programming 21, 3~8-356.
Zilinskas, A. (1982), Axiomatic approach to statistieal models and their use in multimodel
optimization theory, Mathematical Programming 22, 104-116.
Zilverberg, N.D. (1983), Global minimization for large scale linear constrained systems, Ph.D.
Thesis, Computer Science Dept., University of Minnesota.