0% found this document useful (0 votes)
14 views32 pages

Global Optimization

Uploaded by

nguyentientinh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views32 pages

Global Optimization

Uploaded by

nguyentientinh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

G.L. Nemhauser et al., Eds., Handbooks in OR & MS, Vol.

1
© Elsevier Science Publishers B.V. (North-Holland) 1989

Chapter IX

Global Optimization

A. H. G. Rinnooy Kan
Econometric Institute, Erasrnus University, Rotterdam, The Netherlands

G . T. T i m m e r
Econometric Institute, Erasmus University, Rotterdam, and ORTEC Consultants, Rotterdam,
The Netherlands

1. lntroduction

Many problems can be posed as mathematical programming problems, i.e.


problems in which an objective function, that depends on a number of decision
variables, has to be optimized subject to a set of constraints. This has been
appreciated since the second World War and, together with the introduction of
high speed computers, it has led to major research activities in many countries.
The aim of these activities has been to develop efficient methods to solve
subclasses of the above mentioned problem. Many criteria according to which
subclasses can be defined have been proposed, e.g. are there any stochastic
elements in the problem formulation or not, are the decision variables allowed
to take on any real value or are they constrained to be, say, integers, do the
decision variables appear in a linear or in a nonlinear way in the problem
formulation, etc., etc. Here, we will restrict our discussion to nonlinear
programming problems in which both the objective function and the con-
straints, formulated in terms of functions of the decision variables that have to
be equal to or smaller than some constant, depend on the decision variables in
a continuous but possibly nonlinear way.
Let f : ~n__> ~ be a continuous real valued objective function and let G C ~n
be the feasible region, i.e. the region in which none of the constraints is
violated. Our starting point is the observation that most of the many nonlinear
programming methods that have been developed aim for a local optimum (say
local minimum), i.e. a point x* E G such that there exists a neighbourhood B
of x* with

f(x*) ~ f ( x ) Vx ~ B n G . (1.1)

In general, however, several local optima may exist and the corresponding

631
632 A.H.G. Rinnooy Kan, G.T. Timmer

function values may differ substantially. The problem of designing algorithms


that distinguish between these local optima and locate the best possible one is
known as the global optimization problem, and forms the subject of this
chapter.
In the absence of reliable codes for the global optimization problem most
problems are not modelled as such. Many problems, however, are of a global
nature. This is especially true for many technical design problems (Dixon &
Szegö 1978b; Archetti & Frontini 1978). Economic applications, where mul-
timodal cost functions have to be minimized, have also been reported (Archetti
& Frontini 1978). Another global optimization problem offen encountered in
econometrics is that of locating the global maximum of a likelihood function.
Thus, there is no need to dwell on the practical usefulness of quick and reliable
methods to solve the global optimization problem.
In the case where there are no constraints, the global optimization problem
is to find the global optimum (say global minimum) x, of a real valued
objective function f : ~n__~ R, i.e. to find a point x, E ~'~ such that

f ( x , ) <~f(x) Vx E Nn. (1.2)

Unless stated otherwise, we will assume f to be twice continuously differenti-


able. For obvious computational reasons, one usually assumes that a set
S C R n, which is convex, compact and contains the global minimum as an
interior point, is specified in advance. None the less, the problem to find

y , = min f(x) (1.3)

remains essentially one of unconstrained optimization.


Any method for global optimization has to account for the fact that a
numerical procedure can never produce more than approximate answers. Thus,
the global optimization problem might be considered solved if, for some e > 0,
an element of one of the following sets has been identified (Dixon 1978):

A x ( e ) = {x E s lllx- x,It < «}, (1.4)

A f ( e ) = {x E S I[ f(x) - f ( x , ) I ~< e}. (1.5)

A disadvantage of the first mentioned possibility (1.4) is that small perturba-


tions in the problem data may have major effects on the location of x,
(Archetti & Betro 1978a). (The problem is not well posed.) A third possibility
(Betro 1981) is obtained by defining

m( {z E S If(z) <~y}) (1.6)


qS(y) = m(S) '

where M(.) is the Lebesgue measure and taking


IX. Global optimization 633

A~(e) = { x e S l4~(f(x)) ~< e}. (1.7)

We note, however, that this set may contain points whose function values differ
considerably from y . .
So far only few solution methods for the global optimization problem have
been developed, certainly in comparison with the multitude of methods that
aim for a local optimum. The relative difficulty of global optimization as
compared to local optimization is easy to understand. If we assume that f is
twice continuously differentiable, then all that is required to test if a point is a
local minimum is knowledge of the first and second order derivatives at this
point. If the test does not yield a positive result, the continuous differentiability
of the function ensures that a neighbouring point can be found with a lower
function value. Thus, a sequence of points converging to a local optimum can
be constructed.
Such local tests are obviously not sufficient to verify global optimality.
Indeed, in some sense the global optimization problem as stated in (1.3) is
inherently unsolvable in a finite number of steps. For any continuously
differentiable function f, any point 2 and any neighbourhood B of 2, there
exists a function f ' such that f + f ' is continuously differentiable, f + f ' equals
f f o r all points outside B and the global minimum o f f + f ' is 2. ( ( f + f ' ) is an
indentation of f.) Thus, for any point 2, one cannot guarantee that it is not the
global minimum without evaluating the function in at least one point in every
neighbourhood B of Y. As B can be chosen arbitrarily small, it follows that any
method designed to solve the global optimization problem would require an
unbounded number of steps (Dixon 1978).
Of course, this argument does not direcfly apply to the case where one is
satisfied with an approximation of the global minimum. In particular, if an
element of Ax(e ) is sought, then enumerative strategies that only require a
finite number of function evaluations can be easily shown to exist. These
strategies, however, are of limited practical use, and it appears that the above
observation does prohibit the construction of practical methods for the global
optimization problem. Hence, either a further restriction of the class of
objective functions or a further relaxation of what is required of an algorithm
will be inevitable in what follows.
Subject to this first conclusion, the methods developed to solve the global
optimization problem can be divided into two classes, depending on whether or
not they incorporate any stochastic elements (Dixon & Szegö 1975, 1978a;
Fedorov 1985).
Deterministic methods do not involve any stochastic concepts. To provide a
rigid guarantee of success, such methods unavoidably involve additional as-
sumptions on f.
Most stochastic methods involve the evaluation of f in a random sample of
points from S and subsequent manipulations of the sample. As a result, we do
sacrifice the possibility of an absolute guarantee of success. However, under
634 A.H.G. Rinnooy Kan, G.T. Timmer

very mild conditions on the sampling distribution and on f, the probability that
an element of Ax(e), Af(e) or Ao(e) is sampled can be shown to approach 1 as
the sample size increases (Solis & Wets 1981).
Irrespective of whether a global optimization method is deterministic or
stochastic, it always aims for an appropriate convergence guarantee. In some
cases, all that can be asserted is that the method performs well in an empirical
sense. This is far from satisfactory. Ideally, one would like to be assured that
the method will find an element of Ax(s ), Af(e) or Ae~(e) in afinite number of
steps, and under appropriate conditions some deterministic methods provide
such a guarantee. Frequently, that best that one can do is to establish an
asymptotic guarantee, which ensures convergence to the global minimum as the
computational effort goes to infinity. As we have seen, stochastic methods
usually provide a stochastic version of such a guarantee, i.e. convergence in
probability or with probability 1 (almost surely, almost everywhere).
The existence of any type of asymptotic guarantee immediately raises the
issue of an appropriate stopping rule for the algorithm. This is already an
important question in regular nonlinear programming (where it is usually dealt
with in an ad hoc fashion), but acquires additional significance in global
optimization where the trade-off between reliability and effort is at heart of
every computational experiment. The design of an appropriate stopping rule
which weighs costs and benefits of continued computation against each other,
forms a problem of great inherent difficulty, and we shall be able to report only
partial success in solving it satisfactorily.
Rather than reviewing global optimization methods according to the distinc-
tion between deterministic and stochastic methods, we prefer to use a different
classification, based on what might be called the underlying philosophy of the
method. From the literature, five such philosophies can be identified:
(a) Partition and search. Here, S is partitioned into successively smaller
subregions among which the global minimum is sought, much in the spirit of
branch-and-bound methods for combinatorial optimization. A few of these
deterministic methods are reviewed in Section 2.
(b) Approximation and search. In this approach, f is replaced by an
increasingly better approximation that is easier from a computational point of
view. Some methods of this type are reviewed in Section 3.
(c) Global decrease. These methods aim for a permanent improvement in
f-values, culminating in arrival at the global minimum. A few of them will be
reviewed in Section 4.
(d) Improvement of local minima. Exploiting the availability of an efficient
local search routine, these methods seek to generate a sequence of local
minima of decreasing value. Clearly, the global minimum is the last one
encountered in this sequence. Some examples are given in Section 5.
(e) Enurneration of local minima. Complete enumeration of local minima
(or at least of a promising subset of them) is clearly a way to solve the global
optimization problem. Some methods developed for that purpose are discussed
in Section 6.
IX. Global optimization 635

It would have been appropriate to conctude the chapter by a discussion of


the empirial performance of the methods discussed in the various sections.
Unfortunately, the computational evidence is far from complete. To the extent
that experiments have been carried out at all, it is offen not clear under what
conditions this has been done, in particular with respect to the stopping rule
involved. Certainly, none of these experiments capture the full trade-off
between reliability and effort that was alluded to above and will be a crucial
factor in choosing an appropriate global optimization technique. What little
can be said on this issue, will be said in Section 7.
Apart from some special classes (Pardalos & Rosen 1985), no attention will
be given to the constrained global optimization problem. Only very little work
has been done in this field. It is still very unclear what line of approach could
be successful. Some initial attempts using penalty functions are described in
(Timmer 1984).

2. Partition and search

A natural approach to solve the global optimization problem is through an


appropriate generalization of branch and bound methods, a solution technique
that is well known from the area of combinatorial optimization.
At any stage of such a procedure, we have a partition of S in subsets Ss
(o~ ~ A), a lower bound LB(S~) on minx~s~ {f(x)} for all a E A, and an upper
bound UB = f ( £ ) ( £ E S) representing the smallest feasible solution value
found so far.
Clearly, subsets S~ for which LB(S~)~>UB can never contain the global
minimum x, and can be eliminated. If after possible further improvement of
UB any subset is left for which U B - L B ( S a ) > 0 , then the partition is
further r e f i n e d - m o s t naturally by dividing the subset Se, with LB(S~)=
min~E A {LB(S~)} into smaller subsets- and the hext stage is initiated.
There are many variations on this theme, each of which yields convergence
to the global minimum under many different conditions (Horst 1986; Horst &
Tuy 1987; Pinter 1983, 1986a, 1988; Strongin 1978; Tuy 1987). Typically, such
conditions ensure that the lower bounding procedure is asymptotically accurate,
in that LB(S~) approaches the global minimum of f o v e r S~ when the volume
of Ss becomes sufficiently small. Of course, for arbitrary functions f, finding a
lower bound on the global minimum value y , is of the same inherent difficulty
as finding y , itself. Thus, all procedures in this category involve additional
assumptions of f.
We propose to discuss two typical examples, each of which is attractive in
that the inevitable additional assumptions on f a r e reasonable in themselves,
and are in addition cleverly exploited to speed up the procedure.
In the first example (McCormick 1983), S is assumed to be a hyperrectangle
{x ~ ~n ] a ~<x <~ b} and f is supposed to be a factorable function, i.e., f is the
last in a sequence of functions f(l~, f ( 2 ) , . . . , where this factorization sequence
636 A.H.G. Rinnooy Kan, G.T. Timmer

is built up as follows:

f(J)(xl,...,xù)=x] (j= l .... ,n) (2.1)

and for all k > n, one of the following holds:

f(k)(x) = f(«)(X) + f(h)(x) for some g, h < k , (2.2)

f(k)(x) = f(g)(X)" f(h)(x) for some g, h < k, or (2.3)

f(k)(x) = F(f(g)(x)) for some g < k , (2.4)

where F belongs to a given class C of simple functions E---~ E (e.g., F(t) = {,


F(t) = e t, F(t) = sin t , . . . ) .
The reader can easily verify that this is a natural way to view complicated
objective functions that are given in explicit algebraic form. Indeed, the order
in which the f(]~ appear usually corresponds to the order in which
f ( x l , . . . , xn) would be computed for given values of the arguments x] ( ] =
1 .... , n).
In the context of a branch-and-bound procedure, the idea is now to have the
S~ eorrespond to appropriate hyperrectangles {x E N" ]a(a) ~< x ~< b(a)}, and
(k)
to use the factorization sequence ( f [k = 1, 2 . . . . ) to compute lower bounds
for k = 1 , 2 , . , by computing convex lower bounding functions l (k)« and
concave upper bounding functions Ua " (k) with the property that

l~k)(x) <f(k)(x) <<-ußk)(x) Vx E S s (« E A ) . (2.5)


In doing so, we ultimately arrive at a convex lower bounding function for f
over Sù, whose minimum value over S s provides an appropriate lower bound
LB(Sù). If the minimum value is achieved at x = x(a), then f(x(a)) provides an
upper bound on y , and a possible improvement on UB.
Any convex function l(~k) and concave function u (k)«, for which (2.5) holds
and for which asymptotic accuracy can be demonstrated, yields a correct
branch-and-bound procedure. There is clear trade-off between the accuracy of
the bounds and the effort required for their computation. One extreme
possibility is to use the tools of interval analysis (Ratschek 1985; Walster et al.
(k) (k)
1985) which quickly yields constants L a and U~ so that (2.5) holds with
(k) (k) (k) (k)
l~ ( x ) = L ~ ,u« ( x ) = U ~ . At the other end of the spectrum, one would
strive to find the best possible bounding functions, i.e., to take l (k)« equal to the
convex lower envelope of f(k) on S~ and u~ ) equal to the concave upper
envelope. GeneraUy, given l(~j) and u«" (j) for all j < k (note that these functions
can be taken equal to f o)" for ] = 1 , . . . , n) relatively tight convex respectively
concave bounds l~ ) and u(f ) can be computed by exploiting relation (2.2),
(2.3) or (2.4) in the following way (McCormick 1983). If (2.2) holds, then
clearly we may take
IX. Global optirnization 637

l (~) (x) = U ~(.) + l~h~(x) (2.6)

ù~)(~)=u~«)(x)+u~)(x). (2.7)

If (2.3) holds, then we have to presuppose (quite reasonably) that lower


bounds L(~g) , L~ h} and upper bounds v«fl
(g) , U~h) over S s are provided forf(g)(x)
and f(h)(x) respectively. Assuming the lower bounds to be non-positive and the
upper bounds to be nonnegative (analogous results hold for the other cases),
we may take

l a(k) (x) = max(U(h)l(g)(x.i


, _ « _. . . + . u (.g ) l ( h ) ( x ) _ V a(g) U s(h) ,
(h) (g) .'~ (g) (h) .'~ (g)
L,~ l« (.~)+ L« l,~ (a.) -L,~ L~(h) } (2.8)
u~ ) (x) = min{L~ g) u~h) (x) + U(~h) u~g) (x) - L ~g) U~h) ,
(h). (g)[..'X (g). (h)Y..'l (h) u~g) } •
(2.9)

Finally, if (2.4) holds, we presuppose (again quite reasonably) that we are


given convex lower and concave upper bounding functions (L and U respec-
tively) for F over the interval [L s(g) , U~(g) ], and that F attains its minimum and
maximum on this interval in Ymin and Ymax respectively. We then may take

12k) (x) = L(mid{l~ g) (x), u~g) (x), Ymin} ) , (2.10)

u~ ) (x) = U(mid{l(J ) (x), u(J ) (x), Ymax}), (2.11)

where the mid operator selects the middle value.


Clearly, there is a lot of room in this approach for clever ad hoc ideas, whose
ultimate value can only be established in computational experiments (Mentzer
1985).
The second example to be discussed now is a special case of concave
minimization sub]ect to convex constraints: f ( x ) is concave, and S ' = {x E
~n] A x = b, x >i 0}. It is weil known that the global minimum of f occurs at one
of the extreme points of S'.
There is an extensive literature on this type of problem (see (Pardalos &
Rosen 1986)). For a good example of a branch-and-bound approach that truly
explores special structure, we assume that f ( x ) is quadratic: f ( x ) = p T x -
½xTQx ( Q positive semi-definite) (Rosen 1983). This problem is more general
than it appears to be at first glance: both 0-1 programming and the general
linear complementarity problem are special cases.
Prior to partitioning S', we first compute lower and upper bounds on y . . For
an upper bound, f is maximized subject to x E S'. This easy calculation yields a
point ~ and n eigenvectors of the Hessian at 2, say, u ~ , . . . , uù. In an effort to
move as far away as possible from 2, we then solve 2n linear programs of the
form: max{ + UT i X ]X E S'} and find at most 2n different extreme points v i of S'.
638 A.H.G. Rinnooy Kan, G . T . Timmer

The one for which f(x) is minimal produces the required upper bound UB on
Y,.
Now, each vi defines a halfspace { x E ~ n ] 14)iT( X -- -X) ~ W iT(Ui -- X)} OMi
{ui, -u;}) of which the intersection is a hyperrectangle containing S'. Obvious-
ly f achieves its minimum over this hyperrectangle at one of its 2 n vertices; it is
easy to find out which one, and this yields a lower bound on y . .
In addition, we also construct a hyperrectangle inscribing the ellipsoid
{x E N"[ f(x) = UB} , again of the form Di{x E Nn[ W Ti ( X -- -X) <~ d i } with ap-
propriate constants d; that can easily be computed. Clearly, x . cannot be
contained in the interior of this hyperrectangle, and the intersection of its
exterior with S' defines an appropriate family of subsets in which to look for
x , . Each of these at most 2n polytopes can be embedded in a surptex rooted at
the corresponding v;, and a suitable procedure (in particular, the original one
in a recursive fashion) can be called upon to solve these subproblems.
Several variations on this theme have been proposed (Zilverberg 1983;
Kalantari 1984), each of which fully exploits the extreme point property of the
problem. Generalizations of the problem, up to the case where f is the
difference of two convex functions and S' is arbitrary convex, in which some
form of Benders decomposition is clearly called for, have also been attacked.
There is no doubt that if a problem exhibits special structure of the above kind,
then specialized algorithms of this nature are the ones to turn to. Obviously,
this will not always be the case.

3. Approximation and search

If the global optimum of f is difficult to find, it may be attractive to deal with


an approximation f of f instead. Whatever the form of this approximation, it
should of course yield a computational advantage over the original objective
function. In each iteration, one will then typically start by computing the global
minimum ~ of the current approximation f . I f the approximation is found to be
good enough, the algorithm stops. If not, f is updated, usually so as to satisfy
the equality f(:~) = f(£).
Several well known global optimization techniques, both deterministic and
stochastic ortes, can be viewed as belonging in this category.
It is appropriate at this point to mention that several of the techniques
described in this section combine characteristics of both the approximation and
search and the partition and search philosophy. Based on the current approxi-
mation )7 of f these methods focus on subsets of S in which the global minimum
is known to be located. However, since the partition of S is implicit and based
on the approximation for f, these methods are described in the current section.
In view of our discussion in the previous section, w e start with a determinis-
tic approach which is based on the observation that f can be taken equal to any
asymptotically accurate lower bounding function of f, chosen in such a way that
the computation of its global minimum can be carried out efficiently.
1X. Global optimization 639

A popular example of such a method (Shubert 1972) is based on the


assumption that a Lipschitz constant L is given so that

If(x~) - f(x2) I ~< L IIxa - x2l I (3.1)

for all xi, x 2 E S. In the case that n = 1, the method essentially consists of
iterative updates of a piecewise linear function with directional derivatives
equal to L o r - L , which forms an adaptive Lipschitzian minorant function o f f .
Initially, f is evaluated at some arbitrary point x 1. A piecewise linear
function ~~ is defined by

i//l(X) = f(x1) - Lllx - Xll I . (3.2)


Now an iterative procedure starts, where in iteration k (k/>2) a global
minimum of Ok-1 on S is chosen as the point £k where f is next evaluated. A
new piecewise linear function tpk is constructed by a modification of Ok-~:

~Ok(x)=max{f(x)-Ll[x-~k[I, ~~_~(x)), k=2;3,... (3.3)

Hence
O~_l(x) <~ Ok(x) <~f(x) Vx E S , (3.4)

B(£i)=f(£i) (i=l,...k). (3.5)


In each iteration, the piecewise linear approximation for f will improve. The
method is stopped when the difference between the global minimum of qJk,
which is a lower bound on the global minimum of f, and the best function value
found is smaller than e, indicating that a member of Ai(e ) has been found. To
conclude the description, note that ~k is completely determined by the location
and the value of its minima. If ~0k is described in terms of these parameters it is
no problem to find one of its global minima.
The above method can be generalized to n dimensions in at least two
different ways. In the first approach (Mladineo 1986) +k is defined as in (3.3).
The graph of 0k in ~,+1 is then defined by a set of intersecting cones in ~"+1,
whose axes of symmetry are all parallel to the (n + 1)-th unit vector. To find
the global minimum of +k now amounts to intersecting the cone corresponding
to x~ with the previous ones, which turns out to be equivalent to solving a set
of linear equations and a single quadratic one.
In the second generalization (Wood 1985), cones are replaced by inner
approximating simplices. In each iteration, we have a collection of such
simplices in ~n+l that bound parts of the graph of f from below, always
including the part containing the global minimum point (x,, y , ) . The function
evaluation at £ allows one to intersect the current collection with the comple-
ment of a simplex whose zenith is at (£, f(£)) as weil as with the halfspace
defined by y<~f(£): what remains outside the simplex and below the hy-
perplane defines the new set of simptices.
640 A.H.G. Rinnooy Kan, G.T. Timmer

Finally, we mention (Pinter 1986b) which describes a general framework for


generalizing Lipschitz-type methods to higher dimensions.
It is appropriate in the context of approaches such as the above to refer
briefly to the literature on m i n i m a x o p t i m a l i t y o f a l g o r i t h m s . The one-dimen-
sional method and its n-dimensional cone generalization can be shown to
minimize the w o r s t case error f ( £ ) - y , , after a fixed number of steps over all
s e q u e n t i a l s a m p l i n g rules that search for the global minimum of functions
satisfying (3.1) for given L (Shubert 1972; Mladineo 1986). There are other
results of this nature, obtained with respect to different criteria of optimality
and different classes of algorithms (see (Basso 1985)). For a fundamental
comparison between deterministic and stochastic methods along these lines we
refer to (Sukharev 1971; Anderssen & Bloomfield 1975; Archetti & Betro
1978b; Sobol 1982).
It is also appropriate at this point to discuss briefly a related deterministic
approach to the global minimization of functions with given Lipschitz constant,
which was originaUy proposed in (Evtushenko 1971). The theoretical back-
ground of this method is very simple. Suppose that the function has been
evaluated at k points x l , . . . , x~ and that Mk = m i n { f ( x l ) , . . . , f(xk)} is the
smallest function value found so far. For i = 1, 2 . . . . . k, let V~ be the sphere
(x ~ ~n I IIx - xill <~ r i ) , where

r, = ( f ( x e ) - M k + e ) / L . (3.6)

Then, for any x E V~,

f ( x ) >~f ( x « ) - L r i = M k - e . (3.7)

Hence, if the spheres V/(i = 1 . . . . , k) cover the whole set S, then the point £i
for which f ( £ i ) = M k is an element of A t ( e ). Thus, this result converts the
global minimization problem to the problem of covering S with spheres. In the
simpled case of one-dimensional global optimization where S is an interval
{x E N] a ~<x ~< b}, this covering problem is solved by choosing x 1 to be equal
t o a + e / L and

2e + f(x~) - M k
xk = xk-I + L (k = 2,3 . . . . ). (3.8)

The method is stopped if x k >I b.


A generalization of this algorithm for n-dimensional problems (n > 1) con-
sists of systematically covering S with hypercubes whose edgelength is 2 r i / X / n ,
where r i is given by (3.6), i.e. cubes inscribed in the spheres V//.
All the deterministic techniques based on (3.1) have a major drawback: the
number of function evaluations required is very large. To analyse this number,
let S be a hypersphere with radius r, so that
IX. Global optimization 641

n n/2
F Q'r
m(S) = F(1 + ½n) ' (3.9)

where F denotes the g a m m a function. F u r t h e r m o r e , let c be the m a x i m u m o f f


over S and suppose that f has been evaluated in k points X l , . . . , x~. The
function value in a point x can only be known to be a greater than the global
minimum value y , if the function has been evaluated in a point x i within
distance ( f ( x i ) - y , ) / L of x. Hence, the h y p e r s p h e r e s with radii ( f ( x i ) -
y , ) / L centered at the points xi, i = 1 , . . . , k, must cover S to be sure that the
global minimum has been found. The joint volume of these k hyperspheres is
smaller than

k(~)'~.,2
(3.10)
F(1 + ½n)
Thus, for the k hyperspheres to cover S we require

~>(c_~r,,)> (3.11)

Unless the derivative of f in the direction of the global minimum equals - L


everywhere, L is greater than r / ( c - y * ) , and the computational effort re-
quired increases exponentially with n. Thus, the price paid for a finite
deterministic guarantee of convergence to A1(e ) appears to be very high.
We now turn to stochastic methods that can be viewed as operating on a
suitable approximation of f. The underlying assumption here is that f is
produced by a certain random mechanism. Information on the particular
realization of the mechanism is obtained by evaluating f at a suitably chosen set
of points. In each iteration, )Tcan then be taken equal to the expected outcome
of the r a n d o m mechanism, conditioned on the function evaluations that have
occurred so rar.
A n appropriate r a n d o m mechanism is provided by the notion of a stochastic
process, i.e., a finite and real valued function f(x, ~o), which for fixed x is a
r a n d o m variable defined on an appropriate probability space and for fixed o) a
deterministic real valued function on S. Thus, as desired, there corresponds an
objective function f to every realization of the r a n d o m variable w (Kushner
1964).
With each stochastic process, we associate a family of finite dimensional
distributions
P r { f ( x l ) ~< YI, • • • , f(Xk) «- Yk} (Xl, • • • , Xk E S; k = 1, 2 . . . . ).
(3.12)

Two processes sharing the same family (3.12) are called equivalent. The
process is called Gaussian if each distribution (3.12) is normal. It can be shown
642 A.H.G. Rinnooy Kan, G.T. Timmer

(Cramer & Leadbetter 1967) that the specification of such a process amounts
to the specification of an appropriate covariance function R: S x S-+ ~, defined
by

R(Xl, x2) = c o v ( f ( x l ) , f ( x 2 ) ) , (3.13)

which has to be symmetric and nonnegative definite.


A particularly attractive choice within the class of Gaussian distributions is
obtained if we take R(xl, x2) = 0-2 min{xa ' x2}, where 0-2 is a given constant.
This is the celebrated Wiener process, which combines the advantage of
unusual analytical tractability with the disadvantage of having nowhere dif-
ferentiable (albeit continuous) realizations with probability 1. Though less than
an ideal choice in view of the latter property, this model does yield an
attractive global optimization technique for the case that n = 1. This technique
(Woerlee 1984), which improves on some earlier work in (Archetti & Betro
1979a, 1979b) is based on a result from the theory of Wiener Processes (Shepp
1979) which allows us to compute the expectation of the minimum value Ymin of
the Wiener process in an interval [x, 7], conditioned on given function values
f(x) and f(Y):

E(Ymin Ix, f(x), "x, f(x))


= min{ f(x), f(2)}
t,,
o
+ C(x, f(x), ~, f(Y)) J N(min{f(x), f(2)}

- max{f(x), f(2)}, 0--2(x - x)) dt,


(3.14)
where

C(x, f(x),~, f(~)) = - ( 7r(~22 x)°-Be /1/2 xp~[(f('~)=f(x))220-2(~


_--~ ]/ (3.15)

and N ( ~ , 0-2) is the normal density function with mean /x and variance 0-2.
Thus, given a set of points x, < x 2 < • • • < x~ at which f has been evaluated
so far, we can compute the posterior expectation (3.14) of the minimum value
y* for every interval (xi, xi+l) ( i = 1 , . . , k - 1 ) , and select the interval
(xj, Xj+l) for which the value y~ is minimal. As a next point at which to
evaluate f, we can choose the expected location of the minimum given by
xj+a

J tp(t[ y~, xj, f(xj), xj+l, f(Xj+l) ) dt (3.16)


xi
IX. Global optimization 643

with

p(t] YT' xj, fixe), xj+~, f(xj+l) )


= C(t, y j, xj, f(x~), xj+ 1, f(Xj+l) )
exp {\ (Y~ -- f ( x j ) ) 2 (Y~ -- f ( x j + i ) ) 2
X
2t 2(o- 2(Xj+ 1 + xj) - t)

+
(Y7 - f(xj)- f(Xj+l))2)
(3.17)
2O-2(Xy + X j + l ) / '

where

C(t, YT' xj, flxj), xj+ 1, f(Xj+l) )


, 3
( 2 7 -- f ( X j ) ) ( y j -- f ( X j + I ) ) Œ (Xj+ 1 -xj) x3/2
rz X x, t3/21 2« -- X j ) -- /)3/2 "
(3.18)
B~r(y* i - f(xj)- yl, j+l)) tXj+I
t °-

The method can be initiated by using a set of equidistant points 2 1 , . . . , Ym


in S to estimate o- by the maximum likelihood estimator

A2 1 ~ (f(xi)-f(xi_l)) 2 (3.19)
O- ~--- --m i=2 (xi -- x i - 1)

To develop an appropriate stopping rule, we could, for example, drop an


interval [xi, Xi+l] if the probability of finding a function value better than the
currently lowest one 37,

Pr{ min ( f i x ) } < 371 f(xi), f(xi+i) }


x~[xi, xi+l]

= e x p ( - 2 (37- f(x,))(37-f(x~+,))~~xB~)), (3.20)

is sufficiently small, and terminate the algorithm if all intervals except the one
contäining 37 are eliminated.
The above description demonstrates that the simple 1-dimensional Wiener
approach already leads to fairly cumbersome computations. Moreover, its
generalization to the case n > 1 is not obvious - many attractive properties of
the one-dimensional process do not generalize to higher dimensions. And as
remarked already, the almost everywhere nondifferentiability of its realizations
continues to be unfortunate.

Let us describe briefly a possible way to overcome these imperfections. It is


based on a different Gaussian process, for which
644 A.H.G. Rinnooy Kan, G.T. Timmer

o-2
R(xi' x2) = 1 + I]xl - x2ll 2 " (3.21)

Such a process is known to be equivalent with one whose realizations are


continuously differentiable with probability 1. Its conditional expectation El(x)
subject to evaluations f ( x l ) , . . , f(x~) is given by

(f(x,), . . , f(x~))TV;l(R(x, Xl), . . , R(x, xk) ) (3.22)

where V = (vij) = (R(x i, xj)). For the particular case given by (3.21), (3.22) is
equal to E~=a (ai/(l + [Ix-xi[[2)) for appropriate constants ai, so that the
stationary points of El(x) can be found by finding all roots of a set of
polynomial equations. This can be done, for example, by the method from
(Garcia & Zan gwill 1979) discussed in Section 6, and hence we may take the
approximation f equal to El(x) to arrive at another example of an algorithm in
the current category.
Both stochastic approaches described above can be viewed as special cases of
a more general axiomatic approach (Zilinskas 1982). There, uncertainty about
the values of f(x) other than those observed at x l , . . . , x~, is assumed to be
representable by a binary relation ~<x, where (a, a') ~>x (b, b') signifies that the
event {f(x) ~ (a, a')} is at least as likely as the event {f(x) E (b, b')}. Under
some reasonable assumptions on this binary relation (e.g., transitivity and
completeness), there exists a unique density function Px that is compatible with
the relation in that for every pair of countable unions of intervals (A, A'), one
has that A ~>x A' if and only if

f
A
px(t) dt >1 f p x ( t ) d t .
A'
For the special case that all densities are Gaussian and hence characterized
by their means /xx and variances o-2, it is then tempting to approach the
question of where f should be evaluated next in the axiomatic framework of
utility maximization. For this, one has to presuppose that a preference relation
is defined on the set of all pairs (mx, Sx). Subject again to some reasonable
assumptionS about this preference relation and the utility function involved,
the surprising result is that the uniquely rational choice for a next point of
evaluation is the one for which the probability of finding a function value
smaller that 37- e is maximal. In the case of the Wiener process, this amounts
to the selection of the interval which maximizes (3.20) with 37 replaced by
37- e, say (xj, xj+l), and to take

(37- « - f(xj))%+, - xj)


X k + 1 = X j -~- 237- f(xj) -- f ( X ] + l ) - - 2 s " (3.23)

This, by the way, is not necessarily the same choice as dictated by (3.16).
IX. Global optimization 645

The brief description of the above stochastic approximation and search


methods confirms our earlier impression that this class of methods combines
strong theoretical properties with cumbersome computational requirements. A
deterministic approach can be shown to be optimal in the worst case sense, a
stochastic approach can be shown to be optimal in the expected utility sense,
but neither method can be recommended unless evaluations of the original
function f are outrageously expensive, in which case a Gaussian stochastic
model may be appropriate.

4. Global decrease

A natural technique to find the global minimum is to generate a sequence of


points with decreasing function values that converges to x , . Such a sequence
can be initiated by starting out in any direction of local decrease. The difficulty,
obviously, is to continue the sequence properly without getting stuck at a local
minimum from which the sequence cannot escape.
An interesting and much investigated way to achieve global reliability is to
use random directions. In the k-th iteration of such a method, having arrived at
xg ~ S, one generates a point ~g ~ S from a distribution Gg and chooses xg as a
function D of x e_t and ~:g in such way that

f(Xk) = f ( D ( X k - 1 , «k)) ~ min{f(xg-1), f(~:k)} • (4.1)

If the distributions G~ satisfy


co

I ] ( 1 - Gg[A]) = 0 , (4.2)
k=l

for every A C S with positive volume m(A), then it is easy to show that

lim Pr{x k Œ A f ( 8 ) } = 1, (4.3)

i.e., we have convergence to As(e ) in probability. To prove (4.3), all that has
to be observed is that the probability in question is equal to
k
1 - lim I ] ( 1 - Gi(Ai(e)) ) (4.4)
k--+ ¢c i = 1

which converges to 1 because of (4.2) (Solis & Wets 1981).


The function D can be interpreted in many ways. For instance, it may
correspond to minimization of f(ax k 1 + (1 - cQ4:k) over all « @ [0, 1]
(Gavioni 1975), or to minimization of some polynomial approximating f on this
interval (Bremmerman 1970), or even to just taking the minimum of f(xk_l)
and f(~k) (Rastrigin 1963). The range of a can be extended to [0,2],
646 A.H.G. Rinnooy Kan, G.T. Timmer

introducing the notion of reversals (Lawrence & Steiglitz 1972) with generally
better computational results (Schrack & Choit 1976). Hundreds of further
references can be found in (Devroye 1979).
The correctness of these methods is easily understood: condition (4.2)
guarantees that there is no systematic blas against any subset of S with positive
volume, and in particular no such bias against A1(e ). Only sporadic results are
available on the rate of convergence (Rastrigin 1963), and the issue of a
stopping rule has not been resolved at all. Experiments confirm that these
methods are fast but rather unreliable in practice.
A deterministic global descent approach is hard to imagine, unless one is
prepared to impose stringent conditions on f. There is, none the less, an
interesting class of approaches, both deterministic and stochastic, that is based
on appropriate perturbations of the steepest descent trajectory

dx(t)
- g(x(t)), (4.5)
dt
where g(x) is the gradient of f at x.
Deterministically, one tries to perturb (4.5) in such a way that the method
cannot converge to local minima outside Af(e), i.e.

d2x(t)
dt 2 - -s(f(x(t)))g(x(t)), (4.6)

where the function s has to be small i f f ( x ( t ) ) > y , + e and goes to infinity as


x(t) approaches Af(e) so as not to restrict the curvature of the trajectory in
that region. In (Griewank 1981), s(f(x(t))) is taken equal to a/(f(x(t)) - y , -
e) which leads to the differential equation

d2x(t) = - a ( I - dx(t) dx(t)T~ g(x(t)) (4.7)


dt 2 dt dt / f(x(t))- y.- e

We omit a detailed motivation of (4.7), since it suffers from several weaknes-


ses. Obviously, an appropriate target level y , + e is not easily specified. In
addition, convergence can only be verified empirically and not established
analytically for any significant class of objective functions.
A stochastic analogue of (4.7) offers better possibilities. Here, one considers
the stochastic differential equation

dx(t) dw(t)
d~ - g(x(t)) + e(t) d-----~-' (4.8)

where w(t) is an n-dimensional Wiener process. (If e(t) = e 0 for all t, this is
known among physicists as the Smoluchovski-Kramers equation.) The solution
to (4.8) is a stochastic process x(t) that has the following useful property
(Schuss 1980): if e(t) = e0 for all t, then the probability density function p(z) of
1x. Global optimization 647

x(t) becomes proportional to exp(-2f(z)le:o) as t goes to infinity. If we now let


e0 go to 0, this implies that (Aluffi-Pentini et al. 1985) p(z) converges to a
Dirac delta function concentrated on x , (provided that the global minimum is
unique). Thus, if t and e0 would jointly go to ~ and 0 respectively, at an
appropriate rate, then the trajectory converges to the global minimum in the
above probabilistic sense.
Though attractive from a theoretical point of view, the above procedure has
been found wanting computationally, since it requires a costly numerical
integration. The reader may have noticed the similarities between this ap-
proach and the simulated annealing method (Kirkpatrick et al. 1983), which
also involves a suitably perturbed local search procedure.
A continuous version of simulated annealing is close in spirit to the random
direction method. Having arrived at some point x, a method of this type will
generate a random neighbouring point 2 and continue the procedure from 2 if
Af = f(~) -f(x)<~ 0 or, with probability exp(-Af/Ô), if Af > 0. Here,/3 is the
cooling parameter which has to be decreased at an appropriate rate as the
process continues.
Experiments with variations on this approach suggest that it is the method's
willingness to accept an occasional decrease in quality that is responsible for its
success, rather than a metaphysical resemblance between optimization and the
annealing behaviour of physical systems. A continuous version of the approach
could exploit this idea in various ways.
The main problem to cope with in such an extension is an appropriate
selection mechanism for a random neighbour. One promising approach (A1-
Khayyal et al. 1985) is to intersect a line going through x in a random direction
with the boundaries of S, calculate the distance between the intersection points
and travel a certain fraction 7 of that distance from x (allowing reflection at the
boundary) where y is decreased as the method progresses.
The theoretical properties of this type of approach are very similar to those
obtained for the stochastic differential equation approach: convergence with
probability one to the global minimum for sufficiently smooth functions and
sufficiently slowly decreasing cooling parameter. However, convergence can be
very slow, and the asymptotic result does not really provide a useful stopping
rule. This remains a vexing open question.

5. Improvement of local minima

The global descent techniques discussed in the previous section were partial-
ly inspired by steepest descent, but did not really exploit the full power of such
a local search procedure in spite of the fact that the global minimum can
obviously be found among the local ones. Hence, a logical next step is to
investigate the possibility of generating a sequence of local minima with
decreasing function values. Ideally, the last local minimum in the sequence
generated by such an algorithm should demonstrably be the global one.
648 A.H.G. Rinnooy Kan, G.T. Timmer

A weil known deterministic method belonging to this class that has great
intuitive appeal is the tunneling method (Levy & G o m e z 1980). Actually, this
method can be viewed as a generalization of the deflation technique (Goldstein
& Price 1971) to find the global minimum of a one dimensional polynomial.
This latter method works as follows.
Consider a local minimum x* of a one dimensional polynomial f, and define

f(x) - f ( x * )
L(x) - (x - ~,)2 (5.1)

If f is a polynomial of degree m, then fl is a polynomial of degree m - 2. Il, in


addition, it can be shown that the global minimum of fl(x) is positive, then x*
is the global minimum o f f . In case there is a point ~ for which fl(x ) is negative,
then f(~) < f(x*) and x* is not the global minimum. In the latter case one can
continue from a new local minimum which can be found by starting a local
search in 2. To determine whether the global minimum is positive, we proceed
iteratively considering fl(x) as the new basic function.
This method converges rapidly in a finite number of iterations if f is a
one-dimensional polynomial, but there is no reason to expect such attractive
behaviour if we seek to generalize it to arbitrary objective functions and higher
dimensions. Yet, this is precisely what the tunneling method sets out to do.
The tunneling method consists of two phases. In the first phase (minimiza-
tion phase) the local search procedure is applied to a given point x 0 in order to
find a local minimum x*. The purpose of the second phase (tunneling phase) is
to find a point x different from x*, but with the same function value as x*,
which is used as a starting point for the next minimization phase. This point is
obtained by finding a zero of the tunneling function

f(x) - f(x*) (5.2)


T(X) =" HX -- Xmll)'° ]-]k~~i=lI1~ - x~ I[ ~i '

where x i , . . . , x~ are all local minima with function value equal to f(x*) found
in previous iterations, and A0, "~i a r e positive parameters of the method.
Subtracting f(x*) from f(x) eliminates all points satisfying f(x)>f(x*) as a
k ,
possible solution. The term IIi= 1 I I x - x ~ l l ~i is introduced to prevent the
algorithm from choosing the previously found minima as a solution. To prevent
the zero finding algorithm to converge to a stationary point of

f(x ) - y(x * )
k , , (5.3)
ni~l I I x - x i l l ~i

which is not a zero of (5.2), the term [ [ x - XmllA° is added, with x m chosen
appropriately.
If the global minimum has been found, then (5.2) will become positive for all
x. Therefore the method stops if no zero of (5.2) can be found.
IX. Global optimization 649

The tunneling method has the advantage that, provided that the local search
procedure is of the descent type, a local minimum with smaller function value
is located in each iteration. Hence, it is likely that a point with small function
value will be found relatively quickly. However, a major drawback of the
method is that it is difficult to be certain that the search for the global
minimum has been sufficiently thorough. In essence, the tunneling method
only reformulates the problem: rather than solving the original minimization
problem, one now must prove that the tunneling function does not have a zero.
This, however, is once again a global problem which is strongly related to the
original one. The information gained during the foregoing iterations is of no
obvious use in solving this new global problem, which therefore appears to be
as hard to solve as the original one. Thus, lacking any sort of guarantee, the
method is at best of some heuristic value. A decent stopping rule is hard to find
and even harder to justify.
A final method in this category can be criticized on similar grounds. It is the
method offilled functions, developed in (Ge Renpu 1983, 1984). In its simplest
form, for a given sequence of local minima x T , . . . , x k* , the method considers
the auxiliary function f(x) = exp(-I[x~ [[2/p2)/(r + f(x)). For an appropriate
choice of p arameters r and p, it can be shown that a method of local descent
applied to f will lead to a point x~+ 1 starting from which a decent procedure
applied to f will arrive at a bettet local minimum x~+ 1 (i.e., with f(x*~+l) <
f ( X*
k ) ) , if such an improved local minimum exists. Unfortunately, appropriate
values for the parameters have to be based on information about f that is not
readily available. As in the case of the tunneling method, it seems to be
difficult to identify a significant class of objective functions for which converg-
ence to the global minimum can be guaranteed.

6. Enumeration of local minima

The final class of methods that we shall consider is superficiaUy the most
naive one: it is based on the observation that enumeration of all local minima
will certainly lead to the detection of the global one. Of course, it is easy to
conceive of testfunctions where the number of local minima is so large as to
render such an approach inherently useless. Barring such examples, however,
this class contains some of the computationally most successful global optimiza-
tion techniques.
We start with a discussion of deterministic approaches based on this idea. As
in (Branin 1972), we consider the differential equation
dg(x(t))
dt - i~g(x(t)), where ~ E { - 1 , + 1 ) . (6.1)

The exact solution of (6.1) in terms of the gradient vector is

g(x(t)) = g(x(0))e ~t . (6.2)


650 A . H . G . Rinnooy Kan, G.T. Timmer

Clearly, if/x = - 1 , then the points x(t) satisfying (6.2) will tend to a stationary
point with increasing t. I f / x = + 1, then the points will move away from that
stationary point.
In the region of x space where the Hessian H of f is nonsingular, (6.1) is
identical to the Newton equation

dx(t)
-- /&H l(x(t))" g(x(t)). (6.3)
dt

The differential equation (6.3) has the disadvantage that it is not defined in
regions where the determinant det H(.) is zero. To overcome this problem let
the adjoint matrix Adj H be defined by H -1 det H. A transformation of the
parameter t enables (6.3) to be written as

dx(t)
dt - / x A d j H(x(t))" g(x(t)). (6.4)

Equation (6.4) has several nice features. It defines a sequence of points on the
curve x(t), where the step from the k-th to the (k + 1)-th point of this sequence
is equivalent to a Newton step with variable stepsize and alternating sign, i.e.
B%H-l(x(t)) • g(x(t)) (Branin 1972; Hardy 1975). This sequence of points on
the curve x(t) solving (6.4) has to terminate in a stationary point of f (Branin
calls this an essential singularity), or in a point where Adj H(x(t)). g(x(t)) = 0
although g(x(t)) # 0 (an extraneous singularity).
A global optimization method would now be to follow the curve x(t) solving
(6.4) in the above way for a given sign from any starting point, until a
singnlarity is reached. Here, some extrapolation device is adopted to pass this
singularity. If necessary, the sign of (6.4) is then changed, after which the
process of generating points on the curve solving (6.4) is continued. The
trajectory followed in this way depends, of course, on the starting point and on
g(x(O)). It can be shown that the trajectories either lead to the boundary of S
or return to the starting point along a closed curve (Gomulka 1975, 1978).
It has been conjectured that in the absence of extraneous singularities all
trajectories pass through all stationary points. The truth of this conjecture has
not yet been determined, nor is it known what conditions on f are necessary to
avoid extraneous singularities. An example of a function of simple geometrical
shape that has an extraneous singularity is described in (Treccani 1975).
To arrive at a better understanding of this method, we note again that in
each point on a trajectory the gradient is proportional to a fixed vector g(x(O))
(cf. (6.2)). Hence, for a particular choice of x(0) the trajectory points are a
subset of the set C(x(O)) = {x ~ S I g(x) = Ag(x(0)), A ~ R}, which of course
contains all stationary points (take A = 0). Unfortunately, C(x(O)) can consist
of several components, only one of which coincides with the trajectory that will
be followed. All stationary points could be found if it would be possible to start
the method once in each component of C(x(O)), but it is not known how such
starting points could be obtained.
IX. Global optirnization 651

Recently, it has been suggested (Diener 1986) that the components of


C(x(O)) could be connected by, e.g., lines {x E S[ g(x(O))Tx = g(x(O))Tx~ }, for
an appropriate finite set {x~ ]a E A}. Such a finite set can be shown to exist
under very general conditions, but to construct it in such a way that the
trajectory connecting all stationary points can be found by using local informa-
tion only is not easy. Of course, one could never expect this type of method to
be immune to an indentation argument, but it may be possible to identify
non-trivial classes of objective functions for which finite convergence can be
guaranteed.
The relation between the trajectories of (6.1) and the set C(x(O)) can be
exploited in a different way as well. Rather than keeping A fixed and tracing
the solution to the differential equation, we can allow A to decrease from 1
(where x = x(0) is a solution) to 0 (where a solution corresponds to a stationary
point) and trace a solution x as a function of A. Thus, we interpret the set
{x E S] g(x) = Ag(x(0)), A E [0, 1]} as giving rise to a homotopy, and may use
simplicial approximation techniques to compute the path x(A) (Garcia &
Gould 1980). By allowing A to vary beyond the interval [0, 1], we could hope
to generate all stationary points, but of course tun into the same problem as
above when C(x(O)) contains several components.
There are, however, better ways to use homotopies in the current context
(Garcia & Zangwill 1979). Let us assume that the equalities g(x)= 0 can be
rewritten as

¢pi(z)=0 (j=l,...,n) (6.5)

where g,j is a polynomial function. We can view this as a set of equations over
C" and ask for a procedure to compute all the roots of (6.5), i.e., all stationary
points of f.
To initiate such a procedure (Garcia & Zangwill 1979), we choose integers
q(j) (j = 1 , . . , n) with the property that, for every sequence of complex
vectors with [[z[[-->~, there exists an / ~ { 1 , . . , n } and a corresponding
infinite subsequence for which lim ~(z)/(z q(O- 1) does not converge to a
negative real number. Intuitively, this means that z q ( ° - 1 dominates Ot(z).
Such integers q(j) can always be found.
Now, let us consider the system

( 1 - A ) ( z q(j)- l)+ AOj(z)=0 ( j = l . . . . . n). (6.6)

For A = 0, we obtain a trivial system with Q = IIj= 1 q(j) solutions z (h). For
A = 1, we recover the original set of equations. The idea is to increase A from 0
to 1 and trace the solution paths z(h)(A). Standard techniques from homotopy
theory can be used to show that some of these paths may diverge to infinity as
A--> 1 (the choice of q(i) precludes such behaviour for any A E (0, 1), as can
easily be seen), but the paths cannot cycle back and every root of (6.5) must
occur as the endpoint of some path. Simplicial pivoting techniques offer the
possibility to trace these paths to any degree of accuracy required.
652 A.H.G. Rinnooy Kan, G.T. Timmer

We now turn to stochastic methods that enumerate all local minima. In most
methods of this type, a descent local search procedure P is initiated in some or
all points of a random sample drawn from S, in an effort to identify all the local
minima that are potentially global.
The simplest way to make use of the local search procedure P occurs in a
folklore method known as multistart. Here, P is applied to every point in a
sample, drawn from a uniform distribution over S, and the local minimum with
the lowest function value found in this way is the candidate value for y , .
It is easy to see that this candidate value (or, indeed, already the smallest
function value i n the sample prior to application of P ) converges to y , with
probability 1 under very weak smoothness assumptions on f (Rubinstein 1981).
At the same time, it is obvious that this method is very wasteful ffom a
computational point of view, in that the same local minimum is discovered
again and again. Before discussing possible improvements, we first consider the
question of an appropriate stopping rule.
An interesting analysis of multistart was initiated in (Zielinski 1981) and
extended in (Boender & Zielinski 1982; Boender & Rinnooy Kan 1983;
Boender 1984). It is based on a Bayesian estimate of the number o f local
minima W and of the relative size of each region of attraction 0t =
m ( R ( x * ) ) / m ( S ) , l = 1 , . . , W, where a region o f attraction R(x*) is defined to
be the set of all points in S starting from whieh P will arrive at x*.
Why are the above parameters W and 0l important? It suffices to observe
that, if their values are given, the outcome of an application of Multistart is
easy to analyze. We can view the procedure as a series of experiments in which
a sample ffom a multinomial distribution is taken. Each cell of the distribution
corresponds to a minimum x*; the cell probability is equal to the probability
that a uniformly sampled point will be allocated to R(x*) by P, i.e. equal to the
corresponding 0~. Thus, the probability that the l-th local minimum is found b t
times (l = 1 , . . . , W) in N trials is
N! w
w • l-[ 0~t • (6.7)
11l=1 bz! z~1
It is impossible, however, to distinguish between outcomes that are identical
up to a relabeling of the minima. Thus, we have to restrict ourselves to
distinguishabIe aggregates of the random events that appear in (6.7). To
calculate the probability that w different local minima are found during N local
searches and that the i-th minimum is found ag times (a i > 0, Ei~l a i = N), let ci
be the number of ai's equal to j and let Sw(w ) denote the set of all
permutations of w different elements of (1, 2 , . . , W}. The required prob-
ability is then given by (Boender 1984)

N
1 • ~
N! ~]
l~ 0"4. (6.8)
Hi=l ci! Hi=I ai! (~1. . . . . . w)CSw(w) i=l
One could use (6.8) to obtain a maximum likelihood estimate of the unknown
IX. Global optimization 653

number of local optima W. Unfortunately, (6.8) appears to attain a (possibly


non-unique) global minimum in infinity for all possible outcomes a i (i =
1 , . . , w) (Boender 1984).
Formula (6.8) can be used, however, in a Bayesian approach in which the
unknowns W, 01, . . , 0w are assumed to be themselves random variables for
which a prior distribution can be specified. Given the outcome of an application
of multistart, Bayes's rule is used to compute the posterior distribution, which
incorporates both the prior beliefs and the sample information.
In (Boender & Rinnooy Kan 1983), it is assumed that a priori each number
of local minima between 1 and oo is equally probable, and that the relative sizes
of the regions of attraction 0 1 , . . , 0w follow a uniform distribution on the
( W - 1)-dimensional simplex.
After lengthy calculations, surprisingly simple expressions emerge for the
posterior expectation of several interesting parameters. For instance, the
posterior probability that there are K local minima is equal to

(K - 1)!K!(N - 1)!(N - 2)!


(6.9)
(N+ K- 1)!(K- w)!w!(w- 1)!(N- w-2)!

and the postefior expectation of the number of local minima is

w ( N - 1)
(6.10)
N-w-2 "

This theoretical framework is quite an attractive one, the more so since it


can be easily extended to yield optimal Bayesian stopping rules. Such rules
incorporate assumptions about the costs and potential benefits of further
experiments and weigh these against each other probabilistically to calculate
the optimal stopping point. Several loss structures and corresponding stopping
rules are described in (Boender & Rinnooy Kan 1987). They provide a
completely satisfactory solution to the stopping problem for multistart.
Another Bayesian stopping rule which can be used for methods based on a
uniform sample such as Multistart has been proposed in (Betro 1981, Betro
and Schoen 1987). Here, for each set C C R, the probability that f(x) ~ C for a
uniformly sampled point x is specified in a prior distribution and updated as a
result of the sample. In this way a posterior distribution is obtained for ~b(y)
(see (1.6)) for every possible y. The test whether or not a point x (which may
depend on the sample) is an element of A~(e), i.e. (o(f(x))<~ e, can now be
formulated as a Bayesian decision problem. A Bayesian stopping rule based on
the optimal solution of this decision problem can be used to decide if additional
sampling is advisable or not.
We now return to the issue of computational efficiency. To avoid useless
local searches, the local search procedure should be initiated no more than
once, or better still exactly once, in every region of attraction. Clustering
analysis is a natural tool to consider next.
654 A.H.G. Rinnooy Kan, G.T. Timmer

The basic idea behind clustering methods is to start from a uniform sample
from S, to create groups of mutually close points that correspond to the
relevant regions of attractions, and to start P no more than once in every such
region. Two ways to create such groups from the initial sample have been
proposed. The first, called reduction (Becker & Lago 1970), removes a certain
fraction, say 1 - y, of the sample points with the highest function values. The
second, called concentration (Törn 1978), transforms the sample by allowing
one or at most a few steepest descent steps from every point.
Note that these transformations do not necessarily yield groups of points that
correspond to regions of attraction of f. For instance, if we define the level set
L(y) to be { x E S I f(x)~<y}, then the groups created by sample reduction
correspond to the connected components of a level set, and these do not
necessarily correspond to regions of attraction. We return to this problem later.
Several ways have been proposed to identify the clusters of points that result
from either sample reduction or concentration. The basic framework, however,
is always the same. Clusters are formed in a stepwise fashion, starting from a
seed point, which may be the unclustered point with lowest functfon value or
the local minimum found by applying P to this point. Points are added to the
cluster through application of a clustering rule until a terrnination criterion is
satisfied. Where the methods differ, is in their choice of the clustering rule and
of the corresponding termination criterion.
As elsewhere in geometrical cluster analysis, the two most popular clustering
rules are based on recognizing clusters by means of either the density (Törn
1976) or the distances among the points. In the former case, the termination
criterion will refer to the fact that the number of points per unit volume within
the cluster should not drop below a certain level. In the latter case, it will
express that a certain critical distance between pairs of closest points within the
cluster should not be exceeded.
Unlike the usual clustering problem we know here that:the points originally
come from a uniform distribution. Can this be exploited in the analysis? If the
clusters are based on concentration of the sample, the original distribution can
no longer be recognized. However, if the clusters have been formed by
reduction of the sample, then within each cluster the original uniform distribu-
tion still holds. Moreover, as pointe d out above, the clusters correspond to
connected components of a level set of a continuously differentiable function.
These two observations provide a powerful tool in obtaining statistically correct
termination criteria.
Since the connected components of a level set can be of any geometrical
shape, the clustering procedure should be able to produce clusters of widely
varying geometrical shape. Such a procedure, single linkage clustering, was
described in (Boender et al. 1982; Timmer 1984). In this method, clusters are
formed by adding the closest unclustered point to the current cluster until the
distance of this point to the cluster exceeds the critical distance. (The distance
between a point and a cluster is defined as the distance of this point to the
closest point in the cluster.) If no more points can be added to the cluster, the
IX. Global optimization 655
cluster is terminated. Provided that there are still points to be clustered, P is
then applied to the unclustered sample point with smallest fnnction value, and
a new cluster is formed around the resulting local minimum.
Actually, the method is implemented in an iterative fashion, where points are
sampled in groups of fixed size, say N, and the expanded sample is clustered
(after reduction) from scratch in every iteration. If X* denotes the set of local
minima that have been found in previous iterations, then the elements of X*
are used as the seed points for the first clusters. If, for some cr > 0, the critical
distance in iteration k is chosen to be

-1/2[r(1 q- ln)m(S)or l°g k N ] 1In (6.11)


r~ = w kN J '

then the resulting method has strong theoretical properties.


For instance, if o- > 4, then even if the sampling continues forever, the total
number of local searches ever started by single linkage can be proved to be
finite with probability 1. Furthermore, if we define y~ such that
m({x e S] f(x) ~<y,})
m(S) = y' (6.12)

then for any cr > 0 a local minimum will be found by single linkage in every
connected component of L(yv) in which a point has been sampled, within a
finite number of iterations with probability 1 (Timmer 1984; Rinnooy Kan &
Timmer 1987a).
In order to assign a point to a cluster in single linkage, it suffices that this
point is close to one point already in the cluster. In principle it should be
possible to design superior methods by using information of more than two
sample points simultaneously. In (Timmer 1984) an approach is suggested in
which S is partitioned into small hypercubes or cells. If there are k N sample
points, a celt A is said to be full if it contains more than l m ( A ) k N / m ( S )
reduced sample points, i.e. more than half the expected number of sample
points in A. If a cell is not full, it is called empty. Intuitively, one would think
that if the cells get smaller with increasing sample size, then each component of
L(y~) can be approximated by a set of full cells, and different components of
L(y~) will be separated by a set of empty cells. Hence, we could let a cluster
correspond to a connected subset of S which coincides with a number of full
cells. These clusters can be found by applying a single linkage type algorithm to
the full cells, such that if two cells are neighbours, then they are assigned to the
same cluster. Note that in this approach a cluster corresponds to a set of cells
instead of a set of reduced sample points.
The mode analys# method, based on this idea and originally proposed in
(Timmer 1984), is again iterative. In iteration k, for some o-> 0, S is divided
into cells of measure (m(S)crlogkN)/(kN). After sample reduction, it is
determined which cells are full. Sequentially, a seed cell is chosen and a cluster
is grown around this seed cell such that each full cell which is a neighbour of a
656 A.H.G. Rinnooy Kan, G.T. Timmer

cell already in the cluster is added to the cluster until there are no more such
cells.
Initially, the cells that contain local minima previously found are chosen as
seed cells. If this is no longer possible, P is applied to the point ~ which has the
smallest function value among the reduced sample points which are in un-
clustered full cells. The cell containing 2 is then chosen as the next seed cell.
The properties of this mode analysis method are comparable to the prop-
erties of single linkage. In (Timmer 1984; Rinnooy Kan & Timmer 1987a) it is
shown that if cr > 20 (this is not the sharpest possible bound), then, even if the
sampling continues forever, the total number of local searches ever applied in
Mode Analysis is finite with probability 1. If o-> 0, then in every connected
component of L(yv) in which a point has been sampled, a local minimum will
be found by mode analysis in a finite number of iterations with probability 1.
Both single linkage and mode analysis share one major deficiency. The
clusters formed by these methods will (at best) correspond to the connected
components of a level set instead of regions of attraction. Although a region of
attraction cannot intersect with several components of a level set, it is
obviously possible that a component contains more than one region of attrac-
tion. Since only one local search is started in every cluster, it is therefore
possible that a local minimum will not be found although its region of
attraction contains a (reduced) sample point.
We conclude that both single linkage and mode analysis lack an important
guarantee that multistart was offering, namely that if a point is sampled in
R(x*), then the local minimum x* will be found.
We conclude this section by discussing methods that combine the computa-
tional efficiency of clustering methods with the theoretical virtues of multistart.
These multi-level methods exploit the function values in the sample points to
derive additional information about the structure o f f . These methods are again
applied iteratively to an expanding sample. In the multi-level single linkage
method, the local search procedure P is applied to every sample point, except if
there is another sample point within the critical distance which has a smaller
function value. Formulated in this way, the method does not even produce
clusters and it can indeed be applied to the entire sample without reduction or
concentration. Of course, it is still possible to use the reduced sample points
only, if one believes that it is unlikely that the global minimum will be found by
applying P to a sample point which does not belong to the reduced sample. If
desired, clusters can be constructed by associating a point to a local minimum if
there exists a chain of points linking it to that minimum, such that the distance
between each successive pair is at most equal to the critical distance and the
function value is decreasing along the chain. Clearly a point could in this way
be assigned to more than one minimum. This is perfectly acceptable.
In spite of its simplicity, the theoretical properties of multi-level single
linkage are quite strong. If the critical distance is chosen as in (6.11) and if
o-> 4, then even if sampling continues forever, the total number of local
searches ever started by multi-level single linkage is finite with probability 1
IX. Global optimization 657

(Timmer 1984; Rinnooy Kan & Timmer 1987b). At the same time we can
prove, however, for every o - > 0 that all local minima (including the global
one) will be located in the long run. More precisely, consider the connected
component of the level set L(y) containing a local minimum x* and define the
basin Q(x*) as the connected component corresponding to the largest value for
y for which this component contains x* as its only stationary point. It is then
possible to show that in the limit, if a point is sampled in Q(x*), then the local
minimum x* will be found by the procedure. As a result, if o-> 0, then any
local minimum x* will be found by multi level single linkage within a finite
number of iterations with probability 1.
We finally mention a method which is called multi-level mode analysis. This
method is a generalization of mode analysis in a similar way and for the same
reasons as multi-level single linkage is a generalization of single linkage. As in
mode analysis, S is partitioned into cells of measure (m(S)~r log kN)/(kN).
After sample reduction, it is determined which cells contain more than
1« log kN reduced sample points; these are labeled to be full. For each full cell
the function value of the cell is defined to be equal to the smallest function
value of any of the sample points in the cell. Finally, for every full cell, P is
applied to a point in the cell except if the cell has a neighbouring cell which is
full and has a smaller function value. Note that we still reduce the sample.
Although this is not longer strictly necessary, it creates the extra possibility that
two regions of attraction are recognized as such only because a cell on the
boundary of both regions is empty.
For multi-level mode analysis essentially the same results hold as for
multi-level single linkage.
Since the methods described in this section and multistart result in the same
set of minima with a probability that tends to 1 with increasing sample size, we
can easily modify and use the stopping rules which were designed for multistart
(Boender & Rinnooy Kan 1987).
The limited computational experience available (cf. the hext section) con-
firms that this final class of stochastic methods does offer an attractive mix of
reliability and speed.

7. Concluding remarks

As mentioned earlier, it is not easy to design a representative and fair


computational experiment in which the performance of various global optimi-
zation methods is compared, if only because the inevitable trade oft between
reliability and effort is rarely discussed explicitly, let alone captured by an
appropriate stopping rule. The choice of test functions is also crucial, in that
each method can easily be fooled by some test functions and will perform
uncommonly well on appropriately chosen other ones.
An effort is currently under way to arrive at a suitable set of text problems
and to set up guidelines for a proper test which at least provides explicit
658 A.H.G. Rinnooy Kan, G.T. Timmer

Table 7.1
Test functions (Dixon & Szegö 1978b)
GP Goldstein and Price
BR Branin (RCOS)
It3 Hartman 3
H6 Hartman 6
85 Shekel 5
$7 Shekel 7
S10 Shekel 10

information on implementational details. H e r e , we can do little better than to


summarize the experiments carried out on a limited set of test functions that
was proposed in (Dixon & Szegö 1978b). The test functions are listed in
Table 7.1.
Initial experiments have revealed that the results of multi-level single linkage
were the most promising and therefore this m e t h o d has been c o m p a r e d with a
few leading contenders whose computational behaviour is described in (Dixon
& Szegö 1978). In this reference methods are c o m p a r e d on the basis of two
criteria: the n u m b e r of function evaluations and the running time required to
solve each of the seven test problems. To eliminate the influence of the
different computer systems used, the running time required is measured in
units o f standard time, where one unit corresponds to the running time needed
for 1000 evaluations of the $5 test function in the point (4, 4, 4, 4).
In Table 7°3 and Table 7.4, we summarize the computational results of the
methods listed in Table 7.2.
The computational results bear out that a stochastic global optimization
method such as multi-level single linkage provides an attractive combination of
theoretical accuracy and computational efficiency. As such, the m e t h o d can be
r e c o m m e n d e d for use in situations where little is known about the objective
function, for instance, i f f i s given in the f o r m of a black box subroutine. It may

Table 7.2
Methods tested
Description References Computational results
A Trajectory (Branin&Hoo (Gomulka 1978)
1972)
B Density clustering (Törn 1976) (Törn 1978)
(concentration)
C Density clustering (de Biase & (de Biase &
Frontini 1978) Frontini 1978)
D Multi-level single (Timmer 1984) (Timmer 1984)
linkage
E Random direction (Bremmerman 1970) (Dixon & Szegö
1978b)
F Controlled random (Price 1978) (Price 1978)
search
IX. Global optimization 659

Table 7.3
Number of function evaluations
Function
Method GP BR H3 H6 $5 $7 S10
A . . . . 5500 5020 4860
B 2499 1558 2584 3447 3649 3606 3874
C 378 597 732 807 620 788 1160
D 148 206 197 487 404 432* 564
E 300 160 420L 515 375L 405L 336L
F 2500 1800 2400 7600 3800 4900 4400
L: The method did not find the global minimum.
*: The global minimum was not found in one of the four runs.

Table 7.4
Number of units standard time
Function
Method GP BR H3 H6 $5 $7 S10
A . . . . 9 8.5 9.5
B 4 4 8 16 10 13 15
C 15 14 16 21 23 20 30
D 0.15 0.25 0.5 2 1 1" 2
E 0.7 0.5 2L 3 1.5L 1.5L 2L
F 3 4 8 46 14 20 20
L: The method did not find the global minimum.
*: The method did not find the global minimum in one of the four runs.

be possible to i m p r o v e it e v e n f u r t h e r by d e v e l o p i n g an a p p r o p r i a t e rule to
m o d i f y t h e s a m p l i n g d i s t r i b u t i o n as t h e p r o c e d u r e c o n t i n u e s . This, h o w e v e r ,
w o u l d s e e m to p r e s u p p o s e a specific global model o f t h e f u n c t i o n , as w o u l d any
a t t e m p t to i n v o l v e f u n c t i o n v a l u e s directly in t h e s t o p p i n g rule.
A t t h e e n d o f this c h a p t e r , it is a p p r o p r i a t e to c o n c l u d e t h a t global
o p t i m i z a t i o n as a r e s e a r c h a r e a is still on its w ay to m a t u r i t y . T h e v a r i e t y o f
t e c h n i q u e s p r o p o s e d is i m p r e s s i v e , b u t t h e i r r e l a t i v e m e r i t s h a v e n e i t h e r b e e n
a n a l y z e d in a s y s t e m a t i c m a n n e r n o r p r o p e r l y i n v e s t i g a t e d in c o m p u t a t i o n a l
e x p e r i m e n t s . In v i e w o f th e significance o f t h e g l o b a l opfi,~,fization p r o b l e m , it
is to be h o p e d that this c h a l l e n g e will m e e t with an a p p r o " iate r e s p o n s e in t h e
n e a r f u t u re.

References

AI-Khayyal, F., R. Kertz, and G.A. Tovey (1985), Algorithms and diffusion approximations for
nonlinear programming with simulated annealing, to appear.
Aluffi-Pentini, F., V. Parisi and F. Zirilli (1985), Global optimization and stochastic differential
equations, Journal of Optimization Theory and Applications 47, 1-17.
660 A.H.G. Rinnooy Kan, G.T. Timmer

Anderssen, R.S. and P. B loomfield (1975), Properties of the random search in global optimization,
Journal of Optimization Theory and Applications 16, 383-398.
Archetti, F. and B. Betro (1978a), A priori analysis of deterministic strategies for global
optimization, in (Dixon & Szegö 1978a).
Archetti, F. and B. Betro (1978b), On the effectiveness of uniform random sampling in global
optimization problems, Technical Report, University of Pisa, Pisa, Italy.
Archetti, F. and B. Betro (1979a), Stochastic models in optimization, Bolletino Mathematica
Italiana.
Archetti, F. and B. Betro (1979b), A probabilistic algorithm for global optimization, CaIcolo 16,
335 -343.
Archetti, F. and F. Frontini (1978), The application of a global optimization method to some
technological problems, in (Dixon & Szegö 1978a).
Basso, P. (1985), Optimal search for the global maximum of functions with bounded seminorm,
SIAM Journal of Numerical Analysis 22, 888-903.
Becker, R.W. and G.V. Lago (1970), A global optimization algorithm, in: Proceedings of the 8th
Allerton Conference on Circuits and Systems Theory.
Betro, B. (1981), Bayesian testing of nonparametric hypotheses and its application to global
"optimization, Technical Report, CNR-IAMI, Italy.
Betro, B. and F. Schoen (1987), Sequential stopping rules for the multistart algorithm in global
optimization, Mathematical Programming 38, 271-280.
Boender, C.G.E. and R. Zielinski (1982), A sequential Bayesian approach to estimating the
dimension of a multinomial distribution, Technical Report, The Institute of Mathematics of the
Polish Academy of Sciences.
Boender, C.G.E., A.H.G. Rinnooy Kan, L. Stougie and G.T. Timmer (1982), A stochastic
method for global optimization, Mathematical Programming 22, 125-140.
Boender, C.G.E. and A.H.G. Rinnooy Kan (1983), A Bayesian analysis of the number of cells of
a multinomial distribution, The Statistician 32, 240-248.
Boender, C.G.E. and A.H.G. Rinnooy Kan (1987), Bayesian stopping rules for multistart global
optimization methods, Mathematical Programming 37, 59-80.
Boender, C.G.E. (1984), The generalized multinomial distribution: A Bayesian analysis and
applications, Ph.D. Dissertation, Erasmus Universiteit Rotterdam (Centrum voor Wiskunde en
Informatica, Amsterdam).
Branin, F.H. (1972), Widely convergent methods for finding multiple solutions of simultaneous
nonlinear equations, IBM Journal of Research Developments, 5,04-522.
Branin, F.H. and S.K. Hoo (1972), A method for finding multiple extrema of a function of n
variables, in: F.A. Lootsma (ed.), Numerical Methods of Nonlinear Optimization (Academic
Press, London).
Bremmerman, H. (1970), A method of unconstrained global optimization, Mathematical Bio-
sciences 9, 1-15.
Cramer, H. and M.R. Leadbetter (1967), Stationary and Related Stochastic Processes (Wiley, New
York).
De Blase, L. and F. Frontini (1978), A stochastic method for global optimization: its structure and
numerical performance, in (Dixon & Szegö 1978a).
Devroye, L. (1979), A bibliography on random search, Technical Report SOCS, McGill Univer-
sity, Montreat.
Diener, I. (1986), Trajectory nets connecting all critical points of a smooth function, Mathematical
Programming 36, 340-352.
Dixon, L.C.W. (1978), Global optima without convexity, Technical report, Numerical Optimiza-
tion Centre, Hatfietd Polytechnic, Hatfield, England.
Dixon, L.C.W. and G.P. Szegö (eds.) (1975), Towards Global Optimization (North-Holland,
Amsterdam).
Dixon, L.C.W. and G.P. Szegö (eds.) (1978a), Towards Global Optimization 2 (North-Holland,
Amsterdam).
Dixon, L.C.W. and G.P. Szegö (1978b), The global optimization problem, in (Dixon & Szegö
1978a).
1X. Global optimization 661

Evtushenko, Y.P. (1971), Numerical methods for finding global extrema of a nonuniform mesh,
U.S.S.R. Computing Machines and Mathematical Physics 11, 1390-1403.
Fedorov, V.V. (ed.) (1985), Problems of Cybernetics, Models and Methods in Global Optimization.
(USSR Academy of Sciences, Council of Cyberneti¢s, Moscow).
Garcia, C.B. and W.I. ZangwiU (1978), Determining all solutions to certain systems of nonlinear
equations, Mathematics of Operations Research 4, 1-14.
Garcia, C.B. and F.J. Gould (1980), Relations between several path following algorithms and local
global Newton methods, SIAM Review 22, 263-274.
Gaviani, M. (1975), Necessary and sufficient conditions for the convergence of an algorithm in
unconstrained minimization. In (Dixon & Szegö 1975).
Ge Renpu (1983), A filled function method for finding a global minimizer of a function of several
variables, Dundee Biennial Conference on Numerical Analysis.
Ge Renpu (1984), The theory of the filled function method for finding a global minimizer of a
nonlinearly constrained minimization problem, SIAM Conference on Numerical Optimization,
Boulder, CO.
Goldstein, A.A. and J.F. Price (1971), On descent from local minima. Mathematics of Computa-
tion 25, 569-574.
Gomulka, J. (1975), Remarks on Branin's method for solving nonlinear equations, in (Dixon &
Szegö 1975).
Gomulka, J. (1978), Two implementations of Branin's method: numerical experiences, in (Dixon
& Szegö 1978a).
Griewank, A.O. (1981), Generalized descent for global optimization, Journal of Optimization
Theory and Applications 34, 11-39.
Hardy, J.W. (1975), An implemented extension of Branin's method, in (Dixon & Szegö 1975).
Horst, R. (1986), A general class of branch-and-bound methods in global optimization with some
new approaches for concave minimization, Journal of Optimization Theory and Applications 51,
27l -291.
Horst, R. and H. Tuy (1987), On the convergence of global methods in multiextremal optimiza-
tion, Journal of Optimization Theory and Applications 54, 253-271.
Kalantari, B. (1984), Large scale global minimization of linearly constrained concave quadratic
functions and related problems, Ph.D. Thesis, Computer Science Department, University of
Minnesota.
Kirkpatric, S., C.D. Crelatt Jr., and M.P. Vecchi (1983), Optimization by simulated annealing,
Scienee 220, 671-680.
Kushner, M.J. (1964), A new method for locating the maximum point of an arbitrary multipeak
curve in presence of noise, Journal of Basic Engineering 86, 97-106.
Lawrence, J.P. and K. Steiglitz (1972), Randomized pattern search, IEEE Transactions on
Computers 21, 382-385.
Levy, A. and S. Gomez (1980), The tunneling algorithm for the global optimization problem of
constrained functions, Technical Report, Universidad National Autonoma de Mexico.
McCormick; G.P. (1983), Nonlinear Programming: Theory, Algorithrns and Applications (John
Wiley and Sons, New York).
Mentzer, S.G. (1985), Private communication.
Mladineo, R.H. (1986), An algorithm for finding the global maximum of a multimodal, mul-
tivatiate function, Mathematical Programming 34, 188-200.
Pardälos, P.M., and J.B. Rosen (1986), Methods for global concave minimization: A bibliographic
survey, SIAM Review 28, 367-379.
Pinter, J. (1983), A unified approach to globally convergent one-dimensional optimization
algorithms, Technical report, CNR-IAMI, ltaly.
Pinter, J. (1986a), Globally convergent methods for n-dimensional multi-extremal optimization,
Optimization 17, 187-202.
Pinter, J. (1986b), Extended univariate algorithms for n-dimensional global optimization, Comput-
ing 36, 91-103.
Pinter, J. (1988), Branch and Bound algorithms for solving global optimization problems with
Lipschitzian structure, Optimization 19, 101-110.
662 A.H.G. Rinnooy Kan, G.T. Timmer

Price, W.L. (1978), A controlled random search procedure for global optimization, in (Dixon &
Szegö 1978a).
Rastrigin, L.A. (1963), The convergence of the random search method in the extremal control of a
many-parameter system, Automation and Remote Control 24, 1337-1342.
Ratschek, H. (1985), Inclusion functions and global optimization, Mathematical Programming 33,
300-317.
Rinnooy Kan, A.H.G. and G.T. Timmer (1987a), Stochastic global optimization methods. Part I:
Clustering methods, Mathematical Programming 39, 27-56.
Rinnooy Kan, A.H.G. and G.T. Timmer (1987b), Stochastic global optimization methods. Part II:
Multi-levet methods, Mathematical Programming 39, 57-78.
Rosen, J.B. (1983), Global minization of a linearly constrained concave function by partion of
feäsible domain, Mathematics of Operations Research 8, 215-230.
Rubinstein, R.Y. (1981), Simulation and the Monte Carlo Method (John Wiley & Sons, New York).
Schrack, G. and M. Choit (1976), Optimized relative step size random searches, Mathematical
Programming 16, 230-244.
Schuss, Z. (1980), Theory and Applieations of Stochastic Differential Equations (John Wiley and
Sons, New York).
Shepp, L.A. (1979), The joint density of the maximum and its location for a Wiener process with
drift, Journal of Applied Probability 16, 423-427.
Shubert, B.O. (1972), A sequential method seeking the global maximum Of a function, SIAM
Journal on Numerical Analysis 9, 379-388.
Sobol, I.M. (1982), On an estimate of the accuracy of a simple multidimensional search, Soviet
Math. Dokl. 26, 398-401.
Solis, F.J. and R.J.E. Wets (1981), Minimization by random search techniques, Mathematies oB
Operations Research 6, 19-30.
Strongin, R.G. (1978), Numerical Methods for Multiextremal Problems (Nauka, Moscow).
Sukharev, A.G. (1971), Optimal strategies of the search for an extremum, Computational
Mathematics and Mathematical Physics 11, 119-137.
Timmer, G.T. (1984), Global optimization: A stochastic approach, Ph.D. Dissertation, Erasmus
University Rotterdam.
Törn, A.A. (1976), Cluster analysis using seed points and density determined hyperspheres with
an application to global optimization, in: Proceeding of the third International Conference on
Pattern Recognition, Coronado, CA.
Törn, A.A. (1978), A search clustering approach to global optimization, in (Dixon & Szegö
1978a).
Treccani, G. (1975), On the critical points of continuously differentiable functions, in (Dixon &
Szegö 1975).
Tuy, H. (1987), Global optimization of a difference of two convex functions, MathematicaI
Programming Study 30, 150-182.
Walster, G.W., E.R. Hansen and S. Sengupta (1985), Test results for a global optimization
algorithm, in: P.T. Boggs and R.B. Schnabel (eds.), Numerical Optimization 1984 (SIAM,
Philadelphia, PA).
Woerlee, A. (1984), A Bayesian global optimization method, Masters Thesis, Erasmus University
Rotterdam.
Wood, G.R. (1985), Multidimensional bisection and global minimization, Technical report,
University of Canterbury.
Zielinski, R. (1981), A stochastic estimate of the structure of multi-extremal problems, Mathemati-
cal Programming 21, 3~8-356.
Zilinskas, A. (1982), Axiomatic approach to statistieal models and their use in multimodel
optimization theory, Mathematical Programming 22, 104-116.
Zilverberg, N.D. (1983), Global minimization for large scale linear constrained systems, Ph.D.
Thesis, Computer Science Dept., University of Minnesota.

You might also like