0% found this document useful (0 votes)
12 views61 pages

Introduction To Optimization, Gradient-Based Methods

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views61 pages

Introduction To Optimization, Gradient-Based Methods

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Problems 175

7.10 For y'=x2-2y and the initial condition y(O)=O, applying the
Runge-Kutta method with h=0.1 yields y(0.1)=0.0003175,y(0.2)=
0.0024204, y(0.3)=0.0077979. Apply the Adams method to find
y(0.4). Compare with the Euler-method result of Problem 7.2, which
was 0.01284, and the analytical result, which is 0.0176678.
7.11 Use the Adams-method program to estimate the solution of

dx y(ex+lny
at x=0.1. Use the initial conditiony(0)= I, and choose Sm;n=0.0001,
Sma=0.01, h=0.001.
Hint: The closest printout to x=0.l will be x=0.092; thus, for an
initial condition of y (O.092) use the program again with h=0.008.
Compare with the analytical answer y = exp( - x2e - X).
7.12 In Section 7.8 the Runge-Kutta method was applied to solve the set
of equations

d =g(x,y,y ), d ` =f(x.y,yt)
This problem will instead illustrate the application of the Euler
method, which can advance the solution by y(x+h)xy(x)+
hg(x,y,y,) and y,(x+h):.,y,(x)+hf(x,y,y,). Use this approach to
estimatey(0.2) andy,(0.2) for g(x,y,y,)=x+y f(x,y,y,)=x-2y-
3y,, subject to the initial conditions y(0)=0 and y,(0)=0. Choose
h = 0.5.
7.13 Express the following third-order differential equation as a set of
first-order differential equations:
d
+ 4y d + siny = e'T.

What is f (x,y,y,,y2)'?
7.14 The set of first-order differential equations in Problem 7.12 can be
transformed into a second-order differential equation in terms of x
and y. Find this equation and give the initial conditions. If familiar
with differential equations, find the analytical solution.
7.15 Use the program in Fig. 7.8 to find y(l.5) for

dz2 +2d +2y=0


subject to the initial conditions y (0) = 0, y'(0) = 1. First let h = 0.3, and
then let h=0.1.
176 Solution of Differential Equations

7.16 The differential equation in the above problem is said to be homoge-


neous, because the right side is zero. Re-solve the problem for the
right side instead equal to 3. For initial conditions use y(0)=1.5,
y'(0)=1.
7.17 Find y(1.2) for the following nonhomogeneous differential equation:
dy
+3 +2y=3.
d
Use h = 0. 1, y (O) -1, y'(0) = 0.
7.18 Legendre's differential equation is

(1 -x2)d -2xd +n(n+l)y=0.


dx2
Find y(0.1) for n = 5 and the initial conditions y(0)=0, y'(0)=1.875.
First let h=0.01, and then let h=0.001. For h=0.001 modify the
program so that the output is printed only for x=0.1. Compare with
the analytical answer which is
y(X) = 63X5- 35X3+ 15X.
B 4 8

7.19 Find y(0.5) for the following fourth-order second-degree differential


equation:
z z z

dXa -x( dX3 + d +(d / -xy2=x+l-sinx.


Use the initial conditions y (0) = 0, y'(0) =1, y"(0) = 0, y"'(0) _ -1, and
choose h = 0.1.
7.20 An example of a stiff differential equation is

dX2
+100a +y=2.
(a) For the initial conditions y(O)=O and y'(0)=100, show that
y=2-e-0.01x-e-99.99x

(b) Examine the solution where the small time constant predominates
by evaluating the solution in (a) for x=0.001, 0.01, 0.02, and 0.05.
(c) Examine the solution where the large time constant predominates
by evaluating the solution in (a) for x = 5, 50, and 300.
(d) Use the program of Fig. 7.8 to find y(0.05), and compare with the
answer found in (b). Use h=0.001.
(e) In the region where the large time constant predominates, one
would like to use a large step size. Try the program of Fig. 7.8 for
this differential equation with y(1)= 1, y'(1)=0, and h=0.1. Ex-
plain what happened.
Chapter Eight

Introduction to Optimization Theory

8.1 PRELIMINARY REMARKS

In our everyday experiences we often consciously or subconsciously try to


optimize results. Some of the problems we encounter are rather trivial,
such as finding the shortest path between two points: a straight line.
However, if the problem is to reach the airport in a minimum amount of
time so as not to miss a flight, then the optimum solution is not so easily
found. Perhaps the geometrically shortest route goes through the city and
is therefore rather slow. If one decides to drive on the freeway to the
airport, other choices will have to be made: for example the speed (we do
not want to get a ticket or get in an accident), the exit to use, etc.
Optimization problems become increasingly difficult to solve as the
number of variables is increased. In practical applications there may be
many parameters that can be varied independently to improve results.
However, while optimization problems may be involved and thus require
the use of a computer, the benefits of using optimization theory can be
great.
Optimization theory has been used to great advantage in electrical
engineering. For example, an electrical network may contain many resis-
tors, inductors, and capacitors. These elements, which may be considered
as the independent variables, can be adjusted by optimization techniques
so that a desired electrical response is obtained. An example in Section 8.6
illustrates one application of this approach.
Optimization theory has influenced the design of many different types of
manufacturing processes. A chemical product, for instance, may contain
variable proportions of ingredients, may be subjected to various tempera-
tures for different amounts of time, etc. It may be possible to write
functional relations for these different parameters, but it will probably be
impossible to obtain an analytical answer for the optimum result. How-
ever, often the problem can be formulated in a manner that allows iterative
methods to adjust the parameters and obtain an optimum solution. These
optimization techniques can thus be viewed as a collection of numerical
methods that have been linked together in a specific way.
177
178 Introduction to OpUmlzatlon Theory

In the next section, it is shown that optimization can correspond to


minimizing a function, which will be termed the error function E. If the
error E is a function of only one variable x, then we will assume it can be
expanded in terms of a Taylor series as
2
z Sx
E(x + Sx) = E(x) + 7ESx+
dz (2) + (8.1)

In an introductory calculus course, an equation such as this could be used


to prove that at a minimum the first derivative dE/dx must be zero and
the second derivative d2E/dx2 must be positive.
In optimization problems, the error E may be influenced by many
variables x1,x2,...,x and thus will often be written as E(x), where x is a
vector having components x,,. As mentioned in Section 9.6, the
equation that corresponds to (8.1) for this general case is
Sx'HSx +
E(x+Sx)=E(x)+VETSx+ (8.2)
The symbol V E represents the gradient, which is defined in Section 9.3.
The gradient is the n-dimensional analogue of the first derivative. In fact, it
is shown in Section 9.3 that the gradient must be zero at a minimum.
The symbol H represents a matrix termed the Hessian, which is defined
in Section 9.6. The Hessian is the n-dimensional analogue of the second
derivative. In fact, by analogy it can be shown that at a minimum we must
have SxTHSx positive for any nonzero vector Sx. A matrix such as this is
said to be positive definite.
Four different optimization techniques are discussed in detail in this
book: simplex, steepest descent, Fletcher-Powell, and least pth. After
comparing optimization techniques in general in Section 8.3, the simplex
algorithm is described in detail in Section 8.4. Simplex was chosen for this
introductory chapter because it is unsophisticated yet can still be a
powerful optimization program. At the end of the chapter. the simplex
algorithm is applied to various test functions such as the Rosenbrock
function.
Another simple optimization technique is the steepest-descent method,
which searches for the minimum by proceeding in the direction of the
negative gradient. The Fletcher-Powell optimization technique modifies
this direction slightly by using information that is contained in the Hessian
matrix. The least-pth optimization procedure (a generalization of least
squares) also calculates the gradient to help determine the search direction,
but it calculates this very efficiently by assuming that the error function is
of a special form.
The reader may be wondering why the text discusses four different
optimization techniques, when any one theoretically can be used to mini-
Formulation of Optimization Problems 179

mize a function. A major reason is that often one technique is not able to
find a minimum, so another must be tried. Which program is best will
often depend on the particular error function that must be minimized. In
fact, as shown in Section 9.5, even scaling the parameters (e.g., replacing xt
by lOx,) can drastically affect the rate of convergence.
When an optimization program finds a minimum, the user may not be
happy with the minimum that is found. One reason (see Section 9.5) may
be that the point found was only a local minimum; that is, if different
initial parameter values had been chosen, then an even smaller value could
have been obtained. Another reason may be that the parameter values at
the minimum were unrealistic (e.g., if the parameters represent dimensions
of an object, then the "optimum" product may be too large to manufac-
ture). Chapter 11 indicates how parameter values can be constrained to
have practical values.
Before proceeding with this introduction to optimization theory, a warn-
ing should be given about the optimization programs that are included in
the rest of this text. As was true for the previous programs, these optimiza-
tion programs have been written with clarity as the first goal and not
efficiency. (In an optimization program, it can be worthwhile to save a few
milliseconds of computation time because the computations may be done
many times before the optimization loop is completed.) While it should be
acknowledged that these programs are not sophisticated, it also should be
emphasized that they can be used to solve practical optimization problems
-but interaction might be required in certain cases. This interaction could
consist of choosing a better set of initial parameters; or if a particular
algorithm (e.g., steepest descent) does not perform satisfactorily for a
specific problem, then another algorithm (e.g., least pth) could be tried.

8.2 FORMULATION OF OPTIMIZATION PROBLEMS

Before optimization techniques can be applied to a specific problem, it


must be stated in terms of mathematical relations. We will assume the
problem can be formulated so that we wish to minimize an error function'
E. The error function E may depend on many variables xl,x2,...,x and
thus will often be written as E(x), where x is a vector having components
xJ,x2,...,xn.
The fact that the optimization techniques we study are formulated so as
to minimize functions is not so restrictive as it might seem. If instead we
want to maximize a function F(x), we can let E(x)= - F(x) and minimize
E (x).

This is often called the objective function.


180 Introduction to OpUmlzatlon Theory

Often, we will not want to minimize an error, but instead minimize the
magnitude of the error. That is, a large negative error may be as objection-
able as a large positive error. As an example of this, assume we want to
find a solution of the nonlinear equation Inx=x. If we define an error
function as e(x) =1n x - x, then a solution of the nonlinear equation can be
found by adjusting x so that the error function e(x) becomes zero. If we
want to make a function e(x) become zero, we can instead minimize the
square of that function, i.e.,
E=[e(x)]2. (8.3)
This idea can be generalized in numerous ways; for example, the
function we want to make zero may depend on many parameters
x1,x2,...,x,,. Then we write
]2.
E (x) = [ e(x) (8.4)
Or we may have many functions e, (x), e2(x), ... , em (x) we wish to simulta-
neously set equal to zero; then we can define the error function as
m
[e; ]2.
E(x)= (x) (8.5)
1=1

EXAMPLE 8.1

A set of nonlinear equations can be solved by using an error function of


the type in (8.5). As a simple case, consider
x, +3cosx2=2,
cosx1 +2x,x2=4.
For this case, if we define the individual errors as

e,(X1,x2)=x +3cosx2-2,
e2(xt,x2)=cosx,+2X,x2-4,
then minimizing the following error function will produce a solution

E (x) = e, (x 1, x2)2+ e2(x,, x2)2.

The exponent in (8.5) was picked as 2 so that the contribution of each


individual error was always positive. In general, any even positive number
p could have been used:
M
E(x)= [e;(x)]°. (8.6)
Overview of Various Optimization Techniques 181

This is called the least-p tit error criterion, while the special case of p=2 is
called the least-squares error criterion.
In a practical problem, some of the individual errors may be more
important than others. In this case we can assign nonnegative weighting
factors wi so that
Ht

E(x)= wi[e,(x)]°.

The more important an error is, the larger is the weight we assign it. Of
course we could make the substitution wi[ei(x)]p-3[ei(x)] and transform
the weighted problem to a nonweighted one.
In some problems a requirement2 f(t) may be given and we may wish to
adjust the parameters xt,x2.... ,x so that the response h(t) of a system is
as close as possible to f(t). For this case we could define an error as
e(t)=f(t)-h(t). (8.8)

However, a computer can only treat discrete functions, so we will instead


define
e, =f -h,, (8.9)

where the subscript i indicates that the functions are evaluated at t= t,. For
this case, (8.6) can be rewritten as

E(x)= [f,.-hi(x)]°. (8.10)


;at

8.3 OVERVIEW OF VARIOUS OPTIMIZATION TECHNIQUES'

Just as there are various numerical methods for finding the roots of an
equation, there are different optimization techniques that can be used to
minimize the error function E(x). Which is the best will depend on the
specific application. However, it is possible to divide optimization tech-
niques into certain categories and then to compare the categories. This
section will give a brief overview of various optimization techniques, and
the next section will give a more detailed presentation of a specific one.
The following two chapters will give detailed descriptions of additional
optimization techniques.

'The variable t can be considered to be time in this discussion, but of course it could have
a different meaning.
3This is essentially the same as Section 13.4 of Daniels (1974).
182 Introduction to Optimization Theory

1. Simple Search Methods (Nonderivative)


The sophisticated optimization techniques evaluate the first and/or
higher derivatives; however, there are some procedures that do not require
derivatives. Perhaps one of the simplest ways to minimize the error
function is to vary one available parameter at a time-for example, first
minimize the error for the parameter x then for x2, and so on, after doing
this for all parameters, one would start over again at x,. Since each step in
this minimization procedure is a single parameter search, the quadratic
interpolation method may be used to find the minimum.
The simplex method can be used for multiparameter searches, that is,
searches in which one does not examine the behavior of the error function
for one parameter at a time. In n-dimensional space, a set of n + I points
forms a simplex. In the simplex method we start with an arbitrary simplex
and then let it tumble and shrink toward the region where the error is a
minimum. The direction in which we allow the simplex to move is
determined by evaluating the error function at each of the n + I points of
the simplex.

2. Slope-Following Methods
Slope-following methods evaluate the first derivatives of the error func-
tion (8E/8x;) and use this information to indicate how the parameters
should be changed in order to minimize the error. The first derivatives
determine the gradient of the error function. The gradient points in the
direction of the greatest change in error; thus, to minimize the error one
proceeds in the direction opposite to the gradient. This is the basis of the
steepest-descent method, which uses the gradient to predict parameter
changes for error minimization. Steepest descent can be considered to be
an attempt to change parameter values so as to proceed down an error
slope most rapidly. Assuming that the error function E(x) has the special
form given in (8.6) leads to the least-pth optimization technique, which is a
generalization of the least-squares method.

3. Second-Order Methods
The slope-following methods tend to reduce the error rapidly in the
initial stages of an optimization procedure; however, their convergence is
rather slow as the minimum is approached. To improve the rate of
convergence, one can use not only the first derivatives of the error function
(aE/ax;), but also the second derivatives (a ZE/ax; axe). Just as the first
derivatives determined the gradient of the error function, the second
derivatives determine the Hessian.
The Simplex Optimization Technique 183

The various second-order methods differ mainly in the way they try to
approximate the second derivatives. The second derivatives are not usually
found by using a perturbation scheme based on varying elements; they are
instead approximated by using knowledge of the error function and its
gradient at previous iterations. The Fletcher-Powell minimization proce-
dure is one of the best known second-order methods.

8.4 THE SIMPLEX OPTIMIZATION TECHNIQUE4

As previously mentioned, we are trying to minimize an error function


E(x) which is a function of n variables x,,x21 ...,x,,. Because x is an
n-dimensional vector, we are led to the concept of an n-dimensional space.
Even though our imagination may balk at something larger than a three-
dimensional space, our mathematical symbols will not.
The simplex method of nonlinear programming begins by choosing n + I
parameter vectors to span an n-dimensional space. The geometric figure
which is formed by these points is called a simplex-hence the name of the
method. In particular, a two-dimensional simplex is a triangle and a
three-dimensional simplex is a tetrahedron. In the simplex method we will
start with an initial simplex and then cause it to approach the region where
the error is a minimum.
In any optimization technique an initial guess must be made for the
variables x,,x2,.... x,,. The optimization technique is then supposed to
adjust the parameters to minimize E(x). In the simplex method we will
identify this initial guess as point P0 in the n-dimensional space. That is,
(8.11)
We will use this initial point to generate other points P1,P2,...,P,,. These
points will define the initial simplex.
Point P, will be arbitrarily related to Po via
P1 _ (1. I x,, x2,.x3, ... , (8.12)
and in general
P, = (x,, x2, .... 1. l x,, ... , (8.13)
where i= 1,2,...,n.

4This is not related to Dantzig's simplex algorithm of linear programming. The nonlinear
simplex method described here was originally proposed by Nelder, J. A., and Mead, R. (Jan.
1965), "A Simplex Method for Function Minimization", Comput. J., pp. 308-313. The values
for a, P, and y in this section were suggested by P. E. Fleischer and D. M. Bohling (private
correspondence).
184 Introduction to Optimization Theory

The iterative part of the simplex algorithm begins by ordering the points
according to the values of E (P0), E (PI),..., E
Po, P1, ... , P. The points
that yield the lowest, highest, and next highest values of the error function
are identified as PL, PH, and PNH. The simplex algorithm repetitively
replaces PH, the highest point. It will be illuminating to adopt a geometric
language in a description of the algorithm. We will picture the simplex as
moving towards a minimum by a series of operations termed reflections,
contractions, and expansions. These are illustrated in Fig. 8.1 for the case
of a two-dimensional simplex.

1. Reflection
The first attempt at replacing PH is by reflection about the centroid,
defined as
I
C= 2 P,. (8.14)
=o
i H

That is, the centroid is the average of all the points except PH, which is
going to be replaced shortly.

EXAMPLE 8.2

The error function for this example is E(x)=2x, - x2. If the initial point

Po = PH

NZ

Contraction
point Reflection point

Expansion point

FIGURE B.I. Illustration of reflection, contraction, and expansion for a two-dimen-


sional simplex.
The Simplex Optimization Technique 185

is chosen as Po = (2, 1):

(a) Find P, and P2.


(b) What point is PH?
(c) Find the centroid.

SOLUTION

(a) Applying (8.13) yields P,=(2.2,1) and P2=(2,1.1).


(b) To determine which point has the highest error we must evaluate E (x)
for each of the points:
E(P0)=E(2,1)=3,
E(P,)=E(2.2,1)=3.4,
E(P2)=E(2,1.1)=2.9;
thus PH = P,.
(c) Applying (8.14) with H= I yields
C=i(P0+P2)=2(2+2,1+1.1)=1(4,2.1)=(2,1.05).

The reflected point is defined by


PR = (1 + a)C - aPH. (8.15)
If a = 1, then the reflected point is the same distance from the centroid as
is PH, but on the opposite side (see Fig. 8.1 or Problem 8.10). However,
this value will be modified slightly to a=0.9985 to prevent any of the
computed parameters from becoming zero, which could cause computa-
tional difficulties.

EXAMPLE 8.3

This example will demonstrate reflection for the data given in Example
8.2. The parameter a will be chosen as unity to simplify computations.
For a = 1, (8.15) becomes PR = 2C - PH. Substituting the results from the
previous example,
PR=2(2, 1.05)-(2.2, 1)=(1.8,1.1).

If the reflection is moderately successful, namely if


E(PL)<E(PR)<E(PH), (8.16)
then PR replaces PH, thereby forming a new simplex. Another reflection is
then attempted for this new simplex.
186 introduction to Optimization Theory

The reflected point may not be described by (8.16), for one of two
possible reasons. One of the possibilities can be illustrated by calculating
the error function for the reflected point of Example 8.3. Since for those
data the error function was defined as E(x)=2x,-x2, it follows that for
PR = (1.8, 1.1), E (PR) = 2.5. From Example 8.2, the error at the low point
was E (PL) = 2.9; thus E (PR) < E (PL). The other reason the reflected point
may not be described by (8.16) is that in some cases E(PR)> E(PH).

2. Expansion
If the reflection operation is highly successful and produces a new
minimum, so that E (PR) < E (P,,), then it seems likely we are proceeding
in a good direction, so we will go further in that direction by expanding
according to the expression
PEx=0PR+(I-/3)C. (8.17)
If $ = 2, then the expanded point will be twice as far from the centroid as is
the reflected point (see Fig. 8.1 or Problem 8.10). However, this value will
be modified slightly to 1.95 to prevent possible instabilities. Depending on
whether E(PR) or E(PE,,) is smaller, either PR or PEx replaces PH. A new
centroid is then calculated and the reflection process is again attempted.

3. Contraction
If the reflection operation is not successful, but results in E(PR)>
E(PH), then it seems likely we are searching on the wrong side of the
centroid, so contraction is performed according to
PC=(' - y)C+ yPH. (8.18)
If y= then the contracted point is midway between PH and the centroid
(see Fig. 8.1). This value of y will be modified slightly to 0.4985 for
practical reasons. If contraction is successful and E (Pc) < E (PH), then the
result is the same as if reflection had been successful, and we proceed
accordingly.
Contraction is usually successful; however, if it is not and E(Pc) >
E(PH), then drastic measures are taken.

4. Scaling
In the unlikely event that neither reflecting nor contracting can find a
better point than PH, then the simplex itself is modified by a scaling
process described by
P;+k(PL-P;)'P,, i=0,1,...,n. (8.19)
The Simplex Optimization Technique 187

A two-dimensional representation of the scaling process for k =0.5 is


shown in Fig. 8.2. As demonstrated there, if k=0.5, then the scaling
process moves every point towards the best point PL; in fact, the distance
is halved. In general, if k is positive (and also less than unity), then the
scaling process of (8.19) causes the simplex to shrink about the point PL.
On the other hand, if k is negative, then every point moves away from
PL and the simplex increases in size. For example, if k= - 1, then the
distance between PL and a point P; doubles. Suggested values for the scale
factor k are 0.5 to reduce the size of the simplex and - I to increase it.
A program for the simplex algorithm is shown in Fig. 8.3. The notation
in the program is very similar to that used in the preceeding discussion, but
one comment should be made. As usual, because FORTRAN does not allow
a subscript to be zero, the subscripts have been translated by unity in this
program. For example, the initial point is identified as P(1) and not P(0).
Any point in the simplex has n components; since there are n + I points
in the simplex, this implies there is a total of n(n+ 1) quantities that
describe the simplex. The most convenient way to store this information is
in a matrix. In the program, row, of the matrix represents P,, row2
represents P2, and in general row; represents P. The matrix is denoted by
P in the program, so the general term P(i,j) in the matrix represents thejth
component of the ith simplex point P;.
In the program L, H, and NH represent lowest, highest, and next highest.
For example, if L=4, then P4 is the simplex point that has the lowest error
value and this error value is E4. It should be noted that extra memory
locations are not needed for the lowest point PL or its error EL. Instead,
the memory location L is used to indicate which simplex point produces
the lowest error. The error at point P; is stored as E(i); thus the lowest,

ORIGIN

FIGURE 8.2. Illustration of scaling for a two-dimensional simplex.


188 Introduction to Optimization Theory

01C SIMPLEX
02 PROGRAM S(INPUT,OUTPUT)
03 03 FORMAT(7H X(I)=,1P,4E14.5)
04 04 FORMAT(7H ERROR=,IP,E14.5,/)
06 DIMENSION C(10),E(10),P(10,10),R(10),X(10)
07 INTEGER H
08 REAL K
09C
10C INITIAL VALUES
11 PRINT,*N*, $READ,N
12 N1=N+1
14 PRINT,/,*X(I) I=1,2.... N*
15 READ,(X(I),I=1,N)
16 E(1)=ERROR(X)
17 PRINT 4,E(1)
19C
20C INITIALIZE SIMPLEX (EQ. 8.13)
21 DO 22 J=1,N
22 22 P(1,J)=X(J)
24 DO 28 I=2,N1
25 DO 26 J=1,N
26 26 P(I,J)=X(J)
27 P(I,I-1)=1.1*X(I-1)
28 28 IF(ABS(X(I-1)).LT.IE-12) P(I,I-1)=.0001
29C
30C FIND PL,PH
31 31 L=H=1
32 DO 38 I=1,N1
34 DO 35 J=1,N
35 35 X(J)=P(I,J)
36 E(I)=ERROR(X)
37 IF(E(I).LT.E(L)) L=I
38 38 IF(E(I).GT.E(H)) H=I
39C
40C FIND PNH
41 41 NH=L
42 DO 43 1=1,N1
43 43 IF(E(I).GE.E(NH).AND.I.NE.H) NH=I
49C
50C CALCULATE CENTROID (EQ. 8.14)
51 DO 56 J=1,N
52 C(J)=-P(H,J)
53 DO 54 I=1,N1
54 54 C(J)=C(J)+P(I,J)
56 56 C(J)=C(J)/N
59C
60C REFLECT (EQ. 8.15)
61 61 DO 62 J=1,N
62 62 R(J)=1.9985*C(J)-.9985*P(H,J)
64 ER=ERROR(R)
(a)

FIGURE 8.3. A program for simplex.


The Simplex Optimization Technique 189

70C REFLECT AGAIN (IF MODERATELY SUCCESSFUL)


71 IF(ER.LT.E(L)) GO TO 91
73 IF(ER.GE.E(H)) GO TO 122
79 79 DO 80 J=1,N
80 80 P(H,J)=R(J)
81 E(H)=ER
83 IF(ER.GT.E(NH)) GO TO 61
85 H=NH
86 GO TO 41
89C
90C EXPAND (EQ. 8.17)
91 91 L=f) $H=NH
92 DO 93 J=1,N
93 93 X(J)=1.95*R(J)-.95*C(J
94 EX=ERROR(X)
96 IF(EX.LT.ER) GO TO 104
98 DO 99 J=1,N
99 99 P(L,J)=R(J)
100 E(L)=ER
101 GO TO 110
104 104 DO 105 J=1,N
105 105 P(L,J)=X(J)
106 E(L)=EX
109C
110 110 PRINT 3,(P(L,J),J=1,N)
114 PRINT 4,E(L)
117 GO TO 41
11 9C
120C CONTRACT (EQ. 8.18)
122 122 DO 123 J=1,N
123 123 R(J)=.5015*C(J)+.4985*P(H,J)
124 ER=ERROR(R)
126 IF(ER.LT.E(L)) GO TO 91
128 IF(ER.LT.E(H)) GO TO 79
1 32C
133C SCALE (EQ. 8.19)
134 PRINT,*K*, $READ,K
136 00 138 I=1,N1
137 DO 138 J=1,N
138 138 P(I,J)=P(I,J)+K*(P(L,J)-P(I,J))
139 GO TO 31
140 END
897C
898 FUNCTION ERROR(X)
899 DIMENSION X(10)
900 ERROR=100.*(X(1)*X(1)-X(2))**2+(1-X(1))**2
950 RETURN
999 END
(b)

FIGURE 8.3. (Continued.)


190 IntroductIon to Optimization Theory

highest, and next highest errors would be stored as E(L), E(H), and E(NH).
The location of the centroid C is found by calculating
n+1 nil
C=n P;=Nf -PH+ P; P. (8.20)
i L i=1
#N
i

The second form of the equation is used because it avoids the necessity of
checking each point to determine whether or not it is PH.
In the program, the point produced by contraction is identified as PR
(i.e., R). Actually this is the value P.. that would be calculated by applying
(8.18). It is identified in the program as PR (and not as P(.) because if
contraction is successful we proceed exactly the same as we would have
proceeded if reflection had been successful. Thus coding the result of
contraction as PR eliminates an unnecessary substitution (i.e., it eliminates
PC---)'PR)'

8.5 APPLICATIONS OF SIMPLEX

This section discusses some applications of the simplex program to serve


as an introduction to the use of an optimization program. Compared with
most optimization problems, these examples are quite simple, but they
should thus be relatively easy to understand.

EXAMPLE 8.4

This uses the simplex algorithm to solve the following nonlinear equa-
tions, which were first presented in Example 8.1:
x,2+3cosx2=2,
cosx1+2xlx2=4.
That example derived a suitable form for the error function, which can be
coded in statements 900 and 910 as
900 ERROR = (X(1) *X(1) + 3. *COS(X(2)) - 2.) * *2
910 ERROR= ERROR +(COS(X(1))+2. *X(1)*X(2)-4.)**2
In this example, each contribution to the error function was given a
separate statement which makes for ease in reading.
Figure 8.4 indicates the output of the simplex program for this example.
The top of the printout indicates that for the initial set of parameters
x1= 1, X2= 1 the error function was 2.52. After a few iterations, the simplex
algorithm indicated that a better set of parameters would be x, =1.26222,
x2=1.44424, for which the error function was 0.00333.
Applications of Simplex 191

N ? 2

X(I) I=1,2.... N
? 1 1
ERROR= 2.51624E+00

X(I)= 1,14735E+00 1,14735E+00


ERROR= 1.21597E+00

X(I)= 1.02242E+00 1,36449E+00


ERROR= 5.89749E-01

X(I)= 1.16965E+00 1.41160E+00


ERROR= 1.18917E-01

X(I)= 1.31655E+00 1.45862E+00


ERROR= 1.32776E-02

X(I)= 1.28842E+00 1.42986E+00


ERROR= 7.99136E-03

X(I)= 1,26222E+00 1.44424E+00


ERROR= 3.33276E-03
FIGURE 8.4. The solution of two nonlinear equations by simplex.

In most optimization problems exact knowledge about the final value of


an error function will not be very enlightening. The fact that the final error
function in Example 8.4 was 0.00333 does not convey much information.
However, comparing this value with the initial error value which was 2.52,
we can note that the final error function is about a factor of 1000 smaller.
This gives us some confidence that the final set of values for xI, x2 is much
better than the initial set.
Much more insight into the accuracy of the optimization result may be
obtained by returning to the set of equations we were trying to solve.
Consider the first one: x,2+3cosx2=2. If x, =1.2622, x2=1.4442 is sub-
stituted into the left side, the number 1.972 results instead of 2. Similarly,
substituting into the second equation yields 3.949 instead of 4.
Of course, improved accuracy could be obtained by using more itera-
tions of the simplex algorithm. When one is finally satisfied that a set of
parameter values is close enough to the optimum answer, the number of
important digits in the parameter values should be determined. The mere
fact that a computer has been programmed to print out six significant
figures certainly does not imply that they are all important. For example,
changing the solution from x, = 1.2622, x2=1.4442 in the previous problem
192 Introduction to Optimization Theory

to x, = 1.26, x2 = 1.44 changes the left side of the equations to 1.979 and
3.935, which are probably still sufficiently close to the desired values of 2
and 4.
In a practical optimization problem, the accuracy of a parameter value
may indicate the tolerance of a component that is used. For example, if x
represents the value of a resistor in an electrical network, then specifying
four significant figures may imply that a component with a tolerance of
0.1% should be used. Since increased tolerance usually results in increased
cost, an effort should be made to ascertain how many digits in the
optimized parameter values are really important.

EXAMPLE. 8.5

Figure 8.5 demonstrates the minimization of


E(x)= 2 sin x, cos x2 + (cos x,)2(sin x2)°.
For the initial set of parameters x, = 1, x2=1 the error function was 1.056.
After many iterations the parameters were adjusted to x, = 1.56, x2=3.12,
which yielded an error function of -2.00.

It should be noted that the minimum that was found in Example 8.5 was
a negative number. The simplex algorithm is a general procedure that
minimizes a function of several variables. Often the function E(x) is
formulated as a sum of squares [see (8.5)], so that the minimum will be
positive; but as this example demonstrates, that is not necessary.
The processes reflection, expansion, and contraction were used in the
two previous examples. However, it was never necessary to scale the entire
simplex. In fact this is generally true; the scale operation in simplex is not
used often. The following example was especially concocted so that scaling
would be necessary. Singularities (places where the function approaches
infinity) were put near PR (the reflected point) and Pc (the contracted
point) so that these points would have large error values.

EXAMPLE 8.6

The error function for this example is E(x,,x2)=(x,+5x2)2+1/(x,+x2


-2.04)'+ 1 / (x, + X2-2. 19)2. Choosing the initial parameter vector as Po =
(1, 1), the remaining points of the initial simplex are calculated as P, =
(1.1,1) and P. = (1,1.1). For these points the error function is E0 = 688.7,
E, = 438.4, and E2 = 443.5; thus PH = P0.
Applying (8.14) yields that the centroid is C=(1.05,1.05). Reflecting
with the parameter a chosen as unity for simplicity produces P. =(1.1,1.1),
so that the error at the reflected point is ER = 1008.3. This error is greater
than the error at the high point P0, so contraction is attempted.
193
Applications of Simplex

N ? 2

X(I) I=1,2,...N
? 1 1
ERROR= 1,05566E+00

X(I)= 8,52867E-01 1.29460E+00


ERROR= 7.81547E-01

X(I)= 9.77610E-01 1.38689E+00


ERROR= 5.95195E-01

X(I)= 7.50202E-01 1.80949E+00


ERROR= 1.54609E-01

X(I)= 8.85399E-01 2.18931E+00


ERROR= -7.21224E-01

X(I)= 6.58230E-01 2.61100E+00


ERROR= -1.01418E+00

X(I)= 7,93395E-01 2.98993E+00


ERROR= -1.40886E+00

X(I)= 1.06468E+00 2.75167E+00


ERROR= -1.61306E+00

X(I)= 1.15443E+00 3.16823E+00


ERROR= -1.82848E+00

X(I)= 1,42524E+00 2.93001E+00


ERROR= -1.93468E+00

X(I)= 1,51465E+00 3.34612E+00


ERROR= -1,95522E+00

X(I)= 1.53787E+00 3.07897E+00


ERROR= -1.99500E+00
X(I)= 1,57620E+00 3.17595E+00
ERROR= -1,99879E+00

X(I)= 1,56456E+00 3.11956E+00


ERROR= -1.99948E+00
FIGURE 8.5. Demonstration that simplex can produce a minimum that is a
negative number.
194 Introduction to Optimization Theory

N ? 2

X(I) 1=1,2....N
? 1 1
ERROR= 6,88701E+02

K ? -1
X(I)= 1.19471E+00 6,10585E-01
ERROR= 4.29520E+01

X(I)= 9.44848E-01 4,26182E-01


ERROR= 1.31858E+01

X(I)= 1,40035E+00 -4.19359E-01


ERROR= 2,06083E+00

X(I)= 1.20316E+00 -4,16675E-01


ERROR= 1,91885E+00

X(I)= 1,20648E+00 -2.31504E-01


ERROR= 1.56140E+00

X(I)= 8.24123E-01 -1.38592E-01


ERROR= 1.00409E+00

X(I)= 4,99346E-01 -1.96285E-01


ERROR= 8,44714E-01

X(I)= -2,09031E-01 9.16705E-02


ERROR= 4.64853E-01

X(I)= -5.32790E-01 3.38486E-02


ERROR= 4.25601E-01

X(I)= -1,23986E+00 3.21415E-01


ERROR= 3,52595E-01

FIGURE S.S. Illustration of the simplex scale feature.

Contracting with y = i produces P,=(1.025,1.025), so that the error at


the contracted point is EE. = 1008.9, which is still worse than the error at PH'
Because all else has failed, scaling must be done.
Figure 8.6 shows the results of using the simplex program. In this
computer run the scale factor k was chosen as equal to -1, so the initial
simplex was increased in size about the low point P,. Simplex was the
able to make the error function arbitrarily close to zero.s
sIf x1- -5x2, then the error approaches zero as x2 approaches infinity.
Test Functions 195

8.6 TEST FUNCTIONS

The simplex algorithm that was just described is but one of many
optimization techniques that can be used to minimize an error function
E(x). Numerous techniques have been described in the literature; which is
best depends on the application. However, for a specific problem one can
make meaningful comparisons between different optimization techniques.
Over the years researchers have encountered (or concocted) various
functions that are difficult to minimize. Many of these functions have been
described in the literature and are now commonly used as test functions.
In this section a few of the "classical" test functions will be discussed; then
in the following chapters we will be able to compare different optimization
techniques.
Perhaps the most famous test function is the Rosenbrock function
E(x)= l00(x,2 -x2)2 + (1- x,)2. (8.21)
That this is a difficult function to optimize can be appreciated by studying
Fig. 8.7.' This figure indicates contours of constant E (x). Because the
contours are close together it is difficult for an optimization program to
search for the minimum.
The amount of computation time required by an optimization technique
to reach a minimum can be greatly influenced by the values selected for
the initial parameters. For the Rosenbrock function the values customarily
chosen are x, _ -1.2 and x2 =1.
Figure 8.8 shows the output of the simplex program for the Rosenbrock
function. For P0 = ( - 1.2, 1) the initial error was 24.2. After one iteration
the parameters were adjusted to x, = - 1.08, x2=1.10, which reduced the
error to 4.77. However, after this initial reduction in the error function, the
simplex algorithm was very slow in approaching the minimum. The path
taken due to the simplex optimization is indicated by the dashed line in
Fig. 8.7. Following the thin curving contours is time consuming, but
depending on how patient one is, the algorithm can get arbitrarily close to
the minimum x, =1= x2, at which point E (x) is equal to zero.
A function very similar to the Rosenbrock function is "cube"
E(x)= 100(x,3 -X2)2+(1 _ x, )2. (8.22)
This function is plotted in Fig. 8.9, and the simplex optimization results are
given in Fig. 8.10. The initial point Po=(-1.2, 1) was the same as for the
Rosenbrock function. The initial error of 749 was worse than the corre-
sponding Rosenbrock error, but the optimal point P = (1, 1) was ap-
proached much more quickly.

'The dashed line in this figure will be explained later.


186 Introduction to Optimization Theory

x2

_1 L
-3

FIGURE 8.7. The Rosenbrock function.


Both of the test functions that were just discussed were functions of just
two variables. As the number of parameters is increased, an optimization
technique may encounter difficulties. A four-parameter function due to
Powell is
E(x)=(x,+ 10X2)2+5(X3-X4)2+(X2-2X3)4+(10X1-X4)4. (8.23)
Customary initial values for this function are Po=(3, - 1,0, 1). Choosing a
component to be zero causes line 28 in the program to be executed. If this
statement had not been included in the program, then the algorithm would
never have been able to vary the third component of PO from zero. The
computer output in Fig. 8.11 indicates that the simplex algorithm was able
to reduce the error functions slowly from an initial value of 707,336 to 90.9
in ten iterations.
Test Functions 197

N ? 2

X(I) I=1,2,...N
? -1.2 1
ERROR= 2.42000E+01

X(I)= -1.08018E+00 1.09992E+00


ERROR= 4.77423E+00

X(I)= -1.08027E+00 1.19978E+00


ERROR= 4,43505E+00

X(I)= -1.07276E+00 1.14058E+00


ERROR= 4,30682E+00

X(I)= -1,05606E+00 1.10709E+00


ERROR= 4.23406E+00

X(I)= -1.03521E+00 1.09768E+00


ERROR= 4.20976E+00

X(I)= -1.00992E+00 1.01948E+00


ERROR= 4.03978E+00

X(I)= -9.30016E-01 8.53469E-01


ERROR= 3.73810E+00

X(I)= -8,02573E-01 6.04726E-01


ERROR= 3,40449E+00

X(I)= -6,41337E-01 3.50679E-01


ERROR= 3,06163E+00

X(I)= -5.48732E-01 2.51302E-01


ERROR= 2.64661E+00

FIGURE B.S. Minimization of the Rosenbrock function by simplex.

In the next two chapters these three test functions (Rosenbrock, "cube",
and Powell) will be used to compare the efficiency of different optimiza-
tion techniques. This will give insight into how the programs would be
expected to perform on problems similar to these test functions. However,
if one wants to know how different optimization techniques compare for a
particular type of problem, then a problem of that type should be selected.
For an example of a "particular type of problem", the technique of
coefficient matching will be explained. The electrical circuit in Fig. 8.12 is
198 Introduction to Optlmlzatton Theory

-4
-1.5 -1 -0.5 0 0.5 1

FIGURE 8.9. The "cube" function.

termed an active filter-it is made up of resistors, capacitors, and an


operational amplifier. In a design problem the resistors and capacitors
should have their element values chosen to yield a specified response. We
will assume for this example that the response should have the form
s
T (s) = (8.24)
sZ+s+40
where s is a normalized variable related to the frequency of an electrical
signal.
Some of the element values for the circuit can be arbitrarily selected,
and the response still made to have the form shown in (8.24). For this
Test Functions 199

N ? 2

X(I) I=1,2,...N
? -1.2 1
ERROR= 7.49038E+02

X(I)= -9,66351E-01 1.14735E+00


ERROR= 4.24020E+02

X(I)= -8.55709E-01 1,02242E+00


ERROR= 2,75366E+02

X(I)= -3.48384E-01 1,25017E+00


ERROR= 1.68863E+02

X(I)= 1.07280E-01 1.11477E+00


ERROR= 1.24794E+02

X(I)= 1.31085E+00 1.49410E+00


ERROR= 5.76125E+01

X(I)= 1.23542E+00 1.33145E+00


ERROR= 3.07595E+01

X(I)= 9.83418E-01 1.33872E+00


ERROR= 1.50273E+01

X(I)= 1.09020E+00 1.29437E+00


ERROR= 8.32327E-03

X(I)= 1.07322E+00 1,23345E+00


ERROR= 6.07770E-03

X(I)= 1.05875E+00 1,18485E+00


ERROR= 3.83487E-03
FIGURE 8.10. Minimization of "cube" by simplex.

example, the following choices will be made:


Ra = 1000, Rh = 5000, C, = C2 = 10-8. (8.25)

It can then be shown that the response of the circuit is given by


a,s
T(s)= (8.26)
s2+b,s+bo
200 Introduction to Optimization Theory

N ? 4

X(I) I=1,2.... N
?3 -1 0 1
ERROR= 7.07336E+05

X(I)= 2.41588E+00 -1.07368E+00 7.36769E-05 1.07368E+00


ERROR= 2.84082E+05

X(I)= 2.03217E+00 -1,01175E+00 1.93810E-05 1.28777E+00


ERROR= 1.31328E+05

X(I)= 1.31910E+00 -1.02041E+00 2.28368E-04 1.23140E+00


ERROR= 2.05456E+04

X(I)= 3.33829E-01 -1.15493E+00 8.58329E-05 1.38571E+00


ERROR= 1.51702E+02

X(I)= 1.57127E-01 -1.07911E+00 1.62529E-04 1.51571E+00


ERROR= 1.25923E+02

X(I)= 3.12163E-01 -1.03953E+00 1.99599E-04 1.51494E+00


ERROR= 1.20975E+02

X(I)= 7.84367E-02 -1.00796E+00 1.84177E-04 1.64235E+00


ERROR= 1.15083E+02

X(I)= 1.70012E-01 -9.83617E-01 2.59339E-04 1.57024E+00


ERROR= 1.06697E+02

X(I)= 2.24448E-01 -9.24915E-01 2.01309E-04 1.71597E+00


ERROR= 9.69756E+01

X(I)= 3.54463E-01 -8.41895E-01 2.58020E-04 1.82053E+00


ERROR= 9.09423E+01

FIGURE 8.11. Minimization of "Powell" by simplex.

INPUT
OUTPUT

FIGURE 8.12. An active filter circuit that is used to illustrate coefficient matching.
Test Functions 201

where
a,= 1.2x 105/x2,
(X2+X3)X 1010
b0 (8.27)
X,x2X3

2X105 2X104(x2+X3)
X, x2x3
The coefficient-matching technique consists of adjusting the parameters
x,,x2,x3 so that the coefficients have the specified values. For this
problem (8.24) implies that the specified values are a,= 1, b0=40, and
b, = 1. Optimization techniques can be applied to this problem if the
following error function is defined:
E(x)=(a, - 1)2+(bo-40)2+(b,-1)2, (8.28)
where a bo, b, are as given in (8.27).
N ? 3

X(I) I=1,2,...N
? 10000 10000 10000
ERROR= 2.59460E+04

X(I)= 1.09824E+04 1.09824E+04 1.09824E+04


ERROR= 1.61135E+04

X(I)= 1.32447E+04 1.03152E+04 1.06270E+04


ERROR= 1.10851E+04

X(I)= 1.35027E+04 1.25720E+04 1.12430E+04


ERROR= 7.36973E+03

X(I)= 1.50221E+04 1.12300E+04 1.34720E+04


ERROR= 4.89402E+03

X(I)= 1.96490E+04 1.21319E+04 1.33350E+04


ERROR= 1.72462E+03

X(I)= 2.15355E+04 1.52155E+04 1.66871E+04


ERROR= 4.17308E+02
X(I)= 2.89243E+04 1.34183E+04 2.08358E+04
ERROR= 8.06558E+01

X(I)= 3.17047E+04 1.59435E+04 2.04281E+04


ERROR= 7.48690E+01

X(I)= 3.19244E+04 1.44463E+04 2.12484E+04


ERROR= 7.48069E+01

X(I)= 3.04410E+04 1.35902E+04 1.98408E+04


ERROR= 7.13897E+01

FIGURE 8.13. Coefficient matching by simplex.


202 Introduction to Optimization Theory

The result of applying the simplex algorithm to minimize the error


function is shown in Fig. 8.13. For the initial parameters x, = x2 = x3 =
10000 the optimization program yielded x, =30441, x2=13590, x3=19841
as a better set after ten iterations; but the error function was still quite
large.
In conclusion, the results of applying the simplex algorithm to the four
test functions are shown in Figs. 8.8, 8.10, 8.11, and 8.13. Each of these
figures shows ten iterations of the simplex optimization technique. If more
iterations were allowed, the simplex method did converge to the minimum
for each test function. However, the number of iterations was arbitrarily
restricted to ten. This will also be done in later chapters when other
optimization techniques are applied to the same test functions.
M. J. Box (1966) applied the simplex algorithm to many other test
functions not described in this chapter. In fact, one of his test functions
had twenty variables. He observed that (when compared with other optimi-
zation techniques such as Fletcher-Powell7) simplex did well for a small
number of variables, but for more than three dimensions it was progres-
sively less successful. However, in many applications it has been success-
fully used even for more than ten variables.

PROBLEMS

8.1 The simplex algorithm described in this chapter minimizes a function


E(x). If we instead want to maximize the function f(x)=2sinx,cosx2
+(cosx,)2(sinx2)°, how should E(x) be chosen?
8.2 Outline how optimization techniques can be used to solve the follow-
ing set of linear equations:

x+y+z=6, x+2y+z=8, 3x-y+z=4.


8.3 Outline how optimization techniques can be used to solve the follow-
ing set of nonlinear equations:

x2+y2=13, y3-xy=21.
8.4 If the initial parameters are chosen as x, =1, x2=2, and x3=3, what
are the four points that describe the initial simplex?

7See Chapter 9.
Problems 203

8.5 Let the error function for this problem be given as


E(x) = x,2+(2x2- 3)2+(x3- x, +2)2.
If a simplex is defined by
Po=(0,0,0), P, =(0, - 1,0), P2=(0,1,0), P3=(1,1,1),
what is the centroid?
8.6 If the four points of a simplex produce the error functions E,=2,
E2=5, E3=2, and E4=5, then it is not obvious which points will be
chosen as PL, PH, and PNH. In fact, the choice will depend on how
the program is written. For the program in Fig. 8.3, what numbers
would be determined for L, H, and NH?
8.7 The four points of a simplex are Po=(1,1,2), P,=(2,1, -I), P2=
(0,1,2), P3=(1,1,0). Assume that PH=P3.
(a) Find the centroid.
(b) Reflect and find PR (let a=1).
(c) If reflection is highly successful (i.e., ER < EL), an expansion is
attempted. Find PE, (let /3 = 2).
8.8 The following three points define a triangular simplex: Pa=(1,0,4),
P,=(8,2,5), P2=(3,2,6). Assume that PH=P,.
(a) Find the centroid.
(b) Reflect, and find PR (let a = 1).
(c) Assume that the reflection in (b) resulted in an error function ER
described by ENH < ER < EH. This implies that the highest point
PH should be replaced by PR and reflection should be done
again. Find the new point produced by reflection.
(d) Assume that the reflection in (c) was unsuccessful (ER > EH).
Contract and find Pc (let -y=0.5).
8.9 In this problem the initial simplex is defined by Po=(0,2,4), P,=
(4, 2, 0), P2 = (2, - 2, 4). Assume that PL = P,, PH = Po, and therefore
PNH=P2. If the result of reflecting (let a=1) is described by P, <PR
< P2, what is the new centroid?
8.10 (a) Draw a two-dimensional vector diagram to demonstrate that
PR = 2C - PH produces a reflected point that is the same distance
from the centroid as is PH, but on the oppostie side. (Hint: The
drawing should be similar to Fig. 8.2.)
(b) Draw a two-dimensional vector diagram to illustrate PEx=2PR-
C.
(c) Draw a two-dimensional vector diagram to illustrate Pc=0.5PH
+0.5PC.
204 Introduction to Optimization Theory

8.11 For this problem the simplex is defined by Po=(2,4,O), P1=


(-2,4,2), P2 = (0,1, 0), P3 = (2, 2, - 2). Assume that PL = P1, PH = P31
PNH = P0. If neither reflection nor contraction can produce a point
better than PH, then the simplex is scaled.
(a) If k=0.5, what is the new simplex?
(b) If k = - 1, what is the new simplex? For this part, start with the
original simplex and not the result of (a).
8.12 If the initial parameters are chosen as xI =0, x2=0, and x3=0, what
does the program in Fig. 8.3 find for an initial simplex?
8.13 If the simplex algorithm is applied to the error function
E (x) = (cosx2)2 + x,(sin
x2)2,

what will E(x) approach after many iterations?


8.14 Use the simplex program to solve the following set of nonlinear
equations:
2x2-4xy+y2= -7, x4+xy2-y3=7.
Use as the initial set of parameters x =I=y.
8.15 Solve the following three equations by using the simplex program
with the initial parameters x =y = z =1:
x2+y2+z2=21, xyz= -8, 2x+y3+z2=22.
8.16 Apply the simplex program to minimize
x2-4x+y2-6y+9.
8.17 Apply the simplex program to maximize
4x2y+2xy2+8x+4y-4x2-y2-8xy-x?y2.
Chapter Nine

Gradient Techniques

9.1 INTRODUCTION

The simplex algorithm described in the previous chapter was a collection


of various rules: to find a minimum, reflection was tried first; if reflection
was highly successful, an expansion was attempted; if reflection was only
moderately successful, another reflection was performed; etc.
For a particular problem, the rate of convergence of an optimization
technique depends on the initial parameters that are selected. In fact, if a
poor set of initial parameters is used some optimization techniques will
converge slowly so that they are useless. However, the simplex algorithm
does not have this drawback: for any set of initial parameters it will
proceed (although sometimes rather slowly) towards a minimum. Thus, if
another optimization technique encounters convergence problems because
of a poor set of initial parameters, the simplex algorithm can be used to
provide a better set. In fact, practical optimization programs usually
contain more than one optimization technique. This allows the user to first
specify a "slow but sure" algorithm such as simplex and then automatically
switch to a faster technique when sufficiently close to the minimum.
If one is sufficiently near a minimum, there are optimization techniques
that will converge much more rapidly than the simplex algorithm. This
chapter is an introduction to a class of optimization methods that can be
viewed as a combination of two separate processes: first a decision is made
to which direction to proceed in the parameter space, and then a decision
is made as to how far to proceed in that direction.
There are many different rules that are used to determine the direction.
In fact, the simplex algorithm could be modified so that it determines the
direction. If Fig. 8.1 is reexamined, we can see that contraction, reflection,
and expansion are all in the same direction from the high point PH. The
simplex algorithm instructs the computer to proceed from P. towards the
centroid. How far it proceeds depends on which rule is used: reflection,
expansion, or contraction (and, of course, the value of a, /3, or -y).
205
206 Gradient Techniques

The simplex rules are supposed to cause the parameters to proceed from
the high point PH in a "downhill" direction; that is, towards a minimum.
In this chapter, we will next discuss a technique, steepest descent, that
guarantees the direction is downhill. The Fletcher-Powell optimization
technique modifies the steepest-descent direction slightly and thus may
converge even faster. This method will also be discussed in this chapter.
There are other methods which also modify the steepest-descent direction;
they will not be discussed in detail in this book, but an important
observation should be made: no matter which rule or set of rules used to
determine the "proper" direction, the distance to proceed in that direction
can be determined by the same method. The method commonly used is
given in the next section.

9.2 QUADRATIC INTERPOLATION FOR A SPECIFIC DIRECTION

Many optimization methods seek the minimum of a function by choos-


ing the parameter change 8x to point in a certain direction. The direction S
that is calculated will vary from method to method. However, once the
direction of optimization has been determined, any of the optimization
techniques can use the same algorithm to determine the magnitude of the
parameter changes.
The following example is included to illustrate the meaning of direction
and magnitude.

EXAMPLE 9.1

Find a vector r=(r,,r2) which is in the same direction as S=(3,4) and is


twice as long (i.e., has twice the magnitude).

SOLUTION

Since r should point in the same direction as S, it follows that r= aS,


where a is a constant of proportionality. The fact that the magnitude of r is
twice the magnitude of S implies that a = 2, so that
r = aS = 2S = (6,8).

This section will assume that the direction of the parameter change has
been determined and is denoted as S. The magnitude of the parameter
changes will then be determined by selecting the proportionality constant
in the following equation:
8x=aS. (9.1)
Quadratic Interpolation For a Specific Direction 207

From this formulation of the problem it can be seen that the distance
one goes in a specific direction can be determined by a single-parameter
search. That is, the proportionality constant a can be found by using one
of the single-parameter minimization methods that was described in
Chapter 3. Because the single-parameter search can take a substantial
amount of the total computational time, it should be chosen carefully. We
will use the quadratic interpolation method.
The quadratic interpolation method can be used to approximate the
minimum of the error function E by passing a parabola through three
points.If the error function is quadratic, then the minimum of the
parabola will also be the minimum of the error function (that is, the
minimum in the direction S). However, the error function will not usually
be quadratic, so the parabola minimum will just approximate the error
minimum.
The rest of this section will be used to explain the statements shown in
Fig. 9.1. These statements will be used in the steepest-descent program to
apply quadratic interpolation. The statements are a generalization of the

49C QUADRATIC INTERPOLATION


50C
51C FIND ALPHA 1
52 Al=1.
53 53 DO 54 I=1,N
54 54 X1(I)=X(I)+A1*S(I)
55 E1=ERROR(X1)
57 IF(E1.LT.EO) GO TO 63
58 Al=.S*Al
59 GO TO 53
61C
62C FIND ALPHA 2
63 63 A2=Al
64 64 A2=2.*A2
65 DO 66 I=1,N
66 66 X2(I)=X(I)+A2*S(I)
67 EE2=ERROR(X2)
68 IF(E2.GT.E1) GO TO 74
69 A1=A2 $E1=E2
70 GO TO 64
72C
73C FIND ALPHA
74 74 A=(Al*Al-A2*A2)*E8+A2*A2*E1-AI*AI*E2
76 A=.5*A/((Al-A2)*EO+A2*E1-Al*F2)
FIGURE 9.1. Some statements for quadratic interpolation in a specific direction.
208 Gradient Techniques

original quadratic-interpolation program that was described in Chapter 3.


At the beginning of Fig. 9.1 it is assumed that the error function Eo has
been calculated at the initial point x; thus one of the three points of the
parabola is by assumption x, Eo. The next point is found by evaluating the
error function at x, = x+ a, S with a, first set equal to unity. If this error E,
is greater than E0, then a, is repeatedly reduced by a factor of 2 until
E, < E0. In the original quadratic interpolation program of Fig. 3.5, a, was
repeatedly reduced by a factor of -2 instead of 2. The minus sign was
included so that the algorithm would first look for a minimum on one side
of x and then on the other. In the application in this chapter we will know
that the minimum is the direction S (not in the direction -S) and thus
only need look on one side of x.
The last point x2 is found by doublit.g the distance from x until E2> E,.
A parabola is then passed through the three points. The minimum of this
parabola is given by
xm=x+aS,
where
0.5(a,2-a22)Eo+a22Ei-a12E2
a=
(a, - a2)Eo+ a2E, - a,E2
Summarizing, if the direction S is known, then quadratic interpolation
can be used to find the constant of proportionality in Sx = aS. The
direction S will be related to the gradient of the error function, which is
discussed in the next section.
It should be noted that (9.2) only gives the exact minimum for a
quadratic function. Applying quadratic interpolation many times could get
one arbitrarily close to the minimum for the specific direction under
consideration. However, in this chapter quadratic interpolation will be
applied just once for each search direction that is considered. Then a new
"optimum" direction will be determined and a new minimum estimated.
Others have proposed algorithms that find the minimum in each direction
more accurately, but they will not be treated here.

9.3 THE GRADIENT

If a function f (x) depends on only one parameter x, then the first


derivative (rate of change) of the function is called its slope. At a maxi-
mum or minimum the slope is zero.'
'However, the fact that the slope is zero does not necessarily imply that the function is at a
maximum or minimum-it could be at a saddle point.
The Gradient 209

The error function E (x) is a function of many parameters. That is, x is a


vector which may be written as
x=(x1,x2,...,xn) (9.3)
For this multidemensional case, the concept of slope can be generalized by
defining the gradient V E as
DE= aE aE aE (9.4)
( ax ' ax2
1
axn
The gradient is an n-dimensional vector, the ith component of which is
obtained by finding the partial derivative of the function with respect to x;.

EXAMPLE 9.2

The function E (x) = x, 2 + x2 can be considered to define a set of circles


centered at the origin. For example, if E (x) = 4, then x 12 + x22 = 4 is a circle
of radius equal to 2.
By definition, the gradient of E(x) for x = (x,, x2) is VE=
(aE/ax1, aE/ax2). For this example
aE aE = 2X2-
ax, - 2 axe
Thus, the gradient is given by
V E=(2x1,2x2).
In general, the value of the gradient depends on the coordinates of the
point at which the gradient is evaluated. For example, for this particular
function
VEI(2.3)=(4,6).

The concept of the gradient will be helpful when we try to find an


expansion for E(x+Sx). It will let us discover how to choose Sx, the
change in the parameter vector x, so as to minimize the error function E.
This will be done by analogy to the following one-dimensional case.
If the error function is a function of only one variable, then it can be
expanded in terms of a Taylor series as
x
E(x+Sx)=E( x )+ d-Sx +E 2 (S
2 i)
2
(9 . 5 )

If Sx is small, then this can be approximated as


E(x+Sx)-_ E(x)+ d Sx. (9.6)
210 Gradient Techniques

The approximation in (9.6) can be generalized to n dimensions. For this


case, if the n parameter changes Sx,, Sx2, ... , Sx,, are sufficiently small, then
it follows that
E (x + Sx),r E (x) + aE Sx, + a.E Sx2+ . + ax Sx,,. (9.7)
1 2 n

This relation can be written in an extremely convenient form by recall-


ing that the gradient is a vector which can be written as
aE aE aE _ f aE aE aE I T
vE=(a , axe...., IL ax1 aX2 ax
1

Here we have used the notation that a vector can be treated as a column
matrix. It follows that
Sx,

Sx2
VETSx= I aE
M M ]
l a-Y, ax2 ax"

LSx,]
= aE 8x1 + aE Sx2+ ... + (9.9)
ax Sx,,.
Comparing this with (9.7), it follows that
E(x+Sx)xE(x)+VETSx. (9.10)
The above relation will be very important to us-in fact, it can be
considered to be the foundation for the steepest-descent method which is
described in the next section. However, before discussing the steepest-
descent algorithm we will consider an example for (9.10) and also discuss
the gradient in more detail.
EXAMPLE 9.3
If the function E(x)=x,2+x22 is evaluated at x,=2, x2=3, one finds
E(2,3)= 13. If the parameters are changed slightly, choosing Sx, =0.2 and
8X2=0,1, then (9.10) can be applied to estimate a new error. In the
following equation, the value of the gradient came from Example 9.2:
E(2.2,3.1),zt E(2,3)+[4 6]I0. ]=13+0.8+0.6=14.4.
The exact answer is 14.45. t

It was mentioned at the beginning of this section that in the one-dimen-


sional case the slope must be zero at a minimum (or at a maximum).
The Steepest-Descent Optimization Technique 211

Similarly, in the n-dimensional case the gradient must be zero at a


minimum (or at a maximum). That this is necessary can be seen by again
examining
E (x + Sx) E (x) + D E' Sx. (9.10)
Consider the case of a minimum (a maximum could be treated in the same
manner). If the gradient is not zero, then the parameter change Sx can
always be chosen so that V ET ax is a negative number. For example, if
VE=(1,-2), then choosing 8x=(-1,1) yields VETax=-3. If VETax
can be made negative, then (9.10) implies that the parameter changes ax
can be selected to reduce E(x). That is, E(x) can always be reduced unless
the gradient is zero, so the gradient must be zero at a minimum.
In the next section we will learn how to iteratively choose the parameter
change Sx so as to make the gradient become zero and minimize the error
function E(x). However, before this can be done we must obtain an
approximation that can be used by a computer to calculate the gradient.
The gradient is an n-dimensional vector, and a typical component of it is
given by
aE E(x+Sx,)-E(x)
lim (9.11)
ax, s.Y,-.o fix;
where Sx, is a vector which has all zero components except for the ith
component, which is equal to Sx;.
The expression for the partial derivative in (9.11) -is very similar to the
expression for the derivative which was given in (3.8). We will evaluate this
partial derivative by analogy with the method that was used to evaluate the
derivative in Newton's method. That is, Sx, will be calculated via
Sx; = ex;, (9.12)
where a is a constant of proportionality. The same value for e will be used
here as in Chapter 3, namely a=10- 6.

9.4 THE STEEPEST-DESCENT OPTIMIZATION TECHNIQUE

For any optimization procedure, one must make an initial guess x for
the parameter values. Next this guess is modified to x + Sx so as to reduce
the error function. How the parameter change Sx is calculated depends on
the optimization procedure. In this section we will learn how Sx is chosen
for the steepest-descent technique.
Since Sx is a vector, it can be described by two quantities: a direction
and a magnitude. In order to choose the direction of Sx we will use the
scalar product (dot product) of vector algebra. By definition, the scalar
212 Gradient Techniques

product of and is
(9.13)
It can be shown that this definition implies
(9.14)

where Jai is the magnitude of a, IbI is the magnitude of b, and 0 is the angle
between the two vectors.

EXAMPLE 9.4

If a=(3,4), choose a unit vector b such that

(a) a,b, + a2b2 is a maximum.


(b) a, b, + a2b2 is a minimum.

SOLUTION
(a) Comparing (9.13) and (9.14) yields a, b, + a2bZ = IaI Ibl cos B. This is a
maximum when 0 = 0-i.e., when a and b are in the same direction;
thus b=a(3,4). But b is a unit vector, so that
Ib1=I3a,4a1= 9a2+16a2 =5a=1.
Since therefore a = 5 , it follows that
b = (0.6, 0.8).

(b) For the sum to be minimum we require cos 0 = - 1, to align b opposite


to a. Analogously to the above, we have
b= -(0.6,0.8).

In order to see how the scalar product can be applied, consider again the
relation
E(x+Sx),: E(x)+VETSx. (9.15)
The last term in this expression can be rewritten as

VE' Sx= a-LE Sx,+ aE 6x2+ + ax Sx,,, (9.16)


2 n

z
which [by comparison with (9.13)] is the scalar product of the gradient V E
and the parameter change Sx.
Since VETSx is equivalent to the scalar product of VE and Sx, it
follows from Example 9.4 that a unit vector Sx will produce a maximum
change if it points in the direction of the gradient and will produce a
213
The Steepest-Descent Optimization Technique

minimum change (i.e., maximum negative change) if it points in the


direction of the negative gradient. The steepest-descent method utilizes this
information by choosing Sx to point in the direction of the negative
gradient.
In the steepest-descent optimization technique one first chooses an
initial set of parameters x, which results in an error E(x). Next the gradient
V E is calculated at point x. The parameter changes are then chosen to be
in the direction of the negative gradient; that is,
Sx = - a V E, (9.17)
where a is a positive constant.
As explained in Section 9.2, the proportionality constant a can be
determined by using quadratic interpolation for a specific direction. In the
steepest-descent optimization technique the direction is that of
S= - V E. (9.18)
In other optimization methods (e.g., the Fletcher-Powell method in Section
9.6) the direction may be calculated by an equation different from (9.18),
but once the direction has been obtained, the same quadratic interpolation
subroutine can be used to determine the constant of proportionality a.
Finally, the new parameters can be calculated from the old parameters, a,
and S by
x + as-*x. (9.19)

EXAMPLE 9.5

A previous example considered the error function E (x) = x 12 +x22 and


found that at x=(2,3) the error function had the gradient V E=(4, 6). In
this example we will consider two different parameter changes, one which
will be in the direction of the negative gradient and one in the direction of
the positive gradient.
For the parameter changes in the direction of the negative gradient,
S= - V E _ (- 4, - 6). If the constant a is arbitrarily chosen as 0.01, then
(9.19) yields
(2,3)+0.01(-4, -6)=(1.96,2.94)=(x1,x2).
The error function for this is 12.49, which is smaller than the original error
function E= 13.
If the parameter changes are instead in the direction of the positive
gradient, then S = V E _ (4, 6). Again choosing a=0.01 yields
(2, 3)+0.01(4,6)=(2.04,3.06).
The error function for this set of parameters is 13.53, which is larger than
the original error function.
214 Gradlent Techniques

FIGURE 9.2. A steepest-descent optimization path for elliptical contours.

Choosing the parameter changes to be in the direction of the gradient


results in a path that is orthogonal' to contours of constant error. This
follows from (9.10) which was
E(x+Sx)=E(x)+VE'Sx.
If Sx is chosen to be along a contour of constant error, then E (x + Sx) _
E (x), so that V ET x = 0. This implies that V E is orthogonal to Sx, i.e., the
gradient is orthogonal to contours of constant error.
It is illustrated in Fig. 9.2 that the steepest-descent path is orthogonal to
contours of constant error. The error contours for
E(x)=(x, - 1)2 +4(x2 - 2)2.
are ellipses. The steepest-descent path originates on the E(x)= 100 contour
and is orthogonal to that contour. It continues in a straight line until the
E (x) = 50 contour, at which point calculating the gradient again makes the
path orthogonal to the contour. Similarly, each time the gradient is
recalculated at a contour the optimization path becomes orthogonal to that
contour.
21f two vectors are orthogonal, then their scalar product is zero. That is,
abcosO-O. This implies that the angle between a and b is 90°.
The SNepest-Descent Optimization Technique 215

01C STEEPEST DESCENT


02 PROGRAM SD(INPUT,OUTPUT)
03 03 FORMAT(7H X(I)=1P,4E14.5)
04 04 FORMAT(7H ERROR=,IP,E14.5,/)
06 DIMENSION E(10),GRAD(10),S(10),X(10),X1(10),X2(10)
09C
10C INITIAL VALUES
11 PRINT,*N*, $READ,N
14 PRINT,/,*X(I) I=1,2.... N*
15 READ,(X(I),I=1,N)
18 EO=ERROR(X)
19 PRINT 4,EO
20C
21C FIND GRADIENT
22 22 DO 28 I=1,N
23 DELTA=.000001*X(I)
24 IF(ABS(X(I)).LT.1E-12) DELTA=.000001
25 XSAVE=X(I)
26 X(I)=X(I)+DELTA
27 GRAD(I)=(ERROR(X)-EO)/DELTA
28 28 X(I)=XSAVE
39C
40C FIND DRECTION S (EQ. 9.18)
41 DO 44 I=1,N
44 44 S(I)=-GRAD(I)
48C
49C QUADRATIC INTERPOLATION
50C
SIC FIND ALPHA 1
52 Al=1.
53 53 DO 54 I=1,N
54 54 X1(I)=X(I)+A1*S(I)
55 E1=ERROR(X1)
57 IF(E1.LT.EO) GO TO 63
58 Al=.5*A1
59 IF(A1.LT.1E-8) STOP
60 GO TO 53
61C
62C FIND ALPHA 2
63 63 A2=A1
64 64 A2=2.*A2
65 DO 66 I=1,N
66 66 X2(I)=X(I)+A2*S(I)
67 E2=ERROR(X2)
68 IF(E2.GT.E1) GO TO 74
69 Al=A2 $E1=E2
70 G0 TO 64
72C
73C FIND ALPHA
74 74 A=(Al*Al-A2*A2)*EO+A2*A2*E1-Al*Al*E2
76 A=.S*A/((Al-A2)*£O+A2*E1-Al*E2)
80C
(a)

FIGURE 9.3. A program for steepest descent.


216 Gradient Techniques

S1C FIND NEW X


84 DO 85 I=1,N
85 85 X(I)=X(I)+A*S(I)
86 EO=ERROR(X)
87 IF(EO.LT.E1) GO TO 95
90 DO 9) I=1,N
91 91 X(I)=X(I)+(A1-A)*S(I)
93 EO=ERROR(X)
94C
95 95 PRINT 3,(X(I),I=1,N)
96 PRINT 4,E0
98 GO TO 22
99 END
897C
898 FUNCTION ERROR(X)
899 DIMENSION X(10)
900 ERROR=100.*(X(1)*X(1)-X(2))**2+(1.-X(1))**2
950 RETURN
999 END
(b)

FIGURE 9.3. (Continued.)

A program for the steepest-descent method is given in Fig. 9.3. The first
part of the program indicates how the gradient can be approximated by
applying (9.12) with the increment parameter e equal to 0.000001. After the
gradient is calculated, the negative gradient is defined to be S. A one-
dimensional search for the minimum is then performed in this direction by
employing quadratic interpolation, which was described in Section 9.2.
This process yields the parameter a which indicates how far to proceed in
the direction S. The parameter x is then incremented by aS.
The "minimum" that is obtained by the above process will probably not
be the minimum of the error function. This is because (9.15) assumed that
the parameter change Sx is very small (in fact, infinitesimal). At the
beginning of the steepest-descent method, Sx will probably be quite large,
because the initial guess for the parameter x may be poor. However, if the
steepest-descent method is used iteratively, then as the minimum is ap-
proached, the parameter change will become smaller and smaller, so that
the approximation in (9.15) will be very accurate.
Very infrequently it happens that the error function evaluated at x,,,=x
+ aS [where a is given by (9.2)] is larger than the original error function.
This can occur far from a minimum where the assumption of a quadratic3
error function is not sufficiently valid. In this case, line 87 of the steepest-
descent program causes the point x, to be used as the next guess for the
3As a function of a.
The Steepest-Descent Optimization Technique 217

location of the minimum, since E(x,) was found in such a manner that it is
guaranteed to be less than the initial error E0.
The function that is coded in statement 900 of Fig. 9.3 is the Rosenbrock
function. The results of computer runs for this and the other test functions
are given in Figs. 9.4-9.7. These figures indicate the minimum that steepest
descent was able to attain (after ten iterations) for the various test
N ? 2

X(I) I=1,2.... N
? -1.2 1
ERROR= 2.42000E+01

X(I)= -1,01284E+00 1,07639E+00


ERROR= 4.30709E+00

X(I)= -1.02891E+00 1.06651E+00


ERROR= 4.12265E+00

X(I)= -8.13910E-01 6.58198E-01


ERROR= 3.29207E+00

X(I)= -8.05820E-01 6.59570E-01


ERROR= 3.27144E+00

X(I)= -8.02957E-01 6,41045E-01


ERROR= 3.25202E+00

X(I)= -7.94927E-01 6.42283E-01


ERROR= 3.23252E+00

X(I)= -7.92468E-01 6.24776E-01


ERROR= 3.21399E+00

X(I)= -7.84461E-01 6.25899E-01


ERROR= 3.19537E+00

X(I)= -7.82332E-01 6.09208E-01


ERROR= 3.17751E+00

X(I)= -7.74318E-01 6.10229E-01


ERROR= 3.15957E+00
F143URE 9.4. Minimization of the Rosenbrock function by steepest descent.
218 GradNnt Technlquss

N ? 2

X(I) I=1,2.... N
? -1.2 1
ERROR= 7.49038E+02

X(I)= -3.72447E-02 7.31346E-01


ERROR= 5.45701E+01

X(I)= -2.38255E-02 -2.17332E-04


ERROR= 1.04822E+00

X(I)= 2.32122E-01 4.87883E-03


ERROR= 5.95455E-01

X(I)= 2.47363E-01 2.29155E-02


ERROR= 5.72515E-01

X(I)= 2.61018E-01 1.10520E-02


ERROR= 5.50625E-01
X(I)= 8.62417E-01 6.84183E-01
ERROR= 2.01683E-01

X(I)= 8.78017E-01 6.77291E-01


ERROR= 1.48973E-02

X(I)= 8.78377E-01 6.77222E-01


ERROR= 1.48158E-02

X(I)= 8.78676E-01 6.78807E-01


ERROR= 1.47362E-02

X(I)= 8.79030E-01 6.78740E-01


ERROR= 1.46569E-02
FIGURE 9.5. Minimization of the "cube" function by steepest descent.
The Steepest-Descent Optimization Technique 219

functions. If these results are compared with the corresponding results


obtained by the simplex algorithm (see Figs. 8.8, 8.10, 8.11, 8.13), we are
not able to conclude that one of these optimization techniques is always
better than the other: The simplex algorithm was better for the
Rosenbrock and cube functions, while steepest descent was better for the
Powell function and coefficient matching.

N ? 4

X(I) I=1,2.... N
? 3 -1 0 1
ERROI.= 7.07336E+05

X(I)= 6.41034E-01 -9.99652E-01 4.83735E-06 1.23588E+00


ERROR= 8.13063E+02

X(I)= 2.22239E-01 -9.85161E-01 3.46988E-04 1.27696E+00


ERROR= 1.02617E+02

X(I)= 1.50222E-01 -1.25742E-02 2.56034E-02 1.23048E+00


ERROR= 7.26477E+00

X(I)= 1.23559E-01 -2.78714E-02 4.02785E-01 8.55751E-01


ERROR= 1.55328E+00

X(I)= 1.07730E-01 1.77109E-02 4.01929E-01 8.19492E-01


ERROR- 1.33930E+00

X(I)= 9.56487E-02 -1.84180E-02 4.04708E-01 7.79958E-01


ERROR= 1.18252E+00

X(I)= 9.52789E-02 1.63340E-02 3.97946E-01 7.47868E-01


ERROR= 1.05018E+80

X(I)= 8.69167E-02 -1.54803E-02 3.95124E-01 7.14234E-01


ERROR= 9e35799E-01

X(I)= 8.68037E-82 1.45339E-02 3.86480E-01 6.86602E-01


ERROR= 8.36207E-01

X(I)= 7.99820E-02 -1.35781E-02 3.81738E-01 6.57719E-01


ERROR= 7.48938E-01

FIGURE 9.6. Minimization of the Powell function by steepest descent.


220 Gradient Techniques

N ? 3

X(I) I=1,2.... N
? 10000 10000 10000
ERROR= 2.59460E+04

X(I)= 2.98033E+04 1.98722E+04 1.97913E+04


ERROR= 7.70003E+01

X(I)= 2.79038E+04 1,86731E+04 1.81067E+04


ERROR= 4.64119E+01

X(I)= 2.74910E+04 1.92966E+04 1.66871E+04


ERROR= 4.39795E+01

X(I)= 2.80230E+04 1.98468E+04 1.68334E+04


ERROR= 4.16719E+01

X(I)= 2.78720E+04 2.04692E+04 1.54577E+04


ERROR= 3.94360E+01

X(I)= 2.84156E+04 2.09487E+04 1.56617E+04


ERROR= 3.73815E+01

X(I)= 2.83694E+04 2,13156E+04 1.47268E+04


ERROR= 3.57458E+01

X(I)= 2.88460E+04 2.17380E+04 1.48518E+04


ERROR= 3.43825E+01

X(I)= 2.88006E+04 2.19852E+04 1.40920E+04


ERROR= 3.31333E+01

X(I)= 2.92461E+04 2.23671E+04 1.42100E+04


ERROR= 3.19864E+01

FIGURE 9.7. Coefficient matching by steepest descent.

9.5 APPLICATIONS OF STEEPEST DESCENT

The previous chapter discussed some applications of the simplex optimi-


zation technique. Steepest descent could also be applied to any of those
applications. For some situations the simplex algorithm might be more
efficient; for others, steepest descent. In this section some additional
optimization problems will be solved by using the steepest-descent pro-
gram. These, of course, could also be solved by using the simplex algo-
rithm.
EXAMPLE 9.6
Minimize the following function by applying the steepest-descent pro-
gram of Fig. 9.3:
E(x)=x12+0.1x22 +(x1- 1)2(x1-5)2+$2(x2-4)2.
2
(9.20)
Applications of Steepest Descent 221

SOLUTION

Figure 9.8 shows two different steepest-descent computer runs for this
error function. In Fig. 9.8(a) the initial parameters were chosen as x, = 10,
x2 = 10, which yielded a minimum E (0.943, 3.975) = 2.533. In Fig. 9.8(b) the
initial parameters were instead chosen as x, = 1, x2=1, which yielded
E (0.943, 0) = 0.943.

This example was presented to introduce the concept of a local mini-


mum versus a global minimum. The difference between these two mini-
mums can be understood by picturing the steepest-descent algorithm as
changing the parameters x in such a way that one goes down a mountain
(that is, one reduces the error function) as quickly as possible. On the way
down the mountain we may encounter a valley and conclude that we are at

N ? 2

X(I) I=1,2.... N
? 10 10
ERROR= 5.73500E+03

X(I)= 4.85948E+00 2.28120E+00


ERROR= 3.98028E+01

X(I)= 2.01683E+00 4.25819E+00


ERROR= 1.62909E+01

X(I)= 1.01915E+00 3.58429E+00


ERROR= 4.54937E+00

X(I)= 9.17705E-01 3.91928E+00


ERROR= 2.59120E+00

X(I)= 9.47973E-01 3.97099E+00


ERROR= 2.53324E+00

X(I)= 9.43220E-01 3.97426E+00


ERROR= 2.53266E+00

X(I)= 9.43484E-01 3.97466E+00


ERROR= 2.53266E+00
(a)
FIGURE 9.8. A local minimum is found in (a); the global minimum is found in (b).
222 Gradient Techniques

N ? 2

X(I) I=1,2.... N
? 1 1
ERROR= 1.01000E+01

X(I)= 8.70582E-01 2.10555E-01


ERROR= 1.68458E+00

X(I)= 9.63341E-01 2.06959E-02


ERROR= 9.56749E-01

X(I)= 9.42366E-01 1.61174E-03


ERROR= 9.42785E-01

X(I)= 9.43547E-01 8.18966E-05


ERROR= 9.42722E-01

X(I)= 9.43450E-01 7.58370E-06


ERROR= 9.42721E-01
X(I)= 9.43453E-01 -8.32343E-07
ERROR= 9.42721E-01
(b)

FIGURE 9.8. (Continued.)

the bottom. However, it is possible that once we climb the other side of the
valley we discover the mountain descends for another mile. In this discus-
sion, the valley would be the local minimum, while the bottom of the
mountain would be the global minimum. In Example 9.6, E (0.943, 3.975) =
2.533 was a local minimum, while E(0.943,0)=0.943 was the global
minimum.
In an optimization problem one usually wants to find the global mini-
mum and does not want to be trapped at a local minimum. Failing to find
the global minimum is not just a difficulty associated with steepest-
descent; any optimization technique can be plagued by this problem.
Sometimes one will know for physical reasons that the global minimum
must be near a certain point, which should then be picked as the initial
point. The computer will then improve on that point and hopefully attain
the global minimum. Other times (see Chapter 11) the parameters will be
constrained to remain within a certain region (perhaps for reasons of
availability of component values) and this will imply there is only one
allowable minimum.
Applications of Steepest Descent 223

75

50 25 25 50 100

FIGURE 9.9. Different initial parameters can yield different minimums.

However, often many different computer runs must be done before one
is satisfied with the minimum that is attained. If many different initial
parameters produce the same minimum, then one is fairly confident that it
is the lowest value that can be found. But in practical optimization
problems that have many parameters that can be varied, there may be
numerous local minimums.
A hypothetical case that has two local minimums in addition to the
global minimum is portrayed in Fig. 9.9. Choosing the initial parameter
values to be given by P, yields a minimum value of 5; starting at P2 yields
10; and starting at P3 yields the global minimum, which is zero.
In cases that have numerous local minimums, the initial parameters can
be systematically picked for good distribution over the parameter space. In
a practical problem one can never be positive that the global minimum has
been found; however, after a sufficiently low value has been located one
may decide his time can be spent more profitably on other tasks.
The rate of convergence of any search technique is highly dependent on
the given function E (x). It is for this reason that there is no optimal
procedure for an arbitrary error function. The efficiency of a particular
optimization technique is usually dependent on the scale used. To under-
stand why this is so, consider the following function:

E(x1 ,x2)=(x1 -2)2+0.01(x2-30)2. (9.21)


224 Gradient Techniques

A steepest-descent optimization printout for the initial condition (1, 10) is


shown in Fig. 9.10(a). The initial error was 5, and after ten iterations this
was reduced to 0.44.
If 0.1 x2 is replaced by x3, then (9.21) can be rewritten as
E(x,,x3)=(x,-2)2+(x3-3)2. (9.22)
This function represents a family of concentric circles centered about (2,3).
For a case such as this, steepest descent can find the minimum in one
iteration, as demonstrated in Fig. 9.10(b). The initial error at x1= I, x3=1
was 5 (the same as it was at the equivalent unscaled point x1= 1, x2 =10),
and after one iteration the error was essentially zero.
It has just been demonstrated that in some problems scaling can drasti-
cally affect the rate of convergence. In general, scaling should be done so
as to make the contours of constant error be as circular as possible. Note
that in (9.22) the contours were exact circles.
For a problem that has many parameters x1, x2, ... , X. one can examine
the behavior of E(x) as just x, and x2 are varied. x2 can then be scaled so
that near the initial point x the error-function contours are nearly circular.
Next just x1 and x3 can be varied so as to determine how to scale x3;
similarly for the remaining parameters x4,xs1 ...1x,,.
Far from a minimum one will probably not be able to scale the
parameters so that the error contours are circular. However, as the mini-
mum is approached, the error function will behave as a quadratic form and
thus can be scaled to yield circular contours. This discussion implies that
as an optimization procedure advances, one may wish to change how it is
scaled. In fact, some optimization techniques rescale the parameters after a
certain number (e.g. n + 1) of iterations.
Very elaborate schemes exist for determining how to scale parameters
for a particular problem. However, the reader of this text can usually
ignore these techniques, since for our applications the computer will
probably be able to optimize a poorly scaled problem faster than we can
determine a satisfactory scaling. In fact, instead of attempting scaling if a
particular problem is slow to converge, it may be more beneficial to apply
a different optimization technique.
It was mentioned that scaling will usually affect the rate of convergence;
however, the simplex optimization technique is an exception to this rule.
As a demonstration of this, the error function
-2)2+0.01(x2-30)2
was reduced by using ten iterations of simplex. Similarly, ten iterations of
simplex were applied to (the scaled function
E(x1,x3)=(x1
-2)2+(x3-3)2.
N ? 2

X(I) I=1,2,...N
? 1 10
ERROR= 5.00000E+00

X(I)= 2.03958E+00 1.02079E+01


ERROR= 3.91883E+00

X(I)= 1.21625E+00 1.43245E+01


ERROR= 3.07148E+00

X(I)= 2.03103E+00 1.44875E+01


ERROR= 2.40735E+00

X(I)= 1.38573E+00 1.77138E+01


ERROR= 1.88685E+00

X(I)= 2.02432E+00 1.78415E+01


ERROR= 1.47888E+00

X(I)= 1.51855E+00 2.03701E+01


ERROR= 1.15915E+00

X(I)= 2.01906E+00 2.04702E+01


ERROR= 9.08538E-01

X(I)= 1.62265E+00 2.24519E+01


ERROR= 7.12127E-01

X(I)= 2.01494E+00 2.25304E+01


ERROR= 5.58177E-01

X(I)= 1.70424E+00 2.40835E+01


ERROR= 4.37520E-01
(a)

N ? 2

X(I) I=1,2,...N
? 1 1
ERROR= 5.00000E+00

X(I)= 2.00000E+00 3.00000E+00


ERROR= 4.94774E-14
(b)

FIGURE 9.10. Minimization of equivalent scaled functions by steepest descent.


225
226 Gradient Techniques

N ? 2

X(I) I=1,2,...N
? 1 10
ERROR= 5.00000E+00

X(I)= 1.14735E+00 1.14735E+01


ERROR= 4.15930E+00

X(I)= 1.02242E+00 1.36449E+01


ERROR= 3.63056E+00

X(I)= 1.25017E+00 1.55951E+01


ERROR= 2.63726E+00
(a)
N ? 2

X(I) I=1,2,...N
? 1 1
ERROR= 5.00000E+00

X(I)= 1.14735E+00 1.14735E+00


ERROR= 4.15930E+00

X(I)= 1.02242E+00 1.36449E+00


ERROR= 3.63056E+00

X(I)= 1.25017E+00 1.55951E+00


ERROR= 2,63726E+00
(b)

FIGURE 9.11. Minimization of equivalent scaled functions by simplex.

As indicated in Fig. 9.11, the error functions are exactly the same at
corresponding iterations. Scaling does not affect the simplex technique,
because the initial simplex is determined by varying each parameter by the
same percentage. Also, since the algorithm uses only discrete points, which
one is discarded to form a new simplex is not affected by scaling.

9.6 THE FLETCHER-POWELL OPTIMIZATION TECHNIQUE

The steepest-descent optimization technique is based on the fact that the


negative gradient points in the direction of the fastest rate of decrease.
However, this direction is guaranteed to be the optimal direction only for
The Fletcher-Powell Optimization Technique 227

an infinitesimal distance. After that infinitesimal distance, one should


really calculate a new direction-but this of course is impractical.
In order to improve the steepest-descent method, (9.7) can be continued
as
z
E(x+Sx)^E(x)+ 3E8x1+ I E SzSx. (9.23)
axi 2 ax. ax,
i-1 j=1
Not only does this equation contain the first-derivative terms, it also
contains second-derivative terms. Steepest descent chose the direction to
move in parameter space by considering the slope; any method based on
(9.23) will use the slope and the curvature and thus should be more
accurate.
The above equation can be used to improve (9.10), which was the
foundation for the steepest-descent technique. The result is
E(x+Sx)^-E(x)+VE'8x+ ZSHTHSx, (9.24)
where H is the Hessian matrix, which is defined as'
a zE
H= L ax; ax,
1. (9.25)
L

Equation (9.24) implies that in the neighborhood of an optimum, the


objective function is essentially a quadratic function. The theory of many
optimization methods are based on this assumption. In particular, the
Fletcher-Powell method leads to the minimum of a quadratic function in
n +I steps if n is the number of variables.5 However, since most error
functions are not quadratic, it usually takes longer to reach the minimum.
Also adding to the number of iterations is the fact that for each search,
quadratic interpolation finds the minimum search distance only approxi-
mately.
In the steepest-descent technique we searched for a minimum in the
direction of the negative gradient, that is,
S= - V E. (9.26)
Methods that are based on (9.24) modify this direction by the inverse of
the Hessian, i.e.,
S= - H -' V E. (9.27)
4Recall that at a minimum V E must be zero, so that V E T*dx cannot be negative. Similarly
it follows that 8xTHEx must be positive for any nonzero vector 8x. That is, H must be a
positive definite matrix.
5This assumes that the derivatives are calculated analytically and not numerically.
228 Gradient Techniques

This appears to be quite simple, but computationally it is very involved.


First recall that the Hessian is a matrix which contains second-order
partial derivatives. We know from the material in Chapter 5 that the
evaluation of second derivatives by numerical methods is an inaccurate
process. If we could evaluate the Hessian accurately, then the inverse could
be obtained by the methods of Chapter 2, but this would require a
substantial amount of computational time.
For the reasons just mentioned, one usually does not attempt to directly
evaluate H-' in an optimization technique; instead, approximations are
used. The most famous approximation is used in the Fletcher-Powell
optimization technique.' This approximates the inverse of the Hessian by
another matrix G which iteratively approaches H-'. Initially, the ap-
proximation is chosen to be the identity matrix I, so that S in (9.27) points
in the direction of the negative gradient. Thus, the first Fletcher-Powell
iteration is equivalent to steepest descent, but the following iterations use
better and better approximations for H-' to improve the search direction.
We shall now see how the approximation G is obtained. Assume we are
at a point x and have a value for G (initially, this would be the identity
matrix). We then go in the direction S = - G GRAD (x) for a distance AX
(the distance can be found by quadratic interpolation). Thus, the next
point is x' = x + Ox. At this point, a better approximation G' is found by
using the parameter change Ax and also the gradient change
y = GRAD (x) - GRAD (x). (9.28)
These changes are used to adjust G as follows:8

Ox 4x7 (Gy)(Gy)T
G'-- G + d2 (9 . 29)
d,
where the scale factors d, and d2 are given by
d, = Y' Ax, d2 = yrGy. (9.30)
A program that implements these equations is given in Fig. 9.12. The
following comments should help in understanding that program.
The definition
z=Gy (9.31)
6This is also known as the DFP optimization technique after Davidon, Fletcher, and
Powell. Davidon originally conceived the algorithm, but Fletcher and Powell presented it in a
manner that made it famous.
7The notation GRAD(x) is used to indicate that the gradient is evaluated at point x.
sFor a derivation of this equation, the interested reader is referred to Gottfried, B. S., and
J. Weisman (1973), Introduction to Optimization Theory (Englewood Cliffs, N.J.: Prentice-
Hall).
O1C FLETCHER POWELL
02 PROGRAM FP(INPUT,OUTPUT)
03 03 FORMAT(7H X(I)=1P,4E14.5)
04 04 FORMAT(7H ERROR=,IP,E14.5,/)
06 DIMENSION G(10,10),GRAD(10),GRAD1(10),S(10)
07 DIMENSION X(10),X1(10),X2(10),Y(10),Z(10),DELX(10)
09C
10C INITIAL VALUES
11 PRINT,*N*, $READ,N
14 PRINT,/,*X(I) I=1,2,...N*
15 READ,(X(I),I=1,N)
18 EO=ERROR(X)
19 PRINT 4,E0
20C
21C FIND GRADIENT
22 22 DO 28 I=1,N
23 DELTA=.000001*X(I)
24 IF(ABS(X(I)).LT.1E-12) DELTA=.000001
25 XSAVE=X(I)
26 X(I)=X(I)+DELTA
27 GRAD(I)=(ERROR(X)-E0)/DELTA
28 28 X(I)=XSAVE
29C
30C INITIALIZE G
31 K=0
33 DO 36 I=1,N
34 DO 35 J=1,N
35 35 G(I,J)=0.
36 36 G(I,I)=1.
39C
40C FIND DIRECTION S (EQ. 9.27)
41 41 DO 44 I=1,N
42 S(I)=0.
43 DO 44 J=1,N
44 44 S(I)=S(I)-G(I,J)*GRAD(J)
48C
49C QUADRATIC INTERPOLATION
50C
5IC FIND ALPHA 1
52 A1=1.
53 53 DO 54 I=1,N
54 54 X1(I)=X(I)+A1*S(I)
55 E1=ERROR(X1)
57 IF(E1.LT.EO)GO TO 63
58 Al=.5*A1
59 GO TO 53
61C
62C FIND ALPHA 2
63 63 A2=A1
64 64 A2=2.*A2
65 DO 66 I=1,N
66 66 X2(I)=X(I)+A2*S(I)
67 E2=ERROR(X2)
68 IF(E2.GT.E1) GO TO 74
69 A1=A2 $E1=E2
70 GO TO 64
72c
73c FIND ALPHA
74 74 A=(Al*Al-A2*A2)*EO+A2*A2*EI-Al*Al*E2
76 A=.5*A/((A)-A2)*E0+A2*El-A1*E2)

(a)

FIGURE 9.12. A program for the Fletcher-Powell optimization technique.


229
81C FIND NEW X
82 DO 85 I=1,N
83 DELX(I)=A*S(I)
85 85 X(I)=X(I)+DELX(I)
86 EO=ERROR(X)
87 IF(EO.LT.E1) GO TO 95
89 DO 92 I=1,N
90 X(I)=X(I)+(A1-A)*S(I)
92 92 DELX(I)=Al*S(I)
93 EO=ERROR(X)
94C
95 95 PRINT 3,(X(I),I=1,N)
96 PRINT 4,E0
99C
100C REINITIALIZE G EVERY 5 CYCLES
101 K=K+1
102 IF(K.EQ.5) GO TO 22
110C
111C FIND NEW GRADIENT
112 DO 119 I=1,N
114 DELTA=.000001*X(I)
115 IF(ABS(X(I)).LT.1E-12) DELTA=.000001
116 XSAVE=X(I)
117 X(I)=X(I)+DELTA
118 GRAD1(I)=(ERROR(X)-E0)/DELTA
119 119 X(I)=XSAVE
122C
123C FIND Y (EQ. 9.28)
124 DO 126 1=1,N
125 Y(I)=GRAD1(1)-GRAD(I)
126 126 GRAD(I)=GRAD1(I)
129C
130C FIND DI (EQ. 9.30)
131 D1=0.
134 DO 135 I=1,N
135 135 D1=D1+Y(1)*DELX(I)
139C
140C FIND Z (EQ. 9.31)
141 DO 148 I=),N
142 Z(I)=0.
144 DO 148 J=1,N
148 148 Z(I)=Z(I)+G(I,J)*Y(J)
149C
150C FIND D2 (EQ. 9.31)
151 D2=0.
152 DO 155 I=1,N
154 DO 155 J=1,N
155 155 D2=D2+G(I,J)*Y(I)*Y(J)
199C
200C FIND NEW G MATRIX (EQ. 9.29)
201 DO 210 I=1,N
203 DO 210 J=1,N
210 210 G(T,J)=G(I,J)+DELX(I)*DELX(J)/D1-Z(I)*Z(J)/D2
212 GO TO 41
300 END
897C
898 FUNCTION ERROR(X)
899 DIMENSION X(10)
900 ERROR=100.*(X(1)*X(1)-X(2))**2+(1.-X(1))**2
950 RETURN
999 END

(b)
FIGURE 9.12. (Continued.)

230
The Fletcher-Powell Optimization Technique 231

is made to allow the vector Gy to be conveniently stored by the program.


Lines 101 and 102 cause the matrix G to be set equal to the identity
matrix every five iterations. The number five was chosen somewhat arbi-
trarily-the important fact is that periodic reinitialization is necessary to
prevent numerical inaccuracies in G from accumulating. In order to
observe how reinitialization of G can be beneficial, examine Figure 9.13,
which is a Fletcher-Powell optimization of the Rosenbrock function. The

N ? 2

X(I) I=1,2,...N
? --1.2 1
ERROR= 2.42000E+01

X(I)= -1.01284E+00 1.07639E+00


ERROR= 4.30709E+00

X(I)= -8.73203E-01 7.41217E-01


ERROR= 3.55411E+00

X(I)= -8.15501E--01 6.24325E-01


ERROR= 3.46184E+00

X(I)= -6.72526E-01 3.88722E-01


ERROR= 3.20145E+00

X(I)= -7.12159E-01 4.94380E--01


ERROR= 2.94785E+00
X(I)= 1.05488E+00 1.13391E+00
ERROR= 4.77510E-02
X(I)= 1.06641E+00 1.13774E+00
ERROR= 4.43687E-03
X(I)= 1.06413E+00 1.13140E+00
ERROR= 4.20725E-03
X(I)= 1.02291E+00 1.04313E+00
ERROR= 1.55704E-03
X(I)= 1.02772E+00 1.05350E+00
ERROR= 1.49666E-03
FIGURE 9.13. Minimization of the Rosenbrock function by the Fletcher-Powell
program.
232 Gradient Techniques

error did not change very much between the fourth iteration (E=3.20) and
the fifth (E=2.95), but reinitializing G caused the error to be reduced
dramatically (E = 0.05).
There are many other optimization techniques that are related to the
Fletcher-Powell algorithm. These algorithms are referred to as conjugate-
direction methods .9 They all base their search direction on the assumption
that sufficiently near a minimum the error function can be expressed as a
second-degree equation as in (9.24). However, none of the practical
methods directly evaluate the Hessian by evaluating second derivatives. In
fact, some of the methods do not even evaluate the gradient by evaluating
first derivatives. But of the conjugate-direction methods, the Fletcher-
Powell optimization technique is the most popular.
In conclusion, there are many different optimization techniques. Even in
a particular problem, the technique that works best may depend on the
initial set of parameters. For example, far from a minimum, steepest
descent may make more rapid improvements than Fletcher-Powell (be-
cause of a poor approximation to the inverse Hessian). In fact, far from a
minimum the simplex optimization technique may be better than either
steepest descent or Fletcher-Powell.
Organizations that must frequently solve optimization problems have
general-purpose optimization programs that contain many different opti-
mization algorithms. These programs can be written so they will automati-
cally change from one optimization technique to another if the original one
can not make sufficient reductions in the error function. Thus an error
function might be reduced by steepest descent at first, and then by
Fletcher-Powell as the minimum is approached.
The next chapter discusses another algorithm, the least-pth optimization
technique, that is extremely popular. It is often used separately or as part
of a general-purpose optimization program.

9S1, S2 are said to be conjugate with respect to the positive-definite matrix H if SIHS2=0.
PROBLEMS

9.1 If a vector x has components x,,x2,x3, then its magnitude is defined


as lxl=(x,2+x22+x32)'/2. Show that the magnitude of x=(2,4,6) is
twice the magnitude of y=(1,2,3).
9.2 For the error function E (x) = (x, - 2)2 + (x2 - 4)2, what is the error at
x+aS is x=(2.1,4.2), a=0.01, and S= -(1,2)?
9.3 If the function f(x) has f'(xo) =0 and f"(xo) =0, then x0 is called a
saddle point.
(a) What is the saddle point of f(x)=x3-6x2+12x+I?
(b) For the saddle point found in (a), show that f(xo+ 6) >f(xo) if S
is a positive number and f(x0+S)<f(x0) if S is a negative
number (thus x0 is neither a maximum nor minimum).
9.4 For E(x)=x,2+2x22+3x32:
(a) Calculate the gradient at x = (1, 2, 3).
(b) Calculate the gradient at x=(1,0, - 1).
9.5 For E(x)=x,2+2x,x3+x2x32:
(a) Calculate the gradient at x=(l, 1, 1).
(b) Estimate E (1. 1, 1.05,0.95) by using (9.10).
9.6 Find where the gradient of

E(x)=xi2-2x,+x22-8x2+ 17
is equal to zero.
9.7 The gradient of E(x)=x,2+2x,x3+x2x32 was found analytically in
Problem 9.5 to be (4,1,4) at x=(1,1,1). In this problem it will be
found by numerical methods.
(a) Use (9.12) with a=0.01 to estimate the gradient at x=(1,1,1).
(b) Instead of (9.12), use a central-difference formula to estimate the
gradient at x = (1,1,1).
(c) What is the disadvantage of using the central-difference formula
to estimate the gradient?
9.8 What is the scalar product of the two vectors a = (1, 4, - 2),
b=(1, - 1,1)?
9.9 What is
233
234 Gradient Techniques

9.10 For a = (1, 4, - 2), choose a unit vector b such that


(a) a, b, + a2b2 + a3b3 is a maximum,
(b) a, b, + azbz + a3b3 is a minimum.
(c) What are the maximum and minimum values?
9.11 For the error function E(x)=x,2+3.x,x3+2x22+4:
(a) Calculate the error at x = (1, 1, 2).
(b) Calculate the gradient there.
(c) Let Sx = 0.01 V E, and calculate the error at the new point.
(d) Let 8x = - 0.01 V E, and calculate the resulting error.
9.12 The Rosenbrock function was minimized by steepest descent in Fig.
9.4. This problem will verify that after one iteration x, 1.0128,
x2 = 1.0764.
(a) By analytical means, evaluate V E at x, = 1.2, x2= I.
(b) In what direction does steepest descent search for a minimum?
(c) Quadratic interpolation finds it necessary to iterate to a, =(0.5)io
= 9.765625 x 10 - °.
(1) Find x,, E(x,), where x, is as identified in statement 54 of
Fig. 9.3.
(2) Find x2, E (x2), where x2 is as identified in statement 66 of
Fig. 9.3.
(3) Find a.
(d) For the value of a found in (c), calculate the new values of x,,x2.
9.13 What statement was included in the steepest-descent program in case
any of the initial parameters were chosen as zero?
9.14 Solve Problem 8.14 by using the steepest-descent program.
9.15 Solve Problem 8.15 by using the steepest-descent program.
9.16 Solve Problem 8.16 by using the steepest-descent program.
9.17 Solve Problem 8.17 by using the steepest-descent program.
9.18 The amount of calculation performed by the quadratic-interpolation
subroutine greatly affects the speed of the steepest-descent program.
One way to reduce the number of calculations is to change the way a,
is calculated.
(a) Add an output statement to the steepest-descent program so that
after E, < E0 the value of a, is printed. Apply this program to the
"cube" test function.
(b) Change the 0.5a,-4a, statement to 0.la,--tea, and apply this
program to "cube". Did this increase the speed?
Problems 235

9.19 Instead of the modification suggested in Problem 9.18(b). another


possibility is to change the initial value of a, from I to 0.01. Apply
the resulting program to "cube" and determine whether or not this
increased the speed.
9.20 The following function has two minimums:
E (.x,, x2) = x,° 4x,2 + x,2 6x2 +20.
Use the steepest-descent program with different initial parameters so
that both minimums are located.
9.21 The functions in the two parts of this problem are equivalent, since
the variables were scaled according to x2 = 100x3. However, one form
will be much easier for steepest descent to optimize than the other.
(a) Use the steepest-descent program to minimize
3x12-4x,+6+(1+x,2)(x,+100)2.
Choose the initial parameters as x, = 1. x, = 100.
(b) Use the steepest-descent program to minimize
3x,2-4x,+6+10,000(1+x,2)(x3+1)2.
Choose the initial parameters as x, = 1, .x3= I.
9.22 (a) Calculate the Hessian at x = (- 1.2,1) for the Rosenbrock func-
tion.
(b) Find the inverse of the Hessian.
(c) Find the direction that is given by S H -' V E.

You might also like