Introduction To Optimization, Gradient-Based Methods
Introduction To Optimization, Gradient-Based Methods
7.10 For y'=x2-2y and the initial condition y(O)=O, applying the
Runge-Kutta method with h=0.1 yields y(0.1)=0.0003175,y(0.2)=
0.0024204, y(0.3)=0.0077979. Apply the Adams method to find
y(0.4). Compare with the Euler-method result of Problem 7.2, which
was 0.01284, and the analytical result, which is 0.0176678.
7.11 Use the Adams-method program to estimate the solution of
dx y(ex+lny
at x=0.1. Use the initial conditiony(0)= I, and choose Sm;n=0.0001,
Sma=0.01, h=0.001.
Hint: The closest printout to x=0.l will be x=0.092; thus, for an
initial condition of y (O.092) use the program again with h=0.008.
Compare with the analytical answer y = exp( - x2e - X).
7.12 In Section 7.8 the Runge-Kutta method was applied to solve the set
of equations
d =g(x,y,y ), d ` =f(x.y,yt)
This problem will instead illustrate the application of the Euler
method, which can advance the solution by y(x+h)xy(x)+
hg(x,y,y,) and y,(x+h):.,y,(x)+hf(x,y,y,). Use this approach to
estimatey(0.2) andy,(0.2) for g(x,y,y,)=x+y f(x,y,y,)=x-2y-
3y,, subject to the initial conditions y(0)=0 and y,(0)=0. Choose
h = 0.5.
7.13 Express the following third-order differential equation as a set of
first-order differential equations:
d
+ 4y d + siny = e'T.
What is f (x,y,y,,y2)'?
7.14 The set of first-order differential equations in Problem 7.12 can be
transformed into a second-order differential equation in terms of x
and y. Find this equation and give the initial conditions. If familiar
with differential equations, find the analytical solution.
7.15 Use the program in Fig. 7.8 to find y(l.5) for
dX2
+100a +y=2.
(a) For the initial conditions y(O)=O and y'(0)=100, show that
y=2-e-0.01x-e-99.99x
(b) Examine the solution where the small time constant predominates
by evaluating the solution in (a) for x=0.001, 0.01, 0.02, and 0.05.
(c) Examine the solution where the large time constant predominates
by evaluating the solution in (a) for x = 5, 50, and 300.
(d) Use the program of Fig. 7.8 to find y(0.05), and compare with the
answer found in (b). Use h=0.001.
(e) In the region where the large time constant predominates, one
would like to use a large step size. Try the program of Fig. 7.8 for
this differential equation with y(1)= 1, y'(1)=0, and h=0.1. Ex-
plain what happened.
Chapter Eight
mize a function. A major reason is that often one technique is not able to
find a minimum, so another must be tried. Which program is best will
often depend on the particular error function that must be minimized. In
fact, as shown in Section 9.5, even scaling the parameters (e.g., replacing xt
by lOx,) can drastically affect the rate of convergence.
When an optimization program finds a minimum, the user may not be
happy with the minimum that is found. One reason (see Section 9.5) may
be that the point found was only a local minimum; that is, if different
initial parameter values had been chosen, then an even smaller value could
have been obtained. Another reason may be that the parameter values at
the minimum were unrealistic (e.g., if the parameters represent dimensions
of an object, then the "optimum" product may be too large to manufac-
ture). Chapter 11 indicates how parameter values can be constrained to
have practical values.
Before proceeding with this introduction to optimization theory, a warn-
ing should be given about the optimization programs that are included in
the rest of this text. As was true for the previous programs, these optimiza-
tion programs have been written with clarity as the first goal and not
efficiency. (In an optimization program, it can be worthwhile to save a few
milliseconds of computation time because the computations may be done
many times before the optimization loop is completed.) While it should be
acknowledged that these programs are not sophisticated, it also should be
emphasized that they can be used to solve practical optimization problems
-but interaction might be required in certain cases. This interaction could
consist of choosing a better set of initial parameters; or if a particular
algorithm (e.g., steepest descent) does not perform satisfactorily for a
specific problem, then another algorithm (e.g., least pth) could be tried.
Often, we will not want to minimize an error, but instead minimize the
magnitude of the error. That is, a large negative error may be as objection-
able as a large positive error. As an example of this, assume we want to
find a solution of the nonlinear equation Inx=x. If we define an error
function as e(x) =1n x - x, then a solution of the nonlinear equation can be
found by adjusting x so that the error function e(x) becomes zero. If we
want to make a function e(x) become zero, we can instead minimize the
square of that function, i.e.,
E=[e(x)]2. (8.3)
This idea can be generalized in numerous ways; for example, the
function we want to make zero may depend on many parameters
x1,x2,...,x,,. Then we write
]2.
E (x) = [ e(x) (8.4)
Or we may have many functions e, (x), e2(x), ... , em (x) we wish to simulta-
neously set equal to zero; then we can define the error function as
m
[e; ]2.
E(x)= (x) (8.5)
1=1
EXAMPLE 8.1
e,(X1,x2)=x +3cosx2-2,
e2(xt,x2)=cosx,+2X,x2-4,
then minimizing the following error function will produce a solution
This is called the least-p tit error criterion, while the special case of p=2 is
called the least-squares error criterion.
In a practical problem, some of the individual errors may be more
important than others. In this case we can assign nonnegative weighting
factors wi so that
Ht
E(x)= wi[e,(x)]°.
The more important an error is, the larger is the weight we assign it. Of
course we could make the substitution wi[ei(x)]p-3[ei(x)] and transform
the weighted problem to a nonweighted one.
In some problems a requirement2 f(t) may be given and we may wish to
adjust the parameters xt,x2.... ,x so that the response h(t) of a system is
as close as possible to f(t). For this case we could define an error as
e(t)=f(t)-h(t). (8.8)
where the subscript i indicates that the functions are evaluated at t= t,. For
this case, (8.6) can be rewritten as
Just as there are various numerical methods for finding the roots of an
equation, there are different optimization techniques that can be used to
minimize the error function E(x). Which is the best will depend on the
specific application. However, it is possible to divide optimization tech-
niques into certain categories and then to compare the categories. This
section will give a brief overview of various optimization techniques, and
the next section will give a more detailed presentation of a specific one.
The following two chapters will give detailed descriptions of additional
optimization techniques.
'The variable t can be considered to be time in this discussion, but of course it could have
a different meaning.
3This is essentially the same as Section 13.4 of Daniels (1974).
182 Introduction to Optimization Theory
2. Slope-Following Methods
Slope-following methods evaluate the first derivatives of the error func-
tion (8E/8x;) and use this information to indicate how the parameters
should be changed in order to minimize the error. The first derivatives
determine the gradient of the error function. The gradient points in the
direction of the greatest change in error; thus, to minimize the error one
proceeds in the direction opposite to the gradient. This is the basis of the
steepest-descent method, which uses the gradient to predict parameter
changes for error minimization. Steepest descent can be considered to be
an attempt to change parameter values so as to proceed down an error
slope most rapidly. Assuming that the error function E(x) has the special
form given in (8.6) leads to the least-pth optimization technique, which is a
generalization of the least-squares method.
3. Second-Order Methods
The slope-following methods tend to reduce the error rapidly in the
initial stages of an optimization procedure; however, their convergence is
rather slow as the minimum is approached. To improve the rate of
convergence, one can use not only the first derivatives of the error function
(aE/ax;), but also the second derivatives (a ZE/ax; axe). Just as the first
derivatives determined the gradient of the error function, the second
derivatives determine the Hessian.
The Simplex Optimization Technique 183
The various second-order methods differ mainly in the way they try to
approximate the second derivatives. The second derivatives are not usually
found by using a perturbation scheme based on varying elements; they are
instead approximated by using knowledge of the error function and its
gradient at previous iterations. The Fletcher-Powell minimization proce-
dure is one of the best known second-order methods.
4This is not related to Dantzig's simplex algorithm of linear programming. The nonlinear
simplex method described here was originally proposed by Nelder, J. A., and Mead, R. (Jan.
1965), "A Simplex Method for Function Minimization", Comput. J., pp. 308-313. The values
for a, P, and y in this section were suggested by P. E. Fleischer and D. M. Bohling (private
correspondence).
184 Introduction to Optimization Theory
The iterative part of the simplex algorithm begins by ordering the points
according to the values of E (P0), E (PI),..., E
Po, P1, ... , P. The points
that yield the lowest, highest, and next highest values of the error function
are identified as PL, PH, and PNH. The simplex algorithm repetitively
replaces PH, the highest point. It will be illuminating to adopt a geometric
language in a description of the algorithm. We will picture the simplex as
moving towards a minimum by a series of operations termed reflections,
contractions, and expansions. These are illustrated in Fig. 8.1 for the case
of a two-dimensional simplex.
1. Reflection
The first attempt at replacing PH is by reflection about the centroid,
defined as
I
C= 2 P,. (8.14)
=o
i H
That is, the centroid is the average of all the points except PH, which is
going to be replaced shortly.
EXAMPLE 8.2
The error function for this example is E(x)=2x, - x2. If the initial point
Po = PH
NZ
Contraction
point Reflection point
Expansion point
SOLUTION
EXAMPLE 8.3
This example will demonstrate reflection for the data given in Example
8.2. The parameter a will be chosen as unity to simplify computations.
For a = 1, (8.15) becomes PR = 2C - PH. Substituting the results from the
previous example,
PR=2(2, 1.05)-(2.2, 1)=(1.8,1.1).
The reflected point may not be described by (8.16), for one of two
possible reasons. One of the possibilities can be illustrated by calculating
the error function for the reflected point of Example 8.3. Since for those
data the error function was defined as E(x)=2x,-x2, it follows that for
PR = (1.8, 1.1), E (PR) = 2.5. From Example 8.2, the error at the low point
was E (PL) = 2.9; thus E (PR) < E (PL). The other reason the reflected point
may not be described by (8.16) is that in some cases E(PR)> E(PH).
2. Expansion
If the reflection operation is highly successful and produces a new
minimum, so that E (PR) < E (P,,), then it seems likely we are proceeding
in a good direction, so we will go further in that direction by expanding
according to the expression
PEx=0PR+(I-/3)C. (8.17)
If $ = 2, then the expanded point will be twice as far from the centroid as is
the reflected point (see Fig. 8.1 or Problem 8.10). However, this value will
be modified slightly to 1.95 to prevent possible instabilities. Depending on
whether E(PR) or E(PE,,) is smaller, either PR or PEx replaces PH. A new
centroid is then calculated and the reflection process is again attempted.
3. Contraction
If the reflection operation is not successful, but results in E(PR)>
E(PH), then it seems likely we are searching on the wrong side of the
centroid, so contraction is performed according to
PC=(' - y)C+ yPH. (8.18)
If y= then the contracted point is midway between PH and the centroid
(see Fig. 8.1). This value of y will be modified slightly to 0.4985 for
practical reasons. If contraction is successful and E (Pc) < E (PH), then the
result is the same as if reflection had been successful, and we proceed
accordingly.
Contraction is usually successful; however, if it is not and E(Pc) >
E(PH), then drastic measures are taken.
4. Scaling
In the unlikely event that neither reflecting nor contracting can find a
better point than PH, then the simplex itself is modified by a scaling
process described by
P;+k(PL-P;)'P,, i=0,1,...,n. (8.19)
The Simplex Optimization Technique 187
ORIGIN
01C SIMPLEX
02 PROGRAM S(INPUT,OUTPUT)
03 03 FORMAT(7H X(I)=,1P,4E14.5)
04 04 FORMAT(7H ERROR=,IP,E14.5,/)
06 DIMENSION C(10),E(10),P(10,10),R(10),X(10)
07 INTEGER H
08 REAL K
09C
10C INITIAL VALUES
11 PRINT,*N*, $READ,N
12 N1=N+1
14 PRINT,/,*X(I) I=1,2.... N*
15 READ,(X(I),I=1,N)
16 E(1)=ERROR(X)
17 PRINT 4,E(1)
19C
20C INITIALIZE SIMPLEX (EQ. 8.13)
21 DO 22 J=1,N
22 22 P(1,J)=X(J)
24 DO 28 I=2,N1
25 DO 26 J=1,N
26 26 P(I,J)=X(J)
27 P(I,I-1)=1.1*X(I-1)
28 28 IF(ABS(X(I-1)).LT.IE-12) P(I,I-1)=.0001
29C
30C FIND PL,PH
31 31 L=H=1
32 DO 38 I=1,N1
34 DO 35 J=1,N
35 35 X(J)=P(I,J)
36 E(I)=ERROR(X)
37 IF(E(I).LT.E(L)) L=I
38 38 IF(E(I).GT.E(H)) H=I
39C
40C FIND PNH
41 41 NH=L
42 DO 43 1=1,N1
43 43 IF(E(I).GE.E(NH).AND.I.NE.H) NH=I
49C
50C CALCULATE CENTROID (EQ. 8.14)
51 DO 56 J=1,N
52 C(J)=-P(H,J)
53 DO 54 I=1,N1
54 54 C(J)=C(J)+P(I,J)
56 56 C(J)=C(J)/N
59C
60C REFLECT (EQ. 8.15)
61 61 DO 62 J=1,N
62 62 R(J)=1.9985*C(J)-.9985*P(H,J)
64 ER=ERROR(R)
(a)
highest, and next highest errors would be stored as E(L), E(H), and E(NH).
The location of the centroid C is found by calculating
n+1 nil
C=n P;=Nf -PH+ P; P. (8.20)
i L i=1
#N
i
The second form of the equation is used because it avoids the necessity of
checking each point to determine whether or not it is PH.
In the program, the point produced by contraction is identified as PR
(i.e., R). Actually this is the value P.. that would be calculated by applying
(8.18). It is identified in the program as PR (and not as P(.) because if
contraction is successful we proceed exactly the same as we would have
proceeded if reflection had been successful. Thus coding the result of
contraction as PR eliminates an unnecessary substitution (i.e., it eliminates
PC---)'PR)'
EXAMPLE 8.4
This uses the simplex algorithm to solve the following nonlinear equa-
tions, which were first presented in Example 8.1:
x,2+3cosx2=2,
cosx1+2xlx2=4.
That example derived a suitable form for the error function, which can be
coded in statements 900 and 910 as
900 ERROR = (X(1) *X(1) + 3. *COS(X(2)) - 2.) * *2
910 ERROR= ERROR +(COS(X(1))+2. *X(1)*X(2)-4.)**2
In this example, each contribution to the error function was given a
separate statement which makes for ease in reading.
Figure 8.4 indicates the output of the simplex program for this example.
The top of the printout indicates that for the initial set of parameters
x1= 1, X2= 1 the error function was 2.52. After a few iterations, the simplex
algorithm indicated that a better set of parameters would be x, =1.26222,
x2=1.44424, for which the error function was 0.00333.
Applications of Simplex 191
N ? 2
X(I) I=1,2.... N
? 1 1
ERROR= 2.51624E+00
to x, = 1.26, x2 = 1.44 changes the left side of the equations to 1.979 and
3.935, which are probably still sufficiently close to the desired values of 2
and 4.
In a practical optimization problem, the accuracy of a parameter value
may indicate the tolerance of a component that is used. For example, if x
represents the value of a resistor in an electrical network, then specifying
four significant figures may imply that a component with a tolerance of
0.1% should be used. Since increased tolerance usually results in increased
cost, an effort should be made to ascertain how many digits in the
optimized parameter values are really important.
EXAMPLE. 8.5
It should be noted that the minimum that was found in Example 8.5 was
a negative number. The simplex algorithm is a general procedure that
minimizes a function of several variables. Often the function E(x) is
formulated as a sum of squares [see (8.5)], so that the minimum will be
positive; but as this example demonstrates, that is not necessary.
The processes reflection, expansion, and contraction were used in the
two previous examples. However, it was never necessary to scale the entire
simplex. In fact this is generally true; the scale operation in simplex is not
used often. The following example was especially concocted so that scaling
would be necessary. Singularities (places where the function approaches
infinity) were put near PR (the reflected point) and Pc (the contracted
point) so that these points would have large error values.
EXAMPLE 8.6
N ? 2
X(I) I=1,2,...N
? 1 1
ERROR= 1,05566E+00
N ? 2
X(I) 1=1,2....N
? 1 1
ERROR= 6,88701E+02
K ? -1
X(I)= 1.19471E+00 6,10585E-01
ERROR= 4.29520E+01
The simplex algorithm that was just described is but one of many
optimization techniques that can be used to minimize an error function
E(x). Numerous techniques have been described in the literature; which is
best depends on the application. However, for a specific problem one can
make meaningful comparisons between different optimization techniques.
Over the years researchers have encountered (or concocted) various
functions that are difficult to minimize. Many of these functions have been
described in the literature and are now commonly used as test functions.
In this section a few of the "classical" test functions will be discussed; then
in the following chapters we will be able to compare different optimization
techniques.
Perhaps the most famous test function is the Rosenbrock function
E(x)= l00(x,2 -x2)2 + (1- x,)2. (8.21)
That this is a difficult function to optimize can be appreciated by studying
Fig. 8.7.' This figure indicates contours of constant E (x). Because the
contours are close together it is difficult for an optimization program to
search for the minimum.
The amount of computation time required by an optimization technique
to reach a minimum can be greatly influenced by the values selected for
the initial parameters. For the Rosenbrock function the values customarily
chosen are x, _ -1.2 and x2 =1.
Figure 8.8 shows the output of the simplex program for the Rosenbrock
function. For P0 = ( - 1.2, 1) the initial error was 24.2. After one iteration
the parameters were adjusted to x, = - 1.08, x2=1.10, which reduced the
error to 4.77. However, after this initial reduction in the error function, the
simplex algorithm was very slow in approaching the minimum. The path
taken due to the simplex optimization is indicated by the dashed line in
Fig. 8.7. Following the thin curving contours is time consuming, but
depending on how patient one is, the algorithm can get arbitrarily close to
the minimum x, =1= x2, at which point E (x) is equal to zero.
A function very similar to the Rosenbrock function is "cube"
E(x)= 100(x,3 -X2)2+(1 _ x, )2. (8.22)
This function is plotted in Fig. 8.9, and the simplex optimization results are
given in Fig. 8.10. The initial point Po=(-1.2, 1) was the same as for the
Rosenbrock function. The initial error of 749 was worse than the corre-
sponding Rosenbrock error, but the optimal point P = (1, 1) was ap-
proached much more quickly.
x2
_1 L
-3
N ? 2
X(I) I=1,2,...N
? -1.2 1
ERROR= 2.42000E+01
In the next two chapters these three test functions (Rosenbrock, "cube",
and Powell) will be used to compare the efficiency of different optimiza-
tion techniques. This will give insight into how the programs would be
expected to perform on problems similar to these test functions. However,
if one wants to know how different optimization techniques compare for a
particular type of problem, then a problem of that type should be selected.
For an example of a "particular type of problem", the technique of
coefficient matching will be explained. The electrical circuit in Fig. 8.12 is
198 Introduction to Optlmlzatton Theory
-4
-1.5 -1 -0.5 0 0.5 1
N ? 2
X(I) I=1,2,...N
? -1.2 1
ERROR= 7.49038E+02
N ? 4
X(I) I=1,2.... N
?3 -1 0 1
ERROR= 7.07336E+05
INPUT
OUTPUT
FIGURE 8.12. An active filter circuit that is used to illustrate coefficient matching.
Test Functions 201
where
a,= 1.2x 105/x2,
(X2+X3)X 1010
b0 (8.27)
X,x2X3
2X105 2X104(x2+X3)
X, x2x3
The coefficient-matching technique consists of adjusting the parameters
x,,x2,x3 so that the coefficients have the specified values. For this
problem (8.24) implies that the specified values are a,= 1, b0=40, and
b, = 1. Optimization techniques can be applied to this problem if the
following error function is defined:
E(x)=(a, - 1)2+(bo-40)2+(b,-1)2, (8.28)
where a bo, b, are as given in (8.27).
N ? 3
X(I) I=1,2,...N
? 10000 10000 10000
ERROR= 2.59460E+04
PROBLEMS
x2+y2=13, y3-xy=21.
8.4 If the initial parameters are chosen as x, =1, x2=2, and x3=3, what
are the four points that describe the initial simplex?
7See Chapter 9.
Problems 203
Gradient Techniques
9.1 INTRODUCTION
The simplex rules are supposed to cause the parameters to proceed from
the high point PH in a "downhill" direction; that is, towards a minimum.
In this chapter, we will next discuss a technique, steepest descent, that
guarantees the direction is downhill. The Fletcher-Powell optimization
technique modifies the steepest-descent direction slightly and thus may
converge even faster. This method will also be discussed in this chapter.
There are other methods which also modify the steepest-descent direction;
they will not be discussed in detail in this book, but an important
observation should be made: no matter which rule or set of rules used to
determine the "proper" direction, the distance to proceed in that direction
can be determined by the same method. The method commonly used is
given in the next section.
EXAMPLE 9.1
SOLUTION
This section will assume that the direction of the parameter change has
been determined and is denoted as S. The magnitude of the parameter
changes will then be determined by selecting the proportionality constant
in the following equation:
8x=aS. (9.1)
Quadratic Interpolation For a Specific Direction 207
From this formulation of the problem it can be seen that the distance
one goes in a specific direction can be determined by a single-parameter
search. That is, the proportionality constant a can be found by using one
of the single-parameter minimization methods that was described in
Chapter 3. Because the single-parameter search can take a substantial
amount of the total computational time, it should be chosen carefully. We
will use the quadratic interpolation method.
The quadratic interpolation method can be used to approximate the
minimum of the error function E by passing a parabola through three
points.If the error function is quadratic, then the minimum of the
parabola will also be the minimum of the error function (that is, the
minimum in the direction S). However, the error function will not usually
be quadratic, so the parabola minimum will just approximate the error
minimum.
The rest of this section will be used to explain the statements shown in
Fig. 9.1. These statements will be used in the steepest-descent program to
apply quadratic interpolation. The statements are a generalization of the
EXAMPLE 9.2
Here we have used the notation that a vector can be treated as a column
matrix. It follows that
Sx,
Sx2
VETSx= I aE
M M ]
l a-Y, ax2 ax"
LSx,]
= aE 8x1 + aE Sx2+ ... + (9.9)
ax Sx,,.
Comparing this with (9.7), it follows that
E(x+Sx)xE(x)+VETSx. (9.10)
The above relation will be very important to us-in fact, it can be
considered to be the foundation for the steepest-descent method which is
described in the next section. However, before discussing the steepest-
descent algorithm we will consider an example for (9.10) and also discuss
the gradient in more detail.
EXAMPLE 9.3
If the function E(x)=x,2+x22 is evaluated at x,=2, x2=3, one finds
E(2,3)= 13. If the parameters are changed slightly, choosing Sx, =0.2 and
8X2=0,1, then (9.10) can be applied to estimate a new error. In the
following equation, the value of the gradient came from Example 9.2:
E(2.2,3.1),zt E(2,3)+[4 6]I0. ]=13+0.8+0.6=14.4.
The exact answer is 14.45. t
For any optimization procedure, one must make an initial guess x for
the parameter values. Next this guess is modified to x + Sx so as to reduce
the error function. How the parameter change Sx is calculated depends on
the optimization procedure. In this section we will learn how Sx is chosen
for the steepest-descent technique.
Since Sx is a vector, it can be described by two quantities: a direction
and a magnitude. In order to choose the direction of Sx we will use the
scalar product (dot product) of vector algebra. By definition, the scalar
212 Gradient Techniques
product of and is
(9.13)
It can be shown that this definition implies
(9.14)
where Jai is the magnitude of a, IbI is the magnitude of b, and 0 is the angle
between the two vectors.
EXAMPLE 9.4
SOLUTION
(a) Comparing (9.13) and (9.14) yields a, b, + a2bZ = IaI Ibl cos B. This is a
maximum when 0 = 0-i.e., when a and b are in the same direction;
thus b=a(3,4). But b is a unit vector, so that
Ib1=I3a,4a1= 9a2+16a2 =5a=1.
Since therefore a = 5 , it follows that
b = (0.6, 0.8).
In order to see how the scalar product can be applied, consider again the
relation
E(x+Sx),: E(x)+VETSx. (9.15)
The last term in this expression can be rewritten as
z
which [by comparison with (9.13)] is the scalar product of the gradient V E
and the parameter change Sx.
Since VETSx is equivalent to the scalar product of VE and Sx, it
follows from Example 9.4 that a unit vector Sx will produce a maximum
change if it points in the direction of the gradient and will produce a
213
The Steepest-Descent Optimization Technique
EXAMPLE 9.5
A program for the steepest-descent method is given in Fig. 9.3. The first
part of the program indicates how the gradient can be approximated by
applying (9.12) with the increment parameter e equal to 0.000001. After the
gradient is calculated, the negative gradient is defined to be S. A one-
dimensional search for the minimum is then performed in this direction by
employing quadratic interpolation, which was described in Section 9.2.
This process yields the parameter a which indicates how far to proceed in
the direction S. The parameter x is then incremented by aS.
The "minimum" that is obtained by the above process will probably not
be the minimum of the error function. This is because (9.15) assumed that
the parameter change Sx is very small (in fact, infinitesimal). At the
beginning of the steepest-descent method, Sx will probably be quite large,
because the initial guess for the parameter x may be poor. However, if the
steepest-descent method is used iteratively, then as the minimum is ap-
proached, the parameter change will become smaller and smaller, so that
the approximation in (9.15) will be very accurate.
Very infrequently it happens that the error function evaluated at x,,,=x
+ aS [where a is given by (9.2)] is larger than the original error function.
This can occur far from a minimum where the assumption of a quadratic3
error function is not sufficiently valid. In this case, line 87 of the steepest-
descent program causes the point x, to be used as the next guess for the
3As a function of a.
The Steepest-Descent Optimization Technique 217
location of the minimum, since E(x,) was found in such a manner that it is
guaranteed to be less than the initial error E0.
The function that is coded in statement 900 of Fig. 9.3 is the Rosenbrock
function. The results of computer runs for this and the other test functions
are given in Figs. 9.4-9.7. These figures indicate the minimum that steepest
descent was able to attain (after ten iterations) for the various test
N ? 2
X(I) I=1,2.... N
? -1.2 1
ERROR= 2.42000E+01
N ? 2
X(I) I=1,2.... N
? -1.2 1
ERROR= 7.49038E+02
N ? 4
X(I) I=1,2.... N
? 3 -1 0 1
ERROI.= 7.07336E+05
N ? 3
X(I) I=1,2.... N
? 10000 10000 10000
ERROR= 2.59460E+04
SOLUTION
Figure 9.8 shows two different steepest-descent computer runs for this
error function. In Fig. 9.8(a) the initial parameters were chosen as x, = 10,
x2 = 10, which yielded a minimum E (0.943, 3.975) = 2.533. In Fig. 9.8(b) the
initial parameters were instead chosen as x, = 1, x2=1, which yielded
E (0.943, 0) = 0.943.
N ? 2
X(I) I=1,2.... N
? 10 10
ERROR= 5.73500E+03
N ? 2
X(I) I=1,2.... N
? 1 1
ERROR= 1.01000E+01
the bottom. However, it is possible that once we climb the other side of the
valley we discover the mountain descends for another mile. In this discus-
sion, the valley would be the local minimum, while the bottom of the
mountain would be the global minimum. In Example 9.6, E (0.943, 3.975) =
2.533 was a local minimum, while E(0.943,0)=0.943 was the global
minimum.
In an optimization problem one usually wants to find the global mini-
mum and does not want to be trapped at a local minimum. Failing to find
the global minimum is not just a difficulty associated with steepest-
descent; any optimization technique can be plagued by this problem.
Sometimes one will know for physical reasons that the global minimum
must be near a certain point, which should then be picked as the initial
point. The computer will then improve on that point and hopefully attain
the global minimum. Other times (see Chapter 11) the parameters will be
constrained to remain within a certain region (perhaps for reasons of
availability of component values) and this will imply there is only one
allowable minimum.
Applications of Steepest Descent 223
75
50 25 25 50 100
However, often many different computer runs must be done before one
is satisfied with the minimum that is attained. If many different initial
parameters produce the same minimum, then one is fairly confident that it
is the lowest value that can be found. But in practical optimization
problems that have many parameters that can be varied, there may be
numerous local minimums.
A hypothetical case that has two local minimums in addition to the
global minimum is portrayed in Fig. 9.9. Choosing the initial parameter
values to be given by P, yields a minimum value of 5; starting at P2 yields
10; and starting at P3 yields the global minimum, which is zero.
In cases that have numerous local minimums, the initial parameters can
be systematically picked for good distribution over the parameter space. In
a practical problem one can never be positive that the global minimum has
been found; however, after a sufficiently low value has been located one
may decide his time can be spent more profitably on other tasks.
The rate of convergence of any search technique is highly dependent on
the given function E (x). It is for this reason that there is no optimal
procedure for an arbitrary error function. The efficiency of a particular
optimization technique is usually dependent on the scale used. To under-
stand why this is so, consider the following function:
X(I) I=1,2,...N
? 1 10
ERROR= 5.00000E+00
N ? 2
X(I) I=1,2,...N
? 1 1
ERROR= 5.00000E+00
N ? 2
X(I) I=1,2,...N
? 1 10
ERROR= 5.00000E+00
X(I) I=1,2,...N
? 1 1
ERROR= 5.00000E+00
As indicated in Fig. 9.11, the error functions are exactly the same at
corresponding iterations. Scaling does not affect the simplex technique,
because the initial simplex is determined by varying each parameter by the
same percentage. Also, since the algorithm uses only discrete points, which
one is discarded to form a new simplex is not affected by scaling.
Ox 4x7 (Gy)(Gy)T
G'-- G + d2 (9 . 29)
d,
where the scale factors d, and d2 are given by
d, = Y' Ax, d2 = yrGy. (9.30)
A program that implements these equations is given in Fig. 9.12. The
following comments should help in understanding that program.
The definition
z=Gy (9.31)
6This is also known as the DFP optimization technique after Davidon, Fletcher, and
Powell. Davidon originally conceived the algorithm, but Fletcher and Powell presented it in a
manner that made it famous.
7The notation GRAD(x) is used to indicate that the gradient is evaluated at point x.
sFor a derivation of this equation, the interested reader is referred to Gottfried, B. S., and
J. Weisman (1973), Introduction to Optimization Theory (Englewood Cliffs, N.J.: Prentice-
Hall).
O1C FLETCHER POWELL
02 PROGRAM FP(INPUT,OUTPUT)
03 03 FORMAT(7H X(I)=1P,4E14.5)
04 04 FORMAT(7H ERROR=,IP,E14.5,/)
06 DIMENSION G(10,10),GRAD(10),GRAD1(10),S(10)
07 DIMENSION X(10),X1(10),X2(10),Y(10),Z(10),DELX(10)
09C
10C INITIAL VALUES
11 PRINT,*N*, $READ,N
14 PRINT,/,*X(I) I=1,2,...N*
15 READ,(X(I),I=1,N)
18 EO=ERROR(X)
19 PRINT 4,E0
20C
21C FIND GRADIENT
22 22 DO 28 I=1,N
23 DELTA=.000001*X(I)
24 IF(ABS(X(I)).LT.1E-12) DELTA=.000001
25 XSAVE=X(I)
26 X(I)=X(I)+DELTA
27 GRAD(I)=(ERROR(X)-E0)/DELTA
28 28 X(I)=XSAVE
29C
30C INITIALIZE G
31 K=0
33 DO 36 I=1,N
34 DO 35 J=1,N
35 35 G(I,J)=0.
36 36 G(I,I)=1.
39C
40C FIND DIRECTION S (EQ. 9.27)
41 41 DO 44 I=1,N
42 S(I)=0.
43 DO 44 J=1,N
44 44 S(I)=S(I)-G(I,J)*GRAD(J)
48C
49C QUADRATIC INTERPOLATION
50C
5IC FIND ALPHA 1
52 A1=1.
53 53 DO 54 I=1,N
54 54 X1(I)=X(I)+A1*S(I)
55 E1=ERROR(X1)
57 IF(E1.LT.EO)GO TO 63
58 Al=.5*A1
59 GO TO 53
61C
62C FIND ALPHA 2
63 63 A2=A1
64 64 A2=2.*A2
65 DO 66 I=1,N
66 66 X2(I)=X(I)+A2*S(I)
67 E2=ERROR(X2)
68 IF(E2.GT.E1) GO TO 74
69 A1=A2 $E1=E2
70 GO TO 64
72c
73c FIND ALPHA
74 74 A=(Al*Al-A2*A2)*EO+A2*A2*EI-Al*Al*E2
76 A=.5*A/((A)-A2)*E0+A2*El-A1*E2)
(a)
(b)
FIGURE 9.12. (Continued.)
230
The Fletcher-Powell Optimization Technique 231
N ? 2
X(I) I=1,2,...N
? --1.2 1
ERROR= 2.42000E+01
error did not change very much between the fourth iteration (E=3.20) and
the fifth (E=2.95), but reinitializing G caused the error to be reduced
dramatically (E = 0.05).
There are many other optimization techniques that are related to the
Fletcher-Powell algorithm. These algorithms are referred to as conjugate-
direction methods .9 They all base their search direction on the assumption
that sufficiently near a minimum the error function can be expressed as a
second-degree equation as in (9.24). However, none of the practical
methods directly evaluate the Hessian by evaluating second derivatives. In
fact, some of the methods do not even evaluate the gradient by evaluating
first derivatives. But of the conjugate-direction methods, the Fletcher-
Powell optimization technique is the most popular.
In conclusion, there are many different optimization techniques. Even in
a particular problem, the technique that works best may depend on the
initial set of parameters. For example, far from a minimum, steepest
descent may make more rapid improvements than Fletcher-Powell (be-
cause of a poor approximation to the inverse Hessian). In fact, far from a
minimum the simplex optimization technique may be better than either
steepest descent or Fletcher-Powell.
Organizations that must frequently solve optimization problems have
general-purpose optimization programs that contain many different opti-
mization algorithms. These programs can be written so they will automati-
cally change from one optimization technique to another if the original one
can not make sufficient reductions in the error function. Thus an error
function might be reduced by steepest descent at first, and then by
Fletcher-Powell as the minimum is approached.
The next chapter discusses another algorithm, the least-pth optimization
technique, that is extremely popular. It is often used separately or as part
of a general-purpose optimization program.
9S1, S2 are said to be conjugate with respect to the positive-definite matrix H if SIHS2=0.
PROBLEMS
E(x)=xi2-2x,+x22-8x2+ 17
is equal to zero.
9.7 The gradient of E(x)=x,2+2x,x3+x2x32 was found analytically in
Problem 9.5 to be (4,1,4) at x=(1,1,1). In this problem it will be
found by numerical methods.
(a) Use (9.12) with a=0.01 to estimate the gradient at x=(1,1,1).
(b) Instead of (9.12), use a central-difference formula to estimate the
gradient at x = (1,1,1).
(c) What is the disadvantage of using the central-difference formula
to estimate the gradient?
9.8 What is the scalar product of the two vectors a = (1, 4, - 2),
b=(1, - 1,1)?
9.9 What is
233
234 Gradient Techniques