Multi-Variable Optimization Methods
Multi-Variable Optimization Methods
2
f
x
i
x
j
, i = j;
2
f
x
2
i
, i = j (2)
If the partial derivatives f/x
i
, f/x
j
and
2
f/x
i
x
j
are continuous, then
2
f/x
i
x
j
exists and
2
f/x
i
x
j
=
2
f/x
j
x
i
. These second order partial derivatives
can be represented by a square symmetric matrix called the Hessian matrix,
2
f(x) H(x)
2
6
6
4
2
f
2
x
1
2
f
x
i
x
n
.
.
.
.
.
.
2
f
x
1
x
n
2
f
2
x
n
3
7
7
5
(3)
If f is a quadratic function, the Hessian of f is constant and f can be expressed as
f(x) =
1
2
x
T
Hx + g
T
x + (4)
2
As in the single-variable case the optimality conditions can be derived from the Taylor-series
expansion of f about x
,
f(x
+ p) = f(x
) + p
T
g(x
) +
1
2
2
p
T
H(x
+ p)p, (5)
where 0 1, is a scalar, and p is an n-vector.
For x
to be a local minimum, then for any vector p, there must be a nite such that
f(x
+ p) f(x
+ p) f(x
) must be zero.
Now we have to consider the second order term,
2
p
T
H(x
) = 0; H(x
) = 0; H(x
, x
k
.
1. Test for convergence. If the conditions for convergence are satised, then we can stop and
x
k
is the solution. Else, go to Step 2.
2. Compute a search direction. Compute the vector p
k
that denes the direction in n-space
along which we will search.
3. Compute the step length. Find a positive scalar,
k
such that f(x
k
+
k
p
k
) < f(x
k
).
4. Update the design variables. Set x
k+1
= x
k
+
k
p
k
, k = k + 1 and go back to 1.
x
k+1
= x
k
+
k
p
k
| {z }
x
k
(8)
5
3.3 Steepest Descent Method
The steepest descent method uses the negative of the gradient vector at each point as the
search direction for each iteration. As mentioned previously, the gradient vector is orthogonal to
the plane tangent to the isosurfaces of the function.
The gradient vector at a point, g(x
k
), is also the direction of maximum rate of change
(maximum increase) of the function at that point. This rate of change is given by the norm,
g(x
k
).
Steepest descent algorithm:
1. Select starting point x
0
, and convergence parameters
g
,
a
and
r
.
2. Compute g(x
k
) f(x
k
). If g(x
k
)
g
then stop. Otherwise, compute the
normalized search direction to p
k
= g(x
k
)/g(x
k
).
3. Find the positive step length
k
such that f(x
k
+ p
k
) is minimized.
4. Update the current point, x
k+1
= x
k
+ p
k
.
5. Evaluate f(x
k+1
). If the condition |f(x
k+1
) f(x
k
)|
a
+
r
|f(x
k
)| is satised for
two successive iterations then stop. Otherwise, set k = k + 1, x
k+1
= x
k
+ 1 and return
to step 2.
6
Note that the steepest descent direction at each iteration is orthogonal to the previous one, i.e.,
g
T
(x
k+1
)g(x
k
) = 0. Therefore the method zigzags in the design space and is rather
inecient.
The algorithm is guaranteed to converge, but it may take an innite number of iterations. The
rate of convergence is linear.
Usually, a substantial decrease is observed in the rst few iterations, but the method is very
slow after that.
7
3.4 Conjugate Gradient Method
A small modication to the steepest descent method takes into account the history of the
gradients to move more directly towards the optimum.
1. Select starting point x
0
, and convergence parameters
g
,
a
and
r
.
2. Compute g(x
k
) f(x
k
). If g(x
k
)
g
then stop. Otherwise, go to step 5.
3. Compute g(x
k
) f(x
k
). If g(x
k
)
g
then stop. Otherwise continue.
4. Compute the new conjugate gradient direction p
k
= g(x
k
) +
k
p
k1
, where
=
g
k
g
k1
2
=
g
T
k
g
k
g
T
k1
g
k1
.
5. Find the positive step length
k
such that f(x
k
+ p
k
) is minimized.
6. Update the current point, x
k+1
= x
k
+ p
k
.
7. Evaluate f(x
k+1
). If the condition |f(x
k+1
) f(x
k
)|
a
+
r
|f(x
k
)| is satised for
two successive iterations then stop. Otherwise, set k = k + 1, x
k+1
= x
k
+ 1 and return
to step 3.
Usually, a restart is performed every n iterations for computational stability, i.e. we start with a
steepest descent direction.
8
3.5 Newtons Method
The steepest descent and conjugate gradient methods only use rst order information, i.e. the
rst derivative term in the Taylor series to obtain a local model of the function.
Newtons methods uses a second-order Taylor series expansion of the function about the current
design point, i.e. a quadratic model
f(x
k
+ s
k
) f
k
+ g
T
k
s
k
+
1
2
s
T
k
H
k
s
k
, (9)
where s
k
is the step to the minimum. Dierentiating this with respect to s
k
and setting it to
zero, we can obtain the step for that minimizes this quadratic,
H
k
s
k
= g
k
. (10)
This is a linear system which yields a Newton step, s
k
, as a solution.
9
If H
k
is positive denite, only one iteration is required for a quadratic function, from any
starting point. For a general nonlinear function, Newtons method converges quadratically if x
0
is suciently close to x
.
As in the single variable case, diculties and even failure may occur when the quadratic model
is a poor approximation of f far from the current point. If H
k
is not positive denite, the
quadratic model might not have a minimum or even a stationary point. For some nonlinear
functions, the Newton step might be such that f(x
k
+ p
k
) > f(x
k
) and the method is not
guaranteed to converge.
Another disadvantage of Newtons method is the need to compute not only the gradient, but
also the Hessian, i.e. n(n + 1)/2 second order derivatives.
10
3.5.1 Modied Newtons Method
A small modication to Newtons method is to perform a line search along the Newton
direction, rather than accepting the step size that would minimize the quadratic model.
1. Select starting point x
0
, and convergence parameter
g
.
2. Compute g(x
k
) f(x
k
). If g(x
k
)
g
then stop. Otherwise, continue.
3. Compute H(x
k
)
2
f(x
k
) and the search direction, p
k
= H
1
g
k
.
4. Find the positive step length
k
such that f(x
k
+
k
p
k
) is minimized. (Start with
k
= 1).
5. Update the current point, x
k+1
= x
k
+
k
p
k
and return to step 2.
Although this modication increases the probability that f(x
k
+ p
k
) < f(x
k
), it still
vulnerable to the problem of having an Hessian that is not positive denite and has all the other
disadvantages of the pure Newtons method.
11
3.6 Quasi-Newton Methods
This class of methods uses rst order information only, but build second order information an
approximate Hessian based on the sequence of function values and gradients from previous
iterations. The method also forces H to be symmetric and positive denite, greatly improving
its convergence properties.
When using quasi-Newton methods, we usually start with the Hessian initialized to the identity
matrix and then update it at each iteration. Since what we actually need is the inverse of the
Hessian, we will work with V
k
H
1
k
. The update at each iteration is written as
V
k
and is
added to the current one,
V
k+1
= V
k
+
V
k
. (11)
Let s
k
be the step taken from x
k
, and consider the Taylor-series expansion of the gradient
function about x
k
,
g(x
k
+ s
k
) = g
k
+ H
k
s
k
+ (12)
Truncating this series and setting the variation of the gradient to y
k
= g(x
k
+ s
k
) g
k
yields
H
k
s
k
= y
k
. (13)
Then, the new approximate inverse of the Hessian, V
k+1
must satisfy the quasi-Newton
condition,
V
k+1
y
k
= s
k
. (14)
12
3.6.1 DavidonFletcherPowell (DFP) Method
One of the rst quasi-Newton methods was devised by Davidon (in 1959) and modied by
Fletcher and Powell (1963).
1. Select starting point x
0
, and convergence parameter
g
. Set k = 0 and H
0
= I.
2. Compute g(x
k
) f(x
k
). If g(x
k
)
g
then stop. Otherwise, continue.
3. Compute the search direction, p
k
= V
k
g
k
.
4. Find the positive step length
k
such that f(x
k
+
k
p
k
) is minimized. (Start with
k
= 1).
5. Update the current point, x
k+1
= x
k
+
k
p
k
, set s
k
=
k
p
k
, and compute the change in
the gradient, y
k
= g
k+1
g
k
.
6. Update V
k+1
by computing
A
k
=
V
k
y
k
y
T
k
V
k
y
T
k
V
k
y
k
B
k
=
s
k
s
T
k
s
T
k
y
k
V
k+1
= V
k
A
k
+ B
k
7. Set k = k + 1 and return to step 2.
13
3.6.2 BroydenFletcherGoldfarbShanno (BFGS) Method
The BFGS method is also a quasi-Newton method with a dierent updating formula
V
k+1
=
"
I
s
k
y
T
k
s
T
k
y
k
#
V
k
"
I
s
k
y
T
k
s
T
k
y
k
#
+
s
k
s
T
k
s
T
k
y
k
(15)
The relative performance between the DFP and BFGS methods is problem dependent.
14
3.7 Trust Region Methods
Trust region, or restricted-step methods are a dierent approach to resolving the weaknesses
of the pure form of Newtons method, arising from an Hessian that is not positive denite or a
highly nonlinear function.
One way to interpret these problems is to say that they arise from the fact that we are stepping
outside a the region for which the quadratic approximation is reasonable. Thus we can
overcome this diculties by minimizing the quadratic function within a region around x
k
within
which we trust the quadratic model.
15
1. Select starting point x
0
, a convergence parameter
g
and the initial size of the trust region,
h
0
.
2. Compute g(x
k
) f(x
k
). If g(x
k
)
g
then stop. Otherwise, continue.
3. Compute H(x
k
)
2
f(x
k
) and solve the quadratic subproblem
minimize q(s
k
) = f(x
k
) + g(x
k
)
T
s
k
+
1
2
s
T
k
H(x
k
)s
k
(16)
w.r.t. s
k
(17)
s.t. h
k
s
k
h
k
, i = 1, . . . , n (18)
4. Evaluate f(x
k
+ s
k
) and compute the ratio that measures the accuracy of the quadratic
model,
r
k
=
f
q
=
f(x
k
) f(x
k
+ s
k
)
f(x
k
) q(s
k
)
.
5. Compute the size for the new trust region as follows:
h
k+1
=
s
k
4
if r
k
< 0.25, (19)
h
k+1
= 2h
k
if r
k
< 0.25 and h
k
= s
k
, (20)
h
k+1
= h
k
otherwise. (21)
16
6. Determine the new point:
x
k+1
= x
k
if r
k
0 (22)
x
k+1
= x
k
+ s
k
otherwise. (23)
7. Set k = k + 1 and return to 2.
The initial value of h is usually 1. The same stopping criteria used in other gradient-based
methods are applicable.
[1, 2, 3]
17
3.8 Convergence Characteristics for a Quadratic Function
18
3.9 Convergence Characteristics for the Rosenbrock Function
19
3.10 Convergence Characteristics for the Rosenbrock Function
20
References
[1] A. D. Belegundu and T. R. Chandrupatla. Optimization Concepts and Applications in
Engineering, chapter 3. Prentice Hall, 1999.
[2] P. E. Gill, W. Murray, and M. H. Wright. Practical Optimization, chapter 4. Academic
Press, 1981.
[3] C. Onwubiko. Introduction to Engineering Design Optimization, chapter 4. Prentice Hall,
2000.
21