0% found this document useful (0 votes)
78 views

Multi-Variable Optimization Methods

The document discusses gradient-based optimization methods. It introduces the gradient and Hessian of functions with multiple variables. It then describes several optimization algorithms: steepest descent uses the negative gradient direction at each step; conjugate gradient incorporates previous gradient directions; Newton's method uses the Hessian to fit a quadratic model and find the minimum in one step, but may not converge for non-quadratic functions. Modified Newton's method performs a line search along the Newton direction instead of accepting the step that minimizes the quadratic model.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views

Multi-Variable Optimization Methods

The document discusses gradient-based optimization methods. It introduces the gradient and Hessian of functions with multiple variables. It then describes several optimization algorithms: steepest descent uses the negative gradient direction at each step; conjugate gradient incorporates previous gradient directions; Newton's method uses the Hessian to fit a quadratic model and find the minimum in one step, but may not converge for non-quadratic functions. Modified Newton's method performs a line search along the Newton direction instead of accepting the step that minimizes the quadratic model.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

3 Gradient-Based Optimization

3.1 Optimality Conditions


Consider a function f(x) where x is a vector x
T
= [x
1
, x
2
, . . . , x
n
].
The gradient vector of this function is given by the partial derivatives with respect to each of
the independent variables,
f(x) g(x)
2
6
6
6
4
f
x
1
f
x
2
.
.
.
f
x
n
3
7
7
7
5
(1)
In the multivariate case, the gradient vector is perpendicular to the the hyperplane tangent to
the contour surfaces of constant f.
1
Higher derivatives of multi-variable functions are dened as in the single-variable case, but note
that the number of gradient component increases by a factor of n for each dierentiation.
While the gradient of a function of n variables is an n-vector, the second derivative of an
n-variable function is dened by n
2
partial derivatives of the n rst partial derivatives with
respect to th n variables.

2
f
x
i
x
j
, i = j;

2
f
x
2
i
, i = j (2)
If the partial derivatives f/x
i
, f/x
j
and
2
f/x
i
x
j
are continuous, then

2
f/x
i
x
j
exists and
2
f/x
i
x
j
=
2
f/x
j
x
i
. These second order partial derivatives
can be represented by a square symmetric matrix called the Hessian matrix,

2
f(x) H(x)
2
6
6
4

2
f

2
x
1


2
f
x
i
x
n
.
.
.
.
.
.

2
f
x
1
x
n


2
f

2
x
n
3
7
7
5
(3)
If f is a quadratic function, the Hessian of f is constant and f can be expressed as
f(x) =
1
2
x
T
Hx + g
T
x + (4)
2
As in the single-variable case the optimality conditions can be derived from the Taylor-series
expansion of f about x

,
f(x

+ p) = f(x

) + p
T
g(x

) +
1
2

2
p
T
H(x

+ p)p, (5)
where 0 1, is a scalar, and p is an n-vector.
For x

to be a local minimum, then for any vector p, there must be a nite such that
f(x

+ p) f(x

), i.e. there is a neighborhood in which this condition holds. If this


condition is satised, then f(x

+ p) f(x

) 0 and the rst and second order terms in


the Taylor-series expansion must be greater of equal to zero.
As in the single variable case, and for the same reason, we start by considering the rst order
terms. Since p is an arbitrary vector and can be either positive or negative, then every
component of the gradient vector g(x

) must be zero.
Now we have to consider the second order term,
2
p
T
H(x

+ p)p. For this term to be


non-negative, H(x

+ p) has to be positive semi-denite, and by continuity, the Hessian at


the optimum, H(x

) must also be positive semi-denite.


3
Necessary conditions:
g(x

) = 0; H(x

) is positive semi-denite (6)


Sucient conditions:
g(x

) = 0; H(x

) is positive denite (7)


4
3.2 General Algorithm for Smooth Functions
All algorithms for unconstrained gradient-based optimization can be described as follows. We
start with k = 0 and an estimate of x

, x
k
.
1. Test for convergence. If the conditions for convergence are satised, then we can stop and
x
k
is the solution. Else, go to Step 2.
2. Compute a search direction. Compute the vector p
k
that denes the direction in n-space
along which we will search.
3. Compute the step length. Find a positive scalar,
k
such that f(x
k
+
k
p
k
) < f(x
k
).
4. Update the design variables. Set x
k+1
= x
k
+
k
p
k
, k = k + 1 and go back to 1.
x
k+1
= x
k
+
k
p
k
| {z }
x
k
(8)
5
3.3 Steepest Descent Method
The steepest descent method uses the negative of the gradient vector at each point as the
search direction for each iteration. As mentioned previously, the gradient vector is orthogonal to
the plane tangent to the isosurfaces of the function.
The gradient vector at a point, g(x
k
), is also the direction of maximum rate of change
(maximum increase) of the function at that point. This rate of change is given by the norm,
g(x
k
).
Steepest descent algorithm:
1. Select starting point x
0
, and convergence parameters
g
,
a
and
r
.
2. Compute g(x
k
) f(x
k
). If g(x
k
)
g
then stop. Otherwise, compute the
normalized search direction to p
k
= g(x
k
)/g(x
k
).
3. Find the positive step length
k
such that f(x
k
+ p
k
) is minimized.
4. Update the current point, x
k+1
= x
k
+ p
k
.
5. Evaluate f(x
k+1
). If the condition |f(x
k+1
) f(x
k
)|
a
+
r
|f(x
k
)| is satised for
two successive iterations then stop. Otherwise, set k = k + 1, x
k+1
= x
k
+ 1 and return
to step 2.
6
Note that the steepest descent direction at each iteration is orthogonal to the previous one, i.e.,
g
T
(x
k+1
)g(x
k
) = 0. Therefore the method zigzags in the design space and is rather
inecient.
The algorithm is guaranteed to converge, but it may take an innite number of iterations. The
rate of convergence is linear.
Usually, a substantial decrease is observed in the rst few iterations, but the method is very
slow after that.
7
3.4 Conjugate Gradient Method
A small modication to the steepest descent method takes into account the history of the
gradients to move more directly towards the optimum.
1. Select starting point x
0
, and convergence parameters
g
,
a
and
r
.
2. Compute g(x
k
) f(x
k
). If g(x
k
)
g
then stop. Otherwise, go to step 5.
3. Compute g(x
k
) f(x
k
). If g(x
k
)
g
then stop. Otherwise continue.
4. Compute the new conjugate gradient direction p
k
= g(x
k
) +
k
p
k1
, where
=

g
k

g
k1

2
=
g
T
k
g
k
g
T
k1
g
k1
.
5. Find the positive step length
k
such that f(x
k
+ p
k
) is minimized.
6. Update the current point, x
k+1
= x
k
+ p
k
.
7. Evaluate f(x
k+1
). If the condition |f(x
k+1
) f(x
k
)|
a
+
r
|f(x
k
)| is satised for
two successive iterations then stop. Otherwise, set k = k + 1, x
k+1
= x
k
+ 1 and return
to step 3.
Usually, a restart is performed every n iterations for computational stability, i.e. we start with a
steepest descent direction.
8
3.5 Newtons Method
The steepest descent and conjugate gradient methods only use rst order information, i.e. the
rst derivative term in the Taylor series to obtain a local model of the function.
Newtons methods uses a second-order Taylor series expansion of the function about the current
design point, i.e. a quadratic model
f(x
k
+ s
k
) f
k
+ g
T
k
s
k
+
1
2
s
T
k
H
k
s
k
, (9)
where s
k
is the step to the minimum. Dierentiating this with respect to s
k
and setting it to
zero, we can obtain the step for that minimizes this quadratic,
H
k
s
k
= g
k
. (10)
This is a linear system which yields a Newton step, s
k
, as a solution.
9
If H
k
is positive denite, only one iteration is required for a quadratic function, from any
starting point. For a general nonlinear function, Newtons method converges quadratically if x
0
is suciently close to x

and the Hessian is positive denite at x

.
As in the single variable case, diculties and even failure may occur when the quadratic model
is a poor approximation of f far from the current point. If H
k
is not positive denite, the
quadratic model might not have a minimum or even a stationary point. For some nonlinear
functions, the Newton step might be such that f(x
k
+ p
k
) > f(x
k
) and the method is not
guaranteed to converge.
Another disadvantage of Newtons method is the need to compute not only the gradient, but
also the Hessian, i.e. n(n + 1)/2 second order derivatives.
10
3.5.1 Modied Newtons Method
A small modication to Newtons method is to perform a line search along the Newton
direction, rather than accepting the step size that would minimize the quadratic model.
1. Select starting point x
0
, and convergence parameter
g
.
2. Compute g(x
k
) f(x
k
). If g(x
k
)
g
then stop. Otherwise, continue.
3. Compute H(x
k
)
2
f(x
k
) and the search direction, p
k
= H
1
g
k
.
4. Find the positive step length
k
such that f(x
k
+
k
p
k
) is minimized. (Start with

k
= 1).
5. Update the current point, x
k+1
= x
k
+
k
p
k
and return to step 2.
Although this modication increases the probability that f(x
k
+ p
k
) < f(x
k
), it still
vulnerable to the problem of having an Hessian that is not positive denite and has all the other
disadvantages of the pure Newtons method.
11
3.6 Quasi-Newton Methods
This class of methods uses rst order information only, but build second order information an
approximate Hessian based on the sequence of function values and gradients from previous
iterations. The method also forces H to be symmetric and positive denite, greatly improving
its convergence properties.
When using quasi-Newton methods, we usually start with the Hessian initialized to the identity
matrix and then update it at each iteration. Since what we actually need is the inverse of the
Hessian, we will work with V
k
H
1
k
. The update at each iteration is written as

V
k
and is
added to the current one,
V
k+1
= V
k
+

V
k
. (11)
Let s
k
be the step taken from x
k
, and consider the Taylor-series expansion of the gradient
function about x
k
,
g(x
k
+ s
k
) = g
k
+ H
k
s
k
+ (12)
Truncating this series and setting the variation of the gradient to y
k
= g(x
k
+ s
k
) g
k
yields
H
k
s
k
= y
k
. (13)
Then, the new approximate inverse of the Hessian, V
k+1
must satisfy the quasi-Newton
condition,
V
k+1
y
k
= s
k
. (14)
12
3.6.1 DavidonFletcherPowell (DFP) Method
One of the rst quasi-Newton methods was devised by Davidon (in 1959) and modied by
Fletcher and Powell (1963).
1. Select starting point x
0
, and convergence parameter
g
. Set k = 0 and H
0
= I.
2. Compute g(x
k
) f(x
k
). If g(x
k
)
g
then stop. Otherwise, continue.
3. Compute the search direction, p
k
= V
k
g
k
.
4. Find the positive step length
k
such that f(x
k
+
k
p
k
) is minimized. (Start with

k
= 1).
5. Update the current point, x
k+1
= x
k
+
k
p
k
, set s
k
=
k
p
k
, and compute the change in
the gradient, y
k
= g
k+1
g
k
.
6. Update V
k+1
by computing
A
k
=
V
k
y
k
y
T
k
V
k
y
T
k
V
k
y
k
B
k
=
s
k
s
T
k
s
T
k
y
k
V
k+1
= V
k
A
k
+ B
k
7. Set k = k + 1 and return to step 2.
13
3.6.2 BroydenFletcherGoldfarbShanno (BFGS) Method
The BFGS method is also a quasi-Newton method with a dierent updating formula
V
k+1
=
"
I
s
k
y
T
k
s
T
k
y
k
#
V
k
"
I
s
k
y
T
k
s
T
k
y
k
#
+
s
k
s
T
k
s
T
k
y
k
(15)
The relative performance between the DFP and BFGS methods is problem dependent.
14
3.7 Trust Region Methods
Trust region, or restricted-step methods are a dierent approach to resolving the weaknesses
of the pure form of Newtons method, arising from an Hessian that is not positive denite or a
highly nonlinear function.
One way to interpret these problems is to say that they arise from the fact that we are stepping
outside a the region for which the quadratic approximation is reasonable. Thus we can
overcome this diculties by minimizing the quadratic function within a region around x
k
within
which we trust the quadratic model.
15
1. Select starting point x
0
, a convergence parameter
g
and the initial size of the trust region,
h
0
.
2. Compute g(x
k
) f(x
k
). If g(x
k
)
g
then stop. Otherwise, continue.
3. Compute H(x
k
)
2
f(x
k
) and solve the quadratic subproblem
minimize q(s
k
) = f(x
k
) + g(x
k
)
T
s
k
+
1
2
s
T
k
H(x
k
)s
k
(16)
w.r.t. s
k
(17)
s.t. h
k
s
k
h
k
, i = 1, . . . , n (18)
4. Evaluate f(x
k
+ s
k
) and compute the ratio that measures the accuracy of the quadratic
model,
r
k
=
f
q
=
f(x
k
) f(x
k
+ s
k
)
f(x
k
) q(s
k
)
.
5. Compute the size for the new trust region as follows:
h
k+1
=
s
k

4
if r
k
< 0.25, (19)
h
k+1
= 2h
k
if r
k
< 0.25 and h
k
= s
k
, (20)
h
k+1
= h
k
otherwise. (21)
16
6. Determine the new point:
x
k+1
= x
k
if r
k
0 (22)
x
k+1
= x
k
+ s
k
otherwise. (23)
7. Set k = k + 1 and return to 2.
The initial value of h is usually 1. The same stopping criteria used in other gradient-based
methods are applicable.
[1, 2, 3]
17
3.8 Convergence Characteristics for a Quadratic Function
18
3.9 Convergence Characteristics for the Rosenbrock Function
19
3.10 Convergence Characteristics for the Rosenbrock Function
20
References
[1] A. D. Belegundu and T. R. Chandrupatla. Optimization Concepts and Applications in
Engineering, chapter 3. Prentice Hall, 1999.
[2] P. E. Gill, W. Murray, and M. H. Wright. Practical Optimization, chapter 4. Academic
Press, 1981.
[3] C. Onwubiko. Introduction to Engineering Design Optimization, chapter 4. Prentice Hall,
2000.
21

You might also like