Nonlinear Programming (Concepts, Algorithms, and Applications To Chemical Processes) - 3. Newton-Type Methods For Unconstrained Optimization (2010)
Nonlinear Programming (Concepts, Algorithms, and Applications To Chemical Processes) - 3. Newton-Type Methods For Unconstrained Optimization (2010)
i i
2010/7/27
page 39
i i
Downloaded 04/07/15 to 169.230.243.252. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
Chapter 3
Newton-type methods are presented and analyzed for the solution of unconstrained optimiza-
tion problems. In addition to covering the basic derivation and local convergence properties,
both line search and trust region methods are described as globalization strategies, and key
convergence properties are presented. The chapter also describes quasi-Newton methods
and focuses on derivation of symmetric rank one (SR1) and Broyden–Fletcher–Goldfarb–
Shanno (BFGS) methods, using simple variational approaches. The chapter includes a small
example that illustrates the characteristics of these methods.
3.1 Introduction
Chapter 2 concluded with the derivation of Newton’s method for the unconstrained opti-
mization problem
min f (x). (3.1)
x∈Rn
For unconstrained optimization, Newton’s method forms the basis for the most efficient
algorithms. Derived from Taylor’s theorem, this method is distinguished by its fast perfor-
mance. As seen in Theorem 2.20, this method has a quadratic convergence rate that can
lead, in practice, to inexpensive solutions of optimization problems. Moreover, extensions
to constrained optimization rely heavily on this method; this is especially true in chemical
process engineering applications. As a result, concepts of Newton’s method form the core
of all of the algorithms discussed in this book.
On the other hand, given a solution to (3.1) that satisfies first and second order sufficient
conditions, the basic Newton method in Algorithm 2.1 may still have difficulties and may
be unsuccessful. Newton’s method can fail on problem (3.1) for the following reasons:
1. The objective function is not smooth. Here, first and second derivatives are needed
to evaluate the Newton step, and Lipschitz continuity of the second derivatives is
needed to keep them bounded.
2. The Newton step does not generate a descent direction. This is associated with
Hessian matrices that are not positive definite. A singular matrix produces New-
ton steps that are unbounded, while Newton steps with ascent directions lead to an
39
i i
i i
book_tem
i i
2010/7/27
page 40
i i
increase in the objective function. These arise from Hessian matrices with negative
curvature.
3. The starting point is not sufficiently close to solution. For general unconstrained
problems, this property is the hardest to check. While estimates of regions of attraction
for Newton’s method have been developed in [113], they are not easy to apply when
the solution, and its relation to the initial point, is unknown.
These three challenges raise some important questions on how to develop reliable and
efficient optimization algorithms, based on Newton’s method. This chapter deals with these
questions in the following way:
1. In the application of Newton’s method throughout this book, we will focus only on
problems with smooth functions. Nevertheless, there is a rich literature on optimiza-
tion with nonsmooth functions. These include development of nonsmooth Newton
methods [97] and the growing field of nonsmooth optimization algorithms.
2. In Section 3.2, we describe a number of ways to modify the Hessian matrix to ensure
that the modified matrix at iteration k, B k , has a bounded condition number and
remains positive definite. This is followed by Section 3.3 that develops the concept of
quasi-Newton methods, which do not require the calculation of the Hessian matrix.
Instead, a symmetric, positive definite B k matrix is constructed from differences of
the objective function gradient at successive iterations.
3. To avoid the problem of finding a good starting point, globalization strategies are
required that ensure sufficient decrease of the objective function at each step and
lead the algorithm to converge to locally optimal solutions, even from distant starting
points. In Section 3.4, this global convergence property will be effected by line search
methods that are simple modifications of Algorithm 2.1 and require that B k be posi-
tive definite with bounded condition numbers. Moreover, these positive definiteness
assumptions can be relaxed if we apply trust region methods instead. These strategies
are developed and analyzed in Section 3.5.
Algorithm 3.1.
Choose a starting point x 0 and tolerances 1 , 2 > 0.
i i
i i
book_tem
i i
2010/7/27
page 41
i i
The modified Hessian, B k , satisfies v T (B k )−1 v > v2 , for all vectors v = 0 and for
some > 0. The step pk determined from B k leads to the descent property:
B k = ∇ 2 f (x k ) + E k
= V k k V k,T + δI = V k ( k
+ δI )V k,T ,
i i
i i
book_tem
i i
2010/7/27
page 42
i i
where the matrices and V incorporate the eigenvalues λj and (orthonormal) eigenvectors
of ∇ 2 f (x k ), respectively, and the diagonal elements of + δI are the eigenvalues of B k .
Choosing δ = − minj (λj − , 0) for some tolerance > 0 leads to eigenvalues of B k no less
than . If we also assume that the largest eigenvalue is finite, then B k is a positive definite
matrix with a bounded condition number.
While both approaches modify the Hessian to allow the calculation of descent direc-
tions, it is not clear how to choose the adjustable parameters that obtain corrections and still
lead to fast convergence. In particular, one would like these corrections, E k , not to interfere
with fast convergence to the solution. For instance, if we can set E k = 0 in a neighborhood
of the solution and calculate “pure” Newton steps, we obtain quadratic convergence from
Theorem 2.20. A weaker condition that leads to superlinear convergence is given by the
following property.
Theorem 3.1 [294] Assume that f (x) is three times differentiable and that Algorithm 3.1
converges to a point that is a strict local minimum. Then x k converges at a superlinear
rate, i.e.,
x k + pk − x ∗
lim =0 (3.6)
k→∞ x k − x ∗
if and only if
(B k − ∇ 2 f (x k ))p k
lim = 0. (3.7)
k→∞ p k
In Section 3.5, we will see that such a judicious modification of the Hessian can be
performed together with a globalization strategy. In particular, we will consider a systematic
strategy for the Levenberg–Marquardt step that is tied to the trust region method.
B k+1 s = y. (3.9)
If f (x) is a quadratic function, the secant relation is exactly satisfied when B k is the Hessian
matrix. Also, from Taylor’s theorem (2.22), we see that (3.9) can provide a reasonable
approximation to the curvature of f (x) along the direction s. Therefore we consider this
relation to motivate a formula to describe B k . Finally, because ∇ 2 f (x) is a symmetric
matrix, we also want B k to be symmetric as well.
i i
i i
book_tem
i i
2010/7/27
page 43
i i
The simplest way to develop an update formula for B k is to postulate the rank-one
update: B k+1 = B k + wwT . Applying (3.9) to this update (see Exercise 1) leads to
(y − B k s)(y − B k s)T
B k+1 = B k + (3.10)
(y − B k s)T s
which is the symmetric rank 1 (SR1) update formulation. The SR1 update asymptotically
converges to the (positive definite) Hessian of the objective function as long as the steps s are
linearly independent. On the other hand, the update for B k+1 can be adversely affected by
regions of negative or zero curvature and can become ill-conditioned, singular, or unbounded
in norm. In particular, care must be taken so that the denominator in (3.10) is bounded away
from zero, e.g., |(y − B k s)T s| ≥ C1 s2 for some C1 > 0. So, while this update can work
well, it is not guaranteed to be positive definite and may not lead to descent directions.
Instead, we also consider a rank-two quasi-Newton update formula that allows B k to
remain symmetric and positive definite as well. To do this, we define the current Hessian
approximation as B k = J J T , where J is a square, nonsingular matrix. Note that this defi-
nition implies that B k is positive definite. To preserve symmetry, the update to B k can be
given as B k+1 = J + (J + )T , where J + is also expected to remain square and nonsingular.
By working with the matrices J and J + , it will also be easier to monitor the symmetry and
positive definiteness properties of B k .
Using the matrix J + , the secant relation (3.9) can be split into two parts. From
B k+1 s = J + (J + )T s = y, (3.11)
J +v = y and (J + )T s = v. (3.12)
The derived update satisfies the secant relation and symmetry. In order to develop a unique
update formula, we also assume the update has the least change in some norm. Here we
obtain an update formula by invoking a least change strategy for J + , leading to
min J + − J F (3.13)
s.t. J + v = y, (3.14)
where J F is the Frobenius norm of matrix J . Solving (3.13) (see Exercise 8 in Chapter 4)
leads to the so-called Broyden update used to solve nonlinear equations:
(y − J v)v T
J+ = J + . (3.15)
vT v
Using (3.15) we can recover an update formula in terms of s, y, and B k by using the
following identities for v. From (3.12), we have v T v = (y T (J + )−T )(J + )T s = s T y. Also,
postmultiplying (J + )T by s and using (3.15) leads to
v(y − J v)T s
v = (J + )T s = J T s + (3.16)
vT v
T T
v J s
= JT s +v− T v (3.17)
s y
i i
i i
book_tem
i i
2010/7/27
page 44
i i
sT y
and v = vT J T s
J T s. Premultiplying v by s T J and simplifying the expression leads to
1/2
sT y
v= J T s.
sT Bk s
Finally, from the definition of v, B k , and B k+1 as well as (3.15), we have
T
(y − J v)v T (y − J v)v T
B k+1 = J + J +
vT v vT v
yy T − J vv T J T
= JJT +
vT v
yy T J vv T J T
= Bk + T −
s y vT v
yy T B k ss T B k
= Bk + T − T k . (3.18)
s y s B s
From this derivation, we have assumed B k to be a symmetric matrix, and therefore B k+1
remains symmetric, as seen from (3.18). Moreover, it can be shown that if B k is positive
definite and s T y > 0, then the update, B k+1 , is also positive definite. In fact, the condition
that s T y be sufficiently positive at each iteration, i.e.,
s T y ≥ C2 s2 for some C2 > 0, (3.19)
is important in order to maintain a bounded update. As a result, condition (3.19) is checked at
each iteration, and if it cannot be satisfied, the update (3.18) is skipped. Another alternative
to skipping is known as Powell damping [313]. As described in Exercise 2, this approach
maintains positive definiteness when (3.19) fails by redefining y := θy + (1 − θ )B k s for a
calculated θ ∈ [0, 1].
The update formula (3.18) is known as the Broyden–Fletcher–Goldfarb–Shanno
(BFGS) update, and the derivation above is due to Dennis and Schnabel [110]. As a re-
sult of this updating formula, we have a reasonable approximation to the Hessian matrix
that is also positive definite.
Moreover, the BFGS update has a fast rate of convergence as summarized by the
following property.
Theorem
∞ 3.2 [294] If the BFGS algorithm converges to a strict local solution x ∗ with
∗ ∗
k=0 x − x < ∞, and the Hessian ∇ f (x) is Lipschitz continuous at x , then (3.7)
k 2
k
holds and x converges at a superlinear rate.
Finally, while the BFGS update can be applied directly in step 2 of Algorithm 3.1,
calculation of the search direction p k can be made more efficient by implementing the
quasi-Newton update through the following options.
• Solution of the linear system can be performed with a Cholesky factorization of
B k = Lk (Lk )T . On the other hand, Lk can be updated directly by applying formula
T
(3.15) with J := Lk and v = s T Lsk (Lyk )T s (Lk )T s, i.e.,
ys T Lk (Lk )T ss T Lk
J + = Lk + − , (3.20)
(s T y)1/2 (s T Lk (Lk )T s)1/2 s T (Lk )(Lk )T s
i i
i i
book_tem
i i
2010/7/27
page 45
i i
where
Bk sk yk
uk = , vk = . (3.23)
(s k )T B k s k (y k )T s k )1/2
For large-scale problems (say, n > 1000), it is advantageous to store only the last m
updates and to develop the so-called limited memory update:
k
B k+1 = B 0 + v i (v i )T − ui (ui )T . (3.24)
i=max(0,k−m+1)
In this way, only the most recent updates are used for B k , and the older ones are
discarded. While the limited memory update has only a linear convergence rate, it
greatly reduces the linear algebra cost for large problems. Moreover, by storing only
the updates, one can work directly with matrix-vector products instead of B k , i.e.,
k
B k+1 w = B 0 w + v i (v i )T w − ui (ui )T w . (3.25)
i=max(0,k−m+1)
Similar updates have been developed for H k as well. Moreover, Byrd and Nocedal
[83] discovered a particularly efficient compact form of this update as follows:
T 0 −1 T 0
S k B Sk Lk Sk B
B k+1 = B 0 − [B 0 Sk Yk ] , (3.26)
LTk −Dk YkT
where Dk = diag[(s k−m+1 )T (y k−m+1 ), . . . , (s k )T y k ], Sk = [s k−m+1 , . . . , s k ], Yk =
[y k−m+1 , . . . , y k ], and
k−m+i T k−m+j
(s ) (y ), i > j,
(Lk ) =
0 otherwise.
The compact limited memory form (3.26) is more efficient to apply than the unrolled
form (3.25), particularly when m is large and B 0 is initialized as a diagonal matrix.
Similar compact representations have been developed for the inverse BFGS update
H k as well as the SR1 update.
i i
i i
book_tem
i i
2010/7/27
page 46
i i
Figure 3.1. Example that shows cycling of the basic Newton method.
Note that the Hessian is always positive definite (f (z) is a convex function) and a unique
optimum exists at z∗ = 0. However, as seen in Figure 3.1, applying Algorithm 3.1 with a
starting point z0 = 10 leads the algorithm to cycle indefinitely between 10 and −10.
To avoid cycling, convergence requires a sufficient decrease of the objective function.
If we choose the step p k = −(B k )−1 ∇f (x k ) with B k positive definite and bounded in
condition number, we can modify the selection of the next iterate using a positive step
length α with
x k+1 = x k + αpk . (3.27)
Using Taylor’s theorem, one can show for sufficiently small α that
i i
i i
book_tem
i i
2010/7/27
page 47
i i
On the other hand, restriction to only small α leads to inefficient algorithms. Instead, a
systematic line search method is needed that allows larger steps to be taken with a sufficient
decrease of f (x). The line search method therefore consists of three tasks:
1. At iteration k, start with a sufficiently large value for α. While a number of methods
can be applied to determine this initial step length [294, 134], for the purpose of
developing this method, we choose an initial α set to one.
Sufficient decrease of f (x) can be seen from the last term in (3.30). Clearly a value of
α can be chosen that reduces f (x). As the iterations proceed, one would expect decreases
in f (x) to taper off as ∇f (x) → 0. On the other hand, convergence would be impeded if
α k → 0 and the algorithm stalls.
An obvious line search option is to perform a single variable optimization in α, i.e.,
minα f (x k + αpk ), and to choose α k := α ∗ , as seen in Figure 3.2. However, this option is
expensive. And far away from the solution, it is not clear that this additional effort would
reduce the number of iterations to converge to x ∗ . Instead, we consider three popular criteria
to determine sufficient decrease during the line search. To illustrate the criteria for sufficient
decrease, we consider the plot of f (x k + αpk ) shown in Figure 3.2.
All of these criteria require the following decrease in the objective function:
f (x k + α k p k ) ≤ f (x k ) + ηα k ∇f (x k )T p k , (3.31)
i i
i i
book_tem
i i
2010/7/27
page 48
i i
where η ∈ (0, 12 ]. This is also known as the Armijo condition.As seen in Figure 3.2, α ∈ (0, αa ]
satisfies this condition. The following additional conditions are also required so that the
chosen value of α is not too short:
• The Wolfe conditions require that (3.31) be satisfied as well as
∇f (x k + α k p k )T p k ≥ ζ ∇f (x k )T p k (3.32)
for ζ ∈ (η, 1). From Figure 3.2, we see that α ∈ [αw , αa ] satisfies these conditions.
• The strong Wolfe conditions are more restrictive and require satisfaction of
for ζ ∈ (η, 1). From Figure 3.2, we see that α ∈ [αw , αsw ] satisfies these conditions.
• The Goldstein or Goldstein–Armijo conditions require that (3.31) be satisfied as well as
f (x k + α k p k ) ≥ f (x k ) + (1 − η)α k ∇f (x k )T p k . (3.34)
From Figure 3.2, we see that α ∈ [αg , αa ] satisfies these conditions. (Note that the
relative locations of αg and αw change if (1 − η) > ζ .)
The Wolfe and strong Wolfe conditions lead to methods that have desirable conver-
gence properties that are analyzed in [294, 110]. The Goldstein conditions are similar but
do not require evaluation of the directional derivatives ∇f (x k + α k p k )T p k during the line
search. Moreover, in using a backtracking line search, where α is reduced if (3.31) fails, the
Goldstein condition (3.34) is easier to check.
We now consider a global convergence proof for the Goldstein conditions. Based on
the result by Zoutendijk [422], a corresponding proof is also given for the Wolfe conditions
in [294].
Theorem 3.3 (Global Convergence of Line Search Method). Consider an iteration: x k+1 =
x k + α k p k , where pk = −(B k )−1 ∇f (x k ) and α k satisfies the Goldstein–Armijo conditions
(3.31), (3.34). Suppose that f (x) is bounded below for x ∈ Rn , that f (x) is continuously
differentiable, and that ∇f (x) is Lipschitz continuous in an open set containing the level
set {x|f (x) ≤ f (x 0 )}. Then by defining the angle between p k and −∇f (x k ) as
−∇f (x k )T p k
cos θ k = (3.35)
∇f (x k )p k
i i
i i
book_tem
i i
2010/7/27
page 49
i i
Since θk is the angle at x k between the search direction p k and the steepest descent
direction −∇f (x k ), this theorem leads to the result that either ∇f (x k ) approaches zero
or that ∇f (x k ) and pk become orthogonal to each other. However, in the case where we
have a positive definite B k with a bounded condition number, κ(B k )), then
|∇f (x k )T p k | ∇f (x k )T (B k )−1 ∇f (x k ) 1
cos θ = = −1
≥
∇f (x )p ∇f (x )2 (B ) ∇f (x )2 κ(B k )
k k k k k
i i
i i
book_tem
i i
2010/7/27
page 50
i i
With this result, we now state the basic Newton-type algorithm for unconstrained
optimization with a backtracking line search.
Algorithm 3.2.
Choose a starting point x 0 and tolerances 1 , 2 > 0.
3. Set α = 1.
The value of ρ can be chosen in a number of ways. It can be a fixed fraction (e.g., 21 )
or it can be determined by minimizing a quadratic (see Exercise 4) or cubic interpolant based
on previous line search information. In addition, if α < 1, then the Goldstein condition (3.34)
should be checked to ensure that α is not too short.
Finally, in addition to global convergence in Theorem 3.3 we would like Algorithm 3.2
to perform well, especially in the neighborhood of the optimum. The following theorem is
useful for this feature.
Theorem 3.4 Assume that f (x) is three times differentiable and that Algorithm 3.2 con-
verges to a point that is a strict local minimum x ∗ and
(B k − ∇ 2 f (x k ))p k
lim =0
k→∞ p k
is satisfied. Then there exists a finite k0 , where α k = 1 is admissible for all k > k0 and x k
converges superlinearly to x ∗ .
1
f (x k+1 ) = f (x k ) + ∇f (x k )T p k + p k,T ∇ 2 f (x k + tpk )p k
2
1 1
= f (x k ) + ∇f (x k )T p k + p k,T (∇ 2 f (x k + tpk ) − B k )p k + p k,T B k p k
2 2
1 1 k,T 2
= f (x ) + ∇f (x ) p + p (∇ f (x + tp ) − B )p
k k T k k k k k
2 2
1 1
= f (x ) + η∇f (x ) p +
k k T k
− η ∇f (x k )T p k + p k,T (∇ 2 f (x k + tpk ) − B k )p k
2 2
1
≤ f (x k ) + η∇f (x k )T p k − − η |∇f (x k )T p k | + o(pk 2 ).
2
i i
i i
book_tem
i i
2010/7/27
page 51
i i
From the proof of Theorem 3.3 we know that αk is bounded away from zero, and because
we have limk→∞ x k − x k+1 → 0, we have p k → 0. Also, because x ∗ is a strict local
optimum, we have from Taylor’s theorem that |∇f (x k )T p k | > pk 2 for some > 0 and
k sufficiently large. Consequently, for η < 12 there exists a k0 such that
1 k,T 2 1
p (∇ f (x + tp ) − B )p <
k k k k
− η |∇f (x k )T p k |
2 2
leading to
which satisfies the Armijo condition for α = 1. Superlinear convergence then follows from
this result and Theorem 3.1.
and with a T = [0.3, 0.6, 0.2], bT = [5, 26, 3], and cT = [40, 1, 10]. The minimizer occurs at
x ∗ = [0.73950, 0.31436] with f (x ∗ ) = −5.08926.2 As seen from Figure 2.5, this problem
has only a small region around the solution where the Hessian is positive definite. As a result,
we saw in Example 2.21 that Newton’s method has difficulty converging from a starting
point far away from the solution. To deal with this issue, let’s consider the line search
algorithm with BFGS updates, starting with an initial B 0 = I and from a starting point close
to the solution. Applying Algorithm 3.2 to this problem, with a termination tolerance of
∇f (x) ≤ 10−6 , leads to the iteration sequence given in Table 3.1. Here it is clear that
the solution can be found very quickly, although it requires more iterations than Newton’s
method (see Table 2.3). Also, as predicted by Theorem 3.4, the algorithm chooses step sizes
with α k = 1 as convergence proceeds in the neighborhood of the solution. Moreover, based
on the error ∇f (x k ) we observe superlinear convergence rates, as predicted by Theorem
3.1. Finally, from Figure 2.5 we again note that the convergence path remains in the region
where the Hessian is positive definite.
If we now choose a starting point farther away from the minimum, then applying
Algorithm 3.2, with a termination tolerance of ∇f (x) ≤ 10−6 , leads to the iteration
sequence given in Table 3.2. At this starting point, the Hessian is indefinite, and, as seen
in Example 2.21, Newton’s method was unable to converge to the minimum. On the other
hand, the line search method with BFGS updates converges relatively quickly to the optimal
solution. The first three iterations show that large search directions are generated, but the
2 To prevent undefined objective and gradient functions, the square root terms are replaced by f (ξ ) =
(max(10−6 , ξ ))1/2 . While this approximation violates the smoothness assumptions for these methods, it
affects only large search steps which are immediately reduced by the line search in the early stages of the
algorithm. Hence first and second derivatives are never evaluated at these points.
i i
i i
book_tem
i i
2010/7/27
page 52
i i
Table 3.1. Iteration sequence with BFGS line search method with starting point
close to solution.
Iteration (k) x1k x2k f (x k ) ∇f (x k ) α
0 0.8000 0.3000 −5.0000 3.0000 0.0131
1 0.7606 0.3000 −5.0629 3.5620 0.0043
2 0.7408 0.3159 −5.0884 0.8139 1.000
3 0.7391 0.3144 −5.0892 2.2624 × 10−2 1.0000
4 0.7394 0.3143 −5.0892 9.7404 × 10−4 1.0000
5 0.7395 0.3143 −5.0892 1.5950 × 10−5 1.0000
6 0.7395 0.3143 −5.0892 1.3592 × 10−7 —
Table 3.2. Iteration sequence with BFGS line search newton method with starting
point far from solution.
Iteration (k) x1 x2 f ∇f (x k ) α
0 1.0000 0.5000 −1.1226 9.5731 0.0215
1 0.9637 0.2974 −3.7288 12.9460 0.0149∗
2 0.8101 0.2476 −4.4641 19.3569 0.0140
3 0.6587 0.3344 −4.9359 4.8700 1.0000
4 0.7398 0.3250 −5.0665 4.3311 1.0000
5 0.7425 0.3137 −5.0890 0.1779 1.0000
6 0.7393 0.3144 −5.0892 8.8269 × 10−3 1.0000
7 0.7395 0.3143 −5.0892 1.2805 × 10−4 1.0000
8 0.7395 0.3143 −5.0892 3.1141 × 10−6 1.0000
9 0.7395 0.3143 −5.0892 5.4122 × 10−12 —
*BFGS update was reinitialized to I .
line search leads to very small step sizes. In fact, the first BFGS update generates a poor
descent direction, and the matrix B k had to be reinitialized to I . Nevertheless, the algorithm
continues and takes full steps after the third iteration. Once this occurs, we see that the
method converges superlinearly toward the optimum.
The example demonstrates that the initial difficulties that occurred with Newton’s
method are overcome by line search methods as long as positive definite Hessian approx-
imations are applied. Also, the performance of the line search method on this example
confirms the convergence properties shown in this section. Note, however, that these prop-
erties apply only to convergence to stationary points and do not guarantee convergence to
points that also satisfy second order conditions.
i i
i i
book_tem
i i
2010/7/27
page 53
i i
calculated search direction also changes as a function of the step length. This added flexibility
leads to methods that have convergence properties superior to line search methods. On the
other hand, the computational expense for each trust region iteration may be greater than
for line search iterations.
We begin the discussion of this method by defining a trust region for the optimization
step, e.g., p ≤ , and a model function mk (p) that is expected to provide a good approx-
imation to f (x k + p) within a trust region of size . Any norm can be used to characterize
the trust region, although the Euclidean norm is often used for unconstrained optimization.
Also, a quadratic model is often chosen for mk (p), and the optimization step at iteration k
is determined by the following optimization problem:
1
min mk (p) = ∇f (x k )T p + p T B k p (3.49)
2
s.t. p ≤ ,
where B k = ∇ 2 f (x k ) or its quasi-Newton approximation. The basic trust region algorithm
can be given as follows.
Algorithm 3.3.
Given parameters ,¯ 0 ∈ (0, ],
¯ 0 < κ1 < κ2 < 1, γ ∈ (0, 1/4), and tolerances 1 , 2 > 0.
Choose a starting point x 0 .
Typical values of κ1 and κ2 are 14 and 34 , respectively. Algorithm 3.3 lends itself to a
number of variations that will be outlined in the remainder of this section. In particular, if
second derivatives are available, then problem (3.49) is exact up to second order, although
one may need to deal with an indefinite Hessian and nonconvex model problem. On the
other hand, if second derivatives are not used, one may instead use an approximation for
B k such as the BFGS update. Here the model problem is convex, but without second order
information it is less accurate.
i i
i i
book_tem
i i
2010/7/27
page 54
i i
Levenberg–Marquardt Steps
Levenberg–Marquardt steps were discussed in Section 3.2 as a way to correct Hessian
matrices that were not positive definite. For trust region methods, the application of these
steps is further motivated by the following property.
Theorem 3.6 Consider the model problem given by (3.49). The solution is given by p k if
and only if there exists a scalar λ ≥ 0 such that the following conditions are satisfied:
(B k + λI )pk = −∇f (x k ), (3.50)
λ( − pk ) = 0,
(B k + λI ) is positive semidefinite.
Note that when λ = 0, we have the same Newton-type step as with a line search method,
p N = −(B k )−1 ∇f (x k ).
On the other hand, as λ becomes large,
p(λ) = −(B k + λI )−1 ∇f (x k )
approaches a small step in the steepest descent direction pS ≈ − λ1 ∇f (x k ). As k is adjusted
in Algorithm 3.3 and if k < pN , then one can find a suitable value of λ by solving the
equation p(λ) − k = 0. With this equation, however, p depends nonlinearly on λ, thus
leading to difficult convergence with an iterative method. The alternate form
1 1
− = 0, (3.51)
p(λ) k
suggested in [281, 294], is therefore preferred because it is nearly linear in λ. As a result,
with an iterative solver, such as Newton’s method, a suitable λ is often found for (3.51)
in just 2–3 iterations. Figure 3.4 shows how the Levenberg–Marquardt step changes with
increasing values of . Note that for λ = 0 we have the Newton step p N . Once λ increases,
the step takes on a length given by the value of from (3.51). The steps p(λ) then decrease
in size with (and increasing λ) and trace out an arc shown by the dotted lines in the figure.
Finally, as vanishes, p(λ) points to the steepest descent direction.
i i
i i
book_tem
i i
2010/7/27
page 55
i i
Figure 3.4. Levenburg–Marquardt (dotted lines) and Powell dogleg (dashed lines)
steps for different trust regions.
In the first case, the Cauchy step can be derived by inserting a trial solution p = −τ ∇f (x k )
into the model problem (3.49) and solving for τ with a large value of . Otherwise, if B k
is not positive definite, then the Cauchy step is taken to the trust region bound and has
length .
For the dogleg method, we assume that B k is positive definite and we adopt the
Cauchy step for the first case. We also have a well-defined Newton step given by p N =
−(B k )−1 ∇f (x k ). As a result, the solution to (3.49) is given by the following cases:
• p k = p N if ≥ p N ;
k pC
• pk = pC
if ≤ p C ;
Both of these methods provide search directions that address the model problem (3.49).
It is also important to note that while the Levenberg–Marquardt steps provide an exact
solution to the model problem, the dogleg method solves (3.49) only approximately. In both
cases, the following property holds.
Theorem 3.7 [294] Assume that the model problem (3.49) is solved using Levenberg–
Marquardt or dogleg steps with p k ≤ k ; then for some c1 > 0,
∇f (x k )
mk (0) − mk (p) ≥ c1 ∇f (x k ) min , . (3.53)
B k
i i
i i
book_tem
i i
2010/7/27
page 56
i i
The relation (3.53) can be seen as an analogue to the descent property in line search methods
as it relates improvement in the model function back to ∇f (x). With this condition, one
can show convergence properties similar to (and under weaker conditions than) line search
methods [294, 100]. These are summarized by the following theorems.
Theorem 3.8 [294] Let γ ∈ (0, 1/4), B k ≤ β < ∞, and let f (x) be Lipschitz continuously
differentiable and bounded below on a level set {x|f (x) ≤ f (x 0 )}. Also, in solving (3.49)
(approximately), assume that pk satisfies (3.53) and that p k ≤ c2 k for some constant
c2 ≥ 1. Then the algorithm generates a sequence of points with limk→∞ ∇f (x k ) = 0.
The step acceptance condition f (x k ) − f (x k + pk ) ≥ γ (mk (0) − mk (p)) > 0 with
γ > 0 and Lipschitz continuity of ∇f can also be relaxed, but with a weakening of the
above property.
Theorem 3.9 [294] Let γ = 0, B k ≤ β < ∞, and let f (x) be continuously differentiable
and bounded below on a level set {x|f (x) ≤ f (x 0 )}. Let pk satisfy (3.53) and pk ≤ c2 k
for some constant c2 ≥ 1. Then the algorithm generates a sequence of points with
lim inf ∇f (x k ) = 0.
k→∞
This lim inf property states that without a strict step acceptance criterion (i.e., γ = 0),
∇f (x k ) must have a limit point that is not bounded away from zero. On the other hand, there
is only a subsequence, indexed by k , with f (x k ) that converges to zero. A more detailed
description of this property can be found in [294].
Finally, note that Theorems 3.8 and 3.9 deal with convergence to stationary points
that may not be local optima if f (x) is nonconvex. Here, using B k forced to be positive
definite, the dogleg approach may converge to a point where ∇ 2 f (x ∗ ) is not positive definite.
Similarly, the Levenberg–Marquardt method may also converge to such a point if λ remains
positive. The stronger property of convergence to a local minimum requires consideration of
second order conditions for (3.49), as well as a more general approach with B k = ∇ 2 f (x k )
for the nonconvex model problem (3.49).
i i
i i
book_tem
i i
2010/7/27
page 57
i i
To adjust p(λ) to satisfy p(λ) = k , we see from (3.54) that we can make p(λ) small
by increasing λ. Also, if ∇ 2 f (x k ) is positive definite, then if λ = 0 satisfies the acceptance
criterion, we can recover the Newton step, and fast convergence is assured.
In the case of negative or zero curvature, we have an eigenvalue λi∗ ≤ 0 for a particular
index i ∗ . As long as vi∗T ∇f (x k ) = 0, we can still make p(λ) large by letting λ approach
|λi∗ |. Thus, p(λ) could be adjusted so that its length matches k . However, if we have
T ∇f (x k ) = 0, then no positive value of λ can be found which increases the length p(λ)
vi∗
to k . This case is undesirable, as it precludes significant improvement of f (x) along the
direction of negative curvature vi∗ and could lead to premature termination with large values
of λ and very small values of .
This phenomenon is called the hard case [282, 100, 294]. For this case, an additional
term is needed that includes a direction of negative curvature, z. Here the corresponding
eigenvector for a negative eigenvalue, vi∗ , is an ideal choice. Because it is orthogonal both
to the gradient vector and to all of the other eigenvectors, it can independently exploit the
negative curvature direction up to the trust region boundary with a step given by
pk = −(∇ 2 f (x k ) + λI )−1 ∇f (x k ) + τ z, (3.55)
where τ is chosen so that p k = k . Finding the appropriate eigenvector requires an
eigenvalue decomposition of ∇ 2 f (x k ) and is only suitable for small problems. Nevertheless,
the approach based on (3.55) can be applied in systematic way to find the global minimum
of the trust region problem (3.49). More details of this method can be found in [282, 100].
For large-scale problems, there is an inexact way to solve (3.49) based on the truncated
Newton method. Here we attempt to solve the linear system B k p k = −∇f (x k ) with the
method of conjugate gradients. However, if B k is indefinite, this method “fails” by generating
a large step, which can be shown to be a direction of negative curvature and can be used
directly in (3.55). The truncated Newton algorithm for the inexact solution of (3.49) is given
as follows.
Algorithm 3.4.
Given parameters > 0, set p0 = 0, r0 = ∇f (x k ), and d0 = −r0 . If r0 ≤ , stop with
p = p0 .
For j ≥ 0:
• If djT B k dj ≤ 0 (dj is a direction of negative curvature), find τ so that p = pj + τ dj
minimizes m(p) with p = k and return with the solution p k = p.
• Else, set αj = rjT rj /djT B k dj and pj +1 = pj + αj dj .
The conjugate gradient (CG) steps generated to solve (3.49) for a particular can be
seen in Figure 3.5. Note that the first step taken by the method is exactly the Cauchy step.
The subsequently generated CG steps increase in size until they exceed the trust region or
converge to the Newton step inside the trust region.
i i
i i
book_tem
i i
2010/7/27
page 58
i i
Figure 3.5. Solution steps for model problem generated by the truncated Newton
method: truncated step (left), Newton step (right).
While treatment of negative curvature requires more expensive trust region algo-
rithms, what results are convergence properties that are stronger than those resulting from
the solution of convex model problems. In particular, these nonconvex methods can find
limit points that are truly local minima, not just stationary points, as stated by the following
property.
Theorem 3.10 [294] Let p ∗ be the exact solution of the model problem (3.49), γ ∈ (0, 1/4),
and B k = ∇ 2 f (x k ). Also, let the approximate solution for Algorithm 3.3 satisfy p k ≤
c2 k and
lim ∇f (x k ) = 0.
k→∞
Also, if the level set {x|f (x) ≤ f (x 0 )} is closed and bounded, then either the algorithm
terminates at a point that satisfies second order necessary conditions, or there is a limit point
x ∗ that satisfies second order necessary conditions.
Finally, as with line search methods, global convergence alone does not guarantee
efficient algorithms. To ensure fast convergence, we would like to take pure Newton steps
at least in the neighborhood of the optimum. For trust region methods, this requires the trust
region not to shrink to zero upon convergence, i.e., limk→∞ k ≥ > 0. This property is
stated by the following theorem.
Theorem 3.11 [294] Let f (x) be twice Lipschitz continuously differentiable and suppose
that the sequence {x k } converges to a point x ∗ that satisfies second order sufficient conditions.
Also, for sufficiently large k, problem (3.49) is solved asymptotically exactly with B k →
∇ 2 f (x ∗ ), and with at least the same reduction as a Cauchy step. Then the trust region bound
becomes inactive for k sufficiently large.
Example 3.12 To evaluate trust region methods with exact second derivatives, we again
consider the problem described in Example 3.5. This problem has only a small region
i i
i i
book_tem
i i
2010/7/27
page 59
i i
Table 3.3. Iteration sequence with trust region (TR) method and exact Hessian with
starting point close to solution. NFE k denotes the number of function evaluations needed
to adjust the trust region in iteration k.
TR Iteration (k) x1k x2k f (x k ) ∇f (x k ) NFE k
0 0.8000 0.3000 −5.0000 3.0000 3
1 0.7423 0.3115 −5.0882 0.8163 1
2 0.7396 0.3143 −5.0892 6.8524 × 10−3 1
3 0.73950 0.31436 −5.08926 2.6845 × 10−6 —
Table 3.4. Iteration sequence with trust region (TR) method and exact Hessian
with starting point far from solution. NFE k denotes the number of function evaluations to
adjust the trust region in iteration k.
TR Iteration (k) x1 x2 f ∇f (x k ) NFE k
0 1.0000 0.5000 −1.1226 9.5731 11
1 0.9233 0.2634 −4.1492 11.1073 1
2 0.7621 0.3093 −5.0769 1.2263 1
3 0.7397 0.3140 −5.0892 0.1246 1
4 0.7395 0.3143 −5.0892 1.0470 × 10−4 1
5 0.73950 0.31436 −5.08926 5.8435 × 10−10 —
around the solution where the Hessian is positive definite, and a method that takes full
Newton steps has difficulty converging from starting points far away from the solution.
Here we apply the trust region algorithm of Gay [156]. This algorithm is based on the
exact trust region algorithm [282] described above. More details of this method can also be
found in [154, 110]. A termination tolerance of ∇f (x) ≤ 10−6 is chosen and the initial
trust region is determined from the initial Cauchy step pC . Choosing a starting point
close to the solution generates the iteration sequence given in Table 3.3. Here it is clear
that the solution can be found very quickly, with only three trust region iterations. The first
trust region step requires three function evaluations to determine a proper trust region size.
After this, pure Newton steps are taken, the trust region becomes inactive, as predicted by
Theorem 3.11, and we see quadratic convergence to the optimum solution.
If we choose a starting point farther away from the minimum, then the trust region
algorithm, with a termination tolerance of ∇f (x) ≤ 10−6 , generates the iteration sequence
given in Table 3.4. Here only five trust region iterations are required, but the initial trust
region requires 11 function evaluations to determine a proper size. After this, pure Newton
steps are taken, the trust region becomes inactive, as predicted by Theorem 3.11, and we
can observe quadratic convergence to the optimum solution.
This example demonstrates that the initial difficulties that occur with Newton’s method
are overcome by trust region methods that still use exact Hessians, even if they are indefinite.
Also, the performance of the trust region method on this example confirms the trust region
convergence properties described in this section.
i i
i i
book_tem
i i
2010/7/27
page 60
i i
3.7 Exercises
1. Using an SR1 update, derive the following formula:
(y − B k s)(y − B k s)T
B k+1 = B k + .
(y − B k s)T s
max θ
s.t. θs T y + (1 − θ )s T B k s ≥ 0.2s T B k s,
0 ≤ θ ≤ 1.
i i
i i
book_tem
i i
2010/7/27
page 61
i i
3.7. Exercises 61
Downloaded 04/07/15 to 169.230.243.252. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
min W + − W F
s.t. W + v = s, (W + )T y = v. (3.56)
Using the inverse update formula (3.21), derive the DFP update for B k .
7. Download and apply the L-BFGS method to Example 3.5. How does the method
perform as a function of the number of updates?
i i
i i