Nonlinear Programming (Concepts, Algorithms, and Applications To Chemical Processes) - 5. Newton Methods For Equality Constrained Optimization (2010)
Nonlinear Programming (Concepts, Algorithms, and Applications To Chemical Processes) - 5. Newton Methods For Equality Constrained Optimization (2010)
i i
2010/7/27
page 91
i i
Downloaded 04/07/15 to 169.230.243.252. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
Chapter 5
This chapter extends the Newton-based algorithms in Chapter 3 to deal with equality
constrained optimization. It generalizes the concepts of Newton iterations and associ-
ated globalization strategies to the optimality conditions for this constrained optimiza-
tion problem. As with unconstrained optimization we focus on two important aspects:
solving for the Newton step and ensuring convergence from poor starting points. For
the first aspect, we focus on properties of the KKT system and introduce both full- and
reduced-space approaches to deal with equality constraint satisfaction and constrained
minimization of the objective function. For the second aspect, the important properties
of penalty-based merit functions and filter methods will be explored, both for line search
and trust region methods. Several examples are provided to illustrate the concepts devel-
oped in this chapter and to set the stage for the nonlinear programming codes discussed in
Chapter 6.
and we assume that the functions f (x) : Rn → R and h(x) : Rn → Rm have continuous
first and second derivatives. First, we note that if the constraints h(x) = 0 are linear, then
(5.1) is a convex problem if and only if f (x) is convex. On the other hand, as discussed in
Chapter 4, nonlinear equality constraints imply nonconvex problems even if f (x) is convex.
As a result, the presence of nonlinear equalities leads to violation of convexity properties
and will not guarantee that local solutions to (5.1) are global minima. Therefore, unless
additional information is known about (5.1), we will be content in this chapter to determine
only local minima.
91
i i
i i
book_tem
i i
2010/7/27
page 92
i i
i i
i i
book_tem
i i
2010/7/27
page 93
i i
Finding a local solution of (5.1) can therefore be realized by solving (5.2) and then
checking the second order conditions (5.3). In this chapter we develop Newton-based strate-
gies for this task. As with the Newton methods in Chapter 3, a number of important concepts
need to be developed.
Solution of (5.2) with Newton’s method relies on the generation of Newton steps from
the following linear system:
Wk Ak dx ∇L(x k , v k )
= − , (5.4)
(Ak )T 0 dv h(x k )
where W k = ∇xx L(x k , v k ) and Ak = ∇h(x k ). Defining the new multiplier estimate as v̄ =
v k + dv and substituting into (5.4) leads to the equivalent system:
Wk Ak dx ∇f (x k )
=− . (5.5)
(Ak )T 0 v̄ h(x k )
Note that (5.5) are also the first order KKT conditions of the following quadratic program-
ming problem:
1
min ∇f (x k )T dx + dxT W k dx (5.6)
dx 2
s.t. h(x k ) + (Ak )T dx = 0.
Using either (5.4) or (5.5), the basic Newton method can then be stated as follows.
Algorithm 5.1.
Choose a starting point (x 0 , v 0 ).
Theorem 5.1 Consider a solution x ∗ , v ∗ , which satisfies the sufficient second order con-
ditions and ∇h(x ∗ ) is full column rank (LICQ). Moreover, assume that f (x) and h(x) are
twice differentiable and ∇ 2 f (x) and ∇ 2 h(x) are Lipschitz continuous in a neighborhood of
this solution. Then, by applying Algorithm 5.1, and with x 0 and v 0 sufficiently close to x ∗
and v ∗ , there exists a constant Ĉ > 0 such that
k+1 k
x − x∗ ∗ 2
≤ Ĉ x − x
v k+1 − v ∗ vk − v∗ , (5.7)
i i
i i
book_tem
i i
2010/7/27
page 94
i i
The proof of this theorem follows the proof of Theorem 2.20 as long as the
KKT matrix in (5.4) is nonsingular at the solution (see Exercise 1). In the remainder of
this section we consider properties of the KKT matrix and its application in solving (5.1).
or simply
∗T ∗ ∗ ∗T
(Y ) W Y (Y ∗ )T W ∗ Z ∗ (Y ∗ )T A∗ pY (Y ) ∇L(x ∗ , v ∗ )
(Z ∗ )T W ∗ Y ∗ (Z ∗ )T W ∗ Z ∗ 0 pZ = − (Z ∗ )T ∇L(x ∗ , v ∗ ) .
(A∗ )T Y ∗ 0 0 dv h(x ∗ )
(5.9)
i i
i i
book_tem
i i
2010/7/27
page 95
i i
• Because the KKT conditions (5.2) are satisfied at x ∗ , v ∗ , the right-hand side of (5.9)
equals zero.
As a result, we can use the bottom row of (5.9) to solve uniquely for pY = 0. Then, from
the second row of (5.9), we can solve uniquely for pZ = 0. Finally, from the first row of
(5.9), we solve uniquely for dv = 0. This unique solution implies that the matrix in (5.9)
is nonsingular, and from Sylvester’s law of inertia we have that the KKT matrix in (5.4) is
nonsingular as well at x ∗ and v ∗ . Note that nonsingularity is a key property required for the
proof of Theorem 5.1.
i i
i i
book_tem
i i
2010/7/27
page 96
i i
The solution can be found by solving the first order KKT conditions:
x1 + v = 0,
x2 + v = 0,
x1 + x2 = 1
with the solution x1∗ = 12 , x2∗ = 12 , and v1∗ = − 12 . We can define a null space basis as Z ∗ =
[1 −1]T , so that AT Z ∗ = 0. The reduced Hessian at the optimum is (Z ∗ )T ∇xx L(x ∗ , v ∗ )Z ∗ =
2 > 0 and the sufficient second order conditions are satisfied. Moreover, the KKT matrix at
the solution is given by
1 0 1
W∗ A∗
K= = 0 1 1
(A∗ )T 0
1 1 0
with eigenvalues, −1, 1, and 2. The inertia of this system is therefore I n(K) = (2, 1, 0) =
(n, m, 0) as required.
i i
i i
book_tem
i i
2010/7/27
page 97
i i
A number of direct solvers is available for symmetric linear systems. In particular, for
positive definite symmetric matrices, efficient Cholesky factorizations can be applied. These
can be represented as LDLT , with L a lower triangular matrix and D diagonal. On the other
hand, because the KKT matrix in (5.4) is indefinite, possibly singular, and frequently sparse,
another symmetric factorization (such as the Bunch–Kaufman [74] factorization) needs to
be applied. Denoting the KKT matrix as K, and defining P as an orthonormal permutation
matrix, the indefinite factorization allows us to represent K as P T KP = LBLT , where the
block diagonal matrix B is determined to have 1 × 1 or 2 × 2 blocks. From Sylvester’s law of
inertia we can obtain the inertia of K cheaply by examining the blocks in B and evaluating
the signs of their eigenvalues.
To ensure that the iteration matrix is nonsingular, we first check whether the KKT
matrix at iteration k has the correct inertia, (n, m, 0), i.e., n positive, m negative, and no
zero eigenvalues. If the inertia of this matrix is incorrect, we can modify (5.4) to form the
following linear system:
k
W + δW I Ak dx ∇L(x k , v k )
= − . (5.12)
(Ak )T −δA I dv h(x k )
Here, different trial values can be selected for the scalars δW , δA ≥ 0 until the inertia is
correct. To see how this works, we first prove that such values of δW and δA can be found,
and then sketch a prototype algorithm.
Theorem 5.4 Assume that matrices W k and Ak are bounded in norm; then for any δA > 0
there exist suitable values of δW such that the matrix in (5.12) has the inertia (n, m, 0).
Proof: In (5.12) Ak has column rank r ≤ m. Because Ak may be rank deficient, we represent
it by the LU factorization Ak LT = [U T | 0], with upper triangular U ∈ Rr×n of full rank
and nonsingular lower triangular L ∈ Rm×m . This leads to the following factorization of the
KKT matrix:
k
k W + δW I [U T | 0]
W + δW I Ak
V̄ V̄ T = U , (5.13)
(Ak )T −δA I −δA LLT
0
where
I0
V̄ = .
0L
From Theorem 5.2, it is clear that the right-hand matrix has the same inertia as the KKT
matrix in (5.12). Moreover, defining the nonsingular matrix
−1
I δA [U T | 0]L−T L−1
Ṽ =
0 I
i i
i i
book_tem
i i
2010/7/27
page 98
i i
where
−1 −T −1 U −1 k
Ŵ = W + δW I
k
+ δA [U T | 0]L L = W k + δW I + δA A (Ak )T .
0
We note that if δA > 0, the matrix −δA LLT has inertia (0, m, 0) and we see that the right-
hand matrix of (5.14) has an inertia given by I n(Ŵ ) + (0, m, 0). Again, we note that this
matrix has the same inertia as the KKT matrix in (5.12).
To obtain the desired inertia, we now examine Ŵ more closely. Similar to the decom-
position in (5.9), we define full rank range and null-space basis matrices, Y ∈ Rn×r and Z ∈
Rn×(n−r) , respectively, with Y T Z = 0. From Z T Ak = 0, it is clear that Z T [U T | 0]L−T = 0
and also that (Y )T Ak ∈ Rr×m has full column rank. For any vector p ∈ Rn with p =
YpY + ZpZ , we can choose positive constants ã and c̃ that satisfy
c̃pY pZ ≥ −pZT Z T (W k + δW I )YpY = −pZT Z T W k YpY ,
pY Y T Ak (Ak )T YpY ≥ ãpY 2 .
For a given δA > 0, we can choose δW large enough so that for all pZ ∈ Rn−r and pY ∈ Rr
with
pZT Z T (W k + δW I )ZpZ ≥ m̃pZ 2 for m̃ > 0,
pYT Y T (W k + δW I )YpY ≥ −w̃pY 2
for w̃ ≥ 0,
−1 c̃2
we have ãδA > w̃ and m̃ ≥ −1 . With these quantities, we obtain the following
(ãδA −w̃)
relations:
−1 k
Y T (W k + δW I + δA A (Ak )T )Y Y T W kZ pY
p Ŵ p =
T
pYT | pZT T
Z W Y k Z (W k + δW I )Z
T pZ
= pYT Y T (Ŵ )YpY + 2pYT Y T W k ZpZ + pZT Z T (W k + δW I )ZpZ
−1
≥ (ãδA − w̃)pY 2 − 2c̃pY pZ + m̃pZ 2
2
−1 c̃
= (ãδA − w̃) pY −
1/2
−1
pz
(ãδA − w̃)1/2
c̃2
+ m̃ − −1
pz 2 > 0
(ãδA − w̃)
for all values of p. As a result, Ŵ is positive definite, the inertia of the KKT matrix in (5.12)
is (n, m, 0), and we have the desired result.
These observations motivate the following algorithm (simplified from [404]) for
choosing δA and δW for a particular Newton iteration k.
Algorithm 5.2.
Given constants 0 < δ̄Wmin < δ̄ 0 < δ̄ max ; δ̄ > 0; and 0 < κ < 1 < κ . (From [404], rec-
W W A l u
ommended values of the scalar parameters are δ̄W min = 10−20 , δ̄ 0 = 10−4 , δ̄ max = 1040 ,
W W
δ̄A = 10−8 , κu = 8, and κl = 13 .) Also, for the first iteration, set δW
last := 0.
i i
i i
book_tem
i i
2010/7/27
page 99
i i
At each iteration k:
1. Attempt to factorize the matrix in (5.12) with δW = δA = 0. For instance, this can be
done with an LBLT factorization; the diagonal (1 × 1 and 2 × 2) blocks of B are then
used to determine the inertia. If the matrix has correct inertia, then use the resulting
search direction as the Newton step. Otherwise, continue with step 2.
2. If the inertia calculation reveals zero eigenvalues, set δA := δ̄A . Otherwise, set δA := 0.
last = 0, set δ := δ̄ 0 , else set δ := max{δ̄ min , κ δ last }.
3. If δW W W W W l W
4. With updated values of δW and δA , attempt to factorize the matrix in (5.12). If the
last := δ and use the resulting search direction in the line
inertia is now correct, set δW W
search. Otherwise, continue with step 5.
5. Continue to increase δW and set δW := κu δW .
max , abort the search direction computation. The matrix is severely ill-
6. If δW > δ̄W
conditioned.
i i
i i
book_tem
i i
2010/7/27
page 100
i i
Figure 5.1. Normal and tangential steps for the solution dx of (5.4) or (5.6). Note
also that if a coordinate basis (Yc ) is chosen, the normal and tangential steps may not be
orthogonal and the steps Yc pY c are longer than YpY .
In contrast to the full-space method, this decomposition requires that Ak have full column
rank for all k. From the bottom row in (5.16) we obtain
h(x k ) + (Ak )T Y k pY = 0.
To determine the tangential component, pZ , we substitute (5.17) in the second row of (5.16).
The following linear system:
(Z k )T W k Z k pZ = −[(Z k )T ∇f (x k ) + (Z k )T W k Y k pY ] (5.19)
can then be solved if (Z k )T W k Z k is positive definite, and this property can be verified
through a successful Cholesky factorization. Otherwise, if (Z k )T W k Z k is not positive def-
inite, this matrix can be modified either by adding a sufficiently large diagonal term, say
δR I , or by using a modified Cholesky factorization, as discussed in Section 3.2.
Once pZ is calculated from (5.19) we use (5.18) to obtain dx . Finally, the top row in
(5.16) can then be used to update the multipliers:
i i
i i
book_tem
i i
2010/7/27
page 101
i i
Because dxk → 0 as the algorithm converges, a first order multiplier calculation may be used
instead, i.e.,
v̄ = −[(Y k )T Ak ]−1 (Y k )T ∇f (x k ), (5.21)
and we can avoid the calculation of (Y k )T W k Y k and (Y k )T W k Z k . Note that the multipliers
from (5.21) are still asymptotically correct.
A dominant part of the Newton step is the computation of the null-space and range-
space basis matrices, Z and Y , respectively. There are many possible choices for these basis
matrices, including the following three options.
• By computing a QR factorization of A, we can define Z and Y so that they have
orthonormal columns, i.e., Z T Z = In−m , Y T Y = Im , and Z T Y = 0. This gives a
well-conditioned representation of the null space and range space of A, However,
Y and Z are dense matrices, and this can lead to expensive computation when the
number of variables (n) is large.
• A more economical alternative for Z and Y can be found through a simple elimination
of dependent variables [134, 162, 294]. Here we permute the components of x into m
dependent or basic variables (without loss of generality, we select the first m variables)
and n − m independent or superbasic variables. Similarly, the columns of (Ak )T are
permuted and partitioned accordingly to yield
(Ak )T = [AkB | AkS ]. (5.22)
We assume that the m×m basis matrix AkB is nonsingular and we define the coordinate
bases
−(AkB )−1 AkS I
Z =
k
and Y =k
. (5.23)
I 0
Note that in practice Z k is not formed explicitly; instead we can compute and store
the sparse LU factors of AkB . Due to the choice (5.23) of Y k , the normal component
determined in (5.17) and multipliers determined in (5.21) simplify to
pY = −(AkB )−1 h(x k ), (5.24)
v̄ = −(AkB )−T (Y k )T ∇f (x k ). (5.25)
• Numerical accuracy and algorithmic performance are often improved if the tangential
and normal directions, ZpZ and YpY , can be maintained orthogonal and the length
of the normal step is minimized. This can, of course, be obtained through a QR
factorization, but the directions can be obtained more economically by modifying the
coordinate basis decomposition and defining the orthogonal bases
−(AkB )−1 AkS I
Z =
k
and Y = k
. (5.26)
I (AkS )T (AkB )−T
Note that (Z k )T Y k = 0, and from the choice (5.26) of Y k , the calculation in (5.17)
can be written as
pY = −[I + (AkB )−1 AkS (AkS )T (AkB )−T ]−1 (AkB )−1 h(x k )
= − I − (AkB )−1 AkS [I + (AkS )T (AkB )−T (AkB )−1 AkS ]−1 (AkS )T (AkB )−T
×(AkB )−1 h(x k ), (5.27)
i i
i i
book_tem
i i
2010/7/27
page 102
i i
where the second equation follows from the application of the Sherman–Morison–
Woodbury formula [294]. An analogous expression can be derived for v̄. As with
coordinate bases, this calculation requires a factorization of AkB , but it also requires a
factorization of
• The normal and tangential search directions can also be determined implicitly through
an appropriate modification of the full-space KKT system (5.4). As shown in Exer-
cise 3, the tangential step dt = Z k pZ can be found from the following linear system:
Wk Ak dt ∇f (x k )
= − . (5.28)
(Ak )T 0 v 0
B k+1 s = y,
but with
Note that because we approximate the Hessian with respect to x, both terms in the definition
of y require the multiplier evaluated at v k+1 .
With these definitions of s and y, we can directly apply the BFGS update (3.18):
yy T B k ss T B k
B k+1 = B k + − T k (5.31)
sT y s B s
i i
i i
book_tem
i i
2010/7/27
page 103
i i
(y − B k s)(y − B k s)T
B k+1 = B k + . (5.32)
(y − B k s)T s
However, unlike the Hessian matrix for unconstrained optimization, discussed in Chap-
ter 3, W (x, v) is not required to be positive definite at the optimum. Only its projection,
i.e., Z(x ∗ )T W (x ∗ , v ∗ )Z(x ∗ ), needs to be positive definite to satisfy sufficient second order
conditions. As a result, some care is needed in applying the update matrices using (5.31) or
(5.32).
As discussed in Chapter 3, these updates are well defined when s T y is sufficiently
positive for BFGS and (y − B k s)T s = 0 for SR1. Otherwise, the updates are skipped for a
given iteration, or Powell damping (see Section 3.3) can be applied for BFGS.
However, because the updated quasi-Newton matrix B k is dense, handling this matrix di-
rectly leads to factorizations of the Newton step that are O(n3 ). For large problems, this
calculation can be prohibitively expensive. Instead we would like to exploit the fact that
Ak is sparse and B k has the structure given by (5.31) or (5.32). We can therefore store the
updates y and s for successive iterations and incorporate these into the solution of (5.33).
For this update, we store only the last q updates and apply the compact limited memory
representation from [83] (see Section 3.3), along with a sparse factorization of the initial
quasi-Newton matrix, B 0 , in (5.33). Also, we assume that B 0 is itself sparse; often it is
chosen to be diagonal. For the BFGS update (5.31), this compact representation is given by
−1
SkT B 0 Sk Lk SkT B 0
B k+1 = B 0 − [B 0 Sk Y k ] , (5.34)
LTk −Dk (Y k )T
where
and
(s k−q+i )T (y k−q+j ), i > j,
(Lk ) =
0 otherwise.
i i
i i
book_tem
i i
2010/7/27
page 104
i i
By writing (5.34) as
1/2 1/2 −1 1/2 −1
−Dk Dk LTk Dk 0 (Y k )T
B k+1 = B 0 − [Y k B 0 Sk ] −1/2 ,
0 JkT −LTk Dk Jk SkT B 0
(5.36)
where Jk is a lower triangular factorization constructed from the Cholesky factorization that
satisfies
Jk JkT = SkT B 0 Sk + Lk Dk−1 LTk , (5.37)
−1/2
we now define Vk = Y k Dk , Uk = (B 0 Sk + Y k Dk−1 LTk )Jk−T , Ũ T = [UkT 0], and Ṽ T =
[VkT 0] and consider the matrices
Bk Ak (B 0 + Vk VkT − Uk UkT ) Ak
K= = ,
(Ak )T 0 (Ak )T 0
B0 Ak
K0 = .
(Ak )T 0
Moreover, we assume that K0 is sparse and can be factorized cheaply using a sparse or
structured matrix decomposition. (Usually such a factorization requires O(nβ ) operations,
with the exponent β ∈ [1, 2].) Factorization of K can then be made by two applications of
the Sherman–Morison–Woodbury formula, yielding
K1−1 = K0−1 − K0−1 Ṽ [I + Ṽ T K0−1 Ṽ ]−1 Ṽ T K0−1 , (5.38)
K −1
= K1−1 − K1−1 Ũ [I + Ũ T
K1−1 Ũ ]−1 Ũ T K1−1 . (5.39)
By carefully structuring these calculations so that K0−1 , K1−1 , and K −1 are factorized and
incorporated into backsolves, these matrices are never explicitly created, and we can obtain
the solution to (5.33), i.e.,
dx −1 ∇L(x k , v k )
= −K (5.40)
dv h(x k )
in only O(nβ + q 3 + q 2 n) operations. Since only a few updates are stored (q is typically less
than 20), this limited memory approach leads to a much more efficient implementation of
quasi-Newton methods. An analogous approach can be applied to the SR1 method as well.
As seen in Chapter 3, quasi-Newton methods are slower to converge than Newton’s
method, and the best that can be expected is a superlinear convergence rate. For equality
constrained problems, the full-space quasi-Newton implementation has a local convergence
property that is similar to Theorem 3.1, but with additional restrictions.
Theorem 5.5 [63] Assume that f (x) and h(x) are twice differentiable, their second deriva-
tives are Lipschitz continuous, and Algorithm 5.1 modified by a quasi-Newton approxi-
mation converges to a KKT point that satisfies the LICQ and the sufficient second order
optimality conditions. Then x k converges to x ∗ at a superlinear rate, i.e.,
x k + dx − x ∗
lim =0 (5.41)
k→∞ x k − x ∗
i i
i i
book_tem
i i
2010/7/27
page 105
i i
if and only if
Z(x k )T (B k − W (x ∗ , v ∗ ))dx
lim = 0, (5.42)
k→∞ dx
where Z(x) is a representation of the null space of A(x)T that is Lipschitz continuous in a
neighborhood of x ∗ .
Note that the theorem does not state whether a specific quasi-Newton update, e.g.,
BFGS or SR1, satisfies (5.42), as this depends on whether B k remains bounded and well-
conditioned, or whether updates need to be modified through damping or skipping. On the
other hand, under the more restrictive assumption where W (x ∗ , v ∗ ) is positive definite, the
following property can be shown.
Theorem 5.6 [63] Assume that f (x) and h(x) are twice differentiable, their second deriva-
tives are Lipschitz continuous, and Algorithm 5.1 modified by a BFGS update converges to
a KKT point that satisfies the LICQ and the sufficient second order optimality conditions.
If, in addition, W (x ∗ , v ∗ ) and B 0 are positive definite, and B 0 − W (x ∗ , v ∗ ) and x 0 − x ∗
are sufficiently small, then x k converges to x ∗ at a superlinear rate, i.e.,
x k + dx − x ∗
lim = 0. (5.43)
k→∞ x k − x ∗
This linear system is solved by solving two related subsystems of order m for pY and v̄,
respectively, and a third subsystem of order n − m for pZ . For this last subsystem, we can
approximate the reduced Hessian, i.e., B̄ k ≈ (Z k )T W k Z k , by using the BFGS update (5.31).
By noting that
d x = Y k pY + Z k pZ
and assuming that a full Newton step is taken (i.e., x k+1 = x k + dx ), we can apply the secant
condition based on the following first order approximation to the reduced Hessian:
i i
i i
book_tem
i i
2010/7/27
page 106
i i
we can write the secant formula B̄ k+1 sk = yk and use this definition for the BFGS update
(5.31). However, note that for this update, some care is needed in the choice of Z(x k )
so that it remains continuous with respect to x. Discontinuous changes in Z(x) (due to
different variable partitions or QR rotations) could lead to serious problems in the Hessian
approximation [162].
In addition, we modify the Newton step from (5.44). Using the normal step pY cal-
culated from (5.17), we can write the tangential step as follows:
where the second part follows from writing (or approximating) (Z k )T W k Y k pY as w k . Note
that the BFGS update eliminates the need to evaluate W k and to form (Z k )T W k Z k explicitly.
In addition, to avoid the formation of (Z k )T W k Y k pY , it is appealing to approximate the
vectors w k in (5.49) and (5.47). This can be done in a number of ways.
• Many reduced Hessian methods [294, 100, 134] typically approximate Z k W k Y k pY
by w k = 0. Neglecting this term often works well, especially when normal steps are
much smaller than tangential steps. This approximation is particularly useful when the
constraints are linear and pY remains zero once a feasible iterate is encountered. More-
over, when Z and Y have orthogonal representations, the length of the normal step
Y k pY is minimized. Consequently, neglecting this term in (5.49) and (5.47) still can
lead to good performance. On the other hand, coordinate representations (5.23) of Y
can lead to larger normal steps (see Figure 5.1), a nonnegligible value for Z k W k Y k pY ,
and possibly erratic convergence behavior of the method.
• To ensure that good search directions are generated in (5.44) that are independent of
the representations of Z and Y , we can compute a finite difference approximation of
(Z k )T W k Y k along pY , for example
i i
i i
book_tem
i i
2010/7/27
page 107
i i
Figure 5.2. Regions that determine criteria for choosing wk , the approximation
to(Z k )T W k Y k p
Y . Quasi-Newton approximation can be in R1 or R3 , while R2 requires a
finite difference approximation.
∞
k=1 γk < ∞, and σk = (Z k )T ∇f (x k ) + h(x k ) which is related directly to the
distance to the solution, x k − x ∗ . In the space of pY and pZ , we now define
three regions given by
R1 : 0 ≤ pY ≤ γk2 pZ ,
1/2
R2 : γk2 pZ < pY ≤ κpZ /σk ,
1/2
R3 : pY > κpZ /σk
and shown in Figure 5.2. In order to obtain a fast convergence rate, we only need
to resort to the finite difference approximation (5.50) when the calculated step lies
in region R2 . Otherwise, in regions R1 and R3 , the tangential and the normal steps,
respectively, are dominant and the less expensive quasi-Newton approximation can
be used for w k . In [54], these hybrid concepts were incorporated into an algorithm
with a line search along with safeguards to ensure that the updates S k and B̄ k remain
uniformly bounded. A detailed presentation of this hybrid algorithm along with anal-
ysis of global and local convergence properties and numerical performance is given
in [54].
To summarize, the reduced-space quasi-Newton algorithm does not require the com-
putation of the Hessian of the Lagrangian W k and only makes use of first derivatives of f (x)
and h(x). The reduced Hessian matrix (Z k )T W k Z k is approximated by a positive definite
i i
i i
book_tem
i i
2010/7/27
page 108
i i
quasi-Newton matrix B̄ k , using the BFGS formula, and the Z k W k Y k pY term is approxi-
mated by a vector wk , which is computed either by means of a finite difference formula or
via a quasi-Newton approximation. The method is therefore well suited for large problems
with relatively few degrees of freedom.
The local convergence properties for this reduced-space method are analogous to The-
orems 5.5 and 5.6, but they are modified because B̄ k is now a quasi-Newton approximation
to the reduced Hessian and also because of the choice of approximation wk . First, we note
from (5.15), (5.47), and the definitions for B̄ k and w k that the condition
implies (5.42). Therefore, Theorem 5.7 holds and the algorithm converges superlinearly in x.
On the other hand, (5.51) does not hold for all choices of w k or quasi-Newton updates. For
instance, if wk is a poor approximation to (Z k )T W k Y k pY , the convergence rate is modified
as follows.
Theorem 5.7 [63] Assume that f (x) and h(x) are twice differentiable, their second deriva-
tives are Lipschitz continuous, and the reduced-space quasi-Newton algorithm converges to
a KKT point that satisfies the LICQ and the sufficient second order optimality conditions.
If Z(x) and v(x) (obtained from (5.21)) are Lipschitz continuous in a neighborhood of x ∗ ,
Y (x) and Z(x) are bounded, and [Y (x) | Z(x)] has a bounded inverse, then x k converges to
x ∗ at a 2-step superlinear rate, i.e.,
x k+2 − x ∗
lim =0 (5.52)
k→∞ x k − x ∗
if and only if
(B̄ k − (Z ∗ )T W (x ∗ , v ∗ )Z ∗ )pZ
lim = 0. (5.53)
k→∞ dx
Finally, we consider the hybrid approach that monitors and evaluates an accurate ap-
proximation of w k based on the regions in Figure 5.2. Moreover, since (Z ∗ )T W (x ∗ , v ∗ )Z ∗ is
positive definite, the BFGS update to the reduced Hessian provides a suitable approximation.
As a result, the following, stronger property can be proved.
Theorem 5.8 [54] Assume that f (x) and h(x) are twice differentiable, their second deriva-
tives are Lipschitz continuous, and we apply the hybrid reduced-space algorithm in [54]
with the BFGS update (5.31), (5.48). If
• the algorithm converges to a KKT point that satisfies the LICQ and the sufficient
second order optimality conditions,
• Y (x) and Z(x) are bounded and [Y (x) | Z(x)] has a bounded inverse in a neighborhood
of x ∗ ,
i i
i i
book_tem
i i
2010/7/27
page 109
i i
x k + dx − x ∗
lim = 0. (5.54)
k→∞ x k − x ∗
i i
i i
book_tem
i i
2010/7/27
page 110
i i
i i
i i
book_tem
i i
2010/7/27
page 111
i i
Because the augmented Lagrange function is exact for finite values of ρ and suitable values
of v, it forms the basis of a number of NLP methods (see [294, 100, 134]). On the other
hand, the suitability of LA as a merit function is complicated by the following:
• Finding a sufficiently large value of ρ is not straightforward. A suitably large value
of ρ needs to be determined iteratively,
• Suitable values for v need to be determined through iterative updates as the algorithm
proceeds, or by direct calculation using (5.58). Either approach can be expensive and
may lead to ill-conditioning, especially if ∇h(x) is rank deficient.
Nonsmooth Exact Penalty Functions
Because a smooth function of the form (5.56) cannot be exact, we finally turn to nonsmooth
penalty functions. The most common choice for merit functions is the class of p penalty
functions, given by
• Let x̄ be a stationary point of φp (x; ρ) for all values of ρ above a positive threshold;
then if h(x̄) = 0, x̄ is a KKT point for (5.1) (see [294]).
• Let x̄ be a stationary point of φp (x; ρ) for all values of ρ above a positive threshold,
then if h(x̄) = 0, x̄ is an infeasible stationary point for the penalty function (see [294]).
Compared to (5.55) and (5.57) the nonsmooth exact penalty is easy to apply as a merit
function. Moreover, under the threshold and feasibility conditions, the above properties
show an equivalence between local minimizers of φp (x, ρ) and local solutions of (5.1). As
a result, further improvement of φp (x; ρ) cannot be made from a local optimum of (5.1).
Moreover, even if a feasible solution is not found, the minimizer of φp (x; ρ) leads to a
“graceful failure” at a local minimum of the infeasibility. This point may be useful to the
practitioner to flag incorrect constraint specifications in (5.1) or to reinitialize the algorithm.
As a result, φp (x; ρ), with p = 1 or 2, is widely used in globalization strategies discussed
in the remainder of the chapter and implemented in the algorithms in Chapter 6.
The main concern with this penalty function is to find a reasonable value for ρ. While
ρ has a well-defined threshold value, it is unknown a priori, and the resulting algorithm
can suffer from poor performance and ill-conditioning if ρ is set too high. Careful attention
is needed in updating ρ, which is generally chosen after a step is computed, in order to
i i
i i
book_tem
i i
2010/7/27
page 112
i i
promote acceptance. For instance, the value of ρ can be adjusted by monitoring the multiplier
estimates v k and ensuring that ρ > v k q as the algorithm proceeds. Recently, more efficient
updates for ρ have been developed [78, 79, 294] that are based on allowing ρ to lie within a
bounded interval and using the smallest value of ρ in that interval to determine an acceptable
step. As described below, this approach provides considerable flexibility in choosing larger
steps. Concepts of exact penalty functions will be applied to both the line search and trust
region strategies developed in the next section.
i i
i i
book_tem
i i
2010/7/27
page 113
i i
A trial point x̂ (e.g., generated by a Newton-like step coupled to the line search or
trust region approach) is acceptable to the filter if it is not in the forbidden region, Fk ,
and a sufficient improvement is realized corresponding to a small fraction of the current
infeasibility. In other words, a trial point is accepted if
for some small positive values γf , γθ . These inequalities correspond to the trial point
(f (x̂), θ (x̂)) that lies below the dotted lines in Figure 5.3.
The filter strategy also requires two further enhancements in order to guarantee con-
vergence.
1. Relying solely on criterion (5.60) allows the acceptance of a sequence {x k } that only
provides sufficient reduction of the constraint violation, but not the objective function.
For instance, this can occur if a filter pair is placed at a feasible point with θ(x l ) = 0,
and acceptance through (5.60) may lead to convergence to a feasible, but nonoptimal,
point. In order to prevent this, we monitor a switching criterion to test whenever the
constraint infeasibility is too small to promote a sufficient decrease in the objective
function. If so, we require that the trial point satisfy a sufficient decrease of the
objective function alone.
2. Ill-conditioning at infeasible points can lead to new steps that may be too small to
satisfy the filter criteria; i.e., trial points could be “caught” between the dotted and
solid lines in Figure 5.3. This problem can be corrected with a feasibility restoration
phase which attempts to find a less infeasible point at which to restart the algorithm.
With these elements, the filter strategy, coupled either with a line search or trust region
approach, can be shown to converge to either a local solution of (5.1) or a stationary point
of the infeasibility, θ (x). To summarize, the filter strategy is outlined as follows.
• Generate a new trial point, x̂. Continue if the trial point is not within the forbidden
region, Fk .
• If a switching condition (see (5.85)) is triggered, then accept x̂ as x k+1 if a sufficient
reduction is found in the objective function.
• Else, accept x̂ as x k+1 if the trial point is acceptable to filter with a margin given by
θ (x k ). Update the filter.
• If the trial point is not acceptable, find another trial point that is closer to x k .
• If there is insufficient progress in finding trial points, evoke the restoration phase to
find a new point, x R , with a smaller θ(x R ).
i i
i i
book_tem
i i
2010/7/27
page 114
i i
Figure 5.4. Illustration of merit function concepts with a flexible penalty parameter.
Curtis and Nocedal [105], and these can be interpreted and compared in the context of filter
methods.
Figure 5.4 shows how merit functions can mimic the performance of filter methods
in the space of f (x) and h(x) if we allow the penalty parameter to vary between upper
and lower bounds. To see this, consider the necessary condition for acceptance of x k+1 ,
f (x k+1 ) + ρh(x k+1 ) < f (x k ) + ρh(x k ), (5.61)
f (x k ) − f (x k+1 )
or if h(x k ) = h(x k+1 ), ρ > ρv = .
h(x k+1 ) − h(x k )
Note that the bound on ρ defines the slopes of the lines in Figure 5.4. Clearly, R1 does not
satisfy (5.61), and x k+1 is not acceptable in this region. Instead, acceptable regions for x k+1
are R2 and R3 for ρ = ρu , while R3 and R4 are acceptable regions for ρ = ρl . If we now
allow ρ ∈ [ρl , ρu ], then all three regions R2 , R3 , and R4 are acceptable regions for x k+1 and
these regions behave like one element of the filter method. This is especially true if ρu can
be kept arbitrarily large and ρl can be kept close to zero. Of course, the limits on ρ cannot
be set arbitrarily; they are dictated by global convergence properties. As will be seen later,
it is necessary to update ρu based on a predicted descent property. This property ensures
decrease of the exact penalty function as the step size or trust region becomes small. On the
other hand, due to nonlinearity, smaller values of ρ may be allowed for satisfaction of (5.61)
for larger steps. This relation will guide the choice of ρl . More detail on this selection and
the resulting convergence properties is given in the next sections. We now consider how
both merit function and filter concepts are incorporated within globalization strategies.
i i
i i
book_tem
i i
2010/7/27
page 115
i i
also be addressed in this section in the context of constrained optimization, for both merit
function and filter approaches.
where x k+1 = x k + α k dx and Ddx φp (x k ; ρ) is the directional derivative of the merit function
along dx . To evaluate the directional derivative we first consider its definition:
φp (x k + αdx ; ρ) − φp (x k ; ρ)
Ddx φp (x k ; ρ) = lim . (5.63)
α→0 α
For α ≥ 0 we can apply Taylor’s theorem and bound the difference in merit functions by
φp (x k + αdx ; ρ) − φp (x k ; ρ) = (f (x k + αdx ) − f (x k ))
+ρ(h(x k + αdx )p − h(x k )p )
≤ α∇f (x k )T dx + ρ(h(x k ) + α∇h(x k )T dx p − h(x k )p )
+ b1 α 2 dx 2 for some b1 > 0. (5.64)
Using the quadratic term from Taylor’s theorem, one can also derive a corresponding lower
bound, leading to
Dividing (5.65) and (5.66) by α and applying the definition (5.63) as α goes to zero leads
to the following expression for the directional derivative:
The next step is to tie the directional derivative to the line search strategy to enforce
the optimality conditions. In particular, the choices that determine dx and ρ play important
i i
i i
book_tem
i i
2010/7/27
page 116
i i
roles. In Sections 5.3 and 5.4 we considered full- and reduced-space representations as well
as the use of quasi-Newton approximations to generate dx . These features also extend to the
implementation of the line search strategy, as discussed next.
We start by noting that the Newton steps dx from (5.5) and (5.16) are equivalent.
Substituting (5.15) into the directional derivative (5.67) leads to
Ddx φp (x k ; ρ) = ∇f (x k )T (Z k pZ + Y k pY ) − ρh(x k )p , (5.68)
and from (5.17) we have
Ddx φp (x k ; ρ) = ∇f (x k )T Z k pZ − ∇f (x k )T Y k [(Ak )T Y k ]−1 h(x k ) − ρh(x k )p . (5.69)
i i
i i
book_tem
i i
2010/7/27
page 117
i i
Substituting this expression into (5.75) and choosing the constants appropriately leads
directly to (5.72).
• Finally, for flexible choices of ρ ∈ [ρl , ρu ] as shown in Figure 5.4, one can choose ρu
to satisfy (5.72), based on (5.71), (5.73), or (5.74). This approach finds a step size α k
that satisfies the modified Armijo criterion:
φp (x k + α k dx ; ρ) ≤ φp (x k ; ρ) + ηα k Ddx φp (x k ; ρm
k
) (5.77)
On the other hand, ρlk is chosen as a slowly increasing lower bound given by
Algorithm 5.3.
Choose constants η ∈ (0, 1/2), 1 , 2 > 0, and τ , τ with 0 < τ < τ < 1. Set k := 0 and
choose starting points x 0 , v 0 , and an initial value ρ0 > 0 (or ρl0 , ρu0 > 0) for the penalty
parameter.
i i
i i
book_tem
i i
2010/7/27
page 118
i i
3. Update ρ k to satisfy (5.71), (5.73), or (5.74), or update ρlk and ρuk for the flexible
choice of penalty parameters.
4. Set α k = 1.
2. The matrix A(x) has full column rank for all x ∈ D, and there exist constants γ0 and
β0 such that
for all x ∈ D.
3. The Hessian (or approximation) W (x, v) is positive definite on the null space of the
Jacobian A(x)T and uniformly bounded.
Lemma 5.9 [54] If Assumptions I hold and if ρ k = ρ is constant for all sufficiently large k,
then there is a positive constant γρ such that for all large k,
φp (x k ) − φp (x k+1 ) ≥ γρ (Z k )T ∇f (x k )2 + h(x k )p . (5.81)
The proof of Lemma 5.9 is a modification from the proof of Lemma 3.2 in [82] and
Lemma 4.1 in [54]. Key points in the proof (see Exercise 4) include showing that α k is
always bounded away from zero, using (5.64). Applying (5.72) and the fact that h(x) has
an upper bound in D leads to the desired result.
Finally, Assumptions I and Lemma 5.9 now allow us to prove convergence to a KKT
point for a class of line search strategies that use nonsmooth exact penalty merit functions
applied to Newton-like steps.
Theorem 5.10 If Assumptions I hold, then the weights {ρ k } are constant for all sufficiently
large k, limk→∞ ((Z k )T ∇f (x k )+h(x k )) = 0, and limk→∞ x k = x ∗ , a point that satisfies
the KKT conditions (5.2).
i i
i i
book_tem
i i
2010/7/27
page 119
i i
Proof: By Assumptions I, we note that the quantities that define ρ in (5.71), (5.73), or (5.74)
are bounded. Therefore, since the procedure increases ρ k by at least whenever it changes
the penalty parameter, it follows that there is an index k0 and a value ρ such that for all
k > k0 , ρ k = ρ always satisfies (5.71), (5.73), or (5.74).
This argument can be extended to the flexible updates of ρ. Following the proof in
[78], we note that ρuk is increased in the same way ρ k is increased for the other methods,
so ρuk eventually approaches a constant value. For ρlk we note that this quantity is bounded
above by ρuk and through (5.78) it increases by a finite amount to this value at each iteration.
As a result, it also attains a constant value.
Now that ρ k is constant for k > k0 , we have by Lemma 5.9 and the fact that φp (x k )
decreases at each iterate, that
k
φp (x k0 ; ρ) − φp (x k+1 ; ρ) = (φp (x j ; ρ) − φp (x j +1 ; ρ))
j =k0
k
≥ γρ [(Z j )T ∇f (x j )2 + h(x j )p ].
j =k0
By Assumptions I, φp (x, ρ) is bounded below for all x ∈ D, so the last sum is finite, and
thus
lim [(Z k )T ∇f (x k )2 + h(x k )p ] = 0. (5.82)
k→∞
with constants δ > 0, sθ > 1, sf ≥ 1, where mk (α) := α∇f (x k )T dx . If (5.85) holds, then
step dx is a descent direction for the objective function and we require that α k,l satisfies the
i i
i i
book_tem
i i
2010/7/27
page 120
i i
Armijo-type condition
where ηf ∈ (0, 12 ) is a fixed constant. Note that several trial step sizes may be tried with (5.85)
satisfied, but not (5.86). Moreover, for smaller step sizes the f -type switching condition
(5.85) may no longer be valid and we revert to the acceptance criterion (5.84).
Note that the second part of (5.85) ensures that the progress for the objective function
enforced by the Armijo condition (5.86) is sufficiently large compared to the current con-
straint violation. Enforcing (5.85) is essential for points that are near the feasible region.
Also, the choices of sf and sθ allow some flexibility in performance of the algorithm. In
particular, if sf > 2sθ (see [403]), both (5.85) and (5.86) will hold for a full step, possibly
improved by a second order correction (see Section 5.8) [402], and rapid local convergence
is achieved.
During the optimization we make sure that the current iterate x k is always acceptable
to the current filter Fk . At the beginning of the optimization, the forbidden filter region is
normally initialized to F0 := {(θ , f ) ∈ R2 : θ ≥ θmax } for some upper bound on infeasibility,
θmax > θ(x 0 ). Throughout the optimization, the filter is then augmented in some iterations
after the new iterate x k+1 has been accepted. For this, the updating formula
" #
Fk+1 := Fk ∪ (θ , f ) ∈ R2 : θ ≥ (1 − γθ )θ (x k ) and f ≥ f (x k ) − γf θ (x k ) (5.87)
is used. On the other hand, the filter is augmented only if the current iteration is not an f -type
iteration, i.e., if for the accepted trial step size α k , the f -type switching condition (5.85)
does not hold. Otherwise, the filter region is not augmented, i.e., Fk+1 := Fk . Instead, the
Armijo condition (5.86) must be satisfied, and the value of the objective function is strictly
decreased. This is sufficient to prevent cycling.
Finally, finding a trial step size α k,l > 0 that provides sufficient reduction as defined by
criterion (5.84) is not guaranteed. In this situation, the filter method switches to a feasibility
restoration phase, whose purpose is to find a new iterate x k+1 that satisfies (5.84) and is also
acceptable to the current filter by trying to decrease the constraint violation. Any iterative
algorithm can be applied to find a less infeasible point, and different methods could even
be used at different stages of the optimization procedure. To detect the situation where no
admissible step size can be found, one can linearize (5.84) (or (5.85) in the case of an f -type
iteration) and estimate the smallest step size, α k,min , that allows an acceptable trial point.
The algorithm then switches to the feasibility restoration phase when α k,l becomes smaller
than α k,min .
If the feasibility restoration phase terminates successfully by delivering an admissible
point, the filter is augmented according to (5.87) to avoid cycling back to the problematic
point x k . On the other hand, if the restoration phase is unsuccessful, it should converge
to a stationary point of the infeasibility, θ (x). This “graceful exit” may be useful to the
practitioner to flag incorrect constraint specifications in (5.1) or to reinitialize the algorithm.
Combining these elements leads to the following filter line search algorithm.
Algorithm 5.4.
Given a starting point x 0 ; constants θmax ∈ (θ(x 0 ), ∞]; γθ , γf ∈ (0, 1); δ > 0; γα ∈ (0, 1];
sθ > 1; sf ≥ 1; ηf ∈ (0, 21 ); 0 < τ ≤ τ < 1.
i i
i i
book_tem
i i
2010/7/27
page 121
i i
3. Compute search direction. Calculate the search step dx using either the full-space or
reduced-space linear systems, with exact or quasi-Newton Hessian information. If
this system is detected to be too ill-conditioned, go to feasibility restoration phase in
step 8.
6. Augment filter if necessary. If k is not an f -type iteration, augment the filter using
(5.87); otherwise leave the filter unchanged, i.e., set Fk+1 := Fk .
7. Continue with next iteration. Increase the iteration counter k ← k + 1 and go back to
step 2.
8. Feasibility restoration phase. Compute a new iterate x k+1 by decreasing the infeasi-
bility measure θ, so that x k+1 satisfies the sufficient decrease conditions (5.84) and is
acceptable to the filter, i.e., (θ (x k+1 ), f (x k+1 )) ∈ Fk . Augment the filter using (5.87)
(for x k ) and continue with the regular iteration in step 7.
The filter line search method is clearly more complicated than the merit function
approach and is also more difficult to analyze. Nevertheless, the key convergence result for
this method can be summarized as follows.
Theorem 5.11 [403] Let Assumption I(1) hold for all iterates and Assumptions I(1,2,3) hold
for those iterates that are within a neighborhood of the feasible region (h(x k ) ≤ θinc for
some θinc > 0). In this region successful steps can be taken and the restoration phase need
i i
i i
book_tem
i i
2010/7/27
page 122
i i
not be invoked. Then Algorithm 5.4, with no unsuccessful terminations of the feasibility
restoration phase, has the following properties:
In other words, all limit points are feasible, and if {x k } is bounded, then there exists a limit
point x ∗ of {x k } which is a first order KKT point for the equality constrained NLP (5.1).
The filter line search method avoids the need to monitor and update penalty param-
eters over the course of the iterations. This also avoids the case where a poor update of
the penalty parameter may lead to small step sizes and poor performance of the algorithm.
On the other hand, the “lim inf ” result in Theorem 5.11 is not as strong as Theorem 5.7, as
only convergence of a subsequence can be shown. The reason for this weaker result is that,
even for a good trial point, there may be filter information from previous iterates that could
invoke the restoration phase. Nevertheless, one should be able to strengthen this property
to limk→∞ ∇x L(x k , v k ) = 0 through careful design of the restoration phase algorithm
(see [404]).
1
min ∇f (x k )T dx + dxT W k dx (5.89)
dx 2
s.t. h(x k ) + (Ak )T dx = 0,
dx ≤ .
Here dx could be expressed as the max-norm leading to simple bound constraints. How-
ever, if the trust region is small, there may be no feasible solution for (5.89). As discussed
below, this problem can be overcome by using either a merit function or filter approach.
With the merit function approach, the search direction dx is split into two components with
separate trust region problems for each, and a composite-step trust region method is devel-
oped. In contrast, the filter method applies its restoration phase if the trust region is too
small to yield a feasible solution to (5.89).
i i
i i
book_tem
i i
2010/7/27
page 123
i i
This approach focuses on improving each term separately, and with p = 2 quadratic models
of each term can be constructed and solved as separate trust region problems. In particular,
a trust region problem that improves the quadratic model of the objective function can be
shown to give a tangential step, while the trust region problem that reduces a quadratic
model of the infeasibility leads to a normal step. Each trust region problem is then solved
using the algorithms developed in Chapter 3.
Using the 2 penalty function, we now postulate a composite-step trust region model
at iteration k given by
1
m2 (x; ρ) = ∇f (x k )T dx + dxT W k dx + ρh(x k ) + (Ak )T dx 2 .
2
Decomposing dx = Y k pY + Z k pZ into normal and tangential components leads to the fol-
lowing subproblems.
s.t. Y k pY 2 ≤ ξN . (5.90)
i i
i i
book_tem
i i
2010/7/27
page 124
i i
If acceptable steps have been found that provide sufficient reduction in the merit
function, we select our next iterate as
x k+1 = x k + Y k pY + Z k pZ . (5.92)
If not, we reject this step, reduce the size of the trust region, and repeat the calculation of
the normal and tangential steps. These steps are shown in Figure 5.5. In addition, we obtain
Lagrange multiplier estimates at iteration k using the first order approximation:
The criteria for accepting new points and adjusting the trust region are based on
predicted and actual reductions of the merit function. The predicted reduction in the quadratic
models contains the predicted reduction in the model for the tangential step (qk ),
1
qk = −((Z k )T ∇f (x k ) + wk )T pZ − pZT B̄ k pZ , (5.94)
2
and the predicted reduction in the model for the normal step (ϑk ),
i i
i i
book_tem
i i
2010/7/27
page 125
i i
The actual reduction is just the difference in the merit function at two successive
iterates:
−(q k − ∇f (x k )T Y k pY − 12 pYT (Y k )T W k Y k pY )
ρ k = max ρ k−1 , , (5.98)
(1 − ζ )ϑk
where ζ ∈ (0, 1/2). Again, if we use quasi-Newton updates, we would drop the term
1 T k T k k k
2 pY (Y ) W Y pY in (5.98). A more flexible alternate update for ρ is also discussed
in [294].
With these elements, the composite-step trust region algorithm follows.
Algorithm 5.5.
Set all tolerances and constants; initialize B 0 if using quasi-Newton updates, as well as
x 0 , v 0 , ρ 0 , and 0 .
Typical values for this algorithm are κ0 = 10−4 , κ1 = 0.25, κ2 = 0.75, γ1 = 0.5 and we
note that the trust region algorithm is similar to Algorithm 3.3 from Chapter 3.
The trust region algorithm is stated using reduced-space steps. Nevertheless, for large
sparse problems, the tangential and normal steps can be calculated more cheaply using only
i i
i i
book_tem
i i
2010/7/27
page 126
i i
full-space information based on the systems (5.29) and (5.28). Moreover, the tangential
problem allows either exact or quasi-Newton Hessian information. The following conver-
gence property holds for all of these cases.
1. The functions f : Rn → R and h : Rn → Rm are twice differentiable and their first and
second derivatives are Lipschitz continuous and uniformly bounded in norm over D.
2. The matrix A(x) has full column rank for all x ∈ D, and there exist constants γ0 and
β0 such that
Y (x)[A(x)T Y (x)]−1 ≤ γ0 , Z(x) ≤ β0 , (5.99)
for all x ∈ D.
Theorem 5.12 [100] Suppose Assumptions II hold; then Algorithm 5.5 has the following
properties:
In other words, all limit points are feasible and x ∗ is a first order stationary point for the
equality constrained NLP (5.1).
1
min mf (dx ) = ∇f (x k )T dx + dxT W k dx (5.101)
dx 2
s.t. h(x k ) + (Ak )T dx = 0, dx ≤ k .
The filter trust region method uses similar concepts as in the line search case and can
be described by the following algorithm.
Algorithm 5.6.
Initialize all constants and choose a starting point (x 0 , v 0 ) and initial trust region 0 . Ini-
tialize the filter F0 := {(θ, f ) ∈ R2 : θ ≥ θmax } and the iteration counter k ← 0.
2. For the current trust region, check whether a solution exists to (5.101). If so, continue
to the next step. If not, add x k to the filter and invoke the feasibility restoration phase
i i
i i
book_tem
i i
2010/7/27
page 127
i i
1
min mf (dx ) = ∇f (x r )T dx + dxT W (x r )dx (5.102)
dx 2
s.t. h(x r ) + ∇h(x r )T dx = 0,
dx ≤ k
θ (x k + dx ) ≤ (1 − γθ )θ (x k ) or f (x k + dx ) ≤ f (x k ) − γf θ(x k ),
then reject the trial step size, set x k+1 = x k and k+1 ∈ [γ0 k , γ1 k ] and go to
step 1.
• If
mf (x k ) − mf (x k + dx ) ≥ κθ θ(x k )2 (5.103)
and
f (x k ) − f (x k + dx )
πk = < η1 , (5.104)
mf (x k ) − mf (x k + dx )
then we have a rejected f -type step. Set x k+1 = x k and k+1 ∈ [γ0 k , γ1 k ]
and go to step 1.
• If (5.103) fails, add x k to the filter. Otherwise, we have an f -type step.
Typical values for these constants are γ0 = 0.1, γ1 = 0.5, γ2 = 2, η1 = 0.01, η2 = 0.9,
κθ = 10−4 . Because the filter trust region method shares many of the concepts from the
corresponding line search method, it is not surprising that it shares the same convergence
property.
i i
i i
book_tem
i i
2010/7/27
page 128
i i
Example 5.14 (Slow Convergence Near Solution). Consider the following equality con-
strained problem:
with the constant τ > 1. As shown in Figure 5.6, the solution to this problem is x ∗ = [1, 0]T ,
v ∗ = 1/2 − τ and it can easily be seen that ∇xx L(x ∗ , v ∗ ) = I at the solution. To illustrate the
effect of slow convergence, we proceed from a feasible point x k = [cos θ, sin θ ]T and choose
θ small enough to start arbitrarily close to the solution. Using the Hessian information at
i i
i i
book_tem
i i
2010/7/27
page 129
i i
x ∗ and linearizing (5.5) at x k allows the search direction to be determined from the linear
system
1 0 2x1k d1 1 − 2τ x1k
0 1 2x2k d2 = −2τ x2k (5.107)
k
2x1 2x2 k 0 v̄ 1 − (x1 ) − (x2 )
k 2 k 2
yielding d1 = sin2 θ , d2 = − sin θ cos θ and x k + d = [cos θ + sin2 θ, sin θ (1 − cos θ )]T . As
seen in Figure 5.6, x k + d is much closer to the optimum solution, and one can show that
the ratio
x k + d − x ∗ 1
= (5.108)
x k − x ∗ 2 2
is characteristic of quadratic convergence for all values of θ . However, with the p merit
function, there is no positive value of the penalty parameter ρ that allows this point to be
accepted. Moreover, none of the globalization methods described in this chapter accept this
point either, because the infeasibility increases:
To avoid small steps that “creep toward the solution,” we develop a second order
correction at iteration k. The second order correction is an additional step taken in the range
space of the constraint gradients in order to reduce the infeasibility from x k + dx . Here we
define the correction by
Example 5.15 (Maratos Example Revisited). To see the influence of the second order cor-
rection, we again refer to Figure 5.6. From (5.110) with Y k = A(x k ), we have
and we can see that x k + dx + d̄x is even closer to the optimum solution with
x k + dx − x ∗ = 1 − cos θ
≥ 1 − cos θ − (cos θ + 5)(sin2 θ )/4
= x k + dx + d̄x − x ∗ ,
i i
i i
book_tem
i i
2010/7/27
page 130
i i
which is also characteristic of quadratic convergence from (5.108). Moreover, while the
infeasibility still increases from h(x k ) = 0 to h(x k + dx + d̄x ) = sin4 θ/4, the infeasibility is
considerably less than at x k + dx . Moreover, the objective function becomes
For (1 − cos θ/2)/ sin2 θ > (τ + ρ)/4, the objective function clearly decreases from f (x k ) =
− cos θ. Moreover, this point also reduces the exact penalty and is acceptable to the filter.
Note that as θ → 0, this relation holds true for any bounded values of τ and ρ. As a
result, the second order correction allows full steps to be taken and fast convergence to be
obtained.
Implementation of the second order correction is a key part of all of the globalization
algorithms discussed in this chapter. These algorithms are easily modified by first detect-
ing whether the current iteration is within a neighborhood of the solution. This can be
done by checking whether the optimality conditions, say max(h(x k ), (Z k )T ∇f (x k )) or
max(h(x k ), ∇L(x k , v k )), are reasonably small. If x k is within this neighborhood, then
d̄x is calculated and the trial point x k + dx + d̄x is examined for sufficient improvement.
If the trail point is accepted, we set x k+1 = x k + dx + d̄x . Otherwise, we discard d̄x and
continue either by restricting the trust region or line search backtracking.
i i
i i
book_tem
i i
2010/7/27
page 131
i i
5.11 Exercises
1. Prove Theorem 5.1 by modifying the proof of Theorem 2.20.
2. Consider the penalty function given by (5.56) with a finite value of ρ, and compare the
KKT conditions of (5.1) with a stationary point of (5.56). Argue why these conditions
cannot yield the same solution.
3. Show that the tangential step dt = Z k pZ and normal step dn = Y k pY can be found
from (5.28) and (5.29), respectively.
4. Prove Lemma 5.9 by modifying the proof of Lemma 3.2 in [82] and Lemma 4.1
in [54].
5. Fill in the steps used to obtain the results in Example 5.15. In particular, show that for
(1 − cos θ/2)/ sin2 θ > τ + ρ,
the second order correction allows a full step to be taken for either the p penalty or
the filter strategy.
6. Consider the problem
1
min f (x) = (x12 + x22 ) (5.112)
2
s.t. h(x) = x1 (x2 − 1) − θx2 = 0, (5.113)
where θ is an adjustable parameter. The solution to this problem is x1∗ = x2∗ = v ∗ = 0
2 L(x ∗ , v ∗ ) = I . For θ = 1 and θ = 100, perform the following experiments
and ∇xx
with xi0 = 1/θ , i = 1, 2.
i i
i i
book_tem
i i
2010/7/27
page 132
i i
• Setting B k = I , solve this problem with full-space Newton steps and no glob-
alization.
• Setting B̄ k = (Z k )T Z k and w k = 0, solve the above problem with reduced-space
Newton steps and no globalization, using orthogonal bases (5.26).
• Setting B̄ k = (Z k )T Z k and w k = 0, solve the above problem with reduced-space
Newton steps and no globalization, using coordinate basis (5.23).
• Setting B̄ k = (Z k )T Z k and w k = (Z k )T Y k pY , solve the above problem with
reduced-space Newton steps and no globalization, using coordinate basis (5.23).
i i
i i