Numerical Solution of Nonlinear Systems and Optimization
Numerical Solution of Nonlinear Systems and Optimization
2. One-point iteration
For many iterative methods, xi+1 depends only on xi via some formula that doesn’t
depend on i: xi+1 = G(xi ). Such a method is called a (stationary) one-point iteration.
Before considering specific iterations to solve F (x) = 0, we consider one-point iterations in
general.
Assuming the iteration function G is continuous, we obviously have that if the iterates
xi+1 = G(xi ) converge to some limit x∗ , then x∗ = G(x∗ ), i.e., x∗ is a fixed point of G.
A basic result is the contraction mapping theorem. Recall that a map G : B → Rn
(B ⊂ Rn ) is called a contraction (with respect to some norm on Rn ) if G is Lipschitz with
Lipschitz constant strictly less than 1.
Theorem 4.1. Suppose G maps a closed subset B of Rn to itself, and suppose that G
is a contraction (with respect to some norm). Then G has a unique fixed point x∗ in B.
Moreover, if x0 ∈ B is any point, then the iteration xi+1 = G(xi ) converges to x∗ .
If G ∈ C 1 a practical way to check whether G is a contraction (with respect to some
norm on Rn ) is to consider kG0 (x)k (in the associated matrix norm). If kG0 (x)k ≤ λ < 1 on
some convex set Ω (e.g., some ball), then G is a contraction there. In one dimension this is
an immediate consequence of the mean value theorem. In n dimensions we don’t have the
mean value theorem, but we can use the fundamental theorem of calculus to the same end.
Given x, y ∈ Ω we let g(t) = G(x + t(y − x)), so g 0 (t) = 0
R 1G0 (x + t(y − x))(y − x). From the
fundamental theorem of calculus we get g(1) − g(0) = 0 g (t) dt, or
Z 1
0
G(y) − G(x) = G (x + t(y − x)) dt (y − x),
0
whence
kG(y) − G(x)k ≤ sup kG0 (x + t(y − x))kky − xk ≤ λky − xk,
0≤t≤1
and so G is a contraction.
If we assume that x∗ is a fixed point of G, G ∈ C 1 , and r = kG0 (x∗ )k < 1, then we can
conclude that the iteration xi+1 = G(xi ) converges for any starting iterate x0 sufficiently
close to x∗ . This is called a locally convergent iteration. The above argument also shows
that convergence is (at least) linear with rate r.
In this connection, the following theorem, which connects kAk to ρ(A) (the spectral
radius of A, i.e., the maximum modulus of its eigenvalues), is very useful.
Theorem 4.2. Let A ∈ Rn×n . Then
1. For any operator matrix norm, kAk ≥ ρ(A).
2. If A is symmetric, then kAk2 = ρ(A).
3. If A is diagonalizable, then there exists an operator norm so that kAk = ρ(A).
4. For any A and any > 0, there exists an operator norm so that ρ(A) ≤ kAk ≤ ρ(A)+.
Proof. 1. If Ax = λx where x 6= 0 and |λ| = ρ(A), then from kAxk = |λ|kxk we see
that kAk ≥ ρ(A).
p p
2. kAk2 = ρ(AT A) = ρ(A2 ) = ρ(A).
2. ONE-POINT ITERATION 93
3. First note that if S ∈ Rn×n is nonsingular and k · k0 any vector norm, then kxk :=
kSxk0 is another vector norm, and the associated matrix norms satisfy kAk = kSAS −1 k0 .
Now if A is diagonalizable, then there exists S nonsingular so that SAS −1 is a diagonal
matrix with the eigenvalues of A on the diagonal (the columns of S −1 are the eigenvectors
of A). Hence if we apply the above relation beginning with the ∞-norm for k · k0 , we get
kAk = ρ(A).
4. The proof is similar in this case, but we use the Jordan canonical form to write
SAS −1 = J where J has the eigenvalues of A on the diagonal, 0’s and ’s above the diagonal,
and 0’s everywhere else. (The usual Jordan canonical form is the case = 1, but if we
conjugate a Jordan block by the matrix diag(1, , 2 , . . . ) the 1’s above the diagonal are
changed to .) Thus for the matrix norm associated to kxk := kSxk∞ , we have kAk =
kJk∞ ≤ ρ(A) + .
Corollary 4.3. If G is C 1 in a neighborhood of a fixed point x∗ and r = ρ(G0 (x∗ )) < 1,
the one point iteration with iteration function G is locally convergent to x∗ with rate r.
Although we don’t need immediately it, we note another useful corollary of the proceeding
theorem.
Corollary 4.4. Let A ∈ Rn×n . Then limn→∞ An = 0 if and only if ρ(A) < 1, and in
this case the convergence is linear with rate ρ(A).
Proof. kAn k ≥ ρ(An ) = ρ(A)n , so if ρ(A) ≥ 1, then An does not converge to 0.
Conversely, if ρ(A) < 1, then for any ρ̄ ∈ (ρ(A), 1) we can find an operator norm so that
kAk ≤ ρ̄, and then kAn k ≤ kAkn = ρ̄n → 0.
Finally, let us consider the case G0 (x∗ ) = 0. Then clearly the iteration is superlinearly
convergent. If G is C 2 , or, less, if G0 is Lipschitz, then we can show that the convergence is
in fact quadratic. First note that for any C 1 function G,
Z 1
0
G(y) − G(x) − G (x)(y − x) = [G0 (x + t(y − x)) − G0 (x)] dt(y − x).
0
0
Hence, if G is Lipschitz,
C
kG(y) − G(x) − G0 (x)(y − x)k ≤ ky − xk2 ,
2
where C is the Lipschitz constant. Applying this with x = x∗ and y = xi and using the fact
that G(x∗ ) = x∗ , G0 (x∗ ) = 0, we get
C
kxi − x∗ k2 ,
kxi+1 − x∗ k ≤
2
which is quadratic convergence. In the same way we can treat the case of G with several
vanishing derivatives.
Theorem 4.5. Suppose that G maps a neighborhood of x∗ in Rn into Rn and that x∗
is a fixed point of G. Suppose also that all the derivatives of G of order up to p exist, are
Lipschitz continuous, and vanish at x∗ . Then the iteration xi+1 = G(xi ) is locally convergent
to x∗ with order p + 1.
94 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION
3. Newton’s method
An important example of a one-point iteration is Newton’s method for root-finding. Let
F : Ω → Rn be C 1 with Ω ⊂ Rn . We wish to find a root x∗ of F in Ω. If x0 ∈ Ω is an initial
guess of the root, we approximate F by the linear part of its Taylor series near x0 :
F (x) ≈ F (x0 ) + F 0 (x0 )(x − x0 ).
The left-hand side vanishes when x is a root, so setting the right-hand side equal to zero gives
us an equation for a new approximate root, which we take to be x1 . Thus x1 is determined
by the equation
F (x0 ) + F 0 (x1 )(x1 − x0 ) = 0,
or, equivalently,
x1 = x0 − F 0 (x0 )−1 F (x0 ).
Continuing in this way we get Newton’s method:
xi+1 = xi − F 0 (xi )−1 F (xi ), i = 0, 1, . . . .
/ Ω or that some F 0 (xi ) is singular, in which case
(Of course it could happen that some xi ∈
Newton’s method breaks down. We shall see that under appropriate conditions this doesn’t
occur.)
Note that Newton’s method is simply iteration of the function
G(x) = x − F 0 (x)−1 F (x).
Now if x∗ is a root of F and F 0 (x∗ ) is nonsingular (i.e., if x∗ is a simple root), then G is
continuous in a neighborhood of x∗ , and clearly x∗ is a fixed point of G. We have that
G0 (x) = I − K(x)F (x) − F 0 (x)−1 F 0 (x) = −K(x)F (x)
where K is the derivative of the function x 7→ F 0 (x)−1 (this function maps a neighborhood
of x∗ in Rn into Rn×n ). It is an easy (and worthwhile) exercise to derive the formula for
K(x) in terms of F 0 (x) and F 00 (x), but we don’t need it here. It suffices to note that K
exists and is Lipschitz continuous if F 0 and F 00 are. In any case, we have that G0 (x∗ ) = 0.
Thus, assuming that F is C 2 with F 00 Lipschitz (e.g., if F is C 3 ), we have all the hypotheses
necessary for local quadratic convergence. Thus we have proved:
Theorem 4.6. Suppose that F : Ω → Rn , Ω ⊂ Rn is C 2 with F 00 Lipschitz continuous,
and that F (x∗ ) = 0 and F 0 (x∗ ) is nonsingular for some x∗ ∈ Ω. Then if x0 ∈ Ω is sufficiently
close to x∗ , the sequence of points defined by Newton’s method is well-defined and converges
quadratically to x∗ .
The hypothesis that the root be simple is necessary for the quadratic convergence of
Newton’s method, as can easily be seen by a 1-dimensional example. However, the smooth-
ness assumption can be weakened. The following theorem requires only that F 0 (rather than
F 00 ) be Lipschitz continuous (which holds if F is C 2 ). In the statement of the theorem any
vector norm and corresponding operator matrix norm can be used.
3. NEWTON’S METHOD 95
Theorem 4.7. Suppose that F (x∗ ) = 0 and that F 0 is Lipschitz continuous with Lipschitz
constant γ in a ball of radius r around x∗ . Also suppose that F 0 (x∗ ) is nonsingular with
kF 0 (x∗ )−1 k ≤ β. If kx0 − x∗ k ≤ min[r, 1/(2βγ)], then the sequence determined by Newton’s
method is well-defined, converges to x∗ , and satisfies
kxi+1 − x∗ k ≤ βγkxi − x∗ k2 .
Proof. First we show that F 0 (x0 ) is nonsingular. Indeed,
kF 0 (x∗ )−1 [F 0 (x0 ) − F 0 (x∗ )]k ≤ βγkx0 − x∗ k ≤ 1/2,
from which follows the nonsingularity and the estimate
1
kF 0 (x0 )−1 k ≤ kF 0 (x∗ )−1 k ≤ 2β.
1 − 1/2
Thus x1 is well-defined and
x1 − x∗ = x0 − x∗ − F 0 (x0 )−1 F (x0 )
= x0 − x∗ − F 0 (x0 )−1 [F (x0 ) − F (x∗ )]
= F 0 (x0 )−1 [F (x∗ ) − F (x0 ) − F 0 (x0 )(x∗ − x0 )].
We have previously bounded the norm of the bracketed quantity by γkx∗ − x0 k2 /2 and
kF 0 (x0 )−1 k ≤ 2β, so
kx1 − x∗ k ≤ βγkx0 − x∗ k2 .
This is the kind of quadratic bound we need, but first we need to show that the xi are indeed
converging to x∗ . Using again that kx0 − x∗ k ≤ 1/(2βγ), we have the linear estimate
kx1 − x∗ k ≤ kx∗ − x0 k/2.
Thus x1 also satisfies, kx1 − x∗ k≤ min[r, 1/(2βγ)], and the identical argument shows
kx2 − x∗ k ≤ βγkx1 − x∗ k2 , and kx2 − x∗ k ≤ kx1 − x∗ k/2.
Continuing in this way we get the theorem.
The theorem gives a precise sufficient condition on how close the initial iterate x0 must
be to x∗ to insure convergence. Of course it is not a condition that one can apply practically,
since one cannot check if x0 satisfies it without knowing x0 . There are several variant results
which weaken the hypotheses necessary to show quadratic convergence for Newton’s method.
A well-known, but rather complicated one is Kantorovich’s theorem (1948). Unlike the above
theorems, it does not assume the existence of a root x∗ of F , but rather states that if an
initial point x0 satisfies certain conditions, then there is a root, and Newton’s method will
converge to it. Basically it states: if F 0 is Lipschitz near x0 and nonsingular at x0 , and if
the value of F (x0 ) is sufficiently small (how small depending on the Lipschitz constant for
F 0 , and the norm of F 0 (x0 )−1 ), then Newton’s method beginning from x0 is well-defined and
converges quadratically to a root x∗ . The exact statement is rather complicated, so I’ll omit
it. In principle, one could pick a starting iterate x0 , and then compute the norms of F (x0 )
and F 0 (x0 ), and check to see if they fulfil the conditions of Kantorovich’s theorem (if one
knew a bound for the Lipschitz constant of F 0 in a neighborhood of x0 ), and thus tell in
advance whether Newton’s method would converge. In practice this is difficult to do and
would rule out many acceptable choices of initial guess, so it is rarely used.
96 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION
4. Quasi-Newton methods
Each iteration of Newton’s method requires the following operations: evaluate the func-
tion F at the current approximation, evaluate the derivative F 0 at the current approximation,
solve a system of equations with the latter as matrix and the former as right-hand side, and
update the approximation. The evaluation of F 0 and the linear solve are often the most
expensive parts. In some applications, no formula for F 0 is available, and exact evaluation of
F 0 is not possible. There are many variations of Newton’s method that attempt to maintain
good local convergence properties while avoiding the evaluation of F 0 , and/or simplifying
the linear solve step. We shall refer to all of these as quasi-Newton methods, although some
authors restrict that term to specific types of modification to Newton’s method.
Consider the iteration
xi+1 = xi − Bi−1 F (xi ),
where for each i, Bi is a nonsingular matrix to specified. If Bi = F 0 (xi ), this is Newton’s
method. The following theorem states that if Bi is sufficiently close to F 0 (xi ) then this
method is still locally convergent. With a stronger hypothesis on the closeness of Bi to
F 0 (xi ) the convergence is quadratic. Under a somewhat weaker hypothesis, the method still
converges superlinearly.
Theorem 4.8. Suppose F 0 is Lipschitz continuous in a neighborhood of a root x∗ and
that F 0 (x∗ ) is nonsingular.
1. Then there exists δ > 0 such that if kBi − F 0 (xi )k ≤ δ and kx0 − x∗ k ≤ δ, then the
generalized Newton iterates are well-defined by the above formula, and converge to x∗ .
2. If further kBi − F 0 (xi )k → 0, then the convergence is superlinear.
3. If there is a constant c such that kBi − F 0 (xi )k ≤ ckF (xi )k, then the convergence is
quadratic.
Proof. Set β = kF 0 (x∗ )−1 k < ∞. Choosing δ small enough, we can easily achieve
kx − x∗ k ≤ δ, kB − F 0 (x)k ≤ δ =⇒ kB −1 k ≤ 2β.
Let γ be a Lipschitz constant for F 0 . Decreasing δ if necessary we can further achieve
2β(γ/2 + 1)δ ≤ 1/2.
Now let x0 and B0 be chosen in accordance with this δ. Then
x1 − x∗ = x0 − x∗ − B0−1 F (x0 ) = B0−1 [F (x∗ ) − F (x0 ) − B0 (x∗ − x0 )]
Z 1
−1
= B0 [F 0 ((1 − t)x0 + tx∗ ) − B0 ]dt (x∗ − x0 ).
0
0
Now kF ((1 − t)x0 + tx∗ ) − B0 k ≤ γtkx0 − x∗ k + δ, by the triangle inequality, the Lipschitz
condition, and the condition on B0 . Thus
which gives superlinearity under the additional hypothesis of (2). From the additional hy-
pothesis of (3), we get
kBi − F 0 (xi )k ≤ ckF (xi ) − F (x∗ )k ≤ c0 kxi − x∗ k,
which gives the quadratic convergence.
5. Broyden’s method
Broyden’s method (published by C. G. Broyden in 1965) is an important example of
a quasi-Newton method. It is one possible generalization to n-dimensions of the secant
method. For a single nonlinear equation, the secant method replaces f 0 (xi ) in Newton’s
method with the approximation [f (xi ) − f (xi−1 )]/(xi − xi−1 ), to obtain the iteration
xi − xi−1
xi+1 = xi − f (xi ).
f (xi ) − f (xi−1 )
Of course, we cannot directly generalize this idea to Rn , since we can’t divide by the vector
xi − xi−1 . Instead, we can consider the equation
Bi (xi − xi−1 ) = F (xi ) − F (xi−1 ).
However, this does not determine the matrix Bi , only its action on multiples of xi − xi−1 . To
complete the specification of Bi , Broyden’s method sets the action on vectors orthogonal to
xi − xi−1 to be the same as Bi−1 . Broyden’s method is an update method in the sense that
Bi is determined as a modification of Bi−1 .
In order to implement Broyden’s method, we note:
98 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION
Choose x0 ∈ Rn , B0 ∈ Rn×n
for i = 0, 1, . . .
xi+1 = xi − Bi−1 F (xi )
si = xi+1 − xi
vi = F (xi+1 ) − F (xi )
1
Bi+1 = Bi + T (vi − Bi si )sTi
si si
end
Choose x0 ∈ Rn , H0 ∈ Rn×n
for i = 0, 1, . . .
xi+1 = xi − Hi F (xi )
si = xi+1 − xi
vi = F (xi+1 ) − F (xi )
1
Hi+1 = Hi + T (si − Hi vi )sTi Hi
si Hi v i
end
Note that if H0 is B0−1 this algorithm is mathematically equivalent to the basic Broyden
algorithm.
5.1. Convergence of Broyden’s method. Denote by x∗ the solution of F (x∗ ) = 0,
and let xi and Bi denote the sequences of vectors and matrices produced by Broyden’s
method. Set
ei = xi − x∗ , Mi = Bi − F 0 (x∗ ).
Roughly speaking, the key to the convergence of Broyden’s method are the facts that (1)
ei+1 will be small compared to ei if Mi is not large, and (2) Mi+1 will not be much larger
than Mi if the ei ’s are small. Precise results will be based on the following identities, which
follow directly from the definitions of xi and Bi ,
(4.2) ei+1 = −Bi−1 [F (xi ) − F (x∗ ) − F 0 (x∗ )(xi − x∗ )] + Bi−1 Mi ei ,
1 1
(4.3) Mi+1 = Mi I − T si si + T (vi − F 0 (x∗ )si )sTi .
T
si si si si
Our first result gives the local convergence of Broyden’s method, with a rate of conver-
gence that is at least linear. The norms are all the 2-norm.
Theorem 4.11. Let F be differentiable in a ball Ω about a root x∗ ∈ Rn whose derivative
has Lipschitz constant γ on the ball. Suppose that F 0 (x∗ ) is invertible, with kF 0 (x∗ )−1 k ≤ β.
Let x0 ∈ Ω and B0 ∈ Rn×n be given satisfying
1
kM0 k + 2γke0 k ≤ .
8β
100 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION
Then the iterates xi , Bi given by Broyden’s method are well defined, and the errors satisfy
kei+1 k ≤ kei k/2, for i = 0, 1, · · · .
Proof. Claim 1: If xi and Bi are well-defined and kMi k ≤ 1/(2β), then Bi is invertible
(so xi+1 is well-defined), and
kBi−1 k ≤ 2β, kei+1 k ≤ (γβkei k + 2βkMi k)kei k.
Indeed, F 0 (x∗ )−1 Bi = I + F 0 (x∗ )−1 Mi , and since kMi k ≤ 1/(2β), kF 0 (x∗ )−1 Mi k ≤ 1/2, so
Bi is invertible with kBi−1 k ≤ 2β. Therefore, xi+1 is well-defined. The estimate on kei+1 k
follows easily from the bound on Bi−1 and (4.2).
Note that from claim 1 and the hypotheses of the theorem we know that x1 is well-defined
and ke1 k ≤ ke0 k/2.
Claim 2: If B0 , . . . , Bi are defined and invertible, then
kMi+1 k ≤ kMi k + γ max(kei k, kei+1 k).
To prove this, we use (4.3). The first term on the right-hand side is the product of Mi with
the orthogonal projection onto the orthogonal complement of si , so its 2-norm is bounded
by kMi k. For the second term, note that
Z 1
0
vi − F (x∗ )si = [F 0 ((1 − t)xi+1 + txi ) − F 0 (x∗ )] dt si .
0
0 0
Since kF (1 − t)xi+1 + txi ) − F (x∗ )k ≤ γ max(kei k, kei+1 k),
kvi − F 0 (x∗ )si k ≤ γ max(kei k, kei+1 k)ksi k,
and the second term on the right-hand side of (4.3) is bounded in norm by
γ max(kei k, kei+1 k),
which establishes the claim.
We are now ready to prove the theorem. We shall show, by induction on i, that
x0 , . . . , xi+1 are well-defined and
1 1
kei k ≤ , kMi k ≤ , kei+1 k ≤ kei k/2.
8γβ 8β
This is clearly true for i = 0. Assuming it true for i and all smaller indices, we immediately
get the first inequality with i replaced by i + 1. Using claim 2 repeatedly (and noting that
kei+1 k ≤ kei k ≤ · · · , we have
kMi+1 k ≤ kM0 k + γ(ke0 k + ke1 k + · · · + kei+1 k)
≤ kM0 k + γke0 k(1 + 1/2 + · · · ) = kM0 k + 2γke0 k ≤ 1/(8β),
which establishes the second inequality, and then applying claim 1 gives the third inequality.
Notice that the constant 1/2 in the linear convergence estimate arose from the proof
rather than anything inherent to Broyden’s method. Rearranging the proof, one could
change this constant to any positive number. Thus the convergence is actually superlinear.
It would be natural to try to prove this as an application of Theorem 4.8, but this is not
possible, because it can be shown by example, that Bi need not converge to F 0 (x∗ ). The
7. NEWTON’S METHOD 101
superlinear convergence of Broyden’s method was first proved in 1973 by Broyden, Dennis,
and Moré. They proved slightly more, namely that kxi+1 − x∗ k ≤ ri kxi − x∗ k where ri → 0.
6. Unconstrained minimization
We now turn to the problem of minimizing a real-valued function F defined on Rn . (The
problem of minimizing F over a subset of Rn , e.g., a subspace or submanifold, is known as
constrained minimization and is an important subject, which, however, we will not consider
in this course.) We shall sometimes refer to F as the cost function. Usually we will have
to content ourselves with finding a local minimum of the cost function since most methods
cannot distinguish local from global minima. Note that the word “local” comes up in two
distinct senses when describing the behavior of minimization methods: methods are often
only locally convergent (they converge only for initial iterate x0 sufficiently near x∗ ), and
often the limit x∗ is only a local minimum of the cost function.
If F : Rn → R is smooth, then at each x its gradient F 0 (x) is a row vector and its Hessian
F (x) is a symmetric matrix. If F achieves a local minimum at x∗ , then F 0 (x∗ ) = 0 and
00
F 00 (x∗ ) is positive semidefinite. Moreover, if F 0 (x∗ ) = 0 and F 00 (x∗ ) is positive definite, then
F definitely achieves a local minimum at x∗ .
There is a close connection with the problem of minimizing a smooth real-valued function
of n variables and that of finding a root of an n-vector-valued function of n variables. Namely
if x∗ is a minimizer of F : Rn → R, then x∗ is a root of F 0 : Rn → Rn . Another connection
is that a point is a root of the function K : Rn → Rn if and only if it is a minimizer of
F (x) = kK(x)k2 (we usually use the 2-norm or a weighted 2-norm for this purpose, since
then F (x) is smooth if K is).
7. Newton’s method
The idea of Newton’s method for minimization problems is to approximate F (x) by its
quadratic Taylor polynomial, and minimize that. Thus
1
F (x) ≈ F (xi ) + F 0 (xi )(x − xi ) + (x − xi )T F 00 (xi )(x − xi ).
2
The quadratic on the right-hand side achieves a unique minimum value if and only if the
matrix F 00 (xi ) is positive definite, and in that case the minimum is given by the solution to
the equation
F 00 (xi )(x − xi ) + F 0 (xi )T = 0.
Thus we are lead to the iteration
xi+1 = xi − F 00 (xi )−1 F 0 (xi )T .
Note that this is exactly the same as Newton’s method for solving the equation F 0 (x) = 0.
Thus we know that this method is locally quadratically convergent (to a root of F 0 , which
might be only a local minima of F ).
Newton’s method for minimization requires the construction and “inversion” of the entire
Hessian matrix. Thus, as for systems, there is motivation for using quasi-Newton methods in
which the Hessian is only approximated. In addition, there is the fact that Newton’s method
is only locally convergent. We shall return to both of these points below.
102 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION
There is a great deal of freedom in choosing the direction and the step length. The major
criterion for the search direction is that a lower value of F than F (xi ) exist nearby on the
line xi + λsi . We may as well assume that the step λi is positive in which case this criteria
is that si is a descent direction, i.e., that F (xi + λsi ) decreases as λ increases from 0. In
terms of derivatives this condition is that F 0 (xi )T si < 0. Geometrically, this means that si
should make an acute angle with the negative gradient vector −F 0 (xi )T . An obvious choice
is si = −F 0 (xi )T (or −F 0 (xi )T /kF 0 (xi )k if we normalize), the direction of steepest descents.
For the choice of step length, one possibility is exact line search. This means that λi is
chosen to minimize F (xi + λsi ) as a function of λ. In combination with the steepest descent
direction we get the method of steepest descents:
choose λi > 0 minimizing F (xi − λF 0 (xi )T ) for λ > 0 set xi+1 = xi − λi F 0 (xi )T
This method can be shown to be globally convergent to a local minimizer under fairly
general circumstances. However, it may not be fast. To understand the situation better
consider the minimization of a quadratic functional F (x) = xT Ax/2 − xT b where A ∈ Rn×n
is symmetric postive definite and b ∈ Rn . The unique minimizer of F is then the solution x∗
to Ax = b. In this case, the descent direction at any point x is simply −F 0 (x)T = b − Ax,
the residual. Moreover, for any search direction s, the step length λ minimizing F (x + λs)
(exact line search) can be computed analytically in this case:
λ2 T 1
F (x + λs) = s As + λ(sT Ax − sT b) + xT Ax − xT b,
2 2
d
F (x + λs) = λsT As + sT (Ax − b),
dλ
so that at the minimum λ = sT (b − Ax)/sT As, and if s = b − Ax, the direction of steepest
descent, λ = sT s/sT As. Thus the steepest descent algorithm for minimizing xT Ax/2 − xT b,
i.e., for solving Ax = b is
It can be shown that this algorithm is globally convergent to the unique solution x∗ as
long as the matrix A is positive definite. However the convergence order is only linear and
the rate is (κ − 1)/(κ + 1) where κ(A) is the 2-norm condition number of A, i.e., the ratio
of the largest to the smallest eigenvalues of A. Thus the convergence will be very slow if A
is not well-conditioned.
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1 −0.5 0 0.5 1 1.5 2 2.5 3 −1 −0.5 0 0.5 1 1.5 2 2.5 3
This highlights a weakness of the steepest descent direction. It will be even more pro-
nounced for a difficult non-quadratic cost function, such as Rosenbrock’s example in R2
F (x) = (y − x2 )2 + .01(1 − x)2 .
Figure 4.2. Some contours of the Rosenbrock function. Minimum is at (1, 1).
3
2.5
1.5
0.5
−0.5
−1
−1 −0.5 0 0.5 1 1.5 2 2.5 3
While exact line search is possible for a quadratic cost functions, in general it is a scalar
minimization problem which can be expensive or impossible to solve. Moreover, as illustrated
104 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION
by the performance of steepest descents above, since the minimum may not be very near
the search line, it is often not worth the effort to search too carefully. Thus many methods
incorporate more or less sophisticated approximate line search algorithms. As we shall see,
it is possible to devise an approximate line search method which, when used in conjunction
with a reasonable choice of search direction, is globably convergent.
We begin our analysis with a simple calculus lemma.
Lemma 4.12. Let f : R → R be C 1 and bounded below and suppose that f 0 (0) < 0. For
any 0 < α < 1 there exists a non-empty open interval J ⊂ (0, ∞) such that
(4.4) f (x) < f (0) + αxf 0 (0), f 0 (x) > αf 0 (0),
for all x ∈ J.
Proof. Since f 0 (0) < 0 and 0 < α < 1, we have 0 > αf 0 (0) > f 0 (0). Thus the
line y = f (0) + αf 0 (0)x lies above the curve y = f (x) for sufficiently small positive x.
But, since f is bounded below, the line lies below the curve for x sufficiently large. Thus
x1 := inf{ x > 0 | f (x) ≥ f (0) + αf 0 (0)x } > 0. Choose any 0 < x0 < x1 . By the mean value
theorem there exists x between x0 and x1 such that
f (x1 ) − f (x0 )
f 0 (x) = .
x1 − x0
For this point x we clearly have (4.4), and by continuity they must hold on an open interval
around the point.
Add a figure.
Now suppose we use a line search method subject to the following restrictions on the
search directions si and the step lengths λi . We suppose that there exist positive constants
η, α, β, such that for all i:
(H1) there exists η ∈ (0, 1] such that −F 0 (xi )si ≥ ηkF 0 (xi )kksi k
(H2) there exists α ∈ (0, 1) such that F (xi + λi si ) ≤ F (xi ) + αλi F 0 (xi )si
(H3) there exists β ∈ (0, 1) such that F 0 (xi + λi si )si ≥ βF 0 (xi )si
We shall show below that any line-search method meeting these conditions is, essentially,
globally convergent. Before doing so, let us discuss the three conditions. The first condition
concerns the choice of search direction. If η = 0 were permitted it would say that the search
direction is a direction of non-ascent. By insisting on η positive we insure that the search
direction is a direction of descent (F (xi + λsi ) is a decreasing function of λ at λ = 0).
However the condition also enforces a uniformity with respect to i. Specifically, it says
that the angle between si and the steepest descent direction −F 0 (xi )T is bounded above by
arccos η < π/2. The steepest descent direction satisfies (H1) for all η ≤ 1 and so if η < 1
there is a open set of directions satisfying this condition. If the Hessian is positive definite,
then the Newton direction −F 00 (xi )−1 F 0 (xi )T satisfies (H1) for η ≤ 1/κ2 (F 00 (xi )), with κ2
the condition number with respect to the 2-norm, i.e., the ratio of the largest to smallest
eigenvalues (verify!). One possible strategy to obtain the fast local convergence of Newton’s
method without sacrificing global convergence is to use the Newton direction for si whenever
it satisfies (H1) (so whenever F 00 (xi ) is positive definite and not too badly conditioned),
otherwise to use steepest descents. A better approach when the Newton direction fails
(H1) may be to use a convex combination of the Newton direction and the steepest descent
8. LINE SEARCH METHODS 105
direction: si = −[θF 00 (xi )−1 F 0 (xi )T + (1 − θ)F 0 (xi )T ] which will satisfy (H1) if θ > 0 is
small enough. Or similarly, one can take si = [F 00 (xi )−1 + νI]−1 F 0 (xi )T with ν large enough
to insure that the bracketed matrix is positive definite. This is the Levenberg–Marquardt
search direction.
Conditions (H2) and (H3) concern the step length. Roughly, (H2) insures that it is not
too large, and in particular insures that F (xi+1 ) < F (xi ). It is certainly satisfied if λi is
sufficiently small. On the other hand (H3) ensures that the step is not too small, since it is
not satisfied for λi = 0. It is however satisfied at a minimizing λi if one exists. The lemma
insures us that if 0 < α < β < 1, then there is an open interval of values of λi satisfying
(H2) and (H3), and hence it is possible to design line-search algorithms which find a suitable
λi in a finite number of steps. See, e.g., R. Fletcher, Practical Methods of Optimization or
J. Dennis & R. Schnabel, Numerical methods for unconstrained optimization and nonlinear
equations. Fletcher also discusses typical choices for α and β. Typically β is fixed somewhere
between 0.9 and 0.1, the former resulting in a faster line search while the latter in a more
exact line search. Fletcher says that α is generally taken to be quite small, e.g., 0.01, but
that the value of of α is not important in most cases, since it is usually the value of β which
determines point acceptability.
We now state the global convergence theorem.
Theorem 4.13. Suppose that F : Rn → R is C 1 and bounded below, that x0 ∈ Rn is such
that { x ∈ Rn : F (x) ≤ F (x0 ) } is bounded, and that x1 , x2 , . . . is defined by a line search
method with descent search directions and positive step lengths satisfying the three conditions
above. Then limi→∞ F (xi ) exists and limi→∞ F 0 (xi ) = 0.
The next comment should be filled out. Perhaps the theorem should be stated in the
case of a single minimum and then the full state given as a corollary to the proof.
Hence for i ∈ S,
kF 0 (xi+1 ) − F 0 (xi )k ≥ η(1 − β) > 0,
which gives the contradiction.
The next theorem shows that if the point xi is sufficiently close to a minimum, then
choosing si to be the Newton direction and λi = 1 satisfies (H1)–(H3). This means that
it is possible to construct algorithms which are globally convergent, but which are also
quadratically convergent, since they eventually coincide with Newton’s method.
Theorem 4.14. Suppose that F is smooth, x∗ is a local minimum of F , and F 00 (x∗ )
is positive definite. Let 0 < α < 1/2, α < β < 1. Then there exists > 0 such that
if kxi − x∗ k ≤ , si = −F 00 (xi )−1 F 0 (xi )T , and λi = 1, then (H1)–(H3) are satisfied with
η = 1/{4κ2 [F 00 (x∗ )]}.
Proof. Let D denote the ball about x∗ of radius , where > 0 will be chosen below.
From our analysis of Newton’s method we know that by taking sufficiently small, xi ∈
D =⇒ xi + si ∈ D. By continuity of F 00 , we may also arrange that whenever x ∈ D, F 00 (x)
is positive definite, kF 00 (x)k ≤ 2kF 00 (x∗ )k, and kF 00 (x)−1 k ≤ 2kF 00 (x∗ )−1 k.
Then
Now
1 1
kF 0 (xi )k ≥ ksi k ≥ ksi k,
kF 00 (xi )−1 k 2kF 00 (x −1
∗) k
9. Conjugate gradients
Now we return to the case of minimization of a positive definite quadratic function
F (x) = xT Ax/2 − xT b with A ∈ Rn×n symmetric positive definite and b ∈ Rn . So the unique
minimizer x∗ is the solution to the linear system Ax = b. Consider now a line search method
with exact line search:
choose initial iterate x0
for i = 0, 1, . . .
choose search direction si
sT (b − Axi )
λi = i T
si Asi
xi+1 = xi + λi si
end
But
1 λ2
F (y + λsi ) = y T Ay + λsTi Ay + sTi Asi − y T b − λsTi b.
2 2
108 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION
The second term on the right-hand side appears to couple the minimizations with respect
to y and λ, but in fact this is not so. Indeed, xi ∈ x0 + Wi , so for y ∈ x0 + Wi , y − xi ∈ Wi
and so is A-orthogonal to si . That is, sTi Ay = sTi Axi , whence
1 λ2
F (y + λsi ) = [ y T Ay − y T b] + [ sTi Asi + λsTi (Axi − b)],
2 2
and the minimization problem decouples. By induction the minimum of the first term in
brackets over x0 + Wi is achieved by y = xi , and clearly the second term is minimized by
λ = sTi (b − Axi )/sTi Asi , i.e., the exact line search. Thus xi+1 = xi + λi si minimizes F over
x0 + Wi+1 .
Any method which uses A-orthogonal (also called “conjugate”) search directions has the
nice property of the theorem. However it is not so easy to construct such directions. By
far the most useful method is the method of conjugate gradients, or the CG method, which
defines the search directions by A-orthogonalizing the residuals ri = b − Axi :
• s0 = r0
i−1 T
X sj Ari
• si = ri − sj .
sT Asj
j=0 j
The last formula (which is just the Gram-Schmidt procedure) appears to be quite expensive
to implement, but fortunately we shall see that it may be greatly simplified.
Proof. The first statement comes directly from the definitions. To verify the second
statement, note that, for 0 ≤ j < i, F (xi + trj ) is minimal when t = 0, which gives
rjT (Axi − b) = 0, which is the desired orthogonality. For the third statement, certainly there
is a least integer m ∈ [1, n] so that Wm = Wm+1 . Then rm = 0 since it both belongs to
Wm and is orthogonal to Wm . This implies that xm = x∗ and that sm = 0. Since sm = 0,
Wm+1 = Wm and xm+1 = xm = x∗ . Therefore rm+1 = 0, which implies that sm+1 = 0,
therefore Wm+2 = Wm+1 , xm+2 = x∗ , etc.
The fourth statement is an immediate consequence of the preceding ones. For the last
statement, we use the orthogonality of the residuals to see that sTi ri = riT ri . But, if 0 ≤ j ≤
i,then
since x0 − xj ∈ Wi .
9. CONJUGATE GRADIENTS 109
Since si ∈ Wi+1 and the rj , j ≤ i are an orthogonal basis for that space for i < m, we
have
i
X sT rji
si = rj .
j=0
rjT rj
i i−1
X rj X rj
si = riT ri T
T
= r i + ri ri T
,
r r
j=0 j j
r r
j=0 j j
whence
riT ri
si = ri + T
si−1 .
ri−1 ri−1
This is the formula which is used to compute the search direction. In implementing this
formula it is useful to compute the residual from the formula ri+1 = ri − λi Asi (since
xi+1 = xi + λi si ). Putting things together we obtain the following implementation of CG:
The conjugate gradient method gives the exact solution in n iterations, but it is most
commonly terminated with far fewer operations. A typical stopping criterion would be to
test if r2 is below a given tolerance. To justify this, we shall show that the method is linearly
convergence and we shall establish the rate of convergence. For analytical purposes, it is
most convenient to use the vector norm kxkA := (xT Ax)1/2 , and its associated matrix norm.
Lemma 4.17. Wi = span[r0 , Ar0 , . . . , Ai−1 r0 ] for i = 1, 2, . . . , m.
Proof. Since dim Wi = i, it is enough to show that Wi ⊂ span[r0 , Ar0 , . . . , Ai−1 r0 ],
which we do by induction. This is certainly true for i = 1. Assume it holds for some i.
Then, since xi ∈ x0 + Wi , ri = b − Axi ∈ r0 + AWi ∈ span[r0 , Ar0 , . . . , Ai r0 ], and therefore
Wi+1 , which is spanned by Wi and ri belongs to span[r0 , Ar0 , . . . , Ai r0 ], which completes the
induction.
The space span[r0 , Ar0 , . . . , Ai−1 r0 ] is called the Krylov space generated by the matrix A
and the vector r0 . Note that we have as well
Since xi − x0 ∈ Wi ,
inf kx∗ − xi + wkA = inf kx∗ − x0 + wkA .
w∈Wi w∈Wi
0.8 0.075
0.6 0.05
0.4 0.025
0.2 0
0 −0.025
−0.2 −0.05
0 2 4 6 8 10 12 2 4 6 8 10
Applying the obvious bound kp(A)(x∗ − x0 )kA ≤ kp(A)kA kx∗ − x0 kA we see that we can
obtain an error estimate for the conjugate gradient method by estimating
C = inf kp(A)kA .
p∈Pi
p(0)=1
Now if 0 < ρ1 < · · · < ρn are the eigenvalues of A, then the eigenvalues of p(A) are p(ρj ),
j = 1, . . . , n, and kp(A)kA = maxj |p(ρj )| (this is left as exercise 6). Thus1
C = inf max |p(ρj )| ≤ inf max |p(ρ)|.
p∈Pi j p∈Pi ρ1 ≤ρ≤ρn
p(0)=1 p(0)=1
The final infimum can be calculated explicitly using the Chebyshev polynomials, see Fig-
ure 4.3 and (1.16). The minimum value is precisely
√ i
2 κ−1
√ i √ i ≤ 2 √ ,
κ+1 κ−1 κ+1
√
κ−1
+ √
κ+1
where κ = ρn /ρ1 is the condition number of A. (To get the right-hand side, we suppressed
the second term in the denominator of the left-hand side, which is less than 1 and tends to
zero with i, and kept only the first term, which is greater than 1 and tends to infinity with
i.) We have thus proven that
√ i
κ−1
kxi − x∗ kA ≤ 2 √ kx0 − x∗ kA ,
κ+1
1
Here we bound maxj |p(ρj )| by maxρ1 ≤ρ≤ρn |p(ρ)| simply because we can minimize the latter quantity
explicitly. However this does not necessarily lead to the best possible estimate, and the conjugate gradient
method is often observed to converge faster than the result derived here. Better bounds can sometimes be
obtained by taking into account the distribution of the spectrum of A, rather than just its minimum and
maximum.
112 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION
Of course, the same result holds if we replace x0 by xi and x1 by xi+1 . Thus steepest descents
converges linearly, with rate (κ − 1)/(κ + 1). Notice that the estimates indicate that a large
value of κ will slow the convergence
√ of both steepest descents and conjugate gradients, but,
since the dependence is on κ rather than κ, the convergence of conjugate gradients will
usually be much faster.
The figure shows a plot of the norm of the residual versus the number of iterations for
the conjugate gradient method and the method of steepest descents applied to a matrix
of size 233 arising from a finite element simulation. The matrix is irregular, but sparse
(averaging about 6 nonzero elements per row), and has a condition number of about 1, 400.
A logarithmic scale is used on the y-axis so the near linearity of the graph reflects linear
convergence behavior. For conjugate gradients, the observed rate of linear convergence is
about .8, and it takes 80 iterations to reduce the initial residual by a factor of about 106 .
The convergence of steepest descents is too slow to be useful: in 400 iterations the residual
is not even reduced by a factor of 2.
−10
10
0
10
−20
10
CG
−1
−30
CG 10
10
−40 −2
10 10
0 50 100 150 200 250 300 0 10 20 30 40 50
iterations iterations
9. CONJUGATE GRADIENTS 113
Remark. 1. The conjugate gradient algorithm can be generalized to apply to the min-
imization of general (non-quadratic) functionals. The Fletcher–Reeves method is such a
generalization. However in the non-quadratic case the method is significantly more compli-
cated, both to implement and to analyze.
2. There are a variety of conjugate-gradient-like iterative methods that apply to matrix
problems Ax = b where A is either indefinite, non-symmetric, or both. Many share the idea
of approximation of the solution in a Krylov space.
9.1. Preconditioning. The idea is we choose a matrix M ≈ A such that the system
M z = c is relatively easy to solve. We then consider the preconditioned system M −1 Ax =
M −1 b. The new matrix M −1 A is SPD with respect to the M innerproduct, and we solve the
preconditioned system using conjugate gradients but using the M -inner product in place of
the l2 -inner product. Thus to obtain the preconditioned conjugate gradient algorithm, or
PCG, we substitute M −1 A for A everywhere and change expressions of the form xT y into
xT M y. Note that the A-inner product xT Ay remains invariant under these two changes.
Thus we obtain the algorithm:
Note that term sTi Asi arises as the M -inner product of si with M −1 Asi . The quantity
r̄i is the residual in the preconditioned equation, which is related to the regular residual,
ri = b − Axi by ri = M r̄i . Writing PCG in terms of ri rather than r̄i we get
Thus we need to compute M −1 ri at each iteration. Otherwise the work is essentially the
same as for ordinary conjugate gradients. Since the algorithm is just conjugate gradients for
114 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION
Exercises
1. Let f : R → R be a C 2 function with a root x∗ such that neither f 0 nor f 00 has a root. Prove
that Newton’s method converges to x∗ for any initial guess x0 ∈ R.
2. Consider the 2 × 2 system of nonlinear equations
f (x, y) = 0, g(x, y) = 0, x, y ∈ R.
The Jacobi iteration for solving this system beginning from an initial guess x0 , y0 is Thus
for i = 0, 1, 2, . . .
solve f (xi+1 , yi ) = 0 for xi+1
solve g(xi , yi+1 ) = 0 for yi+1
end
EXERCISES 115
SD 0
SD
0
10 10
CG
norm of residual
norm of residual
−20 −2
10 10
CG
−40 −4
10 10
PCG
PCG
−60 −6
10 10
−80 −8
10 10
0 50 100 150 200 250 300 0 10 20 30 40 50
iterations iterations
each step of the iteration requires the solution of 2 scalar nonlinear equations. (N.B.: Of
course the method extends to systems of n equations in n unknowns.) If we combine the
Jacobi iteration with Newton’s method to solve the scalar equations, we get the Newton–
Jacobi iteration:
choose initial guess x0 , y0
for i = 1, 2, . . .
∂f
xi+1 = xi − (xi , yi )−1 f (xi , yi )
∂x
∂g
yi+1 = yi − (xi , yi )−1 g(xi , yi )
∂y
end
Let x0 , x1 , . . . be the sequence of iterates produced by Newton’s method. Show that all the
iterates after the initial guess satisfy the linear equations exactly. Show the same result is
true when the xi are determined by Broyden’s method with B0 chosen to be F 0 (x0 ).
6. Prove that if A is a symmetric positive-definite matrix with eigenvalues ρ1 , . . . , ρn , and p is
a polynomial, then kp(A)kA = max |p(ρj )|.
1≤j≤n
7. Prove that for the conjugate gradient method the search directions si and the errors ei :=
x∗ − xi satisfy sTi ei+1 ≤ 0 (in fact sTi ej ≤ 0 for all i, j). Use this to show that the l2 -norm
of the error kei k is a non-increasing function of i.
8. We analyzed preconditioned conjugate gradients, with a symmetric positive definite precon-
ditioner M , as ordinary conjugate gradients applied to the problem M −1 Ax = M −1 b but
with the M -inner product rather than the l2 -inner product in Rn . An alternative approach
which doesn’t require switching inner products in Rn is to consider the ordinary conjugate
gradient method applied to the symmetric positive definite problem (M −1/2 AM −1/2 )z =
M −1/2 b for which the solution is z = M 1/2 x. Show that this approach leads to exactly the
same preconditioned conjugate gradient algorithm.
9. The Matlab command A=delsq(numgrid(’L’,n)) is a quick way to generate a symmetric
positive definite sparse test matrix: it is the matrix arising from the 5-point finite difference
approximation to the Laplacian on an L-shaped domain using an n × n grid (e.g., if n = 40,
A will be 1, 083 × 1, 083 sparse matrix with 5, 263 nonzero elements and a condition number
of about 325. Implement the conjugate gradient algorithm for the system Ax = b for this
A (and an arbitrary vector b, e.g., all 1’s). Diagonal preconditioning does no good for this
problem. (Why?) Try two other possibilities: tridiagonal preconditioning and incomplete
Cholesky preconditioning (Matlab comes equipped with an incomplete Cholesky routine, so
you don’t have to write your own). Study and report on the convergence in each case.