0% found this document useful (0 votes)

192 views26 pages

Numerical Solution of Nonlinear Systems and Optimization

1) The document discusses numerical methods for solving nonlinear systems of equations. 2) It introduces Newton's method, which iteratively finds better approximations of the root by taking the linear approximation of the function around the current point. 3) One-point iterations are discussed, where the next iterate depends only on the current iterate via a function G. If G is a contraction mapping, the iteration is guaranteed to converge to a unique fixed point of G.

Uploaded by

Lei Lin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

192 views26 pages

Numerical Solution of Nonlinear Systems and Optimization

Uploaded by

Lei Lin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

CHAPTER 4

Numerical solution of nonlinear systems and optimization

1. Introduction and Preliminaries

In this chapter we consider the solution of systems of n nonlinear equations in n un-
knowns. That is, with Ω an open subset of Rn and F : Ω → Rn a continuous function we
wish to find x∗ ∈ Ω such that F (x∗ ) = 0.
For nonlinear systems there is rarely a direct method of solution (an algorithm which
terminates at the exact solution), so we must use iterative methods which produce a sequence
of approximate solutions x0 , x1 , . . . in Ω for which, hopefully, lim xi exists and equals a root
x∗ of F .
First some definitions relating to the speed of convergence of sequences in Rn . Let xi be
a sequence in Rn which converges to 0. For p > 1 we say that the sequence converges to 0
with order p if there exists a constant C and a number N so that kxi+1 k ≤ Ckxi kp for all
i ≥ N . This definition doesn’t depend on the particular norm: if a sequence converges to 0
with order p in one norm, it converges with order p in all norms. Of course we extend this
definition to sequences that converge to an arbitrary x∗ by saying that xi converges to x∗
with order p if and only if xi − x∗ converges to 0 with order p.
For p = 1 it is common to use the same definition except with the requirement that
the constant be less than unity: a sequence would then be said to converge linearly to 0, if
there exists r < 1 and N such that kxi+1 k ≤ rkxi k for all i ≥ N . However, this notion is
norm-dependent. According to this definition, the sequence in R2
(1, 1), (1, 0), (1/4, 1/4), (1/4, 0), (1/16, 1/16), (1/16, 0), . . .
converges linearly to 0 in the 1-norm, but does not converge linearly to 0 with respect to
the ∞-norm. To avoid the norm-dependence, we note that the above definition implies
that there exists a constant C such that kxi k ≤ Cri for all i. (Proof: kxi k ≤ ri−N kxN k
for all i ≥ N . Equivalently, kxi k ≤ C0 ri for i ≥ N where C0 = r−N kxN k. Setting C =
max(C0 , max0≤i<N kxi k/ri ), we obtain the result.) We take this inequality as our definition
of linear convergence: xi converges to 0 linearly if there exists a constant C and a number
r < 1 such that kxi k ≤ Cri . This notion is independent of norm (if it holds for one norm,
then it holds for another with the same value of r, but possibly a different value of C). Note
also that if this definition of linear convergence holds for some r < 1, then it also holds for
all larger r. The infimum of all such r is called the rate of the linear convergence. If the
infimum is 0, we speak of superlinear convergence.
Note that if 1 < p1 < p2 and 0 < r1 < r2 < 1, then
convergence with order p2 =⇒ convergence with order p1 =⇒ superlinear convergence
=⇒ linear convergence with rate r1 =⇒ linear convergence with rate r2
91
92 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION

2. One-point iteration
For many iterative methods, xi+1 depends only on xi via some formula that doesn’t
depend on i: xi+1 = G(xi ). Such a method is called a (stationary) one-point iteration.
Before considering specific iterations to solve F (x) = 0, we consider one-point iterations in
general.
Assuming the iteration function G is continuous, we obviously have that if the iterates
xi+1 = G(xi ) converge to some limit x∗ , then x∗ = G(x∗ ), i.e., x∗ is a fixed point of G.
A basic result is the contraction mapping theorem. Recall that a map G : B → Rn
(B ⊂ Rn ) is called a contraction (with respect to some norm on Rn ) if G is Lipschitz with
Lipschitz constant strictly less than 1.
Theorem 4.1. Suppose G maps a closed subset B of Rn to itself, and suppose that G
is a contraction (with respect to some norm). Then G has a unique fixed point x∗ in B.
Moreover, if x0 ∈ B is any point, then the iteration xi+1 = G(xi ) converges to x∗ .
If G ∈ C 1 a practical way to check whether G is a contraction (with respect to some
norm on Rn ) is to consider kG0 (x)k (in the associated matrix norm). If kG0 (x)k ≤ λ < 1 on
some convex set Ω (e.g., some ball), then G is a contraction there. In one dimension this is
an immediate consequence of the mean value theorem. In n dimensions we don’t have the
mean value theorem, but we can use the fundamental theorem of calculus to the same end.
Given x, y ∈ Ω we let g(t) = G(x + t(y − x)), so g 0 (t) = 0
R 1G0 (x + t(y − x))(y − x). From the
fundamental theorem of calculus we get g(1) − g(0) = 0 g (t) dt, or
Z 1
0
G(y) − G(x) = G (x + t(y − x)) dt (y − x),
0

whence
kG(y) − G(x)k ≤ sup kG0 (x + t(y − x))kky − xk ≤ λky − xk,
0≤t≤1

and so G is a contraction.
If we assume that x∗ is a fixed point of G, G ∈ C 1 , and r = kG0 (x∗ )k < 1, then we can
conclude that the iteration xi+1 = G(xi ) converges for any starting iterate x0 sufficiently
close to x∗ . This is called a locally convergent iteration. The above argument also shows
that convergence is (at least) linear with rate r.
In this connection, the following theorem, which connects kAk to ρ(A) (the spectral
radius of A, i.e., the maximum modulus of its eigenvalues), is very useful.
Theorem 4.2. Let A ∈ Rn×n . Then
1. For any operator matrix norm, kAk ≥ ρ(A).
2. If A is symmetric, then kAk2 = ρ(A).
3. If A is diagonalizable, then there exists an operator norm so that kAk = ρ(A).
4. For any A and any > 0, there exists an operator norm so that ρ(A) ≤ kAk ≤ ρ(A)+.
Proof. 1. If Ax = λx where x 6= 0 and |λ| = ρ(A), then from kAxk = |λ|kxk we see
that kAk ≥ ρ(A).
p p
2. kAk2 = ρ(AT A) = ρ(A2 ) = ρ(A).
2. ONE-POINT ITERATION 93

3. First note that if S ∈ Rn×n is nonsingular and k · k0 any vector norm, then kxk :=
kSxk0 is another vector norm, and the associated matrix norms satisfy kAk = kSAS −1 k0 .
Now if A is diagonalizable, then there exists S nonsingular so that SAS −1 is a diagonal
matrix with the eigenvalues of A on the diagonal (the columns of S −1 are the eigenvectors
of A). Hence if we apply the above relation beginning with the ∞-norm for k · k0 , we get
kAk = ρ(A).
4. The proof is similar in this case, but we use the Jordan canonical form to write
SAS −1 = J where J has the eigenvalues of A on the diagonal, 0’s and ’s above the diagonal,
and 0’s everywhere else. (The usual Jordan canonical form is the case = 1, but if we
conjugate a Jordan block by the matrix diag(1, , 2 , . . . ) the 1’s above the diagonal are
changed to .) Thus for the matrix norm associated to kxk := kSxk∞ , we have kAk =
kJk∞ ≤ ρ(A) + .
Corollary 4.3. If G is C 1 in a neighborhood of a fixed point x∗ and r = ρ(G0 (x∗ )) < 1,
the one point iteration with iteration function G is locally convergent to x∗ with rate r.
Although we don’t need immediately it, we note another useful corollary of the proceeding
theorem.
Corollary 4.4. Let A ∈ Rn×n . Then limn→∞ An = 0 if and only if ρ(A) < 1, and in
this case the convergence is linear with rate ρ(A).
Proof. kAn k ≥ ρ(An ) = ρ(A)n , so if ρ(A) ≥ 1, then An does not converge to 0.
Conversely, if ρ(A) < 1, then for any ρ̄ ∈ (ρ(A), 1) we can find an operator norm so that
kAk ≤ ρ̄, and then kAn k ≤ kAkn = ρ̄n → 0.
Finally, let us consider the case G0 (x∗ ) = 0. Then clearly the iteration is superlinearly
convergent. If G is C 2 , or, less, if G0 is Lipschitz, then we can show that the convergence is
in fact quadratic. First note that for any C 1 function G,
Z 1
0
G(y) − G(x) − G (x)(y − x) = [G0 (x + t(y − x)) − G0 (x)] dt(y − x).
0
0
Hence, if G is Lipschitz,
C
kG(y) − G(x) − G0 (x)(y − x)k ≤ ky − xk2 ,
2
where C is the Lipschitz constant. Applying this with x = x∗ and y = xi and using the fact
that G(x∗ ) = x∗ , G0 (x∗ ) = 0, we get
C
kxi − x∗ k2 ,
kxi+1 − x∗ k ≤
2
which is quadratic convergence. In the same way we can treat the case of G with several
vanishing derivatives.
Theorem 4.5. Suppose that G maps a neighborhood of x∗ in Rn into Rn and that x∗
is a fixed point of G. Suppose also that all the derivatives of G of order up to p exist, are
Lipschitz continuous, and vanish at x∗ . Then the iteration xi+1 = G(xi ) is locally convergent
to x∗ with order p + 1.
94 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION

3. Newton’s method
An important example of a one-point iteration is Newton’s method for root-finding. Let
F : Ω → Rn be C 1 with Ω ⊂ Rn . We wish to find a root x∗ of F in Ω. If x0 ∈ Ω is an initial
guess of the root, we approximate F by the linear part of its Taylor series near x0 :
F (x) ≈ F (x0 ) + F 0 (x0 )(x − x0 ).
The left-hand side vanishes when x is a root, so setting the right-hand side equal to zero gives
us an equation for a new approximate root, which we take to be x1 . Thus x1 is determined
by the equation
F (x0 ) + F 0 (x1 )(x1 − x0 ) = 0,
or, equivalently,
x1 = x0 − F 0 (x0 )−1 F (x0 ).
Continuing in this way we get Newton’s method:
xi+1 = xi − F 0 (xi )−1 F (xi ), i = 0, 1, . . . .
/ Ω or that some F 0 (xi ) is singular, in which case
(Of course it could happen that some xi ∈
Newton’s method breaks down. We shall see that under appropriate conditions this doesn’t
occur.)
Note that Newton’s method is simply iteration of the function
G(x) = x − F 0 (x)−1 F (x).
Now if x∗ is a root of F and F 0 (x∗ ) is nonsingular (i.e., if x∗ is a simple root), then G is
continuous in a neighborhood of x∗ , and clearly x∗ is a fixed point of G. We have that
G0 (x) = I − K(x)F (x) − F 0 (x)−1 F 0 (x) = −K(x)F (x)
where K is the derivative of the function x 7→ F 0 (x)−1 (this function maps a neighborhood
of x∗ in Rn into Rn×n ). It is an easy (and worthwhile) exercise to derive the formula for
K(x) in terms of F 0 (x) and F 00 (x), but we don’t need it here. It suffices to note that K
exists and is Lipschitz continuous if F 0 and F 00 are. In any case, we have that G0 (x∗ ) = 0.
Thus, assuming that F is C 2 with F 00 Lipschitz (e.g., if F is C 3 ), we have all the hypotheses
necessary for local quadratic convergence. Thus we have proved:
Theorem 4.6. Suppose that F : Ω → Rn , Ω ⊂ Rn is C 2 with F 00 Lipschitz continuous,
and that F (x∗ ) = 0 and F 0 (x∗ ) is nonsingular for some x∗ ∈ Ω. Then if x0 ∈ Ω is sufficiently
close to x∗ , the sequence of points defined by Newton’s method is well-defined and converges
quadratically to x∗ .
The hypothesis that the root be simple is necessary for the quadratic convergence of
Newton’s method, as can easily be seen by a 1-dimensional example. However, the smooth-
ness assumption can be weakened. The following theorem requires only that F 0 (rather than
F 00 ) be Lipschitz continuous (which holds if F is C 2 ). In the statement of the theorem any
vector norm and corresponding operator matrix norm can be used.
3. NEWTON’S METHOD 95

Theorem 4.7. Suppose that F (x∗ ) = 0 and that F 0 is Lipschitz continuous with Lipschitz
constant γ in a ball of radius r around x∗ . Also suppose that F 0 (x∗ ) is nonsingular with
kF 0 (x∗ )−1 k ≤ β. If kx0 − x∗ k ≤ min[r, 1/(2βγ)], then the sequence determined by Newton’s
method is well-defined, converges to x∗ , and satisfies
kxi+1 − x∗ k ≤ βγkxi − x∗ k2 .
Proof. First we show that F 0 (x0 ) is nonsingular. Indeed,
kF 0 (x∗ )−1 [F 0 (x0 ) − F 0 (x∗ )]k ≤ βγkx0 − x∗ k ≤ 1/2,
from which follows the nonsingularity and the estimate
1
kF 0 (x0 )−1 k ≤ kF 0 (x∗ )−1 k ≤ 2β.
1 − 1/2
Thus x1 is well-defined and
x1 − x∗ = x0 − x∗ − F 0 (x0 )−1 F (x0 )
= x0 − x∗ − F 0 (x0 )−1 [F (x0 ) − F (x∗ )]
= F 0 (x0 )−1 [F (x∗ ) − F (x0 ) − F 0 (x0 )(x∗ − x0 )].
We have previously bounded the norm of the bracketed quantity by γkx∗ − x0 k2 /2 and
kF 0 (x0 )−1 k ≤ 2β, so
kx1 − x∗ k ≤ βγkx0 − x∗ k2 .
This is the kind of quadratic bound we need, but first we need to show that the xi are indeed
converging to x∗ . Using again that kx0 − x∗ k ≤ 1/(2βγ), we have the linear estimate
kx1 − x∗ k ≤ kx∗ − x0 k/2.
Thus x1 also satisfies, kx1 − x∗ k≤ min[r, 1/(2βγ)], and the identical argument shows
kx2 − x∗ k ≤ βγkx1 − x∗ k2 , and kx2 − x∗ k ≤ kx1 − x∗ k/2.
Continuing in this way we get the theorem.
The theorem gives a precise sufficient condition on how close the initial iterate x0 must
be to x∗ to insure convergence. Of course it is not a condition that one can apply practically,
since one cannot check if x0 satisfies it without knowing x0 . There are several variant results
which weaken the hypotheses necessary to show quadratic convergence for Newton’s method.
A well-known, but rather complicated one is Kantorovich’s theorem (1948). Unlike the above
theorems, it does not assume the existence of a root x∗ of F , but rather states that if an
initial point x0 satisfies certain conditions, then there is a root, and Newton’s method will
converge to it. Basically it states: if F 0 is Lipschitz near x0 and nonsingular at x0 , and if
the value of F (x0 ) is sufficiently small (how small depending on the Lipschitz constant for
F 0 , and the norm of F 0 (x0 )−1 ), then Newton’s method beginning from x0 is well-defined and
converges quadratically to a root x∗ . The exact statement is rather complicated, so I’ll omit
it. In principle, one could pick a starting iterate x0 , and then compute the norms of F (x0 )
and F 0 (x0 ), and check to see if they fulfil the conditions of Kantorovich’s theorem (if one
knew a bound for the Lipschitz constant of F 0 in a neighborhood of x0 ), and thus tell in
advance whether Newton’s method would converge. In practice this is difficult to do and
would rule out many acceptable choices of initial guess, so it is rarely used.
96 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION

4. Quasi-Newton methods
Each iteration of Newton’s method requires the following operations: evaluate the func-
tion F at the current approximation, evaluate the derivative F 0 at the current approximation,
solve a system of equations with the latter as matrix and the former as right-hand side, and
update the approximation. The evaluation of F 0 and the linear solve are often the most
expensive parts. In some applications, no formula for F 0 is available, and exact evaluation of
F 0 is not possible. There are many variations of Newton’s method that attempt to maintain
good local convergence properties while avoiding the evaluation of F 0 , and/or simplifying
the linear solve step. We shall refer to all of these as quasi-Newton methods, although some
authors restrict that term to specific types of modification to Newton’s method.
Consider the iteration
xi+1 = xi − Bi−1 F (xi ),
where for each i, Bi is a nonsingular matrix to specified. If Bi = F 0 (xi ), this is Newton’s
method. The following theorem states that if Bi is sufficiently close to F 0 (xi ) then this
method is still locally convergent. With a stronger hypothesis on the closeness of Bi to
F 0 (xi ) the convergence is quadratic. Under a somewhat weaker hypothesis, the method still
converges superlinearly.
Theorem 4.8. Suppose F 0 is Lipschitz continuous in a neighborhood of a root x∗ and
that F 0 (x∗ ) is nonsingular.
1. Then there exists δ > 0 such that if kBi − F 0 (xi )k ≤ δ and kx0 − x∗ k ≤ δ, then the
generalized Newton iterates are well-defined by the above formula, and converge to x∗ .
2. If further kBi − F 0 (xi )k → 0, then the convergence is superlinear.
3. If there is a constant c such that kBi − F 0 (xi )k ≤ ckF (xi )k, then the convergence is
quadratic.
Proof. Set β = kF 0 (x∗ )−1 k < ∞. Choosing δ small enough, we can easily achieve
kx − x∗ k ≤ δ, kB − F 0 (x)k ≤ δ =⇒ kB −1 k ≤ 2β.
Let γ be a Lipschitz constant for F 0 . Decreasing δ if necessary we can further achieve
2β(γ/2 + 1)δ ≤ 1/2.
Now let x0 and B0 be chosen in accordance with this δ. Then
x1 − x∗ = x0 − x∗ − B0−1 F (x0 ) = B0−1 [F (x∗ ) − F (x0 ) − B0 (x∗ − x0 )]
Z 1
−1
= B0 [F 0 ((1 − t)x0 + tx∗ ) − B0 ]dt (x∗ − x0 ).
0
0
Now kF ((1 − t)x0 + tx∗ ) − B0 k ≤ γtkx0 − x∗ k + δ, by the triangle inequality, the Lipschitz
condition, and the condition on B0 . Thus

kx1 − x∗ k ≤ 2β(γkx0 − x∗ k/2 + δ)kx0 − x∗ k ≤ 2β(γ/2 + 1)δkx0 − x∗ k ≤ kx0 − x∗ k/2.

In particular kx1 − x∗ k ≤ δ, so this process may be repeated. (1) follows easily. Note that
we obtained linear convergence with rate 1/2, but by choosing δ sufficiently small we could
obtain the linear convergence with any desired rate r ∈ (0, 1).
5. BROYDEN’S METHOD 97

From the above reasoning we get

kxi+1 − x∗ k ≤ 2β(γkxi − x∗ k/2 + kBi − F 0 (xi )k)kxi − x∗ k,

which gives superlinearity under the additional hypothesis of (2). From the additional hy-
pothesis of (3), we get
kBi − F 0 (xi )k ≤ ckF (xi ) − F (x∗ )k ≤ c0 kxi − x∗ k,
which gives the quadratic convergence.

Some examples of quasi-Newton methods:

• Replace ∂F i (x)/∂xj by [F i (x + hej ) − F i (x)]/h for small h or by a similar difference
quotient. From the theorem, convergence is guaranteed if h is small enough.
• Use a single evaluation of F 0 for several iterations. (Then one can factor F 0 one time,
and back solve for the other iterations.) Other methods, including Broyden’s method
which we consider next, use some sort of procedure to “update” a previous Jacobian.
• Use Bi = θi−1 F 0 (xi ), where θi is a relaxation parameter. Generally θi is chosen in
(0, 1], so that xi+1 = xi − θi F 0 (xi )−1 F (xi ) is a more “conservative” step than a true
Newton step. This is used to stabilize Newton iterations when not sufficiently near
the root. From the theorem, convergence is guaranteed for θi sufficiently near 1, and
is superlinear if θi → 1. If |θi − 1| ≤ ckF (xi )k for all i sufficiently large, convergence
is quadratic.
• Another possibility, which we shall study when we consider the minimization methods,
is Bi = θi F 0 (xi ) + (1 − θi )I. Convergence statements similar to those for the relaxed
method hold.

5. Broyden’s method
Broyden’s method (published by C. G. Broyden in 1965) is an important example of
a quasi-Newton method. It is one possible generalization to n-dimensions of the secant
method. For a single nonlinear equation, the secant method replaces f 0 (xi ) in Newton’s
method with the approximation [f (xi ) − f (xi−1 )]/(xi − xi−1 ), to obtain the iteration
xi − xi−1
xi+1 = xi − f (xi ).
f (xi ) − f (xi−1 )
Of course, we cannot directly generalize this idea to Rn , since we can’t divide by the vector
xi − xi−1 . Instead, we can consider the equation
Bi (xi − xi−1 ) = F (xi ) − F (xi−1 ).
However, this does not determine the matrix Bi , only its action on multiples of xi − xi−1 . To
complete the specification of Bi , Broyden’s method sets the action on vectors orthogonal to
xi − xi−1 to be the same as Bi−1 . Broyden’s method is an update method in the sense that
Bi is determined as a modification of Bi−1 .
In order to implement Broyden’s method, we note:
98 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION

Theorem 4.9. Given vectors s 6= 0, v in Rn , C ∈ Rn×n , there is a unique matrix

B ∈ Rn×n such that
Bs = v,
Bz = Cz, for all z such that sT z = 0.
To see this, we note that there is certainly at most one such B. To see that such a B
exists, we give the formula:
1
B=C+ (v − Cs)sT .
sT s
It is important to note that B is derived from C by the addition of a matrix of rank 1. In a
certain sense C is the closest matrix to B which takes s to v (see the exercises).
We are now ready to give Broyden’s method:

Choose x0 ∈ Rn , B0 ∈ Rn×n
for i = 0, 1, . . .
xi+1 = xi − Bi−1 F (xi )
si = xi+1 − xi
vi = F (xi+1 ) − F (xi )
1
Bi+1 = Bi + T (vi − Bi si )sTi
si si
end

Remark. For example, B0 can be taken to be F 0 (x0 ). In one dimension, Broyden’s

method reduces to the secant method.
Key to the effectiveness of Broyden’s method is that the matrix Bi+1 differs from Bi
only by a matrix of rank 1. But, as shown in the following theorem, once one can compute
the action of the inverse of a matrix on a vector efficiently (e.g., by by forward and back
substitution once the matrix has been factored into triangular matrices), then once can
compute the action of the inverse of any rank 1 perturbation of the matrix.
Theorem 4.10 (Sherman-Morrison-Woodbury formula). Let B ∈ Rn×n , y, v ∈ Rn , and
suppose that both B and B̃ := B + vy T are nonsingular. Then 1 + y T B −1 v 6= 0 and
1
B̃ −1 = B −1 − B −1 vy T B −1 .
1+ y T B −1 v
Proof. Given any u ∈ Rn , let x = B̃ −1 u, so
(4.1) Bx + (y T x)v = u.
Multiplying on the left by y T B −1 gives (y T x)(1 + y T B −1 v) = y T B −1 u. In particular, if we
take u = By, then the right-hand side is y T y, and so 1 + y T B −1 v 6= 0 and we obtain
y T B −1 u
yT x = .
1 + y T B −1 v
5. BROYDEN’S METHOD 99

Combining this expression and (4.1) we see that

y T B −1 u
Bx = u − v.
1 + y T B −1 v
Multiplying by B −1 and recalling that x = B̃ −1 u we obtain the Sherman–Morrison–Woodbury
formula.
Thus to compute the action of B̃ −1 on a vector, we just need to know the action of B −1
on that vector and on v and y, and to compute some inner products and simple expressions.
A good way to implement this formula for Broyden’s method is to store Hi := Bi−1 rather
than Bi . The algorithm then becomes:

Choose x0 ∈ Rn , H0 ∈ Rn×n
for i = 0, 1, . . .
xi+1 = xi − Hi F (xi )
si = xi+1 − xi
vi = F (xi+1 ) − F (xi )
1
Hi+1 = Hi + T (si − Hi vi )sTi Hi
si Hi v i
end

Note that if H0 is B0−1 this algorithm is mathematically equivalent to the basic Broyden
algorithm.
5.1. Convergence of Broyden’s method. Denote by x∗ the solution of F (x∗ ) = 0,
and let xi and Bi denote the sequences of vectors and matrices produced by Broyden’s
method. Set
ei = xi − x∗ , Mi = Bi − F 0 (x∗ ).
Roughly speaking, the key to the convergence of Broyden’s method are the facts that (1)
ei+1 will be small compared to ei if Mi is not large, and (2) Mi+1 will not be much larger
than Mi if the ei ’s are small. Precise results will be based on the following identities, which
follow directly from the definitions of xi and Bi ,
(4.2) ei+1 = −Bi−1 [F (xi ) − F (x∗ ) − F 0 (x∗ )(xi − x∗ )] + Bi−1 Mi ei ,

1 1
(4.3) Mi+1 = Mi I − T si si + T (vi − F 0 (x∗ )si )sTi .
T
si si si si
Our first result gives the local convergence of Broyden’s method, with a rate of conver-
gence that is at least linear. The norms are all the 2-norm.
Theorem 4.11. Let F be differentiable in a ball Ω about a root x∗ ∈ Rn whose derivative
has Lipschitz constant γ on the ball. Suppose that F 0 (x∗ ) is invertible, with kF 0 (x∗ )−1 k ≤ β.
Let x0 ∈ Ω and B0 ∈ Rn×n be given satisfying
1
kM0 k + 2γke0 k ≤ .
8β
100 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION

Then the iterates xi , Bi given by Broyden’s method are well defined, and the errors satisfy
kei+1 k ≤ kei k/2, for i = 0, 1, · · · .
Proof. Claim 1: If xi and Bi are well-defined and kMi k ≤ 1/(2β), then Bi is invertible
(so xi+1 is well-defined), and
kBi−1 k ≤ 2β, kei+1 k ≤ (γβkei k + 2βkMi k)kei k.
Indeed, F 0 (x∗ )−1 Bi = I + F 0 (x∗ )−1 Mi , and since kMi k ≤ 1/(2β), kF 0 (x∗ )−1 Mi k ≤ 1/2, so
Bi is invertible with kBi−1 k ≤ 2β. Therefore, xi+1 is well-defined. The estimate on kei+1 k
follows easily from the bound on Bi−1 and (4.2).
Note that from claim 1 and the hypotheses of the theorem we know that x1 is well-defined
and ke1 k ≤ ke0 k/2.
Claim 2: If B0 , . . . , Bi are defined and invertible, then
kMi+1 k ≤ kMi k + γ max(kei k, kei+1 k).
To prove this, we use (4.3). The first term on the right-hand side is the product of Mi with
the orthogonal projection onto the orthogonal complement of si , so its 2-norm is bounded
by kMi k. For the second term, note that
Z 1
0
vi − F (x∗ )si = [F 0 ((1 − t)xi+1 + txi ) − F 0 (x∗ )] dt si .
0
0 0
Since kF (1 − t)xi+1 + txi ) − F (x∗ )k ≤ γ max(kei k, kei+1 k),
kvi − F 0 (x∗ )si k ≤ γ max(kei k, kei+1 k)ksi k,
and the second term on the right-hand side of (4.3) is bounded in norm by
γ max(kei k, kei+1 k),
which establishes the claim.
We are now ready to prove the theorem. We shall show, by induction on i, that
x0 , . . . , xi+1 are well-defined and
1 1
kei k ≤ , kMi k ≤ , kei+1 k ≤ kei k/2.
8γβ 8β
This is clearly true for i = 0. Assuming it true for i and all smaller indices, we immediately
get the first inequality with i replaced by i + 1. Using claim 2 repeatedly (and noting that
kei+1 k ≤ kei k ≤ · · · , we have
kMi+1 k ≤ kM0 k + γ(ke0 k + ke1 k + · · · + kei+1 k)
≤ kM0 k + γke0 k(1 + 1/2 + · · · ) = kM0 k + 2γke0 k ≤ 1/(8β),
which establishes the second inequality, and then applying claim 1 gives the third inequality.

Notice that the constant 1/2 in the linear convergence estimate arose from the proof
rather than anything inherent to Broyden’s method. Rearranging the proof, one could
change this constant to any positive number. Thus the convergence is actually superlinear.
It would be natural to try to prove this as an application of Theorem 4.8, but this is not
possible, because it can be shown by example, that Bi need not converge to F 0 (x∗ ). The
7. NEWTON’S METHOD 101

superlinear convergence of Broyden’s method was first proved in 1973 by Broyden, Dennis,
and Moré. They proved slightly more, namely that kxi+1 − x∗ k ≤ ri kxi − x∗ k where ri → 0.

6. Unconstrained minimization
We now turn to the problem of minimizing a real-valued function F defined on Rn . (The
problem of minimizing F over a subset of Rn , e.g., a subspace or submanifold, is known as
constrained minimization and is an important subject, which, however, we will not consider
in this course.) We shall sometimes refer to F as the cost function. Usually we will have
to content ourselves with finding a local minimum of the cost function since most methods
cannot distinguish local from global minima. Note that the word “local” comes up in two
distinct senses when describing the behavior of minimization methods: methods are often
only locally convergent (they converge only for initial iterate x0 sufficiently near x∗ ), and
often the limit x∗ is only a local minimum of the cost function.
If F : Rn → R is smooth, then at each x its gradient F 0 (x) is a row vector and its Hessian
F (x) is a symmetric matrix. If F achieves a local minimum at x∗ , then F 0 (x∗ ) = 0 and
00

F 00 (x∗ ) is positive semidefinite. Moreover, if F 0 (x∗ ) = 0 and F 00 (x∗ ) is positive definite, then
F definitely achieves a local minimum at x∗ .
There is a close connection with the problem of minimizing a smooth real-valued function
of n variables and that of finding a root of an n-vector-valued function of n variables. Namely
if x∗ is a minimizer of F : Rn → R, then x∗ is a root of F 0 : Rn → Rn . Another connection
is that a point is a root of the function K : Rn → Rn if and only if it is a minimizer of
F (x) = kK(x)k2 (we usually use the 2-norm or a weighted 2-norm for this purpose, since
then F (x) is smooth if K is).

7. Newton’s method
The idea of Newton’s method for minimization problems is to approximate F (x) by its
quadratic Taylor polynomial, and minimize that. Thus
1
F (x) ≈ F (xi ) + F 0 (xi )(x − xi ) + (x − xi )T F 00 (xi )(x − xi ).
2
The quadratic on the right-hand side achieves a unique minimum value if and only if the
matrix F 00 (xi ) is positive definite, and in that case the minimum is given by the solution to
the equation
F 00 (xi )(x − xi ) + F 0 (xi )T = 0.
Thus we are lead to the iteration
xi+1 = xi − F 00 (xi )−1 F 0 (xi )T .
Note that this is exactly the same as Newton’s method for solving the equation F 0 (x) = 0.
Thus we know that this method is locally quadratically convergent (to a root of F 0 , which
might be only a local minima of F ).
Newton’s method for minimization requires the construction and “inversion” of the entire
Hessian matrix. Thus, as for systems, there is motivation for using quasi-Newton methods in
which the Hessian is only approximated. In addition, there is the fact that Newton’s method
is only locally convergent. We shall return to both of these points below.
102 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION

8. Line search methods

Line search methods, of which there are many, take the form

choose initial iterate x0

for i = 0, 1, . . .
choose search direction vector si ∈ Rn
choose step length λi ∈ R
xi+1 = xi + λi si
end

There is a great deal of freedom in choosing the direction and the step length. The major
criterion for the search direction is that a lower value of F than F (xi ) exist nearby on the
line xi + λsi . We may as well assume that the step λi is positive in which case this criteria
is that si is a descent direction, i.e., that F (xi + λsi ) decreases as λ increases from 0. In
terms of derivatives this condition is that F 0 (xi )T si < 0. Geometrically, this means that si
should make an acute angle with the negative gradient vector −F 0 (xi )T . An obvious choice
is si = −F 0 (xi )T (or −F 0 (xi )T /kF 0 (xi )k if we normalize), the direction of steepest descents.
For the choice of step length, one possibility is exact line search. This means that λi is
chosen to minimize F (xi + λsi ) as a function of λ. In combination with the steepest descent
direction we get the method of steepest descents:
choose λi > 0 minimizing F (xi − λF 0 (xi )T ) for λ > 0 set xi+1 = xi − λi F 0 (xi )T
This method can be shown to be globally convergent to a local minimizer under fairly
general circumstances. However, it may not be fast. To understand the situation better
consider the minimization of a quadratic functional F (x) = xT Ax/2 − xT b where A ∈ Rn×n
is symmetric postive definite and b ∈ Rn . The unique minimizer of F is then the solution x∗
to Ax = b. In this case, the descent direction at any point x is simply −F 0 (x)T = b − Ax,
the residual. Moreover, for any search direction s, the step length λ minimizing F (x + λs)
(exact line search) can be computed analytically in this case:
λ2 T 1
F (x + λs) = s As + λ(sT Ax − sT b) + xT Ax − xT b,
2 2
d
F (x + λs) = λsT As + sT (Ax − b),
dλ
so that at the minimum λ = sT (b − Ax)/sT As, and if s = b − Ax, the direction of steepest
descent, λ = sT s/sT As. Thus the steepest descent algorithm for minimizing xT Ax/2 − xT b,
i.e., for solving Ax = b is

choose initial iterate x0

for i = 0, 1, . . .
si = b − Ax
sT s
λi = sTiAsi i
i
xi+1 = xi + λi si
end
8. LINE SEARCH METHODS 103

It can be shown that this algorithm is globally convergent to the unique solution x∗ as
long as the matrix A is positive definite. However the convergence order is only linear and
the rate is (κ − 1)/(κ + 1) where κ(A) is the 2-norm condition number of A, i.e., the ratio
of the largest to the smallest eigenvalues of A. Thus the convergence will be very slow if A
is not well-conditioned.

Figure 4.1. Convergence of steepest descents with a quadratic cost function.

Left: condition number 2; right: condition number: 10.
3 3

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
−1 −0.5 0 0.5 1 1.5 2 2.5 3 −1 −0.5 0 0.5 1 1.5 2 2.5 3

This highlights a weakness of the steepest descent direction. It will be even more pro-
nounced for a difficult non-quadratic cost function, such as Rosenbrock’s example in R2
F (x) = (y − x2 )2 + .01(1 − x)2 .

Figure 4.2. Some contours of the Rosenbrock function. Minimum is at (1, 1).
3

2.5

1.5

0.5

−0.5

−1
−1 −0.5 0 0.5 1 1.5 2 2.5 3

While exact line search is possible for a quadratic cost functions, in general it is a scalar
minimization problem which can be expensive or impossible to solve. Moreover, as illustrated
104 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION

by the performance of steepest descents above, since the minimum may not be very near
the search line, it is often not worth the effort to search too carefully. Thus many methods
incorporate more or less sophisticated approximate line search algorithms. As we shall see,
it is possible to devise an approximate line search method which, when used in conjunction
with a reasonable choice of search direction, is globably convergent.
We begin our analysis with a simple calculus lemma.
Lemma 4.12. Let f : R → R be C 1 and bounded below and suppose that f 0 (0) < 0. For
any 0 < α < 1 there exists a non-empty open interval J ⊂ (0, ∞) such that
(4.4) f (x) < f (0) + αxf 0 (0), f 0 (x) > αf 0 (0),
for all x ∈ J.
Proof. Since f 0 (0) < 0 and 0 < α < 1, we have 0 > αf 0 (0) > f 0 (0). Thus the
line y = f (0) + αf 0 (0)x lies above the curve y = f (x) for sufficiently small positive x.
But, since f is bounded below, the line lies below the curve for x sufficiently large. Thus
x1 := inf{ x > 0 | f (x) ≥ f (0) + αf 0 (0)x } > 0. Choose any 0 < x0 < x1 . By the mean value
theorem there exists x between x0 and x1 such that
f (x1 ) − f (x0 )
f 0 (x) = .
x1 − x0
For this point x we clearly have (4.4), and by continuity they must hold on an open interval
around the point.
Add a figure.
Now suppose we use a line search method subject to the following restrictions on the
search directions si and the step lengths λi . We suppose that there exist positive constants
η, α, β, such that for all i:
(H1) there exists η ∈ (0, 1] such that −F 0 (xi )si ≥ ηkF 0 (xi )kksi k
(H2) there exists α ∈ (0, 1) such that F (xi + λi si ) ≤ F (xi ) + αλi F 0 (xi )si
(H3) there exists β ∈ (0, 1) such that F 0 (xi + λi si )si ≥ βF 0 (xi )si
We shall show below that any line-search method meeting these conditions is, essentially,
globally convergent. Before doing so, let us discuss the three conditions. The first condition
concerns the choice of search direction. If η = 0 were permitted it would say that the search
direction is a direction of non-ascent. By insisting on η positive we insure that the search
direction is a direction of descent (F (xi + λsi ) is a decreasing function of λ at λ = 0).
However the condition also enforces a uniformity with respect to i. Specifically, it says
that the angle between si and the steepest descent direction −F 0 (xi )T is bounded above by
arccos η < π/2. The steepest descent direction satisfies (H1) for all η ≤ 1 and so if η < 1
there is a open set of directions satisfying this condition. If the Hessian is positive definite,
then the Newton direction −F 00 (xi )−1 F 0 (xi )T satisfies (H1) for η ≤ 1/κ2 (F 00 (xi )), with κ2
the condition number with respect to the 2-norm, i.e., the ratio of the largest to smallest
eigenvalues (verify!). One possible strategy to obtain the fast local convergence of Newton’s
method without sacrificing global convergence is to use the Newton direction for si whenever
it satisfies (H1) (so whenever F 00 (xi ) is positive definite and not too badly conditioned),
otherwise to use steepest descents. A better approach when the Newton direction fails
(H1) may be to use a convex combination of the Newton direction and the steepest descent
8. LINE SEARCH METHODS 105

direction: si = −[θF 00 (xi )−1 F 0 (xi )T + (1 − θ)F 0 (xi )T ] which will satisfy (H1) if θ > 0 is
small enough. Or similarly, one can take si = [F 00 (xi )−1 + νI]−1 F 0 (xi )T with ν large enough
to insure that the bracketed matrix is positive definite. This is the Levenberg–Marquardt
search direction.
Conditions (H2) and (H3) concern the step length. Roughly, (H2) insures that it is not
too large, and in particular insures that F (xi+1 ) < F (xi ). It is certainly satisfied if λi is
sufficiently small. On the other hand (H3) ensures that the step is not too small, since it is
not satisfied for λi = 0. It is however satisfied at a minimizing λi if one exists. The lemma
insures us that if 0 < α < β < 1, then there is an open interval of values of λi satisfying
(H2) and (H3), and hence it is possible to design line-search algorithms which find a suitable
λi in a finite number of steps. See, e.g., R. Fletcher, Practical Methods of Optimization or
J. Dennis & R. Schnabel, Numerical methods for unconstrained optimization and nonlinear
equations. Fletcher also discusses typical choices for α and β. Typically β is fixed somewhere
between 0.9 and 0.1, the former resulting in a faster line search while the latter in a more
exact line search. Fletcher says that α is generally taken to be quite small, e.g., 0.01, but
that the value of of α is not important in most cases, since it is usually the value of β which
determines point acceptability.
We now state the global convergence theorem.
Theorem 4.13. Suppose that F : Rn → R is C 1 and bounded below, that x0 ∈ Rn is such
that { x ∈ Rn : F (x) ≤ F (x0 ) } is bounded, and that x1 , x2 , . . . is defined by a line search
method with descent search directions and positive step lengths satisfying the three conditions
above. Then limi→∞ F (xi ) exists and limi→∞ F 0 (xi ) = 0.
The next comment should be filled out. Perhaps the theorem should be stated in the
case of a single minimum and then the full state given as a corollary to the proof.

Remark. If there is only one critical point x∗ of F in the region { x ∈ Rn : F (x) ≤

F (x0 ) }, then the theorem guarantees that lim xi = x∗ . In general the theorem does not
quite guarantee that the xi converge to anything, but by compactness the xi must have one
or more accumulation points, and these must be critical points.
Proof. Since the sequence F (xi ) is decreasing and bounded below, it converges. Hence
F (xi ) − F (xi+1 ) → 0. By (H2),
F (xi ) − F (xi+1 ) ≥ −αλi F 0 (xi )si ≥ 0,
so λi F 0 (xi )si → 0. By the (H1), this implies kF 0 (xi )kλi ksi k → 0. There are now two
possibilities: either kF 0 (xi )k → 0, in which case we are done, or else there exists > 0 and
a subsequence S with kF 0 (xi )k ≥ , i ∈ S. In view of the previous inequality, λi si → 0, i.e.,
xi − xi+1 → 0, for i ∈ S. Since all the iterates belong to the compact set { x ∈ Rn : F (x) ≤
F (x0 ) }, we may invoke uniform continuity of F 0 to conclude that F 0 (xi+1 ) − F 0 (xi ) → 0 as
i → ∞, i ∈ S. We shall show that this is a contradiction.
Using (H3) and (H1), we have for all i

kF 0 (xi+1 ) − F 0 (xi )kksi k

≥ [F 0 (xi+1 ) − F 0 (xi )]si ≥ (1 − β)[−F 0 (xi )si ] ≥ η(1 − β)kF 0 (xi )kksi k.
106 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION

Hence for i ∈ S,
kF 0 (xi+1 ) − F 0 (xi )k ≥ η(1 − β) > 0,
which gives the contradiction.
The next theorem shows that if the point xi is sufficiently close to a minimum, then
choosing si to be the Newton direction and λi = 1 satisfies (H1)–(H3). This means that
it is possible to construct algorithms which are globally convergent, but which are also
quadratically convergent, since they eventually coincide with Newton’s method.
Theorem 4.14. Suppose that F is smooth, x∗ is a local minimum of F , and F 00 (x∗ )
is positive definite. Let 0 < α < 1/2, α < β < 1. Then there exists > 0 such that
if kxi − x∗ k ≤ , si = −F 00 (xi )−1 F 0 (xi )T , and λi = 1, then (H1)–(H3) are satisfied with
η = 1/{4κ2 [F 00 (x∗ )]}.
Proof. Let D denote the ball about x∗ of radius , where > 0 will be chosen below.
From our analysis of Newton’s method we know that by taking sufficiently small, xi ∈
D =⇒ xi + si ∈ D. By continuity of F 00 , we may also arrange that whenever x ∈ D, F 00 (x)
is positive definite, kF 00 (x)k ≤ 2kF 00 (x∗ )k, and kF 00 (x)−1 k ≤ 2kF 00 (x∗ )−1 k.
Then

− F 0 (xi )si = F 0 (xi )F 00 (xi )−1 F 0 (xi )T

1 1
≥ λmin [F 00 (xi )−1 ]kF 0 (xi )k2 = kF 0 (xi )k2 ≥ kF 0 (xi )k2 .
kF 00 (x i )k 2kF 00 (x ∗ )k

Now
1 1
kF 0 (xi )k ≥ ksi k ≥ ksi k,
kF 00 (xi )−1 k 2kF 00 (x −1
∗) k

and (H1) follows from the last two estimates.

By Taylor’s theorem,
1
F (xi + si ) − F (xi ) = F 0 (xi )si + sTi F 00 (x̄)si ,
2
for some x̄ ∈ D. Thus
1 1 1
F (xi + si ) − F (xi ) = F 0 (xi )si + sTi [F 0 (xi )T + F 00 (xi )si ] + sTi [F 00 (x̄) − F 00 (xi )]si .
2 2 2
Now the second term on the right-hand side vanishes by the choice of si , and the third term
can be bounded by a Lipschitz condition on F 00 , so
1 γ
F (xi + si ) − F (xi ) ≤ F 0 (xi )si + ksi k2 .
2 2
Since we have already established (H1), we have
2
(4.5) ksi k2 ≤ 2kF 00 (x∗ )−1 kkF 0 (xi )kksi k ≤ − kF 00 (x∗ )−1 kF 0 (xi )si .
η
Combining the last two estimates and choosing sufficiently small gives (H2) with any desired
α < 1/2.
9. CONJUGATE GRADIENTS 107

For (H3), we note that

F 0 (xi + si ) = F 0 (xi ) + sTi F 00 (xi ) + sTi [F 00 (x̃) − F 00 (xi )] = sTi [F 00 (x̄) − F 00 (xi )],
for some x̃ ∈ D. Using the Lipschitz condition and (4.5) we get
2
F 0 (xi + si )si ≥ −γksi k2 ≥ γ kF 00 (x∗ )−1 kF 0 (xi )si ,
η
and the desired estimate holds for sufficiently small.

9. Conjugate gradients
Now we return to the case of minimization of a positive definite quadratic function
F (x) = xT Ax/2 − xT b with A ∈ Rn×n symmetric positive definite and b ∈ Rn . So the unique
minimizer x∗ is the solution to the linear system Ax = b. Consider now a line search method
with exact line search:
choose initial iterate x0
for i = 0, 1, . . .
choose search direction si
sT (b − Axi )
λi = i T
si Asi
xi+1 = xi + λi si
end

Thus x1 = x0 + λ0 s0 minimizes F over the 1-dimensional affine space x0 + span[s0 ], and

then x2 = x0 +λ0 s0 +λ1 s1 minimizes F over the affine space 1-dimensional x0 +λ0 s0 +span[s1 ].
However x2 does not minimize F over the 2-dimensional affine space x0 + span[s0 , s1 ]. If that
were the case, then for 2-dimensional problems we would have x2 = x∗ and we saw that that
was not the case for steepest descents.
However, it turns out that there is a simple condition on the search directions si that
ensures that x2 is the minimizer of F over x0 + span[s0 , s1 ], and more generally that xi is the
minimizer of F over x0 + span[s0 , . . . , si−1 ]. In particular (as long as the search directions
are linearly independent), this implies that xn = x∗ .
Theorem 4.15. Suppose that xi are defined by exact line search using search directions
which are A-orthogonal: sTi Asj = 0 for i 6= j. Then
F (xi ) = min{ F (x) | x ∈ x0 + span[s0 , . . . , si−1 ] }.
Proof. By induction on i, the case i = 1 being clear. Write Wi for span[s0 , . . . , si−1 ].
Now
min F = min min F (y + λsi ).
x0 +Wi+1 y∈x0 +Wi λ∈R

But
1 λ2
F (y + λsi ) = y T Ay + λsTi Ay + sTi Asi − y T b − λsTi b.
2 2
108 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION

The second term on the right-hand side appears to couple the minimizations with respect
to y and λ, but in fact this is not so. Indeed, xi ∈ x0 + Wi , so for y ∈ x0 + Wi , y − xi ∈ Wi
and so is A-orthogonal to si . That is, sTi Ay = sTi Axi , whence

1 λ2
F (y + λsi ) = [ y T Ay − y T b] + [ sTi Asi + λsTi (Axi − b)],
2 2
and the minimization problem decouples. By induction the minimum of the first term in
brackets over x0 + Wi is achieved by y = xi , and clearly the second term is minimized by
λ = sTi (b − Axi )/sTi Asi , i.e., the exact line search. Thus xi+1 = xi + λi si minimizes F over
x0 + Wi+1 .

Any method which uses A-orthogonal (also called “conjugate”) search directions has the
nice property of the theorem. However it is not so easy to construct such directions. By
far the most useful method is the method of conjugate gradients, or the CG method, which
defines the search directions by A-orthogonalizing the residuals ri = b − Axi :
• s0 = r0
i−1 T
X sj Ari
• si = ri − sj .
sT Asj
j=0 j

The last formula (which is just the Gram-Schmidt procedure) appears to be quite expensive
to implement, but fortunately we shall see that it may be greatly simplified.

Lemma 4.16. 1. Wi = span[s0 , . . . , si−1 ] = span[r0 , . . . , ri−1 ].

2. The residuals are l2 -orthogonal: riT rj = 0 for i 6= j.
3. There exists m ≤ n such that W1 ( W2 ( · · · ( Wm = Wm+1 = · · · and x0 6= x1 6=
· · · 6= xm = xm+1 = · · · = x∗ .
4. For i ≤ m, { s0 , . . . , si−1 } is an A-orthogonal basis for Wi and { r0 , . . . , ri−1 } is an
l2 -orthogonal basis for Wi .
5. sTi rj = riT ri for 0 ≤ j ≤ i.

Proof. The first statement comes directly from the definitions. To verify the second
statement, note that, for 0 ≤ j < i, F (xi + trj ) is minimal when t = 0, which gives
rjT (Axi − b) = 0, which is the desired orthogonality. For the third statement, certainly there
is a least integer m ∈ [1, n] so that Wm = Wm+1 . Then rm = 0 since it both belongs to
Wm and is orthogonal to Wm . This implies that xm = x∗ and that sm = 0. Since sm = 0,
Wm+1 = Wm and xm+1 = xm = x∗ . Therefore rm+1 = 0, which implies that sm+1 = 0,
therefore Wm+2 = Wm+1 , xm+2 = x∗ , etc.
The fourth statement is an immediate consequence of the preceding ones. For the last
statement, we use the orthogonality of the residuals to see that sTi ri = riT ri . But, if 0 ≤ j ≤
i,then

sTi rj − sTi r0 = sTi A(x0 − xj ) = 0,

since x0 − xj ∈ Wi .
9. CONJUGATE GRADIENTS 109

Since si ∈ Wi+1 and the rj , j ≤ i are an orthogonal basis for that space for i < m, we
have

i
X sT rji
si = rj .
j=0
rjT rj

In view of part 5 of the lemma, we can simplify

i i−1
X rj X rj
si = riT ri T
T
= r i + ri ri T
,
r r
j=0 j j
r r
j=0 j j

whence

riT ri
si = ri + T
si−1 .
ri−1 ri−1

This is the formula which is used to compute the search direction. In implementing this
formula it is useful to compute the residual from the formula ri+1 = ri − λi Asi (since
xi+1 = xi + λi si ). Putting things together we obtain the following implementation of CG:

choose initial iterate x0 , set s0 = r0 = b − Ax0

for i = 0, 1, . . .
r T ri
λi = Ti
si Asi
xi+1 = xi + λi si
ri+1 = ri − λi Asi
rT ri+1
si+1 = ri+1 + i+1T si
ri ri
end

At each step we have to perform one multiplication of a vector by A, two dot-products,

and three SAXPYs. When A is sparse, so that multiplication by A is inexpensive, the
conjugate gradient method is most useful. Here is the algorithm written out in full in
pseudocode:
110 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION

choose initial iterate x

r ← b − Ax
r2 ← rT r
s←r
for i = 0, 1, . . .
t ← As (matrix multiplication)
s2 ← sT t (dot product)
λ ← r2/s2
x ← x + λs (SAXPY)
r2old ← r2
r ← r − λt (SAXPY)
r2 ← rT r (dot product)
s ← r + (r2/r2old)s (SAXPY)
end

The conjugate gradient method gives the exact solution in n iterations, but it is most
commonly terminated with far fewer operations. A typical stopping criterion would be to
test if r2 is below a given tolerance. To justify this, we shall show that the method is linearly
convergence and we shall establish the rate of convergence. For analytical purposes, it is
most convenient to use the vector norm kxkA := (xT Ax)1/2 , and its associated matrix norm.
Lemma 4.17. Wi = span[r0 , Ar0 , . . . , Ai−1 r0 ] for i = 1, 2, . . . , m.
Proof. Since dim Wi = i, it is enough to show that Wi ⊂ span[r0 , Ar0 , . . . , Ai−1 r0 ],
which we do by induction. This is certainly true for i = 1. Assume it holds for some i.
Then, since xi ∈ x0 + Wi , ri = b − Axi ∈ r0 + AWi ∈ span[r0 , Ar0 , . . . , Ai r0 ], and therefore
Wi+1 , which is spanned by Wi and ri belongs to span[r0 , Ar0 , . . . , Ai r0 ], which completes the
induction.

The space span[r0 , Ar0 , . . . , Ai−1 r0 ] is called the Krylov space generated by the matrix A
and the vector r0 . Note that we have as well

Wi = span[r0 , Ar0 , . . . , Ai−1 r0 ] = { p(A)r0 | p ∈ Pi−1 } = { q(A)(x∗ − x0 ) | q ∈ Pi , q(0) = 0 }.

Since ri is l2 -orthogonal to Wi , x∗ − xi is A-orthogonal to Wi so
kx∗ − xi kA = inf kx∗ − xi + wkA .
w∈Wi

Since xi − x0 ∈ Wi ,
inf kx∗ − xi + wkA = inf kx∗ − x0 + wkA .
w∈Wi w∈Wi

Combining the last three equations, we get

kx∗ − xi kA = inf kx∗ − x0 + q(A)(x∗ − x0 )kA = inf kp(A)(x∗ − x0 )kA .
q∈Pi p∈Pi
q(0)=0 p(0)=1
9. CONJUGATE GRADIENTS 111

Figure 4.3. The quintic polynomial equal to 1 at 0 with the smallest L∞

norm on [2, 10]. This is a scaled Chebyshev polynomial, and so the norm can
be computed exactly.
1 0.1

0.8 0.075

0.6 0.05

0.4 0.025

0.2 0

0 −0.025

−0.2 −0.05
0 2 4 6 8 10 12 2 4 6 8 10

Applying the obvious bound kp(A)(x∗ − x0 )kA ≤ kp(A)kA kx∗ − x0 kA we see that we can
obtain an error estimate for the conjugate gradient method by estimating
C = inf kp(A)kA .
p∈Pi
p(0)=1

Now if 0 < ρ1 < · · · < ρn are the eigenvalues of A, then the eigenvalues of p(A) are p(ρj ),
j = 1, . . . , n, and kp(A)kA = maxj |p(ρj )| (this is left as exercise 6). Thus1
C = inf max |p(ρj )| ≤ inf max |p(ρ)|.
p∈Pi j p∈Pi ρ1 ≤ρ≤ρn
p(0)=1 p(0)=1

The final infimum can be calculated explicitly using the Chebyshev polynomials, see Fig-
ure 4.3 and (1.16). The minimum value is precisely
√ i
2 κ−1
√ i √ i ≤ 2 √ ,
κ+1 κ−1 κ+1
√
κ−1
+ √
κ+1

where κ = ρn /ρ1 is the condition number of A. (To get the right-hand side, we suppressed
the second term in the denominator of the left-hand side, which is less than 1 and tends to
zero with i, and kept only the first term, which is greater than 1 and tends to infinity with
i.) We have thus proven that
√ i
κ−1
kxi − x∗ kA ≤ 2 √ kx0 − x∗ kA ,
κ+1
1
Here we bound maxj |p(ρj )| by maxρ1 ≤ρ≤ρn |p(ρ)| simply because we can minimize the latter quantity
explicitly. However this does not necessarily lead to the best possible estimate, and the conjugate gradient
method is often observed to converge faster than the result derived here. Better bounds can sometimes be
obtained by taking into account the distribution of the spectrum of A, rather than just its minimum and
maximum.
112 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION

which is linear convergence with rate

√
κ−1
r=√ .
κ+1
√
Note that r ∼ 1 − 2/ κ for large κ. So the convergence deteriorates when the condition
number is large.
Let us compare this convergence estimate with the analogous one for the method of
steepest descents. To derive an estimate for steepest descents, we use the fact that the first
step of conjugate gradients coincides with steepest descents, and so
2 κ−1
kx∗ − x1 kA ≤ √
κ+1
√
κ−1
kx∗ − x0 kA = kx∗ − x0 kA .
√ + √ κ+1
κ−1 κ+1

Of course, the same result holds if we replace x0 by xi and x1 by xi+1 . Thus steepest descents
converges linearly, with rate (κ − 1)/(κ + 1). Notice that the estimates indicate that a large
value of κ will slow the convergence
√ of both steepest descents and conjugate gradients, but,
since the dependence is on κ rather than κ, the convergence of conjugate gradients will
usually be much faster.
The figure shows a plot of the norm of the residual versus the number of iterations for
the conjugate gradient method and the method of steepest descents applied to a matrix
of size 233 arising from a finite element simulation. The matrix is irregular, but sparse
(averaging about 6 nonzero elements per row), and has a condition number of about 1, 400.
A logarithmic scale is used on the y-axis so the near linearity of the graph reflects linear
convergence behavior. For conjugate gradients, the observed rate of linear convergence is
about .8, and it takes 80 iterations to reduce the initial residual by a factor of about 106 .
The convergence of steepest descents is too slow to be useful: in 400 iterations the residual
is not even reduced by a factor of 2.

Figure 4.4. Convergence of conjugate gradients for solving a finite element

system of size 233. On the left 300 iterations are shown, on the right the first
50. Steepest descents is shown for comparison.
10 2
10 10
SD
0
SD
10 1
10
norm of residual
norm of residual

−10
10
0
10
−20
10
CG
−1
−30
CG 10
10

−40 −2
10 10
0 50 100 150 200 250 300 0 10 20 30 40 50
iterations iterations
9. CONJUGATE GRADIENTS 113

Remark. 1. The conjugate gradient algorithm can be generalized to apply to the min-
imization of general (non-quadratic) functionals. The Fletcher–Reeves method is such a
generalization. However in the non-quadratic case the method is significantly more compli-
cated, both to implement and to analyze.
2. There are a variety of conjugate-gradient-like iterative methods that apply to matrix
problems Ax = b where A is either indefinite, non-symmetric, or both. Many share the idea
of approximation of the solution in a Krylov space.
9.1. Preconditioning. The idea is we choose a matrix M ≈ A such that the system
M z = c is relatively easy to solve. We then consider the preconditioned system M −1 Ax =
M −1 b. The new matrix M −1 A is SPD with respect to the M innerproduct, and we solve the
preconditioned system using conjugate gradients but using the M -inner product in place of
the l2 -inner product. Thus to obtain the preconditioned conjugate gradient algorithm, or
PCG, we substitute M −1 A for A everywhere and change expressions of the form xT y into
xT M y. Note that the A-inner product xT Ay remains invariant under these two changes.
Thus we obtain the algorithm:

choose initial iterate x0 , set s0 = r̄0 = M −1 b − M −1 Ax0

for i = 0, 1, . . .
r̄T M r̄i
λi = iT
si Asi
xi+1 = xi + λi si
r̄i+1 = r̄i − λi M −1 Asi
r̄T M r̄i+1
si+1 = r̄i+1 + i+1T si
r̄i M r̄i
end

Note that term sTi Asi arises as the M -inner product of si with M −1 Asi . The quantity
r̄i is the residual in the preconditioned equation, which is related to the regular residual,
ri = b − Axi by ri = M r̄i . Writing PCG in terms of ri rather than r̄i we get

choose initial iterate x0 , set r0 = b − Ax0 , s0 = M −1 r0

for i = 0, 1, . . .
rT M −1 ri
λi = i T
si Asi
xi+1 = xi + λi si
ri+1 = ri − λi Asi
rT M −1 ri+1
si+1 = M −1 ri+1 + i+1T −1 si
r i M ri
end

Thus we need to compute M −1 ri at each iteration. Otherwise the work is essentially the
same as for ordinary conjugate gradients. Since the algorithm is just conjugate gradients for
114 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION

the preconditioned equation we immediately have an error estimate:

√ i
κ−1
kxi − x∗ kA ≤ 2 √ kx0 − x∗ kA ,
κ+1
where κ now is the ratio of the largest to the least eigenvalue of M −1 A. To the extent that
M approximates A, this ratio will be close to 1 and so the algorithm will converge quickly.
The matrix M is called the preconditioner. A good preconditioner should have two prop-
erties. First, it must be substantially easier to solve systems with the matrix M than with
the original matrix A, since we will have to solve such a system at each step of the precon-
ditioned conjugate gradient algorithm. Second, the matrix M −1 A should be substantially
better conditioned than A, so that PCG converges faster than ordinary CG. In short, M
should be near A, but much easier to invert. One simple possibility is to take M to be the
diagonal matrix with the same diagonal entries as A. This certainly fulfils the first criterion
(easy invertibility), and for some matrices A, the second criterion is met as well. A similar
possibility is to take M to be a tridiagonal matrix with its nonzero entries taken from A. A
third possibility which is often applied when A is sparse is to determine M via the incomplete
Cholesky factorization. This means that a triangular matrix L is computed by the Cholesky
algorithm applied to A, except that no fill-in is allowed: only the non-zero elements of A
are altered, and the zero elements left untouched. One then takes M = LLT , and, so M −1
is easy to apply. Other preconditioners take into account the source of the matrix problem.
For example, if a matrix arises from the discretization of a complex partial differential equa-
tion, we might precondition it by the discretization matrix for a simpler related differential
equation (if that lead to a linear systems which is easier to solve). In fact the derivation of
good preconditioners for important classes of linear systems remain a very active research
area.
We close with numerical results for the simplest preconditioner: the diagonal precondi-
tioner. The following figure reproduces the results shown in Figure 4.4, together with the
norm of the residual for PCG. An error reduction of 10−6 occurs with 44 iterations of PCG,
as opposed to 80 of CG.

Exercises

1. Let f : R → R be a C 2 function with a root x∗ such that neither f 0 nor f 00 has a root. Prove
that Newton’s method converges to x∗ for any initial guess x0 ∈ R.
2. Consider the 2 × 2 system of nonlinear equations
f (x, y) = 0, g(x, y) = 0, x, y ∈ R.
The Jacobi iteration for solving this system beginning from an initial guess x0 , y0 is Thus

for i = 0, 1, 2, . . .
solve f (xi+1 , yi ) = 0 for xi+1
solve g(xi , yi+1 ) = 0 for yi+1
end
EXERCISES 115

Figure 4.5. Convergence of preconditioned conjugate gradients for solving a

finite element system of size 233. On the left 300 iterations are shown, on the
right the first 50. Unpreconditioned CG and Steepest descents are shown for
comparison.
20 2
10 10

SD 0
SD
0
10 10
CG

norm of residual
norm of residual

−20 −2
10 10
CG
−40 −4
10 10
PCG
PCG
−60 −6
10 10

−80 −8
10 10
0 50 100 150 200 250 300 0 10 20 30 40 50
iterations iterations

each step of the iteration requires the solution of 2 scalar nonlinear equations. (N.B.: Of
course the method extends to systems of n equations in n unknowns.) If we combine the
Jacobi iteration with Newton’s method to solve the scalar equations, we get the Newton–
Jacobi iteration:
choose initial guess x0 , y0
for i = 1, 2, . . .
∂f
xi+1 = xi − (xi , yi )−1 f (xi , yi )
∂x
∂g
yi+1 = yi − (xi , yi )−1 g(xi , yi )
∂y
end

Determine under what conditions this algorithm is locally convergent.

3. The Gauss–Seidel iteration for a 2 × 2 system of nonlinear equations differs from the Ja-
cobi iteration in that the equation determining yi+1 is g(xi+1 , yi+1 ) = 0. Formulate the
Newton–Gauss–Seidel iteration, determine conditions under which it is locally convergent,
and compare the conditions to those for the Newton–Jacobi iteration.
4. Recall that in Broyden’s method we update a matrix B to obtain a matrix B̃ which satisfies
B̃s = v for given vectors s 6= 0, v. Show that B̃ the closest matrix to B which satisfies this
equation, that is that kB̃ − Bk ≤ kB̄ − Bk for any matrix B̄ satisfying B̄s = v where the
norm is the matrix 2-norm. ShowP that the same result holds if the norm is the Frobenius
norm, which is defined by kAk = ( i,j a2ij )1/2 , and that in this case B̃ is the unique nearest
matrix to B satisfying the desired equation.
5. Consider a system of n equations in n unknowns consisting of m linear equations and n − m
nonlinear equations
Ax − b = 0, g(x) = 0, A ∈ Rm×n , b ∈ Rm , g : Rn → Rn−m .
116 4. NUMERICAL SOLUTION OF NONLINEAR SYSTEMS AND OPTIMIZATION

Let x0 , x1 , . . . be the sequence of iterates produced by Newton’s method. Show that all the
iterates after the initial guess satisfy the linear equations exactly. Show the same result is
true when the xi are determined by Broyden’s method with B0 chosen to be F 0 (x0 ).
6. Prove that if A is a symmetric positive-definite matrix with eigenvalues ρ1 , . . . , ρn , and p is
a polynomial, then kp(A)kA = max |p(ρj )|.
1≤j≤n

7. Prove that for the conjugate gradient method the search directions si and the errors ei :=
x∗ − xi satisfy sTi ei+1 ≤ 0 (in fact sTi ej ≤ 0 for all i, j). Use this to show that the l2 -norm
of the error kei k is a non-increasing function of i.
8. We analyzed preconditioned conjugate gradients, with a symmetric positive definite precon-
ditioner M , as ordinary conjugate gradients applied to the problem M −1 Ax = M −1 b but
with the M -inner product rather than the l2 -inner product in Rn . An alternative approach
which doesn’t require switching inner products in Rn is to consider the ordinary conjugate
gradient method applied to the symmetric positive definite problem (M −1/2 AM −1/2 )z =
M −1/2 b for which the solution is z = M 1/2 x. Show that this approach leads to exactly the
same preconditioned conjugate gradient algorithm.
9. The Matlab command A=delsq(numgrid(’L’,n)) is a quick way to generate a symmetric
positive definite sparse test matrix: it is the matrix arising from the 5-point finite difference
approximation to the Laplacian on an L-shaped domain using an n × n grid (e.g., if n = 40,
A will be 1, 083 × 1, 083 sparse matrix with 5, 263 nonzero elements and a condition number
of about 325. Implement the conjugate gradient algorithm for the system Ax = b for this
A (and an arbitrary vector b, e.g., all 1’s). Diagonal preconditioning does no good for this
problem. (Why?) Try two other possibilities: tridiagonal preconditioning and incomplete
Cholesky preconditioning (Matlab comes equipped with an incomplete Cholesky routine, so
you don’t have to write your own). Study and report on the convergence in each case.

Shaista Begum
No ratings yet
Shaista Begum
67 pages
Calculus Assignment PDF
No ratings yet
Calculus Assignment PDF
329 pages
849full Note
No ratings yet
849full Note
94 pages
Solving_Nonlinear_Equations
No ratings yet
Solving_Nonlinear_Equations
18 pages
MATH5011_Chapter_4.2014
No ratings yet
MATH5011_Chapter_4.2014
30 pages
D.Daners_Dirichlet problems on varying domains
No ratings yet
D.Daners_Dirichlet problems on varying domains
34 pages
WTW123 - Formulas, Proofs and Graphs
No ratings yet
WTW123 - Formulas, Proofs and Graphs
22 pages
MATH5011_Chapter_4.2023
No ratings yet
MATH5011_Chapter_4.2023
30 pages
4. Ders
No ratings yet
4. Ders
5 pages
04 Nonlinear Systems and Optimization
No ratings yet
04 Nonlinear Systems and Optimization
74 pages
Lec0general
No ratings yet
Lec0general
7 pages
tut 4s
No ratings yet
tut 4s
5 pages
SIgnals
No ratings yet
SIgnals
5 pages
notes 2
No ratings yet
notes 2
7 pages
amath731_intro
No ratings yet
amath731_intro
7 pages
Notes - Fourier Series in (0,2π)
No ratings yet
Notes - Fourier Series in (0,2π)
39 pages
20250214_annotated
No ratings yet
20250214_annotated
22 pages
Hintermüller M. Semismooth Newton Methods and Applications
No ratings yet
Hintermüller M. Semismooth Newton Methods and Applications
72 pages
math400-exercises-chapt2-co-1
No ratings yet
math400-exercises-chapt2-co-1
8 pages
State and prove Euler's theorem for three variables and hence find the following
No ratings yet
State and prove Euler's theorem for three variables and hence find the following
3 pages
num_pde_fub_4
No ratings yet
num_pde_fub_4
8 pages
Extending The Applicability of The SuperHalleyLike Method Using Continuous Derivatives and Restricted Convergence Domains
No ratings yet
Extending The Applicability of The SuperHalleyLike Method Using Continuous Derivatives and Restricted Convergence Domains
20 pages
TRIGONOMETRY
No ratings yet
TRIGONOMETRY
27 pages
Solving Non Linear Equation Systems - 240628 - 082523
No ratings yet
Solving Non Linear Equation Systems - 240628 - 082523
9 pages
Grade 9 Mid Year Achievement Test
No ratings yet
Grade 9 Mid Year Achievement Test
2 pages
Extra Notes 02
No ratings yet
Extra Notes 02
7 pages
App 1
No ratings yet
App 1
7 pages
Sum and Difference Identities
No ratings yet
Sum and Difference Identities
11 pages
Karl Weierstraß and The Theory of Abelian and Elliptic Functions
100% (1)
Karl Weierstraß and The Theory of Abelian and Elliptic Functions
16 pages
Convergence Dominant EigenvaIue Method
No ratings yet
Convergence Dominant EigenvaIue Method
7 pages
Exer Ma2715 Chap5 Some Answers
No ratings yet
Exer Ma2715 Chap5 Some Answers
17 pages
Iterative Methods For Non-Linear Systems of Equations: F: D R 7 R N N F F (X) 0 F: D R 7 R
No ratings yet
Iterative Methods For Non-Linear Systems of Equations: F: D R 7 R N N F F (X) 0 F: D R 7 R
92 pages
NumProg2 - 2020-07-12
No ratings yet
NumProg2 - 2020-07-12
110 pages
INEQUALITIES
100% (1)
INEQUALITIES
53 pages
Fixed Point
No ratings yet
Fixed Point
8 pages
Chapter 1..
No ratings yet
Chapter 1..
35 pages
An Introduction To Semilinear Elliptic Equations: Thierry Cazenave
No ratings yet
An Introduction To Semilinear Elliptic Equations: Thierry Cazenave
139 pages
Mild Solutions For Nonlinear Evolution Equations
No ratings yet
Mild Solutions For Nonlinear Evolution Equations
63 pages
Integration Concept
No ratings yet
Integration Concept
34 pages
Fixed Point
No ratings yet
Fixed Point
11 pages
Topic 6: Graph of Trigonometric Functions DIM5058
No ratings yet
Topic 6: Graph of Trigonometric Functions DIM5058
6 pages
Systems of First Order Differential Equations: Department of Mathematics IIT Guwahati
No ratings yet
Systems of First Order Differential Equations: Department of Mathematics IIT Guwahati
18 pages
Nonlinear Systems PDF
No ratings yet
Nonlinear Systems PDF
72 pages
Norms
No ratings yet
Norms
8 pages
Solutions To Exercises:, Y) ) Is Dense in X y /2 /2
No ratings yet
Solutions To Exercises:, Y) ) Is Dense in X y /2 /2
72 pages
Lec5 PDF
No ratings yet
Lec5 PDF
29 pages
Lecture03 PDF
No ratings yet
Lecture03 PDF
8 pages
Interval and Speed of Convergence On Iterative Methods: The Himalayan Physics
No ratings yet
Interval and Speed of Convergence On Iterative Methods: The Himalayan Physics
3 pages
Lec 5
No ratings yet
Lec 5
29 pages
7.the Derivative of An Implicit Function
No ratings yet
7.the Derivative of An Implicit Function
10 pages
Iterative Methods For Eigenvalues of Symmetric Matrices As Fixed Point Theorems
No ratings yet
Iterative Methods For Eigenvalues of Symmetric Matrices As Fixed Point Theorems
14 pages
03 - Chain Rule With Inverse Trig PDF
No ratings yet
03 - Chain Rule With Inverse Trig PDF
2 pages
Numerical Integratio N: Prepared By: Engr. Cielito V. Maligalig
No ratings yet
Numerical Integratio N: Prepared By: Engr. Cielito V. Maligalig
22 pages
Tables and Formulas
No ratings yet
Tables and Formulas
3 pages
DP Noas
No ratings yet
DP Noas
10 pages
Continuity at A Interval 2
No ratings yet
Continuity at A Interval 2
28 pages
Fixedpoint
No ratings yet
Fixedpoint
5 pages
Mathematical and Computer Modelling: Radu Precup
No ratings yet
Mathematical and Computer Modelling: Radu Precup
6 pages
Manualv 2
No ratings yet
Manualv 2
4 pages
212 EEE 3310 LabSheet 04
No ratings yet
212 EEE 3310 LabSheet 04
13 pages
Module 6: Solving Ordinary Differential Equations - Initial Value Problems (Ode-Ivps) Section 1: Introduction
No ratings yet
Module 6: Solving Ordinary Differential Equations - Initial Value Problems (Ode-Ivps) Section 1: Introduction
9 pages
Lecture 22
No ratings yet
Lecture 22
4 pages
Lecture Wise Teaching Plan Name of Faculty: T P Singh Name of Course: B.Tech (Computer Science) Name of Subject: Graph Theory (CMP-XXX)
No ratings yet
Lecture Wise Teaching Plan Name of Faculty: T P Singh Name of Course: B.Tech (Computer Science) Name of Subject: Graph Theory (CMP-XXX)
1 page
James Notes
No ratings yet
James Notes
83 pages
Introduction To Hilbert Spaces. I.: (KS) G. Sparr, A Sparr: "Kontinuerliga System", Studentliterature, Lund (2000)
No ratings yet
Introduction To Hilbert Spaces. I.: (KS) G. Sparr, A Sparr: "Kontinuerliga System", Studentliterature, Lund (2000)
5 pages
Ch01 Lecture Notes On Numerical Analysis of Nonlinear Equations
No ratings yet
Ch01 Lecture Notes On Numerical Analysis of Nonlinear Equations
49 pages
PH PG
No ratings yet
PH PG
6 pages
Tenta17mars PDF
No ratings yet
Tenta17mars PDF
8 pages
A Note On Resolvent Convergence On A Thin Domain, Ricardo Parreira Da Silva
No ratings yet
A Note On Resolvent Convergence On A Thin Domain, Ricardo Parreira Da Silva
8 pages
On The Application of A Newton Raphson'S Iterative Method of The Fixed Point Theory To The Solution of A Chemical Equilibrium Problem
No ratings yet
On The Application of A Newton Raphson'S Iterative Method of The Fixed Point Theory To The Solution of A Chemical Equilibrium Problem
20 pages
Lesson in Indefinite Integrals
No ratings yet
Lesson in Indefinite Integrals
3 pages
Methodos Numericos
No ratings yet
Methodos Numericos
34 pages
4 Iterative Methods: 4.1 What A Two Year Old Child Can Do
No ratings yet
4 Iterative Methods: 4.1 What A Two Year Old Child Can Do
15 pages
Krylov Subspace Methods For Solving Large Unsymmetric Linear Systems (Saad)
No ratings yet
Krylov Subspace Methods For Solving Large Unsymmetric Linear Systems (Saad)
22 pages
Linear Algebra 2005
No ratings yet
Linear Algebra 2005
3 pages
S3 4 PDF
No ratings yet
S3 4 PDF
28 pages
Separable Equations & Linear Equations: This Report Is Submitted To Fulfill The Requirements of The Numerical Analysis
No ratings yet
Separable Equations & Linear Equations: This Report Is Submitted To Fulfill The Requirements of The Numerical Analysis
9 pages
1 The Bisection Method
No ratings yet
1 The Bisection Method
20 pages
7.6 Second and Third Order Determinants
No ratings yet
7.6 Second and Third Order Determinants
5 pages
G8 Functions and Relations
No ratings yet
G8 Functions and Relations
18 pages
Uniform Boundedness (Gliding Hump)
No ratings yet
Uniform Boundedness (Gliding Hump)
6 pages
Solution Set 6: To Some Problems Given For TMA4230 Functional Analysis
No ratings yet
Solution Set 6: To Some Problems Given For TMA4230 Functional Analysis
2 pages
Uniform Convergence Malik Arora
No ratings yet
Uniform Convergence Malik Arora
42 pages
The Adjacency Matrix
No ratings yet
The Adjacency Matrix
4 pages
Functional Analysis Week03 PDF
No ratings yet
Functional Analysis Week03 PDF
16 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
3.5/5 (1)
A Short Course in Automorphic Functions
From Everand
A Short Course in Automorphic Functions
Joseph Lehner
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet

Numerical Solution of Nonlinear Systems and Optimization

Uploaded by

Numerical Solution of Nonlinear Systems and Optimization

Uploaded by

CHAPTER 4

Numerical solution of nonlinear systems and optimization

1. Introduction and Preliminaries

kx1 − x∗ k ≤ 2β(γkx0 − x∗ k/2 + δ)kx0 − x∗ k ≤ 2β(γ/2 + 1)δkx0 − x∗ k ≤ kx0 − x∗ k/2.

From the above reasoning we get

Some examples of quasi-Newton methods:

Theorem 4.9. Given vectors s 6= 0, v in Rn , C ∈ Rn×n , there is a unique matrix

Remark. For example, B0 can be taken to be F 0 (x0 ). In one dimension, Broyden’s

Combining this expression and (4.1) we see that

8. Line search methods

choose initial iterate x0

choose initial iterate x0

Figure 4.1. Convergence of steepest descents with a quadratic cost function.

Remark. If there is only one critical point x∗ of F in the region { x ∈ Rn : F (x) ≤

kF 0 (xi+1 ) − F 0 (xi )kksi k

− F 0 (xi )si = F 0 (xi )F 00 (xi )−1 F 0 (xi )T

and (H1) follows from the last two estimates.

For (H3), we note that

Thus x1 = x0 + λ0 s0 minimizes F over the 1-dimensional affine space x0 + span[s0 ], and

Lemma 4.16. 1. Wi = span[s0 , . . . , si−1 ] = span[r0 , . . . , ri−1 ].

sTi rj − sTi r0 = sTi A(x0 − xj ) = 0,

In view of part 5 of the lemma, we can simplify

choose initial iterate x0 , set s0 = r0 = b − Ax0

At each step we have to perform one multiplication of a vector by A, two dot-products,

choose initial iterate x

Wi = span[r0 , Ar0 , . . . , Ai−1 r0 ] = { p(A)r0 | p ∈ Pi−1 } = { q(A)(x∗ − x0 ) | q ∈ Pi , q(0) = 0 }.

Combining the last three equations, we get

Figure 4.3. The quintic polynomial equal to 1 at 0 with the smallest L∞

which is linear convergence with rate

Figure 4.4. Convergence of conjugate gradients for solving a finite element

choose initial iterate x0 , set s0 = r̄0 = M −1 b − M −1 Ax0

choose initial iterate x0 , set r0 = b − Ax0 , s0 = M −1 r0

the preconditioned equation we immediately have an error estimate:

Figure 4.5. Convergence of preconditioned conjugate gradients for solving a

Determine under what conditions this algorithm is locally convergent.

You might also like