Numerical Methods For Nonlinear Equations
Numerical Methods For Nonlinear Equations
Václav Kučera
Prague, 2022
To Monika and Terezka
for their endless love and support
These materials have been produced within and supported by the project “Increasing
the quality of education at Charles University and its relevance to the needs of the
labor market” kept under number CZ.02.2.69/0.0/0.0/16 015/0002362.
Contents
1 Nonlinear equations 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Mathematical background . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Rate of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Scalar equations 8
2.1 Bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Fixed point iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Cobweb plots . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 What could possibly go wrong? . . . . . . . . . . . . . . . . . 19
2.3.2 Stopping criteria . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Secant method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.1 Approximation of the derivative in Newton’s method . . . . . 30
2.4.2 Secant method . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5 Various more sophisticated methods . . . . . . . . . . . . . . . . . . . 39
2.5.1 Methods based on quadratic interpolation . . . . . . . . . . . 40
2.5.2 Hybrid algorithms . . . . . . . . . . . . . . . . . . . . . . . . 41
3 Systems of equations 44
3.1 Tools from differential calculus . . . . . . . . . . . . . . . . . . . . . . 45
3.1.1 Fixed point iteration . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Newton’s method in C . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.1 Affine invariance and contravariance . . . . . . . . . . . . . . 48
3.3.2 Local quadratic convergence of Newton’s method . . . . . . . 50
3.3.3 Variations on Newton’s method . . . . . . . . . . . . . . . . . 52
3.4 Quasi-Newton methods . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Continuation methods . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 1
Nonlinear equations
1.1 Introduction
These lecture notes represent a brief introduction to the topic of numerical methods
for nonlinear equations. Sometimes the term ‘nonlinear algebraic equations’ can be
found in the literature, which evokes polynomial equations. This is however not
the case, the term ‘algebraic’ is used to distinguish our equations from nonlinear
differential equations, which is a completely different world. In our case, we shall
look for roots of a given function f , which will be a general, at least continuous
function. Here ‘roots’ means points where the function f attains the value zero.
From the historical perspective, originally the focus was on finding roots, ideally
of polynomial equations of higher and higher degree. However, in the past century,
the focus shifted from roots to fixed points. From the point of view of modern
mathematics, there is nothing special about a root, it is just a point, where f
attains a certain value, zero, which we consider ‘special’, but we might as well have
chosen this special value as 1 or π or −17. On the other hand a fixed point is a
topological invariant and therefore many powerful tools of modern mathematics can
be used to prove nonconstructive existence of fixed points. As we shall see later on,
from the viewpoint of numerical methods for nonlinear (algebraic) equations, fixed
points are the natural perspective.
If we want to look more closely at the history of root finding, polynomial equa-
tions are very interesting. Formulas for the roots of general quadratic, cubic and
quartic equations were known by the first half of the 16th century. Then the race to
find similar formulas for the quintic (5th degree) equation began. Three centuries
later, the Abel-Ruffini theorem was proved in 1824, which states that the roots of a
general quintic equation
√ cannot be expressed in terms of radicals, i.e. finite combi-
nations of +, −, ∗, /, n ·. Galois theory is the modern viewpoint and one can show
e.g. that the real root of x5 − x − 1 = 0 is not expressible using radicals over the
rational numbers. However, the key here is to define exactly what is meant by ‘for-
mula for the roots’. Abel and Ruffini say that radicals are hopeless for quintic and
higher equations. Radicals rely on our ability to extract the n-th root, i.e. for n = 5
solve the auxiliary equation x5 − a = 0 for a given a. However, if one admits the use
of so-called ultraradicals (Bring radicals), where one solves the modified equation
x5 + x − a = 0 for a given a, then one can solve any quintic equation. Another
1
CHAPTER 1. NONLINEAR EQUATIONS 2
approach to obtain formulas for general quintic roots is through the use of elliptic
modular functions. Felix Klein gives an elegant solution using the symmetry group
of the icosahedron. Later others came and solved the case of general equations of
degree 7, then 11, etc. using more and more special functions. The final chapter was
written by Hiroshi Umemura in 1984, when he published formulas for roots of gen-
eral polynomial equations of arbitrary degree. If he had done so in 1850, he would
have been celebrated as one of the greatest mathematicians of all time. But today
his result is a practically unknown curiosity sometimes mentioned in a footnote.
The reason is that he expresses the roots using so-called Siegel modular functions,
which are defines as infinite series of matrix exponentials. Nobody knows how to
practically evaluate or even reasonably approximate these functions. So from the
point of view of practice, the formulas are completely useless. Also from the point
of view of theory, they bring nothing new. The idea that you can solve any problem
analytically if you define more and more complicated special functions with special
properties that nobody knows how to evaluate belongs in the 19th century. Modern
mathematics uses other tools to show existence, uniqueness, desired properties, etc.
indirectly or non-constructively. If we want to know what the value is numerically,
we turn to numerical mathematics instead of formulas.
In this textbook, we shall describe and analyze the basic numerical methods for
nonlinear equations and their systems. We will spend a lot of time in the seemingly
simple case of 1D, i.e. scalar nonlinear equations. The reason is that the mathe-
matical tools and the proofs are more intuitive and can often be demonstrated using
pictures. This is not the case of systems of equations, where the theory becomes
much more technical. Bear in mind that the literature concerning numerics for non-
linear equations os huge, with hundreds of methods being developed and analyzed.
Here we barely scratch the surface.
xn+1 = G(xn )
↓ ↓
x∗ = G(x∗ ),
century and 20th century mathematics – while roots and fixed points are fundamen-
tally connected, the latter is a more natural notion from the topological viewpoint,
therefore topological tools can be used in proofs, etc. Fixed point theorems are one
of the basic tools in mathematical analysis.
For completeness, we shall state several basic theorems from mathematical anal-
ysis concerning the existence of fixed points or roots.
Theorem 1 (Banach fixed point theorem). Let X be a complete metric space and
let G : X → X be a contraction, i.e. there exists α ∈ [0, 1) s.t. kG(x) − G(y)k ≤
αkx − yk for all x, y ∈ X. Then there exists a unique fixed point x∗ ∈ X of G.
Moreover, let the sequence xn ∈ X, n ∈ N0 be defined by xn+1 = G(xn ) for arbitrary
x0 ∈ X. Then xn → x∗ and we have the estimate
αn
kxn − x∗ k ≤ kx1 − x0 k. (1.1)
1−α
Proof. We proceed in several steps:
1. The entire sequence {xn }∞
n=0 is well defined, since x0 ∈ X and G maps X to X.
2. For arbitrary m ∈ N, we can estimate by induction
kxm − xm−1 k = kG(xm−1 ) − G(xm−2 )k ≤ αkxm−1 − xm−2 k ≤ . . . ≤ αm−1 kx1 − x0 k.
3. {xn }∞
n=0 is a Cauchy sequence: Let m ≥ n, then by the previous estimate
Lemma 2. Let I ⊂ R be a closed interval. Let g ∈ C 1 (I) and let there exist
α ∈ [0, 1) such that |g 0 (x)| ≤ α for all x ∈ I. Then g is a contraction on I.
Exercises
which (ideally) converges to the true solution x∗ . In order to compare these methods
among themselves, it is useful to measure how fast such sequences converge.
Definition 8 (Rate of convergence). Let {xn }∞n=0 ⊂ R
N
be a given sequence which
converges to x∗ . We say that the sequence has convergence rate (or order) p ≥ 1
if there exists C > 0 such that
kxn+1 − x∗ k
lim = C. (1.3)
n→∞ kxn − x∗ kp
We say that a numerical method has order p, if all sequences generated by the method
have convergence rate at least p for the considered problem and set of initial condi-
tions, and at least one such sequence has convergence rate exactly p.
CHAPTER 1. NONLINEAR EQUATIONS 6
• p = 2 – Quadratic convergence.
where the supremum is taken over all sequences generated by the method. This
rather technical definition avoids the subtle issue with (1.3) concerning special cases,
where there does not exist a p such that the limit in (1.3) is a nonzero finite constant
C. This is similar to the case of superlinear convergence, as defined above, where
C = 0 for p = 1. We do not wish to go too deeply into this subject, as for example
[OR70], where a whole chapter is devoted to the study of (1.4) and its relation to a
similar concept – the R-factor (root factor).
We conclude with several remarks:
• Defintion 8 only takes into account how much the error decreases from itera-
tion to iteration. It does not take into account how expensive or cheap each
iteration is. We will address this question when comparing Newton’s method
with the secant method on page 38. Newton’s method is quadratically con-
vergent, while the secant method has convergence rate approximately 1.618.
From the viewpoint of Definition 8, Newton is the clear winner. However each
iteration is twice as expensive as in the secant method (in terms of function
evaluations). Taking this into account, it is not that clear which of the methods
is the winner.
Chapter 2
Scalar equations
In this chapter, we will be dealing with scalar nonlinear equations. We will consider
the equation f (x) = 0 with the root x∗ , where f will be at least continuous on some
interval containing x∗ . We will distinguish between three basic types of methods:
8
CHAPTER 2. SCALAR EQUATIONS 9
its values at the endpoints of an interval. Therefore, if f (a)f (b) ≤ 0 then there
exists a root between a and b. —Given a, b we take the midpoint m = 21 (a + b) and
based on the sign changes of f (a), f (b) and f (m), we choose either [a, m] or [m, b]
as the new smaller interval containing x∗ . Written as an algorithm:
Given I0 = [a0 , b0 ] and a tolerance tol. Set n = 0.
While (bn − an ) > tol:
mn = 12 (an + bn ).
If f (an )f (mn ) ≤ 0
an+1 := an , bn+1 := mn ,
else
an+1 := mn , bn+1 := bn .
as in the scalar case. The question is, whether they are useful. Consider a
system of 100 equations for 100 unknowns. You start with an initial hypercube
I0 ∈ R100 , perform bisection and test which one of the smaller cubes contains
a root. In R splitting an interval in half gives two sub-intervals. However, in
R100 , splitting the edges of a cube in half results in 2100 sub-cubes. This is
approximately 1030 sub-cubes in each iteration, each of which must be tested
if it contains a root – just to gain ‘one bit of information’. It is clear that
such methods can in principle be practical only for very small systems. They
are completely out of the question, for example, for finite element solvers
for nonlinear partial differential equations where one must solve millions of
equations for millions of unknowns.
where the terms in the sum are zero due to the assumption g (j) (x∗ ) = 0 for j =
1, . . . , p − 1. It follows that
|xn+1 − x∗ |
lim = C > 0. (2.2)
n→∞ |xn − x∗ |p
Let there exist some J ∈ {1, . . . , p − 1} such that g (J) (x∗ ) 6= 0 and let J be the
smallest index with this property. Then by the Taylor expansion, we have
J−1 (j)
X g (x∗ ) g (J) (ξn )
xn+1 − x∗ = g(xn ) − g(x∗ ) = (xn − x∗ )j + (xn − x∗ )J ,
j=1
j! J!
| {z }
=0
therefore
|xn+1 − x∗ | |g (J) (ξn )| |xn − x∗ |J
lim = lim . = ∞,
n→∞ |xn − x∗ |p n→∞ J! |xn − x∗ |p
since p > J, |xn − x∗ | → 0 and |g (J) (ξn )| → |g (J) (x∗ )| > 0. This is a contradiction
with the finiteness of the limit due to (2.2).
We conclude with several remarks:
• Here we were concerned with one-point iterative processes, which means that
xn+1 depends only on the previous iterate via (2.1). We have seen that given
sufficient regularity of g, such iterative processes (methods) can have only
integer convergence rates. For multi-point methods, where xn+1 depends on
several previous iterates (e.g. xn+1 = g(xn , xn−1 )), we can have general real
convergence rates. We will see this in the secant method.
• There exist general “recipes” how to obtain suitable g with arbitrarily high
convergence rate p. These are the so-called Householder methods. These
methods are however considered as impractical, since they need to evaluate
derivatives of f up to order p − 1 and the complexity of the resulting formulas
rapidly grows with growing p.
CHAPTER 2. SCALAR EQUATIONS 12
x2 b
g
x1 = g(x0 ) b
b b b b
x0 x1 x∗ x2
Exercises
This function maps the interval [0, 1] onto itself and iterating this function leads to
surprisingly complex and chaotic behavior, as can be seen from the cobweb plot in
Figure 2.2. As usual define the iterative procedure xn+1 = g(xn ) for a chosen x0 . We
call x0 a periodic point of period P , if xP = x0 , which implies that xn+P = xn
for all n. Show that for the tent map defined above, the set of periodic points of all
periods is dense in the interval [0, 1]. This means that even the smallest change in
x0 will dramatically change the behavior of the resulting sequence {xn }.
Hint: To find a fixed point of g (which is a periodic point with period P = 1),
draw the graph of g and a graph of the identity function f (x) = x and look where
they intersect. Similarly, to find points of period P = 2, draw the graph of g ◦ g and
look at the intersections with f (x) = x. And so on. The result can be seen from
realizing what the graphs of g ◦ g ◦ . . . ◦ g looks like.
x∗
b b b b
x2 x1 x0
usually Newton succeeded by one or more names from the list Raphson, Simpson,
Fourier and Kantorovich (in Banach spaces). This is due to historical reasons which
we shall mention later on.
The basic derivation is as follows: Given x0 , we do a first order Taylor expansion:
We intuitively expect that if the neglected remainder R was small enough then the
right-hand side of (2.3) could be a better approximation of x∗ than x0 . Thus we
denote it as x1 and apply the procedure iteratively.
f (xn ) 1 f 00 (ξn )
0 = x∗ + − x n + (x∗ − xn )2 ,
f 0 (xn ) 2 f 0 (xn )
| {z }
=−xn+1
hence
1 f 00 (ξn )
xn+1 − x∗ = (x∗ − xn )2 .
2 f 0 (xn )
If we denote the error of the method as en = x∗ − xn , we obtain the error equation
for Newton’s method:
f 00 (ξn ) 2
en+1 = − 0 e . (2.4)
2f (xn ) n
Convergence: Since f ∈ C 2 (U ), f 0 (x∗ ) 6= 0, we have that the factor from (2.4) is
uniformly bounded on some neighborhood V of x∗ : there exists M > 0
f 00 (x)
≤ M for all x, y ∈ V.
2f 0 (y)
Now if we choose x0 ∈ V close enough to x∗ so that M |e0 | ≤ 1/2, we get from (2.4)
We note that the basis of the proof is the derivation of an identity relating the
current error to the previous error(s) – the error equation (or inequality). From this
we deduce convergence and the order of convergence. This will be the case in the
proofs of similar theorems for the other methods we shall consider.
Now we give an improved version of Theorem 14 under weaker assumptions. The
main difference is that instead of f ∈ C 2 , we assume f 0 to be Lipschitz continuous.
This is a seemingly trivial differences which however becomes important in RN and
in Banach spaces, where f 00 is a more complicated object than f 0 .
Remark 15. If a function is Lipschitz continuous on a set U , then its derivative
exists almost everywhere in U and the derivative is in the Lebesgue space L∞ (U ).
Therefore, assuming f 0 to be Lipschitz continuous means that f 00 exists almost ev-
erywhere and is in L∞ . Compare this to Theorem 14, where f 00 exists everywhere
and is a continuous function. In other words, the difference is assuming f ∈ C 2
versus f ∈ W 2,∞ .
First, we need a substitute lemma for the remainder of Taylor’s polynomial under
the weaker assumptions. Here by “ γ-Lipschitz” we mean “Lipschitz continuous with
Lipschitz constant γ”.
Lemma 16 (Substitute for Taylor). Let f : (a, b) → R be such that f 0 is γ-Lipschitz.
Then for all x, y ∈ (a, b)
f (y) − f (x) − f 0 (x)(y − x) ≤ 21 γ(y − x)2 . (2.6)
Proof. Simple substitution gives us
Z y Z 1
z=x+t(y−x)
0
f 0 x + t(y − x) (y − x) dt,
f (y) − f (x) = f (z) dz = dz=(y−x) dt =
x (0,1)→(x,y) 0
0
therefore, due to the γ-Lipschitz continuity of f , we get
Z 1h i
0
f 0 x + t(y − x) − f 0 (x) (y − x)dt
f (y) − f (x) − f (x)(y − x) =
0
Z 1
≤ |y − x| γt|y − x| dt = 12 γ|y − x|2 .
0
CHAPTER 2. SCALAR EQUATIONS 17
We note that the right-hand side of (2.6) is essentially an estimate of the re-
mainder of a first order Taylor polynomial. If we take e.g. the Lagrange form of the
remainder 12 f 00 (ξ)(y − x)2 , we can estimate it by 12 max |f 00 |(y − x)2 . In Lemma 16,
we have the Lipschitz constant γ of f 0 instead of max |f 00 |, which is closely related
but weaker, as explained in Remark 15.
f (xn ) 1 0
xn+1 − x∗ = xn − − x ∗ = f (x ∗ ) −f (x n ) − f (x n )(x ∗ − x n ) . (2.7)
f 0 (xn ) f 0 (xn ) | {z }
=0
Since f 0 (x∗ ) 6= 0 and f 0 is continuous, there exists ρ > 0 such that |f 0 (x)| ≥ ρ > 0
for all x ∈ U . Using this estimate and Lemma 16 with x = xn , y = x∗ , we can
estimate (2.7) on U :
1 γ
|xn+1 − x∗ | ≤ · (xn − x∗ )2 .
ρ 2
This is the analogue of error equation (2.4) from which we derived quadratic conver-
gence of the method. The rest of the proof thus continues similarly as in Theorem
14 and we shall omit it.
There is another type of theorem concerning the convergence of Newton’s method
other than the local fixed point view from the previous theorems. The theorem gives
more global convergence under more restrictive assumptions of convexity (concavity)
of f . The statement and proof of the theorem can be easily seen in Figure 2.3, its
formalization is only a technical issue.
Theorem 18 (Fourier). Let f ∈ C 2 [a, b] with f (a)f (b) < 0 such that f 00 does not
change sign in [a, b]. Let x0 be such that f (x0 ) has the same sign as f 00 . Then for
Newton’s method xn → x∗ monotonically.
Proof. The formal proof simply follows the intuition from Figure 2.3, however it
is somewhat lengthy and has little ‘added value’, thus we rather move on to more
important topics. We refer the interested reader to [Seg00].
Historical note
We end this section with a short overview of the history of Newton’s method. In
1669, Newton considered the equation
x3 − 2x − 5 = 0 (2.8)
CHAPTER 2. SCALAR EQUATIONS 18
which has a root in the interval (2, 3) due to opposite signs of the polynomial at
these points. Newton wrote the root as x∗ = 2 + p which he substituted into (2.8)
to obtain the new equation
Newton assumed that p is small, hence the higher order terms p3 + 6p2 ≈ 0 are very
small and he neglected them. Thus equation (2.9) reduces to 10p − 1 = 0 with the
solution p = 0.1. Next, Newton wrote p = 0.1 + q, substituted into (2.9) to obtain
the equation
q 3 + 6.3q 2 + 11.23q + 0.061 = 0. (2.10)
Again, Newton neglected the quadratic and cubic terms to obtain 11.23q +0.061 = 0
with the solution q = −0.0054. This procedure is then applied iteratively, writing
q = −0.0054 + r and substituting into (2.10), and so on. We note that already the
second approximation is 2 + p + q = 2, 0946, which is a good approximation of the
true root 2.094551482... Newton then proceeds to use this procedure on Kepler’s
equation x − e sin(x) = M from astronomy which is the relation between mean
anomaly M and eccentric anomaly x. Newton used the technique he developed for
polynomials by taking the Taylor expansion of sin x. He noticed what we now call
quadratic convergence of the iterates.
In 1690, Joseph Raphson improved the procedure by substituting the corrections
back into the original equation instead of producing a new equation in each iterate.
The advantage is that one saves a lot of work when performing the calculations
by hand, since one can reuse some of the already computed quantities. Raphson
thought he invented a new method, although in fact it is equivalent to Newton’s
original approach.
Notice that up to now, there is no mention of derivatives in the procedure. These
were introduced into the linearization process by Thomas Simpson in 1740. Finally,
it was Joseph Fourier, who wrote down Newton’s method as we know it today. One
can see that the development of the method was not straightforward and that it is,
ironically, hard to recognize Newton’s method in the tedious procedure that Newton
himself used. It is for these reasons that the method is sometimes called Newton,
Newton-Raphson, Newton-Raphson-Simpson, Newton-Raphson-Simpson-Fourier or
some other combination of the mentioned names, possibly including Kantorovich in
Banach spaces.
Exercises
• If you start from the equation x2 = A and apply Newton’s method, you end
up with the so-called Babylonian method for computing square roots which
was derived geometrically in ancient Babylonia some 3500 years ago and redis-
covered 2000 years ago in ancient Greece (Hero’s method). As a challenge, try
deriving the resulting iterative procedure using only simple geometry, without
any symbolic calculations or calculus, as was done in Babylonia.
√
• Another possibility is to start from the equation 1/x2 = A to compute 1/ A
and then multiply the final approximation by A. The final method has the
advantage that it uses only multiplication without any division. Thus it is can
be efficiently implemented (in more sophisticated form, e.g. Goldschmidt’s
algorithm) directly in the hardware using logic gates.
3. Divergence to ±∞.
6. Chaotic behavior.
We demonstrate these phenomena in 1D, where one can gain at least some in-
tuition from pictures and where the analysis is not too technical. The situation
becomes much more complicated in RN , where the problems we mention here some-
times render the method unusable, since the neighborhood of (quadratic) conver-
gence is impractically small. We shall discuss the individual points from the list
above in more detail.
which is linear convergence (in fact exactly the same convergence as for the bisec-
tion method). After some experimenting one notices that if we modified Newton’s
method to be xn+1 = xn − 2f 0 (xn )/f (xn ) we would have
2x2n
xn+1 = xn − = 0,
2xn
hence we get x∗ exactly in the first iteration, at least for this particular equation.
More generally, this modification gives x∗ in one iteration for f (x) = a(x − x∗ )2 ,
which is the general quadratic case of a double root:
a(xn − x∗ )2
xn+1 = xn − 2 = xn − (xn − x∗ ) = x∗ .
2a(xn − x∗ )
Even more generally, for a root of multiplicity r ∈ N and the model equation a(x −
x∗ )r = 0, the modification of Newton’s method xn+1 = xn − rf 0 (xn )/f (xn ) gives x∗
in one iteration:
a(xn − x∗ )r
xn+1 = xn − r = x∗ .
ra(xn − x∗ )r−1
It turns out that this modification works in general, not only for the model problems:
Theorem 19 (Roots with multiplicity). Let x∗ be a root of multiplicity r ∈ N of f ,
i.e. f (j) (x∗ ) = 0 for j = 0, . . . , r − 1 and f (r) (x∗ ) 6= 0. Let f ∈ C r (U ) where U is
some neighborhood of x∗ . Then the modified Newton method
f (xn )
xn+1 = xn − r
f 0 (xn )
converges locally quadratically to x∗ .
Proof. The proof is a technical calculation without much added value, thus we omit
it and refer the interested reader to [RR78, Section 8.6].
In the theorem above, one needs to know the multiplicity of x∗ in advance. This
is not always the case, in fact one might not even notice apriori that x∗ is a root of
higher multiplicity (consider the equation ex −x−1 = 0 with x∗ = 0). One universal
solution is to consider the equation f¯(x) := f (x)/f 0 (x) = 0 instead of f (x) = 0.
Then x∗ is always a root of multiplicity 1 for f¯ irrespective of its multiplicity for f .
One can then apply his/her favorite method to the modified equation.
3. Divergence to ±∞.
Now we get to the cases when Newton’s method fails altogether. Among these,
divergence to infinity is the ‘nicest’ because it is at least noticeable very early and
can be easily detected in the implementation, unlike the other cases.
Consider the model equation f (x) := arctan(x) = 0. Since f 0 (x) → 0 as x →
±∞, one can expect that if we choose x0 sufficiently far away from x∗ = 0, then x1
will be even larger due to the division by f 0 (xn ) in Newton’s method. This is indeed
the case, as can be seen by a simple analysis or by a simple numerical experiment.
In fact, there exists a neighborhood of x∗ , such that outside of this neighborhood
CHAPTER 2. SCALAR EQUATIONS 22
Newton’s method diverges to ±∞. The size of this neighborhood can be computed,
as we shall do in Exercise 9.
In general, divergence to infinity is usually caused by the denominator f 0 (xn )
being very close to zero, in which case xn+1 is very large or even causes an overflow.
There are tricks how to overcome this issue, which fall under the general category
of “globalization strategies”. Here we only briefly mention one possible strategy.
Damped Newton method. The idea is to do smaller steps than Newton’s
method recommends. In the simplest case, we can write this as
f (xn )
xn+1 = xn − λn , (2.11)
f 0 (xn )
where we apply the damping parameters λn ∈ (0, 1] which tells us how much of
the Newton correction to take. Obviously, we should not destroy the quadratic
convergence rate of Newton’s method, once we are sufficiently close to x∗ . Thus we
require λn → 1 when xn → x∗ (or perhaps λn = 1 for sufficiently large n). One
can try to come up with an explicit formula for λn , such as λn = 1/(1 + |f (xn )|)
which tries to use the residual of the equation to measure ‘closeness’ to x∗ . A more
sophisticated idea is to use a backtracking strategy : try (2.11) with λn = 1,
i.e. Newton’s method. If the residual increases, i.e. |f (xn+1 )| ≥ |f (xn )|, we try
λn = 1/2 instead. Again, we test for decrease of the residual and if needed, try
λn = 1/4 instead. We can write this procedure as
f (xn )
Given xn , compute x̃ = xn − f 0 (xn )
.
While |f (x̃)| ≥ |f (xn )|:
x̃ := 12 (x̃ + xn ).
Set xn+1 := x̃.
The danger of simple procedures like the one above is that the iterates can
converge to a local extreme, instead of the root. This happens because if we start
close to a local extreme, then f 0 (xn ) ≈ 0 and the Newton correction is detected to
be too large. Therefore the step is taken to be much smaller, perhaps so small that
we only get closer to the local extreme and the situation is much worse in the next
iterate.
More sophisticated strategies similar to the one above are important in numerical
optimization, where they fall under the category of line search methods, where much
more sophisticated strategies for the choice of the new iterate x̃ and the necessary
condition it must satisfy are considered.
We note that the backtracking procedure above can be viewed as a rudimentary
hybrid method, which consists of a fast converging method combined with a (ideally)
globally convergent method/strategy. More on these methods in Section 2.5.2.
From the mathematical point of view, it can be shown that the set of ‘bad’ initial
conditions {x0 : ∃n ∈ N s.t. f 0 (xn ) = 0} is at most countable. In other words, this
set is very small from the point of view of R (measure zero) and it is in general
unlikely to choose such a x0 by chance.
b b b b
x1 x0 = x2
This means that if we start near 0, the sequence from Newton’s method converges
to 0, 1, 0, 1, 0, 1, . . . . Moreover, for this example, the odd/even terms will converge
to 0 or 1 very quickly (quadratically)! It is then hard to dismiss this example of
periodic cycling as a rare thing which we will never encounter – there is a whole
open set of initial values for which Newton converges to a 2-periodic sequence very
quickly. And choosing an x0 from an open set is something that can easily happen
in practice – this is not one isolated point which we will easily avoid.
In general, the situation is much more complicated. The following theorem can
be obtained as a special case of the Sharkovskii or Li & Yorke theorems known from
the theory of chaotic dynamical systems.
It is good to consider the implications of Theorem 21. The function f is not some
‘wild’ counterexample form measure theory, it is a simple polynomial (degree 4 is
sufficient). Even so, Newton’s method has extremely complicated behavior when we
look beyond the neighborhood where we have local quadratic convergence – for any
period we choose, there is a x0 such that Newton will have that period. There will
be a point with period 5 and 17 and 123456789 and 10365 . Moreover, there will also
be preperiodic points x0 . This means that xn will first ‘jump around’ for a while
before settling on e.g. period 17. This means there is a large (although countable)
set of initial values for which Newton’s method will (eventually) periodically cycle.
We note that the situation is much more complicated in general, since the phe-
nomena from Theorem 21 and from Lemma 20 can combine. Then we can have also
open sets from which we will converge to periodic cycling with some period.
6. Chaotic behavior.
Even more generally, we may consider the set of all points x0 for which Newton’s
method does not converge to any root, even though we never hit a point where
f 0 (xn ) = 0 (i.e. the whole sequence is well defined). We have the following theorem.
CHAPTER 2. SCALAR EQUATIONS 25
J2 J1
α1 b b b
α3
I1 α2 I2
b
b
Figure 2.5: Simplest case of the Saari-Urenko theorem (Theorem 23) and its proof.
1 2 1 1 2 1 1 1 2 1 1 1 1 2...,
we get an x0 such that the consecutive iterates x0 , x1 , x2 , . . . lie in the intervals
I1 , I2 , I1 , I1 , I2 , I1 , I1 , I1 , I2 , I1 , I1 , I1 , I1 , I2 , . . . .
CHAPTER 2. SCALAR EQUATIONS 26
It is clear that such a sequence cannot converge to a root. It also cannot be a periodic
sequence and it cannot converge to a periodic sequence, as in Exercise 10, since the
iterates xn jump between the intervals I1 and I2 aperiodically. The only possibility
for convergence would be convergence to the common endpoint α2 . However this is
also not possible, since f 0 (α2 ) = 0, therefore if xn would be too close to α2 , due to
the size of the Newton update the next iterate xn+1 would be very far away, certainly
it would lie outside of the finite intervals I1 and I2 , thus violating the theorem.
One can choose even ‘wilder’ sequences {Sn }∞ n=0 – we might choose a random
sequence. Thus Newton’s method will jump around ‘randomly’ (even though it is
a fully deterministic process). We √ might choose the sequence to correspond to the
digits of the binary expansion of 2 or π. These are irrational numbers, so again, the
iterates cannot converge to a periodic sequence. How many are there such possible
aperiodic sequences {Sn }∞ n=0 ? The answer is uncountably many, since the set of
irrational numbers is uncountable.
We call the behavior above chaotic, since the iterates can ‘jump around’ seem-
ingly randomly, never converging to some reasonable behavior. Even so, the theorem
on local quadratic convergence is still valid – there are neighborhoods around the
roots, where Newton will converge. This is typical of chaotic dynamical systems –
the mixing of very regular behavior (e.g. periodicity or convergence) with seemingly
random behavior. There is an entire field of research devoted to this topic.
We note that the proof of Theorem 23 is not difficult and is in fact (relatively)
constructive and intuitive. We indicate the basic idea in Figure 2.5.
Proof of Theorem 23 (basic idea). Consider the simplest case of a polynomial of de-
gree 4, as in Figure 2.5. Let g(x) = x−f (x)/f 0 (x) be the mapping defining Newton’s
method. Since limx→α1 + g(x) = +∞ and limx→α2 − g(x) = −∞, we have g(I1 ) = R.
Similarly g(I2 ) = R. Therefore, for any point (or interval) in R, we can find its
preimage under g in both I1 and I2 . For example, there exist intervals J1 , J2 ⊂ I1
such that g(J1 ) = I1 and g(J2 ) = I2 . These can easily be found ‘graphically’, as
in Figure 2.5 (depicted in green). Altogether, in our case, if we choose an inter-
val I ⊂ (α1 , α3 ) then g −1 (I) consists of two intervals, one of which is in I1 , while
the other is in I2 . The whole ‘trick’ of the theorem is choosing which of the two
preimages we take, according to the prescribed sequence.
Assume for example that the sequence {Sn }∞ n=0 is 1 2 . . . Taking the first two
symbols, we seek a point x0 ∈ I1 such that x1 = g(x0 ) ∈ I2 . The set of these points
is exactly the aforementioned interval J2 . In other words, J2 is one component of
g −1 (I2 ). We would choose the other component (which lies in I2 ), if the sequence
started 2 2 . . . In general, we consider g −1 (g −1 (. . . g −1 (Ii ) . . .)) for i = 1, 2, where we
take the individual preimages from I1 or I2 according to the prescribed sequence.
The successive preimages are smaller and smaller intervals and if one proceeds care-
fully, in the limit one obtains a single point x0 . This is the rough idea behind the
proof.
CHAPTER 2. SCALAR EQUATIONS 27
Exercises
The problem with this (and other) absolute criteria is that it does not take into
account the typical magnitude of the numbers that we are dealing with. If in our
problem we expect xn and x∗ to typically be on the order of 1010 , it is not reasonable
to prescribe the same tolerance ε as if the typical numbers in our problem are on
the level of 10−10 . Choosing e.g. ε = 10−10 leads to a criterion that may not even
be satisfiable in finite precision arithmetic in the first example, while in the second
example the criterion may be satisfied even if we do not have a single relevant digit
of accuracy in the approximate solution. From this point of view it seems more
reasonable to choose some form of relative criterion, e.g.
|xn − xn−1 |
< ε. (2.14)
|xn |
This is a more robust approach, however it can also fail in certain situations.
Example. Consider Newton’s method for the equation x2 = 0. The iterative
process is xn = 12 xn−1 . Evaluating the left-hand side of (2.14) in this case gives us
1
|xn − xn−1 | |xn−1 |
= 21 = 1.
|xn | |x |
2 n−1
Therefore, criterion (2.14) can never be satisfied for any ε < 1, even though xn → x∗ .
We note that changing the denominator in (2.14) to |xn−1 | does not help, as the
expression only evaluates to 21 instead of 1.
One recommendation how to avoid the problem in the previous example is to
use the following criterion
|xn − xn−1 |
< ε. (2.15)
max{|xn |, xtyp }
Here, xtyp is a nonzero user-specified quantity representing the ‘typical’ magnitude
of the argument x that we are working with (we shall encounter this quantity again
CHAPTER 2. SCALAR EQUATIONS 29
in (2.23), where we deal with the effect of rounding errors). The exact value of
xtyp is not really important, as long as it has roughly the correct magnitude we
are working with in our problem. This of course limits the usefulness of (2.15) in
‘black-box’ solvers, however when our problem comes e.g. from physics, we have at
least a general idea of what to expect of the quantities involved.
It is obvious that measuring the distance of two iterates in general has nothing
in common with the distance to the solution. The method could be caught in a
temporary stagnation phase, where progress is slow for some reason, even though
we are far from the solution.
Finally, one can combine the mentioned criteria to, for example:
|xn − xn−1 |
IF |f (xn )| < ε1 |f (x0 )| + ε2 OR < ε3 STOP.
max{|xn |, xtyp }
One might also try an AND instead of an OR in the above, but this might run into
satisfiability issues.
Remarks:
• If one has some additional information, this can be used. For example, once
Newton’s method actually starts converging quadratically, we have
xn − x∗ = xn − xn+1 + xn+1 − x∗ ,
| {z } | {z }
en O(|en |2 )
CHAPTER 2. SCALAR EQUATIONS 30
therefore |xn − xn+1 | and |en | can be expected to have the same magnitude,
once we are in the quadratically convergent regime. Therefore testing the
Cauchy property is justified in this case and gives a relevant estimate of the
error.
• linear: if there exists c > 0 such that |hn | > c for all n,
• superlinear: if limn→∞ hn = 0,
f (xn ) 1
xn+1 − x∗ = xn − − x∗ = f (x∗ ) −f (xn ) − an (x∗ − xn )
an an | {z }
=0
1 (2.18)
f (x∗ ) − f (xn ) − f 0 (xn )(x∗ − xn ) + (f 0 (xn ) − an )(x∗ − xn ) .
=
an | {z } | {z }
(∗) (∗∗)
(∗) ≤ 12 γ|en |2 ,
(2.19)
(∗∗) ≤ 21 γ|hn ||en |.
One might ask whether there can be a practical choice of hn so that we get a
quadratically convergent method, since Theorem 25 requires |hn | ≤ C|en | and |en | is
an unknown quantity which we can only estimate (if we knew en , we would immedi-
ately have the exact solution x∗ = xn + en ). There is however one seemingly strange
choice of hn which naturally fulfills this requirement. Stephensen’s method is
obtained by taking hn = f (xn ) in (2.16):
f (xn )2
xn+1 = xn − .
f (xn + f (xn )) − f (xn )
We wish to take |h| so that the right-hand side estimate in (2.22) is minimal. If
we minimize
√ the expression |h| + ε̄/|h| with respect to |h|, we get the minimum √ at
|h| = ε̄. Therefore the general recommendation is to take h on the order of ε̄ or
√
εmach . We say ‘on the order of’, because we only have a rough bound on ε(x) and
it is essentially useless to try to carefully evaluate the constants in (2.22).
In general, one also has to take into account the error relative to the current and
‘typical’ value of xn . One possible recommendation from the literature is taking
√
|hn | = εmach max{|xn |, xtyp }. (2.23)
CHAPTER 2. SCALAR EQUATIONS 33
x∗
b b b b
x2 x1 x0
Remarks:
• We have only one evaluation of f per iteration, since we know fn−1 from the
previous iteration. Compare to Newton or Newton with differences, where we
need two function evaluations per iteration.
• By Theorem 12, one-point iteration methods have only integer rates of con-
vergence. By ‘one-point’ method we mean methods of the form xn+1 = g(xn ),
where the next iteration depends only on one previous iterate. The secant
method is a two-point method, i.e. it can be written in the form xn+1 =
g(xn , xn−1 ). Such methods can have general non-integer rates of conver-
gence, as is the case for the secant method.
The rest of this part of the proof lies in expressing the term (?). Since f (x∗ ) = 0,
we have
fn −f (x∗ ) −f (x∗ )
xn −x∗
− fn−1
xn−1 −x∗ F (xn ) − F (xn−1 )
(?) = = , (2.26)
fn − fn−1 fn − fn−1
where we have defined the auxiliary function F as
f (x) − f (x∗ )
F (x) = .
x − x∗
We can express the denominator in (2.26) by the mean value theorem:
for some ξn between xn and xn−1 . The derivative F 0 from (2.27) can be explicitly
computed as
1 00 ˜
f 0 (x)(x − x∗ ) − f (x) + f (x∗ ) f (ξn )(x − x∗ )2
F 0 (x) = = 2
= 21 f 00 (ξ˜n ), (2.28)
(x − x∗ )2 (x − x∗ )2
for some ξ˜n between x (which is ξn in (2.27), hence the dependence of ξ˜n on n). The
second equality in (2.28) follows from Taylor’s expansion: f (x∗ ) = f (x) + f 0 (x)(x∗ −
x) + 21 f 00 (ξ˜n )(x∗ − x)2 .
By combining (2.27) with (2.28), we get
which follows from the mean value theorem fn − fn−1 = f 0 (ξˆn )(xn − xn−1 ).
Having expressed the term (?) from (2.25), we finally arrive at the error equation
for the secant method in the form
f 00 (ξ˜n )
en+1 = en en−1 , (2.29)
2f 0 (ξˆn )
get |en | → 0 by induction similarly as in the proof of Theorem 14. What remains is
to derive the convergence rate.
Let us assume that the secant method has rate of convergence p. If we define
|en+1 |
Sn = , (2.30)
|en |p
then by the definition of convergence rate p, there exists C > 0 such that
Now, we turn our attention to the error equation (2.29). We take its absolute value
and divide by |en ||en−1 | in order to have all the error terms on one side. We get
|f 00 (ξ˜n )| |en+1 |
= . (2.33)
2|f 0 (ξˆn )| |en ||en−1 |
Since the method converges, we have ξ˜n , ξˆn → x∗ . The left-hand side term from
(2.33) therefore converges to a constant which is in general a finite nonzero number:
|f 00 (ξ˜n )| |f 00 (x∗ )|
lim = 6= 0. (2.34)
n→∞ 2|f 0 (ξˆ )| 2|f 0 (x∗ )|
n
Concerning the right-hand side of the error relation (2.33), we can express it using
(2.32) and take the limit to obtain
p 2
|en+1 | Sn Sn−1 |en−1 |p p−1 2
lim = lim p
= lim Sn Sn−1 |en−1 |p −p−1 =
n→∞ |en ||en−1 | n→∞ Sn−1 |en−1 | |en−1 | n→∞ (2.35)
p p2 −p−1
= C lim |en−1 | ,
n→∞
Since |en−1 | → 0, the value of the last limit in (2.36) depends on the sign of the
exponent:
0,
p2 − p − 1 > 0,
p2 −p−1
lim |en−1 | = ∞, p2 − p − 1 < 0, (2.37)
n→∞
1, p2 − p − 1 = 0.
CHAPTER 2. SCALAR EQUATIONS 37
The first case is not possible, since (2.36) would reduce to 0 6= 0. The second
case is also not possible, since the right-hand side of (2.36) would be infinite (here
it is important that C p 6= 0), while the left-hand side is a finite (but nonzero)
number. Therefore, the only possibility for (2.36) to be satisfied
√
is the third case
2 1+ 5
p − p − 1 = 0. This equation has a single positive root p = 2 ≈ 1.618, which is
the desired convergence rate.
When reading the proof of Theorem 27, it is not easy to gain an intuitive
√ insight
as to why the error equation (2.29) leads to the convergence rate 1+2 5 . Compare
this to Newton’s method, where the error equation is simply en+1 ≈ e2n , from which
we immediately see the convergence rate of 2. Let us now try to gain some intuition
on the case of the secant method.
We write the error equation for the secant method (2.29) in the simplified form
Let us assume for simplicity that the initial errors are on the order
e0 , e1 ≈ 10−1 .
We can notice that the exponents are the Fibonacci numbers Fn defined by the
recurrence Fn+1 = Fn + Fn−1 with F0 = F1 = 1. Altogether, we have
en ≈ 10−Fn . (2.39)
There are many classical results relating properties of Fibonacci numbers to the
‘golden ratio’. For example, one can derive that
√
Fn+1 1+ 5
lim = . (2.40)
n→∞ Fn 2
This follows e.g. from Binet’s formula, which is an explicit formula for Fn . If we
combine (2.39) with (2.40) for sufficiently large n, we get
√
−Fn+1 −Fn
FFn+1 1+ 5
en+1 ≈ 10 = 10 n ≈ en 2
.
√
From this we can see the convergence rate 1+2 5 of the secant method. Of course
these simple considerations only serve to gain intuitive insight and do not constitute
a rigorous proof.
CHAPTER 2. SCALAR EQUATIONS 38
Remark 28. There is one more issue with the secant method. Once we are very
close to x∗ , the differences xn − xn−1 and f (xn ) − f (xn−1 ) become very small. This
may present a problem in finite precision arithmetic, as we have discussed in Section
2.4.1. It may happen that xn − xn−1 is too small with respect to the recommended
√
value of εmach and the difference approximation of the derivative can no longer be
trusted. Newton’s method does not suffer from this issue.
a0
b
a1 b
a2 b b b b
x∗ b0
b
Exercises
b
b
x0 x0 x3
x⋆ b b b b b
b b b b b
x⋆ x1 x2
b x3 x1 x2 b
Muller’s method
Muller’s method is a direct generalization of the secant method to the quadratic
case. Due to its simplicity, it seems to have been independently discovered and re-
discovered many times by various people, however it is credited to David E. Muller
in 1956. The idea is simple:
1. Use xn , xn−1 , xn−2 and the values of f at these points to construct a quadratic
polynomial p such that
p(xn−i ) = f (xn−i ), i = 0, 1, 2.
• A quadratic function need not have a real root. If this happens for the poly-
nomial p in some iteration, the next xn+1 is not defined and the iteration
process stops. One can fix this e.g. by switching to secants in that iteration
or neglecting the imaginary parts of the roots.
• Another possibility is to accept the complex roots of p and use the method to
search for complex roots of f (e.g. when f is a polynomial), which is actually
a good idea.
CHAPTER 2. SCALAR EQUATIONS 41
• The convergence rate is approximately 1.839, which is the real root of x3 −x2 −
x − 1 = 0, which stems from the error equation en+1 = Cn en en−1 en−2 , where
Cn is some quantity involving various derivatives of f . Compare this to the
convergence rate 1.618 of the secant method, which is the root of x2 −x−1 = 0
stemming from the error equation en+1 = Cn en en−1 .
• Muller’s method needs only one function evaluation per iteration. Therefore,
similarly as in the secant method on page 38, if we wish to compare Newton and
Muller, it is more fair to compare two iterations of Muller with one iteration
of Newton. Two iterations of Muller’s method have a combined convergence
rate of 1.8392 ∼ 3.383 which is much higher than the quadratic convergence
rate of Newton.
1. Use xn , xn−1 , xn−2 and the values f (xn ), f (xn−1 ), f (xn−2 ) to construct a quad-
ratic polynomial p̃ such that
p̃ f (xn−i ) = xn−i , i = 0, 1, 2. (2.43)
Again, we do not write down explicit formulas for xn+1 , which can be found
elsewhere. Obviously the method can break down, if there are two same values
among f (xn ), f (xn−1 ), f (xn−2 ), since the interpolation problem (2.43) is not well
posed in this case. This will obviously happen very rarely, but may cause instabilities
in the algorithm when some of these values are very close to each other.
Otherwise, the basic properties of IQI are the same as of Muller: The rate of
convergence of IQI is the same as in Muller’s method (1.839), it also uses one function
evaluation per iteration and also needs three initial values.
slowly, for any continuous function on any interval with a sign change in the end-
points. The idea of hybrid methods is to combine two or more methods, typically a
very fast locally convergent method with a slow globally convergent method along
with a set of criteria how to switch between the methods in every iteration. Typ-
ically this is done by combining open methods (Newton) and bracketing methods
(bisection), however we can view the damped Newton method with backtracking
(page 22) as a rudimentary example of such a method.
Dekker’s method
The first sophisticated hybrid method was developed in 1969 by Theodorus Dekker
(although variants of this method appeared in earlier works). The idea is to combine
the bisection method with the secant method. We produce a sequence of brackets
[an , bn ] containing x∗ , where the points bn are the actual approximations of the root
(bn need not necessarily be the right endpoint of the bracket, it can be the left
endpoint as well). We find two candidates for approximating xs – those given by
the secant and bisection methods. Then we choose which one we will be our next
approximation bn+1 . Finally, we choose an appropriate an+1 from all the available
values to have a new bracket [an+1 , bn+1 ]. In detail, we have:
an + b n
b=
2
7. Set n := n + 1 and go to 1.
The criterion from point number 4 in the algorithm guarantees that the new
approximation bn+1 does not jump too far from the current approximation bn , which
was one of the causes of non-convergence in Newton’s method. Point number 5
ensures we will have a bracket containing the root in the next iteration. In point
number 6 we assume that the residual of the equation is an indicator of the quality
of approximation (which might or might not be true) and possibly change the role
of an+1 and bn+1 so that bn+1 is the ‘better’ approximation of x∗ .
It can be proven that the method always converges and that x∗ ∈ [an , bn ] for all
n. Sometimes the method can converge very slowly, even more slowly than bisection.
CHAPTER 2. SCALAR EQUATIONS 43
Brent’s method
In 1973, Richard P. Brent published an improved version of Dekker’s method. With-
out going into the technical detail, we only state that Brent’s method chooses be-
tween bisection, secants and IQI in each iteration based on more sophisticated crite-
ria. For general functions, the method can actually be slower than bisection – Brent
shows that the method converges to the desired precision in at most N 2 iterations,
if bisection needs N iterations. However, for ‘nice’ functions, the method typically
performs IQI in most of the iterations, yielding superlinear convergence. We note
that Brent’s method is used in the fzero command in MATLAB.
Exercises
Systems of equations
• Much more complicated behavior and properties of the systems and numerical
methods, causing the analysis to be much more technical and complex.
The situation is also much more complicated in comparison to the case of linear
systems Ax = b and even in this case there are no universally functioning numer-
ical methods. Consider the simple question of how many solutions the system of
equations can have. In the linear case, the answer is 0, 1 or ∞. For nonlinear equa-
tions, the answer is 0, any natural number, countably many and uncountably many.
Moreover, even the number of roots can depend very sensitively on slight changes
in the system.
Example. Consider the system
x21 − x2 + γ = 0,
−x1 + x22 + γ = 0,
44
CHAPTER 3. SYSTEMS OF EQUATIONS 45
Proof. Since (3.2) is a vector identity, it is sufficient to prove it for each component
separately. Define the scalar function ϕi (t) = fi (x + t(y − x)). Then ϕ0i (t) =
∇fi (x + t(y − x)) · (y − x). Therefore
Z 1 Z 1
0
fi (y) − fi (x) = ϕi (1) − ϕi (0) = ϕi (t) dt = ∇fi (x + t(y − x)) · (y − x) dt.
0 0
Remark 36. There are specific situations when Newton’s method is more practical
in the original form (3.4) than (3.5). Consider for example the situation, when
we have a nonlinear system of two equations for two unknowns with a prescribed
constant right-hand side and suppose we need to quickly solve it many times with
different right-hand sides. The formula for the inverse of a 2 by 2 matrix is simple
and if we explicitly evaluate formula (3.4), it may happen that many cancellations
and simplifications occur, resulting in a simple closed formula for the function G
to be iterated, without the need to solve the linear algebraic systems from (3.5).
Or perhaps we are in a (rare) situation, when we actually know (F 0 )−1 or its good
approximation. This might happen when F 0 has a very simple structure or when we
have some additional information from the problem itself (its derivation, the physical
model it describes, etc.).
This means that we are transforming both the source and target spaces RN by an
affine and linear mapping, respectively. We note that the reason we do not transform
affinely also in the target space, i.e. G = AF + a for some vector a ∈ RN , is that
this would fundamentally change the problem, as we would no longer seek the root
G(y) = 0 but the solution of G(y) = a.
Now we use Newton’s method to solve both the problems. Newton for F (x) = 0
reads:
F 0 (xn )∆xn = −F (xn ), xn+1 = xn + ∆xn . (3.7)
CHAPTER 3. SYSTEMS OF EQUATIONS 49
which can be rewritten using (3.6) and the rule for differentiating composite func-
tions as
AF 0 (Byn + b)B∆yn = −AF (Byn + b), yn+1 = yn + ∆yn .
Since A is regular, we can eliminate it from the equation for the update ∆yn to get
But this is the same equation as (3.8), i.e. Newton for G – only the solution to (3.8)
is B∆y0 while the solution of (3.9) is ∆x0 . Therefore these two solutions must be
equal and we have ∆x0 = B∆y0 (the matrix F 0 (x0 ) is tacitly assumed to be regular,
since x1 would otherwise be undefined). Now if we update x0 , we get
Exercises
Use Ostrowski’s Theorem 31 to show that the iterates will converge for x0 sufficiently
close to x∗ . Verify the criterion for quadratic convergence.
Hint: We need to show that G0 (x∗ ) = 0. If one is not comfortable with com-
∂gi
puting G0 directly, one can look at its individual entries ∂x j
. Also in order to avoid
0
differentiating the matrix inverse, consider calculating G (x∗ ) for the more general
function G(x) = x − B(x)F (x), where B(x) is some matrix.
x3 + y = 1,
y 3 − x = 1.
Find the unique real solution to this system using Newton’s method and also ana-
lytically. How robust is the convergence of the method with respect to the choice of
(x0 , y0 )?
Lemma 40 (Banach perturbation lemma). Let A ∈ RN,N with kAk < 1 for some
consistent matrix norm k · k. Then (I + A)−1 exists and
1
k(I + A)−1 k ≤ . (3.10)
1 − kAk
Proof. First we prove existence of the inverse. Let 0 6= x ∈ RN . Then the reverse
triangle inequality gives us
If we express k(I + A)−1 k from the inequality above, we immediately get (3.10).
Next, we need an analogue to Lemma 16, the substitute for the remainder of
Taylor’s polynomial.
Lemma 41. Let F : D ⊂ RN → RN be differentiable on a convex open set D and
let F 0 be γ-Lipschitz on D. Then for all x, y ∈ (a, b)
We note that the estimate (3.14) makes intuitive sense: If, by assumption 2, we
estimate F 0 (x∗ )−1 by β, then at a nearby point F 0 (x0 )−1 can be estimated a little
bit worse, by 2β.
2. Error estimate. Now we can estimate e1 = x1 − x∗ . By expressing x1 from
Newton’s method and adding the ‘smart zero’ F (x∗ ), we get
x1 − x∗ = x0 − x∗ − F 0 (x0 )−1 F (x0 ) − F (x∗ )
due to estimate (3.14) and Lemma 41. Thus we have proved that
ke1 k ≤ βγke0 k2 .
The estimate of en for general n can be obtained similarly, by induction.
Chord method
The idea of the chord method is to reuse the matrix in the linear system for the
Newton update (3.5) for all or several iterations. So instead of solving (3.5), we
solve
F 0 (x0 )∆xn = −F (xn ) (3.16)
in each Newton iteration. This can be advantageous from several points of view: we
save computational time evaluating the matrix F 0 (xn ) and if we spend more time
on pre-computing a more sophisticated expensive preconditioner for F 0 (x0 ), we can
solve systems (3.16) much more efficiently. Perhaps, we could even go as far as
pre-computing an LU or Cholesky factorization of F 0 (x0 ). The price we pay is that
Newton with the update given by (3.16) converges only linearly and is suitably only
for ‘mildly’ nonlinear problems. One can then balance the mentioned advantages
and disadvantages by updating the system matrix F 0 every K iterations for some
K. Such methods can √ be analyzed and, for example, if we take K = 2, the resulting
convergence rate is 3 ≈ 1.732.
xn+1 = xn − B−1
n F (xn ), (3.17)
where the matrix Bn will (a) be additively updated in each iteration and (b) should
somehow correspond to the secant method in 1D. Concerning the secant method,
we can write it in the form
f (xn ) − f (xn−1 )
bn = , xn+1 = xn − b−1
n f (xn ). (3.18)
xn − xn−1
Remark 43. We note that condition (3.19) is natural for a matrix which should
play a similar role as F 0 (xn ). The Mean value theorem (Lemma 30) states that
Z 1
0
F (xn−1 + t(xn − xn−1 )) dt (xn − xn−1 ) = F (xn ) − F (xn−1 ), (3.20)
0
which has a similar form as (3.19). Therefore, Bn plays a similar role as the Jacobi
matrix averaged along the line between xn , xn−1 . This is not claiming that the result-
ing Bn will be an approximation of F 0 in some sense, only that it satisfies similar
requirements.
Broyden’s method
Now we must supply additional requirements on Bn to get a uniquely determined
method. The first such method was Broyden’s method published in 1965. The
basic additional assumptions are such: Let Bn be given, then Bn+1 is constructed as
follows:
Bn+1 = Bn + Cn . (3.21)
CHAPTER 3. SYSTEMS OF EQUATIONS 55
2. Cn is a rank-1 matrix:
Cn = an bTn , an , bn ∈ RN . (3.22)
xn+1 = xn − B−1
n F (xn )
1 T
Bn+1 = Bn + ∆Fn − Bn ∆x n ∆xn ,
k∆xn k2
Remarks:
• The algorithm needs only one evaluation of F per iteration, similarly as the
secant method in 1D.
• The choice bn = BTn ∆xn gives the so-called ‘bad’ Broyden method.
• ‘Good’ Broyden is affine invariant but not contravariant. For ‘Bad’ Broyden
it is exactly opposite.
matrix satisfies the quasi-Newton condition. Here ‘smallest’ means ‘smallest in the
Frobenius norm’. Since the Frobenius norm is the L2 norm of all components of
the matrix, one can say that the rank 1 update is the smallest elementwise update
satisfying the Quasi-Newton condition. This means that the rank 1 update is the
most conservative – make the smallest possible change to Bn so that we satisfy
the basic assumption, the Quasi-Newton condition. This is reasonable, because, as
noted earlier, all the other assumptions are to some extent artificial and we want to
minimize their influence to some extent.
Lemma 45. Let ∆xn , ∆Fn ∈ RN be given. Then the matrix Cn given by (3.26) is
the smallest update of Bn in the Frobenius norm, such that Bn+1 = Bn + Cn satisfies
the quasi-Newton condition Bn+1 ∆xn = ∆Fn .
Proof. Let C̃n be another update such that B̃n+1 = Bn + C̃n also satisfies B̃n+1 ∆xn =
∆Fn . Then
1
∆Fn − Bn ∆xn ∆xTn
kCn kF = 2
k∆xn k F
1 T
= B̃n+1 ∆xn − Bn ∆xn ∆xn
k∆xn k2 F
1
B̃n+1 − Bn ∆xn ∆xTn F
=
k∆xn k2
1
≤ B̃n+1 − Bn F k∆xn ∆xTn F
| {z } k∆xn k2
C̃n | {z }
=1
= kC̃n kF ,
which we wanted to prove. Here we have used the property of the Frobenius norm
that kuk2 = kuuT kF for any u ∈ RN , which can be proved directly:
v v v
N
X
u N
u X 2 u X
u N u N
2 2
uX
kuk = ui = t 2
ui = t 2 2
ui uj = t (ui uj )2 = kuuT kF .
i=1 i=1 i,j=1 i,j=1
Finally, we come to the last reason, why a rank 1 update should be a wise choice.
Using the so-called Sherman-Morrison formula, one can directly compute the inverse
B−1 −1
n+1 as an update of Bn .
A−1 uv T A−1
(A + uv T )−1 = A−1 − . (3.27)
1 + v T A−1 u
Lemma 46 allows us to simplify Broyden’s method significantly, as we can now
directly compute Bn+1 using updates. A direct application of (3.27) to the ‘good’
CHAPTER 3. SYSTEMS OF EQUATIONS 58
xn+1 = xn − B−1
n F (xn )
∆xn − B−1 n ∆Fn
(3.28)
B−1
n+1 = B−1
n + T −1
∆xTn B−1
n ,
∆xn Bn ∆Fn
which requires only O(N 2 ) operations per iteration without the necessity to solve
linear systems with the matrix Bn .
Remarks:
• In the algorithm (3.28), we never need the matrices Bn , we only use B−1
n .
• The BFGS (Broyden, Fletcher, Goldfarb, Shanno) method uses the update
Bn+1 = Bn + αaaT + βbbT which automatically preserves symmetry, and after
choosing the parameters α, β, it also preserves positive definiteness. The BFGS
method is one of the most popular methods in optimization, where the role
of the Jacobi matrix F 0 is taken by the Hessian matrix H of the minimized
functional. The point is that Hessian matrices are naturally symmetric, while
there is no reason for symmetry of Jacobi matrices. Thus the BFGS is more
popular in the optimization community.
zero, we get the Stokes problem which is linear and thus much simpler than high
Reynolds flows with turbulence etc. Perhaps our specific problem contains such a
parameter naturally. If not, we will introduce such a parameter artificially.
In any case, we will call the auxiliary parameter t and we assume that for t = 0
we get a simpler (e.g. linear) problem F0 (x) = 0 and for t = 1, we get the original
problem F (x) = 0 which was to be solved. So instead of a single function F : D ⊂
RN → RN we have a class of problems H : D × [0, 1] ⊂ RN +1 → RN such that
If F does not naturally contain such a tuning parameter t, we can take for example
the convex combination
where F0 is some ‘simple’ mapping, for which we ideally know the solution to F0 (x) =
0. However, F0 should not be chosen arbitrarily, as the resulting method would have
problems. The function F0 should somehow be related to F . One possible choice is
to choose some x0 and take F0 (x) = F (x) − F (x0 ). Then x0 is trivially the solution
to F0 (x) = 0. Taking such an F0 in (3.30) results in
This definition of H clearly satisfies the basic assumption (3.29), since we know the
solution x0 to H(x, 0) = 0. In the following we will always keep (3.31) in mind as a
basic choice of H.
Now instead of the single problem F (x) = 0 we will be solving a whole class of
problems
H(x, t) = 0, ∀t ∈ [0, 1]. (3.32)
We assume that (3.32) has a solution x(t) for any t ∈ [0, 1] and that it depends
continuously on t. In other words, we assume the existence of a continuous mapping
x : [0, 1] → RN such that
In this case, x(·) defines a curve in RN such that x(0) = x0 is prescribed and
x(1) = x∗ is the desired solution to our original problem F (x∗ ) = 0.
Remark 47. The standard implicit function theorem gives local existence and unique-
ness of x(·). The fundamental assumption (apart from regularity) is the regularity of
∂x H – the Jacobi matrix of H with respect to the x-variables. This can be improved
to get a global existence result under additional assumptions. For example, we get
the following.
Lemma 48. Let F : RN → RN be continuously differentiable and let kF 0 (x)−1 k ≤
β < ∞ for all x ∈ RN . Then for all x0 ∈ RN there exists a continuously differentiable
mapping x : [0, 1] → RN such that H(x(t), t) = 0 for all t ∈ [0, 1], where H is defined
by (3.31).
CHAPTER 3. SYSTEMS OF EQUATIONS 60
H(x, ti ) = 0, i = 1, . . . , M.
We will denote the solution of the i-th problem as xi , so that H(xi , ti ) = 0. Now
since x(·) is continuous, if ti − ti−1 is sufficiently small then xi−1 is close to xi .
So close in fact that xi−1 can be used as an initial point in a locally convergent
method for xi . In other words, each xi has a neighborhood where e.g. Newton’s
method will converge. If these neighborhoods overlap sufficiently, then xi−1 (or its
approximation) can be used as a good initial guess in Newton’s method for solving
H(x, ti ) = 0 with the solution xi . The whole process then goes as follows: start with
x0 and do several iterations of Newton for x1 . Then use the current approximation
of x1 and do several iterations of Newton for x2 . And so on. In the end xM −1 is
a good initial approximation for xM = x∗ . The point is that x0 is a bad initial
guess in Newton for xM , but we use the intermediate solutions xi to bridge this
gap via (possibly very small) neighborhoods of convergence of all the individual xi .
Of coarse, in practice we only do a finite number Ni of iterations of Newton for
each xi . The result is that we need two indexes: xi,n , where the first index denotes
the problem solved and the second one denotes the Newton iteration index. The
resulting algorithm is this:
General continuation algorithm with Newton’s method (3.33)
Given x0 , set x1,0 = x0 .
For i = 1, . . . , M − 1:
For n = 0, . . . , Ni − 1:
xi,n+1 = xi,n − [∂x H(xi,n , ti )]−1 H(xi,n , ti ), (Newton for H(x, ti ) = 0)
xi+1,0 = xi,Ni
For n = 0, . . . (until stopping criterion is satisfied) do
xM,n+1 = xM,n − [∂x H(xM,n , 1)]−1 H(xM,n , 1). (Newton for H(x, 1) = 0)
Here the last line is simply just Newton’s method for F (x) = 0. Since we do not
need to compute the auxiliary xi to huge precision, one could simply take Ni = 1, i.e.
do just one iteration of Newton for each xi , while taking a finer partition of [0, 1]. In
this case we can drop the second iteration index and the previous algorithm (3.33)
with the choice of H given by (3.31) reduces to
We note that this resulting algorithm is very much similar to Newton’s method with
a slight change of the Newton formula in the first M iterations. This will also be
the case in other similar methods, which result in some strange simple modification
of Newton with the ultimate goal of making it converge more globally.
Concerning the “General continuation algorithm with Newton’s method” above,
we have the following convergence result.
CHAPTER 3. SYSTEMS OF EQUATIONS 61
Predictor-corrector approach
Up to now we have simply taken the current approximation of xi as an initial guess
in Newton’s method for xi+1 . The question is whether we can do better. What we
want to do is take xi , somehow produce a better initial guess x̂i+1 (predictor phase)
and use that as an initial guess for Newton (corrector phase):
x(t)
1.
b
2. bb
3.
4.
Figure 3.1: Various problems with the continuation path x(·): 1. Escape to infinity,
2. Discontinuity, 3. Turning point, 4. Bifurcation.
Again this looks like some variation on Newton’s method with a damping factor
(ti+1 − ti ) and F (x0 ) instead of F (xn ).
Of coarse one can use more sophisticated methods than Euler’s method, for
example Runge-Kutta, etc. Such methods appear in the literature and can be ana-
lyzed.
We conclude this section by noting what are the basic obstacles in the continua-
tion–based approaches described above. The main problem is the nonexistence of
a continuous curve x(·) satisfying the basic assumptions used above. If H(·, ·) is
chosen poorly, one of several things can go wrong, which prevent the successful
application of the methods above:
1. x(·) escapes to infinity and returns back.
2. x(·) is discontinuous.
CHAPTER 3. SYSTEMS OF EQUATIONS 63
These situations are depicted in 1D Figure 3.1 in the presented order. As was
mentioned earlier, there are no universal recipes how to treat such problems and
there is an extensive literature dealing with these issues.
Exercises
64