0% found this document useful (0 votes)
10 views

Numerical Methods For Nonlinear Equations

The document provides an introduction to numerical methods for solving nonlinear equations. It discusses how the focus has shifted from finding roots of polynomials to finding fixed points of functions, as fixed points are a more natural concept from a topological perspective. Various theorems regarding the existence of fixed points are also presented, such as the Banach fixed point theorem which guarantees the existence and uniqueness of a fixed point for a contraction mapping. The document serves as an overview and introduction to the topic before going into more technical details in subsequent chapters.

Uploaded by

Ahmed Rabie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Numerical Methods For Nonlinear Equations

The document provides an introduction to numerical methods for solving nonlinear equations. It discusses how the focus has shifted from finding roots of polynomials to finding fixed points of functions, as fixed points are a more natural concept from a topological perspective. Various theorems regarding the existence of fixed points are also presented, such as the Banach fixed point theorem which guarantees the existence and uniqueness of a fixed point for a contraction mapping. The document serves as an overview and introduction to the topic before going into more technical details in subsequent chapters.

Uploaded by

Ahmed Rabie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Numerical Methods

for Nonlinear Equations

Václav Kučera

Prague, 2022
To Monika and Terezka
for their endless love and support

These materials have been produced within and supported by the project “Increasing
the quality of education at Charles University and its relevance to the needs of the
labor market” kept under number CZ.02.2.69/0.0/0.0/16 015/0002362.
Contents

1 Nonlinear equations 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Mathematical background . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Rate of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Scalar equations 8
2.1 Bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Fixed point iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Cobweb plots . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 What could possibly go wrong? . . . . . . . . . . . . . . . . . 19
2.3.2 Stopping criteria . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Secant method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.1 Approximation of the derivative in Newton’s method . . . . . 30
2.4.2 Secant method . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5 Various more sophisticated methods . . . . . . . . . . . . . . . . . . . 39
2.5.1 Methods based on quadratic interpolation . . . . . . . . . . . 40
2.5.2 Hybrid algorithms . . . . . . . . . . . . . . . . . . . . . . . . 41

3 Systems of equations 44
3.1 Tools from differential calculus . . . . . . . . . . . . . . . . . . . . . . 45
3.1.1 Fixed point iteration . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Newton’s method in C . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.1 Affine invariance and contravariance . . . . . . . . . . . . . . 48
3.3.2 Local quadratic convergence of Newton’s method . . . . . . . 50
3.3.3 Variations on Newton’s method . . . . . . . . . . . . . . . . . 52
3.4 Quasi-Newton methods . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Continuation methods . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 1

Nonlinear equations

1.1 Introduction
These lecture notes represent a brief introduction to the topic of numerical methods
for nonlinear equations. Sometimes the term ‘nonlinear algebraic equations’ can be
found in the literature, which evokes polynomial equations. This is however not
the case, the term ‘algebraic’ is used to distinguish our equations from nonlinear
differential equations, which is a completely different world. In our case, we shall
look for roots of a given function f , which will be a general, at least continuous
function. Here ‘roots’ means points where the function f attains the value zero.
From the historical perspective, originally the focus was on finding roots, ideally
of polynomial equations of higher and higher degree. However, in the past century,
the focus shifted from roots to fixed points. From the point of view of modern
mathematics, there is nothing special about a root, it is just a point, where f
attains a certain value, zero, which we consider ‘special’, but we might as well have
chosen this special value as 1 or π or −17. On the other hand a fixed point is a
topological invariant and therefore many powerful tools of modern mathematics can
be used to prove nonconstructive existence of fixed points. As we shall see later on,
from the viewpoint of numerical methods for nonlinear (algebraic) equations, fixed
points are the natural perspective.
If we want to look more closely at the history of root finding, polynomial equa-
tions are very interesting. Formulas for the roots of general quadratic, cubic and
quartic equations were known by the first half of the 16th century. Then the race to
find similar formulas for the quintic (5th degree) equation began. Three centuries
later, the Abel-Ruffini theorem was proved in 1824, which states that the roots of a
general quintic equation
√ cannot be expressed in terms of radicals, i.e. finite combi-
nations of +, −, ∗, /, n ·. Galois theory is the modern viewpoint and one can show
e.g. that the real root of x5 − x − 1 = 0 is not expressible using radicals over the
rational numbers. However, the key here is to define exactly what is meant by ‘for-
mula for the roots’. Abel and Ruffini say that radicals are hopeless for quintic and
higher equations. Radicals rely on our ability to extract the n-th root, i.e. for n = 5
solve the auxiliary equation x5 − a = 0 for a given a. However, if one admits the use
of so-called ultraradicals (Bring radicals), where one solves the modified equation
x5 + x − a = 0 for a given a, then one can solve any quintic equation. Another

1
CHAPTER 1. NONLINEAR EQUATIONS 2

approach to obtain formulas for general quintic roots is through the use of elliptic
modular functions. Felix Klein gives an elegant solution using the symmetry group
of the icosahedron. Later others came and solved the case of general equations of
degree 7, then 11, etc. using more and more special functions. The final chapter was
written by Hiroshi Umemura in 1984, when he published formulas for roots of gen-
eral polynomial equations of arbitrary degree. If he had done so in 1850, he would
have been celebrated as one of the greatest mathematicians of all time. But today
his result is a practically unknown curiosity sometimes mentioned in a footnote.
The reason is that he expresses the roots using so-called Siegel modular functions,
which are defines as infinite series of matrix exponentials. Nobody knows how to
practically evaluate or even reasonably approximate these functions. So from the
point of view of practice, the formulas are completely useless. Also from the point
of view of theory, they bring nothing new. The idea that you can solve any problem
analytically if you define more and more complicated special functions with special
properties that nobody knows how to evaluate belongs in the 19th century. Modern
mathematics uses other tools to show existence, uniqueness, desired properties, etc.
indirectly or non-constructively. If we want to know what the value is numerically,
we turn to numerical mathematics instead of formulas.
In this textbook, we shall describe and analyze the basic numerical methods for
nonlinear equations and their systems. We will spend a lot of time in the seemingly
simple case of 1D, i.e. scalar nonlinear equations. The reason is that the mathe-
matical tools and the proofs are more intuitive and can often be demonstrated using
pictures. This is not the case of systems of equations, where the theory becomes
much more technical. Bear in mind that the literature concerning numerics for non-
linear equations os huge, with hundreds of methods being developed and analyzed.
Here we barely scratch the surface.

1.2 Mathematical background


We will consider a function f : R → R (or perhaps f might be defined only on an
interval) and we seek x∗ ∈ R which is the root of f , i.e. f (x∗ )=0. This is the scalar
case. For systems of equations, we will consider F : RN → RN and its root x∗ ∈ RN ,
i.e. F (x∗ ) = 0.
We will be dealing with iterative methods for root finding. These construct a
sequence xn , n ∈ N0 , which (ideally) converges to x∗ . We can usually write the
considered numerical method as xn+1 = G(xn ), where G : RN → RN is somehow
derived from F . Assuming that G is continuous and that the method works (xn →
x∗ ) we have

xn+1 = G(xn )
↓ ↓
x∗ = G(x∗ ),

therefore x∗ , which is a root of F , is a fixed point of G. Therefore, in our context, it


is more natural to look at fixed points of mappings rather than roots. This was also
the general trend in mathematics and highlights one of the differences between 19th
CHAPTER 1. NONLINEAR EQUATIONS 3

century and 20th century mathematics – while roots and fixed points are fundamen-
tally connected, the latter is a more natural notion from the topological viewpoint,
therefore topological tools can be used in proofs, etc. Fixed point theorems are one
of the basic tools in mathematical analysis.
For completeness, we shall state several basic theorems from mathematical anal-
ysis concerning the existence of fixed points or roots.
Theorem 1 (Banach fixed point theorem). Let X be a complete metric space and
let G : X → X be a contraction, i.e. there exists α ∈ [0, 1) s.t. kG(x) − G(y)k ≤
αkx − yk for all x, y ∈ X. Then there exists a unique fixed point x∗ ∈ X of G.
Moreover, let the sequence xn ∈ X, n ∈ N0 be defined by xn+1 = G(xn ) for arbitrary
x0 ∈ X. Then xn → x∗ and we have the estimate
αn
kxn − x∗ k ≤ kx1 − x0 k. (1.1)
1−α
Proof. We proceed in several steps:
1. The entire sequence {xn }∞
n=0 is well defined, since x0 ∈ X and G maps X to X.
2. For arbitrary m ∈ N, we can estimate by induction
kxm − xm−1 k = kG(xm−1 ) − G(xm−2 )k ≤ αkxm−1 − xm−2 k ≤ . . . ≤ αm−1 kx1 − x0 k.
3. {xn }∞
n=0 is a Cauchy sequence: Let m ≥ n, then by the previous estimate

kxm − xn k ≤ kxm − xm−1 k + kxm−1 − xm−2 k + . . . + kxn+1 − xn k


≤ αm−1 kx1 − x0 k + αm−2 kx1 − x0 k + αn kx1 − x0 k
m−n−1
X ∞
X
n
= α kx1 − x0 k k n
α ≤ α kx1 − x0 k αk (1.2)
k=0 k=0
αn
= kx1 − x0 k −→ 0, as n → ∞,
1−α
since α ∈ [0, 1).
4. X is complete, therefore {xn }∞
n=0 has a limit x∗ . Since G is continuous, we have

0 ≤ kG(x∗ ) − x∗ k = lim kG(xn ) − xn k ≤ lim αn−1 kx1 − x0 k = 0,


n→∞ n→∞

hence G(x∗ ) = x∗ and x∗ is a fixed point of G.


5. To prove estimate (1.1), we simply take the limit as m → ∞ in (1.2), noting that
the right-hand side is independent of m and that xm → x∗ in the left-hand side.

Theorem 1 is a perfect theorem from the viewpoint of numerical analysis. It gives


existence and uniqueness of the exact solution, constructs an iterative numerical
method, proves convergence of the method to the exact solution and provides an
apriori error estimate.
We note that the error estimate indicates that the closer α is to zero, the faster
convergence of xn we can expect. This will be an important observation in the
context of Newton’s method. In general, it is not easy to straightforwardly prove
that a mapping is a contraction. For differentiable functions in 1D, we can use the
following simple observation.
CHAPTER 1. NONLINEAR EQUATIONS 4

Lemma 2. Let I ⊂ R be a closed interval. Let g ∈ C 1 (I) and let there exist
α ∈ [0, 1) such that |g 0 (x)| ≤ α for all x ∈ I. Then g is a contraction on I.

Proof. Let x, y ∈ I be arbitrary. By the mean value theorem, there exists ξ ∈ I


such that
|g(x) − g(y)| = |g 0 (ξ)||x − y| ≤ α|x − y|.

Provided g is continuously differentiable, it is sufficient to have |g 0 (x∗ )| ≤ 1 to be


a contraction on some neighborhood of x∗ . This ensures that the iterative process
from Banach’s fixed point theorem converges for xn sufficiently close to x∗ . This is
called local convergence and is usually the best we can generally hope to prove
for our considered numerical methods.

Corollary 3. Let g be continuously differentiable on some neighborhood U of x∗ .


Let |g 0 (x∗ )| ≤ 1. Then there exists a closed neighborhood V of x∗ such that g is a
contraction on V .

The situation in RN is more complicated, however Corollary 3 can be generalized


– This is Ostrowski’s theorem 31.
Contractivity is a very useful, yet rare property. The following result is one of
the more general fixed point results.

Theorem 4 (Brouwer’s fixed point theorem). Let B ⊂ RN be a closed ball. Let


F : B → B be continuous. Then F has a fixed point.

Classical proofs of Brouwer’s theorem are extremely nonconstructive. This is


ironic, because Luitzen Egbertus Jan Brouwer was a strong figure in the construc-
tivist school of mathematics which believed that only constructive mathematics is
valid and rejected things like proof by contradiction. There exist modern proofs that
are constructive (and relatively simple), but not as explicitly and simply constructive
as Banach’s fixed point theorem.
We note that the ball B in Theorem 4 can be replaced by any convex compact
set (or a set homeomorphic to such a set). The assumption of convexity and com-
pactness is also sufficient in general Banach spaces (Schauder fixed point theorem).
Unlike Banach’s theorem, the fixed point is in general not unique for Brouwer.
In terms of roots, not fixed points, one can prove for example the following
general results.

Theorem 5. Let BR = {x ∈ RN ; kxk ≤ R} and let F : BR → RN be continuous.


Assume F (x) · x > 0 for all x : kxk = R. Then there exists a root x∗ of F in BR .

Theorem 6. Let F : RN → RN be continuous and let


F (x) · x
lim = ∞.
kxk→∞ kxk

Then for all y ∈ RN there exists a solution x of F (x) = y.


CHAPTER 1. NONLINEAR EQUATIONS 5

Theorem 7 (Zarantonello). Let F be a continuous mapping from the Hilbert space


H to itself such that F maps bounded sets to bounded sets. Let there exist c > 0 such
that (F (x) − F (y), x − y) ≥ ckx − yk2 for all x, y ∈ H. Then for all y ∈ H there
exists a unique solution x of F (x) = y. Moreover x depends continuously on y.
We note that the existing literature is full of such results, which have various
applications, typically in nonlinear partial differential equations. However, since
nonlinearities are difficult, there are no universal results covering all possible situa-
tions.

Exercises

Exercise 1 (Going from roots to fixed points).


Let f (x) = x2 − x − 2. We wish to solve the equation f (x∗ ) = 0 by reformulating
it as a fixed point problem g(x∗ ) = x∗ which we will solve using the simple iteration
procedure from Theorem 1. There are many possible choice of g derived from the
equation f (x) = 0 to obtain g(x) = x by various manipulations:
1. g(x) = x2 − 2

2. g(x) = x + 2
3. g(x) = 1 + 2/x
x2 + 2
4. g(x) =
2x − 1
Check that these choices of g really correspond to the equation f (x) = 0. Try each
possibility whether or not the iterative process xn+1 = g(xn ) converges and justify
the results by checking whether or not g is a contraction in each case.

1.3 Rate of convergence


We will be concerned with iterative numerical methods for the solution of nonlinear
equations. Such methods generate a sequence of approximations {xn }∞ n=0 ⊂ R
N

which (ideally) converges to the true solution x∗ . In order to compare these methods
among themselves, it is useful to measure how fast such sequences converge.
Definition 8 (Rate of convergence). Let {xn }∞n=0 ⊂ R
N
be a given sequence which
converges to x∗ . We say that the sequence has convergence rate (or order) p ≥ 1
if there exists C > 0 such that
kxn+1 − x∗ k
lim = C. (1.3)
n→∞ kxn − x∗ kp

We say that a numerical method has order p, if all sequences generated by the method
have convergence rate at least p for the considered problem and set of initial condi-
tions, and at least one such sequence has convergence rate exactly p.
CHAPTER 1. NONLINEAR EQUATIONS 6

If we denote the error of the method at the current iteration as en = x∗ − xn , the


relation (1.3) states that (in the limit) we expect the next error to be approximately
ken+1 k ≈ Cken kp . Obviously, under ideal circumstances it is desired to have methods
with higher p.

Definition 9. We distinguish several basic cases of convergence rate:

• p = 1, C ∈ (0, 1) – Linear convergence,

• p = 1, C = 0, or p > 1 – Superlinear convergence,

• p = 2 – Quadratic convergence.

Obviously, the list could go on – superquadratic, cubic, etc. We will have no


need for such terminology. We note that Definition 8 is a simplified version of the
so-called Q-factor (quotient factor), which defines the rate of convergence as
 
kxn+1 − x∗ k
inf sup lim sup =∞ , (1.4)
p≥1 {xn }n∈N0 n∈N0 kxn − x∗ kp
xn →x∗

where the supremum is taken over all sequences generated by the method. This
rather technical definition avoids the subtle issue with (1.3) concerning special cases,
where there does not exist a p such that the limit in (1.3) is a nonzero finite constant
C. This is similar to the case of superlinear convergence, as defined above, where
C = 0 for p = 1. We do not wish to go too deeply into this subject, as for example
[OR70], where a whole chapter is devoted to the study of (1.4) and its relation to a
similar concept – the R-factor (root factor).
We conclude with several remarks:

• Roughly speaking, a linearly convergent method adds a constant number of


correct digits to the result. Let |xn − x∗ | ≈ 10−l . Then in the limit, we expect
|xn+1 − x∗ | ≈ C.10−l . For example, if C = 0.1, we expect |xn+1 − x∗ | ≈
C.10−(l+1) . More generally, in each iteration we expect to obtain − logK (C)
digits of the correct result in number base K.

• On the other hand, a quadratically convergent method will approximately


double the number of correct digits in each iteration. Since if |xn −x∗ | ≈ 10−l ,
we expect |xn+1 − x∗ | ≈ C.10−2l . Here the role of the constant C becomes
marginal with respect to the squaring of the previous error. Once a quadrati-
cally convergent method (e.g. Newton’s method) really starts converging, we
reach very high precision in only a few iterations. The problem with these
methods is the so-called pre-asymptotic phase, where this doubling is not yet
present and it might take a while (or perhaps never), before the method actu-
ally starts converging quadratically.
In general a method with convergence rate p > 1 multiplies the number of
correct digits by the factor p in each iteration (at least in the limit).
CHAPTER 1. NONLINEAR EQUATIONS 7

• For more complicated methods, it is sometimes hard to prove the existence of


the limit in the form of equality (1.3), one usually proceeds by estimating the
errors rather than proving a series of equalities. That is why many authors
define convergence rate p with an inequality in (1.3), i.e. lim . . . ≤ C. Strictly
speaking one is then proving that the method has convergence rate at least p
and should also provide a lower bound, at least for some simple example, to
show that the convergence rate is exactly p.

• Defintion 8 only takes into account how much the error decreases from itera-
tion to iteration. It does not take into account how expensive or cheap each
iteration is. We will address this question when comparing Newton’s method
with the secant method on page 38. Newton’s method is quadratically con-
vergent, while the secant method has convergence rate approximately 1.618.
From the viewpoint of Definition 8, Newton is the clear winner. However each
iteration is twice as expensive as in the secant method (in terms of function
evaluations). Taking this into account, it is not that clear which of the methods
is the winner.
Chapter 2

Scalar equations

In this chapter, we will be dealing with scalar nonlinear equations. We will consider
the equation f (x) = 0 with the root x∗ , where f will be at least continuous on some
interval containing x∗ . We will distinguish between three basic types of methods:

• Open methods. These methods construct a sequence {xn }∞ n=0 of approxima-


tions which should converge to x∗ . Examples include the Newton and secant
methods. In general they converge only locally (x0 must be close enough to x∗
for convergence), however they can achieve higher convergence rates and are
thus faster (provided they converge).

• Bracketing methods. Instead of a sequence of approximations xn , we will


construct a sequence of intervals {In }∞
n=0 , where x∗ ∈ In for all n. Examples of
these methods include the bisection and false position (regula falsi) methods.
They are typically very simple and slow, but converge (almost) always. The
mathematical justification for bracketing is that the exact solution x∗ might
not be exactly representable in finite precision arithmetic. Thus it makes sense
to look for a small interval containing x∗ instead of a single approximation.

• Hybrid methods. These methods use combinations of open and bracketing


methods to obtain more robust (perhaps even globally convergent) methods,
which may exhibit high convergence rates under ideal circumstances. Sophis-
ticated criteria are used to switch between several methods in each iteration
to choose the best and/or safest next iteration. Dekker’s method and Brent’s
method are examples of hybrid approaches.

We note that there is an entire field of research concerning globalization strate-


gies, the aim of which is to make locally convergent methods globally convergent.
We will demonstrate one such approach in Section 3.5. However, as usual with
nonlinearities, there are no universal recipes – methods that always work.

2.1 Bisection method


Bisection is the simplest bracketing method. It is a straightforward application of
the intermediate value theorem – a continuous function attains all values between

8
CHAPTER 2. SCALAR EQUATIONS 9

its values at the endpoints of an interval. Therefore, if f (a)f (b) ≤ 0 then there
exists a root between a and b. —Given a, b we take the midpoint m = 21 (a + b) and
based on the sign changes of f (a), f (b) and f (m), we choose either [a, m] or [m, b]
as the new smaller interval containing x∗ . Written as an algorithm:
Given I0 = [a0 , b0 ] and a tolerance tol. Set n = 0.
While (bn − an ) > tol:
mn = 12 (an + bn ).
If f (an )f (mn ) ≤ 0
an+1 := an , bn+1 := mn ,
else
an+1 := mn , bn+1 := bn .

Bisection is linearly convergent method in the sense of Definition 8, where instead


of the errors |xn+1 − x∗ | and |xn − x∗ | in the definition of the limit, we take |In+1 |
and |In |.
Theorem 10. The bisection method is linearly convergent. It is globally con-
vergent: for any I0 = [a0 , b0 ] such that f (a0 )f (b0 ) ≤ 0, we have convergence, i.e.
an → x∗ , bn → x∗ . In each iteration we gain one bit of information.
Proof. All of the statements follow immediately from the fact that |In+1 | = |In |/2.

We conclude with several remarks:


• It is more efficient to compare the signs of f (a), f (m) in the implementation.
It is more efficient than multiplying f (a)f (m) only to test the sign of the
product.

• Bisection cannot be directly used to approximate even roots, e.g. to solve


x2 = 0 (x∗ is the root of f and f 0 ). This is because there are no sign changes
near x∗ in this case. A simple workaround is the fact that an even root of f
is an odd root of f 0 . Therefore use bisection to find a root of f 0 and check
whether it is also a root of f .

• Starting from an initial interval of length one, we always need 52 iterations in


double precision arithmetic to obtain In whose endpoints are two neighboring
double numbers. This is irrespective of whether f is a ‘simple’ or ‘complicated’
function. The method ignores any information about f other than simple sign
changes. For example, Newtons method takes into account the derivative of
f . Thus Newton’s method gives the exact solution of a linear equation in one
iteration, while bisection doesn’t ‘care’ whether or not f is linear.

• Surprisingly, there exist generalizations of the bisection method to systems of


equations. All one needs is some criterion whether an interval (hypercube or
perhaps a simplex) in RN contains a root. These can be based on some more or
less practically implementable version of the topological index of a mapping.
The resulting formulas and mathematical background are very complex. Such
methods are tempting, since they would in principle always converge, similarly
CHAPTER 2. SCALAR EQUATIONS 10

as in the scalar case. The question is, whether they are useful. Consider a
system of 100 equations for 100 unknowns. You start with an initial hypercube
I0 ∈ R100 , perform bisection and test which one of the smaller cubes contains
a root. In R splitting an interval in half gives two sub-intervals. However, in
R100 , splitting the edges of a cube in half results in 2100 sub-cubes. This is
approximately 1030 sub-cubes in each iteration, each of which must be tested
if it contains a root – just to gain ‘one bit of information’. It is clear that
such methods can in principle be practical only for very small systems. They
are completely out of the question, for example, for finite element solvers
for nonlinear partial differential equations where one must solve millions of
equations for millions of unknowns.

2.2 Fixed point iteration


Here we will take a closer look at the basic fixed point iteration procedure from
Banach’s fixed point theorem. As a reminder, instead of f (x) = 0 with the root x∗ ,
we consider g(x) = x, for which x∗ is a fixed point. For a given x0 , we consider the
iterative procedure
xn+1 = g(xn ). (2.1)
Theorem 11 (Linear convergence). Let g ∈ C 1 (U ) for some neighborhood U of x∗ .
Let 0 < |g 0 (x∗ )| < 1. Then there exists a neighborhood V of x∗ such that for all
x0 ∈ V the sequence defined by (2.1) converges to x∗ linearly.
Proof. Since g ∈ C 1 (U ) and |g 0 (x∗ )| < 1, there exists a (closed) neighborhood V of
x∗ such that |g 0 (x)| < 1 for all x ∈ V . By Banach’s fixed point theorem xn → x∗ .
Concerning the linear convergence rate, the mean value theorem gives us

xn+1 − x∗ = g(xn ) − g(x∗ ) = g 0 (ξn )(xn − x∗ )

for some ξn between xn and x∗ . Since xn → x∗ , also ξn → x∗ and we have


|xn+1 − x∗ |
lim = lim |g 0 (ξn )| = |g 0 (x∗ )| > 0,
n→∞ |xn − x∗ | n→∞

which is linear convergence, by definition (1.3).


Theorem 12 (Higher order convergence). Let p ∈ N and let g ∈ C p (U ) for some
neighborhood U of x∗ . Then the following are equivalent:
(i) xn → x∗ with rate p for all x0 in some neighborhood V of x∗ ,
(ii) g (j) (x∗ ) = 0 for j = 1, . . . , p − 1 and g (p) (x∗ ) 6= 0.
Proof. (ii) ⇒ (i). Banach’s fixed point theorem gives us convergence. By Taylor’s
expansion, there exists ξn between xn and x∗ such that
p−1 (j)
X g (x∗ ) g (p) (ξn )
xn+1 − x∗ = g(xn ) − g(x∗ ) = (xn − x∗ )j + (xn − x∗ )p ,
j=1
j! p!
| {z }
=0
CHAPTER 2. SCALAR EQUATIONS 11

where the terms in the sum are zero due to the assumption g (j) (x∗ ) = 0 for j =
1, . . . , p − 1. It follows that

|xn+1 − x∗ | |g (p) (ξn )| |g (p) (x∗ )|


lim = lim = > 0,
n→∞ |xn − x∗ |p n→∞ p! p!

which is convergence rate p, by definition (1.3).


(i) ⇒ (ii). We prove the assertion by contradiction. By definition of convergence
rate p we have the existence and finiteness of the limit

|xn+1 − x∗ |
lim = C > 0. (2.2)
n→∞ |xn − x∗ |p

Let there exist some J ∈ {1, . . . , p − 1} such that g (J) (x∗ ) 6= 0 and let J be the
smallest index with this property. Then by the Taylor expansion, we have
J−1 (j)
X g (x∗ ) g (J) (ξn )
xn+1 − x∗ = g(xn ) − g(x∗ ) = (xn − x∗ )j + (xn − x∗ )J ,
j=1
j! J!
| {z }
=0

therefore
|xn+1 − x∗ | |g (J) (ξn )| |xn − x∗ |J
lim = lim . = ∞,
n→∞ |xn − x∗ |p n→∞ J! |xn − x∗ |p
since p > J, |xn − x∗ | → 0 and |g (J) (ξn )| → |g (J) (x∗ )| > 0. This is a contradiction
with the finiteness of the limit due to (2.2).
We conclude with several remarks:

• Here we were concerned with one-point iterative processes, which means that
xn+1 depends only on the previous iterate via (2.1). We have seen that given
sufficient regularity of g, such iterative processes (methods) can have only
integer convergence rates. For multi-point methods, where xn+1 depends on
several previous iterates (e.g. xn+1 = g(xn , xn−1 )), we can have general real
convergence rates. We will see this in the secant method.

• We can view Newton’s method as a “recipe” how to obtain from a given f a


new function g such that the fixed point iteration has quadratic convergence
rate, i.e. g satisfies g 0 (x∗ ) = 0, see Exercise 5.

• There exist general “recipes” how to obtain suitable g with arbitrarily high
convergence rate p. These are the so-called Householder methods. These
methods are however considered as impractical, since they need to evaluate
derivatives of f up to order p − 1 and the complexity of the resulting formulas
rapidly grows with growing p.
CHAPTER 2. SCALAR EQUATIONS 12

x2 b

g
x1 = g(x0 ) b

b b b b

x0 x1 x∗ x2

Figure 2.1: Cobweb plot.

2.2.1 Cobweb plots


There is a simple and elegant graphical method, how to gain insight and intuition
on the iterative process (2.1) called cobweb plots or Verhulst diagrams (after
the discoverer, Pierre François Verhulst, a Belgian mathematician from the first half
of the 19th century).
The procedure is simple (cf. Figure 2.1). Draw the graph of g and the graph of
the identity mapping y = x (i.e. the diagonal). Choose some x0 . Then x1 = g(x0 )
can be visualized as going vertically from x0 to the graph of g. Next, we want to
evaluate x2 = g(x1 ), however x1 ‘lives’ on the y-axis and in order to evaluate g(x1 ),
we need to transfer it to the x-axis. Without measuring, this can be done simply by
going horizontally from the value x1 on the y-axis until we intersect the diagonal.
After this we go down to the x-axis to obtain the value x1 on this ‘correct’ axis.
Then we simply iterate the previous procedure:
1) from the current point go vertically to the graph of g,
2) go horizontally to the diagonal.
The iterates xn can then be read off the x-axis where we go vertically.

Exercises

Exercise 2 (Banach is not sharp).


The assumptions of Banach’s fixed point theorem do not need to be satisfied in
order to have convergence to a unique fixed point. Take g(x) = sin(x). Prove that
for all x0 ∈ (−π/2, π/2), we have xn → x∗ = 0, even though g 0 (x∗ ) = 1.
CHAPTER 2. SCALAR EQUATIONS 13

Figure 2.2: Cobweb plot of the ‘tent map’.

Exercise 3 (Punching the ‘cos’ button on a calculator).


Type any real number into a calculator. Push the cosine button repeatedly many
times. Prove that you will always converge to the same number (0.739... in radians).

Exercise 4 (Iteration is complicated beyond Banach).


Consider the so-called tent map, which is the piecewise linear function
(
2x, x ∈ [0, 12 ],
g(x) =
2 − 2x, x ∈ ( 12 , 1].

This function maps the interval [0, 1] onto itself and iterating this function leads to
surprisingly complex and chaotic behavior, as can be seen from the cobweb plot in
Figure 2.2. As usual define the iterative procedure xn+1 = g(xn ) for a chosen x0 . We
call x0 a periodic point of period P , if xP = x0 , which implies that xn+P = xn
for all n. Show that for the tent map defined above, the set of periodic points of all
periods is dense in the interval [0, 1]. This means that even the smallest change in
x0 will dramatically change the behavior of the resulting sequence {xn }.
Hint: To find a fixed point of g (which is a periodic point with period P = 1),
draw the graph of g and a graph of the identity function f (x) = x and look where
they intersect. Similarly, to find points of period P = 2, draw the graph of g ◦ g and
look at the intersections with f (x) = x. And so on. The result can be seen from
realizing what the graphs of g ◦ g ◦ . . . ◦ g looks like.

2.3 Newton’s method


Newton’s method is one of the most used basic methods for the numerical solution of
nonlinear equations. The method can be found under various names in the literature:
CHAPTER 2. SCALAR EQUATIONS 14

x∗
b b b b

x2 x1 x0

Figure 2.3: Newton’s method.

usually Newton succeeded by one or more names from the list Raphson, Simpson,
Fourier and Kantorovich (in Banach spaces). This is due to historical reasons which
we shall mention later on.
The basic derivation is as follows: Given x0 , we do a first order Taylor expansion:

0 = f (x∗ ) = f (x0 ) + f 0 (x0 )(x∗ − x0 ) + R.

Next, we neglect the remainder R and obtain the approximate identity

0 ≈ f (x0 ) + f 0 (x0 )(x∗ − x0 ),

which we solve for the desired x∗ :


f (x0 )
x∗ ≈ x0 − . (2.3)
f 0 (x0 )

We intuitively expect that if the neglected remainder R was small enough then the
right-hand side of (2.3) could be a better approximation of x∗ than x0 . Thus we
denote it as x1 and apply the procedure iteratively.

Definition 13 (Newton’s method). Let x0 be given. Newton’s method consists


of the iterative procedure
f (xn )
xn+1 = xn − 0 .
f (xn )
A graphical interpretation of Newton’s method can be seen in Figure 2.3. We
draw a tangent to f at the point x0 and see where that tangent intersects the x-axis.
This is the new iterate x1 and we repeat the procedure iteratively. We note that
for ‘nice’ functions, the convergence is quite fast – we did not draw x3 in Figure
2.3, as it would already be visually almost indistinguishable from x∗ . This graphical
interpretation of Newton’s method is the reason it is sometimes called the method
of tangents.
CHAPTER 2. SCALAR EQUATIONS 15

Another interpretation of Newton’s method is that it is the fixed point iteration


procedure with g(x) = x − f (x)/f 0 (x). We can thus apply the results of Section 2.2
to analyze the method (see Exercise 5).
We note that if f is a linear function, then for arbitrary x0 , Newton’s method
gives the root exactly in the first iteration: x1 = x∗ . Compare this to the bisection
method, which takes the same number of iterations irrespective of the simplicity of
f . This is due to the fact that Newton’s method takes into account more information
about f than the simple sign change in two points.
We will now prove the basic theorem on local convergence of Newton’s method.
We shall do so in two versions – the basic version under stronger assumptions and
an improved version under weaker assumptions. The reason we do this is that
the first version has a simpler, more straightforward proof. However, this proof
does not generalize well to systems of equations, for reasons we shall explain in
Section 3.1. On the other hand, the second, slightly more technical proof generalizes
straightforwardly to RN and even general Banach spaces.
Theorem 14 (Local quadratic convergence of Newton I). Let f ∈ C 2 (U ) for some
neighborhood U of x∗ . Let f 0 (x∗ ) 6= 0 and let x0 be sufficiently close to x∗ . Then
Newton’s method satisfies xn → x∗ quadratically.
Proof. We proceed as in the derivation of Newton’s method, but this time we do
not neglect the remainder of the Taylor expansion but write it down explicitly in
Lagrange’s form:
1
0 = f (x∗ ) = f (xn ) + f 0 (xn )(x∗ − xn ) + f 00 (ξn )(x∗ − xn )2 ,
2
where ξn lies between xn and x0 . We divide by f 0 (xn ) and rearrange to get

f (xn ) 1 f 00 (ξn )
0 = x∗ + − x n + (x∗ − xn )2 ,
f 0 (xn ) 2 f 0 (xn )
| {z }
=−xn+1

hence
1 f 00 (ξn )
xn+1 − x∗ = (x∗ − xn )2 .
2 f 0 (xn )
If we denote the error of the method as en = x∗ − xn , we obtain the error equation
for Newton’s method:
f 00 (ξn ) 2
en+1 = − 0 e . (2.4)
2f (xn ) n
Convergence: Since f ∈ C 2 (U ), f 0 (x∗ ) 6= 0, we have that the factor from (2.4) is
uniformly bounded on some neighborhood V of x∗ : there exists M > 0

f 00 (x)
≤ M for all x, y ∈ V.
2f 0 (y)

Now if we choose x0 ∈ V close enough to x∗ so that M |e0 | ≤ 1/2, we get from (2.4)

|e1 | ≤ M e20 = M |e0 | |e0 | ≤ 21 |e0 |.



CHAPTER 2. SCALAR EQUATIONS 16

Therefore, x1 ∈ V , since it is closer to x∗ than x0 . Moreover, M |e1 | ≤ M |e0 | ≤ 1/2.


We can therefore proceed by induction to get
|en+1 | ≤ M e2n = M |en | |en | ≤ 21 |en | ≤ . . . ≤ 2n+1
1

|e0 |. (2.5)
Therefore, en → 0, i.e. xn → x∗ . We note that the crude estimate (2.5) would
correspond to only linear convergence.
Quadratic convergence: From (2.5) and the fact that xn → x∗ , we immediately
have
|en+1 | f 00 (ξn ) f 00 (x∗ )
lim = lim = = C.
n→∞ |en |2 n→∞ 2f 0 (xn ) 2f 0 (x∗ )
We note that C > 0 if f 00 (x∗ ) 6= 0. On the other hand, we see that in the special case
when f 00 (x∗ ) = 0, we can expect even faster than quadratic convergence of Newton’s
method.

We note that the basis of the proof is the derivation of an identity relating the
current error to the previous error(s) – the error equation (or inequality). From this
we deduce convergence and the order of convergence. This will be the case in the
proofs of similar theorems for the other methods we shall consider.
Now we give an improved version of Theorem 14 under weaker assumptions. The
main difference is that instead of f ∈ C 2 , we assume f 0 to be Lipschitz continuous.
This is a seemingly trivial differences which however becomes important in RN and
in Banach spaces, where f 00 is a more complicated object than f 0 .
Remark 15. If a function is Lipschitz continuous on a set U , then its derivative
exists almost everywhere in U and the derivative is in the Lebesgue space L∞ (U ).
Therefore, assuming f 0 to be Lipschitz continuous means that f 00 exists almost ev-
erywhere and is in L∞ . Compare this to Theorem 14, where f 00 exists everywhere
and is a continuous function. In other words, the difference is assuming f ∈ C 2
versus f ∈ W 2,∞ .
First, we need a substitute lemma for the remainder of Taylor’s polynomial under
the weaker assumptions. Here by “ γ-Lipschitz” we mean “Lipschitz continuous with
Lipschitz constant γ”.
Lemma 16 (Substitute for Taylor). Let f : (a, b) → R be such that f 0 is γ-Lipschitz.
Then for all x, y ∈ (a, b)
f (y) − f (x) − f 0 (x)(y − x) ≤ 21 γ(y − x)2 . (2.6)
Proof. Simple substitution gives us
Z y Z 1
z=x+t(y−x)
0
f 0 x + t(y − x) (y − x) dt,

f (y) − f (x) = f (z) dz = dz=(y−x) dt =
x (0,1)→(x,y) 0
0
therefore, due to the γ-Lipschitz continuity of f , we get
Z 1h i
0
f 0 x + t(y − x) − f 0 (x) (y − x)dt

f (y) − f (x) − f (x)(y − x) =
0
Z 1
≤ |y − x| γt|y − x| dt = 12 γ|y − x|2 .
0
CHAPTER 2. SCALAR EQUATIONS 17

We note that the right-hand side of (2.6) is essentially an estimate of the re-
mainder of a first order Taylor polynomial. If we take e.g. the Lagrange form of the
remainder 12 f 00 (ξ)(y − x)2 , we can estimate it by 12 max |f 00 |(y − x)2 . In Lemma 16,
we have the Lipschitz constant γ of f 0 instead of max |f 00 |, which is closely related
but weaker, as explained in Remark 15.

Theorem 17 (Local quadratic convergence of Newton II). Let f 0 be γ-Lipschitz on


some neighborhood U of x∗ . Let f 0 (x∗ ) 6= 0 and let x0 be sufficiently close to x∗ .
Then Newton’s method satisfies xn → x∗ quadratically.

Proof. By definition of xn+1 , we have

f (xn ) 1 0

xn+1 − x∗ = xn − − x ∗ = f (x ∗ ) −f (x n ) − f (x n )(x ∗ − x n ) . (2.7)
f 0 (xn ) f 0 (xn ) | {z }
=0

Since f 0 (x∗ ) 6= 0 and f 0 is continuous, there exists ρ > 0 such that |f 0 (x)| ≥ ρ > 0
for all x ∈ U . Using this estimate and Lemma 16 with x = xn , y = x∗ , we can
estimate (2.7) on U :
1 γ
|xn+1 − x∗ | ≤ · (xn − x∗ )2 .
ρ 2
This is the analogue of error equation (2.4) from which we derived quadratic conver-
gence of the method. The rest of the proof thus continues similarly as in Theorem
14 and we shall omit it.
There is another type of theorem concerning the convergence of Newton’s method
other than the local fixed point view from the previous theorems. The theorem gives
more global convergence under more restrictive assumptions of convexity (concavity)
of f . The statement and proof of the theorem can be easily seen in Figure 2.3, its
formalization is only a technical issue.

Theorem 18 (Fourier). Let f ∈ C 2 [a, b] with f (a)f (b) < 0 such that f 00 does not
change sign in [a, b]. Let x0 be such that f (x0 ) has the same sign as f 00 . Then for
Newton’s method xn → x∗ monotonically.

Proof. The formal proof simply follows the intuition from Figure 2.3, however it
is somewhat lengthy and has little ‘added value’, thus we rather move on to more
important topics. We refer the interested reader to [Seg00].

Historical note
We end this section with a short overview of the history of Newton’s method. In
1669, Newton considered the equation

x3 − 2x − 5 = 0 (2.8)
CHAPTER 2. SCALAR EQUATIONS 18

which has a root in the interval (2, 3) due to opposite signs of the polynomial at
these points. Newton wrote the root as x∗ = 2 + p which he substituted into (2.8)
to obtain the new equation

p3 + 6p2 + 10p − 1 = 0. (2.9)

Newton assumed that p is small, hence the higher order terms p3 + 6p2 ≈ 0 are very
small and he neglected them. Thus equation (2.9) reduces to 10p − 1 = 0 with the
solution p = 0.1. Next, Newton wrote p = 0.1 + q, substituted into (2.9) to obtain
the equation
q 3 + 6.3q 2 + 11.23q + 0.061 = 0. (2.10)
Again, Newton neglected the quadratic and cubic terms to obtain 11.23q +0.061 = 0
with the solution q = −0.0054. This procedure is then applied iteratively, writing
q = −0.0054 + r and substituting into (2.10), and so on. We note that already the
second approximation is 2 + p + q = 2, 0946, which is a good approximation of the
true root 2.094551482... Newton then proceeds to use this procedure on Kepler’s
equation x − e sin(x) = M from astronomy which is the relation between mean
anomaly M and eccentric anomaly x. Newton used the technique he developed for
polynomials by taking the Taylor expansion of sin x. He noticed what we now call
quadratic convergence of the iterates.
In 1690, Joseph Raphson improved the procedure by substituting the corrections
back into the original equation instead of producing a new equation in each iterate.
The advantage is that one saves a lot of work when performing the calculations
by hand, since one can reuse some of the already computed quantities. Raphson
thought he invented a new method, although in fact it is equivalent to Newton’s
original approach.
Notice that up to now, there is no mention of derivatives in the procedure. These
were introduced into the linearization process by Thomas Simpson in 1740. Finally,
it was Joseph Fourier, who wrote down Newton’s method as we know it today. One
can see that the development of the method was not straightforward and that it is,
ironically, hard to recognize Newton’s method in the tedious procedure that Newton
himself used. It is for these reasons that the method is sometimes called Newton,
Newton-Raphson, Newton-Raphson-Simpson, Newton-Raphson-Simpson-Fourier or
some other combination of the mentioned names, possibly including Kantorovich in
Banach spaces.

Exercises

Exercise 5 (Newton as a fixed point iteration procedure).


Use Theorem 12 to prove that Newton’s method is locally quadratically conver-
gent by verifying the assumption for g – the function which is iterated in Newton’s
method.
CHAPTER 2. SCALAR EQUATIONS 19

Exercise 6 (Computing the square root).



Given A > 0, compute A using Newton’s method. Observe the extremely fast
convergence for this simple problem. Prove convergence for any x0 > 0.
Remark: There are two basic possibilities:

• If you start from the equation x2 = A and apply Newton’s method, you end
up with the so-called Babylonian method for computing square roots which
was derived geometrically in ancient Babylonia some 3500 years ago and redis-
covered 2000 years ago in ancient Greece (Hero’s method). As a challenge, try
deriving the resulting iterative procedure using only simple geometry, without
any symbolic calculations or calculus, as was done in Babylonia.

• Another possibility is to start from the equation 1/x2 = A to compute 1/ A
and then multiply the final approximation by A. The final method has the
advantage that it uses only multiplication without any division. Thus it is can
be efficiently implemented (in more sophisticated form, e.g. Goldschmidt’s
algorithm) directly in the hardware using logic gates.

Exercise 7 (Division without division).


Given A 6= 0, compute 1/A using only the operations +, −, ∗.
Hint: Try Newton’s method for a suitable equation with the solution x∗ = 1/A.
Choose f so that the resulting formula from Newton’s method contains only addition
(subtraction) and multiplication. Analyze for which x0 the method converges to 1/A.
Remark: Improved versions of this method (e.g. Goldschmidt’s algorithm) are
actually hardware-implemented in many microprocessors to calculate division. The
cost of such a calculation is of the same order as the cost of multiplication. This
also plays an important role in special applications such as extremely high precision
calculations (millions of digits) or cryptography (calculations with very large integers
with thousands or millions of digits).
Remark: In the matrix setting, the method can be adapted to produce a quickly
converging algorithm for the approximation of the inverse A−1 of a matrix A which
uses only matrix addition and multiplication. This is the so-called Schulz iterative
method for the matrix inverse from 1933: Xn+1 = Xn (2I − AXn ). A reasonable
initial value is X0 = AT /(kAk1 kAk∞ ).

2.3.1 What could possibly go wrong?


In this section we shall take a look at what can happen in Newton’s method (and
other locally convergent methods) beyond the neighborhood where Theorem 17 guar-
antees quadratic convergence. We shall see that the method can have (and typically
has) very wild behavior. This is the price we pay for the very fast local convergence.
Here is a basic list of the things that can go wrong if x0 is not close enough to x∗ :

1. Convergence to a different root than the one we desire.


CHAPTER 2. SCALAR EQUATIONS 20

2. Slower than quadratic convergence.

3. Divergence to ±∞.

4. f 0 (xn ) is undefined or is equal to zero.

5. Periodic cycling of the iterates.

6. Chaotic behavior.

We demonstrate these phenomena in 1D, where one can gain at least some in-
tuition from pictures and where the analysis is not too technical. The situation
becomes much more complicated in RN , where the problems we mention here some-
times render the method unusable, since the neighborhood of (quadratic) conver-
gence is impractically small. We shall discuss the individual points from the list
above in more detail.

1. Convergence to a different root than the one we desire


Sometimes this is not a problem, sometimes it is a huge problem. An example of the
latter case can be taken from finite element solvers for partial differential equations
of physics. For example, for nonlinear first order hyperbolic conservation laws (such
as the Euler equations describing compressible fluid flows) the equations themselves
have many solutions, only one of which – the entropy solution – is physically admis-
sible. It may happen that your Newton’s method for your chosen finite element or
volume method converges to an unphysical solution and you are unable to come up
with an initial x0 for which Newton’s method would converge to the correct entropy
solution. There is no general recipe how to fix this issue and sometimes elaborate
techniques are required to find the correct solution..

2. Slower than quadratic convergence.


One can expect this to happen outside the neighborhood guaranteed by Theorem
17, but at least we have some form of convergence, albeit not quadratic. However,
we may have slower than quadratic convergence even for all x0 arbitrarily close to
x∗ . Clearly this happens when some assumption of Theorem 17 is violated. This
may happen for two reasons:
Insufficient regularity. If f 0 is not Lipschitz continuous near x∗ , one may
observe slower than quadratic convergence. See Exercise 8 where Newton’s method
for x + x4/3 = 0 is considered, which results in a fractional convergence rate between
1 and 2.
Zero derivative. One can also observe slower than quadratic convergence when
0
f (x∗ ) = 0. This means that x∗ is a root with multiplicity greater than one. We
illustrate this on the simple equation x2 = 0 with x∗ = 0. Newton’s method for this
equation is
x2 1 1
xn+1 = xn − n = xn =⇒ |en+1 | = |en |,
2xn 2 2
CHAPTER 2. SCALAR EQUATIONS 21

which is linear convergence (in fact exactly the same convergence as for the bisec-
tion method). After some experimenting one notices that if we modified Newton’s
method to be xn+1 = xn − 2f 0 (xn )/f (xn ) we would have

2x2n
xn+1 = xn − = 0,
2xn
hence we get x∗ exactly in the first iteration, at least for this particular equation.
More generally, this modification gives x∗ in one iteration for f (x) = a(x − x∗ )2 ,
which is the general quadratic case of a double root:
a(xn − x∗ )2
xn+1 = xn − 2 = xn − (xn − x∗ ) = x∗ .
2a(xn − x∗ )
Even more generally, for a root of multiplicity r ∈ N and the model equation a(x −
x∗ )r = 0, the modification of Newton’s method xn+1 = xn − rf 0 (xn )/f (xn ) gives x∗
in one iteration:
a(xn − x∗ )r
xn+1 = xn − r = x∗ .
ra(xn − x∗ )r−1
It turns out that this modification works in general, not only for the model problems:
Theorem 19 (Roots with multiplicity). Let x∗ be a root of multiplicity r ∈ N of f ,
i.e. f (j) (x∗ ) = 0 for j = 0, . . . , r − 1 and f (r) (x∗ ) 6= 0. Let f ∈ C r (U ) where U is
some neighborhood of x∗ . Then the modified Newton method
f (xn )
xn+1 = xn − r
f 0 (xn )
converges locally quadratically to x∗ .
Proof. The proof is a technical calculation without much added value, thus we omit
it and refer the interested reader to [RR78, Section 8.6].
In the theorem above, one needs to know the multiplicity of x∗ in advance. This
is not always the case, in fact one might not even notice apriori that x∗ is a root of
higher multiplicity (consider the equation ex −x−1 = 0 with x∗ = 0). One universal
solution is to consider the equation f¯(x) := f (x)/f 0 (x) = 0 instead of f (x) = 0.
Then x∗ is always a root of multiplicity 1 for f¯ irrespective of its multiplicity for f .
One can then apply his/her favorite method to the modified equation.

3. Divergence to ±∞.
Now we get to the cases when Newton’s method fails altogether. Among these,
divergence to infinity is the ‘nicest’ because it is at least noticeable very early and
can be easily detected in the implementation, unlike the other cases.
Consider the model equation f (x) := arctan(x) = 0. Since f 0 (x) → 0 as x →
±∞, one can expect that if we choose x0 sufficiently far away from x∗ = 0, then x1
will be even larger due to the division by f 0 (xn ) in Newton’s method. This is indeed
the case, as can be seen by a simple analysis or by a simple numerical experiment.
In fact, there exists a neighborhood of x∗ , such that outside of this neighborhood
CHAPTER 2. SCALAR EQUATIONS 22

Newton’s method diverges to ±∞. The size of this neighborhood can be computed,
as we shall do in Exercise 9.
In general, divergence to infinity is usually caused by the denominator f 0 (xn )
being very close to zero, in which case xn+1 is very large or even causes an overflow.
There are tricks how to overcome this issue, which fall under the general category
of “globalization strategies”. Here we only briefly mention one possible strategy.
Damped Newton method. The idea is to do smaller steps than Newton’s
method recommends. In the simplest case, we can write this as

f (xn )
xn+1 = xn − λn , (2.11)
f 0 (xn )

where we apply the damping parameters λn ∈ (0, 1] which tells us how much of
the Newton correction to take. Obviously, we should not destroy the quadratic
convergence rate of Newton’s method, once we are sufficiently close to x∗ . Thus we
require λn → 1 when xn → x∗ (or perhaps λn = 1 for sufficiently large n). One
can try to come up with an explicit formula for λn , such as λn = 1/(1 + |f (xn )|)
which tries to use the residual of the equation to measure ‘closeness’ to x∗ . A more
sophisticated idea is to use a backtracking strategy : try (2.11) with λn = 1,
i.e. Newton’s method. If the residual increases, i.e. |f (xn+1 )| ≥ |f (xn )|, we try
λn = 1/2 instead. Again, we test for decrease of the residual and if needed, try
λn = 1/4 instead. We can write this procedure as
f (xn )
Given xn , compute x̃ = xn − f 0 (xn )
.
While |f (x̃)| ≥ |f (xn )|:
x̃ := 12 (x̃ + xn ).
Set xn+1 := x̃.
The danger of simple procedures like the one above is that the iterates can
converge to a local extreme, instead of the root. This happens because if we start
close to a local extreme, then f 0 (xn ) ≈ 0 and the Newton correction is detected to
be too large. Therefore the step is taken to be much smaller, perhaps so small that
we only get closer to the local extreme and the situation is much worse in the next
iterate.
More sophisticated strategies similar to the one above are important in numerical
optimization, where they fall under the category of line search methods, where much
more sophisticated strategies for the choice of the new iterate x̃ and the necessary
condition it must satisfy are considered.
We note that the backtracking procedure above can be viewed as a rudimentary
hybrid method, which consists of a fast converging method combined with a (ideally)
globally convergent method/strategy. More on these methods in Section 2.5.2.

4. Derivative undefined or equal to zero.


This case is quite similar to the previous case. From the practical point of view,
one rarely encounters an exact zero in floating point arithmetic. Usually, due to
rounding errors, we will ‘only’ have f 0 (xn ) ≈ 0, not exactly f 0 (xn ) = 0. In this case
we are in the previous case discussed above.
CHAPTER 2. SCALAR EQUATIONS 23

From the mathematical point of view, it can be shown that the set of ‘bad’ initial
conditions {x0 : ∃n ∈ N s.t. f 0 (xn ) = 0} is at most countable. In other words, this
set is very small from the point of view of R (measure zero) and it is in general
unlikely to choose such a x0 by chance.

5. Periodic cycling of the iterates.


We say that the sequence {xn }∞ n=0 has period P , if xn+P = xn for all n. It seems
reasonable to assume that one can come up with an example of a function f and
an initial estimate x0 , such that the resulting sequence from Newton’s method has,
for example, period 2. One may then say that such an artificial example will be
very rare and it will be virtually impossible to choose the one special point x0 by
accident. However, one can go a bit further and construct a function f , such that
for any x0 6= x∗ , Newton’s method has always period 2, see Figure 2.4. Indeed, we
want an f such that Newton’s method satisfies
xn+1 − x∗ = x∗ − xn , ∀n ∈ N,
which we can rewrite using the definition of Newton’s method as
f (xn )
xn − − x∗ = x∗ − xn . (2.12)
f 0 (xn )
We want this equation to be satisfies for all xn , so we can omit the subscript n and
rewrite (2.12) as
f 0 (x) 1
= .
f (x) 2(x − x∗ )
This is a simple ordinary differential equation for the unknown function f , which
has the solution p
f (x) = C sgn(x − x∗ ) |x − x∗ |, (2.13)
for any C ∈ R, cf. Figure 2.4. Indeed, one may verify that for any x0 6= x∗ , Newton’s
method jumps periodically back and forth between two values. We note that this
is possible, since the function from (2.13) violates the assumptions of Theorem 17,
since f 0 (x∗ ) is infinite, hence f 0 cannot be Lipschitz continuous on any neighborhood
of x∗ .
One can then dismiss the previous example by saying that it is one very special
function which we will never solve by Newton’s method anyway, since we can easily
solve f (x) = 0 analytically in this case. However a more interesting situation is
described in Exercise 10, where the function f (x) = x3 − 2x + 2 is considered. If
we choose x0 = 0, Newton’s method gives the sequence 0, 1, 0, 1, 0, 1, . . . . One can
say that this is very nice but if we choose any other x0 , we will not get a periodic
sequence anymore. This is true, but the interesting phenomenon behind this example
is contained in the following lemma:
Lemma 20. Let f (x) = x3 − 2x + 2. There exists a neighborhood U of 0, such that
for any x0 ∈ U , the sequence from Newton’s method satisfies
lim x2n = 0,
n→∞
lim x2n+1 = 1.
n→∞
CHAPTER 2. SCALAR EQUATIONS 24

b b b b

x1 x0 = x2

Figure 2.4: Periodic cycling of Newton’s method, two different cycles.

Moreover, the convergence in the limits above is quadratic.

This means that if we start near 0, the sequence from Newton’s method converges
to 0, 1, 0, 1, 0, 1, . . . . Moreover, for this example, the odd/even terms will converge
to 0 or 1 very quickly (quadratically)! It is then hard to dismiss this example of
periodic cycling as a rare thing which we will never encounter – there is a whole
open set of initial values for which Newton converges to a 2-periodic sequence very
quickly. And choosing an x0 from an open set is something that can easily happen
in practice – this is not one isolated point which we will easily avoid.
In general, the situation is much more complicated. The following theorem can
be obtained as a special case of the Sharkovskii or Li & Yorke theorems known from
the theory of chaotic dynamical systems.

Theorem 21 (Sharkovskii, Li & Yorke). Let f : R → R be a polynomial with at least


four distinct real roots. Then for any P ∈ N there exists an x0 such that Newton’s
method has period P .

It is good to consider the implications of Theorem 21. The function f is not some
‘wild’ counterexample form measure theory, it is a simple polynomial (degree 4 is
sufficient). Even so, Newton’s method has extremely complicated behavior when we
look beyond the neighborhood where we have local quadratic convergence – for any
period we choose, there is a x0 such that Newton will have that period. There will
be a point with period 5 and 17 and 123456789 and 10365 . Moreover, there will also
be preperiodic points x0 . This means that xn will first ‘jump around’ for a while
before settling on e.g. period 17. This means there is a large (although countable)
set of initial values for which Newton’s method will (eventually) periodically cycle.
We note that the situation is much more complicated in general, since the phe-
nomena from Theorem 21 and from Lemma 20 can combine. Then we can have also
open sets from which we will converge to periodic cycling with some period.

6. Chaotic behavior.
Even more generally, we may consider the set of all points x0 for which Newton’s
method does not converge to any root, even though we never hit a point where
f 0 (xn ) = 0 (i.e. the whole sequence is well defined). We have the following theorem.
CHAPTER 2. SCALAR EQUATIONS 25

J2 J1
α1 b b b
α3
I1 α2 I2

b
b

Figure 2.5: Simplest case of the Saari-Urenko theorem (Theorem 23) and its proof.

Theorem 22 (Barna). Let f : R → R be a polynomial of degree D ≥ 4 with D


distinct real roots. Let W be the set of x0 such that:
1. the whole sequence {xn }∞n=0 from Newton’s method is well defined,

2. {xn }n=0 does not converge to any root of f .
Then W is homeomorphic to a Cantor discontinuum, i.e. it is uncountable,
closed, has empty interior and has no isolated points.
The set W form Theorem 22 is uncountable, hence it is large from the set-
theoretic point of view. Obviously it contains the periodic points from Theorem
21. However this is ‘only’ a countable subset. So what do all the other points look
like? Simply stated, these are the points x0 for which Newton “chaotically” jumps
around. Specifically, we have the following result if f is a polynomial as in Theorem
22.
Theorem 23 (Saari, Urenko). Let f : R → R be a polynomial of degree D ≥ 4 with
D distinct real roots. Let α1 < α2 < . . . < αD−1 be the roots of f 0 (i.e. the local
extremes). Define the intervals Ij = (αj , αj+1 ) for j = 1, . . . , D − 2. Choose an
arbitrary sequence {Sn }∞
n=0 such that each Sn ∈ {1, . . . , D − 2}. Then there exists
an x0 ∈ R such that Newton’s method satisfies
xn ∈ ISn , ∀n = 0, 1, . . . .

To fully appreciate Theorem 23, we illustrate it in the simplest case. Consider


a quartic (D = 4) polynomial f as in Figure 2.5. We have three points α1 , α2 , α3 ,
where f 0 is zero. These define two intervals I1 and I2 . Now we can choose any
sequence of ones and twos and the theorem gives an x0 such that the iterates of
Newtons method jump between I1 and I2 accordingly. For example, if we choose
the sequence {Sn }∞n=0 as

1 2 1 1 2 1 1 1 2 1 1 1 1 2...,
we get an x0 such that the consecutive iterates x0 , x1 , x2 , . . . lie in the intervals
I1 , I2 , I1 , I1 , I2 , I1 , I1 , I1 , I2 , I1 , I1 , I1 , I1 , I2 , . . . .
CHAPTER 2. SCALAR EQUATIONS 26

It is clear that such a sequence cannot converge to a root. It also cannot be a periodic
sequence and it cannot converge to a periodic sequence, as in Exercise 10, since the
iterates xn jump between the intervals I1 and I2 aperiodically. The only possibility
for convergence would be convergence to the common endpoint α2 . However this is
also not possible, since f 0 (α2 ) = 0, therefore if xn would be too close to α2 , due to
the size of the Newton update the next iterate xn+1 would be very far away, certainly
it would lie outside of the finite intervals I1 and I2 , thus violating the theorem.
One can choose even ‘wilder’ sequences {Sn }∞ n=0 – we might choose a random
sequence. Thus Newton’s method will jump around ‘randomly’ (even though it is
a fully deterministic process). We √ might choose the sequence to correspond to the
digits of the binary expansion of 2 or π. These are irrational numbers, so again, the
iterates cannot converge to a periodic sequence. How many are there such possible
aperiodic sequences {Sn }∞ n=0 ? The answer is uncountably many, since the set of
irrational numbers is uncountable.
We call the behavior above chaotic, since the iterates can ‘jump around’ seem-
ingly randomly, never converging to some reasonable behavior. Even so, the theorem
on local quadratic convergence is still valid – there are neighborhoods around the
roots, where Newton will converge. This is typical of chaotic dynamical systems –
the mixing of very regular behavior (e.g. periodicity or convergence) with seemingly
random behavior. There is an entire field of research devoted to this topic.
We note that the proof of Theorem 23 is not difficult and is in fact (relatively)
constructive and intuitive. We indicate the basic idea in Figure 2.5.
Proof of Theorem 23 (basic idea). Consider the simplest case of a polynomial of de-
gree 4, as in Figure 2.5. Let g(x) = x−f (x)/f 0 (x) be the mapping defining Newton’s
method. Since limx→α1 + g(x) = +∞ and limx→α2 − g(x) = −∞, we have g(I1 ) = R.
Similarly g(I2 ) = R. Therefore, for any point (or interval) in R, we can find its
preimage under g in both I1 and I2 . For example, there exist intervals J1 , J2 ⊂ I1
such that g(J1 ) = I1 and g(J2 ) = I2 . These can easily be found ‘graphically’, as
in Figure 2.5 (depicted in green). Altogether, in our case, if we choose an inter-
val I ⊂ (α1 , α3 ) then g −1 (I) consists of two intervals, one of which is in I1 , while
the other is in I2 . The whole ‘trick’ of the theorem is choosing which of the two
preimages we take, according to the prescribed sequence.
Assume for example that the sequence {Sn }∞ n=0 is 1 2 . . . Taking the first two
symbols, we seek a point x0 ∈ I1 such that x1 = g(x0 ) ∈ I2 . The set of these points
is exactly the aforementioned interval J2 . In other words, J2 is one component of
g −1 (I2 ). We would choose the other component (which lies in I2 ), if the sequence
started 2 2 . . . In general, we consider g −1 (g −1 (. . . g −1 (Ii ) . . .)) for i = 1, 2, where we
take the individual preimages from I1 or I2 according to the prescribed sequence.
The successive preimages are smaller and smaller intervals and if one proceeds care-
fully, in the limit one obtains a single point x0 . This is the rough idea behind the
proof.
CHAPTER 2. SCALAR EQUATIONS 27

Exercises

Exercise 8 (Insufficient regularity).


Consider the equation x + x4/3 = 0. Write down Newton’s method for this
equation. Derive the rate of convergence of xn to the root x∗ = 0 directly from the
definition of convergence rate. Why is it not two? What assumption of Theorem 17
is violated? Try to find a generalization of Theorem 17 for functions with weaker
regularity.
Hint: Instead of Lipschitz continuity of f 0 , consider Hölder continuity: A
function h is α-Hölder continuous if there exist constants C, α > 0 such that
|h(x) − h(y)| ≤ C|x − y|α for all x, y. Is f 0 from the exercise α-Hölder contin-
uous for some α? Go through the proof of Theorem 17 with the assumption of
Hölder continuity instead of Lipschitz.

Exercise 9 (Local convergence).


Use Newton’s method to solve the equation arctan(x) = 0. Due to Theorem
17 there exists xc > 0 such that Newton’s method converges for all x0 ∈ (−xc , xc ).
However, if |x0 | > xc , Newton’s method will diverge to ±∞. Calculate xc explicitly
using Newton’s method.
Hint: If we choose x0 = xc , Newton’s method gives us x1 = −xc , x2 = xc , etc.
(draw a picture!). Therefore, if we write Newton’s method as xn+1 = g(xn ), then
xc solves the equation g(xc ) = −xc . This equation (probably) cannot be solved
analytically, so use Newton’s method to calculate the solution xc .

Exercise 10 (Periodic cycling).


Consider f (x) = x3 − 2x + 2. The equation f (x) = 0 has a single real root
x∗ ≈ −1.7693. Starting from x0 = 0, Newton’s method periodically cycles between
the two points 0 and 1. Show that this behavior is locally attracting: There exists a
neighborhood U of 0 (and also an analogous neighborhood of 1), such that for any
x0 ∈ U , the sequence from Newton’s method satisfies
lim x2n = 0,
n→∞
lim x2n+1 = 1.
n→∞

Moreover, the convergence in the limits above is quadratic.


Hint: If we write Newton’s method as xn+1 = g(xn ), consider what happens
when we iterate g ◦ g in the situation above. Apply the results of Section 2.2.

2.3.2 Stopping criteria


This section discusses the seemingly simple question of when to stop iterating when
seeking for a root with sufficient accuracy. What we discuss here is not necessarily
CHAPTER 2. SCALAR EQUATIONS 28

restricted to Newton’s method, it applies to general iterative processes for nonlinear


equations. Moreover, by changing the absolute value |.| to a norm k.k, we can apply
these ideas to systems of equations.
On a general level, when testing whether a sequence x0 , x1 , . . . converges to a
solution of f (x? ) = 0, one has, in principle, only two basic possibilities:
1. Testing convergence.

2. Testing if we satisfy the equation.


In the first case, we essentially test if the sequence is a Cauchy sequence. In the
second case, we measure the residual of the equation. In either case, we want xn to
satisfy some given tolerance ε.

1. Testing for a Cauchy sequence


The simplest criterion for a Cauchy sequence is the absolute criterion

|xn − xn−1 | < ε.

The problem with this (and other) absolute criteria is that it does not take into
account the typical magnitude of the numbers that we are dealing with. If in our
problem we expect xn and x∗ to typically be on the order of 1010 , it is not reasonable
to prescribe the same tolerance ε as if the typical numbers in our problem are on
the level of 10−10 . Choosing e.g. ε = 10−10 leads to a criterion that may not even
be satisfiable in finite precision arithmetic in the first example, while in the second
example the criterion may be satisfied even if we do not have a single relevant digit
of accuracy in the approximate solution. From this point of view it seems more
reasonable to choose some form of relative criterion, e.g.
|xn − xn−1 |
< ε. (2.14)
|xn |
This is a more robust approach, however it can also fail in certain situations.
Example. Consider Newton’s method for the equation x2 = 0. The iterative
process is xn = 12 xn−1 . Evaluating the left-hand side of (2.14) in this case gives us
1
|xn − xn−1 | |xn−1 |
= 21 = 1.
|xn | |x |
2 n−1

Therefore, criterion (2.14) can never be satisfied for any ε < 1, even though xn → x∗ .
We note that changing the denominator in (2.14) to |xn−1 | does not help, as the
expression only evaluates to 21 instead of 1.
One recommendation how to avoid the problem in the previous example is to
use the following criterion
|xn − xn−1 |
< ε. (2.15)
max{|xn |, xtyp }
Here, xtyp is a nonzero user-specified quantity representing the ‘typical’ magnitude
of the argument x that we are working with (we shall encounter this quantity again
CHAPTER 2. SCALAR EQUATIONS 29

in (2.23), where we deal with the effect of rounding errors). The exact value of
xtyp is not really important, as long as it has roughly the correct magnitude we
are working with in our problem. This of course limits the usefulness of (2.15) in
‘black-box’ solvers, however when our problem comes e.g. from physics, we have at
least a general idea of what to expect of the quantities involved.
It is obvious that measuring the distance of two iterates in general has nothing
in common with the distance to the solution. The method could be caught in a
temporary stagnation phase, where progress is slow for some reason, even though
we are far from the solution.

1. Testing the residual


The other approach is to test how much we satisfy the equation, i.e. measure the
magnitude of |f (xn )|. The problem is that in general, the size of the residual can
have little in common with the distance to the solution. This is true especially for
systems of equations, and even for systems of linear equations. However, we do not
have much else to work with in general. The absolute criterion in this case would
be
|f (xn )| < ε.
In a relative criterion, we can measure e.g. the decrease of the residual relative
to the initial one:
|f (xn )| < ε|f (x0 )|.
This condition can however prove to be unsatisfiable, if |f (x0 )| was already a small
number. In this case one can use a combination of the relative and absolute criteria:

|f (xn )| < ε1 |f (x0 )| + ε2 .

Finally, one can combine the mentioned criteria to, for example:

|xn − xn−1 |
IF |f (xn )| < ε1 |f (x0 )| + ε2 OR < ε3 STOP.
max{|xn |, xtyp }

One might also try an AND instead of an OR in the above, but this might run into
satisfiability issues.
Remarks:

• As usual, there are no universal recipes in numerical mathematics.

• The choice of ε should be a reasonable trade-off between accuracy, attainability


and also rounding errors on the level of ‘machine precision’ εmach . A reasonable

choice is to set e.g. ε ∼ εmach . This is a quantity, we will encounter also
when balancing approximation and rounding errors on page 32.

• If one has some additional information, this can be used. For example, once
Newton’s method actually starts converging quadratically, we have

xn − x∗ = xn − xn+1 + xn+1 − x∗ ,
| {z } | {z }
en O(|en |2 )
CHAPTER 2. SCALAR EQUATIONS 30

therefore |xn − xn+1 | and |en | can be expected to have the same magnitude,
once we are in the quadratically convergent regime. Therefore testing the
Cauchy property is justified in this case and gives a relevant estimate of the
error.

2.4 Secant method


So far we have have considered the situation when f is explicitly given by a formula.
This is the purely mathematical setting of Newton’s and other methods. However
this is often not the case – f could be e.g. given by the output from another calcu-
lation implemented in some lengthy code, or perhaps as output from an experiment.
In this setting, we can evaluate the values of f , but we do not have easy access to
the values of its derivative f 0 . What then can we do in Newton’s method where we
need to evaluate f 0 ? The basic idea is to approximate the derivative somehow – we
will treat this approach in the following section. This is also the basis of the secant
method, cf. Section 2.4.2.
For completeness, let us mention one other possibility – automatic differentia-
tion, which can be applied when f is given by a computer program. The goal of
automatic differentiation is to take the program for evaluating f and turn it into
a program for evaluating f 0 . There are various techniques, tools, and libraries that
one can use to achieve this, here we only briefly describe the basic idea behind one
of them. The general idea is to take variables in the program, e.g. double v, and
replace it with pairs of numbers (variables), e.g. double v[2]. The first component
v[0] corresponds to the value of the variable and the second component v[1] cor-
responds to the derivative. One can then define new rules for operators, functions,
etc., which work on these new ‘extended’ variables according to the rules of differ-
entiation, e.g. (v*w)[0]=v[0]*w[0], (v*w)[1]=v[1]*w[0]+v[0]*w[1]. One can
thus rewrite the code for f to obtain the code for f 0 . This can even be as simple as
overloading the definitions of data types, operators, functions, etc., while the code
itself stays essentially the same. As mentioned, there are many approaches to this,
which differ e.g. by the order in which we evaluate the differentiation of composite
functions (forward mode or reverse mode), which can differ by their computational
or memory efficiency.

2.4.1 Approximation of the derivative in Newton’s method


The basic idea here is to approximate f 0 by a simple difference with a step hn in
each iteration. This leads to the following:
Definition 24 (Newton’s method with differences). Let x0 be given. Let hk 6=
0, k = 0, 1, . . . be given. Newton’s method with differences consists of the
iterative procedure
f (xn + hn ) − f (xn )
an = ,
hn
(2.16)
f (xn )
xn+1 = xn − .
an
CHAPTER 2. SCALAR EQUATIONS 31

Obviously, an ≈ f 0 (xn ) is a first order approximation with respect to hn . Specif-


ically, by taking by taking y = xn + hn and x = xn in Lemma 16 we get

an − f 0 (xn ) ≤ 12 γ|hn |, (2.17)

where γ is the Lipschitz constant of f 0 .


We note that Newton’s method with differences only requires the evaluation of
f and not f 0 , unlike Newton’s method. The question is then what happens to the
convergence rate.

Theorem 25 (Convergence of Newton with differences). Let f 0 be γ-Lipschitz on


some neighborhood U of x∗ . Let f 0 (x∗ ) 6= 0. Then there exists η > 0 such that
if {hn }∞
n=0 satisfies 0 < |hn | < η then for any x0 sufficiently close to x∗ Newton’s
method with differences (2.16) converges to x∗ . The convergence of the method is:

• linear: if there exists c > 0 such that |hn | > c for all n,

• superlinear: if limn→∞ hn = 0,

• quadratic: if there exists C > 0 such that |hn | ≤ C|xn − x∗ |.

Proof. By definition of xn+1 , we have

f (xn ) 1 
xn+1 − x∗ = xn − − x∗ = f (x∗ ) −f (xn ) − an (x∗ − xn )
an an | {z }
=0
1 (2.18)
f (x∗ ) − f (xn ) − f 0 (xn )(x∗ − xn ) + (f 0 (xn ) − an )(x∗ − xn ) .

=
an | {z } | {z }
(∗) (∗∗)

Lemma 16 and (2.17) give us

(∗) ≤ 12 γ|en |2 ,
(2.19)
(∗∗) ≤ 21 γ|hn ||en |.

It remains to estimate an in the denominator of (2.18). Since f 0 (x∗ ) 6= 0 and f 0 is


continuous, there exists ρ > 0 such that |f 0 (x)| ≥ ρ > 0 for all x ∈ U . We use the
triangle inequality in the form |A| ≥ |B| − |A − B| to get

|an | ≥ |f 0 (xn )| − |an − f 0 (xn )| ≥ ρ − 21 γ|hn | ≥ 21 ρ, (2.20)

for |hn | ≤ η sufficiently small.


Altogether, if we apply the estimates (2.19) and (2.20) in (2.18), we get the error
inequality
1  γ
|en+1 | ≤ 1 21 γ |en |2 + |hn ||en | = |en | + |hn | |en |.

(2.21)
2
ρ ρ
All the statements of the theorem follow from this error inequality.
CHAPTER 2. SCALAR EQUATIONS 32

One might ask whether there can be a practical choice of hn so that we get a
quadratically convergent method, since Theorem 25 requires |hn | ≤ C|en | and |en | is
an unknown quantity which we can only estimate (if we knew en , we would immedi-
ately have the exact solution x∗ = xn + en ). There is however one seemingly strange
choice of hn which naturally fulfills this requirement. Stephensen’s method is
obtained by taking hn = f (xn ) in (2.16):

f (xn )2
xn+1 = xn − .
f (xn + f (xn )) − f (xn )

The method is quadratically convergent by Theorem 25, since

|hn | = |f (xn )| = |f (xn ) − f (x∗ )| ≤ L|xn − x∗ |,

where L is the Lipschitz constant of f (we assume even f 0 Lipschitz continuous,


hence f is also Lipschitz). We note that the method works relatively well, with all
of the strengths and weaknesses of Newton’s method, although at first it might seem
strange to have f composed with itself in the definition of the method. Nevertheless
it is a quadratically convergent method.

Choice of hn in finite precision arithmetic


From the purely mathematical viewpoint, one should choose hn very small in or-
der to get a good approximation of f 0 by the difference and recover the quadratic
convergence of Newton’s method from Theorem 25. However in practice we cannot
choose hn arbitrarily small due to errors in floating point operations. Then even the
evaluation of f is subject to rounding or approximation errors. Specifically, instead
of f (x), we are actually computing some approximation f (x) + ε(x). Not much
can be said about the error ε(x), except that it is bounded from above for some
range of the variable x: |ε(x)| ≤ ε̄. This upper bound would usually be assumed to
be proportional to the ‘machine epsilon’ εmach and to |x|, but we will not get into
such details. Under these assumptions, we can estimate the error of the difference
approximation of the derivative in finite precision arithmetic as

f (x + h) + ε(x + h) − f (x) + ε(x)
f 0 (x) − ≤ (2.22)
h
f (x + h) − f (x) ε(x + h) − ε(x) 1 2ε̄  ε̄ 
≤ f 0 (x) − + ≤ γ|h| + = O |h| + .
h h 2 |h| |h|

We wish to take |h| so that the right-hand side estimate in (2.22) is minimal. If
we minimize
√ the expression |h| + ε̄/|h| with respect to |h|, we get the minimum √ at
|h| = ε̄. Therefore the general recommendation is to take h on the order of ε̄ or

εmach . We say ‘on the order of’, because we only have a rough bound on ε(x) and
it is essentially useless to try to carefully evaluate the constants in (2.22).
In general, one also has to take into account the error relative to the current and
‘typical’ value of xn . One possible recommendation from the literature is taking

|hn | = εmach max{|xn |, xtyp }. (2.23)
CHAPTER 2. SCALAR EQUATIONS 33

Here, xtyp is a user-specified quantity representing the ‘typical’ magnitude of the


argument x that we are working with (we have already encountered this quantity
in Section 2.3.2). The reasoning behind (2.23) is this: it is reasonable to have hn
proportional to xn – taking the perturbation h = 0.1 is clearly not the same if xn = 1
and if xn = 1010 . Also, we usually know where the problem comes from and what
is the typical magnitude of x – if the equation comes from physics and x represents
atmospheric pressure in pascals, we expect typically x ≈ 105 . If the equation comes
from electrical engineering and x represents capacitance in farads, we expect x ≈

10−6 or smaller. Hence it is reasonable to take |hn | = εmach |xn |. However this can
fail. Let us assume that we typically expect x ≈ 1. It may happen that Newton’s
method in some iteration gives xn = 0 (perhaps because of some chaotic phase before

the method starts converging). Then we should take |hn | = εmach |xn | = 0 which
is nonsense. Or if Newton gives xn very small, then |hn | should be unrealistically
small and results in a bad approximation due to round-off errors. The quantity
xtyp is added into (2.23) to circumvent this obstacle. Since it is only a fail-safe, the
specific value of xtyp is not important, only a rough approximation is sufficient for
the purpose.
Finally we note that we can use other approximations of the derivative f 0 . For
example one can take the central difference
f (x + h) − f (x − h)
f 0 (x) ≈ (2.24)
2h
which has an error of O(|h|2 ). This can be advantageous when the error ε̄ in the
evaluation of f is too large and the choice of hn from (2.23) results in a large error.
If we use√the approximation (2.24) instead, then the optimal choice of |hn | is on the
order of 3 ε̄ and the approximation improves. The price we pay is three evaluations
of f per Newton iteration: f (xn + hn ), f (xn ), f (xn − hn ).

2.4.2 Secant method


In Newton’s method with differences, we need to evaluate the term f (xn + hn ) in
each iteration. The idea behind the secant method is to reuse an already computed
value of f . Namely, if we set hn = xn−1 − xn then f (xn + hn ) = f (xn−1 ) which is
a quantity that we have already computed in the previous iteration. Thus we save
one function evaluation per iteration.
If we set hn = xn−1 − xn in the definition of Newton’s method with differences
(2.16), we get
f (xn + hn ) − f (xn ) f (xn−1 ) − f (xn )
an = = ,
hn xn−1 − xn
f (xn ) f (xn )(xn−1 − xn ) xn−1 f (xn ) − xn f (xn−1 )
xn+1 = xn − = xn − = .
an f (xn−1 ) − f (xn ) f (xn ) − f (xn−1 )
This is secant method.
Definition 26 (Secant method). Let x0 , x1 be given. The Secant method consists
of the iterative procedure
xn−1 f (xn ) − xn f (xn−1 )
xn+1 = .
f (xn ) − f (xn−1 )
CHAPTER 2. SCALAR EQUATIONS 34

x∗
b b b b

x2 x1 x0

Figure 2.6: Secant method.

Remarks:

• We have only one evaluation of f per iteration, since we know fn−1 from the
previous iteration. Compare to Newton or Newton with differences, where we
need two function evaluations per iteration.

• We do not need to know f 0 .

• We need two initial values x0 and x1 instead of one.

• A more classical derivation and interpretation of the secant method is con-


tained in Figure 2.6. We define a linear approximation of f by passing a line
through the points (xn−1 , f (xn−1 )), (xn , f (xn )) in the plane and define xn+1 as
the root (or intersection with the x-axis) of this linear approximation. Hence
the name secant method instead of method of tangents (a less common name
for Newton’s method).

• By Theorem 12, one-point iteration methods have only integer rates of con-
vergence. By ‘one-point’ method we mean methods of the form xn+1 = g(xn ),
where the next iteration depends only on one previous iterate. The secant
method is a two-point method, i.e. it can be written in the form xn+1 =
g(xn , xn−1 ). Such methods can have general non-integer rates of conver-
gence, as is the case for the secant method.

Theorem 27 (Local superlinear convergence of the secant method). Let f ∈ C 2 (U )


for some neighborhood U of x∗ . Let f 0 (x∗ ) 6= 0 and let x0 , x1 be sufficiently close to
x∗ . Then
√ the secant method converges superlinearly to x∗ . The convergence rate
is 1+2 5 ≈ 1.618.

Proof. In order to simplify the notation, we define fn := f (xn ), fn−1 := f (xn−1 ),


etc. We divide the proof into two parts.
CHAPTER 2. SCALAR EQUATIONS 35

1. Derivation of the error equation. By definition of xn+1 , we have

xn−1 fn − xn fn−1 en−1 fn − en fn−1


fn
en
− fen−1
n−1

en+1 = − x∗ = = en en−1 . (2.25)


fn − fn−1 fn − fn−1 fn − fn−1
| {z }
(?)

The rest of this part of the proof lies in expressing the term (?). Since f (x∗ ) = 0,
we have
fn −f (x∗ ) −f (x∗ )
xn −x∗
− fn−1
xn−1 −x∗ F (xn ) − F (xn−1 )
(?) = = , (2.26)
fn − fn−1 fn − fn−1
where we have defined the auxiliary function F as
f (x) − f (x∗ )
F (x) = .
x − x∗
We can express the denominator in (2.26) by the mean value theorem:

F (xn ) − F (xn−1 ) = F 0 (ξn )(xn − xn−1 ), (2.27)

for some ξn between xn and xn−1 . The derivative F 0 from (2.27) can be explicitly
computed as
1 00 ˜
f 0 (x)(x − x∗ ) − f (x) + f (x∗ ) f (ξn )(x − x∗ )2
F 0 (x) = = 2
= 21 f 00 (ξ˜n ), (2.28)
(x − x∗ )2 (x − x∗ )2

for some ξ˜n between x (which is ξn in (2.27), hence the dependence of ξ˜n on n). The
second equality in (2.28) follows from Taylor’s expansion: f (x∗ ) = f (x) + f 0 (x)(x∗ −
x) + 21 f 00 (ξ˜n )(x∗ − x)2 .
By combining (2.27) with (2.28), we get

F (xn ) − F (xn−1 ) = 12 f 00 (ξ˜n )(xn − xn−1 ),

which we can substitute into (2.26) to obtain


xn − xn−1 1
(?) = 12 f 00 (ξ˜n ) = 12 f 00 (ξ˜n ) ,
fn − fn−1 f 0 (ξˆn )

which follows from the mean value theorem fn − fn−1 = f 0 (ξˆn )(xn − xn−1 ).
Having expressed the term (?) from (2.25), we finally arrive at the error equation
for the secant method in the form
f 00 (ξ˜n )
en+1 = en en−1 , (2.29)
2f 0 (ξˆn )

where ξ˜n and ξˆn lie between xn and xn−1 .


2. Convergence rate. From the error equation (2.29) we can easily derive con-
f 00 (ξ̃n )
vergence of the method: the factor 2f 0 (ξ̂ )
can be bounded by some constant M on
n
some neighborhood of x∗ and if e0 , e1 are sufficiently small with respect to M , we
CHAPTER 2. SCALAR EQUATIONS 36

get |en | → 0 by induction similarly as in the proof of Theorem 14. What remains is
to derive the convergence rate.
Let us assume that the secant method has rate of convergence p. If we define

|en+1 |
Sn = , (2.30)
|en |p

then by the definition of convergence rate p, there exists C > 0 such that

lim Sn = C > 0. (2.31)


n→∞

From the definition (2.30) of Sn , we can express

|en | = Sn−1 |en−1 |p


p p 2 (2.32)
|en+1 | = Sn |en |p = Sn Sn−1 |en−1 |p = Sn Sn−1 |en−1 |p .

Now, we turn our attention to the error equation (2.29). We take its absolute value
and divide by |en ||en−1 | in order to have all the error terms on one side. We get

|f 00 (ξ˜n )| |en+1 |
= . (2.33)
2|f 0 (ξˆn )| |en ||en−1 |

Since the method converges, we have ξ˜n , ξˆn → x∗ . The left-hand side term from
(2.33) therefore converges to a constant which is in general a finite nonzero number:

|f 00 (ξ˜n )| |f 00 (x∗ )|
lim = 6= 0. (2.34)
n→∞ 2|f 0 (ξˆ )| 2|f 0 (x∗ )|
n

Concerning the right-hand side of the error relation (2.33), we can express it using
(2.32) and take the limit to obtain
p 2
|en+1 | Sn Sn−1 |en−1 |p p−1 2
lim = lim p
= lim Sn Sn−1 |en−1 |p −p−1 =
n→∞ |en ||en−1 | n→∞ Sn−1 |en−1 | |en−1 | n→∞ (2.35)
p p2 −p−1
= C lim |en−1 | ,
n→∞

since Sn , Sn−1 → C > 0, by the definition of convergence rate (2.31). Altogether,


if we take the limit of the error relation (2.33) and substitute (2.34) and (2.35), we
get

|f 00 (x∗ )| |f 00 (ξ˜n )| |en+1 | 2


0 6= = lim = lim = C p lim |en−1 |p −p−1 . (2.36)
2|f (x∗ )| n→∞ 2|f 0 (ξˆn )| n→∞ |en ||en−1 |
0 n→∞

Since |en−1 | → 0, the value of the last limit in (2.36) depends on the sign of the
exponent: 
0,
 p2 − p − 1 > 0,
p2 −p−1
lim |en−1 | = ∞, p2 − p − 1 < 0, (2.37)
n→∞ 
1, p2 − p − 1 = 0.

CHAPTER 2. SCALAR EQUATIONS 37

The first case is not possible, since (2.36) would reduce to 0 6= 0. The second
case is also not possible, since the right-hand side of (2.36) would be infinite (here
it is important that C p 6= 0), while the left-hand side is a finite (but nonzero)
number. Therefore, the only possibility for (2.36) to be satisfied

is the third case
2 1+ 5
p − p − 1 = 0. This equation has a single positive root p = 2 ≈ 1.618, which is
the desired convergence rate.

When reading the proof of Theorem 27, it is not easy to gain an intuitive
√ insight
as to why the error equation (2.29) leads to the convergence rate 1+2 5 . Compare
this to Newton’s method, where the error equation is simply en+1 ≈ e2n , from which
we immediately see the convergence rate of 2. Let us now try to gain some intuition
on the case of the secant method.
We write the error equation for the secant method (2.29) in the simplified form

en+1 ≈ en en−1 . (2.38)

Let us assume for simplicity that the initial errors are on the order

e0 , e1 ≈ 10−1 .

By repeatedly applying the error equation (2.38), we get

e2 ≈ 10−2 , e3 ≈ 10−3 , e4 ≈ 10−5 , e5 ≈ 10−8 , e6 ≈ 10−13 , . . .

We can notice that the exponents are the Fibonacci numbers Fn defined by the
recurrence Fn+1 = Fn + Fn−1 with F0 = F1 = 1. Altogether, we have

en ≈ 10−Fn . (2.39)

There are many classical results relating properties of Fibonacci numbers to the
‘golden ratio’. For example, one can derive that

Fn+1 1+ 5
lim = . (2.40)
n→∞ Fn 2
This follows e.g. from Binet’s formula, which is an explicit formula for Fn . If we
combine (2.39) with (2.40) for sufficiently large n, we get

−Fn+1 −Fn
 FFn+1 1+ 5
en+1 ≈ 10 = 10 n ≈ en 2
.

From this we can see the convergence rate 1+2 5 of the secant method. Of course
these simple considerations only serve to gain intuitive insight and do not constitute
a rigorous proof.
CHAPTER 2. SCALAR EQUATIONS 38

Which is better – Newton or secants?


At first glance, it may seem that the quadratically convergent Newton method is
clearly better than the ‘merely’ superlinear secant method. This is true if we simply
compare the convergence rates. However, Newton is twice as expensive as the secant
method in terms of function evaluations per iteration – Newton needs two function
evaluations, while secants only need one. Therefore, it would be more fair to compare
one iteration of Newton with two iterations of the secant method. One iteration
of Newton gives us
|en+1 | ≤ C|en |2 . (2.41)

On the other hand, if we denote p = 1+2 5 , two iterations of the secant method give
us
p 2
|en+2 | ≤ C|en+1 |p ≤ C C|en |p = C̃|en |p . (2.42)
Now in order to compare (2.41) and (2.42), we need to know whether p2 is bigger
or smaller than 2. But if we go back to (2.37), p was obtained as the solution of
p2 − p − 1 = 0. Therefore, p2 = p + 1 and if we remember that p ≈ 1.618, then we
immediately see that p2 ≈ 2.618, which is larger than two.
Altogether, if we compare Newton with secants in terms of ‘convergence rate
with respect to number of function evaluations’, the secant method actually slightly
beats Newton’s method.

Remark 28. There is one more issue with the secant method. Once we are very
close to x∗ , the differences xn − xn−1 and f (xn ) − f (xn−1 ) become very small. This
may present a problem in finite precision arithmetic, as we have discussed in Section
2.4.1. It may happen that xn − xn−1 is too small with respect to the recommended

value of εmach and the difference approximation of the derivative can no longer be
trusted. Newton’s method does not suffer from this issue.

False position method (regula falsi)


The false position method is a very old method, which can be seen from the fact
that in some countries it is known by its Latin name ‘regula falsi’. The idea is to
take the bisection method and instead of evaluating f at the midpoint of (an , bn )
and comparing signs, we evaluate f at the point obtained from the application of
the secant method to an , bn . Thus we bring more information about f into play,
than just taking the midpoint universally for all functions. Here is the resulting
algorithm:
Given I0 = [a0 , b0 ] and a tolerance tol. Set n = 0.
While (bn − an ) > tol:
an f (bn ) − bn f (an )
sn = .
f (bn ) − f (an )
If f (an )f (sn ) ≤ 0
an+1 := an , bn+1 := sn ,
else
an+1 := sn , bn+1 := bn .
CHAPTER 2. SCALAR EQUATIONS 39

a0
b
a1 b
a2 b b b b

x∗ b0
b

Figure 2.7: False position method (regula falsi).

We note that unlike in the bisection method, we do not have |bn − an | → 0 in


general. This can be seen from the example in Figure 2.7. For such a function,
|bn − an | 9 0, however an → x∗ . In this sense, regula falsi converges to x∗ for any
continuous function – either |bn − an | → 0 or an → x∗ or bn → x∗ , cf. [YG88,
Chapter 4.5] or [Seg00, Věta 6.5]. Moreover, it is truly a bracketing method, i.e.
(an , bn ) always contains a root.
Regula falsi has only linear convergence in general. Sometimes it may happen
that the convergence is even slower than for bisection, although this is usually not
the case.

Exercises

Exercise 11 (Regula falsi as a fixed point iteration procedure).


Consider the example depicted in Figure 2.7. Specifically, let f 00 > 0, f 0 > 0 on
(a0 , b0 ), where f (a0 ) < 0 and f (b0 ) > 0. Prove that regula falsi satisfies an → x∗ .
Hint: In this case bn = b0 for all n and only an changes according to an+1 = g(an )
for some g. Find the formula for g and show that it is a contraction, given the
assumptions on f .

2.5 Various more sophisticated methods


Current literature on the topic is full of many diverse numerical methods for the
solution of nonlinear equations. Even a brief overview would be beyond the scope
of these notes. Here we only present several basic examples of more sophisticated
methods.
CHAPTER 2. SCALAR EQUATIONS 40

b
b

x0 x0 x3
x⋆ b b b b b
b b b b b
x⋆ x1 x2
b x3 x1 x2 b

Figure 2.8: Muller’s method (left) and IQI (right).

2.5.1 Methods based on quadratic interpolation


Newton’s method was based on linearization of the nonlinear equation and solving
the resulting linear equation. It is a natural idea to generalize this approach to the
next level - quadratic approximations.

Muller’s method
Muller’s method is a direct generalization of the secant method to the quadratic
case. Due to its simplicity, it seems to have been independently discovered and re-
discovered many times by various people, however it is credited to David E. Muller
in 1956. The idea is simple:

1. Use xn , xn−1 , xn−2 and the values of f at these points to construct a quadratic
polynomial p such that

p(xn−i ) = f (xn−i ), i = 0, 1, 2.

2. Find the roots of p.

3. Define xn+1 as the root of p which is closest to xn .

A graphical representation of Muller’s method can be found in Figure 2.8. We


will not bother to write down the final formula for xn+1 , this can be found elsewhere.
Remarks:

• A quadratic function need not have a real root. If this happens for the poly-
nomial p in some iteration, the next xn+1 is not defined and the iteration
process stops. One can fix this e.g. by switching to secants in that iteration
or neglecting the imaginary parts of the roots.

• Another possibility is to accept the complex roots of p and use the method to
search for complex roots of f (e.g. when f is a polynomial), which is actually
a good idea.
CHAPTER 2. SCALAR EQUATIONS 41

• The convergence rate is approximately 1.839, which is the real root of x3 −x2 −
x − 1 = 0, which stems from the error equation en+1 = Cn en en−1 en−2 , where
Cn is some quantity involving various derivatives of f . Compare this to the
convergence rate 1.618 of the secant method, which is the root of x2 −x−1 = 0
stemming from the error equation en+1 = Cn en en−1 .

• Muller’s method needs only one function evaluation per iteration. Therefore,
similarly as in the secant method on page 38, if we wish to compare Newton and
Muller, it is more fair to compare two iterations of Muller with one iteration
of Newton. Two iterations of Muller’s method have a combined convergence
rate of 1.8392 ∼ 3.383 which is much higher than the quadratic convergence
rate of Newton.

• We now need three initial values x0 , x1 and x2 .

Inverse quadratic interpolation (IQI)


A clever possibility how to avoid the problem with complex roots in Muller’s method
is the so-called Inverse quadratic interpolation (IQI) method introduced in 1826
by the French mathematician Germinal Pierre Dandelin (so technically it predates
Muller’s method). The idea is simple and elegant: interpolate the points (xi , f (xi ))
in the plane not as a function “p of x”, but rather as a function “p̃ of y”, i.e. rotate
the interpolation problem by 90◦ , cf. Figure 2.8. Now instead of seeking for the
roots of the quadratic polynomial p from Muller’s method, in IQI we get xn+1 by
simply evaluating p̃(0). The procedure is thus:

1. Use xn , xn−1 , xn−2 and the values f (xn ), f (xn−1 ), f (xn−2 ) to construct a quad-
ratic polynomial p̃ such that

p̃ f (xn−i ) = xn−i , i = 0, 1, 2. (2.43)

2. Define xn+1 = p̃(0).

Again, we do not write down explicit formulas for xn+1 , which can be found
elsewhere. Obviously the method can break down, if there are two same values
among f (xn ), f (xn−1 ), f (xn−2 ), since the interpolation problem (2.43) is not well
posed in this case. This will obviously happen very rarely, but may cause instabilities
in the algorithm when some of these values are very close to each other.
Otherwise, the basic properties of IQI are the same as of Muller: The rate of
convergence of IQI is the same as in Muller’s method (1.839), it also uses one function
evaluation per iteration and also needs three initial values.

2.5.2 Hybrid algorithms


As we have seen in Section 2.3, Newton’s method is very fast if we are sufficiently
close too the root, however many bad things can happen outside of this neighbor-
hood. On the other hand, bisection is guaranteed to converge to a root, albeit very
CHAPTER 2. SCALAR EQUATIONS 42

slowly, for any continuous function on any interval with a sign change in the end-
points. The idea of hybrid methods is to combine two or more methods, typically a
very fast locally convergent method with a slow globally convergent method along
with a set of criteria how to switch between the methods in every iteration. Typ-
ically this is done by combining open methods (Newton) and bracketing methods
(bisection), however we can view the damped Newton method with backtracking
(page 22) as a rudimentary example of such a method.

Dekker’s method
The first sophisticated hybrid method was developed in 1969 by Theodorus Dekker
(although variants of this method appeared in earlier works). The idea is to combine
the bisection method with the secant method. We produce a sequence of brackets
[an , bn ] containing x∗ , where the points bn are the actual approximations of the root
(bn need not necessarily be the right endpoint of the bracket, it can be the left
endpoint as well). We find two candidates for approximating xs – those given by
the secant and bisection methods. Then we choose which one we will be our next
approximation bn+1 . Finally, we choose an appropriate an+1 from all the available
values to have a new bracket [an+1 , bn+1 ]. In detail, we have:

1. Let an , bn and bn−1 be given such that f (an )f (bn ) < 0.

2. Compute the secant approximation from bn , bn−1 :

bn−1 f (bn ) − bn f (bn−1 )


s=
f (bn ) − f (bn−1 )

3. Compute the bisection approximation (midpoint) from an , bn :

an + b n
b=
2

4. If s lies between m and bn , set bn+1 := s. Otherwise set bn+1 := m.

5. Set an+1 to either an or bn , so that f (an+1 )f (bn+1 ) < 0.

6. If |f (an+1 )| < |f (bn+1 )|, exchange an+1 and bn+1 .

7. Set n := n + 1 and go to 1.

The criterion from point number 4 in the algorithm guarantees that the new
approximation bn+1 does not jump too far from the current approximation bn , which
was one of the causes of non-convergence in Newton’s method. Point number 5
ensures we will have a bracket containing the root in the next iteration. In point
number 6 we assume that the residual of the equation is an indicator of the quality
of approximation (which might or might not be true) and possibly change the role
of an+1 and bn+1 so that bn+1 is the ‘better’ approximation of x∗ .
It can be proven that the method always converges and that x∗ ∈ [an , bn ] for all
n. Sometimes the method can converge very slowly, even more slowly than bisection.
CHAPTER 2. SCALAR EQUATIONS 43

Brent’s method
In 1973, Richard P. Brent published an improved version of Dekker’s method. With-
out going into the technical detail, we only state that Brent’s method chooses be-
tween bisection, secants and IQI in each iteration based on more sophisticated crite-
ria. For general functions, the method can actually be slower than bisection – Brent
shows that the method converges to the desired precision in at most N 2 iterations,
if bisection needs N iterations. However, for ‘nice’ functions, the method typically
performs IQI in most of the iterations, yielding superlinear convergence. We note
that Brent’s method is used in the fzero command in MATLAB.

Exercises

Exercise 12 (Falling out of an airplane).


A person falling from an airplane without a parachute falls y(t) meters in t
seconds, where p
y(t) = ln(cosh(t gk))/k,
where g = 9.8065 m/s2 and k = 0.00341 m−1 . How long does it take for the person to
fall one kilometer? Use Newton’s method or the secant method to find the solution.
Remark: The formula for y(t) can be obtained by solving the corresponding
ordinary differential equation describing free fall in a gravity field with air friction
proportional to the square of the velocity.
Chapter 3

Systems of equations

In this chapter, we will consider systems of N nonlinear equations for N unknowns.


In other words, we will have a mapping F : RN → RN and we seek a root x∗ ∈ RN ,
i.e. the solution of F (x∗ ) = 0. Since we are in RN , we can write everything in terms
of individual components, namely F = (f1 , . . . , fN )T and x = (x1 , . . . , xN )T .
In comparison to the scalar case things start to get qualitatively more complex:

• Much more complicated behavior and properties of the systems and numerical
methods, causing the analysis to be much more technical and complex.

• A practical bracketing algorithm does not exist for high N .

• Computational costs tend to grow exponentially in terms of N in the worst


case, or at least polynomially (so-called curse of dimensionality).

The situation is also much more complicated in comparison to the case of linear
systems Ax = b and even in this case there are no universally functioning numer-
ical methods. Consider the simple question of how many solutions the system of
equations can have. In the linear case, the answer is 0, 1 or ∞. For nonlinear equa-
tions, the answer is 0, any natural number, countably many and uncountably many.
Moreover, even the number of roots can depend very sensitively on slight changes
in the system.
Example. Consider the system

x21 − x2 + γ = 0,
−x1 + x22 + γ = 0,

where γ ∈ R is a parameter. Note that each of the equations describes a parabola


in the plane and γ determines their mutual position (draw a picture!). Depending
on the value of γ, we have the following number of solutions:


 0.5, 0 solutions,

0.25, 1 solution,
γ=


 −0.5, 2 solutions,
−1.0,

4 solutions.

44
CHAPTER 3. SYSTEMS OF EQUATIONS 45

Example. Consider the system


x51 x72 − x53 = 1,
x42 x83 − x41 = 2,
x81 x43 − x62 = 3.
The question is how many real solutions does the system have. I do know the answer,
but one can estimate the maximum number. Since each equation is a polynomial,
Bézout’s theorem gives an upper bound on the number of roots as the product of
the degrees of the multivariate polynomials. In our case this means that the system
above can have up to 123 = 1728 roots. We can see that even small, seemingly
simple systems can have surprisingly complex behavior.

3.1 Tools from differential calculus


Here we will review the basic tools that we will need from differential and integral
calculus in RN .
Definition 29. A function F : RN → RM is differentiable at x ∈ RN if there
exists a linear mapping A : RN → RM such that
kF (x + h) − F (x) − Ahk
lim = 0. (3.1)
h→0 khk
h∈RN

The mapping A is called the derivative or differential of F at x and we will


denote it as F 0 (x).
Remarks:
• The definition is independent of the specific choice of the norms in (3.1).
• If F 0 (x) exists, then it is determined uniquely.

• If F 0 (x) exists, then so do the partial derivatives ∂fi (x)


∂xj
and the differential
∂fi
F 0 can be represented by the Jacobi matrix DF
Dx
= { ∂xj }i,j in the sense that
F 0 (x)v = DF
Dx
(x)v for all v ∈ RN .
We will need the following version of the mean value theorem.
Lemma 30. Let F : D ⊂ RN → RM be differentiable on a convex open set D and
let F 0 be continuous on D. Then for all x, y ∈ D
Z 1
F (y) − F (x) = F 0 (x + t(y − x))(y − x) dt. (3.2)
0

Proof. Since (3.2) is a vector identity, it is sufficient to prove it for each component
separately. Define the scalar function ϕi (t) = fi (x + t(y − x)). Then ϕ0i (t) =
∇fi (x + t(y − x)) · (y − x). Therefore
Z 1 Z 1
0
fi (y) − fi (x) = ϕi (1) − ϕi (0) = ϕi (t) dt = ∇fi (x + t(y − x)) · (y − x) dt.
0 0

This is the i-th component of (3.2).


CHAPTER 3. SYSTEMS OF EQUATIONS 46

We note that Definition 29 can be directly extended to mappings F : X → Y ,


where X, Y are normed vector spaces – this is the Fréchet derivative. Then the
differential is a bounded linear operator from X to Y : F 0 (x) ∈ L(X, Y ), hence F 0 :
X → L(X, Y ). Therefore, if we want to go one step further and define F 00 = (F 0 )0 ,
then we get F 00 (x) ∈ L(X, L(X, Y )). In the finite dimensional case, F 0 (x) ∈ L(X, Y )
can be represented by a matrix, but F 00 (x) ∈ L(X, L(X, Y )) is representable by a
2f
tensor with components ∂x∂j ∂x i
k
. This is the reason, why we avoid working with
second derivatives (e.g. in the convergence theorem for Newton’s method), as one
must either be comfortable with tensor notation, or write everything in terms of
the individual components which requires three indexes. It is then much easier to
work with e.g. Lipschitz continuity of F 0 (i.e. F 00 ∈ L∞ , see Remark 15) than with
continuity of F 00 . Obviously, things get much worse when working with even higher
derivatives.

3.1.1 Fixed point iteration


If we take a mapping G : RN → RN with the fixed point x∗ and define the simple
fixed point procedure
xn+1 = G(xn ), x0 ∈ RN ,
then Banach’s fixed point theorem ensures convergence to a fixed point x∗ if G is
contractive, however contractivity is hard to prove directly for complicated map-
pings. In 1D, Corollary 3 gave a simple sufficient condition for local contractivity in
the form of the assumption g 0 (x∗ ) < 1 and continuity of g 0 . One can ask what is an
analogous condition in RN . Obviously, if G is (locally) contractive in some norm,
then Banach’s fixed point theorem is valid and xn → x∗ . However contractivity is
norm-dependent, hence G may be contractive in some norm, but not in another and
choosing the correct norm may be a difficult task. In terms of Corollary 3, we might
ask if the norm of G0 (x∗ ) is less then one. But again – which matrix norm should we
take? Obviously, there has to be some norm-independent criterion – Ostrowski’s
theorem states that we should look at the spectral radius ρ of G0 .
Theorem 31 (Ostrowski). Let x∗ be the fixed point of G. Let G be differentiable at
x∗ . If ρ(G0 (x∗ )) < 1 then xn → x∗ for x0 sufficiently close to x∗ .
Proof. The full proof can be found in [OR70]. Here we only indicate the basic idea.
The basis is to construct a suitable norm in which G is (locally) contractive, which
after some technicalities boils down to finding a norm in which kG0 (x∗ )k < 1. The
norm is constructed using the following lemma from linear algebra: Let A ∈ CN,N
be a matrix and let ε > 0, then there exists a consistent matrix norm k · kA,ε such
that kAkA,ε ≤ ρ(A) + ε. The rest of the proof is simple: denote A := G0 (x∗ ). Then
ρ(A) < 1, hence there exists ε > 0 such that ρ(A) + ε < 1. The mentioned lemma
gives a specific norm k · kA,ε such that kAkA,ε ≤ ρ(A) + ε < 1, i.e. a norm in which
G is contractive. We note that the norm k · kA,ε is explicitly constructed using the
Jordan decomposition of A. Also note that the definition of the norm k·kA,ε depends
on the specific choice of A, ε and ‘degenerates’ as ε → 0.
In the scalar case, Theorem 12 gives conditions on higher order convergence
rates of the fixed point iteration process. Namely, if g 0 (x∗ ) = 0, then we have at
CHAPTER 3. SYSTEMS OF EQUATIONS 47

least quadratic convergence. And if g 00 (x∗ ) 6= 0, then we have exactly quadratic


convergence. This can be generalized to systems.
Theorem 32. Let x∗ be the fixed point of G. Let G be constinuosly differentiable on
a neighborhood of x∗ and let G00 (x∗ ) exist. Let G0 (x∗ ) = 0. Then xn → x∗ at least
quadratically. If moreover G00 (x∗ )hh 6= 0 for all 0 6= h ∈ RN , then the convergence
is exactly quadratic.

3.2 Newton’s method in C


3.3 Newton’s method
Now we shall derive and analyze Newton’s method for systems of equations of the
form F (x∗ ) = 0. Similarly as in 1D, the basis will be a suitable linearization, namely
the first order Taylor expansion of F .
Lemma 33. Let f : RN → R be twice differentiable, let x, a ∈ RN . Then
f (x) = f (a) + ∇f (a) · (x − a) + O(kx − ak2 ). (3.3)
Proof. Define g(t) = f (a + t(x − a)), then g : R → R and a we have g 0 (t) =
∇f (a + t(x − a)) · (x − a). Thus we can apply the standard Taylor expansion to get
g(1) = g(0) + g 0 (0) + 1 00
g (ξ) ,
|{z} |{z} |{z} |2 {z }
f (x) f (a) ∇f (a)·(x−a) O(kx−ak2 )

where ξ ∈ [0, 1].


Now we can apply (3.3) to the individual components fi of F . Noticing that the
i-th row of F 0 (a) is ∇fi (a)T , we can write everything together as
F (x) = F (a) + F 0 (a)(x − a) + R,
where kRk = O(kx − ak2 ). Similarly as in the 1D case, Newton’s method is based
on neglecting the remainder R and taking x = x∗ , a = x0 :
0 = F (x∗ ) ≈ F (x0 ) + F 0 (x0 )(x∗ − x0 ).
Finally, we express the desired x∗ :
x∗ ≈ x0 − (F 0 (x0 ))−1 F (x0 ).
Definition 34 (Newton’s method). Let x0 be given. Newton’s method consists
of the iterative procedure
xn+1 = xn − (F 0 (xn ))−1 F (xn ). (3.4)
In numerical mathematics, there are only a handful of cases, when one actually
needs to approximate a matrix inverse. In formula (3.4) we certainly do not need to
invert F 0 just to multiply it by a vector F . Thus it is useful to reformulate Newton’s
method using the solution of a linear-algebraic system, which can be obtained by
multiplying (3.4) by F 0 (xn ).
CHAPTER 3. SYSTEMS OF EQUATIONS 48

Definition 35 (Newton’s method – alternative form). Let x0 be given. Denote the


Newton update as ∆xn := xn+1 − xn . Then the alternative form of Newton’s
method consists of the iterative procedure: Given xn , find ∆xn and update xn such
that
F 0 (xn )∆xn = −F (xn ),
(3.5)
xn+1 = xn + ∆xn .

Remark 36. There are specific situations when Newton’s method is more practical
in the original form (3.4) than (3.5). Consider for example the situation, when
we have a nonlinear system of two equations for two unknowns with a prescribed
constant right-hand side and suppose we need to quickly solve it many times with
different right-hand sides. The formula for the inverse of a 2 by 2 matrix is simple
and if we explicitly evaluate formula (3.4), it may happen that many cancellations
and simplifications occur, resulting in a simple closed formula for the function G
to be iterated, without the need to solve the linear algebraic systems from (3.5).
Or perhaps we are in a (rare) situation, when we actually know (F 0 )−1 or its good
approximation. This might happen when F 0 has a very simple structure or when we
have some additional information from the problem itself (its derivation, the physical
model it describes, etc.).

3.3.1 Affine invariance and contravariance


Here we describe an interesting and important property of Newton’s method – affine
invariance and contravariance. The basic question is what happens when we perform
a simple linear or affine transform of the system F (x) = 0. This could correspond
e.g. to a scaling and translation of the unknowns or scaling and permutation of the
equations. Ideally, one hopes that such operations do not fundamentally change the
behavior of Newton’s method – for example that swapping the first two equations
in the system does not cause Newton’s method to converge more slowly or cease
to converge at all. Then the performance of the method would rely on such ad hoc
things as proper scaling of the equations. As we shall prove in this section, Newton’s
method is ‘invariant’ under such transforms.
Let A, B ∈ RN,N be given regular matrices and let b ∈ RN be a given vector. We
transform the original system F (x) = 0 to the new system G(y) = 0, where the new
and old variables and systems are related by

G(y) = AF (By + b) and x = By + b. (3.6)

This means that we are transforming both the source and target spaces RN by an
affine and linear mapping, respectively. We note that the reason we do not transform
affinely also in the target space, i.e. G = AF + a for some vector a ∈ RN , is that
this would fundamentally change the problem, as we would no longer seek the root
G(y) = 0 but the solution of G(y) = a.
Now we use Newton’s method to solve both the problems. Newton for F (x) = 0
reads:
F 0 (xn )∆xn = −F (xn ), xn+1 = xn + ∆xn . (3.7)
CHAPTER 3. SYSTEMS OF EQUATIONS 49

Newton for G(y) = 0 reads

G0 (yn )∆yn = −G(yn ), yn+1 = yn + ∆yn ,

which can be rewritten using (3.6) and the rule for differentiating composite func-
tions as
AF 0 (Byn + b)B∆yn = −AF (Byn + b), yn+1 = yn + ∆yn .
Since A is regular, we can eliminate it from the equation for the update ∆yn to get

F 0 (Byn + b)B∆yn = −F (Byn + b). (3.8)

This means that Newton’s method for G(y) = 0 is independent of A, i.e. it is


independent of the transformation of the target space. This property is called affine
invariance and we have just proven the following lemma:
Lemma 37. Newton’s method is affine invariant: The sequence yn , n ∈ N, from
Newton’s method for G is independent of A.
The other property we shall prove is affine contravariance, which is defined
as follows. Let us solve F (x) = 0 and G(y) = 0 by Newton’s method starting from
the initial conditions x0 and y0 , respectively, which satisfy x0 = By0 +b. This ensures
that the two versions of Newton start from the ‘same’ initial point, only transformed
appropriately via (3.6). Affine contravariance then means that the entire trajectories
of the two versions of Newton are also related via (3.6), i.e. that xn = Byn + b for
all n.
Lemma 38. Newton’s method is affine contravariant: Let x0 = By0 + b, then
xn = Byn + b for all n, where xn and yn are generated by Newton’s method for
finding the roots of F and G, respectively.
Proof. We only prove the case of n = 1, the rest is obtained similarly by induction.
The first step of (3.7), i.e. Newton’s method for F , is

F 0 (x0 )∆x0 = −F (x0 ), x1 = x0 + ∆x0 .

Substituting x0 = By0 + b into the first equation for ∆x0 gives

F 0 (By0 + b)∆x0 = −F (By0 + b). (3.9)

But this is the same equation as (3.8), i.e. Newton for G – only the solution to (3.8)
is B∆y0 while the solution of (3.9) is ∆x0 . Therefore these two solutions must be
equal and we have ∆x0 = B∆y0 (the matrix F 0 (x0 ) is tacitly assumed to be regular,
since x1 would otherwise be undefined). Now if we update x0 , we get

x1 = x0 + ∆x0 = By0 + b + B∆y0 = B(y0 + ∆y0 ) + b = By1 + b,

so that x1 = By1 + b, which we wanted to prove.


Remark 39. Newton has both the affine invariance and contravariance properties,
however this is a rare occurrence. Other methods usually have one property or the
other, but not both – see for example the ‘good’ and ‘bad’ Broyden methods from
Section 3.4.
CHAPTER 3. SYSTEMS OF EQUATIONS 50

Exercises

Exercise 13 (Quadratic convergence of Newton via Ostrowski).


Consider Newton’s method in RN , where we iterate the function

G(x) = x − (F 0 (x))−1 F (x).

Use Ostrowski’s Theorem 31 to show that the iterates will converge for x0 sufficiently
close to x∗ . Verify the criterion for quadratic convergence.
Hint: We need to show that G0 (x∗ ) = 0. If one is not comfortable with com-
∂gi
puting G0 directly, one can look at its individual entries ∂x j
. Also in order to avoid
0
differentiating the matrix inverse, consider calculating G (x∗ ) for the more general
function G(x) = x − B(x)F (x), where B(x) is some matrix.

Exercise 14 (Simple 2 by 2 system).


Consider the system of equations

x3 + y = 1,
y 3 − x = 1.

Find the unique real solution to this system using Newton’s method and also ana-
lytically. How robust is the convergence of the method with respect to the choice of
(x0 , y0 )?

3.3.2 Local quadratic convergence of Newton’s method


Here we will prove local quadratic convergence of Newton’s method for systems of
equations. The theorem we prove is a simplified version of the so-called Kantorovich
theorem. We will need to prepare two auxiliary tools – first we prove an estimate
of inverses of perturbed operators. We note that this lemma can be immediately
extended to Banach spaces, however we state the result in finite dimension.

Lemma 40 (Banach perturbation lemma). Let A ∈ RN,N with kAk < 1 for some
consistent matrix norm k · k. Then (I + A)−1 exists and
1
k(I + A)−1 k ≤ . (3.10)
1 − kAk

Proof. First we prove existence of the inverse. Let 0 6= x ∈ RN . Then the reverse
triangle inequality gives us

k(I + A)xk ≥ kxk − kAxk ≥ (1 − kAk)kxk > 0 for any x 6= 0,

therefore I + A is not singular, hence it has an inverse.


CHAPTER 3. SYSTEMS OF EQUATIONS 51

Now we can estimate the inverse. Since k · k is a consistent norm, we have


1 = kIk = k(I + A)(I + A)−1 k = k(I + A)−1 + A(I + A)−1 k
≥ k(I + A)−1 k − kAkk(I + A)−1 k = (1 − kAk) k(I + A)−1 k.

If we express k(I + A)−1 k from the inequality above, we immediately get (3.10).
Next, we need an analogue to Lemma 16, the substitute for the remainder of
Taylor’s polynomial.
Lemma 41. Let F : D ⊂ RN → RN be differentiable on a convex open set D and
let F 0 be γ-Lipschitz on D. Then for all x, y ∈ (a, b)

F (y) − F (x) − F 0 (x)(y − x) ≤ 21 γky − xk2 . (3.11)

Proof. The Mean value theorem (Lemma 30) gives


Z 1
F (y) − F (x) = F 0 (x + t(y − x))(y − x) dt.
0

Therefore, by Lipschitz continuity of F 0 ,


Z 1
0
F 0 (x + t(y − x)) − F 0 (x) (y − x) dt

F (y) − F (x) − F (x)(y − x) =
0
Z 1
2
≤ γky − xk t dt = 12 γky − xk2 .
0

Theorem 42 (Local quadratic convergence of Newton). Let F : D ⊂ RN → RN be


differentiable on a convex open set D. Let the following be satisfied:
1. F has a root x∗ ∈ D,
2. F 0 (x∗ ) is invertible and kF 0 (x∗ )−1 k ≤ β < ∞,
3. F 0 is γ-Lipschitz on D.
4. Kantorovich condition: x0 ∈ D is such that βγkx0 − x∗ k ≤ 21 .
Then Newton’s method starting from x0 converges quadratically to x∗ in the sense
that ken+1 k ≤ βγken k2 for all n.
Proof. We only consider the case n = 0, the rest follows by induction. We split the
proof into two parts. First, we need to prepare an estimate of kF 0 (x0 )−1 k:
1. Estimate of kF 0 (x0 )−1 k. We have an estimate for kF 0 (x∗ )−1 k (assumption 2)
We produce the desired estimate by Banach’s perturbation Lemma 40. We have
−1
kF 0 (x0 )−1 k = F 0 (x∗ ) + F 0 (x0 ) − F 0 (x∗ )
−1
= F 0 (x∗ ) I + F 0 (x∗ )−1 (F 0 (x0 ) − F 0 (x∗ ))
−1 (3.12)
≤ F 0 (x∗ )−1 I + F 0 (x∗ )−1 (F 0 (x0 ) − F 0 (x∗ )) .
| {z } | {z }
≤β =:A
CHAPTER 3. SYSTEMS OF EQUATIONS 52

Now we need to estimate kAk in order to apply Lemma 40”


kAk = kF 0 (x∗ )−1 (F 0 (x0 ) − F 0 (x∗ ))k
≤ kF 0 (x∗ )−1 kkF 0 (x0 ) − F 0 (x∗ )k (3.13)
≤ βγkx0 − x∗ k ≤ 21 ,
due to assumptions 2, 3, and 4 (in that order). If we substitute (3.13) into (3.12)
via Lemma 40, we get
1 1
kF 0 (x0 )−1 k ≤ βk(I + A)−1 k ≤ β ≤β 1 = 2β. (3.14)
1 − kAk 1− 2

We note that the estimate (3.14) makes intuitive sense: If, by assumption 2, we
estimate F 0 (x∗ )−1 by β, then at a nearby point F 0 (x0 )−1 can be estimated a little
bit worse, by 2β.
2. Error estimate. Now we can estimate e1 = x1 − x∗ . By expressing x1 from
Newton’s method and adding the ‘smart zero’ F (x∗ ), we get
x1 − x∗ = x0 − x∗ − F 0 (x0 )−1 F (x0 ) − F (x∗ )


= F 0 (x0 )−1 F 0 (x0 )(x0 − x∗ ) − F (x0 ) + F (x∗ ) ,



| {z } | {z }
k·k≤2β k·k≤ 12 γkx0 −x∗ k2

due to estimate (3.14) and Lemma 41. Thus we have proved that
ke1 k ≤ βγke0 k2 .
The estimate of en for general n can be obtained similarly, by induction.

3.3.3 Variations on Newton’s method


Inexact Newton
In practice, we do not solve the linear systems for the Newton update F 0 (xn )∆xn =
−F (xn ) from (3.5) exactly, but we apply some iterative method (preconditioned
conjugate gradients, GMRES, etc.). Thus we solve the linear systems with some
error. Specifically, we seek ∆xn such that the relative residual of the linear system
is small in some sense, for example such that
kF 0 (xn )∆xn + F (xn )k ≤ ηn kF (xn )k
with some tolerance ηn . If one analyzes the error of the resulting inexact Newton
method, one obtains the error inequality
ken+1 k ≤ C(ken k + ηn )ken k. (3.15)
We note that this is essentially the same estimate as (2.21) in the proof of Theorem
25 (here ηn plays a similar role as |hn |) and the proof of these two estimates follows
essentially the same lines. The inequality (3.15) also has similar implications as in
Theorem 25: If ηn → 0 then we get superlinear convergence, etc. We note that
assumption (3.15) and (3.15) are a ‘black box’ result independent of the iterative
method used for the linear systems. Of coarse there are many papers written on the
analysis of specific combinations such as Newton-CG, applied to specific problems.
CHAPTER 3. SYSTEMS OF EQUATIONS 53

Chord method
The idea of the chord method is to reuse the matrix in the linear system for the
Newton update (3.5) for all or several iterations. So instead of solving (3.5), we
solve
F 0 (x0 )∆xn = −F (xn ) (3.16)
in each Newton iteration. This can be advantageous from several points of view: we
save computational time evaluating the matrix F 0 (xn ) and if we spend more time
on pre-computing a more sophisticated expensive preconditioner for F 0 (x0 ), we can
solve systems (3.16) much more efficiently. Perhaps, we could even go as far as
pre-computing an LU or Cholesky factorization of F 0 (x0 ). The price we pay is that
Newton with the update given by (3.16) converges only linearly and is suitably only
for ‘mildly’ nonlinear problems. One can then balance the mentioned advantages
and disadvantages by updating the system matrix F 0 every K iterations for some
K. Such methods can √ be analyzed and, for example, if we take K = 2, the resulting
convergence rate is 3 ≈ 1.732.

Newton with differences


We discussed the use of difference approximations for the derivative in Newton’s
method is Section 2.4.1. Here the situation is more complicated, since we would
have to approximate the whole Jacobi matrix F 0 using differences, i.e. approximate
∂fi
each partial derivative ∂x j
for all i, j = 1, . . . , N , which is clearly impractical for
large N .
However, there is an elegant approach to reducing the workload. As mentioned
earlier, in practical applications, one solves the linear system for the Newton update
by some numerical method. Often this will be some Krylov subspace method, e.g.
conjugate gradients, etc. Such methods do not need the system matrix A to be
explicitly given. Instead they require the repeated evaluation of Ap for some given
vector p. In our case, such a method will need the evaluation of F 0 (xn )p for a given
p ∈ RN . But F 0 (xn )p is the directional derivative of F in the direction p, which can
be approximated straightforwardly as
1
F 0 (xn )p ≈

F (xn + hp) − F (xn )
h
for h small. This requires only one extra evaluation of F (xn + hp) and is therefore
cheap compared to approximating the whole Jacobi matrix F 0 . In the end we do not
even need a data structure to store matrices, only vectors, and such methods are
called matrix-free algorithms. These can be advantageous for example in the set-
ting of finite element methods for partial differential equations, where the unknowns
are often stored not in classical vector format (arrays), but as degrees of freedom
in some strange data structure corresponding to the (unstructured) computational
mesh. Creating a data structure for a matrix based on such a data mesh-based
structure for vectors is sometimes rather technical and tedious implementational
work and can be avoided altogether with matrix-free methods.
CHAPTER 3. SYSTEMS OF EQUATIONS 54

3.4 Quasi-Newton methods


In this section we will present a generalization of the one-dimensional secant method
to systems of equations. These methods replace the matrix F 0 , by matrices that are
more easily evaluated and/or inverted. The method is derived under a series of
simplifying assumptions, which we shall motivate and discuss.
The basic idea is to take the following modification of Newton’s method:

xn+1 = xn − B−1
n F (xn ), (3.17)

where the matrix Bn will (a) be additively updated in each iteration and (b) should
somehow correspond to the secant method in 1D. Concerning the secant method,
we can write it in the form
f (xn ) − f (xn−1 )
bn = , xn+1 = xn − b−1
n f (xn ). (3.18)
xn − xn−1

The method (3.17) is clearly a generalization of the second equation in (3.18) to RN .


However, it is unclear how to directly generalize the first equation in (3.18) to the
case when x, F ∈ RN and B ∈ RN,N . The key is to reformulate the relation for bn
to the form bn (xn − xn−1 ) = f (xn ) − f (xn−1 ). This form can be directly generalized
for matrices and vectors to get the so-called quasi-Newton condition (or secant
condition)
Bn (xn − xn−1 ) = F (xn ) − F (xn−1 ). (3.19)
We note that this condition does not determine the matrix Bn uniquely, since (3.19)
represents N equations for the N 2 unknown entries of Bn . It is therefore necessary
to add additional conditions on Bn to obtain a uniquely determined Bn .

Remark 43. We note that condition (3.19) is natural for a matrix which should
play a similar role as F 0 (xn ). The Mean value theorem (Lemma 30) states that
Z 1 
0
F (xn−1 + t(xn − xn−1 )) dt (xn − xn−1 ) = F (xn ) − F (xn−1 ), (3.20)
0

which has a similar form as (3.19). Therefore, Bn plays a similar role as the Jacobi
matrix averaged along the line between xn , xn−1 . This is not claiming that the result-
ing Bn will be an approximation of F 0 in some sense, only that it satisfies similar
requirements.

Broyden’s method
Now we must supply additional requirements on Bn to get a uniquely determined
method. The first such method was Broyden’s method published in 1965. The
basic additional assumptions are such: Let Bn be given, then Bn+1 is constructed as
follows:

1. The matrices Bn are constructed by additive updates by some matrices Cn :

Bn+1 = Bn + Cn . (3.21)
CHAPTER 3. SYSTEMS OF EQUATIONS 55

2. Cn is a rank-1 matrix:
Cn = an bTn , an , bn ∈ RN . (3.22)

3. Bn+1 also satisfies the quasi-Newton condition (3.19)


Bn+1 ∆xn = ∆Fn , (3.23)
where ∆xn = xn+1 − xn and ∆Fn = F (xn+1 ) − F (xn ).
While assumptions 1 and 3 seem natural, assumption 2 seems arbitrary. The
idea is to take Cn ‘as simple as possible’, however that is a very vague notion. A
less arbitrary explanation of the choice of rank-1 updates is Lemma 45, as we shall
see.
If we substitute (3.21) and (3.22) into (3.23), we get the quasi-Newton condition
written in terms of the yet unspecified vectors an , bn :
an bTn ∆xn = ∆Fn − Bn ∆xn . (3.24)
If we notice that bTn ∆xn is a number, we can divide (3.24) by it to obtain
1 
an = T ∆Fn − Bn ∆xn . (3.25)
bn ∆xn
Therefore, it is sufficient to only specify the vector bn , because an will then be given
by formula (3.25).
So it remains to choose a suitable vector bn . There are many possibilities, here
we present two of them. The basic idea is that in the current iteration, the update
Cn should be based on the current information, which is the points xn , xn+1 and the
values of F at these points. Therefore, under ideal circumstances, we are building a
good approximation along the line joining xn , xn+1 but have no information about
the problem in other directions. In other words, Cn should work as expected ‘along
the line joining xn , xn+1 ’, i.e. when applied to the vector ∆xn , but it should do noth-
ing for other vectors, especially vectors in the perpendicular direction. Altogether,
the effect of Cn on vectors y ∈ RN such that y T ∆xn = 0 should be zero (we try not
to ‘invent’ some information about the behavior of the problem where we do not
know anything). Written mathematically, we require
Cn y = an bTn y = 0 for all y such that y T ∆xn = 0.
The simplest natural choice satisfying the requirement is bn = ∆xn , since then
Cn y = an ∆xTn y = an y T ∆xn = 0,
since y T ∆xn = 0. This choice of bn = ∆xn leads to the so-called Good Broyden
method. A more general possibility to ensure (3.25) is the choice bn = D∆xn for
some matrix D – a specific choice of D will lead to the so-called bad Broyden
method.
If we take bn = ∆xn in (3.25), we get
1
Cn = an bTn =
 T
∆F n − Bn ∆x n ∆xn . (3.26)
k∆xn k2
Thus we can write down the final algorithm:
CHAPTER 3. SYSTEMS OF EQUATIONS 56

Definition 44 (‘Good’ Broyden method ). Given x0 , B0 , compute for n = 0, 1, . . .

xn+1 = xn − B−1
n F (xn )
1  T
Bn+1 = Bn + ∆Fn − Bn ∆x n ∆xn ,
k∆xn k2

where ∆xn = xn+1 − xn and ∆Fn = F (xn+1 ) − F (xn ).

Remarks:

• The algorithm needs only one evaluation of F per iteration, similarly as the
secant method in 1D.

• The choice bn = BTn ∆xn gives the so-called ‘bad’ Broyden method.

• Broyden’s method has superlinear convergence, however, unlike Newton’s


method, it does not have a uniquely defined convergence rate for all functions
F with given regularity. One can show that the best convergence rate in RN
is the unique positive root of the equation ρN +1 − ρN − 1 = 0 (compare with
the
√ secant method with N = 1), which can be shown to be approximately
N
N as N → ∞. This implies that as N → ∞, the best convergence rate
tends to 1 from above. The decrease of the convergence rate with growing
dimension is intuitive: The secant equation determines Bn uniquely only for
N = 1. As N grows, so does the gap between N (number of equations in the
quasi-Newton condition) and N 2 (components of Bn ). To bridge this larger
and larger gap, we needed to ‘make up’ simplifying assumptions (Cn is rank 1,
etc.) which are however somewhat artificial, thus decreasing the convergence
rate. The highest convergence rate is in 1D, where these additional artificial
assumptions are not needed.

• If F (x) = Ax − b is a linear system, then Broyden is a finite method – for


any x0 we get the exact solution in at most N iterations. This is worse than
Newton (exact solution in the first iteration), however it is a finite method
nonetheless.

• ‘Good’ Broyden is affine invariant but not contravariant. For ‘Bad’ Broyden
it is exactly opposite.

• The terminology ‘good’ and ’bad’ Broyden method is misleading. It origi-


nates from Broyden’s original paper, where several numerical experiments are
considered. The ’good’ version worked better then the ‘bad’ one on these
examples, so Broyden introduced these names. As it turned out later, the
situation is not so clear and for other cases the ‘bad’ variant can actually be
better. There are papers on this subject.

Now we return to the question of why we chose the update to be rank 1. We


motivated this by simplicity, but this is a relative concept- why not diagonal or
unitary or circulant, etc.? A possible answer is given by the following theorem: The
rank 1 update turns out to be the smallest possible update such that the updated
CHAPTER 3. SYSTEMS OF EQUATIONS 57

matrix satisfies the quasi-Newton condition. Here ‘smallest’ means ‘smallest in the
Frobenius norm’. Since the Frobenius norm is the L2 norm of all components of
the matrix, one can say that the rank 1 update is the smallest elementwise update
satisfying the Quasi-Newton condition. This means that the rank 1 update is the
most conservative – make the smallest possible change to Bn so that we satisfy
the basic assumption, the Quasi-Newton condition. This is reasonable, because, as
noted earlier, all the other assumptions are to some extent artificial and we want to
minimize their influence to some extent.

Lemma 45. Let ∆xn , ∆Fn ∈ RN be given. Then the matrix Cn given by (3.26) is
the smallest update of Bn in the Frobenius norm, such that Bn+1 = Bn + Cn satisfies
the quasi-Newton condition Bn+1 ∆xn = ∆Fn .

Proof. Let C̃n be another update such that B̃n+1 = Bn + C̃n also satisfies B̃n+1 ∆xn =
∆Fn . Then

1
∆Fn − Bn ∆xn ∆xTn

kCn kF = 2
k∆xn k F
1  T
= B̃n+1 ∆xn − Bn ∆xn ∆xn
k∆xn k2 F
1
B̃n+1 − Bn ∆xn ∆xTn F

=
k∆xn k2
1
≤ B̃n+1 − Bn F k∆xn ∆xTn F
| {z } k∆xn k2
C̃n | {z }
=1

= kC̃n kF ,

which we wanted to prove. Here we have used the property of the Frobenius norm
that kuk2 = kuuT kF for any u ∈ RN , which can be proved directly:
v v v
N
X
u N
u X 2 u X
u N u N
2 2
uX
kuk = ui = t 2
ui = t 2 2
ui uj = t (ui uj )2 = kuuT kF .
i=1 i=1 i,j=1 i,j=1

Finally, we come to the last reason, why a rank 1 update should be a wise choice.
Using the so-called Sherman-Morrison formula, one can directly compute the inverse
B−1 −1
n+1 as an update of Bn .

Lemma 46 (Sherman-Morrison formula). Let A ∈ RN be an invertible matrix and


let u, v ∈ RN . Then A + uv T is invertible iff 1 + v T A−1 u 6= 0 and

A−1 uv T A−1
(A + uv T )−1 = A−1 − . (3.27)
1 + v T A−1 u
Lemma 46 allows us to simplify Broyden’s method significantly, as we can now
directly compute Bn+1 using updates. A direct application of (3.27) to the ‘good’
CHAPTER 3. SYSTEMS OF EQUATIONS 58

Broyden method with u = an , v = bn gives us the following algorithm:

xn+1 = xn − B−1
n F (xn )
∆xn − B−1 n ∆Fn
(3.28)
B−1
n+1 = B−1
n + T −1
∆xTn B−1
n ,
∆xn Bn ∆Fn

which requires only O(N 2 ) operations per iteration without the necessity to solve
linear systems with the matrix Bn .
Remarks:
• In the algorithm (3.28), we never need the matrices Bn , we only use B−1
n .

• The general recommendation is to take B0 = I, which is somewhat surprising,


since in general I has nothing in common with F 0 . However, Broyden very
quickly starts accumulating information about F . Alternatively, one can take
B0 as some easily invertible approximation of F 0 , for example the diagonal
part of F 0 .

• Unfortunately, the update of Bn does not preserve sparsity, so we end up with


full matrices. This can be a problem in e.g. finite element methods. There
exist sparse versions of Broyden’s method which preserve the sparsity structure
of F 0 .

• Unfortunately, the Broyden update of Bn does not preserve symmetry or pos-


itive definiteness: If Bn is symmetric and/or positive definite, then Bn+1 is
not.

• The BFGS (Broyden, Fletcher, Goldfarb, Shanno) method uses the update
Bn+1 = Bn + αaaT + βbbT which automatically preserves symmetry, and after
choosing the parameters α, β, it also preserves positive definiteness. The BFGS
method is one of the most popular methods in optimization, where the role
of the Jacobi matrix F 0 is taken by the Hessian matrix H of the minimized
functional. The point is that Hessian matrices are naturally symmetric, while
there is no reason for symmetry of Jacobi matrices. Thus the BFGS is more
popular in the optimization community.

3.5 Continuation methods


As we have seen throughout this text, the main disadvantage of more sophisticated
methods is their local convergence. Here we present one possible ‘globalization’
strategy how to try to enlarge the region of convergence. We will describe the basic
strategy of continuation (or homotopy) methods.
The basic idea is to find or introduce an auxiliary parameter on which the prob-
lem depends and which controls how ‘hard’ the problem is. Sometimes this is natu-
rally contained in our equations – a parameter which for some values gives a much
simpler problem, while we want to compute the solution to a hard problem for some
other value of the parameter. For example, in computational fluid dynamics, the
viscosity (or Reynolds number) is such a parameter: for Reynolds number equal to
CHAPTER 3. SYSTEMS OF EQUATIONS 59

zero, we get the Stokes problem which is linear and thus much simpler than high
Reynolds flows with turbulence etc. Perhaps our specific problem contains such a
parameter naturally. If not, we will introduce such a parameter artificially.
In any case, we will call the auxiliary parameter t and we assume that for t = 0
we get a simpler (e.g. linear) problem F0 (x) = 0 and for t = 1, we get the original
problem F (x) = 0 which was to be solved. So instead of a single function F : D ⊂
RN → RN we have a class of problems H : D × [0, 1] ⊂ RN +1 → RN such that

H(x, 0) = F0 (x), which is a simple problem,


(3.29)
H(x, 1) = F (x), which is the original problem.

If F does not naturally contain such a tuning parameter t, we can take for example
the convex combination

H(x, t) = tF (x) + (1 − t)F0 (x), (3.30)

where F0 is some ‘simple’ mapping, for which we ideally know the solution to F0 (x) =
0. However, F0 should not be chosen arbitrarily, as the resulting method would have
problems. The function F0 should somehow be related to F . One possible choice is
to choose some x0 and take F0 (x) = F (x) − F (x0 ). Then x0 is trivially the solution
to F0 (x) = 0. Taking such an F0 in (3.30) results in

H(x, t) = F (x) + (t − 1)F (x0 ). (3.31)

This definition of H clearly satisfies the basic assumption (3.29), since we know the
solution x0 to H(x, 0) = 0. In the following we will always keep (3.31) in mind as a
basic choice of H.
Now instead of the single problem F (x) = 0 we will be solving a whole class of
problems
H(x, t) = 0, ∀t ∈ [0, 1]. (3.32)
We assume that (3.32) has a solution x(t) for any t ∈ [0, 1] and that it depends
continuously on t. In other words, we assume the existence of a continuous mapping
x : [0, 1] → RN such that

H(x(t), t) = 0, ∀t ∈ [0, 1].

In this case, x(·) defines a curve in RN such that x(0) = x0 is prescribed and
x(1) = x∗ is the desired solution to our original problem F (x∗ ) = 0.
Remark 47. The standard implicit function theorem gives local existence and unique-
ness of x(·). The fundamental assumption (apart from regularity) is the regularity of
∂x H – the Jacobi matrix of H with respect to the x-variables. This can be improved
to get a global existence result under additional assumptions. For example, we get
the following.
Lemma 48. Let F : RN → RN be continuously differentiable and let kF 0 (x)−1 k ≤
β < ∞ for all x ∈ RN . Then for all x0 ∈ RN there exists a continuously differentiable
mapping x : [0, 1] → RN such that H(x(t), t) = 0 for all t ∈ [0, 1], where H is defined
by (3.31).
CHAPTER 3. SYSTEMS OF EQUATIONS 60

Now we come to the basic idea of continuation methods. We want to approximate


the curve x(·) from its known starting point x0 to its unknown endpoint x∗ . To this
end we choose a partition 0 = t0 < t1 < . . . < tM = 1 of [0, 1] and we solve the
sequence of intermediate problems

H(x, ti ) = 0, i = 1, . . . , M.

We will denote the solution of the i-th problem as xi , so that H(xi , ti ) = 0. Now
since x(·) is continuous, if ti − ti−1 is sufficiently small then xi−1 is close to xi .
So close in fact that xi−1 can be used as an initial point in a locally convergent
method for xi . In other words, each xi has a neighborhood where e.g. Newton’s
method will converge. If these neighborhoods overlap sufficiently, then xi−1 (or its
approximation) can be used as a good initial guess in Newton’s method for solving
H(x, ti ) = 0 with the solution xi . The whole process then goes as follows: start with
x0 and do several iterations of Newton for x1 . Then use the current approximation
of x1 and do several iterations of Newton for x2 . And so on. In the end xM −1 is
a good initial approximation for xM = x∗ . The point is that x0 is a bad initial
guess in Newton for xM , but we use the intermediate solutions xi to bridge this
gap via (possibly very small) neighborhoods of convergence of all the individual xi .
Of coarse, in practice we only do a finite number Ni of iterations of Newton for
each xi . The result is that we need two indexes: xi,n , where the first index denotes
the problem solved and the second one denotes the Newton iteration index. The
resulting algorithm is this:
General continuation algorithm with Newton’s method (3.33)
Given x0 , set x1,0 = x0 .
For i = 1, . . . , M − 1:
For n = 0, . . . , Ni − 1:
xi,n+1 = xi,n − [∂x H(xi,n , ti )]−1 H(xi,n , ti ), (Newton for H(x, ti ) = 0)
xi+1,0 = xi,Ni
For n = 0, . . . (until stopping criterion is satisfied) do
xM,n+1 = xM,n − [∂x H(xM,n , 1)]−1 H(xM,n , 1). (Newton for H(x, 1) = 0)

Here the last line is simply just Newton’s method for F (x) = 0. Since we do not
need to compute the auxiliary xi to huge precision, one could simply take Ni = 1, i.e.
do just one iteration of Newton for each xi , while taking a finer partition of [0, 1]. In
this case we can drop the second iteration index and the previous algorithm (3.33)
with the choice of H given by (3.31) reduces to

xn+1 = xn − F 0 (xn )−1 F (xn ) + (tn − 1)F (x0 ) , n = 0, 1, . . . , M − 1,


 

xn+1 = xn − F 0 (xn )−1 F (xn ), n = M, M + 1 . . . .

We note that this resulting algorithm is very much similar to Newton’s method with
a slight change of the Newton formula in the first M iterations. This will also be
the case in other similar methods, which result in some strange simple modification
of Newton with the ultimate goal of making it converge more globally.
Concerning the “General continuation algorithm with Newton’s method” above,
we have the following convergence result.
CHAPTER 3. SYSTEMS OF EQUATIONS 61

Theorem 49. Let H : D × [0, 1] ⊂ RN +1 → RN be continuously differentiable


with respect to x. Let there exist a continuous mapping x : [0, 1] → D solving
H(x(t), t) = 0 for all t ∈ [0, 1]. Let ∂x H(x(t), t) be regular for all t ∈ [0, 1]. Then
M −1
there exists a partition {ti }M
i=0 of [0, 1] and numbers {Ni }i=1 such that the sequence
{xi,n } defined by algorithm (3.33) is well defined and lim xM,n = x∗ .
n→∞

Predictor-corrector approach
Up to now we have simply taken the current approximation of xi as an initial guess
in Newton’s method for xi+1 . The question is whether we can do better. What we
want to do is take xi , somehow produce a better initial guess x̂i+1 (predictor phase)
and use that as an initial guess for Newton (corrector phase):

1. Predictor: Given xi , find x̂i+1 close to xi+1 .

2. Corrector: Given x̂i+1 , use a numerical method to find a better approxima-


tion of xi+1 .

What we have done up to now is called classical continuation, where the


predictor phase is simply x̂i+1 = xi and the corrector phase consists of Newton’s
method. Another possibility is the following.

Tangent continuation method


The idea is to locally approximate the curve x(·) by its tangent and find x̂i+1 on the
tangent:
x̂i+1 = xi + x0 (ti )(ti+1 − ti ). (3.34)
Now the question is how to obtain x0 (·)? To this end we simply take the basic
relation H(x(t), t) = 0 and differentiate with respect to t. We get the so-called
Davidenko differential equation

∂x H(x(t), t) x0 (t) + ∂t H(x(t), t) = 0,

where ∂t H is the partial derivative of H with respect to the second variable t. If we


assume regularity of the Jacobi matrix ∂x H(x(t), t), we can express

x0 (t) = −∂x H(x(t), t)−1 ∂t H(x(t), t). (3.35)

In the case of H defined by (3.31), we get

x0 (t) = −F 0 (x(t))−1 F (x0 ),

therefore the predictor phase (3.34) reduces to

x̂i+1 = xi − F 0 (xi )−1 F (x0 )(ti+1 − ti ),

which again resembles some simple variation on Newton’s formula.


CHAPTER 3. SYSTEMS OF EQUATIONS 62

x(t)

1.
b

2. bb

3.

4.

Figure 3.1: Various problems with the continuation path x(·): 1. Escape to infinity,
2. Discontinuity, 3. Turning point, 4. Bifurcation.

Methods based on the Davidenko equation


The original goal was to approximate the endpoint x∗ of the curve x(·) given its
known initial point x0 . Since we now have a differential equation for x(·), we can
discretize the differential equation directly. For example, if we use the forward
(explicit) Euler method to discretize (3.35) on the partition {ti }M i=0 , we get the
scheme
xi+1 − xi
= −∂x H(xi , ti )−1 ∂t H(xi , ti ), i = 0, . . . M − 1,
ti+1 − ti
or after reformulation as

xi+1 = xi − (ti+1 − ti )∂x H(xi , ti )−1 ∂t H(xi , ti ).

Again, in the case of H given by (3.31), this reduces to

xi+1 = xi − (ti+1 − ti )F 0 (xi )−1 F (x0 ).

Again this looks like some variation on Newton’s method with a damping factor
(ti+1 − ti ) and F (x0 ) instead of F (xn ).
Of coarse one can use more sophisticated methods than Euler’s method, for
example Runge-Kutta, etc. Such methods appear in the literature and can be ana-
lyzed.

We conclude this section by noting what are the basic obstacles in the continua-
tion–based approaches described above. The main problem is the nonexistence of
a continuous curve x(·) satisfying the basic assumptions used above. If H(·, ·) is
chosen poorly, one of several things can go wrong, which prevent the successful
application of the methods above:
1. x(·) escapes to infinity and returns back.

2. x(·) is discontinuous.
CHAPTER 3. SYSTEMS OF EQUATIONS 63

3. x(·) undergoes a turning point.

4. x(·) undergoes a bifurcation.

These situations are depicted in 1D Figure 3.1 in the presented order. As was
mentioned earlier, there are no universal recipes how to treat such problems and
there is an extensive literature dealing with these issues.

Exercises

Exercise 15 (Globalization of arctan).


As we have seen in Exercise 9, Newton’s method converges only locally to the
solution of arctan(x) = 0. Apply any of the continuation strategies described in this
section to get a globally convergent method. Try it out.
Bibliography

[OR70] J. M. Ortega and W. C. Rheinboldt, Iterative solution of nonlinear


equations in several variables, Computer science and applied mathe-
matics, Academic Press, New York, 1970.

[RR78] A. Ralston and P. Rabinowitz, A first course in numerical analysis,


second ed., McGraw-Hill, 1978.

[Seg00] J. Segethová, Základy numerické matematiky, Karolinum, 2000.

[YG88] D.M. Young and R.T. Gregory, A survey of numerical mathematics,


vol. 1, Dover Publications, 1988.

64

You might also like