0% found this document useful (0 votes)
124 views230 pages

Curseng

This document provides an overview of numerical analysis. It begins by defining numerical analysis as "the study of algorithms for the problems of continuous mathematics". The key aspect of numerical analysis is devising and analyzing numerical algorithms to solve problems involving real and complex variables, as real numbers cannot be represented exactly on computers and must be approximated. The document then discusses sources of error in numerical computations, such as rounding errors, and the importance of studying the stability and accuracy of algorithms. It provides examples of applications of numerical analysis techniques to solving systems of linear equations and approximating functions.

Uploaded by

Radu Trimbitas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views230 pages

Curseng

This document provides an overview of numerical analysis. It begins by defining numerical analysis as "the study of algorithms for the problems of continuous mathematics". The key aspect of numerical analysis is devising and analyzing numerical algorithms to solve problems involving real and complex variables, as real numbers cannot be represented exactly on computers and must be approximated. The document then discusses sources of error in numerical computations, such as rounding errors, and the importance of studying the stability and accuracy of algorithms. It provides examples of applications of numerical analysis techniques to solving systems of linear equations and approximating functions.

Uploaded by

Radu Trimbitas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 230

Numerical Analysis

Radu T. Trı̂mbiţaş
ii
Preface

Lloyd N. Trefethen [40] proposed the following definition of Numerical Analysis:


Numerical Analysis is the study of algorithms for the problems of con-
tinuous mathematics.
The keyword here is that of algorithms. Although many papers do not highlight this
aspect, the center of numerical analysis is devising and analyzing numerical algo-
rithms to solve a certain class of problems.
These are the problems of continuous mathematics. “Continuous” means here
that real and complex variables are involved; its opposite is discrete. Shortly, one
could say that Numerical analysis is continuous algorithmics, as opposed to classical
algorithmics, that is discrete algorithmics.
It is clear that since real and complex numbers cannot be represented exactly on
computers, they must be approximated using a finite representation. This is where the
rounding errors come in and it is clear that their study is one of the important goals
of Numerical Analysis. There were and there exists yet many opinion stating that it
is the most important. An argument that supports this idea, excepting the ubiquity of
error is given by exact methods for the solution of linear algebraic system, such as
Gaussian elimination.
But, most problems of continuous mathematics cannot be solved by the so-called
finite algorithms, even we assume an infinite precision arithmetic. A first example is
the solution of nonlinear algebraic equations. This becomes more clear in eigenvalue
and eigenvector problems. The same conclusion extends to virtually any problem
with a nonlinear term or a derivative in it – zero finding, numerical quadrature, dif-
ferential equations, integral equations, optimization, and so on.
Even if rounding errors vanished, Numerical Analysis will remain. Approximat-
ing mere numbers, the task of floating point arithmetic is a tedious topic. The deeper
business of Numerical Analysis is approximating unknowns, not knowns. Rapid con-
vergence of approximations is the aim, and the pride of our field is that, for many
problems, we have invented algorithms that converge exceedingly fast. The develop-
ment of symbolic software like Maple or Mathematica diminished the importance of

iii
iv Preface

rounding errors without diminishing the importance of algorithms speed of conver-


gence.
The above definition fails to catch some important matters: that these algorithms
are implemented on computers, whose architecture may be an important part of the
problem; that reliability and efficiency are paramount goals; that some numerical an-
alysts write programs and others prove theorems1 ; and most important, all this work
is applied, applied daily and successfully to thousands of applications on millions
of computer around the world. “The problems of continuous mathematics” are the
problems that science and engineering are build upon; without numerical methods,
science and engineering as practiced today would come quickly to halt. They are also
the problems that preoccupied most mathematicians from the time of Newton to the
twentieth century. As much as many pure mathematicians, numerical analysts are the
heirs to the great traditions of Euler, Lagrange, Gauss and others.

Radu Tiberiu Trı̂mbiţaş


Cluj-Napoca, July 2003

1
some of them do both activities
Contents

1 Errors and Floating Point Arithmetic 1


1.1 Numerical Problems . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Error Measuring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Propagated error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Floating-Point Representation . . . . . . . . . . . . . . . . . . . . 4
1.4.1 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.2 Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 IEEE Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.1 Special Quantities . . . . . . . . . . . . . . . . . . . . . . 9
1.6 The Condition of a Problem . . . . . . . . . . . . . . . . . . . . . 10
1.7 The Condition of an algorithm . . . . . . . . . . . . . . . . . . . . 12
1.8 Overall error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.9 Ill-Conditioned Problems and Ill-Posed Problems . . . . . . . . . . 15
1.10 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.10.1 Asymptotical notations . . . . . . . . . . . . . . . . . . . . 15
1.10.2 Accuracy and stability . . . . . . . . . . . . . . . . . . . . 17
1.10.3 Backward Error Analysis . . . . . . . . . . . . . . . . . . . 19

2 Numerical Solution of Linear Algebraic Systems 21


2.1 Notions of Matrix Analysis . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Condition of a linear system . . . . . . . . . . . . . . . . . . . . . 27
2.3 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Factorization based methods . . . . . . . . . . . . . . . . . . . . . 35
2.4.1 LU decomposition . . . . . . . . . . . . . . . . . . . . . . 35
2.4.2 LUP decomposition . . . . . . . . . . . . . . . . . . . . . 36
2.4.3 Cholesky factorization . . . . . . . . . . . . . . . . . . . . 37
2.4.4 QR decomposition . . . . . . . . . . . . . . . . . . . . . . 39
2.5 Strassen’s algorithm for matrix multiplication . . . . . . . . . . . . 41
2.6 Iterative refinement . . . . . . . . . . . . . . . . . . . . . . . . . . 43

v
vi Contents

2.7 Iterative solution of Linear Algebraic Systems . . . . . . . . . . . . 43

3 Function Approximation 49
3.1 Least Squares approximation . . . . . . . . . . . . . . . . . . . . . 52
3.1.1 Inner products . . . . . . . . . . . . . . . . . . . . . . . . 53
3.1.2 The normal equations . . . . . . . . . . . . . . . . . . . . . 54
3.1.3 Least square error; convergence . . . . . . . . . . . . . . . 56
3.2 Examples of orthogonal systems . . . . . . . . . . . . . . . . . . . 59
3.3 Examples of orthogonal polynomials . . . . . . . . . . . . . . . . . 62
3.3.1 Legendre polynomials . . . . . . . . . . . . . . . . . . . . 62
3.3.2 First kind Chebyshev polynomials . . . . . . . . . . . . . . 64
3.3.3 Second kind Chebyshev polynomials . . . . . . . . . . . . 68
3.3.4 Laguerre polynomials . . . . . . . . . . . . . . . . . . . . 69
3.3.5 Hermite polynomials . . . . . . . . . . . . . . . . . . . . . 69
3.3.6 Jacobi polynomials . . . . . . . . . . . . . . . . . . . . . . 70
3.4 The Space H n [a, b] . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.5 Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . 73
3.5.1 Lagrange interpolation . . . . . . . . . . . . . . . . . . . . 73
3.5.2 Hermite Interpolation . . . . . . . . . . . . . . . . . . . . . 76
3.5.3 Interpolation error . . . . . . . . . . . . . . . . . . . . . . 80
3.6 Efficient Computation of Interpolation Polynomials . . . . . . . . . 83
3.6.1 Aitken-type methods . . . . . . . . . . . . . . . . . . . . . 83
3.6.2 Divided difference method . . . . . . . . . . . . . . . . . . 85
3.6.3 Multiple nodes divided differences . . . . . . . . . . . . . . 88
3.7 Convergence of polynomial interpolation . . . . . . . . . . . . . . 90
3.8 Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.8.1 Interpolation by cubic splines . . . . . . . . . . . . . . . . 95
3.8.2 Minimality properties of cubic spline interpolants . . . . . . 99

4 Linear Functional Approximation 103


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.1.1 Method of interpolation . . . . . . . . . . . . . . . . . . . 106
4.1.2 Method of undetermined coefficients . . . . . . . . . . . . 107
4.2 Numerical Differentiation . . . . . . . . . . . . . . . . . . . . . . . 107
4.3 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.3.1 The composite trapezoidal and Simpson’s rule . . . . . . . 110
4.3.2 Weighted Newton-Cotes and Gauss formulae . . . . . . . . 113
4.3.3 Properties of Gaussian quadrature rules . . . . . . . . . . . 116
4.4 Adaptive Quadratures . . . . . . . . . . . . . . . . . . . . . . . . 121
4.5 Iterated Quadratures. Romberg Method . . . . . . . . . . . . . . . 122
Contents vii

4.6 Adaptive Quadratures II . . . . . . . . . . . . . . . . . . . . . . . . 126

5 Numerical Solution of Nonlinear Equations 129


5.1 Nonlinear Equations . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.2 Iterations, Convergence, and Efficiency . . . . . . . . . . . . . . . 130
5.3 Sturm Sequences Method . . . . . . . . . . . . . . . . . . . . . . 132
5.4 Method of False Position . . . . . . . . . . . . . . . . . . . . . . . 135
5.5 Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.6 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.7 Fixed Point Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.8 Newton’s Method for Multiple zeros . . . . . . . . . . . . . . . . . 145
5.9 Algebraic Equations . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.10 Newton’s method for systems of nonlinear equations . . . . . . . . 148
5.11 Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . 149
5.11.1 Linear Interpolation . . . . . . . . . . . . . . . . . . . . . 151
5.11.2 Modification Method . . . . . . . . . . . . . . . . . . . . . 151

6 Eigenvalues and Eigenvectors 155


6.1 Eigenvalues and Polynomial Roots . . . . . . . . . . . . . . . . . . 155
6.2 Basic Terminology and Schur Decomposition . . . . . . . . . . . . 157
6.3 Vector Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.4 QR Method – the Theory . . . . . . . . . . . . . . . . . . . . . . . 163
6.5 QR Method – the Practice . . . . . . . . . . . . . . . . . . . . . . . 167
6.5.1 Classical QR method . . . . . . . . . . . . . . . . . . . . . 167
6.5.2 Spectral shift . . . . . . . . . . . . . . . . . . . . . . . . . 174
6.5.3 Double shift QR method . . . . . . . . . . . . . . . . . . . 176

7 Numerical Solution of Ordinary Differential Equations 179


7.1 Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.3 Local Description of One-Step Methods . . . . . . . . . . . . . . . 182
7.4 Examples of One-Step Methods . . . . . . . . . . . . . . . . . . . 183
7.4.1 Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . 183
7.4.2 Method of Taylor expansion . . . . . . . . . . . . . . . . . 185
7.4.3 Improved Euler methods . . . . . . . . . . . . . . . . . . . 186
7.5 Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . 188
7.6 Global Description of One-Step Methods . . . . . . . . . . . . . . 193
7.6.1 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
7.6.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.6.3 Asymptotics of global error . . . . . . . . . . . . . . . . . 199
viii Contents

7.7 Error Monitoring and Step Control . . . . . . . . . . . . . . . . . . 202


7.7.1 Estimation of global error . . . . . . . . . . . . . . . . . . 202
7.7.2 Truncation error estimates . . . . . . . . . . . . . . . . . . 204
7.7.3 Step control . . . . . . . . . . . . . . . . . . . . . . . . . . 207

Bibliography 215

Index 218
List of Algorithms

2.1 Solution of the system Ax = b by Gaussian elimination . . . . . . 34


2.2 Cholesky Factorization . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3 QR factorization using Householder reflections . . . . . . . . . . . 41
2.4 Computation of the product QT b . . . . . . . . . . . . . . . . . . . 41
2.5 Computation of the product Qx . . . . . . . . . . . . . . . . . . . . 41
4.1 Adaptive quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.2 Adaptive quadrature based on Simpson method and extrapolation . . 127
5.1 Secant method for nonlinear equations in R . . . . . . . . . . . . . 141
5.2 Newton’s method for nonlinear equations in R . . . . . . . . . . . . 144
5.3 Newton method for nonlinear systems . . . . . . . . . . . . . . . . 150
5.4 Broyden’s method for nonlinear systems . . . . . . . . . . . . . . . 153
6.1 RQ transformation of a Hessenberg matrix H, that is H∗ = RQ,
where H = QR is a QR decomposition of H . . . . . . . . . . . . . 169
6.2 Reduction to upper Hessenberg form . . . . . . . . . . . . . . . . . 171
6.3 Pure (simple) QR Method . . . . . . . . . . . . . . . . . . . . . . . 171
6.4 QRSplit1a – QR method with partition and treatment of 2×2 matrices173
6.5 QR iterations on a Hessenberg matrix ; used by Algorithm 6.4 – call
[H1 , H2, it] = QRIter(H, t) . . . . . . . . . . . . . . . . . . . . 173
6.6 Spectral shift QR method, partition and treatment of complex eigen-
values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.7 QR iteration and partitioning . . . . . . . . . . . . . . . . . . . . . 175
6.8 Double shift QR method with partition and treating 2 × 2 matrices . 178
6.9 Double shift QR iterations and Hessenberg transformation . . . . . 178
7.1 4th order Runge-Kutta method . . . . . . . . . . . . . . . . . . . . 191
7.2 Pseudo-code fragment that illustrates the implementation of a variable-
step RK method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

ix
x LIST OF ALGORITHMS
Chapter 1

Errors and Floating Point


Arithmetic

Computation errors evaluation is one of the main goal of Numerical Analysis. Several
type of error which can affect the accuracy may occur:

1. Input data error;

2. Rounding error;

3. Approximation error.

Input data errors are out of computation control. They are due, for example, to the
inherent imperfections of physical measures.
Rounding errors are caused since we perform our computation using a finite rep-
resentation, as usual.
For the third error type, many methods do not provide the exact solution of a given
problem P, even if the computation is carried out exactly (without rounding), but
rather the solution of a simpler problem, P,
e which approximates P. As an example
we consider the summation of an infinite series:
1 1 1
e=1+ + + + ···
1! 2! 3!
which could be replaced with a simpler problem P e which consist of a summation of
a finite number of series terms. Such an error is called truncation error (nevertheless,
this name is also use for rounding errors obtained by removing the last digits of the
representation – chopping). Many approximation problems result by “discretising”
the original problem P: definite integrals are approximated by finite sums, derivatives

1
2 Errors and Floating Point Arithmetic

by differences, and so on. Some authors extend the term “truncation error” to cover
also the discretization error.
The aim of this chapter is to study the overall effect of input error and rounding
error on a computational result. The approximation errors will be discussed when we
expose the numerical methods individually.

1.1 Numerical Problems


A numerical problem is a combination of a constructive mathematical problem (MP)
and a precision specification (PS).

Example 1.1.1. Let f : R −→ R and x ∈ R. We wish to compute y = f (x).


Generally, x is not representable inside the computer; for this reason we shall use an
approximation x∗ ≈ x. Also, it is possible that f could not be computed exactly; we
shall replace f with an approximation fA . The computed value would be fA (x∗ ). So
our numerical problem is:

MP. Given x and f , compute f (x).

PS. |f (x) − fA (x∗ )| < ε, for a given ε. ♦

1.2 Error Measuring


Definition 1.2.1. Let X be a normed linear space, A ⊆ X and x ∈ X. An element
x∗ ∈ A is an approximation of x from A (notation x∗ ≈ x).

Definition 1.2.2. If x∗ is an approximation of x the difference ∆x = x − x∗ is called


error, and

k∆xk = kx∗ − xk (1.2.1)


is an absolute error.

Definition 1.2.3. The quantity


k∆xk
δx = , x 6= 0 (1.2.2)
kxk
is called a relative error.

Remark 1.2.4.
1.3. Propagated error 3

k∆xk
1. Since x is unknown in practice, one uses the approximation δx = kx∗ k . If
k∆xk is small relatively to x∗ , then the approximation is accurate.

2. If X = R, then it is to use δx = ∆x
x and ∆x = x∗ − x. ♦

1.3 Propagated error


Let f : Rn −→ R, x = (x1 , . . . , xn ) and x∗ = (x∗1 , . . . , x∗n ). We want to evaluate
the absolute error ∆f , and the relative error δf , respectively, when f is approximated
by f (x∗ ). These are propagated errors, because they describe how the initial error
(absolute or relative) is propagated during the computation of f . Let suppose x =
x∗ + ∆x, where ∆x = (∆x1 , . . . , ∆xn ). For the absolute error, using Taylor’s
formula we obtain

∆f = f (x∗1 + ∆x1 , . . . , x∗n + ∆xn ) − f (x∗1 , . . . x∗n ) =


n n n
X ∂f 1 XX ∂2f
= ∆xi ∗ (x∗1 , . . . x∗n ) + ∆xi ∆xj ∗ ∗ (θ),
∂xi 2 ∂xi ∂xj
i=1 i=1 j=1

where θ ∈ [(x∗1 , . . . , x∗n ), (x∗1 + ∆x1 , . . . , x∗n + ∆xn )] .


If the quantities ∆xi are sufficiently small, then ∆xi ∆xj are negligible with
respect to ∆xi , and we have
n
X ∂f ∗
∆f ≈ ∆xi (x , . . . x∗n ). (1.3.1)
∂x∗i 1
i=1

Analogously, for the relative error

n ∂f ∗ n
∆f X ∂x∗i (x ) X ∂
δf = ≈ ∆xi ∗
= ∆xi ∗ ln f (x∗ ) =
f f (x ) ∂xi
i=1 i=1
n
X ∂
= x∗i δxi ln f (x∗ ).
∂x∗i
i=1

Thus
n
X ∂
δf = x∗i ln f (x∗ )δxi . (1.3.2)
∂x∗i
i=1

The inverse problem has also a great importance: what accuracy is needed for the
input data such that the result be of a desired accuracy? That is, given ε > 0, how
4 Errors and Floating Point Arithmetic

much ∆xi or δxi , i = 1, n would be such that ∆f or δf < ε? A solution method is


based on equal effects principle: one supposes all terms which appear in (1.3.1) have
(1.3.2) the same effect, i. e.

∂f ∗ ∂f ∗
∗ (x )∆x1 = . . . = (x )∆xn .
∂x1 ∂x∗n

(1.3.1) implies
∆f
∆xi ≈ ∂f ∗ .
(1.3.3)
n ∂x∗ (x )
i

Analogously,
δf
δxi = . (1.3.4)
n x∗i ∂x∂ ∗ ln f (x∗ )

i

1.4 Floating-Point Representation


1.4.1 Parameters
Several different representations of real numbers have been proposed, but by far the
most widely used is the floating-point representation. The floating-point represen-
tation parameters are a base β (which is always assumed to be even), a precision
p, and a largest and a smallest allowable exponent, emax and emin , all being natural
numbers. In general, a floating-point number will be represented as

x = ±d0 .d1 d2 . . . dp−1 × β e , 0 ≤ di < β (1.4.1)

where d0 .d1 d2 . . . dp−1 is called the significand or mantissa, and e is the exponent.
The value of x is

±(d0 + d1 β −1 + d2 β −2 + · · · + dp−1 β −(p−1) )β e . (1.4.2)

In order to achieve the uniqueness of representation, the floating-point number are


normalized, that is, we change the representation, not the value, such that d0 6= 0.
Zero is represented as 1.0 × β emin −1 . Thus, the numerical ordering of nonnegative
real numbers corresponds to the lexicographical ordering of their floating-point rep-
resentation with exponent stored to the left of the significand.
The term floating-point number will be used to mean a real number that can be
exactly represented in this format. Each interval [β e , β e+1 ) in R contains exactly
β p floating-point (the number of all possible significands). The interval (0, β emin )
is empty; for these reason the denormalized numbers are introduced, i.e. numbers
1.4. Floating-Point Representation 5

Figure 1.1: The distribution of normalized floating point numbers on the real axis
without denormalization

whose significand has the form 0.d1 d2 . . . dp−1 and whose exponent is β emin −1 . The
availability of denormalization is an additional parameter of the representation. The
set of floating-numbers for a set fixed parameters of representation will be denoted
F(β, p, emin , emax , denorm), denorm ∈ {true, f alse}.
This set is not equal to R because:
1. is a finite subset of Q;
2. for x ∈ R it is possible to have |x| > β × β emax (overflow) or |x| < 1.0 × β emin
(underflow).
The usual arithmetic operation on F(β, p, emin , emax , denorm) are denoted by
⊕, , ⊗, , and the name of usual functions are capitalized: SIN, COS, EXP, LN,
SQRT, and so on. (F, ⊕, ⊗) is not a field since
(x ⊕ y) ⊕ z 6= x ⊕ (y ⊕ z) (x ⊗ y) ⊗ z 6= x ⊗ (y ⊗ z)
(x ⊕ y) ⊗ z 6= x ⊗ z ⊕ y ⊗ z.
In order to measure the error one uses the relative error and ulps – units in the last
place. If the number z is represented as d0 .d1 d2 . . . dp−1 × β e , then the error is
|d0 .d1 d2 . . . dp−1 − z/β e | β p−1 ulps.
6 Errors and Floating Point Arithmetic

Figure 1.2: The distribution of normalized floating point numbers on the real axis
with denormalization

The relative error that corresponds to 21 ulps is


1 −p 1 β
β ≤ ulps ≤ β −p ,
2 2 2
since 0.0
| {z. . . 0} β 0 × β e , with β 0 = β2 . The value eps = β2 β −p is referred to as machine
p
epsilon.
The default rounding obeys the even digit rule: if x = d0 .d1 . . . dp−1 dp . . . and
dp > β2 then the rounding is upward, if dp < β2 the rounding is downward, and
if dp = β2 and among the removed digits there exists a nonzero one the rounding
is upward; otherwise, the last preserved digit is even. If fl denotes the rounding
operation, we can define the floating-point arithmetic operation by
x } y = fl(x ◦ y). (1.4.3)
Another kind of rounding can be chosen: to −∞, to +∞, to 0 (chopping). During the
reasoning concerning the floating-point operations we shall use the following model
∀x, y ∈ F, ∃δ with |δ| < eps such that x } y = (x ◦ y)(1 + δ). (1.4.4)
Intuitively, each floating point arithmetic operation is exact within a boundary of at
most eps for the relative errors.
1.4. Floating-Point Representation 7

The formula (1.4.4) is called the fundamental axiom of floating-point arithmetic.

1.4.2 Cancellation
From formulae (1.3.2) for the relative error, if x ≈ x(1 + δx ) and y ≈ y(1 + δy ), we
obtain the relative error for the floating-point arithmetic:

δxy = δx + δy (1.4.5)
δx/y = δx − δy (1.4.6)
x y
δx+y = δx + δy (1.4.7)
x+y x+y

Only subtraction of two nearby quantities x ≈ y is critical; in this case, δx−y →


∞. This phenomenon is called cancellation and is depicted in Figure 1.3. There b,
b0 , b00 stand for binary digits which are reliable, and the gs represent binary digits
contaminated by errors (garbage digits). Note in that garbage - garbage = garbage,
but more important, the normalization of the result moves the first garbage digit from
the 12th position to the 3rd.

x = 1 0 1 1 0 0 1 0 1 b b g g g g e
y = 1 0 1 1 0 0 1 0 1 b0 b0 g g g g e
x-y = 0 0 0 0 0 0 0 0 0 b00 b00 g g g g e
= b00 b00 g g g g ? ? ? ? ? ? ? ? ? e-9

Figure 1.3: The cancellation phenomenon

We have two kind of cancellation: benign, when subtracting exactly known quan-
tities and catastrophic, when the subtraction operands are subject to rounding errors.
The programmer must be aware of the possibility of its occurrence and he/she must
try to avoid it. The expressions which lead to cancellation must be rewritten, and a
catastrophic cancellation must be converted into a benign one. We shall give some
examples in the sequel.

Example 1.4.1. If a ≈ b, then the expression a2 − b2 is rewritten into (a − b)(a + b).


The initial form is preferred when a  b or b  a. ♦

Example 1.4.2. If cancellation appears within an expression containing square roots,


then we rewrite:
√ √ δ
x+δ− x= √ √ , δ ≈ 0.
x+δ+ x ♦
8 Errors and Floating Point Arithmetic

Example 1.4.3. The difference of two values of the same function for nearby argu-
ments is rewritten using Taylor expansion:

δ 2 00
f (x + δ) − f (x) = δf 0 (x) + f (x) + · · · f ∈ C n [a, b]. ♦
2

Example 1.4.4. The solution of a quadratic equation ax2 + bx + c = 0 can involve


catastrophic cancellation when b2  4ac. The usual formulae


−b +b2 − 4ac
x1 = (1.4.8)
√2a
−b − b2 − 4ac
x2 = (1.4.9)
2a
can lead to cancellation as follows: for b > 0 the cancellation affects the computation
of x1 ; for b < 0 x2 . We can correct the situation using the conjugate

2c
x1 = √ (1.4.10)
−b − b2 − 4ac
2c
x2 = √ . (1.4.11)
−b + b2 − 4ac
For the first case we use formulae (1.4.10) and (1.4.9); for the second case (1.4.8)
and (1.4.11). ♦

1.5 IEEE Standard


There are two different standards for floating point computation: IEEE 754 that re-
quire β = 2 and IEEE 854 that allows either β = 2 or β = 10, but it is more
permissive concerning representation.
We deal only with the first standard. Table 1.1 gives its parameters.

Why extended formats?

1. A better precision.

2. The conversion from binary to decimal and then back to binary needs 9 digits
in single precision and 17 digits in double precision.
1.5. IEEE Standard 9

Format
Parameter Single Single Extended Double Double extended
p 24 ≥ 32 53 ≥ 64
emax +127 ≥ +1023 +1023 ≥ +16383
emin -126 ≤ −1022 -1022 ≤ −16382
Exponent width 8 ≥ 11 11 ≥ 15
Number width 32 ≥ 43 64 ≥ 79

Table 1.1: IEEE 754 Format Parameters

The relation |emin | < emax is motivated by the fact that 1/2emin must not lead to
overflow.
The operations ⊕, , ⊗, must be exactly rounded. The accuracy is achieved
using two guard digit and a third sticky bit.
The exponent is biased, i.e. instead of e the standard represents e + D, where D
is fixed when the format is chosen.
For IEEE 754 single precision, D = 127.

1.5.1 Special Quantities


The IEEE standard specifies the following special quantities:

Exponent Significand Represents


e = emin − 1 f =0 ±0
e = emin − 1 f 6= 0 0.f × 2emin
emin ≤ e ≤ emax 1.f × 2e
e = emax + 1 f =0 ±∞
e = emax + 1 f 6= 0 NaN
NaN. In fact we have a family of NaNs. The illegal and indeterminate operations

lead to NaN: ∞ + (−∞), 0 × ∞, 0/0, ∞/∞, x REM 0, ∞ REM y, x for x < 0.
If one operand is a NaN the result is a NaN too.
Infinity. 1/0 = ∞, −1/0 = −∞. The infinite values allow the continuation of
computation when an overflow occurs. This is safer than aborting or returning the
largest representable number.
x
1+x2
for x = ∞ gives 0.
Signed Zero. We have two zeros: +0, −0; the relations +0 = −0 and −0 < +∞
hold. Advantages: simpler treatment of underflow and discontinuity. We can make a
distinction between log 0 = −∞ and log x = NaN for x < 0. Without signed zero
we can not make any distinction between the logarithm of a negative number which
leads to overflow and the logarithm of 0.
10 Errors and Floating Point Arithmetic

1.6 The Condition of a Problem


We may think a problem as a map

f : Rm → Rn , y = f (x). (1.6.1)

We are interested in the sensitivity of the map f at some given point x to a small
perturbation of x, that is, how much bigger (or smaller) the perturbation in y is com-
pared to the perturbation in x. In particular, we wish to measure the degree of sen-
sitivity by a single number – the condition number of the map f at the point x. The
function f is assumed to be evaluated exactly, with infinite precision, as we perturb
x. The condition of f , therefore, is an inherent property of the map f and does not
depend on any algorithmic considerations concerning its implementation.
It does not mean that the knowledge of the condition of a problem is irrelevant
to any algorithmic solution of the problem. On the contrary! The reason is that
quite often the computed solution y ∗ of (1.6.1) (computed in floating point machine
arithmetic, using a specific algorithm) can be demonstrated to be the exact solution
of a “nearby” problem; that is
y ∗ = f (x∗ ) (1.6.2)

where
x∗ = x + δ (1.6.3)

and moreover, the distance kδk = kx∗ − xk can be estimated in terms of the machine
precision. Therefore, if we know how strongly or weakly the map f reacts to small
perturbation, such as δ in (1.6.3), we can say something about the error y ∗ − y in the
solution caused by the perturbation.
We can consider more general spaces for f , but for practical implementation the
finite dimensional spaces are sufficient.
Let

x = [x1 , . . . , xm ]T ∈ Rm , y = [y1 , . . . , yn ]T ∈ Rn ,
yν = fν (x1 , . . . , xm ), ν = 1, n.

We think yν as a function of one single variable xµ


∂fν

∂xµ
γνµ = (condνµ f )(x) = . (1.6.4)
fν (x)
1.6. The Condition of a Problem 11

These give us a matrix of condition numbers


∂f ∂f
x1 ∂x1
 
xm ∂x 1
 f1 (x)
1
... m
f1 (x) 
Γ(x) = 
 .. .. ..  = [γνµ (x)]

(1.6.5)
 . . . 
x1 ∂fn
∂x1
∂fn
xm ∂x
fn (x) ... m
fn (x)

and we shall consider as condition number

(cond f )(x) = kΓ(x)k. (1.6.6)

Another approach. We consider the norm k · k∞


m
X ∂fν
∆yν ≈ ∆xµ (= fν (x + ∆x) − fν (x))
∂xµ
µ=1

n m
X ∂fν X ∂fν
|∆yν | ≤ ∆xµ ≤ max |∆xµ |
∂xµ ∂xµ ≤

µ
µ=1 µ=1
m
X ∂fν
≤ max |∆xµ | max
∂xµ
µ ν
µ=1

Therefore
∂f
k∆yk∞ ≤ k∆xk∞
∂x (1.6.7)

where  
∂f1 ∂f1 ∂f1
∂x1 ∂x2 ... ∂xm
∂f2 ∂f2 ∂f2
...
 
∂f  ∂x1 ∂x2 ∂xm  ∈ Rn × Rm

J(x) = = .. .. .. .. (1.6.8)
∂x  . . . .


∂fn ∂fn ∂fn
∂x1 ∂x2 ... ∂xm

is the Jacobian matrix of f



kxk∞ ∂f

k∆yk∞ ∂x ∞ k∆xk
≤ · . (1.6.9)
kyk∞ kf (x)k∞ kxk∞

If m = n = 1, then both approaches lead to


0
xf (x)
(cond f )(x) =
,
f (x)
12 Errors and Floating Point Arithmetic

for x 6= 0, y 6= 0.
If x = 0 ∧ y 6= 0, then we take the absolute error for x and the relative error for y
0
f (x)
(cond f )(x) = .
f (x)

For y = 0 ∧ x 6= 0 we take the absolute error for y and the relative error for x.
For x = y = 0
(cond f )(x) = f 0 (x) .

Example 1.6.1 (Systems of linear algebraic equations). Given a nonsingular squa-


re matrix A ∈ Rn×n and a vector b ∈ Rn solve the system

Ax = b. (1.6.10)

Here the input data are the elements of A and b, and the result is the vector x. To
simplify matters let’s assume that A is a fixed matrix not subject to change, and only
b is undergoing perturbations. We have a map f : Rn → Rn given by

x = f (b) := A−1 b,
∂f
which is linear. Therefore ∂b = A−1 and using (1.6.9),

kbkkA−1 k kAxkkA−1 k
(cond f )(b) = = ,
kA−1 bk kA−1 bk
kAxk −1 (1.6.11)
max (cond f )(b) = max kA k = kAkkA−1 k. ♦
b∈R n
b6=0
x∈R n
b6=0
kxk

The number kAkkA−1 k is called the condition number of the matrix A and we denote
it by cond A.
cond A = kAkkA−1 k.

1.7 The Condition of an algorithm


Let us consider the problem

f : Rm → Rn , y = f (x). (1.7.1)

Along with the problem f , we are also given an algorithm A that solves the
problem. That is, given a vector x ∈ F(β, p, emin , emax , denorm), the algorithm A
produces a vector yA (in floating-point arithmetic), that is supposed to approximate
1.8. Overall error 13

y = f (x). Thus we have another map fA describing how the problem f is solved by
the algorithm A
fA : Fm (. . . ) → Fn (. . . ), yA = fA (x).
In order to be able to analyze fA in this general terms, we must make a basic
assumption, namely, that

(BA) ∀ x ∈ Fm ∃ xA ∈ Fm : fA (x) = f (xA ). (1.7.2)

That is, the computed solution corresponding to some input x is the exact solution
for some different input xA (not necessarily a machine vector and not necessarily
uniquely determined) that we hope is close to x. The closer we can find an xA to x,
the more confidence we should place in the algorithm A.
We define the condition of A at x by comparing the relative error with eps:

kxA − xk .
(cond A)(x) = inf eps .
xA kxk
Motivation:
fA (x) − f (x) (xA − x)f 0 (ξ) xA − x 1 xf 0 (x)
δy = = ≈ · eps .
f (x) f (x) x eps f (x)

The infimum is over all xA satisfying yA = f (xA ). In practice one can take any
such xA and then obtain an upper bound for the condition number
kxA −xk
kxk
(cond A)(x) ≤ . (1.7.3)
eps

1.8 Overall error


The problem to be solved is again

f : Rm → Rn , y = f (x). (1.8.1)

This is the mathematical (idealized) problem, where the data are exact real num-
bers, and the solution is the mathematically exact solution. When solving such a
problem on a computer, in floating-point arithmetic with precision eps, and using
some algorithm A, one first of all rounds the data, and then applies to these rounded
data not f , but fA .

kx∗ − xk
x∗ = x rounded, = ε, ∗
yA = fA (x∗ ).
kxk
14 Errors and Floating Point Arithmetic

Here ε represents the rounding error in the data. (It could also be due to sources other
than rounding, e.g., measurement.) The total error that we wish to estimate is
∗ − yk
kyA
.
kyk

By the basic assumption (1.7.2, BA) made on the algorithm A, and choosing xA
optimally, we have
fA (x∗ ) = f (x∗A ),

kx∗A − x∗ k
= (cond A)(x∗ ) eps . (1.8.2)
kx∗ k

Let y ∗ = f (x∗ ). Using the triangle inequality, we have


∗ − yk
kyA ky ∗ − y ∗ k ky ∗ − yk ky ∗ − y ∗ k ky ∗ − yk
≤ A + ≈ A ∗ + .
kyk kyk kyk ky k ky ∗ k

We supposed kyk ≈ ky ∗ k. By virtue of (1.8.2) we now have for the first term on
the right,
∗ − y∗k kf (x∗A ) − f (x∗ )k
kyA kfA (x∗ ) − f (x∗ )k
= = ≤
ky ∗ k kf (x∗ )k kf (x∗ )k

kx∗A − xk
≤ (cond f )(x∗ ) = (cond f )(x∗ )(cond A)(x∗ ) eps,
kx∗ k
and for the second

ky ∗ − yk kf (x∗ ) − f (x)k kx∗ − xk


= ≤ (cond f )(x) = (cond f )(x)ε.
kyk kf (x)k kxk

Assuming finally that (cond f )(x∗ ) ≈ (cond f )(x), we get


∗ − yk
kyA
≤ (cond f )(x)[ε + (cond A)(x∗ ) eps]. (1.8.3)
kyk

Interpretation: The data error and eps contribute together towards the total error.
Both are amplified by the condition of the problem, but the latter is further amplified
by the condition of the algorithm.
1.9. Ill-Conditioned Problems and Ill-Posed Problems 15

1.9 Ill-Conditioned Problems and Ill-Posed Problems


If the condition number of a problem is large ((cond f )(x)  1), then even for
small (relative) errors, huge errors in output data could be expected. Such problems
are called ill-conditioned problems. It is not possible to draw a clear separation line
between well-conditioned and ill-conditioned problems. The classification depends
on precision specifications. If we wish
ky ∗ − yA
∗k

kyk
and in (1.8.3) (cond f )(x)ε ≥ τ , then the problem is surely ill-conditioned.
It is important to choose a reasonable boundary off error, since otherwise, even if
we increase the iteration number, we can not achieve the desired accuracy.
If the result of a mathematical problem depends discontinuously on continuous
input data, then it is impossible to obtain an accurate numerical solution in a neigh-
borhood of the discontinuity. In such cases the result is significantly perturbed, even
if the input data are accurate and the computation is performed using multiple preci-
sion. These problems are called ill-posed problems. An ill-posed problem can appear
if, for example, an integer result is computed from real input data (which vary con-
tinuously). As examples we can cite the number of real zeros of a polynomial and
the rank of a matrix.
Example 1.9.1 (The number of real zeros of a polynomial). The equation
P3 (x, c0 ) = c0 + x − 2x2 + x3
can have one, two or three real zeros, depending on how c0 is: strictly positive, zero
or strictly negative. Therefore, if c0 is close to zero, the number of real zeros of P3 is
an ill-posed problem. ♦

1.10 Stability
1.10.1 Asymptotical notations
We shall introduce here basic notations and some common abuses.
For a given function g(n), Θ(g(n)) will denote the set of functions
Θ(g(n)) = {f (n) : ∃c1 , c2 , n0 > 0 0 ≤ c1 g(n) ≤ f (n) ≤ c2 g(n) ∀n ≤ n0 } .
Although Θ(g(n)) is a set we write f (n) = Θ(g(n)) instead of f (n) ∈ Θ(g(n)).
These abuse has some advantages. g(n) will be called an asymptotically tight bound
for f (n).
16 Errors and Floating Point Arithmetic

Figure 1.4: An ill-posed problem

The Θ(g(n)) definition requires that every member of it to be asymptotically


nonnegative, that is, f (n) ≥ 0 for sufficiently large n.
For a given function g(n), O(g(n)) will denote the set

O(g(n)) = {f (n) : ∃c, n0 0 ≤ f (n) ≤ cg(n), ∀n ≥ n0 } .

Also for f (n) ∈ O(g(n)) we shall use f (n) = O(g(n)). Note that f (n) = Θ(g(n))
implies f (n) = O(g(n), since the Θ notation is stronger than the O notation. In set
theory terms, Θ(g(n)) ⊆ O(g(n)). One of the funny properties of the O notation is
n = O(n2 ). g(n) will be called an asymptotically upper bound for f .
For a given function g(n), Ω(g(n)) is defined as the set of functions

Ω(g(n)) = {f (n) : ∃c, n0 0 ≤ cg(n) ≤ f (n), ∀n ≥ n0 } .

This notation provide an asymptotically lower bound. The definitions of asymptotic


notations imply immediately:

f (n) = Θ(g(n)) ⇐⇒ f (n) = O(g(n)) ∧ f (n) = Ω(g(n)).

The functions f and g : N −→ R are asymptotically equivalent (notation ∼) if

f (n)
lim = 1.
n→∞ g(n)
1.10. Stability 17

The extension of asymptotic notations to real numbers is obvious. For example,


f (t) = O(g(t)) means that there exists a positive constant C such that for all t
sufficiently close to an understood limit (e.g., t → 0 or t → ∞),

|f (t)| ≤ Cg(t). (1.10.1)

1.10.2 Accuracy and stability


In this section, we think a problem as a map f : X −→ Y , where X and Y are
normed linear spaces (for our purpose finite dimensional spaces are sufficient). We
are interested in the problem behavior at a particular point x ∈ X (the behavior may
vary from one point to another). A combination of a problem f with prescribed input
data x might be called a problem instance, but it is usually, though occasionally,
confusing to use the term problem for both notions.
Since the complex numbers are represented as a pair of floating-point numbers,
the axiom (1.4.4) also holds for complex numbers, except that for ⊗ and eps must
be enlarged by a factor of the order 23/2 and 25/2 , respectively.
An algorithm can be viewed as another map fA : X −→ Y , where X and Y
are as above. Let us consider a problem f , a computer whose floating-point number
system satisfies (1.4.4), but not necessarily (1.4.3), an algorithm fA for f , an imple-
mentation of this algorithm as a computer program, A, all fixed. Given input data
x ∈ X, we round it to a floating point number and then supply it to the program. The
result is a collection of floating-point numbers forming a vector from Y (since the
algorithm was designed to solve f ). Let this computer result be called fA (x).
Except in trivial cases, fA cannot be continuous. One might say that an algorithm
fA for the problem f is accurate, if for each x ∈ X, his relative error satisfies
kfA (x) − f (x)k
= O(eps). (1.10.2)
kf (x)k
If the problem f is ill-conditioned, the goal of accuracy, as defined by (1.10.2) is
unreasonable ambitious. Rounding errors on input data are unavoidable on a digital
computer, and even if all the subsequent computation could be carried out perfectly,
this perturbation alone might lead to a significant change in the result. Instead of aim-
ing at accuracy in all cases, it is the most appropriate to search for general stability.
We say that an algorithm fA for a problem f is stable if for each x ∈ X
kfA (x) − f (e
x)k
= O(eps), (1.10.3)
kf (e
x)k
for some x̃ with
ke
x − xk
= O(eps). (1.10.4)
kxk
18 Errors and Floating Point Arithmetic

In words,
A stable algorithm gives nearly the right answer to nearly the right ques-
tion.
Many algorithms of Numerical Linear Algebra satisfy a condition that is both
stronger and simpler than stability. We say that an algorithm fA for the problem f is
backward stable if
ke
x − xk
∀x ∈ X ∃e
x with = O(eps) such that fA (x) = f (e
x). (1.10.5)
kxk
This is a tightening of the definition of stability in that the O(eps) in (1.10.3) was
replaced by zero. In words
A backward stable algorithm gives exactly the right answer to nearly the
right question.
Remark 1.10.1. The notation
||computed quantity|| = O(eps) (1.10.6)
has the following meaning:

• ||computed quantity|| represents the norm of some number or collection of


numbers determined by an algorithm fA for a problem f , depending on both
the input data x ∈ X for f and eps. An example is the relative error.
• The implicit limit process is eps → 0 (i.e. eps corresponds to t in (1.10.1)).
• The O applies uniformly to all data x ∈ X. This uniformity is default in the
statement of stability results.
• In any particular machine arithmetic, eps is a fixed quantity. Speaking of the
limit eps → 0, we are considering an idealization of a computer or a family
of computers. Equation (1.10.6) means that if we were to run the algorithm in
question on computers satisfying (1.4.3) and (1.4.4) for a sequence of values
of eps decreasing to zero, then ||computed quantity|| would be guaranteed to
decrease in proportion to eps or faster. These ideal computers are required to
satisfy (1.4.3) and (1.4.4), but nothing else.
• The default constant in O can depend also on the size of the argument (e.g.
the solution of a nonsingular system Ax = b on A and b size). Generally,
in practice, the error growing due to the size of argument is slow enough, but
there are situations with factors like 2m ; they make such bounds useless for
practice. ♦
1.10. Stability 19

Due to the equivalence of norms on finite dimensional linear spaces, for problems
f and algorithms fA defined on such spaces, the properties of accuracy, stability and
backward stability all hold or fail to hold independently of the choice of norms in X
and Y .

1.10.3 Backward Error Analysis


Backward stability and well-conditioning imply accuracy in relative sense.

Theorem 1.10.2. Suppose a backward stable algorithm fA is applied to solve a


problem f : X −→ Y with condition number (cond f )(x)on a computer satisfy-
ing the axioms (1.4.3) and (1.4.4). Then the relative error satisfies

kfA (x) − f (x)k


= O ((cond f )(x) eps) . (1.10.7)
kf (x)k

Proof. By the definition (1.10.5) of backward stability we have fA (x) = f (e


x) for
some xe ∈ X satisfying
ke
x − xk
= O(eps).
kxk
By the definition (1.6.5) and (1.6.6) of (cond f )(x), this implies

kfA (x) − f (x)k ke


x − xk
≤ ((cond f )(x) + o(1)) , (1.10.8)
kf (x)k kxk

where o(1) denotes a quantity which converges to zero as eps → 0. Combining these
bounds gives (1.10.7). 

The process just carried out in proving Theorem 1.10.2 is known as backward
error analysis. We obtained an accuracy estimate by two steps. One step is to in-
vestigate the condition of the problem. The other is to investigate the stability of the
algorithm. By Theorem 1.10.2, if the algorithm is backward stable, then the final
accuracy reflects that condition number.
There exists also a forward error analysis. Here, the rounding error introduced at
each step of the calculation are estimated, and somehow, a total is maintained of how
they compound from step to step (section 1.3).
Experience has shown that for the most of the algorithms of numerical linear
algebra, forward error analysis is harder to carry out than the backward error anal-
ysis. The best algorithms of linear algebra do no more, in general, than to compute
exact solutions for slightly perturbed data. Backward error analysis is a method of
reasoning fitted neatly to this backward reality.
20 Errors and Floating Point Arithmetic
Chapter 2

Numerical Solution of Linear


Algebraic Systems

There are two classes of methods for the solution of algebraic linear systems (ALS):

• direct or exact methods – they provide a solution in a finite number of steps,


under the assumption all computations are carried out exactly (Cramer, Gaus-
sian elimination, Cholesky)

• iterative methods – they approximates the solution by generating a sequence


converging to that solution (Jacobi, Gauss-Seidel, SOR).

2.1 Notions of Matrix Analysis


Let A ∈ Kn×n .

• The polynomial p(λ) = det(A − λI) – characteristic polynomial of A;

• zeros of p – eigenvalues of A;

• If λ is an eigenvalue ofA, the vector x 6= 0 such that (A − λI)x = 0 is an


eigenvector of A corresponding to the eigenvalue λ;

• ρ(A) = max{|λ| λ eigenvalue of A} – spectral radius of the matrix A.

AT – the transpose of A; A∗ the conjugate transpose (adjoint) of A.

Definition 2.1.1. A matrix A is called:

1. normal, if AA∗ = A∗ A;

21
22 Numerical Solution of Linear Algebraic Systems

2. unitary, if AA∗ = A∗ A = I;

3. hermitian, if A = A∗ ;

4. orthogonal, if AAT = AT A = I, A real;

5. symmetric, if A = AT , A real.

Definition 2.1.2. A matrix norm is a map k · k : Km×n → R that for each A, B ∈


Km×n and α ∈ K satisfy

(NM1) kAk ≥ 0, kAk = 0 ⇔ A = On ;

(NM2) kαAk = |α|kAk;

(NM3) kA + Bk ≤ kAk + kBk;

(NM4) kABk ≤ kAkkBk.

A simple way to obtain matrix norm is: given a vector norm k · k on Cn , the map
k · k : Cn×n → R

kAvk
kAk = sup = sup kAvk = sup kAvk
v∈Cn kvk v∈Cn v∈Cn
v6=0 kvk≤1 kvk=1

is a matrix norm called subordinate matrix norm(to the given vector norm) or natural
norm (induced by the given vector norm).

Remark 2.1.3. A subordinate matrix norm verifies kIk = 1. ♦

The norms subordinate to vector norms k · k1 , k · k2 , k · k∞ are given by the


following result.

Theorem 2.1.4. Let A ∈ Kn×n (C). Then

kAvk1 X
kAk1 := sup = max |aij |,
v∈Cn \{0} kvk 1 j
i
kAvk2 p
= ρ(A∗ A) = ρ(AA∗ ) = kA∗ k2 ,
p
kAk2 := sup
v∈Cn \{0} kvk2
kAvk∞ X
kAk∞ := sup = max |aij |,
v∈Cn \{0} kvk∞ i
j
2.1. Notions of Matrix Analysis 23

The norm k · k2 is invariant to the unit transforms,

U U ∗ = I ⇒ kAk2 = kAU k2 = kU Ak2 = kU ∗ AU k2 .

If A is normal, then
AA∗ = A∗ A ⇒ kAk2 = ρ(A).

Proof. For any vector v we have




X X X X
kAvk1 =
aij v j

|v j | |aij | ≤
i j j i
!
X
≤ max |aij | kvk1 .
j
i
P
I order to show that max
j i |aij | is actually the smallest α having the property
kAvk1 ≤ αkvk1 , ∀ v ∈ Cn , it is sufficient to construct a vector u (that depends
on A) such that: ( )
X
kAuk1 = max |aij | kuk1 .
j
i
If j0 is a subscript such that
X X
max |aij | = |aij0 |,
j
i i

then the vector u entries are ui = 0 for i 6= j0 , uj0 = 1.


Similarly
 

X X
kAvk∞ = max aij vj ≤ max |aij | kvk∞ .
i i
j j

Let i0 be a subscript such that


X X
max |aij | = |ai0 j |.
i
j j

ai0 j
The vector u such that uj = for ai0 j 6= 0, uj = 1 for ai0 j = 0 verifies
|ai0 j |
 
 X 
kAuk∞ = max |aij | kuk∞ .
 i 
j
24 Numerical Solution of Linear Algebraic Systems

Since AA∗ is a Hermitian matrix, there exists an eigendecomposition AA∗ =


QΛQ∗ of A, where Q is a unitary matrix whose columns are eigenvectors, and Λ
is a diagonal matrix whose entries are eigenvectors of A (all of them must be real).
If there exists a negative eigenvalue and q is the corresponding eigenvector, then
0 ≤ kAqk22 = q ∗ A∗ Aq = q ∗ λq = λkqk22 . So,

kAxk2 (x∗ A∗ Ax)1/2 (x∗ QΛQ∗ x)1/2


kAk2 = max = max = max
x6=0 kxk2 x6=0 kxk2 x6=0 kxk2
sP
((Q∗ x)∗ ΛQ∗ x)1/2 (y ∗ Λy)1/2 λi yi2
= max = max = max
kQ∗ xk2 yi2
P
x6=0 y6=0 kyk2 y6=0
sP
p y2
≤ max λmax P i2 ;
y6=0 yi

the equality holds if y is a conveniently chosen column of a unit matrix.


Let us prove now that ρ(A∗ A) = ρ(AA∗ ). If ρ(A∗ A) > 0 there exists a p
such that p 6= 0, A∗ Ap = ρ(A∗ A)p and Ap 6= 0 (ρ(A∗ A) > 0). Since Ap 6= 0
and AA∗ (Ap) = ρ(A∗ A)Ap it follows that 0 < ρ(A∗ A) ≤ ρ(AA∗ ), and therefore
ρ(AA∗ ) = ρ(A∗ A) (because (A∗ )∗ = A). If ρ(A∗ A) = 0, then ρ(AA∗ ) = 0. Hence,
in all cases kAk22 = ρ(A∗ A) = ρ(AA∗ ) = kA∗ k22 .
The invariance of k · k2 norm to unitary transforms is a translation of relations:

ρ(A∗ A) = ρ(U ∗ A∗ AU ) = ρ(A∗ U ∗ U A) = ρ(U ∗ A∗ U U ∗ AU ).

Finally, if A is normal, there exists a matrix U such that

def
U ∗ AU = diag(λi (A)) = Λ.

In this case
A∗ A = (U ΛU ∗ )∗ U ΛU = U D∗ ΛU ∗ ,
which shows us that

ρ(A∗ A) = ρ(Λ∗ Λ) = max |λi (A)|2 = (ρ(A))2 .


i

Remark 2.1.5. 1) If U is hermitian or symmetric (therefore normal),

kAk2 = ρ(A).
2.1. Notions of Matrix Analysis 25

2) If U is unitary or orthogonal (therefore normal),


p p
kAk2 = ρ(A∗ A) = ρ(I) = 1.

3) Theorem 2.1.4 states that normal matrices and k · k2 norm verify

kAk2 = ρ(A).

(4) ||.||∞ is called Cebyshev norm or m-norm, ||.||1 is called Minkowski norm or
l-norm, and ||.||2 is the Euclidian norm. ♦

Theorem 2.1.6. (1) Let A be an arbitrary square matrix and k · k a certain matrix
norm (subordinate or not). Then

ρ(A) ≤ kAk. (2.1.1)

(2) Given a matrix A and a number ε > 0, there exists a subordinate matrix norm
such that
kAk ≤ ρ(A) + ε. (2.1.2)

Proof. (1) Let p be a vector verifying p 6= 0, Ap = λp, |λ| = ρ(A) and q a vector
such that pq T 6= 0. Since

ρ(A)kpq T k = kλpq T k = kApq T k ≤ kAkkpq T k,

(2.1.1) results immediately.


(2) Let A be a given matrix. There exists an invertible matrix U such that U −1 AU
is upper-triangular (in fact, U is unitary)
 
λ1 t12 t13 . . . t1,n

 λ 2 t23 . . . t2,n


U −1 AU =  .. ..
;
 
 . . 
 λn−1 tn−1,n 
λn

the scalars λi are the eigenvalues of A. To any scalar δ 6= 0 we associate a matrix

Dδ = diag(1, δ, δ 2 , . . . , δ n−1 ),
26 Numerical Solution of Linear Algebraic Systems

such that
λ1 δt12 δ 2 t13 . . . δ n−1 t1n
 

 λ2 δt23 . . . δ n−2 t2n 

(U Dδ )−1 A(U Dδ ) =  .. ..
.
 
 . . 
 λn−1 δtn−1n 
λn

Given ε > 0, we take a δ fixed, such that


n
X
|δ j−i tij | ≤ ε, 1 ≤ i ≤ n − 1.
j=i+1

Then, the map

k · k : B ∈ Kn×n → kBk = k(U Dδ )−1 B(U Dδ )k∞ , (2.1.3)

which depends on A and ε solve the problem. Indeed,

kAk ≤ ρ(A) + ε
P
and according to the choice of δ and the definition of k·k∞ (kcij k∞ = maxi j |cij |)
norm, the norm given by 2.1.3 is a matrix norm subordinated to the vector norm

v ∈ Kn → k(U Dδ )−1 vk∞ .

An important matrix norm, which is not a subordinate matrix norm is the Frobe-
nius norm:  1/2
X X 
kAkE = |aij |2 = {tr(A∗ A)}1/2
 
i j

It is not a subordinate norm, since kIkE = n.

Theorem 2.1.7. Let B a square matrix. The following statements are equivalent:

(1) lim B k = 0;
k→∞

(2) lim B k v = 0, ∀ v ∈ Kn ;
k→∞

(3) ρ(B) < 1;


2.2. Condition of a linear system 27

(4) There exists a subordinate matrix norm such that kBk < 1.

Proof. (1) ⇒ (2)


kB k vk ≤ kB k kkvk ⇒ lim B k v = 0
k→∞

(2) ⇒ (3) If ρ(B) ≥ 1, we can find p such that p 6= 0, Bp = λp , |λ| ≥ 1. Then the
vector sequence (B k p)k∈N could not converge to 0.
(3) ⇒ (4) ρ(B) < 1 ⇒ ∃k · k such that kBk ≤ ρ(B) + ε, ∀ ε > 0 hence kBk < 1.
(4) ⇒ (1) It is sufficient to apply the inequality kB k k ≤ kBkk . 

2.2 Condition of a linear system


We are interested in the conditioning of the problem: given the matrix A ∈ Kn×n
and the vector b ∈ Kn×1 , solve the system

Ax = b.

Let the system (this example is due to Wilson)


    
10 7 8 7 x1 32
 7 5 6 5   x2   23 
 8 6 10 9   x3  = 
    ,
33 
7 5 9 10 x4 31

having the solution (1, 1, 1, 1)T and we consider the perturbed system where the
right-hand side is slightly modified, the system matrix remaining unchanged
    
10 7 8 7 x1 + δx1 32.1
 7 5 6 5   x2 + δx2   22.9 
 8 6 10 9   x3 + δx4  =  33.1  ,
    

7 5 9 10 x4 + δx4 30.9

having the solution (9.2, −12.6, 4.5, −1.1)T . In other words a 1/200 error in input
data causes 10/1 relative error on result, hence an approx. 2000 times growing of the
relative error!
Let now the system with the perturbed matrix
    
10 7 8.1 7.2 x1 + ∆x1 32
 7.08 5.04 6 5    x2 + ∆x2  =  23  ,
   

 8 5.98 9.89 9   x3 + ∆x4   33 
6.99 4.99 9 9.98 x4 + ∆x4 31
28 Numerical Solution of Linear Algebraic Systems

having the solution (−81, 137, −34, 22)T . Again, a small variation on input data
(here, matrix elements) modifies dramatically the output result. The matrix has a
“good” shape, is symmetric, its determinant is equal to 1, and its inverse is
 
25 −41 10 −6
 −41 68 −17 10 
 ,
 10 −17 5 −3 
−6 10 −3 2

which is also “nice”.


Let us now consider the system parameterized by t

(A + t∆A)x(t) = b + t∆b, x(0) = x.

A being nonsingular, the function x is differentiable at t = 0:

ẋ(0) = A−1 (∆b − ∆Ax).

The Taylor expansion of x(t) is given by

x(t) = x + tẋ(0) + O(t2 ).

It thus follows that the absolute error can be estimated using

k∆x(t)k = kx(t) − xk ≤ |t| x0 (0) + O(t2 )


≤ |t| A−1 (k∆bk + k∆Ak kxk) + O(t2 )


and (due to ||b|| ≤ ||A||||x||) we get for the relative error


 
k∆x(t)k −1 k∆bk
≤ |t| A + k∆Ak + O(t2 ) (2.2.1)
kxk kxk
 
−1 k∆bk k∆Ak
≤ kAk A |t| + + O(t2 ).
kbk kAk

By introducing the notations

k∆Ak k∆bk
ρA (t) := |t| , ρb (t) := |t|
kAk kbk

for the relative errors in A and b, the relative error estimate can be written as
k∆x(t)k
≤ kAk A−1 (ρA + ρb ) + O(t2 ).

(2.2.2)
kxk
2.2. Condition of a linear system 29

Definition 2.2.1. If A in nonsingular, the number

cond(A) = ||A||||A−1 || (2.2.3)

is called the condition number of the matrix A.

The relation (2.2.2) can be rewritten as

k∆x(t)k
≤ cond(A) (ρA + ρb ) + O(t2 ). (2.2.4)
kxk

Example 2.2.2 (Ill-conditioned matrix). Consider the n-th order Hilbert 1 matrix,
Hn = (hij ), given by

1
hij = , i, j = 1, n.
i+j−1

This is a symmetric positive definite matrix, so it is nonsingular. For various values


of n one gets in the Euclidean norm

n 10 20 40
cond2 (Hn ) 1.6 · 1013 2.45 · 1028 7.65 · 1058

A system of order n = 10, for example, cannot be solved with any reliability in
single precision on a 14-decimal computer. Double precision will be “exhausted” by
the time we reach n = 20. The Hilbert matrix is thus a prototype of an ill-conditioned

David Hilbert (1862-1943) was the most prominent member


of the Göttingen school of Mathematics. Hilbert’s fundamen-
tal contributions to almost all parts of mathematics — algebra,
1 number theory, geometry, integral equations, calculus of vari-
ations, and foundations — and in particular the 23 now famous
problems he proposed in 1900 at the International Congress of
Mathematicians in Paris, gave a new impetus, and new direc-
tions, to the 20th-century mathematics.
30 Numerical Solution of Linear Algebraic Systems

matrix. From a result of G. Szegő 2


it can be seen that

( 2 + 1)4n+4
cond2 (Hn ) ∼ √ . ♦
215/4 πn

Example 2.2.3 (Ill-conditioned matrix). Vandermonde matrices are of the form


 
1 1 ... 1
 t1 t2 . . . tn 
Vn =  .. ..  ,
 
.. ..
 . . . . 
tn−1
1 t2n−1 . . . tnn−1
where ti are real parameters. If ti ’s are equally spaced in [-1,1], then it holds
1 − π n( π + 1 ln 2)
cond∞ (Vn ) ∼ e 4e 4 2 .
π
For ti = 1i , i = 1, n
cond∞ (Vn ) > nn+1 . ♦

2.3 Gaussian Elimination


Let us consider the linear system having n equations and n unknowns
Ax = b, (2.3.1)
where A ∈ Kn×n , b ∈ Kn×1 are given, and x ∈ Kn×1 must be determined, or written
in a detailed fashion


 a11 x1 + a12 x2 + · · · + a1n xn = b1 (E1 )
 a21 x1 + a22 x2 + · · · + a2n xn = b2 (E2 )

.. .. (2.3.2)


 . .
an1 x1 + an2 x2 + · · · + ann xn = bn (En )

Gabor Szegő (1895-1985) Hungarian mathematician. Szegő’s


most important work was in the area of extremal problems
and Toeplitz matrices. Orthogonal Polynomials appeared in
1939 and was published by the American Mathematical Soci-
ety. It has proved highly successful, running to four editions
2
and many reprints over the years. He cooperated with Pólya in
bringing out a joint Problem Book: Aufgaben und Lehrsätze
aus der Analysis, volumes I and II (Problems and Theorems in
Analysis) (1925) which has since gone through many editions
and which has had an enormous impact on later generations of
mathematicians.
2.3. Gaussian Elimination 31

The Gaussian 3 elimination method has two stages:

e1) Transforming the given system into a triangular one.

e2) Solving the triangular system using back substitution.

During the solution of the system (2.3.1) or (2.3.2) the following transforms are
allowed:

1. The equation Ei can be multiplied by λ ∈ K∗ . This operation will be denoted


by (λEi ) → (Ei ).

2. The equation Ej can be multiplied by λ ∈ K∗ and added to the equation Ei ,


the result replacing Ei . Notation (Ei + λEj ) → (Ei ).

3. The equation Ei and Ej can be interchanged; notation (Ei ) ←→ (Ej ).

In order to express conveniently the transform into a triangular system we shall


use the extended matrix:
 
a11 a12 . . . a1n a1,n+1
 a21 a22 . . . a2n a2,n+1 
A = [A, b] =  .
 
e .. .. .. ..
 ..

. . .
. 
an1 an2 . . . ann an,n+1

with ai,n+1 = bi .
Johann Carl Friedrich Gauss (1777-1855) was one of the
greatest mathematicians of the 19th century — and perhaps
of all time. He spent almost his entire life in Göttingen, where
he was the director of the observatory for some 40 years. Al-
ready as a student in Göttingen, Gauss discovered that the 17-
gon can be constructed by compass and ruler, thereby settling
a problem that had been open since antiquity. His dissertation
gave the first proof of the Fundamental Theorem of Algebra.
3 He went on to make fundamental contributions to number the-
ory, differential and non-Euclidean Geometry, elliptic and hy-
pergeometric functions, celestial mechanics and geodesy, and
various branches of physics, notably magnetism and optics.
His computational efforts in celestial mechanics and geodesy,
based on the principle of least squares, required the solution
(by hand) of large systems of linear equations, for which he
used what today are known as Gaussian elimination and re-
laxation methods. Gauss’s work on quadrature builds upon
the earlier work of Newton and Cotes.
32 Numerical Solution of Linear Algebraic Systems

Assuming a11 6= 0, we shall eliminate the coefficients of x1 in Ej , for j =


2, n using the operation (Ej − (aj1 /a11 )E1 ) → (Ej ). We proceed similarly for the
coefficients of xi , for i = 2, n − 1, j = i + 1, n. This requires aii 6= 0.
The procedure can be described as follows: one builds the following sequence of
extended matrix A e(1) , A
e(2) , . . ., A e(1) = A and the elements a(k) of A
e(n) , where A e(k)
ij
are given by  
(k−1)
ai,k−1
Ei − Ek−1  −→ (Ei ).
(k−1)
ak−1,k−1
(p)
Remark 2.3.1. aij denotes the value of aij at the p-th step. ♦
Thus
(1) (1) (1) (1) (1) (1)

a11 a12 a13 ... a1,k−1 a1k ... a1n a(1)

1,n+1
(2) (2) (2) (2) (2)
a(2)

 0 a22 a23 ... a2,k−1 a2,k ... a2,n 
 2,n+1 
 .. .. .. .. ..

.. ..

 . . ... . . . . .


 .. .. .. .. ..

.. .. 
.

 .
(k)  . . . . .

A = .
e 
(k−1)
 .. (k−1) (k−1) (k−1)

ak−1,k−1 ak−1,k . . . ak−1,n ak−1,n+1 

 .. a(k)
 
(k) (k)
 . 0 akk ... akn k,n+1 
..

 .. .. .. ..
 
 . . . ... .

. 
(k) (k) (k)
an,n+1

0 ... ... ... 0 ank ... ann
represents an equivalent linear system where the variable xk−1 was eliminated from
e(n) is
the equations Ek , Ek+1 , . . . , En . The system corresponding to the matrix A
triangular and equivalent to
 (1) (1) (1) (1)

 a11 x1 + a12 x2 + · · · + a1n xn = a1,n+1
(2) (2) (2)

a22 x2 + · · · + a2n xn = a2,n+1


.. .


 .
(n) (n)

ann xn = an,n+1

One obtains
(n)
an,n+1
xn = (n)
an,n
and, generally
 
n
1  (i) X(i)
xi = (i)
ai,n+1 − aij xj  , i = n − 1, 1
aii j=i+1
2.3. Gaussian Elimination 33

(i) (i)
The procedure is applicable only if aii 6= 0, i = 1, n. The element aii is called
(k)
pivot. If during the elimination process, at the kth step one obtains akk = 0, one can
perform the line interchange (Ek ) ↔ (Ep ), where k + 1 ≤ p ≤ n is the smallest
(k)
integer satisfying apk 6= 0. In practice, such operations are necessary even if the
pivot is nonzero. The reason is that a pivot which is small cause large rounding errors
and even cancellation. The remedy is to choose for pivoting the subdiagonal element
on the same column having the largest absolute value. That is, we must find a p such
that
(k) (k)
|apk | = max |aik |,
k≤i≤n

and then perform the interchange (Ek ) ↔ (Ep ). This technique is called column
maximal pivoting or partial pivoting.
Another technique which decreases errors and prevents from the floating-point
cancellation is scaled column pivoting. We define in a first step a scaling factor for
each line
n
X
si = max |aij | or si = |aij |.
j=1,n
j=1
If an i such that si = 0 does exist, the matrix is singular. The next steps will establish
what interchange is to be done. In the i-th one finds the smallest integer p, i ≤ p ≤ n,
such that
|api | |aji |
= max
sp i≤j≤n sj

and then, (Ei ) ↔ (Ep ). Scaling guarantees us that the largest element in each col-
umn has the relative magnitude 1, before doing the comparisons needed for line in-
terchange. Scaling is performed only for comparison purpose, so that the division by
the scaling factor does not introduce any rounding error. The third method is total
pivoting or maximal pivoting. In this method, at the kth step one finds
max{|aij |, i = k, n, j = k, n}
and line and columns interchange are carried out.
Pivoting was introduced for the first time by Goldstine and von Neumann, 1947
[19].
Remark 2.3.2. Some suggestions which speeds-up the running time.

1. The pivoting need not physically row or column interchange. One can manage
one (or two) permutation vector(s) p(q); p[i](q[i]) means the line (column) that
was interchanged to the ith line(column). This is a good solution if matrices are
stored row by row or column by column; for other representation or memory
hierarchies, physical interchange could yield better results.
34 Numerical Solution of Linear Algebraic Systems

2. The subdiagonal elements (that vanish) need not to be computed.


3. A matrix A can be inverted solving the systems Ax = ek , k = 1, n, where ek
are the vectors in the canonical base of Kn – simultaneous equations method.♦
Analysis of Gaussian elimination. The method is given by Algorithm 2.1. Our

Algorithm 2.1 Solution of the system Ax = b by Gaussian elimination


Input: The extended matrix A = (aij ), i = 1, n, j = 1, n + 1
Output: Solution x1 , . . . , xn or an error message
1: for i := 1 to n − 1 do
2: Let p be the smallest integer such that i ≤ p ≤ n and api 6= 0
3: if 6 ∃ p then
4: error (’6 ∃ a unique solution’); STOP
5: end if
6: if p 6= i then
7: (Ep ) ↔ (Ei )
8: end if
9: for j := i + 1 to n do
10: mji := aji /aii ;
11: (Ej − mji Ei ) → (Ej );
12: end for
13: end for
14: if amn = 0 then
15: error (’6 ∃ a unique solution’); STOP
16: end if
17: xn := an,n+1 /ann ;
18: for i :=  n − 1 downto 1 do 
Xn
19: xi = ai,n+1 −
 aij xj  /aii ;
j=i+1
20: end for
21: Output (x1 , . . . , xn ) {success} STOP.

complexity measure – number or floating point operations or shortly, flops. In the


innermost loop, lines 10–11 we have 2n − 2i + 3 flops, total (n − i)(2n − 2i + 3)
flops. For the outer loop total
n−1
X 2n3
(n − i)(2n − 2i + 3) ∼ .
3
i=1

For back substitution Θ(n2 ) flops. Overall total, Θ(n3 ).


2.4. Factorization based methods 35

2.4 Factorization based methods


2.4.1 LU decomposition
Theorem 2.4.1. If the Gaussian elimination for the system Ax = b can be done
without line interchange, then A can factor as A = LU where L is a lower triangular
matrix and U is an upper triangular matrix. The pair (L, U ) is the LU decomposition
of the matrix A.

Advantages. Ax = b ⇔ LU x = b ⇔ Ly = b ∧ U x = y. If we have to solve


several systems Ax = bi , i = 1, m, each takes Θ(n3 ), total Θ(mn3 ); if we begin
with an LU decomposition which takes Θ(n3 ) and solve each system in Θ(n2 ), we
need a Θ(n3 ) time.

Remark 2.4.2. U is the upper triangular matrix generated by Gaussian elimination,


and L is the matrix of multipliers mij . ♦

If Gaussian elimination carries out with line interchange, it holds also A = LU ,


but L is no more lower triangular.
The method is called LU factorization.
We give two examples where Gaussian elimination can be carried out without
interchanges:
- A is line diagonal dominant, that is
n
X
|aii | > |aij |, i = 1, n
j=1
j6=i

- A is positive definite (∀ x 6= 0 x∗ Ax > 0).

Proof of Theorem 2.4.1. (sketch) For n > 1 we split A in the following way:
 
a11 a12 . . . a1n
a21 a22 . . . a2n  
a11 w∗
 
A= = ,
 
.. .. .. .. v A0
 . . . . 
an1 an2 . . . ann

where v is a n − 1 column vector, and w∗ - is a n − 1 line vector. We can factor A

a11 w∗ w∗
    
1 0 a11
A= = .
v A0 v/a11 In−1 0 A0 − vw∗ /a11
36 Numerical Solution of Linear Algebraic Systems

The matrix A0 − vw∗ /a11 is called a Schur complement of A with respect to a11 .
Then, we proceed with the recursive decomposition of Schur complement:

A0 − vw∗ /a11 = L0 U 0 .

w∗
  
1 0 a11
A = 0 ∗ =
 v/a11 In−1   0 A ∗− vw  /a11
1 0 a11 w
= =
v/a 11 In−1 0 L0 U 0
a11 w∗
 
1 0
= .
v/a11 L0 0 U0


We have several choices for uii and lii , i = 1, n. For example, if lii = 1, we have
Doolittle factorization , and if uii = 1, we have Crout factorization.

2.4.2 LUP decomposition


The idea behind LU P decomposition is to find out three square matrices L, U and
P where L - lower triangular, U - upper triangular, P permutation matrix, such that
P A = LU .
The triple (L, U, P ) is called the LUP decomposition of the matrix A.
The solution of the system Ax = b is equivalent to the solution of two triangular
systems, since

Ax = b ⇔ LU x = P b ⇔ Ly = P b ∧ U x = y

and
Ax = P −1 LU x = P −1 Ly = P −1 P b = b.
We shall choose as pivot ak1 instead of a11 . The effect is a multiplication by a
permutation matrix Q:

ak1 w∗ w∗
    
1 0 ak1
QA = = .
v A0 v/ak1 In−1 0 A0 − vw∗ /ak1
Then, we compute the LU P -decomposition of the Schur complement.

P 0 (A0 − vw∗ /ak1 ) = L0 U 0 .

We define  
1 0
P = Q,
0 P0
2.4. Factorization based methods 37

which is a permutation matrix too. We have now


 
1 0
PA = QA =
0 P0
w∗
   
1 0 1 0 ak1
= =
0 P0 v/ak1 In−1 0 A0 − vw∗ /ak1
w∗
  
1 0 ak1
= =
P 0 v/ak1 P 0 0 A0 − vw∗ /ak1
w∗
  
1 0 ak1
= =
P 0 v/ak1 In−1 0 P 0 (A0 − vw∗ /ak1 )
ak1 w∗ ak1 w∗
     
1 0 1 0
= = .
P 0 v/ak1 In−1 0 L0 U 0 P 0 v/ak1 L0 0 U0
Note that in this reasoning, both the column vector and the Schur complement are
multiplied by the permutation matrix P 0 .

2.4.3 Cholesky factorization


Hermitian positive definite matrices can be decomposed into triangular factors twice
as quickly as general matrices. The standard algorithm for this, Cholesky factoriza-
tion, is a variant of Gaussian elimination, which operates on both the left and the right
of the matrix at once, preserving and exploiting the symmetry.
Systems having hermitian positive definite matrices play an important role in
Numerical Linear Algebra and its applications. Many matrices that arise in physical
systems are hermitian positive definite because of the fundamental physical laws.
Properties of hermitian matrices. Let A be a m × m hermitian and positive
definite matrix.
1. If X is a full-rank m × n matrix, then is X ∗ AX hermitian positive definite;
2. Any principal submatrix of A is positive definite;
3. Any diagonal element of A is a positive real number;
4. The eigenvalues of A are positive real numbers;
5. Eigenvectors corresponding to distinct eigenvalues of a hermitian matrix are
orthogonal.
A Cholesky factorization of a matrix A is a decomposition
A = R∗ R, rjj > 0, (2.4.1)
where R is an upper triangular matrix.
38 Numerical Solution of Linear Algebraic Systems

Theorem 2.4.3. Every hermitian positive definite matrix A ∈ Cm×m has a unique
Cholesky factorization (2.4.1).

Proof. (Existence) Since A is hermitian and positive definite a11 > 0 and we may

set α = a11 . Note that

a11 w∗
 
A=
w K
(2.4.2)
α w∗ /α
   
α 0 1 0
= = R1∗ A1 R1 .
w/α I 0 K − ww∗ /a11 0 I

This is the basic step that is repeated in Cholesky factorization. The matrix K −
ww∗ /a11 being a (m−1)×(m−1) principal submatrix of the positive definite matrix
R1∗ AR1−1 is positive definite and hence his upper left element is positive. By induc-
tion, all matrices that appear during the factorization are positive definite and thus the
process cannot break down. We proceed to the factorization of A1 = R2∗ A2 R2 , and
thus, A = R1∗ R2∗ A2 R2 R1 ; the process can be employed until the lower right corner
is reached, getting
A = R1∗ R2∗ . . . Rm

R ...R R ;
| {z } | m {z 2 }1
R∗ R

this decomposition has the desired form.


(Uniqueness) In fact, the above process also establishes uniqueness. At each step,

(2.4.2), the value α = a11 is determined by the form of factorization R∗ R and once
α is determined, the first row of R1∗ is also determined. Since the analogous quantities
are determined at each step, the entire factorization is unique. 

Since only half the matrix needs to be stored, it follows that half of the arithmetic
operations can be avoided. The inner loop dominates the work. A single execution of

Algorithm 2.2 Cholesky Factorization


Input: A symmetric positive definite matrix A
Output: The upper triangular matrix R
1: R := A;
2: for k := 1 to m do
3: for j := k + 1 to m do
4: Rj,j:m := Rj,j:m − Rk,j:m Rk,j /Rk,k
5: end for p
6: Rk,k:m := Rk,k:m / Rk,k
7: end for
2.4. Factorization based methods 39

the line 4 requires one division, m−j +1 multiplications, and m−j +1 subtractions,
for a total of ∼ 2(m−j) flops. This calculation is repeated once for each j from k +1
to m, and that loop is repeated for each k from 1 to m. The sum is straightforward to
evaluate:
m X
m m X
k m
X X X 1
2(m − j) ∼ 2 j∼ k 2 ∼ m3 flops.
3
k=1 j=k+1 k=1 j=1 k=1

Thus, Cholesky factorization involves half as many operations as Gaussian elimina-


tion.

2.4.4 QR decomposition
Theorem 2.4.4. Let A ∈ Rm×n , with m ≥ n. Then, there exists a unique m × n
orthogonal matrix Q and a unique n×n upper triangular matrix R, having a positive
diagonal (rii > 0) such that A = QR.

Proof. It is a consequence of the algorithm 2.3 (to be given in this section). 

Orthogonal and unitary matrices are desirable for numerical computation because
they preserve length, preserve angles, and do not magnify errors.
A Householder 4 transform (or a reflection) is a matrix of form P = I − 2uuT ,
where kuk2 = 1. One easily checks that P = P T and

P P T = I − 2uuT I − 2uuT = I − 4uuT + 4uuT uuT = I,


 

hence P is a symmetric orthogonal matrix. It is called a reflection since P x is the


reflection of x with respect to the hyperplane H which pass through the origin and is
orthogonal to u (Figure 2.1).
Given a vector x, it is easy to find a Householder reflection P = I − 2uuT to
zero out all but the first entry of x: P x = [c, 0, . . . , 0]T = ce1 . We do this as follows.

Alston S. Householder (1904-1993), American mathemati-


cian. Important contributions to mathematical biology and
4
mainly to numerical linear algebra. His well known book ”The
theory of matrices in numerical analysis” has a great impact on
development of numerical analysis and computer science.
40 Numerical Solution of Linear Algebraic Systems

Figure 2.1: A Householder reflector

Write P x = x − 2u(uT x) = ce1 , so that u = 2(u1T x) (x − ce1 ), i.e., u is a linear


combination of x and e1 . Since kxk2 = kP xk2 = |c|, u must be parallel to the vector
ũ = x ± kxk2 e1 , and so u = ũ/kũk2 . One can verify that any choice of sign yields
a u satisfying P x = ce1 , as long as ũ 6= 0. We will use ũ = x + sign(x1 )kxk2 e1 ,
since this means that there is no cancellation in computing the first component of ũ.
If x1 = 0, we choose conventionally sign(x1 ) = 1. In summary, we get
 
x1 + sign(x1 )kxk2
 x2  ũ
ũ =   , with u = .
 
..
 .  kũk2
xn

We write this as u = House(x). In practice, we can store ũ instead of u to save


2 T instead of P =
the work of computing u, and use the formula P = I − kuk 2 ũũ
2
I − 2uuT . The matrix Pi0 needs not to be built; rather we can apply it directly:

Ai:m,i:n = Pi0 Ai:m,i:n = (I − 2ui uTi )Ai:m,i:n


= Ai:m,i:n − 2ui (uTi Ai:m,i:n ).

Starting from the relation

Ax = b ⇔ QRx = b ⇔ Rx = QT b,

we can choose the following strategy for the solution of linear system Ax = b:
2.5. Strassen’s algorithm for matrix multiplication 41

Algorithm 2.3 QR factorization using Householder reflections


1: for i := 1 to min(m − 1, n) do
2: ui := House(Ai:m,i );
3: Ai:m,i:n := Ai:m,i:n − 2ui (uTi Ai:m,i:n );
4: end for

1. Determine the factorization A = QR of A;

2. Compute y = QT b;

3. Solve the upper triangular system Rx = y.

The computation of QT b can be performed by QT b = Pn Pn−1 . . . P1 b, so we


need to store the product of b by P1 , P2 , . . . , Pn – see Algorithm 2.4.

Algorithm 2.4 Computation of the product QT b


1: for i := 1 to n do
2: bi:m := bi:m + −2ui (uTi bi:m );
3: end for

Similarly, the computation of Qx can be performed by the same process, done in


reverted order (see Algorithm 2.5).

Algorithm 2.5 Computation of the product Qx


1: for k := n downto 1 do
2: bk:m := bk:m − 2uk (uTk xk:m );
3: end for

The cost of QR decomposition A = QR is 2n2 m − 32 n3 , and the costs for QT b


and Qx are both O(mn).
If we wish to compute the matrix Q explicitly, we can use Algorithm 2.5 to build
QI, by computing the column Qe1 , Qe2 , . . . , Qem of QI.

2.5 Strassen’s algorithm for matrix multiplication


Let A, B ∈ Rn×n . We wish to compute C = AB. Suppose n = 2k . We split A and
B      
a11 a12 b11 b12 c11 c12
A= , B= , C= .
a21 a22 b21 b22 c21 c22
42 Numerical Solution of Linear Algebraic Systems

Classical algorithm requires 8 multiplications and 4 additions for one step; the run-
ning time is T (n) = Θ(n3 ), since T (n) = 8T (n/2) + Θ(n2 ).
We are interested in reducing the number of multiplications. Volker Strassen
[39] discovered a method to reduce the number of multiplications to 7 per step. One
computes the following quantities

p1 = (a11 + a22 )(b11 + b22 )


p2 = (a21 + a22 )b11
p3 = a11 (b12 − b22 )
p4 = a22 (b21 − b11 )
p5 = (a11 + a12 )b22
p6 = (a21 − a11 )(b11 + b12 )
p7 = (a12 − a22 )(b21 + b22 )

c11 = p1 + p4 − p5 + p7
c12 = p3 + p5
c21 = p2 + p4
c22 = p1 + p3 − p2 + p6 .
Since we have 7 multiplications and 18 additions per step, the running times verifies
the following recurrence

T (n) = 7T (n/2) + Θ(n2 ).

The solution is
T (n) = Θ(nlog2 7 ) ∼ 28nlog2 7 .
The algorithm can be extended to matrices of n = m · 2k size. If n is odd, then
the last column of the result can be computed using standard method; then Strassen’s
algorithm is applied to n − 1 by n − 1 matrices.

m · 2k+1 → m · 2k

The p-s can be computed in parallel; the c-s, too.


The theoretical speed-up of matrix multiplication translates into a speed-up of
matrix inversion, hence to the solution of linear algebraic systems. If M (n) is the
time for multiplication of two n × n matrices and I(n) is the inversion time for a
n × n matrix, then M (n) = Θ(n). This can be proven in two steps: we show that
M (n) = O(I(n)) and then I(n) = O(M (n)).

Theorem 2.5.1 (Multiplication is not harder than inversion). If we can invert a n×


n matrix in I(n) time, where I(n) = Ω(n2 ) satisfies the regularity condition I(3n) =
O(I(n)), then we can multiply two nth order matrices in O(I(n)) time.
2.6. Iterative refinement 43

Note that I(n) satisfies the regularity condition only if I(n) has not large jumps
in its values. For example, if I(n) = Θ(nc logd n), for any constants c > 0, d ≥ 0,
then I(n) satisfies the regularity conditions.

Theorem 2.5.2 (Inversion is not harder than multiplication). If we can multiply


two real n × n matrices in M (n) time, where M (n) = Ω(n2 ), M (n) satisfies the
regularity conditions M (n) = O(M (n + k)) for each k, 0 ≤ k ≤ n, and M (n/2) ≤
cM (n), for any constant c < 1/2, then we can invert a real n × n nonsingular matrix
in O(M (n)) time.

2.6 Iterative refinement


If the solution method for Ax = b is unstable, then Ax 6= b, where x is the computed
value. We shall compute a correction ∆x such that

A(x + ∆x1 ) = b ⇒ A∆x1 = b − Ax

We solve the system and we obtain a new x, x := x + ∆x1 . If again Ax 6= b, then


we compute a new correction, until

k∆xi − ∆xi−1 k < ε or kAx − bk < ε.

The computation of the vector r = b − Ax, residue, will be performed in double


precision.

2.7 Iterative solution of Linear Algebraic Systems


We wish to compute the solution

Ax = b, (2.7.1)

when A is invertible. Suppose we have found a matrix T and a vector c such that
I − T is invertible and the unique fixpoint of the equation

x = Tx + c (2.7.2)

equates the solution of the system Ax = b. Let x∗ be the solution of (2.7.1) or,
equivalently, of (2.7.2).
Iteration: x(0) given; one defines the sequence (x(k) ) by

x(k+1) = T x(k) + c, k ∈ N. (2.7.3)


44 Numerical Solution of Linear Algebraic Systems

Lemma 2.7.1. If ρ(X) < 1, then there exists (I − X)−1 and

(I − X)−1 = I + X + X 2 + · · · + X k + . . .

Proof. Let
Sk = I + X + · · · + X k
(I − X)Sk = I − X k+1
lim (I − X)Sk = I ⇒ lim Sk = (I − X)−1
k→∞ k→∞

since X k+1 → 0 ⇔ ρ(X) < 1. 

Theorem 2.7.2. The following statements are equivalent

(1) method (2.7.3) is convergent;

(2) ρ(T ) < 1;

(3) kT k < 1 for at least a matrix norm.


Proof.

x(k) = T x(k−1) + c = T (T x(k−2) + c) + c = · · · =

= T k x(0) + (I + T + · · · + T n−1 )

(2.7.3) convergent ⇔ I − T invertible ⇔ ρ(T ) < 1 ⇔ ∃k · k such that kT k < 1


(from Theorem 2.1.7). 

Banach’s fixpoint theorem implies:

Theorem 2.7.3. If there exists k · k such that kT k < 1, the sequence (x(k) ) given by
(2.7.3) is convergent for any x(0) ∈ Rn and the following estimations hold

kx∗ − x(k) k ≤ kT kk kx(0) − xk (2.7.4)


kT kk kT k
kx∗ − x(k) k ≤ kx(1) − x(0) k ≤ kx(1) − x(0) ||. (2.7.5)
1 − kT k 1 − kT k
An iterative method for the solution of an linear algebraic system Ax = b starts
from an initial approximation x(0) ∈ Rn (Cn ) and generates a sequence of vectors
{x(k) }, that converges to the solution of the system x∗ . Such techniques transform
the initial system into an equivalent system, having the form x = T x + c, T ∈ Kn×n ,
c ∈ Kn . One generates a sequence x(k) = T x(k−1) + c.
2.7. Iterative solution of Linear Algebraic Systems 45

The stopping criterion is

1 − kT k
kx(k) − x(k−1) k ≤ ε. (2.7.6)
kT k

It is based on the result:

Proposition 2.7.4. If x∗ is the solution of (2.7.2), and kT k < 1, then

kT k
kx∗ − x(k) k ≤ kx(k) − x(k−1) k. (2.7.7)
1 − kT k

Proof. Let p ∈ N∗ . We have

kx(k+p) − x(k) k ≤ kx(k+1) − x(k) k + · · · + kx(k+p) − x(k+p−1) k. (2.7.8)

On the other hand, (2.7.3) implies

kx(m+1) − x(m) k ≤ kT kkx(m) − x(m−1) k

or, for k < m

kx(m+1) − x(m) k ≤ kT km−k+1 kx(k) − x(k−1) k.

By applying successively these inequalities for m = k, k + p − 1, the relation (2.7.8)


becomes

kx(k+p) − x(k) k ≤ (kT k + · · · + kT kp )kx(k) − x(k−1) k


≤ (kT k + · · · + kT kp + . . . )kx(k) − x(k−1) k.

Since kT k < 1, we have

kT k
kx(k+p) − x(k) k ≤ kx(k) − x(k−1) k,
1 − kT k

which when passing to the limit with respect to p yields to (2.7.7). 

If kT k ≤ 1/2, inequality (2.7.7) becomes

kx∗ − x(k) k ≤ kxk − x(k−1) k,

and the stoping criterion


kxk − x(k−1) k ≤ ε.
46 Numerical Solution of Linear Algebraic Systems

Iterative methods are seldom used to the solution of small systems since the time
required to attain the desired accuracy exceeds the time required for Gaussian elimi-
nation. For large sparse systems (i.e. systems whose matrix has many zeros), iterative
methods are efficient both in time and space.
Let the system Ax = b. Suppose we can split A as A = M − N . If M can be
easily inverted (diagonal, triangular, and so on) it is more convenient to carry out the
computation in the following manner

Ax = b ⇔ M x = N x + b ⇔ x = M −1 N x + M −1 b

The last equation is in the form x = T x + c, where T = M −1 N = I − M −1 A. One


obtains the sequence

x(k+1) = M −1 N x(k) + M −1 b, k ∈ N,

where x(0) is an arbitrary vector.


The first splitting we consider is A = D − L − U , where

aij , i > j
(D)ij = aij δij , (−L)ij =
0, otherwise

aij , i < j
(−U )ij =
0, otherwise
Taking M = D, N = L + U , one successively obtains

Ax = b ⇔ Dx = (L + U )x + b ⇔ x = D−1 (L + U )x + D−1 b

Thus T = TJ = D−1 (L + U ), c = cJ = D−1 b – Jacobi method (D is invertible,


why?), due to Carl Jacobi. 5 [23]
Another decomposition is A = D − L − U , M = D − L, N = U which yields to
TGS = (D − L)−1 U , and cGS = (D − L)−1 b – called Gauss-Seidel method (D − L
invertible, why?)

Carl Gustav Jacob Jacobi (1804-1851) was a contemporary of


Gauss, and with him one of the most important 19th-century
mathematicians in Germany. His name is connected with el-
5 liptic functions, partial differential equations of dynamics, cal-
culus of variations, celestial mechanics; functional determi-
nants also bear his name. In his work on celestial mechanics
he invented what is now called the Jacobi method for solving
linear algebraic systems.
2.7. Iterative solution of Linear Algebraic Systems 47

 
n
(k) 1  X (k−1) 
Let us examine Jacobi iteration xi = bi − aij xj .
aii j=1
j6=i
(k)
Computation of xi uses all components of x(k−1) (simultaneous substitution).
(k) (k)
Since for i > 1, x1 , . . . , xi−1 have already been computed, and we suppose they are
(k−1) (k−1)
better approximations of the solution components than x1 , . . . , xi−1 it seems
(k)
reasonable to compute xi using the most recent values, i.e.
 
k−1 n
(k) 1  X (k)
X (k−1) 
xi = bi − aij xj − aij xj .
aii
j=1 k=i+1

One can state necessary and sufficient conditions for the convergence of Jacobi
and Gauss-Seidel methods
ρ(TJ ) < 1
ρ(TGS ) < 1
and sufficient conditions: there exists k · k such that

kTJ k < 1

kTGS k < 1.
We can improve Gauss-Seidel method introducing a parameter ω and splitting
D
M= − L.
ω
We have    
D 1−ω
A= −L − D+U ,
ω ω
and the iteration is
   
D (k+1) 1−ω
−L x = D + U x(k) + b
ω ω
Finally, we obtain the matrix
 −1  
D 1−ω
T = Tω = −L D+U
ω ω
= (D − ωL)−1 ((1 − ω)D + ωU ).

The method is called relaxation method. We have the following variants:


48 Numerical Solution of Linear Algebraic Systems

– ω > 1 overrelaxation (SOR - Successive Over Relaxation)

– ω < 1 subrelaxation

– ω = 1 Gauss-Seidel

In the sequel, we state two theorems on the convergence of relaxation method.

Theorem 2.7.5 (Kahan). If aii 6= 0, i = 1, n, ρ(Tω ) ≥ |ω − 1|. This implies the


following necessary conditions ρ(Tω ) < 1 ⇒ 0 < ω < 2.

Theorem 2.7.6 (Ostrowski-Reich). If A is a positive definite matrix, and 0 < ω <


2, then SOR converges for any choice of the initial approximation x(0) .

Remark 2.7.7. For Jacobi (and Gauss-Seidel) method a sufficient condition for con-
vergence is
n
X
|aii | > |aij | (A row diagonal dominant)
j=1
j6=i
n
X
|aii | > |aji | (A column diagonal dominant) ♦
j=1
j6=i

The optimal value for ω is


2
ωO = p .
1 + 1 − (ρ(TJ ))2
Chapter 3

Function Approximation

The function to be approximated can be defined:


• On a continuum (typically a finite interval) – special functions (elementary or
transcendental) that one wishes to evaluate as part of a subroutine.
• On a finite set of points – instance frequently encountered in the physical sci-
ences when measurements are taken of a certain physical quantity as a function
of other physical quantity (such as time).
In either case one wants to approximate the given function “as well as possible” in
terms of other simpler functions. Since such an evaluation must be reduced to a finite
number of arithmetical operations, the simpler functions should be polynomial or
rational functions.
The general scheme of approximation can be described as:
• A given function f ∈ X to be approximated.
• A class Φ of “approximations”.
• A “norm” k · k measuring the overall magnitude of functions.
b ∈ Φ of f such that
We are looking for an approximation ϕ
kf − ϕk
b ≤ kf − ϕk for all ϕ ∈ Φ. (3.0.1)
This problem is called best approximation problem of f from the class Φ, and the
b is called best approximation element of f , relative to the norm k · k.
function ϕ
Given a basis {πj }nj=1 of Φ we can express a ϕ ∈ Φ and Φ as
 
 Xn 
Φ = Φn = ϕ : ϕ(t) = cj πj (t), cj ∈ R . (3.0.2)
 
j=1

49
50 Function Approximation

Φ is a finite-dimensional linear space or a proper subset of it.

Example 3.0.8. Φ = Pm - the set of polynomials of degree at most m. A basis of


this space is ej (t) = tj , j = 0, 1, . . . , m. So, dim Pm = m + 1. Polynomials are the
most frequently used “general-purpose” approximations for dealing with functions
on bounded domains (finite intervals or finite set of points). One reason – Weier-
strass’ theorem – any function from C[a, b] can be approximated on a finite interval
as closely as one wishes by a polynomial of sufficiently high degree. ♦

Example 3.0.9. Φ = Skm (∆) the space of polynomial spline functions of degree m
and smoothness class k on the subdivision

∆ : a = t1 < t2 < t3 < · · · < tN −1 < tN = b

of the interval [a, b]. These are piecewise polynomials of degree ≤ m, pieced together
at the “joints” t1 , . . . , tN −1 , in such a way that all derivatives up to and including
the kth are continuous on the whole interval [a, b] including the joints. We assume
0 ≤ k < m. For k = m this space equates Pm . We set k = −1 if we allow
discontinuities at the joints. ♦

Example 3.0.10. Φ = Tm [0, 2π] the space of trigonometric polynomials of degree


≤ m on [0, 2π]. This are linear combinations of the functions

πk (t) = cos(k − 1)t k = 1, m + 1,


πm+1−k (t) = sin kt k = 1, m.

The dimension of this space is n = 2m + 1. Such approximations are natural choices


when the function f to be approximated is periodic with period 2π. (If f has period
p, one makes a change of variables t 7→ tp/2π.) ♦

The class of rational functions

Φ = Rr,s = {ϕ : ϕ = p/q, p ∈ Pr , q ∈ Ps },

is not a linear space.


Possible choice of norms – both for continuous and discrete functions – and the
approximation they generate are summarized in Table 3.1. The continuous case in-
volve an interval [a, b] and a weight function w(t) (possibly w(t) ≡ 1) defined on
[a, b] and positive except for isolate zeros. The discrete case involve a set of N
distinct points t1 , t2 , . . . , tN along with positive weight factors w1 , w2 , . . . , wN (pos-
sibly wi = 1, i = 1, N ). The interval [a, b] may be unbounded if the weight function
w is such that the integral extended over [a, b], which defines the norm makes sense.
51

continuous norm type discrete norm


kuk∞ = max |u(t)| L∞ kuk∞ = max |u(ti )|
a≤t≤b 1≤i≤N
Rb
L1w kuk1,w = ni=1 wi |u(ti )|
P
kuk1,w = a |u(t)|w(t) dt
R 1/2 P 1/2
b N
kuk2,w = a |u(t)|2 w(t) dt L2w kuk2,w = w
i=1 i |u(t i )|2

Table 3.1: Types of approximations and associated norms

Hence, we may take any one of the norms in Table 3.1 and combine it with any of
the preceding linear spaces Φ to arrive at a meaningful best approximation problem
(3.0.1). In the continuous case, the given function f and the functions ϕ ∈ Φ must
be defined on [a, b] and such that the norm kf − ϕk makes sense. Likewise, f and ϕ
must be defined at the points ti in the discrete case.
Note that if the best approximant ϕ b in the discrete case is such that kf − ϕk
b = 0,
then ϕ(t
b i ) = f (ti ), for i = 1, 2, . . . , N . We then say that ϕ b interpolates f at the
points ti and we refer to this kind of approximation as an interpolation problem.
The simplest approximation problems are the least squares problem and the in-
terpolation problem and the easiest space is the space of polynomials.
Before we start with the least square problem we introduce a notational device
that allows us to treat the continuous and the discrete case simultaneously. We define
in the continuous case

 0, if t < a (when − ∞ < a),


 Z t
w(τ ) dτ, if a ≤ t ≤ b,

λ(t) = a (3.0.3)

 Z b
w(τ ) dτ, if t > b (when b < ∞).



a

then we can write, for any continuous function u


Z Z b
u(t) dλ(t) = u(t)w(t) dt, (3.0.4)
R a

since dλ(t) ≡ 0 outside [a, b] and dλ(t) = w(t) dt inside. We call dλ a continuous
(positive) measure. The discrete measure (also called “Dirac measure”) associated to
the point set {t1 , t2 , . . . , tN } is a measure dλ that is nonzero only at the points ti and
has the value wi there. Thus in this case
Z N
X
u(t) dλ(t) = wi u(ti ). (3.0.5)
R i=1
52 Function Approximation

A more precise definition can be given in terms of Stieltjes integrals, if we define


λ(t) to be a step function having the jump wi at ti . In particular, we can define the
L2 norm as
Z 1
2
2
kuk2, dλ = |u(t)| dλ(t) (3.0.6)
R
and obtain the continuous or the discrete norm depending on whether λ is taken to be
as in (3.0.3) or a step function as in (3.0.5).
We call the support of dλ – denoted by supp dλ – the interval [a, b] in the con-
tinuous case (assuming w is positive on [a, b] except for isolated zeros) and the set
{t1 , t2 , . . . , tN } in the discrete case. We say that the set of functions πj in (3.0.2) is
linearly independent on supp dλ if
n
X
∀ t ∈ supp dλ cj πj (t) ≡ 0 ⇒ c1 = c2 = · · · = ck = 0 (3.0.7)
j=1

3.1 Least Squares approximation


We specialize the best approximation problem (3.0.1) by taking as norm the L2 norm
Z 1
2
2
kuk2, dλ = |u(t)| dλ(t) , (3.1.1)
R
where dλ is either a continuous measure (cf. (3.0.3)) or a discrete measure (cf.
(3.0.5)) and using approximants ϕ from an n-dimensional linear space
 
 Xn 
Φ = Φn = ϕ : ϕ(t) = cj πj (t), cj ∈ R . (3.1.2)
 
j=1

πj linearly independent on suppdλ; the integral in (3.1.1) is meaningful whenever


u = πj , j = 1, . . . , n or u = f .
The specialized problem is called least squares approximation problem or square
mean approximation problem. His solution (beginning of the 19th century) is due to
Gauss and Legendre 1 .

Adrien Marie Legendre (1752-1833) was a French mathemati-


cians active in Paris, best known for his treatise on elliptic in-
1 tegrals, but also famous for his work in number theory and
geometry. He is considered the originator (in 1805) of the
method of least squares, although Gauss had already used it in
1794, but published it only in 1809.
3.1. Least Squares approximation 53

3.1.1 Inner products


Given a discrete or continuous measure dλ, and given any two function u and v
having a finite norm (3.0.1), we can define the inner (scalar) product
Z
(u, v) = u(t)v(t) dλ(t). (3.1.3)
R

The Cauchy-Buniakovski-Schwarz inequality

k(u, v)k ≤ kuk2,dλ kvk2,dλ

tells us that the integral in (3.1.3) is well defined.


A real inner product has the following properties:
(i) symmetry (u, v) = (v, u);

(ii) homogeneity (αu, v) = α(u, v), α ∈ R;

(iii) additivity (u + v, w) = (u, w) + (v, w);

(iv) positive definiteness (u, u) ≥ 0 and (u, u) = 0 ⇔ u ≡ 0 on suppdλ.


(i)+(ii) ⇒ linearity

(α1 u1 + α2 u2 , v) = α1 (u1 , v) + α2 (u2 , v) (3.1.4)

(3.1.4) extends to finite linear combinations. Also

kuk22,dλ = (u, u). (3.1.5)

We say that u and v are orthogonal if

(u, v) = 0. (3.1.6)

More generally, we may consider an orthogonal system {uk }nk=1 :

(ui , uj ) = 0 if i 6= j, uk 6= 0 on supp dλ; i, j = 1, n, k = 1, n. (3.1.7)

For such a system we have the Generalized Theorem of Pythagoras


n 2 n
X X
αk uk = |αk |2 kuk k2 . (3.1.8)



k=1 k=1

(3.1.8) implies that every orthogonal system is linearly independent on suppdλ.


Indeed, if the left-hand side of (3.1.8) vanishes, then so does the right-hand side, and
this, since kuk k2 > 0, by assumption, implies α1 = α2 = · · · = αn = 0.
54 Function Approximation

3.1.2 The normal equations


By (3.1.5) we can write the square of L2 error in the form

E 2 [ϕ] := kϕ − f k2 = (ϕ − f, ϕ − f ) = (ϕ, ϕ) − 2(ϕ, f ) + (f, f ).

Inserting ϕ from (3.1.2) gives


 2  
Z Xn Z Xn
E 2 [ϕ] =  cj πj (t) dλ(t) − 2  cj πj (t) f (t) dλ(t)+ (3.1.9)
R j=1 R j=1
Z
+ f 2 (t) dλ(t).
R

The squared L2 error is a quadratic function of the coefficients c1 , . . . , cn of


ϕ. The problem of best L2 approximation thus amounts to minimizing this quadratic
function; one solves it by vanishing the partial derivatives. One obtains
 
Z n Z
∂ 2 X
E [ϕ] = 2  cj πj (t) πi (t) dλ(t) − 2 πi (t)f (t) dλ(t) = 0,

∂ci R R
j=1

that is,
n
X
(πi , πj )cj = (πi , f ), i = 1, 2, . . . , n. (3.1.10)
j=1

These are called normal equations for the least squares problem. They form a
system having the form
Ac = b, (3.1.11)
where the matrix A and the vector b have elements

A = [aij ], aij = (πi , πj ), b = [bi ], bi = (πi , f ). (3.1.12)

By symmetry of the inner product, A is a symmetric matrix. Moreover, A is positive


definite; that is
n X
X n
xT Ax = aij xi xj > 0 if x 6= [0, 0, . . . , 0]T . (3.1.13)
i=1 j=1

The quadratic function in (3.1.13) is called a quadratic form (since it is homogeneous


of degree 2). The positive definiteness of A says that the quadratic form whose
coefficients are the elements of A is always nonnegative, and it is zero only if all
variables xi vanish.
3.1. Least Squares approximation 55

To prove (3.1.13), all we have to do is insert the definition of aij and to use the
property (i)-(iv) of the inner product
2
Xn X n n X
X n Xn
T
x Ax = xi xj (πi , πj ) = (xi πi , xj πj ) = xi π i .


i=1 j=1 i=1 j=1 i=1
Pn
This is clearly nonnegative. It is zero only if i=1 xi πi ≡ 0 on supp dλ, which, by
the assumption of linear independence of the πi , implies x1 = x2 = · · · = xn = 0.
It is a well-known fact of linear algebra that a symmetric positive definite ma-
trix A is nonsingular. Indeed, its determinant, as well as its leading principal minor
determinants are strictly positive. If follows that the system (3.1.10) of normal equa-
tion has a unique solution. Does this solution correspond to a minimum of E[ϕ] in
(3.1.9)? The hessian matrix H = [∂ 2 E 2 /∂ci ∂cj ] has to be positive definite. But
H = 2A, since E 2 is a quadratic function. Therefore, H, with A, is indeed positive
definite, and the solution of the normal equations gives us the desired minimum. The
least squares approximation problem thus has a unique solution, given by
n
X
ϕ(t)
b = cj πj (t)
b (3.1.14)
j=1

where ĉ = [ĉ1 , ĉ2 , . . . , ĉn ]T is the solution of the normal equation (3.1.10).
This completely settles the least square approximation problem in theory. How
in practice? For a general set of linearly independent basis function, we can see the
following difficulties.
(1) The system (3.1.10) may be ill-conditioned. A simple example is provided by
suppdλ = [0, 1], dλ(t) = dt on [0, 1] and πj (t) = tj−1 , j = 1, 2, . . . , n. Then
Z 1
1
(πi , πj ) = ti+j−2 dt = , i, j = 1, 2, . . . , n,
0 i+j−1
that is A is precisely the Hilbert matrix. The resulting severe ill-conditioning of
the normal equations is entirely due to an unfortunate choice of the basis function.
These become almost linearly dependent, R 1 as the exponent grows. Another source of
degradation lies in the element bj = 0 πj (t)f (t)dt of the right-hand side vector.
When j is large πj (t) = tj−1 behaves on [0, 1] like a discontinuous function. A
polynomial πj that oscillates rapidly on [0, 1] would seem to be preferable from this
point of view, since it would ”engage“ more vigorously the function f over all the
interval [0, 1], in contrast to a canonical monomial which shoots from almost zero to
1 at the right endpoint.
(2) The second disadvantage is that all the coefficients b cj in (3.1.14) depends
(n)
on n, i.e. b cj = b cj , j = 1, 2, . . . , n. Increasing n will give an enlarged system
56 Function Approximation

of normal equations with a completely new solution vector. We refer to this as the
nonpermanence of the coefficients b cj .
Both defects (1) and (2) can be eliminated (or at least attenuated) by choosing for
the basis functions πj an orthogonal system,

(πi , πj ) = 0 if i 6= j (πi , πj ) = kπj k2 > 0 (3.1.15)

Then the system of normal equations becomes diagonal and is solved immediately
by
(πj , f )
cj =
b , j = 1, 2, . . . , n. (3.1.16)
(πi , πj )
Clearly, each of these coefficients ĉj are independent of n and once computed, they
remain the same for any larger n. We now have permanence of the coefficients. We
must not solve a system of normal equations, but instead we can use the formula
(3.1.16) directly.
Any system {π̂j } that is linearly independent on suppdλ can be orthogonalized
with respect to the measure dλ by the Gram-Schmidt procedure. One takes

π = π̂1

and then, for j = 2, 3, . . . recursively computes


j−1
X (b
πj , πk )
bj −
πj = π ck πk , ck = , k = 1, j − 1.
(πk , πk )
k=1

Then each πj so determined is orthogonal to all preceding ones.

3.1.3 Least square error; convergence


We have seen that if Φ = Φn consists of n functions πj , j = 1, 2, . . . , n that are
linearly independent on supp dλ, then the least squares problem for dλ

min kf − ϕk2,dλ = kf − ϕk
b 2,dλ (3.1.17)
ϕ∈φn

has a unique solution ϕ b=ϕ bn , given by (3.1.14). There are many ways to select a
basis {πj } in Φn and, therefore, many ways the solution ϕ̂n be represented. Never-
theless, is always one and the same function. The least squares error – the quantity
on the right of (3.1.17) – is independent of the choice of basis functions (although the
calculation of the least square solution, as mentioned previously, is not). In study-
ing this error we may assume, without restricting generality, that the basis πj is an
3.1. Least Squares approximation 57

orthogonal system. (Every linear independent system can be orthogonalized by the


Gram-Schmidt orthogonalization procedure). Then we have (cf. (3.1.16))
n
X (πj , f )
ϕ
bn (t) = cj πj (t),
b cj =
b . (3.1.18)
(πj , πj )
j=1

We first note that the error f − ϕn is orthogonal to the space Φn ; that is

(f − ϕ
cn , ϕ) = 0, ∀ ϕ ∈ Φn (3.1.19)

where the inner product is the one in (3.1.3). Since ϕ is a linear combination of the
πk , it suffices to show (3.1.19) for each ϕ = πk , k = 1, 2, . . . , n. Inserting ϕ̂n from
(3.1.18) in the left of (3.1.19), we find indeed
 
Xn
(f − ϕbn , πk ) = f − cj πk , πk  = (f, πk ) − b
b ck (πk , πk ) = 0,
j=1

the last equation following from the formula for ĉk in (3.1.18). The result (3.1.19) has
a simple geometric interpretation. If we picture functions as vectors, and the space
Φn as a plane, then for any function f that “sticks out” of the plane Φn , the least
square approximant ϕ̂n is the orthogonal projection of f onto Φn ; see Figure 3.1.

Figure 3.1: Least square approximation as orthogonal projection

In particular, choosing ϕ = ϕ̂n in (3.1.19), we get

(f − ϕ
bn , ϕ
bn ) = 0
58 Function Approximation

and, therefore, since f = (f − ϕ)


b + ϕ,
b by the theorem of Pythagoras and its gener-
alization (3.1.8)
2
X n
2 2 2 2

kf k = kf − ϕk
b + kϕk
b = kf − ϕ bn k +
cj πj
b
j=1
n
X
bn k2 +
= kf − ϕ cj |2 kπj k2 .
|b
j=1

Solving for the first term on the right, we get


 1/2
n
 X  (πj , f )
kf − ϕ bn k = kf k2 − cj |kπj k2
|b , cj =
b . (3.1.20)
  (πj , πj )
j=1

Note that the expression in braces must necessarily be nonnegative.


The formula (3.1.20) is interesting theoretically, but for limited practical use.
Note, indeed, that as the error approaches the level of the machine precision eps,
computing the error from the right-hand side of (3.1.20) cannot produce anything

smaller than eps because of inevitable rounding errors committed during the sub-
traction in the radicand. (They may even produce a negative result for the radicand.)
Using instead the definition,
Z 1
2
2
kf − ϕbn k = [f (t) − ϕ
bn (t)] dλ(t) ,
R

along, perhaps, with a suitable (positive) quadrature rule, it is guaranteed to produce


a nonnegative result that may potentially be as small as O(eps).
If now we are given a sequence of linear spaces Φn , n = 1, 2, 3, . . . , then clearly

kf − ϕ
b1 k ≥ kf − ϕ
b2 k ≥ kf − ϕ
b3 k ≥ . . . ,

which follows not only from (3.1.20), but more directly from the fact that

Φ1 ⊂ Φ2 ⊂ Φ3 ⊂ . . . .

If there are infinitely many such spaces, then the sequence of L2 errors, being monot-
onically decreasing, must converge to a limit. Is this limit zero? If so, we say that the
least square approximation process converges (in the mean) as n → ∞. It is obvious
from (3.1.20) that a necessary and sufficient condition for this is

X
cj |2 kπj k2 = kf k2 .
|b (3.1.21)
j=1
3.2. Examples of orthogonal systems 59

An equivalent way of stating convergence is as follows: given any f with kf k <


∞, that is ∀ f ∈ L2dλ and given any ε > 0 no matter how small, there exists an integer
n = nε and a function ϕ∗ ∈ Φn such that kf − ϕ∗ k ≤ ε. A class of spaces Φn having
this property is said to be complete with respect to the norm k · k = k · k2,dλ . One
therefore calls the relation (3.1.21) the completeness relation or Parseval-Liapunov
relation.

3.2 Examples of orthogonal systems


The prototype of all orthogonal systems is the system of trigonometric functions
known from Fourier analysis. Other widely used systems involve orthogonal alge-
braic polynomials.
(1) The trigonometric system consists of the functions

1, cos t, cos 2t, cos 3t, . . . , sin t, sin 2t, sin 3t, . . .

It is orthogonal on [0, 2π] with respect to the equally weighted measure



dt on [0, 2π],
dλ(t) =
0 otherwise.

We have
Z 2π 
0, if 6 `
k=
sin kt sin `t dt = k, ` = 1, 2, 3, . . .
0 π, if k=`

Z 2π  0, k 6= `
cos kt cos `t dt = 2π, k = ` = 0 k, ` = 0, 1, 2, . . .
0 
π, k = ` > 0
Z 2π
sin kt cos `t dt = 0, k = 1, 2, 3, . . . , ` = 0, 1, 2, . . .
0
The form of approximation is

a0 X
f (t) = + (ak cos kt + bk sin kt). (3.2.1)
2
k=1

Using (3.1.16) we get


Z 2π
1
ak = f (t) cos kt dt, k = 1, 2, . . .
π 0
60 Function Approximation

Z 2π
1
bk = f (t) sin kt dt, k = 1, 2, . . . (3.2.2)
π 0

which are known as Fourier coefficients of f . They are precisely the coefficients
(3.1.16) for the trigonometric system. By extension, the coefficients (3.1.16) for any
orthogonal system (πj ) will be called Fourier coefficients of f relative to this system.
In particular, we recognize the truncated Fourier series at k = m the best approxima-
tion of f from the class of trigonometric polynomials of degree ≤ n relative to the
norm
Z 2π 1/2
kuk2 = |u(t)|2 dt .
0

(2) Orthogonal polynomials. Given a measure dλ, we know that any finite num-
ber of consecutive powers 1, t, t2 , . . . are linearly independent on [a, b], if supp dλ =
[a, b], whereas the finite set 1, t, . . . , tn−1 is linearly independent on supp dλ =
{t1 , t2 , . . . , tN }. Since a linearly independent set can be orthogonalized by Gram-
Schmidt procedure, any measure dλ of the type considered generates a unique set of
monic2 polynomials πj (t, dλ), j = 0, 1, 2, . . . satisfying

degree πj = j, j = 0, 1, 2, . . .
(3.2.3)
Z
πk (t)π` (t) dλ(t) = 0, if k 6= `
R

These are called orthogonal polynomials relative to the measure dλ. Let the index
j start from zero. The set {πj } is infinite if suppdλ = [a, b], and consists of exactly
N polynomials π0 , π1 , . . . , πN −1 if supp dλ = {t1 , . . . , tN }. The latter are referred
to as discrete orthogonal polynomials.
Three consecutive orthogonal polynomials are linearly related. Specifically, there
exists real constants αk = αk ( dλ) and βk = βk ( dλ) > 0 (depending on the measure
dλ) such that

πk+1 (t) = (t − αk )πk (t) − βk πk−1 (t), k = 0, 1, 2, . . .


(3.2.4)
π−1 (t) = 0, π0 (t) = 1.

(It is understood that (3.2.4) holds for all k ∈ N if supp dλ = [a, b] and only for
k = 0, N − 2 if supp dλ = {t1 , t2 , . . . , tN }).
To prove (3.2.4) and, at the same time identify the coefficients αk , βk we note
that
πk+1 (t) − tπk (t)
2
A polynomial is called monic if its leading coefficient is equal to 1.
3.2. Examples of orthogonal systems 61

is a polynomial of degree ≤ k, and it can be expressed as a linear combination of


π0 , π1 , . . . , πk . We write this linear combination in the form

k−2
X
πk+1 − tπk (t) = −αk πk (t) − βk πk−1 (t) + γk,j πj (t) (3.2.5)
j=0

(with the understanding that empty sums are zero). Now multiply both sides of (3.2.5)
by πk in the sense of inner product defined in (3.1.3) we get

(−tπk , πk ) = −αk (πk , πk );

that is
(tπk , πk )
αk = , k = 0, 1, 2, . . . (3.2.6)
(πk , πk )

Similarly, forming the inner product of (3.2.5) with πk−1 gives

(−tπk , πk−1 ) = −βk (πk−1 , πk−1 ).

Since (tπk , πk−1 ) = (πk , tπk−1 ) and tπk−1 differs from πk by a polynomial of de-
gree < k, we obtain by orthogonality (tπk , πk−1 ) = (πk , πk ); hence

(πk , πk )
βk = , k = 1, 2, . . . (3.2.7)
(πk−1 , πk−1 )

Multiplication of (3.2.5) by π` , ` < k − 1, yields

γk,` = 0, ` = 0, 1, . . . , k − 1 (3.2.8)

The recursion (3.2.4) provides us with a practical scheme of generating orthog-


onal polynomials. Since π0 = 1, we can compute α0 by (3.2.6) with k = 0. This
allows us to compute π1 , using (3.2.4), with k = 0. Knowing π0 , π1 we can go
back to (3.2.6) and (3.2.7) and compute α1 and β1 , respectively. This allow us to
compute π2 via (3.2.4) with k = 1. Proceeding in this fashion, using alternatively
(3.2.6), (3.2.7) and (3.2.4), we can generate as many orthogonal polynomials as are
62 Function Approximation

desired. This procedure, called Stieltjes’s 3 procedure – is particularly well suited


for discrete orthogonal polynomials, since the inner product is then a finite sum. In
the continuous case, the computation of the inner product requires integration, which
complicates matters. Fortunately, for many important special measures dλ(t) = w(t)
the recursion coefficients are explicitly known.
The special case of symmetry (i.e. dλ(t) = w(t) with w(−t) = w(t) and
supp dλ is symmetric with respect to the origin) deserves special attention. In this
case αk = 0, ∀ k ∈ N, due to (3.2.1) since

Z Z b
(tπk , πk ) = tπk2 (t) dλ(t) = w(t)tπk2 (t) dt = 0,
R a

because the integrand is an odd function and the domain is symmetric with respect to
the origin.

3.3 Examples of orthogonal polynomials

3.3.1 Legendre polynomials

They are defined by means of the so-called Rodrigues’s formula

k! dk 2
πk (t) = (t − 1)k . (3.3.1)
(2k)! dtk

Let us check first the orthogonality on [−1, 1] relative to the measure dλ(t) = dt.

Thomas Jan Stieltjes (1856-1894), born in the Netherlands,


studied at the Technical Institute of Delft, but never finished to
get his degree because of a deep-seated aversion to examina-
tions. He nevertheless got a job at the Observatory of Leiden
as a “computer assistant for astronomical calculation”. His
early publication caught the attention of Hermite, who was
3 able to eventually secure a university position for Stieltjes in
Toulouse. A life-long friendship evolved between these two
great men, of which two volumes of their correspondence
gives vivid testimony (and still makes fascinating reading).
Stieltjes is best known for his work on continued fractions and
moment problem, which, among other things, led him to in-
vent a new concept of integral which now bears his name. He
died very young of tuberculosis at age of 38.
3.3. Examples of orthogonal polynomials 63

For any 0 ≤ ` < k, repeated integration by parts gives


Z 1
dk
t` k (t2 − 1)k =
−1 dt
1
`
X dk−m−1 2
(−1)` `(` − 1) . . . (` − m + 1)t`−m (t − 1)k
= 0, (3.3.2)
dtk−m−1
m=0 −1
the last relation since 0 ≤ k − m − 1 < k. Thus,
(πk , p) = 0, ∀p ∈ Pk−1 ,
proving orthogonality. Writing (by symmetry)
πk (t) = tk + µk tk−2 + . . . , k≥2
and noting (again by symmetry) that the recurrence relation has the form
πk+1 (t) = tπk (t) − βk πk−1 (t),
we obtain
tπk (t) − πk+1 (t)
βk = ,
πk−1 (t)
which is valid for all t. In particular as t → ∞,
tπk (t) − πk+1 (t) (µk − µk+1 )tk−1 + . . .
βk = lim = lim = µk − µk+1 .
t→∞ πk−1 (t) t→∞ tk−1 + . . .
(If k = 1, set µ1 = 0.)
From Rodrigues’s formula we find
k! dk  2k 2k−2

πk (t) = t − kt + ...
(2k)! dtk
k!
= (2k(2k − 1) . . . (k + 1)tk − k(2k − 2)(2k − 3) . . . (k − 1)tk−1 + . . . )
(2k)!
k(k − 1) k−2
= tk − t + ...,
2(2k − 1)
so that
k(k − 1)
µk = , k ≥ 2.
2(2k − 1)
Therefore ,
k2
βk = µk − µk+1 =
(2k − 1)(2k + 1)
that is, since µ1 = 0,
1
βk = , k ≥ 1. (3.3.3)
4 − k −2
64 Function Approximation

3.3.2 First kind Chebyshev polynomials


The Chebyshev 4 #1 polynomials can be defined by formula
Tn (x) = cos(n arccos x), n ∈ N. (3.3.4)
The trigonometric identity
cos(k + 1)θ + cos(k − 1)θ = 2 cos θ cos kθ
and (3.3.4), by setting θ = arccos x give us
Tk+1 (x) = 2xTk (x) − Tk−1 (x) k = 1, 2, 3, . . .
(3.3.5)
T0 (x) = 1, T1 (x) = x.
For example,
T2 (x) = 2x2 − 1,
T3 (x) = 4x3 − 3x,
T4 (x) = 8x4 − 8x2 + 1,
and so on.
It is evident from (3.3.5) that the leading coefficient of Tn is 2n−1 (if n ≥ 1); the
first kind monic Cebyshev polynomial is
◦ 1 ◦
Tn (x) = n−1 Tn (x), n ≥ 0, T0 = T0 . (3.3.6)
2
From (3.3.4) we obtain immediately the zeros of Tn
(n) (n) 2k − 1
(n)
xk = cos θk , θk =
π, k = 1, n. (3.3.7)
2n
They are the projections onto the real line of equally spaced points on the unit circle;
see Figure 3.2 for n=4.
On [−1, 1] Tn oscillates from +1 to -1, attaining this extreme values at
(n) (n)kπ (n)
yk = cos ηk , ηk
, k = 0, n. =
n
Figure 3.3 give the graphs of some first kind Cebyshev polynomials.

Pafnuty Levovich Cebyshev (1821-1894) was the most promi-


nent of the St. Petersburg school of mathematics. He made
4 pioneering contributions to number theory, probability theory,
and approximation theory. He is regarded as the founder of
the constructive function theory, but also worked in mechan-
ics, notably the theory of mechanisms, and in ballistics.
3.3. Examples of orthogonal polynomials 65

Figure 3.2: The Cebyshev polynomial T4 and its root

Figure 3.3: The Cebyshev #1 polynomials T3 , T4 , T7 , T8 on [-1,1]


66 Function Approximation

First kind Cebyshev polynomials are orthogonal relative to the measure


dx
dλ(x) = √ , on [−1, 1].
1 − x2
One easily checks from (3.3.4) that
Z 1 Z π
dx
Tk (x)T` (x) √ = Tk (cos θ)T` (cos θ) dθ
−1 1 − x2 0

Z π  0 if k 6= `
= cos kθ cos `θ dθ = π if k = ` = 0
0 
π/2 if k = ` 6= 0
(3.3.8)

The Fourier expansion in Chebyshev polynomials (essentially the Fourier cosine ex-
pansion) is given by
∞ ∞
X 0 1 X
f (x) = cj Tj (x) := c0 + cj Tj (x), (3.3.9)
2
j=0 j=1

where Z 1
2 dx
cj = f (x)Tj (x) √ , j ∈ N.
π −1 1 − x2
Truncating (3.3.9) with the term of degree n gives a useful polynomial approximation
of degree n
n n
X 0 c0 X
τn (x) = cj Tj (x) := + cj Tj (x), (3.3.10)
2
j=0 j=1

having an error

X
f (x) − τn (x) = cj Tj (x) ≈ cn+1 Tn+1 (x). (3.3.11)
j=n+1

The approximation on the far right is better the faster the Fourier coefficients cj tend
to zero. The error (3.3.11), essentially oscillates between +cn+1 and −cn+1 and thus
is of “uniform” size. This is in stark contrast to Taylor’s expansion at x = 0, where
the nth degree polynomial partial sum has an error proportional to xn+1 on [−1, 1].
With respect to the inner product
n+1
X
(f, g)T := f (ξk )g(ξk ), (3.3.12)
k=1
3.3. Examples of orthogonal polynomials 67

where {ξ1 , . . . , ξn+1 } is the set of zeros of Tn+1 , the following discrete orthogonality
property holds 
 0, i 6= j
n+1
(Ti , Tj )T = , i = j 6= 0 .
 2
n + 1, i = j = 0
2k−1
Indeed, we have arccos ξk = 2n+2 π, k = 1, n + 1. Let us compute now the inner
product:
(Ti , Tj )T = (cos i arccos t, cos j arccos t)T =
n+1
X
= cos(i arccos ξk ) cos(j arccos ξk ) =
k=1
n+1    
X 2k − 1 2k − 1
= cos i π cos j π =
2(n + 1) 2(n + 1)
k=1
n+1  
1X 2k − 1 2k − 1
= cos(i + j) π + cos(i − j) π =
2 2(n + 1) 2(n + 1)
k=1
n+1 n+1
1 X i+j 1X i−j
= cos(2k − 1) π+ cos(2k − 1) π.
2 2(n + 1) 2 2(n + 1)
k=1 k=1
i+j i−j
One introduces the notations α := 2(n+1) π, β := 2(n+1) π and
n+1
1X
S1 := cos(2k − 1)α,
2
k=1
n+1
1 X
S2 := cos(2k − 1)β.
2
k=1
Since
2 sin αS1 = sin 2(n + 1)α,
2 sin βS2 = sin 2(n + 1)β,
one obtains S1 = 0 şi S2 = 0.
With respect to the inner product
1 1
(f, g)U := f (η0 )g(η0 ) + f (η1 )g(η1 ) + · · · + f (ηn−1 )g(ηn−1 ) + f (ηn )g(ηn )
2 2
n
X
00
= f (ηk )g(ηk ),
k=0
(3.3.13)
68 Function Approximation

where {η0 , . . . , ηn } is the set of extremal points of Tn , a similar property holds



 0, i 6= j
n
(Ti , Tj )U = , i = j 6= 0 .
 2
n, i = j = 0

The polynomial Tn has the least uniform norm in the set of n-th monic polyno-
mials.

Theorem 3.3.1 (Cebyshev). For an arbitrary monic polynomial pn of degree n, there
holds


◦ 1
max pn (x) ≥ max Tn (x) = n−1 , n ≥ 1, (3.3.14)

−1≤x≤1 −1≤x≤1 2

where Tn (x) is the monic Cebyshev polynomial (3.3.6) of degree n.

Proof. (by contradiction) Assume contrary to (3.3.14), that




1
max pn (x) < n−1 . (3.3.15)

−1≤x≤1 2
◦ ◦
Then the polynomial dn (x) = Tn (x) − pn (x) (of degree ≤ n − 1) satisfies
       
(n) (n) (n)
dn y0 > 0, dn y1 < 0, dn y2 > 0, . . . , (−1)n dn yn(n) > 0.
(3.3.16)
Since dn change sign at least n times, it must vanish identically; this contradicts
(3.3.16); thus (3.3.15) cannot be true. 

The result (3.3.14) can be given the following interesting interpretation: the best

uniform approximation on [−1, 1] to f (x) = xn from Pn−1 is given by xn − Tn (x),

that is, by the aggregate of terms of degree ≤ n − 1 in Tn taken with the minus
sign. From the theory of uniform polynomial approximation it is known that the

best approximant is unique. Therefore equality in (3.3.14) can hold only if pn (x) =

Tn (x).

3.3.3 Second kind Chebyshev polynomials


Cebyshev #2 polynomials are defined by

sin[(n + 1) arccos t]
Qn (t) = √ , t ∈ [−1, 1]
1 − t2
3.3. Examples of orthogonal polynomials 69

√ They are orthogonal on [−1, 1] relative to the measure dλ(t) = w(t)dt, w(t) =
1 − t2 .
The recurrence relation is

Qn+1 (t) = 2tQn (t) − Qn−1 (t), Q0 (t) = 1, Q1 (t) = 2t.

3.3.4 Laguerre polynomials


This Laguerre 5 polynomials are orthogonal on [0, ∞) with respect to the weight
w(t) = tα e−t . They are defined by

et t−α dn n+α −t
lnα (t) = (t e ) for α > 1
n! dtn

The recurrence relation for monic polynomials ˜lnα is

˜lα (t) = (t − αn )˜lα (t) − (2n + α + 1)˜lα (t),


n+1 n n−1

where α0 = Γ(1 + α) and αk = k(k + α), for k > 0.

3.3.5 Hermite polynomials


Hermite polynomials are defined by

2 dn −t2
Hn (t) = (−1)n et (e ).
dtn
2
They are orthogonal on (−∞, ∞) with respect to the weight w(t) = e−t and the
recurrence relation is for monic polynomials H̃n (t) is

H̃n+1 (t) = tH̃n (t) − βn H̃n−1 (t),



where β0 = π and βk = k/2, for k > 0.

Edmond Laguerre (1834-1886) was a French mathematician


5
active in Paris, who made essential contributions to geometry,
algebra, and analysis.
70 Function Approximation

3.3.6 Jacobi polynomials


They are orthogonal on [−1, 1] relative to the weight

w(t) = (1 − t)α (1 + t)β .

Jacobi polynomials are generalizations of other orthogonal polynomials:

• For α = β = 0 we obtain Legendre polynomials.

• For α = β = −1/2 we obtain Cebyshev #1 polynomials.

• For α = β = 1/2 we obtain Cebyshev #2 polynomials.

Remark 3.3.2. For Jacobi polynomials we have

β 2 − α2
αk =
(2k + α + β)(2k + α + β + 2)

and

β0 =2α+β+1 B(α + 1, β + 1),


4k(k + α)(k + α + β)(k + β)
βk = , k > 0. ♦
(2k + α + β − 1)(2k + α + β)2 (2k + α + β + 1)

We conclude this section with a table of some classical weight functions, their
corresponding orthogonal polynomials, and the recursion coefficients αk , βk for gen-
erating orthogonal polynomials (see Table 3.2).

3.4 The Space H n [a, b]


For n ∈ N∗ , we define

H n [a, b] = {f : [a, b] → R : f ∈ C n−1 [a, b], f (n−1) absolute continuous on [a, b]}.
(3.4.1)
Each function f ∈ H n [a, b] admits a Taylor-type representation with the remain-
der in integral form
n−1 x
(x − a)k (k) (x − t)n−1 (n)
X Z
f (x) = f (a) + f (t)dt. (3.4.2)
k! a (n − 1)!
k=0

H n [a, b] is a linear space.


3.4. The Space H n [a, b] 71

Polynomials Notation Weight interval αk βk


Legendre Pn (ln ) 1 [-1,1] 0 2 (k=0)
(4−k−2 )−1 (k>0)
−1
Cebyshev #1 Tn (1−t2 ) 2 [−1,1] 0 π (k=0)
1
2
π (k=1)
1
4
(k>0)
1 1
Cebyshev #2 un (Qn ) (1−t2 ) 2 [−1,1] 0 2
π (k=0)
1
4
(k>0)
(α)
Laguerre Ln tα e−t α>−1 [0,∞) 2k+α+1 Γ(1+α) (k=0)
k(k+α) (k>0)
2 √
Hermite Hn e−t R 0 π (k=0)
1
2
k (k>0)
(α,β)
Jacobi Pn (1−t)α (1−t)β [−1,1] See Remark 3.3.2
α>−1, β>−1 page 70

Table 3.2: Orthogonal Polynomials

Remark 3.4.1. A function f : I → R, I interval, is called absolute continuous on


I if ∀ ε > 0 ∃ δ > 0 such that forPeach finite system of disjoint subinterval in I
{(ak , bk )}k=1,n having the property nk=1 (bk − ak ) < δ it holds
n
X
|f (bk ) − f (ak )| < ε.
k=1 ♦

The next theorem, due to Peano 6 , extremely important for Numerical Analysis,
gives a representation of real linear functionals, defined on H n [a, b].
Theorem 3.4.2 (Peano). Let L be a real continuous linear functional, defined on
H n [a, b]. If KerL = Pn−1 then
Z b
Lf = K(t)f (n) (t)dt, (3.4.3)
a

Giuseppe Peano (1858-1932), an Italian mathematician ac-


tive in Turin, made fundamental contributions to mathematical
logic, set theory, and the foundations of mathematics. Gen-
6 eral existence theorems in ordinary differential equations also
bear his name. He created his own mathematical language, us-
ing symbols of the algebra and logic, and even promoted (and
used) a simplified Latin (his “latino”) as a world language for
scientific publication.
72 Function Approximation

where
1
K(t) = L[(· − t)n−1
+ ] (Peano kernel). (3.4.4)
(n − 1)!
Remark 3.4.3. The function

z, z ≥ 0
z+ =
0, z < 0
n is called truncated power.
is called positive part, and z+ ♦
Proof. f admits a Taylor representation with the remainder in integral form
f (x) = Tn−1 (x) + Rn−1 (x)
where
x Z b
(x − t)n−1 (n)
Z
1
Rn−1 (x) = f (t)dt = (x − t)n−1
+ f
(n)
(t)dt
a (n − 1)! (n − 1)! a
By applying L to both sides we get
Z b 
1 n−1 (n)
Lf = LTn−1 +LRn−1 ⇒ Lf = L (· − t)+ f (t)dt =
| {z } (n − 1)! a
0
Z b
cont 1
= L(· − t)n−1
+ f
(n)
(t)dt.
(n − 1)! a

Remark 3.4.4. The conclusion of the theorem remains valid if L is not continuous,
but it has the form
n−1
XZ b
Lf = f (i) (x)dµi (x), µi ∈ BV [a, b].
i=0 a ♦

Corollary 3.4.5. If K does not change sign on [a, b] and f (n) is continuous on [a, b],
then there exists ξ ∈ [a, b] such that
1
Lf = f (n) (ξ)Len , (3.4.5)
n!
where ek (x) = xk , k ∈ N.
Proof. Since K does not change sign we may apply in (3.4.3) the second mean value
theorem of integral calculus
Z b
Lf = f (n) (ξ) Kn (t)dt, ξ ∈ [a, b].
a
Setting f = en we get precise (3.4.5). 
3.5. Polynomial Interpolation 73

3.5 Polynomial Interpolation


We now wish to approximate functions by matching their values at given points.

Problem 3.1. Given m + 1 distinct points x0 , x1 , . . . , xm and values fi = f (xi ) of


some function f ∈ X at this points, find a function ϕ ∈ Φ such that

ϕ(xi ) = fi , i = 1, m.

Suppose Φ is a (m + 1)-dimensional linear space. Since we have to satisfy m + 1


conditions, and have at our disposal m + 1 degrees of freedom – the coefficients of
ϕ relative to a base of Φ – we expect the problem to have a unique solution. Other
question of interest, in addition to existence and uniqueness, are different ways of
representing and computing ϕ, what can be said about the error e(x) = f (x) − p(x)
when x 6= xi , i = 1, m and the quality of approximation f (x) ≈ ϕ(x) when the
number of points, and hence the “degree” of ϕ, is allowed to increase indefinitely.
Although these question are not of the utmost interest in themselves, the result dis-
cussed in the sequel are widely used in the development of approximate methods for
important practical tasks (numerical integration, equation solving and so on).
Interpolation to function values is referred to as Lagrange-type interpolation.
More generally, we may wish to interpolate to function and derivative values of some
function. This is called Hermite-type interpolation.
When Φ = Pn we have to deal with polynomial interpolation. In this case in-
terpolation problem is called Lagrange interpolation and Hermite interpolation, re-
spectively. For example, the Lagrange interpolation problem is stated as follows.

Problem 3.2. Given m + 1 distinct points x0 , x1 , . . . , xm and values fi = f (xi ) of


some function f ∈ X at this points, find a polynomial ϕ of minimum degree such
that
ϕ(xi ) = fi , i = 1, m.

3.5.1 Lagrange interpolation


Let [a, b] ⊂ R a closed interval, a set of m + 1 distinct points {x0 , x1 , . . . , xm } ⊂
[a, b] and a function f : [a, b] 7→ R. We wish to determine a polynomial of minimum
degree reproducing the values of i f at xk , k = 0, m.

Theorem 3.5.1. There exists one polynomial and only one Lm f ∈ Pm such that

∀ i = 0, 1, . . . , m, (Lm f )(xi ) = f (xi ); (3.5.1)


74 Function Approximation

this polynomial can be written in the form


m
X
(Lm f )(x) = f (xi )`i (x), (3.5.2)
i=0

where
m
Y x − xj
`i (x) = . (3.5.3)
j=0
x i − xj
j6=i

Definition 3.5.2. The polynomial Lm f defined in Theorem 3.5.1 is called Lagrange


7 interpolation polynomial of frelative to the points x0 , x1 , . . . , xm , and the functions
`i (x), i = 0, m, are called elementary (fundamental, basic) Lagrange polynomials
associated to those points.

Proof. One proves immediately that `i ∈ Pi and that `i (xj ) = δij (Krönecker’s
symbol); it results that the polynomial Lm f defined by (3.5.1) is of degree at most
m and it satisfies (3.5.2). Suppose that there is another polynomial p∗m ∈ Pm which
also verifies (3.5.2) and we set qm = Lm − p∗m ; we have qm ∈ Pm and ∀ i = 0, m,
qm (xi ) = 0; so qm , having (m + 1) distinct roots vanishes identically, therefore the
uniqueness result. 

Remark 3.5.3. The basic polynomial `i is thus the unique polynomial satisfying

`i ∈ Pm and ∀ j = 0, 1, . . . , m, `i (xj ) = δij

Setting
m
Y
u(x) = (x − xj )
j=0
u(x)
from (3.5.3) we obtain that ∀ x 6= xi , `i (x) = (x−xi )u0 (xi ) . ♦
Joseph Louis Lagrange (1736-1813), born in Turin, became,
through correspondence with Euler, his protégé. In 1766 he
indeed succeeded Euler in Berlin. He returned to Paris in
1787. Clairaut wrote of the young Lagrange: “... a Young
man, no less remarkable for his talents than for his modesty;
his temperament is mild and melancholic; he knows no other
7
pleasure than study”. Lagrange made fundamental contribu-
tions to the calculus of variations and to number theory, but
worked also on many problems in analysis. He is widely
known for his representation of the remainder term in Tay-
lor’s formula. The interpolation formula appeared in 1794.
His Mécanique Analytique, published in 1788, made him one
of the founders of analytic mechanics.
3.5. Polynomial Interpolation 75

The proof of 3.5.1 proves in fact the existence and the uniqueness of the solution
of general Lagrange interpolation problem:

(PGIL) Given the data b0 , b1 , . . . , bm ∈ R, determine

pm ∈ Pm such that ∀ i = 0, 1, . . . , n, pm (xi ) = bi . (3.5.4)

Problem (3.5.4) leads us to a linear system of (m + 1) equations with (m + 1)


unknowns (the coefficients of pm ).
It is a well-known result from linear algebra

{Existence of a solution ∀ b0 , b1 , . . . , bm } ⇔ {uniqueness of the solution} ⇔

{(b0 = b1 = · · · = bm = 0) ⇒ pm ≡ 0}
We set pm = a0 + a1 x + · · · + am xm

a = (a0 , a1 , . . . , am )T , b = (b0 , b1 , . . . , bm )T

and let V = (vij ) be the m + 1 by m + 1 square matrix with elements vij = xji . The
equation (3.5.4) can be rewritten in the form

Va=b

The matrix V is invertible P(it is Vandermonde); one can prove that V −1 = U T


where U = (uij ) with `i (x) = m k
k=0 uik x ; in this way we obtain a not so expensive
procedure to invert the Vandermonde matrix and thus to solve the system (3.5.4).
Example 3.5.4. The Lagrange interpolation polynomial of a function f relative to
the nodes x0 and x1 is
x − x1 x − x0
(L1 f ) (x) = f (x0 ) + f (x1 ),
x0 − x1 x1 − x0
that is, the line passing through the points (x0 , f (x0 )) and (x1 , f (x1 )). Analogously,
the Lagrange interpolation polynomial of a function f relative to the nodes x0 , x1
and x2 is
(x − x1 )(x − x2 ) (x − x0 )(x − x2 )
(L2 f ) (x) = f (x0 ) + f (x1 )+
(x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 )
(x − x0 )(x − x1 )
f (x2 ),
(x2 − x0 )(x2 − x1 )
that is, the parabola passing through the points of coordinates (x0 , f (x0 )), (x1 , f (x1 ))
and (x2 , f (x2 )). Their geometric interpretation is given in Figure 3.4. ♦
76 Function Approximation

(a) (L1 f ) (b) (L2 f )

Figure 3.4: Geometric interpretation of L1 f (left) and L2 f

3.5.2 Hermite Interpolation


Instead of making f and the interpolation polynomial to agree at points xi in [a, b],
we could make that f and the interpolation polynomial to agree together with their
derivatives up to the order ri at points xi . One obtains:

Theorem 3.5.5. Given (m + 1) distinct points x0 , x1 , . . . , xm in [a, b] and (m + 1)


natural numbers r0 , r1 , . . . , rm , we set n = m + r0 + r1 + · · · + rm . Then, given a
function f , defined on [a, b] and having ri th order derivative at point xi , there exists
one polynomial and only one Hn f of degree ≤ n such that

∀ (i, `), 0 ≤ i ≤ m, 0 ≤ ` ≤ ri (Hn f )(`) (xi ) = f (`) (xi ), (3.5.5)

where f (`) (xi ) is the `th order derivative of f at xi .

Definition 3.5.6. The polynomial defined as above is called Hermite 8 interpolation


polynomial of the function f relative to the points x0 , x1 , . . . , xm and integers r0 ,
r1 , . . . , rm .

Charles Hermite (1822-1901) was a leading French math-


ematicians, Academician in Paris, known for his extensive
8
work in number theory, algebra, and analysis. He is famous
for his proof in 1873 of the transcendental nature of the num-
ber e.
3.5. Polynomial Interpolation 77

Proof. Equation (3.5.5) leads us to a linear system having (n + 1) equations and


(n + 1) unknowns (the coefficients of Hn f ), so it is sufficient to show that the corre-
sponding homogeneous system has only the null solution, that is, the relations
Hn f ∈ Pn and ∀ (i, `), 0 ≤ i ≤ k, 0 ≤ ` ≤ ri , (Hn f )(`) (xi ) = 0
guarantee us that for each i = 0, 1, . . . , m xi is a (ri + 1)th order multiple root of
Hn f ; therefore Hn f has the form
m
Y
(Hn f )(x) = q(x) (x − xi )ri +1 ,
i=0
Pm
where q is a polynomial. Since i=0 (αi + 1) = n + 1, the above relation is incom-
patible to the membership of Hn to Pn , excepting the situation when q ≡ 0, hence
Hn ≡ 0. 
Remark 3.5.7. 1) Given the real numbers bi` , for each pair (i, `) such that 0 ≤
i ≤ k and 0 ≤ ` ≤ ri , we proved that the general Hermite interpolation
problem
determine pn ∈ Pn such that ∀ (i, `) 0 ≤ i ≤ m and
(`) (3.5.6)
0 ≤ ` ≤ ri , pn (xi ) = bi`
has a solution and only one. In particular, if we choose a given pair (i, `),
bi` = 1 and bjn = 0, ∀ (j, m) 6= (i, `) one obtains a basic (fundamental)
Hermite interpolation polynomial relative to the points x0 , x1 , . . . , xm and in-
tegers r0 , r1 , . . . , rm . The Hermite interpolation polynomial defined by (3.5.5)
can be obtained using the basic polynomials
ri
m X
X
(Hn f )(x) = f (`) (x)hi` (x). (3.5.7)
i=0 l=0
Setting
k 
x − xj rj+1
Y 
qi (x) =
j=0
xi − xj
j6=i

one checks easily that the basic polynomials hi` are defined by the recurrences
(x − xi )ri
hiri (x) = qi (x)
ri !
and for ` = ri−1 , ri−2 , . . . , 1, 0
ri  
(x − xi )` X j (j−`)
hi` (x) = qi (x) − q (xi )hij (x).
`! ` i
j=`+1
78 Function Approximation

2) The matrix V of the linear system (3.5.6) is called generalized Vandermonde


matrix; it is invertible and the elements of its inverse are the coefficients of
polynomials hil .

3) Lagrange interpolation is a particular case of Hermite interpolation (for ri = 0,


i = 0, 1, . . . , m); Taylor’s polynomial is a particular case for m = 0 and
r0 = n. ♦

We shall give a more convenient expression for Hermite basic polynomials due
to Dimitrie D. Stancu. They verify
(p)
hkj (xν ) = 0, ν 6= k, p = 0, rν (3.5.8)
(p)
hkj (xk ) = δjp , p = 0, rk

for j = 0, rk and ν, k = 0, m. Setting


m
Y
u(x) = (x − xk )rk +1
k=0

and
u(x)
uk (x) = ,
(x − xk )rk +1
it results from (3.5.8) that hkj is of the form

hkj (x) = uk (x)(x − xk )j gkj (x), gkj ∈ Prk −j . (3.5.9)

Applying Taylor’s formula, we get

k −j
rX
(x − xk )ν ν
gkj (x) = gkj (xk ); (3.5.10)
ν!
ν=0

ν (x ), ν = 0, r − j. Rewriting (3.5.9) in the


now we must determine the values gkj k k
form
1
(x − xk )j gkj (x) = hkj (x) ,
uk (x)
and applying Leibnitz’s formula for the (j + ν)th order derivative of the product one
gets
j+ν   j+ν    (s)
X j+ν h j
i(j+ν−s)
(s)
X j + ν (j+ν−s) 1
(x − xk ) gkj (x) = hkj (x) .
s s uk (x)
s=0 s=0
3.5. Polynomial Interpolation 79

Taking x = xk , all terms in both sides, excepting those corresponding to s = ν will


vanish. Thus, we have
    (ν)
j+ν (ν) j+ν 1
j!gkj (xk ) = , ν = 0, rk − j.
ν ν uk (x) x=xk

We got
 (ν)
(ν) 1 1
gkj (xk ) = ,
j! uk (x) x=xk
and from (3.5.10) and (3.5.9) we finally have

k −j
rX (ν)
(x − xk )j (x − xk )ν

1
hkj (x) = uk (x) .
j! ν! uk (x) x=xk
ν=0

Proposition 3.5.8. The operator Hn is a projector, i.e.

• it is linear (Hn (αf + βg) = αHn f + βHn g);

• it is idempotent (Hn ◦ Hn = Hn ).

Proof. Linearity results from (3.5.7). Due to the uniqueness of Hermite interpolation
polynomial, Hn (Hn f ) − Hn f vanishes identically, hence Hn (Hn f ) = Hn f , that is,
it is idempotent. 

Example 3.5.9. The Hermite interpolation polynomial corresponding to a function


f and double nodes 0 and 1 is

(H3 f ) (x) = h00 (x)f (0) + h10 (x)f (1) + h01 (x)f 0 (0) + h11 (x)f 0 (1),

where

h00 (x) = (x − 1)2 (2x + 1),


h01 (x) = x(x − 1)2 ,
h10 (x) = x2 (3 − 2x),
h11 (x) = x2 (x − 1).

If we add the node x = 21 , then the quality of approximation increases (see Figure
3.5). ♦
80 Function Approximation

(a) (H3 f ) (b) (H3 f )

Figure 3.5: Hermite interpolation polynomial (H3 f ) (black) of the function f :


[0, 1] → R , f (x) = sin πx (red) and double nodes x0 = 0 and x1 = 1 (left)
and (H5 f ) (black) of the function f : [0, 1] → R , f (x) = sin πx (red) and double
nodes x0 = 0, x1 = 12 and x2 = 1

3.5.3 Interpolation error


If we wish to use Lagrange or Hermite interpolation polynomial to approximate a
function f at a point x ∈ [a, b], x 6= xk , k = 0, m, we need to estimate the error
(Rn f )(x) = f (x) − (Hn f )(x). If we have not any information about f excepting
the values at xi , we can say nothing about (Rn f )(x); we can change f everywhere
excepting the points xi without modify (Hn f ) (x). We need some supplementary
assumptions (regularity conditions) on f . Let C m [a, b] be the space of real functions
m times continuous-differentiable on [a, b]. We have the following theorem about
error in Hermite interpolation.
Theorem 3.5.10. Suppose f ∈ C n [α, β] and there exists f (n+1) on (α, β), where
α = min{x, x0 , . . . , xm } and β = max{x, x0 , . . . , xm }; then, for each x ∈ [α, β],
there exists a ξx ∈ (α, β) such that
1
(Rn f )(x) = un (x)f (n+1) (ξx ), (3.5.11)
(n + 1)!
where
m
Y
un (x) = (x − xi )ri+1 .
i=0

Proof. If x = xi , (Rn f )(x) = 0 and (3.5.11) holds trivially. Suppose x 6= xi ,


i = 0, m and for a fixed x, we introduce the auxiliary function

un (z) (Rn f )(z)
F (z) = .
un (x) (Rn f )(x)
3.5. Polynomial Interpolation 81

Note that F ∈ C n [α, β], ∃ F (n+1) on (α, β), F (x) = 0 and F (j) (xk ) = 0 for
k = 0, m, j = 0, rk . Thus, F has (n + 2) zeros, considering their multiplicities.
Applying successively Rolle generalized Theorem, it results that there exists at least
one ξ ∈ (α, β) such that F (n+1) (ξ) = 0, i.e.

(m+1)
(n + 1)! f (n+1) (ξ)
F (ξ) = = 0, (3.5.12)
un (x) (Rn f )(x)

where we used the relation (Rn f )(n+1) = f (n+1) − (Hn f )(n+1) = f (n+1) . Express-
ing (Rn f )(x) from (3.5.12) one obtains (3.5.11). 

Corollary 3.5.11. We set Mn+1 = max |f (n+1) (x)|; an upper bound of interpola-
x∈[a,b]
tion error (Rn f )(x) = f (x) − (Hn f )(x) is given by

Mn+1
|(Rn f )(x)| ≤ |un (x)|.
(n + 1)!

Since Hn is a projector, Rn is also a projector and additionally KerRn = Pn ,


because Rn f = f − Hn f = f − f = 0, ∀f ∈ Pn . Thus, we can apply Peano’s
Theorem to Rn .

Theorem 3.5.12. If f ∈ C n+1 [a, b], then


Z b
(Rn f ) (x) = Kn (x; t)f (n+1) (t)dt, (3.5.13)
a

where
 
rk
m X
1  X (j) 
(x − t)n+ − hkj (x) (xk − t)n+

Kn (x; t) = . (3.5.14)
n!  
k=0 j=0

Proof. Applying Peano’s Theorem, we have


Z b
(Rn f ) (x) = Kn (x; t)f (n+1) (t)dt
a

and taking into account that

(x − t)n+ (x − t)n+ (x − t)n+


   
Kn (x; t) = Rn = − Hn ,
n! n! n!

the theorem follows immediately. 


82 Function Approximation

Since Lagrange interpolation is a particular case of Hermite interpolation for ri =


0, i = 0, 1, . . . , m we have from Theorem 3.5.10:

Corollary 3.5.13. Suppose f ∈ C m [α, β] and there exists f (m+1) on (α, β), where
α = min{x, x0 , . . . , xm } and β = max{x, x0 , . . . , xm }; then, for each x ∈ [α, β],
there exists a ξx ∈ (α, β) such that

1
(Rm f )(x) = um (x)f (m+1) (ξx ), (3.5.15)
(n + 1)!

where
m
Y
um (x) = (x − xi ).
i=0

Also, it follows from Peano’s Theorem 3.5.12:

Corollary 3.5.14. If f ∈ C m+1 [a, b], then


Z b
(Rm f ) (x) = Km (x; t)f (m+1) (t)dt (3.5.16)
a

where
m
" #
1 X
Km (x; t) = (x − t)m
+ − lk (x)(xk − t)m
+ . (3.5.17)
m!
k=0

Example 3.5.15. For interpolation polynomials in example 3.5.4 the corresponding


remainders are
(x − x0 )(x − x1 ) 00
(R1 f )(x) = f (ξ),
2
and
(x − x0 )(x − x1 )(x − x2 ) 000
(R2 f )(x) = f (ξ),
6
respectively. ♦

Example 3.5.16. The remainder for the Hermite interpolation formula with double
nodes 0 and 1, for f ∈ C 4 [α, β], is

x2 (x − 1)2 (4)
(R3 f )(x) = f (ξ). ♦
6!
3.6. Efficient Computation of Interpolation Polynomials 83

Example 3.5.17. Let f (x) = ex . For x ∈ [a, b], we have Mn+1 = eb and for every
choice of the points xi , |un (x)| ≤ (b − a)n+1 , which implies

(b − a)n+1 b
max |(Rn f )(x)| ≤ e.
x∈[a,b] (n + 1)!
One gets  
lim max |(Rn f )(x)| = lim k(Rn f )(x)k = 0,
n→∞ x∈[a,b] n→∞

that is, Hn f converges uniformly to f on [a, b], when n tends to ∞. In fact, we can
prove an analogous result for any function which can be developed into a Taylor in a
disk centered in x = a+b 3
2 and with the radius of convergence r > 2 (b − a). ♦

3.6 Efficient Computation of Interpolation Polynomials


3.6.1 Aitken-type methods
Many times, the degree required to attain the desired accuracy in polynomial inter-
polation is not known. It can be obtained from the remainder expression, but this
require kf (m+1) k∞ to be known. Pm1 ,m2 ,...,mk will denote the Lagrange interpola-
tion polynomial with nodes xm1 , . . . , xmk .
Proposition 3.6.1. If f is defined at x0 , . . . , xk , xj 6= xi , 0 ≤ i, j ≤ k, then
(x − xj )P0,1,...,j−1,j+1,...,k (x) − (x − xi )P0,1,...,i−1,i+1,...,k (x)
P0,1,...,k = =
xi − xj

1 x − xj P0,1,...,i−1,i+1,...,k (x)
= (3.6.1)
xi − xj x − xi P0,1,...,j−1,j+1,...,k (x)

Proof. Q = P0,1,...,i−1,i+1,...,k , Q
b = P0,1,...,j−1,j+1,k

(x − xj )Q(x)
b − (x − xi )Q(x)
P (x) =
xi − xj

(xr − xj )Q(x
b r ) − (xr − xi )Q(xr ) xi − xj
P (xr ) = = f (xr ) = f (xr )
xi − xj xi − xj

for r 6= i ∧ r 6= j, since Q(xr ) = Q(x


b r ) = f (xr ). But,

(xi − xj )Q(x
b i ) − (xi − xj )Q(xi )
P (xi ) = = f (xi )
xi − xj
84 Function Approximation

and
(xj − xi )Q(x
b j ) − (xj − xi )Q(xj )
P (xj ) = = f (xj ),
xi − xj
hence P = P0,1,...,k . 

Thus we established a recurrence relation between a Lagrange interpolation poly-


nomial of degree k and two Lagrange interpolation polynomials of degree k − 1. The
computation could be organized in a tabular fashion

x0 P0
x1 P1 P0,1
x2 P2 P1,2 P0,1,2
x3 P3 P2,3 P1,2,3 P0,1,2,3
x4 P4 P3,4 P2,3,4 P1,2,3,4 P0,1,2,3,4

And now, suppose P0,1,2,3,4 does not provide the desired accuracy. One can select
a new node and add a new line to the table

x5 P5 P4,5 P3,4,5 P2,3,4,5 P1,2,3,4,5 P0,1,2,3,4,5

and neighbor elements on row, column and diagonal could be compared to check if
the desired accuracy was achieved.
The method is called Neville method.
We can simplify the notations

Qi,j := Pi−j,i−j+1,...,i−1,i ,

Qi,j−1 = Pi−j+1,...,i−1,i ,
Qi−1,j−1 := Pi−j,i−j+1,...,i−1 .
Formula (3.6.1) implies

(x − xi−j )Qi,j−1 − (x − xi )Qi−1,j−1


Qi,j = ,
xi − xi−j

for j = 1, 2, 3, . . . , i = j + 1, j + 2, . . .
Moreover, Qi,0 = f (xi ). We obtain

x0 Q0,0
x1 Q1,0 Q1,1
x2 Q2,0 Q2,1 Q2,2
x3 Q3,0 Q3,1 Q3,2 Q3,3
3.6. Efficient Computation of Interpolation Polynomials 85

If the interpolation procedure converges, then the sequence Qi,i also converges
and a stopping criterion could be
|Qi,i − Qi−1,i−1 | < ε.
The algorithm speeds-up by sorting the nodes on ascending order over |xi − x|.
Aitken methods is similar to Neville method. It builds the table
x0 P0
x1 P1 P0,1
x2 P2 P0,2 P0,1,2
x3 P3 P0,3 P0,1,3 P0,1,2,3
x4 P4 P0,4 P0,1,4 P0,1,2,4 P0,1,2,3,4
To compute a new value one takes the value in top of the preceding column and
the value from the current line and the preceding column.

3.6.2 Divided difference method


Let Lk f denotes the Lagrange interpolation polynomial with nodes x0 , x1 , . . . , xk
for k = 0, 1, . . . , n. We shall construct Lm by recurrence. We have
(L0 f )(x) = f (x0 )
for k ≥ 1, the polynomial Lk − Lk−1 is of degree k, vanish at x0 , x1 , . . . , xk , so its
form is
(Lk f )(x) = (Lk−1 f )(x)+f [x0 , x1 , . . . , xk ](x−x0 )(x−x1 ) . . . (x−xk−1 ), (3.6.2)
where f [x0 , x1 , . . . , xk ] denotes the coefficient of xk in (Lk f )(x). One derives the
expression of the interpolation polynomial Lm f with nodes x0 , x1 , . . . , xn
m
X
(Lm f )(x) = f (x0 ) + f [x0 , x1 , . . . , xk ](x − x0 )(x − x1 ) . . . (x −xk−1 ), (3.6.3)
k=1

called Newton’s 9 form of Lagrange interpolation polynomial.

Sir Isaac Newton (1643 - 1727) was an eminent figure of the


17th century mathematics and physics. Not only did he lay
the foundations of modern physics, but he was also one of the
co-inventors of the differential calculus. Another was Leib-
9
niz, with whom he became entangled in a biter and life-long
priority dispute. His most influential work was the Principia,
which not only contains his ideas on interpolation, but also his
suggestion to use the interpolating polynomial for purposes of
integration.
86 Function Approximation

Formula (3.6.3) reduces the computation by recurrence of Lm f to that of the


coefficients f [x0 , x1 , . . . , xk ], k = 0, m.
It holds

Lemma 3.6.2.

f [x1 , x2 , . . . , xk ] − f [x0 , x1 , . . . , xk−1 ]


∀k≥1 f [x0 , x1 , . . . , xk ] = (3.6.4)
xk − x0

and
f [xi ] = f (xi ), i = 0, 1, . . . , k.

Proof. For k ≥ 1 let L∗k−1 f be the interpolation polynomial for f , having the degree
k − 1 and the nodes x1 , x2 , . . . , xk ; the coefficient of xk−1 is f [x1 , x2 , . . . , xk ]. The
polynomial qk of degree k defined by

(x − x0 )(L∗k−1 f )(x) − (x − xk )(Lk−1 f )(x)


qk (x) =
xk − x0

equates f at points x0 , x1 , . . . , xk , hence qk (x) ≡ (Lk f )(x). Formula (3.6.4) is


obtaining by identification of xk coefficients in both sides. 

Definition 3.6.3. The quantity f [x0 , x1 , . . . , xk ] is called kth divided difference of f


relative to the nodes x0 , x1 , . . . , xk .

An alternative notation is [x0 , . . . , xk ; f ].


The definition implies that f [x0 , x1 , . . . , xk ] is independent of x’s order and it
could be computed as a function of f (x0 ), . . . , f (xm ). Indeed, the Lagrange inter-
polation polynomial of degree ≤ m relative to the nodes x0 , . . . , xm can be written
as
X m
(Lm f )(x) = li f (xi )
i=0

and the coefficient of xm is


m
X f (xi )
f [x0 , . . . , xm ] = m . (3.6.5)
Y
i=0 (xi − xj )
j=0
j6=i
3.6. Efficient Computation of Interpolation Polynomials 87

The formula (3.6.4) can be used to generate the table of divided differences

x0 f [x0 ] - f [x0 , x1 ] - f [x0 , x1 , x2 ] - f [x0 , x1 , x2 , x3 ]


* *
 *

  
  
  
x1 f [x1 ] - f [x1 , x2 ] - f [x1 , x2 , x3 ]
*
 *

   
 
x2 f [x2 ] - f [x2 , x3 ]
*
 

x3 f [x3 ]

The first column contains the values of function f , the second contains the 1st or-
der divided difference and so on; we pass from a column to the next using formula
(3.6.4): each entry is the difference of the entry to the left and below it and the one
immediately to the left, divided by the difference of the x-value found by going di-
agonally down and the x-value horizontally to the left. The divided differences that
occur in the Newton formula (3.6.3) are precisely the m + 1 entries in the first line
of the table of divided differences. Their computation requires n(n + 1) additions
and 21 n(n + 1) divisions. Adding another data point (xm+1 , f [xm+1 ]) requires the
generation of the next diagonal. Lm+1 f can be obtained from Lm f by adding to it
the term f [x0 , . . . , xm+1 ](x − x0 ) . . . (x − xm+1 ).

Remark 3.6.4. The interpolation error is given by

f (x) − (Lm f )(x) = um (x)f [x0 , x1 , . . . , xm , x]. (3.6.6)

Indeed, it is sufficient to note that

(Lm f )(t) + um (t)f [x0 , . . . , xm ; x]

is, according to (3.6.3) the interpolation polynomial (in t) of f relative to the points
x0 , x1 , . . . , xm , x. The theorem on the remainder of Lagrange interpolation formula
(3.5.11) implies the existence of a ξ ∈ (a, b) such that

1 (m)
f [x0 , x1 , . . . , xm ] = f (ξ) (3.6.7)
m!
(mean formula for divided differences). ♦
88 Function Approximation

A divided difference could be written as the quotient of two determinants.


Theorem 3.6.5. It holds
(W f )(x0 , . . . , xm )
f [x0 , . . . , xm ] = (3.6.8)
V (x0 , . . . , xm )
where
. . . xm−1

x20


1 x0 0 f (x0 )

m−1
1 x1 x21 . . . x1 f (x1 )
(W f )(x0 , . . . , xn ) = , (3.6.9)

.. .. .. .. .. ..

. . . . . .

1 xm x2m m−1
. . . xm f (xm )

and V (x0 , . . . , xm ) is the Vandermonde determinant.


Proof. One expands (W f )(x0 , . . . , xm ) over the last columns; taking into account
that every algebraic complement is a Vandermonde determinant, one gets
m
1 X
f [x0 , . . . , xm ] = V (x0 , . . . , xi−1 , xi+1 , . . . , xm )f (xi ) =
V (x0 , . . . , xm )
i=0
m
X f (xi )
= (−1)m−i ,
(xi − x0 ) . . . (xi − xi−1 )(xi − xi+1 ) . . . (xn − xi )
i=0
that after the sign changing of the last m − i terms implies (3.6.5). 

3.6.3 Multiple nodes divided differences


Formula (3.6.8) allows us to introduce the notion of a multiple nodes divided differ-
ence: if f ∈ C m [a, b] and α ∈ [a, b], then
f m (ξ) f (m) (α)
lim [x0 , . . . , xn ; f ] = lim =
x0 ,...,xn →α ξ→α m! m!
This suggests the relation
1 (m)
[α, . . . , α; f ] = f (α).
| {z } m!
m+1

Expressing this as a quotient of two determinants one obtains


α α2 . . . αm−1

  1 f (α)
m−2 0 (α)

0 1 2α . . . (m − 1)α f
(W f ) α, . . . , α =
... ... ... ... ... ...
| {z }

m+1 0 0 0 ... (m − 1)! f (m−1) (α)
3.6. Efficient Computation of Interpolation Polynomials 89

and
α α2 . . . αm

  1
1 2α . . . mαm−1

0
V α, . . . , α =
  ,
| {z } ... ... ... ... ...

m+1 0 0 0 ... m!

that is, the two determinants are built from the line of the node α and its successive
derivatives with respect to α up to the mth order.
The generalization to several nodes is:

Definition 3.6.6. Let rk ∈ N, k = 0, m, n = r0 + · · · + rm . Suppose that f (j) (xk ),


k = 0, m, j = 0, rk − 1 exist. The quantity

(Wf )(x0 , . . . , x0 , . . . , xm , . . . , xm )
[x0 , . . . , x0 , x1 , . . . , x1 , . . . , xm , . . . , xm ;f ] =
| {z } | {z } | {z } V (x0 , . . . , x0 , . . . , xm , . . . , xm )
r0 r1 rm

where
(W f )(x0 , . . . , x0 , . . . , xm , . . . , xm ) =
xr00 −1 x0n−1

1 x0 ... ... f (x0 )
(r0 − 1)xr00 −2 f 0 (x0 )

0 1 ... ...
.. .. .. .. .. ..

..
. . . . . . .
Qr0−1 n−r0 (r −1)

0 0 ... (r0 − 1)! ... p=1 (n − p)x0 f 0 (x0 )
= r −1 n−1
1 xm ... xmm ... xm f (xm )

0 1 . . . (rm − 1)xm rm −2 ... (n − 1)xn−2 f 0 (xm )
m
.. .. .. .. .. .. ..

. . . . . . .

Qrm−1 n−rm (r −1)
0 0 ... (rm − 1)! ... p=1 (n − p)xm f n (xn )
and V (x0 , . . . , x0 , . . . , xm , . . . , xm ) is as above, excepting the last column which is

0 −2
rY rm
Y −2
(xn0 , nxn−1
0 ,..., (n − p)xn−r
0
0 +1
, . . . , xnm , nxm
n−1
,..., xn−r
m
m +1 T
)
p=0 p=0

is called divided difference with multiple nodes xk , k = 0, m and orders of multi-


plicity rk , k = 0, m.

By generalization of Newton’s form for Lagrange interpolation polynomial one


obtains a method for computing Hermite interpolation polynomial based on multiple
nodes divided difference.
Suppose nodes xi , i = 0, m and values f (xi ), f 0 (xi ) are given. We define the
sequence of nodes z0 , z1 , . . . , z2n+1 by z2i = z2i+1 = xi , i = 0, m. We build
90 Function Approximation

z0 = x 0 f [z0 ]
f [z0 , z1 ] = f 0 (x0 )
f [z1 ,z2 ]−f [z0 ,z1 ]
z 1 = x0 f [z1 ] f [z0 , z1 , z2 ] = z2 −z0
f (z2 )−f (z1 )
f [z1 , z2 ] = z2 −z1
f [z3 ,z2 ]−f [z2 ,z1 ]
z 2 = x1 f [z2 ] f [x1 , z2 , z3 ] = z3 −z1
f [z2 , z3 ] = f 0 (x 1)
f [z4 ,z3 ]−f [z3 ,z2 ]
z 3 = x1 f [z3 ] f [z2 , z3 , z4 ] = z4 −z2
f (z4 )−f (z3 )
f [z3 , z4 ] = z4 −z3
f [z5 ,z4 ]−f [z4 ,z3 ]
z 4 = x2 f [z4 ] f [z3 , z4 , z5 ] = z5 −z3
f [z4 , z5 ] = f 0 (x2 )
z 5 = x2 f [z5 ]

Table 3.3: A divided difference table with double nodes

the divided difference table relative to the nodes zi , i = 0, 2m + 1. Since z2i =


z2i+1 = xi for every i, f [x2i , x2i+1 ] is a divided difference with a double node and
it equates f 0 (xi ); therefore we shall use f 0 (x0 ), f 0 (x1 ), . . . , f 0 (xm ) instead of first
order divided differences

f [z0 , z1 ], f [z2 , z3 ], . . . , f [z2m , z2m+1 ].

The other divided differences are obtained as usual, as the Table 3.3 shows. This idea
could be extended to another Hermite interpolation problems. The method is due to
Powell.

3.7 Convergence of polynomial interpolation


Let’s explain first what we mean by “convergence”. We assume that we are given a
(m)
triangular array of interpolation nodes xi = xi , exactly m + 1 distinct nodes for
each m = 0, 1, 2, . . . .
(0)
x0
(1) (1)
x0 x1
(2) (2) (2)
x0 x1 x2
.. .. .. .. (3.7.1)
. . . .
(m) (m) (m) (m)
x0 x1 x2 . . . xm
.. .. .. ..
. . . .
3.7. Convergence of polynomial interpolation 91

We assume further that all nodes are contained in some finite interval [a, b]. Then,
for each m we define
(m) (m)
Pm (x) = Lm (f ; x0 , x1 , . . . , x(m)
m ; x), x ∈ [a, b]. (3.7.2)

We say that Lagrange interpolation based on the triangular array of nodes (3.7.1)
converges if
pm (x) ⇒ f (x), când n → ∞ pe [a, b]. (3.7.3)

Example 3.7.1 (Runge’s example).


1
f (x) = , x ∈ [−5, 5],
1 + x2
(m) k
xk = −5 + 10 , k = 0, m. (3.7.4)
m
Here the nodes are equally spaced in [−5, 5]. Note that f has two poles at z = ±i. It
has been shown, indeed, that

0 if |x| < 3.633 . . .
lim |f (x) − pm (f ; x)| = (3.7.5)
m→∞ ∞ if |x| > 3.633 . . .

The graph for m = 10, 13, 16 is given in Figure 3.6. ♦

Example 3.7.2 (Bernstein’s example). Let us consider the function

f (x) = |x|, x ∈ [−1, 1],

and the nodes


(m) 2k
xk = −1 + , k = 0, 1, 2, . . . , m. (3.7.6)
m
Here analyticity of f is completely gone, f being not differentiable at x = 0. One
finds that
lim |f (x) − Lm (f ; x)| = ∞ ∀x ∈ [−1, 1]
m→∞

excepting the points x = −1, x = 0 and x = 1. See figure 3.7(a), for m = 20.
The convergence in x = ±1 is trivial, since they are interpolation nodes, where the
error is zero. The same is true for x = 0 when m is even, but not if m is odd. The
failure of the convergence in the last two examples can only in part be blamed on
insufficient regularity of f . Another culprit is the equidistribution of nodes. There
are better distributions such as Chebyshev nodes. Figure 3.7(b) gives the graph for
m = 17. ♦
92 Function Approximation

Figure 3.6: A graphical illustration of Runge’s counterexample

(a) Equispaced nodes, m = 20 (b) Cebyshev nodes, m = 17

Figure 3.7: Behavior of Lagrange interpolation for f : [−1, 1] → R, f (x) = |x|.


3.8. Spline Interpolation 93

The problem of convergence was solved for the general case by Faber and Bern-
stein during 1914 and 1916. Faber has proved that for each triangular array of nodes
of type 3.7.1 in [a, b] there exists a function f ∈ C[a, b] such that the sequence of
(m)
Lagrange interpolation polynomials Lm f for the nodes xi (row wise) does not
converge uniformly to f on [a, b].
Bernstein 10 has proved that for any array of nodes as before there exists a func-
tion f ∈ C[a, b] such that the corresponding sequence (Lm f ) is divergent.
Remedies:
• Local approach – the interval [a, b] is taken very small – the approach used to
numerical solution of differential equations;
• Spline interpolation – the interpolant is piecewise polynomial.

3.8 Spline Interpolation


Let ∆ be a subdivision upon the interval [a, b]
∆ : a = x1 < x2 < · · · < xn−1 < xn = b (3.8.1)
We shall use low-degree polynomials on each subinterval [xi , xi+1 ], i = 1, n − 1.
The rationale behind this is the recognition that on a sufficiently small interval, func-
tions can be approximated arbitrarily by polynomials of low degree, even degree 1,
or 0 for that matter.
We have already introduced the space
Skm (∆) = {s : s ∈ C k [a, b], s|[xi ,xi+1 ] ∈ Pm , i = 1, 2, . . . , n − 1} (3.8.2)
m ≥ 0, k ∈ N ∪ {−1}, of spline functions of degree m and smoothness class k
relative to the subdivision ∆. If k = m, then functions s ∈ Sm
m (∆) are polynomials.
For m = 1 and k = 0 one obtains linear splines.
We wish to find s ∈ S01 (∆) such that
s(xi ) = fi , where fi = f (xi ), i = 1, 2, . . . , n.

Sergi Natanovitch Bernstein (1880-1968) made major contri-


bution to polynomial approximation, continuing the tradition
of Chebyshev. In 1911 he introduced what are now called the
Bernstein polynomials to give a constructive proof of Weier-
10
strass’s theorem (1885), namely that a continuous function on
a finite subinterval of the real line can be uniformly approx-
imated as closely as we wish by a polynomial. He is also
known for his works on differential equations and probability
theory.
94 Function Approximation

Figure 3.8: Piecewise linear interpolation

The solution is trivial, see Figure 3.8. On the interval [xi , xi+1 ]

s(f ; x) = fi + (x − xi )f [xi , xi+1 ], (3.8.3)


and
(∆xi )2
|f (x) − s(f (x))| ≤ max |f 00 (x)|. (3.8.4)
8 x∈[xi ,xi+1 ]

It follows that
1
kf (·) − s(f, ·)k∞ ≤ |∆|2 kf 00 k∞ . (3.8.5)
8
The dimension of S01 (∆) can be computed in the following way: since we have
n − 1 subintervals, each linear piece has 2 coefficients (2 degrees of freedom) and
each continuity condition reduces the degree of freedom by 1, we have finally

dim S01 (∆) = 2n − 2 − (n − 2) = n.

A basis of this space is given by the so-called B-spline functions:


We let x0 = x1 , xn+1 = xn , for i = 1, n
 x−x
i−1
 , for xi−1 ≤ x ≤ xi
 xi − xi−1


Bi (x) = xi+1 − x (3.8.6)
, for xi ≤ x ≤ xi+1
 xi+1 − xi



0, otherwise
3.8. Spline Interpolation 95

Note that the first equation, when i = 1, and the second when i = n are to be
ignored. The functions Bi may be referred to as “hat functions” (Chinese hats), but
note that the first and the last hat is cut in half. The functions Bi are depicted in
Figure 3.9.

Figure 3.9: First degree B-spline functions

They have the property


Bi (xj ) = δij ,

are linear independent, since


n
X
s(x) = ci Bi (x) = 0 ∧ x 6= xj ⇒ cj = 0.
i=1

and
hBi ii=1,n = S10 (∆),

Bi plays the same role as elementary Lagrange polynomials `i .

3.8.1 Interpolation by cubic splines


Cubic spline are the most widely used. We first discuss the interpolation problem for
s ∈ S13 (∆). Continuity of the first derivative of any cubic spline interpolant s3 (f ; ·)
96 Function Approximation

can be enforced by prescribing the values of the first derivative at each point xi ,
i = 1, 2, . . . , n. Thus, let m1 , m2 , . . . , mn be arbitrary given numbers, and denote

s3 (f ; ·)|[xi ,xi+1 ] = pi (x), i = 1, 2, . . . , n − 1 (3.8.7)

Then we enforce s3 (f ; xi ) = mi , i = 1, n, by selecting each piece pi of s3 (f, ·)


to be the (unique) solution of a Hermite interpolation problem, namely,

pi (xi ) = fi , pi (xi+1 ) = fi+1 , i = 1, n − 1, (3.8.8)


p0i (xi ) = mi , p0i (xi+1 ) = mi+1

We solve the problem by Newton’s interpolation formula. The required divided dif-
ferences are
f [xi ,xi+1 ]−mi mi+1 +mi −2f [xi ,xi+1 ]
xi fi mi ∆xi (∆xi )2
mi+1 −f [xi ,xi+1 ]
xi fi f [xi , xi+1 ] ∆xi
xi+1 fi+1 mi+1
xi+1 fi+1
and the Hermite interpolation polynomial (in Newton form) is

f [xi , xi+1 ] − mi
pi (x) = fi + (x − xi )mi + (x − xi )2 +
∆xi
mi+1 + mi − 2f [xi , xi+1 ]
+ (x − xi )2 (x − xi+1 ) .
(∆xi )2
Alternatively, in Taylor’s form, we can write for xi ≤ x ≤ xi+1

pi (x) = ci,0 + ci,1 (x − xi ) + ci,2 (x − xi )2 + ci,3 (x − xi )3 (3.8.9)

and since x − xi+1 = x − xi − ∆xi , by identification we have

ci,0 = fi
ci,1 = mi
f [xi , xi+1 ] − mi
ci,2 = − ci,3 ∆xi (3.8.10)
∆xi
mi+1 + mi − 2f [xi , xi+1 ]
ci,3 =
(∆xi )2
Thus to compute s3 (f ; x) for any given x ∈ [a, b] that is not an interpolation node,
one first locates the interval [xi , xi+1 ] 3 x and then computes the corresponding piece
(3.8.7) by (3.8.9) and (3.8.10).
We now discuss some possible choices of the parameters m1 , m2 , . . . , mn .
3.8. Spline Interpolation 97

Piecewise cubic Hermite interpolation


Here one selects mi = f 0 (xi ) (assuming that these derivative values are known).
This gives rise to a strictly local scheme, in that each piece pi can be determined
independently from the others. Furthermore, the interpolation error is, for
4
|f (4) (x)|

1
|f (x) − pi (x)| ≤ ∆xi max , xi ≤ x ≤ xi+1 . (3.8.11)
2 x∈[xi ,xi+1 ] 4!
Hence,
1
kf (·) − s3 (f ; ·)k∞ ≤ |∆|4 kf (4) k∞ . (3.8.12)
384
For equally spaced points xi , one has |∆| = (b − a)/(n − 1) and, therefore

kf (·) − s3 (f ; ·)k∞ = O(n−4 ), n → ∞. (3.8.13)

Cubic spline interpolation


Here we require s3 (f ; ·) ∈ S23 (∆), that is, continuity of the second derivatives. In
terms of pieces (3.8.7) of s3 (f ; ·), this means that

p00i−1 (xi ) = p00i (xi ), i = 2, n − 1, (3.8.14)

and translates into a condition for the Taylor coefficients in (3.8.9), namely

2ci−1,2 + 6ci−1,3 ∆xi−1 = 2ci,2 , i = 2, n − 1.

Plugging in explicit values (3.8.10) for these coefficients, we arived at the linear
system

∆xi mi−1 + 2(∆xi−1 + ∆xi )mi + (∆xi−1 )mi+1 = bi , i = 2, n − 1 (3.8.15)

where
bi = 3{∆xi f [xi−1 , xi ] + ∆xi−1 f [xi , xi+1 ]} (3.8.16)
These are n − 2 linear equations in the n unknowns m1 , m2 , . . . , mn . Once m1
and mn have been chosen in some way, the system becomes again tridiagonal in
the remaining unknowns and hence is readily solved by Gaussian elimination, by
factorization or by an iterative method.
Here are some possible choices of m1 and mn .
Complete (clamped) spline. We take m1 = f 0 (a), mn = f 0 (b). It is known that
for this spline, if f ∈ C 4 [a, b],

kf (r) (·) − s(r) (f ; ·)k∞ ≤ cr |∆|4−r kf (n) k∞ , r = 0, 1, 2, 3 (3.8.17)


98 Function Approximation

5 1
where c0 = 384 , c1 = 24 , c2 = 38 , and c3 is a constant depending on the ratio
|∆|
mini ∆xi .
Matching of the second derivatives at the endpoints. We enforce the condi-
tions s003 (f ; a) = f 00 (a); s003 (f ; b) = f 00 (b). Each of these conditions gives rise to an
additional equation, namely,

2m1 + m2 = 3f [x1 , x2 ] − 12 f 00 (a)∆x1


(3.8.18)
mn−1 + 2mn = 3f [xn−1 , xn ] + 21 f 00 (b)∆xn−1

One conveniently adjoins the first equation to the top of the system (3.8.15), and the
second to the bottom, thereby preserving the tridiagonal structure of the system.
Natural cubic spline. Enforcing s00 (f ; a) = s00 (f ; b) = 0, one obtains two
additional equations, which can be obtained from (3.8.18) by putting there f 00 (a) =
f 00 (b) = 0. The nice thing about this spline is that it requires only function values of
f – no derivatives – but the price one pays is a degradation of the accuracy to O(|∆|2 )
near the endpoints (unless indeed f 00 (a) = f 00 (b) = 0).
”Not-a-knot spline”. (C. deBoor). Here we impose the conditions p1 (x) ≡
p2 (x) and pn−2 (x) ≡ pn−1 (x); that is, the first two pieces and the last two should
be the same polynomial. In effect, this means that the first interior knot x2 , and the
last one xn−1 both are inactive. This again gives rise to two supplementary equations
expressing the continuity of s0003 (f ; x) in x = x2 and x = xn−1 . The continuity
condition of s3 (f, .) at x2 and xn−1 implies the equality of the leading coefficients
c1,3 = c2,3 and cn−2,3 = cn−1,3 . This gives rise to the equations

(∆x2 )2 m1 + [(∆x2 )2 − (∆x1 )2 ]m2 − (∆x1 )2 m3 = β1


(∆x2 )2 mn−2 + [(∆x2 )2 − (∆x1 )2 ]mn−1 − (∆x1 )2 mn = β2 ,

where

β1 = 2{(∆x2 )2 f [x1 , x2 ] − (∆x1 )2 f [x2 , x3 ]}


β2 = 2{(∆xn−1 )2 f [xn−2 , xn−1 ] − (∆xn−2 )2 f [xn−1 , xn ]}.

The first equation adjoins to the top of the system n − 2 equations in n unknowns
given by (3.8.15) and (3.8.16) and the second to the bottom. The system is no more
tridiagonal, but it can be turn into a tridiagonal one, by combining equations 1 and
2, and n − 1 and n, respectively. After this transformations, the first and the last
equations become

∆x2 m1 + (∆x2 + ∆x1 )m2 = γ1 (3.8.19)


(∆xn−1 + ∆xn−2 )mn−1 + ∆xn−2 mn = γ2 , (3.8.20)
3.8. Spline Interpolation 99

where
1
f [x1 , x2 ]∆x2 [∆x1 + 2(∆x1 + ∆x2 )] + (∆x1 )2 f [x2 , x3 ]

γ1 =
∆x2 + ∆x1
1  2
γ2 = ∆xn−1 f [xn−2 , xn−1 ] +
∆xn−1 + ∆xn−2

[2(∆xn−1 + ∆xn−2 ) + ∆xn−1 ]∆xn−2 f [xn−1 , xn ] .

3.8.2 Minimality properties of cubic spline interpolants


The complete and natural splines have interesting optimality properties. To formulate
them, it is convenient to consider not only the subdivision ∆ in (3.8.1), but also the
subdivision

∆0 : a = x0 = x1 < x2 < x3 < · · · < xn−1 < xn = xn+1 = b, (3.8.21)

in which the endpoints are double knots. This means that whenever we interpolate
on ∆0 , we interpolate to function values at all interior points, but to the functions as
well as first derivative values at the endpoints. The first of the two theorems relates
to the complete cubic spline interpolant, scompl (f ; ·).
Theorem 3.8.1. For any function g ∈ C 2 [a, b] that interpolates f on ∆0 , there holds
Z b Z b
00 2
[g (x)] dx ≥ [s00compl (f ; x)]2 dx, (3.8.22)
a a
with equality if and only if g(·) = scompl (f ; ·).

Remark 3.8.2. scompl (f ; ·) in Theorem 3.8.1 also interpolates f on ∆0 and among


all such interpolants its second derivative has the smallest L2 norm. ♦

Proof. We write (for short) scompl = s. The theorem follows once we have shown
that
Z b Z b Z b
00 00 00
2
[g (x)] dx = 2
[g (x) − s (x)] dx + [s00 (x)]2 dx. (3.8.23)
a a a

Indeed, this immediately implies (3.8.22), and equality in (3.8.22) holds if and only
if g 00 (x)−s00 (x) ≡ 0, which, integrating twice from a to x and using the interpolation
properties of s and g at x = a gives g(x) ≡ s(x).
To complete the proof, note that the relation (3.8.23) is equivalent to
Z b
s00 (x)[g 00 (x) − s00 (x)]dx = 0. (3.8.24)
a
100 Function Approximation

Integrating by parts and taking into account that s0 (b) = g 0 (b) = f 0 (b) and s0 (a) =
g 0 (a) = f 0 (a), we get
Z b
s00 (x)[g 00 (x) − s00 (x)]dx =
a
b Z b
00 0 0
= s (x)[g (x) − s (x)] − s000 (x)[g 0 (x) − s0 (x)]dx (3.8.25)

a a
Z b
=− s000 (x)[g 0 (x) − s0 (x)]dx.
a

But s000 is piecewise constant, so


Z b n−1
X Z xν+1
s000 (x)[g 0 (x) − s0 (x)]dx = s000 (xν + 0) [g 0 (x) − s0 (x)]dx =
a ν−1 xν

n−1
X
= s000 (xν+0 )[g(xν+1 ) − s(xν+1 ) − (g(xν ) − s(xν ))] = 0
ν=1

since both s and g interpolate f on ∆. This proves (3.8.24) and hence the theorem.


For interpolation on ∆, the distinction of being optimal goes to the natural cubic
spline interpolant snat (f ; ·).

Theorem 3.8.3. For any function g ∈ C 2 [a, b] that interpolates f on ∆ (not ∆0 ),


there holds Z b Z b
00 2
[g (x)] dx ≥ [s00nat (f ; x)]2 dx (3.8.26)
a a
with equality if and only if g(·) = snat (f ; ·).

The proof of Theorem 3.8.3 is virtually the same as that of Theorem 3.8.1, since
(3.8.23) holds again, this time because s00 (b) = s00 (a) = 0.
Putting g(·) = scompl (f ; ·) in Theorem 3.8.3 immediately gives
Z b Z b
[s00compl (f ; x)]2 dx ≥ [s00nat (f ; x)]2 dx. (3.8.27)
a a

Therefore, in a sense, the natural cubic spline is the “smoothest” interpolant.


The property expressed in Theorem 3.8.3 is the origin of the name “spline”. A
spline is a flexible strip of wood used in drawing curves. If its shape is given by the
3.8. Spline Interpolation 101

equation y = g(x), x ∈ [a, b] and if the spline is constrained to pass through the
points (xi , gi ), then it assumes a form that minimizes the bending energy
b
[g 00 (x)]2 dx
Z
,
a (1 + [g 0 (x)]2 )3

over all functions g similarly constrained. For slowly varying g (kg 0 k∞  1) this is
nearly the same as the minimum property of Theorem 3.8.3.
102 Function Approximation
Chapter 4

Linear Functional Approximation

4.1 Introduction
Let X be a linear space, L1 , . . . , Lm real linear functional, that are linear indepen-
dent, defined on X and L : X → R be a real linear functional such that L, L1 , . . . ,
Lm are linear independent.

Definition 4.1.1. A formula for approximation of a linear functional L with respect


to linear functionals L1 , . . . , Lm is a formula having the form
m
X
L(f ) = Ai Li (f ) + R(f ), f ∈ X. (4.1.1)
i=1

Real parameters Ai are called coefficients (weights) of the formula, and R(f ) is the
remainder term.

For a formula of form (4.1.1), given Li , we wish to determine the weights Ai and
to study the remainder term corresponding to these coefficients.

Remark 4.1.2. The form of Li depends on information on f available (they really


express these information, but also on the nature of the approximation problem, that
is, the form of L. ♦

Example 4.1.3. If X = {f | f : [a, b] → R}, Li (f ) = f (xi ), i = 0, m, xi ∈ [a, b] şi


L(f ) = f (α), α ∈ [a, b], the Lagrange interpolation formula
m
X
f (α) = li (α)f (xi ) + (Rf )α
i=0

103
104 Linear Functional Approximation

provides us an example of type (4.1.1), having the coefficients Ai = li (α), and a


possible representation of the remainder is
u(α)
(Rf )(α) = f (m+1) (ξ), ξ ∈ [a, b],
(m + 1)!

if f (m+1) exists [a, b]. ♦

Example 4.1.4. If X and Li are like in Example 4.1.3 and f (k) (α) exists, α ∈ [a, b],
k ∈ N∗ , and L(f ) = f (k) (α) one obtains a formula for the approximation of the kth
derivative of f at α
m
X
f (k) (α) = Ai f (xi ) + R(f ),
i=0
called numerical differentiation formula . ♦

Example 4.1.5. If X is a space of functions which are defined on [a, b], integrable
on [a, b] and there exists f (j) (xk ), k = 0, m, j ∈ Ik , with xk ∈ [a, b] and Ik are given
sets of indices
Lkj (f ) = f (j) (xk ), k = 0, m, j ∈ Ik ,
and Z b
L(f ) = f (x)dx,
a
one obtains a formula
Z b m X
X
f (x)dx = Akj f (j) (xk ) + R(f ),
a k=0 j∈Ik

called numerical integration formula. ♦

Definition 4.1.6. If Pr ⊂ X, then the number r ∈ N such that Ker(R) = Pr is


called degree of exactness of the approximation formula (4.1.1).

Remark 4.1.7. Since R is a linear functional, the property Ker(R) = Pr is equiva-


lent to R(ek ) = 0, k = 0, r şi R(er+1 ) 6= 0, where ek (x) = xk . ♦

We are now ready to formulate the general approximation problem: given a linear
functional L on X, m linear functional L1 , L2 , . . . , Lm on X and their values (the
“data”) li = Li f , i = 1, m applied to some function f and given a linear subspace
Φ ⊂ X with dim Φ = m, we want to find an approximation formula of the type
m
X
Lf ≈ ai Li f (4.1.2)
i=1
4.1. Introduction 105

that is exact (i.e., holds with equality), whenever f ∈ Φ.


It is natural (since we want to interpolate) to make the following
Assumption: the “interpolation problem”
find ϕ ∈ Φ such that
Li ϕ = si , i = 1, m (4.1.3)
has a unique solution ϕ(·) = ϕ(s, ·), for arbitrary s = [s1 , . . . , sm ]T .
We can express our assumption more explicitly in terms of a given basis ϕ1 , ϕ2 ,
. . . , ϕm of Φ and the associated Gram 1 matrix

L1 ϕ1 L1 ϕ2 . . . L1 ϕm

L2 ϕ1 L2 ϕ2 . . . L2 ϕm
G = [Li ϕj ] =
∈ Rm×m . (4.1.4)
. . . . . . . . . . . .

Lm ϕ1 Lm ϕ2 . . . Lm ϕm

What we require is that


det G 6= 0. (4.1.5)
It is easily seen that this condition is independent of the particular choice of basis.
To show that unique solvability of (4.1.3) and (4.1.5) are equivalent, we express
ϕ in (4.1.3) as a linear combination of the basis functions
nm
X
ϕ= cj ϕj (4.1.6)
j=1

and note that the interpolation conditions


 
Xm
Li  cj ϕj  = si , i = 1, m
j=1

by the linearity of the functionals Li , can be written in the form


m
X
cj Li ϕj = si , i = 1, m,
j=1

Jórgen Pedersen Gram (1850-1916), Danish mathematician


who studied at the University of Copenhagen. After gradu-
ation, he entered an insurance company as computer assistant
1 and, moving up the ranks, eventually became its director. He
was interested in series expansions of special functions and
also contributed to Chebyshev and least squares approxima-
tion. The “Gram determinant” was introduced by him in con-
nection with his study of linear independence.
106 Linear Functional Approximation

that is,

Gc = s, c = [c1 , c2 , . . . , cm ]T , s = [s1 , s2 , . . . , sm ]T . (4.1.7)

This has a unique solution for arbitrary s if and only if (4.1.5) holds.
We have two approaches for the solution of this problem.

4.1.1 Method of interpolation


We solve the general approximation problem by interpolation

Lf ≈ Lϕ(`; ·), ` = [`1 , `2 , . . . , `m ]T , `i = Li f (4.1.8)

In other words, we apply L not to f , but to ϕ(l; ·) — the solution of the interpola-
tion problem (4.1.3) in which s = `, the given “data”. Our assumption guarantees
that ϕ(`; ·) is uniquely determined. In particular, if f ∈ Φ, then (4.1.8) holds with
equality, since trivially ϕ(l; ·) = f (·). Thus, our approximation (4.1.8) already satis-
fies the exactness condition required for (4.1.2). It remains only to show that (4.1.8)
produces indeed an approximation of the form (4.1.2).
To do so, observe that the interpolant in (4.1.8) is
m
X
ϕ(`; ·) = cj ϕj (·)
j=1

where the vector c = [c1 , c2 , . . . , cm ]T satisfies (4.1.7) with s = `

Gc = `, ` = [L1 f, L2 f, . . . , Lm f ]T .

Writing
λj = Lϕj , j = 1, m, λ = [λ1 , λ2 , . . . , λm ]T , (4.1.9)
we have by the linearity of L
m
X
Lϕ(`; ·) = cj Lϕj = λT c = λT G−1 ` = [(GT )−1 λ]T `,
j=1

that is,
m
X
Lϕ(`; ·) = ai Li f, a = [a1 , a2 , . . . , am ]T = (GT )−1 λ. (4.1.10)
i=1
4.2. Numerical Differentiation 107

4.1.2 Method of undetermined coefficients


Here we determined the coefficients ai in (4.1.3) such that the equality holds ∀ f ∈ Φ,
which, by the linearity of L and Li is equivalent to equality for f = ϕ1 , f = ϕ2 , . . . ,
f = ϕm ; that is  
Xm
 aj Lj  ϕi = Lϕi , i = 1, m,
j=1

or by (4.1.8)
m
X
aj Lj ϕi = λi , i = 1, m.
j=1

Evidently, the matrix of this system is GT , so that


a = [a1 , a2 , . . . , am ]T = (GT )−1 λ,
in agreement with (4.1.10). Thus, the method of interpolation and the method of
undetermined coefficients are mathematically equivalent — they produce exactly the
same approximation.
It seems that, at least in the case of polynomial (i.e. Φ = Pd ), that the method of
interpolation is more powerful than the method of undetermined coefficients, because
it also yields an expression for the error term (if we carry along the remainder term
of interpolation). But, the method of undetermined coefficients is allowed, using the
condition of exactness to find the remainder term by the Peano Theorem.

4.2 Numerical Differentiation


For simplicity we consider only the first derivative; analogous techniques apply to
higher order derivatives. We solve the problem by means of interpolation: instead to
differentiate f ∈ C m+1 [a, b], we shall differentiate its interpolation polynomial:
f (x) = (Lm f )(x) + (Rm f )(x). (4.2.1)
We write the interpolation polynomial in Newton form
(Lm f )(x) = (Nm f )(x) = f0 + (x − x0 )f [x0 , x1 ] + · · · +

+(x − x0 ) . . . (x − xm−1 )f [x0 , x1 , . . . , xm ] (4.2.2)


and the remainder in the form
f (m+1) (ξ(x))
(Rm f )(x) = (x − x0 ) . . . (x − xm ) . (4.2.3)
(m + 1)!
108 Linear Functional Approximation

Differentiating (4.2.2) with respect to x and then putting x = x0 gives


(Lm f )(x0 ) = f [x0 , x1 ] + (x0 − x1 )f [x0 , x1 , x2 ] + · · · +

+(x0 − x1 )(x0 − x2 ) . . . (x0 − xm−1 )f [x0 , x1 , . . . , xm ]. (4.2.4)


Assuming that f is has n + 2 continuous derivatives in an appropriate interval we get

f (m+1) (ξ(x0 ))
(Rm f )0 (x0 ) = (x0 − x1 ) . . . (x0 − xm ) . (4.2.5)
(m + 1)!
Therefore, differentiating (4.2.4), we find
f 0 (x0 ) = (Lm f )0 (x0 ) + (Rm f )0 (x0 ) . (4.2.6)
| {z }
em

If H = max |x0 − xi |, the error has the form em = O(H m ), when H → 0.


i
We can thus get approximation formulae of arbitrarily high order, but those with large
n are of limited practical use.
Remark 4.2.1. Numerical differentiation is a critical operation; for this reason it
must be avoided as much as possible, since even good approximation lead to poor
approximation of the derivative (see Figure 4.1). This also follows from Example
4.2.2. ♦

Figure 4.1: The drawbacks of numerical differentiation

Example 4.2.2. Let the function


1
f (x) = g(x) + sin n2 (x − a), g ∈ C 1 [a, b].
n
We see that d(f, g) → 0 (n → ∞), but d(f 0 , g 0 ) = n 9 0. ♦
4.3. Numerical Integration 109

The most important uses of differentiation formulae are made in the discretization
of differential equations — ordinary or partial. In these applications, the spacing
of the points is usually uniform, but unequally distribution points arise when partial
differential operators are to be discretized near the boundary of the domain of interest.
We can also use another interpolation procedures such as: Taylor, Hermite, spline,
least squares.

4.3 Numerical Integration


The basic problem is to calculate the definite integral of a given function f over a
finite interval [a, b]. If f is well behaved, this is a routine problem, for which the
simplest integration rules, such as the composite trapezoidal or Simpson’s rule will
be quite adequate, the former having an edge over the later if f is periodic with period
b − a.
Complications arise if f has an integrable singularity, or the interval of integration
extends to infinity (which is just other manifestation of the singular behavior). By
breaking up the integral, if necessary, into several pieces, it can be assumed that the
singularity, if its location is known, is at one (or both) ends of the interval [a, b].
Such improper integrals can usually be treated by weighted quadrature; that is one
incorporates the singularity into a weight function, which then becomes one factor of
the integrand, leaving the other factor well behaved. The most important example of
this is Gaussian quadrature relative to such a weight function. Finally, it is possible to
accelerate the convergence of quadrature schemes by suitable recombinations. The
best-known example of this is Romberg integration.
Let f : [a, b] → R be a function integrable on [a, b], Fk (f ), k = 0, m information
on f (usually linear functionals) and w : [a, b] → R+ is a weight function, integrable
over [a, b].

Definition 4.3.1. A formula of the form


Z b
w(x)f (x)dx = Q(f ) + R(f ), (4.3.1)
a

where
m
X
Q(f ) = Aj Fj (f ),
j=0

is called a numerical integration formula for the function f or a quadrature formula.


Parameters Aj , j = 0, m are called weights or coefficients of the formula, and R(f )
is its remainder term. Q is called quadrature functional.
110 Linear Functional Approximation

Definition 4.3.2. The natural number d = d(Q) having the property ∀ f ∈ Pd,
R(f ) = 0 and ∃ g ∈ Pd + 1 such that R(g) 6= 0 is called degree of exactness of the
quadrature formula..
Since R is linear, a quadrature formula has the degree of exactness d if and only
if R(ej ) = 0, j = 0, d and R(ed+1 ) 6= 0.
If the degree of exactness of a quadrature formula is known, the remainder could
be determined using Peano theorem.

4.3.1 The composite trapezoidal and Simpson’s rule


These formulae are called by Gautschi in [16] “the workhorses of numerical integra-
tion”. They will do the job when the interval is finite and the integrand is unproblem-
atic. The trapezoidal rule is sometimes surprisingly effective on infinite intervals.
Both rules are obtained by applying the simplest kind of interpolation on subin-
tervals of the decomposition
b−a
a = x0 < x1 < x2 < · · · < xn−1 < xn = b, xk = a+kh, h= . (4.3.2)
n
of the interval [a, b]. In the trapezoidal rule, one interpolates linearly on each subin-
terval [xk , xk+1 ], and obtains
Z xk+1 Z xk+1 Z xk+1
f (x)dx = (L1 f )(x)dx + (R1 f )(x)dx, f ∈ C 1 [a, b],
xk xk xk
(4.3.3)
where
(L1 f )(x) = fk + (x − xk )f [xk , xk+1 ].
Integrating, we have
Z xk+1
h
f (x)dx = (fk + fk+1 ) + R1 (f ),
xk 2
where (using Peano Theorem)
Z xk+1
R1 (f ) = K1 (t)f 00 (t)dt,
xk
and
(xk+1 − t)2 h
K1 (t) = − [(xk − t)+ + (xk+1 − t)+ ]
2 2
(xk − t)2 h(xk+1 − t)
= −
2 2
1
= (xk+1 − t)(xk − t) ≤ 0.
2
4.3. Numerical Integration 111

So
h3 00
R1 (f ) = − f (ξk ), ξk ∈ (xk , xk+1 )
12
and Z xk+1
h 1
f (x)dx = (fk + fk+1 ) − h3 f 00 (ξk ). (4.3.4)
xk 2 12
This is the elementary trapezoidal rule. Summing over all subinterval gives the
trapezes rule or the composite trapezoidal rule.
Z b   n−1
1 1 1 3 X 00
f (x)dx = h f0 + f1 + · · · + fn−1 + fn − h f (ξk ).
a 2 2 12
k=0

Since f 00 is continuous on [a, b], the remainder term could be written as

(b − a)h2 00 (b − a)3 00
R1,n (f ) = − f (ξ) = − f (ξ). (4.3.5)
12 12n2
Since f 00 is bounded in absolute value on [a, b] we have

R1,n (f ) = O(h2 ),

when h → 0 and so the composite trapezoidal rule converges when h → 0 (or


equivalently, n → ∞), provided that f ∈ C 2 [a, b].
If instead of linear interpolation one uses quadratic interpolation over two consec-
utive intervals, one gives rise to the composite Simpson’s formula. Its “elementary”
version, called Simpson’s rule or Simpson formula is
Z xk+1
h 1
f (x)dx = (fk + 4fk+1 + fk+2 ) − h5 f (h) (ξk ), xk ≤ ξk ≤ xk+1 ,
xk 3 90
(4.3.6)
where it has been assumed that f ∈ C 4 [a, b].
Let us prove the formula for the remainder of Simpson rule. Since de degree of
exactness is 3, Peano theorem yields to
Z xk+2
R2 (f ) = K2 (t)f (4) (t) dt.
xk

where

(xk+1 − t)4 h h
 i
1 3 3 3
K2 (t) = − (xk − t)+ + 4 (xk+1 − t)+ + (xk+2 − t)+ ,
3! 4 3
112 Linear Functional Approximation

that is,

(xk+2 −t)4
h i
h 3
1 4 − 3 4 (xk+1 − t) + (xk+2 − t)3 , t ∈ [xk , xk+1 ] ,
K2 (t) = (xk+2 −t)4
6 − h 3
4 3 (xk+2 − t) , t ∈ [xk+1 , xk+2 ] .
One easily checks that for t ∈ [a, b], K2 (t) ≤ 0, so we can apply Peano’s Theorem
corollary.
1
R2 (f ) = f (4) (ξk )R2 (e4 ),
4!

x5k+2 − x5k h 4
xk + 4x4k+1 + x4k+1

R2 (e4 ) = −
" 5 3
x4 + xk+2 xk + x2k+2 x2k + xk+2 x3k + x4k
3
= h 2 k+2
5
#
5x4k + 4x3k xk+2 + 6x2k x2k+2 + 4xk x3k+2 + 5x4k+2

12
h
−x4k + 4x3k xk+2 + 6x2k x2k+2 + 4xk x3k+2 − x4k+2

=
60
h h5
= − (xk+2 − xk )4 = −4 .
60 15
Thus,
h5 (4)
R2 (f ) = −
f (ξk ).
90
For the composite Simpson 2 rule we get
Z b
h
f (x)dx = (f0 + 4f1 + 2f2 + 4f3 + 2f4 + · · · + 4fn−1 + fn ) + R2,n (f ) (4.3.7)
a 3
with
1 (b − a)5 (4)
R2,n (f ) = − (b − a)h4 f (4) (ξ) = − f (ξ), ξ ∈ (a, b). (4.3.8)
180 2880n4

Thomas Simpson (1710-1761) was an English mathematician,


self-educated, and author of many textbooks popular at the
2
time. Simpson published his formula in 1743, but it was al-
ready known to Cavalieri (1639), Gregory (1668), and Cotes
(1722), among others.
4.3. Numerical Integration 113

One notes that R2,n (f ) = O(h4 ), which assures convergence when n → ∞. We


have also a gain with 1 in the order of accuracy. This is the reason why Simpson’s
rule has long been, and continues to be, one of the most popular general-purpose
integration methods.

4.3.2 Weighted Newton-Cotes and Gauss formulae


A weighted quadrature formula is a formula of the type
Z b n
X
f (t)w(t)dt = wk f (tk ) + Rn (f ) (4.3.9)
a k=1

where the weight function w is nonnegative and integrable on (a, b). The interval
(a, b) may be finite or infinite. If it is infinite, we must make sure that the integral in
(4.3.9) is well defined, at least when f is a polynomial. We achieve this by requiring
that all moments of the weight function,
Z b
µs = ts w(t)dt, s = 0, 1, 2, . . . (4.3.10)
a

exist and are finite.


We call (4.3.9) interpolatory, if it has the degree of exactness d = n − 1. Inter-
polatory formulae are precisely those “obtained by interpolation”, that is, for which
n
X Z b
wk f (tk ) = Ln−1 (f ; t1 , . . . , tn , t)w(t)dt (4.3.11)
k=1 a

or equivalently,
Z b
wk = lk (t)w(t)dt, k = 1, 2, . . . , n, (4.3.12)
a
where
n
Y t − tl
lk (t) = (4.3.13)
t − tl
l=1 k
l6=k

are the elementary Lagrange interpolation polynomials associated with the nodes t1 ,
t2 , . . . , tn . The fact that (4.3.9) with wk given by (4.3.12) has the degree of exactness
d = n − 1 is evident, since for any f ∈ Pn−1 Ln−1 (f ; ·) ≡ f (·) ı̂n (4.3.11). Con-
versely, if (4.3.9) has the degree of exactness d = n − 1, then putting f (t) = lr (t) in
(4.3.10) gives
Z b Xn
lr (t)w(t)dt = wk lr (tk ) = wr , r = 1, 2, . . . , n,
a k=1
114 Linear Functional Approximation

that is, (4.3.12).


We see that given any n distinct nodes t1 , . . . , tn it is always possible to construct
a formula of type (4.3.9) which is exact for all polynomials of degree ≤ n − 1. In
the case w(t) ≡ 1 on [−1, 1] and tk equally spaced on [−1, 1], the feasibility of
such a construction was already alluded to by Newton in 1687 and implemented in
detail by Cotes 3 around 1712. By extension, we call the formula (4.3.9), with the tk
prescribed and the wk given by (4.3.12) a Newton-Cotes formula.
The question naturally arises whether we can do better, that is, whether we can
achieve the degree of exactness d > n − 1 by a judicious choice of the nodes tk (the
weights wk being necessarily given by (4.3.12)). The answer is surprisingly simple
and direct. To formulate it we introduce the node polynomial
n
Y
un (t) = (t − tk ). (4.3.14)
k=1

Theorem 4.3.3. Given an integer k, with 0 ≤ k ≤ n, the quadrature formula (4.3.9)


has the degree of exactness d = n−1+k if and only if both of the following conditions
are satisfied.

(a) The formula (4.3.9) is interpolatory;

(b) The node polynomial un in (4.3.14) satisfies


Z b
un (t)p(t)w(t) dt = 0, ∀ p ∈ Pk−1 .
a

The condition in (b) imposes k conditions on the nodes t1 , t2 , . . . , tn of (4.3.9). (If


k = 0, there is no restriction since, as we know, we can always get d = n − 1). In
effect, un must be orthogonal to Pk−1 relative to the weight function w. Since w(t) ≥
0, we have necessarily k ≤ n; otherwise, un would have to be orthogonal to Pn , in
particular, orthogonal to itself, which is impossible. Thus k = n is optimal, giving
rise to a quadrature rule of maximum degree of exactness dmax = 2n − 1. Condition

Roger Cotes (1682-1716), precocious son of an English coun-


try pastor, was entrusted with the preparation of the second
edition of Newton’s Principia. He worked out in detail New-
3 ton’s idea of numerical integration and published the coeffi-
cients — now known as Cotes numbers — of the n-point for-
mula for all n < 11. Upon his death at the early age of 33,
Newton said of him: “If he had lived, we might have known
something.”
4.3. Numerical Integration 115

(b) then amounts to orthogonality of un to all polynomials of lower degree; that is


un (·) = πn (·, w) is precisely the nth-degree orthogonal polynomial with respect
to the weight function w. This optimal formula is called the Gaussian quadrature
formula associated with the weight function w. Its nodes, therefore, are the nodes of
πn (·, w), and the weights (coefficients) wk are given as in (4.3.12); thus

πn (tk ; w) = 0
Z b
πn (t, w) k = 1, 2, . . . , n. (4.3.15)
wk = 0
w(t)dt,
a (t − tk )πn (tk , w)

The formula was developed in 1814 by Gauss for the special case w(t) ≡ 1 on
[−1, 1], and extended to more general weight functions by Christoffel 4 in 1877. It
is, therefore, also refered to as Gauss-Christoffel quadrature formula.

Proof of theorem 4.3.3. Necessity. Since the degree of exactness is d = n − 1 + k ≥


n − 1, condition (a) is trivial. Condition (b) also follows immediately, since for any
p ∈ Pk−1 , un p ∈ Pn−1+k . Hence,
Z b n
X
un (t)p(t)w(t) = wk uk (tk )p(tk ),
a k=1

which vanishes, since un (tk ) = 0 for k = 1, 2, . . . , n.


Sufficiency. We must show that for any p ∈ Pn−1+k we have Rn (p) = 0 in
(4.3.9). Given any such p, divide it by un , such that

p = qun + r, q ∈ Pk−1 , r ∈ Pn−1

where q is the quotient and r the remainder. There follows


Z b Z b Z b
p(t)w(t)dt = q(t)un (t)w(t)dt + r(t)w(t)dt.
a a a

Elvin Bruno Christoffel (1829-1900) was active for a short pe-


riod of time in Berlin and and Zurich and, for the rest of his
4
life, in Strasbourg. He is best known for his work in geome-
try, in particular, tensor analysis, which became important in
Einstein’s theory of relativity.
116 Linear Functional Approximation

The first integral on the right, by (b), is zero, since q ∈ Pk−1 , whereas the second, by
(a), since r ∈ Pn−1 equals
n
X n
X n
X
wk r(tk ) = wk [p(tk ) − q(tk )un (tk )] = wk p(tk )
k=1 k=1 k=1

the last equality following again from un (tk ) = 0, k = 1, 2, . . . , n, which completes


the proof. 

The case k = n is discussed further in §4.3.3. Here we still mention two special
cases with k < n, which are of some practical interest. The first is the Gauss-
Radau quadrature formula in which one endpoint, say a, is finite and serves as a
quadrature node, say t1 = a. The maximum degree of exactness attainable then is
d = 2n−2 and corresponds to k = n−1 in Theorem (4.3.3). Part (b) of that theorem
tells us that the remaining nodes t2 , . . . , tn must be the zeroes of πn−1 (·, wa ), where
wa (t) = (t − a)w(t).
Similarly, in the Gauss-Lobatto formula, both endpoints are finite and serves as
nodes, say, t1 = a, tn = b, and the remaining nodes t2 , . . . , tn−1 are taken to be the
zeros of πn−2 (·; wa,b ), wa,b (t) = (t − a)(b − t)w(t), thus achieving maximum degree
of exactness d = 2n − 3.

4.3.3 Properties of Gaussian quadrature rules


The Gaussian quadrature rule (4.3.9) and (4.3.15), in addition to being optimal (i.e.
has maximum degree of exactness), has some interesting and useful properties.
(i) All nodes tk are real, distinct, and contained in the open interval (a, b). This
is a well-known property satisfied by the zeros of orthogonal polynomials.
(ii) All the weights (coefficients) wk are positive. An ingenious observation of
Stieltjes proves it almost immediately. Indeed,
Z b Xn
2
0< lj (t)w(t)dt = wk lj2 (tk ) = wj , j = 1, 2, . . . , n,
a k=1

the first equality following since lj2 ∈ P2n−2 and the degree of exactness is d =
2n − 1.
(iii) If [a, b] is a finite interval, then the Gauss formula converges for any con-
tinuous function; that is, Rn (f ) → 0, when n → ∞, for any f ∈ C[a, b]. This is
basically a consequence of the Weierstrass Approximation Theorem, which implies
that, if pb2n−1 (f ; ·) denotes the best polynomial approximation of degree 2n − 1 of f
on [a, b] in the uniform norm, then
lim kf (·) − pb2k−1 (f ; ·)k∞ = 0.
n→∞
4.3. Numerical Integration 117

p2n−1 ) = 0 (since d = 2n − 1), it follows that


Since Rn (b

|Rn (f )| = |Rn (f − pb2n−1 )|


Z n

b X
= [f (t) − pb2n−1 (f ; t)]w(t)dt − wk [f (tk ) − pb2n−1 (f ; tk )]

a
k=1
Z b Xn
≤ |f (t) − pb2n−1 (f ; t)|w(t)dt + wk |f (tk ) − pb2n−1 (f ; tk )|
a k=1
n
"Z #
b X
≤ kf (·) − pb2n−1 (f ; ·)k∞ w(t)dt + wk .
a k=1

Here the positivity of weights wk has been used crucially. Noting that

n
X Z b
wk = w(t)dt = µ0 ,
k=1 a

we conclude

|Rn (f )| ≤ 2µ0 kf − pb2n−1 k∞ → 0, când n → ∞.

The next property is the background for an efficient algorithm for computing
Gaussian quadrature formulae.
(iv) Let αk = αk (w) and βk = βk (w) be the recursion coefficients for the or-
thogonal polynomials πk (·) = πk (·; w), that is

πk+1 (t) = (t − αk )πk (t) − βk πk−1 (t), k = 0, 1, 2, . . .


(4.3.16)
π0 (t) = 1, π−1 (t) = 0,

where
(tπk , πk )
αk =
(πk , πk )
(4.3.17)
(πk , πk )
βk = ,
(πk−1 , πk−1 )
with β0 defined (as is customary) by
Z b
β0 = w(t)dt (= µ0 ).
a
118 Linear Functional Approximation

The nth order Jacobi matrix for the weight function w is a tridiagonal symmetric
matrix defined by
 √ 
α0 β1 √ 0
 √β 1 α1 β2 

 
 .. 
Jn (w) = 
 β2 . .

 .. .. p 
 . . βn−1 
p
0 βn−1 αn−1
Theorem 4.3.4. The nodes tk of a Gauss-type quadrature formula are the eigenval-
ues of Jn
Jn vk = tk vk , vkT vk = 1, k = 1, 2, . . . , n, (4.3.18)
and the weights wk are expressible in terms of the first component vk of the corre-
sponding (normalized) eigenvectors by
2
wk = β0 vk,1 , k = 1, 2, . . . , n (4.3.19)

Thus, to compute the Gauss formula, we must solve an eigenvalue/eigenvector


problem for a symmetric tridiagonal matrix. This is a routine problem in linear al-
gebra, and very efficient methods (e.g. the QR algorithm) are known for solving it.
Thus, the eigenvalue-based approach is more efficient than the classical one. More-
over, the classical approach is based on two ill-conditioned problems: the solution of
a polynomial equation and the solution of a Vandermonde system of linear equations.

Proof of theorem 4.3.4. Let p π̃k (·) = π̃k (·, w) denote the normalized orthogonal poly-
nomials,
p so that π k = (πk , πk ) dλ π̃k . Inserting this into (4.3.16), dividing by
(πk , πk ) dλ , and using (4.3.17), we obtain
π̃k π̃k−1
π̃k+1 (t) = (t − αk ) p − βk p ,
βk+1 βk+1 βk
p
or, multiplying through by βk+1 and rearranging
p p
tπ̃k (t) = αk π̃k (t) + βk π̃k−1 (t) + βk+1 π̃k+1 (t), k = 0, 1, . . . n − 1. (4.3.20)

In terms of the Jacobi matrix Jn we can write these relations in vector form as
p
tπ̃(t) = Jn π̃(t) + βn π̃n (t)en , (4.3.21)

where π̃(t) = [π̃0 (t), π̃1 (t), . . . , π̃n−1 (t)]T and en (t) = [0, 0, . . . , 0, 1]T are vectors
in Rn . Since tk is a zero of π̃n , it follows from (4.3.21) that

tk π̃(tk ) = Jn π̃(tk ), k = 1, 2, . . . , n. (4.3.22)


4.3. Numerical Integration 119

This proves the first relation in Theorem 4.3.4, since π̃ is a nonzero vector, its first
component being
−1/2
π̃0 = β0 . (4.3.23)
To prove the second relation, note from (4.3.22) that the normalized eigenvector
vk is
 −1/2
n
1 X
2
vk = π̃(tk ) =  π̃µ−1 (tk ) π̃(tk ).
[π̃(tk )T π̃(tk )]
µ=1

Comparing the first component on far left and right, and squaring, gives, by virtue of
(4.3.23)
1 2
n = β0 vk,1 , k = 1, 2, . . . , n. (4.3.24)
X 2
˜ µ−1 (tk )
pi
µ=1

On the other hand, letting f (t) = π̃µ−1 (t) in (4.3.9), one gets, by orthogonality, using
(4.3.23) again that
n
1/2
X
β0 δµ−1,0 = wk π̃µ−1 (tk )
k=0
or in matrix form
1/2
P w = β0 e1 , (4.3.25)
where δµ−1,0 is Kronecker’s delta, P ∈ Rn×n is the matrix of eigenvectors, w ∈ Rn
is the vector of Gaussian weights, and e1 = [1, 0, . . . , 0]T ∈ Rn . Since the columns
of P are orthogonal, we have
n
X
T 2
P P = D, D = diag(d1 , d2 , . . . , dn ), dk = π̃µ−1 (tk ).
µ=1

Now multiply (4.3.25) from the left by P T to obtain


1/2 1/2 −1/2
Dw = β0 P T e1 = β0 ∗ β0 e = e, e = [1, 1, . . . , 1]T .

Therefore, w = D−1 e, that is,


1
wk = n , k = 1, 2, . . . , n.
X
2
π̃µ−1 (tk )
µ=1

Comparing this with (4.3.24) establishes the desired result. 


120 Linear Functional Approximation

For details on algorithmic aspects concerning orthogonal polynomials and Gaussian


quadratures see [17].
(v) Markov 5 observed in 1885 that the Gauss quadrature formula can also be
obtained by Hermite interpolation on the nodes tk , double.
f (x) = (H2n−1 f )(x) + u2n (x)f [x, x1 , x1 , . . . , xn , xn ],

Z b Z b
w(x)f (x)dx = w(x)(H2n−1 f )(x)dx+
a a
Z b
+ w(x)u2n (x)f [x, x1 , x1 , . . . , xn , xn ]dx.
a
But the degree of exactness 2n − 1 implies
Z b Xn n
X
w(x)(H2n−1 f )(x)dx = wi (H2n−1 f )(xi ) = wi f (xi ),
a i=1 i=1
Z b n
X Z b
w(x)f (x)dx = wi f (xi ) + w(x)u2 (x)f [x, x1 , x1 , . . . , xn , xn ]dx,
a i=1 a
so Z b
Rn (f ) = w(x)u2n (x)f [x, x1 , x1 , . . . , xn , xn ]dx.
a
Since w(x)u2 (x) ≥ 0, applying the second Mean Value Theorem for integrals and
the Mean Value Theorem for divided differences, we get
Z b
Rn (f ) = f [η, x1 , x1 , . . . , xn , xn ] w(x)u2 (x)dx
a
f (2n) (ξ) b
Z
= w(x)[πn (x, w)]2 dx, ξ ∈ [a, b].
(2n)! a
For orthogonal polynomials and their recursion coefficients αk , βk see Table 3.2,
page 71.

Andrei Andrejevich Markov (1856-1922) was a Russian math-


5 ematician active in St. Petersburg who made important contri-
butions to probability theory, number theory, and constructive
approximation theory.
4.4. Adaptive Quadratures 121

4.4 Adaptive Quadratures


In a numerical integration method errors depend not only on the size of the interval,
but also on values of certain higher order derivatives of the function to be integrated.
This implies that the methods do not work well for functions having large values
of higher order derivatives — especially for functions having large oscillations on
the whole interval or on some subintervals. It is reasonable to use small subinter-
vals where the derivatives have large values and large subintervals where the deriva-
tives have small values. A method which does this systematically is called adaptive
quadrature.
The general approach in an adaptive quadrature is to use two different methods
for each interval, to compare the results, and to divide the interval when the differ-
ences are large. There are situations when we have two bad methods, the results
are bad, but their differences are small. We can avoid this situation taking a method
which overestimates the result and another which underestimates it. We shall give an
example of general structure for a recursive adaptive quadrature (Algorithm 4.1).
Let us suppose
metint(a, b : real; f : f unctie, n : integer) : real
Rb
is a function that approximates a f (x)dx using a composite quadrature rule with n
subintervals. For m one chooses a small value (4 or 5). Algorithm structure: DIVIDE

Algorithm 4.1 Adaptive quadrature


Input: f - the integrand, a, b - endpoints, ε - tolerance, metint - a composed quadra-
ture rule
Output: Value of the integral
function adapt(f, a, b, ε, metint)
if |metint(a, b, f, 2m) − metint(a, b, f, m)| < ε then
adapt := metint(a, b, f, 2m);
else
adapt := adapt(f, a, (a + b)/2, ε, metint) + adapt(f, (a + b)/2, b, ε, metint);
end if

AND CONQUER.
In contrast to other methods, which decide what amount of work is needed to
achieve the desired accuracy, an adaptive quadrature compute only as much as it is
necessary. This means that the absolute error ε must be chosen so that to avoid an
infinite loop when one tries to achieve a precision which could not be achieved. The
number of steps depends on the behavior of the function to be integrated.
Possible improvements: metint(a, b, f, 2m) is called twice, the accuracy can
be scaled by the ratio of current interval length and the whole interval length. For
122 Linear Functional Approximation

supplementary details see [15].

4.5 Iterated Quadratures. Romberg Method


A drawback of previous adaptive quadrature is that it computes repeatedly the func-
tion values at nodes; also a time penalty appears at running time due to recursion
or to stack management in an iterative implementation. Iterative quadratures remove
these drawbacks. They apply at the first step a composite quadrature rule and then
they divide the interval into equal parts using at each step the previously computed
approximations. We shall illustrate this technique using a method that starts from
composite trapezoidal rule and then improves the convergence using Richardson ex-
trapolation.
The first step involves applying the composite trapezoidal rule for n1 = 1, n2 =
2, . . . , np = 2p−1 , where p ∈ N∗ . The step size hk corresponding to nk would be
b−a b−a
hk = = k−1 .
nk 2
Using these notations the trapezes rule becomes
 
Z b 2n−1
X−1
hk  b − a 2 00
f (x)dx = f (a) + f (b) + 2 f (a + ihk ) − h f (µk ) (4.5.1)
a 2 12 k
i=1

µk ∈ (a, b).
Let Rk,1 denote the result of approximation in accordance to (4.5.1).
h1 b−a
R1,1 = [f (a) + f (b)] = [f (a) + f (b)] (4.5.2)
2 2

h2
R2,1 = [f (a) + f (b) + 2f (a + h2 )] =
2   
b−a b−a
= f (a) + f (b) + 2f a +
4 2
  
1 1
= R1,1 + h1 f a + h1 .
2 2
and generally
 
k−2
2X    
1 1
Rk,1 = Rk−1,1 + hk−1 f a+ i− hk−1  , k = 2, n (4.5.3)
2 2
i=1
4.5. Iterated Quadratures. Romberg Method 123

Now, it follows the improvement by Richardson 6 extrapolation (in fact it is due


to Archimedes 7 ).

b
(b − a) 2 00
Z
I= f (x)dx = Rk−1,1 − hk f (a) + O(h4k ).
a 12

We shall eliminate the h2k term by combining two equations

(b − a) 2 00
I =Rk−1,1 − hk f (a) + O(h4k ),
12
b − a 2 00
I =Rk,1 − h f (a) + O(h4k ).
48 k

We get
4Rk,1 − Rk−1,1
I= + O(h4 ).
3
We define
4Rk,1 − Rk−1,1
Rk,2 = . (4.5.4)
3
Lewis Fry Richardson (1881-1953), born, educated, and ac-
tive in England, did pioneering work in numerical weather
prediction, proposing to solve the hydrodynamical and ther-
modynamical equations of meteorology by finite difference
6 methods. He also did a penetrating study of atmospheric tur-
bulence, where a nondimensional quantity introduced by him
is now called “Richardson’s number”. At the age of 50 he
earned a degree in psychology and began to develop a scien-
tific theory of international relations. He was elected fellow of
the Royal Society in 1926.
Archimedes (287 B.C. - 212 B.C.) Greek mathematician from
Syracuse. The achievements of Archimedes are quite out-
standing. He is considered by most historians of mathematics
as one of the greatest mathematicians of all time. He perfected
a method of integration which allowed him to find areas, vol-
umes and surface areas of many bodies. Archimedes was able
to apply the method of exhaustion, which is the early form of
7
integration. He also gave an accurate approximation to π and
showed that he could approximate square roots accurately. He
invented a system for expressing large numbers. In mechan-
ics Archimedes discovered fundamental theorems concerning
the centre of gravity of plane figures and solids. His most fa-
mous theorem gives the weight of a body immersed in a liquid,
called Archimedes’ principle. He defended his town during
the Romans’ siege.
124 Linear Functional Approximation

One applies the Richardson extrapolation to these values. If f ∈ C 2n+2 [a, b],
then for k = 1, n we may write

 
Z b 2k−1
X−1
hk 
f (x)dx = f (a) + f (b) + 2 f (a + ihk )
a 2
i=1
(4.5.5)
k
X
2k+2
+ Ki h2i
k + O(hk ),
i=1

where Ki does not depend on hk .


Rb
The formula (4.5.5) can be argued as follows. Let a0 = a f (x)dx and

n−1
" #
h X b−a
A(h) = f (a) + 2 f (a + kh) + f (b) , h= .
2 k
k=1

(the approximation obtained using trapezes rule).


If f ∈ C 2k+1 [a, b], k ∈ N∗ , then the following formula due to Euler 8 and

Leonhard Euler (1707-1783) was the son of a minister in-


terested in mathematics who followed lectures of Jakob
Bernoulli at the University of Basel. Euler himself was al-
lowed to see Johann Bernoulli on Saturday afternoons for pri-
vate tutoring. At the age of 20, after he was unsuccessful to
obtain a professorship in physics at the University of Basel,
because of a lottery system then is use (Euler lost), he em-
igrated to St. Petersburg; later, he moved on to Berlin, and
then back to St. Petersburg. Euler unquestionable was the
8
most prolific mathematician of the 18th century, working in
virtually all branches of the differential and integral calculus
and, in particular, being one of the founders of the calculus
of variations. He also did pioneering work in the applied sci-
ences, notably hydrodynamics, mechanics of deformable ma-
terials and rigid bodies, optics, astronomy and the theory of
the spinning top. Not even his blindness at the age of 59 man-
aged to break his phenomenal productivity. Euler’s collected
works are still being edited, 71 volumes having already been
published.
4.5. Iterated Quadratures. Romberg Method 125

Maclaurin 9 holds
A(h) = a0 + a1 h2 + a2 h4 + · · · + ak h2k + O(h2k+1 ), h→0 (4.5.6)
where
B2k (2k−1)
ak = [f (b) − f (2k−1) (a)], k = 1, 2, . . . , K.
(2k)!
The quantities Bk are the coefficients in the expansion

z X Bk
= zk , |z| < 2π;
ez − 1 k!
k=0

they are called Bernoulli 10 numbers.


Formula (4.5.6) is called Euler-MacLaurin formula. By eliminating successively
the powers of h in (4.5.5) one obtains
4j−1 Rk,j−1 − Rk−1,j−1
Rk,j = , k = 2, n, j = 2, i.
4j−1 − 1
The computation could be performed in a tabular fashion:
R1,1
R2,1 R2,2
R3,1 R3,2 R3,3
.. .. .. ..
. . ..
Rn,1 Rn,2 Rn,3 . . . Rn,n

Colin Maclaurin (1698-1768) was a Scottish mathematicians


9 who applied the new infinitesimal calculus to the various prob-
lems in geometry. He is best known for his power series ex-
pansion, but also contributed to the theory of equations.

Jacob Bernoulli (1654-1705), the elder brother of Johann


Bernoulli, was active in Basel. He was one of the first to
appreciate Leibniz’s and Newton’s differential and integral
10
calculus and enriched it by many original contributions of
his own, often in (not always amicable) competition with his
younger brother. He is also known in probability theory for
his “law of large numbers”.
126 Linear Functional Approximation

Since (Rn,1 ) is convergent, (Rn,n ) is also convergent, but faster than (Rn,1 ). One
may choose as stopping criterion |Rn−1,n−1 − Rn,n | ≤ ε.

4.6 Adaptive Quadratures II


The second column of tabular Romberg’s method corresponds to a Simpson’s rule
approximation. We introduce the notation

Sk,1 = Rk,2 .

The third column is a combination of two Simpson approximation:


Sk,1 − Sk−1,1 Rk,2 − Rk−1,2
Sk,2 = Sk,1 + = Rk,2 + .
15 15
We shall use the relation
Sk,1 − Sk−1,1
Sk,2 = Sk,1 + , (4.6.1)
15
to devise and adaptive quadrature.
Let c = (a + b)/2. The elementary Simpson formula is

h
S= (f (a) + 4f (c) + f (b)) .
6
For two subintervals one obtains
h
S2 = (f (a) + 4f (d) + 2f (c) + 4f (e) + f (b)) ,
12
where d = (a + c)/2 and e = (c + b)/2. Applying (4.6.1) to S1 and S2 yields

Q = S2 + (S2 − S)/15.

Now, we are able to give a recursive algorithm for the approximation of our in-
tegral. The function adquad evaluates the integral by applying Simpson. It calls
quadstep recursively and apply extrapolation. The description appears in Algorithm
4.2.
4.6. Adaptive Quadratures II 127

Algorithm 4.2 Adaptive quadrature based on Simpson method and extrapolation


Input: The function f , interval [a, b], tolerance ε
Rb
Output: Approximate value of a f (x) dx
function adquad(f, a, b, ε) : real
c := (a + b)/2;
f a = f (a); f b := f (b); f c := f (c);
Q := quadstep(f, a, b, ε, f a, f c, f b);
return Q;
function quadstep(f, a, b, ε, f a, f c, f b) : real
h := b − a; c := (a + b)/2;
f d := f ((a + c)/2); f e := f ((c + b)/2);
Q1 := h/6 ∗ (f a + 4 ∗ f c + f b);
Q2 := h/12 ∗ (f a + 4 ∗ f b + 2 ∗ f c + 4 ∗ f e + f b);
if |Q1 − Q2| < ε then
Q := Q2 + (Q2 − Q1)/15;
else
Qa := quadstep(f, a, c, ε, f a, f d, f c);
Qb := quadstep(f, c, b, ε, f c, f e, f b);
Q := Qa + Qb;
end if
return Q;
128 Linear Functional Approximation
Chapter 5

Numerical Solution of Nonlinear


Equations

5.1 Nonlinear Equations


The problems discussed in this chapter may be written generically in the form

f (x) = 0, (5.1.1)

but allow different interpretations depending on the meaning of x and f . The simplest
case is a single equation in a single unknown, in which case f is a given function
of a real or complex variable, and we are trying to find values of this variables for
which f vanishes. Such values are called roots of the equation (5.1.1) or zeros of
the function f . If x in (5.1.1) is a vector, say x = [x1 , x2 , . . . , xd ]T ∈ Rd , and f is
also a vector, each component of which is a function of d variables x1 , x2 , . . . , xd ,
then (5.1.1) represents a system of equations. The system is nonlinear if at least one
component of f depends nonlinearly of at least one of the variables x1 , x2 , . . . , xd .
If all components of f are linear functions of x1 , . . . , xd we call (5.1.1) a system of
linear algebraic equations. Still more generally, (5.1.1) could represent a functional
equation, if x is an element of some function space and f a (linear or nonlinear)
operator acting on this space. In each of these interpretations, the zero on the right of
(5.1.1) has a different meaning: the number zero in the first case, the zero vector in
the second, and the function identically equal to zero in the last case.
Much of this chapter is devoted to single nonlinear equations. Such equations are
often encountered in the analysis of vibrating systems, where the roots correspond
to critical frequencies (resonance). The special case of algebraic equations, where f
in (5.1.1) is a polynomial, is also of considerable importance and deserves a special
treatment.

129
130 Numerical Solution of Nonlinear Equations

5.2 Iterations, Convergence, and Efficiency


Even the simplest of nonlinear equations — for example, algebraic equations — are
known to not admit solutions that are expressible rationally in terms of the data. It is
therefore impossible, in general, to compute roots of nonlinear equations in a finite
numbers of arithmetic operations. What is required is an iterative method, that is, a
procedure that generates an infinite sequence of approximations {xn }n∈N , such that

lim xn = α, (5.2.1)
n→∞

for some root α of the equation. In case of a system of equations, both xk and α are
vectors of appropriate dimension, and convergence is to be understood is sense of a
componentwise convergence.
Although convergence of an iterative process is certainly desirable, it takes more
than just convergence to make it practical. What one wants is fast convergence. A
basic concept to measure the speed of convergence is the order of convergence.
Definition 5.2.1. One says that xn converge to α (at least) linearly if

|xn − α| ≤ en (5.2.2)

where {en } is a positive sequence satisfying


en+1
lim = c, 0 < c < 1. (5.2.3)
n→∞ en

If (5.2.2) and (5.2.3) hold with equality in (5.2.2) then c is called asymptotic error
constant.

The phrase “at least” in this definition relates to the fact that we have only inequal-
ity in (5.2.2), which in practice is all we can usually ascertain. So, strictly speaking,
it is the bounds en that converge linearly, meaning that (e.g. for n large enough) each
of these error bounds is approximately a constant fraction of the preceding one.
Definition 5.2.2. One says that xn converges to α with (at least) order p ≥ 1 If
(5.2.2) holds with
en+1
lim p = c, c>0 (5.2.4)
n→∞ en

Thus, convergence of order 1 is the same as linear convergence, whereas conver-


gence of order p > 1 is faster. Note that in this latter case there is no restriction on
the constant c: once en is small enough, it will be the exponent p that takes care of the
convergence. If we have equality in (5.2.2), c is again referred to as the asymptotic
error constant.
5.2. Iterations, Convergence, and Efficiency 131

The same definitions apply also to vector-valued sequences; one only needs to
replace absolute values in (5.2.2) by (any) vector norm.
The classification of convergence with respect to order is still rather crude, as
there are types of convergence that “fall between the cracks”. Thus, a sequence {en }
may converge to zero slowly than linearly, for example such that c = 1 in (5.2.3).
We call this type of convergence sublinear. Likewise, c = 0 in (5.2.3) gives rise to
superlinear convergence, if (5.2.4) does not hold for any p > 1.
It is instructive to examine the behavior of en , if instead of the limit relations
(5.2.3) and (5.2.4) we had strict equality from some n, say,
en+1
= c, n = n0 , n0 + 1, n0 + 2, . . . (5.2.5)
epn
For n0 large enough, this is almost true. A simple induction argument then shows
that
pk −1 k
en0 +k = c p−1 epn0 , (5.2.6)
which certainly holds for p > 1, but also for p = 1 in the limits as p ↓ 1:

en0 +k = ck en0 , k = 0, 1, 2, . . . , (p = 1) (5.2.7)

Assuming then en0 sufficiently small so that the approximation xn0 has several
correct decimal digits, we write en0 +k = 10−δk en0 . Then δk , according to (5.2.2)
approximately represents the number of additional correct digits in the approximation
xn0 +k (as opposed to xn0 ). Taking logarithms in (5.2.6) and (5.2.7) gives
1
(
h c , −k
k log i if p = 1
δk = k 1−p 1 −k 1
p p−1 log c + (1 − p ) log en , if p > 1
0

hence as k → ∞

δk ∼ c1 k (p = 1), δk ∼ cp pk (p > 1), (5.2.8)

where c1 = log 1c > 0, if p = 1 and

1 1 1
cp = log + log .
p−1 c en0
(We assume here that n0 is large enough, and hence en0 small enough, to have
cp > 0). This shows that the number of correct decimal digits increases linearly
with k when p = 1, but exponentially when p > 1. In the latter case, δk+1 /δk ∼ p
meaning that ultimately (for large k) the number of correct decimal digits increases,
per iteration step, by a factor of p.
132 Numerical Solution of Nonlinear Equations

If each iteration requires m units of work (a “unit of works” typically is the work
involved in computing a function value or a value of one of its derivatives), then the
efficiency index of the iteration may be defined by

lim [δk+1 /δk ]1/m = p1/m .


k→∞

It provides a common basis on which to compare different iterative methods with one
another. Methods that converge linearly have efficiency index 1.
Practical computation requires the employment of a stopping rule that terminates
the iteration once the desired accuracy is (or is believed to be) obtained. Ideally, one
stops as soon as kxn − αk < tol, where tol is a prescribed accuracy. Since α is not
known, one commonly replaces xn − α by xn − xn−1 and requires

kxn − xn−1 k ≤ tol (5.2.9)

where
tol = kxn kεr + εa (5.2.10)

with εr , εa prescribed tolerances. As a safety measure, one might require (5.2.9) not
just for one, but a few consecutive values of n. Choosing εr = 0 or εa = 0 will
make (5.2.10) a relative (resp., absolute) error tolerance. It is prudent, however, to
use a “mixed error tolerance”, say εe = εa = ε. Then, if kxn k is small or moderately
large, one effectively controls the absolute error, whereas for kxn k very large, it is
in effect the relative error that is controlled. One can combine the above tests with
||f (x)|| ≤ ε. In algorithms given in this chapter we shall suppose that a function,
stopping criterion, that implement a stopping criterion (rule) is available.

5.3 Sturm Sequences Method

There are situations in which it is desirable to be able to select one particular root
among many others and have the iterative scheme converge to it. This is the case,
for example, in orthogonal polynomials, where we know that all zeros are real and
distinct. It may well be that all zeros are real and distinct. It may well be that we are
interested in the second-largest or third-largest zero, and should be able to compute
it without computing any of the others. This is indeed possible if we combine the
5.3. Sturm Sequences Method 133

bisection method with the theorem of Sturm 1 .


Thus, consider
f (x) := πd (x) = 0, (5.3.1)
where πd is a polynomial of degree d, orthogonal with respect to some positive mea-
sure. We know that πd is the characteristic polynomial of a symmetric tridiagonal
matrix and can be computed recursively by a three term recurrence relation

π0 (x) = 1, π1 (x) = x − α0
(5.3.2)
πk+1 (x) = (x − αk )πk (x) − βk πk−1 (x), k = 1, 2, . . . , d − 1

with all βk positive. The recursion (5.3.2) is not only useful to compute πd (x), but
has also the property due to Sturm.

Proposition 5.3.1. Let σ(x) be the number of sign changes zeros do not count in the
sequence of numbers

πd (x), πd−1 (x), . . . , π1 (x), π0 (x). (5.3.3)

Then, for any two numbers a, b with a < b, the number of real zeros of πd in the
interval a < x ≤ b is equal to σ(a) − σ(b).

Since πk (x) = xk + . . . , it is clear that σ(−∞) = d, σ(+∞) = 0, so that indeed


the number of real zeros of πd is σ(−∞) − σ(∞) = d. Moreover, if ξ1 > ξ2 > · · · >
ξd denote the zeros of πd in decreasing order, we have the behavior of σ(x) as shown
in Figure 5.1.
It is now easy to see that

σ(x) ≤ r − 1 ⇐⇒ x ≥ ξr . (5.3.4)

Indeed, suppose that x ≥ ξr . Then {#zeros ≤ x} ≥ d + 1 − r; hence, by Sturm’s


theorem , σ(−∞) − σ(x) = d − σ(x) = {# zeros ≤ x} ≤ d1 − r, that is,
σ(x) ≤ r − 1. Conversely, if σ(x) ≤ r − 1, then, again by Sturm’s theorem,

Jaques Charles François Sturm (1803-1855), a Swiss ana-


lyst and theoretical physicist, is best known for his theorem
1
on Sturm sequences, discovered in 1829, and his theory on
Sturm-Liouville differential equation. He also contributed sig-
nificantly to differential and projective geometry.
134 Numerical Solution of Nonlinear Equations

Figure 5.1: Sturm’s theorem

{# zeros ≤ x} = d − σ(x) ≥ d + 1 − r, which implies x ≥ ξr (see Figure


5.1).
The basic idea is to control the bisection process, not as in the bisection process,
by checking the sign of pid (x), but rather, by checking the inequality (5.3.4) to see
whether we are on the right or left side of the zero ξr . In order to initialize the
procedure, we need two values a1 = a, b1 = b such that a < ξd and b > ξ1 .
These are trivially obtained as endpoints of the interval of orthogonality for πd , if it
is finite. More generally, one can apply Gershgorin’s theorem to the Jacobi matrix Jd
associated to the polynomial (5.3.2)

 √ 
α0 β1 0
 β1 α1 √β2
√ 

 
 .. 
Jn = 
 β 2 α2 . 

 .. .. p 
 . . βn−1 
p
0 βn−1 αn−1

and taking into account that the zeros of πd are precisely the eigenvalues of Jd .
Gershgorin’s theorem states that the eigenvalue of a matrix A = [aij ] of order d
5.4. Method of False Position 135

are located in the union of the disks


 
 X 
z ∈ C : |z − aii | ≤ ri , ri = |aij | , i = 1, d.
 
j6=i

In√this way, a√can be chosen


√ to be the smallest
p b the largest of the
and p pd numbers
α0 + β1 , α1 + β1 + β2 , . . . , αd−2 + βd−2 + βd−1 , αd−1 + βd−1 . The
method of Sturm sequences then proceeds as follows, for any given r with 1 ≤ r ≤
d:
for n := 1, 2, 3, . . . do
xn := 21 (an + bn );
if σ(xn ) > r − 1 then
an+1 := xn ; bn+1 := bn ;
else
an+1 := an ; bn+1 = xn ;
end if
end for
Since initially σ(a) = d > r − 1, σ(b) = 0 ≤ r − 1, it follows by construction that
σ(an ) > r − 1, σ(bn ) ≤ r − 1, n = 1, 2, 3, . . .
meaning that ξr ∈ [an , bn ], for all n = 1, 2, 3, . . . . Moreover, as in the bisection
method, bn − an = 2−(n−1) (b − a), so that |xn − ξr | ≤ εn with εn = 2−n (b − a).
The method converges (at least) linearly to the root ξr . A computer implementation
can be obtained by modifying the if-else statement appropriately.

5.4 Method of False Position


As in the method of bisection, we assume two numbers a < b such that
f ∈ C[a, b], f (a)f (b) < 0 (5.4.1)
and generate a sequence of nested intervals [an , bn ], n = 1, 2, 3, . . . with a1 = a,
b1 = b such that f (an )f (bn ) < 0. Unlike the bisection method, however, we are not
taking the midpoint of [an , bn ] to determine the next interval, but rather the solution
x = xn of the linear equation
(L1 f )(x; an , bn ) = 0.
This would appear to be more flexible than bisection, as xn will come to lie closer to
the endpoint at which |f | is smaller.
More explicitly, the method proceeds as follows: define a1 = a, b1 = b. Then
136 Numerical Solution of Nonlinear Equations

for n := 1, 2, . . . do
an − bn
xn := an − f (an );
f (an ) − f (bn )
if f (an )f (xn ) > 0 then
an+1 := xn ; bn+1 := bn ;
else
an+1 := an ; bn+1 := xn ;
end if
end for
One may terminate the iteration as soon as min(xn − an , bn − xn ) ≤ tol, where
tol is a prescribed error tolerance, although this is not entirely fool-proof.
The convergence behavior is most easily analyzed if we assume that f is convex
or concave on [a, b]. To fix ideas, suppose f is convex, say

f 00 (x) > 0, x ∈ [a, b], f (a) < 0, f (b) > 0. (5.4.2)

Then f has exactly one zero, α, in [a, b. Moreover, the secant connecting f (a) and
f (b) lies entirely above the graph of y = f (x), and hence intersects the real line to the
left of α. This will be the case of all subsequent secants, which means that the point
x = b remains fixed while the other endpoint a gets continuously updated, producing
a monotonically increasing sequence of approximation. The sequence defined by

xn − b
xn+1 = xn − f (xn ), n ∈ N∗ , x1 = a (5.4.3)
f (xn ) − f (b)

is monotonically increasing sequence bounded from above by α, therefore convergent


to a limit x, and f (x) = 0 (See Figure 5.2).
To determine the speed of convergence, we subtract α from both side of (5.4.3)
and use the fact that f (α) = 0:

xn − b
xn+1 − α = xn − α − [f (xn ) − f (α)].
f (xn ) − f (b)

Now divide by xn − α to get

xn+1 − α xn − b f (xn ) − f (α)


=1− .
xn − α f (xn ) − f (b) xn − α

Letting here n → ∞ and using the fact that xn → α, we obtain

xn+1 − α f 0 (α)
lim = 1 − (b − α) . (5.4.4)
n→∞ xn − α f (b)
5.5. Secant Method 137

Figure 5.2: Method of false position

Thus, we have linear convergence with asymptotic error constant equal to

f 0 (α)
c = 1 − (b − a) .
f (b)

Due to the assumption of convexity, c ∈ (0, 1). The proof when f is concave is
analogous. If f is neither convex nor concave on [a, b], but f ∈ C 2 [a, b] and f 00 (α) 6=
0, f 00 has a constant sign in a neighborhood of α and for n large enough xn will
eventually come to lie in this neighborhood, and we can proceed as above.
Drawbacks. (i) Slow convergence; (ii) The fact that one of the endpoints remain
fixed. If f is very flat near α, the point a is nearby and b further away, the convergence
is exceptionally slow.

5.5 Secant Method


The secant method is a simple variant of the method of false position in which it is
no longer required that the function f have opposite signs at the endpoints of each
interval generated, not even the initial interval. One starts with two arbitrary initial
138 Numerical Solution of Nonlinear Equations

approximations x0 , x1 and continues with


xn − xn−1
xn+1 = xn − f (xn ), n ∈ N∗ (5.5.1)
f (xn ) − f (xn−1 )

This precludes the formation of a fixed false position, as in the method of false po-
sition, and hence suggest potentially faster convergence. Unfortunately, the “global
convergence” no longer holds; the method converges only “locally”, that is only if
the initial approximations x0 and x1 are sufficiently close to a root.
We need a relation between three consecutive errors
 
f (xn ) f (xn ) − f (α)
xn+1 − α = xn − α − = (xn − α) 1 −
f [xn−1 , xn ] (xn − α)f [xn−1 , xn ]
 
f [xn , α] f [xn−1 , xn ] − f [xn , α]
= (xn − α) 1 − = (xn − α)
f [xn−1 , xn ] f [xn−1 , xn ]
f [xn , xn−1 , α]
= (xn − α)(xn−1 − α) .
f [xn−1 , xn ]

Hence,

f [xn , xn−1 , α]
(xn+1 − α) = (xn − α)(xn−1 − α) , n ∈ N∗ (5.5.2)
f [xn−1 , xn ]

From (5.5.2) it follows that if α is a simple root (f (α) = 0, f 0 (α) 6= 0) and if


xn → α, then convergence is faster than linear, at least if f ∈ C 2 near α. How fast is
convergence?
We replace the ratio of divided difference in (5.5.2) by a constant, which is almost
true when n is large. Letting then ek = |xk − α|, we have

en+1 = en en−1 C, C>0

Multiplying both sides by C and defining En = Cen gives

En+1 = En En−1 , En → 0.

Taking logarithms on both sides, and defining yn = log E1n we obtain

yn+1 = yn + yn−1 , (5.5.3)

the well-known difference equation for the Fibonacci sequence.


The solution is
yn = c1 tn1 + c2 tn2 ,
5.5. Secant Method 139

where c1 , c2 are constants and


1 √ 1 √
t1 = (1 + 5), t2 = (1 − 5).
2 2
Since yn → ∞, we have c1 6= 0 and yn ∼ c1 tn1 , as n → ∞, since |t2 | < 1.
n n
Putting them back, E1n ∼ ec1 t1 , e1n ∼ Cec1 t1 , so
n
en+1 C t1 ec1 t1 t1 t1 −1
∼ n+1 = C , n → ∞.
etn1 Cec1 t1

1+ 5
The order of convergence, therefore, is t1 = ≈ 1.61803 . . . (the golden
2
ration).
Theorem 5.5.1. Let α be a simple zero of f . Let Iε = {x ∈ R : |x − α| < ε} and
assume f ∈ C 2 [Iε ]. Define, for sufficiently small ε
00
f (s)
M (ε) = max 0 . (5.5.4)
s∈Iε
t∈I
2f (t)
ε

Assume ε so small that


εM (ε) < 1 (5.5.5)
Then the secant method converges to the unique root α ∈ Iε for any starting values
x0 6= x1 with x0 ∈ Iε , x1 ∈ Iε .
00
f (α)
Remark 5.5.2. Note that lim M (ε) = 0 < ∞, so that (5.5.5)) can certainly
ε→0 2f (α)
be satisfied for ε small enough. The local nature of convergence is thus quantified by
the requirement x0 , x1 ∈ Iε . ♦

Proof. First of all, observe that α is the only zero of f in Iε . This follows from
Taylor’s formula applied at x = α:
(x − α)2 00
f (x) = f (α) + (x − α)f 0 (α) + f (ξ)
2
where f (α) = 0 and ξ ∈ (x, α) (or (α, x)). Thus, if x ∈ Iε , then also ξ ∈ Iε , and we
have
x − α f 00 (ξ)
 
0
f (x) = (x − α)f (α) 1 +
2 f 0 (α)
Here, if x 6= α, all three factors are different from zero, the last one since by assump-
tion
x − α f 00 (ξ)

2 f 0 (α) ≤ εM (ε) < 1.

140 Numerical Solution of Nonlinear Equations

Thus, f on Iε can only vanish at x = α.


Next we show that for all n, xn ∈ Iε and two consecutive iterates are distinct,
unless f (xn ) = 0 for some n, in which case xn = α and the method converge in a
finit number of steps. We prove this by induction: assume that xn−1 , xn ∈ Iε and
xn 6= xn−1 . (By assumption this is true for n = 1.) Then, from known properties of
divided differences, and by our assumption that f ∈ C 2 [Iε ], we have

1
f [xn−1 , xn ] = f 0 (ξ1 ), f [xn−1 , xn , α] = f 00 (ξ2 ), ξi ∈ Iε , i = 1, 2.
2
Therefore, by (5.5.2),
00
2 f
(ξn )
|xn+1 − α| ≤ ε 0 ≤ εεM (ε) < ε,
2f (ξ1 )

that is, xn+1 ∈ Iε . Furthermore, by the relation between three consecutive errors
(5.5.2)), xn+1 6= xn unless f (xn ) = 0 , hence xn = α.
Finally, using again (5.5.2) we have

|xn+1 − α| ≤ |xn − α|εM (ε)

which, applied repeatedly, yields

|xn+1 − α| ≤ |xn − α|εM (ε) ≤ · · · ≤ [εM (ε)]n−1 |x1 − α|.

Since εM (ε) < 1, it follows that the method converges and xn → α as n → ∞. 

The pseudocode is given in Algorithm 5.1. Since only one evaluation of f √is re-
quired in each iteration step, the secant method has the efficiency index p = 1+2 5 ≈
1.61803 . . . .

5.6 Newton’s method


Newton’s method can be thought of as a limit case of the secant method, when
xn−1 → xn . The result is
f (xn )
xn+1 = xn − 0 (5.6.1)
f (xn )
where x0 is some appropriate initial approximation. Another, more fruitful interpre-
tation is that of linearization of the equation f (x) = 0 at x = xn :

f (x) ≈ f (xn ) + (x − xn )f 0 (xn ) = 0.


5.6. Newton’s method 141

Algorithm 5.1 Secant method for nonlinear equations in R


Input: Function f , starting values x0 and x1 , maximum number of iterations,
N max, tolerance information tol
Output: An approximation of a root or an error message
1: xc := x1 ; xv = x0 ;
2: f c := f (x1 ); f v := f (x0 );
3: for k := 1 to N max do
4: xn := xc − f c ∗ (xc − xv)/(f c − f v);
5: if stopping criterion(tol) then
6: return xn;{Success}
7: end if
8: xv := xc; f v := f c; xc := xn; f c = f (xn);
9: end for
10: error(”Maximum number of iterations exceeded”).

Viewed in this manner, Newton’s method can be vastly generalized to nonlinear equa-
tions of all kinds (nonlinear equations, functional equations, in which case the deriva-
tive f 0 is to be understood as a Fréchet derivative, and the iteration is

xn+1 = xn − [f 0 (xn )]−1 f (xn ) (5.6.2)

The study of error in Newton’s method is virtually the same as the one for the
secant method.
f (xn )
xn+1 − α = xn − α − 0
f (xn )
 
f (xn ) − f (α)
= (xn − α) 1 −
(xn − α)f 0 (xn )
  (5.6.3)
f [xn , α]
= (xn − α) 1 −
f [xn , xn ]
f [xn , xn , α]
= (xn − α)2 .
f [xn , xn ]

Therefore, if xn → α, then

xn+1 − α f 00 (α)
lim =
n→∞ (xn − α)2 2f 0 (α)

that is, Newton’s method has the order of convergence p = 2 if f 00 (α) 6= 0. For the
convergence of Newton’s method we have the following result.
142 Numerical Solution of Nonlinear Equations

Theorem 5.6.1. Let α be a simple root of the equation f (x) = 0 and Iε = {x ∈ R :


|x − α| ≤ ε}. Assume that f ∈ C 2 [Iε ]. Define
00
f (s)
M (ε) = max 0 (5.6.4)
s∈Iε
t∈I
2f (t)
ε

If ε is so small that
2εM (ε) < 1, (5.6.5)
then for every x0 ∈ Iε , Newton’s method is well defined and converges quadratically
to the only root α ∈ Iε .

The extra factor 2 in (5.6.5) comes from the requirement that f 0 (x) 6= 0 for x ∈ Iε .
The stopping criterion for Newton’s method

|xn − xn−1 | < ε

is based on the following result.

Proposition 5.6.2. Let (xn ) be the sequence of approximations generated by New-


ton’s method. If α is a simple root in [a, b], f ∈ C 2 [a, b] and the method is convergent,
then there exists an n0 ∈ N such that

|xn − α| ≤ |xn − xn−1 |, n > n0 .

Proof. We shall first show that


1
|xn − α| ≤ |f (xn )|, m1 ≤ inf |f 0 (x)|. (5.6.6)
m1 x∈[a,b]

Using Lagrange Theorem, f (α) − f (xn ) = f 0 (ξ)(α − xn ), where ξ ∈ (α, xn )


(or (xn , α)). Relations f (α) = 0 and |f 0 (x)| ≥ m1 , for x ∈ (a, b), imply that
|f (xn )| ≥ m1 |α − xn |, that is (5.6.6).
Based on Taylor formula, we have
1
f (xn ) = f (xn−1 ) + (xn − xn−1 )f 0 (xn−1 ) + (xn − xn−1 )2 f 00 (µ), (5.6.7)
2
where µ ∈ (xn−1 , xn ) or µ ∈ (xn , xn−1 ). Due to the way which we obtain an
approximation in Newton’s method, we have f (xn−1 ) + (xn − xn−1 )f 0 (xn−1 ) = 0
and from (5.6.7) we obtain
1 1
|f (xn )| = (xn − xn−1 )2 |f 00 (µ)| ≤ (xn − xn−1 )2 kf 00 k∞ ,
2 2
5.6. Newton’s method 143

and based on (5.6.6) it follows that


kf 00 k∞
|α − xn | ≤ (xn − xn−1 )2 .
2m1
Since we assumed the convergence of the method, there exists a n0 ∈ N such that
kf 00 k∞
(xn − xn−1 ) < 1, n > n0 ,
2m1
and hence
|xn − α| ≤ |xn − xn−1 |, n > n0 .


The geometric interpretation of Newton’s method is given in Figure 5.3, and its
description in Algorithm 5.2.

Figure 5.3: Newton’s method

The choice of starting value is, in general, a difficult task. In practice, one chooses
a value, and if after a fixed maximum number the desired accuracy, tested by an
usual stopping criterion, another starting value is chosen. For example, if the root is
isolated in a certain interval [a, b], and f 00 (x) 6= 0, x ∈ (a, b), a choice criterion is
f (x0 )f 00 (x0 ) > 0.
144 Numerical Solution of Nonlinear Equations

Algorithm 5.2 Newton’s method for nonlinear equations in R


Input: Function f , derivative f 0 , starting value x0 , the maximum number of itera-
tions N max, tolerance information tol
Output: An approximation of the root ar an error message
1: for k := 0 to N max do
2: xk+1 := xk − ff0(x k)
(xk ) ;
3: if stopping criterion(tol) then
4: return xk+1 ;{Success}
5: end if
6: end for
7: error(”Maximum number of iterations exceeded”).

5.7 Fixed Point Iteration


Often, in applications, a nonlinear equation presents itself in the form of a fixed point
problem: find x such that
x = ϕ(x). (5.7.1)
A number α satisfying this equation is called a a fixed point of ϕ. Any equation
f (x) = 0 can, in fact, (in many different ways) be written equivalently in the form
(5.7.1). For example, if f 0 (x) 6= 0 in the interval of interest, we can take

f (x)
ϕ(x) = x − . (5.7.2)
f 0 (x)

If x0 is an initial approximation of a fixed point α of (5.7.1), the fixed point


iteration generates a sequence of approximates by

xn+1 = ϕ(xn ). (5.7.3)

If it converges, it clearly converges to a fixed point of ϕ if ϕ is continuous. Note that


(5.7.3) is precisely Newton’s method for solving f (x) = 0 if ϕ is defined by (5.7.2).
So Newton’s method can be viewed as a fixed point, but not the secant method.
For any iteration of the form (5.7.3), assuming that xn → α when n → ∞, it
is straightforward to determine the order of convergence. Suppose indeed that at the
fixed point α we have

ϕ0 (α) = ϕ00 (α) = · · · = ϕ(p−1) (α) = 0, ϕp (α) 6= 0 (5.7.4)

We assume that ϕ ∈ C p on a neighborhood of α. This defines the integer p ≥ 1. We


5.8. Newton’s Method for Multiple zeros 145

then have by Taylor’s theorem

(xn − α)p−1 (p−1)


ϕ(xn ) = ϕ(α) + (xn − α)ϕ0 (α) + · · · + ϕ (α)
(p − 1)!
(xn − α)p (p) (xn − α)p (p)
+ ϕ (ξn ) = ϕ(α) + ϕ (ξn ),
p! p!
where ξn is between α and xn . Since ϕ(xn ) = xn+1 and ϕ(α) = α we get
xn+1 − α 1
p
= ϕ(p) (ξn ).
(xn − α) p!
As xn → α, since ξn is trapped between xn and α, we conclude by the continuity of
ϕ( p) at α, that
xn+1 − α 1
lim p
= ϕ(p) (α) 6= 0. (5.7.5)
n→∞ (xn − α) p!
This shows that convergence is exactly of order p, and the asymptotic error constant
is
1
c = ϕ(p) (α). (5.7.6)
p!
Combining this with the usual local convergence argument, we obtain the following
result.
Theorem 5.7.1. Let α be a fixed point of ϕ and Iε = {x ∈ R : |x − α| ≤ ε}.
Assume ϕ ∈ C p [Iε ] satisfies (5.7.4). If

M (ε) := max |ϕ0 (t)| < 1 (5.7.7)


t∈Iε

then the fixed point iteration converges to α, for any x0 ∈ Iε . The order of conver-
gence is p, and the asymptotic error constant is given by (5.7.6).

5.8 Newton’s Method for Multiple zeros


If α is a zero of multiplicity m, then the convergence order of Newton method is only
one. Indeed, let
f (x)
ϕ(x) = x − 0 .
f (x)
Since
f (x)f 00 (x)
ϕ0 (x) =
[f 0 (x)]2
the process should be convergent if ϕ0 (α) = 1 − 1/m < 1.
146 Numerical Solution of Nonlinear Equations

One way to avoid multiple zeros is to solve the modified equation


f (x)
u(x) := =0
f 0 (x)
which has the same roots as f , but simple. Newton’s method for the modified problem
has the form
u(xk ) f (xk )f 0 (xk )
xk+1 = xk − = . (5.8.1)
u0 (xk ) [f 0 (xk )]2 − f (xk )f 00 (xk )
Since α is a simple zero of u, the convergence of (5.8.1) is always quadratic. The only
theoretical disadvantage of (5.8.1) is the additionally required second derivative of f
and the slightly higher cost of the determination of xk+1 from xk . In practice, this is
a weakness, since the denominator of (5.8.1) could be very small on a neighborhood
of α when xk → α.
Quadratic convergence for zeros of higher multiplicities can be achieved not only
by multiplying the problem, but also by modifying the method. In a neighborhood of
a zero alpha with multiplicity m, the relation

f (x) = (x − α)m ϕ(x) ≈ (x − α)m · c, (5.8.2)

holds. This leads to


f (x) x−α f (x)
≈ ⇒α≈x−m 0 .
f 0 (x) m f (x)
The accordingly modified sequence
f (xk )
xk+1 := xk − m , k = 0, 1, 2, . . . (5.8.3)
f 0 (xk )
converges quadratically even at multiple zeros, provided the correct value of the mul-
tiplicity m is used in (5.8.3).
The efficiency of the Newton variant (5.8.3) depends critically on the correctness
of the value m used. If this value cannot be determined analytically, then a good
estimate should at least be used.
Provided that

|xk − α| < |xk−1 − α| ∧ |xk − α| < |xk−2 − α|

xk can be substituted for α in (5.8.2):

f (xk−1 ) ≈ (xk−1 − xk )m · c
f (xk−2 ) ≈ (xk−2 − xk )m · c.
5.9. Algebraic Equations 147

Then this system is solved with respect to m:


log [f (xk−1 )/f (xk−2 )]
m≈ .
log [(xk−1 − xk )/(xk−2 − xk )]
This estimate of the multiplicity can be used, for example, in (5.8.3).

5.9 Algebraic Equations


There are many iterative methods specifically designed to solve algebraic equations.
Here we only describe how Newton’s method applies to this context, essentially con-
fining ourselves to a discussion of an efficient way to evaluate simultaneously the
value of a polynomial and its first derivative. In the special case where all zeros of
the polynomial are known to be real and simple, we describe an improved variant of
Newton’s method.
Newton’s method applied to algebraic equations. We consider an algebraic
equation of degree d,

f (x) = 0, f (x) = xd + ad−1 xd−1 + · · · + a0 , (5.9.1)

where the leading coefficient is assumed (without restricting generality) to be 1 and


where we may also assume a0 6= 0 without loss of generality. For simplicity we
assume all coefficients to be real.
To apply Newton’s method to (5.9.1), one needs good methods for evaluating a
polynomial and its derivative.
Horner’s scheme is good for this purpose:
bd := 1; cd := 1;
for k = d − 1 downto 1 do
bk := tbk+1 + ak ;
ck := tck+1 + bk ;
end for
b0 := tb1 + a0 ;
Then f (t) = b0 , f 0 (t) = c1 .
We proceed as follows:
One applies Newton’s method, computing simultaneous f (xn ) and f 0 (xn )
f (xn )
xn+1 = xn − .
f 0 (xn )
f (x)
Then we apply Newton’s method to the polynomial . For complex roots,
x−α
one begins with x0 complex and all computations are done in complex arithmetic. It
148 Numerical Solution of Nonlinear Equations

is possible to divide by quadratic factors and to compute entirely in real arithmetic


– Bairstow’s method. This method for decreasing the degree could lead to large
errors. A way of improvement is to use the approximated roots as starting values for
Newton’s method to the original polynomial.

5.10 Newton’s method for systems of nonlinear equations


Newton’s method can be easily adapted to deal with systems of nonlinear equations

F (x) = 0, (5.10.1)

where F : Ω ⊂ Rn → Rn , and x, F (x) ∈ Rn . The system (5.10.1) can be written


explicitely 
 F1 (x1 , . . . , xn ) = 0

..
 .
Fn (x1 , . . . , xn ) = 0

Let F 0 (x(k) ) be the Jacobian matrix of F in x(k) :


 ∂F1 (k) ∂F1 (k)

∂x1 (x ) . . . ∂xn (x )
0 (k)
J := F (x ) =  .. .. ..
. (5.10.2)
 
. . .
∂Fn (k) ∂Fn (k)
∂x1 (x ) ... ∂xn (x )

The quantity 1/f 0 (x) is replaced by the inverse of the Jacobian in x(k) :

x(k+1) = x(k) − [F 0 (x(k) )]−1 F (x(k) ). (5.10.3)

We write iteration under the form

x(k+1) = x(k) + w(k) . (5.10.4)

Note that wk is the solution of the system having n equations and n unknowns

F 0 (x(k) )w(k) = −F (x(k) ). (5.10.5)

It is more efficient and convenient that, instead of computing the inverse Jacobian to
solve the system (5.10.5) and of using the form (5.10.4) of iteration.
Theorem 5.10.1. Let α be a solution of equation F (x) = 0 and suppose that in
closed ball B(δ) ≡ {x : kx − αk ≤ δ}, there exists the Jacobi matrix of F : Rn →
Rn , it is nonsingular and satisfies a Lipschitz condition

kF 0 (x) − F 0 (y)k∞ ≤ ckx − yk∞ , ∀ x, y ∈ B(δ), c > 0.


5.11. Quasi-Newton Methods 149

We set γ = c max k[F 0 (x)]−1 k∞ : kα − xk∞ ≤ δ and 0 < ε < min{δ, γ −1 }.




Then for any initial approximation x(0) ∈ B(ε) := {x : kx − αk∞ ≤ ε} Newton


method is convergent, and the vectors e(k) := α − x(k) satisfy the following inequal-
ities:

(a) ke(k+1) k∞ ≤ γke(k) k2∞


k
(b) ke(k) k∞ ≤ γ −1 (γke(0) k∞ )2 .

Proof. If F 0 is continuous on the segment joining the points x, y ∈ Rn , Lagrange’s


Theorem implies
F (x) − F (y) = Jk (x − y),
where  ∂F1 ∂F1 
∂x1 (ξ1 )
... ∂xn (ξ1 )
.
.. .. ..
Jk =  ⇒
 
. .
∂Fn ∂Fn
∂x1 n(ξ ) . . . ∂xn (ξn )

e(k+1) = e(k) − [F 0 (x(k) )]−1 (F (α) − F (x(k) )) = e(k) − [F 0 (x(k) )]−1 Jk e(k)
= [F 0 (x(k) )]−1 (F 0 (x(k) ) − Jk )e(k)

and (a) follows. From Lipschitz condition one gets

kF 0 (x(k) ) − Jk k∞ ≤ c max kx(k) − ξ (j) k ≤ ckx(k) − αk


j=1,n

Thus, if kα − x(k) k∞ ≤ ε, then kα − x(k+1) k∞ ≤ (γε)ε ≤ ε. Since (a) holds for


any k, (b) follows immediately. 

Algorithm 5.3 describes Newton’s method for systems of nonlinear equations.

5.11 Quasi-Newton Methods


An important weakness of Newton’s method for solution of systems of nonlinear
equation is the necessity to compute the Jacobian matrix and to solve a linear n × n
system having this matrix. To illustrate the size of such a weakness, let us evaluate
the amount of computation associated to an iteration of Newton method. The Jaco-
bian matrix associated to a system of n nonlinear equation F (x) = 0 requires the
evaluation of the n2 partial derivatives of the n component of F . In most situations,
evaluation of partial derivatives is not convenient and often impossible. The compu-
tational effort required by an iteration of Newton’s method is at least n2 + n scalar
150 Numerical Solution of Nonlinear Equations

Algorithm 5.3 Newton method for nonlinear systems


Input: Function F , Fréchet differential F 0 , initial vector x(0) , maximum number of
iterations N max, tolerance information tol
Output: An approximation for the root or an error message
1: for k := 0 to N max do
2: Compute the Jacobian matrix J = F 0 (x(k) );
3: Solve the system Jw = −F (x(k) );
4: x(k+1) := x(k) + w;
5: if stopping criterion(tol) then
6: return x(k+1) ;{Success}
7: end if
8: end for
9: error(”Maximum number of iterations exceeded”).

function evaluations (n2 for Jacobian and n for F ) and O(n3 ) flops for the solution of
nonlinear system. This amount of computation is prohibitive, excepting small values
of n, and scalar functions which can be evaluated easily. So, it is natural to focus
our attention to reduce the number of evaluation and to avoid the solution of a linear
system at each step.
With the scalar secant method, the next iteration, x(k+1) , is obtained as the solu-
tion of the linear equation
(k) + hk ) − f (x(k) )
¯lk = f (x(k) ) + (x − x(k) ) f (x = 0.
hk
Here the linear function ¯lk can be interpreted in two ways:
1. ¯lk is an approximation of the tangent equation
lk (x) = f (x(k) ) + (x − x(k) )f 0 (x(k) );

2. ¯lk is the linear interpolation of f between the points x(k) and x(k+1) .
By extending the scalar secant method to n dimensions, different generalization of
secant method are obtained depending on the interpretation of ¯lk . The first inter-
pretation leads to the discrete Newton method, and the second one to interpolation
methods.
The discrete Newton method is obtained replacing F 0 (x) in Newton’s method
(5.10.3) by a discrete approximation A(x, h). The partial derivatives in the Jacobian
matrix (5.10.2) are replaced by divided differences
A(x, h)ei := [F (x + hi ei ) − F (x)]/hi , i = 1, n, (5.11.1)
5.11. Quasi-Newton Methods 151

where ei ∈ Rn is the i-th unit vector and hi = hi (x) is the step length of the dis-
cretization. A possible choice of step length is

ε|xi |, if xi 6= 0;
hi :=
ε, otherwise,

with ε := eps, where eps is the machine epsilon.

5.11.1 Linear Interpolation


In linear interpolation, each of the tangent planes is replaced by a (hyper-)plane which
interpolates the component function Fi of F at n + 1 given points xk,j , j = 0, n,
located in a neighborhood of x(k) , that is one determines vectors a(i) and scalars αi ,
in such a way that for
Li (x) = αi + a(i)T x, i = 1, n (5.11.2)
the following relations hold
Li (xk,j ) = Fi (xk,j ), i = 1, n, j = 0, n.
The next iterate x(k+1) is obtained by intersecting the n hyperplanes (5.11.2) in Rn+1
and the hyperplane y = 0. x(k+1) is the solution of the linear system of equations
Li (x) = 0, i = 1, n. (5.11.3)
Depending on the selection of interpolation points xk,j , numerous different methods
are derived, and among them the best known are Brown’s method and Brent’s method.
Brown’s method combines the processes of approximating F 0 and that of solving the
system of linear equations (5.11.3) using Gaussian elimination. Brent’s method uses
a QR factorization for solving (5.11.3). Both methods are members of a class of
methods quadratically convergent (like Newton’s method) but require only (n2 +
3n)/2 function evaluations per iteration.
In a comparative study Moré and Cosnard [27] found that Brent’s method is often
better than Brown’s method, and that the discrete Newton method is usually the most
efficient if the F-evaluations do not require too much computational effort.

5.11.2 Modification Method


Iterative methods of exceptionally efficiency can be constructed by using an approx-
imation Ak of F 0 (x(k) ), which is derived from Ak−1 by a rank-1 modification, i.e.,
by adding a matrix of rank 1:
h iT
Ak+1 := Ak + u(k) v (k) , u(k) , v (k) ∈ Rn , k = 0, 1, 2, . . .
152 Numerical Solution of Nonlinear Equations

According to the Sherman-Morrison formula(see [12])


−1 1
A + uv T = A−1 − A−1 uv T A−1
1+ v T A−1 u
for Bk+1 := A−1
k+1 the recursion
T
Bk u(k) v (k) Bk

Bk+1 = Bk −  T , k = 0, 1, 2, . . . ,
1 + v (k) Bk u(k)
T
holds, provided that 1 + v (k) Bk u(k) 6= 0. Thus, the necessity of solving a system


of linear equations in every iteration is avoided; a matrix vector operation suffices.


Accordingly, a reduction of the computational effort from O(n3 ) to O(n2 ). There
is, however, a major drawback: the convergence is no longer quadratic (as it is in
Newton’s, Brown or Brent method); it is only superlinear:

kx(k+1) − αk
lim = 0. (5.11.4)
k→∞ kx(k) − αk

Broyden’s chooses the vectors u(k) and v (k) using the principle of secant approxi-
mation. For the scalar case, the approximation ak ≈ f 0 (x(k) ) is uniquely defined
by
ak+1 (x(k+1) − x(k) ) = f (x(k+1) − f (x(k) ).
However, for n > 1, the approximation

Ak+1 (x(k+1) − x(k) ) = F (x(k+1) ) − F (x(k) ) (5.11.5)

(called quasi-Newton equation) is no longer uniquely defined; any other matrix of the
form
Āk+1 := Ak+1 + pq T ,
where p, q ∈ Rn and q T (x(k+1) − x(k) checks the equations (5.11.5). On the other
hand,
yk := F (x(k) ) − F (x(k−1) ) and sk := x(k) − x(k−1)
only contain information about the partial derivative of F in the direction of sk , not
about the partial derivative in directions orthogonal to sk . On this direction, the effect
of Ak+1 and Ak should be the same

Ak+1 q = Ak q, ∀q ∈ {v : v 6= 0, v T sk = 0}. (5.11.6)

Starting from an initial approximation A0 ≈ F 0 (x(0) ) (the differential quotient given


by (5.11.1) could be such an example), one generates the sequence A1 , A2 , . . . ,
5.11. Quasi-Newton Methods 153

uniquely determined by formulas (5.11.5) and (5.11.6) (Broyden [6], Dennis şi Moré
[12]).
For the corresponding sequence B0 = A−1 0 ≈ [F (x(0) )]−1 , B1 , B2 , . . . the
Sherman-Morrison formula can be used to obtain the recursion
(sk+1 − Bk yk+1 )sTk+1 Bk
Bk+1 := Bk + , k = 0, 1, 2, . . .
sTk+1 Bk yk+1

which requires only matrix-vector multiplication operations and thus only O(n2 )
computational work. With the matrices Bk one can define the Broyden’s method
by the iteration

x(k+1) := x(k) − Bk F (x(k) ), k = 0, 1, 2, . . .

This method converges superlinearly in terms of (5.11.4), if the update vectors sk


converge (as k → ∞) to the update vectors of the Newton’s. This is a good illus-
tration of the importance of local linearization principle for the solution of nonlinear
equation.
Broyden’s method is described in Algorithm 5.4.

Algorithm 5.4 Broyden’s method for nonlinear systems


Input: Function F , starting vector x(0) , maximum number of iterations, N max,
tolerance information tol
Output: An approximation for the root or an error message
1: B0 := F 0 (x(0) ); v := F (x); B := B0−1 ;
2: s := −Bv; x := x + s;
3: for k := 1 to N max do
4: w := v; v := F (x); y := v − w;
5: z := −By; {z = −Bk−1 yk }
6: p := −sT z; {p = sTk Bk−1 yk }
−1
7: C := pI + (s + z)sT ; {C = sTk Bk−1 yk I + (sk + Bk−1 yk )sTk }
8: B := (1/p)CB; {B = Bk }
9: s := −Bv; {s = −Bk F (x(k) )}
10: x := x + s;
11: if stopping criterion(tol) then
12: return x; {success}
13: end if
14: end for
15: error(”Maximum number of iterations exceeds”)
154 Numerical Solution of Nonlinear Equations
Chapter 6

Eigenvalues and Eigenvectors

In the following we study the problem of determining eigenvalues (and eigenvectors)


of a square matrix A ∈ Rn×n , that is to find the numbers λ ∈ C and the vectors
x ∈ Cn such that
Ax = λx. (6.0.1)

Definition 6.0.1. A number λ ∈ C is called an eigenvalue of the matrix A ∈ Rn×n ,


if there is a vector x ∈ Cn \ {0} called an eigenvector such that Ax = λx.
Remark 6.0.2. 1. The requirement x 6= 0 is important, since the null vector is an
eigenvector corresponding to any eigenvalue.

2. Even if A is real, some of its eigenvalues may be complex. For real matrices,
these occur always in conjugated pairs. ♦

6.1 Eigenvalues and Polynomial Roots


Any eigenvalue problem could be reduced to a problem of finding zeros of a poly-
nomial: the eigenvalues of a matrix A ∈ Rn×n are the roots of a characteristic
polynomial
pA λ = det(A − λI), λ∈C
since the above determinant is zero if and only if the system (A − λI)x = 0 has a
nontrivial solution, that is λ is an eigenvalue.
A naive method for the solution of eigenproblems is the computation of charac-
teristic polynomial and then the computation of its roots. But the computation of a
determinant is, is general, a complex and unstable problem, so matrix transformation

155
156 Eigenvalues and Eigenvectors

is more appropriate. Conversely, the problem of computing polynomial roots can be


formulated as an eigenvalue problem. Let p ∈ Pn a polynomial with real coefficients,
that can be written (by mean of its roots z1 , . . . , zn , which could be complex) as

p(x) = an xn + · · · + a0 = an (x − z1 ) . . . (x − zn ), an ∈ R, an 6= 0.

In the vector space Pn−1 “modulo p multiplication”

Pn−1  q 7→ r xq(x) = αp(x) + r(x), r ∈ Pn (6.1.1)

is a linear transform, and since


n−1
X aj
1
xn = p(x) − xj , x ∈ R,
an an
j=0

we shall represent p in basis 1, x, . . . , xn−1 by mean of so-called Frobenius’s com-


panion matrix (of size n × n)

0 − aan0
 
1 0
 − aan1


M =  ... .. ..
, (6.1.2)
 
. .
 an−2 
 1 0 − an 
1 − an−1
an

Let vj = (vjk : k = 1, n) ∈ Cn , j = 1, n chosen so that


n
p(x) Y X
`j (x) = = an (x − zk ) = vjk xk−1 , j = 1, n,
x − zj
k6=j k=1

then
n
X
(M vj − zj vj )k xk−1 = x`j (x) − zj `j (x) = (x − zj )`j (x) = p(x) ≈ 0,
k=1

and thus M vj = zj vj , j = 1, n.
Hence, eigenvalues of M are roots of p.
The Frobenius matrix given by (6.1.2) is only a way (there exists many other) to
represent the “multiplication” in (6.1.1); any other basis of Pn−1 provides a matrix
M whose eigenvalues are roots of p. The only device for polynomial handling is
“remainder division”.
6.2. Basic Terminology and Schur Decomposition 157

6.2 Basic Terminology and Schur Decomposition


As the following example shows
 
0 1
A= , pA (λ) = λ2 + 1 = (λ + i)(λ − i),
−1 0

a real matrix may have complex eigenvalues. Therefore (at least from theoretical
point of view) it is convenient to deal with complex matrices A ∈ Cn×n .

Definition 6.2.1. Two matrices, A, B ∈ Cn×n are called similar if there exists a
nonsingular matrix T ∈ Cn×n , such that

A = T BT −1 .

Lemma 6.2.2. If A, B ∈ Cn×n are similar, their eigenvalues are the same.

Proof. Let λ ∈ C be an eigenvalue of A = T BT −1 and x ∈ Cn its eigenvector.


Then
B(T −1 x) = T −1 AT T −1 x = T −1 Ax = λT −1 x
and hence, λ is also an eigenvalue of B. 

The following important result from linear algebra holds, which we state without
proof. (For a proof see [21].)

Theorem 6.2.3 (Jordan normal form). Any matrix A ∈ Cn×n is similar to a matrix
 
  λ` 1
J1  .. ..  k
.  . .  n` ×n`
X
J =
 . .  , J` = 
 
∈C
 , n` = n,
..
Jk
 . 1  `=1
λ`

called Jordan normal form of A.

Definition 6.2.4. A matrix is called diagonalizable, if all its Jordan blocks J` have
their dimension equal to one, that is, n` = 1, ` = 1, n. A matrix is called nonderoga-
tory if each eigenvalue λ` appears in exactly one Jordan block, on diagonal.

Remark 6.2.5. If a matrix A ∈ Rn×n has n simple eigenvalues, then it is diagonal-


izable and also nonderogatory, and suitable for numerical treatment. ♦
158 Eigenvalues and Eigenvectors

Theorem 6.2.6 (Schur decomposition). For every matrix A ∈ Cn×n there exists a
unitary matrix U ∈ Cn×n and an upper triangular matrix
 
λ1 ∗ . . . ∗
.. .. . 
. .. 

 .  ∈ Cn×n ,
R= 
.. 
 . ∗
λn
such that A = U RU ∗ .
Remark 6.2.7. 1. The diagonal elements of R are eigenvalues of A. Since A and
R are similar, they have the same eigenvalues.
2. We have a stronger form of similarity between A and R: they are unitary-
similar. ♦

Proof of theorem 6.2.6. By induction on n. The case n = 1 is trivial. Suppose the


theorem is true for n ∈ N and let A ∈ C(n+1)×(n+1) . Let λ ∈ C be an eigenvalue of
A and x ∈ Cn+1 , kxk2 = 1, the corresponding eigenvector. We take u1 = x and we
choose u2 , . . . , un+1 such that u1 , . . . , un+1 to be an orthonormal basis for Cn+1 , or
equivalently, the matrix U = [u1 , . . . , un+1 ] to be unitary. Thus,
U ∗ AU e1 = U ∗ Au1 = U ∗ Ax = λU ∗ x = λe1 ,
that is 

∗ λ ∗
U AU = , B ∈ Cn×n .
0 B
By induction hypothesis, there exists a unitary matrix V ∈ Cn×n , such that B =
V SV ∗ , where S ∈ Cn×n is an upper triangular matrix. Therefore,
     
λ ∗ 1 0 λ1 ∗ 1 0
A=U 1 U ∗
= U U ∗,
0 V SV ∗ 0 V 0 S 0 V∗
| {z } | {z } | {z }
=:U =:R =U ∗

which completes the proof. 


Let us give now two direct consequences of Schur decomposition.
Corollary 6.2.8. To each Hermitian matrix A ∈ Cn×n there corresponds an orthog-
onal matrix U ∈ Cn×n such that
 
λ1
A=U
 ..  ∗
U , λj ∈ R, j = 1, n.
.
λn
6.2. Basic Terminology and Schur Decomposition 159

Proof. The matrix R in Theorem 6.2.6 verifies R = U ∗ AU . Since

R∗ = (U ∗ AU ) = U ∗ A∗ U = U ∗ AU = R,

R must be diagonal, and its diagonal elements are real (being Hermitian). 

In other words, Corollary 6.2.8 guarantees that any Hermitian matrix is unitary-
diagonalisable and has a basis which consists of orthonormal eigenvectors. More-
over, all eigenvalues of a Hermitian matrix are real. It is interesting, not only from
theoretical point of view, what kind of matrices are unitary-diagonalisable.

Theorem 6.2.9. A matrix A ∈ Cn×n is unitary-diagonalisable, that is, there exists a


unitary matrix U ∈ Cn×n such that
 
λ1
A=U
 ..  ∗
U , λj ∈ R, j = 1, . . . , n. (6.2.1)
.
λn

if and only if A is normal, i.e.

AA∗ = A∗ A. (6.2.2)

Proof. We set Λ = diag(λ1 , . . . , λn ). By (6.2.1), A has the form A = U ΛU ∗ , so

AA∗ = U ΛU ∗ U Λ∗ U ∗ = U |Λ|2 U ∗ and A∗ A = U ∗ Λ∗ U ∗ U ΛU ∗ = U |Λ|2 U ∗ ,

that is, (6.2.2). For the converse, we use the Schur decomposition of A in form
R = U ∗ AU . Then
n
X
|λ1 |2 = (R∗ R)11 = (RR∗ )11 = |λ1 |2 + |r1k |2 ,
k=2

which implies r12 = · · · = r1n = 0. By induction, for j = 2, n,


j−1
X n
X
(R∗ R)jj = |λj |2 + ∗
|rkj |2 = (RRjj = |λj |2 + |rjk |2 ,
k=1 k=j+1

and for this reason R must be diagonal. 

A Schur decomposition for real matrices, that is, the so-called real Schur decompo-
sition is a little bit more complicated.
160 Eigenvalues and Eigenvectors

Theorem 6.2.10. For any matrix A ∈ Rn×n there exists an orthogonal matrix U ∈
Rn×n such that  
R1 ∗ ... ∗
.. .. . 
. .. 

 .  U ∗,
A=U  ..  (6.2.3)
 . ∗
Rk
where either Rj ∈ R1×1 , or Rj ∈ R2×2 , with two complex conjugated eigenvalues,
j = 1, k.
A real Schur decomposition transforms A into an upper Hessenberg matrix
 
∗ ... ... ∗
.. 
∗ . . .

T .
U AU =   . . . . .. 
.
 . . .
∗ ∗
Proof. If all eigenvalues of A are real then we proceeds the same way as for complex
Schur decomposition. Thus, let λ = α + iβ, β 6= 0, a complex eigenvalue of A and
x + iy its eigenvector. Then

A(x + iy) = λ(x + iy) = (α + iβ)(x + iy) = (αx − βy) + i(βx + αy)

or in matrix form  
α β
A [x y] = [x y] .
|{z} −β α
∈Rn×2 | {z }
:=R

Since R = α2 + β2 > 0, (β 6= 0), span{x, y} is an A-invariant bidimensional


subspace of Rn . Then we choose u1 , u2 such that they form a basis of this space,
completed by u3 , . . . , un such that all these vectors be an orthonormal basis of Rn
and after a reasoning which is analogous to that of complex case we get
 
R ∗
U T AU = ,
0 B
and the induction proceeds as for complex Schur decomposition. 

6.3 Vector Iteration


Vector iteration (also called power method) is the simplest method when an eigen-
value and its eigenvector is needed.
6.3. Vector Iteration 161

Starting with an arbitrary y (0) ∈ Cn one constructs the sequence y (k) , k ∈ N


based on the following iteration

z (k) = Ay (k−1) ,
(k)
z (k) |zj∗ |
   

(k) 1
y (k) = (k) (k)
, j ∗ = min 1 ≤ j ≤ n : z
j ≥ 1 − kz (k)
k ∞ .
zj∗ kz k∞ k
(6.3.1)
Under certain condition, this sequence converges to the eigenvector corresponding to
the dominant eigenvalue.
Proposition 6.3.1. Let A ∈ Cn×n a diagonalizable matrix whose eigenvalues verify
the condition
|λ1 | > |λ2 | ≥ . . . |λn |.
Then the sequence y (k) , k ∈ N, converges to a multiple of the normed eigenvector
corresponding to the eigenvalue λ1 , for almost every starting vector y (0) .
Proof. Let x1 , . . . , xn be the orthonormal eigenvectors of A corresponding to the
eigenvalues λ1 , . . . , λn – they exist, A is diagonalisable. We express y (0) as
n
X
y (0) = αj xj , αj ∈ C, j = 1, n,
j=1

and we state that


n n n  k
k (0)
X
k
X
k
X λj
A y = αj A xj = αj λ xj = λk1 αj xj .
λ1
j=1 j=1 j=1

Since |λ1 | > |λj |, j = 2, n this implies


n  k
X λj
lim λ−k k (0)
1 A y = α1 x1 + lim αj xk = α1 x1 ,
k→∞ k→∞ λ1
j=2

and also

n  k
X λ j
lim |λ1 |−k Ak y (0)

= αj = |α1 | kx1 k∞ .
αj xj

k→∞ ∞
j=1 λ1

If α1 = 0, and thus y (0) is in hyperplane


n ∗
x+
1 = {x ∈ C , x x1 = 0},
162 Eigenvalues and Eigenvectors

then both limits are zero, and we cannot derive any conclusion on the convergence of
y k , k ∈ N; this hyperplane has the measure zero, so in the sequel we shall suppose
α1 6= 0.
Equation (6.3.1) implies y k = γk Ak y (0) , γk ∈ C and moreover ky k k∞ = 1, so

1 1
lim |λ1 |k |γk | = lim = .
k→∞ k→∞ |λ1 |−1 kA(k) y (0) k |α1 |kx1 k∞

Thus

γk λk1 |λ2 |k
 
α1 x1
y (k) = γk Ak y (0) = k |α |kx k
+O , k ∈ N, (6.3.2)
|γk λ1 | 1 1 ∞ |λ1 |k
| {z } | {z }
=:e−2πiθk =:αx1

where θk ∈ [0, 1]. Now, it is the time to use the ”strange” relation (6.3.1): let j be the
least subscript such that |(αx1 )j | = kαx1 k∞ ; then, by (6.3.2), for a sufficiently large
k, it holds in (6.3.1) j ∗ = j too. Therefore it holds
(k)
(k) yj 1
lim yj = 1 ⇒ lim e2πiθk = lim = .
k→∞ k→∞ k→∞ (αx1 )j (αx1 )j

Substituting this in (6.3.2) we conclude the convergence of y (k) , k ∈ N. 

We could also apply vector iteration to compute all eigenvalues and eigenvectors,
provided that the eigenvalues of A have different modulus. For this purpose we find
the largest modulus eigenvalue λ1 of A and the corresponding eigenvector x1 , and
we proceed to
A(1) = A − λ1 x1 xT1 .
The matrix A(1) is diagonalizable and has the same orthonormal eigenvectors as A,
excepting that x1 is the eigenvector corresponding to the eigenvalue 0, and does not
play any role for the iteration, provided that the starting vector is not a multiple of
x1 . By applying once more vector iteration to A(1) one obtains the second largest
modulus eigenvalue λ2 and the corresponding eigenvector; the iteration

A(j) = A(j−1) − λj xj xTj , j = 1, n, A(0) = A

computes successively all the eigenvalues and eigenvectors of A, if its eigenvalues


are different in modulus.

Remark 6.3.2 (Drawbacks of vector iteration). 1. The method works only if


there exists a dominant eigenvector, that is only when there exists a unique
6.4. QR Method – the Theory 163

eigenvector corresponding to the dominant eigenvalue. For example, the ma-


trix  
0 1
A= ,
0 1
transforms vector [x1 x2 ]T into the vector [x2 x1 ]T and the convergence holds
only if the starting vector is an eigenvector.

2. The method works well only for “suitable” starting vectors. It sounds gorgeous
that all vectors which are not in a certain hyperplane are good, but the things are
more complicated. If the dominant eigenvalue of a real matrix is complex and
the starting values are real, then the iteration run indefinitely, without finding
an eigenvector.

3. We could perform all computation in complex, but this grows seriously the
computational complexity (with a factor of two for addition and six for multi-
plication, respectively).

4. The speed of convergence depends on ratio


|λ2 |
< 1,
|λ1 |
which may be indefinitely close to 1. If the dominance of dominant eigenvector
is not sufficiently emphasized, the the convergence is very slow. ♦
Taking into account the above remarks, we conclude that vector iteration method
is not enough good.

6.4 QR Method – the Theory


The practical method for eigenproblems is the QR method, due to Francis [14] and
Kublanovskaya [25], a unitary extension of Rutishauser’s LR method [30]. We begin
with the complex case.
The iterative method is very simple: one starts with A(0) = A and one computes
iteratively, using QR decomposition,

A(k) = Qk Rk , A(k+1) = Rk Qk , k ∈ N0 . (6.4.1)

With a bit of luck, or as mathematicians say, under certain hypothesis, this sequence
will converge to a matrix whose diagonal elements are the eigenvalues of A.
Lemma 6.4.1. The matrices A(k) , built by (6.4.1), k ∈ N, are orthogonal-similar to
A (and, obviously have the same eigenvalues as A).
164 Eigenvalues and Eigenvectors

Proof. The following statement holds


A(k+1) = Q∗k Qk Rk Qk = Q∗k A(k) Qk = · · · = Q∗k . . . Q∗0 A Q0 . . . Qk .
| {z } | {z }
=:Uk∗ =:Uk


In order to prove the convergence, we shall interpret QR iteration as a gener-
alization of vector iteration (6.3.1) (without the strange norming process) to vector
spaces. For this purpose, we shall write the orthonormal base u1 , . . . , um ∈ Cn of
a m-dimensional subspace U ⊂ Cn , m ≤ n, as column vectors of a unitary matrix
U ∈ Rn×m and we shall iterate the subspace (i.e. matrices) over the QR decomposi-
tion
Uk+1 Rk = AUk , k ∈ N0 , U0 ∈ Cn . (6.4.2)
This implies immediately
Uk+1 (Rk . . .R0 ) = AUk (Rk−1 . . .R0 ) = A2 Uk−1 (Rk−2 . . .R0 ) = . . . = Ak+1 U0 .
(6.4.3)
If we define, for m = n, A(k) = Uk∗ AUk , then by (6.4.2), the following relation
holds
A(k) = Uk∗ AUk = Uk∗ Uk+1 Rk
∗ ∗
A(k+1) = Uk+1 AUk+1 = Uk+1 AUk Uk∗ Uk+1
and setting Qk := Uk∗ Uk+1 , we obtain the iteration rule (6.4.1). We choose U0 = I
as starting matrix.
Definition 6.4.2. A phase matrix Θ ∈ Cn×n is a diagonal matrix with form
 −iθ 
e 1
Θ=
 .. ,

θj ∈ [0, 2π), j = 1, n.
.
e −iθ n

Proposition 6.4.3. Suppose A ∈ Cn×n has eigenvalues with distinct moduli, |λ1 | >
|λ2 | > · · · > |λn | > 0. If the matrix X −1 in normal Jordan form A = XΛX −1 of A
has a LU decomposition
 
1  
∗ 1  ∗ . . . ∗
X −1 = ST, S = . . , T = . . .. 
. . ,
  
. . . .
. . . 

∗ ... ∗ 1
the there exists phase matrices Θk , k ∈ N0 , such that the matrix sequence (Θk Uk ),
k ∈ N is convergent.
6.4. QR Method – the Theory 165

Remark 6.4.4 (On proposition 6.4.3). 1. The convergence of the matrix sequen-
ce (Θk Uk ) means that if the corresponding orthonormal bases converge to an
orthonormal basis of Cn , we have also the convergence of the corresponding
vector spaces.

2. The existence of a LU decomposition for X −1 introduces no additional con-


straints: since X −1 is invertible, there exists a permutation P , such that

X −1 P T = (P X)−1 = LU

b = P T AP , which is the
and P X is invertible. This means that the matrix A
result of line and column permutation of A has the same eigenvalues of A,
fulfill the hypothesis of Proposition 6.4.3.

3. The proof of Proposition 6.4.3 is a modification of proof in [37, pag. 54–


56] for the convergence of LR, whose origin can be found in Wilkinson’s 1
book [44]. What is the LR method? It is analogous to QR method, but the
QR decomposition of A is replaced by an LU decomposition, A(k) = Lk Rk ,
and then one builds A(k+1) = Rk Lk . Under certain conditions, this method
converges to an upper triangular matrix. ♦

Before the proof of 6.4.3, let us see why the convergence of the sequence (Uk )
implies the convergence of QR method. Namely, if we have kUk+1 − Uk k ≤ ε or
equivalently
Uk+1 = Uk + E, kEk2 ≤ ε,
then

Qk = Uk+1 Uk = (Uk + E)∗ Uk = I + E ∗ Uk = I + F, kF k2 ≤ kEk2 kUk k2 ≤ ε,
| {z }
=1

James Hardy Wilkinson (1919-1986), English mathematician.


Contribution to numerical analysis, numerical linear algebra
and computer science. He received many awards for his out-
standing work. He was elected a Fellow of the Royal Society
in 1969. He received the A. M. Turing award from the As-
sociation of Computing Machinery and the J. von Neumann
1
award from the Society for Industrial and Applied Mathemat-
ics both in 1970. Beside the large numbers of papers on his
theoretical work on numerical analysis, Wilkinson developed
computer software, working on the production of libraries of
numerical routines. The NAG (Numerical Algorithms Group)
began work in 1970 and much of the linear algebra routines
were due to Wilkinson.
166 Eigenvalues and Eigenvectors

and simultaneously

A(k+1) = Rk Qk = Rk (I + F ) = Rk + G, kGk2 ≤ εkRk k2 ,

hence A(k) , k ∈ N, converges also to an upper triangular matrix, only if the norms of
Rk , k ∈ N are uniformly bounded. This is the case, since

kRk k2 = kQ∗k A(k) k2 = kA(k) k2 = kQ∗k−1 . . . Q∗0 AQ0 . . . Qk−1 k = kAk2 .

We need also an auxiliary result on the “uniqueness” of QR decomposition.


Lemma 6.4.5. Let U, V ∈ Cn×n be unitary matrices and R, S ∈ Cn×n be invertible
upper triangular matrices. Then U R = V S if and only if there exists a phase matrix
 −iθ 
e 1
Θ=
 .. ,

θj ∈ [0, 2π), j = 1, n,
.
e −iθ n

such that U = V Θ∗ , R = ΘS.


Proof. Since U R = V Θ∗ ΘS = V S, the sufficiency is trivial. For the necessity,
from U R = V S it follows that V ∗ U = SR−1 must be an upper triangular matrix
such that (V ∗ U )∗ = U ∗ V = RS −1 . Hence Θ = V ∗ U is a unitary diagonal matrix
and it holds U = V V ∗ U = V Θ. 

Proof of Proposition 6.4.3. Let A = XΛX −1 be the Jordan normal form of A, where
Λ = diag(λ1 , . . . , λn ). For U0 = I and k ∈ N0
 
0
Y k
Uk  Rj  = X −1 ΛX = XΛk X −1 = XΛk ST = X (Λk SΛ−k ) Λk T,
| {z }
j=k−1 =:Lk

where Lk is a lower triangular matrix with entries

λj k
 
(Lk )jm = , 1≤m≤j≤n (6.4.4)
λm
such that for k ∈ N
 
0
λj k 1 . . .
    

|Lk − I| ≤ max |sjm | max 
. . .
, (k ∈ N).
1≤m<j≤n 1≤m<j≤n λm  . . . 
. . . 
1 ... 1 0
(6.4.5)
6.5. QR Method – the Practice 167

Let U
bk R
bk = XLk be the QR decomposition of XLk , that, due to (6.4.5) and Lemma
6.4.5 converges, up to a phase matrix, to a QR decomposition X = U R of X. Now
we apply Lemma 6.4.5 to the identity
 
Y0
Uk  Rj  Q bk Λk T ;
bk R
j=k−1

there exist phase matrices Θk , such that


 
Y0
Uk = Q b k Θ∗ şi  bk Λk T,
Rj  = Θk R
k
j=k−1

hence there exist phase matrices Θ b k → U , when k → ∞. 


b k , such that Uk Θ

Let us examine shortly the “error term” in (6.4.4), whose sub-diagonal entries
verifies  
|λj |
|Lk |jm ≤ |sjm , 1 ≤ m < j ≤ n.
|λm |
Therefore, it holds
The farther is the sub-diagonal element to the diagonal, the faster is the
convergence of that element to zero.

6.5 QR Method – the Practice


6.5.1 Classical QR method
We have seen that QR method generates a matrix sequence A(k) that under certain
conditions converges to an upper triangular matrix with eigenvalues of A on diagonal.
We may apply this method to real matrices.
Example 6.5.1. Let  
1 1 1
A = 1 2 3 .
1 2 1
Its eigenvalues are
λ1 ≈ 4.56155, λ2 = −1, λ3 ≈ 0.43845.
Using a rough MATLAB implementation of QR method we obtain the the values
in Table 6.1 for the subdiagonal entries. Note that after k iterative steps, the entries
(k)
am` , ` < m, approach to 0 like |λ` /λk | do. ♦
168 Eigenvalues and Eigenvectors

#iterations a21 a31 a32


10 6.64251e-007 -2.26011e-009 0.00339953
20 1.70342e-013 -1.52207e-019 8.9354e-007
30 4.36711e-020 -1.02443e-029 2.34578e-010
40 1.11961e-026 -6.89489e-040 6.15829e-014

Table 6.1: Results for Example 6.5.1

Example 6.5.2. The matrix  


1 5 7
3 0 6
4 3 1
has the eigenvalues

λ1 ≈ 9.7407, λ2 ≈ −3.8703 + 0.6480i, λ2 ≈ −3.8703 − 0.6480i.

In this case, QR method does not converge to an upper triangular matrix. After 100
iterations we obtain the matrix
 
9.7407 −4.3355 0.94726
A(100) ≈  8.552e − 039 −4.2645 0.7236  ,
3.3746e − 039 −0.79491 −3.4762

that correctly provides the real eigenvalue. Additionally, the lower right 2 × 2 matrix
provide the complex eigenvalues −3.8703 ± 0.6480i. ♦

The second example leads us to the following strategy: if the sub-diagonal entries do
not disappear, it is recommendable to examine the corresponding 2 × 2 matrix.

Definition 6.5.3. If A ∈ Rn×n has the QR decomposition A = QR, then a RQ


transformation of A is defined by A∗ = RQ.

What problems appear in a practical usage of QR method? Since the complexity


of QR decomposition is Θ(n3 ), it is not advisable to use a method based on such
an iterative step. In order to avoid the problem, we convert the initial matrices into
a matrix whose QR decomposition could be computed faster. Such examples are
upper Hessenberg matrices, whose QR decomposition could be computed using n
Givens rotations, a total of O(n2 ) flops: since only entries hj−1,j , j = 2, n must be
eliminated, we shall find the angles φ2 , . . . φn , such that

G(n − 1, n; φn ) . . . G(1, 2; φ2 )H = R
6.5. QR Method – the Practice 169

and it holds
H∗ = RGT (1, 2; φ2 ) . . . GT (n − 1, n; φn ). (6.5.1)

This is the idea of Algorithm 6.1. Following [32], the motto must be “once Hessen-
berg, always Hessenberg”.

Algorithm 6.1 RQ transformation of a Hessenberg matrix H, that is H∗ = RQ,


where H = QR is a QR decomposition of H
for k := 1 to n − 1 do
[ck , sk ] := givens(Hkk , Hk+1,k );
 T
ck sk
Hk:k+1,k:n := Hk:k+1,k:n ;
−sk ck
end for
for k := 1 to n − 1 do  
ck sk
H1:k+1,k:k+1 := H1:k+1,k:k+1 ;
−sk ck
end for

Lemma 6.5.4. If H ∈ Rn×n is an upper Hessenberg matrix, the matrix H∗ is upper


Hessenberg, too.

Proof. The conclusion is a direct consequence of (6.5.1) representation. Right multi-


plication by a Givens matrix, GT (j, j + 1, φj+1 ), j = 1, n − 1 means a combination
of jth and (j + 1)th columns that creates nonzero values only in the first sub-diagonal
– R is upper triangular. 

Let see how to convert the initial matrix to a Hessenberg form. For this purpose
we shall use (for variation) Householder transformations. Let us suppose we have
already found a matrix Qk , such that the first k columns of the transformed matrix
are already in Hessenberg form, that is,
 
∗ ... ∗ ∗ ∗ ... ∗
 ∗ ... ∗ ∗ ∗ ... ∗ 
 
 . . .. .. .. . . .. 

 . . . . . . 

T
Qk AQk = 
 ∗ ∗ ∗ ... ∗ .

(k)
a1 ∗ ... ∗
 
 
 .. .. . . .. 

 . . . .


(k)
an−k−1 ∗ . . . ∗
170 Eigenvalues and Eigenvectors

Then we determine yb ∈ Rn−k−1 and α ∈ R (which results automatically), such that


 (k) 
a1
 
α
 ..  
0
 
. Ik+1
   
H(b
y)  =  ⇒ Uk+1 :=

.. .. H(b
y)
.
  
 .  
(k)
an−k−1 0

and we get
 
∗ ... ∗ ∗ ∗ ... ∗
 ∗ ... ∗ ∗ ∗ ... ∗ 
. . .. .. .. . . .. 
 

 . . . . . . 
 
 ∗ ∗ ∗ ... ∗ 
Uk+1 Qk A Qk Uk+1 =   Uk+1 ;
| {z } | {z }   α ∗ ... ∗  
=:Qk+1 =QT  0 ∗ ... ∗ 
k+1  
 .. .. .. 
 . . . 
0 ∗ ... ∗

the upper left unit matrix Ik+1 in matrix Uk+1 takes care to have on the first k + 1
columns a Hessenberg structure. Algorithm 6.2 gives a method for conversion of a
matrix into Hessenberg form. To conclude, our QR method will be a two step method:

1. Convert A into Hessenberg form using an orthogonal transformation:

H (0) = QAQT , QT Q = QQT = I.

2. Do QR iterations
(k)
H (k+1) = H∗ , k ∈ N0 ,

hoping that all the sub-diagonal elements converge to zero.

Since sub-diagonal entries converge slowest, we can use the maximum of modu-
lus as stopping criterion. This leads us to the simple QR method, see Algorithm 6.3.
Of course, for complex eigenvalues this method iterates infinitely.

Example 6.5.5. We apply the new method to the matrix in Example 6.5.1. For var-
ious given tolerances ε, we get the results given in Table 6.2. Note that one gains a
new decimal digit for sub-diagonal entries at each three iterations. ♦
6.5. QR Method – the Practice 171

Algorithm 6.2 Reduction to upper Hessenberg form


Input: Matrix A ∈ Rn×n
Output: H, the Hessenberg for of A, and, if desired, Q such that H = QAQT
for i := 1 to n − 2 do
ui := House(Ai+1:n,i );
Pi := I − 2ui uTi ; {Qi = diag(Ii , Pi )}
Ai+1:n,i:n := Pi Ai+1:n,i:n ;
A1:n,i+1:n := A1:n,i+1:n Pi ;
end for
if one wishes Q then
Q := I;
for i := 1 to n − 2 do
Qi+1:n,i+1:n := Pi Qi+1:n,i+1:n ;
end for
end if

Algorithm 6.3 Pure (simple) QR Method


Input: Matrix A, tolerance tol
Output: A vector λ of eigenvalues and the actual number of iterations it
H := Hessenberg(A); {Hessenberg form of A}
it := 0;
while kdiag(H, −1)k∞ > tol do
H := H∗ ; {RQ transform of H}
it := it + 1;
end while
λ := diag(H);

ε #iterations λ1 λ2 λ3
10−3 11 4.56155 -0.999834 0.438281
10−4 14 4.56155 -1.00001 0.438461
10−5 17 4.56155 -0.999999 0.438446
10−10 31 4.56155 -1 0.438447

Table 6.2: Results for Example 6.5.5


172 Eigenvalues and Eigenvectors

We can try a speedup of our method by decomposing a problem into subproblems.


If we have a Hessenberg matrix with form
 
∗ ... ... ∗
..
 ∗ ...
 
. ∗ 
.
 
 . . . . . 
 . . . 
   
 ∗ ∗  H1 ∗
H=   = ,
 ∗ ... ... ∗   H2
 .. .. 

 ∗ . . 

 . . . . .. 
 . . . 
∗ ∗
then the eigenvalue problem for H may be decomposed into an eigenvalue problem
for H1 and one for H2 .
According to to [20], a sub-diagonal hj+1,j entry is considered to be “small
enough” if
|hj+1,j | ≤ eps (|hjj | + |hj+1,j+1 |) . (6.5.2)
We shall do something simpler, namely we shall decompose a matrix if its least
modulus sub-diagonal entries is less than a given tolerance. The procedure is as
follows: the function for computing eigenvalues using QR iterations finds a decom-
position into two matrices H1 and H2 that calls itself recursively for each of these
matrices.
If one of these matrices is 1 × 1, the eigenvalue is trivially computed, and if it is
2 × 2, then its characteristic polynomial is
pA (x) = det(A − xI) = x2 − trace(A)x + det(A)
= x2 (a11 + a22 ) x + (a11 a22 − a12 a21 ) .
| {z } | {z }
=:b =:c

If its discriminant b2 − 4c is positive, then A have two real and distinct eigenvalues
1 p  c
x1 = −b − sgn(b) b2 − 4c şi x2 = ,
2 x1
otherwise its eigenvalues are complex, namely
1 p 
−b ± i 4c − b2 ;
2
thus we can deal with complex eigenvalues. The function Eigen2x2 returns the
eigenvalues of a 2 × 2 matrix. The idea is implemented in Algorithm 6.4. Effective
QR iterations are given in Algorithm 6.5.
6.5. QR Method – the Practice 173

Algorithm 6.4 QRSplit1a – QR method with partition and treatment of 2 × 2


matrices
Input: Matrix A ∈ Rn×n and tolerance tol
Output: Eigenvalues λ and number of iterations it
if n = 1 then
it := 0; λ := A;
return;
else
if n = 2 then
it := 0;
λ := Eigen2x2(A);
return;
else
H := Hessenberg(A); {Hessenberg form}
[H1 , H2 , it] := QRIter(H, tol);
[λ1 , it1 ] := QRSplit1a(H1 , tol);{recursive calls}
[λ2 , it2 ] := QRSplit1a(H2 , tol);
it := it + it1 + it2 ;
λ := [λ1 , λ2 ];
end if
end if

Algorithm 6.5 QR iterations on a Hessenberg matrix ; used by Algorithm 6.4 – call


[H1 , H2, it] = QRIter(H, t)
Input: Hessenberg matrix H, tolerance tol
Output: Matrices H1 , H2 representing a decomposition of H over the minimum
modulus sub-diagonal entry and number of iterations It
it := 0;
Find the minimum m of modulus of subdiagonal entries in H and its position j;
while m > tol do
it := it + 1;
H := H∗ ; {RQ transformation of H}
Find the minimum m of modulus of sub-diagonal entries in H and its position
j;
end while
H1 := H1:j,1:j ;
H2 := Hj+1:n,j+1:n ;
174 Eigenvalues and Eigenvectors

ε #iteraţii λ1 λ2 λ3
10−3 12 9.7406 −3.8703 + 0.6479i −3.8703 − 0.6479i
10−4 14 9.7407 −3.8703 + 0.6479i −3.8703 − 0.6479i
10−5 17 9.7407 −3.8703 + 0.6480i −3.8703 − 0.6480i
10−5 19 9.7407 −3.8703 + 0.6480i −3.8703 − 0.6480i
10−5 22 9.7407 −3.8703 + 0.6480i −3.8703 − 0.6480i

Table 6.3: Results for Example 6.5.6

Example 6.5.6. Let us consider again the matrix in Exemple 6.5.2; we apply it to
Algorithm 6.4. The results are given in Table 6.3. ♦

6.5.2 Spectral shift


Hessenberg matrices allow us to execute each iteration into a shorter time. We shall
try to reduce the number of iterations, that is to increase the speed of convergence,
since
The convergence rate of sub-diagonal entries hj+1,j has order of growth

λj+1 k
 
, j = 1, n − 1.
λj

The keyword is here spectral shift. One observes that for µ ∈ R the matrix A−µI has
the eigenvalues λ1 − µ, . . . , λn − µ. For an arbitrary invertible matrix B the matrix
B(A − µI)B −1 + µI has the eigenvalues λ1 , . . . , λn – one may shift the spectrum of
matrices forward and backwards by means of a similarity transformation. One sorts
the eigenvalues µ1 , . . . , µn such that

|µ1 − µ| > |µ2 − µ| > · · · > |µn − µ|, {µ1 , . . . , µ1 } = {λ1 , . . . , λn },

and if µ is close to µn , then if QR method start with H 0 = A − µI, the subdiagonal


(n)
entry hn−1,n converge very fast to zero. It is better if spectral shift is performed at
each step individually. In addition, we may choose heuristically as approximation for
(k)
µ the value hnn . One gets the following iterative scheme
 
H (k+1) = H (k) − µk I + µk I, µk := h(k)
nn , k ∈ N0 ,

with the starting matrix H 0 = QAQT . Algorithm 6.6 gives a variant of the method
which treats complex eigenvalues. It uses Algorithm 6.7. Within the last algorithm,
(H − Hn,n In )∗ in line 6 means the RQ transformation of the matrix H − Hn,n In
6.5. QR Method – the Practice 175

Algorithm 6.6 Spectral shift QR method, partition and treatment of complex eigen-
values
Input: Matrix A ∈ Rn×n and tolerance tol
Output: Eigenvalues λ of A and number of iterations It
It := 0;
if n = 1 then
λ := A;
return
else if n = 2 then
λ := Eigen2x2(A);
return
else
H := Hessenberg(A); {convert to Hessenberg form}
[H1 , H2 , It] := QRIter2(H, t)
[λ1 , It1 ] := QRSplit2(H1 , tol) {recursive call}
[λ2 , It2 ] := QRSplit2(H2 , tol) {recursive call}
It := It + It1 + It2 ;
λ = [λ1 , λ2 ];
end if

Algorithm 6.7 QR iteration and partitioning


It := 0;
Find the minimum m of modulus of sub-diagonal entries in H and its position j;
while m > tol do
It := It + 1;
H := (H − Hn,n In )∗ + Hn,n In ;
Find the minimum m of modulus of sub-diagonal entries in H and its position
j;
end while
H1 := H1:j,1:j ;
H2 := Hj+1:n,j+1:n ;
176 Eigenvalues and Eigenvectors

Remark 6.5.7. If the shift value µ is sufficiently close to an eigenvalue λ, then the
matrix could be decomposed in a single iterative step. ♦

6.5.3 Double shift QR method


It can be shown that spectral shift QR method converges quadratically, that is the
error is, for ρ < 1,
O(ρ2k ) instead of O(ρk ).
This nice idea works only for real eigenvalues; for complex eigenvalue it is prob-
lematic. Nevertheless, we can exploit the fact that eigenvalues appear in conjugated
pairs. This leads us to “double shift methods”:
Instead of shifting the spectrum with an eigenvalue, approximated heuris-
(k)
tically by hn,n , we could rather perform two shifts in a step, namely with
eigenvalues of " #
(k) (k)
hn−1,n−1 hn−1,n
B= (k) (k) .
hn−1,n hn,n

There are two possibilities: either both eigenvalues µ and µ0 of B are real and we
proceed as above, or we have a pair of complex conjugated eigenvalues, µ and µ̄.
As we shall see, the second case could be also treated in real arithmetic. Let Qk ,
Q0k ∈ Cn×n şi Rk , Rk0 ∈ Cn×n the matrices of complex QR decomposition

Qk Rk = H (k) − µI,
Q0k Rk0 = Rk Qk + (µ − µ̄)I.

Then it holds

H (k+1) := Rk0 Q0k + µI = (Q0k )∗ (Rk Qk + (µ − µ̄)I)Q0k + µ̄I


= Q∗k Rk Qk Q0k + µI = (Q0k )∗ Q∗k (H (k) − µI)Qk Q0k + µI
= (Qk Q0k )∗ H (k) Qk Q0k .
| {z } | {z }
=U ∗ =U

Using the matrix S = Rk0 Rk we have

U S = Qk Q0k Rk0 Rk = Qk (Rk Qk + (µ − µ̄)I)Rk


= Qk Rk Qk Rk + (µ − µ̄)Qk Rk = (H (k) − µI)2 + (µ − µ̄)(H (k) − µI)
= (H (k) )2 − 2µH (k) + µ2 I + (µ − µ̄)H (k) − (µ2 − µµ̄)I
= (H (k) )2 − (µ + µ̄)H (k) + µµ̄I =: X
(6.5.3)
6.5. QR Method – the Practice 177

#iterations in R #iterations in C
ε alg. 6.6 alg. 6.8 alg. 6.6 alg. 6.8
1e-010 1 1 9 4
1e-020 9 2 17 5
1e-030 26 3 45 5

Table 6.4: Comparations in Example 6.5.9

If µ = α + iβ,then µ + µ̄ = 2α and µµ̄ = |µ|2 = α2 + β 2 , hence the matrix X in the


righthand side of (6.5.3) is real, so it has a real QR decomposition X = QR and by
Lemma 6.4.5 there exist a phase matrix Θ ∈ Cn×n such that U = ΘQ. If we perfom
real iteration further, we obtain double shift QR method
(k)
Qk Rk = (H (k) )2 − (hn−1,n−1 + hn,n
(k)
)H (k)
 
(k) (k) (k)
+ (hn−1,n−1 h(k)n,n − h n−1,n n,n−1 I,
H (6.5.4)

H (k+1) = QTk H (k) Qk .

Remark 6.5.8 (Double shift QR method). 1. The matrix X in (6.5.3) is no more


a Hessenberg matrix, since it has an additional diagonal. Nevertheless, one
could easily compute the QR decomposition of X, using only 2n − 3 Jacobi
rotations, instead of n − 1, the number required by a Hessenberg matrix.

2. Because of its high complexity, the multiplication QTk H (k) Qk is no more an


effective method for our iteration; we can fix this drawback, see for example
[14] or [33, pages 272–278].

3. Naturally, H (k+1) could be converted to Hessenberg form.

4. Double shift QR method is useful only when A has complex eigenvalues; for
symmetric matrices it is not advantageous. ♦

Double shift QR method with partitioning and treatment of 2 × 2 matrices is given in


Algorithm 6.8. It calls Algorithm 6.9.

Example 6.5.9. We apply Algorithms 6.6 and 6.8 to matrices in Examples 6.5.1 and
6.5.2. One gets the results in Table 6.4. The good behavior of double shift QR method
can be explained by the idea to obtain two eigenvalues simultaneously. ♦
178 Eigenvalues and Eigenvectors

Algorithm 6.8 Double shift QR method with partition and treating 2 × 2 matrices
Input: Matrix A ∈ Rn×n and tolerance tol
Output: Eigenvalues λ of A and number of iterations It
It := 0;
if n = 1 then
λ := A;
return
else if n = 2 then
λ := Eigen2x2(A);
return
else
H := Hessenberg(A); {convert to Hessenberg form}
[H1 , H2 , It] := QRDouble(H, t)
[λ1 , It1 ] := QRSplit2(H1 , tol) {recursive call}
[λ2 , It2 ] := QRSplit2(H2 , tol) {recursive call}
It := It + It1 + It2 ;
λ = [λ1 , λ2 ];
end if

Algorithm 6.9 Double shift QR iterations and Hessenberg transformation


It := 0;
Find the minimum m of modulus of sub-diagonal entries in H and its position j;
while m > tol do
It := It + 1;
X := H 2 − (Hn−1,n−1 + Hn,n )H + (Hn−1,n−1 Hn,n − Hn,n−1 Hn−1,n )In ;
Find QR decomposition X = QR of X;
H := Hessenberg(QT HQ);
Find the minimum m of modulus of sub-diagonal entries in H and its position
j;
end while
H1 := H1:j,1:j ;
H2 := Hj+1:n,j+1:n ;
Chapter 7

Numerical Solution of Ordinary


Differential Equations

7.1 Differential Equations


Let us consider the initial value (Cauchy 1 ) problem: determine a vector valued
y ∈ C 1 [a, b], y : [a, b] → Rd , such that

dy
(
= f (x, y), x ∈ [a, b],
(CP ) dx (7.1.1)
y(a) = y0 .

We shall emphasize two important classes of such problems:

(i) for d = 1 we have a single first-order differential equation

y 0 = f (x, y),


y(a) = y0 .

Augustin Louis Cauchy (1789-1857), French mathematician,


active in Paris, is considered to be the father of modern anal-
ysis. He provided a firm foundation for analysis by basing
it on a rigorous concept of limit. He is also the creator of
complex analysis, where ”Cauchy’s formula” play a central
1
role. In addition, Cauchy’s name is attached to pioneering
contributions to the theory of ordinary and partial differen-
tial equations, mainly in existence and uniqueness problems.
Like other great mathematicians of 18th and 19th centuries,
his work encompasses geometry, algebra, number theory, me-
chanics and theoretical physics.

179
180 Numerical Solution of Ordinary Differential Equations

(ii) for d > 1 we have a system of first order differential equation


(
dy i
dx = f i (x, y 1 , y 2 , . . . , y d ), i = 1, d,
y i (a) = y0i , i = 1, d.

Remark 7.1.1. Let us consider a single d-th order differential equation,

u(n) = g(x, u, u0 , . . . , u(n−1) ),

with the initial condition u(i) (a) = ui0 , i = 0, d − 1. This problem is easily brought
into the form (7.1.1) by defining

y i = u(i−1) , i = 1, d.

Then
dy 1
= y2, y 1 (a) = u00 ,
dx
dy 2
= y3, y 2 (a) = u10 ,
dx
... (7.1.2)
dy d−1
= yd, y d−1 (a) = ud−2
0 ,
dx
dy d
= g(x, y 1 , y 2 , . . . , y d ), y d (a) = ud−1
0 .
dx
which has the form (7.1.1) with very special (linear) functions f 1 , f 2 , . . . , f d−1 , and
f d (x, y) = g(x, y). ♦

We recall from the theory of differential equation the following basic existence
and uniqueness.

Theorem 7.1.2. Assume that f (x, y) is continuous in the first variable for x ∈ [a, b]
and with respect to the second satisfies a uniform Lipschitz condition

kf (x, y) − f (x, y ∗ )k ≤ Lky − y ∗ k, y, y ∗ ∈ Rd , (7.1.3)

where k · k is some vector norm. Then the initial value problem (CP) has a unique
solution y(x), a ≤ x ≤ b, ∀ y0 ∈ Rd . Moreover, y(x) depends continuously on a
and y0 .
7.2. Numerical Methods 181

∂f i
The Lipschitz condition (7.1.3) certainly holds if all functions ∂y j
(x, y), i, j =
1, d are continuous in the y-variables and bounded on [a, b] × Rd . This is the case for
linear systems of differential equations, where
d
X
f i (x, y) = aij (x)y j + bi (x), i = 1, d
j=1

and aij (x), bi (x) are continuous functions on [a, b].


Often the Lipschitz condition (7.1.3) holds in some compact neighbor of a in
which y(x) remains in a compact D.

7.2 Numerical Methods


One can distinguish between analytic approximation methods and discrete variable
methods. In the former one tries to find approximations ya (x) ≈ y(x) to the exact
solutions, valid for all x ∈ [a, b]. This usually take the form of a truncated series
expansion, either in powers of x, in Cebyshev polynomials, or in some other system
of basis functions. In discrete-variable methods, one attempts to find approximations
un ∈ Rd of y(xn ) only at discrete points xn ∈ [a, b]. The abscissas xn may be prede-
termined (e.g., equally spaced on [a, b]), or, more likely, are generated dynamically
as a part of the integration process.
If desired, one can then from these discrete approximations {un } obtain again an
approximation ya (x) defined for all x ∈ [a, b], either by interpolation or, by a contin-
uation mechanism built into the approximation method itself. We are concerned only
with discrete one step methods, that is methods in which un+1 is determined solely
from a knowledge of xn , un and the step h to proceed from xn to xn+1 = xn + h.
In a k-step method (k > 1) knowledge of k − 1 additional points (xn−j , un−j ),
j = 1, 2, . . . , k − 1, is required to advance the solution.
When describing a single step of a one-step method, it suffices to show how
one proceeds from a generic point (x, y), x ∈ [a, b], y ∈ Rd to the “next” point
(x + h, ynext ). We refer to this as the local description of the one-step method. This
also includes a discussion of the local accuracy, that is how closely ynext agrees at
x + h with the solution of the differential equation. A one-step method solving the
initial value problem (7.1.1) effectively generates a grid function {un }, un ∈ Rd , on
a grid a = x0 < x1 < x2 < · · · < xN −1 < xN = b covering the interval [a, b],
whereby un is intended to approximate the exact solution y(x) at x = xn . The point
(xn+1 , un+1 ) is obtained from the point (xn , un ) by applying a one-step method with
an appropriate step hn = xn+1 − xn . This is referred to as the global description
of a one-step method. Questions of interest here are the behavior of the global error
182 Numerical Solution of Ordinary Differential Equations

un − y(xn ), in particular stability and convergence, and the choice of hn to proceed


from one grid point xn to the next, xn+1 = xn + hn .

7.3 Local Description of One-Step Methods


Given a generic point x ∈ [a, b], y ∈ Rd , we define a single step of a one step method
by
ynext = y + hΦ(x, y; h), h > 0. (7.3.1)
The function Φ : [a, b]×Rd ×R+ → Rd may be thought as the approximate increment
per unit step, or the approximate difference quotient, and it defines the method. Along
with (7.3.1), we consider the solution u(t) of the differential equation (7.1.1) passing
through the point (x, y), that is, the local initial value problem
 du
dt = f (t, u) (7.3.2)
u(t) = y t ∈ [t, t + h]
We call u(t) the reference solution. The vector ynext in (7.3.1) is intended to approx-
imate u(x + h). How successfully is this done it is measured by the truncation error
defined as follows.
Definition 7.3.1. The truncation error of the method Φ at the point (x, y) is defined
by
1
T (x, y; h) = [ynext − u(x + h)]. (7.3.3)
h
Thus the truncation error is a vector valued function of d + 2 variables. Using
(7.3.1) and (7.3.2), we can write for it alternatively,
1
T (x, y; h) = Φ(x, y; h) −[u(x + h) − u(x)], (7.3.4)
h
showing that T is the difference between the approximate and exact increment per
unit step.
Definition 7.3.2. The method Φ is called consistent if
T (x, y; h) → 0 as h → 0, (7.3.5)
uniformly for (x, y) ∈ [a, b] × Rd .
By (7.3.4) and (7.3.3)) we have consistency if and only if
Φ(x, y; 0) = f (x, y), x ∈ [a, b], y ∈ Rd . (7.3.6)
A finer description of local accuracy is provided by the next definition based on
the notion of a local truncation error.
7.4. Examples of One-Step Methods 183

Definition 7.3.3. The method Φ is said to have order p if for some vector norm k · k,

kT (x, y; h)k ≤ Chp , (7.3.7)

uniformly on [a, b] × Rd , with a constant C not depending on x, y and h.

We express briefly this property as

T (x, y; h) = O(hp ), h → 0. (7.3.8)

Note that p > 0 implies consistency. Usually, p ∈ N∗ . It is called the exact order,
if (7.3.7) does not hold for any larger p.
Definition 7.3.4. A function τ : [a, b] × Rd → Rd that satisfies τ (x, y) 6≡ 0 and

T (x, y; h) = τ (x, y)hp + O(hp+1 ), h→0 (7.3.9)

is called the principal error function .

The principal error function determines the leading term in the truncation error.
The number p in (7.3.9) is the exact order of the method since τ 6≡ 0.
All the preceding definitions are made with the idea in mind that h > 0 is a small
number. Then the larger is p, the more accurate is the method.

7.4 Examples of One-Step Methods


Some of the oldest methods are motivated by simple geometric considerations based
on the slope field defined by the right-hand side of the differential equation. This in-
clude the Euler and modified Euler methods. More accurate and sophisticated meth-
ods are based on Taylor expansion.

7.4.1 Euler’s method


Euler proposed his method in 1768, in the early days of calculus. It consist of simply
following the slope at the generic point (x, y) over an interval of length h

ynext = y + hf (x, y). (7.4.1)

(See Figure 7.1).


Thus, Φ(x, y; h) = f (x, y) does not depend on h and by (7.3.6) the method is
consistent. For the truncation error we have by (7.3.3)
1
T (x, y; h) = f (x, y) − [u(x + h) − u(x)], (7.4.2)
h
184 Numerical Solution of Ordinary Differential Equations

Figure 7.1: Euler’s method – the exact solution (continuous line) and the approximate
solution (dashed line)

where u(t) is the reference solution defined in (7.3.2). Since u0 = f (x, u(x)) =
f (x, y), we can write, using Taylor’s theorem,

1
T (x, y; h) = u0 (x) − [u(x + h) − u(x)]
h
1 1
= u0 (x) − [u(x) + hu0 (x) + h2 u00 (ξ) − u(x)] (7.4.3)
h 2
1 00
= − hu (ξ), ξ ∈ (x, x + h),
2

assuming u ∈ C r [x, x + h]. This is certainly true if f ∈ C 1 ([a, b] × Rd ). Now


differentiating (7.3.2) totally with respect to t and then setting t = ξ, yields

1
T (x, y; h) = − h[fx + fy f ](ξ, u(ξ)), (7.4.4)
2
where fx is the partial derivative of f with respect to x and fy the Jacobian of f with
respect to the y-variables. If, in the spirit of Theorem 7.1.2, we assume that f and all
its first partial derivatives are uniformly bounded in [a, b]×Rd , there exists a constant
C independent of x, y and h such that

kT (x, y; h)k ≤ Ch. (7.4.5)


7.4. Examples of One-Step Methods 185

Thus, Euler’s method has the order p = 1. If we make the same assumption
about all second-order partial derivatives of f we have u00 (ξ) = u00 (x) + O(h) and,
therefore from (7.4.3),

1
T (x, y; h) = − h[fx + fy f ](x, y) + O(h2 ), h → 0, (7.4.6)
2

showing that the principal error function is given by

1
τ (x, y) = − [fx + fy f ](x, y). (7.4.7)
2

Unless fx + fy f ≡ 0, the order of Euler’s method is exactly p = 1.

7.4.2 Method of Taylor expansion


We have seen that Euler’s method basically amounts to truncating the Taylor ex-
pansion of the reference solution after its second term. It is a natural idea, already
proposed by Euler, to use more terms of the Taylor expansion. This requires the
computation of successive “total derivatives” of f ,

f [0] (x, y) = f (x, y)


[k] [k] (7.4.8)
f [k+1] (x, y) = fx (x, y) + fy (x, y)f (x, y), k = 0, 1, 2, . . .

which determine the successsive derivatives of the reference solution u(t) of (7.3.2)
by virtue of
u(k+1) (t) = f [k] (t, u(t)), k = 0, 1, 2, . . . (7.4.9)

These, for t = x, become

u(k+1) (x) = f [k] (x, y), k = 0, 1, 2, . . . (7.4.10)

and are used to form the Taylor series approximation according to


 
[0] 1 [1] 1 p−1 [p−1]
ynext = y + h f (x, y) + hf (x, y) + · · · + h f (x, y) , (7.4.11)
2 p!

that is,

1 1
Φ(x, y; h) = f [0] (x, y) + hf [1] (x, y) + · · · + hp−1 f [p−1] (x, y). (7.4.12)
2 p!
186 Numerical Solution of Ordinary Differential Equations

For the truncation error, using (7.4.10) and (7.4.12) and assuming f ∈ C p ([a, b]×
Rd ), we obtain from Taylor’s theorem
1
T (x, y; h) = Φ(x, y; h) − [u(x + h) − u(x)] =
h
p−1
X hk hp
= Φ(x, y; h) − u(k+1) (x) − u(p+1) (ξ) =
(k + 1)! (p + 1)!
k=0
hp
= −u(p+1) (ξ) , ξ ∈ (x, x + h),
(p + 1)!
so that
Cp
kT (x, y; h)k ≤ hp ,
(p + 1)!
where Cp is a bound on the pth total derivative of f . Thus, the method has the exact
order p (unless f [p] (x, y) ≡ 0), and the principal error function is
1
τ (x, y) = − f [p] (x, y). (7.4.13)
(p + 1)!
The necessity of computing many partial derivatives in (7.4.8) was a discouraging
factor in the past, when this had to be done by hand. But nowadays, this task can be
delegated to the computer, so that the method has become again a viable option.

7.4.3 Improved Euler methods


There is too much inertia in Euler’s method: one should not follow the same initial
slope over the whole interval of length h, since along this line segment the slope
defined by the slope field of the differential equation changes. This suggests several
alternatives. For example, we may wish to reevaluate the slope halfway through the
line segment — retake the pulse of the differential equation, as it were — and then
follow this revised slope over the whole interval (cf. Figure 7.2). In formula,
 
1 1
ynext = y + hf x + h, y + hf (x, y) (7.4.14)
2 2
or  
1 1
Φ(x, y; h) = f x + h, y + hf (x, y) (7.4.15)
2 2
Note the characteristic “nesting” of f that is required here. For programming purpose
it may be desirable to undo the nesting and write
K1 (x, y) = f (x, y)
K2 (x, y; h) = f x + 12 h, u + 12 hK1

(7.4.16)
ynext = y + hK2
7.4. Examples of One-Step Methods 187

Figure 7.2: Modified Euler method

In other words, we are taking two trial slopes, K1 and K2 , one at the initial point and
the other nearby, and then taking the latter as the final slope. The method is called
modified Euler method.

We could equally well take the second trial slope at (x + h, y + hf (x, y)), but
then, having waiting too long before reevaluating the slope, take now as the final
slope the average of two slopes:

K1 (x, y) = f (x, y)
K2 (x, y; h) = f (x + h, y + hK1 ) (7.4.17)
1
ynext = y + h(K1 + K2 ).
2

This is sometimes referred to as Heun method . The effect of both modifications is to


raise the order by 1, as shown in the sequel.
188 Numerical Solution of Ordinary Differential Equations

7.5 Runge-Kutta Methods


We look for Φ of the form:
r
X
Φ(x, y; h) = αs Ks
s=1
K1 (x, y) = f (x,
 y)  (7.5.1)
s−1
X
Ks (x, y) = f x + µs h, y + h λsj Kj  , s = 2, 3, . . . , r
j=1

It is natural in (7.5.1) to impose the conditions

s−1
X r
X
µs = λsj , s = 2, 3, . . . , r, αs = 1, (7.5.2)
j=1 s=1

where the first set is equivalent to

Ks (x, y; h) = u0 (x + µs h) + O(h2 ), s ≥ 2,

and the second is nothing but the consistency condition (cf. (7.3.6)) (i.e. Φ(x, y; h) =
f (x, y)).
We call (7.5.1) an explicit r-stage Runge-Kutta method; it requires r evaluation
of the right-hand side f of the differential equation. Conditions (7.5.2) lead to a
nonlinear system. Let p∗ (r) the maximum attainable order (for arbitrary sufficient
smooth f ) of an explicit r-stage Runge-Kutta method. Kutta 2 has shown in 1901
that
p∗ (r) = r, r = 1, 4.

Wilhelm Martin Kutta (1867-1944) was a German applied


mathematician. It is well-known for his work on the numer-
2 ical solution of ODE. He had important contributions on ap-
plication of conformal mapping to hydro- and aerodynamical
problems (Kutta-Joukowski formula for the lift exerted on air-
foil).
7.5. Runge-Kutta Methods 189

We can consider implicit r-stage Runge-Kutta methods


r
X
Φ(x, y; h) = αs Ks (x, y; h),
s=1

r
 (7.5.3)
X
Ks = f x + µs h, y + λsj Kj  , s = 1, r,
j=1

in which the last r equations form a system of (in general nonlinear) equations in
the unknowns K1 , K2 , . . . , Kr . Since each of these is a vector in Rd , before we
can form the approximate increment Φ we must solve a system of rd equations in rd
unknowns. Semi-implicit Runge-Kutta methods, where the summation in the formula
for Ks extends from j = 1 to j = s, require less work. This yields r systems of
equations, each having only d unknowns, the components of Ks .
Already in the case of explicit Runge-Kutta methods, and even more so in im-
plicit methods, we have at our disposal a large number of parameters which we can
choose to achieve the maximum possible order for all sufficiently smooth f . The
considerable computational expenses involved in implicit and semi-implicit methods
can only be justified in special circumstances, for example, stiff problems. The rea-
son is that implicit methods can be made not only to have higher order than explicit
methods, but to have also better stability properties.
Example 7.5.1. Let
Φ(x, y; h) = α1 K1 + α2 K2 , ♦
where

K1 (x, y) = f (x, y),


K2 (x, y; h) = f (x + µ2 h, y + λ21 hK1 ),
λ21 = µ2 .

We have now three parameters, α1 , α2 , and µ. A systematic way of determining


the maximum order p is to expand both Φ(x, y; h) and h−1 [u(x + h) − u(x)] in
powers of h and to match as many terms as we can, without imposing constraints on
f.
To expand Φ, we need Taylor’s expansion for (vector-valued) functions of several
variables
f (x + ∆x, y + ∆y) = f + fx ∆x + fy ∆y+
1
+ [fxx (∆x)2 + 2fxy ∆x∆y + (∆y)T fyy (∆y)] + · · · ,
2
(7.5.4)
190 Numerical Solution of Ordinary Differential Equations

where fy denotes the Jacobian of f , and fyy = [fyy i ] is the vector of Hessian matrices

of f . In (7.5.4), all functions and partial derivatives are understood to be evaluated at


(x, y). Letting ∆x = µh, ∆y = µhf then gives
K2 (x, y; h) = f + µh(fx + fy f )
1 (7.5.5)
+ µ2 h2 (fxx + 2fxy f + f T fyy f ) + O(h3 ),
2
1 1 1
[u(x + h) − u(x)] = u0 (x) + hu00 (x) + u000 (x) + O(h3 ), (7.5.6)
h 2 6
where
u0 (x) = f
u00 (x) = f [1] = fx + fy f
u000 (x) = f [2] = fx[1] + fy[1] f = fxx + fx fy f + fy fx + (fxy + (fy f )y )f =
= fxx + 2fxy f + f T fyy f + fy (fx + fy )f,
and where in the last equation we have used
(fy f )y f = f T fyy f + fy2 f
Now,
1
T (x, y; h) = α1 K1 + α2 K2 − [u(x + h) − u(x)]
h
wherein we substitute the expansions (7.5.5) and (7.5.6). We find
 
1
T (x, y; h) = (α1 + α2 − 1)f + α2 µ − h(fx + fy f )+
2
  
1 2 2 1 1
+ h α2 µ − (fxx + 2fxy f + f fyy f ) − fy (fx + fy f ) + O(h3 )
T
2 3 3
(7.5.7)

We cannot enforce the condition that the h2 coefficient be zero without imposing
severe restriction on f . Thus, the maximum order is 2 and we obtain it for

α1 + α2 = 1
α2 µ = 21
The solution
α1 = 1 − α2
1
µ=
2α2
7.5. Runge-Kutta Methods 191

depends upon an arbitrary parameter, α2 6= 0.


For α2 = 1 we obtain modified Euler method, and for α2 = 12 the Heun’s method.
We shall mention the classical Runge-Kutta formula of order p = 4.

Φ(x, y; h) = 16 (K1 + 2K2 + 2K3 + K4 )


K1 (x, y; h) = f (x, y)
K2 (x, y; h) = f x + 12 h, y + 21 hK1 

(7.5.8)
K3 (x, y; h) = f x + 12 h, y + 21 hK2
K4 (x, y; h) = f (x + h, y + hK3 )

When f does not depend on y, then (7.5.8) becomes the Simpson’s formula. Runge’s
3 idea was to generalize Simpson’s quadrature formula to ordinary differential equa-

tions. He succeeded only partially; his formula had r = 4 and p = 3. The method
(7.5.8) was discovered by Kutta in 1901 through a systematic search.
The classical 4th order Runge-Kutta method for a grid of N + 1 equally spaced
method is given by Algorithm 7.1.

Algorithm 7.1 4th order Runge-Kutta method


Input: endpoints a, b; number of steps N ; initial value α.
Output: N + 1 abscissae t and the approximations w for y at t.
h := (b − a)/N ;
t0 := a;
w0 := α;
for i := 0 to N − 1 do
K1 := hf (ti , wi );
K2 := hf (ti + h/2, wi + K1 /2);
K3 := hf (ti + h/2, wi + K2 /2);
K4 := hf (ti + h, wi + K3 );
wi+1 := wi + (K1 + 2 ∗ K2 + 2 ∗ K3 + K4 );
ti+1 := ti + h;
end for

Carle David Tolmé Runge (1856-1927) was active in the fa-


mous Götingen school of mathematics and is one of the pi-
3 oneer of numerical mathematics. He is best known for the
Runge-Kutta formula in ordinary differential equation, for
which he provided the basic idea. He made also notable con-
tributions to approximation theory in the complex plane.
192 Numerical Solution of Ordinary Differential Equations

ti Aproximations Exact values Error


0.0 1 1 0
0.1 1.00483750000 1.00483741804 8.19640e-008
0.2 1.01873090141 1.01873075308 1.48328e-007
0.3 1.04081842200 1.04081822068 2.01319e-007
0.4 1.07032028892 1.07032004604 2.42882e-007
0.5 1.10653093442 1.10653065971 2.74711e-007
0.6 1.14881193438 1.14881163609 2.98282e-007
0.7 1.19658561867 1.19658530379 3.14880e-007
0.8 1.24932928973 1.24932896412 3.25617e-007
0.9 1.30656999120 1.30656965974 3.31459e-007
1.0 1.36787977441 1.36787944117 3.33241e-007

Table 7.1: Numerical results for Example 7.5.2

Example 7.5.2. Using 4th order Runge-Kutta method for the initial value problem
y 0 = −y + t + 1, t ∈ [0, 1]
y(0) = 1,
with h = 0.1, N = 10, and ti = 0.1i we obtain the results given in Table 7.1. ♦

It is usual to associate to each r-stages Runge-Kutta method (7.5.3) the tableau

µ1 λ11 λ12 . . . λ1r


µ2 λ21 λ22 . . . λ2r !
.. .. .. . µ Λ
. . . . . . .. in matrix form
αT
µr λr1 λr2 . . . λrr
α1 α2 ... αr
called Butcher table. For an explicit method µ1 = 0 and Λ is upper triangular hav-
ing a null main diagonal.
Rµ We can associate to the first r-lines of a Butcher table a
quadrature formula 0 s u(t) dt ≈ rj=1 λsj u(µj ), s = 1, r and to the last line the
P
R1
quadrature formula 0 u(t) dt ≈ rs=1 αs u(µj ). If corresponding degrees of exact-
P
ness are ds = qs − 1, 1 ≤ s ≤ r + 1 (ds = ∞ if µs = 0 and all λsj = 0), then
Peano’s Theorem implies that in the representation of the remainder occurs the qs th
derivatives of u and setting u(t) = y 0 (x + th) one obtains
r
y(x + µs h) − y(x) X
− λsj y 0 (x + µj h) = O (hqs ) , s = 1, r
h
j=1
7.6. Global Description of One-Step Methods 193

and
r
y(x + h) − y(x) X
αs y 0 (x + µs h) = O hqr +1 .


h
s=1

For classical 4th order Runge-Kutta method (7.5.8) the Butcher table is:

0 0
1 1
2 2 0
1 1
2 0 2 0
1 0 0 1 0
1 2 2 1
6 6 6 6

7.6 Global Description of One-Step Methods


Global description of one-step methods is best done in terms of grid and grid func-
tions.
A grid on interval [a, b] is a set of points {xn }N
n=0 such that

a = x0 < x1 < x2 < · · · < xN −1 < xN = b, (7.6.1)


with grid lengths hn defined by
hn = xn+1 − xn , n = 0, 1, . . . , N − 1. (7.6.2)
The fineness of grid is measured by
|h| = max hn . (7.6.3)
0≤n≤N −1

We shall use the letter h to denote the collection of lengths h = {hn }. If h1 =


h2 = · · · = hN = (b − a)/N , we call (7.6.1) a uniform grid, otherwise a nonuniform
grid. Letter h is also use to designate the common grid length h = (b − a)/N . A
vector valued function v = {vn }, vn ∈ Rd , defined on the grid (7.6.1) is called a grid
function. Thus, vn is the value of v at the gridpoint xn . Every function v(x) defined
on [a, b] induces a grid function by restriction. We denote the set of grid functions on
[a, b] by Γh [a, b], and for each grid function v = {vn } define its norm by
kvk∞ = max kvn k, v ∈ Γh [a, b] (7.6.4)
0≤n≤N

A one-step method – indeed, any discrete-variable method – is a method produc-


ing a grid function u = {un } such that u ≈ y, where y = {yn } is the grid function
induced by the exact solution y(x) of the initial value problem (7.1.1)
194 Numerical Solution of Ordinary Differential Equations

Let the method


xn+1 = xn + hn
(7.6.5)
un+1 = un + hn Φ(xn , un ; hn )
where x0 = a, u0 = y0 .
To bring up the analogy between (7.1.1) and (7.6.5),we introduce operators R
and Rh acting on C 1 [a, b] and Γh [a, b], respectively. These are the residual operators
(Rv)(x) := v 0 (x) − f (x, v(x)), v ∈ C 1 [a, b] (7.6.6)
1
(Rh v)n := (vn+1 − vn ) − Φ(xn , vn ; hn ), n = 0, 1, . . . , N − 1, (7.6.7)
hn
where v = {vn } ∈ Γh [a, b]. (The grid function {(Rh v)n } is not defined for n = N ,
but we may arbitrarily set (Rh v)N = (Rh v)N −1 ). Then the initial value problem
(7.1.1) and its discrete analogue (7.6.5) can be written transparently as
Ry = 0 on [a, b], y(a) = y0 (7.6.8)
Rh u = 0 on [a, b], u0 = y0 (7.6.9)
Note that the discrete residual operator (7.6.7) is closely related to the truncation
error (7.3.3) when we apply the operator at a point (xn , y(xn )) on the exact solution
trajectory. Then indeed the reference solution u(t) coincides with the solution y(t)
and
1
(Rh y)n = [y(xn+1 ) − y(xn )] − Φ(xn , y(xn ); hn ) =
hn
= −T (xn , y(xn ); hn ). (7.6.10)

7.6.1 Stability
Stability is a property of the numerical scheme (7.6.5) alone and has nothing to do
with its approximation power. It characterizes the robustness of the scheme with
respect to small perturbations. Nevertheless, stability combined with consistency
yields convergence of the numerical solution to the true solution.
We define stability in terms of the discrete residual operators Rh in (7.6.7). As
usual we assume Φ(x, y; h) to be defined on [a, b] × Rd × [0, h0 ], where h0 > 0 is
some suitable positive number.
Definition 7.6.1. The method (7.6.5) is called stable on [a, b] if there exists a constant
K > 0 not depending on h such that for an arbitrary grid h on [a, b], and for two
arbitrary grid functions v, w ∈ Γh [a, b], there holds
kv − wk∞ ≤ K (kv0 − w0 k∞ + kRh v − Rh wk∞ ) , v, w ∈ Γh [a, b] (7.6.11)
for all h with |h| sufficiently small. In (7.6.11) the norm is defined by (7.6.4).
7.6. Global Description of One-Step Methods 195

We refer to (7.6.11) as the stability inequality. The motivation for it is as follows.


Suppose we have two grid functions u, w satisfying

Rh u = 0, u0 = y0 (7.6.12)
Rh w = ε, w0 = y0 + η0 , (7.6.13)

where ε = {εn } ∈ Γh [a, b] is a grid function with small kεn k, and kη0 k is also
small. We may interpret u ∈ Γh [a, b] as the result of applying the numerical scheme
in (7.6.5) in infinite precision, whereas w ∈ Γh [a, b] could be the solution of (7.6.5)
in floating-point arithmetic. Then, if stability holds, we have

ku − wk∞ ≤ K(kη0 k∞ + kεk∞ ), (7.6.14)

that is, the global change in u is of the same order of magnitude as the local resid-
ual errors {εn } and initial error η0 . It should be appreciated, however that the first
equations in (7.6.13) says

wn+1 − wn − hn Φ(xn , wn , hn ) = hn εn ,

meaning that rounding errors must go to zero as |h| → ∞.


Interestingly enough, a Lipschitz condition on Φ is all that is required for stability.

Theorem 7.6.2. If Φ(x, y; h) satisfies a Lipschitz condition with respect to the y-


variables

kΦ(x, y; h) − Φ(x, y ∗ ; h)k ≤ M ky − y ∗ k on [a, b] × Rd × [0, h0 ], (7.6.15)

then the method (7.6.5) is stable.

For the proof we need the following lemma.

Lemma 7.6.3. Let {en } be a sequence of numbers en ∈ R, satisfying

en+1 ≤ an en + bn , n = 0, 1, . . . , N − 1 (7.6.16)

where an > 0 and bn ∈ R. Then


n−1 n−1 n−1
! !
Y X Y
en ≤ En , En = ak e0 + al bk , n = 0, 1, . . . , N (7.6.17)
k=0 k=0 l=k+1

We adopt here the usual convention that an empty product has the value 1 and an
empty sum has the value 0.
196 Numerical Solution of Ordinary Differential Equations

Proof of lemma 7.6.3. It is readily verified that


En+1 = an En + bn , n = 0, 1, . . . , N − 1, E0 = e0 .
Subtracting this from (7.6.16), we get
en+1 − En+1 ≤ an (en − En ), n = 0, 1, . . . , N − 1.
Now, e0 − E0 = 0, so that e1 − E1 ≤ 0, since a0 > 0. By induction, more generally,
en − En ≤ 0, since an−1 > 0. 

Proof of Theorem 7.6.2. Let h = {hn } be an arbitrary grid on [a, b] and v, w ∈


Γh [a, b] two arbitrary (vector-valued) grid functions. By definitions of Rh , we can
write
vn+1 = vn + hn Φ(xn , vn ; hn ) + hn (Rh v)n , n = 0, 1, . . . , N − 1
and similarly for wn+1 . Subtracting then gives

vn+1 − wn+1 = vn − wn + hn [Φ(xn , vn ; hn ) − Φ(xn , wn ; hn )]+


+ hn [(Rh v)n − (Rh w)n ], n = 0, 1, . . . , N − 1. (7.6.18)
Define now
en = kvn − wn k, dn = k(Rh v)n − (Rh w)n k, δ = kdn k∞ . (7.6.19)
Then, using the triangle inequality in (7.6.18) and the Lipschitz condition (7.6.19) for
Φ, we obtain
en+1 ≤ (1 + hn M )en + hn δ, n = 0, 1, . . . , N − 1 (7.6.20)
This is inequality (7.6.16) with an = 1+hn M , bn = hn δ. Since for k = 0, 1, . . . , n−
1, n ≤ N we have
n−1
Y n−1
Y N
Y −1 N
Y −1
a` ≤ a` = (1 + h` M ) ≤ eh` M
`=k+1 `=0 `=0 `=0
(h0 +h1 +···+hN −1 )M (b−a)M
=e =e ,
where the classical result 1 + x ≤ ex has been used in the second inequality, we
obtain from lemma 7.6.3 that
n−1
X
(b−a)M (b−a)M
en ≤ e e0 + e hk δ ≤
k=0
≤ e(b−a)M (e0 + (b − a)δ), n = 0, 1, . . . , N − 1.
7.6. Global Description of One-Step Methods 197

Therefore

kek∞ = kv − wk∞ ≤ e(b−a)M (kv0 − w0 k + (b − a)kRh v − Rh wk∞ ),

which is (7.6.11) with K = e(b−a)M max{1, b − a}. 

We have actually proved stability for all |h| ≤ h0 , not only for h sufficiently
small.
All one-step methods used in practice satisfy a Lipschitz condition if f does, and
the constant M for Φ can be expressed in terms of the Lipschitz constant L for f . This
is obvious for Euler’s method, and not difficult to prove for others. It is useful to note
that Φ does not need not be continuous in x; piecewise continuity suffices, as long as
(7.6.15) holds for all x ∈ [a, b], taking one side limits at points of discontinuity.
The following application of Lemma 7.6.3, relative to a grid function v ∈ Γh [a, b]
satisfying

vn+1 = vn + hn (An vn + bn ), n = 0, 1, ..., N − 1, (7.6.21)

where An ∈ Rd×d , bn ∈ Rd , and hn is an arbitrary grid on [a, b] is also useful.

Lemma 7.6.4. Suppose in (7.6.21) that

kAn k ≤ M, kbn k ≤ δ, n = 0, 1, . . . , N − 1, (7.6.22)

where the constants M , δ do not depend on h. Then, there exists a constant K > 0
independent of h, but depending on kv0 k, such that

kvk∞ ≤ K. (7.6.23)

Proof. The lemma follows observing that

kvn+1 kk ≤ (1 + hn M )kvn k + hn δ, n = 0, 1, . . . , N − 1,

which is precisely the inequality (7.6.19) in the proof of Theorem 7.6.2, hence

kvn k ≤ e(b−a)M {kv0 k + (b − a)δ}. (7.6.24)


198 Numerical Solution of Ordinary Differential Equations

7.6.2 Convergence
Stability is a powerful concept. It implies almost immediately convergence, and it is
also instrumental in deriving asymptotic global error estimates. We begin by defining
precisely what we mean by convergence.

Definition 7.6.5. Let a = x0 < x1 < x2 < · · · < xN = b be a grid on [a, b] with
grid length |h| = max (xn − xn−1 ). Let u = {un } be the grid function defined
1≤n≤N
by applying the method (7.6.5) on [a, b] and y = {yn } the grid function induced by
the exact solution of the initial value problem (7.1.1). The method (7.6.5) is said to
converge on [a, b] if there holds

ku − yk∞ → 0 as |h| → 0 (7.6.25)

Theorem 7.6.6. If the method (7.6.5) is consistent and stable on [a, b], then it con-
verges. Moreover, if Φ has order p, then

ku − yk∞ = O(|h|p ) as |h| → 0. (7.6.26)

Proof. By the stability inequality (7.6.11) applied to the grid functions v = h and
w = y of Definition 7.6.5, we have for |h| sufficiently small

ku − yk∞ ≤ K(ku0 − y(x0 )k + kRh u − Rh yk∞ ) = KkRh yk (7.6.27)

since u0 = y(x0 ) and Rh u = 0 by (7.6.5). But, by (7.6.10),

kRh yk∞ = kT (·, y; h)k∞ (7.6.28)

where T is the truncation error of the method Φ. By definition of consistency

kT (·, y; h)k∞ → 0, as |h| → 0,

which proves the first part of the theorem. The second part follows immediately from
(7.6.27) and (7.6.28), since order p, means, by definition that

kT (·, y; h)k∞ = O(|h|p ), as |h| → 0. (7.6.29)


7.6. Global Description of One-Step Methods 199

7.6.3 Asymptotics of global error


Since the principal error function describes the leading contribution of the local trun-
cation error, it is of interest to identify the leading term in the global error un −y(xn ).
To simplify matters, we assume a constant grid length h, although it is not difficult to
deal with a variable grid length of the form hn = ϑ(xn )h, where ϑ(x) is piecewise
continuous and 0 < ϑ(x) < θ for a ≤ x ≤ b. Thus, we consider our one-step method
to have the form

xn+1 = xn + h
un+1 = un + hΦ(xn , un ; h); n = 0, 1, . . . , N − 1 (7.6.30)
x0 = a, u0 = y0 ,

defining a grid function u = {un } on a uniform grid on [a, b]. We are interested in
the asymptotic behavior of un − y(xn ) as h → 0, where y(x) is the exact solution of
the initial value problem
dy
(
= f (x, y) x ∈ [a, b]
dx (7.6.31)
y(a) = y0

Theorem 7.6.7. Assume that


(1) Φ(x, y, h) ∈ C 2 [a, b] × Rd × [0, h0 ] ;


(2) Φ is a methodof order p ≥ 1 admitting a principal error function τ (x, y) ∈


C [a, b] × Rd ;

(3) e(x) is the solution of the linear initial value problem


 de
dx = fy (x, y(x))e + τ (x, y(x)), a ≤ x ≤ b (7.6.32)
e(a) = 0

Then, for n = 0, N ,

un − y(xn ) = e(xn )hp + O(hp+1 ), as h → 0. (7.6.33)

Before we prove the theorem, we make the following remarks:

1. The precise meaning of (7.6.33) is

ku − y − hp ek∞ = O(hp+1 ),

where u, y, e are the grid functions u = {un }, y = {y(xn )} and e = {e(xn )}.
200 Numerical Solution of Ordinary Differential Equations

2. Since by consistency Φ(x, y; 0) = f (x, y), assumption (1) implies f is of class


C 2 on ([a, b] × Rd ), which is more than enough to guarantee the existence and
uniqueness of the solution e(x) of (7.6.32) on the whole interval [a, b].

3. The fact that some, but not all, components of τ (x, y) may vanish identically
does not imply that the corresponding components of e(x) also vanish, since
(7.6.32) is a coupled system of differential equations.

Proof of Theorem 7.6.7. We begin with an auxiliary computation, an estimate for

Φ(xn , un ; h) − Φ(xn , y(xn ); h). (7.6.34)

By Taylor’s (for functions of several variables), applied to the ith component of


(7.6.34), we have
d
X
Φi (xn , un ; h)−Φi (xn , y(xn ); h) = Φi y j (xn , y(xn ); h) ujn − y j (xn )
 

j=1
d
1 X i j k h i
Φ y y (xn , un ; h) ujn − y j (xn ) ukn − y k (xn ) ,

+
2
j,k=1
(7.6.35)

where ūn is on the line segment connecting un and y(xn ). Using Taylor’s theorem
once more, in the variable h, we can write

Φiyj (xn , y(xn ); h) = Φiyj (xn , y(xn ); 0) + hΦiyj h xn , y(xn ); h̄ ,




where 0 < h̄ < h. Since, by consistency, Φ(x, y; 0) ≡ f (x, y) on [a, b] × Rd , we


have
Φiyj (x, y; 0) = fyi j (x, y), x ∈ [a, b], y ∈ Rd ,
and assumption (1) allows us to write

Φiyj (xn , y(xn ); h) = fyi j (xn , y(xn )) + O(h), h → 0. (7.6.36)

Now observing that un − y(xn ) = O(hp ), by virtue of Theorem 7.6.6 and using
(7.6.36) in (7.6.35), we get, again by assumption (1),
d
X
Φi (xn , un ; h) − Φi (xn , y(xn ); h) = fyi j (xn , y(xn )) ujn − y j (xn ) +
 

j=1
p+1
O(h ) + O(h2p ).
7.6. Global Description of One-Step Methods 201

But O(h2p ) is also of order O(hp+1 ), since p ≥ 1. Thus, in vector notation,


Φ(xn , un ; h)−Φ(xn , y(xn ); h) = fy (xn , y(xn )) [un − y(xn )]+O(hp+1 ). (7.6.37)
Now, to highlight the leading term in the global error, we define the grid function
r = {rn } by
r = h−p (u − y). (7.6.38)
Then
1 1  −p
h (un+1 − y(xn+1 )) − h−p (un − y(xn )) =

(rn+1 − rn ) =
h h  
p 1 1
=h (un+1 − un ) − (y(xn+1 ) − y(xn )) =
h h
−p
= h {Φ(xn , un ; h) − [Φ(xn , y(xn ); h) − T (xn , y(xn ); h)]} ,
where we have used (7.6.30) and the relation (7.6.10) for the truncation error T .
Therefore, expressing T in terms of the principal error function τ , we get
1
(rn+1 − rn ) = h−p Φ(xn , un ; h) − Φ(xn , y(xn ); h) + τ (xn , y(xn ))hp

h
+ O(hp+1 )


For the first two terms in brackets we use (7.6.37) and the definition of r in (7.6.38)
to obtain
1
h (rn+1 − rn ) = fy (xn , y(xn )) rn + τ (xn , y(xn )) + O(h), n = 0, N − 1
r0 = 0.
(7.6.39)
Now letting
g(x, y) := fy (x, y(x))y + τ (x, y(x)) (7.6.40)
we can interpret (7.6.39) by writing
 
RhEuler,g r = εn , n = 0, N − 1, εn = O(h),
n

where RhEuler,g is the discrete residual operator (7.6.7) that goes with Euler’s method
applied to e0 = g(x, e), e(a) = 0. Since Euler’s method is stable on [a, b] and g being
linear in y satisfies a uniform Lipschitz condition, we have by the stability inequality
(7.6.11)
kr − ek∞ = O(h),
and hence, by (7.6.38)
ku − y − hp ek∞ = O(hp+1 ),
as was to be shown. 
202 Numerical Solution of Ordinary Differential Equations

7.7 Error Monitoring and Step Control


Most production codes currently available for solving ODEs monitor local truncation
errors and control the step length on the basis of estimates for these errors. Here we
attempt to monitor global error, at least asymptotically, by implementing the asymp-
totic result of Theorem 7.6.7. This necessitates the evaluation of the Jacobian matrix
fy (x, y) along or near the solution trajectory; but this is only natural, since fy , in a
first approximation, governs the effect of perturbations via the variational differential
equation (7.6.32). This equation is driven by the principal error function evaluated
along the trajectory, so that estimates of local truncation errors (more precisely, of the
principal error function) are needed also in this approach. For simplicity we again
assume constant grid length.

7.7.1 Estimation of global error


The idea of our estimation is to integrate the “variational equation” (7.6.32) along
with the main equation (7.6.31). Since we need e(x) in (7.6.31) only to within an
accuracy of O(h) (any O(h) error term in e(xn ), multiplied by hp , being absorbed by
the O(hp−1 ) term), we can use Euler’s method for that purpose, which will provided
the desired approximation vn ≈ e(xn ).

Theorem 7.7.1. Assume that

(1) Φ(x, y; h) ∈ C 2 [a, b] × Rd × [0, h0 ] ;




1 d
 order p ≥ 1 admitting a principal error function τ (x, y) ∈
(2) Φ is a method of
C [a, b] × R ;

(3) an estimate r(x, y; h) is available for principal error function that satisfies

r(x, y; h) = τ (x, y) + O(h), h → 0, (7.7.1)

uniformly on [a, b] × Rd ;

(4) along with the grid function u = {un } we generate the grid function v = {vn }
in the following manner.

xn+1 = xn + h;
un+1 = un + hΦ(xn , un ; h)
(7.7.2)
vn+1 = vn + h [fy (xn , vn )vn + r(xn , un ; h)]
x0 = a, u0 = y0 , v0 = 0.
7.7. Error Monitoring and Step Control 203

Then, for n = 0, N − 1,

un − y(xn ) = vn hp + O(hp+1 ), când h → 0. (7.7.3)

Proof. The proof begins by establishing the following estimates

fy (xn , un ) = +O(h), (7.7.4)


r(xn , un ; h) = τ (xn , y(xn )) + O(h). (7.7.5)

From assumption  (1) we note by consistency, f (x, y) = Φ(x, y; 0) that f (x, y) ∈


C 2 [a, b] × Rd . Taking into account the Theorem 7.6.6, we have un = y(xn ) +
O(hp ), and therefore,

fy (xn , un ) = fy (xn , yn ) + O(hp ),

which implies (7.7.4), since p ≥ 1. Next, since τ (x, y) ∈ C 1 [a, b] × Rd , by




assumption (2) we have

τ (xn , un ) = τ (xn , y(xn )) + τy (xn , ūn )(un − y(xn ))


= τ (xn , y(xn )) + O(hp )

so that by assumption (3),

r(xn , un ; h) = τ (xn , un ) + O(h) = τ (xn , y(xn )) + O(hp ) + O(h),

which implies (7.7.5) immediately.


Let (cf. (7.6.40))

g(x, y) = fy (x, y(x))y + τ (x, y(x)). (7.7.6)

The equation for vn+1 in (7.7.2) has the form

vn+1 = vn + h(An vn + bn ),

where An are bounded matrices and bn bounded vectors. By Lemma 7.6.4, 7.6.4, we
have boundedness of vn ,
vn = O(1), h → 0. (7.7.7)
Substituting (7.7.4) and (7.7.5) into the equation for vn+1 and noting (7.7.7), we
obtain

vn+1 = vn + h [fy (xn , y(xn ))vn + τ (xn , y(xn )) + O(h)]


= vn + hg(xn , vn ) + O(h2 ).
204 Numerical Solution of Ordinary Differential Equations

Thus, in the notation used in the proof of Theorem 7.6.7


 
RhEuler,g v = O(h), v0 = 0.
n

Since Euler’s method is stable, we conclude

vn − e(xn ) = O(h),

where e(x) is, as before, the solution of

e0 = g(x, e)
e(a) = 0.

Therefore, by (7.6.33)

un − y(xn ) = e(xn )hp + O(hp+1 ).

7.7.2 Truncation error estimates


In order to apply Theorem 7.7.1 we need estimates r(x, y; h) of the principal error
function τ (x, y) which are O(h) accurate. We shall describe two of them in increas-
ing order of efficiency.

Local Richardson extrapolation to zero


This works for any one-step method Φ, but is usually considered too expensive. If Φ
has the order p, the procedure is as follows

yh = y + hΦ(x, y; h),
 
1 1
yh/2 = y + hΦ x, y; h ,
2 2
1

1 1
 (7.7.8)

yh = yh/2 + hΦ x + h, yh/2 ; h ,
2 2 2
1 1
r(x, y; h) = (yh − yh∗ ) .
1 − 2−p hp+1

Note that yh∗ is the result of applying Φ over two consecutive steps of length h/2
each, whereas yh is the result of one application over the whole step length h.
7.7. Error Monitoring and Step Control 205

We now verify that r(x, y; h) in (7.7.8) is an acceptable error estimator. To do


this, we need to assume that τ (x, y) ∈ C 1 [a, b] × Rd . In terms of the reference


solution u(t) through (x, y) we have (cf. (7.3.4) and (7.3.8))


1
Φ(x, y; h) = [u(x + h) − u(x)] + τ (x, y)hp + O(hp+1 ). (7.7.9)
h
Furthermore,
 
1 ∗ 1 1 1 1
(yh − yh ) = (yh − yh/2 ) + Φ(x, y; h) − hΦ x + h, yh/2 ; h
h h 2 2 2
   
1 1 1 1 1
= Φ(x, y; h) − Φ x, y; h − hΦ x + h, yh/2 ; h .
2 2 2 2 2
Applying (7.7.9) to each of the three terms on the right, we find

1 1
(yh − yh∗ ) = [u(x + h) − u(x)] + τ (x, y)hp + O(hp+1 )
h h     
1 1 1 1 1 p
− u x + h − u(x) − τ (x, y) h + O(hp+1 )
2 h/2 2 2 2
     
1 1 1 1 1 1 p
− u (x + h) − u x + h − τ x + h, y + O(h) h
2 h/2 2 2 2 2
+ O(hp+1 ) = τ (x, y)(1 − 2−p )hp + O(hp+1 ).

Consequently
1 1
−p
(yh − yh∗ ) = τ (x, y)hp + O(hp+1 ), (7.7.10)
1−2 h
as required.
Subtracting (7.7.10) from (7.7.9) shows, incidentally that
1 1
Φ∗ (x, y; h) := Φ(x, y; h) − (yh − yh∗ ) (7.7.11)
1 − 2−p h
defines a one-step method of order p + 1.
Procedure (7.7.8) is rather expensive. For a fourth-order Runge-Kutta process,
it requires a total of 11 evaluations of f per step, almost three times the effort for a
single Runge-Kutta step. Therefore, Richardson extrapolation is normally used only
after two steps of Φ, that is one proceeds according to

yh = y + hΦ(x, y; h),

y2h = yh + hΦ(x + h, yh ; h) (7.7.12)
y2h = y + 2hΦ(x, y; 2h).
206 Numerical Solution of Ordinary Differential Equations

Then (7.7.10) gives


1 1 ∗
(y2h − y2h ) = τ (x, y) + O(h), (7.7.13)
2(2p − 1) hp+1

so that the expression on the left is an acceptable estimator r(x, y; h). If the two
steps in (7.7.12) yield acceptable accuracy (cf. §7.7.3), then again for a fourth-order
Runge-Kutta process, the procedure requires only three additional evaluations of f ,
∗ would have to be computed anyhow. There are still more efficient
since yh and y2h
schemes, as we shall seen.

Embedded methods
The basic idea of this approach is very simple: if the given method Φ has order p,
take any other one step method Φ∗ of order p∗ = p + 1 and define
1
r(x, y; h) = [Φ(x, y; h) − Φ∗ (x, y; h)] (7.7.14)
hp
This is indeed an acceptable estimator, as follows by subtracting the two relations
1
Φ(x, y; h) − [u(x + h) − u(x)] = τ (x, y)hp + O(hp+1 )
h
1
Φ∗ (x, y; h) − [u(x + h) − u(x)] = O(hp+1 )
h
and dividing the result by hp .
The tricky part is to make this procedure efficient. Following an idea of Fehlberg,
one can try to do this by embedding one Runge-Kutta process (of order p) into another
(of order p + 1). Specifically, let Φ be some explicit r-stage Runge-Kutta method.
K1 (x, y) = f (x, y)
 
s−1
X
Ks (x, y; h) = f x + µs h; y + h λsj Kj  , s = 2, 3, . . . , r
j=1
r
X
Φ(x, y; h) = αs Ks
s=1

Then for Φ∗ choose a similar r∗ -stage process, with r∗ > r, in such a way that
µ∗s = µs , λ∗sj = λsj , for s = 2, 3, . . . , r.
The estimate (7.7.14) then costs only r∗ − r extra evaluations of f . If r∗ = r + 1 one
might even attempt to save the additional evaluation by selecting (if possible)
µ∗r = 1, λ∗rj = αj for j = 1, r∗ − 1 (r∗ = r + 1) (7.7.15)
7.7. Error Monitoring and Step Control 207

Then indeed, Kr∗ will be identical with K1 for the next step.
Pairs of such embedded (p, p + 1) Runge-Kutta formulae have been developed
in the late 1960’s by E. Fehlberg. There is a considerable degree of freedom in
choosing the parameters. Fehlberg’s choices were guided by an attempt to reduce the
magnitude of the coefficients of all the partial derivative aggregates that enter into
the principal error function τ (x, y) of Φ. He succeeded in obtaining pairs with the
following values of parameters p, r, r∗ , given in Table 7.2.

p 3 4 5 6 7 8
r 4 5 6 8 11 15
r∗ 5 6 8 10 13 17

Table 7.2: Embedded Runge-Kutta formulae

For the third-order process (and only for that one) one can choose the parameters
for (7.7.15) to hold.

7.7.3 Step control


Any estimate r(x, y; h) of the principal error function τ (x, y) implies an estimate

hp r(x, y; h) = T (x, y; h) + O(hp+1 ) (7.7.16)

for the truncation error, which can be used to monitor the local truncation error during
the integration process. However, one has to keep in mind that the local truncation
error is quite different from the global error, that one really wants to control. To get
more insight into the relationship between these two errors, we recall the following
theorem, which quantifies the continuity of solution of an initial value problem with
respect to initial values.

Theorem 7.7.2. Let f (x, y) be continuous in x ∈ [a, b] and satisfy a Lipschitz con-
dition uniformly on [a, b] × R, with Lipschitz constant L, that is

kf (x, y) − f (x, y ∗ )k ≤ L ky − y ∗ k .

Then the initial value problem

dy
= f (x, y), x ∈ [a, b],
dx (7.7.17)
y(c) = yc
208 Numerical Solution of Ordinary Differential Equations

has a unique solution on [a, b] for any c ∈ [a, b] and for any yc ∈ Rd . Let y(x, s)
and y(x; s∗ ) be the solutions of (7.7.17) corresponding to yc = s and yc = s∗ ,
respectively. Then for any vector norm k.k,

ky(x; s) − y(x; s∗ )k ≤ eL|x−c| ks − s∗ k . (7.7.18)

“Solving the given initial value problem (7.6.31) numerically by a one-step meth-
od (not necessarily with constant step) means in reality that one follows a sequence
of “solution tracks”, whereby at each grid point xn one jumps from one track to the
next by an amount determined by the truncation error at xn ” [16] (see Figure 7.3).
This result from the definition of truncation error, the reference solution being one of
the solution tracks. Specifically, the nth track, n = 0, N , is given by the solution of
the initial value problem

dvn
= f (x, vn ), x ∈ [xn , b],
dx (7.7.19)
vn (xn ) = un ,
and
un+1 = v(xn+1 ) + hn T (xn , un ; hn ), n = 0, N − 1. (7.7.20)
Since by (7.7.19) we have un+1 = vn+1 (xn+1 ), we can apply Theorem 7.7.2 to the

Figure 7.3: Error accumulation in a one-step method


7.7. Error Monitoring and Step Control 209

solution vn+1 and vn , letting c = xn+1 , s = un+1 , s∗ = un+1 − hn T (xn , un ; hn )


(by (7.7.20)), and thus obtain

kvn+1 (x) − vn (x)k ≤ hn eL|x−xn | kT (xn , un ; hn )k , n = 0, N − 1. (7.7.21)

Now
N
X −1
[vn+1 (x) − vn (x)] = vN (x) − v0 (x) = vN (x) − y(x), (7.7.22)
n=0

and since vN (xN ) = uN , letting x = xN , we get from (7.7.21) and (7.7.22) that
N
X −1
kuN − y(xN )k ≤ kvn+1 (xN ) − vn (xN )k
n=0
N
X −1
≤ hn eL|xN −xn+1 | kT (xn , un ; hn )k .
n=0

Therefore, if we make sure that

kT (xn , un ; hn )k ≤ εT , n = 0, N − 1, (7.7.23)

then
N
X −1
kuN − y(xN )k ≤ εT (xn+1 − xn )eL|xN −xn+1 | .
n=0
Interpreting the sum on the right as a Riemann sum for a definite integral, we finally
obtain, approximately,

Zb
εT  L(b−a) 
kuN − y(xN )k ≤ εT eL(b−x) dx = e −1 .
L
a

Thus, knowing an estimate for L would allow us to set an appropriate εT , namely


L
εT = ε, (7.7.24)
eL(b−a) −1
to guarantee an error kuN − y(xN )k ≤ ε. What holds for the whole grid on [a, b] of
course, holds for any grid on a subinterval [a, x], a ≤ x ≤ b. So, in principle, given
the desired accuracy ε for the solution y(x), we can determine a “local tolerance
level” εT (cf. (7.7.24)) and achieve the desired accuracy by keeping the local trunca-
tion error bellow εT (cf. (7.7.23)). Note that as L → 0, we have εT → ε/(b − a).
210 Numerical Solution of Ordinary Differential Equations

This limit value of εT would be appropriate for a quadrature problem but definitely
not for a true differential equation problem, where εT , in general, has to be chosen
considerably smaller than the target error tolerance ε.
Considerations such as these motivate the following step control mechanism:
each integration step (from xn to xn+1 = xn + hn ) consists of these parts:

1. Estimate hn .

2. Compute un+1 = un + hn Φ(xn , un ; hn ) and r(xn , un ; hn ).

3. Test hpn kr(xn , un ; hn )k ≤ εT (cf. (7.7.16) and (7.7.23)). If the test passes,
proceed with the next step; if not, repeat the step with a smaller hn , say, half
as large, until the test passes.

To estimate hn , assume first that n ≥ 1, so that the estimator from the previous
step, r(xn−1 , un−1 ; hn−1 ) (or at least its norm) is available. Then, neglecting terms
of O(h),
kτ (xn−1 , un−1 k ≈ kr(xn−1 , un−1 ; hn−1 )k,
and since τ (xn , un ) ≈ τ (xn−1 , un−1 ), likewise

kτ (xn , un )k ≈ kr(xn−1 , un−1 ; hn−1 )k.

What we want is
kτ (xn , un )khpn ≈ θεT ,
where θ is “safety factor”, say, θ = 0.8. Eliminating τ (xn , un ), we find
 1/p
θεT
hn ≈ .
kr(xn−1 , un−1 , hn−1 )k
Note that from the previous step we have

hpn−1 kr(xn−1 , un−1 ; hn−1 )k ≤ εT ,

so that
hn ≥ θ1/p hn−1 ,
and the tendency is to increase the step.
(0)
If n = 0, we proceed similarly, using some initial guess h0 of h0 and associated
(0)
r(x0 , y0 ; h0 ) to obtain
( )1/p
(1) θεT
h0 = (0)
.
r(x0 , y0 ; h0 )
7.7. Error Monitoring and Step Control 211

The process may be repeated once or twice to get the final estimate of both h0 and
(0)
r(x0 , y0 ; h0 ).
For a synthetic description of variable-step Runge-Kutta methods Butcher table
is completed by an supplementary line used for the computation of Φ∗ (and thus of
r(x, y; h)):

µ1 λ11 λ12 . . . λ1r


µ2 λ21 λ22 . . . λ2r
.. .. .. .
. . . . . . ..
µr λr1 λr2 . . . λrr
α1 α2 . . . αr
α1∗ α2∗ αr∗ αr+1

As an example, Table 7.3 is the Butcher table for a 2-3 method. For the derivation
of this table see [38, pages 451–452].

µj λij
0 0
1 1
4 4 0
27
40 − 189
800
729
800 0
214 1 650
1 891 33 891 0
214 650
αi 891 891 0
αi∗ 533
2106
800
1053
1
− 78

Table 7.3: A 2-3 pair

Table 7.4 is the Butcher table for Bogacki-Shampine method [5]. It is the back-
ground for M ATLAB ode23 solver.
Another important example is DORPRI5 or RK5(4)7FM, a pair of order 4-5 and
7 stages (Table 7.5). This is a very efficient pair; it is the base for M ATLAB ode45
solver, and other important solver.
Algorithm 7.2 gives implementation hints for a variable step Runge-Kutta method
when the Butcher is given. ttol is the product of tol by safe factor (0.8 or 0.9).
For applications of numerical solution of differential equations and other numer-
ical methods in mechanics see [24].
212 Numerical Solution of Ordinary Differential Equations

Algorithm 7.2 Pseudo-code fragment that illustrates the implementation of a


variable-step RK method
done := false;
loop
K1,: := f (x, y);
for i = 2 to s do
w := y + hK:,1:i−1 λTi,1:i−1 ;
K:,i := f (x + µi h, w);
end for
δ := h max (α∗ − α)T K ; {error estimation}


β := (δ/ttol)1/(1+p) ; {step length ratio}


if δ < tol then
{accept step}
y := y + h(KαT ); {update y}
x := x + h;
if done then
EXIT {terminate and exit}
end if
h := h/ max(β, 0.1); {predict next step}
if x + h > xend then
x := xend − x; {decreasing step at end}
done := true;
end if
else
{reject step}
h := h/ min(β, 10); {reduce step size}
if done then
done := false;
end if
end if
end loop
7.7. Error Monitoring and Step Control 213

µj λij
0 0
1 1
2 2 0
3 3
4 0 4 0
2 3 4
1 9 9 9 0
2 3 4
αi 9 9 9 0
αi∗ 7
24
1
4
1
3
1
8

Table 7.4: The Butcher table for Bogacki-Shampine method

µj λij
0 0
1 1
5 5 0
3 3 9
10 40 40 0
4 44
5 45 − 56
15
32
9 0
8 19372 25360 64448 212
9 6561 − 2187 6561 − 729 0
9017
1 3168 − 355
33
46732
5247
49
176
5103
− 18656 0
35 500 125
1 384 0 1113 192 − 2187
6784
11
84 0
35 500 125
αi 384 0 1113 192 − 2187
6784
11
84 0
αi∗ 5179
57600 0 7571
16695
393
640
92097
− 339200 187
2100
1
40

Table 7.5: RK5(4)7FM (DORPRI5) embedded pair


214 Numerical Solution of Ordinary Differential Equations
Bibliography 215

Bibliography
[1] Octavian Agratini, Ioana Chiorean, Gheorghe Coman, and Radu Trı̂mbiţaş,
Analiză numerică şi teoria aproximării, vol. III, Presa Universitară Clujeană,
2002, coordonatori D. D. Stancu şi Gh. Coman.

[2] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Ei-


jkhout, R. Pozo, C. Romine, and H. van der Vorst, Templates for the Solu-
tion of Linear Systems: Building Blocks for Iterative Methods, 2nd ed., SIAM,
Philadelphia, PA, 1994, disponibila prin www, http://www.netlib.
org/templates.

[3] Å. Björk, Numerical Methods for Least Squares Problem, SIAM, Philadelphia,
1996.

[4] E. Blum, Numerical Computing: Theory and Practice, Addison-Wesley, 1972.

[5] P. Bogacki and L. F. Shampine, A 3(2) pair of Runge-Kutta formulas, Appl.


Math. Lett. 2 (1989), no. 4, 321–325.

[6] C. G. Broyden, A Class of Methods for Solving Nonlinear Simultaneous Equa-


tions, Math. Comp. 19 (1965), 577–593.

[7] P. G. Ciarlet, Introduction à l’analyse numérique matricielle et à l’optimisation,


Masson, Paris, Milan, Barcelone, Mexico, 1990.

[8] Gheorghe Coman, Analiză numerică, Editura Libris, Cluj-Napoca, 1995.

[9] M. Crouzeix and A. L. Mignot, Analyse numerique des équations differentielles,


Masson, Paris, Milan, Barcelone, Mexico, 1989.

[10] I. Cuculescu, Analiză numerică, Editura Tehnică, Bucureşti, 1967.

[11] James Demmel, Applied Numerical Linear Algebra, SIAM, Philadelphia, 1997.

[12] J. E. Dennis and J. J. Moré, Quasi-Newton Metods, Motivation and Theory,


SIAM Review 19 (1977), 46–89.

[13] J. Dormand, Numerical Methods for Differential Equations. A Computational


Approach, CRC Press, Boca Raton New York, 1996.

[14] J. G. F. Francis, The QR transformation: A unitary analogue to the LR transfor-


mation, Computer J. 4 (1961), 256–272, 332–345, parts I and II.
216 Bibliography

[15] W. Gander and W. Gautschi, Adaptive quadrature - revisited, BIT 40 (2000),


84–101.

[16] W. Gautschi, Numerical Analysis, an Introduction, Birkhäuser, Basel, 1997.

[17] Walther Gautschi, Orthogonal polynomials: applications and computation,


Acta Numerica 5 (1996), 45–119.

[18] D. Goldberg, What every computer scientist should know about floating-point
arithmetic, Computing Surveys 23 (1991), no. 1, 5–48.

[19] H. H. Goldstine and J. von Neumann, Numerical inverting of matrices of high


order, Amer. Math. Soc. Bull. 53 (1947), 1021–1099.

[20] Gene H. Golub and Charles van Loan, Matrix Computations, 3rd ed., John Hop-
kins University Press, Baltimore and London, 1996.

[21] P. R. Halmos, Finite-Dimensional Vector Spaces, Springer Verlag, New York,


1958.

[22] Nicholas J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM,


Philadelphia, 1996.

[23] C. G. J. Jacobi, Über eine neue Auflösungsart der bei der Methode der kleinsten
Quadrate vorkommenden linearen Gleichungen, Astronomische Nachrichten
22 (1845), 9–12, Issue no. 523.

[24] Mirela Kohr and Ioan Pop, Viscous Incompressible Flow for Low Reynolds
Numbers, WIT Press, Southampton(UK) - Boston, 2004.

[25] V. N. Kublanovskaya, On some algorithms for the solution of the complete


eigenvalue problem, USSR Comp. Math. Phys. 3 (1961), 637–657.

[26] Cleve Moler, Numerical Computing in MATLAB, SIAM, 2004, available via
www at http://www.mathworks.com/moler.

[27] J. J. Moré and M. Y. Cosnard, Numerical Solutions of Nonlinear Equations,


ACM Trans. Math. Softw. 5 (1979), 64–85.

[28] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical


Recipes in C, Cambridge University Press, Cambridge, New York, Port Chester,
Melbourne, Sidney, 1996, disponibila prin www, http://www.nr.com/.

[29] I. A. Rus, Ecuaţii diferenţiale, ecuaţii integrale şi sisteme dinamice, Transilva-
nia Press, Cluj-Napoca, 1996.
Bibliography 217

[30] H. Rutishauser, Solution of the eigenvalue problems with the LR transformation,


Nat. Bur. Stand. App. Math. Ser. 49 (1958), 47–81.
[31] Y. Saad, Iterative Methods for Sparse Linear Systems, PWS Publishing, Boston,
1996, disponibilă via www la adresa http://www-users.cs.umn.edu/
˜saad/books.html.
[32] Thomas Sauer, Numerische Mathematik II, Universität Erlangen-Nurnberg, Er-
langen, 2000, Vorlesungskript.
[33] R. Schwarz, H., Numerische Mathematik, B. G. Teubner, Stuttgart, 1988.
[34] D. D. Stancu, Analiză numerică – Curs şi culegere de probleme, Lito UBB,
Cluj-Napoca, 1977.
[35] D. D. Stancu, G. Coman, and P. Blaga, Analiză numerică şi Teoria Aproximării,
vol. II, Presa Universitară Clujeană, Cluj-Napoca, 2002, D. D. Stancu, Gh. Co-
man, (coord.).
[36] D. D. Stancu, Gh. Coman, O. Agratini, and R. Trı̂mbiţaş, Analiză numerică şi
Teoria aproximării, vol. I, Presa Universitară Clujeană, Cluj-Napoca, 2001, D.
D. Stancu, Gh. Coman, (coord.).
[37] J. Stoer and R. Burlisch, Einfuhrung in die Numerische Mathematik, vol. II,
Springer Verlag, Berlin, Heidelberg, 1978.
[38] J. Stoer and R. Burlisch, Introduction to Numerical Analysis, 2nd ed., Springer
Verlag, 1992.
[39] Volker Strassen, Gaussian elimination is not optimal, Numer. Math. 13 (1969),
354–356.
[40] Lloyd N. Trefethen, The Definition of Numerical Analysis, SIAM News ?
(1992), no. 3, 1–5.
[41] Lloyd N. Trefethen and David Bau III, Numerical Linear Algebra, SIAM,
Philadelphia, 1996.
[42] C. Überhuber, Computer-Numerik, vol. 1, 2, Springer Verlag, Berlin, Heidel-
berg, New-York, 1995.
[43] C. Ueberhuber, Numerical Computation. Methods, Software and Analysis, vol.
I, II, Springer Verlag, Berlin, Heidelberg, New York, 1997.
[44] J. H. Wilkinson, The Algebraic Eigenvalue Problem, Clarendon Press, Oxford,
1965.
Index

adaptive quadratures, 121 matrix


asymptotic error, 130 characteristic polynomial of a ∼,
155
Butcher table, 192, 211 companion, 156
diagonalisable ∼, 157
Cholesky factorization, 37
eigenvalue of a ∼, 155
composite Simpson formula, 111
eigenvector of a ∼, 155
composite trapezoidal rule, 111
Jordan normal form of a ∼, 157
degree of exactness, 104, 110 nonderogatory ∼, 157
divided difference, 86 real Schur decomposition of a ∼,
159
efficiency index, 132 RQ transformation of a ∼, 168
upper Hessenberg, 160
formula maximal pivoting, see ttal pivoting33
Euler-MacLaurin ∼, 125 method
Simpson ∼, 111 Broyden’s ∼, 152
Euler ∼, 183
Gauss-Christoffel quadrature formula,
false position ∼, 135
see Gaussian quadrature for-
fixed point iteration ∼, 144
mula
Heun ∼, 187, 191
Gaussian quadrature formula, 115
modified Euler ∼, 187, 191
grid, 193
Newton ∼, 140
grid function, 193
power ∼, see vector iteration
Lagrange interpolation QR
Aitken method, 85 double shift, 177
Neville method, 84 simple, 170
linear convergence, 130 spectral shift, 174
QR ∼, 163
matrice quasi-Newton∼, 149
Schur decomposition of a ∼, 158 Romberg ∼, 122
matrices Runge-Kutta ∼, 188
similar, 157 secant ∼, 137

218
Index 219

semi-implicit Runge-Kutta ∼, 189 trapezoidal formula, see trapezoidal rule


SOR, 48 trapezoidal rule, 110
Sturm ∼, 132 truncation error, 182
Taylor expansion ∼, 185
vector iteration, 160
Newton-Cotes formulae, 114
notation
Ω, 16
numerical differentiation formula, 104
numerical integration formula, 109
numerical quadrature formula, see nu-
merical integration formula
numerical solution of differential equa-
tions
one-step methods, 182

one step method


stable ∼, 194
one-step method
consistent ∼, 182
convergent ∼, 198
exact order, 183
order, 183
principal error function, 183
order of convergence, 130

Runge-Kutta method
implicit ∼, 189

spline
complete, 97
Not-a-knot, 98
stability inequality, 195

theorem
Peano, 71
total pivoting, 33
transform
Householder, 39
trapezes rule, see composite trapezoidal
rule
220 Index

You might also like