Curseng
Curseng
Radu T. Trı̂mbiţaş
ii
Preface
iii
iv Preface
1
some of them do both activities
Contents
v
vi Contents
3 Function Approximation 49
3.1 Least Squares approximation . . . . . . . . . . . . . . . . . . . . . 52
3.1.1 Inner products . . . . . . . . . . . . . . . . . . . . . . . . 53
3.1.2 The normal equations . . . . . . . . . . . . . . . . . . . . . 54
3.1.3 Least square error; convergence . . . . . . . . . . . . . . . 56
3.2 Examples of orthogonal systems . . . . . . . . . . . . . . . . . . . 59
3.3 Examples of orthogonal polynomials . . . . . . . . . . . . . . . . . 62
3.3.1 Legendre polynomials . . . . . . . . . . . . . . . . . . . . 62
3.3.2 First kind Chebyshev polynomials . . . . . . . . . . . . . . 64
3.3.3 Second kind Chebyshev polynomials . . . . . . . . . . . . 68
3.3.4 Laguerre polynomials . . . . . . . . . . . . . . . . . . . . 69
3.3.5 Hermite polynomials . . . . . . . . . . . . . . . . . . . . . 69
3.3.6 Jacobi polynomials . . . . . . . . . . . . . . . . . . . . . . 70
3.4 The Space H n [a, b] . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.5 Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . 73
3.5.1 Lagrange interpolation . . . . . . . . . . . . . . . . . . . . 73
3.5.2 Hermite Interpolation . . . . . . . . . . . . . . . . . . . . . 76
3.5.3 Interpolation error . . . . . . . . . . . . . . . . . . . . . . 80
3.6 Efficient Computation of Interpolation Polynomials . . . . . . . . . 83
3.6.1 Aitken-type methods . . . . . . . . . . . . . . . . . . . . . 83
3.6.2 Divided difference method . . . . . . . . . . . . . . . . . . 85
3.6.3 Multiple nodes divided differences . . . . . . . . . . . . . . 88
3.7 Convergence of polynomial interpolation . . . . . . . . . . . . . . 90
3.8 Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.8.1 Interpolation by cubic splines . . . . . . . . . . . . . . . . 95
3.8.2 Minimality properties of cubic spline interpolants . . . . . . 99
Bibliography 215
Index 218
List of Algorithms
ix
x LIST OF ALGORITHMS
Chapter 1
Computation errors evaluation is one of the main goal of Numerical Analysis. Several
type of error which can affect the accuracy may occur:
2. Rounding error;
3. Approximation error.
Input data errors are out of computation control. They are due, for example, to the
inherent imperfections of physical measures.
Rounding errors are caused since we perform our computation using a finite rep-
resentation, as usual.
For the third error type, many methods do not provide the exact solution of a given
problem P, even if the computation is carried out exactly (without rounding), but
rather the solution of a simpler problem, P,
e which approximates P. As an example
we consider the summation of an infinite series:
1 1 1
e=1+ + + + ···
1! 2! 3!
which could be replaced with a simpler problem P e which consist of a summation of
a finite number of series terms. Such an error is called truncation error (nevertheless,
this name is also use for rounding errors obtained by removing the last digits of the
representation – chopping). Many approximation problems result by “discretising”
the original problem P: definite integrals are approximated by finite sums, derivatives
1
2 Errors and Floating Point Arithmetic
by differences, and so on. Some authors extend the term “truncation error” to cover
also the discretization error.
The aim of this chapter is to study the overall effect of input error and rounding
error on a computational result. The approximation errors will be discussed when we
expose the numerical methods individually.
Remark 1.2.4.
1.3. Propagated error 3
k∆xk
1. Since x is unknown in practice, one uses the approximation δx = kx∗ k . If
k∆xk is small relatively to x∗ , then the approximation is accurate.
2. If X = R, then it is to use δx = ∆x
x and ∆x = x∗ − x. ♦
n ∂f ∗ n
∆f X ∂x∗i (x ) X ∂
δf = ≈ ∆xi ∗
= ∆xi ∗ ln f (x∗ ) =
f f (x ) ∂xi
i=1 i=1
n
X ∂
= x∗i δxi ln f (x∗ ).
∂x∗i
i=1
Thus
n
X ∂
δf = x∗i ln f (x∗ )δxi . (1.3.2)
∂x∗i
i=1
The inverse problem has also a great importance: what accuracy is needed for the
input data such that the result be of a desired accuracy? That is, given ε > 0, how
4 Errors and Floating Point Arithmetic
∂f ∗ ∂f ∗
∗ (x )∆x1 = . . . = (x )∆xn .
∂x1 ∂x∗n
(1.3.1) implies
∆f
∆xi ≈ ∂f ∗ .
(1.3.3)
n ∂x∗ (x )
i
Analogously,
δf
δxi = . (1.3.4)
n x∗i ∂x∂ ∗ ln f (x∗ )
i
where d0 .d1 d2 . . . dp−1 is called the significand or mantissa, and e is the exponent.
The value of x is
Figure 1.1: The distribution of normalized floating point numbers on the real axis
without denormalization
whose significand has the form 0.d1 d2 . . . dp−1 and whose exponent is β emin −1 . The
availability of denormalization is an additional parameter of the representation. The
set of floating-numbers for a set fixed parameters of representation will be denoted
F(β, p, emin , emax , denorm), denorm ∈ {true, f alse}.
This set is not equal to R because:
1. is a finite subset of Q;
2. for x ∈ R it is possible to have |x| > β × β emax (overflow) or |x| < 1.0 × β emin
(underflow).
The usual arithmetic operation on F(β, p, emin , emax , denorm) are denoted by
⊕, , ⊗, , and the name of usual functions are capitalized: SIN, COS, EXP, LN,
SQRT, and so on. (F, ⊕, ⊗) is not a field since
(x ⊕ y) ⊕ z 6= x ⊕ (y ⊕ z) (x ⊗ y) ⊗ z 6= x ⊗ (y ⊗ z)
(x ⊕ y) ⊗ z 6= x ⊗ z ⊕ y ⊗ z.
In order to measure the error one uses the relative error and ulps – units in the last
place. If the number z is represented as d0 .d1 d2 . . . dp−1 × β e , then the error is
|d0 .d1 d2 . . . dp−1 − z/β e | β p−1 ulps.
6 Errors and Floating Point Arithmetic
Figure 1.2: The distribution of normalized floating point numbers on the real axis
with denormalization
1.4.2 Cancellation
From formulae (1.3.2) for the relative error, if x ≈ x(1 + δx ) and y ≈ y(1 + δy ), we
obtain the relative error for the floating-point arithmetic:
δxy = δx + δy (1.4.5)
δx/y = δx − δy (1.4.6)
x y
δx+y = δx + δy (1.4.7)
x+y x+y
x = 1 0 1 1 0 0 1 0 1 b b g g g g e
y = 1 0 1 1 0 0 1 0 1 b0 b0 g g g g e
x-y = 0 0 0 0 0 0 0 0 0 b00 b00 g g g g e
= b00 b00 g g g g ? ? ? ? ? ? ? ? ? e-9
We have two kind of cancellation: benign, when subtracting exactly known quan-
tities and catastrophic, when the subtraction operands are subject to rounding errors.
The programmer must be aware of the possibility of its occurrence and he/she must
try to avoid it. The expressions which lead to cancellation must be rewritten, and a
catastrophic cancellation must be converted into a benign one. We shall give some
examples in the sequel.
Example 1.4.3. The difference of two values of the same function for nearby argu-
ments is rewritten using Taylor expansion:
δ 2 00
f (x + δ) − f (x) = δf 0 (x) + f (x) + · · · f ∈ C n [a, b]. ♦
2
√
−b +b2 − 4ac
x1 = (1.4.8)
√2a
−b − b2 − 4ac
x2 = (1.4.9)
2a
can lead to cancellation as follows: for b > 0 the cancellation affects the computation
of x1 ; for b < 0 x2 . We can correct the situation using the conjugate
2c
x1 = √ (1.4.10)
−b − b2 − 4ac
2c
x2 = √ . (1.4.11)
−b + b2 − 4ac
For the first case we use formulae (1.4.10) and (1.4.9); for the second case (1.4.8)
and (1.4.11). ♦
1. A better precision.
2. The conversion from binary to decimal and then back to binary needs 9 digits
in single precision and 17 digits in double precision.
1.5. IEEE Standard 9
Format
Parameter Single Single Extended Double Double extended
p 24 ≥ 32 53 ≥ 64
emax +127 ≥ +1023 +1023 ≥ +16383
emin -126 ≤ −1022 -1022 ≤ −16382
Exponent width 8 ≥ 11 11 ≥ 15
Number width 32 ≥ 43 64 ≥ 79
The relation |emin | < emax is motivated by the fact that 1/2emin must not lead to
overflow.
The operations ⊕, , ⊗, must be exactly rounded. The accuracy is achieved
using two guard digit and a third sticky bit.
The exponent is biased, i.e. instead of e the standard represents e + D, where D
is fixed when the format is chosen.
For IEEE 754 single precision, D = 127.
f : Rm → Rn , y = f (x). (1.6.1)
We are interested in the sensitivity of the map f at some given point x to a small
perturbation of x, that is, how much bigger (or smaller) the perturbation in y is com-
pared to the perturbation in x. In particular, we wish to measure the degree of sen-
sitivity by a single number – the condition number of the map f at the point x. The
function f is assumed to be evaluated exactly, with infinite precision, as we perturb
x. The condition of f , therefore, is an inherent property of the map f and does not
depend on any algorithmic considerations concerning its implementation.
It does not mean that the knowledge of the condition of a problem is irrelevant
to any algorithmic solution of the problem. On the contrary! The reason is that
quite often the computed solution y ∗ of (1.6.1) (computed in floating point machine
arithmetic, using a specific algorithm) can be demonstrated to be the exact solution
of a “nearby” problem; that is
y ∗ = f (x∗ ) (1.6.2)
where
x∗ = x + δ (1.6.3)
and moreover, the distance kδk = kx∗ − xk can be estimated in terms of the machine
precision. Therefore, if we know how strongly or weakly the map f reacts to small
perturbation, such as δ in (1.6.3), we can say something about the error y ∗ − y in the
solution caused by the perturbation.
We can consider more general spaces for f , but for practical implementation the
finite dimensional spaces are sufficient.
Let
x = [x1 , . . . , xm ]T ∈ Rm , y = [y1 , . . . , yn ]T ∈ Rn ,
yν = fν (x1 , . . . , xm ), ν = 1, n.
n m
X ∂fν X ∂fν
|∆yν | ≤ ∆xµ ≤ max |∆xµ |
∂xµ ∂xµ ≤
µ
µ=1 µ=1
m
X ∂fν
≤ max |∆xµ | max
∂xµ
µ ν
µ=1
Therefore
∂f
k∆yk∞ ≤ k∆xk∞
∂x
(1.6.7)
∞
where
∂f1 ∂f1 ∂f1
∂x1 ∂x2 ... ∂xm
∂f2 ∂f2 ∂f2
...
∂f ∂x1 ∂x2 ∂xm ∈ Rn × Rm
J(x) = = .. .. .. .. (1.6.8)
∂x . . . .
∂fn ∂fn ∂fn
∂x1 ∂x2 ... ∂xm
for x 6= 0, y 6= 0.
If x = 0 ∧ y 6= 0, then we take the absolute error for x and the relative error for y
0
f (x)
(cond f )(x) = .
f (x)
For y = 0 ∧ x 6= 0 we take the absolute error for y and the relative error for x.
For x = y = 0
(cond f )(x) = f 0 (x) .
Ax = b. (1.6.10)
Here the input data are the elements of A and b, and the result is the vector x. To
simplify matters let’s assume that A is a fixed matrix not subject to change, and only
b is undergoing perturbations. We have a map f : Rn → Rn given by
x = f (b) := A−1 b,
∂f
which is linear. Therefore ∂b = A−1 and using (1.6.9),
kbkkA−1 k kAxkkA−1 k
(cond f )(b) = = ,
kA−1 bk kA−1 bk
kAxk −1 (1.6.11)
max (cond f )(b) = max kA k = kAkkA−1 k. ♦
b∈R n
b6=0
x∈R n
b6=0
kxk
The number kAkkA−1 k is called the condition number of the matrix A and we denote
it by cond A.
cond A = kAkkA−1 k.
f : Rm → Rn , y = f (x). (1.7.1)
Along with the problem f , we are also given an algorithm A that solves the
problem. That is, given a vector x ∈ F(β, p, emin , emax , denorm), the algorithm A
produces a vector yA (in floating-point arithmetic), that is supposed to approximate
1.8. Overall error 13
y = f (x). Thus we have another map fA describing how the problem f is solved by
the algorithm A
fA : Fm (. . . ) → Fn (. . . ), yA = fA (x).
In order to be able to analyze fA in this general terms, we must make a basic
assumption, namely, that
That is, the computed solution corresponding to some input x is the exact solution
for some different input xA (not necessarily a machine vector and not necessarily
uniquely determined) that we hope is close to x. The closer we can find an xA to x,
the more confidence we should place in the algorithm A.
We define the condition of A at x by comparing the relative error with eps:
kxA − xk .
(cond A)(x) = inf eps .
xA kxk
Motivation:
fA (x) − f (x) (xA − x)f 0 (ξ) xA − x 1 xf 0 (x)
δy = = ≈ · eps .
f (x) f (x) x eps f (x)
The infimum is over all xA satisfying yA = f (xA ). In practice one can take any
such xA and then obtain an upper bound for the condition number
kxA −xk
kxk
(cond A)(x) ≤ . (1.7.3)
eps
f : Rm → Rn , y = f (x). (1.8.1)
This is the mathematical (idealized) problem, where the data are exact real num-
bers, and the solution is the mathematically exact solution. When solving such a
problem on a computer, in floating-point arithmetic with precision eps, and using
some algorithm A, one first of all rounds the data, and then applies to these rounded
data not f , but fA .
kx∗ − xk
x∗ = x rounded, = ε, ∗
yA = fA (x∗ ).
kxk
14 Errors and Floating Point Arithmetic
Here ε represents the rounding error in the data. (It could also be due to sources other
than rounding, e.g., measurement.) The total error that we wish to estimate is
∗ − yk
kyA
.
kyk
By the basic assumption (1.7.2, BA) made on the algorithm A, and choosing xA
optimally, we have
fA (x∗ ) = f (x∗A ),
kx∗A − x∗ k
= (cond A)(x∗ ) eps . (1.8.2)
kx∗ k
We supposed kyk ≈ ky ∗ k. By virtue of (1.8.2) we now have for the first term on
the right,
∗ − y∗k kf (x∗A ) − f (x∗ )k
kyA kfA (x∗ ) − f (x∗ )k
= = ≤
ky ∗ k kf (x∗ )k kf (x∗ )k
kx∗A − xk
≤ (cond f )(x∗ ) = (cond f )(x∗ )(cond A)(x∗ ) eps,
kx∗ k
and for the second
Interpretation: The data error and eps contribute together towards the total error.
Both are amplified by the condition of the problem, but the latter is further amplified
by the condition of the algorithm.
1.9. Ill-Conditioned Problems and Ill-Posed Problems 15
1.10 Stability
1.10.1 Asymptotical notations
We shall introduce here basic notations and some common abuses.
For a given function g(n), Θ(g(n)) will denote the set of functions
Θ(g(n)) = {f (n) : ∃c1 , c2 , n0 > 0 0 ≤ c1 g(n) ≤ f (n) ≤ c2 g(n) ∀n ≤ n0 } .
Although Θ(g(n)) is a set we write f (n) = Θ(g(n)) instead of f (n) ∈ Θ(g(n)).
These abuse has some advantages. g(n) will be called an asymptotically tight bound
for f (n).
16 Errors and Floating Point Arithmetic
Also for f (n) ∈ O(g(n)) we shall use f (n) = O(g(n)). Note that f (n) = Θ(g(n))
implies f (n) = O(g(n), since the Θ notation is stronger than the O notation. In set
theory terms, Θ(g(n)) ⊆ O(g(n)). One of the funny properties of the O notation is
n = O(n2 ). g(n) will be called an asymptotically upper bound for f .
For a given function g(n), Ω(g(n)) is defined as the set of functions
f (n)
lim = 1.
n→∞ g(n)
1.10. Stability 17
In words,
A stable algorithm gives nearly the right answer to nearly the right ques-
tion.
Many algorithms of Numerical Linear Algebra satisfy a condition that is both
stronger and simpler than stability. We say that an algorithm fA for the problem f is
backward stable if
ke
x − xk
∀x ∈ X ∃e
x with = O(eps) such that fA (x) = f (e
x). (1.10.5)
kxk
This is a tightening of the definition of stability in that the O(eps) in (1.10.3) was
replaced by zero. In words
A backward stable algorithm gives exactly the right answer to nearly the
right question.
Remark 1.10.1. The notation
||computed quantity|| = O(eps) (1.10.6)
has the following meaning:
Due to the equivalence of norms on finite dimensional linear spaces, for problems
f and algorithms fA defined on such spaces, the properties of accuracy, stability and
backward stability all hold or fail to hold independently of the choice of norms in X
and Y .
where o(1) denotes a quantity which converges to zero as eps → 0. Combining these
bounds gives (1.10.7).
The process just carried out in proving Theorem 1.10.2 is known as backward
error analysis. We obtained an accuracy estimate by two steps. One step is to in-
vestigate the condition of the problem. The other is to investigate the stability of the
algorithm. By Theorem 1.10.2, if the algorithm is backward stable, then the final
accuracy reflects that condition number.
There exists also a forward error analysis. Here, the rounding error introduced at
each step of the calculation are estimated, and somehow, a total is maintained of how
they compound from step to step (section 1.3).
Experience has shown that for the most of the algorithms of numerical linear
algebra, forward error analysis is harder to carry out than the backward error anal-
ysis. The best algorithms of linear algebra do no more, in general, than to compute
exact solutions for slightly perturbed data. Backward error analysis is a method of
reasoning fitted neatly to this backward reality.
20 Errors and Floating Point Arithmetic
Chapter 2
There are two classes of methods for the solution of algebraic linear systems (ALS):
• zeros of p – eigenvalues of A;
1. normal, if AA∗ = A∗ A;
21
22 Numerical Solution of Linear Algebraic Systems
2. unitary, if AA∗ = A∗ A = I;
3. hermitian, if A = A∗ ;
5. symmetric, if A = AT , A real.
A simple way to obtain matrix norm is: given a vector norm k · k on Cn , the map
k · k : Cn×n → R
kAvk
kAk = sup = sup kAvk = sup kAvk
v∈Cn kvk v∈Cn v∈Cn
v6=0 kvk≤1 kvk=1
is a matrix norm called subordinate matrix norm(to the given vector norm) or natural
norm (induced by the given vector norm).
kAvk1 X
kAk1 := sup = max |aij |,
v∈Cn \{0} kvk 1 j
i
kAvk2 p
= ρ(A∗ A) = ρ(AA∗ ) = kA∗ k2 ,
p
kAk2 := sup
v∈Cn \{0} kvk2
kAvk∞ X
kAk∞ := sup = max |aij |,
v∈Cn \{0} kvk∞ i
j
2.1. Notions of Matrix Analysis 23
If A is normal, then
AA∗ = A∗ A ⇒ kAk2 = ρ(A).
ai0 j
The vector u such that uj = for ai0 j 6= 0, uj = 1 for ai0 j = 0 verifies
|ai0 j |
X
kAuk∞ = max |aij | kuk∞ .
i
j
24 Numerical Solution of Linear Algebraic Systems
def
U ∗ AU = diag(λi (A)) = Λ.
In this case
A∗ A = (U ΛU ∗ )∗ U ΛU = U D∗ ΛU ∗ ,
which shows us that
kAk2 = ρ(A).
2.1. Notions of Matrix Analysis 25
kAk2 = ρ(A).
(4) ||.||∞ is called Cebyshev norm or m-norm, ||.||1 is called Minkowski norm or
l-norm, and ||.||2 is the Euclidian norm. ♦
Theorem 2.1.6. (1) Let A be an arbitrary square matrix and k · k a certain matrix
norm (subordinate or not). Then
(2) Given a matrix A and a number ε > 0, there exists a subordinate matrix norm
such that
kAk ≤ ρ(A) + ε. (2.1.2)
Proof. (1) Let p be a vector verifying p 6= 0, Ap = λp, |λ| = ρ(A) and q a vector
such that pq T 6= 0. Since
Dδ = diag(1, δ, δ 2 , . . . , δ n−1 ),
26 Numerical Solution of Linear Algebraic Systems
such that
λ1 δt12 δ 2 t13 . . . δ n−1 t1n
λ2 δt23 . . . δ n−2 t2n
(U Dδ )−1 A(U Dδ ) = .. ..
.
. .
λn−1 δtn−1n
λn
kAk ≤ ρ(A) + ε
P
and according to the choice of δ and the definition of k·k∞ (kcij k∞ = maxi j |cij |)
norm, the norm given by 2.1.3 is a matrix norm subordinated to the vector norm
An important matrix norm, which is not a subordinate matrix norm is the Frobe-
nius norm: 1/2
X X
kAkE = |aij |2 = {tr(A∗ A)}1/2
i j
√
It is not a subordinate norm, since kIkE = n.
Theorem 2.1.7. Let B a square matrix. The following statements are equivalent:
(1) lim B k = 0;
k→∞
(2) lim B k v = 0, ∀ v ∈ Kn ;
k→∞
(4) There exists a subordinate matrix norm such that kBk < 1.
(2) ⇒ (3) If ρ(B) ≥ 1, we can find p such that p 6= 0, Bp = λp , |λ| ≥ 1. Then the
vector sequence (B k p)k∈N could not converge to 0.
(3) ⇒ (4) ρ(B) < 1 ⇒ ∃k · k such that kBk ≤ ρ(B) + ε, ∀ ε > 0 hence kBk < 1.
(4) ⇒ (1) It is sufficient to apply the inequality kB k k ≤ kBkk .
Ax = b.
having the solution (1, 1, 1, 1)T and we consider the perturbed system where the
right-hand side is slightly modified, the system matrix remaining unchanged
10 7 8 7 x1 + δx1 32.1
7 5 6 5 x2 + δx2 22.9
8 6 10 9 x3 + δx4 = 33.1 ,
7 5 9 10 x4 + δx4 30.9
having the solution (9.2, −12.6, 4.5, −1.1)T . In other words a 1/200 error in input
data causes 10/1 relative error on result, hence an approx. 2000 times growing of the
relative error!
Let now the system with the perturbed matrix
10 7 8.1 7.2 x1 + ∆x1 32
7.08 5.04 6 5 x2 + ∆x2 = 23 ,
8 5.98 9.89 9 x3 + ∆x4 33
6.99 4.99 9 9.98 x4 + ∆x4 31
28 Numerical Solution of Linear Algebraic Systems
having the solution (−81, 137, −34, 22)T . Again, a small variation on input data
(here, matrix elements) modifies dramatically the output result. The matrix has a
“good” shape, is symmetric, its determinant is equal to 1, and its inverse is
25 −41 10 −6
−41 68 −17 10
,
10 −17 5 −3
−6 10 −3 2
k∆Ak k∆bk
ρA (t) := |t| , ρb (t) := |t|
kAk kbk
for the relative errors in A and b, the relative error estimate can be written as
k∆x(t)k
≤ kAk
A−1
(ρA + ρb ) + O(t2 ).
(2.2.2)
kxk
2.2. Condition of a linear system 29
k∆x(t)k
≤ cond(A) (ρA + ρb ) + O(t2 ). (2.2.4)
kxk
Example 2.2.2 (Ill-conditioned matrix). Consider the n-th order Hilbert 1 matrix,
Hn = (hij ), given by
1
hij = , i, j = 1, n.
i+j−1
n 10 20 40
cond2 (Hn ) 1.6 · 1013 2.45 · 1028 7.65 · 1058
A system of order n = 10, for example, cannot be solved with any reliability in
single precision on a 14-decimal computer. Double precision will be “exhausted” by
the time we reach n = 20. The Hilbert matrix is thus a prototype of an ill-conditioned
During the solution of the system (2.3.1) or (2.3.2) the following transforms are
allowed:
with ai,n+1 = bi .
Johann Carl Friedrich Gauss (1777-1855) was one of the
greatest mathematicians of the 19th century — and perhaps
of all time. He spent almost his entire life in Göttingen, where
he was the director of the observatory for some 40 years. Al-
ready as a student in Göttingen, Gauss discovered that the 17-
gon can be constructed by compass and ruler, thereby settling
a problem that had been open since antiquity. His dissertation
gave the first proof of the Fundamental Theorem of Algebra.
3 He went on to make fundamental contributions to number the-
ory, differential and non-Euclidean Geometry, elliptic and hy-
pergeometric functions, celestial mechanics and geodesy, and
various branches of physics, notably magnetism and optics.
His computational efforts in celestial mechanics and geodesy,
based on the principle of least squares, required the solution
(by hand) of large systems of linear equations, for which he
used what today are known as Gaussian elimination and re-
laxation methods. Gauss’s work on quadrature builds upon
the earlier work of Newton and Cotes.
32 Numerical Solution of Linear Algebraic Systems
One obtains
(n)
an,n+1
xn = (n)
an,n
and, generally
n
1 (i) X(i)
xi = (i)
ai,n+1 − aij xj , i = n − 1, 1
aii j=i+1
2.3. Gaussian Elimination 33
(i) (i)
The procedure is applicable only if aii 6= 0, i = 1, n. The element aii is called
(k)
pivot. If during the elimination process, at the kth step one obtains akk = 0, one can
perform the line interchange (Ek ) ↔ (Ep ), where k + 1 ≤ p ≤ n is the smallest
(k)
integer satisfying apk 6= 0. In practice, such operations are necessary even if the
pivot is nonzero. The reason is that a pivot which is small cause large rounding errors
and even cancellation. The remedy is to choose for pivoting the subdiagonal element
on the same column having the largest absolute value. That is, we must find a p such
that
(k) (k)
|apk | = max |aik |,
k≤i≤n
and then perform the interchange (Ek ) ↔ (Ep ). This technique is called column
maximal pivoting or partial pivoting.
Another technique which decreases errors and prevents from the floating-point
cancellation is scaled column pivoting. We define in a first step a scaling factor for
each line
n
X
si = max |aij | or si = |aij |.
j=1,n
j=1
If an i such that si = 0 does exist, the matrix is singular. The next steps will establish
what interchange is to be done. In the i-th one finds the smallest integer p, i ≤ p ≤ n,
such that
|api | |aji |
= max
sp i≤j≤n sj
and then, (Ei ) ↔ (Ep ). Scaling guarantees us that the largest element in each col-
umn has the relative magnitude 1, before doing the comparisons needed for line in-
terchange. Scaling is performed only for comparison purpose, so that the division by
the scaling factor does not introduce any rounding error. The third method is total
pivoting or maximal pivoting. In this method, at the kth step one finds
max{|aij |, i = k, n, j = k, n}
and line and columns interchange are carried out.
Pivoting was introduced for the first time by Goldstine and von Neumann, 1947
[19].
Remark 2.3.2. Some suggestions which speeds-up the running time.
1. The pivoting need not physically row or column interchange. One can manage
one (or two) permutation vector(s) p(q); p[i](q[i]) means the line (column) that
was interchanged to the ith line(column). This is a good solution if matrices are
stored row by row or column by column; for other representation or memory
hierarchies, physical interchange could yield better results.
34 Numerical Solution of Linear Algebraic Systems
Proof of Theorem 2.4.1. (sketch) For n > 1 we split A in the following way:
a11 a12 . . . a1n
a21 a22 . . . a2n
a11 w∗
A= = ,
.. .. .. .. v A0
. . . .
an1 an2 . . . ann
a11 w∗ w∗
1 0 a11
A= = .
v A0 v/a11 In−1 0 A0 − vw∗ /a11
36 Numerical Solution of Linear Algebraic Systems
The matrix A0 − vw∗ /a11 is called a Schur complement of A with respect to a11 .
Then, we proceed with the recursive decomposition of Schur complement:
A0 − vw∗ /a11 = L0 U 0 .
w∗
1 0 a11
A = 0 ∗ =
v/a11 In−1 0 A ∗− vw /a11
1 0 a11 w
= =
v/a 11 In−1 0 L0 U 0
a11 w∗
1 0
= .
v/a11 L0 0 U0
We have several choices for uii and lii , i = 1, n. For example, if lii = 1, we have
Doolittle factorization , and if uii = 1, we have Crout factorization.
Ax = b ⇔ LU x = P b ⇔ Ly = P b ∧ U x = y
and
Ax = P −1 LU x = P −1 Ly = P −1 P b = b.
We shall choose as pivot ak1 instead of a11 . The effect is a multiplication by a
permutation matrix Q:
ak1 w∗ w∗
1 0 ak1
QA = = .
v A0 v/ak1 In−1 0 A0 − vw∗ /ak1
Then, we compute the LU P -decomposition of the Schur complement.
We define
1 0
P = Q,
0 P0
2.4. Factorization based methods 37
Theorem 2.4.3. Every hermitian positive definite matrix A ∈ Cm×m has a unique
Cholesky factorization (2.4.1).
Proof. (Existence) Since A is hermitian and positive definite a11 > 0 and we may
√
set α = a11 . Note that
a11 w∗
A=
w K
(2.4.2)
α w∗ /α
α 0 1 0
= = R1∗ A1 R1 .
w/α I 0 K − ww∗ /a11 0 I
This is the basic step that is repeated in Cholesky factorization. The matrix K −
ww∗ /a11 being a (m−1)×(m−1) principal submatrix of the positive definite matrix
R1∗ AR1−1 is positive definite and hence his upper left element is positive. By induc-
tion, all matrices that appear during the factorization are positive definite and thus the
process cannot break down. We proceed to the factorization of A1 = R2∗ A2 R2 , and
thus, A = R1∗ R2∗ A2 R2 R1 ; the process can be employed until the lower right corner
is reached, getting
A = R1∗ R2∗ . . . Rm
∗
R ...R R ;
| {z } | m {z 2 }1
R∗ R
Since only half the matrix needs to be stored, it follows that half of the arithmetic
operations can be avoided. The inner loop dominates the work. A single execution of
the line 4 requires one division, m−j +1 multiplications, and m−j +1 subtractions,
for a total of ∼ 2(m−j) flops. This calculation is repeated once for each j from k +1
to m, and that loop is repeated for each k from 1 to m. The sum is straightforward to
evaluate:
m X
m m X
k m
X X X 1
2(m − j) ∼ 2 j∼ k 2 ∼ m3 flops.
3
k=1 j=k+1 k=1 j=1 k=1
2.4.4 QR decomposition
Theorem 2.4.4. Let A ∈ Rm×n , with m ≥ n. Then, there exists a unique m × n
orthogonal matrix Q and a unique n×n upper triangular matrix R, having a positive
diagonal (rii > 0) such that A = QR.
Orthogonal and unitary matrices are desirable for numerical computation because
they preserve length, preserve angles, and do not magnify errors.
A Householder 4 transform (or a reflection) is a matrix of form P = I − 2uuT ,
where kuk2 = 1. One easily checks that P = P T and
Ax = b ⇔ QRx = b ⇔ Rx = QT b,
we can choose the following strategy for the solution of linear system Ax = b:
2.5. Strassen’s algorithm for matrix multiplication 41
2. Compute y = QT b;
Classical algorithm requires 8 multiplications and 4 additions for one step; the run-
ning time is T (n) = Θ(n3 ), since T (n) = 8T (n/2) + Θ(n2 ).
We are interested in reducing the number of multiplications. Volker Strassen
[39] discovered a method to reduce the number of multiplications to 7 per step. One
computes the following quantities
c11 = p1 + p4 − p5 + p7
c12 = p3 + p5
c21 = p2 + p4
c22 = p1 + p3 − p2 + p6 .
Since we have 7 multiplications and 18 additions per step, the running times verifies
the following recurrence
The solution is
T (n) = Θ(nlog2 7 ) ∼ 28nlog2 7 .
The algorithm can be extended to matrices of n = m · 2k size. If n is odd, then
the last column of the result can be computed using standard method; then Strassen’s
algorithm is applied to n − 1 by n − 1 matrices.
m · 2k+1 → m · 2k
Note that I(n) satisfies the regularity condition only if I(n) has not large jumps
in its values. For example, if I(n) = Θ(nc logd n), for any constants c > 0, d ≥ 0,
then I(n) satisfies the regularity conditions.
Ax = b, (2.7.1)
when A is invertible. Suppose we have found a matrix T and a vector c such that
I − T is invertible and the unique fixpoint of the equation
x = Tx + c (2.7.2)
equates the solution of the system Ax = b. Let x∗ be the solution of (2.7.1) or,
equivalently, of (2.7.2).
Iteration: x(0) given; one defines the sequence (x(k) ) by
(I − X)−1 = I + X + X 2 + · · · + X k + . . .
Proof. Let
Sk = I + X + · · · + X k
(I − X)Sk = I − X k+1
lim (I − X)Sk = I ⇒ lim Sk = (I − X)−1
k→∞ k→∞
= T k x(0) + (I + T + · · · + T n−1 )
Theorem 2.7.3. If there exists k · k such that kT k < 1, the sequence (x(k) ) given by
(2.7.3) is convergent for any x(0) ∈ Rn and the following estimations hold
1 − kT k
kx(k) − x(k−1) k ≤ ε. (2.7.6)
kT k
kT k
kx∗ − x(k) k ≤ kx(k) − x(k−1) k. (2.7.7)
1 − kT k
kT k
kx(k+p) − x(k) k ≤ kx(k) − x(k−1) k,
1 − kT k
Iterative methods are seldom used to the solution of small systems since the time
required to attain the desired accuracy exceeds the time required for Gaussian elimi-
nation. For large sparse systems (i.e. systems whose matrix has many zeros), iterative
methods are efficient both in time and space.
Let the system Ax = b. Suppose we can split A as A = M − N . If M can be
easily inverted (diagonal, triangular, and so on) it is more convenient to carry out the
computation in the following manner
Ax = b ⇔ M x = N x + b ⇔ x = M −1 N x + M −1 b
x(k+1) = M −1 N x(k) + M −1 b, k ∈ N,
Ax = b ⇔ Dx = (L + U )x + b ⇔ x = D−1 (L + U )x + D−1 b
n
(k) 1 X (k−1)
Let us examine Jacobi iteration xi = bi − aij xj .
aii j=1
j6=i
(k)
Computation of xi uses all components of x(k−1) (simultaneous substitution).
(k) (k)
Since for i > 1, x1 , . . . , xi−1 have already been computed, and we suppose they are
(k−1) (k−1)
better approximations of the solution components than x1 , . . . , xi−1 it seems
(k)
reasonable to compute xi using the most recent values, i.e.
k−1 n
(k) 1 X (k)
X (k−1)
xi = bi − aij xj − aij xj .
aii
j=1 k=i+1
One can state necessary and sufficient conditions for the convergence of Jacobi
and Gauss-Seidel methods
ρ(TJ ) < 1
ρ(TGS ) < 1
and sufficient conditions: there exists k · k such that
kTJ k < 1
kTGS k < 1.
We can improve Gauss-Seidel method introducing a parameter ω and splitting
D
M= − L.
ω
We have
D 1−ω
A= −L − D+U ,
ω ω
and the iteration is
D (k+1) 1−ω
−L x = D + U x(k) + b
ω ω
Finally, we obtain the matrix
−1
D 1−ω
T = Tω = −L D+U
ω ω
= (D − ωL)−1 ((1 − ω)D + ωU ).
– ω < 1 subrelaxation
– ω = 1 Gauss-Seidel
Remark 2.7.7. For Jacobi (and Gauss-Seidel) method a sufficient condition for con-
vergence is
n
X
|aii | > |aij | (A row diagonal dominant)
j=1
j6=i
n
X
|aii | > |aji | (A column diagonal dominant) ♦
j=1
j6=i
Function Approximation
49
50 Function Approximation
Example 3.0.9. Φ = Skm (∆) the space of polynomial spline functions of degree m
and smoothness class k on the subdivision
of the interval [a, b]. These are piecewise polynomials of degree ≤ m, pieced together
at the “joints” t1 , . . . , tN −1 , in such a way that all derivatives up to and including
the kth are continuous on the whole interval [a, b] including the joints. We assume
0 ≤ k < m. For k = m this space equates Pm . We set k = −1 if we allow
discontinuities at the joints. ♦
Φ = Rr,s = {ϕ : ϕ = p/q, p ∈ Pr , q ∈ Ps },
Hence, we may take any one of the norms in Table 3.1 and combine it with any of
the preceding linear spaces Φ to arrive at a meaningful best approximation problem
(3.0.1). In the continuous case, the given function f and the functions ϕ ∈ Φ must
be defined on [a, b] and such that the norm kf − ϕk makes sense. Likewise, f and ϕ
must be defined at the points ti in the discrete case.
Note that if the best approximant ϕ b in the discrete case is such that kf − ϕk
b = 0,
then ϕ(t
b i ) = f (ti ), for i = 1, 2, . . . , N . We then say that ϕ b interpolates f at the
points ti and we refer to this kind of approximation as an interpolation problem.
The simplest approximation problems are the least squares problem and the in-
terpolation problem and the easiest space is the space of polynomials.
Before we start with the least square problem we introduce a notational device
that allows us to treat the continuous and the discrete case simultaneously. We define
in the continuous case
0, if t < a (when − ∞ < a),
Z t
w(τ ) dτ, if a ≤ t ≤ b,
λ(t) = a (3.0.3)
Z b
w(τ ) dτ, if t > b (when b < ∞).
a
since dλ(t) ≡ 0 outside [a, b] and dλ(t) = w(t) dt inside. We call dλ a continuous
(positive) measure. The discrete measure (also called “Dirac measure”) associated to
the point set {t1 , t2 , . . . , tN } is a measure dλ that is nonzero only at the points ti and
has the value wi there. Thus in this case
Z N
X
u(t) dλ(t) = wi u(ti ). (3.0.5)
R i=1
52 Function Approximation
(u, v) = 0. (3.1.6)
that is,
n
X
(πi , πj )cj = (πi , f ), i = 1, 2, . . . , n. (3.1.10)
j=1
These are called normal equations for the least squares problem. They form a
system having the form
Ac = b, (3.1.11)
where the matrix A and the vector b have elements
To prove (3.1.13), all we have to do is insert the definition of aij and to use the
property (i)-(iv) of the inner product
2
Xn X n n X
X n
Xn
T
x Ax = xi xj (πi , πj ) = (xi πi , xj πj ) =
xi π i
.
i=1 j=1 i=1 j=1 i=1
Pn
This is clearly nonnegative. It is zero only if i=1 xi πi ≡ 0 on supp dλ, which, by
the assumption of linear independence of the πi , implies x1 = x2 = · · · = xn = 0.
It is a well-known fact of linear algebra that a symmetric positive definite ma-
trix A is nonsingular. Indeed, its determinant, as well as its leading principal minor
determinants are strictly positive. If follows that the system (3.1.10) of normal equa-
tion has a unique solution. Does this solution correspond to a minimum of E[ϕ] in
(3.1.9)? The hessian matrix H = [∂ 2 E 2 /∂ci ∂cj ] has to be positive definite. But
H = 2A, since E 2 is a quadratic function. Therefore, H, with A, is indeed positive
definite, and the solution of the normal equations gives us the desired minimum. The
least squares approximation problem thus has a unique solution, given by
n
X
ϕ(t)
b = cj πj (t)
b (3.1.14)
j=1
where ĉ = [ĉ1 , ĉ2 , . . . , ĉn ]T is the solution of the normal equation (3.1.10).
This completely settles the least square approximation problem in theory. How
in practice? For a general set of linearly independent basis function, we can see the
following difficulties.
(1) The system (3.1.10) may be ill-conditioned. A simple example is provided by
suppdλ = [0, 1], dλ(t) = dt on [0, 1] and πj (t) = tj−1 , j = 1, 2, . . . , n. Then
Z 1
1
(πi , πj ) = ti+j−2 dt = , i, j = 1, 2, . . . , n,
0 i+j−1
that is A is precisely the Hilbert matrix. The resulting severe ill-conditioning of
the normal equations is entirely due to an unfortunate choice of the basis function.
These become almost linearly dependent, R 1 as the exponent grows. Another source of
degradation lies in the element bj = 0 πj (t)f (t)dt of the right-hand side vector.
When j is large πj (t) = tj−1 behaves on [0, 1] like a discontinuous function. A
polynomial πj that oscillates rapidly on [0, 1] would seem to be preferable from this
point of view, since it would ”engage“ more vigorously the function f over all the
interval [0, 1], in contrast to a canonical monomial which shoots from almost zero to
1 at the right endpoint.
(2) The second disadvantage is that all the coefficients b cj in (3.1.14) depends
(n)
on n, i.e. b cj = b cj , j = 1, 2, . . . , n. Increasing n will give an enlarged system
56 Function Approximation
of normal equations with a completely new solution vector. We refer to this as the
nonpermanence of the coefficients b cj .
Both defects (1) and (2) can be eliminated (or at least attenuated) by choosing for
the basis functions πj an orthogonal system,
Then the system of normal equations becomes diagonal and is solved immediately
by
(πj , f )
cj =
b , j = 1, 2, . . . , n. (3.1.16)
(πi , πj )
Clearly, each of these coefficients ĉj are independent of n and once computed, they
remain the same for any larger n. We now have permanence of the coefficients. We
must not solve a system of normal equations, but instead we can use the formula
(3.1.16) directly.
Any system {π̂j } that is linearly independent on suppdλ can be orthogonalized
with respect to the measure dλ by the Gram-Schmidt procedure. One takes
π = π̂1
min kf − ϕk2,dλ = kf − ϕk
b 2,dλ (3.1.17)
ϕ∈φn
has a unique solution ϕ b=ϕ bn , given by (3.1.14). There are many ways to select a
basis {πj } in Φn and, therefore, many ways the solution ϕ̂n be represented. Never-
theless, is always one and the same function. The least squares error – the quantity
on the right of (3.1.17) – is independent of the choice of basis functions (although the
calculation of the least square solution, as mentioned previously, is not). In study-
ing this error we may assume, without restricting generality, that the basis πj is an
3.1. Least Squares approximation 57
(f − ϕ
cn , ϕ) = 0, ∀ ϕ ∈ Φn (3.1.19)
where the inner product is the one in (3.1.3). Since ϕ is a linear combination of the
πk , it suffices to show (3.1.19) for each ϕ = πk , k = 1, 2, . . . , n. Inserting ϕ̂n from
(3.1.18) in the left of (3.1.19), we find indeed
Xn
(f − ϕbn , πk ) = f − cj πk , πk = (f, πk ) − b
b ck (πk , πk ) = 0,
j=1
the last equation following from the formula for ĉk in (3.1.18). The result (3.1.19) has
a simple geometric interpretation. If we picture functions as vectors, and the space
Φn as a plane, then for any function f that “sticks out” of the plane Φn , the least
square approximant ϕ̂n is the orthogonal projection of f onto Φn ; see Figure 3.1.
(f − ϕ
bn , ϕ
bn ) = 0
58 Function Approximation
kf − ϕ
b1 k ≥ kf − ϕ
b2 k ≥ kf − ϕ
b3 k ≥ . . . ,
which follows not only from (3.1.20), but more directly from the fact that
Φ1 ⊂ Φ2 ⊂ Φ3 ⊂ . . . .
If there are infinitely many such spaces, then the sequence of L2 errors, being monot-
onically decreasing, must converge to a limit. Is this limit zero? If so, we say that the
least square approximation process converges (in the mean) as n → ∞. It is obvious
from (3.1.20) that a necessary and sufficient condition for this is
∞
X
cj |2 kπj k2 = kf k2 .
|b (3.1.21)
j=1
3.2. Examples of orthogonal systems 59
1, cos t, cos 2t, cos 3t, . . . , sin t, sin 2t, sin 3t, . . .
We have
Z 2π
0, if 6 `
k=
sin kt sin `t dt = k, ` = 1, 2, 3, . . .
0 π, if k=`
Z 2π 0, k 6= `
cos kt cos `t dt = 2π, k = ` = 0 k, ` = 0, 1, 2, . . .
0
π, k = ` > 0
Z 2π
sin kt cos `t dt = 0, k = 1, 2, 3, . . . , ` = 0, 1, 2, . . .
0
The form of approximation is
∞
a0 X
f (t) = + (ak cos kt + bk sin kt). (3.2.1)
2
k=1
Z 2π
1
bk = f (t) sin kt dt, k = 1, 2, . . . (3.2.2)
π 0
which are known as Fourier coefficients of f . They are precisely the coefficients
(3.1.16) for the trigonometric system. By extension, the coefficients (3.1.16) for any
orthogonal system (πj ) will be called Fourier coefficients of f relative to this system.
In particular, we recognize the truncated Fourier series at k = m the best approxima-
tion of f from the class of trigonometric polynomials of degree ≤ n relative to the
norm
Z 2π 1/2
kuk2 = |u(t)|2 dt .
0
(2) Orthogonal polynomials. Given a measure dλ, we know that any finite num-
ber of consecutive powers 1, t, t2 , . . . are linearly independent on [a, b], if supp dλ =
[a, b], whereas the finite set 1, t, . . . , tn−1 is linearly independent on supp dλ =
{t1 , t2 , . . . , tN }. Since a linearly independent set can be orthogonalized by Gram-
Schmidt procedure, any measure dλ of the type considered generates a unique set of
monic2 polynomials πj (t, dλ), j = 0, 1, 2, . . . satisfying
degree πj = j, j = 0, 1, 2, . . .
(3.2.3)
Z
πk (t)π` (t) dλ(t) = 0, if k 6= `
R
These are called orthogonal polynomials relative to the measure dλ. Let the index
j start from zero. The set {πj } is infinite if suppdλ = [a, b], and consists of exactly
N polynomials π0 , π1 , . . . , πN −1 if supp dλ = {t1 , . . . , tN }. The latter are referred
to as discrete orthogonal polynomials.
Three consecutive orthogonal polynomials are linearly related. Specifically, there
exists real constants αk = αk ( dλ) and βk = βk ( dλ) > 0 (depending on the measure
dλ) such that
(It is understood that (3.2.4) holds for all k ∈ N if supp dλ = [a, b] and only for
k = 0, N − 2 if supp dλ = {t1 , t2 , . . . , tN }).
To prove (3.2.4) and, at the same time identify the coefficients αk , βk we note
that
πk+1 (t) − tπk (t)
2
A polynomial is called monic if its leading coefficient is equal to 1.
3.2. Examples of orthogonal systems 61
k−2
X
πk+1 − tπk (t) = −αk πk (t) − βk πk−1 (t) + γk,j πj (t) (3.2.5)
j=0
(with the understanding that empty sums are zero). Now multiply both sides of (3.2.5)
by πk in the sense of inner product defined in (3.1.3) we get
that is
(tπk , πk )
αk = , k = 0, 1, 2, . . . (3.2.6)
(πk , πk )
Since (tπk , πk−1 ) = (πk , tπk−1 ) and tπk−1 differs from πk by a polynomial of de-
gree < k, we obtain by orthogonality (tπk , πk−1 ) = (πk , πk ); hence
(πk , πk )
βk = , k = 1, 2, . . . (3.2.7)
(πk−1 , πk−1 )
γk,` = 0, ` = 0, 1, . . . , k − 1 (3.2.8)
Z Z b
(tπk , πk ) = tπk2 (t) dλ(t) = w(t)tπk2 (t) dt = 0,
R a
because the integrand is an odd function and the domain is symmetric with respect to
the origin.
k! dk 2
πk (t) = (t − 1)k . (3.3.1)
(2k)! dtk
Let us check first the orthogonality on [−1, 1] relative to the measure dλ(t) = dt.
The Fourier expansion in Chebyshev polynomials (essentially the Fourier cosine ex-
pansion) is given by
∞ ∞
X 0 1 X
f (x) = cj Tj (x) := c0 + cj Tj (x), (3.3.9)
2
j=0 j=1
where Z 1
2 dx
cj = f (x)Tj (x) √ , j ∈ N.
π −1 1 − x2
Truncating (3.3.9) with the term of degree n gives a useful polynomial approximation
of degree n
n n
X 0 c0 X
τn (x) = cj Tj (x) := + cj Tj (x), (3.3.10)
2
j=0 j=1
having an error
∞
X
f (x) − τn (x) = cj Tj (x) ≈ cn+1 Tn+1 (x). (3.3.11)
j=n+1
The approximation on the far right is better the faster the Fourier coefficients cj tend
to zero. The error (3.3.11), essentially oscillates between +cn+1 and −cn+1 and thus
is of “uniform” size. This is in stark contrast to Taylor’s expansion at x = 0, where
the nth degree polynomial partial sum has an error proportional to xn+1 on [−1, 1].
With respect to the inner product
n+1
X
(f, g)T := f (ξk )g(ξk ), (3.3.12)
k=1
3.3. Examples of orthogonal polynomials 67
where {ξ1 , . . . , ξn+1 } is the set of zeros of Tn+1 , the following discrete orthogonality
property holds
0, i 6= j
n+1
(Ti , Tj )T = , i = j 6= 0 .
2
n + 1, i = j = 0
2k−1
Indeed, we have arccos ξk = 2n+2 π, k = 1, n + 1. Let us compute now the inner
product:
(Ti , Tj )T = (cos i arccos t, cos j arccos t)T =
n+1
X
= cos(i arccos ξk ) cos(j arccos ξk ) =
k=1
n+1
X 2k − 1 2k − 1
= cos i π cos j π =
2(n + 1) 2(n + 1)
k=1
n+1
1X 2k − 1 2k − 1
= cos(i + j) π + cos(i − j) π =
2 2(n + 1) 2(n + 1)
k=1
n+1 n+1
1 X i+j 1X i−j
= cos(2k − 1) π+ cos(2k − 1) π.
2 2(n + 1) 2 2(n + 1)
k=1 k=1
i+j i−j
One introduces the notations α := 2(n+1) π, β := 2(n+1) π and
n+1
1X
S1 := cos(2k − 1)α,
2
k=1
n+1
1 X
S2 := cos(2k − 1)β.
2
k=1
Since
2 sin αS1 = sin 2(n + 1)α,
2 sin βS2 = sin 2(n + 1)β,
one obtains S1 = 0 şi S2 = 0.
With respect to the inner product
1 1
(f, g)U := f (η0 )g(η0 ) + f (η1 )g(η1 ) + · · · + f (ηn−1 )g(ηn−1 ) + f (ηn )g(ηn )
2 2
n
X
00
= f (ηk )g(ηk ),
k=0
(3.3.13)
68 Function Approximation
The result (3.3.14) can be given the following interesting interpretation: the best
◦
uniform approximation on [−1, 1] to f (x) = xn from Pn−1 is given by xn − Tn (x),
◦
that is, by the aggregate of terms of degree ≤ n − 1 in Tn taken with the minus
sign. From the theory of uniform polynomial approximation it is known that the
◦
best approximant is unique. Therefore equality in (3.3.14) can hold only if pn (x) =
◦
Tn (x).
sin[(n + 1) arccos t]
Qn (t) = √ , t ∈ [−1, 1]
1 − t2
3.3. Examples of orthogonal polynomials 69
√ They are orthogonal on [−1, 1] relative to the measure dλ(t) = w(t)dt, w(t) =
1 − t2 .
The recurrence relation is
et t−α dn n+α −t
lnα (t) = (t e ) for α > 1
n! dtn
2 dn −t2
Hn (t) = (−1)n et (e ).
dtn
2
They are orthogonal on (−∞, ∞) with respect to the weight w(t) = e−t and the
recurrence relation is for monic polynomials H̃n (t) is
β 2 − α2
αk =
(2k + α + β)(2k + α + β + 2)
and
We conclude this section with a table of some classical weight functions, their
corresponding orthogonal polynomials, and the recursion coefficients αk , βk for gen-
erating orthogonal polynomials (see Table 3.2).
H n [a, b] = {f : [a, b] → R : f ∈ C n−1 [a, b], f (n−1) absolute continuous on [a, b]}.
(3.4.1)
Each function f ∈ H n [a, b] admits a Taylor-type representation with the remain-
der in integral form
n−1 x
(x − a)k (k) (x − t)n−1 (n)
X Z
f (x) = f (a) + f (t)dt. (3.4.2)
k! a (n − 1)!
k=0
The next theorem, due to Peano 6 , extremely important for Numerical Analysis,
gives a representation of real linear functionals, defined on H n [a, b].
Theorem 3.4.2 (Peano). Let L be a real continuous linear functional, defined on
H n [a, b]. If KerL = Pn−1 then
Z b
Lf = K(t)f (n) (t)dt, (3.4.3)
a
where
1
K(t) = L[(· − t)n−1
+ ] (Peano kernel). (3.4.4)
(n − 1)!
Remark 3.4.3. The function
z, z ≥ 0
z+ =
0, z < 0
n is called truncated power.
is called positive part, and z+ ♦
Proof. f admits a Taylor representation with the remainder in integral form
f (x) = Tn−1 (x) + Rn−1 (x)
where
x Z b
(x − t)n−1 (n)
Z
1
Rn−1 (x) = f (t)dt = (x − t)n−1
+ f
(n)
(t)dt
a (n − 1)! (n − 1)! a
By applying L to both sides we get
Z b
1 n−1 (n)
Lf = LTn−1 +LRn−1 ⇒ Lf = L (· − t)+ f (t)dt =
| {z } (n − 1)! a
0
Z b
cont 1
= L(· − t)n−1
+ f
(n)
(t)dt.
(n − 1)! a
Remark 3.4.4. The conclusion of the theorem remains valid if L is not continuous,
but it has the form
n−1
XZ b
Lf = f (i) (x)dµi (x), µi ∈ BV [a, b].
i=0 a ♦
Corollary 3.4.5. If K does not change sign on [a, b] and f (n) is continuous on [a, b],
then there exists ξ ∈ [a, b] such that
1
Lf = f (n) (ξ)Len , (3.4.5)
n!
where ek (x) = xk , k ∈ N.
Proof. Since K does not change sign we may apply in (3.4.3) the second mean value
theorem of integral calculus
Z b
Lf = f (n) (ξ) Kn (t)dt, ξ ∈ [a, b].
a
Setting f = en we get precise (3.4.5).
3.5. Polynomial Interpolation 73
ϕ(xi ) = fi , i = 1, m.
Theorem 3.5.1. There exists one polynomial and only one Lm f ∈ Pm such that
where
m
Y x − xj
`i (x) = . (3.5.3)
j=0
x i − xj
j6=i
Proof. One proves immediately that `i ∈ Pi and that `i (xj ) = δij (Krönecker’s
symbol); it results that the polynomial Lm f defined by (3.5.1) is of degree at most
m and it satisfies (3.5.2). Suppose that there is another polynomial p∗m ∈ Pm which
also verifies (3.5.2) and we set qm = Lm − p∗m ; we have qm ∈ Pm and ∀ i = 0, m,
qm (xi ) = 0; so qm , having (m + 1) distinct roots vanishes identically, therefore the
uniqueness result.
Remark 3.5.3. The basic polynomial `i is thus the unique polynomial satisfying
Setting
m
Y
u(x) = (x − xj )
j=0
u(x)
from (3.5.3) we obtain that ∀ x 6= xi , `i (x) = (x−xi )u0 (xi ) . ♦
Joseph Louis Lagrange (1736-1813), born in Turin, became,
through correspondence with Euler, his protégé. In 1766 he
indeed succeeded Euler in Berlin. He returned to Paris in
1787. Clairaut wrote of the young Lagrange: “... a Young
man, no less remarkable for his talents than for his modesty;
his temperament is mild and melancholic; he knows no other
7
pleasure than study”. Lagrange made fundamental contribu-
tions to the calculus of variations and to number theory, but
worked also on many problems in analysis. He is widely
known for his representation of the remainder term in Tay-
lor’s formula. The interpolation formula appeared in 1794.
His Mécanique Analytique, published in 1788, made him one
of the founders of analytic mechanics.
3.5. Polynomial Interpolation 75
The proof of 3.5.1 proves in fact the existence and the uniqueness of the solution
of general Lagrange interpolation problem:
{(b0 = b1 = · · · = bm = 0) ⇒ pm ≡ 0}
We set pm = a0 + a1 x + · · · + am xm
a = (a0 , a1 , . . . , am )T , b = (b0 , b1 , . . . , bm )T
and let V = (vij ) be the m + 1 by m + 1 square matrix with elements vij = xji . The
equation (3.5.4) can be rewritten in the form
Va=b
one checks easily that the basic polynomials hi` are defined by the recurrences
(x − xi )ri
hiri (x) = qi (x)
ri !
and for ` = ri−1 , ri−2 , . . . , 1, 0
ri
(x − xi )` X j (j−`)
hi` (x) = qi (x) − q (xi )hij (x).
`! ` i
j=`+1
78 Function Approximation
We shall give a more convenient expression for Hermite basic polynomials due
to Dimitrie D. Stancu. They verify
(p)
hkj (xν ) = 0, ν 6= k, p = 0, rν (3.5.8)
(p)
hkj (xk ) = δjp , p = 0, rk
and
u(x)
uk (x) = ,
(x − xk )rk +1
it results from (3.5.8) that hkj is of the form
k −j
rX
(x − xk )ν ν
gkj (x) = gkj (xk ); (3.5.10)
ν!
ν=0
We got
(ν)
(ν) 1 1
gkj (xk ) = ,
j! uk (x) x=xk
and from (3.5.10) and (3.5.9) we finally have
k −j
rX (ν)
(x − xk )j (x − xk )ν
1
hkj (x) = uk (x) .
j! ν! uk (x) x=xk
ν=0
• it is idempotent (Hn ◦ Hn = Hn ).
Proof. Linearity results from (3.5.7). Due to the uniqueness of Hermite interpolation
polynomial, Hn (Hn f ) − Hn f vanishes identically, hence Hn (Hn f ) = Hn f , that is,
it is idempotent.
(H3 f ) (x) = h00 (x)f (0) + h10 (x)f (1) + h01 (x)f 0 (0) + h11 (x)f 0 (1),
where
If we add the node x = 21 , then the quality of approximation increases (see Figure
3.5). ♦
80 Function Approximation
Note that F ∈ C n [α, β], ∃ F (n+1) on (α, β), F (x) = 0 and F (j) (xk ) = 0 for
k = 0, m, j = 0, rk . Thus, F has (n + 2) zeros, considering their multiplicities.
Applying successively Rolle generalized Theorem, it results that there exists at least
one ξ ∈ (α, β) such that F (n+1) (ξ) = 0, i.e.
(m+1)
(n + 1)! f (n+1) (ξ)
F (ξ) = = 0, (3.5.12)
un (x) (Rn f )(x)
where we used the relation (Rn f )(n+1) = f (n+1) − (Hn f )(n+1) = f (n+1) . Express-
ing (Rn f )(x) from (3.5.12) one obtains (3.5.11).
Corollary 3.5.11. We set Mn+1 = max |f (n+1) (x)|; an upper bound of interpola-
x∈[a,b]
tion error (Rn f )(x) = f (x) − (Hn f )(x) is given by
Mn+1
|(Rn f )(x)| ≤ |un (x)|.
(n + 1)!
where
rk
m X
1 X (j)
(x − t)n+ − hkj (x) (xk − t)n+
Kn (x; t) = . (3.5.14)
n!
k=0 j=0
Corollary 3.5.13. Suppose f ∈ C m [α, β] and there exists f (m+1) on (α, β), where
α = min{x, x0 , . . . , xm } and β = max{x, x0 , . . . , xm }; then, for each x ∈ [α, β],
there exists a ξx ∈ (α, β) such that
1
(Rm f )(x) = um (x)f (m+1) (ξx ), (3.5.15)
(n + 1)!
where
m
Y
um (x) = (x − xi ).
i=0
where
m
" #
1 X
Km (x; t) = (x − t)m
+ − lk (x)(xk − t)m
+ . (3.5.17)
m!
k=0
Example 3.5.16. The remainder for the Hermite interpolation formula with double
nodes 0 and 1, for f ∈ C 4 [α, β], is
x2 (x − 1)2 (4)
(R3 f )(x) = f (ξ). ♦
6!
3.6. Efficient Computation of Interpolation Polynomials 83
Example 3.5.17. Let f (x) = ex . For x ∈ [a, b], we have Mn+1 = eb and for every
choice of the points xi , |un (x)| ≤ (b − a)n+1 , which implies
(b − a)n+1 b
max |(Rn f )(x)| ≤ e.
x∈[a,b] (n + 1)!
One gets
lim max |(Rn f )(x)| = lim k(Rn f )(x)k = 0,
n→∞ x∈[a,b] n→∞
that is, Hn f converges uniformly to f on [a, b], when n tends to ∞. In fact, we can
prove an analogous result for any function which can be developed into a Taylor in a
disk centered in x = a+b 3
2 and with the radius of convergence r > 2 (b − a). ♦
Proof. Q = P0,1,...,i−1,i+1,...,k , Q
b = P0,1,...,j−1,j+1,k
(x − xj )Q(x)
b − (x − xi )Q(x)
P (x) =
xi − xj
(xr − xj )Q(x
b r ) − (xr − xi )Q(xr ) xi − xj
P (xr ) = = f (xr ) = f (xr )
xi − xj xi − xj
(xi − xj )Q(x
b i ) − (xi − xj )Q(xi )
P (xi ) = = f (xi )
xi − xj
84 Function Approximation
and
(xj − xi )Q(x
b j ) − (xj − xi )Q(xj )
P (xj ) = = f (xj ),
xi − xj
hence P = P0,1,...,k .
x0 P0
x1 P1 P0,1
x2 P2 P1,2 P0,1,2
x3 P3 P2,3 P1,2,3 P0,1,2,3
x4 P4 P3,4 P2,3,4 P1,2,3,4 P0,1,2,3,4
And now, suppose P0,1,2,3,4 does not provide the desired accuracy. One can select
a new node and add a new line to the table
and neighbor elements on row, column and diagonal could be compared to check if
the desired accuracy was achieved.
The method is called Neville method.
We can simplify the notations
Qi,j := Pi−j,i−j+1,...,i−1,i ,
Qi,j−1 = Pi−j+1,...,i−1,i ,
Qi−1,j−1 := Pi−j,i−j+1,...,i−1 .
Formula (3.6.1) implies
for j = 1, 2, 3, . . . , i = j + 1, j + 2, . . .
Moreover, Qi,0 = f (xi ). We obtain
x0 Q0,0
x1 Q1,0 Q1,1
x2 Q2,0 Q2,1 Q2,2
x3 Q3,0 Q3,1 Q3,2 Q3,3
3.6. Efficient Computation of Interpolation Polynomials 85
If the interpolation procedure converges, then the sequence Qi,i also converges
and a stopping criterion could be
|Qi,i − Qi−1,i−1 | < ε.
The algorithm speeds-up by sorting the nodes on ascending order over |xi − x|.
Aitken methods is similar to Neville method. It builds the table
x0 P0
x1 P1 P0,1
x2 P2 P0,2 P0,1,2
x3 P3 P0,3 P0,1,3 P0,1,2,3
x4 P4 P0,4 P0,1,4 P0,1,2,4 P0,1,2,3,4
To compute a new value one takes the value in top of the preceding column and
the value from the current line and the preceding column.
Lemma 3.6.2.
and
f [xi ] = f (xi ), i = 0, 1, . . . , k.
Proof. For k ≥ 1 let L∗k−1 f be the interpolation polynomial for f , having the degree
k − 1 and the nodes x1 , x2 , . . . , xk ; the coefficient of xk−1 is f [x1 , x2 , . . . , xk ]. The
polynomial qk of degree k defined by
The formula (3.6.4) can be used to generate the table of divided differences
The first column contains the values of function f , the second contains the 1st or-
der divided difference and so on; we pass from a column to the next using formula
(3.6.4): each entry is the difference of the entry to the left and below it and the one
immediately to the left, divided by the difference of the x-value found by going di-
agonally down and the x-value horizontally to the left. The divided differences that
occur in the Newton formula (3.6.3) are precisely the m + 1 entries in the first line
of the table of divided differences. Their computation requires n(n + 1) additions
and 21 n(n + 1) divisions. Adding another data point (xm+1 , f [xm+1 ]) requires the
generation of the next diagonal. Lm+1 f can be obtained from Lm f by adding to it
the term f [x0 , . . . , xm+1 ](x − x0 ) . . . (x − xm+1 ).
is, according to (3.6.3) the interpolation polynomial (in t) of f relative to the points
x0 , x1 , . . . , xm , x. The theorem on the remainder of Lagrange interpolation formula
(3.5.11) implies the existence of a ξ ∈ (a, b) such that
1 (m)
f [x0 , x1 , . . . , xm ] = f (ξ) (3.6.7)
m!
(mean formula for divided differences). ♦
88 Function Approximation
and
α α2 . . . αm
1
1 2α . . . mαm−1
0
V α, . . . , α =
,
| {z } ... ... ... ... ...
m+1 0 0 0 ... m!
that is, the two determinants are built from the line of the node α and its successive
derivatives with respect to α up to the mth order.
The generalization to several nodes is:
(Wf )(x0 , . . . , x0 , . . . , xm , . . . , xm )
[x0 , . . . , x0 , x1 , . . . , x1 , . . . , xm , . . . , xm ;f ] =
| {z } | {z } | {z } V (x0 , . . . , x0 , . . . , xm , . . . , xm )
r0 r1 rm
where
(W f )(x0 , . . . , x0 , . . . , xm , . . . , xm ) =
xr00 −1 x0n−1
1 x0 ... ... f (x0 )
(r0 − 1)xr00 −2 f 0 (x0 )
0 1 ... ...
.. .. .. .. .. ..
..
. . . . . . .
Qr0−1 n−r0 (r −1)
0 0 ... (r0 − 1)! ... p=1 (n − p)x0 f 0 (x0 )
= r −1 n−1
1 xm ... xmm ... xm f (xm )
0 1 . . . (rm − 1)xm rm −2 ... (n − 1)xn−2 f 0 (xm )
m
.. .. .. .. .. .. ..
. . . . . . .
Qrm−1 n−rm (r −1)
0 0 ... (rm − 1)! ... p=1 (n − p)xm f n (xn )
and V (x0 , . . . , x0 , . . . , xm , . . . , xm ) is as above, excepting the last column which is
0 −2
rY rm
Y −2
(xn0 , nxn−1
0 ,..., (n − p)xn−r
0
0 +1
, . . . , xnm , nxm
n−1
,..., xn−r
m
m +1 T
)
p=0 p=0
z0 = x 0 f [z0 ]
f [z0 , z1 ] = f 0 (x0 )
f [z1 ,z2 ]−f [z0 ,z1 ]
z 1 = x0 f [z1 ] f [z0 , z1 , z2 ] = z2 −z0
f (z2 )−f (z1 )
f [z1 , z2 ] = z2 −z1
f [z3 ,z2 ]−f [z2 ,z1 ]
z 2 = x1 f [z2 ] f [x1 , z2 , z3 ] = z3 −z1
f [z2 , z3 ] = f 0 (x 1)
f [z4 ,z3 ]−f [z3 ,z2 ]
z 3 = x1 f [z3 ] f [z2 , z3 , z4 ] = z4 −z2
f (z4 )−f (z3 )
f [z3 , z4 ] = z4 −z3
f [z5 ,z4 ]−f [z4 ,z3 ]
z 4 = x2 f [z4 ] f [z3 , z4 , z5 ] = z5 −z3
f [z4 , z5 ] = f 0 (x2 )
z 5 = x2 f [z5 ]
The other divided differences are obtained as usual, as the Table 3.3 shows. This idea
could be extended to another Hermite interpolation problems. The method is due to
Powell.
We assume further that all nodes are contained in some finite interval [a, b]. Then,
for each m we define
(m) (m)
Pm (x) = Lm (f ; x0 , x1 , . . . , x(m)
m ; x), x ∈ [a, b]. (3.7.2)
We say that Lagrange interpolation based on the triangular array of nodes (3.7.1)
converges if
pm (x) ⇒ f (x), când n → ∞ pe [a, b]. (3.7.3)
excepting the points x = −1, x = 0 and x = 1. See figure 3.7(a), for m = 20.
The convergence in x = ±1 is trivial, since they are interpolation nodes, where the
error is zero. The same is true for x = 0 when m is even, but not if m is odd. The
failure of the convergence in the last two examples can only in part be blamed on
insufficient regularity of f . Another culprit is the equidistribution of nodes. There
are better distributions such as Chebyshev nodes. Figure 3.7(b) gives the graph for
m = 17. ♦
92 Function Approximation
The problem of convergence was solved for the general case by Faber and Bern-
stein during 1914 and 1916. Faber has proved that for each triangular array of nodes
of type 3.7.1 in [a, b] there exists a function f ∈ C[a, b] such that the sequence of
(m)
Lagrange interpolation polynomials Lm f for the nodes xi (row wise) does not
converge uniformly to f on [a, b].
Bernstein 10 has proved that for any array of nodes as before there exists a func-
tion f ∈ C[a, b] such that the corresponding sequence (Lm f ) is divergent.
Remedies:
• Local approach – the interval [a, b] is taken very small – the approach used to
numerical solution of differential equations;
• Spline interpolation – the interpolant is piecewise polynomial.
The solution is trivial, see Figure 3.8. On the interval [xi , xi+1 ]
It follows that
1
kf (·) − s(f, ·)k∞ ≤ |∆|2 kf 00 k∞ . (3.8.5)
8
The dimension of S01 (∆) can be computed in the following way: since we have
n − 1 subintervals, each linear piece has 2 coefficients (2 degrees of freedom) and
each continuity condition reduces the degree of freedom by 1, we have finally
Note that the first equation, when i = 1, and the second when i = n are to be
ignored. The functions Bi may be referred to as “hat functions” (Chinese hats), but
note that the first and the last hat is cut in half. The functions Bi are depicted in
Figure 3.9.
and
hBi ii=1,n = S10 (∆),
can be enforced by prescribing the values of the first derivative at each point xi ,
i = 1, 2, . . . , n. Thus, let m1 , m2 , . . . , mn be arbitrary given numbers, and denote
We solve the problem by Newton’s interpolation formula. The required divided dif-
ferences are
f [xi ,xi+1 ]−mi mi+1 +mi −2f [xi ,xi+1 ]
xi fi mi ∆xi (∆xi )2
mi+1 −f [xi ,xi+1 ]
xi fi f [xi , xi+1 ] ∆xi
xi+1 fi+1 mi+1
xi+1 fi+1
and the Hermite interpolation polynomial (in Newton form) is
f [xi , xi+1 ] − mi
pi (x) = fi + (x − xi )mi + (x − xi )2 +
∆xi
mi+1 + mi − 2f [xi , xi+1 ]
+ (x − xi )2 (x − xi+1 ) .
(∆xi )2
Alternatively, in Taylor’s form, we can write for xi ≤ x ≤ xi+1
ci,0 = fi
ci,1 = mi
f [xi , xi+1 ] − mi
ci,2 = − ci,3 ∆xi (3.8.10)
∆xi
mi+1 + mi − 2f [xi , xi+1 ]
ci,3 =
(∆xi )2
Thus to compute s3 (f ; x) for any given x ∈ [a, b] that is not an interpolation node,
one first locates the interval [xi , xi+1 ] 3 x and then computes the corresponding piece
(3.8.7) by (3.8.9) and (3.8.10).
We now discuss some possible choices of the parameters m1 , m2 , . . . , mn .
3.8. Spline Interpolation 97
and translates into a condition for the Taylor coefficients in (3.8.9), namely
Plugging in explicit values (3.8.10) for these coefficients, we arived at the linear
system
where
bi = 3{∆xi f [xi−1 , xi ] + ∆xi−1 f [xi , xi+1 ]} (3.8.16)
These are n − 2 linear equations in the n unknowns m1 , m2 , . . . , mn . Once m1
and mn have been chosen in some way, the system becomes again tridiagonal in
the remaining unknowns and hence is readily solved by Gaussian elimination, by
factorization or by an iterative method.
Here are some possible choices of m1 and mn .
Complete (clamped) spline. We take m1 = f 0 (a), mn = f 0 (b). It is known that
for this spline, if f ∈ C 4 [a, b],
5 1
where c0 = 384 , c1 = 24 , c2 = 38 , and c3 is a constant depending on the ratio
|∆|
mini ∆xi .
Matching of the second derivatives at the endpoints. We enforce the condi-
tions s003 (f ; a) = f 00 (a); s003 (f ; b) = f 00 (b). Each of these conditions gives rise to an
additional equation, namely,
One conveniently adjoins the first equation to the top of the system (3.8.15), and the
second to the bottom, thereby preserving the tridiagonal structure of the system.
Natural cubic spline. Enforcing s00 (f ; a) = s00 (f ; b) = 0, one obtains two
additional equations, which can be obtained from (3.8.18) by putting there f 00 (a) =
f 00 (b) = 0. The nice thing about this spline is that it requires only function values of
f – no derivatives – but the price one pays is a degradation of the accuracy to O(|∆|2 )
near the endpoints (unless indeed f 00 (a) = f 00 (b) = 0).
”Not-a-knot spline”. (C. deBoor). Here we impose the conditions p1 (x) ≡
p2 (x) and pn−2 (x) ≡ pn−1 (x); that is, the first two pieces and the last two should
be the same polynomial. In effect, this means that the first interior knot x2 , and the
last one xn−1 both are inactive. This again gives rise to two supplementary equations
expressing the continuity of s0003 (f ; x) in x = x2 and x = xn−1 . The continuity
condition of s3 (f, .) at x2 and xn−1 implies the equality of the leading coefficients
c1,3 = c2,3 and cn−2,3 = cn−1,3 . This gives rise to the equations
where
The first equation adjoins to the top of the system n − 2 equations in n unknowns
given by (3.8.15) and (3.8.16) and the second to the bottom. The system is no more
tridiagonal, but it can be turn into a tridiagonal one, by combining equations 1 and
2, and n − 1 and n, respectively. After this transformations, the first and the last
equations become
where
1
f [x1 , x2 ]∆x2 [∆x1 + 2(∆x1 + ∆x2 )] + (∆x1 )2 f [x2 , x3 ]
γ1 =
∆x2 + ∆x1
1 2
γ2 = ∆xn−1 f [xn−2 , xn−1 ] +
∆xn−1 + ∆xn−2
[2(∆xn−1 + ∆xn−2 ) + ∆xn−1 ]∆xn−2 f [xn−1 , xn ] .
in which the endpoints are double knots. This means that whenever we interpolate
on ∆0 , we interpolate to function values at all interior points, but to the functions as
well as first derivative values at the endpoints. The first of the two theorems relates
to the complete cubic spline interpolant, scompl (f ; ·).
Theorem 3.8.1. For any function g ∈ C 2 [a, b] that interpolates f on ∆0 , there holds
Z b Z b
00 2
[g (x)] dx ≥ [s00compl (f ; x)]2 dx, (3.8.22)
a a
with equality if and only if g(·) = scompl (f ; ·).
Proof. We write (for short) scompl = s. The theorem follows once we have shown
that
Z b Z b Z b
00 00 00
2
[g (x)] dx = 2
[g (x) − s (x)] dx + [s00 (x)]2 dx. (3.8.23)
a a a
Indeed, this immediately implies (3.8.22), and equality in (3.8.22) holds if and only
if g 00 (x)−s00 (x) ≡ 0, which, integrating twice from a to x and using the interpolation
properties of s and g at x = a gives g(x) ≡ s(x).
To complete the proof, note that the relation (3.8.23) is equivalent to
Z b
s00 (x)[g 00 (x) − s00 (x)]dx = 0. (3.8.24)
a
100 Function Approximation
Integrating by parts and taking into account that s0 (b) = g 0 (b) = f 0 (b) and s0 (a) =
g 0 (a) = f 0 (a), we get
Z b
s00 (x)[g 00 (x) − s00 (x)]dx =
a
b Z b
00 0 0
= s (x)[g (x) − s (x)] − s000 (x)[g 0 (x) − s0 (x)]dx (3.8.25)
a a
Z b
=− s000 (x)[g 0 (x) − s0 (x)]dx.
a
n−1
X
= s000 (xν+0 )[g(xν+1 ) − s(xν+1 ) − (g(xν ) − s(xν ))] = 0
ν=1
since both s and g interpolate f on ∆. This proves (3.8.24) and hence the theorem.
For interpolation on ∆, the distinction of being optimal goes to the natural cubic
spline interpolant snat (f ; ·).
The proof of Theorem 3.8.3 is virtually the same as that of Theorem 3.8.1, since
(3.8.23) holds again, this time because s00 (b) = s00 (a) = 0.
Putting g(·) = scompl (f ; ·) in Theorem 3.8.3 immediately gives
Z b Z b
[s00compl (f ; x)]2 dx ≥ [s00nat (f ; x)]2 dx. (3.8.27)
a a
equation y = g(x), x ∈ [a, b] and if the spline is constrained to pass through the
points (xi , gi ), then it assumes a form that minimizes the bending energy
b
[g 00 (x)]2 dx
Z
,
a (1 + [g 0 (x)]2 )3
over all functions g similarly constrained. For slowly varying g (kg 0 k∞ 1) this is
nearly the same as the minimum property of Theorem 3.8.3.
102 Function Approximation
Chapter 4
4.1 Introduction
Let X be a linear space, L1 , . . . , Lm real linear functional, that are linear indepen-
dent, defined on X and L : X → R be a real linear functional such that L, L1 , . . . ,
Lm are linear independent.
Real parameters Ai are called coefficients (weights) of the formula, and R(f ) is the
remainder term.
For a formula of form (4.1.1), given Li , we wish to determine the weights Ai and
to study the remainder term corresponding to these coefficients.
103
104 Linear Functional Approximation
Example 4.1.4. If X and Li are like in Example 4.1.3 and f (k) (α) exists, α ∈ [a, b],
k ∈ N∗ , and L(f ) = f (k) (α) one obtains a formula for the approximation of the kth
derivative of f at α
m
X
f (k) (α) = Ai f (xi ) + R(f ),
i=0
called numerical differentiation formula . ♦
Example 4.1.5. If X is a space of functions which are defined on [a, b], integrable
on [a, b] and there exists f (j) (xk ), k = 0, m, j ∈ Ik , with xk ∈ [a, b] and Ik are given
sets of indices
Lkj (f ) = f (j) (xk ), k = 0, m, j ∈ Ik ,
and Z b
L(f ) = f (x)dx,
a
one obtains a formula
Z b m X
X
f (x)dx = Akj f (j) (xk ) + R(f ),
a k=0 j∈Ik
We are now ready to formulate the general approximation problem: given a linear
functional L on X, m linear functional L1 , L2 , . . . , Lm on X and their values (the
“data”) li = Li f , i = 1, m applied to some function f and given a linear subspace
Φ ⊂ X with dim Φ = m, we want to find an approximation formula of the type
m
X
Lf ≈ ai Li f (4.1.2)
i=1
4.1. Introduction 105
that is,
This has a unique solution for arbitrary s if and only if (4.1.5) holds.
We have two approaches for the solution of this problem.
In other words, we apply L not to f , but to ϕ(l; ·) — the solution of the interpola-
tion problem (4.1.3) in which s = `, the given “data”. Our assumption guarantees
that ϕ(`; ·) is uniquely determined. In particular, if f ∈ Φ, then (4.1.8) holds with
equality, since trivially ϕ(l; ·) = f (·). Thus, our approximation (4.1.8) already satis-
fies the exactness condition required for (4.1.2). It remains only to show that (4.1.8)
produces indeed an approximation of the form (4.1.2).
To do so, observe that the interpolant in (4.1.8) is
m
X
ϕ(`; ·) = cj ϕj (·)
j=1
Gc = `, ` = [L1 f, L2 f, . . . , Lm f ]T .
Writing
λj = Lϕj , j = 1, m, λ = [λ1 , λ2 , . . . , λm ]T , (4.1.9)
we have by the linearity of L
m
X
Lϕ(`; ·) = cj Lϕj = λT c = λT G−1 ` = [(GT )−1 λ]T `,
j=1
that is,
m
X
Lϕ(`; ·) = ai Li f, a = [a1 , a2 , . . . , am ]T = (GT )−1 λ. (4.1.10)
i=1
4.2. Numerical Differentiation 107
or by (4.1.8)
m
X
aj Lj ϕi = λi , i = 1, m.
j=1
f (m+1) (ξ(x0 ))
(Rm f )0 (x0 ) = (x0 − x1 ) . . . (x0 − xm ) . (4.2.5)
(m + 1)!
Therefore, differentiating (4.2.4), we find
f 0 (x0 ) = (Lm f )0 (x0 ) + (Rm f )0 (x0 ) . (4.2.6)
| {z }
em
The most important uses of differentiation formulae are made in the discretization
of differential equations — ordinary or partial. In these applications, the spacing
of the points is usually uniform, but unequally distribution points arise when partial
differential operators are to be discretized near the boundary of the domain of interest.
We can also use another interpolation procedures such as: Taylor, Hermite, spline,
least squares.
where
m
X
Q(f ) = Aj Fj (f ),
j=0
Definition 4.3.2. The natural number d = d(Q) having the property ∀ f ∈ Pd,
R(f ) = 0 and ∃ g ∈ Pd + 1 such that R(g) 6= 0 is called degree of exactness of the
quadrature formula..
Since R is linear, a quadrature formula has the degree of exactness d if and only
if R(ej ) = 0, j = 0, d and R(ed+1 ) 6= 0.
If the degree of exactness of a quadrature formula is known, the remainder could
be determined using Peano theorem.
So
h3 00
R1 (f ) = − f (ξk ), ξk ∈ (xk , xk+1 )
12
and Z xk+1
h 1
f (x)dx = (fk + fk+1 ) − h3 f 00 (ξk ). (4.3.4)
xk 2 12
This is the elementary trapezoidal rule. Summing over all subinterval gives the
trapezes rule or the composite trapezoidal rule.
Z b n−1
1 1 1 3 X 00
f (x)dx = h f0 + f1 + · · · + fn−1 + fn − h f (ξk ).
a 2 2 12
k=0
(b − a)h2 00 (b − a)3 00
R1,n (f ) = − f (ξ) = − f (ξ). (4.3.5)
12 12n2
Since f 00 is bounded in absolute value on [a, b] we have
R1,n (f ) = O(h2 ),
where
(xk+1 − t)4 h h
i
1 3 3 3
K2 (t) = − (xk − t)+ + 4 (xk+1 − t)+ + (xk+2 − t)+ ,
3! 4 3
112 Linear Functional Approximation
that is,
(xk+2 −t)4
h i
h 3
1 4 − 3 4 (xk+1 − t) + (xk+2 − t)3 , t ∈ [xk , xk+1 ] ,
K2 (t) = (xk+2 −t)4
6 − h 3
4 3 (xk+2 − t) , t ∈ [xk+1 , xk+2 ] .
One easily checks that for t ∈ [a, b], K2 (t) ≤ 0, so we can apply Peano’s Theorem
corollary.
1
R2 (f ) = f (4) (ξk )R2 (e4 ),
4!
x5k+2 − x5k h 4
xk + 4x4k+1 + x4k+1
R2 (e4 ) = −
" 5 3
x4 + xk+2 xk + x2k+2 x2k + xk+2 x3k + x4k
3
= h 2 k+2
5
#
5x4k + 4x3k xk+2 + 6x2k x2k+2 + 4xk x3k+2 + 5x4k+2
−
12
h
−x4k + 4x3k xk+2 + 6x2k x2k+2 + 4xk x3k+2 − x4k+2
=
60
h h5
= − (xk+2 − xk )4 = −4 .
60 15
Thus,
h5 (4)
R2 (f ) = −
f (ξk ).
90
For the composite Simpson 2 rule we get
Z b
h
f (x)dx = (f0 + 4f1 + 2f2 + 4f3 + 2f4 + · · · + 4fn−1 + fn ) + R2,n (f ) (4.3.7)
a 3
with
1 (b − a)5 (4)
R2,n (f ) = − (b − a)h4 f (4) (ξ) = − f (ξ), ξ ∈ (a, b). (4.3.8)
180 2880n4
where the weight function w is nonnegative and integrable on (a, b). The interval
(a, b) may be finite or infinite. If it is infinite, we must make sure that the integral in
(4.3.9) is well defined, at least when f is a polynomial. We achieve this by requiring
that all moments of the weight function,
Z b
µs = ts w(t)dt, s = 0, 1, 2, . . . (4.3.10)
a
or equivalently,
Z b
wk = lk (t)w(t)dt, k = 1, 2, . . . , n, (4.3.12)
a
where
n
Y t − tl
lk (t) = (4.3.13)
t − tl
l=1 k
l6=k
are the elementary Lagrange interpolation polynomials associated with the nodes t1 ,
t2 , . . . , tn . The fact that (4.3.9) with wk given by (4.3.12) has the degree of exactness
d = n − 1 is evident, since for any f ∈ Pn−1 Ln−1 (f ; ·) ≡ f (·) ı̂n (4.3.11). Con-
versely, if (4.3.9) has the degree of exactness d = n − 1, then putting f (t) = lr (t) in
(4.3.10) gives
Z b Xn
lr (t)w(t)dt = wk lr (tk ) = wr , r = 1, 2, . . . , n,
a k=1
114 Linear Functional Approximation
πn (tk ; w) = 0
Z b
πn (t, w) k = 1, 2, . . . , n. (4.3.15)
wk = 0
w(t)dt,
a (t − tk )πn (tk , w)
The formula was developed in 1814 by Gauss for the special case w(t) ≡ 1 on
[−1, 1], and extended to more general weight functions by Christoffel 4 in 1877. It
is, therefore, also refered to as Gauss-Christoffel quadrature formula.
The first integral on the right, by (b), is zero, since q ∈ Pk−1 , whereas the second, by
(a), since r ∈ Pn−1 equals
n
X n
X n
X
wk r(tk ) = wk [p(tk ) − q(tk )un (tk )] = wk p(tk )
k=1 k=1 k=1
The case k = n is discussed further in §4.3.3. Here we still mention two special
cases with k < n, which are of some practical interest. The first is the Gauss-
Radau quadrature formula in which one endpoint, say a, is finite and serves as a
quadrature node, say t1 = a. The maximum degree of exactness attainable then is
d = 2n−2 and corresponds to k = n−1 in Theorem (4.3.3). Part (b) of that theorem
tells us that the remaining nodes t2 , . . . , tn must be the zeroes of πn−1 (·, wa ), where
wa (t) = (t − a)w(t).
Similarly, in the Gauss-Lobatto formula, both endpoints are finite and serves as
nodes, say, t1 = a, tn = b, and the remaining nodes t2 , . . . , tn−1 are taken to be the
zeros of πn−2 (·; wa,b ), wa,b (t) = (t − a)(b − t)w(t), thus achieving maximum degree
of exactness d = 2n − 3.
the first equality following since lj2 ∈ P2n−2 and the degree of exactness is d =
2n − 1.
(iii) If [a, b] is a finite interval, then the Gauss formula converges for any con-
tinuous function; that is, Rn (f ) → 0, when n → ∞, for any f ∈ C[a, b]. This is
basically a consequence of the Weierstrass Approximation Theorem, which implies
that, if pb2n−1 (f ; ·) denotes the best polynomial approximation of degree 2n − 1 of f
on [a, b] in the uniform norm, then
lim kf (·) − pb2k−1 (f ; ·)k∞ = 0.
n→∞
4.3. Numerical Integration 117
Here the positivity of weights wk has been used crucially. Noting that
n
X Z b
wk = w(t)dt = µ0 ,
k=1 a
we conclude
The next property is the background for an efficient algorithm for computing
Gaussian quadrature formulae.
(iv) Let αk = αk (w) and βk = βk (w) be the recursion coefficients for the or-
thogonal polynomials πk (·) = πk (·; w), that is
where
(tπk , πk )
αk =
(πk , πk )
(4.3.17)
(πk , πk )
βk = ,
(πk−1 , πk−1 )
with β0 defined (as is customary) by
Z b
β0 = w(t)dt (= µ0 ).
a
118 Linear Functional Approximation
The nth order Jacobi matrix for the weight function w is a tridiagonal symmetric
matrix defined by
√
α0 β1 √ 0
√β 1 α1 β2
√
..
Jn (w) =
β2 . .
.. .. p
. . βn−1
p
0 βn−1 αn−1
Theorem 4.3.4. The nodes tk of a Gauss-type quadrature formula are the eigenval-
ues of Jn
Jn vk = tk vk , vkT vk = 1, k = 1, 2, . . . , n, (4.3.18)
and the weights wk are expressible in terms of the first component vk of the corre-
sponding (normalized) eigenvectors by
2
wk = β0 vk,1 , k = 1, 2, . . . , n (4.3.19)
Proof of theorem 4.3.4. Let p π̃k (·) = π̃k (·, w) denote the normalized orthogonal poly-
nomials,
p so that π k = (πk , πk ) dλ π̃k . Inserting this into (4.3.16), dividing by
(πk , πk ) dλ , and using (4.3.17), we obtain
π̃k π̃k−1
π̃k+1 (t) = (t − αk ) p − βk p ,
βk+1 βk+1 βk
p
or, multiplying through by βk+1 and rearranging
p p
tπ̃k (t) = αk π̃k (t) + βk π̃k−1 (t) + βk+1 π̃k+1 (t), k = 0, 1, . . . n − 1. (4.3.20)
In terms of the Jacobi matrix Jn we can write these relations in vector form as
p
tπ̃(t) = Jn π̃(t) + βn π̃n (t)en , (4.3.21)
where π̃(t) = [π̃0 (t), π̃1 (t), . . . , π̃n−1 (t)]T and en (t) = [0, 0, . . . , 0, 1]T are vectors
in Rn . Since tk is a zero of π̃n , it follows from (4.3.21) that
This proves the first relation in Theorem 4.3.4, since π̃ is a nonzero vector, its first
component being
−1/2
π̃0 = β0 . (4.3.23)
To prove the second relation, note from (4.3.22) that the normalized eigenvector
vk is
−1/2
n
1 X
2
vk = π̃(tk ) = π̃µ−1 (tk ) π̃(tk ).
[π̃(tk )T π̃(tk )]
µ=1
Comparing the first component on far left and right, and squaring, gives, by virtue of
(4.3.23)
1 2
n = β0 vk,1 , k = 1, 2, . . . , n. (4.3.24)
X 2
˜ µ−1 (tk )
pi
µ=1
On the other hand, letting f (t) = π̃µ−1 (t) in (4.3.9), one gets, by orthogonality, using
(4.3.23) again that
n
1/2
X
β0 δµ−1,0 = wk π̃µ−1 (tk )
k=0
or in matrix form
1/2
P w = β0 e1 , (4.3.25)
where δµ−1,0 is Kronecker’s delta, P ∈ Rn×n is the matrix of eigenvectors, w ∈ Rn
is the vector of Gaussian weights, and e1 = [1, 0, . . . , 0]T ∈ Rn . Since the columns
of P are orthogonal, we have
n
X
T 2
P P = D, D = diag(d1 , d2 , . . . , dn ), dk = π̃µ−1 (tk ).
µ=1
Z b Z b
w(x)f (x)dx = w(x)(H2n−1 f )(x)dx+
a a
Z b
+ w(x)u2n (x)f [x, x1 , x1 , . . . , xn , xn ]dx.
a
But the degree of exactness 2n − 1 implies
Z b Xn n
X
w(x)(H2n−1 f )(x)dx = wi (H2n−1 f )(xi ) = wi f (xi ),
a i=1 i=1
Z b n
X Z b
w(x)f (x)dx = wi f (xi ) + w(x)u2 (x)f [x, x1 , x1 , . . . , xn , xn ]dx,
a i=1 a
so Z b
Rn (f ) = w(x)u2n (x)f [x, x1 , x1 , . . . , xn , xn ]dx.
a
Since w(x)u2 (x) ≥ 0, applying the second Mean Value Theorem for integrals and
the Mean Value Theorem for divided differences, we get
Z b
Rn (f ) = f [η, x1 , x1 , . . . , xn , xn ] w(x)u2 (x)dx
a
f (2n) (ξ) b
Z
= w(x)[πn (x, w)]2 dx, ξ ∈ [a, b].
(2n)! a
For orthogonal polynomials and their recursion coefficients αk , βk see Table 3.2,
page 71.
AND CONQUER.
In contrast to other methods, which decide what amount of work is needed to
achieve the desired accuracy, an adaptive quadrature compute only as much as it is
necessary. This means that the absolute error ε must be chosen so that to avoid an
infinite loop when one tries to achieve a precision which could not be achieved. The
number of steps depends on the behavior of the function to be integrated.
Possible improvements: metint(a, b, f, 2m) is called twice, the accuracy can
be scaled by the ratio of current interval length and the whole interval length. For
122 Linear Functional Approximation
µk ∈ (a, b).
Let Rk,1 denote the result of approximation in accordance to (4.5.1).
h1 b−a
R1,1 = [f (a) + f (b)] = [f (a) + f (b)] (4.5.2)
2 2
h2
R2,1 = [f (a) + f (b) + 2f (a + h2 )] =
2
b−a b−a
= f (a) + f (b) + 2f a +
4 2
1 1
= R1,1 + h1 f a + h1 .
2 2
and generally
k−2
2X
1 1
Rk,1 = Rk−1,1 + hk−1 f a+ i− hk−1 , k = 2, n (4.5.3)
2 2
i=1
4.5. Iterated Quadratures. Romberg Method 123
b
(b − a) 2 00
Z
I= f (x)dx = Rk−1,1 − hk f (a) + O(h4k ).
a 12
(b − a) 2 00
I =Rk−1,1 − hk f (a) + O(h4k ),
12
b − a 2 00
I =Rk,1 − h f (a) + O(h4k ).
48 k
We get
4Rk,1 − Rk−1,1
I= + O(h4 ).
3
We define
4Rk,1 − Rk−1,1
Rk,2 = . (4.5.4)
3
Lewis Fry Richardson (1881-1953), born, educated, and ac-
tive in England, did pioneering work in numerical weather
prediction, proposing to solve the hydrodynamical and ther-
modynamical equations of meteorology by finite difference
6 methods. He also did a penetrating study of atmospheric tur-
bulence, where a nondimensional quantity introduced by him
is now called “Richardson’s number”. At the age of 50 he
earned a degree in psychology and began to develop a scien-
tific theory of international relations. He was elected fellow of
the Royal Society in 1926.
Archimedes (287 B.C. - 212 B.C.) Greek mathematician from
Syracuse. The achievements of Archimedes are quite out-
standing. He is considered by most historians of mathematics
as one of the greatest mathematicians of all time. He perfected
a method of integration which allowed him to find areas, vol-
umes and surface areas of many bodies. Archimedes was able
to apply the method of exhaustion, which is the early form of
7
integration. He also gave an accurate approximation to π and
showed that he could approximate square roots accurately. He
invented a system for expressing large numbers. In mechan-
ics Archimedes discovered fundamental theorems concerning
the centre of gravity of plane figures and solids. His most fa-
mous theorem gives the weight of a body immersed in a liquid,
called Archimedes’ principle. He defended his town during
the Romans’ siege.
124 Linear Functional Approximation
One applies the Richardson extrapolation to these values. If f ∈ C 2n+2 [a, b],
then for k = 1, n we may write
Z b 2k−1
X−1
hk
f (x)dx = f (a) + f (b) + 2 f (a + ihk )
a 2
i=1
(4.5.5)
k
X
2k+2
+ Ki h2i
k + O(hk ),
i=1
n−1
" #
h X b−a
A(h) = f (a) + 2 f (a + kh) + f (b) , h= .
2 k
k=1
Maclaurin 9 holds
A(h) = a0 + a1 h2 + a2 h4 + · · · + ak h2k + O(h2k+1 ), h→0 (4.5.6)
where
B2k (2k−1)
ak = [f (b) − f (2k−1) (a)], k = 1, 2, . . . , K.
(2k)!
The quantities Bk are the coefficients in the expansion
∞
z X Bk
= zk , |z| < 2π;
ez − 1 k!
k=0
Since (Rn,1 ) is convergent, (Rn,n ) is also convergent, but faster than (Rn,1 ). One
may choose as stopping criterion |Rn−1,n−1 − Rn,n | ≤ ε.
Sk,1 = Rk,2 .
h
S= (f (a) + 4f (c) + f (b)) .
6
For two subintervals one obtains
h
S2 = (f (a) + 4f (d) + 2f (c) + 4f (e) + f (b)) ,
12
where d = (a + c)/2 and e = (c + b)/2. Applying (4.6.1) to S1 and S2 yields
Q = S2 + (S2 − S)/15.
Now, we are able to give a recursive algorithm for the approximation of our in-
tegral. The function adquad evaluates the integral by applying Simpson. It calls
quadstep recursively and apply extrapolation. The description appears in Algorithm
4.2.
4.6. Adaptive Quadratures II 127
f (x) = 0, (5.1.1)
but allow different interpretations depending on the meaning of x and f . The simplest
case is a single equation in a single unknown, in which case f is a given function
of a real or complex variable, and we are trying to find values of this variables for
which f vanishes. Such values are called roots of the equation (5.1.1) or zeros of
the function f . If x in (5.1.1) is a vector, say x = [x1 , x2 , . . . , xd ]T ∈ Rd , and f is
also a vector, each component of which is a function of d variables x1 , x2 , . . . , xd ,
then (5.1.1) represents a system of equations. The system is nonlinear if at least one
component of f depends nonlinearly of at least one of the variables x1 , x2 , . . . , xd .
If all components of f are linear functions of x1 , . . . , xd we call (5.1.1) a system of
linear algebraic equations. Still more generally, (5.1.1) could represent a functional
equation, if x is an element of some function space and f a (linear or nonlinear)
operator acting on this space. In each of these interpretations, the zero on the right of
(5.1.1) has a different meaning: the number zero in the first case, the zero vector in
the second, and the function identically equal to zero in the last case.
Much of this chapter is devoted to single nonlinear equations. Such equations are
often encountered in the analysis of vibrating systems, where the roots correspond
to critical frequencies (resonance). The special case of algebraic equations, where f
in (5.1.1) is a polynomial, is also of considerable importance and deserves a special
treatment.
129
130 Numerical Solution of Nonlinear Equations
lim xn = α, (5.2.1)
n→∞
for some root α of the equation. In case of a system of equations, both xk and α are
vectors of appropriate dimension, and convergence is to be understood is sense of a
componentwise convergence.
Although convergence of an iterative process is certainly desirable, it takes more
than just convergence to make it practical. What one wants is fast convergence. A
basic concept to measure the speed of convergence is the order of convergence.
Definition 5.2.1. One says that xn converge to α (at least) linearly if
|xn − α| ≤ en (5.2.2)
If (5.2.2) and (5.2.3) hold with equality in (5.2.2) then c is called asymptotic error
constant.
The phrase “at least” in this definition relates to the fact that we have only inequal-
ity in (5.2.2), which in practice is all we can usually ascertain. So, strictly speaking,
it is the bounds en that converge linearly, meaning that (e.g. for n large enough) each
of these error bounds is approximately a constant fraction of the preceding one.
Definition 5.2.2. One says that xn converges to α with (at least) order p ≥ 1 If
(5.2.2) holds with
en+1
lim p = c, c>0 (5.2.4)
n→∞ en
The same definitions apply also to vector-valued sequences; one only needs to
replace absolute values in (5.2.2) by (any) vector norm.
The classification of convergence with respect to order is still rather crude, as
there are types of convergence that “fall between the cracks”. Thus, a sequence {en }
may converge to zero slowly than linearly, for example such that c = 1 in (5.2.3).
We call this type of convergence sublinear. Likewise, c = 0 in (5.2.3) gives rise to
superlinear convergence, if (5.2.4) does not hold for any p > 1.
It is instructive to examine the behavior of en , if instead of the limit relations
(5.2.3) and (5.2.4) we had strict equality from some n, say,
en+1
= c, n = n0 , n0 + 1, n0 + 2, . . . (5.2.5)
epn
For n0 large enough, this is almost true. A simple induction argument then shows
that
pk −1 k
en0 +k = c p−1 epn0 , (5.2.6)
which certainly holds for p > 1, but also for p = 1 in the limits as p ↓ 1:
Assuming then en0 sufficiently small so that the approximation xn0 has several
correct decimal digits, we write en0 +k = 10−δk en0 . Then δk , according to (5.2.2)
approximately represents the number of additional correct digits in the approximation
xn0 +k (as opposed to xn0 ). Taking logarithms in (5.2.6) and (5.2.7) gives
1
(
h c , −k
k log i if p = 1
δk = k 1−p 1 −k 1
p p−1 log c + (1 − p ) log en , if p > 1
0
hence as k → ∞
1 1 1
cp = log + log .
p−1 c en0
(We assume here that n0 is large enough, and hence en0 small enough, to have
cp > 0). This shows that the number of correct decimal digits increases linearly
with k when p = 1, but exponentially when p > 1. In the latter case, δk+1 /δk ∼ p
meaning that ultimately (for large k) the number of correct decimal digits increases,
per iteration step, by a factor of p.
132 Numerical Solution of Nonlinear Equations
If each iteration requires m units of work (a “unit of works” typically is the work
involved in computing a function value or a value of one of its derivatives), then the
efficiency index of the iteration may be defined by
It provides a common basis on which to compare different iterative methods with one
another. Methods that converge linearly have efficiency index 1.
Practical computation requires the employment of a stopping rule that terminates
the iteration once the desired accuracy is (or is believed to be) obtained. Ideally, one
stops as soon as kxn − αk < tol, where tol is a prescribed accuracy. Since α is not
known, one commonly replaces xn − α by xn − xn−1 and requires
where
tol = kxn kεr + εa (5.2.10)
with εr , εa prescribed tolerances. As a safety measure, one might require (5.2.9) not
just for one, but a few consecutive values of n. Choosing εr = 0 or εa = 0 will
make (5.2.10) a relative (resp., absolute) error tolerance. It is prudent, however, to
use a “mixed error tolerance”, say εe = εa = ε. Then, if kxn k is small or moderately
large, one effectively controls the absolute error, whereas for kxn k very large, it is
in effect the relative error that is controlled. One can combine the above tests with
||f (x)|| ≤ ε. In algorithms given in this chapter we shall suppose that a function,
stopping criterion, that implement a stopping criterion (rule) is available.
There are situations in which it is desirable to be able to select one particular root
among many others and have the iterative scheme converge to it. This is the case,
for example, in orthogonal polynomials, where we know that all zeros are real and
distinct. It may well be that all zeros are real and distinct. It may well be that we are
interested in the second-largest or third-largest zero, and should be able to compute
it without computing any of the others. This is indeed possible if we combine the
5.3. Sturm Sequences Method 133
π0 (x) = 1, π1 (x) = x − α0
(5.3.2)
πk+1 (x) = (x − αk )πk (x) − βk πk−1 (x), k = 1, 2, . . . , d − 1
with all βk positive. The recursion (5.3.2) is not only useful to compute πd (x), but
has also the property due to Sturm.
Proposition 5.3.1. Let σ(x) be the number of sign changes zeros do not count in the
sequence of numbers
Then, for any two numbers a, b with a < b, the number of real zeros of πd in the
interval a < x ≤ b is equal to σ(a) − σ(b).
σ(x) ≤ r − 1 ⇐⇒ x ≥ ξr . (5.3.4)
√
α0 β1 0
β1 α1 √β2
√
√
..
Jn =
β 2 α2 .
.. .. p
. . βn−1
p
0 βn−1 αn−1
and taking into account that the zeros of πd are precisely the eigenvalues of Jd .
Gershgorin’s theorem states that the eigenvalue of a matrix A = [aij ] of order d
5.4. Method of False Position 135
for n := 1, 2, . . . do
an − bn
xn := an − f (an );
f (an ) − f (bn )
if f (an )f (xn ) > 0 then
an+1 := xn ; bn+1 := bn ;
else
an+1 := an ; bn+1 := xn ;
end if
end for
One may terminate the iteration as soon as min(xn − an , bn − xn ) ≤ tol, where
tol is a prescribed error tolerance, although this is not entirely fool-proof.
The convergence behavior is most easily analyzed if we assume that f is convex
or concave on [a, b]. To fix ideas, suppose f is convex, say
Then f has exactly one zero, α, in [a, b. Moreover, the secant connecting f (a) and
f (b) lies entirely above the graph of y = f (x), and hence intersects the real line to the
left of α. This will be the case of all subsequent secants, which means that the point
x = b remains fixed while the other endpoint a gets continuously updated, producing
a monotonically increasing sequence of approximation. The sequence defined by
xn − b
xn+1 = xn − f (xn ), n ∈ N∗ , x1 = a (5.4.3)
f (xn ) − f (b)
xn − b
xn+1 − α = xn − α − [f (xn ) − f (α)].
f (xn ) − f (b)
xn+1 − α f 0 (α)
lim = 1 − (b − α) . (5.4.4)
n→∞ xn − α f (b)
5.5. Secant Method 137
f 0 (α)
c = 1 − (b − a) .
f (b)
Due to the assumption of convexity, c ∈ (0, 1). The proof when f is concave is
analogous. If f is neither convex nor concave on [a, b], but f ∈ C 2 [a, b] and f 00 (α) 6=
0, f 00 has a constant sign in a neighborhood of α and for n large enough xn will
eventually come to lie in this neighborhood, and we can proceed as above.
Drawbacks. (i) Slow convergence; (ii) The fact that one of the endpoints remain
fixed. If f is very flat near α, the point a is nearby and b further away, the convergence
is exceptionally slow.
This precludes the formation of a fixed false position, as in the method of false po-
sition, and hence suggest potentially faster convergence. Unfortunately, the “global
convergence” no longer holds; the method converges only “locally”, that is only if
the initial approximations x0 and x1 are sufficiently close to a root.
We need a relation between three consecutive errors
f (xn ) f (xn ) − f (α)
xn+1 − α = xn − α − = (xn − α) 1 −
f [xn−1 , xn ] (xn − α)f [xn−1 , xn ]
f [xn , α] f [xn−1 , xn ] − f [xn , α]
= (xn − α) 1 − = (xn − α)
f [xn−1 , xn ] f [xn−1 , xn ]
f [xn , xn−1 , α]
= (xn − α)(xn−1 − α) .
f [xn−1 , xn ]
Hence,
f [xn , xn−1 , α]
(xn+1 − α) = (xn − α)(xn−1 − α) , n ∈ N∗ (5.5.2)
f [xn−1 , xn ]
En+1 = En En−1 , En → 0.
Proof. First of all, observe that α is the only zero of f in Iε . This follows from
Taylor’s formula applied at x = α:
(x − α)2 00
f (x) = f (α) + (x − α)f 0 (α) + f (ξ)
2
where f (α) = 0 and ξ ∈ (x, α) (or (α, x)). Thus, if x ∈ Iε , then also ξ ∈ Iε , and we
have
x − α f 00 (ξ)
0
f (x) = (x − α)f (α) 1 +
2 f 0 (α)
Here, if x 6= α, all three factors are different from zero, the last one since by assump-
tion
x − α f 00 (ξ)
2 f 0 (α) ≤ εM (ε) < 1.
140 Numerical Solution of Nonlinear Equations
1
f [xn−1 , xn ] = f 0 (ξ1 ), f [xn−1 , xn , α] = f 00 (ξ2 ), ξi ∈ Iε , i = 1, 2.
2
Therefore, by (5.5.2),
00
2f
(ξn )
|xn+1 − α| ≤ ε 0 ≤ εεM (ε) < ε,
2f (ξ1 )
that is, xn+1 ∈ Iε . Furthermore, by the relation between three consecutive errors
(5.5.2)), xn+1 6= xn unless f (xn ) = 0 , hence xn = α.
Finally, using again (5.5.2) we have
The pseudocode is given in Algorithm 5.1. Since only one evaluation of f √is re-
quired in each iteration step, the secant method has the efficiency index p = 1+2 5 ≈
1.61803 . . . .
Viewed in this manner, Newton’s method can be vastly generalized to nonlinear equa-
tions of all kinds (nonlinear equations, functional equations, in which case the deriva-
tive f 0 is to be understood as a Fréchet derivative, and the iteration is
The study of error in Newton’s method is virtually the same as the one for the
secant method.
f (xn )
xn+1 − α = xn − α − 0
f (xn )
f (xn ) − f (α)
= (xn − α) 1 −
(xn − α)f 0 (xn )
(5.6.3)
f [xn , α]
= (xn − α) 1 −
f [xn , xn ]
f [xn , xn , α]
= (xn − α)2 .
f [xn , xn ]
Therefore, if xn → α, then
xn+1 − α f 00 (α)
lim =
n→∞ (xn − α)2 2f 0 (α)
that is, Newton’s method has the order of convergence p = 2 if f 00 (α) 6= 0. For the
convergence of Newton’s method we have the following result.
142 Numerical Solution of Nonlinear Equations
If ε is so small that
2εM (ε) < 1, (5.6.5)
then for every x0 ∈ Iε , Newton’s method is well defined and converges quadratically
to the only root α ∈ Iε .
The extra factor 2 in (5.6.5) comes from the requirement that f 0 (x) 6= 0 for x ∈ Iε .
The stopping criterion for Newton’s method
The geometric interpretation of Newton’s method is given in Figure 5.3, and its
description in Algorithm 5.2.
The choice of starting value is, in general, a difficult task. In practice, one chooses
a value, and if after a fixed maximum number the desired accuracy, tested by an
usual stopping criterion, another starting value is chosen. For example, if the root is
isolated in a certain interval [a, b], and f 00 (x) 6= 0, x ∈ (a, b), a choice criterion is
f (x0 )f 00 (x0 ) > 0.
144 Numerical Solution of Nonlinear Equations
f (x)
ϕ(x) = x − . (5.7.2)
f 0 (x)
then the fixed point iteration converges to α, for any x0 ∈ Iε . The order of conver-
gence is p, and the asymptotic error constant is given by (5.7.6).
f (xk−1 ) ≈ (xk−1 − xk )m · c
f (xk−2 ) ≈ (xk−2 − xk )m · c.
5.9. Algebraic Equations 147
F (x) = 0, (5.10.1)
The quantity 1/f 0 (x) is replaced by the inverse of the Jacobian in x(k) :
Note that wk is the solution of the system having n equations and n unknowns
It is more efficient and convenient that, instead of computing the inverse Jacobian to
solve the system (5.10.5) and of using the form (5.10.4) of iteration.
Theorem 5.10.1. Let α be a solution of equation F (x) = 0 and suppose that in
closed ball B(δ) ≡ {x : kx − αk ≤ δ}, there exists the Jacobi matrix of F : Rn →
Rn , it is nonsingular and satisfies a Lipschitz condition
e(k+1) = e(k) − [F 0 (x(k) )]−1 (F (α) − F (x(k) )) = e(k) − [F 0 (x(k) )]−1 Jk e(k)
= [F 0 (x(k) )]−1 (F 0 (x(k) ) − Jk )e(k)
function evaluations (n2 for Jacobian and n for F ) and O(n3 ) flops for the solution of
nonlinear system. This amount of computation is prohibitive, excepting small values
of n, and scalar functions which can be evaluated easily. So, it is natural to focus
our attention to reduce the number of evaluation and to avoid the solution of a linear
system at each step.
With the scalar secant method, the next iteration, x(k+1) , is obtained as the solu-
tion of the linear equation
(k) + hk ) − f (x(k) )
¯lk = f (x(k) ) + (x − x(k) ) f (x = 0.
hk
Here the linear function ¯lk can be interpreted in two ways:
1. ¯lk is an approximation of the tangent equation
lk (x) = f (x(k) ) + (x − x(k) )f 0 (x(k) );
2. ¯lk is the linear interpolation of f between the points x(k) and x(k+1) .
By extending the scalar secant method to n dimensions, different generalization of
secant method are obtained depending on the interpretation of ¯lk . The first inter-
pretation leads to the discrete Newton method, and the second one to interpolation
methods.
The discrete Newton method is obtained replacing F 0 (x) in Newton’s method
(5.10.3) by a discrete approximation A(x, h). The partial derivatives in the Jacobian
matrix (5.10.2) are replaced by divided differences
A(x, h)ei := [F (x + hi ei ) − F (x)]/hi , i = 1, n, (5.11.1)
5.11. Quasi-Newton Methods 151
where ei ∈ Rn is the i-th unit vector and hi = hi (x) is the step length of the dis-
cretization. A possible choice of step length is
ε|xi |, if xi 6= 0;
hi :=
ε, otherwise,
√
with ε := eps, where eps is the machine epsilon.
kx(k+1) − αk
lim = 0. (5.11.4)
k→∞ kx(k) − αk
Broyden’s chooses the vectors u(k) and v (k) using the principle of secant approxi-
mation. For the scalar case, the approximation ak ≈ f 0 (x(k) ) is uniquely defined
by
ak+1 (x(k+1) − x(k) ) = f (x(k+1) − f (x(k) ).
However, for n > 1, the approximation
(called quasi-Newton equation) is no longer uniquely defined; any other matrix of the
form
Āk+1 := Ak+1 + pq T ,
where p, q ∈ Rn and q T (x(k+1) − x(k) checks the equations (5.11.5). On the other
hand,
yk := F (x(k) ) − F (x(k−1) ) and sk := x(k) − x(k−1)
only contain information about the partial derivative of F in the direction of sk , not
about the partial derivative in directions orthogonal to sk . On this direction, the effect
of Ak+1 and Ak should be the same
uniquely determined by formulas (5.11.5) and (5.11.6) (Broyden [6], Dennis şi Moré
[12]).
For the corresponding sequence B0 = A−1 0 ≈ [F (x(0) )]−1 , B1 , B2 , . . . the
Sherman-Morrison formula can be used to obtain the recursion
(sk+1 − Bk yk+1 )sTk+1 Bk
Bk+1 := Bk + , k = 0, 1, 2, . . .
sTk+1 Bk yk+1
which requires only matrix-vector multiplication operations and thus only O(n2 )
computational work. With the matrices Bk one can define the Broyden’s method
by the iteration
2. Even if A is real, some of its eigenvalues may be complex. For real matrices,
these occur always in conjugated pairs. ♦
155
156 Eigenvalues and Eigenvectors
p(x) = an xn + · · · + a0 = an (x − z1 ) . . . (x − zn ), an ∈ R, an 6= 0.
0 − aan0
1 0
− aan1
M = ... .. ..
, (6.1.2)
. .
an−2
1 0 − an
1 − an−1
an
then
n
X
(M vj − zj vj )k xk−1 = x`j (x) − zj `j (x) = (x − zj )`j (x) = p(x) ≈ 0,
k=1
and thus M vj = zj vj , j = 1, n.
Hence, eigenvalues of M are roots of p.
The Frobenius matrix given by (6.1.2) is only a way (there exists many other) to
represent the “multiplication” in (6.1.1); any other basis of Pn−1 provides a matrix
M whose eigenvalues are roots of p. The only device for polynomial handling is
“remainder division”.
6.2. Basic Terminology and Schur Decomposition 157
a real matrix may have complex eigenvalues. Therefore (at least from theoretical
point of view) it is convenient to deal with complex matrices A ∈ Cn×n .
Definition 6.2.1. Two matrices, A, B ∈ Cn×n are called similar if there exists a
nonsingular matrix T ∈ Cn×n , such that
A = T BT −1 .
Lemma 6.2.2. If A, B ∈ Cn×n are similar, their eigenvalues are the same.
The following important result from linear algebra holds, which we state without
proof. (For a proof see [21].)
Theorem 6.2.3 (Jordan normal form). Any matrix A ∈ Cn×n is similar to a matrix
λ` 1
J1 .. .. k
. . . n` ×n`
X
J =
. . , J` =
∈C
, n` = n,
..
Jk
. 1 `=1
λ`
Definition 6.2.4. A matrix is called diagonalizable, if all its Jordan blocks J` have
their dimension equal to one, that is, n` = 1, ` = 1, n. A matrix is called nonderoga-
tory if each eigenvalue λ` appears in exactly one Jordan block, on diagonal.
Theorem 6.2.6 (Schur decomposition). For every matrix A ∈ Cn×n there exists a
unitary matrix U ∈ Cn×n and an upper triangular matrix
λ1 ∗ . . . ∗
.. .. .
. ..
. ∈ Cn×n ,
R=
..
. ∗
λn
such that A = U RU ∗ .
Remark 6.2.7. 1. The diagonal elements of R are eigenvalues of A. Since A and
R are similar, they have the same eigenvalues.
2. We have a stronger form of similarity between A and R: they are unitary-
similar. ♦
R∗ = (U ∗ AU ) = U ∗ A∗ U = U ∗ AU = R,
R must be diagonal, and its diagonal elements are real (being Hermitian).
In other words, Corollary 6.2.8 guarantees that any Hermitian matrix is unitary-
diagonalisable and has a basis which consists of orthonormal eigenvectors. More-
over, all eigenvalues of a Hermitian matrix are real. It is interesting, not only from
theoretical point of view, what kind of matrices are unitary-diagonalisable.
AA∗ = A∗ A. (6.2.2)
that is, (6.2.2). For the converse, we use the Schur decomposition of A in form
R = U ∗ AU . Then
n
X
|λ1 |2 = (R∗ R)11 = (RR∗ )11 = |λ1 |2 + |r1k |2 ,
k=2
A Schur decomposition for real matrices, that is, the so-called real Schur decompo-
sition is a little bit more complicated.
160 Eigenvalues and Eigenvectors
Theorem 6.2.10. For any matrix A ∈ Rn×n there exists an orthogonal matrix U ∈
Rn×n such that
R1 ∗ ... ∗
.. .. .
. ..
. U ∗,
A=U .. (6.2.3)
. ∗
Rk
where either Rj ∈ R1×1 , or Rj ∈ R2×2 , with two complex conjugated eigenvalues,
j = 1, k.
A real Schur decomposition transforms A into an upper Hessenberg matrix
∗ ... ... ∗
..
∗ . . .
T .
U AU = . . . . ..
.
. . .
∗ ∗
Proof. If all eigenvalues of A are real then we proceeds the same way as for complex
Schur decomposition. Thus, let λ = α + iβ, β 6= 0, a complex eigenvalue of A and
x + iy its eigenvector. Then
A(x + iy) = λ(x + iy) = (α + iβ)(x + iy) = (αx − βy) + i(βx + αy)
or in matrix form
α β
A [x y] = [x y] .
|{z} −β α
∈Rn×2 | {z }
:=R
z (k) = Ay (k−1) ,
(k)
z (k) |zj∗ |
(k) 1
y (k) = (k) (k)
, j ∗ = min 1 ≤ j ≤ n : z
j ≥ 1 − kz (k)
k ∞ .
zj∗ kz k∞ k
(6.3.1)
Under certain condition, this sequence converges to the eigenvector corresponding to
the dominant eigenvalue.
Proposition 6.3.1. Let A ∈ Cn×n a diagonalizable matrix whose eigenvalues verify
the condition
|λ1 | > |λ2 | ≥ . . . |λn |.
Then the sequence y (k) , k ∈ N, converges to a multiple of the normed eigenvector
corresponding to the eigenvalue λ1 , for almost every starting vector y (0) .
Proof. Let x1 , . . . , xn be the orthonormal eigenvectors of A corresponding to the
eigenvalues λ1 , . . . , λn – they exist, A is diagonalisable. We express y (0) as
n
X
y (0) = αj xj , αj ∈ C, j = 1, n,
j=1
and also
n k
X λ j
lim |λ1 |−k
Ak y (0)
=
αj
= |α1 | kx1 k∞ .
αj xj
k→∞ ∞
j=1 λ1
then both limits are zero, and we cannot derive any conclusion on the convergence of
y k , k ∈ N; this hyperplane has the measure zero, so in the sequel we shall suppose
α1 6= 0.
Equation (6.3.1) implies y k = γk Ak y (0) , γk ∈ C and moreover ky k k∞ = 1, so
1 1
lim |λ1 |k |γk | = lim = .
k→∞ k→∞ |λ1 |−1 kA(k) y (0) k |α1 |kx1 k∞
Thus
γk λk1 |λ2 |k
α1 x1
y (k) = γk Ak y (0) = k |α |kx k
+O , k ∈ N, (6.3.2)
|γk λ1 | 1 1 ∞ |λ1 |k
| {z } | {z }
=:e−2πiθk =:αx1
where θk ∈ [0, 1]. Now, it is the time to use the ”strange” relation (6.3.1): let j be the
least subscript such that |(αx1 )j | = kαx1 k∞ ; then, by (6.3.2), for a sufficiently large
k, it holds in (6.3.1) j ∗ = j too. Therefore it holds
(k)
(k) yj 1
lim yj = 1 ⇒ lim e2πiθk = lim = .
k→∞ k→∞ k→∞ (αx1 )j (αx1 )j
We could also apply vector iteration to compute all eigenvalues and eigenvectors,
provided that the eigenvalues of A have different modulus. For this purpose we find
the largest modulus eigenvalue λ1 of A and the corresponding eigenvector x1 , and
we proceed to
A(1) = A − λ1 x1 xT1 .
The matrix A(1) is diagonalizable and has the same orthonormal eigenvectors as A,
excepting that x1 is the eigenvector corresponding to the eigenvalue 0, and does not
play any role for the iteration, provided that the starting vector is not a multiple of
x1 . By applying once more vector iteration to A(1) one obtains the second largest
modulus eigenvalue λ2 and the corresponding eigenvector; the iteration
2. The method works well only for “suitable” starting vectors. It sounds gorgeous
that all vectors which are not in a certain hyperplane are good, but the things are
more complicated. If the dominant eigenvalue of a real matrix is complex and
the starting values are real, then the iteration run indefinitely, without finding
an eigenvector.
3. We could perform all computation in complex, but this grows seriously the
computational complexity (with a factor of two for addition and six for multi-
plication, respectively).
With a bit of luck, or as mathematicians say, under certain hypothesis, this sequence
will converge to a matrix whose diagonal elements are the eigenvalues of A.
Lemma 6.4.1. The matrices A(k) , built by (6.4.1), k ∈ N, are orthogonal-similar to
A (and, obviously have the same eigenvalues as A).
164 Eigenvalues and Eigenvectors
In order to prove the convergence, we shall interpret QR iteration as a gener-
alization of vector iteration (6.3.1) (without the strange norming process) to vector
spaces. For this purpose, we shall write the orthonormal base u1 , . . . , um ∈ Cn of
a m-dimensional subspace U ⊂ Cn , m ≤ n, as column vectors of a unitary matrix
U ∈ Rn×m and we shall iterate the subspace (i.e. matrices) over the QR decomposi-
tion
Uk+1 Rk = AUk , k ∈ N0 , U0 ∈ Cn . (6.4.2)
This implies immediately
Uk+1 (Rk . . .R0 ) = AUk (Rk−1 . . .R0 ) = A2 Uk−1 (Rk−2 . . .R0 ) = . . . = Ak+1 U0 .
(6.4.3)
If we define, for m = n, A(k) = Uk∗ AUk , then by (6.4.2), the following relation
holds
A(k) = Uk∗ AUk = Uk∗ Uk+1 Rk
∗ ∗
A(k+1) = Uk+1 AUk+1 = Uk+1 AUk Uk∗ Uk+1
and setting Qk := Uk∗ Uk+1 , we obtain the iteration rule (6.4.1). We choose U0 = I
as starting matrix.
Definition 6.4.2. A phase matrix Θ ∈ Cn×n is a diagonal matrix with form
−iθ
e 1
Θ=
.. ,
θj ∈ [0, 2π), j = 1, n.
.
e −iθ n
Proposition 6.4.3. Suppose A ∈ Cn×n has eigenvalues with distinct moduli, |λ1 | >
|λ2 | > · · · > |λn | > 0. If the matrix X −1 in normal Jordan form A = XΛX −1 of A
has a LU decomposition
1
∗ 1 ∗ . . . ∗
X −1 = ST, S = . . , T = . . ..
. . ,
. . . .
. . .
∗
∗ ... ∗ 1
the there exists phase matrices Θk , k ∈ N0 , such that the matrix sequence (Θk Uk ),
k ∈ N is convergent.
6.4. QR Method – the Theory 165
Remark 6.4.4 (On proposition 6.4.3). 1. The convergence of the matrix sequen-
ce (Θk Uk ) means that if the corresponding orthonormal bases converge to an
orthonormal basis of Cn , we have also the convergence of the corresponding
vector spaces.
X −1 P T = (P X)−1 = LU
b = P T AP , which is the
and P X is invertible. This means that the matrix A
result of line and column permutation of A has the same eigenvalues of A,
fulfill the hypothesis of Proposition 6.4.3.
Before the proof of 6.4.3, let us see why the convergence of the sequence (Uk )
implies the convergence of QR method. Namely, if we have kUk+1 − Uk k ≤ ε or
equivalently
Uk+1 = Uk + E, kEk2 ≤ ε,
then
∗
Qk = Uk+1 Uk = (Uk + E)∗ Uk = I + E ∗ Uk = I + F, kF k2 ≤ kEk2 kUk k2 ≤ ε,
| {z }
=1
and simultaneously
hence A(k) , k ∈ N, converges also to an upper triangular matrix, only if the norms of
Rk , k ∈ N are uniformly bounded. This is the case, since
Proof of Proposition 6.4.3. Let A = XΛX −1 be the Jordan normal form of A, where
Λ = diag(λ1 , . . . , λn ). For U0 = I and k ∈ N0
0
Y k
Uk Rj = X −1 ΛX = XΛk X −1 = XΛk ST = X (Λk SΛ−k ) Λk T,
| {z }
j=k−1 =:Lk
λj k
(Lk )jm = , 1≤m≤j≤n (6.4.4)
λm
such that for k ∈ N
0
λj k 1 . . .
|Lk − I| ≤ max |sjm | max
. . .
, (k ∈ N).
1≤m<j≤n 1≤m<j≤n λm . . .
. . .
1 ... 1 0
(6.4.5)
6.5. QR Method – the Practice 167
Let U
bk R
bk = XLk be the QR decomposition of XLk , that, due to (6.4.5) and Lemma
6.4.5 converges, up to a phase matrix, to a QR decomposition X = U R of X. Now
we apply Lemma 6.4.5 to the identity
Y0
Uk Rj Q bk Λk T ;
bk R
j=k−1
Let us examine shortly the “error term” in (6.4.4), whose sub-diagonal entries
verifies
|λj |
|Lk |jm ≤ |sjm , 1 ≤ m < j ≤ n.
|λm |
Therefore, it holds
The farther is the sub-diagonal element to the diagonal, the faster is the
convergence of that element to zero.
In this case, QR method does not converge to an upper triangular matrix. After 100
iterations we obtain the matrix
9.7407 −4.3355 0.94726
A(100) ≈ 8.552e − 039 −4.2645 0.7236 ,
3.3746e − 039 −0.79491 −3.4762
that correctly provides the real eigenvalue. Additionally, the lower right 2 × 2 matrix
provide the complex eigenvalues −3.8703 ± 0.6480i. ♦
The second example leads us to the following strategy: if the sub-diagonal entries do
not disappear, it is recommendable to examine the corresponding 2 × 2 matrix.
G(n − 1, n; φn ) . . . G(1, 2; φ2 )H = R
6.5. QR Method – the Practice 169
and it holds
H∗ = RGT (1, 2; φ2 ) . . . GT (n − 1, n; φn ). (6.5.1)
This is the idea of Algorithm 6.1. Following [32], the motto must be “once Hessen-
berg, always Hessenberg”.
Let see how to convert the initial matrix to a Hessenberg form. For this purpose
we shall use (for variation) Householder transformations. Let us suppose we have
already found a matrix Qk , such that the first k columns of the transformed matrix
are already in Hessenberg form, that is,
∗ ... ∗ ∗ ∗ ... ∗
∗ ... ∗ ∗ ∗ ... ∗
. . .. .. .. . . ..
. . . . . .
T
Qk AQk =
∗ ∗ ∗ ... ∗ .
(k)
a1 ∗ ... ∗
.. .. . . ..
. . . .
(k)
an−k−1 ∗ . . . ∗
170 Eigenvalues and Eigenvectors
and we get
∗ ... ∗ ∗ ∗ ... ∗
∗ ... ∗ ∗ ∗ ... ∗
. . .. .. .. . . ..
. . . . . .
∗ ∗ ∗ ... ∗
Uk+1 Qk A Qk Uk+1 = Uk+1 ;
| {z } | {z } α ∗ ... ∗
=:Qk+1 =QT 0 ∗ ... ∗
k+1
.. .. ..
. . .
0 ∗ ... ∗
the upper left unit matrix Ik+1 in matrix Uk+1 takes care to have on the first k + 1
columns a Hessenberg structure. Algorithm 6.2 gives a method for conversion of a
matrix into Hessenberg form. To conclude, our QR method will be a two step method:
2. Do QR iterations
(k)
H (k+1) = H∗ , k ∈ N0 ,
Since sub-diagonal entries converge slowest, we can use the maximum of modu-
lus as stopping criterion. This leads us to the simple QR method, see Algorithm 6.3.
Of course, for complex eigenvalues this method iterates infinitely.
Example 6.5.5. We apply the new method to the matrix in Example 6.5.1. For var-
ious given tolerances ε, we get the results given in Table 6.2. Note that one gains a
new decimal digit for sub-diagonal entries at each three iterations. ♦
6.5. QR Method – the Practice 171
ε #iterations λ1 λ2 λ3
10−3 11 4.56155 -0.999834 0.438281
10−4 14 4.56155 -1.00001 0.438461
10−5 17 4.56155 -0.999999 0.438446
10−10 31 4.56155 -1 0.438447
If its discriminant b2 − 4c is positive, then A have two real and distinct eigenvalues
1 p c
x1 = −b − sgn(b) b2 − 4c şi x2 = ,
2 x1
otherwise its eigenvalues are complex, namely
1 p
−b ± i 4c − b2 ;
2
thus we can deal with complex eigenvalues. The function Eigen2x2 returns the
eigenvalues of a 2 × 2 matrix. The idea is implemented in Algorithm 6.4. Effective
QR iterations are given in Algorithm 6.5.
6.5. QR Method – the Practice 173
ε #iteraţii λ1 λ2 λ3
10−3 12 9.7406 −3.8703 + 0.6479i −3.8703 − 0.6479i
10−4 14 9.7407 −3.8703 + 0.6479i −3.8703 − 0.6479i
10−5 17 9.7407 −3.8703 + 0.6480i −3.8703 − 0.6480i
10−5 19 9.7407 −3.8703 + 0.6480i −3.8703 − 0.6480i
10−5 22 9.7407 −3.8703 + 0.6480i −3.8703 − 0.6480i
Example 6.5.6. Let us consider again the matrix in Exemple 6.5.2; we apply it to
Algorithm 6.4. The results are given in Table 6.3. ♦
λj+1 k
, j = 1, n − 1.
λj
The keyword is here spectral shift. One observes that for µ ∈ R the matrix A−µI has
the eigenvalues λ1 − µ, . . . , λn − µ. For an arbitrary invertible matrix B the matrix
B(A − µI)B −1 + µI has the eigenvalues λ1 , . . . , λn – one may shift the spectrum of
matrices forward and backwards by means of a similarity transformation. One sorts
the eigenvalues µ1 , . . . , µn such that
with the starting matrix H 0 = QAQT . Algorithm 6.6 gives a variant of the method
which treats complex eigenvalues. It uses Algorithm 6.7. Within the last algorithm,
(H − Hn,n In )∗ in line 6 means the RQ transformation of the matrix H − Hn,n In
6.5. QR Method – the Practice 175
Algorithm 6.6 Spectral shift QR method, partition and treatment of complex eigen-
values
Input: Matrix A ∈ Rn×n and tolerance tol
Output: Eigenvalues λ of A and number of iterations It
It := 0;
if n = 1 then
λ := A;
return
else if n = 2 then
λ := Eigen2x2(A);
return
else
H := Hessenberg(A); {convert to Hessenberg form}
[H1 , H2 , It] := QRIter2(H, t)
[λ1 , It1 ] := QRSplit2(H1 , tol) {recursive call}
[λ2 , It2 ] := QRSplit2(H2 , tol) {recursive call}
It := It + It1 + It2 ;
λ = [λ1 , λ2 ];
end if
Remark 6.5.7. If the shift value µ is sufficiently close to an eigenvalue λ, then the
matrix could be decomposed in a single iterative step. ♦
There are two possibilities: either both eigenvalues µ and µ0 of B are real and we
proceed as above, or we have a pair of complex conjugated eigenvalues, µ and µ̄.
As we shall see, the second case could be also treated in real arithmetic. Let Qk ,
Q0k ∈ Cn×n şi Rk , Rk0 ∈ Cn×n the matrices of complex QR decomposition
Qk Rk = H (k) − µI,
Q0k Rk0 = Rk Qk + (µ − µ̄)I.
Then it holds
#iterations in R #iterations in C
ε alg. 6.6 alg. 6.8 alg. 6.6 alg. 6.8
1e-010 1 1 9 4
1e-020 9 2 17 5
1e-030 26 3 45 5
4. Double shift QR method is useful only when A has complex eigenvalues; for
symmetric matrices it is not advantageous. ♦
Example 6.5.9. We apply Algorithms 6.6 and 6.8 to matrices in Examples 6.5.1 and
6.5.2. One gets the results in Table 6.4. The good behavior of double shift QR method
can be explained by the idea to obtain two eigenvalues simultaneously. ♦
178 Eigenvalues and Eigenvectors
Algorithm 6.8 Double shift QR method with partition and treating 2 × 2 matrices
Input: Matrix A ∈ Rn×n and tolerance tol
Output: Eigenvalues λ of A and number of iterations It
It := 0;
if n = 1 then
λ := A;
return
else if n = 2 then
λ := Eigen2x2(A);
return
else
H := Hessenberg(A); {convert to Hessenberg form}
[H1 , H2 , It] := QRDouble(H, t)
[λ1 , It1 ] := QRSplit2(H1 , tol) {recursive call}
[λ2 , It2 ] := QRSplit2(H2 , tol) {recursive call}
It := It + It1 + It2 ;
λ = [λ1 , λ2 ];
end if
dy
(
= f (x, y), x ∈ [a, b],
(CP ) dx (7.1.1)
y(a) = y0 .
y 0 = f (x, y),
y(a) = y0 .
179
180 Numerical Solution of Ordinary Differential Equations
with the initial condition u(i) (a) = ui0 , i = 0, d − 1. This problem is easily brought
into the form (7.1.1) by defining
y i = u(i−1) , i = 1, d.
Then
dy 1
= y2, y 1 (a) = u00 ,
dx
dy 2
= y3, y 2 (a) = u10 ,
dx
... (7.1.2)
dy d−1
= yd, y d−1 (a) = ud−2
0 ,
dx
dy d
= g(x, y 1 , y 2 , . . . , y d ), y d (a) = ud−1
0 .
dx
which has the form (7.1.1) with very special (linear) functions f 1 , f 2 , . . . , f d−1 , and
f d (x, y) = g(x, y). ♦
We recall from the theory of differential equation the following basic existence
and uniqueness.
Theorem 7.1.2. Assume that f (x, y) is continuous in the first variable for x ∈ [a, b]
and with respect to the second satisfies a uniform Lipschitz condition
where k · k is some vector norm. Then the initial value problem (CP) has a unique
solution y(x), a ≤ x ≤ b, ∀ y0 ∈ Rd . Moreover, y(x) depends continuously on a
and y0 .
7.2. Numerical Methods 181
∂f i
The Lipschitz condition (7.1.3) certainly holds if all functions ∂y j
(x, y), i, j =
1, d are continuous in the y-variables and bounded on [a, b] × Rd . This is the case for
linear systems of differential equations, where
d
X
f i (x, y) = aij (x)y j + bi (x), i = 1, d
j=1
Definition 7.3.3. The method Φ is said to have order p if for some vector norm k · k,
Note that p > 0 implies consistency. Usually, p ∈ N∗ . It is called the exact order,
if (7.3.7) does not hold for any larger p.
Definition 7.3.4. A function τ : [a, b] × Rd → Rd that satisfies τ (x, y) 6≡ 0 and
The principal error function determines the leading term in the truncation error.
The number p in (7.3.9) is the exact order of the method since τ 6≡ 0.
All the preceding definitions are made with the idea in mind that h > 0 is a small
number. Then the larger is p, the more accurate is the method.
Figure 7.1: Euler’s method – the exact solution (continuous line) and the approximate
solution (dashed line)
where u(t) is the reference solution defined in (7.3.2). Since u0 = f (x, u(x)) =
f (x, y), we can write, using Taylor’s theorem,
1
T (x, y; h) = u0 (x) − [u(x + h) − u(x)]
h
1 1
= u0 (x) − [u(x) + hu0 (x) + h2 u00 (ξ) − u(x)] (7.4.3)
h 2
1 00
= − hu (ξ), ξ ∈ (x, x + h),
2
1
T (x, y; h) = − h[fx + fy f ](ξ, u(ξ)), (7.4.4)
2
where fx is the partial derivative of f with respect to x and fy the Jacobian of f with
respect to the y-variables. If, in the spirit of Theorem 7.1.2, we assume that f and all
its first partial derivatives are uniformly bounded in [a, b]×Rd , there exists a constant
C independent of x, y and h such that
Thus, Euler’s method has the order p = 1. If we make the same assumption
about all second-order partial derivatives of f we have u00 (ξ) = u00 (x) + O(h) and,
therefore from (7.4.3),
1
T (x, y; h) = − h[fx + fy f ](x, y) + O(h2 ), h → 0, (7.4.6)
2
1
τ (x, y) = − [fx + fy f ](x, y). (7.4.7)
2
which determine the successsive derivatives of the reference solution u(t) of (7.3.2)
by virtue of
u(k+1) (t) = f [k] (t, u(t)), k = 0, 1, 2, . . . (7.4.9)
that is,
1 1
Φ(x, y; h) = f [0] (x, y) + hf [1] (x, y) + · · · + hp−1 f [p−1] (x, y). (7.4.12)
2 p!
186 Numerical Solution of Ordinary Differential Equations
For the truncation error, using (7.4.10) and (7.4.12) and assuming f ∈ C p ([a, b]×
Rd ), we obtain from Taylor’s theorem
1
T (x, y; h) = Φ(x, y; h) − [u(x + h) − u(x)] =
h
p−1
X hk hp
= Φ(x, y; h) − u(k+1) (x) − u(p+1) (ξ) =
(k + 1)! (p + 1)!
k=0
hp
= −u(p+1) (ξ) , ξ ∈ (x, x + h),
(p + 1)!
so that
Cp
kT (x, y; h)k ≤ hp ,
(p + 1)!
where Cp is a bound on the pth total derivative of f . Thus, the method has the exact
order p (unless f [p] (x, y) ≡ 0), and the principal error function is
1
τ (x, y) = − f [p] (x, y). (7.4.13)
(p + 1)!
The necessity of computing many partial derivatives in (7.4.8) was a discouraging
factor in the past, when this had to be done by hand. But nowadays, this task can be
delegated to the computer, so that the method has become again a viable option.
In other words, we are taking two trial slopes, K1 and K2 , one at the initial point and
the other nearby, and then taking the latter as the final slope. The method is called
modified Euler method.
We could equally well take the second trial slope at (x + h, y + hf (x, y)), but
then, having waiting too long before reevaluating the slope, take now as the final
slope the average of two slopes:
K1 (x, y) = f (x, y)
K2 (x, y; h) = f (x + h, y + hK1 ) (7.4.17)
1
ynext = y + h(K1 + K2 ).
2
s−1
X r
X
µs = λsj , s = 2, 3, . . . , r, αs = 1, (7.5.2)
j=1 s=1
Ks (x, y; h) = u0 (x + µs h) + O(h2 ), s ≥ 2,
and the second is nothing but the consistency condition (cf. (7.3.6)) (i.e. Φ(x, y; h) =
f (x, y)).
We call (7.5.1) an explicit r-stage Runge-Kutta method; it requires r evaluation
of the right-hand side f of the differential equation. Conditions (7.5.2) lead to a
nonlinear system. Let p∗ (r) the maximum attainable order (for arbitrary sufficient
smooth f ) of an explicit r-stage Runge-Kutta method. Kutta 2 has shown in 1901
that
p∗ (r) = r, r = 1, 4.
in which the last r equations form a system of (in general nonlinear) equations in
the unknowns K1 , K2 , . . . , Kr . Since each of these is a vector in Rd , before we
can form the approximate increment Φ we must solve a system of rd equations in rd
unknowns. Semi-implicit Runge-Kutta methods, where the summation in the formula
for Ks extends from j = 1 to j = s, require less work. This yields r systems of
equations, each having only d unknowns, the components of Ks .
Already in the case of explicit Runge-Kutta methods, and even more so in im-
plicit methods, we have at our disposal a large number of parameters which we can
choose to achieve the maximum possible order for all sufficiently smooth f . The
considerable computational expenses involved in implicit and semi-implicit methods
can only be justified in special circumstances, for example, stiff problems. The rea-
son is that implicit methods can be made not only to have higher order than explicit
methods, but to have also better stability properties.
Example 7.5.1. Let
Φ(x, y; h) = α1 K1 + α2 K2 , ♦
where
where fy denotes the Jacobian of f , and fyy = [fyy i ] is the vector of Hessian matrices
We cannot enforce the condition that the h2 coefficient be zero without imposing
severe restriction on f . Thus, the maximum order is 2 and we obtain it for
α1 + α2 = 1
α2 µ = 21
The solution
α1 = 1 − α2
1
µ=
2α2
7.5. Runge-Kutta Methods 191
When f does not depend on y, then (7.5.8) becomes the Simpson’s formula. Runge’s
3 idea was to generalize Simpson’s quadrature formula to ordinary differential equa-
tions. He succeeded only partially; his formula had r = 4 and p = 3. The method
(7.5.8) was discovered by Kutta in 1901 through a systematic search.
The classical 4th order Runge-Kutta method for a grid of N + 1 equally spaced
method is given by Algorithm 7.1.
Example 7.5.2. Using 4th order Runge-Kutta method for the initial value problem
y 0 = −y + t + 1, t ∈ [0, 1]
y(0) = 1,
with h = 0.1, N = 10, and ti = 0.1i we obtain the results given in Table 7.1. ♦
and
r
y(x + h) − y(x) X
αs y 0 (x + µs h) = O hqr +1 .
−
h
s=1
For classical 4th order Runge-Kutta method (7.5.8) the Butcher table is:
0 0
1 1
2 2 0
1 1
2 0 2 0
1 0 0 1 0
1 2 2 1
6 6 6 6
7.6.1 Stability
Stability is a property of the numerical scheme (7.6.5) alone and has nothing to do
with its approximation power. It characterizes the robustness of the scheme with
respect to small perturbations. Nevertheless, stability combined with consistency
yields convergence of the numerical solution to the true solution.
We define stability in terms of the discrete residual operators Rh in (7.6.7). As
usual we assume Φ(x, y; h) to be defined on [a, b] × Rd × [0, h0 ], where h0 > 0 is
some suitable positive number.
Definition 7.6.1. The method (7.6.5) is called stable on [a, b] if there exists a constant
K > 0 not depending on h such that for an arbitrary grid h on [a, b], and for two
arbitrary grid functions v, w ∈ Γh [a, b], there holds
kv − wk∞ ≤ K (kv0 − w0 k∞ + kRh v − Rh wk∞ ) , v, w ∈ Γh [a, b] (7.6.11)
for all h with |h| sufficiently small. In (7.6.11) the norm is defined by (7.6.4).
7.6. Global Description of One-Step Methods 195
Rh u = 0, u0 = y0 (7.6.12)
Rh w = ε, w0 = y0 + η0 , (7.6.13)
where ε = {εn } ∈ Γh [a, b] is a grid function with small kεn k, and kη0 k is also
small. We may interpret u ∈ Γh [a, b] as the result of applying the numerical scheme
in (7.6.5) in infinite precision, whereas w ∈ Γh [a, b] could be the solution of (7.6.5)
in floating-point arithmetic. Then, if stability holds, we have
that is, the global change in u is of the same order of magnitude as the local resid-
ual errors {εn } and initial error η0 . It should be appreciated, however that the first
equations in (7.6.13) says
wn+1 − wn − hn Φ(xn , wn , hn ) = hn εn ,
en+1 ≤ an en + bn , n = 0, 1, . . . , N − 1 (7.6.16)
We adopt here the usual convention that an empty product has the value 1 and an
empty sum has the value 0.
196 Numerical Solution of Ordinary Differential Equations
Therefore
We have actually proved stability for all |h| ≤ h0 , not only for h sufficiently
small.
All one-step methods used in practice satisfy a Lipschitz condition if f does, and
the constant M for Φ can be expressed in terms of the Lipschitz constant L for f . This
is obvious for Euler’s method, and not difficult to prove for others. It is useful to note
that Φ does not need not be continuous in x; piecewise continuity suffices, as long as
(7.6.15) holds for all x ∈ [a, b], taking one side limits at points of discontinuity.
The following application of Lemma 7.6.3, relative to a grid function v ∈ Γh [a, b]
satisfying
where the constants M , δ do not depend on h. Then, there exists a constant K > 0
independent of h, but depending on kv0 k, such that
kvk∞ ≤ K. (7.6.23)
kvn+1 kk ≤ (1 + hn M )kvn k + hn δ, n = 0, 1, . . . , N − 1,
which is precisely the inequality (7.6.19) in the proof of Theorem 7.6.2, hence
198 Numerical Solution of Ordinary Differential Equations
7.6.2 Convergence
Stability is a powerful concept. It implies almost immediately convergence, and it is
also instrumental in deriving asymptotic global error estimates. We begin by defining
precisely what we mean by convergence.
Definition 7.6.5. Let a = x0 < x1 < x2 < · · · < xN = b be a grid on [a, b] with
grid length |h| = max (xn − xn−1 ). Let u = {un } be the grid function defined
1≤n≤N
by applying the method (7.6.5) on [a, b] and y = {yn } the grid function induced by
the exact solution of the initial value problem (7.1.1). The method (7.6.5) is said to
converge on [a, b] if there holds
Theorem 7.6.6. If the method (7.6.5) is consistent and stable on [a, b], then it con-
verges. Moreover, if Φ has order p, then
Proof. By the stability inequality (7.6.11) applied to the grid functions v = h and
w = y of Definition 7.6.5, we have for |h| sufficiently small
which proves the first part of the theorem. The second part follows immediately from
(7.6.27) and (7.6.28), since order p, means, by definition that
7.6. Global Description of One-Step Methods 199
xn+1 = xn + h
un+1 = un + hΦ(xn , un ; h); n = 0, 1, . . . , N − 1 (7.6.30)
x0 = a, u0 = y0 ,
defining a grid function u = {un } on a uniform grid on [a, b]. We are interested in
the asymptotic behavior of un − y(xn ) as h → 0, where y(x) is the exact solution of
the initial value problem
dy
(
= f (x, y) x ∈ [a, b]
dx (7.6.31)
y(a) = y0
Then, for n = 0, N ,
ku − y − hp ek∞ = O(hp+1 ),
where u, y, e are the grid functions u = {un }, y = {y(xn )} and e = {e(xn )}.
200 Numerical Solution of Ordinary Differential Equations
3. The fact that some, but not all, components of τ (x, y) may vanish identically
does not imply that the corresponding components of e(x) also vanish, since
(7.6.32) is a coupled system of differential equations.
j=1
d
1 X i j k h i
Φ y y (xn , un ; h) ujn − y j (xn ) ukn − y k (xn ) ,
+
2
j,k=1
(7.6.35)
where ūn is on the line segment connecting un and y(xn ). Using Taylor’s theorem
once more, in the variable h, we can write
Now observing that un − y(xn ) = O(hp ), by virtue of Theorem 7.6.6 and using
(7.6.36) in (7.6.35), we get, again by assumption (1),
d
X
Φi (xn , un ; h) − Φi (xn , y(xn ); h) = fyi j (xn , y(xn )) ujn − y j (xn ) +
j=1
p+1
O(h ) + O(h2p ).
7.6. Global Description of One-Step Methods 201
For the first two terms in brackets we use (7.6.37) and the definition of r in (7.6.38)
to obtain
1
h (rn+1 − rn ) = fy (xn , y(xn )) rn + τ (xn , y(xn )) + O(h), n = 0, N − 1
r0 = 0.
(7.6.39)
Now letting
g(x, y) := fy (x, y(x))y + τ (x, y(x)) (7.6.40)
we can interpret (7.6.39) by writing
RhEuler,g r = εn , n = 0, N − 1, εn = O(h),
n
where RhEuler,g is the discrete residual operator (7.6.7) that goes with Euler’s method
applied to e0 = g(x, e), e(a) = 0. Since Euler’s method is stable on [a, b] and g being
linear in y satisfies a uniform Lipschitz condition, we have by the stability inequality
(7.6.11)
kr − ek∞ = O(h),
and hence, by (7.6.38)
ku − y − hp ek∞ = O(hp+1 ),
as was to be shown.
202 Numerical Solution of Ordinary Differential Equations
1 d
order p ≥ 1 admitting a principal error function τ (x, y) ∈
(2) Φ is a method of
C [a, b] × R ;
(3) an estimate r(x, y; h) is available for principal error function that satisfies
uniformly on [a, b] × Rd ;
(4) along with the grid function u = {un } we generate the grid function v = {vn }
in the following manner.
xn+1 = xn + h;
un+1 = un + hΦ(xn , un ; h)
(7.7.2)
vn+1 = vn + h [fy (xn , vn )vn + r(xn , un ; h)]
x0 = a, u0 = y0 , v0 = 0.
7.7. Error Monitoring and Step Control 203
Then, for n = 0, N − 1,
vn+1 = vn + h(An vn + bn ),
where An are bounded matrices and bn bounded vectors. By Lemma 7.6.4, 7.6.4, we
have boundedness of vn ,
vn = O(1), h → 0. (7.7.7)
Substituting (7.7.4) and (7.7.5) into the equation for vn+1 and noting (7.7.7), we
obtain
vn − e(xn ) = O(h),
e0 = g(x, e)
e(a) = 0.
Therefore, by (7.6.33)
yh = y + hΦ(x, y; h),
1 1
yh/2 = y + hΦ x, y; h ,
2 2
1
1 1
(7.7.8)
∗
yh = yh/2 + hΦ x + h, yh/2 ; h ,
2 2 2
1 1
r(x, y; h) = (yh − yh∗ ) .
1 − 2−p hp+1
Note that yh∗ is the result of applying Φ over two consecutive steps of length h/2
each, whereas yh is the result of one application over the whole step length h.
7.7. Error Monitoring and Step Control 205
1 1
(yh − yh∗ ) = [u(x + h) − u(x)] + τ (x, y)hp + O(hp+1 )
h h
1 1 1 1 1 p
− u x + h − u(x) − τ (x, y) h + O(hp+1 )
2 h/2 2 2 2
1 1 1 1 1 1 p
− u (x + h) − u x + h − τ x + h, y + O(h) h
2 h/2 2 2 2 2
+ O(hp+1 ) = τ (x, y)(1 − 2−p )hp + O(hp+1 ).
Consequently
1 1
−p
(yh − yh∗ ) = τ (x, y)hp + O(hp+1 ), (7.7.10)
1−2 h
as required.
Subtracting (7.7.10) from (7.7.9) shows, incidentally that
1 1
Φ∗ (x, y; h) := Φ(x, y; h) − (yh − yh∗ ) (7.7.11)
1 − 2−p h
defines a one-step method of order p + 1.
Procedure (7.7.8) is rather expensive. For a fourth-order Runge-Kutta process,
it requires a total of 11 evaluations of f per step, almost three times the effort for a
single Runge-Kutta step. Therefore, Richardson extrapolation is normally used only
after two steps of Φ, that is one proceeds according to
yh = y + hΦ(x, y; h),
∗
y2h = yh + hΦ(x + h, yh ; h) (7.7.12)
y2h = y + 2hΦ(x, y; 2h).
206 Numerical Solution of Ordinary Differential Equations
so that the expression on the left is an acceptable estimator r(x, y; h). If the two
steps in (7.7.12) yield acceptable accuracy (cf. §7.7.3), then again for a fourth-order
Runge-Kutta process, the procedure requires only three additional evaluations of f ,
∗ would have to be computed anyhow. There are still more efficient
since yh and y2h
schemes, as we shall seen.
Embedded methods
The basic idea of this approach is very simple: if the given method Φ has order p,
take any other one step method Φ∗ of order p∗ = p + 1 and define
1
r(x, y; h) = [Φ(x, y; h) − Φ∗ (x, y; h)] (7.7.14)
hp
This is indeed an acceptable estimator, as follows by subtracting the two relations
1
Φ(x, y; h) − [u(x + h) − u(x)] = τ (x, y)hp + O(hp+1 )
h
1
Φ∗ (x, y; h) − [u(x + h) − u(x)] = O(hp+1 )
h
and dividing the result by hp .
The tricky part is to make this procedure efficient. Following an idea of Fehlberg,
one can try to do this by embedding one Runge-Kutta process (of order p) into another
(of order p + 1). Specifically, let Φ be some explicit r-stage Runge-Kutta method.
K1 (x, y) = f (x, y)
s−1
X
Ks (x, y; h) = f x + µs h; y + h λsj Kj , s = 2, 3, . . . , r
j=1
r
X
Φ(x, y; h) = αs Ks
s=1
Then for Φ∗ choose a similar r∗ -stage process, with r∗ > r, in such a way that
µ∗s = µs , λ∗sj = λsj , for s = 2, 3, . . . , r.
The estimate (7.7.14) then costs only r∗ − r extra evaluations of f . If r∗ = r + 1 one
might even attempt to save the additional evaluation by selecting (if possible)
µ∗r = 1, λ∗rj = αj for j = 1, r∗ − 1 (r∗ = r + 1) (7.7.15)
7.7. Error Monitoring and Step Control 207
Then indeed, Kr∗ will be identical with K1 for the next step.
Pairs of such embedded (p, p + 1) Runge-Kutta formulae have been developed
in the late 1960’s by E. Fehlberg. There is a considerable degree of freedom in
choosing the parameters. Fehlberg’s choices were guided by an attempt to reduce the
magnitude of the coefficients of all the partial derivative aggregates that enter into
the principal error function τ (x, y) of Φ. He succeeded in obtaining pairs with the
following values of parameters p, r, r∗ , given in Table 7.2.
p 3 4 5 6 7 8
r 4 5 6 8 11 15
r∗ 5 6 8 10 13 17
For the third-order process (and only for that one) one can choose the parameters
for (7.7.15) to hold.
for the truncation error, which can be used to monitor the local truncation error during
the integration process. However, one has to keep in mind that the local truncation
error is quite different from the global error, that one really wants to control. To get
more insight into the relationship between these two errors, we recall the following
theorem, which quantifies the continuity of solution of an initial value problem with
respect to initial values.
Theorem 7.7.2. Let f (x, y) be continuous in x ∈ [a, b] and satisfy a Lipschitz con-
dition uniformly on [a, b] × R, with Lipschitz constant L, that is
kf (x, y) − f (x, y ∗ )k ≤ L ky − y ∗ k .
dy
= f (x, y), x ∈ [a, b],
dx (7.7.17)
y(c) = yc
208 Numerical Solution of Ordinary Differential Equations
has a unique solution on [a, b] for any c ∈ [a, b] and for any yc ∈ Rd . Let y(x, s)
and y(x; s∗ ) be the solutions of (7.7.17) corresponding to yc = s and yc = s∗ ,
respectively. Then for any vector norm k.k,
“Solving the given initial value problem (7.6.31) numerically by a one-step meth-
od (not necessarily with constant step) means in reality that one follows a sequence
of “solution tracks”, whereby at each grid point xn one jumps from one track to the
next by an amount determined by the truncation error at xn ” [16] (see Figure 7.3).
This result from the definition of truncation error, the reference solution being one of
the solution tracks. Specifically, the nth track, n = 0, N , is given by the solution of
the initial value problem
dvn
= f (x, vn ), x ∈ [xn , b],
dx (7.7.19)
vn (xn ) = un ,
and
un+1 = v(xn+1 ) + hn T (xn , un ; hn ), n = 0, N − 1. (7.7.20)
Since by (7.7.19) we have un+1 = vn+1 (xn+1 ), we can apply Theorem 7.7.2 to the
Now
N
X −1
[vn+1 (x) − vn (x)] = vN (x) − v0 (x) = vN (x) − y(x), (7.7.22)
n=0
and since vN (xN ) = uN , letting x = xN , we get from (7.7.21) and (7.7.22) that
N
X −1
kuN − y(xN )k ≤ kvn+1 (xN ) − vn (xN )k
n=0
N
X −1
≤ hn eL|xN −xn+1 | kT (xn , un ; hn )k .
n=0
kT (xn , un ; hn )k ≤ εT , n = 0, N − 1, (7.7.23)
then
N
X −1
kuN − y(xN )k ≤ εT (xn+1 − xn )eL|xN −xn+1 | .
n=0
Interpreting the sum on the right as a Riemann sum for a definite integral, we finally
obtain, approximately,
Zb
εT L(b−a)
kuN − y(xN )k ≤ εT eL(b−x) dx = e −1 .
L
a
This limit value of εT would be appropriate for a quadrature problem but definitely
not for a true differential equation problem, where εT , in general, has to be chosen
considerably smaller than the target error tolerance ε.
Considerations such as these motivate the following step control mechanism:
each integration step (from xn to xn+1 = xn + hn ) consists of these parts:
1. Estimate hn .
3. Test hpn kr(xn , un ; hn )k ≤ εT (cf. (7.7.16) and (7.7.23)). If the test passes,
proceed with the next step; if not, repeat the step with a smaller hn , say, half
as large, until the test passes.
To estimate hn , assume first that n ≥ 1, so that the estimator from the previous
step, r(xn−1 , un−1 ; hn−1 ) (or at least its norm) is available. Then, neglecting terms
of O(h),
kτ (xn−1 , un−1 k ≈ kr(xn−1 , un−1 ; hn−1 )k,
and since τ (xn , un ) ≈ τ (xn−1 , un−1 ), likewise
What we want is
kτ (xn , un )khpn ≈ θεT ,
where θ is “safety factor”, say, θ = 0.8. Eliminating τ (xn , un ), we find
1/p
θεT
hn ≈ .
kr(xn−1 , un−1 , hn−1 )k
Note that from the previous step we have
so that
hn ≥ θ1/p hn−1 ,
and the tendency is to increase the step.
(0)
If n = 0, we proceed similarly, using some initial guess h0 of h0 and associated
(0)
r(x0 , y0 ; h0 ) to obtain
( )1/p
(1) θεT
h0 = (0)
.
r(x0 , y0 ; h0 )
7.7. Error Monitoring and Step Control 211
The process may be repeated once or twice to get the final estimate of both h0 and
(0)
r(x0 , y0 ; h0 ).
For a synthetic description of variable-step Runge-Kutta methods Butcher table
is completed by an supplementary line used for the computation of Φ∗ (and thus of
r(x, y; h)):
As an example, Table 7.3 is the Butcher table for a 2-3 method. For the derivation
of this table see [38, pages 451–452].
µj λij
0 0
1 1
4 4 0
27
40 − 189
800
729
800 0
214 1 650
1 891 33 891 0
214 650
αi 891 891 0
αi∗ 533
2106
800
1053
1
− 78
Table 7.4 is the Butcher table for Bogacki-Shampine method [5]. It is the back-
ground for M ATLAB ode23 solver.
Another important example is DORPRI5 or RK5(4)7FM, a pair of order 4-5 and
7 stages (Table 7.5). This is a very efficient pair; it is the base for M ATLAB ode45
solver, and other important solver.
Algorithm 7.2 gives implementation hints for a variable step Runge-Kutta method
when the Butcher is given. ttol is the product of tol by safe factor (0.8 or 0.9).
For applications of numerical solution of differential equations and other numer-
ical methods in mechanics see [24].
212 Numerical Solution of Ordinary Differential Equations
µj λij
0 0
1 1
2 2 0
3 3
4 0 4 0
2 3 4
1 9 9 9 0
2 3 4
αi 9 9 9 0
αi∗ 7
24
1
4
1
3
1
8
µj λij
0 0
1 1
5 5 0
3 3 9
10 40 40 0
4 44
5 45 − 56
15
32
9 0
8 19372 25360 64448 212
9 6561 − 2187 6561 − 729 0
9017
1 3168 − 355
33
46732
5247
49
176
5103
− 18656 0
35 500 125
1 384 0 1113 192 − 2187
6784
11
84 0
35 500 125
αi 384 0 1113 192 − 2187
6784
11
84 0
αi∗ 5179
57600 0 7571
16695
393
640
92097
− 339200 187
2100
1
40
Bibliography
[1] Octavian Agratini, Ioana Chiorean, Gheorghe Coman, and Radu Trı̂mbiţaş,
Analiză numerică şi teoria aproximării, vol. III, Presa Universitară Clujeană,
2002, coordonatori D. D. Stancu şi Gh. Coman.
[3] Å. Björk, Numerical Methods for Least Squares Problem, SIAM, Philadelphia,
1996.
[11] James Demmel, Applied Numerical Linear Algebra, SIAM, Philadelphia, 1997.
[18] D. Goldberg, What every computer scientist should know about floating-point
arithmetic, Computing Surveys 23 (1991), no. 1, 5–48.
[20] Gene H. Golub and Charles van Loan, Matrix Computations, 3rd ed., John Hop-
kins University Press, Baltimore and London, 1996.
[23] C. G. J. Jacobi, Über eine neue Auflösungsart der bei der Methode der kleinsten
Quadrate vorkommenden linearen Gleichungen, Astronomische Nachrichten
22 (1845), 9–12, Issue no. 523.
[24] Mirela Kohr and Ioan Pop, Viscous Incompressible Flow for Low Reynolds
Numbers, WIT Press, Southampton(UK) - Boston, 2004.
[26] Cleve Moler, Numerical Computing in MATLAB, SIAM, 2004, available via
www at http://www.mathworks.com/moler.
[29] I. A. Rus, Ecuaţii diferenţiale, ecuaţii integrale şi sisteme dinamice, Transilva-
nia Press, Cluj-Napoca, 1996.
Bibliography 217
218
Index 219
Runge-Kutta method
implicit ∼, 189
spline
complete, 97
Not-a-knot, 98
stability inequality, 195
theorem
Peano, 71
total pivoting, 33
transform
Householder, 39
trapezes rule, see composite trapezoidal
rule
220 Index