A Concise Introduction To Numerical Analysis
A Concise Introduction To Numerical Analysis
A Concise Introduction
K25104
w w w. c rc p r e s s . c o m
A. C. Faul
University of Cambridge, UK
MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not
warrant the accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB® soft-
ware or related products does not constitute endorsement or sponsorship by The MathWorks of a particular
pedagogical approach or particular use of the MATLAB® software.
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
List of Figures xi
Preface xiii
Acknowledgments xv
Chapter 1 Fundamentals 1
1.1 Floating Point Arithmetic 1
1.2 Overflow and Underflow 3
1.3 Absolute, Relative Error, Machine Epsilon 4
1.4 Forward and Backward Error Analysis 6
1.5 Loss of Significance 8
1.6 Robustness 11
1.7 Error Testing and Order of Convergence 12
1.8 Computational Complexity 14
1.9 Condition 15
1.10 Revision Exercises 19
vii
viii Contents
2.13 Relaxation 51
2.14 Steepest Descent Method 52
2.15 Conjugate Gradients 56
2.16 Krylov Subspaces and Pre-Conditioning 59
2.17 Eigenvalues and Eigenvectors 63
2.18 The Power Method 63
2.19 Inverse Iteration 67
2.20 Deflation 69
2.21 Revision Exercises 72
Bibliography 285
Index 287
List of Figures
xi
xii List of Figures
6.2 Stability domains for θ = 1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2,
and 0.1 167
6.3 The stability domains for various Adams–Bashforth methods 171
6.4 The stability domains for various Adams–Moulton methods 172
6.5 The stability domains of various BDF methods in grey. The
instability regions are in white. 174
6.6 Butcher tableau 181
6.7 Stability domain of the methods given by (6.17) 182
6.8 The stability domain of the original Runge–Kutta method
given by (6.18) 184
6.9 Stability domain of Kutta’s and Nystrom’s method 185
6.10 Tableau of the Bogacki–Shampine method 185
6.11 Tableau of the Fehlberg method 186
6.12 Tableau of the Cash–Karp method 187
6.13 Tableau of the Dormand–Prince method 187
6.14 The 2-stage Radau and Lobatto methods 191
6.15 The 3-stage Radau methods 192
6.16 The 3-stage Lobatto methods 193
6.17 Stability domain of the method given in (6.20). The instability
region is white. 195
6.18 Fourth order elementary differentials 199
8.1 Gibbs effect when approximating the line y = x for the choices
N = 4, 8, 16, 32 259
Preface
This book has been developed from the notes of lectures on Numerical Analysis
held within the MPhil in Scientific Computing at the University of Cambridge,
UK. This course has been taught successfully since 2011. Terms at Cambridge
University are very short, only eight weeks in length. Therefore lectures are
concise, concentrating on the essential content, but also conveying the under-
lying connecting principles. On the other hand, the Cambridge Mathematical
Tripos was established in around 1790. Thus, knowledge has been distilled
literally over centuries.
I have tried to carry over this spirit into the book. Students will get a
concise, but thorough introduction to numerical analysis. In addition the al-
gorithmic principles are emphasized to encourage a deeper understanding of
why an algorithms is suitable (and sometimes unsuitable) for a particular
problem.
The book is also intended for graduate students coming from other sub-
jects, but who will use numerical techniques extensively in their graduate
studies. The intake of the MPhil in Scientific Computing is very varied. Mathe-
maticians are actually in the minority and share the classroom with physicists,
engineers, chemists, computer scientists, and others. Also some of the MPhil
students return to university after a professional career wanting to further
and deepen their knowledge of techniques they have found they are lacking in
their professional life. Because of this the book makes the difficult balance be-
tween being mathematically comprehensive, but also not overwhelming with
mathematical detail. In some places where further detail was felt to be out of
scope of this book, the reader is referred to further reading.
Techniques are illustrated by MATLABr1 implementations. The main pur-
pose is to show the workings of the method and thus MATLABr ’s own im-
plementations are avoided (unless they are used as building blocks of an algo-
rithm). In some cases the listings are printed in the book, but all are available
online at https://www.crcpress.com/A-Concise-Introduction-to-Numerical-
Analysis/Faul/9781498712187 as part of the package K25104_Downloads.zip.
Most implementations are in the form of functions returning the outcome of
the algorithm. Also examples for the use of the functions are given.
Exercises are put inline with the text where appropriate. Each chapter ends
with a selection of revision exercises which are suitable exam questions. A PDF
entitled “Solutions to Odd-Numbered Exercises for a Concise Introduction
to Numerical Analysis” can be downloaded at https://www.crcpress.com/A-
xiii
xiv Preface
Concise-Introduction-to-Numerical-Analysis/Faul/9781498712187 as part of
the package K25104_Downloads.zip.
Students will find the index comprehensive, making it easy to find the
information they are after. Hopefully this will prove the book useful as a ref-
erence and make it an essential on the bookshelves of anybody using numerical
algorithms.
1 MATLAB and Simulink are registered trademarks of The Mathworks, Inc. For product
First and foremost I have to thank Dr Nikos Nikiforakis without whom this
book would not have come into being. I first contacted him in 2011 when I
was looking for work which I could balance with being a mum of three. My
intention was to supervise small groups of students in Numerical Analysis at
Cambridge University. He asked whether I could also lecture the subject for
the students undertaking the MPhil in Scientific Computing. My involvement
in the MPhil and his research group grew from there. Aside from his unwa-
vering belief in what I can accomplish, he has become one of my best friends
supporting me through difficult times.
Next thanks are also due to my PhD supervisor Professor Mike Powell who
sadly passed away in April 2014. From him I learnt that one should strive for
understanding and simplicity. Some of his advice was, “Never cite a paper
you haven’t understood.” My citation lists have been short ever since. I more
often saw him working through an algorithm with pen and paper than sitting
at a computer. He wanted to know why a particular algorithm was successful.
I now ask my students, “How can you improve on something, if you do not
understand it?”
Of course I also need to thank all the MPhil students whose questions and
quest to find typos have improved this book over several years, especially Peter
Wirnsberger, Will Drazin, and Chongli Qin. In addition, Geraint Harcombe
contributed significantly to the MATLABr examples of this book.
I would also like to express my gratitude to Cambridge University and the
staff and fellows at Selwyn College for creating such a wonderful atmosphere
in which to learn and teach.
This book would not have been written without the support of many people
in my private life, foremost my parents Helmut and Marieluise Faul, who
instilled a love for knowledge in me. Next, my many friends of whom I would
like to mention, especially the ones helping out with childcare, and by name,
Annemarie Moore (née Clemence) who was always there when I needed help
to clear my head, Sybille Hutter for emotional support, and Karen Higgins
who is brilliant at sorting out anything practical.
A.C. Faul
xv
CHAPTER 1
Fundamentals
If the exponents of two floating point numbers are the same, they are said
to be of the same magnitude. Let’s look at two floating point numbers of
the same magnitude which also have the same digits apart from the digit in
position p, which has index p − 1. We assume that they only differ by one in
that digit. These floating point numbers are neighbours in the representation
and differ by
1 × β −(p−1) × β e = β e−p+1 .
Thus, if the exponent is large the difference between neighbouring floating
point numbers is large, while if the exponent is small the difference between
neighbouring floating point numbers is small. This means floating point num-
bers are more dense around zero.
2 A Concise Introduction to Numerical Analysis
As an example consider the decimal number 3.141592, the first seven digits
of π. Converting this into binary, we could approximate it by
1.1001001000011111101011111100100011 × 21 =
(1 + 1 ∗ 12 + 0 ∗ 14 + 0 ∗ 18 + 1 ∗ 16
1 1
+ 0 ∗ 32 1
+ 0 ∗ 64 + ...) ∗ 21 ≈ 3.14159200003.
1.100100100001111110101111110010001 × 21 ≈ 3.14159199991.
3.243F 5F 9 ≈ 3.14159199968.
Thus the amount of accuracy lost varies with the underlying representation.
The largest and smallest allowable exponents are denoted emax and emin ,
respectively. Note that emax is positive, while emin is negative. Thus there are
emax −emin +1 possible exponents, the +1 standing for the zero exponent. Since
there are β p possible significands, a floating-point number can be encoded in
[log2 (emax − emin + 1)] + [log2 (β p )] + 1 bits where the final +1 is for the sign
bit.
The storage of floating point numbers varies between machine architec-
tures. A particular computer may store the number using 24 bits for the
significand, 1 bit for the sign (of the significand), and 7 bits for the exponent
in order to store each floating-point number in 4 bytes ( 1 byte = 8 bits). This
format may be used by two different machines, but they may store the expo-
nent and significand in the opposite order. Calculations will produce the same
answer, but the internal bit pattern in each word will be different. Therefore
operations on bytes should be avoided and such code optimization left to the
compiler.
There are two reasons why a real number cannot be exactly represented
as a floating-point number. The first one is illustrated by the decimal number
0.1. Although it has a finite decimal representation, in binary it has an infinite
repeating representation. Thus in this representation it lies strictly between
two floating point numbers and is neither of them.
Another situation is that a real number is too large or too small. This is
also known as being out of range. That is, its absolute value is larger than or
equal to β × β emax or smaller than 1.0 × β emin . We use the terms overflow and
underflow for these numbers. These will be discussed in the next chapter.
Fundamentals 3
Here NAN stands for not a number and shows that the result of an operation
is undefined.
Overflow is caused by any operation whose result is too large in absolute
value to be represented. This can be the result of exponentiation or multipli-
cation or division or, just possibly, addition or subtraction. It is better to high-
light the occurrence of overflow with the quantity ∞ than returningp the largest
representable number. As an example, consider computing x2 + y 2 , when
β = 10, p = 3, and emax = 100. If x = 3.00 × 1060 and y = 4.00 × 1060 , then
x2 , y 2 , and x2 + y 2 will each√overflow in turn, and be replaced by 9.99 × 10100 .
So the final result will be 9.99 × 10100 = 3.16 × 1050 , which is drastically
4 A Concise Introduction to Numerical Analysis
Exercise
p 1.1. Write a C-routine which implements the calculation of
x2 + y 2 in a way which avoids overflow. Consider the cases where x and
y differ largely in magnitude.
Underflow is caused by any calculation where result is too small to be
distinguished from zero. As with overflow this can be caused by different op-
erations, although addition and subtraction are less likely to cause it. However,
in many circumstances continuing the calculation with zero is sufficient. Often
it is safe to treat an underflowing value as zero. However, there are several
exceptions. For example, suppose a loop terminates after some fixed time has
elapsed and the calculation uses a variable time step δt which is used to update
the elapsed time t by assignments of the form
δt := δt × update
(1.1)
t := t + δt
x∗ = x(1 + δ) = x + xδ.
Thus
= xδ or, if x 6= 0, δ= .
x
The absolute and relative error are zero if and only if x can be represented
exactly in the chosen floating point representation.
In floating-point arithmetic, relative error seems appropriate because each
number is represented to a similar relative accuracy. For example consider
β = 10 and p = 3 and the numbers x = 1.001 × 103 and y = 1.001 × 100
with representations x∗ = 1.00 × 103 and y ∗ = 1.00 × 100 . For x we have an
absolute error of x = 0.001 × 103 = 1 and for y y = 0.001 × 100 = 0.001.
Fundamentals 5
1/2 × β e−p+1
|δ| = | | < | = 1/2 × β −p+1 .
x 1 × βe
Generally, the smallest number u such that |δ| ≤ u, for all x (excluding x
values very close to overflow or underflow) is called the unit round off .
On most computers 1∗ = 1. The smallest positive m such that
(1 + m )∗ > 1
indicates that this number is not included. Thus the relative error δ corre-
sponding to 0.5 ulps ranges between
(x∗ )2 = x2 (1 + δ)2
= x2 (1 + 2δ + δ 2 )
≈ x2 (1 + 2δ),
where ρ denotes the relative error in the output such that ρ ≤ u. As ρ is small,
1 + ρ > 0. Thus there exists ρ̃ such that (1 + ρ̃)2 = 1 + ρ with |ρ̃| < |ρ| ≤ u,
since (1 + ρ̃)2 = 1 + 2ρ̃ + ρ̃2 = 1 + ρ̃(2 + ρ̃). We can now write
[f (x)]∗ = x2 (1 + ρ̃)2
= f [x(1 + ρ̃)].
If the backward error is small, we accept the solution, since it is the correct
solution to a nearby problem.
Another reason for the preference which is given to backward error analysis
is that often the inverse of a calculation can be performed much easier than
the calculation itself. Take for example
√
f (x) = x
x2 x3
f (x) = 1 + x + + .
2! 3!
The forward error is simply f (x) − ex . For the backward error we need to find
∗
x∗ such that ex = f (x). In particular,
x∗ = ln(f (x)).
Next we consider how errors build up using the basic floating point op-
erations: multiplication ×, division /, and exponentiation ↑. As a practical
example, consider double precision in IBM System/370 arithmetic. Here the
value of u is approximately 0.22 × 10−15 . We simplify this by letting all num-
bers be represented with the same relative error 10−15 .
Starting with multiplication, we write
x∗1 = x1 (1 + δ1 )
x∗2 = x2 (1 + δ2 ).
Then
x∗1 × x∗2 = x1 x2 (1 + δ1 )(1 + δ2 )
= x1 x2 (1 + δ1 + δ2 + δ1 δ2 ).
The term δ1 δ2 can be neglected, since it is of magnitude 10−30 in our example.
The worst case is, when δ1 and δ2 have the same sign, i.e., the relative error
in x∗1 × x∗2 is no worse than |δ1 | + |δ2 |. If we perform one million floating-
point multiplications, then at worst the relative error will have built up to
106 × 10−15 = 10−9 .
Division can be easily analyzed in the same way by using the binomial
expansion to write
1 1 1
= (1 + δ2 )−1 = (1 − δ2 + ...).
x∗2 x2 x2
The omitted terms are of magnitude 10−30 or smaller and can be neglected.
Again, the relative error in x∗1 /x∗2 is no worse than |δ1 | + |δ2 |.
We can compute x∗1 ↑ n, for any integer n by repeated multiplication or
division. Consequently we can argue that the relative error in x∗1 ↑ n is no
worse than n|δ1 |.
This leaves addition + and subtraction − with which we will deal in the
next section. Here the error build-up depends on absolute accuracy, rather
than relative accuracy.
The problem is solved by the introduction of a guard digit. That is, after
the smaller number is shifted to have the same exponent as the larger number,
it is truncated to p+1 digits. Then the subtraction is performed and the result
rounded to p digits.
Theorem 1.2. If x and y are positive floating-point numbers in a format with
parameters β and p, and if subtraction is done with p+1 digits (i.e., one guard
digit), then the relative error in the result is less than ( β2 + 1)β −p .
10 A Concise Introduction to Numerical Analysis
The lower row gives the power of β associated with the position of the digit.
Let ŷ be y truncated to p + 1 digits. Then
From the definition of guard digit, the computed value of x−y is x− ŷ rounded
to be a floating-point number, that is, (x − ŷ) + α, where the rounding error
α satisfies
β
|α| ≤ β −p .
2
The exact difference is x − y, so the absolute error is |(x − y) − (x − ŷ + α)| =
|ŷ − y − α|. The relative error is the absolute error divided by the true solution
|ŷ − y − α|
.
x−y
We now find a bound for the relative error. If x − y ≥ 1, then the relative
error is bounded by
1.6 Robustness
An algorithm is described as robust if for any valid data input which is rea-
sonably representable, it completes successfully. This is often achieved at the
expense of time. Robustness is best illustrated by example. We consider the
quadratic equation ax2 + bx + c = 0. Solving it seems to be a very elementary
problem. Since it is often part of a larger calculation, it is important that it
is implemented in a robust way, meaning that it will not fail and give reason-
ably accurate answers for any coefficients a, b and c which are not too close
to overflow or underflow. The standard formula for the two roots is
√
−b ± b2 − 4ac
x= .
2a
12 A Concise Introduction to Numerical Analysis
A problem arises if b2 is much larger than |4ac|. In the worst case the difference
in the magnitudes of b2 and |4ac| is so large that b2 − 4ac evaluates to b2 and
the square root evaluates to b, and one of the calculated roots lies at zero.
Even if the difference in magnitude is not that large, one root is still small.
Without loss of generality we assume b > 0 and the small root is given by
√
−b + b2 − 4ac
x= . (1.3)
2a
We note that there is loss of significance in the numerator. As we have seen
before, this can lead to a large relative error in the result compared to the
relative error in the input. The problem can be averted by manipulating the
formula in the following way:
√ √
−b + b2 − 4ac −b − b2 − 4ac
x = × √
2a −b − b2 − 4ac
b2 − (b2 − 4ac)
= √ (1.4)
2a(−b − b2 − 4ac)
−2c
= √ .
b + b2 − 4ac
Now quantities of similar size are added instead of subtracted.
Taking a = 1, b = 100000 and c = 1 and as accuracy 2 × 10−10 , Equation
(1.3) gives x = −1.525878906 × 10−5 for the smaller root while (1.4) gives
x = −1.000000000 × 10−5 , which is the best this accuracy allows.
In general, adequate analysis has to be conducted to find cases where
numerical difficulties will be encountered, and a robust algorithm must use an
appropriate method in each case.
|n+1 | ≈ C|n |p
for sufficiently large n. Thus the error in the next iteration step is approxi-
mately the pth power of the current iteration error times a constant C. For
p = 1 we need C < 1 to have a reduction in error. This is not necessary
for p > 1, because as long as the current error is less than one, taking any
power greater or equal to one leads to a reduction. For p = 0, the O-notation
becomes O(1), which signifies that something remains constant. Of course in
this case there is no convergence.
The following categorization for various rates of convergence is in use:
1. p = 1: linear convergence. Each iteration produces the same reduction in
absolute error. This is generally regarded as being too slow for practical
methods.
1.9 Condition
The condition of a problem is inherent to the problem whichever algorithm is
used to solve it. The condition number of a numerical problem measures the
asymptotically worst case of how much the outcome can change in proportion
to small perturbations in the input data. A problem with a low condition
number is said to be well-conditioned, while a problem with a high condition
number is said to be ill-conditioned. The condition number is a property of
the problem and not of the different algorithms that can be used to solve the
problem.
As an example consider the problem where a graph crosses the line x = 0.
Naively one could draw the graph and measure the coordinates of the crossover
points. Figure 1.1 illustrates two cases. In the left-hand problem it would be
easier to measure the crossover points, while in the right-hand problem the
crossover points lie in a region of candidates. A better (or worse) algorithm
would be to use a higher (or lower) resolution. In the chapter on non-linear
systems we will encounter many methods to find the roots of a function that
is the points where the graph of a function crosses the line x = 0.
16 A Concise Introduction to Numerical Analysis
√
Exercise 1.4. Perform a forward and backward error analysis for f (x) = x.
You should find that the relative error is reduced by half in the process.
1
However, when f (x) = 1−x we get
0
xf (x) x[1/(1 − x)2 ] x
K(x) =
= = .
f (x) 1/(1 − x) 1 − x
Thus for values of x close to 1, K(x) can get large. For example if x = 1.000001,
then K(1.000001) = 1.000001 × 106 and thus the relative error will increase
by a factor of about 106 .
Exercise 1.5. Examine the condition of the evaluating cos x.
where k · k denotes any vector norm. We will see below how different norms
lead to different condition numbers.
Assuming that A is invertible, then x is given by
A−1 b
and the perturba-
tion in x is A−1 e. Thus the relative change in x is
A−1 e
/ kxk. Hence the
condition number is
−1
−1
A e
kek
A e
kAxk
K(A) = max / = max × max .
x,e6=0 kxk kAxk e6=0 kek x6=0 kxk
We see that the calculation of the condition number involves the definition of
the matrix norm. Thus the condition number of an invertible matrix is
n
!1/2
X
kxk2 := x2i ,
i=1
and
σmax (A)
K2 (A) = ,
σmin (A)
where σmax (A) and σmin (A) are the maximal and minimal singular val-
ues of A, respectively. (The singular values of a matrix A are the square
roots of the eigenvalues of the matrix A∗ A where A∗ denotes the com-
plex conjugate transpose of A.) In particular A is called a normal matrix
if A∗ A = AA∗ . In this case
λmax (A)
K2 (A) =
λmin (A)
where λmax (A) and λmin (A) are the maximum and minimum modulus
eigenvalues of A. If A is unitary, i.e., multiplying the matrix with its
conjugate transpose results in the identity matrix, then K2 (A) = 1.
2. If k · k is the ∞-norm defined as
Now assuming the error e on the right-hand side is aligned with the eigenvector
which belongs to the smaller eigenvalue, then this is multiplied by a factor of
|1/λ2 | = 1/4.8. On the other hand a small error in the solution which is
aligned with the eigenvector belonging to the larger eigenvalue takes us away
from the right-hand side b by a factor of 412.8. This is the worst-case scenario
and gives
λ1 412.8
K(A) = ≈ .
λ2 4.8
Exercise 1.7. Let
1000 999
A= .
999 998
Calculate A−1 , the eigenvalues and
eigenvectors
of A, K2 (A), and K∞ (A).
1 −1
What is special about the vectors 1 and 1 ?
If
√ n 4n
is the
√ size of the matrix, then the condition number is of order O((1 +
2) / n).
Exercise 1.9. (a) Define absolute error, relative error, and state their rela-
tionship.
(b) Show how the relative error builds up in multiplication and division.
(c) Explain forward and backward error analysis using the example of approx-
imating
cos x ≈ f (x) = 1 − x2 /2.
(g) How are the numbers 2k for positive and negative k represented?
Exercise 1.10. (a) Define absolute error, relative error, machine epsilon,
and unit round-off.
(b) Although machine epsilon is defined in terms of absolute error, which
assumption makes it useful as a measurement of relative error?
(c) What does it mean if a floating point number is said to be normalized?
What is the hidden bit and how is it used?
(d) What does NaN stand for? Give an example of an operation which yields
an NaN value.
(e) Given emax = 127 and emin = −126, one bit for the sign and 23 bits
for the significand, show the bit pattern representing each of the following
Fundamentals 21
numbers. State the sign, the exponent, and the significand. You may use
0 . . . 0 to represent a long string of zeros.
0
−∞
−1.0
1.0 + machine epsilon
4.0
4.0 + machine epsilon
NaN
x∗ , the smallest representable number greater than 216
In the last case, what is the value of the least significant bit in the signifi-
cand of x∗ and what is the relative error if rounding errors cause x = 216
to become x∗ ?
Exercise 1.11. (a) Define absolute error, relative error, and state their re-
lationship.
(b) Explain absolute error test and relative error test and give examples of
circumstances when they are unsuitable. What is a mixed error test?
(c) Explain loss of significance.
(d) Let x1 = 3.0001 be the true value approximated by x∗1 = 3.0001 + 10−5 and
x2 = −3.0000 be the true value approximated by x∗2 = −3.0000 + 10−5 .
State the absolute and relative errors in x∗1 and x∗2 . Calculate the absolute
error and relative error in approximating x1 + x2 by x∗1 + x∗2 . How many
times bigger is the relative error in the sum compared to the relative error
in x∗1 and x∗2 ?
(e) Let p
f (x) = x − x2 + 1, x ≥ 0. (1.10)
Explain when and why loss of significance occurs in the evaluation of f .
(f ) Derive an alternative formula for evaluating f which avoids loss of sig-
nificance.
(g) Test your alternative by considering a decimal precision p = 16 and
x = 108 . What answer does your alternative formula give compared to
the original formula?
(h) Explain condition and condition number in general terms.
(i) Derive the condition number for evaluating a differentiable function f at
a point x, i.e., calculating f (x).
(j) Considering f (x) as defined in (1.10), find the smallest interval in
which the condition number lies. Is the problem well-conditioned or ill-
conditioned?
CHAPTER 2
Linear Systems
Ax = b, (2.1)
Av = 0.
These vectors lie in the null space of A. That is the space of all vectors mapped
to zero when multiplied by A. If x is a solution of (2.1) then so is x + v, since
A(x + v) = Ax + Av = b + 0 = b.
If any of the diagonal elements ai,i is zero, then A is singular and there is no
unique solution. There might be no solution or infinitely many.
Considering the upper triangular form, the solution is obtained by the
back substitution algorithm. The last equation contains only one unknown xn ,
which can be calculated by
bn
xn = .
an,n
Having determined xn , the second to last equation contains only the unknown
xn−1 , which can then be calculated and so on. The algorithm can be summa-
rized as Pn
bi − j=i+1 ai,j xj
xi = , i = n, n − 1, . . . , 1.
ai,i
For the lower triangular form the forward substitution algorithm of similar
form is used. Here is an implementation in MATLAB.
which we convert into upper triangular form to use back substitution. The
first equation cannot form the first row of the upper triangle because its first
coefficient is zero. Both the second and third row are suitable since their first
element is 1. However, we also need a non-zero element in the (2, 2) position
and therefore the first step is to exchange the first two equations, hence
1 0 1 x1 2
0 1 1 x2 = 1 .
1 1 0 x3 4
After subtracting the new first and the second equation from the third, we
arrive at
1 0 1 x1 2
0 1 1 x2 = 1 .
0 0 −2 x3 1
Back substitution then gives x1 = 52 , x2 = 32 , and x3 = − 12 .
This trivial example shows that a subroutine to deal with the general case
is much less straightforward, since any number of coefficients may be zero at
any step.
The second example shows that not only zero coefficients cause problems.
Consider the pair of equations
0.0002 1.044 x1 1.046
= .
0.2302 −1.624 x2 0.678
the modulus of the largest coefficient of each. Then the pivotal equation is
chosen as the one with the largest (scaled) coefficient of xk and moved into
the k th row. In total (or complete or maximal ) pivoting the pivotal equation
and pivotal variable are selected by choosing the largest (unscaled) coefficient
of any of the remaining variables. This is moved into the (k, k) position. This
can involve the exchange of columns as well as rows. If columns are exchanged
the order of the unknowns needs to be adjusted accordingly. Total pivoting
is more expensive, but for certain systems it may be required for acceptable
accuracy.
2.3 LU Factorization
Another possibility to solve a linear system is to factorize A into a lower
triangular matrix L (i.e., Li,j = 0 for i < j) and a upper triangular matrix U
(i.e., Ui,j = 0 for i > j), that is, A = LU . This is called LU factorization. The
linear system then becomes L(U x) = b, which we decompose into Ly = b
and U x = y. Both these systems can be solved easily by back substitution.
Other applications of the LU factorization are
1. Calculation of determinant:
n
Y n
Y
det A = (det L)(det U ) = ( Lk,k )( Uk,k ).
k=1 k=1
The first row and column of A1 are zero and it follows that uT2 is the second
row of A1 , while l2 is the second column of A1 scaled so that L2,2 = 1.
To formalize the LU algorithm, set A0 := A. For all k = 1, . . . , n set uTk to
the k th row of Ak−1 and lk to the k th column of Ak−1 , scaled so that Lk,k = 1.
Further calculate Ak := Ak−1 − lk uTk before incrementing k by 1.
Exercise 2.2. Calculate the LU factorization of the matrix
8 6 −2 1
8 8 −3 0
A= −2 2 −2
,
1
4 3 −2 5
where all the diagonal elements of L are 1. Choose one of these factorizations
to find the solution to Ax = b for bT = (−2 0 2 − 1).
All elements of the first k rows and columns of Ak are zero. Therefore we
can use the storage of the original A to accumulate L and U . The full LU
accumulation requires O(n3 ) operations.
Looking closer at the equation Ak = Ak−1 − lk uTk , we see that the j th row
of Ak is the j th row of Ak−1 minus Lj,k times uTk which is the k th row of Ak−1 .
This is an elementary row operation. Thus calculating Ak = Ak−1 − lk uTk is
equivalent to performing n−k row operations on the last n−k rows. Moreover,
the elements of lk which are the multipliers Lk,k , Lk+1,k , . . . , Ln,k are chosen
so that the k th column of Ak is zero. Hence we see that the LU factorization
is analogous to Gaussian elimination for solving Ax = b. The main difference
however is that the LU factorization does not consider b until the end. This
is particularly useful when there are many different vectors b, some of which
might not be known at the outset. For each different b, Gaussian elimination
would require O(n3 ) operations, whereas with LU factorization O(n3 ) opera-
tions are necessary for the initial factorization, but then the solution for each
b requires only O(n2 ) operations.
The algorithm can be rewritten in terms of the elements of A, L, and U .
Since
Xn Xn
T
Ai,j = la ua i,j = Li,a Ua,j ,
a=1 a=1
and since Ua,j = 0 for a > j and Li,a = 0 for a > i, we have
min(i,j)
X
Ai,j = Li,a Ua,j .
a=1
At the k th step the elements Li,a are known for a < i and the elements U a, j
are known for a < j. For i = j,
i
X
Ai,i = Li,a Ua,i ,
a=1
Linear Systems 29
If Ui,i = 0, then U has a zero on the diagonal and thus is singular, since U
is upper triangular. The matrix U inherits the rank of A, while L is always
non-singular, since its diagonal consists entirely of 1.
For j > i, Ai,j lies above the diagonal and
i
X i−1
X
Ai,j = Li,a Ua,j = Li,a Ua,j + Ui,j
a=1 a=1
Similarly, for j < i, Ai,j lies to the left of the diagonal and
j
X j−1
X
Ai,j = Li,a Ua,j = Li,a Ua,j + Li,j Uj,j
a=1 a=1
which gives
Pj−1
Ai,j − a=1 Li,a Ua,j
Li,j = .
Uj,j
Note that the last formula is only valid when Uj,j is non-zero. That means
when A is non-singular. If A is singular, other strategies are necessary, such
as pivoting, which is described below.
We have already seen the equivalence of LU factorization and Gaussian
elimination. Therefore the concept of pivoting also exists for LU factorization.
It is necessary for such cases as when, for example, A1,1 = 0. In this case
we need to exchange rows of A to be able to proceed with the factorization.
Specifically, pivoting means that having obtained Ak−1 , we exchange two rows
of Ak−1 so that the element of largest magnitude in the k th column is in the
pivotal position (k, k), i.e.,
function [L,U]=LU(A)
% Computes the LU factorisation
% A input argument, square matrix
% L square matrix of the same dimensions as A containing the lower
% triangular portion of the LU factorisation
% U square matrix of the same dimensions as A containing the upper
% triangular portion of the LU factorisation
end
There has been much research on how best to exploit sparsity with the
help of graph theory. However, this is beyond this course. The above theorem
can also be applied to banded matrices.
Definition 2.1. The matrix A is a banded matrix if there exists an integer
r < n such that Ai,j = 0 for |i − j| > r, i, j = 1, . . . , n. In other words, all
the nonzero elements of A reside in a band of width 2r + 1 along the main
diagonal.
For banded matrices, the factorization A = LU implies that Li,j = Ui,j = 0
for |i − j| > r and the banded structure is inherited by both L and U .
In general, the expense of calculating an LU factorization of an n × n
dense matrix A is O(n3 ) operations and the expense of solving Ax = b using
that factorization is O(n2 ). However, in the case of a banded matrix, we need
just O(r2 n) operations to factorize and O(rn) operations to solve the linear
system. If r is a lot smaller than n, then this is a substantial saving.
1 14
4 1 1 0 4 0
= 1 .
1 4 4 1 0 154 0 1
Since the first k − 1 rows and columns of Ak−1 and the last n − k components
of x vanish and xk = 1, we have
k−1
X
(Ak−1 )k,k = xT Ak−1 x = xT A − Dj,j lj lTj x
j=1
k−1
X
= xT Ax − Dj,j (lTj x)2 = xT Ax > 0.
j=1
A, we can write
A = (LD1/2 )(D1/2 LT ) = (LD1/2 )(LD1/2 )T .
Letting L̃ := LD1/2 , A = L̃L̃T is known as the Cholesky factorization.
Exercise 2.6. Calculate the Cholesky factorization of the matrix
··· ··· 0
1 1 0
.. ..
1 2 1 . .
..
..
0 1
2 1 . . .
. .
.. ..
1 3 1 0
. ..
..
. 1 3 1
0 ··· ··· 0 1 λ
34 A Concise Introduction to Numerical Analysis
Deduce from the factorization the value of λ which makes the matrix singular.
2.5 QR Factorization
In the following we examine another way to factorize a matrix. However, first
we need to recall a few concepts.
For all x, y ∈ Rn , the scalar product is defined by
n
X
hx, yi = hy, xi = xi yi = xT y = yT x.
i=1
n
!1/2
X
kxk = x2i = hx, xi1/2 ≥ 0.
i=1
hx, yi = 0.
we can find a row of Q where the sum of the Pmsquares is less than 1, in other
words there exists l ∈ {1, . . . , n} such that j=1 Q2l,j < 1. Let el denote the
Pm
lth unit vector. Then Ql,j = hqj , el i. Further, set w = el − j=1 hqj , el iqj .
The scalar product of w with each qi for i = 1, . . . , m is then
m
X
hqi , wi = hqi , el i − hqj , el ihqi , qj i = 0.
j=1
We see that
k
X
ak = Ri,k qi k = 1, . . . , n. (2.2)
i=1
36 A Concise Introduction to Numerical Analysis
1. Set k := 0, j := 0.
2. Increase j by 1. If k = 0, set w := aj , otherwise (i.e., when k ≥ 1)
Pk
set w := aj − i=1 hqi , aj iqi . By this construction w is orthogonal to
q1 , . . . , qk .
3. If w = 0, then aj lies in the space spanned by q1 , . . . , qk or is zero. If aj
is zero, set the j th column of R to zero. Otherwise, set Ri,j := hqi , aj i
for i = 1, . . . , k and Ri,j = 0 for i = k + 1, . . . , n. Note that in this case
no new column of Q is constructed. If w 6= 0, increase k by one and
set qk := w/kwk, Ri,j := hqi , aj i for i = 1, . . . , k − 1, Rk,j := kwk and
Ri,j := 0 for i = k + 1, . . . n. By this construction, each column of Q has
Pk
unit length and aj = i=1 Ri,j qi as required and R is upper triangular,
since k ≤ j.
4. Terminate if j = m, otherwise go to 2.
Ω[p,q] [p,q]
p,p = Ωq,q = cos θ, Ω[p,q]
p,q = sin θ, Ω[p,q]
q,p = − sin θ
Note that if Aq,j = 0, Ai,j is already zero, since i = q. In this case cos θ = 1
and sin θ = 0 and Ω[p,q] is the identity matrix. On the other hand, if Ap,j = 0,
then cos θ = 0 and sin θ = 1 and Ω[p,q] is the permutation matrix which swaps
the pth and q th rows to bring the already existing zero in the desired position.
Let Ap,j 6= 0 and Aq,j 6= 0. Considering the q th row of Ω[p,q] A, we see
Since (Ω[1,2] A)3,1 is already zero, we do not need Ω[1,3] to introduce a zero
there. Next we pick Ω[2,3] such that (Ω[2,3] Ω[1,2] A)3,2 = 0. Thus
1 0
√ 0
Ω[2,3] = 0 23 2 1
√
3
.
1 2
0 −3 3 2
The final matrix A is the required matrix R. Since the number of leading
zeros increases strictly monotonically until all rows of R are zero, R is in
standard form.
The number of rotations needed is at most the number of elements below
the diagonal, which is
m
X m(m + 1)
(n − 1) + (n − 2) + · · · + (n − m) = mn − j = mn − = O(mn).
j=1
2
(For m = n this becomes n(n − 1)/2 = O(n2 )). Each rotation replaces two
rows by their linear combinations, which requires O(m) operations. Hence the
total cost is O(m2 n).
When solving a linear system, we multiply the right-hand side by the same
rotations. The cost for this is O(mn), since for each rotation two elements of
the vector are combined.
However, if the orthogonal matrix Q is required explicitly, we begin by
letting Ω be the m × m unit matrix. Each time A is pre-multiplied by Ω[p,q] ,
Ω is also pre-multiplied by the same rotation. Thus the final Ω is the product
of all rotations and we have Q = ΩT . The cost for obtaining Q explicitly is
O(m2 n), since in this case the rows have length m.
In the next section we encounter another class of orthogonal matrices.
Linear Systems 41
uuT
I −2
kuk2
is called a Householder reflection.
A Householder reflection describes a reflection about a hyperplane which
is orthogonal to the vector u/kuk and which contains the origin. Each such
matrix is symmetric and orthogonal, since
T 2
uuT uuT uuT
I −2 I −2 = I −2
kuk2 kuk2 kuk2
uuT u(uT u)uT
= I −4 2
+4 = I.
kuk kuk4
We can use Householder reflections instead of Givens rotations to calculate a
QR factorization.
With each multiplication of an n×m matrix A by a Householder reflection
we want to introduce zeros under the diagonal in an entire column. To start
with we construct a reflection which transforms the first nonzero column a ∈
Rn of A into a multiple of the first unit vector e1 . In other words we want to
choose u ∈ Rn such that the last n − 1 entries of
uuT uT a
I −2 2
a=a−2 u (2.3)
kuk kuk2
vanish. Since we are free to choose the length of u, we normalize it such that
kuk2 = 2uT a, which is possible since a 6= 0. The right side of Equation (2.3)
then simplifies to a − u and we have ui = ai for i = 2, . . . , n. Using this we
can rewrite the normalization as
n
X n
X
2u1 a1 + 2 a2i = u21 + a2i .
i=2 i=2
to transform the k th column such that the first k − 1 columns remain in this
form. Therefore we let the first k − 1 entries of u be zero. With this choice the
first k −1 rows and columns of the outer product uuT are zero and the top left
(k −1)×(k −1) p submatrix
Pn of the Householder reflection is the identity matrix.
Let uk = ak ± a 2 and u = a for i = k + 1, . . . , n. This introduces
i=k i i i
zeros below the diagonal in the k th column.
The end result after processing all columns of A in sequence is an upper
triangular matrix R in standard form.
For large n no explicit matrix multiplications are executed. Instead we use
uuT u(uT A)
I −2 2
A=A−2 .
kuk kuk2
2
So we first calculate wT := uT A and then A − T
kuk2 uw .
A = U1 D1/2 V1T ,
Linear Systems 45
U = (U1 U2 )
A = U SV T .
SV T x = U T b.
For overdetermined systems n > m, consider the linear least squares prob-
lem, which we can rewrite using the orthogonality of U and V .
What are the necessary and sufficient conditions for convergence? Suppose
that A is non-singular and therefore has a unique solution x∗ . Since x∗ solves
Ax = b, it also satisfies (A − B)x∗ = −Bx∗ + b. Subtracting this equation
from (2.4) gives
(A − B)(x(k+1) − x∗ ) = −B(x(k) − x∗ ).
Theorem 2.6. We have limk→∞ x(k) = x∗ for all x(0) ∈ Rn if and only if
ρ(H) < 1.
Proof. For the first direction we assume ρ(H) ≥ 1. Let λ be an eigenvalue of
H such that |λ| = ρ(H) and let v be the corresponding eigenvector, that is
Hv = λv. If v is real , we let x(0) = x∗ + v, hence e(0) = v. It follows by
induction that e(k) = λk v. This cannot tend to zero since |λ| ≥ 1.
If λ is complex, then v is complex. In this case we have a complex pair
of eigenvalues, λ and its complex conjugate λ, which has v as its eigenvector.
The vectors v and v are linearly independent. We let x(0) = x∗ + v + v ∈ Rn ,
k
hence e(0) = v + v ∈ Rn . Again by induction we have e(k) = λk v + λ v ∈ Rn .
Now
k
ke(k) k = kλk v + λ vk
= |λk |keiθk v + e−iθk vk,
where we have changed to polar coordinates. Now θk lies in the closed interval
[−π, π] for all k = 0, 1, . . .. The function in θ ∈ [−π, π] given by keiθ v + e−iθ vk
is continuous and has a minimum with value, say, µ. This has to be positive
since v and v are linearly independent. Hence ke(k) k ≥ µ|λk | and e(k) cannot
tend to zero, since |λ| ≥ 1.
The other direction is beyond the scope of this course, but can be found
in [20] R. S. Varga Matrix Iterative Analysis, which is regarded as a classic in
its field.
48 A Concise Introduction to Numerical Analysis
where α, β, γ ∈ R are constant. Find all values for α, β, γ such that the se-
quence converges for every x(0) and b. What happens when α = β = γ = −1
and α = β = 0?
In some cases, however, iteration matrices can arise where we will have
convergence, but it will be very, very slow. An example for this situation is
H= 0 0 1020 .
0 0 0.99
Gauss–Seidel method
We set A − B = L + D, the lower triangular portion of A, or in other
words B = U . The sequence x(k) , k = 1, . . . , is generated by
(L + D)x(k+1) = −U x(k) + b.
for i, 1, . . . , n.
For the first class, the following theorem holds:
Theorem 2.7 (Householder–John theorem). If A and B are real matrices
such that both A and A − B − B T are symmetric and positive definite, then
the spectral radius of H = −(A − B)−1 B is strictly less than one.
Proof. Let λ be an eigenvalue of H and v 6= 0 the corresponding eigenvector.
Note that λ and thus v might have nonzero imaginary parts. From Hv = λv
we deduce −Bv = λ(A − B)v. λ cannot equal one since otherwise A would
map v to zero and be singular. Thus we deduce
λ
vT Bv = vT Av. (2.5)
λ−1
Writing v = u + iw, where u, w ∈ Rn , we deduce vT Av = uT Au + wT Aw.
Hence, positive definiteness of A and A − B − B T implies vT Av > 0 and
vT (A − B − B T )v > 0. Inserting (2.5) and its conjugate transpose into the
latter, and we arrive at
1 − |λ|2 T
T T T T λ λ
0 < v Av−v Bv−v B v = 1 − − vT Av = v Av.
λ−1 λ−1 |λ − 1|2
50 A Concise Introduction to Numerical Analysis
The denominator does not vanish since λ 6= 1. Hence |λ − 1|2 > 0. Since
vT Av > 0, 1 − |λ|2 has to be positive. Therefore we have |λ| < 1 for every
eigenvalue of H as required.
For the second class we use the following simple, but very useful theorem.
Proof. For the Gauss–Seidel method the eigenvalues of the iteration matrix
HGS = −(L + D)−1 U are solutions to
We have already seen one example where convergence can be very slow.
The next section shows how to improve convergence.
2.13 Relaxation
The efficiency of the splitting method can be improved by relaxation. Here,
instead of iterating (A − B)x(k+1) = −Bx(k) + b, we first calculate (A −
B)x̃(k+1) = −Bx(k) + b as an intermediate value and then let
as
Hω = ωH + (1 − ω)I.
It follows that the eigenvalues of Hω and H are related by λω = ωλ + (1 − ω).
The best choice for ω would be to minimize max{|ωλi + (1 − ω)|, i = 1, . . . , n}
where λ1 , . . . , λn are the eigenvalues of H. However, the eigenvalues of H
are often unknown, but sometimes there is information (for example, derived
from the Gerschgorin theorem), which makes it possible to choose a good if not
optimal value for ω. For example, it might be known that all the eigenvalues
are real and lie in the interval [a, b], where −1 < a < b < 1. Then the interval
containing the eigenvalues of Hω is given by [ωa + (1 − ω), ωb + (1 − ω)].
An optimal choice for ω is the one which centralizes this interval around the
origin:
−[ωa + (1 − ω)] = ωb + (1 − ω).
It follows that
2
ω= .
2 − (a + b)
The eigenvalues of the relaxed iteration matrix lie in the interval
b−a b−a
[− 2−(a+b) , 2−(a+b) ]. Note that if the interval [a, b] is already symmetric about
zero, i.e., a = −b, then ω = 1 and no relaxation is performed. On the other
52 A Concise Introduction to Numerical Analysis
hand consider the case where all eigenvalues lie in a small interval close to
2
1. More specifically, let a = 1 − 2 and b = 1 − ; then ω = 3 and the new
1 1
interval is [− 3 , 3 ].
Exercise 2.13. The Gauss–Seidel method is used to solve Ax = b, where
100 −11
A= .
9 1
Find the eigenvalues of the iteration matrix. Then show that with relaxation
the spectral radius can be reduced by nearly a factor of 3. In addition show that
after one iterations with the relaxed method the error kx(k) − x∗ k is reduced
by more than a factor of 3. Estimate the number of iterations the original
Gauss–Seidel would need to achieve a similar decrease in the error.
Note that the first sum is a double sum. A multivariate function has an
extremum at the point where the derivatives in each of the directions xi ,
i = 1, . . . , n vanish. The vector of derivatives is called the gradient and is de-
noted ∇F (x). So the extremum occurs when the gradient vanishes, or in other
words when x satisfies ∇F (x) = 0. The derivative of F (x) in the direction of
xk is
n n
d 1 X X
F (x1 , . . . , xn ) = Aik xi + Akj xj − bk
dxk 2 i=1 j=1
n
X
= Akj xj − bk ,
j=1
where we used the symmetry of A in the last step. This is one component of
the gradient vector and thus
∇F (x) = Ax − b.
Linear Systems 53
That is, the value of F at the new approximation should be less than the value
of F at the current approximation, since we are looking for a minimum. Both
Jacobi and Gauss–Seidel methods do so.
Generally decent methods have the following form.
1. Pick any starting vector x(0) ∈ Rn .
2. For any k = 0, 1, 2, . . . the calculation stops if the norm of the gradient
kAx(k) − bk = k∇F (x(k) )k is acceptably small.
3. Otherwise a search direction d(k) is generated that satisfies the descent
condition
d
F (x(k) + ωd(k) ) < 0.
dω ω=0
In other words, if we are walking in the search direction, the values of
F become smaller.
4. The value ω (k) > 0 which minimizes F (x(k) + ωd(k) ) is calculated and
the new approximation is
Return to 2.
We will see specific choices for the search direction d(k) later. First we look
54 A Concise Introduction to Numerical Analysis
and
T
(k) g(k) g(k) kg(k) k2
ω = = ,
T
g(k) Ag(k) kg(k) k2A
which is the square of the Euclidean norm of the gradient divided by the
square of the norm defined by A of the gradient.
It can be proven that the infinite sequence x(k) , k = 0, 1, 2, . . . converges
to the solution of Ax = b (see for example [9] R. Fletcher Practical Methods
of Optimization). However, convergence can be unacceptably slow.
We look at the contour lines of F in two dimensions (n = 2). These are
lines where F takes the same value. They are ellipses with the minimum
lying at the intersection of the axes of the ellipses. The gradients and thus
Linear Systems 55
the search directions are perpendicular to the contour lines. That is, they
are perpendicular to the tangent to the contour line at that point. When the
current search direction becomes tangential to another contour line, the new
approximation is reached and the next search direction is perpendicular to the
previous one. Figure 2.1 illustrates this. The resultant zigzag search path is
typical. The value F (x(k+1) ) is decreased locally relative to F (x(k) ), but the
global decrease with respect to F (x(0) ) can be small.
We assume that the assertions are true for some k ≥ 1 and prove that they
remain true when k is increased by one.
For (1), the definition of d(k) = −g(k) + β (k) d(k−1) and the inductive
hypothesis show that any vector in the span of d(0) , . . . , d(k) also lies in the
span of g(0) , . . . , g(k) .
We have seen that the search directions satisfy g(k+1) = g(k) + ω (k) Ad(k) .
T
Multiplying this by d(j) from the left for j = 0, 1, . . . , k − 1, the first term
of the sum vanishes due to (2) and the second term vanishes due to (3). For
T
j = k, the choice of ω (k) ensures orthogonality. Thus d(j) g(k+1) = 0 for
j = 0, 1, . . . , k.
This also proves (4), since d(0) , . . . , d(k) span the same space as
g , . . . , g(k) .
(0)
where d(k−1) = −g(k−1) + β (k−1) d(k−2) and the orthogonality properties (2)
and (4) of Theorem 2.10 are used.
We write −r(k) instead of g(k) , where r(k) is the residual b − Ax(k) . The
zero vector is chosen as the initial approximation x(0) .
The algorithm is then as follows
1. Set k = 0, x(0) = 0, r(0) = b and d(0) = r(0) .
2. Stop if kr(k) k is sufficiently small.
3. If k ≥ 1, set d(k) = r(k) + β (k) d(k−1) , where β (k) = kr(k) k2 /kr(k−1) k2 .
T
4. Calculate v(k) = Ad(k) and ω (k) = kr(k) k2 /d(k) v(k) .
58 A Concise Introduction to Numerical Analysis
5. Set x(k+1) = x(k) + ω (k) d(k) and r(k+1) = r(k) − ω (k) v(k) .
r(k) −ω (k) v(k) . However, the number of operations is dominated by the matrix
multiplication Ad(k) , which is O(n2 ) if A is dense. In the cases where A is
sparse this can be reduced and the conjugate gradient method becomes highly
suitable.
Only in exact arithmetic is the number of iterations at most n. The con-
jugate gradient method is sensitive to even small perturbations. In practice,
most directions will not be conjugate and the exact solution is not reached.
An acceptable approximation is however usually reached within a small (com-
pared to the problem size) number of iterations. The speed of convergence is
typically linear, but it depends on the condition number of the matrix A. The
larger the condition number the slower the convergence. In the following sec-
tion we analyze this further and show how to improve the conjugate gradient
method.
Linear Systems 59
Exercise 2.14. Let A be positive definite and let the standard conjugate gra-
dient method be used to solve Ax P = b. Express d(k) in terms of r(j) and β (j) ,
(k+1) k
j = 0, 1, . . . , k. Using x = j=0 ω (j) d(j) , ω (j) > 0 and Theorem 2.10
show that the sequence kx(j) k, j = 0, 1, . . . , k + 1, increases monotonically.
Exercise 2.15. Use the standard form of the conjugate gradient method to
solve
1 0 0 1
0 2 0 x = 1
0 0 3 1
starting with x(0) = 0. Show that the residuals r(0) , r(1) and r(2) are mutu-
ally orthogonal and that the search directions d(0) , d(1) and d(2) are mutually
conjugate and that x(3) is the solution.
s−1
X
As+r b = aj Aj+r b
j=0
for any positive r. This means that also the spaces Ks+r+1 (A, b) and
Ks+r (A, b) are the same. Therefore,
Pt for every m ≥ s Km (A, b) = Ks (A, b).
Suppose now that b = i=1 ci vi , where v1 , . . . , vt are eigenvectors of A
corresponding to distinct eigenvalues λi . Then for every j = 1, . . . , s
t
X
Aj b = ci λji vi .
i=1
Since the eigenvectors are linearly independent and all ci are nonzero, we have
t−1
X
aj λji = 0
j=0
r̃(k)T r̃(k)
ω (k) = ,
d̃(k)T P T AP d̃(k)
x̃(k+1) = x̃(k) + ω (k) d̃(k) ,
r̃(k+1)T r̃(k+1)
β (k+1) = ,
r̃(k)T r̃(k)
d̃(k+1) = r̃(k+1) + β (k+1) d̃(k) .
The number of iterations is at most the dimension of the Krylov subspace
Kn (P T AP, P T b). This space is spanned by (P T AP )j P T b = P T (AP P T )j b,
j = 0, . . . , n−1. Since P is nonsingular, so is P T . It follows that the dimension
of Kn (P T AP, P T b) is the same as the dimension of Kn (AP P T , b).
It is undesirable in this method that P has to be computed. However, with
a few careful changes of variables P can be eliminated. Instead, the matrix
S = P P T is used. We see later why this is advantageous.
Firstly, we use x̃(k) = P −1 x(k) , or equivalently x(k) = P x̃(k) . Setting
r = P −T r̃(k) and d(k) = P d̃(k) , we derive the untransformed preconditioned
(k)
from the (n, n) entry. Prove that the preconditioned gradient method converges
in just two iterations then.
As a closing remark on conjugate gradients, we have seen that the method
of normal equations solves AT Ax = AT b for which conjugate gradients can
be used as long as Ax = b is not underdetermined, because only then is
AT A nonsingular. However, the condition number of AT A is the square of
the condition number of A, so convergence can be significantly slower. An
important technical point is that AT A is never computed explicitly, since it
is less sparse. Instead when calculating AT Ad(k) , first Ad(k) is computed and
this is then multiplied by AT . It is also numerically more stable to calculate
T
d(k) AT Ad(k) as the inner product of Ad(k) with itself.
since in every iteration the new approximation is scaled to have length 1. Since
kx(k) k = kvn k = 1, it follows that x(k) = ±vn + O(|λn−1 /λn |k ), where the
sign is determined by cn λkn , since we scale by a positive factor in each iteration.
The fraction |λn−1 /λn | characterizes the rate of convergence. Thus if λn−1 is
similar in size to λn convergence is slow. However, if λn is considerably larger
than the other eigenvalues in modulus, convergence is fast.
Termination occurs, since
This proves the first assertion, the second one follows similarly.
If kx(k+2) + αx(k+1) + βx(k) k = 0, then the vectors x(k) and x(k+1) span
an eigenspace of A. This means that if A is applied to any linear combination
of x(k) and x(k+1) , then the result is again a linear combination of x(k) and
x(k+1) . Since kx(k+2) + αx(k+1) + βx(k) k = 0, it is easy to see that
u + αv + βw)T (u + αv + βw =
100.5, the first n − 1 eigenvalues lie in [−0.5, 0.5] and the largest eigenvalue is
1.5. Now the reduction factor is at least 0.5/1.5 = 1/3. Occasionally some prior
knowledge (Gerschgorin theorem) is available, which provides good choices for
the shift.
This gives the power method with shifts.
The following method also uses shifts, but with a different intention.
(A − sI)x(k+1) = x(k) , k = 0, 1, . . . ,
where s is a scalar that may depend on k. Thus the inverse power method
is the power method applied to the matrix (A − sI)−1 . If s is close to an
eigenvalue, then the matrix A − sI has an eigenvalue close to zero, but this
implies that (A − sI)−1 has a very large eigenvalue and we have seen that in
this case the power method converges fast.
In every iteration x(k+1) is scaled such that kx(k+1) k = 1. We see that the
calculation of x(k+1) requires the solution of an n × n system of equations.
If s is constant in every iteration such that A − sI is nonsingular,
Pn then
x(k+1) is a multiple of (A − sI)−k−1 x(0) . As before we let x(0) = j=1 cj vj ,
where vj , j = 1, . . . , n, are the linearly independent eigenvectors. The eigen-
value equation then implies (A − sI)vj = (λj − s)vj . For the inverse we have
then (A − sI)−1 vj = (λj − s)−1 vj . It follows that x(k+1) is a multiple of
n
X n
X
(A − sI)−k−1 x(0) = cj (A − sI)−k−1 vj = cj (λj − s)−k−1 vj .
j=0 j=0
inverse iterations are very efficient, because in this case the LU factorization
requires O(n2 ) operations when A is nonsymmetric and O(n) operations if A
is symmetric.
We have seen earlier when examining how to arrive at a QR factorization,
that Givens rotations and Householder reflections can be used to introduce
zeros below the diagonal. These can be used to transform A into an upper
Hessenberg matrix. They are also used in the next section.
2.20 Deflation
Suppose we have found one solution of the eigenvector equation Av = λv
(or possibly a pair of complex conjugate eigenvalues with their corresponding
eigenvectors), where A is an n × n matrix. Deflation constructs an (n − 1) ×
(n−1) (or (n−2)×(n−2)) matrix, say B, whose eigenvalues are the other n−1
(or n − 2) eigenvalues of A. The concept is based on the following theorem.
Theorem 2.12. Let A and S be n × n matrices, S being nonsingular. Then
v is an eigenvector of A with eigenvalue λ if and only if Sv is an eigenvector
of SAS −1 with the same eigenvalue. S is called a similarity transformation.
Proof.
Av = λv ⇔ AS −1 (Sv) = λv ⇔ (SAS −1 )(Sv) = λ(Sv).
Let’s assume one eigenvalue λ and its corresponding eigenvector have been
found. In deflation we apply a similarity transformation S to A such that the
first column of SAS −1 is λ times the first standard unit vector e1 ,
(SAS −1 )e1 = λe1 .
Then we can let B be the bottom right (n − 1) × (n − 1) submatrix of SAS −1 .
We see from the above theorem that it is sufficient to let S have the property
Sv = ae1 , where a is any nonzero scalar.
If we know a complex conjugate pair of eigenvalues, then there is a two-
dimensional eigenspace associated with them. Eigenspace means that if A is
applied to any vector in the eigenspace, then the result will again lie in the
eigenspace. Let v1 and v2 be vectors spanning the eigenspace. For example
these could have been found by the two-stage power method. We need to find
a similarity transformation S which maps the eigenspace to the space spanned
by the first two standard basis vectors e1 and e2 . Let S1 such that S1 v1 = ae1 .
In addition let v̂ be the vector composed of the last n − 1 components of S1 v2 .
We then let S2 be of the form
1 0 ··· 0
0
S2 = . ,
.
. Ŝ
0
70 A Concise Introduction to Numerical Analysis
orthogonal matrix S such that Sv is a multiple of the first standard unit vector
e1 . Calculate SAS. The resultant matrix is suitable for deflation and hence
identify the remaining eigenvalues and eigenvectors.
We could achieve the same form of SAS −1 using successive Givens rota-
tions instead of one Householder reflection. However, this makes sense only if
there are already many zero elements in the first column of A.
The following algorithm for deflation can be used for nonsymmetric matri-
ces as well as symmetric ones. Let vi , i = 1, . . . , n, be the components of the
eigenvector v. We can assume v1 6= 0, since otherwise the variables could be
reordered. Let S be the n × n matrix which is identical to the n × n identity
matrix except for the off-diagonal elements of the first column of S, which are
Si,1 = −vi /v1 , i = 2, . . . , n.
0 ··· ··· 0
1
..
−v2 /v1 1 . . .
.
S= .
.. . .. . . . ... .
0
.. .. . . ..
. . . . 0
−vn /v1 0 · · · 0 1
Then S is nonsingular, has the property Sv = v1 e1 , and thus is suitable for our
purposes. The inverse S −1 is also identical to the identity matrix except for the
off-diagonal elements of the first column of S −1 which are (S −1 )i,1 = +vi /v1 ,
i = 2, . . . , n. Because of this form of S and S −1 , SAS −1 and hence B can
be calculated in only O(n2 ) operations. Moreover, the last n − 1 columns of
SAS −1 and SA are the same, since the last n − 1 columns of S −1 are taken
from the identity matrix, and thus B is just the bottom (n − 1) × (n − 1)
submatrix of SA. Therefore, for every integer i = 1, . . . , n − 1 we calculate the
ith row of B by subtracting vi+1 /v1 times the first row of A from the (i + 1)th
row of A and deleting the first component of the resultant row vector.
The following algorithm provides deflation in the form of block matrices. It
is known as the QR algorithm, since QR factorizations are calculated again and
again. Set A0 = A. For k = 0, 1, . . . calculate the QR factorizationAk = Qk Rk ,
where Qk is orthogonal and Rk is upper triangular. Set Ak+1 = Rk Qk . The
eigenvalues of Ak+1 are the same as the eigenvalues of Ak , since
Ak+1 = Rk Qk = Q−1 −1
k Qk Rk Qk = Qk Ak Qk
all entries in D are close to zero. We can then calculate the eigenvalues of
B and E separately, possibly again with the QR algorithm, except for 1 × 1
and 2 × 2 blocks where the eigenvalue problem is trivial. The space spanned
by e1 , . . . , em can be regarded as an eigenspace of Ak+1 , since, if D = 0,
Ak+1 ei , i = 1, . . . , m, again lies in this space. Equally the space spanned by
em+1 , . . . , en can be regarded as an eigenspace of Ak+1 .
The concept of eigenspaces is important when dealing with a complex
conjugate pair of eigenvalues λ and λ with corresponding eigenvectors v and
v in Cn . However, we are operating in Rn . The real and imaginary parts
Re(v) and Im(v) form an eigenspace of Rn ; that is, A applied to any linear
combination of Re(v) and Im(v) will again be a linear combination of Re(v)
and Im(v).
In this situation we choose S such that S applied to any vector in the
space spanned by Re(v) and Im(v) is a linear combination e1 and e2 . The
matrix SAS −1 , then, consists of a 2 × 2 block in the top left corner and an
(n − 2) × (n − 2) block B in the bottom right, and the last (n − 2) elements in
the first two columns are zero. The search for eigenvalues can then continue
with B. The following exercise illustrates this.
Exercise 2.20. Show that the vectors x, Ax, and A2 x are linearly dependent
for
1
1 0 − 34
4 1
1
3
2 2 − 12 4
A= and x = 1 .
0 − 34 1 1
4
2 −1 3 1 4
2 2
(b) Define the Gauss–Seidel and Jacobi iterations and state their iteration
matrices, respectively.
(c) Describe relaxation and briefly consider the cases when the relaxation pa-
rameter ω equals 0 and 1.
(d) Show how the iteration matrix Hω of the relaxed method is related to the
iteration matrix H of the original method and thus how the eigenvalues
are related. How should ω be chosen?
Linear Systems 73
(e) We now consider the tridiagonal matrix A with diagonal elements Ai,i = 1
and off-diagonal elements Ai,i−1 = Ai,i+1 = 1/4. Calculate the iteration
matrices H of the Jacobi method and Hω of the relaxed Jacobi method.
(f ) The eigenvectors of both H and Hω are v1 , . . . , vn where the ith component
πik
of vk is given by (vk )i = sin n+1 . Calculate the eigenvalues of H by
evaluating Hvk (Hint: sin(x ± y) = sin x cos y ± cos x sin y).
(g) Using the formula for the eigenvalues of Hω derived earlier, state the
eigenvalues of Hω and show that the relaxed method converges for 0 <
ω ≤ 4/3.
(a) Describe the power method to generate a single eigenvector and eigenvalue
of A. Define the Rayleigh quotient in the process.
(b) Which assumption is crucial for the power method to converge? What
characterizes the rate of convergence? By expressing the starting vector
x(0) as a linear combination of eigenvectors, give an expression for x(k) .
(c) Given
1 1 1 2
A= 1 1 0 and x(0) = 1 ,
1 0 1 −1
calculate x(1) and x(2) and evaluate the Rayleigh quotient λ(k) for k =
0, 1, 2.
(d) Suppose now that for a different matrix A the eigenvalues of largest mod-
ulus are a complex conjugate pair of eigenvalues, λ and λ̄. In this case the
vectors x(k) , Ax(k) and A2 x(k) tend to be linearly dependent. Assuming
that they are linearly dependent, show how this can be used to calculate
the eigenvalues λ and λ̄.
the first column of HA is a multiple of the first standard unit vector and
calculate HA.
(f ) Having found H in the previous part, calculate Hb for
1
b = 5 .
1
Linear Systems 75
(g) Using the results of the previous two parts, find the x ∈ R2 which mini-
mizes kAx − bk and calculate the minimum.
Exercise 2.25. (a) Explain the technique of splitting for solving the linear
system Ax = b iteratively where A is an n × n, non-singular matrix.
Define the iteration matrix H and state the property it has to satisfy to
ensure convergence.
(b) Define what it means for a matrix to be positive definite. Show that all
diagonal elements of a positive definite matrix are positive.
(c) State the Householder-John theorem and explain how it can be used to
design iterative methods for solving Ax = b.
(d) Let the iteration matrix H have a real eigenvector v with real eigenvalue
λ. Show that the condition of the Householder-John theorem implies that
|λ| < 1.
(c) The next approximation x(k+1) is calculated as x(k+1) = x(k) + ω (k) d(k) .
Derive an expression for ω (k) using the gradient g(k) = ∇F (x(k) ) =
Ax(k) − b.
(d) Derive an expression for the new gradient g(k+1) and a relationship be-
tween it and the search direction d(k) .
76 A Concise Introduction to Numerical Analysis
(e) Explain how the search direction d(k) is chosen in the steepest descent
method and give a motivation for this choice.
(f ) Define the concept of conjugacy.
(g) How are the search directions chosen in the conjugate gradient method?
Derive an explicit formula for the search direction d(k) stating the conju-
gacy condition in the process. What is d(0) ?
(h) Which property do the gradients in the conjugate gradient method satisfy?
(c) For v = (0, 1, −1)T calculate H and Hv such that the last two components
of Hv are zero.
(d) Let A be an n × n real matrix with a real eigenvector v ∈ Rn with real
eigenvector λ. Explain how a similarity transformation can be used to
obtain an (n − 1) × (n − 1) matrix B whose n − 1 eigenvalues are the other
n − 1 eigenvalues of A.
(e) The matrix
1 1 1
A= 1 1 2
1 2 1
has the eigenvector v = (0, 1, −1)T with eigenvalue −1. Using the H ob-
tained in (c) calculate HAH T and thus calculate two more eigenvalues.
Exercise 2.29. (a) Explain the technique of splitting for solving the linear
system Ax = b iteratively where A is an n × n, non-singular matrix.
Define the iteration matrix H and state the property it has to satisfy to
ensure convergence.
(b) Define the Gauss–Seidel and Jacobi iterations and state their iteration
matrices, respectively.
(c) Let √
3 1
√
2 2 √2
A= 3 3
2 .
2 √ 2
1 3
2 2 2
Derive the iteration matrix for the Jacobi iterations and state the eigen-
value equation. Check that the numbers −3/4, 1/4, 1/2 satisfy the eigen-
value equation and thus are the eigenvalues of the iteration matrix.
(d) The matrix given in (c) is positive definite. State the Householder–John
theorem and apply it to show that the Gauss–Seidel iterations for this
matrix converge.
(e) Describe relaxation and show how the the iteration matrix Hω of the re-
laxed method is related to the iteration matrix H of the original method
and thus how the eigenvalues are related. How should ω be chosen?
(f ) For the eigenvalues given in (c) calculate the best choice of ω and the
eigenvalues of the relaxed method.
CHAPTER 3
Interpolation and
Approximation Theory
p(x) = a1 x + a0
through a pair of points given by (x0 , f0 ) and (x1 , f1 ). This means solving 2
equations, one for each data point. Thus we have 2 degrees of freedom. For a
quadratic curve there are 3 degrees of freedom, fitting a cubic curve we have
4 degrees of freedom, etc.
Let Pn [x] denote the linear space of all real polynomials of degree at most
n. Each p ∈ Pn [x] is uniquely defined by its n + 1 coefficients. This gives
n + 1 degrees of freedom, while interpolating x0 , x1 , . . . , xn gives rise to n + 1
conditions.
As we have mentioned above, in determining the polynomial interpolant
we can solve a linear system of equations. However, this can be done more
easily.
Definition 3.1 (Lagrange cardinal polynomials). These are given by
n
Y x − xl
Lk (x) := , x ∈ R.
l=0
xk − xl
l6=k
80 A Concise Introduction to Numerical Analysis
Exercise 3.1. Let the function values f (0), f (1), f (2), and f (3) be given. We
want to estimate Z 3
f (−1), f 0 (1) and f (x)dx.
0
To this end, we let p be the cubic polynomial that interpolates these function
values, and then approximate by
Z 3
0
p(−1), p (1) and p(x)dx.
0
Using the Lagrange formula, show that every approximation is a linear com-
bination of the function values with constant coefficients and calculate these
coefficients. Show that the approximations are exact if f is any cubic polyno-
mial.
Lemma 3.1 (Uniqueness). The polynomial interpolant is unique.
Proof. Suppose that two polynomials p, q ∈ Pn [x] satisfy p(xi ) = q(xi ) = fi ,
i = 0, . . . , n. Then the nth degree polynomial p − q vanishes at n + 1 distinct
points. However, the only nth degree polynomial with n + 1 or more zeros is
the zero polynomial. Therefore p = q.
Exercise 3.2 (Birkhoff–Hermite interpolation). Let a, b, and c be distinct
real numbers, and let f (a), f (b), f 0 (a), f 0 (b), and f 0 (c) be given. Because there
are five function values, a possibility is to approximate f by a polynomial of
degree at most four that interpolates the function values. Show by a general
argument that this interpolation problem has a solution and that the solution
is unique if and only if there is no nonzero polynomial p ∈ P4 [x] that satisfies
p(a) = p(b) = p0 (a) = p0 (b) = p0 (c) = 0. Hence, given a and b, show that there
exists a possible value of c 6= a, b such that there is no unique solution.
Let [a, b] be a closed interval of R. C[a, b] is the space of continuous func-
tions from [a, b] to R and we denote by C k [a, b] the set of such functions which
have continuous k th derivatives.
Theorem 3.1 (The error of polynomial interpolation). Given f ∈ C n+1 [a, b]
and f (xi ) = fi , where x0 , . . . , xn are pairwise distinct, let p ∈ Pn [x] be the
interpolating polynomial. Then for every x ∈ [a, b], there exists ξ ∈ [a, b] such
that
n
1 Y
f (x) − p(x) = f (n+1) (ξ) (x − xi ). (3.1)
(n + 1)! i=0
Interpolation and Approximation Theory 81
For t = xj the first term vanishes, since f (xj ) = fj = p(xj ), and by construc-
tion the product in the second term vanishes. We also have φ(x) = 0, since
then the two terms cancel. Hence φ has at least n + 2 distinct zeros in [a, b].
By Rolle’s theorem if a function with continuous derivative vanishes at two
distinct points, then its derivative vanishes at an intermediate point. Since
φ ∈ C n+1 [a, b], we can deduce that φ0 vanishes at n + 1 distinct points in
[a, b]. Applying Rolle again, we see that φ00 vanishes at n distinct points in
[a, b]. By induction, φ(n+1) vanishes once, say at ξ ∈ [a, b]. Since p is an nth
degree polynomial, we have p(n+1) ≡ 0. On the other hand,
n
dn+1 Y
(t − xi ) = (n + 1)!.
dtn+1 i=0
Hence
n
Y
0 = φ(n+1) (ξ) = f (n+1) (ξ) (x − xi ) − [f (x) − p(x)](n + 1)!
i=0
Qn
Lemma 3.2. The maximum absolute value of i=0 (x − xi ) on Q [−1, 1] is
n
minimal if it is the normalized (n + 1)th Chebyshev polynomial, i.e., i=0 (x −
−n −n
xi ) = 2 Tn+1 (x). The maximum absolute value is then 2 .
Qn
Proof. i=0 (x − xi ) describes an (n + 1)th degree polynomial with leading
coefficient 1. From the recurrence relation we see that the leading coefficient of
Tn+1 is 2n and thus 2−n Tn+1 has leading coefficient one. Let p be a polynomial
of degree n + 1 with leading coefficient 1 with maximum absolute value m <
2−n on [−1, 1]. Tn+1 has n+2 extreme points. At these points we have |p(x)| ≤
m < |2−n Tn+1 (x)|. Moreover, for x = cos n+12kπ
, 0 ≤ 2k ≤ n + 1, where Tn+1
has a maximum, we have
Thus the function 2−n Tn+1 (x)−p(x) changes sign between the points cos n+1kπ
,
−n
0 ≤ k ≤ n + 1. From the intermediate value theorem 2 Tn+1 − p has at least
n + 1 roots. However, this is impossible, since 2−n Tn+1 − p is a polynomial of
degree n, since the leading coefficients cancel. Thus we have a contradiction
and 2−n Tn+1 (x) gives the minimal value.
For a general interval [a, b] the Chebyshev points are xk = (a + b + (b −
a) cos( π2 2k+1
n+1 ))/2.
Another interesting fact about Chebyshev polynomials is that they form
a set of polynomials which are orthogonal with respect to the weight function
(1 − x2 )−1/2 on (−1, 1). More specifically we have
Z 1
dx π, m = n = 0,
π
Tm (x)Tn (x) √ = , m = n ≥ 1,
−1 1 − x2 2
0 m 6= n.
84 A Concise Introduction to Numerical Analysis
1 1 2 1
Z Z
dx dx
p̌0 = p(x) √ , p̌k = p(x)Tk (x) √ , k = 1, . . . , n.
π −1 1 − x2 π −1 1 − x2
We will encounter Chebyshev polynomials also again in spectral methods for
the solution of partial differential equations.
1 1
f [x0 , x1 , . . . , xn ] = c = f (n) (ξ).
n! n!
Exercise 3.3. Let f be a real valued function and let p be the polynomial of de-
gree at most n that interpolates f at the pairwise distinct points x0 , x1 , . . . , xn .
Furthermore, let x be any real number that is not an interpolation point. De-
duce for the error at x
n
Y
f (x) − p(x) = f [x0 , . . . , xn , x] (x − xj ).
j=0
(Hint: Use the definition for the divided difference f [x0 , . . . , xn , x].)
We might ask what the divided difference of degree zero is. It is the co-
efficient of the zero degree interpolating polynomial, i.e., a constant. Hence
f [xi ] = f (xi ). Using the formula for linear interpolation between two points
(xi , f (xi )) and (xj , f (xj )), the interpolating polynomial is given by
x − xj x − xi
p(x) = f (xi ) + f (xj )
xi − xj xj − xi
This recursive formula gives a fast way to calculate the divided difference
table shown in Figure 3.3. This requires O(n2 ) operations and calculates the
numbers f [x0 , . . . xl ] for l = 0, . . . , n. These are needed for the alternative
representation of the interpolating polynomial.
Exercise 3.4. Implement a routine calculating the divided difference table.
f [x0 ] &
% f [x0 , x1 ] &
f [x1 ] & % f [x0 , x1 , x2 ]
f [x1 , x2 ] ..
% & .
f [x2 ] ..
& .
.. f [x0 , . . . , xn ].
.. .
.
.. . ..
.
. ..
f [xn−2 , xn−1 , xn ]
f [xn−1 , xn ] %
f [xn ] %
return;
end
if n6=k
disp('input dimensions do not agree')
return;
end
m = size(x); % number of evaluation points
y = d(1) * ones(m); % first term of the sum in the Newton form
temp = x − inter(1) * ones(m); % temp holds the product
% Note that y and temp are vectors
for i=2:n % add the terms of the sum in the Newton form one after
% the other
y = y + d(i) * temp;
temp = temp .* (x − inter(i)); % Note: *. is element−wise
% multiplication
end
end
Exercise 3.5. Given f, g ∈ C[a, b], let h := f g. Prove by induction that the
divided differences of h satisfy the relation
n
X
h[x0 , . . . , xn ] = f [x0 , . . . , xj ]g[xj , . . . , xn ].
j=0
we see that f [x0 , x1 ] = h−1 ∆f (x0 ). We will encounter finite differences again
when deriving solutions to partial differential equations. Since
where we use the recurrence formula for divided differences, we can deduce by
induction that
∆j f (x0 ) = j!hj f [x0 , . . . , xj ].
Hence we can rewrite the Newton formula as
n j−1 n
X Y 1 X s
pn (x0 + sh) = f (x0 ) + (s − i) ∆j f (x0 ) = f (x0 ) + ∆j f (x0 ).
j! j
j=1 i=0 j=1
In this form the formula looks suspiciously like a finite analog of the Taylor
expansion. The Taylor expansion tells where a function will go based on the
values of the function and its derivatives (its rate of change and the rate of
change of its rate of change, etc.) at one given point x. Newtons formula is
based on finite differences instead of instantaneous rates of change.
If the points are reordered as xn , . . . , x0 , we can again rewrite the Newton
formula for x = xn + sh. Then x − xi = (s + n − i)h, since xi = xn − (n − i)h,
and we have
n−2
Y
pn (xn + sh) = f [xn ] + f [xn , xn−1 ]sh + · · · + f [xn , . . . , x1 ] (s + i)hn−1
i=0
n−1
Y
+f [xn , . . . , x0 ] (s + i)hn .
i=0
Lemma 3.3. Let xi1 , xi2 , . . . , xim be m distinct points and denote by
pi1 ,i2 ,...,im the polynomial of degree m − 1 satisfying
Proof. For n = 0, there are no additional points xiν and pj (x) ≡ f (xj ) and
pk (x) ≡ f (xk ). It can be easily seen that the right-hand side defines a poly-
nomial of degree at most n + 1, which takes the values f (xiν ) at xiν for
ν = 1, . . . , n, f (xj ) at xj and f (xk ) at xk . Hence the right-hand side is the
unique interpolation polynomial.
We see that with iterated linear interpolation, points can be added
anywhere. The variety of methods differ in the order in which the values
(xj , f (xj )) are employed. For many applications, additional function values
are generated on the fly and thus their number are not known in advance. For
such cases we may always employ the latest pair of values as in
x0 f (x0 )
x1 f (x1 ) p0,1 (x)
x2 f (x2 ) p1,2 (x) p0,1,2 (x)
.. .. .. ..
. . . .
xn f (xn ) pn−1,n (x) pn−2,n−1,n (x) · · · p0,1,...,n (x).
The rows are computed sequentially. Any p··· (x) is calculated using the two
quantities to its left and diagonally above it. To determine the (n + 1)th row
only the nth row needs to be known. As more points are generated, rows of
greater lengths need to be calculated and stored. The algorithm stops when-
ever
|p0,1,...,n (x) − p0,1,...,n−1 (x)| < .
This scheme is known as Neville’s iterated interpolation.
90 A Concise Introduction to Numerical Analysis
For V = C[a, b], the space of all continuous functions on the interval [a, b],
we can define a scalar product using a fixed positive function w ∈ C[a, b], the
weight function, in the following way
Z b
hf, gi := w(x)f (x)g(x)dx for all f, g ∈ C[a, b].
a
All three axioms are easily verified for this scalar product. The associated
norm is s
p Z b
kf k2 = hf, f i = w(x)f 2 (x)dx.
a
Interpolation and Approximation Theory 91
The vector space of functions for which this integral exists is denoted by Lp .
Unless p = 2, this is a normed space, but not an inner product space, because
this norm does not satisfy the parallelogram equality given by
kf k∞ = max |f (x)|.
x∈[a,b]
Theorem 3.5. For every n ≥ 0 there exists a unique monic orthogonal poly-
nomial pn of degree n. Any p ∈ Pn [x] can be expressed as a linear combination
of p0 , p1 , . . . pn .
92 A Concise Introduction to Numerical Analysis
Proof. We first consider uniqueness. Assume that there are two monic orthog-
onal polynomials pn , p̃n ∈ Pn [x]. Let p = pn − p̃n which is of degree n − 1,
since the xn terms cancel, because in both polynomials the leading coefficient
is 1. By definition of orthogonal polynomials, hpn , pi = 0 = hp̃n , pi. Thus we
can write
0 = hpn , pi − hp̃n , pi = hpn − p̃n , pi = hp, pi,
and hence p ≡ 0.
We provide a constructive proof for the existence by induction on n. We
let p0 (x) ≡ 1. Assume that p0 , p1 , . . . , pn have already been constructed, con-
sistent with both statements of the theorem. Let q(x) := xn+1 ∈ Pn+1 [x].
Following the Gram–Schmidt algorithm, we construct
n
X hq, pk i
pn+1 (x) = q(x) − pk (x).
hpk , pk i
k=0
It is of degree n + 1 and it is monic, since all the terms in the sum are of
degree less than or equal to n. Let m ∈ 0, 1, . . . , n.
n
X hq, pk i hq, pm i
hpn+1 , pm i = hq, pm i − hpk , pm i = hq, pm i − hpm , pm i = 0.
hpk , pk i hpm , pm i
k=0
since from the definition of the inner product, h(x − αn )pn , pl i = hpn , (x −
αn )pl i, and since pn−1 , pn and pn+1 are orthogonal polynomials. Moreover,
∂
hf − p, f − pi = −2hpj , f i + 2cj hpj , pj i, j = 0, . . . , n.
∂cj
hpj , f i
cj =
hpj , pj i
and thus
n
X hpj , f i
p= pj .
j=0
hpj , pj i
In the fraction we see the usual inner product of the vector (pi (x1 ), . . . , pi (xm ))T
with itself and with the vector (xk1 , . . . , xkm )T .
Once the orthogonal polynomials are constructed, we find p by
n
X hpk , f i
p(x) = pk (x).
hpk pk i
k=0
For each k the work to find pk is bounded by a multiple of m and thus the
complete cost is O(mn). The only difference to the continuous case is that
we cannot keep adding terms, since we only have enough data to construct
p0 , p1 , . . . , pm−1 .
this could be the maximum of the absolute difference between the function
and the approximation over the interval [a, b]. In the case of a quadrature it
is the absolute difference between the value given by the quadrature and the
value of the integral. Thus L maps the space of functions C k+1 [a, b] to R. We
assume that L is a linear functional , i.e., L(αf + βg) = αL(f ) + βL(g) for
all α, β ∈ R. We also assume that the approximation is constructed in such a
way that it is correct for all polynomials of degree at most k, i.e., L(p) = 0
for all p ∈ Pk [x].
At a given point x ∈ [a, b], f (x) can be written as its Taylor polynomial
with integral remainder term
2
(x−a)k (k)
f (x) = f (a)R+ (x − a)f 0 (a) + (x−a)
2! f 00 (a) + · · · + k! f (a)
1 x k (k+1)
+ k! a (x − θ) f (θ)dθ.
since the first terms of the Taylor expansion are polynomials of degree at most
k. To make the integration independent of x, we can use the notation
(x − θ)k , x ≥ θ,
(x − θ)k+ :=
0, x ≤ θ.
Then !
Z b
1
L(f ) = L (x − θ)k+ f (k+1) (θ)dθ .
k! a
We now make the important assumption that the order of the integral and
functional can be exchanged. For most approximations, calculating L involves
differentiation, integration and linear combination of function values. In the
case of quadratures (which we will encounter later and which are a form of
numerical integration), L is the difference of the quadrature rule which is a
linear combination of function values and the integral. Both these operations
can be exchanged with the integral.
Definition 3.3 (Peano kernel). The Peano kernel K of L is the function
defined by
K(θ) := L[(x − θ)k+ ] for x ∈ [a, b].
Theorem 3.7 (Peano kernel theorem). Let L be a linear functional such that
L(p) = 0 for all p ∈ Pk [x]. Provided that the exchange of L with the integration
is valid, then for f ∈ C k+1 [a, b]
Z b
1
L(f ) = K(θ)f (k+1) (θ)dθ.
k! a
Interpolation and Approximation Theory 97
Theorem 3.8. If K does not change sign in (a, b), then for f ∈ C k+1 [a, b]
"Z #
b
1
L(f ) = K(θ)dθ f (k+1) (ξ)
k! a
1 2
Z
L(f ) = K(θ)f 000 (θ)dθ.
2 0
The Peano kernel is
3 1
K(θ) = L[(x − θ)2+ ] = 2(0 − θ)+ − [− (0 − θ)2+ + 2(1 − θ)2+ − (2 − θ)2+ ].
2 2
Using the definition of (x − θ)k+ , we obtain
Hence L(f ) = 12 23 f 000 (ξ) = 13 f 000 (ξ) for some ξ ∈ (0, 2). Thus the error in
approximating the derivative at zero is 13 f 000 (ξ) for some ξ ∈ (0, 2).
98 A Concise Introduction to Numerical Analysis
assuming that f ∈ C 3 [0, 3]. Sketch the kernel function K(θ) for θ ∈ [0, 3]. By
integrating K(θ) and using the mean value theorem, show that
1 000
f [0, 1, 2, 3] = f (ξ)
6
for some point ξ ∈ [0, 3].
1
Exercise 3.10. We approximate the function value of f ∈ C 2 [0, 1] at 2 by
p( 12 ) = 12 [f (0) + f (1)]. Find the least constants c0 , c1 and c2 such that
1 1
|f ( ) − p( )| ≤ ck kf (k) k∞ , k = 0, 1, 2.
2 2
For k = 0, 1 work from first principles and for k = 2 apply the Peano kernel
theorem.
3.7 Splines
The problem with polynomial interpolation is that with increasing degree the
polynomial ’wiggles’ from data point to data point. Low-order polynomials do
not display this behaviour. Let’s interpolate data by fitting two cubic poly-
nomials, p1 (x) and p2 (x), to different parts of the data meeting at the point
x∗ . Each cubic polynomial has four coefficients and thus we have 8 degrees
of freedom and hence can fit 8 data points. However, the two polynomial
pieces are unlikely to meet at x∗ . We need to ensure some continuity. If we let
p1 (x∗ ) = p2 (x∗ ), then the curve is at least continuous, but we are losing one
degree of freedom. The fit
take up one degree of freedom each. This gives a smooth curve, but we are
only left with five degrees of freedom and thus can only fit 5 data points. If
we also require the third derivative to be continuous, the two cubics become
the same cubic which is uniquely specified by the four data points.
The point x∗ is called a knot point. To fit more data we specify n + 1 such
knots and fit a curve consisting of n separate cubics between them, which is
continuous and also has continuity of the first and second derivative. This has
n + 3 degrees of freedom. This curve is called a cubic spline.
The two-dimensional equivalent of the cubic spline is called thin-plate
spline. A linear combination of thin-plate splines passes through the data
Interpolation and Approximation Theory 99
points exactly while minimizing the so-called bending energy, which is defined
as the integral over the squares of the second derivatives
Z Z
2 2 2
(fxx + 2fxy + fyy )dxdy
where fx = ∂f∂x . The name thin-plate spline refers to bending a thin sheet of
metal being held in place by bolts as in the building of a ship’s hull. However,
here we will consider splines in one dimension.
Definition 3.4 (Spline function). The function s is called a spline function
of degree k if there exist points a = x0 < x1 < . . . < xn = b such that
s is a polynomial of degree at most k on each of the intervals [xj−1 , xj ] for
j = 1, . . . , n and such that s has continuous k − 1 derivatives. In other words,
s ∈ C k−1 [a, b]. We call s a linear, quadratic, cubic, or quartic spline for k =
1, 2, 3, or 4. The points x0 , . . . , xn are called knots and the points x1 , . . . , xn−1
are called interior knots.
A spline of degree k can be written as
k
X n−1
X
s(x) = ci xi + dj (x − xj )k+ (3.2)
i=0 j=1
(x − xj )k , x ≥ xj ,
(x − xj )k+ :=
0, x ≤ xj .
introduced for the Peano kernel. The first sum defines a general polynomial
on [x0 , x1 ] of degree k. Each of the functions (x − xj )k+ is a spline itself with
only one knot at xj and continuous k − 1 derivatives. These derivatives all
vanish at xj . Thus (3.2) describes all possible spline functions.
Theorem 3.9. Let S be the set of splines of degree k with fixed knots
x0 , . . . , xn , then S is a linear space of dimension n + k.
Proof. Linearity is implied since differentiation and continuity are linear. The
notation in (3.2) implies that S has dimension at most n + k. Hence it is
sufficient to show that if s ≡ 0, then all the coefficients ci , i = 0, . . . , k, and dj
j = 1, . . . , n − 1, vanish. Considering the interval [x0 , x1 ], then s(x) is equal
Pk
to i=0 ci xi on this interval and has an infinite number of zeros. Hence the
coefficients ci , i = 0, . . . , k have to be zero. To deduce dj = 0, j = 1, . . . , n − 1,
we consider each interval [xj , xj+1 ] in turn. The polynomial there has again
infinite zeros from which dj = 0 follows.
Definition 3.5 (Spline interpolation). Let f ∈ C[a, b] be given. The spline
interpolant to f is obtained by constructing the spline s that satisfies s(xi ) =
f (xi ) for i = 0, . . . , n.
100 A Concise Introduction to Numerical Analysis
The derivative of s is
s0 (xi−1 ) + 2s0 (xi ) 2s0 (xi ) + s0 (xi+1 ) 3s(xi ) − 3s(xi−1 ) 3s(xi+1 ) − 3s(xi )
+ = + .
xi − xi−1 xi+1 − xi (xi − xi−1 )2 (xi+1 − xi )2
(3.5)
Assume we are given the function values f (xi ), i = 0, . . . , n, and the deriva-
tives f 0 (a) and f 0 (b) at the endpoints a = x0 and b = xn . Thus we know s0 (x0 )
and s0 (xn ). We seek the cubic spline that interpolates the augmented data.
Note that now we have n + 3 conditions consistent with the dimension of the
space of cubic splines. Equation (3.5) describes a system of n − 1 equations
in the unknowns s0 (xi ), i = 1, . . . , n − 1, specified by a tridiagonal matrix S
where the diagonal elements are
2 2
Si,i = +
xi − xi−1 xi+1 − xi
Interpolation and Approximation Theory 101
√
If the coefficients α and β are chosen as α = 1, β = 0 or α = 0, β = ( 3 − 2)n
we obtain two splines with values of the derivative at x0 , . . . , xn given by
√ √
s0 (xi ) = ( 3 − 2)i and s0 (xi ) = ( 3 − 2)n−i i = 0, . . . , n.
These two splines are a convenient basis of S0 . One decays rapidly across the
interval [a, b] when moving from left to right, while the other decays rapidly
when moving from right to left as can be seen in Figure 3.4. This basis implies
that, for equally spaced knots, the freedom in a cubic spline interpolant is
greatest near the end points of the interval [a, b]. Therefore it is important
to take up this freedom by imposing an extra condition at each end of the
interval as we have done by letting s0 (a) = f 0 (a) and s0 (b) = f 0 (b). The
following exercise illustrates how the solution deteriorates if this is not done.
Exercise 3.11. Let S be the set of cubic splines with knots xi = ih for
i = 0, . . . , n, where h = 1/n. An inexperienced user obtains an approximation
to a twice-differentiable function f by satisfying the conditions s0 (0) = f 0 (0),
s00 (0) = f 00 (0), and s(xi ) = f (xi ), i = 0, . . . , n. Show how the changes in the
first derivatives s0 (xi ) propagate if s0 (0) is increased by a small perturbation
, i.e., s0 (0) = f 0 (0) + , but the remaining data remain the same.
Another possibility to take up this freedom is known as the not-a-knot
technique. Here we require the third derivative s000 to be continuous at x1
and xn−1 . It is called not-a-knot since there is no break between the two
polynomial pieces at these points. Hence you can think of these knots as not
being knots at all.
Definition 3.6 (Lagrange form of spline interpolation). For j = 0, . . . , n, let
sj be an element of S that satisfies
where δij is the Kronecker delta (i.e., δjj = 1 and δij = 0 for i 6= j). Figure
3.5 shows an example of a Lagrange cubic spline. Note that the splines sj are
not unique, since any element from S0 can be added. Any spline interpolant
to the data f0 , . . . , fn can then be written as
n
X
s(x) = fj sj (x) + ŝ(x),
j=0
Theorem 3.10. Let S be the space of cubic splines with equally spaced knots
a = x0 < x1 < . . . < xn = b where xi − xi−1 = h for i = 1, . . . , n. Then for
each integer j = 0, . . . , n there is a cubic spline sj that satisfies the Lagrange
conditions (3.7) and that has the first derivative values
√
− h3 ( 3 − 2)j−i , i = 0, . . . , j − 1,
s0j (xi ) = 0, i=j
√
3
i−j
h ( 3 − 2) , i = j + 1, . . . , n.
It is easily verified that the values for s0j (xi ) given in the theorem satisfy these.
For example, considering the last two equations, we have
3 √ 3 √
s0j (xj−2 ) + 4s0j (xj−1 ) + s0j (xj ) = − ( 3 − 2)2 − 4 ( 3 − 2)
h h
3 √ √ 3
= − (7 − 4 3 + 4 3 − 8] =
h h
and
3 √ 3 √
s0j (xj ) + 4s0j (xj+1 ) + s0j (xj+2 ) = 4 ( 3 − 2) + ( 3 − 2)2
h h
3 √ √ 3
= (−8 + 4 3 − 4 3 + 7) = − .
h h
√
Since 3 − 2 ≈ −0.268, it can be deduced that sj (x) decays rapidly as
|x − xj | increases. Thus the contribution of fj to the cubic spline interpolant
s decays rapidly as |x − xj | increases. Hence for x ∈ [a, b], the value of s(x) is
determined mainly by the data fj for which |x − xj | is relatively small.
However, the Lagrange functions of quadratic spline interpolation do not
enjoy these decay properties if the knots coincide with the interpolation points.
Generally, on the interval [xi , xi+1 ], the quadratic polynomial piece can be
derived from the fact that the derivative is a linear polynomial and thus given
by
s0 (xi+1 ) − s0 (xi )
s0 (x) = (x − xi ) + s0 (xi ).
xi+1 − xi
Integrating over x and using s(xi ) = fi we get
s0 (xi+1 ) − s0 (xi )
s(x) = (x − xi )2 + s0 (xi )(x − xi ) + fi .
2(xi+1 − xi )
fi+1 − fi
s0 (xi+1 ) = 2 − s0 (xi ),
xi+1 − xi
3.8 B-Spline
In this section we generalize the concept of splines.
This being identical to 0, means that there are infinitely many zeros, and this
in turn means that the coefficients of xk−i have to vanish for all i = 0, . . . , k.
106 A Concise Introduction to Numerical Analysis
Therefore
p+k+1
X
λj xij = 0, i = 0, . . . , k.
j=p
A solution to this problem exists, since k + 2 coefficients have to satisfy only
k + 1 conditions. The matrix describing the system of equations is
1 ··· ··· 1
xp xp+1 · · · xp+k+1
A = .. .. .. .
..
. . . .
xkp xkp+1 ··· xkp+k+1
If λp+k+1 is zero, then the system reduces to a (k + 1) × (k + 1) matrix where
the last column of A is removed. This however is a Vandermonde matrix ,
which is non-singular, since all xi , i = p . . . , p + k + 1 are distinct. It follows
then that all the other coefficients λp , . . . , λp+k are also zero.
Therefore λp+k+1 has to be nonzero. This can be chosen and the system
can be solved uniquely for the remaining k+1 coefficients, since A with the last
column removed is nonsingular. Therefore, apart from scaling by a constant,
the B-spline of degree k with knots xp < xp+1 < . . . < xp+k+1 that vanishes
outside (xp , xp+k+1 ) is unique.
Theorem 3.11. Apart from scaling, the coefficients λj , j = p, . . . , p + k + 1,
in (3.8) are given by
−1
p+k+1
Y
λj = (xj − xi ) .
i=p,i6=j
Proof. The function Bpk (x) is a polynomial of degree at most k for x ≥ xp+k+1
−1
p+k+1 p+k+1 k
X Y X k
Bpk (x) = (xj − xi ) xk−l (−xj )l
l
j=p i=p,i6=j l=0
−1
k
X k p+k+1 p+k+1
X Y
= (−1)l (xj − xi ) xlj xk−l .
l
l=0 j=p i=p,i6=j
The right-hand side is a factor of the coefficient of xk−l in Bpk (x) for x ≥
xp+k+1 . It follows that the coefficient of xk−l in Bpk (x), x ≥ xp+k+1 , is zero
for l = 0, 1, . . . , k. Thus Bpk (x) vanishes for x ≥ xp+k+1 and is the required
B-spline.
The advantage of B-splines is that the nonzero part is confined to an
interval which contains only k + 2 consecutive knots. This is also known as
the spline having finite support.
As an example we consider the (n + 1)-dimensional space of linear splines
with the usual knots a = x0 < . . . < xn = b. We introduce extra knots x−1 and
1
xn+1 outside the interval and we let Bj−1 be the linear spline which satisfies
1
the conditions Bj−1 (xi ) = δij , i = 0, . . . , n, where δij denotes the Kronecker
delta. Then every s in the space of linear splines can be written as
n
X
1
s(x) = s(xj )Bj−1 (x), a ≤ x ≤ b.
j=0
These basis functions are often called hat functions because of their shape, as
Figure 3.7 shows.
For general k we can generate a set of B-splines for the (n+k)- dimensional
space S of splines of degree k with the knots a = x0 < x1 < . . . < xn = b. We
add k additional knots both to the left of a and to the right of b. Thus the
full list of knots is x−k < x−k+1 < . . . < xn+k . We let Bpk be the function as
defined in (3.9) for p = −k, −k+1, . . . , n−1, where we restrict the range of x to
the interval [a, b] instead of x ∈ R. Therefore for each p = −k, −k +1, . . . , n−1
the function Bpk (x), a ≤ x ≤ b, lies in S and vanishes outside the interval
(xp , xp+k+1 ). Figure 3.8 shows these splines for k = 3.
Theorem 3.12. The B-splines Bpk , p = −k, −k + 1, . . . , n − 1, form a basis
of S.
Proof. The number of B-splines is n + k and this equals the dimension of
S. To show that the B-splines form a basis, it is sufficient to show that a
nontrivial linear combination of them cannot vanish identically in the interval
[a, b]. Assume otherwise and let
n−1
X
s(x) = sj Bjk (x), x ∈ R,
j=−k
where s(x) ≡ 0 for a ≤ x ≤ b. We know that s(x) also has to be zero for
108 A Concise Introduction to Numerical Analysis
The advantage of this form is that Bjk (ξi ) is nonzero if and only if ξi is in
the interval (xj , xj+k+1 ). Therefore there are at most k + 1 nonzero elements
in each row of the matrix. Thus the matrix is sparse.
The explicit form of a B-spline given in (3.9) is not very suitable for eval-
uation if x is close to xp+k+1 , since all (·)+ terms will be nonzero. However, a
different representation of B-splines can be used for evaluation purposes. The
following definition is motivated by formula (3.9) for k = 0.
Definition 3.8. The B-spline of degree 0 Bp0 is defined as
0, x < xp or x > xp+1 ,
Bp0 (x) = (xp+1 − xp )−1 , xp < x < xp+1 ,
−1
1
2 (xp+1 − xp ) , x = xp or x = xp+1 .
Theorem 3.13 (B-spline recurrence relation). For k ≥ 1 the B-splines satisfy
the following recurrence relation
k−1
(x − xp )Bpk−1 (x) + (xp+k+1 − x)Bp+1 (x)
Bpk (x) = , x ∈ R. (3.10)
xp+k+1 − xp
Proof. In the proof one needs to show that the resulting function has the
required properties to be a B-spline and this is left to the reader.
Let s ∈ S be a linear combination of B-splines. If s needs to be evaluated
for x ∈ [a, b], we pick j between 0 and n − 1 such that x ∈ [xj , xj+1 ]. It is only
necessary to calculate Bpk (x) for p = j − k, j − k + 1, . . . , j, since the other
B-splines of order k vanish on this interval. The calculation starts with the
one nonzero value of Bp0 (x), p = 0, 1, . . . , n. (It could be two nonzero values if
x happens to coincide with a knot.) Then for l = 1, 2, . . . , k the values Bpl (x)
for p = j − l, j − l + 1, . . . , j are generated by the recurrence relation given
l l
in (3.10), keeping in mind that Bj−l−1 (x) and Bj+1 (x) are zero. Note that as
the order of the B-splines increases, the number of B-splines not zero on the
interval [xj , xj+1 ] increases.
k
Bj−k (x)
.
2
Bj−2 (x) ..
1 %
Bj−1 (x)
% &
Bj0 (x) 2
Bj−1 (x) · · ·
& %
Bj1 (x)
&
Bj2 (x) ..
.
Bjk (x)
110 A Concise Introduction to Numerical Analysis
The orthogonal polynomials are the Chebyshev polynomials of the first kind
given by Qk (x) = cos(k arccos x), k ≥ 0. Using the substitution x = cos θ,
calculate the inner products hQk , Qk i for k ≥ 0. (Hint: 2 cos2 x = 1 +
cos 2x.)
(c) For the inner product given above and the Chebyshev polynomials, calcu-
late the inner products hQk , f i for k ≥ 0, k 6= 1, where f is given by
f (x) = (1 − x2 )1/2 . (Hint: cos x sin y = 12 [sin(x + y) − sin(x − y)].)
Exercise 3.14. (a) Given a set of real values f0 , f1 , . . . , fn at real data points
x0 , x1 , . . . , xn , give a formula for the Lagrange cardinal polynomials and
state their properties. Write the polynomial interpolant in the Lagrange
form.
(b) Define the divided difference of degree n, f [x0 , x1 , . . . , xn ], and give a for-
mula for it derived from the Lagrange form of the interpolant. What is the
divided difference of degree zero?
(c) State the recursive formula for the divided differences and proof it.
(d) State the Newton form of polynomial interpolation and show how it can
be evaluated in just O(n) operations.
(e) Let x0 = 4, x1 = 6, x2 = 8, and x3 = 10 with data values f0 = 1, f1 =
3, f2 = 8, and f3 = 20. Give the Lagrange form of the polynomial inter-
polant.
Interpolation and Approximation Theory 111
(d) Using
s(x) = s(xi−1 ) + s0 (xi−1 )(x − xi−1 ) + c2 (x − xi−1 )2 + c3 (x − xi−1 )3 .
on the interval [xi−1 , xi ], where
s(xi ) − s(xi−1 ) 2s0 (xi−1 ) + s0 (xi )
c2 = 3 − ,
(xi − xi−1 )2 xi − xi−1
0 0
s(xi−1 ) − s(xi ) s (xi−1 ) + s (xi )
c3 = 2 + ,
(xi − xi−1 )3 (xi − xi−1 )2
deduce that s can be nonzero with a bounded first derivative.
112 A Concise Introduction to Numerical Analysis
√
(e) Further show that such an s is bounded, if µ is at most 12 (3 + 5).
Exercise 3.17. (a) Given a set of real values f0 , f1 , . . . , fn at real data points
x0 , x1 , . . . , xn , give a formula for the Lagrange cardinal polynomials and
state their properties. Write the polynomial interpolant in the Lagrange
form.
(b) How many operations are necessary to evaluate the polynomial interpolant
in the Lagrange form at x?
(c) Prove that the polynomial interpolant is unique.
(d) Using the Lagrange form of interpolation, compute the polynomial p(x)
that interpolates the data x0 = 0, x1 = 1, x2 = 2 and f0 = 1, f1 = 2,
f2 = 3. What is the degree of p(x)?
(e) What is a divided difference and a divided difference table and for which
form of interpolant is it used? Give the formula for the interpolant, how
many operations are necessary to evaluate the polynomial in this form?
(f ) Prove the relation used in a divided difference table.
(g) Write down the divided difference table for the interpolation problem given
in (d). How does it change with the additional data f3 = 5 at x3 = 3?
CHAPTER 4
Non-Linear Systems
illustrated in Figure 4.1. Loss of significance is unlikely since f (a) and f (b)
have opposite signs. This method is sometimes called regula falsi or rule of
false position, since we take a guess as to the position of the root.
At first glance it seems superior to the bisection method, since an approx-
imation to the root is used. However, asymptotically one of the end-points
will converge to the root, while the other one remains fixed. Thus only one
end-point of the interval gets updated. We illustrate this behaviour with the
function
f (x) = 2x3 − 4x2 + 3x,
which has a root for x = 0. We start with the initial interval [−1, 1]. The
left end-point −1 is never replaced while the right end-point moves towards
zero. Thus the length of the interval is always at least 1. In the first iteration
the right end-point becomes 3/4 and in the next iteration it is 159/233. It
converges to zero at a linear rate similar to the bisection method.
The value m is a weighted average of the function values. The method can
be modified by adjusting the weights of the function values in the case where
the same endpoint is retained twice in a row. The value m is then for example
calculated according to
1
2 f (b)a − f (a)b f (b)a − 12 f (a)b
m= 1 or .
2 f (b) − f (a) f (b) − 12 f (a)
side of the root. Initializing x(0) = a and x(1) = b the method calculates
This is the intersection of the secant through the points (x(n−1) , f (x(n−1) ) and
(x(n) , f (x(n) ) with the axis. There is now the possibility of loss of significance,
since f (x(n−1) ) and f (x(n) ) can have the same sign. Hence the denominator
can become very small leading to large values, possibly leading away from the
root.
Also, convergence may be lost, since the root is not bracketed by an interval
anymore. If, for example, f is differentiable and the derivative vanishes at some
point in the initial interval, the algorithm may not converge, since the secant
can become close to horizontal and the intersection point will be far away
from the root.
To analyze the properties of the secant method further, in particular to
find the order of convergence, we assume that f is twice differentiable and
that the derivative of f is bounded away from zero in a neighbourhood of the
root. Let x∗ be the root and let e(n) denote the error e(n) = x(n) − x∗ at the
nth iteration. We can then express the error at the (n + 1)th iteration in terms
of the error in the previous two iterations:
e(n) − e(n−1)
e(n+1) = e(n) − f (x∗ + e(n) )
f (x∗
+ e(n) ) − f (x∗ + e(n−1) )
f (x∗ + e(n) )e(n−1) − f (x∗ + e(n−1) )e(n)
=
f (x∗ + e(n) ) − f (x∗ + e(n−1) )
∗ (n) ∗ (n−1)
(n) (n−1)
( f (x e+e
(n)
)
− f (x e+e
(n−1)
)
)
= e e ∗ (n) ∗ (n−1)
.
f (x + e ) − f (x + e )
f (x∗ + e(n) ) 1
(n)
= f 0 (x∗ ) + f 00 (x∗ )e(n) + O([e(n) ]2 ).
e 2
Doing the same for f (x∗ + e(n−1) )/e(n−1) and assuming that x(n) and x(n−1)
are close enough to the root such that the terms O([e(n) ]2 ) and O([e(n−1) ]2 )
can be neglected, the expression for e(n+1) becomes
1 e(n) − e(n−1)
e(n+1) ≈ e(n) e(n−1) f 00 (x∗ )
2 f (x∗ + e(n) ) − f (x∗ + e(n−1) )
00 ∗
f (x )
≈ e(n) e(n−1) 0 ∗ ,
2f (x )
116 A Concise Introduction to Numerical Analysis
where we again used the Taylor expansion in the last step. Letting C =
f 00 (x∗ )
| 2f 0 (x∗ ) |, the modulus of the error in the next iteration is then approximately
|e(n+1) |
= c = O(1).
|e(n) |p
f (x(n) )
x(n+1) = x(n) − . (4.2)
f 0 (x(n) )
Geometrically the secant through two points of a curve becomes the tan-
gent to the curve in the limit of the points coinciding. The tangent to the
curve f at the point (x(n) , f (x(n) )) has the equation
The point x(n+1) is the point of intersection of this tangent with the x-axis.
This is illustrated in Figure 4.2.
Let x∗ be the root. The Taylor expansion of f (x∗ ) about x(n) is
1 00 (n) ∗
f (x∗ ) = f (x(n) ) + f 0 (x(n) )(x∗ − x(n) ) + f (ξ )(x − x(n) )2 ,
2!
Non-Linear Systems 117
where ξ (n) lies between x(n) and x∗ . Since x∗ is the root, this equates to zero:
1 00 (n) ∗
f (x(n) ) + f 0 (x(n) )(x∗ − x(n) ) + f (ξ )(x − x(n) )2 = 0.
2!
Let’s assume that f 0 is bounded away from zero in a neighbourhood of x∗
and x(n) lies in this neighbourhood. We can then divide the above equation
by f 0 (x(n) ). After rearranging, this becomes
f (x(n) ) 1 f 00 (ξ (n) ) ∗
0
+ (x∗ − x(n) ) = − (x − x(n) )2 .
(n)
f (x ) 2 f 0 (x(n) )
Using (4.2) we can relate the error in the next iteration to the error in the
current iteration
1 f 00 (ξ (n) ) ∗
x∗ − x(n+1) = − (x − x(n) )2 .
2 f 0 (x(n) )
This shows that under certain conditions the convergence of Newton’s method
is quadratic. The conditions are that there exists a neighbourhood U of the
root where f 0 is bounded away from zero and where f 00 is finite and that the
starting point lies sufficiently close to x∗ . More specifically, let
supx∈U |f 00 (x)|
C= .
inf x∈U |f 0 (x)|
Here the notation supx∈U |f 00 (x)| is the smallest upper bound of the values
118 A Concise Introduction to Numerical Analysis
|f 00 | takes in U , while inf x∈U |f 0 (x)| stands for the largest lower bound of the
values |f 0 | takes in U . Then
1
|x∗ − x(n+1) | ≤ C(x∗ − x(n) )2 .
2
Here we can see that sufficiently close to x∗ means that 12 C|x∗ − x(0) | < 1,
otherwise the distance to the root may not decrease.
We take a closer look at the situations where the method fails. In some
cases the function in question satisfies the conditions for convergence, but the
point chosen as starting point is not sufficiently close to the root. In this case
other methods such as bisection are used to obtain a better starting point.
The method fails if any of the iteration points happens to be a stationary
point, i.e., a point where the first derivative vanishes. In this case the next
iteration step is undefined, since the tangent there will be parallel to the x-
axis and not intersect it. Even if the derivative is nonzero, but small, the next
approximation may be far away from the root.
For some functions it can happen that the iteration points enter an infinite
cycle. Take for example the polynomial f (x) = x3 −2x+2. If 0 is chosen as the
starting point, the first iteration produces 1, while the next iteration produces
0 again and so forth. The behaviour of the sequence produced by Newton’s
method is illustrated by Newton fractals in the complex plane C. If z ∈ C
is chosen as a starting point and if the sequence produced converges to a
specific root then z is associated with this root. The set of initial points z
that converge to the same root is called basin of attraction for that root. The
fractals illustrate beautifully that a slight perturbation of the starting value
can result into the algorithm converging to a different root.
Exercise 4.1. Write a program which takes a polynomial of degree between
2 and 7 as input and colours the basins of attraction for each root a different
colour. Try it out for the polynomial z n − 1 for n = 2, . . . , 7.
√ A simple example where Newton’s method diverges is the cube root f (x) =
3
x which is continuous and infinitely differentiable except for the root x = 0,
where the derivative is undefined. For any approximation x(n) 6= 0 the next
approximation will be
x(n)1/3
x(n+1) = x(n) − 1 (n)1/3−1 = x
(n)
− 3x(n) = −2x(n) .
3x
In every iteration the algorithm overshoots the solution onto the other side
further away than it was initially. The distance to the solution is doubled in
each iteration.
There are also cases where Newton’s method converges, but the rate of
convergence is not quadratic. For example take f (x) = x2 . Then for every
approximation x(n) the next approximation is
x(n)2 1
x(n+1) = x(n) − = x(n) .
2x(n) 2
Non-Linear Systems 119
Thus the distance to the root is halved in every iteration comparable to the
bisection method.
Newton’s method readily generalizes to higher dimensional problems.
Given a function f : Rm → Rm , we consider f as a vector of m functions
f1 (x)
f (x) = ..
,
.
fm (x)
f (x(n) ) − f (x(n−1) )
f 0 (x(n) ) ≈ .
x(n) − x(n−1)
Rearranging gives
In Broyden’s method the Jacobian is only calculated once in the first iteration
as A(0) = Jf (x(0) ). In all the subsequent iterations the matrix is updated in
such a way that it satisfies (4.4). That is, if the matrix multiplies x(n) −x(n−1) ,
the result is f (x(n) ) − f (x(n−1) ). This determines how the matrix acts on the
one-dimensional subspace of Rm spanned by the vector x(n) −x(n−1) . However,
this does not determine how the matrix acts on the (m − 1)-dimensional
complementary subspace. Or in other words (4.4) provides only m equations
to specify an m×m matrix. The remaining degrees of freedom are taken up by
letting A(n) be a minimal modification to A(n−1) , minimal in the sense that
A(n) acts the same as A(n−1) on all vectors orthogonal to x(n) − x(n−1) . These
vectors are the (m−1)-dimensional complementary subspace. The matrix A(n)
is then given by
f (x(n) ) − f (x(n−1) ) − A(n−1) (x(n) − x(n−1) ) (n)
A(n) = A(n−1) + (x −x(n−1) )T .
(x(n) − x(n−1) )T (x(n) − x(n−1) )
If A(n) is applied to x(n) − x(n−1) , most terms vanish and the desired result
f (x(n) )−f (x(n−1) ) remains. The second term vanishes whenever A(n) is applied
to a vector v orthogonal to x(n) −x(n−1) , since in this case (x(n) −x(n−1) )T v =
0, and A(n) acts on these vectors exactly as A(n−1) .
The next approximation is then given by
x(n+1) = x(n) − (A(n) )−1 f (x(n) ).
Just as in Newton’s method, we do not calculate the inverse directly. Instead
we solve
A(n) h = −f (x(n) )
for some perturbation h ∈ Rm and let x(n+1) = x(n) + h.
However, if the inverse of the initial Jacobian A(0) has been calculated the
inverse can be updated in only O(m2 ) operations using the Sherman–Morrison
formula, which states that for a non-singular matrix A and vectors u and v
such that vT A−1 u 6= −1 we have
A−1 uvT A−1
(A + uvT )−1 = A−1 − .
1 + vT A−1 u
Letting
A = A(n−1) ,
f (x(n) ) − f (x(n−1) ) − A(n−1) (x(n) − x(n−1) )
u = ,
(x(n) − x(n−1) )T (x(n) − x(n−1) )
v = x(n) − x(n−1) ,
we have a fast update formula.
Broyden’s method is not as fast as the quadratic convergence of Newton’s
method. But the smaller operation count per iteration is often worth it. In
the next section we return to the one-dimensional case.
Non-Linear Systems 121
f (x)
g(x) = p
|f 0 (x)|
g(x(n) )
x(n+1) = x(n) − .
g 0 (x(n) )
2f (x(n) )f 0 (x(n) )
x(n+1) = x(n) − .
2[f 0 (x(n) )]2 − f (x(n) )f 00 (x(n) )
The formula can be rearranged to show the similarity between Halley’s method
and Newton’s method
−1
f (x(n) ) f (x(n) ) f 00 (x(n) )
x(n+1) = x(n) − 0 (n) 1 − 0 (n) .
f (x ) f (x ) 2f 0 (x(n) )
We see that when the second derivative is close to zero near x∗ then the itera-
tion is nearly the same as Newton’s method. The expression f (x(n) )/f 0 (x(n) )
is only calculated once. This form is particularly useful when f 00 (x(n) )/f 0 (x(n) )
can be simplified.
The technique is also known as Bailey’s method when written in the fol-
lowing form:
−1
f (x(n) )f 00 (x(n) )
(n+1) (n) (n) 0 (n)
x =x − f (x ) f (x ) − .
2f 0 (x(n) )
(x(n−2) −x(n) )2 [f (x(n−1) )−f (x(n) )]−(x(n−1) −x(n) )2 [f (x(n−2) )−f (x(n) )]
b = (x(n−2) −x(n) )(x(n−1) −x(n) )(x(n−2) −x(n−1) )
The next approximation x(n+1) is one of the roots of p and the one closer to
x(n) is chosen. To avoid errors due to loss of significance we use the alternative
formula for the roots derived in 1.4,
−2c
x(n+1) − x(n) = √ , (4.5)
b + sgn(b) b2 − 4ac
where sgn(b) denotes the sign of b. This way the root which gives the largest
denominator and thus is closest to x(n) is chosen.
Note that x(n+1) can be complex even if all previous approximations have
been real. This is in contrast to previous root-finding methods where the
Non-Linear Systems 123
iterates remain real if the starting value is real. This behaviour can be an
advantage, if complex roots are to be found or a disadvantage if the roots are
known to be real.
An alternative representation uses the Newton form of the interpolating
polynomial
where f [x(n) , x(n−1) ] and f [x(n) , x(n−1) , x(n−2) ] denote divided differences. Af-
ter some manipulation using the recurrence relation for divided differences, one
can see that
since x(1) − x(0) = g(x(0) ) − x(0) . Thus for each new iterate
n
X
|x(n+1) − x(0) | ≤ |x(k+1) − x(k) |
k=0
Xn
≤ λk (1 − λ)δ
k=0
≤ (1 − λn+1 )δ ≤ δ.
Hence all iterates lie in the interval I. This also means that the sequence of
iterates is bounded. Moreover,
n+p−1
X
|x(n+p) − x(n) | ≤ |x(k+1) − x(k) |
k=n
n+p−1
X
≤ λk (1 − λ)δ
k=n
n→∞
≤ (1 − λp )λn δ → 0.
Since λ < 1, the sequence is a Cauchy sequence and hence converges.
Suppose the solution is not unique, i.e., there exist x∗ 6= x̃∗ such that
x = g(x∗ ) and x̃∗ = g(x̃∗ ). Then
∗
where b−1 is taken to be a0 . The first value m is the midpoint of the interval,
while the second value s is the approximation to the root given by the secant
method. If s lies between bn and m, it becomes the next iterate, that is bn+1 =
s, otherwise the midpoint is chosen, bn+1 = m. If f (an ) and f (bn+1 ) have
opposite signs, the new contrapoint is an+1 = an , otherwise an+1 = bn , since
f (bn ) and f (bn+1 ) must have opposite signs in this case, since f (an ) and
f (bn ) had opposite signs in the previous iteration. Additionally, if the modulus
of f (an+1 ) is less than the modulus of f (bn+1 ), an+1 is considered a better
approximation to the root and it becomes the new iterate while bn+1 becomes
the new contrapoint. Thus the iterate is always the better approximation.
This method performs generally well, but there are situations where every
iteration employs the secant method and convergence is very slow, requiring
far more iterations than the bisection method. In particular, bn − bn−1 might
Non-Linear Systems 127
become arbitrarily small while the length of the interval given by an and bn
decreases very slowly. The following method tries to alleviate this problem.
Brent’s method combines the bisection method, the secant method, and
inverse quadratic interpolation. It is also known as the Wijngaarden–Dekker–
Brent method . At every iteration, Brent’s method decides which method out
of these three is likely to do best, and proceeds by doing a step according to
that method. This gives a robust and fast method.
A numerical tolerance is chosen. The method ensures that the bisection
method is used, if consecutive iterates are too close together. More specifically,
if the previous step performed the bisection method and |bn − bn−1 | ≤ , then
the bisection method will be performed again. Similarly, if the previous step
performed interpolation (either linear interpolation for the secant method or
inverse quadratic interpolation) and |bn−1 − bn−2 | ≤ , then the bisection
method will be performed again. Thus bn and bn−1 are allowed to become
arbitrarily close at most two times in a row.
Additionally, the intersection s from interpolation (either linear or inverse
quadratic) is only accepted as new iterate if |s − bn | < 12 |bn − bn−1 |, if the
previous step used bisection, or if |s − bn | < 12 |bn−1 − bn−2 |, if the previous
step used interpolation (linear or inverse quadratic). These conditions enforce
that consecutive interpolation steps halve the step size every two iterations
until the step size becomes less than after at most 2 log2 (|bn−1 − bn−2 |/)
iterations which invokes a bisection.
Brent’s algorithm uses linear interpolation, that is, the secant method,
if any of f (bn ), f (an ), or f (bn−1 ) coincide. If they are all distinct, inverse
quadratic interpolation is used. However, the requirement for s to lie between
m and bn is changed: s has to lie between (3an + bn )/4 and bn .
Exercise 4.3. Implement Brent’s algorithm. It should terminate if either
f (bn ) or f (s) is zero or if |bn − an | is small enough. Use the bisection rule if
s is not between (3an + bn )/4 and bn for both linear and inverse quadratic
interpolation or if any of Brent’s conditions arises. Try your program on
f (x) = x3 − x2 − 4x + 4, which has zeros at −2, 1, and 2. Start with the
interval [−4, 2.5], which contains all roots. List which method is used in each
iteration.
(b) Let x∗ be the root and let e(n) denote the error e(n) = x(n) − x∗ at the
nth iteration. Express the error at the (n + 1)th iteration in terms of the
errors in the previous two iterations.
(c) Approximate f (x∗ + e(n−1) )/e(n−1) , f (x∗ + e(n) )/e(n) , and f (x∗ + e(n) ) −
f (x∗ + e(n−1) ) using Taylor expansion. You may assume that x(n) and
x(n−1) are close enough to the root such that the terms O([e(n) ]2 ) and
O([e(n−1) ]2 ) can be neglected.
(d) Using the derived approximation and the expression derived for e(n+1) ,
show that the error at (n + 1)th iteration is approximately
f 00 (x∗ )
e(n+1) ≈ e(n) e(n−1) .
2f 0 (x∗ )
(e) From |e(n+1) | = O(|e(n) ||e(n−1) |) derive p such that |e(n+1) | = O(|e(n) |p ).
(f ) Derive the Newton method from the secant method.
(g) Let f (x) = x2 . Letting x( 1) = 12 x(0) , for both the secant and the Newton
method express x(2) , x(3) , and x(4) in terms of x(0) .
Exercise 4.5. Newton’s method for finding the solution of f (x) = 0 is given
by
f (x(n) )
x(n+1) = x(n) − 0 (n) ,
f (x )
where x(n) is the approximation to the root x∗ in the nth iteration. The starting
point x(0) is already close enough to the root.
(a) By means of a sketch graph describe how the method works in a simple
case and give an example where it might fail to converge.
(b) Using the Taylor expansion of f (x∗ ) = 0 about x(n) , relate the error in
the next iteration to the error in the current iteration and show that the
convergence of Newton’s method is quadratic.
(c) Generalize Newton’s method to higher dimensions.
(d) Let
1 2
f (x) = f (x, y) = 2x +y
.
1 2
2y +x
The roots lie at (0, 0) and (−2, −2). Calculate the Jacobian of f and its
inverse.
(e) Why does Newton’s method fail near (1, 1) and (−1, −1)?
(f ) Let x(0) = (1, 0). Calculate x(1) , x(2) and x(3) , and their Euclidean norms.
Non-Linear Systems 129
(g) The approximations converge to (0, 0). Show that the speed of convergence
agrees with the theoretical quadratic speed of convergence.
Sketch the graph of f (x) and sketch the first iteration for cases (i) and
(ii) to show why (i) converges faster than (ii).
(e) In a separate (rough) sketch, show the first two iterations for case (iii).
(f ) Now consider f (x) = x4 − 3x2 − 2. Calculate two Newton–Raphson it-
erations from the starting value x = 1. Comment on the prospects for
convergence in this case.
(g) Give further examples where the method might fail to converge or con-
verges very slowly.
Exercise 4.7. The following reaction occurs when water vapor is heated
1
H2 O
H2 + O2 .
2
130 A Concise Introduction to Numerical Analysis
Numerical Integration
The points xi , i = 1, . . . , n are called the abscissae chosen such that a ≤ x1 <
. . . < xn ≤ b. The coefficients wi are called the weights. Quadrature rules are
derived by integrating a polynomial interpolating the function values at the
abscissae. Usually only positive weights are allowed since whether something
is added or subtracted should be determined by the sign of the function at
this point. This also avoids loss of significance.
a+b
Proof. We use Taylor expansion of f around the midpoint 2
Theorem 5.2. The trapezium rule has the following error term
Z b
b−a 1
f (x)dx = [f (a) + f (b)] − f 00 (ξ)(b − a)3 ,
a 2 12
where ξ is some point in the interval [a, b]
Proof. We use Taylor expansion of f (a) around the point x
1
f (a) = f (x) + (a − x)f 0 (x) + (a − x)2 f 00 (ξ),
2
Numerical Integration 133
is the difference between the value of the integral and the value given by the
quadrature,
Z b n
X
L(f ) = f (x)dx − wi f (xi ).
a i=1
k+1
Thus L maps the space of functions C [a, b] to R. L is a linear functional ,
i.e., L(αf + βg) = αL(f ) + βL(g) for all α, β ∈ R, since integration and
weighted summation are linear operations themselves. We assume that the
quadrature is constructed in such a way that it is correct for all polynomials
of degree at most k, i.e., L(p) = 0 for all p ∈ Pk [x]. Recall
Definition 5.1 (Peano kernel). The Peano kernel K of L is the function
defined by
K(θ) := L[(x − θ)k+ ] for θ ∈ [a, b].
and
Theorem 5.3 (Peano kernel theorem). Let L be a linear functional such that
L(p) = 0 for all p ∈ Pk [x]. Provided that the exchange of L with the integration
is valid, then for f ∈ C k+1 [a, b]
Z b
1
L(f ) = K(θ)f (k+1) (θ)dθ.
k! a
Here the order of L and the integration can be swapped, since L consists
of an integration and a weighted sum of function evaluations. The theorem
has the following extension:
Theorem 5.4. If K does not change sign in (a, b), then for f ∈ C k+1 [a, b]
"Z #
b
1
L(f ) = K(θ)dθ f (k+1) (ξ)
k! a
where we used the fact that (a − θ)+ = 0, since a ≤ θ for all θ ∈ [a, b] and
(b − θ)+ = b − θ, since b ≥ θ for all θ ∈ [a, b].
The kernel K(θ) does not change sign for θ ∈ (a, b), since b − θ > 0 and
a − θ < 0 for θ ∈ (a, b). The integral over [a, b] of the kernel is given by
Z b Z b
(b − θ)(a − θ) 1
K(θ)dθ = dθ = − (b − a)3 ,
a a 2 12
which can be easily verified. Thus the error for the trapezium rule is
"Z #
b
1 1
L(f ) = K(θ)dθ f 00 (ξ) = − (b − a)3 f 00 (ξ)
1! a 12
Proof. We apply Peano’s kernel theorem to prove this result. Firstly, Simp-
son’s rule is correct for all quadratic polynomials by construction. However,
it is also correct for cubic polynomials, which can be proven by applying it
to the monomial x3 . The value of the integral of x3 over the interval [a, b] is
(b4 − a4 )/4. Simpson’s rule applied to x3 gives
(b − a) 3
a + 4(a + b)3 /23 + b3
6
(b − a)
2a3 + a3 + 3a2 b + 3ab2 + b3 + 2b3
=
12
(b − a) 3
a + a2 b + ab2 + b3
=
4
1
= (a3 b + a2 b2 + ab3 + b4 − a4 − a3 b − a2 b2 − ab3 )
4
(b4 − a4 )
= .
4
136 A Concise Introduction to Numerical Analysis
So Simpson’s rule is correct for all cubic polynomials, since they form a linear
space, and we have k = 3. However, it is not correct for polynomials of degree
four. Let for example a = 0 and b = 1, then the integral of x4 over the interval
[0, 1] is 1/5 while the approximation by Simpson’s rule is 1/6 ∗ (04 + 4(1/2)4 +
14 ) = 5/24.
The kernel is given by
For θ ∈ [ a+b
2 , b], the result can be simplified as
The first term (b − θ)3 is always positive, since θ < b. The expression in
the square brackets is decreasing and has a zero at the point 23 a + 13 b =
1 1 1
2 (a + b) − 6 (b − a) < 2 (a + b) and thus is negative. Hence K(θ) is negative
a+b
on [ 2 , b].
For θ ∈ [a, a+b 3
2 ] the additional term 2/3(b − a)[(a + b)/2 − θ] , which is
positive in this interval, is subtracted. Thus, K(θ) is also negative here.
Hence the kernel does not change sign. We need to integrate K(θ) to obtain
Numerical Integration 137
High-order Newton–Cotes Rules are rarely used, for two reasons. Firstly,
for larger n some of the weights are negative, which leads to numerical instabil-
ity. Secondly, methods based on high-order polynomials with equally spaced
points have the same disadvantages as high-order polynomial interpolation,
as we have seen with Runge’s phenomenon.
Theorem 5.6. All roots of pn are real, have multiplicity one, and lie in (a, b).
Proof. Let x1 , . . . , xm be the places where pn changes sign in (a, b). These are
Numerical Integration 139
roots of pn , but pn does not necessarily change sign at every root (e.g., if the
root has even multiplicity where the curve just touches the x-axis, but does
not cross it). For pn to change sign the root must have odd multiplicity. There
could be no places in (a, b) where pn changes sign, in which case m = 0. We
know that m ≤ n, since pn has at most n real roots.
The polynomial (x − x1 )(x − x2 ) · · · (x − xm ) changes sign in the same way
as pn . Hence the product of the two does not change sign at all in (a, b). Now
pn is orthogonal to all polynomials of degree less than n. Thus,
Z b
(x − x1 )(x − x2 ) · · · (x − xm )pn (x)w(x)dx = 0,
a
and we have Z b n
X
f (x)w(x)dx ≈ wi f (xi ).
a i=1
Exercise 5.2. Calculate the weights w0 and w1 and the abscissae x0 and x1
such that the approximation
Z 1
f (x)dx ≈ w0 f (x0 ) + w1 f (x1 )
0
is exact when f is a cubic polynomial. You may use the fact that x0 and x1
are the zeros of a quadratic polynomial which is orthogonal to all linear poly-
nomials. Verify your calculation by testing the formula when f (x) = 1, x, x2
and x3 .
The consequence of the theorem is that using equally spaced points, the
resulting method is only necessarily exact for polynomials of degree at most
n − 1. By picking the abscissae carefully, however, a method results which
is exact for polynomials of degree up to 2n − 1. For the price of storing the
same number of points, one gets much more accuracy for the same number
of function evaluations. However, if the values of the integrand are given as
empirical data, where it was not possible to choose the abscissae, Gaussian
quadrature is not appropriate.
The weights of Gaussian quadrature rules are positive. Consider L2k (x),
which is a polynomial of degree 2n − 2, and thus the quadrature formula is
exact for it and we have
Z b Xn
0< L2k (x)w(x)dx = wi L2k (xi ) = wk .
a i=0
Gaussian quadrature rules based on a weight function w(x) work very well
for functions that behave like a polynomial times the weight, something which
occurs in physical problems. However, a change of variables may be necessary
for this condition to hold.
The most common Gaussian quadrature formula is the case where (a, b) =
(−1, 1) and w(x) ≡ 1. The orthogonal polynomials are then called Legendre
polynomials. To construct the quadrature rule, one must determine the roots
of the Legendre polynomial of degree n and then calculate the associated
weights.
2
√ let n = 2. The quadratic Legendre polynomial is x √− 1/3
As an example,
with roots ±1/
√ 3. The two interpolating Lagrange polynomials are ( 3x +
1)/2 and (− 3x+1)/2 and both integrate to 1 over [−1, 1]. Thus the two-point
Gauss–Legendre rule is given by
Z 1
1 1
f (x)dx ≈ f (− √ ) + f ( √ )
−1 3 3
and it is correct for all cubic polynomials.
For n ≤ 5, the largest degree of polynomial for which the quadrature is
correct, the abscissae and corresponding weights are given in the following
Numerical Integration 141
table,
n 2n − 1 abscissae weights
√
2 3 ±1/ 3 1
3 5 0p 8/9
± 3/5 5/9
q p √
4 7 ± (3 − 2 6/5)/7 (18 + 30)/36
q p √
± (3 + 2 6/5)/7 (18 − 30)/36
5 9 0 q 128/225
p √
± 13 5 − 2 10/7 (322 + 13 70)/900
q p √
± 13 5 + 2 10/7 (322 − 13 70)/900
Notice that the abscissae are not uniformly distributed in the interval
[−1, 1]. They are symmetric about zero and cluster near the end points.
Exercise 5.3. R 1 Implement the Gauss–Legendre quadrature for n = 2, . . . , 5,
approximate −1 xj dx for j = 1, . . . , 10, and compare the results to the true
solution. Interpret your results.
Other choices of weight functions are listed in the following table
Name Notation Interval Weight function
Legendre Pn [−1, 1] w(x) ≡ 1
(α,β)
Jacobi Pn (−1, 1) w(x) = (1 − x)α (1 + x)β
Chebyshev (first kind) Tn (−1, 1) w(x) = (1 − x2 )−1/2
Chebyshev (second kind) Un [−1, 1] w(x) = (1 − x2 )1/2
Laguerre Ln [0, ∞) w(x) = e−x
2
Hermite Hn (−∞, ∞) w(x) = e−x
where α, β > −1.
Next we turn to the estimation of the error of Gaussian quadrature rules.
for some ξ ∈ (a, b), where p̂n is the nth orthogonal polynomial with respect to
w(x), scaled such that the leading coefficient is 1.
This is possible, since these are 2n conditions and q has 2n degrees of freedom.
Since the degree of q is at most 2n − 1, the quadrature rule is exact
Z b n
X n
X
q(x)w(x)dx = wi q(xi ) = wi f (xi ).
a i=1 i=1
f (2n) (ξ(x))
f (x) − q(x) = [p̂n (x)]2 .
(2n)!
numerator with the same multiplicity. We can apply the mean value theorem
of integral calculus:
Z b Z b (2n)
f (ξ(x))
(f (x) − q(x))w(x)dx = [p̂n (x)]2 w(x)dx
a a (2n)!
f (2n) (ξ) b
Z
= [p̂n (x)]2 w(x)dx
(2n)! a
for some ξ ∈ (a, b).
There are two variations of Gaussian quadrature rules. The Gauss–Lobatto
rules, also known as Lobatto quadrature, explicitly include the end points of the
interval as abscissae, x1 = a and xn = b, while the remaining n − 2 abscissae
are chosen optimally. The quadrature is then accurate for polynomials up to
degree 2n − 3. For w(x) ≡ 1 and [a, b] = [−1, 1], the remaining abscissae are
the zeros of the derivative of the (n − 1)th Legendre polynomial Pn−1 (x). The
Lobatto quadrature of f (x) on [−1, 1] is
Z 1 n−1
2 X
f (x)dx ≈ [f (1) + f (−1)] + wi f (xi ),
−1 n(n − 1) i=2
n 2n − 3 abscissae weights
3 3 0 4/3
±1 1/3
√
4 5 ±1/ 5 5/6
±1 1/6
5 7 0p 32/45
± 3/7 49/90
±1 1/10
The second variation are the Gauss–Radau rules or Radau quadratures.
Here one end point is included as abscissa. Therefore we distinguish left and
right Radau rules. The remaining n − 1 abscissae are chosen optimally. The
quadrature is accurate for polynomials of degree up to 2n − 2. For w(x) ≡ 1
and [a, b] = [−1, 1] and x1 = −1, the remaining abscissae are the zeros of the
polynomial given by
Pn−1 (x) + Pn (x)
.
1+x
144 A Concise Introduction to Numerical Analysis
The following table lists the left Radau rules with their abscissae and weights
until n = 5 and the degree of polynomial they are correct for. For n = 4 and
5 only approximations to the abscissae and weights are given.
n 2n − 2 abscissae weights
2 2 −1 1/2
1/3 3/2
3 4 −1 √ 2/9 √
1/5(1 ± 6) 1/18(16 ∓ 6)
4 6 −1 1/8
−0.575319 0.657689
0.181066 0.776387
0.822824 0.440924
5 8 0 2/25
−0.72048 0.446208
−0.167181 0.623653
0.446314 0.562712
0.885792 0.287427
One drawback of Gaussian quadrature is the need to pre-compute the
necessary abscissae and weights. Often the abscissae and weights are given
in look-up tables for specific intervals. If one has a quadrature rule for the
interval [c, d], it can be adapted to the interval [a, b] with a simple change of
variables. Let t(x) be the linear transformation taking [c, d] to [a, b] and t−1 (y)
its inverse,
b−a
y = t(x) = a+ (x − c),
d−c
d−c
x = t−1 (y) = c + (y − a),
b−a
dy b−a
dx = .
d−c
The integration is then transformed:
Z b Z t−1 (b)
f (y)w(y)dy = f (t(x))w(t(x))t0 (x)dx
a t−1 (a)
b−a d
Z
= f (t(x))w(t(x))dx.
d−c c
Numerical Integration 145
because the function values at the interior abscissae are needed twice to ap-
proximate the integrals on the intervals on the left and right of them. Since the
146 A Concise Introduction to Numerical Analysis
On each sub-interval the error is O(h5 ), and since there are N sub-intervals,
the overall error is O(h4 ).
Because evaluating an arbitrary function can be potentially expensive, the
efficiency of quadrature rules is usually measured by the number of func-
tion evaluations required to achieve a desired accuracy. In composite rules it
is therefore advantageous, if the endpoints are abscissae, since the function
value at the end-point of one sub-interval will be used again in the next sub-
interval. Therefore Lobatto rules play an important role in the construction
of composite rules. In the following we put this on a more theoretical footing.
Definition 5.2 (Riemann integral). For each n ∈ N let there be a set of
numbers a = ξ0 < ξ1 < . . . < ξn = b. A Riemann integral is defined by
Z b Xn
f (x)dx = lim (ξi − ξi−1 )f (xi ),
a n→∞,∆ξ→0
i=1
where xi ∈ [ξi−1 , ξi ] and ∆ξ = max1≤i≤n |ξi −ξi−1 |. The sum on the right-hand
side is called a Riemann sum.
Some simple quadrature rules are clearly Riemann sums. For example,
Numerical Integration 147
take the composite rectangle rule, which approximates the value on each sub-
interval by the function value at the right end-point times the length of the
interval
XN
QN (f ) = h f (a + ih),
i=1
where xij is the j th abscissa in the ith sub-interval calculated as xij = ti (xj ),
where ti is the transformation taking [c, d] of length d − c to [a + (i − 1)(b −
a)/M, a + i(b − a)/M ] of length (b − a)/M .
Theorem 5.9. Let Qn be a quadrature rule that integrates constants exactly,
Rd
i.e., Qn (1) = c 1dx = d−c. If f is bounded on [a, b] and is Riemann integrable
then Z b
lim (M × Qn )(f ) = f (x)dx.
M →∞ a
That is, the weights sum to d − c. Swapping the summations and taking
everything independent of M out of the limit in (5.1) leads to
n
" M
#
1 X b−aX
limM →∞ (M × Qn )(f ) = wj lim f (xij )
d − c j=1 M →∞ M i=1
Z b
= f (x)dx,
a
Numerical Integration 149
We have already seen that the error can be expressed in such a form for
all quadrature rules we have encountered so far.
Theorem 5.10. Let Qn be a simplex rule as defined above and let EM ×Qn (f )
denote the error of (M × Qn )(f ). Then
h i
lim M k EM ×Qn (f ) = C(b − a)k f (k−1) (b) − f (k−1) (a) .
M →∞
Rb
That is, (M × Qn )(f ) converges to a
f (x)w(x)dx like M −k for sufficiently
large M .
Proof. The error of the composite rule is the sum of the errors in each sub-
interval. Thus
M k+1
X b−a
EM ×Qn (f ) = C f (k) (ξi ),
i=1
M
where ξi lies inside the ith sub-interval. Multiplying by M k and taking the
limit gives
" M
#
k k b − a X (k)
lim M EM ×Qn (f ) = C(b − a) lim f (ξi ) .
M →∞ M →∞ M i=1
where there are d integrals and where the boundaries of each integral may
depend on variables not used in that integral. In the k th dimension the inter-
val of integration is [lk (x1 , . . . , xk−1 ), uk (x1 , . . . , xk−1 )]. Such problems often
arise in practice, mostly for two or three dimensions, but sometimes for 10
or 20 dimensions. The problem becomes considerably more expensive with
each extra dimension. Therefore different methods have been developed for
different ranges of dimensions.
We first consider the transformation into standard regions with the hyper-
cube as an example. Other standard regions are the hypersphere, the surface
of the hypersphere, or a simplex, where a simplex is the generalization of the
triangle or tetrahedron to higher dimensions. Different methods have been
developed for different standard regions. Returning to the hypercube, it can
be transformed to the region of the integral given in (5.2) by
1
xi = [ui (x1 , . . . , xi−1 ) + li (x1 , . . . , xi−1 )]
2
1
+yi [ui (x1 , . . . , xi−1 ) − li (x1 , . . . , xi−1 )] .
2
If yi = −1, then xi = li (x1 , . . . , xi−1 ), the lower limit of the integration. If
yi = 1, then xi = ui (x1 , . . . , xi−1 ), the upper limit of the integration. For
yi = 0 xi is the midpoint of the interval. The derivative of this transformation
is given by
dxi 1
= [ui (x1 , . . . , xi−1 ) − li (x1 , . . . , xi−1 )] .
dyi 2
Using the transformation, we write f (x1 , . . . , xd ) = g(y1 , . . . , yd ) and the in-
tegral I becomes
Z 1 Z 1 d
1 Y
... [ui (x1 , . . . , xi−1 ) − li (x1 , . . . , xi−1 )] g(y1 , . . . , yd )dy1 . . . dyd .
−1 −1 2d i=1
where x1i ∈ [a, b], i = 1, . . . , n, are the abscissae of Q1 and x2j ∈ [c, d],
j = 1, . . . , m, are the abscissae of Q2 .
Definition 5.4. The product rule Q1 × Q2 to integrate a function F : I1 ×
I2 → R is defined by
n X
X m
(Q1 × Q2 )(F ) = w1i w2j F (x1i , x2j ).
i=1 j=1
Exercise 5.4. Let Q1 integrate f exactly over the interval I1 and let Q2
integrate g exactly over the interval I2 . Prove that Q1 ×Q2 integrates f (x)g(y)
over I1 × I2 exactly.
A consequence of the above definition and exercise is that we can combine
all the one-dimensional quadrature rules we encountered before to create rules
in two dimensions. If Q1 is correct for polynomials of degree at most k1 and
Q2 is correct for polynomials of degree at most k2 , then the product rule is
correct for any linear combination of the monomials xi y j , where i = 0, . . . , k1 ,
and j = 0, . . . , k2 . As an example let Q1 and Q2 both be Simpson’s rule. The
product rule is then given by
F (a, d) + 4F ( a+b
2 , d) + F (b, d) ,
the order of integration or even just reducing the number of dimensions by one
can have a considerable effect. In high dimensional problems, high accuracy
is often not required. Often merely the magnitude of the integral is sufficient.
Note that high dimensional integration is a problem well-suited to parallel
computing.
is one. That is, using truly random abscissae and equal weights, convergence
is almost sure.
Statistical error estimates are available, but these depend on the variance
of the function f . As an example for the efficiency of the Monte Carlo method,
to obtain an error less than 0.01 with 99% certainty in the estimation of I,
we need to average 6.6 × 104 function values. To gain an extra decimal place
with the same certainty requires 6.6 × 106 function evaluations.
The advantage of the Monte Carlo method is that the error behaves like
n−1/2 and not like n−k/d , that is, it is independent of the number of dimen-
sions. However, there is still a dimensional effect, since higher dimensional
functions have larger variances.
As a final algorithm the Korobov–Conroy method needs to be mentioned.
Here the abscissae are not chosen pseudo-randomly, but are in some sense op-
Numerical Integration 153
timal. They are derived from results in number theory (See for example [15] H.
Niederreiter, Random Number Generation and Quasi-Monte Carlo Methods).
(a) Define Gaussian quadrature and state how the abscissae are obtained. Give
a formula for the weights. If f is a polynomial, what is the maximum degree
of f for which the Gaussian quadrature rule is correct?
(b) In the following let the interval be [a, b] = [−2, 2] and w(x) = 4 − x2 . Thus
we want to approximate the integral
Z 2
(4 − x2 )f (x)dx.
−2
(c) If K does not change sign in (a, b), how can the expression for L(f ) be
further simplified?
(d) In the following we let a = 0 and b = 2 and let
Z 2
1
L(f ) = f (x)dx − [f (0) + 4f (1) + f (2)].
0 3
Find the highest degree of polynomials for which this approximation is
correct.
(e) Calculate K(θ) for θ ∈ [0, 2].
(f ) Given that K(θ) is negative for θ ∈ [0, 2], obtain c such that
(f ) Let [c, d] = [−1, 1]. Give the constant, linear, and quadratic monic poly-
nomials which are orthogonal with respect to the inner product given by
Z 1
hf, gi = f (x)g(x)dx
−1
(g) Give the abscissae of the two-point Gauss–Legendre rule on the interval
[−1, 1].
(h) The weights of the two-point Gauss–Legendre rule are 1 for both abscis-
sae. State the two-point Gauss–Legendre rule and give the formula for the
composite rule on [a, b] employing the two-point Gauss–Legendre rule.
Exercise 5.8. The integral
Z 1
(1 − x2 )f (x)dx
−1
which is exact for all f (x) that are polynomials of degree less than or equal to
2n − 1.
(a) Explain how the weights wi are calculated, writing down explicit expres-
sions in terms of integrals.
(b) Explain why it is necessary that the xi are the zeros of a (monic) poly-
R1
nomial pn of degree n that satisfies −1 (1 − x2 )pn (x)q(x)dx = 0 for any
polynomial q(x) of degree less than n.
(c) The first such polynomials are p0 = 1, p1 = x, p2 = x2 − 15 , p3 = x3 − 37 x.
Show that the Gaussian quadrature formulae for n = 2, 3 are
2 1 1
n=2: f (− √ ) + f ( √ ) ,
3 5 5
" r r #
14 3 3 32
n=3: f (− ) + f( ) + f (0).
45 7 7 45
(b) Calculate the zeros of the polynomial found in (a) and explain how they
are used to construct a Gaussian quadrature rule.
(c) Describe how the weights are calculated for a Gaussian quadrature rule
R2
and calculate the weights to approximate 0 f (x)dx.
(d) For which polynomials is the constructed quadrature rule correct?
(e) State the functional L(f ) acting on f describing the error when the integral
R2
0
f (x)dx is approximated by the quadrature rule.
(f ) Define the Peano kernel and state the Peano kernel theorem.
(g) Calculate the Peano kernel for the functional L(f ) in (e).
(h) The Peano kernel does not change sign in [0, 2] (not required to be proven).
Derive an expression for L(f ) of the form constant times a derivative of
f . (Hint: (a + b)4 = a4 + 4a3 b + 6a2 b2 + 4ab3 + b4 .)
CHAPTER 6
ODEs
Lipschitz continuity means that the slopes of all secant lines to the function
between possible points v and w are bounded above by a positive constant.
Thus a Lipschitz continuous function is limited in how much and how fast
it can change. In the theory of differential equations, Lipschitz continuity is
the central condition of the Picard–Lindelöf theorem, which guarantees the
existence and uniqueness of a solution to an initial value problem.
For our analysis of numerical solutions we henceforth assume that f is
analytic and we are always able to expand locally into a Taylor series.
We want to calculate yn+1 ≈ y(tn+1 ), n = 0, 1, . . . , from y0 , y1 , . . . , yn ,
where tn = nh and the time step h > 0 is small.
Looking at the expansion of ehλ , we see that 1 + hλ ≤ ehλ , since hλ > 0. Thus
∗
ch ch (n+1)hλ cet λ
(1 + hλ)n+1 ≤ e ≤ h = Ch,
λ λ λ
where we use the fact that (n + 1)h ≤ t∗ . Thus ken k converges uniformly to
zero and the theorem is true.
Exercise 6.2. Assuming that f is Lipschitz continuous and possesses a
bounded third derivative in [0, t∗ ], use the same method of analysis to show
that the trapezoidal rule
1
yn+1 = yn + h[f (tn , yn ) + f (tn+1 , yn+1 )]
2
converges and that kyn − y(tn )k ≤ ch2 for some c > 0 and all n such that
0 ≤ nh ≤ t∗ .
Note that the arguments in φh are the true function values of y. Thus
the local truncation error is the difference between the true solution and the
method applied to the true solution. Hence it only gives an indication of the
error if all previous steps have been exact.
Definition 6.4 (Consistency). The numerical method given by (6.3) to obtain
solutions for (6.1) is called consistent if
δn+1,h
lim = 0.
h→0 h
Thinking back to the definition of the O-notation, consistency is equivalent
to saying that the order is at least one. For convergence, p ≥ 1 is necessary.
For Euler’s method we have φh (tn , yn ) = yn + hf (tn , yn ). Using again
Taylor series expansion,
y(tn+1 ) − [y(tn ) + hf (tn , y(tn ))] = [y(tn ) + hy0 (tn ) + 12 h2 y00 (tn ) + · · · ]
−[y(tn ) + hy0 (tn )]
= O(h2 ),
1
3. For θ = 2 we have the trapezoidal rule
1
yn+1 = yn + h[f (tn , yn ) + f (tn+1 , yn+1 )].
2
Therefore all theta methods are of order 1, except that the trapezoidal rule
(θ = 1/2) is of order 2.
If θ < 1, then the theta method is implicit. That means each time step
requires the solution of N (generally non-linear) algebraic equations to find
the unknown vector yn+1 . This can be done by iteration and generally the
[0]
first estimate yn+1 for yn+1 is set to yn , assuming that the function does not
change rapidly between time steps.
To obtain further estimates for yn+1 one can use direct iteration;
[j+1] [j]
yn+1 = φh (tn , y0 , y1 , . . . yn , yn+1 ).
Other methods arise by viewing the problem of finding yn+1 as finding the
zero of the function F : RN → RN defined by
F (y) = y − φh (tn , y0 , y1 , . . . yn , y).
This is the subject of the chapter on non-linear systems. Assume we already
have an estimate y[j] for the zero. Let
F1 (y)
F (y) = ..
.
.
FN (y)
and let h = (h1 , . . . hN )T be a small perturbation vector. The multidimen-
sional Taylor expansion of each function component Fi , i = 1, . . . , N is
N
X ∂Fi (y[j] )
Fi (y[j] + h) = Fi (y[j] ) + hk + O(khk2 ).
∂yk
k=1
∂Fi (y[j] )
The Jacobian matrix JF (y[j] ) has the entries ∂yk and thus we can write
in matrix notation
F (y[j] + h) = F (y[j] ) + JF (y[j] )h + O(khk2 ).
Neglecting the O(khk2 ), we equate this to zero (since we are looking for a
better approximation of the zero) and solve for h. We let the new estimate be
y[j+1] = y[j] + h = y[j] − [JF (y[j] )]−1 F (y[j] ).
This method is called the Newton–Raphson method .
Of course the inverse of the Jacobian is not computed explicitly, instead
the equation
[JF (y[j] )]h = −F (y[j] )
is solved for h.
The method can be simplified further by using the same Jacobian JF (y[0] )
in the computation of the new estimate y[j+1] , which is then called modified
Newton–Raphson.
Exercise 6.3. Implement the backward Euler method in MATLAB or a dif-
ferent programming language of your choice.
162 A Concise Introduction to Numerical Analysis
4 1 1
ρ(ez ) − zσ(ez ) = [1 + 2z + 2z 2 + z 3 ] − [1 + z + z 2 + z 3 ]
3 2 6
3 1 2 1 4
− z[1 + z + z ] + z + O(z )
2 2 2
5 3 4
= z + O(z ).
12
Hence the method is of order 2.
Exercise 6.4. Calculate the coefficients of the multistep method
Proof. The proof of this result is long and technical. Details can be found
in [10] W. Gautschi, Numerical Analysis or [11] P. Henrici, Discrete Variable
Methods in Ordinary Differential Equations.
Exercise 6.6. Show that the multistep method given by
3
X 2
X
ρj yn+j = h σj f (tn+j , yn+j )
j=0 j=0
Proof. Again the proof is technical and beyond the scope of this course. See
again [11] P. Henrici, Discrete Variable Methods in Ordinary Differential Equa-
tions.
the exact solution of the ODE can be represented explicitly as y(t) = etA y0 .
We solve the ODE with the forward Euler method. Then
yn+1 = (I + hA)yn ,
and therefore
yn = (I + hA)n y0 .
Let the eigenvalues of A be λ1 , . . . , λN with corresponding linear indepen-
dent eigenvectors v1 , . . . , vN . Further let D be the diagonal matrix with the
eigenvalues being the entries on the diagonal and V = [v1 , . . . , vN ], whence
A = V DV −1 .
We assume further that Reλl < 0, l = 1, . . . , N . In this case limt→∞ y(t) =
ODEs 165
lie within the unit circle for all hλ ∈ C− and the roots on the unit circle are
simple roots (these are roots where the function vanishes, but not its deriva-
tive).
ODEs 167
Figure 6.2 Stability domains for θ = 1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2,
and 0.1
168 A Concise Introduction to Numerical Analysis
Proof. When the s-step method given by (6.4) is applied to the test equation
y 0 = λy, y(0) = 1, it reads
s
X
(ρl − hλσl )yn+l = 0.
l=0
This recurrence relation has the characteristic polynomial given by (6.7). Let
its zeros be w1 (hλ), . . . , wN (hλ) (hλ) with multiplicities µ1 (hλ), . . . , µN (hλ) (hλ),
respectively, where the multiplicities sum to the order of the polynomial τ . If
the root of a function has multiplicity k, then it and its first k − 1 derivatives
vanish there. The solutions of the recurrence relation are given by
N (hλ) µj (hλ)−1
X X
yn = ni wj (hλ)n αij (hλ),
j=1 i=0
where αij (hλ) are independent of n but depend on the starting values
y0 , . . . , ys−1 . Hence the linear stability domain is the set of all hλ ∈ C such
that all the zeros of (6.7) satisfy |wj (hλ)| ≤ 1 and if |wj (hλ)| = 1, then
µj (hλ) = 1.
The theorem implies that hλ ∈ C is in the stability region if the roots of
the polynomial ρ(w) − hλσ(w) lie within the unit circle. It follows that if hλ
is on the boundary of the stability region, then ρ(w) − hλσ(w) must have at
least one root with magnitude exactly equal to 1. Let this root be eiα for some
value α in the interval [0, 2π]. Since eiα is a root we have
ρ(eiα ) − hλσ(eiα ) = 0
and hence
ρ(eiα )
hλ = .
σ(eiα )
Since every point hλ on the boundary of the stability domain has to be of this
form, we can determine the parametrized curve
ρ(eiα )
z(α) =
σ(eiα )
for 0 ≤ α ≤ 2π which are all points that are potentially on the boundary of the
stability domain. For simple methods this yields the stability domain directly
after one determines on which side of the boundary the stability domain lies.
This is known as the boundary locus method .
We illustrate this with the Theta methods, which are given by
yn+1 − yn = h[(1 − θ)f (tn+1 , yn+1 ) + θf (tn , yn )].
Thus ρ(w) = w − 1 and σ(w) = (1 − θ)w + θ and the parametrized curve is
ρ(eiα ) eiα − 1
z(α) = iα
= .
σ(e ) (1 − θ)eiα + θ
For various values of θ, these curves were used to generate Figure 6.2.
ODEs 169
Theorem 6.6 (the second Dahlquist barrier). A-stability implies that the
order p has to be less than or equal to 2. Moreover, the second order A-stable
method with the least truncation error is the trapezoidal rule.
So no multistep method of p ≥ 3 may be A-stable, but there are methods
which are satisfactory for most stiff equations. The point is that in many stiff
linear systems in real world applications the eigenvalues are not just in C− , but
also well away from iR, that is, the imaginary axis. Therefore relaxed stability
concepts are sufficient. Requiring stability only across a wedge in C− of angle
α results in A(α)-stability. A-stability is equivalent to A(90◦ )-stability. A(α)-
stability is sufficient for most purposes. High-order A(α)-stable methods exist
for α < 90◦ . For α → 90◦ the coefficients of high-order A(α)-stable methods
begin to diverge. If λ in the test equation is purely imaginary, then the solution
is a linear combination of sin(λt) and cos(λt) and it oscillates a lot for large
λ. Therefore numerical pointwise solutions are useless anyway.
3 5
= 1 + v + v 2 + O(v 3 )
2 12
3 5
= 1 + (w − 1) + (w − 1)2 + O(|w − 1|3 )
2 12
1 2 5 2
= − + w + w + O(|w − 1|3 ).
12 3 12
Therefore the 2-step, 3rd order Adams–Moulton method is
1 2 5
yn+2 − yn+1 = h − f (tn , yn ) + f (tn+1 , yn+1 ) + f (tn+2 , yn+2 ) .
12 3 12
Exercise 6.9. Calculate the actual values of the coefficients of the 3-step
Adams–Bashforth method.
Exercise 6.10 (Recurrence relation for Adams–Bashforth). Let ρs and σs
denote the polynomials generating the s-step Adams–Bashforth method. Prove
that
σs (w) = wσs−1 (w) + αs−1 (w − 1)s−1 ,
where αs 6= 0, s = 1, 2, . . ., is a constant such that ρs (z) − log zσs (w) =
αs (w − 1)s+1 + O(|w − 1|s+2 ) for w close to 1.
The Adams–Bashforth methods are as follows:
and partitioning the interval equally into t0 < t1 < · · · < tn < · · · with
step size h. Having already approximated yn , . . . , yn+s−1 , we use polynomial
interpolation to find the polynomial p of degree s − 1 such that p(tn+i ) =
f (tn+i , yn+i ) for i = 0, . . . , s − 1. Locally p is a good approximation to the
right-hand side of y 0 = f (t, y) that is to be solved, so we consider y 0 = p(t)
instead. This can be solved explicitly by
Z tn+s
yn+s = yn+s−1 + p(τ )dτ.
tn+s−1
BDF are especially used for the solution of stiff differential equations.
To derive the explicit form of the s-step BDF we again employ the tech-
nique introduced in (6.8), this time solving for ρ(w), since σ(w) is given. Thus
ρ(w) = σs ws log w + O(|w − 1|s+1 ). Dividing by ws and setting v = 1/w, this
becomes
X s
ρl v s−l = −σs log v + O(|v − 1|s+1 ).
l=0
The simple change form w to v in the O-term is possible since we are consid-
ering w close to 1 or written mathematically w = O(1) and
1 s+1
O(|w−1|s+1 ) = O(|w|s+1 |1− | ) = O(|w|s+1 )O(|1−v|s+1 ) = O(|v−1|s+1 ).
w
P∞
Now log v = log(1 + (v − 1)) = l=1 (−1)l−1 (v − 1)l /l. Consequently we want
ODEs 173
We expand
l l l
(1 − w)l = 1 − w+ w2 − . . . + (−1)l wl
1 2 l
4 1 2
yn+2 − yn+1 + yn = hf (tn+2 , yn+2 ).
3 3 3
The BDF are as follows
• 4-step ρ(w) = w4 − 48 3
25 w + 36 2
25 w − 16
25 w + 3
25 , σ(w) = 12 4
25 w .
It can be proven that BDF are convergent if and only if s ≤ 6. For higher
values of s they must not be used. Figure (6.5) shows the stability domain for
various BDF methods. As the number of steps increases the stability domain
shrinks but it remains unbounded for s ≤ 6. In particular, the 3-step BDF is
A(86◦ 20 )-stable.
174 A Concise Introduction to Numerical Analysis
Figure 6.5 The stability domains of various BDF methods in grey. The
instability regions are in white.
Each multi-step method has its own error constant. For example, the 2nd -order
2-step Adams–Bashforth method (AB2)
1
yn+1 − yn = h[3f (tn , yn ) − f (tn−1 , yn−1 )],
2
5
has the error constant cAB2 = 12 .
The idea behind the Milne device is to use two multistep methods of the
same order, one explicit and the second implicit, to estimate the local error
of the implicit method. For example, locally,
5 3 000
AB2
yn+1 ≈ y(tn+1 ) − cAB2 h3 y000 (tn ) = y(tn+1 ) − h y (tn ),
12
1
TR
yn+1 ≈ y(tn+1 ) − cTR h3 y000 (tn ) = y(tn+1 ) + h3 y000 (tn ).
12
Subtracting, we obtain the estimate
The predictor is employed not just to estimate the error of the corrector,
but also to provide an initial guess in the solution of the implicit corrector
equations. Typically, for nonstiff equations, we iterate correction equations at
most twice, while stiff equations require iteration to convergence, otherwise
the typically superior stability features of the corrector are lost.
Exercise 6.11. Consider the predictor–corrector pair given by
P
yn+3 = − 12 yn + 3yn+1 − 32 yn+2 + 3hf (tn+2 , yn+2 ),
C 1
yn+3 = 11 [2yn − 9yn+1 + 18yn+2 + 6hf (tn+3 , yn+3 )].
There are two important observations with regard to this differential equa-
tion. Firstly, it is a small perturbation on the original ODE, since the
term p0 (t) − f (t, p) is usually small since locally p(t) − y(t) = O(hp+1 )
and y0 (t) = f (t, y(t)). Secondly, the exact solution of (6.12) is obviously
z(t) = p(t). Now we calculate zn+1 using exactly the same method and imple-
mentation details. We then evaluate the error in zn+1 , namely zn+1 −p(tn+1 ),
and use it as an estimate of the error in yn+1 . The error estimate can then be
used to assess the step and adjust the step size if necessary.
∂f (y)
f0 (y) = y, f1 (y) = f (y), f2 (y) = f (y), ...
∂y
This then motivates the Taylor method
p
X 1 k
yn+1 = h fk (yn ), n ∈ Z+ . (6.13)
k!
k=0
For example,
Theorem 6.7. The Taylor method given by (6.13) has error O(hp+1 ).
Proof. The proof is easily done by induction. Firstly we have y0 = y(0) =
y(0) + O(hp+1 ). Now assume yn = y(tn ) + O(hp+1 ) = y(nh) + O(hp+1 ). It
follows that fk (yn ) = fk (y(nh)) + O(hp+1 ), since f is analytic. Hence yn+1 =
y((n + 1)h) + O(hp+1 ) = y(tn+1 ) + O(hp+1 ).
Recalling the differentiation operator Dt y(t) = y0 (t), we see that
Dtk y = fk (y).
178 A Concise Introduction to Numerical Analysis
P∞
Let R(z) = k=0 rk z k be an analytic function such that R(z) = ez +O(z p+1 ),
1
i.e., rk = k! for k = 0, 1, . . . , p. Then the formal method defined by
∞
X
yn+1 = R(hDt )yn = rk hk fk (yn ), n ∈ Z+ ,
k=0
is of order p. Indeed, the Taylor method is one such method by dint of letting
R(z) be the pth section of the Taylor expansion of ez .
We can let R be a rational function of the form
PM
pk z k
R(z) = Pk=0
N
,
k
k=0 qk z
is of order M + N .
ODEs 179
k1 = f (tn , yn ),
k2 = f (tn + c2 h, yn + hc2 k1 ),
k3 = f (tn + c3 h, yn + h(a3,1 k1 + a3,2 k2 )), a3,1 + a3,2 = c3 ,
..
.
ν−1
X ν−1
X
kν = f tn + cν h, yn + h aν,j kj , aν,j = cν ,
j=1 j=1
ν
X
yn+1 = yn + h bl k l .
l=1
b =P(b1 , . . . , bν )T are called the Runge–Kutta weights and satisfy the con-
ν
dition l=1 bl = 1. c = (c1 , . . . , cν )T are called the Runge–Kutta nodes. The
method is called consistent if
ν−1
X
ai,j = ci , i = 1, . . . , ν.
j=1
0
c2 a2,1
c3 a3,1 a3,2 c A
.. .. .. ⇔
. . . bT
cν aν,1 aν,2 ··· aν,ν−1
b1 b2 ··· bν−1 bν
k2 = f (tn + h, yn + hk1 ).
The Runge–Kutta method uses the average of these two increments, i.e.,
1
yn+1 = yn + h(k1 + k2 ).
2
The corresponding tableau is
0
1 1
1 1
2 2
0 1
Both of these methods belong to the family of explicit methods given by
1 1
yn+1 = yn + h (1 − )f (tn , yn ) + f (tn + αh, yn + αhf (tn , yn )) .
2α 2α
182 A Concise Introduction to Numerical Analysis
The choice α = 12 recovers the mid-point rule and α = 1 is Heun’s rule. All
these methods have the same stability domain which is shown in Figure 6.7.
The choice of the RK coefficients al,j is motivated at the first instance by
order considerations. Thus we again have to perform Taylor expansions. As
an example we derive a 2-stage Runge–Kutta method. We have k1 = f (tn , yn )
and to examine k2 we Taylor-expand about (tn , yn ),
k2 = f (tn + c2 h, yn + hc2 f (tn , yn ))
h i
= f (tn , yn ) + hc2 ∂f (t∂t
n ,yn )
+ ∂f (t∂y
n ,yn )
f (tn , yn ) + O(h2 ).
Exercise 6.13. Show that the truncation error of methods given by (6.17) is
minimal for α = 23 . Also show that no such method has order 3 or above.
Different categories of Runge–Kutta methods are abbreviated as follows
0 0
1 1 2 2
2 2 3 3
Kutta 1 −1 2 Nystrom 2
0 2
3 3
1 2 1 1 3 3
6 3 6 4 8 8
The one in the tableau of Kutta’s method shows that this method explicitly
employs an estimate of f at tn + h. Both methods have the same stability
domain shown in Figure 6.9.
An error control device specific to Runge–Kutta methods is embedding
with adaptive step size. We embed a method in a larger method. For example,
let
à 0 c̃
A= , c= ,
aT a c
such that the method given by
c A
bT
Comparison of the two yields an estimate of the local error. We use the method
with the smaller truncation error to estimate the error in the other method.
More specifically, kyn+1 − y(tn+1 )k ≈ kyn+1 − ỹn+1 k. This is then used to
adjust the step size to achieve a desired accuracy. The error estimate is used
to improve the solution. Often the matrix à and vector c̃ are actually not
extended. Both methods use the same matrix of coefficients and nodes. How-
ever, the weights b̃ differ. The methods are described with an extended Butcher
tableau, which is the Butcher tableau of the higher-order method with another
row added for the weights of the lower-order method.
c1 a1,1 a1,2 ··· a1,ν
c2 a2,1 a2,2 ··· a2,ν
.. .. .. .. ..
. . . . .
cν aν,1 aν,2 ··· aν,ν
b1 b2 ··· bν
b̃1 b̃2 ··· b̃ν
The simplest adaptive Runge–Kutta method involves combining the Heun
method, which is order 2, with the forward Euler method, which is order 1.
ODEs 185
The result is the Heun–Euler method and its extended Butcher tableau is
0
1 1
1 1
2 2
1 0
The zero in the bottom line of the tableau shows that the forward Euler
method does not use the estimate k2 .
The Bogacki–Shampine method has two methods of orders 3 and 2. Its
extended Butcher tableau is shown in Figure 6.10
0
1 1
2 2
3 3
4 0 4
2 1 4
1 9 3 9
2 1 4
9 3 9 0
7 1 1 1
24 4 3 8
There are several things to note about this method: firstly, the zero in the
186 A Concise Introduction to Numerical Analysis
0
1 1
4 4
3 3 9
8 32 32
12 1932
13 2197 − 7200
2197
7296
2197
439 3680 845
1 216 −8 513 − 4104
1 8
2 − 27 2 − 3544
2565
1859
4104 − 11
40
16 6656 28561 9 2
135 0 12825 56430 − 50 55
25 1408 2197 1
216 0 2565 4104 5 0
The Cash–Karp method takes the concept of error control and adaptive
step size to a whole new level by not only embedding one lower method,
methods of order 1, 2, 3, and 4 are embedded in a fifth-order method. The
tableau is as displayed in Figure 6.12. Note that the order-one method is the
forward Euler method.
The Cash–Karp method was motivated to deal with the situation when
certain derivatives of the solution are very large for part of the region. These
are, for example, regions where the solution has a sharp front or some deriva-
tive of the solution is discontinuous in the limit. In these circumstance the step
size has to be adjusted. By computing solutions at several different orders, it
is possible to detect sharp fronts or discontinuities before all the function eval-
uations defining the full Runge–Kutta step have been computed. We can then
either accept a lower-order solution or abort the step (and try again with a
smaller step-size), depending on which course of action seems appropriate. J.
Cash provides the code for this algorithm in Fortran, C, and MATLAB on his
homepage.
The Dormand–Prince method is similar to the Fehlberg method. It also
embeds a fourth-order method in a fifth-order method. The coefficients are,
however, chosen so that the error of the fifth-order solution is minimized and
ODEs 187
0
1 1
5 5
3 3 9
10 40 40
3 3 9 6
5 10 − 10 5
1 − 11
54
5
2 − 70
27
35
27
7 1631 175 575 44275 253
8 55296 512 13824 110592 4096
37 250 125 512
378 0 621 594 0 1771 Order 5
2825 18575 13525 277 1
27648 0 48384 55296 14336 4 Order 4
19
54 0 − 10
27
55
54 0 0 Order 3
− 32 5
2 0 0 0 0 Order 2
1 0 0 0 0 0 Order 1
the difference between the solutions is used to estimate the error in the fourth-
order method. The Dormand–Prince method has seven stages, but it uses
only six function evaluations per step, because it has the FSAL property.
The tableau is given in Figure 6.13. This method is currently the default in
MATLAB’s ode45 solver.
0
1 1
5 5
3 3 9
10 40 40
4 44
5 45 − 56
15
32
9
8 19372
9 6561 − 25360
2187
644482
6561 − 212
729
9017
1 3168 − 355
33
46732
5247
49
176
5103
− 18656
35 500 125
1 384 0 1113 192 − 2187
6784
11
84
35 500 125
384 0 1113 192 − 2187
6784
11
84 0
5179 7571 393 92097 187 1
57600 0 16695 640 − 339200 2100 40
Obviously, al,j = 0 for all l < j yields the standard explicit RK . Otherwise,
an RK method is said to be implicit.
One way to derive implicit Runge–Kutta methods are collocation methods.
However, not all Rung–Kutta methods are collocation methods. The idea is to
choose a finite-dimensional space of candidate solutions (usually, polynomials
up to a certain degree) and a number of points in the domain (called collocation
points), and to select that solution which satisfies the given equation at the
collocation points. More precisely, we want to find a ν-degree polynomial p
such that p(tn ) = yn and
p0 (tn + cl h) = f (tn + cl h, p(tn + cl h)), l = 1, . . . , ν,
where cl , l = 1, . . . , ν are the collocation points. This gives ν + 1 conditions,
which matches the ν + 1 parameters needed to specify a polynomial of degree
ν. The new estimate yn+1 is defined to be p(tn+1 ).
As an example pick the two collocation points c1 = 0 and c2 = 1 at the
beginning and the end of the interval [tn , tn+1 ]. The collocation conditions are
p(tn ) = yn ,
0
p (tn ) = f (tn , p(tn )),
p0 (tn + h) = f (tn + h, p(tn + h)).
For these three conditions we need a polynomial of degree 2, which we write
in the form
p(t) = α(t − tn )2 + β(t − tn ) + γ.
The collocation conditions can be solved to give the coefficients
1
α = [f (tn + h, p(tn + h)) − f (tn , p(tn ))],
2h
β = f (tn , p(tn )),
γ = yn .
Putting these coefficients back into the definition of p and evaluating it at
t = tn+1 gives the method
1
yn+1 = p(tn + h) = yn + h(f (tn + h, p(tn + h)) + f (tn , p(tn )))
2
1
= yn + h(f (tn + h, yn+1 ) + f (tn , yn )),
2
and we have recovered Heun’s method.
ODEs 189
Now we have
ν Z ck
X 1
p(tn + ck h) = yn + h f (tn + cl h, p(tn + cl h)) wl (τ )dτ
wl (cl ) 0
l=1
Xν
= yn + h ak,l f (tn + cl h, p(tn + cl h)).
l=1
This and defining kl = f (tn + cl h, p(tn + cl h)) gives the intermediate stages
of the Runge–Kutta method. Additionally we have
ν Z 1 ν
X 1 X
yn+1 = p(tn + h) = yn + h kl wl (τ )dτ = p(tn ) + h bl k l .
wl (cl ) 0
l=1 l=1
This and the definition of the Runge–Kutta method proves the theorem.
190 A Concise Introduction to Numerical Analysis
√ √
1
2 + 63 14 + 63 1
4
1 1
2 2
√ √
• c1 = 1
2 − 15
10 , c2 = 1
2 , c3 = 1
2 + 15
10 order 6 with tableau
√ √ √
1 15 5 2 15 5 15
2 − 10 36 9 − 15 36 − 30
√ √
1 5 15 2 5 15
2 36 + 24 9 36 − 24
√ √ √
1 15 5 15 2 15 5
2 + 10 36 + 30 9 + 15 36
5 4 5
18 9 18
ds−2 s−1
III [x (x − 1)s−1 ] Lobatto
dxs−2
The methods are of order 2s − 1 in the Radau case and of order 2s − 2 in
the Lobatto case. Note that the Radau I methods have 0 as one of their
collocation points, while the Radau II method has 1 as one of its collocation
points. This means that for Radau I the first row in the tableau always consists
entirely of zeros while for Radau II the last row is identical to the vector of
weights. The 2-stage methods are given in Figure 6.14. The letters correspond
to certain conditions imposed on A which are however beyond this course (for
further information see [3] J. C. Butcher, The numerical analysis of ordinary
differential equations: Runge–Kutta and general linear methods.
0 0 0 0 0 0
2 1 1 1 1
Radau I 3 3 3 Lobatto III(A) 1 2 2
1 3 1 1
4 4 2 2
1
0 4 − 14 0 1
2 0
2 1 5 1
Radau IA 3 4 12 Lobatto IIIB 1 2 0
1 3 1 1
4 4 2 2
1 5 1 1
3 12 − 12 0 2 − 12
3 1 1 1
Radau II(A) 1 4 4 Lobatto IIIC 1 2 2
3 1 1 1
4 4 2 2
For the three stages we have the Radau methods as specified in Figure
6.15 and the Lobatto methods as in Figure 6.16.
Next we examine the stability domain of Runge–Kutta methods by con-
sidering the test equation y 0 = λy = f (t, y). Firstly, we get for the internal
stages the relations
k = λ (yn 1 + hAk) .
192 A Concise Introduction to Numerical Analysis
√
0 √
0 0 √
0 √
6− 6 9+ 6 24+ 6 168−73∗ 6
10 75 120 600
Radau I √ √ √ √
6+ 6 9− 6 168+73∗ 6 24− 6
10 75 600 120
√ √
1 16+ 6 16− 6
9 36 36
√ √
1 −1− 6 −1+ 6
0 9 18 18
√ √ √
6− 6 1 88+7 6 88−43 6
10 9 360 360
Radau IA √ √ √
6+ 6 1 88+43∗ 6 88−7 6
10 9 360 360
√ √
1 16+ 6 16− 6
9 36 36
√ √ √ √
4− 6 88−7 6 296−169 6 −2+3 6
10 360 1800 225
√ √ √ √
4+ 6 296+169 6 88+7 6 −2−3 6
10 1800 360 225
Radau II(A) √ √
16− 6 16+ 6 1
1 36 36 9
√ √
16− 6 16+ 6 1
36 36 9
Solving for k,
−1
k = λyn (I − hλA) 1.
Further, we have
ν
X
yn+1 = yn + h bl kl = yn + hbT k,
l=1
0 0 0 0
1 5 1 1
2 24 3 − 24
Lobatto III(A) 1 2 1
1 6 3 6
1 2 1
6 3 6
1
0 6 − 16 0
1 1 1
2 6 3 0
Lobatto IIIB 1 5
1 6 6 0
1 2 1
6 3 6
1
0 6 − 13 1
6
1 1 5 1
2 6 12 − 12
Lobatto IIIC 1 2 1
1 6 3 6
1 2 1
6 3 6
therefore
1 3 1 + 13 hλ
yn+1 = yn + hk1 + hk2 = yn .
4 4 1 − 3 hλ + 16 (hλ)2
2
1 + 13 z
R(z) = .
1 − 23 z + 16 z 2
Figure 6.17 illustrates the stability region given by this stability function.
We prove A-stability by the following technique. According to the maxi-
mum modulus principle, if g is analytic in the closed complex domain V , then
|g| attains its maximum on the boundary ∂V . We let g√= R. This is a rational
function, hence its only singularities are the poles 2 ± i 2, which are the roots
of the denominator and g is analytic in V = clC− = {z ∈ C : Rez ≤ 0}.
Therefore it attains its maximum on ∂V = iR and the following statements
are equivalent
Figure 6.17 Stability domain of the method given in (6.20). The insta-
bility region is white.
k1 = f (tn , yn ),
1 1
k2 = f (tn + h, yn + hk1 ),
3 3
2 1
k3 = f (tn + h, yn − hk1 + hk2 ),
3 3
k4 = f (tn + h, yn + hk1 − hk2 + hk3 ),
1
yn+1 = yn + h(k1 + 3k2 + 3k3 + k4 ).
8
By applying it to the equation y 0 = y, show that the order is at most four.
ODEs 197
Then, for scalar functions, prove that the order is at least four in the easy
case when f is independent of y, and that the order is at least three in the
relatively easy case when f is independent of t. (Thus you are not expected to
do Taylor expansions when f depends on both y and t.)
A solution to this system gives a solution for (6.1) by removing the first
component.
Suppose zn = (tn , yn )T . Now let lj = (1, kj ).
ν
X
li = g(zn + h ai,j lj )
j=1
Xν ν
X
= g(tn + h ai,j , yn + h ai,j kj )
j=1 j=1
ν
X
= g(tn + hci , yn + h ai,j kj )
j=1
ν
X 1 T
= (1, f (tn + hci , yn + h ai,j kj )) =
j=1
ki
Additionally we have
ν
X ν
X
T
zn+1 = zn + h bi li = (tn , yn ) + h bi (1, ki )T
i=1 i=1
Xν ν
X
= (tn + h bi , y n + h bi ki )T
i=1 i=1
ν
X
= (tn+1 , yn + h bi ki )T ,
i=1
Pν
since i=1 bi = 1 and tn+1 = tn + h.
For simplification we restrict ourselves to one-dimensional, autonomous
systems in the following analysis. The Taylor expansion will produce terms
of the form y 0 = f , y 00 = fy f , y 000 = fyy f 2 + fy (fy f ) and y (4) = fyyy f 3 +
4fyy fy f 2 +fy3 f . The f, fy f, fy2 f, fyy f 2 , etc., are called elementary differentials.
Every derivative of y is a linear combination with positive integer coefficients
of elementary differentials.
A convenient way to represent elementary differentials is by rooted trees.
For example, f is represented by
T0 = f
while fy f is represented by
T1 = fy
fyy f 2 fy (fy f )
T2 = T3 =
fy
fyy fy
f f f
ODEs 199
With the fourth derivative of y, it becomes more interesting, since the ele-
mentary differential fyy fy f 2 arises in the differentiation of the first term as
well as the second term of y000 . This corresponds to two different trees and
the distinction is important since in several variables these correspond to dif-
ferentiation matrices which do not commute. In Figure 6.18 we see the two
different trees for the same elementary differential.
fy
fyy fyy fy
fyyy fy f fy fy
f f f f f f f
Φ(T ) together with the usual Taylor series coefficients give the coefficients
of the elementary differentials when expanding the Runge–Kutta method. In
order for the Taylor expansion of the true solution and the Taylor expansion of
the Runge–Kutta method to match up to the hp terms, we need the coefficients
of the elementary differentials to be the same. This implies for all trees T with
p vertices or less,
1
Φ(T ) = ,
γ(T )
where γ(T ) is the density of the tree which is defined to be the product of the
number of vertices of T with the number of vertices of all possible trees after
ODEs 201
order p 1 2 3 4 5 6 7 8 ... 9
number of conditions 1 2 4 8 17 37 85 200 ... 7813
k1 = f (yn ),
k2 = f (yn + (1 − α)hk1 + αhk2 ),
h
yn+1 = yn + (k1 + k2 ),
2
where α is a real parameter.
Exercise 6.16. Consider the multistep method for numerical solution of the
differential equation y0 = f (t, y):
s
X s
X
ρl yn+l = h σl f (tn+l , yn+l ), n = 0, 1, . . . .
l=0 l=0
Exercise 6.17. Consider the multistep method for numerical solution of the
differential equation y0 = f (t, y):
s
X s
X
ρl yn+l = h σl f (tn+l , yn+l ), n = 0, 1, . . . .
l=0 l=0
Ps
(f ) Give the conditions on ρ(w) = l=0 ρl wl that ensure convergence.
(g) Hence determine for what values of θ and σ0 , σ1 , σ2 the two-step method
yn+2 −(1−θ)yn+1 −θyn = h[σ0 f (tn , yn )+σ1 f (tn+1 , yn+1 )+σ2 f (tn+2 , yn+2 )]
assuming that yn , yn+1 , . . . , yn+s−1 are available. The following complex poly-
nomials are defined:
s
X s
X
ρ(w) = ρl w l , σ(w) = σl w l .
l=0 l=0
(a) When is the method given by (6.21) explicit and when implicit?
(b) Derive a condition involving the polynomials ρ and σ which is equivalent
to the s-step method given in (6.21) being of order p.
(c) Define another polynomial and state (no proof required) a condition for
the method in (6.21) to be A-stable.
(d) Describe the boundary locus method to find the boundary of the stability
domain for the method given in (6.21).
(e) What is ρ for the Adams methods and what is the difference between
Adams–Bashforth and Adams–Moulton methods?
(f ) Let s = 1. Derive the Adams–Moulton method of the form
(a) Express the second and third derivative of y in terms of f and its deriva-
tives. Write the Taylor expansion of y(t + h) in terms of f and its deriva-
tives up to O(h4 ).
(b) The differential equation is solved by the Runge–Kutta scheme
k1 = hf (yn ),
k2 = hf (yn + k1 ),
yn+1 = yn + 12 (k1 + k2 ).
(d) Apply the Runge–Kutta scheme given in (b) to the linear test equation
from part (c) and find an expression for the linear stability domain of the
method. Is the method A-stable?
(e) We now modify the Runge–Kutta scheme in the following way
k1 = hf (yn ),
k2 = hf (yn + a(k1 + k2 )),
yn+1 = yn + 12 (k1 + k2 ),
Numerical
Differentiation
f (x + h) − f (x)
f 0 (x) ≈ .
h
This approximation is generally called a difference quotient. The important
question is how should h be chosen.
Suppose f (x) can be differentiated at least three times, then from Taylor’s
theorem we can write
h2 00
f (x + h) = f (x) + hf 0 (x) + f (x) + O(h3 ).
2
After rearranging, we have
f (x + h) − f (x) h 00
f 0 (x) = − f (x) + O(h2 ).
h 2
The first term is the approximation to the derivative. Thus the absolute value
of the discretization error or local truncation error is approximately h2 |f 00 (x)|.
206 A Concise Introduction to Numerical Analysis
We now turn to the rounding error. The difference quotient uses the float-
ing point representations
f (x + h)∗ = f (x + h) + x+h ,
∗
f (x) = f (x) + x .
Thus the representation of the approximation is given by
f (x + h) − f (x) x+h − x
+ ,
h h
where we assume for simplicity that the difference and division have been cal-
culated exactly. If f (x) can be evaluated with a relative error of approximately
macheps, we can assume that
|x+h − x | ≤ macheps (|f (x)| + |f (x + h)|).
Thus the rounding error is at most macheps (|f (x)| + |f (x + h)|)/h.
The main point to note is that as h decreases, the discretization error
decreases, but the rounding error increases, since we are dividing by h. The
ideal choice of h is the one which minimizes the total error. This is the case
where the absolute values of the discretization error and the rounding error
become the same
h 00
|f (x)| = macheps(|f (x)| + |f (x + h)|)/h.
2
However, this involves unknown quantities. Assuming that 12 |f 00 (x)| and
O(1), a good choice for h would satisfy h2 ≈
|f (x)| + |f (x + h)| are of order √
macheps or in other words h ≈ √macheps. The total absolute error in the ap-
proximation is then of order O( macheps). However, since we assumed that
|f (x)|p= O(1) in the above analysis, a more realistic choice for h would be
h ≈ macheps|f (x)|. This, however, does not deal with the assumption on
|f 00 (x)|.
Exercise 7.1. List the assumptions made in the analysis and give an example
where at least one of these assumptions does not hold. What does this mean
in practice for the approximation of derivatives?
E = ehD
and thus
1 1 hD
∆+ = (e − I) = D + O(h),
h h
1 1
∆− = (I − e−hD ) = D + O(h).
h h
It follows that the forward and backward difference operators approximate
the differential operator, or in other words the first derivative with an error
of O(h).
Both the averaging and the central difference operator are not well-defined
on a grid. However, we have
δ 2 f (x) = f (x + h) − 2f (x) + f (x − h),
1
δµ0 f (x) = (f (x + h) − f (x − h).
2
1 2
Now h2 δ approximates the second derivative with an error of O(h2 ), because
1 2 1
δ = 2 (ehD − 2I + e−hD ) = D2 + O(h2 ).
h2 h
On the other hand, we have
1 1 hD
δµ0 = (e − e−hD ) = D + O(h2 ).
h 2h
Hence the combination of central difference and averaging operator gives a
better approximation to the first derivative. We can also achieve higher accu-
racy by using the sum of the forward and backward difference
1 1 hD
(∆+ + ∆− ) = (e − I + I − e−hD ) = D + O(h2 ).
2h 2h
208 A Concise Introduction to Numerical Analysis
For odd n the nth central difference is again not well-defined on a grid, but
this can be alleviated as before by combining one central difference with the
averaging operator. After dividing by hn the nth order forward and backward
differences approximate the nth derivative with an error of O(h), while the nth
order central difference (if necessary combined with the averaging operator)
approximates the nth derivative with an error of O(h2 ).
Combination of higher-order differences can also be used to construct bet-
ter approximations. For example,
1 1 −1
(∆+ − ∆2+ )f (x) = (f (x + 2h) − 4f (x + h) + 3f (x))
h 2 2h
can be written with the shift and differential operator as
−1 2 −1 2hD
(E − 4E + 3I) = (e − 4ehD + 3I) = D + O(h2 ).
2h 2h
The drawback is that more grid points need to be employed. This is called
the bandwidth. Special schemes at the boundaries are then necessary.
Numerical Differentiation 209
PDEs
∂ |α| u ∂ |α| u
Dα u = = .
∂x α ∂x1 · · · ∂xα
α1
n
n
Further, let Dk u = {Dα u : |α| = k}, the construct of all derivatives of degree
∂u ∂u
k. For k = 1, Du = ( ∂x1
, . . . , ∂x n
) is the gradient of u. For k = 2, D2 u is the
Hessian matrix of second derivatives given by
∂2u ∂2u ∂2u
∂x21 ∂x1 ∂x2 ··· ∂x1 ∂xn
∂2u ∂2u ∂2u
···
∂x2 ∂x1 ∂x22 ∂x2 ∂xn
.
.. .. .. ..
. . . .
∂2u ∂2u ∂2u
∂xn ∂x1 ∂xn ∂x2 ··· ∂x2n
Note that the Hessian matrix is symmetric, since it does not matter in which
order partial derivatives are taken. (The Hessian matrix of a scalar valued
212 A Concise Introduction to Numerical Analysis
function should not be confused with the Jacobian matrix of a vector valued
function which we encountered in the chapter on non-linear systems.) Then
In other words, the PDE is linear with regards to the derivatives of highest
degree, but nonlinear for lower derivatives.
Definition 8.3 (Quasilinear PDE). The PDE is called quasilinear if it has
the form X
cα (x, u(x), Du(x), . . . , Dk−1 u(x))Dα u(x)
|α|=k
+G(x, u(x), Du(x), . . . , Dk−1 u(x)) = 0.
For further classification, we restrict ourselves to quasilinear PDEs of order
2, of the form
n
X ∂2u
aij − f = 0, (8.3)
i,j=1
∂xi ∂xj
where the coefficients aij and f are allowed to depend on x, u, and the gradient
of u. Without loss of generality we can assume that the matrix A = (aij ) is
symmetric, since otherwise we could rewrite the PDE according to
and the matrix B = (bij ) with coefficients bij = 12 (aij + aji ) is symmetric.
Definition 8.4. Let λ1 (x), . . . , λn (x) ∈ R be the eigenvalues of the symmetric
coefficient matrix A = (aij ) of the PDE given by (8.3) at a point x ∈ Rn . The
PDE is
PDEs 213
parabolic
at x, if there exists j ∈ {1, . . . , n} for which λj (x) = 0,
elliptic
at x, if λi (x) > 0 for all i = 1, . . . , n,
hyperbolic
at x, if λj (x) > 0 for one j ∈ {1, . . . , n} and λi (x) < 0 for all i 6= j, or
if λj (x) < 0 for one j ∈ {1, . . . , n} and λi (x) > 0 for all i 6= j.
Exercise 8.1. Consider the PDE
∇2 denotes the Laplace operator (in three dimensions). For the mathematical
treatment it is sufficient to consider the case α = 1. We restrict ourselves
further and consider the one-dimensional case u(x, t) specified by
∂u ∂2u
= , (8.4)
∂t ∂x2
for 0 ≤ x ≤ 1 and t ≥ 0. Initial conditions u(x, 0) describe the state of the
system at the beginning and Dirichlet boundary conditions u(0, t) and u(1, t)
show how the system changes at the boundary over time. The most common
example is that of a metal rod heated at a point in the middle. After a long
enough time the rod will have constant temperature everywhere.
The initial condition gives the vector u0 . Using the boundary conditions
un0 and unm+1 when necessary, (8.7) can be advanced from un to un+1 for
n = 0, 1, 2, . . ., since from one time step to the next the right-hand side of
Equation (8.7) is known. This is an example of a time marching scheme.
Keeping µ fixed and letting ∆x → 0 (which also implies that ∆t → 0, since
∆t = µ(∆x)2 ), the question is: Does for every T > 0 the point approximation
unm converge uniformly to u(x, t) for m∆x → x ∈ [0, 1] and n∆t → t ∈ [0, T ]?
The method here has an extra parameter, µ. It is entirely possible for a method
to converge for some choice of µ and diverge otherwise.
1
Theorem 8.1. µ ≤ 2 ⇒ convergence.
for every constant T > 0. Since u satisfies the heat equation, we can equate
(8.5) and (8.6), which gives
1
[u(x, t + ∆t) − u(x, t)] + O(∆t) =
∆t
1
[u(x − ∆x, t) − 2u(x, t) + u(x + ∆x, t)] + O((∆x)2 ).
(∆x)2
Rearranging yields
Subtracting this from (8.7) and using ∆t = µ(∆x)2 , it follows that there exists
C > 0 such that
|en+1 n n n n 4
m | ≤ |em + µ(em−1 − 2em + em+1 )| + C(∆x) ,
where we used the triangle inequality and properties of the O-notation. Let
enmax := max |enm |. Then
m=1,...,M
|en+1
m | ≤ |µenm−1 + (1 − 2µ)enm + µenm+1 | + C(∆x)4
≤ (2µ + |1 − 2µ|)enmax + C(∆x)4 ,
ken+1 n 4 n 4
m | ≤ (2µ + 1 − 2µ)emax + C(∆x) = emax + C(∆x) .
216 A Concise Introduction to Numerical Analysis
for all x ∈ Rn .
Returning to the previous example, the PDE is
(Dt − Dx2 )u(x, t) = 0,
while F is given by
−1
F u(x, t) = E∆t − I − µ E∆x − 2I + E∆x , u(x, t),
where E∆t and E∆x denote the shiftpoperators in the x and y direction respec-
tively. Since µ = ∆t/(∆x)2 , ∆x = ∆t/µ is a fixed funtion of ∆t. Denoting
the differentiation operator in the t direction by Dt and the differentiation
operator in the x direction by Dx , we have
E∆t = e∆tDt and E∆x = e∆xDx .
Then F becomes
∆tDt ∆t −∆xDx ∆xDx
F u(x, t) = e −I − e − 2I + e u(x, t)
(∆x)2
1 2 2 1 2 4
= ∆tDt + (∆tDt ) + . . . − ∆t Dx + (∆x) Dx + . . . u(x, t)
2 6
= ∆t(Dt − Dx2 )u(x, t) + O((∆t)2 ),
since (∆x)2 = O(∆t).
Note, since ∆x and ∆t are linked by a fixed function, the local error can
be expressed both in terms of ∆x or ∆t.
PDEs 217
T qk = λk qk
hold for k = 1, . . . , M .
Thus the eigenvalues of A are given by
πk πk
(1 − 2µ) + 2µ cos = 1 − 4µ(sin )2 , k = 1, . . . , M.
M +1 2M + 2
Note that
πk
0 < (sin )2 < 1.
2M + 2
For µ ≤ 12 , the maximum modulus of eigenvalue is given by
πM 2
|1 − 4µ(sin ) | ≤ 1.
2M + 2
This is the spectral radius of A and thus the matrix norm kAk. Hence
dum (t) 1
= (um−1 (t) − 2um (t) + um+1 (t)), m = 1, . . . , M
dt (∆x)2
If this system is solved by the forward Euler method the resulting scheme
is (8.7), while backward Euler yields
un+1
m − µ(un+1 n+1
m−1 − 2um + un+1 n
m+1 ) = um .
1 + µ − 12 µ 1
1−µ 2µ
−1µ 1 + µ 1µ 1−µ
2 2
B= ,C = .
. .. . .. . .. . ..
−1µ
2
1
µ 2
− 12 µ 1 + µ 1
2µ 1−µ
1 − 2µ sin2 πk
2(M +1)
≤ 1, k = 1, . . . , M.
1 + 2µ sin2 2(M
πk
+1)
Thus we can deduce that the Crank–Nicolson method is stable for all µ > 0
and we only need to consider accuracy in our choice of ∆t versus ∆x.
This technique is the eigenvalue analysis of stability. More generally, sup-
pose that a numerical method (for a PDE with zero boundary conditions) can
be written in the form
un+1 n
∆x = A∆x u∆x ,
Stability can now be defined as preserving the boundedness of un∆x with re-
spect to the chosen norm k · k, and it follows from the inequality above that
the method is stable if
kA∆x k ≤ 1 as ∆x → 0.
PDEs 221
M
! 12
X
ku∆x k∆x = ∆x |ui |2 .
i=1
Note that the dimensions depend on ∆x. The reason for the factor ∆x1/2 is
because of
M
! 12 Z 1 12
∆x→0
X
2 2
ku∆x k∆x = ∆x |ui | −→ |u(x)| dx ,
i=1 0
where Ii−1 denotes the (i−1)×(i−1) identity matrix, B (i) is a (M −i)×(M −i)
submatrix, bii is a scalar, and bi is a (M − i) × 1 vector. For i = 1, I0 is a
matrix of size 0, i.e., it is not there. We let
Ii−1 √ 0 0
Li := 0 bii 0 ,
1
0 √
b
b i In−i
ii
Since bi bTi is an outer product, this algorithm is also called the outer prod-
uct version (other names are the Cholesky–Banachiewicz algorithm and the
Cholesky–Crout algorithm). After M steps, we get A(M +1) = I, since B (M )
and bM are of size 0. Hence, the lower triangular matrix L of the factorization
is L := L1 L2 . . . LM .
For TST matrices we have
a11 bT1
(1)
A = ,
b1 B (1)
where bT1 = (a12 , 0, . . . , 0). Note that B (1) is a tridiagonal symmetric matrix.
So
√
a11 0 0
√a12 (2) 1 0
L1 = a11 1 0 and A = .
0 B (1) − √a111 b1 bT1
0 0 In−2
Now the matrix formed by b1 bT1 has only one non-zero entry in the top left
corner, which is a212 . It follows that B (1) − √a111 b1 bT1 is again a tridiagonal
symmetric positive-definite matrix. Thus the algorithm calculates successively
smaller tridiagonal symmetric positive-definite matrices where only the top
left element has to be updated. Additionally the matrices Li differ from the
identity matrix in only two entries. Thus the factorization can be calculated
in O(M ) operations. We will see later when discussing the Hockney algorithm
how the structure of the eigenvectors can be used to obtain solutions to this
system of equations using the fast Fourier transform.
Lemma 8.1 (Parseval’s identity). For any sequence v, we have the identity
kvk = kv̂k.
Proof. We have Z π
2π, l = 0
e−ilθ dθ = .
−π 0, l 6= 0.
So by definition,
Z π X Z π XX
1 1
kv̂k2 = | vm e−imθ |2 dθ = vm v̄k e−i(m−k)θ dθ
2π −π 2π −π
m∈Z Z π m∈Z k∈Z
1 XX −i(m−k)θ
XX
= vm v̄k e dθ = vm v̄k δmk = kvk2 .
2π −π
m∈Z k∈Z m∈Z k∈Z
To prove the first direction, let’s assume |H(θ)| ≤ 1 for all θ ∈ [−π, π]. Then
by the above equation |ûn (θ)| ≤ |û0 (θ)| it follows that
Z π Z π
1 1
kûn k2 = |ûn (θ)|2 dθ ≤ |H(θ)|2n |û0 (θ)|2 dθ
2π Z−π 2π −π
π
1
≤ |û0 (θ)|2 dθ = kû0 k2 ,
2π −π
which cannot be at θ∗ , since it takes a finite value there. Hence there exist
θ− < θ∗ < θ+ ∈ [−π, π] such that |H(θ)| ≥ 1 + 12 for all θ ∈ [θ− , θ+ ]. Let
( q )
2π − +
0 +
θ −θ − , θ ≤ θ ≤ θ ,
û (θ) =
0, otherwise.
This is a step function over the interval [θ− , θ+ ]. We can calculate the sequence
which gives rise to this step function by
Z π Z θ+ r
1 1 2π
u0m = û0 (θ)eimθ dθ = eimθ dθ.
2π −π 2π θ− θ+ − θ−
is well defined and continuous, since for x → 0 it tends to the value for x = 0
(which can be verified by using the expansions of the exponentials). It is also
square integrable, since it tends to zero for x → ∞ due to x being in the
denominator. Therefore it is a suitable choice for an initial condition.
On the other hand,
Z π 12
1
kûn k = √ |H(θ)|2n |û0 (θ)|2 dθ
2π −π
! 12
Z θ+
1 2n 0 2
= √ |H(θ)| |û (θ)| dθ
2π θ−
Z θ+ ! 12
1 1 n + − −1
≥ √ (1 + ) 2π(θ − θ ) dθ
2π 2 θ−
1 n→∞
= (1 + )n −→ ∞,
2
since the last integral equates to 1. Thus the method is then unstable.
226 A Concise Introduction to Numerical Analysis
Consider the Cauchy problem for the heat equation and recall the first
method based on solving the semi-discretized problem with forward Euler.
un+1
m = unm + µ(unm−1 − 2unm + unm+1 ).
Therefore
θ
H(θ) = 1 + µ(e−iθ − 2 + eiθ ) = 1 − 4µ sin2 , θ ∈ [−π, π]
2
and thus 1 ≥ H(θ) ≥ H(π) = 1 − 4µ. Hence the method is stable if and only
if µ ≤ 12 .
On the other hand, for the backward Euler method we have
un+1
m − µ(un+1 n+1
m−1 − 2um + un+1 n
m+1 ) = um
and therefore
θ
H(θ) = [1 − µ(e−iθ − 2 + eiθ )]−1 = [1 + 4µ sin2 ]−1 ∈ (0, 1],
2
which implies stability for all µ > 0.
The Crank–Nicolson scheme given by
1 1
un+1
m − µ(un+1 n+1
m−1 − 2um + un+1 n n n n
m+1 ) = um + µ(um−1 − 2um + um+1 )
2 2
results in
1 + 12 µ(e−iθ − 2 + eiθ ) 1 − 2µ sin2 θ2
H(θ) = = ,
1 − 12 µ(e−iθ − 2 + eiθ ) 1 + 2µ sin2 θ2
which lies in (−1, 1] for all θ ∈ [−π, π] and all µ > 0.
Exercise 8.2. Apply the Fourier stability test to the difference equation
1 2
un+1
m = (2 − 5µ + 6µ2 )unm + µ(2 − 3µ)(unm−1 + unm+1 )
2 3
1
− µ(1 − 6µ)(um−2 + unm+2 ).
n
12
Deduce that the test is satisfied if and only if 0 ≤ µ ≤ 23 .
The eigenvalue stability analysis and the Fourier stability analysis are
tackling two fundamentally different problems. In the eigenvalue framework,
boundaries are incorporated, while in the Fourier analysis we have m ∈ Z,
which corresponds to x ∈ R in the underlying PDE. It is no trivial task to
translate Fourier analysis to problems with boundaries. When either r ≥ 2 or
PDEs 227
s ≥ 2 there are not enough boundary values to satisfy the recurrence equations
near the boundary. This means the discretized equations need to be amended
near the boundary and the identity (8.11) is no longer valid. It is not enough
to extend the values unm with zeros for m ∈ / {1, 2, . . . , M }. In general a great
deal of care needs to be taken to combine Fourier analysis with boundary
conditions. How to treat the problem at the boundaries has to be carefully
considered to avoid the instability which then propagates from the boundary
inwards.
With many parabolic PDEs, e.g., the heat equation, the Euclidean norm
of the exact solution decays (for zero boundary conditions) and good methods
share this behaviour. Hence they are robust enough to cope with inwards error
propagation from the boundary into the solution domain, which might occur
when discretized equations are amended there. The situation is more difficult
for many hyperbolic equations, e.g., the wave equation, since the exact solution
keeps the energy (a.k.a. the Euclidean norm) constant and so do many good
methods. In that case any error propagation from the boundary delivers a false
result. Additional mathematical techniques are necessary in this situation.
Exercise 8.3. The Crank–Nicolson formula is applied to the heat equa-
tion ut = uxx on a rectangular mesh (m∆x, n∆t), m = 0, 1, ..., M + 1,
n = 0, 1, 2, ..., where ∆x = 1/(M + 1). We assume zero boundary conditions
u(0, t) = u(1, t) = 0 for all t ≥ 0. Prove that the estimates um
n ≈ u(m∆x, n∆t)
satisfy the equation
M M +1
X 1 X n+1
[(un+1 2 n 2
m ) − (um ) ] = − µ (um + unm − un+1 n 2
m−1 − um−1 ) .
m=1
2 m=1
PM n 2
This shows that m=1 (um ) is monotonically decreasing with increasing n
and the numerical solution mimics the decaying behaviour of the exact solu-
tion. (Hint: Substitute the value of un+1m − unm that is given by the Crank–
PM n+1 n
Nicolson formula into the elementary equation
PM m=1 [(um )2 − (um )2] =
n+1
m=1 (um − unm )(un+1
m + unm ). It is also helpful occasionally to change the
index m of the summation by one.)
where Ω is an open connected domain in R2 with boundary ∂Ω. For all (x, y) ∈
∂Ω we have the Dirichlet boundary condition u(x, y) = φ(x, y). We assume
that f is continuous in Ω and that φ is twice differentiable. We lay a square
228 A Concise Introduction to Numerical Analysis
grid over Ω with uniform spacing of ∆x in both the x and y direction. Further,
we assume that ∂Ω is part of the grid. For our purposes Ω is a rectangle, but
the results hold for other domains.
which produces a local error of O((∆x)2 ). This gives rise to the five-point
method
where fl,m = f (l∆x, m∆x) and ul,m approximates u(l∆x, m∆x). A compact
notation is the computational stencil (also known as computational molecule)
1
1 −4 1 ul,m = fl,m
(∆x)2
1
(8.12)
∂ 2 u(x, y) 1 1 4 5
= [− u(x − 2∆x, y) + u(x − ∆x, y) − u(x, y)
∂x2 (∆x)2 12 3 2
4 1
+ u(x + ∆x, y) − u(x + 2∆x, y)] + O((∆x)4 )
3 12
PDEs 229
1
− 12
4
3
1 1 4 4 1
− 12 −5 − 12 ul,m = fl,m
(∆x)2 3 3
4
3
1
− 12
(8.13)
which produces a local error of O((∆x)4 ). However, the implementation of this
method is more complicated, since at the boundary, values of points outside
the boundary are needed. These values can be approximated by nearby values.
For example, if we require an approximation to u(l∆x, m∆x), where m∆x lies
outside the boundary, we can set
1 1
u(l∆x, m∆x) ≈ u((l + 1)∆x, (m − 1)∆x) + u(l∆x, (m − 1)∆x)
4 2
1
+ u((l − 1)∆x, (m − 1)∆x).
4
(8.14)
The set of linear algebraic equations has to be modified accordingly to take
these adjustments into account.
Now consider the approximation
∂2u
= 2u(x, y) + (∆x)2 (x, y) + O((∆x)4 ),
∂x2
where Dx denotes the differential operator in the x-direction. Applying the
230 A Concise Introduction to Numerical Analysis
∂2 2 ∂
2
[2 + (∆x)2 + O((∆x) 4
)] × [2 + (∆x) + O((∆x)4 )]u(x, y) =
∂x2 ∂y 2
∂2u ∂2u
4u(x, y) + 2(∆x)2 2 (x, y) + 2(∆x)2 2 (x, y) + O((∆x)4 ).
∂x ∂y
This motivates the computational stencil
1 1
2 0 2
1
0 −2 0 ul,m = fl,m
(∆x)2
1 1
2 0 2
(8.15)
1 2 1
6 3 6
1 2
− 10 2 ul,m = fl,m
(∆x)2 3 3 3
1 2 1
6 3 6
(8.16)
1 2 1
6 3 6
1 2
− 10 2 ul,m =
(∆x)2 3 3 3
1 2 1
6 3 6
1
12
1
(∆x)2 12 − 13 1
12
fl,m + fl,m
1
12
which has a local error of O((∆x)4 ), since the (∆x)2 error term is canceled.
Exercise 8.5. Determine the order of the local error of the finite difference
232 A Concise Introduction to Numerical Analysis
− 14 0 1
4
1
0 0 0
(∆x)2
1
4 0 − 14
We have seen that the first error term in the nine-point method is
4
∂4u ∂4u
1 ∂ u 1
(∆x)2 (x, y) + 2 (x, y) + (x, y) = (∆x)2 ∇4 u(x, y),
12 ∂x4 ∂x2 ∂y 2 ∂y 4 12
while the method given by (8.13) has no (∆x)2 error term. Now we can com-
bine these two methods to generate an approximation for ∇4 . In particular,
let the new method be given by 12×(8.16) −12×(8.13), and dividing by (∆x)2
gives
2 −8 2
1
1 −8 20 −8 1
(∆x)4
2 −8 2
1
PDEs 233
Exercise 8.6. Verify that the above approximation of ∇4 has a local error of
O((∆x)2 ) and identify the first error term.
Thus, by knowing the first error term, we can combine different finite dif-
ference schemes to obtain new approximations to different partial differential
equations.
So far, we have only looked at equispaced square grids. The boundary,
however, often fails to fit exactly into a square grid. Thus we sometimes need
to approximate derivatives using non-equispaced points at the boundary. In
the interior the grid remains equispaced. For example, suppose we try to
approximate the second directional derivative and that the grid points have
the spacing ∆x in the interior and α∆x at the boundary, where 0 < α ≤ 1.
Using the Taylor expansion, one can see that
1 2 2 2
g(x − ∆x) − g(x) + g(x + α∆x)
(∆x)2 α + 1 α α(α + 1)
1
= g 00 (x) + (α − 1)g 000 (x)∆x + O((∆x)2 ),
2
with error of O(∆x). Note that α = 1 recovers the finite difference with error
O((∆x)2 ) that we have already used. Better approximation can be obtained
by taking two equispaced points on the interior side.
1 α−1 2(α − 2)
g(x − 2∆x) − g(x − ∆x)
(∆x)2 α + 2 α+1
α−3 6
+ g(x) + g(x + α∆x)
α α(α + 1)(α + 2)
= g 00 (x) + O((∆x)2 ).
Note that B is a TST matrix. The vector b is given by the right-hand sides fl,m
(following the same ordering) and the boundary conditions. More specifically,
if ul,m is such that for example ul,m+1 lies on the boundary, we let ul,m+1 =
φ(l∆x, (m + 1)∆x) and the right-hand side bl,m = fl,m − φ(l∆x, (m + 1)∆x).
The method specified by (8.13) has the associated matrix
4 1
B 3I − 12 I
4 4 1
3I
B 3I − 12 I
−1I 4 4 1
12 3I B 3I − 12 I
A=
. .. . .. . .. . .. . ..
1 4 4 1
− 12 I 3 I B 3 I − 12 I
1 4 4
− 12 I 3I B 3I
1 4
− 12 I 3 I B
where
4 1
−5 3 − 12
4 4 1
3 −5 3 − 12
−1 4 4 1
12 3 −5 3 − 12
B=
.. .. .. .. .. .
. . . . .
1 4 4 1
− 12 3 −5 3 − 12
1 4 4
− 12 3 −5 3
1 4
− 12 3 −5
Again the boundary conditions need to be incorporated into the right-hand
side, using the approximation given by (8.14) for points lying outside Ω.
Now the nine-point method has the associated block tridiagonal matrix
B C
C B C
A=
. .
.. .. ...
(8.17)
C B C
C B
QDB QT QDC QT
the portion v̂1 is made out of the first components of each of the portions
v1 , . . . , vM , the portion v̂2 is made out of the second components of each
of the portions v1 , . . . , vM , and so on. Permutations come essentially at no
computational cost since in practice we store c, v as 2D arrays (which are
addressed accordingly) and not in one long vector. This yields a new system
Λ1 v̂1 ĉ1
Λ2 v̂2 ĉ2
.. = .. ,
. .. . .
ΛM v̂M ĉM
where
λB λC
k k
C
λB λC
λk k k
Λk =
.. .. .. .
. . .
λC
k λB
k λC
k
λC
k λB
k
the even and odd portions of x̂. Note that both x̂E and x̂O have period
n/2 = 2L−1 . Suppose we already know the inverse DFT of both the short
sequences xE and xO . Then it is possible to assemble x in a small number of
operations. Remembering wnn = 1,
L
2X −1 2L−1
X−1 2L−1
X−1 (2j+1)l
xl = w2jlL x̂j = w22jl
L x̂2j + w2L x̂2j+1
j=0 j=0 j=0
2L−1
X−1 2L−1
X−1
= w2jlL−1 x̂E
j + ω2l L w2jlL−1 x̂O E l O
j = xl + ω2L xl .
j=0 j=0
+i0 +i1 − i0 −i1 +i0 +i1 − i0 −i1
0 4 2 6 1 5 3 7
jπ
The eigenvalues of Λi are λB
i + 2 cos M +1 , j = 1, . . . , M . We deduce that the
eigenvalues of the system are
1 iπ jπ iπ jπ
λi,j = (2 cos + 2 cos ) = 1 − (sin2 + sin2 ).
4 M +1 M +1 2(M + 1) 2(M + 1)
Hence all the eigenvalues are smaller than 1 in modulus, since i and j range
from 1 to M , guaranteeing convergence; however, the spectral radius is close
π2
to 1, being 1 − 2 sin2 2(Mπ+1) ≈ 1 − 2M 2 . The larger M , the closer the spectral
radius is to 1.
240 A Concise Introduction to Numerical Analysis
Let e(k) be the error in the k th iteration and let vi,j be the orthonormal
eigenvectors. We can expand the error with respect to this basis.
M
(k)
X
e(k) = ei,j vi,j .
i,j=1
Iterating, we have
(k) (0)
e(k) = H k e(0) ⇒ |ei,j | = |λi,j |k |ei,j |.
Thus the components of the error (with respect to the basis of eigenvectors)
decay at a different rate for different values of i, j, which are the frequencies.
We separate those into low frequencies (LF) where both i and j lie in [1, M2+1 ),
which results in the angles lying between zero and π/4, high frequencies (HF)
where both i and j lie in [ M2+1 , M ], which results in the angles lying between
pi/4 and π/2, and mixed frequencies (MF) where one of i and j lies in [1, M2+1 )
and the other lies in [ M2+1 , M ]. Let us determine the least factor by which the
amplitudes of the mixed frequencies are damped. Either i or j is at least
M +1
2 while the other is at most M2+1 and thus sin2 2(Miπ+1) ∈ [0, 12 ] while
jπ
sin2 2(M +1) ∈ [ 12 , 1]. It follows that
iπ jπ 1 1
1 − (sin2 + sin2 ) ∈ [− , ]
2(M + 1) 2(M + 1) 2 2
for i, j in the mixed frequency range. Hence the amplitudes are damped by at
least a factor of 12 , which corresponds to the observations. This explains the
observations, but how can it be used to speed up convergence?
Firstly, let’s look at the high frequencies. Good damping of those is
achieved by using the Jacobi method with relaxation (also known as damped
Jacobi ) which gives a damping factor of 35 . The proof of this is left to the
reader in the following exercise.
Exercise 8.8. Find a formula for the eigenvalues of the iteration matrix of
the Jacobi with relaxation and an optimal value for the relaxation parameter
for the MF and HF components combined.
From the analysis and the results from the exercise, we can deduce that
the damped Jacobi method converges fast for HF and MF. This is also true
for the damped Gauss–Seidel method. For the low frequencies we note that
these are the high frequencies if we consider a grid with spacing 2∆x instead
of ∆x.
To examine this further we restrict ourselves to one space dimension. The
matrix arising from approximating the second derivative is given by
−2 1
1 −2 1
. .. . .. . .. .
1 −2 1
1 −2
PDEs 241
for j = 1, . . . , n/2 − 1. This is now the eigenvector of a matrix with the same
form as in (8.18) but with dimension (n/2 − 1) × (n/2 − 1). The corresponding
eigenvalue is
kπ
λk = cos .
n/2
For k = n/4 where we had a slow reduction in error on the fine grid, we now
have
n/4π π
λk = cos = cos = 0,
n/2 2
the fastest reduction possible.
The idea of the multigrid method is to cover the square domain by a range
of nested grids, of increasing coarseness, say,
Ω∆x ⊂ Ω2∆x ⊂ · · · ,
the equations directly), and back to the finest. Whenever we coarsen the grid,
we compute the residual
where ∆x is the size of the grid we are coarsening and x∆x is the current
solution on this grid. The values of the residual are restricted to the coarser
grid by combining nine fine values by the restriction operator R
1 1 1
4 2 4
1 1
2 1 2
1 1 1
4 2 4
Thus the new value on the coarse grid is an average of the fine value at this
point and its eight neighbouring fine values. Then we solve for the residual,
i.e., we iterate to solve the equations
• if the point has two coarse neighbours at each side, the value is the
average of those two,
• if the point has four coarse neighbours (top left, top right, bottom left,
bottom right), the value is the average of those four.
We then correct x∆x by P x2∆x .
Usually only a moderate number of iterations (3 to 5) is employed in each
restriction to solve (8.19). At each prolongation one or two iterations are
necessary to remove high frequencies which have been reintroduced by the
prolongation. We check for convergence at the end of each sweep. We repeat
the sweeps until convergence has occurred.
Before the first multigrid sweep, however, we need to obtain good starting
PDEs 243
values for the finest grid. This is done by starting from the coarsest grid solving
the system of equations there and prolonging the solution to the finest grid
in a zig-zag fashion. That means we do not go directly to the finest grid, but
return after each finer grid to the coarsest grid to obtain better initial values
for the solution.
Exercise 8.9. The function u(x, y) = 18x(1 − x)y(1 − y), 0 ≤ x, y ≤ 1, is the
solution of the Poisson equation uxx + uyy = 36(x2 + y 2 − x − y) = f (x, y),
subject to zero boundary conditions. Let ∆x = 1/6 and seek the solution of
the five-point method
un+1 n n n n n n
l,m = ul,m + µ(ul−1,m + ul+1,m + ul,m−1 + ul,m+1 − 4ul,m ), (8.21)
∆t
where µ = (∆x)2 . Again, in matrix form this is
isometry.
1/2
Z π Z π 1/2
X 1
|ul,m |2 =: kuk = kûk := |û(θ, ψ)| 2
dθdφ .
4π 2 −π −π
l,m∈Z
and the method is stable if and only if |H(θ, ψ)| ≤ 1 for all θ, ψ ∈ [−π, π].
For the method given in (8.21) we have
−iθ iθ −iψ iψ 2 θ 2 ψ
H(θ, ψ) = 1 + µ(e +e +e + e − 4) = 1 − 4µ sin + sin ,
2 2
and we again deduce stability if and only if µ ≤ 14 .
If we apply the trapezoidal rule instead of the forward Euler method to
our semi-discretization (8.20), we obtain the two-dimensional Crank–Nicolson
method
un+1 1 n+1 n+1 n+1 n+1 n+1
l,m − 2 µ(ul−1,m + ul+1,m + ul,m−1 + ul,m+1 − 4ul,m ) =
8.4.1 Splitting
Solving parabolic equations with explicit methods typically leads to restric-
tions of the form ∆t ∼ ∆x2 , and this is generally not acceptable. Instead,
implicit methods are used, for example, Crank–Nicolson. However, this means
that in each time step a system of linear equations needs to be solved. This
can become very costly for several space dimensions. The matrix I − 12 µA∗ is
in structure similar to A∗ , so we can apply the Hockney algorithm.
However, since the two-dimensional Crank–Nicolson method already car-
ries a local truncation error of O((∆t)3 + ∆t(∆x)2 ) = O((∆t)2 ) (because of
∆t = (µ∆x)2 ), the system does not need to be solved exactly. It is enough
to solve it within this error. Using the following operator notation (central
difference operator applied twice),
δx2 ul,m = ul−1,m − 2ul,m + ul+1,m , δy2 ul,m = ul,m−1 − 2ul,m + ul,m+1 ,
We know that δx2 /(∆x)2 and δy2 /(∆x)2 are approximations to the second par-
tial derivatives in the space directions carrying an error of O(∆x)2 . Thus we
PDEs 247
(∆t)2 1 1 ∂
e = δ2 δ 2 ∆t ul,m (t) + O((∆t)2 )
4 (∆x)2 x (∆x)2 y ∂t
(∆t)3 ∂ 2 ∂ 2 ∂
= u(x, y, t) + O((∆t)3 (∆x)2 ) + O((∆t)2 ) = O((∆t)2 ).
4 ∂x2 ∂y 2 ∂t
In matrix form, the new method is equivalent to splitting the matrix A∗ into
the sum of two matrices, Ax and Ay , where
−2I I
B
.. ..
B
I . .
Ax = , A = ,
. . y ..
. . . .
I
.
I −2I B
where
−2 1
.. ..
1 . .
B= .
.. ..
. . 1
1 −2
We then solve the uncoupled system
1 1 1 1
(I − µAx )(I − µAy )un+1 = (I + µAx )(I + µAy )un
2 2 2 2
in two steps, first solving
1 1 1
(I − µAx )un+1/2 = (I + µAx )(I + µAy )un
2 2 2
then solving
1
(I − µAy )un+1 = un+1/2 .
2
The matrix I − 12 µAy is block diagonal, where each block is I − 12 µB. Thus
solving the above system is equivalent to solving the same tridiagonal system
n+1/2
(I − 12 µB)un+1
i = ui for different right-hand sides M times, for which the
same method can be reused. Here the vector u has been divided into vectors
ui of size M for i = 1, . . . , M . The matrix I − 12 µAx is of the same form after a
reordering of the grid which changes the right hand sides. Thus we first have
to calculate (I + 12 µAx )(I + 12 µAy )un , then reorder, solve the first system,
then reorder and solve the second system.
Speaking more generally, suppose the method of lines results after dis-
cretization in space in the linear system of ODEs given by
u0 = Au, u(0) = u0 ,
248 A Concise Introduction to Numerical Analysis
and the solution to the system of ODEs is u(t) = e(tA) u0 , or at time tn+1 ,
Many methods for ODEs are actually approximations to the matrix expo-
nential. For example, applying the forward Euler method to the ODE results
in
un+1 = (I + ∆tA)un .
The corresponding approximation to the exponential is 1 + z = ez + O(z 2 ).
On the other hand, if the trapezoidal rule is used instead, we have
1 1
un+1 = (I − ∆tA)−1 (I + ∆tA)
2 2
1 + 12 z
= ez + O(z 3 ).
1 − 12 z
The advantage is that up to reordering all matrices involved are tridiagonal (if
only neighbouring points are used) and the system of equations can be solved
cheaply.
PDEs 249
However, the assumption et(Ax +Ay ) = etAx etAy is generally false. Taking
the first few terms of the definition of the matrix exponential, we have
1
et(Ax +Ay ) = I + t(Ax + Ay ) + t2 (A2x + Ax Ay + Ay Ax + A2y ) + O(t3 )
2
while
etAx etAy = [I + tAx + 12 t2 A2x + O(t3 )] × [I + tAy + 12 t2 A2y + O(t3 )]
= I + t(Ax + Ay ) + 12 t2 (A2x + 2Ax Ay + A2y ) + O(t3 ).
cos Mkπ
+1 ) for k = 1, . . . , M , where each eigenvalue is M -fold. It is easy to see
that these eigenvalues are nonpositive. So as long as r specifies an A-stable
method, that is, |r(z)| < 1 for all z ∈ C− , we have stability.
Exercise 8.11. Let F (t) = etA etB be the first order Beam–Warming splitting
of et(A+B) . Generally the splitting error is of the form t2 C for some matrix
C. If C has large eigenvalues the splitting error can be large even for small t.
Show that
Z t
F (t) = et(A+B) + e(t−τ )(A+B) eτ A B − Beτ A eτ B dτ.
0
(Hint: Find explicitly G(t) = F 0 (t)−(A+B)F (t) and use variation of constants
to find the solution of the linear matrix ODE F 0 = (A + B)F + G, F (0) = I.)
Suppose that a matrix norm k · k is given and that there exist real constants
cA , cB and cA+B such that
Prove that
e(cA +cB )t − ecA+B t
kF (t) − et(A+B) k ≤ 2kBk .
cA + cB − cA+B
Hence, for cA , cB ≤ 0, the splitting error remains relatively small even for
large t. (ecA+B t is an intrinsic error.)
So far we have made it easy for ourselves by assuming zero boundary
conditions. We now consider the splitting of inhomogeneous systems where
the boundary conditions are also allowed to vary over time. In general, the
linear ODE system is of the form
(Ax + Ay )et(Ax +Ay ) c(t) + et(Ax +Ay ) c0 (t) = (Ax + Ay )et(Ax +Ay ) c(t) + b(t)
and thus
Z t
0 −t(Ax +Ay )
c (t) = e b(t) ⇒ c(t) = e−τ (Ax +Ay ) b(τ )dτ + c0 .
0
PDEs 251
Using the initial condition u(0) = u0 , the exact solution of (8.23) is provided
by
Z t
u(t) = et(Ax +Ay ) u0 + e−τ (Ax +Ay ) b(τ )dτ
Z t0
t(Ax +Ay )
= e u0 + e(t−τ )(Ax +Ay ) b(τ )dτ, t ≥ 0.
0
Often, we can evaluate the integral explicitly; for example, when b(t) is a linear
combination of exponential and polynomial terms. If b is constant, then
u((n + 1)∆t) = e∆t(Ax +Ay ) u(n∆t) + (Ax + Ay )−1 e∆t(Ax +Ay ) − I b.
However, this observation does not get us any further, since, even if we split
the exponential, an equivalent technique to split (Ax + Ay )−1 does not exist.
The solution is not to compute the integral explicitly but to use quadrature
rules instead. One of those is the trapezium rule given by
Z h
1
g(τ )dτ = h[g(0) + g(h)] + O(h3 ).
0 2
This gives
where a(x, y) > 0 and f (x, y) are given, as are initial conditions on the unit
square and Dirichlet boundary conditions on ∂[0, 1]2 × [0, ∞). Every space
derivative is replaced by a central difference at the midpoint, for example
∂ 1
u(x, y) = δx u(x, y) + O((∆x)2 )
∂x ∆x
1 1 1
= u(x + ∆x, y) − u(x − ∆x, y) + O((∆x)2 ).
∆x 2 2
This yields the ODE system
1 h
u0l,m = a 1 ul−1,m + al+ 12 ul+1,m + al,m− 12 ul,m−1
(∆x)2 l− 2 ,m i
+al,m+ 12 ul,m+1 + (al− 12 ,m + al+ 12 ,m + al,m− 12 + al,m+ 12 )ul,m + fl,m .
The resulting matrix A is split in such a way that Ax consists of all the
al± 12 ,m terms, while Ay includes the remaining al,m± 12 terms. Again, if the
grid is ordered by columns, Ay is tridiagonal; if it is ordered by rows, Ax
is tridiagonal. The vector b consists of fl,m and incorporates the boundary
conditions.
What we have looked at so far is known as dimensional splitting. In ad-
dition there also exists operational splitting, to resolve non-linearities. As an
example, consider the reaction-diffusion equation in one space dimension
∂u ∂2u
= + αu(1 − u).
∂t ∂x2
For simplicity we assume zero boundary conditions at x = 0 and x = 1.
Discretizing in space, we arrive at
1
u0m = (um−1 − 2um + um+1 ) + αum (1 − um ).
(∆x)2
We separate the diffusion from the reaction part by keeping one part constant
and advancing the other by half a time step. We add the superscript n to the
part which is kept constant. In particular we advance by 12 ∆t solving
1
u0m = (um−1 − 2um + um+1 ) + αunm (1 − unm ),
(∆x)2
i.e., keeping the reaction part constant. This can be done, for example, by
Crank–Nicolson. Then we advance another half time step solving
1 n+ 12 n+ 12 n+ 12
u0m = (um−1 − 2u m + um+1 ) + αum (1 − um ),
(∆x)2
this time keeping the diffusion part constant. The second ODE is a linear
Riccati equation, i.e., the right-hand side is a quadratic in um which can be
solved explicitly (see for example [21] D. Zwillinger, Handbook of Differential
Equations).
PDEs 253
As time passes, the initial condition retains its shape while shifting with ve-
locity c to the right or left depending on the sign of c (to the right for positive
c, to the left for negative c). This has been likened to a wind blowing from left
to right or vice versa. For simplicity let c = −1, which gives ut (x, t) = ux (x, t),
and let the support of φ lie in [0, 1]. We restrict ourselves to the interval [0, 1]
by imposing the boundary condition u(1, t) = φ(t + 1).
Let ∆x = M1+1 . We start by semidiscretizing the right-hand side by the
sum of the forward and backward difference
∂ 1
um (t) = [um+1 (t) − um−1 (t)] + O((∆x)2 ). (8.24)
∂t 2∆x
Solving the resulting ODE u0m (t) = (2∆x)−1 [um+1 (t) − um−1 (t)] by forward
Euler results in
1
un+1
m = unm + µ(unm+1 − unm−1 ), m = 1, . . . , M, n ≥ 0,
2
∆t
where µ = ∆x is the Courant number. The overall local error is O((∆t)2 +
254 A Concise Introduction to Numerical Analysis
−β α
are given by λk = α + 2iβ cos Mkπ +1 , with corresponding eigenvector vk , which
has as j th component ij sin Mjkπ kπ
+1 , j = 1, . . . , M . So for A, λk = 1 + iµ cos M +1
and |λk |2 = 1 + µ2 cos2 Mkπ+1 > 1. Hence we have instability for any µ.
It is, however, sufficient to have a local error of O(∆x) when discretizing
in space, since it is multiplied by ∆t, which is then O((∆x)2 ) for a fixed µ.
Thus if we discretized in space by the forward difference
∂ 1
um (t) = [um+1 (t) − um (t)] + O(∆x)
∂t ∆x
and solve the resulting ODE again by the forward Euler method, we arrive at
un+1
m = unm + µ(unm+1 − unm ), m = 1, . . . M, n ≥ 0. (8.25)
This method is known as the upwind method . It takes its name because we
are taking additional information from the point m + 1 which is against the
wind, which is blowing from right to left since c is negative. It makes logical
sense to take information from the direction the wind is blowing from. This
implies that the method has to be adjusted for positive c to use unm−1 instead
of unm+1 . It also explains the instability of the first scheme we constructed,
since there information was taken from both sides of um in the form of um−1
and um+1 .
In matrix form, the upwind method is un+1 = Aun where
1−µ µ
.
1 − µ ..
A= .
. ..
µ
1−µ
PDEs 255
Now the matrix A is not normal and thus its 2-norm is not equal to its spectral
radius, but equal to the square root of the spectral radius of AAT . Now
(1 − µ)2 + µ2 µ(1 − µ)
.
(1 − µ)2 + µ2 . .
µ(1 − µ)
T
AA =
.. .. .. ,
. . .
.. 2 2
. (1 − µ) + µ µ(1 − µ)
µ(1 − µ) (1 − µ)2
which is not TST, since the entry in the bottom right corner differs. The
eigenvalues can be calculated solving a three term recurrence relation (see for
example [18]). However, defining kun k∞ = maxm |unm |, it follows from (8.25)
that
kun+1 k∞ = max |un+1 n n n
m | ≤ max{|1 − µ||um | + µ|um+1 |} ≤ (|1 − µ| + µ)ku k∞ .
m m
un+1
m = µ(unm+1 − unm−1 ) + un−1
m .
We have stability if |û± (θ)| ≤ 1 for all θ ∈ [−π, π] and we do not have a double
root on the unit circle. For µ > 1 the square root is imaginary at θ = ±π/2
and then
p p
|û+ (π/2)| = µ + µ − 1 > 1 and |û− (−π/2)| = | − µ − µ − 1| > 1.
For µ = 1 we have a double root for both θ = π/2 and θ = −π/2, since the
square root vanishes. In this case û± (±π/2) = ±i, which lies on the unit circle.
Thus we have instability for µ ≥ 1. For |µ| < 1 we have stability, because in
this case
|û± (θ)|2 = µ2 sin2 θ + 1 − µ2 sin2 θ = 1.
The leapfrog method is a good example of how instability can be intro-
duced from the boundary. Calculating un+1 m for m = 0 we are lacking the
value un−1 . Setting un−1 = 0 introduces instability which propagates inwards.
However, stability can be recovered by letting un+1
0 = un1 .
If the original method for the advection equation is stable for all µ ∈ [a, b]
where a < 0 < b, then the method for the system of advection equations
is stable as long as a ≤ λµ ≤ b for all eigenvalues λ of A. Again for the
wave equation the eigenvalues are ±1 with corresponding eigenvectors (1, 1)T
and (1, −1)T . Thus using the upwind method (8.25) (for which we have the
n n
condition |µ| ≤ 1) we calculate vm and wm according to
n+1 n n n n+1 n n n
vm = vm + µ(wm+1 − wm ), wm = wm + µ(vm+1 − vm ).
n
Eliminating the wm s and letting unm = vm
n
, we obtain
un+1
m − 2unm + un−1
m = µ2 (unm+1 − 2unm + unm−1 ),
which is the leapfrog scheme. Note that we could also have obtained the
method by using the usual finite difference approximating the second deriva-
tive.
Since this is intrinsically a two-step method in the time direction, we need
to calculate u1m . One possibility is to use the forward Euler method and let
u1m = u(m∆x, 0) + ∆tut (m∆x, 0) where both terms on the right-hand side are
given by the initial conditions. This carries an error of O((∆t)2 ). However,
considering the Taylor expansion,
We see that approximating according to the last line has better accuracy.
The Fourier stability analysis of the leapfrog method for the Cauchy prob-
lem yields
θ
ûn+1 (θ) − 2ûn (θ) + ûn−1 (θ) = µ(eiθ − 2 + e−iθ )ûn (θ) = −4µ sin2 ûn (θ).
2
This recurrence relation has the characteristic equation
θ
x2 − 2(1 − 2µ sin2 )x + 1 = 0
2
q
with roots x± = (1 − 2µ sin2 θ2 ) ± (1 − 2µ sin2 θ2 )2 − 1. The product of the
258 A Concise Introduction to Numerical Analysis
roots is 1. For stability we require the moduli of both roots to be less than
or equal to 1 and if a root lies at 1 it has to be a single root. Thus the roots
must be a complex conjugate pair and this leads to the inequality
θ
(1 − 2µ sin2 )2 ≤ 1.
2
This condition is fulfilled if and only if µ = (∆t/∆x)2 ≤ 1.
1 1
Z
ˆ
fn = f (τ )e−iπnτ dτ, , n ∈ Z.
2 −1
Note that the above theorem explicitly excludes the endpoints of the in-
terval. This is due to the Gibbs phenomenon. Figure 8.1 illustrates this. The
Gibbs effect involves both the fact that Fourier sums overshoot at a discon-
tinuity, and that this overshoot does not die out as the frequency increases.
With increasing N the point where that overshoot happens moves closer and
closer to the discontinuity. So once the overshoot has passed by a particular
x, convergence at the value of x is possible.However, convergence at the end-
points −1 and 1 cannot be guaranteed. It is possible to show (as a consequence
of the Dirichlet–Jordan theorem) that
1
φN (±1) → [f (−1) + f (1)] as N →∞
2
and hence there is no convergence unless f is periodic.
For proofs and more in-depth analysis, see [13] T. W. Körner, Fourier
Analysis.
Theorem 8.5. Let f be an analytic function in [−1, 1], which can be extended
analytically to a complex domain Ω and which is periodic with period 2, i.e.,
f (m) (1) = f (m) (−1) for all m = 1, 2, . . .. Then the Fourier coefficients fˆn are
260 A Concise Introduction to Numerical Analysis
1 1
Z
fˆn = f (τ )e−iπnτ dτ
2 −1
1
e−iπnτ 1 1 0 e−iπnτ
Z
1
= f (τ ) − f (τ ) dτ.
2 (−iπn) −1 2 −1 (−iπn)
The first term vanishes, since f (−1) = f (1) and e±iπn = cos nπ ± i sin nπ =
cos nπ. Thus
1 b0
fˆn = f .
πin n
Using f (m) (1) = f (m) (−1) for all m = 1, 2, . . . and multiple integration by
parts gives
2 3
1 b0 1 1
fˆn = f = fc00 n = 000 = . . . .
fc
πin n πin πin n
Hence m
1
fˆn = (m) ,
fd n m = 1, 2, . . . .
πin
For n = 0 the ODE simplifies to û00 (t) = 0 and thus û0 (t) = c0 . The
constant c0 is determined by the normalization condition
Z 1 N/2 N/2 Z 1
2
n2 t iπnx 2
n2 t
X X
cn e−π e dx = cn e−π eiπnx dx = 2c0 ,
−1 n=−N/2+1 n=−N/2+1 −1
we have
∞
X ∞
X
f (x) + g(x) = (fˆn + ĝn )eiπnx af (x) = afˆn eiπnx .
n=−∞ n=−∞
Moreover,
∞ ∞ ∞
!
X X X
f (x)g(x) = fˆn−m ĝm eiπnx = {fˆn } ∗ {ĝn } eiπnx ,
n=−∞ m=−∞ n=−∞
Since {fˆn } decays faster than O(n−m ) for all m ∈ Z+ , it follows that all
derivatives of f have rapidly convergent Fourier expansions.
We now have the tools at our disposal to solve the heat equation with
non-constant coefficient α(x). In particular,
∂u(x, t) ∂ ∂u(x, t)
= α(x)
∂t ∂x ∂x
we have
∞
∂u(x, t) X
= ûn (t)iπneiπnx
∂x n=−∞
and by convolution,
∞ ∞
!
∂u(x, t) X X
α(x) = α̂n−m iπmûm (t) eiπnx .
∂x n=−∞ m=−∞
or in vector form,
ûk+1 = (I + ∆tÂ)ûk ,
where  has elements Ânm = −π 2 nmα̂n−m , m, n = −N/2 + 1, . . . , N/2. Note
that every row and every column in  has a common factor. If α(x) is constant,
i.e., α(x) ≡ α0 , then α̂n = 0 for all n 6= 0 and  is a diagonal matrix
(−N/2 + 1)2 α0 0
··· 0
..
0 .
α0
2
−π 0 .
α0
..
. 0
0 ··· 0 (N/2)2 α0
analyticity and f (m) (1) = f (m) (−1) for all m = 1, 2, . . . is crucial. What to
do, however, if the latter does not hold? We can force values at the endpoints
to be equal. Consider the function
1 1
g(x) = f (x) − (1 − x)f (−1) − (1 + x)f (1),
2 2
which satisfies g(±1) = 0. Now the Fourier coefficients ĝn are O(n−1 ). Ac-
cording to the de la Valleé Poussin theorem, the rate of convergence of the
N -terms truncated Fourier expansion of g is hence O(N −1 ). This idea can be
iterated. By letting
h(x) = g(x) + a(1 − x)(1 + x) + b(1 − x)(1 + x)2 ,
which already satisfies h(±1) = g(±1) = 0, and choosing a and b appropri-
ately, we achieve h0 (±1) = 0. Here the Fourier coefficients ĥn are O(n−2 ).
However, the values of the derivatives at the boundaries need to be known
and with every step the degree of the polynomial which needs to be added to
achieve zero boundary conditions increases by 2.
Another possibility to deal with the lack of periodicity is the use of
Chebyshev polynomials (of the first kind), which are defined by Tn (x) =
cos(n arccos x), n ≥ 0. Each Tn is a polynomial of degree n, i.e.,
T0 (x) = 1, T1 (x) = x, T2 (x) = 2x2 − 1, T3 (x) = 4x3 − 3x, ...
The sequence Tn obeys the three-term recurrence relation
Tn+1 (x) = 2xTn (x) − Tn−1 (x), n = 1, 2, . . . .
Moreover, they form a sequence of orthogonal polynomials, which are orthogo-
1
nal with respect to the weight function (1−x2 )− 2 in (−1, 1). More specifically,
Z 1
dx π m=n=0
π
Tm (x)Tn (x) √ = m=n≥1 . (8.26)
−1 1 − x2 2
0 m 6= n.
This can be proven by letting x = cos θ and using the identity Tn (cos θ) =
cos nθ.
Now since the Chebyshev polynomials are mutually orthogonal, a general
integrable function f can be expanded in
∞
X
f (x) = fˇn Tn (x). (8.27)
n=0
since the general Fourier transform of a function g defined in the interval [a, b],
a < b, and which is periodic with period b − a, is given by the sequence
Z b
1 2πinτ
ĝ = g(τ )e− b−a dτ.
b−a a
In particular, letting g(x) = f (cosx) and [a, b] = [−π, π], we have
Z π
1
ĝ = g(τ )e−inτ dτ
2π −π
and the result follows. Thus we can deduce
f\ (cos θ)0 , n = 0,
fˇn = .
\
f (cos θ)−n + f \
(cos θ)n , n = 1, 2, . . . .
we have
∞
X ∞
X
f (x)g(x) = fˇm Tm (x) ǧn Tn (x)
m=0 n=0
∞
1 X ˇ
= fm ǧn [T|m−n| (x) + Tm+n (x)]
2 m,n=0
∞
1 X ˇ
= fm (ǧm+n + ǧ|m−n| )Tn (x),
2 m,n=0
Proof. We only proof (8.28), since the proof of (8.29) follows a similar argu-
ment. Since T2n (x) = cos(2n arccos x), we have
0 1
T2n (x) = 2n sin(2n arccos x) √ .
1 − x2
Letting x = cos θ and rearranging, it follows that
0
sin θ T2n (cos θ) = 2n sin(2nθ).
thus summarizing
1
− fˆk,l , k, l ∈ Z\{0, 0}
ûk,l = π (k + l2 )
2 2
0 (k, l) = (0, 0).
This solution is not representative for its application to general PDEs.
The reason is the special structure of the Poisson equation, because φk,l =
eiπ(kx+ly) are the eigenfunctions of the Laplace operator with eigenvalue
−π 2 (k 2 + l2 ), since
∇2 φk,l = −π 2 (k 2 + l2 )φk,l ,
and they obey periodic boundary conditions.
The concept can be extended to general second-order elliptic PDEs speci-
fied by the equation
∇T (a∇u) = f, −1 ≤ x, y ≤ 1,
where a is a positive analytic function and f is an analytic function, and both
a and f are periodic. We again impose the boundary conditions (8.31) and
the normalization condition (8.32). Writing
∞
X
a(x, y) = âk,l eiπ(kx+ly) ,
k,l=−∞
X∞
f (x, y) = fˆk,l eiπ(kx+ly) ,
k,l=−∞
X∞
u(x, y) = ûk,l eiπ(kx+ly) ,
k,l=−∞
and rewriting the PDE using the fact that the Laplacian ∇2 is the divergence
∇T of the gradient ∇u as
∂ ∂
∇T (a∇u) = (aux ) + (auy ) = a∇2 u + ax ux + ay uy ,
∂x ∂y
we get
∞ ∞
!
X X
−π 2 âk,l eiπ(kx+ly) (m2 + n2 )ûm,n eiπ(mx+ny)
m,n=−∞
k,l=−∞
∞ ∞
!
X X
2 iπ(kx+ly) iπ(mx+ny)
−π kâk,l e mûm,n e
m,n=−∞
k,l=−∞
∞ ∞
!
X X
−π 2 lâk,l eiπ(kx+ly) nûm,n e iπ(mx+ny)
k,l=−∞ m,n=−∞
∞
X
= fˆk,l eiπ(kx+ly) .
k,l=−∞
PDEs 269
m,n=−N/2+1
(l − n)âk−m,l−n n] ûm,n = fˆk,l .
The main difference between methods arising from computational stencils
and spectral methods is that the former leads to large sparse matrices, while
the latter leads to small but dense matrices.
The first integral is I(u) and the last integral is always non-negative. For the
second integral integrating by parts, we obtain
Z 1
u0 (x)v 0 (x) + v(x)f (x)dx =
0 Z 1 Z 1
1
= [u0 (x)v(x)]0 − u00 (x)v(x)dx + v(x)f (x)dx
0 Z 1 0
since v(0) = v(1) = 0, since u and w have the same boundary conditions, and
since u00 (x) = f (x). Hence
Z 1
I(w) = I(u) + [v 0 (x)]2 dx.
0
I(w) = I(u + v)
Z 1Z 1
2 2
= (ux (x, y) + vx (x, y)) + (uy (x, y) + vy (x, y))
0 0
Z 1Z 1
+ 2(u(x, y) + v(x, y)f (x, y)dxdy
0 0
Z 1 Z 1
= [ux (x, y)]2 + [uy (x, y)]2 + 2u(x, y)f (x, y)dxdy
0 0
Z 1 Z 1
+2 ux (x, y)vx (x, y) + uy (x, y)vy (x, y) + v(x, y)f (x, y)dxdy
0 0
Z 1Z 1
+ [vx (x, y)]2 + [vy (x, y)]2 dxdy.
0 0
since v(x, 0) = v(x, 1) = v(0, y) = v(1, y) = 0 for all x, y ∈ [0, 1] and f (x, y) =
272 A Concise Introduction to Numerical Analysis
uxx (x, y) + uyy (x, y) for all (x, y) ∈ [0, 1] × [0, 1]. Hence
Z 1 Z 1
I(w) = I(u) + [vx (x, y)]2 + [vy (x, y)]2 dxdy.
0 0
This implies v(x, y) ≡ 0 and thus the solution to the Poisson equation also
minimizes the integral.
We have seen two examples of the first step of the finite element method.
The first step is to rephrase the problem as a variational problem. The true
solution lies in an infinite dimensional space of functions and is minimal with
respect to a certain functional. The next step is to choose a finite dimen-
sional subspace S and find the element of that subspace which minimizes the
functional.
To be more formal, let the functional be of the form
In the following we assume that the functional A(u, v) satisfies the follow-
ing properties:
Symmetry
A(u, v) = A(v, u).
PDEs 273
Bi-linearity
A(u, v) is a linear function of u for fixed v:
A(u, v) + hv, f i = 0.
Proof. Let u ∈ S minimize I(u), then for any non-zero v ∈ S and scalars λ,
A(u, v) + hu, f i = 0.
To proof the other direction let u ∈ S such that A(u, v) + hv, f i = 0 for
all non-zero functions v ∈ S. Then for any v ∈ S,
We continue our one-dimensional example, solving u00 (x) = f (x) with zero
boundary conditions. Let x0 , . . . , xn+1 be nodes on the interval [0, 1] with
x0 = 0 and xn+1 = 1. Let hi = xi − xi−1 be the spacing. For i = 1, . . . , n we
choose
x − xi−1
, x ∈ (xi−1 , xi ),
hi
vi (x) = x i+1 − x
, x ∈ [xi , xi + 1),
hi+1
0 otherwise
as basis functions. These are similar to the linear B-splines displayed in Fig-
ure 3.7, the difference being that the nodes are not equally spaced and we
restricted the basis to those splines evaluating to zero at the boundary, be-
cause of the zero boundary conditions.
Recall that in this example
Z 1
A(u, v) = u0 (x)v 0 (x)dx
0
and hence
−1
, j = i − 1,
hi−1
1 1
+ , j = i,
Ai,j = A(vi , vj ) = hi hi+1
−1
, j = i + 1,
hi+1
0 otherwise
This looks very similar to the finite difference approximation to the solution
of the Poisson equation in one dimension on an equally spaced grid. Let um
approximate u(xm ). We approximate the second derivative by applying the
central difference operator twice and dividing by h2 . Then, using the zero
boundary conditions, the differential equation can be approximated on the
grid by
1
h2 (−2u1 + u2 ) = f (x1 ),
1
h2 (um−1 − 2um + um+1 ) = f (xm ),
1
h2 (un−1 − 2un ) = hf (xn ).
However, the two methods are two completely different approaches to the
same problem. The finite difference solution calculates approximations to func-
tion values on grid points, while the finite element method produces a function
as a solution which is the linear combination of basis functions. The right-hand
sides in the finite difference technique are f evaluated at the grid points. The
right-hand sides in the finite element method are scalar products of the basis
functions with f . Finite element methods are chosen, when it is important
to have a continuous representation of the solution. By choosing appropriate
basis functions such as higher-order B-splines, the solution can also be forced
to have continuous derivatives up to a certain degree.
PDEs 277
∂u ∂2u
=
∂t ∂x2
is discretized by the finite difference method
1
un+1 − (µ − α) un+1 n+1
+ un+1
m m−1 − 2um m+1
2
1
= unm + (µ + α) unm−1 − 2unm + unm+1 ,
2
where unm approximates u(m∆x, n∆t) and µ = ∆t/(∆x)2 and α are constant.
(a) Show that the order of magnitude (as a power of ∆x) of the local error is
O((∆x)4 ) for general α and derive the value of α for which it is O((∆x)6 ).
State which expansions and substitutions you are using.
(b) Define the Fourier transform of a sequence unm , m ∈ Z. Investigate the
stability of the given finite difference method by Fourier technique and its
dependence on α. In the process define the amplification factor. (Hint:
Express the amplification factor as 1 − . . .)
Exercise 8.13. The diffusion equation
∂u ∂ ∂u
= a(x) , 0 ≤ x ≤ 1, t ≥ 0,
∂t ∂x ∂x
(a) Assuming sufficient smoothness of the function a, show that the local error
of the method is at least O((∆x)3 ). State which expansions and substitu-
tions you are using.
un+1 = Aun
giving a formula for the entries Ak,l of A. From the structure of A, what
can you say about the eigenvalues of A?
278 A Concise Introduction to Numerical Analysis
c b c
1 ul,m
(∆x)2 b a b
c b c
is used to approximate
∂2u ∂2u
+ 2.
∂x2 ∂y
∂2u ∂2u
+ 2 = 0.
∂x2 ∂y
∂2u ∂2u
+ 2 = f, −1 ≤ x, y ≤ 1
∂x2 ∂y
where f is analytic and obeys periodic boundary conditions
un+1
m = µ(unm+1 − unm−1 ) + un−1
m ,
(1 − 2µ)un+1 n+1
m−1 + 4µum + (1 − 2µ)un+1 n n
m+1 = um−1 + um+1
∂u ∂2u
= .
∂t ∂x2
Express the local error as a power of ∆x.
Exercise 8.18. (a) Define the central difference operator δ and show how it
is used to approximate the second derivative of a function. What is the
approximation error?
(b) Explain the method of lines, applying it to the diffusion equation
∂u ∂2u
=
∂t ∂x2
for x ∈ [0, 1], t ≥ 0 using the results from (a).
(c) Given space discretization step ∆x and time discretization step ∆t, the
diffusion equation is approximated by the scheme
un+1
m − unm − µ(un+1 n+1
m+1 − 2um + un+1
m−1 ) = 0,
on the left side of the equation. How does this change the error term
calculated in (c)? For which choice of α depending on µ can a higher
order be achieved?
(e) Perform a Fourier stability analysis on the scheme given in (c) with arbi-
trary value of µ stating for which values of µ the method is stable. (Hint:
cos θ − 1 = −2 sin2 θ/2.)
Exercise 8.19. We consider the diffusion equation with variable diffusion
coefficient
∂u ∂ ∂u
= (a ),
∂t ∂x ∂x
where a(x), x ∈ [−1, 1] is a given differentiable function. The initial condi-
tion for t = 0 is given, that is, u(x, 0) = u0 (x), and we have zero boundary
conditions for x = −1 and x = 1, that is, u(−1, t) = 0 and u(1, t) = 0, t ≥ 0.
(a) Given space discretization step ∆x and time discretization step ∆t, the
following finite difference method is used,
(c) Since the boundary conditions are zero, the solution may be expressed in
terms of periodic functions. Therefore the differential equation is solved
by spectral methods letting
∞
X ∞
X
u(x, t) = ûn (t)eiπnx and a(x) = ân eiπnx .
n=−∞ n=−∞
∂u(x, t)
a(x) .
∂x
PDEs 283
(e) By differentiating the result in (d) again with regards to x and truncating,
deduce the system of ODEs for the coefficients ûn (t). Specify the matrix
B such that
d
û(t) = B û(t).
dt
(f ) Let a(x) be constant, that is, a(x) = â0 . What are the matrices A and B
with this choice of a(x)?
(g) Let
1 ıπx
a(x) = cos πx = (e + e−ıπx ).
2
What are the matrices A and B with this choice of a(x)? (Hint: cos(x −
π) = − cos x and cos(x − y) + cos(x + y) = 2 cos x cos y.)
Exercise 8.20. We consider the square [0, 1] × [0, 1], which is divided by
M + 1 in both directions, with a grid spacing of ∆x in both directions. The
computational stencil given by
1 1
2 0 2
1
0 −2 0 ul,m
(∆x)2
1 1
2 0 2
∂2u ∂2u
+ 2 = f.
∂x2 ∂y
285
286 Bibliography
K25104
w w w. c rc p r e s s . c o m