100% found this document useful (1 vote)
496 views

A Concise Introduction To Numerical Analysis

Uploaded by

pablo perez
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
496 views

A Concise Introduction To Numerical Analysis

Uploaded by

pablo perez
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 304

Numerical Analysis & Mathematical Computation

A Concise Introduction

A Concise Introduction to Numerical Analysis


This textbook provides an accessible and concise introduction to nu-
merical analysis for upper undergraduate and beginning graduate stu-
dents from various backgrounds. It was developed from the lecture
to Numerical Analysis
notes of four successful courses on numerical analysis taught within
the MPhil of Scientific Computing at the University of Cambridge. The
A. C. Faul
book is easily accessible, even to those with limited knowledge of
mathematics.
Students will get a concise, but thorough introduction to numerical
analysis. In addition the algorithmic principles are emphasized to en-
courage a deeper understanding of why an algorithm is suitable, and
sometimes unsuitable, for a particular problem.
A Concise Introduction to Numerical Analysis strikes a balance be-
tween being mathematically comprehensive, but not overwhelming
with mathematical detail. In some places where further detail was felt
to be out of scope of the book, the reader is referred to further reading.
The book uses MATLAB® implementations to demonstrate the work-
ings of the method and thus MATLAB’s own implementations are
avoided, unless they are used as building blocks of an algorithm. In
some cases the listings are printed in the book, but all are available
online on the book’s page at www.crcpress.com.
Most implementations are in the form of functions returning the out-
come of the algorithm. Also, examples for the use of the functions are
given. Exercises are included in line with the text where appropriate,
and each chapter ends with a selection of revision exercises. Solutions
to odd-numbered exercises are also provided on the book’s page at
www.crcpress.com.
This textbook is also an ideal resource for graduate students coming
from other subjects who will use numerical techniques extensively in
their graduate studies.
Faul

K25104

w w w. c rc p r e s s . c o m

K25104_cover.indd 1 2/8/16 12:59 PM


A Concise Introduction
to Numerical Analysis
A Concise Introduction
to Numerical Analysis

A. C. Faul
University of Cambridge, UK
MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not
warrant the accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB® soft-
ware or related products does not constitute endorsement or sponsorship by The MathWorks of a particular
pedagogical approach or particular use of the MATLAB® software.

CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2016 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works


Version Date: 20160224

International Standard Book Number-13: 978-1-4987-1219-4 (eBook - PDF)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com

and the CRC Press Web site at


http://www.crcpress.com
To Philip, Rosemary,
and Sheila.
Contents

List of Figures xi
Preface xiii
Acknowledgments xv

Chapter 1  Fundamentals 1
1.1 Floating Point Arithmetic 1
1.2 Overflow and Underflow 3
1.3 Absolute, Relative Error, Machine Epsilon 4
1.4 Forward and Backward Error Analysis 6
1.5 Loss of Significance 8
1.6 Robustness 11
1.7 Error Testing and Order of Convergence 12
1.8 Computational Complexity 14
1.9 Condition 15
1.10 Revision Exercises 19

Chapter 2  Linear Systems 23


2.1 Simultaneous Linear Equations 23
2.2 Gaussian Elimination and Pivoting 25
2.3 LU Factorization 27
2.4 Cholesky Factorization 32
2.5 QR Factorization 34
2.6 The Gram–Schmidt Algorithm 36
2.7 Givens Rotations 38
2.8 Householder Reflections 41
2.9 Linear Least Squares 42
2.10 Singular Value Decomposition 43
2.11 Iterative Schemes and Splitting 46
2.12 Jacobi and Gauss–Seidel Iterations 48

vii
viii  Contents

2.13 Relaxation 51
2.14 Steepest Descent Method 52
2.15 Conjugate Gradients 56
2.16 Krylov Subspaces and Pre-Conditioning 59
2.17 Eigenvalues and Eigenvectors 63
2.18 The Power Method 63
2.19 Inverse Iteration 67
2.20 Deflation 69
2.21 Revision Exercises 72

Chapter 3  Interpolation and Approximation Theory 79


3.1 Lagrange Form of Polynomial Interpolation 79
3.2 Newton Form of Polynomial Interpolation 84
3.3 Polynomial Best Approximations 90
3.4 Orthogonal polynomials 91
3.5 Least-Squares Polynomial Fitting 94
3.6 The Peano Kernel Theorem 95
3.7 Splines 98
3.8 B-Spline 105
3.9 Revision Exercises 110

Chapter 4  Non-Linear Systems 113


4.1 Bisection, Regula Falsi, and Secant Method 113
4.2 Newton’s Method 116
4.3 Broyden’s Method 119
4.4 Householder Methods 121
4.5 Müller’s Method 122
4.6 Inverse Quadratic Interpolation 123
4.7 Fixed Point Iteration Theory 124
4.8 Mixed Methods 126
4.9 Revision Exercises 127

Chapter 5  Numerical Integration 131


5.1 Mid-Point and Trapezium Rule 131
5.2 The Peano Kernel Theorem 133
5.3 Simpson’s Rule 135
Contents  ix

5.4 Newton–Cotes Rules 137


5.5 Gaussian Quadrature 138
5.6 Composite Rules 145
5.7 Multi-Dimensional Integration 150
5.8 Monte Carlo Methods 152
5.9 Revision Exercises 153

Chapter 6  ODEs 157


6.1 One-Step Methods 157
6.2 Multistep Methods, Order, and Consistency 159
6.3 Order Conditions 162
6.4 Stiffness and A-Stability 164
6.5 Adams Methods 169
6.6 Backward Differentiation Formulae 172
6.7 The Milne and Zadunaisky Device 174
6.8 Rational Methods 177
6.9 Runge–Kutta Methods 179
6.10 Revision Exercises 201

Chapter 7  Numerical Differentiation 205


7.1 Finite Differences 206
7.2 Differentiation of Incomplete or Inexact Data 209

Chapter 8  PDEs 211


8.1 Classification of PDEs 211
8.2 Parabolic PDEs 213
8.2.1 Finite Differences 214
8.2.2 Stability and Its Eigenvalue Analysis 217
8.2.3 Cauchy Problems and the Fourier Analysis
of Stability 222
8.3 Elliptic PDEs 227
8.3.1 Computational Stencils 228
8.3.2 Sparse Algebraic Systems Arising from
Computational Stencils 233
8.3.3 Hockney Algorithm 235
8.3.4 Multigrid Methods 238
8.4 Parabolic PDEs in Two Dimensions 243
x  Contents

8.4.1 Splitting 246


8.5 Hyperbolic PDEs 253
8.5.1 Advection Equation 253
8.5.2 The Wave Equation 256
8.6 Spectral Methods 258
8.6.1 Spectral Solution to the Poisson Equation 267
8.7 Finite Element Method 269
8.8 Revision Exercises 277

Bibliography 285

Index 287
List of Figures

1.1 Example of a well-conditioned and an ill-conditioned problem 16

2.1 Worst-case scenario for the steepest descent method 55


2.2 Conjugate gradient method applied to the same problem as in
Figure 2.1 58

3.1 Interpolation of Runge’s example with polynomials of degree


8 (top) and degree 16 (bottom) 82
th
3.2 The 10 Chebyshev Polynomial 83
3.3 Divided difference table 86
3.4 Basis of the subspace S0 with 8 knots 102
3.5 Example of a Lagrange cubic spline with 9 knots 103
3.6 Lagrange function for quadratic spline interpolation 105
3.7 Hat functions generated by linear B-splines on equally spaced
nodes 108
3.8 Basis of cubic B-splines 108

4.1 The rule of false position 114


4.2 Newton’s method 117
4.3 Illustration of the secant method on the left and Müller’s
method on the right 123
4.4 Cobweb plot for fixed point iteration 126

5.1 The mid-point rule 132


5.2 The trapezium rule 133
5.3 Simpson’s rule 136
5.4 The composite midpoint rule 146
5.5 The composite trapezium rule 147
5.6 The composite rectangle rule 148

6.1 The forward Euler method 158

xi
xii  List of Figures

6.2 Stability domains for θ = 1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2,
and 0.1 167
6.3 The stability domains for various Adams–Bashforth methods 171
6.4 The stability domains for various Adams–Moulton methods 172
6.5 The stability domains of various BDF methods in grey. The
instability regions are in white. 174
6.6 Butcher tableau 181
6.7 Stability domain of the methods given by (6.17) 182
6.8 The stability domain of the original Runge–Kutta method
given by (6.18) 184
6.9 Stability domain of Kutta’s and Nystrom’s method 185
6.10 Tableau of the Bogacki–Shampine method 185
6.11 Tableau of the Fehlberg method 186
6.12 Tableau of the Cash–Karp method 187
6.13 Tableau of the Dormand–Prince method 187
6.14 The 2-stage Radau and Lobatto methods 191
6.15 The 3-stage Radau methods 192
6.16 The 3-stage Lobatto methods 193
6.17 Stability domain of the method given in (6.20). The instability
region is white. 195
6.18 Fourth order elementary differentials 199

7.1 Inexact and incomplete data with a fitted curve 208

8.1 Gibbs effect when approximating the line y = x for the choices
N = 4, 8, 16, 32 259
Preface

This book has been developed from the notes of lectures on Numerical Analysis
held within the MPhil in Scientific Computing at the University of Cambridge,
UK. This course has been taught successfully since 2011. Terms at Cambridge
University are very short, only eight weeks in length. Therefore lectures are
concise, concentrating on the essential content, but also conveying the under-
lying connecting principles. On the other hand, the Cambridge Mathematical
Tripos was established in around 1790. Thus, knowledge has been distilled
literally over centuries.
I have tried to carry over this spirit into the book. Students will get a
concise, but thorough introduction to numerical analysis. In addition the al-
gorithmic principles are emphasized to encourage a deeper understanding of
why an algorithms is suitable (and sometimes unsuitable) for a particular
problem.
The book is also intended for graduate students coming from other sub-
jects, but who will use numerical techniques extensively in their graduate
studies. The intake of the MPhil in Scientific Computing is very varied. Mathe-
maticians are actually in the minority and share the classroom with physicists,
engineers, chemists, computer scientists, and others. Also some of the MPhil
students return to university after a professional career wanting to further
and deepen their knowledge of techniques they have found they are lacking in
their professional life. Because of this the book makes the difficult balance be-
tween being mathematically comprehensive, but also not overwhelming with
mathematical detail. In some places where further detail was felt to be out of
scope of this book, the reader is referred to further reading.
Techniques are illustrated by MATLABr1 implementations. The main pur-
pose is to show the workings of the method and thus MATLABr ’s own im-
plementations are avoided (unless they are used as building blocks of an algo-
rithm). In some cases the listings are printed in the book, but all are available
online at https://www.crcpress.com/A-Concise-Introduction-to-Numerical-
Analysis/Faul/9781498712187 as part of the package K25104_Downloads.zip.
Most implementations are in the form of functions returning the outcome of
the algorithm. Also examples for the use of the functions are given.
Exercises are put inline with the text where appropriate. Each chapter ends
with a selection of revision exercises which are suitable exam questions. A PDF
entitled “Solutions to Odd-Numbered Exercises for a Concise Introduction
to Numerical Analysis” can be downloaded at https://www.crcpress.com/A-

xiii
xiv  Preface

Concise-Introduction-to-Numerical-Analysis/Faul/9781498712187 as part of
the package K25104_Downloads.zip.
Students will find the index comprehensive, making it easy to find the
information they are after. Hopefully this will prove the book useful as a ref-
erence and make it an essential on the bookshelves of anybody using numerical
algorithms.

1 MATLAB and Simulink are registered trademarks of The Mathworks, Inc. For product

information please contact:


The Mathworks, Inc.
3 Apple Hill Drive
Natick, MA 01760-2098 USA
Tel: 508-647-7000
Fax: 508-647-7001
E–mail: [email protected]
Web: www.mathworks.com
Acknowledgments

First and foremost I have to thank Dr Nikos Nikiforakis without whom this
book would not have come into being. I first contacted him in 2011 when I
was looking for work which I could balance with being a mum of three. My
intention was to supervise small groups of students in Numerical Analysis at
Cambridge University. He asked whether I could also lecture the subject for
the students undertaking the MPhil in Scientific Computing. My involvement
in the MPhil and his research group grew from there. Aside from his unwa-
vering belief in what I can accomplish, he has become one of my best friends
supporting me through difficult times.
Next thanks are also due to my PhD supervisor Professor Mike Powell who
sadly passed away in April 2014. From him I learnt that one should strive for
understanding and simplicity. Some of his advice was, “Never cite a paper
you haven’t understood.” My citation lists have been short ever since. I more
often saw him working through an algorithm with pen and paper than sitting
at a computer. He wanted to know why a particular algorithm was successful.
I now ask my students, “How can you improve on something, if you do not
understand it?”
Of course I also need to thank all the MPhil students whose questions and
quest to find typos have improved this book over several years, especially Peter
Wirnsberger, Will Drazin, and Chongli Qin. In addition, Geraint Harcombe
contributed significantly to the MATLABr examples of this book.
I would also like to express my gratitude to Cambridge University and the
staff and fellows at Selwyn College for creating such a wonderful atmosphere
in which to learn and teach.
This book would not have been written without the support of many people
in my private life, foremost my parents Helmut and Marieluise Faul, who
instilled a love for knowledge in me. Next, my many friends of whom I would
like to mention, especially the ones helping out with childcare, and by name,
Annemarie Moore (née Clemence) who was always there when I needed help
to clear my head, Sybille Hutter for emotional support, and Karen Higgins
who is brilliant at sorting out anything practical.

A.C. Faul

xv
CHAPTER 1

Fundamentals

1.1 Floating Point Arithmetic


We live in a continuous world with infinitely many real numbers. However,
a computer has only a finite number of bits. This requires an approximate
representation. In the past, several different representations of real numbers
have been suggested, but now the most widely used by far is the floating
point representation. Each floating point representations has a base β (which
is always assumed to be even) which is typically 2 (binary), 8 (octal), 10
(decimal), or 16 (hexadecimal), and a precision p which is the number of
digits (of base β) held in a floating point number. For example, if β = 10
and p = 5, the number 0.1 is represented as 1.0000 × 10−1 . On the other
hand, if β = 2 and p = 20, the decimal number 0.1 cannot be represented
exactly but is approximately 1.1001100110011001100 × 2−4 . We can write
the representation as ±d0 .d1 · · · dp−1 × β e , where d0 .d1 · · · dp−1 is called the
significand (or mantissa) and has p digits and e is the exponent. If the leading
digit d0 is non-zero, the number is said to be normalized . More precisely
±d0 .d1 · · · dp−1 × β e is the number

±(d0 + d1 β −1 + d2 β −2 + · · · + dp−1 β −(p−1) )β e , 0 ≤ di < β.

If the exponents of two floating point numbers are the same, they are said
to be of the same magnitude. Let’s look at two floating point numbers of
the same magnitude which also have the same digits apart from the digit in
position p, which has index p − 1. We assume that they only differ by one in
that digit. These floating point numbers are neighbours in the representation
and differ by
1 × β −(p−1) × β e = β e−p+1 .
Thus, if the exponent is large the difference between neighbouring floating
point numbers is large, while if the exponent is small the difference between
neighbouring floating point numbers is small. This means floating point num-
bers are more dense around zero.
2  A Concise Introduction to Numerical Analysis

As an example consider the decimal number 3.141592, the first seven digits
of π. Converting this into binary, we could approximate it by

1.1001001000011111101011111100100011 × 21 =
(1 + 1 ∗ 12 + 0 ∗ 14 + 0 ∗ 18 + 1 ∗ 16
1 1
+ 0 ∗ 32 1
+ 0 ∗ 64 + ...) ∗ 21 ≈ 3.14159200003.

If we omit the last digit we arrive at

1.100100100001111110101111110010001 × 21 ≈ 3.14159199991.

Looking at the conversion to a hexadecimal representation, we get the approx-


imation
3.243F 5F 92 =
1 1 1 1 1
3 + 2 ∗ 16 +4∗ 162 +3∗ 163 + 15 ∗ 164 +5∗ 165 + ... ≈ 3.14159200001.

Omitting the last digit again gives

3.243F 5F 9 ≈ 3.14159199968.

Thus the amount of accuracy lost varies with the underlying representation.
The largest and smallest allowable exponents are denoted emax and emin ,
respectively. Note that emax is positive, while emin is negative. Thus there are
emax −emin +1 possible exponents, the +1 standing for the zero exponent. Since
there are β p possible significands, a floating-point number can be encoded in
[log2 (emax − emin + 1)] + [log2 (β p )] + 1 bits where the final +1 is for the sign
bit.
The storage of floating point numbers varies between machine architec-
tures. A particular computer may store the number using 24 bits for the
significand, 1 bit for the sign (of the significand), and 7 bits for the exponent
in order to store each floating-point number in 4 bytes ( 1 byte = 8 bits). This
format may be used by two different machines, but they may store the expo-
nent and significand in the opposite order. Calculations will produce the same
answer, but the internal bit pattern in each word will be different. Therefore
operations on bytes should be avoided and such code optimization left to the
compiler.
There are two reasons why a real number cannot be exactly represented
as a floating-point number. The first one is illustrated by the decimal number
0.1. Although it has a finite decimal representation, in binary it has an infinite
repeating representation. Thus in this representation it lies strictly between
two floating point numbers and is neither of them.
Another situation is that a real number is too large or too small. This is
also known as being out of range. That is, its absolute value is larger than or
equal to β × β emax or smaller than 1.0 × β emin . We use the terms overflow and
underflow for these numbers. These will be discussed in the next chapter.
Fundamentals  3

1.2 Overflow and Underflow


Both overflow and underflow present difficulties but in rather different ways.
The representation of the exponent is chosen in the IEEE binary standard
with this in mind. It uses a biased representation (as opposed to sign/magni-
tude and two’s complement, for which see [12] I. Koren Computer Arithmetic
Algorithms). In the case of single precision, where the exponent is stored in 8
bits, the bias is 127 (for double precision, which uses 11 bits, it is 1023). If the
exponent bits are interpreted as an unsigned integer k, then the exponent of
the floating point number is k −127. This is often called the unbiased exponent
to distinguish it from the biased exponent k.
In single precision the maximum and minimum allowable values for the
unbiased exponent are emax = 127 and emin = −126. The reason for having
|emin | < emax is so that the reciprocal of the smallest number (i.e., 1/2emin )
will not overflow. However, the reciprocal of the largest number will underflow,
but this is considered less serious than overflow.
The exponents emax + 1 and emin − 1 are used to encode special quantities
as we will see below. This means that the unbiased exponents range between
emin − 1 = −127 and emax + 1 = 128, whereas the biased exponents range
between 0 and 255, which are the non-negative numbers that can be repre-
sented using 8 bits. Since floating point numbers are always normalized, the
most significant bit of the significand is always 1 when using base β = 2, and
thus this bit does not need to be stored. It is known as the hidden bit. Using
this trick the significand of the number 1 is entirely zero. However, the signif-
icand of the number 0 is also entirely zero. This requires a special convention
to distinguish 0 from 1. The method is that an exponent of emin − 1 and a
significand of all zeros represents 0. The following table shows which other
special quantities are encoded using emax + 1 and emin − 1.

Exponent Significand Represents


e = emin − 1 s=0 ±0
e = emin − 1 s 6= 0 0.s × 2emin
emin ≤ e ≤ emax anys 1.s × 2e
e = emax + 1 s=0 ∞
e = emax + 1 s 6= 0 NaN

Here NAN stands for not a number and shows that the result of an operation
is undefined.
Overflow is caused by any operation whose result is too large in absolute
value to be represented. This can be the result of exponentiation or multipli-
cation or division or, just possibly, addition or subtraction. It is better to high-
light the occurrence of overflow with the quantity ∞ than returningp the largest
representable number. As an example, consider computing x2 + y 2 , when
β = 10, p = 3, and emax = 100. If x = 3.00 × 1060 and y = 4.00 × 1060 , then
x2 , y 2 , and x2 + y 2 will each√overflow in turn, and be replaced by 9.99 × 10100 .
So the final result will be 9.99 × 10100 = 3.16 × 1050 , which is drastically
4  A Concise Introduction to Numerical Analysis

wrong: the correct answerpis 5×1060 . In IEEE arithmetic, the result of x2 is ∞


and so is y 2 , x2 + y 2 and x2 + y 2 . Thus the final result is ∞, indicating that
the problems should be dealt with programmatically. A well-written routine
will remove possibilities for overflow occurring in the first place.

Exercise
p 1.1. Write a C-routine which implements the calculation of
x2 + y 2 in a way which avoids overflow. Consider the cases where x and
y differ largely in magnitude.
Underflow is caused by any calculation where result is too small to be
distinguished from zero. As with overflow this can be caused by different op-
erations, although addition and subtraction are less likely to cause it. However,
in many circumstances continuing the calculation with zero is sufficient. Often
it is safe to treat an underflowing value as zero. However, there are several
exceptions. For example, suppose a loop terminates after some fixed time has
elapsed and the calculation uses a variable time step δt which is used to update
the elapsed time t by assignments of the form

δt := δt × update
(1.1)
t := t + δt

Updates such as in Equation (1.1) occur in stiff differential equations. If the


variable δt ever underflows, then the calculation may go into an infinite loop.
A well-thought-through algorithm would anticipate and avoid such problems.

1.3 Absolute, Relative Error, Machine Epsilon


Suppose that x, y are real numbers well away from overflow or underflow. Let
x∗ denote the floating-point representation of x. We define the absolute error
 by
x∗ = x + 
and the relative error δ by

x∗ = x(1 + δ) = x + xδ.

Thus

 = xδ or, if x 6= 0, δ= .
x
The absolute and relative error are zero if and only if x can be represented
exactly in the chosen floating point representation.
In floating-point arithmetic, relative error seems appropriate because each
number is represented to a similar relative accuracy. For example consider
β = 10 and p = 3 and the numbers x = 1.001 × 103 and y = 1.001 × 100
with representations x∗ = 1.00 × 103 and y ∗ = 1.00 × 100 . For x we have an
absolute error of x = 0.001 × 103 = 1 and for y y = 0.001 × 100 = 0.001.
Fundamentals  5

However, the relative errors are


x 1
δx = = ≈ 0.999 × 10−3
x 1.001 × 103
y 0.001 1 × 10−3
δy = = = ≈ 0.999 × 10−3
y 1.001 × 100 1.001
When x = 0 or x is very close to 0, we will need to consider absolute error
as well. In the latter case, when dividing by x to obtain the relative error, the
resultant relative error will be very large, since the reciprocal of a very small
number is very large.
Writing x∗ = x(1 + δ) we see that δ depends on x. Consider two neigh-
bouring numbers in the floating point representation of magnitude β e . We
have seen earlier that they differ by β e−p+1 . Any real number x lying between
these two floating point numbers is represented by the floating point number
it is closer to. Thus it is represented with an absolute error  of less than
1/2 × β e−p+1 in modulus. To obtain the relative error we need to divide x.
However it is sufficient to divide by 1 × β e to obtain a bound for the relative
error, since |x| > 1 × β e . Thus

 1/2 × β e−p+1
|δ| = | | < | = 1/2 × β −p+1 .
x 1 × βe

Generally, the smallest number u such that |δ| ≤ u, for all x (excluding x
values very close to overflow or underflow) is called the unit round off .
On most computers 1∗ = 1. The smallest positive m such that

(1 + m )∗ > 1

is called machine epsilon or macheps. It is often assumed that u ≈ m and the


terms machine epsilon and unit round off are used interchangeably. Indeed,
often people refer to machine epsilon, but mean unit round off.
Suppose β = 10 and p = 3. Consider, for example the number π (whose
value is the non-terminating, non-repeating decimal 3.1415926535 . . .), whose
representation is 3.14 × 100 . There is said to be an error of 0.15926535 . . .
units in the last place. This is abbreviated ulp (plural ulps). In general, if the
floating-point number d0 .d1 · · · dp−1 × β e is used to represent x, then it is in
error by

d0 .d1 · · · dp−1 − x × β p−1 units in the last place.

e
β

It can be viewed as a fraction of the least significant digit.


In particular, when a real number is approximated by the closest floating-
point number d0 .d1 · · · dp−1 × β e , we have already seen that the absolute error
might be as large as 1/2 × β e−p+1 . Numbers of the form d0 .d1 · · · dp−1 × β e
represent real numbers in the interval [β e , β × β e ), where the round bracket
6  A Concise Introduction to Numerical Analysis

indicates that this number is not included. Thus the relative error δ corre-
sponding to 0.5 ulps ranges between

1/2 × β e−p+1 1/2 × β e−p+1 1 1


e+1
< δ ≤ e
⇒ β −p < δ ≤ β −p+1 (1.2)
β β 2 2
In particular, the relative error can vary by a factor of β for a fixed absolute
error. This factor is called the wobble. We can set the unit round off u to
1/2 × β −p+1 . With the choice β = 10 and p = 3 we have u = 1/2 × 10−2 =
0.005. The quantities u and ulp can be viewed as measuring units. The absolute
error is measured in ulps and the relative error in u.
Continuing with β = 10 and p = 3, consider the real number x = 12.35. It
is approximated by x∗ = 1.24×101 . The absolute error is 0.5 ulps, the relative
error is
0.005 × 101 0.05 0.004
δ= 1
= ≈ 0.004 = u = 0.8u.
1.235 × 10 12.35 0.005
Next we multiply by 8. The exact value is 8x = 98.8, while the computed
value is 8x∗ = 8 × 1.24 × 101 = 9.92 × 101 . The error is now 4.0 ulps. On the
other hand, the relative error is

9.92 × 101 − 98.8 0.4 0.004


= ≈ 0.004 = u = 0.8u.
98.8 98.8 0.005
The error measured in ulps has grown 8 times larger. The relative error, how-
ever, is the same, because the scaling factor to obtain the relative error has
also been multiplied by 8.

1.4 Forward and Backward Error Analysis


Forward error analysis examines how perturbations of the input propagate.
For example, consider the function f (x) = x2 . Let x∗ = x(1 + δ) be the
representation of x, then squaring both sides gives

(x∗ )2 = x2 (1 + δ)2
= x2 (1 + 2δ + δ 2 )
≈ x2 (1 + 2δ),

because δ 2 is small. This means the relative error is approximately doubled.


Forward error analysis often leads to pessimistic overestimates of the error,
especially when a sequence of calculations is considered and in each calcu-
lation the error of the worst case is assumed. When error analyses were first
performed, it was feared that the final error could be unacceptable, because of
the accumulation of intermediate errors. In practice, however, errors average
out. An error in one calculation gets reduced by an error of opposite sign in
the next calculation.
Backward error analysis examines the question: How much error in input
would be required to explain all output error? It assumes that an approximate
Fundamentals  7

solution to a problem is good if it is the exact solution to a nearby problem.


Returning to our example, the output error can be written as

[f (x)]∗ = (x2 )∗ = x2 (1 + ρ),

where ρ denotes the relative error in the output such that ρ ≤ u. As ρ is small,
1 + ρ > 0. Thus there exists ρ̃ such that (1 + ρ̃)2 = 1 + ρ with |ρ̃| < |ρ| ≤ u,
since (1 + ρ̃)2 = 1 + 2ρ̃ + ρ̃2 = 1 + ρ̃(2 + ρ̃). We can now write

[f (x)]∗ = x2 (1 + ρ̃)2
= f [x(1 + ρ̃)].

If the backward error is small, we accept the solution, since it is the correct
solution to a nearby problem.
Another reason for the preference which is given to backward error analysis
is that often the inverse of a calculation can be performed much easier than
the calculation itself. Take for example

f (x) = x

with the inverse


f −1 (y) = y 2 .
The square root of x can be calculated iteratively by the Babylonian method,
letting
1 x
xn+1 = (xn + ).
2 xn
We can test whether a good-enough approximation has been reached by check-
ing the difference
x2n − x,

which is the backward error of the approximation to x. Note that in the
above analysis we did not use the representation x∗ of x. The assumption is
that x is represented correctly. We will continue to do so and concentrate on
the errors introduced by performing approximate calculations.
Another example is approximating ex by

x2 x3
f (x) = 1 + x + + .
2! 3!
The forward error is simply f (x) − ex . For the backward error we need to find

x∗ such that ex = f (x). In particular,

x∗ = ln(f (x)).

At the point x = 1 we have (to seven decimal places) e1 = 2.718282 whereas


f (1) = 1 + 1 + 12 + 16 = 2.666667. Thus the forward error is 2.666667 −
2.718282 = −0.051615 while the backward error is ln(f (1)) − 1 = 0.980829 −
1 = 0.019171. They are different to each other and cannot be compared.
8  A Concise Introduction to Numerical Analysis

Next we consider how errors build up using the basic floating point op-
erations: multiplication ×, division /, and exponentiation ↑. As a practical
example, consider double precision in IBM System/370 arithmetic. Here the
value of u is approximately 0.22 × 10−15 . We simplify this by letting all num-
bers be represented with the same relative error 10−15 .
Starting with multiplication, we write
x∗1 = x1 (1 + δ1 )
x∗2 = x2 (1 + δ2 ).
Then
x∗1 × x∗2 = x1 x2 (1 + δ1 )(1 + δ2 )
= x1 x2 (1 + δ1 + δ2 + δ1 δ2 ).
The term δ1 δ2 can be neglected, since it is of magnitude 10−30 in our example.
The worst case is, when δ1 and δ2 have the same sign, i.e., the relative error
in x∗1 × x∗2 is no worse than |δ1 | + |δ2 |. If we perform one million floating-
point multiplications, then at worst the relative error will have built up to
106 × 10−15 = 10−9 .
Division can be easily analyzed in the same way by using the binomial
expansion to write
1 1 1
= (1 + δ2 )−1 = (1 − δ2 + ...).
x∗2 x2 x2
The omitted terms are of magnitude 10−30 or smaller and can be neglected.
Again, the relative error in x∗1 /x∗2 is no worse than |δ1 | + |δ2 |.
We can compute x∗1 ↑ n, for any integer n by repeated multiplication or
division. Consequently we can argue that the relative error in x∗1 ↑ n is no
worse than n|δ1 |.
This leaves addition + and subtraction − with which we will deal in the
next section. Here the error build-up depends on absolute accuracy, rather
than relative accuracy.

1.5 Loss of Significance


Consider
x∗1 + x∗2 = x1 (1 + δ1 ) + x2 (1 + δ2 )
= x1 + x2 + (x1 δ1 + x2 δ2 )
= x1 + x2 + (1 + 2 ).
Note how the error build-up in addition and subtraction depends on the ab-
solute errors 1 and 2 in representing x1 , x2 , respectively. In the worst case
scenario 1 and 2 have the same sign, i.e., the absolute error in x∗1 + x∗2 is no
worse than |1 | + |2 |.
Using the fact that (−x2 )∗ = −x2 − 2 we get that the absolute error in
x∗1 − x∗2 is also no worse than |1 | + |2 |. However, the relative error is
|1 | + |2 |
.
x1 − x2
Fundamentals  9

Suppose
√ we calculate 10 − π using a computer with precision p = 6. Then
10 ≈ 3.16228 with absolute error of about 2 × 10−6 and π ≈ 3.14159
√ with
absolute error of about 3 × 10−6 . The absolute error in the result 10 − π ≈
0.02069 is about 5 × 10−6 . However, calculating the relative error, we get
approximately
5 × 10−6 /0.02068 . . . ≈ 2 × 10−4 .
This means that the relative
√ error in the subtraction is about 100 times as big
as the relative error in 10 or π.
This problem is known as loss of significance. It can occur whenever two
similar numbers of equal sign are subtracted (or two similar numbers of op-
posite sign are added). If possible it should be avoided programmatically. We
will see an example of how to do so later when discussing robustness.
As another example with the same precision p = 6, consider the numbers
x1 = 1.00000 and x2 = 9.99999 × 10−1 . The true solution of x1 − x2 =
0.000001. However, when calculating the difference, the computer first adjusts
the magnitude such that both x1 and x2 have the same magnitude. This way
x2 becomes 0.99999. Note that we have lost one digit in x2 . The computed
result is 1.00000 − 0.99999 = 0.00001 and the absolute error is |0.000001 −
0.00001| = 0.000009. The relative error is 0.000009/0.000001 = 9. We see that
the relative error has become as large as the base minus one. The following
theorem generalizes this for any base.
Theorem 1.1. Using a floating-point format with parameters β and p, and
computing differences using p digits, the relative error of the result can be as
large as β − 1.
Proof. Consider the expression x − y when x = 1.00 . . . 0 and y = ρ.ρρ . . . ρ ×
β −1 , where ρ = β − 1. Here y has p digits (all equal to ρ). The exact difference
is x − y = β −p . However, when computing the answer using only p digits, the
rightmost digit of y gets shifted off, and so the computed difference is β −p+1 .
Thus the absolute error is β −p+1 − β −p = β −p (β − 1), and the relative error
is
β −p (β − 1)
= β − 1.
β −p

The problem is solved by the introduction of a guard digit. That is, after
the smaller number is shifted to have the same exponent as the larger number,
it is truncated to p+1 digits. Then the subtraction is performed and the result
rounded to p digits.
Theorem 1.2. If x and y are positive floating-point numbers in a format with
parameters β and p, and if subtraction is done with p+1 digits (i.e., one guard
digit), then the relative error in the result is less than ( β2 + 1)β −p .
10  A Concise Introduction to Numerical Analysis

Proof. Without loss of generality, x > y, since otherwise we can exchange


their roles. We can also assume that x is represented by x0 .x1 . . . xp−1 × β 0 ,
since both numbers can be scaled by the same factor. If y is represented as
y0 .y1 . . . yp−1 , then the difference is exact, since both numbers have the same
magnitude. So in this case there is no error.
Let y be represented by y0 .y1 . . . yp−1 × β −k−1 , where k ≥ 0. That is, y is
at least one magnitude smaller than x. To perform the subtraction the digits
of y are shifted to the right in the following way.
k
z }| {
0. 0 ... 0 y0 ... yp−k−1 yp−k ... yp−1
↑ ↑ ↑ ↑ ↑ ↑ ↑
β0 β −1 ... β −k β −k−1 ... β −p β −p−1 ... β −p−k
| {z }
p+1 digits

The lower row gives the power of β associated with the position of the digit.
Let ŷ be y truncated to p + 1 digits. Then

y − ŷ = yp−k β −p−1 + yp−k+1 β −p−2 + · · · + yp−1 β −p−k

≤ (β − 1)(β −p−1 + β −p−2 + . . . + β −p−k ).

From the definition of guard digit, the computed value of x−y is x− ŷ rounded
to be a floating-point number, that is, (x − ŷ) + α, where the rounding error
α satisfies
β
|α| ≤ β −p .
2
The exact difference is x − y, so the absolute error is |(x − y) − (x − ŷ + α)| =
|ŷ − y − α|. The relative error is the absolute error divided by the true solution

|ŷ − y − α|
.
x−y
We now find a bound for the relative error. If x − y ≥ 1, then the relative
error is bounded by

|ŷ − y − α| |y − ŷ| + |α|



x−y 1
 
β
≤ β −p (β − 1)(β −1 + . . . + β −k ) +
2
 
−p −1 −k+1 −1 −k β
≤ β 1 + β + ... + β − β − ... − β +
2
−p β
≤ β (1 + 2 ),

which is the bound given in the theorem.


Fundamentals  11

If x − y < 1, we need to consider two cases. Firstly, if x − ŷ < 1, then


no rounding was necessary, because the first digit in x − ŷ is zero and can
be shifted off to the left, making room to keep the p + 1th digit. In this case
α = 0. Defining ρ = β − 1, we find the smallest x − y can be by letting x
be as small as possible, which is x = 1, and y as large as possible, which is
y = ρ.ρ . . . ρ × β −k−1 . The difference is then
k p k p−1
z }| { z }| z }| { z }| {
1 − 0. 0 . . . 0 ρ . . . ρ = 0. ρ . . . ρ 0 . . . 0 1 > (β − 1)(β −1 + . . . + β −k ).
{

In this case the relative error is bounded by


|y − ŷ| + |α| (β − 1)β −p (β −1 + . . . + β −k )
< = β −p .
(β − 1)(β −1 + . . . + β −k ) (β − 1)(β −1 + . . . + β −k )
The other case is when x − y < 1, but x − ŷ ≥ 1. The only way this could
happen is if x − ŷ = 1, in which case α = 0, then the above equation applies
and the relative error is bounded by β −p < β −p (1 + β2 ).
Thus we have seen that the introduction of guard digits alleviates the loss
of significance. However, careful algorithm design can be much more effective,
as the following exercise illustrates.
Exercise 1.2. Consider the function sin x, which has the series expansion
x3 x5
sin x = x − + − ...,
3! 5!
which converges, for any x, to a value in the range −1 ≤ sin x ≤ 1. Write
a MATLAB-routine which examines the behaviour when summing this series
of terms with alternating signs as it stands for different starting values of x.
Decide on a convergence condition for the sum. For different x, how many
terms are needed to achieve convergence, and how large is the relative error
then? (It gets interesting when x is relatively large, e.g., 20). How can this
sum be safely implemented?

1.6 Robustness
An algorithm is described as robust if for any valid data input which is rea-
sonably representable, it completes successfully. This is often achieved at the
expense of time. Robustness is best illustrated by example. We consider the
quadratic equation ax2 + bx + c = 0. Solving it seems to be a very elementary
problem. Since it is often part of a larger calculation, it is important that it
is implemented in a robust way, meaning that it will not fail and give reason-
ably accurate answers for any coefficients a, b and c which are not too close
to overflow or underflow. The standard formula for the two roots is

−b ± b2 − 4ac
x= .
2a
12  A Concise Introduction to Numerical Analysis

A problem arises if b2 is much larger than |4ac|. In the worst case the difference
in the magnitudes of b2 and |4ac| is so large that b2 − 4ac evaluates to b2 and
the square root evaluates to b, and one of the calculated roots lies at zero.
Even if the difference in magnitude is not that large, one root is still small.
Without loss of generality we assume b > 0 and the small root is given by

−b + b2 − 4ac
x= . (1.3)
2a
We note that there is loss of significance in the numerator. As we have seen
before, this can lead to a large relative error in the result compared to the
relative error in the input. The problem can be averted by manipulating the
formula in the following way:
√ √
−b + b2 − 4ac −b − b2 − 4ac
x = × √
2a −b − b2 − 4ac
b2 − (b2 − 4ac)
= √ (1.4)
2a(−b − b2 − 4ac)
−2c
= √ .
b + b2 − 4ac
Now quantities of similar size are added instead of subtracted.
Taking a = 1, b = 100000 and c = 1 and as accuracy 2 × 10−10 , Equation
(1.3) gives x = −1.525878906 × 10−5 for the smaller root while (1.4) gives
x = −1.000000000 × 10−5 , which is the best this accuracy allows.
In general, adequate analysis has to be conducted to find cases where
numerical difficulties will be encountered, and a robust algorithm must use an
appropriate method in each case.

1.7 Error Testing and Order of Convergence


Often an algorithm first generates an approximation to the solution and then
improves this approximation again and again. This is called an iterative nu-
merical process. Often the calculations in each iteration are the same. How-
ever, sometimes the calculations are adjusted to reach the solution faster. If
the process is successful, the approximate solutions will converge to a solu-
tion. Note that it is a solution, not the solution. We will see this beautifully
illustrated when considering the fractals generated by Newton’s and Halley’s
methods.
More precisely, convergence of a sequence is defined as follows. Let x0 , x1 ,
x2 , . . . be a sequence (of approximations) and let x be the true solution. We
define the absolute error in the nth iteration as
n = xn − x.
The sequence converges to the limit x of the sequence if
lim n = 0
n→∞
Fundamentals  13

Note that convergence of a sequence is defined in terms of absolute error.


There are two forms of error testing, one using a target absolute accuracy
t , the other using a target relative error δt . In the first case the calculation
is terminated when
|n | ≤ t . (1.5)
In the second case the calculation is terminated when
|n | ≤ δt |xn |. (1.6)
Both methods are flawed under certain circumstances. If x is large, say 1020 ,
and u = 10−16 , then n is never likely to be much less than 104 , so condition
(1.5) is unlikely to be satisfied if t is chosen too small even when the process
converges. On the other hand, if |xn | is very small, then δt |xn | may underflow
and test (1.6) may never be satisfied (unless the error becomes exactly zero).
As (1.5) is useful when (1.6) is not, and vice versa, they are combined
into a mixed error test. A target error ηt is prescribed and the calculation is
terminated when
|n | ≤ ηt (1 + |xn |). (1.7)
If |xn | is small, ηt is regarded as target absolute error, or if |xn | is large ηt is
regarded as target relative error.
Tests such as (1.7) are used in modern numerical software, but we have
not addressed the problem of estimating n , since the true value x is unknown.
The simplest formula is
n ≈ xn − xn−1 (1.8)
However, theoretical research has shown that in a wide class of numerical
methods, cases arise where adjacent values in an approximation sequence have
the same value, but are both the incorrect answer. Test (1.8) will cause the
algorithm to terminate too early with an incorrect solution.
A safer estimate is
n ≈ |xn − xn−1 | + |xn−1 − xn−2 |, (1.9)
but again research has shown that even the approximations of three consec-
utive iterations can all be the same for certain methods, so (1.9) might not
work either. However, in many problems convergence can be tested indepen-
dently, for example when the inverse of a function can be easily calculated
(calculating the k th power as compared to taking the k th root). Error and
convergence testing should always be fitted to the underlying problem.
We now turn our attention to the speed of convergence. It is defined by
comparing the errors of two subsequent iterations. If there exist constants p
and C such that
n+1
lim p = C,
n→∞ n
then the process has order of convergence p, where p ≥ 1. This is often ex-
pressed as
|n+1 | = O(|n |p ),
14  A Concise Introduction to Numerical Analysis

known as the O-notation. Or in other words, order p convergence implies

|n+1 | ≈ C|n |p

for sufficiently large n. Thus the error in the next iteration step is approxi-
mately the pth power of the current iteration error times a constant C. For
p = 1 we need C < 1 to have a reduction in error. This is not necessary
for p > 1, because as long as the current error is less than one, taking any
power greater or equal to one leads to a reduction. For p = 0, the O-notation
becomes O(1), which signifies that something remains constant. Of course in
this case there is no convergence.
The following categorization for various rates of convergence is in use:
1. p = 1: linear convergence. Each iteration produces the same reduction in
absolute error. This is generally regarded as being too slow for practical
methods.

2. p = 2: quadratic convergence. Each iteration squares the absolute error.


3. 1 < p < 2: superlinear convergence. This is not as good as quadratic but
the minimum acceptable rate in practice.
4. Exponential rate of convergence.

So far we have only considered the concept of convergence for numbers


x1 , x2 , x3 , . . .and x. However, in numerical calculations the true solution might
be a complex structure. For example, it might be approximations to the so-
lution of a partial differential equation on an irregular grid. In this case a
measurement of how close the approximation is to the true solution has to
be defined. This can look very different depending on the underlying problem
which is being solved.

1.8 Computational Complexity


A well-designed algorithm should not only be robust, and have a fast rate
of convergence, but should also have a reasonable computational complexity.
That is, the computation time shall not increase prohibitingly with the size
of the problem, because the algorithm is then too slow to be used for large
problems.
Suppose that some operation, call it , is the most expensive in a particular
algorithm. Let n be the size of the problem. If the number of operations of
the algorithm can be expressed as O[f (n)] operations of type , then we say
that the computational complexity is f (n). In other words, we neglect the less
expensive operations. However, less expensive operations cannot be neglected,
if a large number of them need to be performed for each expensive operation.
For example, in matrix calculations the most expensive operations are
multiplications of array elements and array references. Thus in this case the
Fundamentals  15

operation may be defined to be a combination of one multiplication and one


or more array references. Let’s consider the multiplication of n × n matrices
A = (Aij ) and B = (Bij ) to form a product C = (Cij ). For each element in
C, we have to calculate
Xn
Cij = Aik Bkj ,
k=1

which requires n multiplications (plus two array references per multiplication).


Since there are n2 elements in C, the computational complexity is n2 ×n = n3 .
Note that processes of lower complexity are absorbed into higher com-
plexity ones and do not change the overall computational complexity of an
algorithm. This is the case, unless the processes of lower complexity are per-
formed a large number of times.
For example, if an n2 process is performed each time an n3 process is
performed then, because of

O(n2 ) + O(n3 ) = O(n3 )

the overall computational complexity is still n3 . If, however, the n2 process


was performed n2 times each time the n3 process was performed, then the
computational complexity would be n2 × n2 = n4 .
Exercise 1.3. Let A be an n × n nonsingular band matrix that satisfies the
condition Aij = 0 for |i−j| > r, where r is small, and let Gaussian elimination
(introduced in Linear Systems 2.2) be used to solve Ax = b. Deduce that the
total number of additions and multiplications of the complete calculation can
be bounded by a constant multiple of nr2 .

1.9 Condition
The condition of a problem is inherent to the problem whichever algorithm is
used to solve it. The condition number of a numerical problem measures the
asymptotically worst case of how much the outcome can change in proportion
to small perturbations in the input data. A problem with a low condition
number is said to be well-conditioned, while a problem with a high condition
number is said to be ill-conditioned. The condition number is a property of
the problem and not of the different algorithms that can be used to solve the
problem.
As an example consider the problem where a graph crosses the line x = 0.
Naively one could draw the graph and measure the coordinates of the crossover
points. Figure 1.1 illustrates two cases. In the left-hand problem it would be
easier to measure the crossover points, while in the right-hand problem the
crossover points lie in a region of candidates. A better (or worse) algorithm
would be to use a higher (or lower) resolution. In the chapter on non-linear
systems we will encounter many methods to find the roots of a function that
is the points where the graph of a function crosses the line x = 0.
16  A Concise Introduction to Numerical Analysis

Figure 1.1 Example of a well-conditioned and an ill-conditioned problem

As a further example we consider the problem of evaluating a differentiable


function f at a given point x, i.e., calculating f (x). Let x̂ be a point close to
x.The relative change in x is
x − x̂
x
and the relative change in f (x) is
f (x) − f (x̂)
.
f (x)
The condition number K is defined as the limit of the relative change in f (x)
divided by the relative change in x as x̂ tends to x.

K(x) = limx̂→x f (x)−f (x̂) x−x̂

/

f (x) x
= limx̂→x f (x)−f (x̂) x

f (x)

x−x̂

x f (x)−f (x̂)
= f (x) limx̂→x x−x̂
0
= xff (x)
(x)
,

since the limit is exactly the definition of the derivative


√ of f at x.
1
To illustrate we consider the functions f (x) = x and f (x) = 1−x . In the
first case we get
xf (x) x( 12 x−1/2 )
0
1
K(x) = = √ = .
f (x) x 2
So K is constant for all non-negative x and thus taking square roots has the
same condition for all inputs x. Whatever the value of x is, a perturbation
in the input leads to a perturbation of the output which is half of the input
perturbation.
Fundamentals  17


Exercise 1.4. Perform a forward and backward error analysis for f (x) = x.
You should find that the relative error is reduced by half in the process.
1
However, when f (x) = 1−x we get
0
xf (x) x[1/(1 − x)2 ] x

K(x) =
= = .
f (x) 1/(1 − x) 1 − x

Thus for values of x close to 1, K(x) can get large. For example if x = 1.000001,
then K(1.000001) = 1.000001 × 106 and thus the relative error will increase
by a factor of about 106 .
Exercise 1.5. Examine the condition of the evaluating cos x.

As another example we examine the condition number of an n × n matrix


A = (Aij ) associated with the linear equation Ax = b. Here the right-hand
side b can be any non-zero vector. Many numerical algorithms reduce to solv-
ing a system of equations for many different right-hand sides b. Therefore,
knowing how accurately this can be done is essential. Remember that the
condition number is a property of the matrix, not the algorithm or the ma-
chine accuracy of the computer used to solve the corresponding system. The
condition number is the rate at which the solution x will change with changes
in the right-hand side b. A large condition number implies that even a small
error in b may cause a large error in x.
The condition number is defined to be the maximum ratio of the rela-
tive change in x divided by the relative change in b for any b. Let e be a
perturbation in b. Thus the relative change in b is

kek / kbk = kek / kAxk ,

where k · k denotes any vector norm. We will see below how different norms
lead to different condition numbers.
Assuming that A is invertible, then x is given by A−1 b and the perturba-
tion in x is A−1 e. Thus the relative change in x is A−1 e / kxk. Hence the
condition number is
−1 −1
A e kek A e kAxk
K(A) = max / = max × max .
x,e6=0 kxk kAxk e6=0 kek x6=0 kxk

Now the definition of a matrix norm derived from a vector norm is


kAvk
kAk = max .
v6=0 kvk

We see that the calculation of the condition number involves the definition of
the matrix norm. Thus the condition number of an invertible matrix is

K(A) = A−1 × kAk .



18  A Concise Introduction to Numerical Analysis

Of course, this definition depends on the choice of norm. We just give a


brief outline of different vector norms and the condition numbers induced by
these. For more information, see [7] J. W. Demmel Applied Numerical Linear
Algebra.

1. If k · k is the standard 2-norm also known as Euclidean norm defined as

n
!1/2
X
kxk2 := x2i ,
i=1

then the induced matrix norm is

kAk2 = σmax (A),

and
σmax (A)
K2 (A) = ,
σmin (A)
where σmax (A) and σmin (A) are the maximal and minimal singular val-
ues of A, respectively. (The singular values of a matrix A are the square
roots of the eigenvalues of the matrix A∗ A where A∗ denotes the com-
plex conjugate transpose of A.) In particular A is called a normal matrix
if A∗ A = AA∗ . In this case

λmax (A)
K2 (A) =
λmin (A)

where λmax (A) and λmin (A) are the maximum and minimum modulus
eigenvalues of A. If A is unitary, i.e., multiplying the matrix with its
conjugate transpose results in the identity matrix, then K2 (A) = 1.
2. If k · k is the ∞-norm defined as

kxk∞ := max |xi |,


i=1,...,n

then the induced matrix norm is


n
X
kAk∞ = max |Aij |,
i=1,...,n
j=1

which is the maximum absolute row sum of the matrix.


Exercise 1.6. If A is lower triangular and non-singular, and using the ∞-
norm, show that
maxi=1,...n (|Aii |)
K∞ (A) ≥
mini=1,...n (|Aii |)
Fundamentals  19

As an example we consider the matrix equation


    
404 60 x1 b1
= .
60 4 x2 b2

The matrix has eigenvalues 204 ± 20 109 that is λ1 ≈ 412.8 and λ2 ≈ 4.8.
Recall that the condition number is defined as the maximum ratio of the
relative change in x over the relative change e in b:
−1 −1
A e kek A e kAxk
K(A) = max / = max × max .
x,e6=0 kxk kAxk e6=0 kek x6=0 kxk

Now assuming the error e on the right-hand side is aligned with the eigenvector
which belongs to the smaller eigenvalue, then this is multiplied by a factor of
|1/λ2 | = 1/4.8. On the other hand a small error in the solution which is
aligned with the eigenvector belonging to the larger eigenvalue takes us away
from the right-hand side b by a factor of 412.8. This is the worst-case scenario
and gives
λ1 412.8
K(A) = ≈ .
λ2 4.8
Exercise 1.7. Let  
1000 999
A= .
999 998
Calculate A−1 , the eigenvalues and
 eigenvectors
  of A, K2 (A), and K∞ (A).
1 −1
What is special about the vectors 1 and 1 ?

An example of notoriously ill-conditioned matrices are the Hilbert matri-


ces. These are square matrices with entries
1
Hi,j = .
i+j−1

If
√ n 4n
is the
√ size of the matrix, then the condition number is of order O((1 +
2) / n).

1.10 Revision Exercises


Exercise 1.8. (a) Define overflow and underflow.
(b) Describe loss of significance and give an example.
(c) Show that using a floating-point format with base β and p digits in the
significand and computing differences using p digits, the relative error of
the result can be as large as β − 1.
(d) Describe (without proof ) how computing differences can be improved.
20  A Concise Introduction to Numerical Analysis

(e) Consider the quadratic expression Q(x) = ax2 + bx + c in which a, b, c


and x are all represented with the same relative error δ. In computing bx
and ax2 , estimate the relative error, and hence the absolute error of both
expressions. Hence deduce an estimate for the absolute error in computing
Q(x).
(f ) Comment on the suitability of the formula

−b ± b2 − 4ac
x=
2a
for computing the roots of Q(x) in floating point arithmetic. Derive an
alternative formula and describe how it can be used in practice.

Exercise 1.9. (a) Define absolute error, relative error, and state their rela-
tionship.
(b) Show how the relative error builds up in multiplication and division.
(c) Explain forward and backward error analysis using the example of approx-
imating
cos x ≈ f (x) = 1 − x2 /2.

(d) Considering the binary floating point representation of numbers, explain


the concept of the hidden bit.
(e) Explain the biased representation of the exponent in binary floating point
representation.
(f ) How are 0, ∞, and NaN represented?

(g) How are the numbers 2k for positive and negative k represented?
Exercise 1.10. (a) Define absolute error, relative error, machine epsilon,
and unit round-off.
(b) Although machine epsilon is defined in terms of absolute error, which
assumption makes it useful as a measurement of relative error?
(c) What does it mean if a floating point number is said to be normalized?
What is the hidden bit and how is it used?
(d) What does NaN stand for? Give an example of an operation which yields
an NaN value.

(e) Given emax = 127 and emin = −126, one bit for the sign and 23 bits
for the significand, show the bit pattern representing each of the following
Fundamentals  21

numbers. State the sign, the exponent, and the significand. You may use
0 . . . 0 to represent a long string of zeros.
0
−∞
−1.0
1.0 + machine epsilon
4.0
4.0 + machine epsilon
NaN
x∗ , the smallest representable number greater than 216
In the last case, what is the value of the least significant bit in the signifi-
cand of x∗ and what is the relative error if rounding errors cause x = 216
to become x∗ ?
Exercise 1.11. (a) Define absolute error, relative error, and state their re-
lationship.
(b) Explain absolute error test and relative error test and give examples of
circumstances when they are unsuitable. What is a mixed error test?
(c) Explain loss of significance.
(d) Let x1 = 3.0001 be the true value approximated by x∗1 = 3.0001 + 10−5 and
x2 = −3.0000 be the true value approximated by x∗2 = −3.0000 + 10−5 .
State the absolute and relative errors in x∗1 and x∗2 . Calculate the absolute
error and relative error in approximating x1 + x2 by x∗1 + x∗2 . How many
times bigger is the relative error in the sum compared to the relative error
in x∗1 and x∗2 ?
(e) Let p
f (x) = x − x2 + 1, x ≥ 0. (1.10)
Explain when and why loss of significance occurs in the evaluation of f .
(f ) Derive an alternative formula for evaluating f which avoids loss of sig-
nificance.
(g) Test your alternative by considering a decimal precision p = 16 and
x = 108 . What answer does your alternative formula give compared to
the original formula?
(h) Explain condition and condition number in general terms.
(i) Derive the condition number for evaluating a differentiable function f at
a point x, i.e., calculating f (x).
(j) Considering f (x) as defined in (1.10), find the smallest interval in
which the condition number lies. Is the problem well-conditioned or ill-
conditioned?
CHAPTER 2

Linear Systems

2.1 Simultaneous Linear Equations


Here we consider the solution of simultaneous linear equations of the form

Ax = b, (2.1)

where A is a matrix of coefficients, b is a given vector, and x is the vector of


unknowns to be determined. In the first instance, we assume A is square with
n rows and n columns, and x, b ∈ Rn . At least one element of b is non-zero.
The equations have a unique solution if and only if A is a non-singular matrix,
i.e., the inverse A−1 exists. The solution is then x = A−1 b. There is no need to
calculate A−1 explicitly, since the vector A−1 b needs to be calculated and the
calculation of A−1 would be an intermediate step. The calculation of a matrix
inverse is usually avoided unless the elements of the inverse itself are required
for other purposes, since this can lead to unnecessary loss of accuracy.
If A is singular, there exist non-zero vectors v such that

Av = 0.

These vectors lie in the null space of A. That is the space of all vectors mapped
to zero when multiplied by A. If x is a solution of (2.1) then so is x + v, since

A(x + v) = Ax + Av = b + 0 = b.

In this case there are infinitely many solutions.


The result of A applied to all vectors in Rn is called the image of A. If b
does not lie in the image of A, then no vector x satisfies (2.1) and there is no
solution. In this case the equations are inconsistent. This situation can also
occur when A is singular.
The solution of Equation (2.1) is trivial if the matrix A is either lower
24  A Concise Introduction to Numerical Analysis

triangular or upper triangular, i.e.,


   
a1,1 0 ··· 0 a1,1 ··· ··· a1,n
 .. . .. . .. ..   0 a2,2 ··· a2,n 

 . .  or 
 
.. .. .. .. .

 an−1,1 · · · an−1,n−1 0   . . . . 
an,1 ··· ··· an,n 0 ··· 0 an,n

If any of the diagonal elements ai,i is zero, then A is singular and there is no
unique solution. There might be no solution or infinitely many.
Considering the upper triangular form, the solution is obtained by the
back substitution algorithm. The last equation contains only one unknown xn ,
which can be calculated by
bn
xn = .
an,n
Having determined xn , the second to last equation contains only the unknown
xn−1 , which can then be calculated and so on. The algorithm can be summa-
rized as Pn
bi − j=i+1 ai,j xj
xi = , i = n, n − 1, . . . , 1.
ai,i
For the lower triangular form the forward substitution algorithm of similar
form is used. Here is an implementation in MATLAB.

function [x] = Forward(A,b)


% Solves the lower triangular system of equations Ax = b
% A input argument, square lower triangular matrix
% b input argument
% x solution

[n,m]=size(A); % finding the size of A


if n6= m
error('input is not a square matrix');
end
if size(b,1) 6= n
error('input dimensions do not match');
end
x = zeros(n,1); % initialise x to the same dimension
if abs(A(1,1)) > 1e−12 % not comparing to zero because of possible
% rounding errors
x(1) = b(1)/A(1,1); % solve for the first element of x
else
disp('input singular'); % A is singular if any of the diagonal
% elements are zero
return;
end
for k=2:n % the loop considers one row after the other
if abs(A(k,k))>1e−12 % not comparing to zero because of possible
% rounding errors
temp = 0;
for j=1:k−1
temp = temp + A(k,j) * x(j); % Multiply the elements of
Linear Systems  25

% the k−th row of A before the


% diagonal by the elements of x
% already calculated
end
x(k) = (b(k)−temp)/A(k,k); % solve for the k−th element of x
else
error('input singular'); % A is singular if any of the
% diagonal elements are zero
end
end

Exercise 2.1. Implement backward substitution.

2.2 Gaussian Elimination and Pivoting


Given a set of simultaneous equations Ax = b, the solution x is unchanged if
any of the following operations is performed:
1. Multiplication of an equation by a non-zero constant.

2. Addition of the multiple of one equation to another.


3. Interchange of two equations.
The same operations have to be performed on both sides of the equal sign.
These operations are used to convert the system of equations to the trivial
case, i.e., upper or lower triangular form. This is called Gaussian elimination.
By its nature there are a many different ways to go about this. The usual
strategy is called pivotal strategy and we see below that this in general avoids
the accumulation of errors and in some situations is crucially important.
A pivot entry is usually required to be at least distinct from zero and
often well away from it. Finding this element is called pivoting. Once the pivot
element is found, interchange of rows (and possibly columns) may follow to
bring the pivot element into a certain position. Pivoting can be viewed as
sorting rows (and possibly columns) in a matrix. The swapping of rows is
equivalent to multiplying A by a permutation matrix. In practice the matrix
elements are, however, rarely moved, since this would cost too much time.
Instead the algorithms keep track of the permutations. Pivoting increases the
overall computational cost of an algorithm. However, sometimes pivoting is
necessary for the algorithm to work at all, at other times it increases the
numerical stability. We illustrate this with two examples.
Consider the three simultaneous equations where the diagonal of the ma-
trix consists entirely of zeros,
    
0 1 1 x1 1
 1 0 1   x2  =  2  .
1 1 0 x3 4
26  A Concise Introduction to Numerical Analysis

which we convert into upper triangular form to use back substitution. The
first equation cannot form the first row of the upper triangle because its first
coefficient is zero. Both the second and third row are suitable since their first
element is 1. However, we also need a non-zero element in the (2, 2) position
and therefore the first step is to exchange the first two equations, hence
    
1 0 1 x1 2
 0 1 1   x2  =  1  .
1 1 0 x3 4

After subtracting the new first and the second equation from the third, we
arrive at     
1 0 1 x1 2
 0 1 1   x2  =  1  .
0 0 −2 x3 1
Back substitution then gives x1 = 52 , x2 = 32 , and x3 = − 12 .
This trivial example shows that a subroutine to deal with the general case
is much less straightforward, since any number of coefficients may be zero at
any step.
The second example shows that not only zero coefficients cause problems.
Consider the pair of equations
    
0.0002 1.044 x1 1.046
= .
0.2302 −1.624 x2 0.678

The exact solution is x1 = 10, x2 = 1. However, if we assume the accuracy


is restricted to four digits in every operation and multiply the first equation
by 0.2302/0.0002 = 1151 and subtract from the second, the last equation
becomes
−1204x2 = −1203,
which gives the solution x2 = 0.9992. Using this number to solve for x1 in the
first equation gives the answer x1 = 14.18, which is quite removed from the
true solution. A small non-zero number as pivotal value was not suitable. In
fact, the problem is that 0.002 is small compared to 1.044.
A successful pivotal strategy requires the comparison of the relative size of
a coefficient to the other coefficients in the same equation. We can calculate
the relative size by dividing each coefficient by the largest absolute value in
the row. This is known as scaled pivoting.
To summarize, suppose an n × n set of linear equations is solved by Gaus-
sian elimination. At the first step there are n possible equations and one is
chosen as pivot and moved to the first row. Then zeros are introduced in the
first column below the first row. This leaves at the second step n − 1 equations
to be transformed. At the start of the k th step there are n − k + 1 equations
remaining. From these a pivot is selected. There are different ways to go about
this. In partial pivoting these n − k + 1 equations are scaled by dividing by
Linear Systems  27

the modulus of the largest coefficient of each. Then the pivotal equation is
chosen as the one with the largest (scaled) coefficient of xk and moved into
the k th row. In total (or complete or maximal ) pivoting the pivotal equation
and pivotal variable are selected by choosing the largest (unscaled) coefficient
of any of the remaining variables. This is moved into the (k, k) position. This
can involve the exchange of columns as well as rows. If columns are exchanged
the order of the unknowns needs to be adjusted accordingly. Total pivoting
is more expensive, but for certain systems it may be required for acceptable
accuracy.

2.3 LU Factorization
Another possibility to solve a linear system is to factorize A into a lower
triangular matrix L (i.e., Li,j = 0 for i < j) and a upper triangular matrix U
(i.e., Ui,j = 0 for i > j), that is, A = LU . This is called LU factorization. The
linear system then becomes L(U x) = b, which we decompose into Ly = b
and U x = y. Both these systems can be solved easily by back substitution.
Other applications of the LU factorization are
1. Calculation of determinant:
n
Y n
Y
det A = (det L)(det U ) = ( Lk,k )( Uk,k ).
k=1 k=1

2. Non-singularity testing: A = LU is non-singular if and only if all the


diagonal elements of L and U are nonzero.
3. Calculating the inverse: The inverse of triangular matrices can be easily
calculated directly. Subsequently A−1 = U −1 L−1 .

In the following we derive how to obtain the LU factorization. We denote


the columns of L by l1 , . . . , ln and the rows of U by uT1 , . . . , uTn . Thus
 T 
u1 n
 ..  X
A = LU = (l1 . . . ln )  .  = lk uTk .
uTn k=1

Assume that A is nonsingular and that the factorization exists. Hence


the diagonal elements of L are non-zero. Since lk uTk stays the same if lk is
replaced by αlk and uk by α−1 uk , where α 6= 0, we can assume that all
diagonal elements of L are equal to 1.
Since the first k − 1 components of lk and uk are zero, each matrix lk uTk
has zeros in the first k − 1 rows and columns. It follows that uT1 is the first
row of A and l1 is the first column of A divided by A1,1 so that LP1,1 = 1.
n
Having found l1 and u1 , we form the matrix A1 = A − l1 uT1 = k=2 lk uTk .
28  A Concise Introduction to Numerical Analysis

The first row and column of A1 are zero and it follows that uT2 is the second
row of A1 , while l2 is the second column of A1 scaled so that L2,2 = 1.
To formalize the LU algorithm, set A0 := A. For all k = 1, . . . , n set uTk to
the k th row of Ak−1 and lk to the k th column of Ak−1 , scaled so that Lk,k = 1.
Further calculate Ak := Ak−1 − lk uTk before incrementing k by 1.
Exercise 2.2. Calculate the LU factorization of the matrix
 
8 6 −2 1
 8 8 −3 0 
A=  −2 2 −2
,
1 
4 3 −2 5

where all the diagonal elements of L are 1. Choose one of these factorizations
to find the solution to Ax = b for bT = (−2 0 2 − 1).
All elements of the first k rows and columns of Ak are zero. Therefore we
can use the storage of the original A to accumulate L and U . The full LU
accumulation requires O(n3 ) operations.
Looking closer at the equation Ak = Ak−1 − lk uTk , we see that the j th row
of Ak is the j th row of Ak−1 minus Lj,k times uTk which is the k th row of Ak−1 .
This is an elementary row operation. Thus calculating Ak = Ak−1 − lk uTk is
equivalent to performing n−k row operations on the last n−k rows. Moreover,
the elements of lk which are the multipliers Lk,k , Lk+1,k , . . . , Ln,k are chosen
so that the k th column of Ak is zero. Hence we see that the LU factorization
is analogous to Gaussian elimination for solving Ax = b. The main difference
however is that the LU factorization does not consider b until the end. This
is particularly useful when there are many different vectors b, some of which
might not be known at the outset. For each different b, Gaussian elimination
would require O(n3 ) operations, whereas with LU factorization O(n3 ) opera-
tions are necessary for the initial factorization, but then the solution for each
b requires only O(n2 ) operations.
The algorithm can be rewritten in terms of the elements of A, L, and U .
Since
Xn Xn
T

Ai,j = la ua i,j = Li,a Ua,j ,
a=1 a=1

and since Ua,j = 0 for a > j and Li,a = 0 for a > i, we have
min(i,j)
X
Ai,j = Li,a Ua,j .
a=1

At the k th step the elements Li,a are known for a < i and the elements U a, j
are known for a < j. For i = j,
i
X
Ai,i = Li,a Ua,i ,
a=1
Linear Systems  29

and since Li,i = 1, we can solve for


i−1
X
Ui,i = Ai,i − Li,a Ua,i .
a=1

If Ui,i = 0, then U has a zero on the diagonal and thus is singular, since U
is upper triangular. The matrix U inherits the rank of A, while L is always
non-singular, since its diagonal consists entirely of 1.
For j > i, Ai,j lies above the diagonal and
i
X i−1
X
Ai,j = Li,a Ua,j = Li,a Ua,j + Ui,j
a=1 a=1

and we can solve for


i−1
X
Ui,j = Ai,j − Li,a Ua,j .
a=1

Similarly, for j < i, Ai,j lies to the left of the diagonal and
j
X j−1
X
Ai,j = Li,a Ua,j = Li,a Ua,j + Li,j Uj,j
a=1 a=1

which gives
Pj−1
Ai,j − a=1 Li,a Ua,j
Li,j = .
Uj,j
Note that the last formula is only valid when Uj,j is non-zero. That means
when A is non-singular. If A is singular, other strategies are necessary, such
as pivoting, which is described below.
We have already seen the equivalence of LU factorization and Gaussian
elimination. Therefore the concept of pivoting also exists for LU factorization.
It is necessary for such cases as when, for example, A1,1 = 0. In this case
we need to exchange rows of A to be able to proceed with the factorization.
Specifically, pivoting means that having obtained Ak−1 , we exchange two rows
of Ak−1 so that the element of largest magnitude in the k th column is in the
pivotal position (k, k), i.e.,

|(Ak−1 )k,k | ≥ |(Ak−1 )j,k |, j = 1, . . . , n.

Since the exchange of rows can be regarded as the pre-multiplication of the


relevant matrix by a permutation matrix, we need to do the same exchange in
the portion of L that has been formed already (i.e., the first k − 1 columns):
k−1
X k−1
X
Anew
k−1 = P Ak−1 = P A − P lj uTj = P A − P lj uTj .
j=1 j=1
30  A Concise Introduction to Numerical Analysis

We also need to record the permutations of rows to solve for b.


Pivoting copes with zeros in the pivot position unless the entire k th column
of Ak−1 is zero. In this particular case we let lk be the k th unit vector while uTk
is the k th row of Ak−1 as before. With this choice we retain that the matrix
lk uTk has the same k th row and column as Ak−1 .
An important advantage of pivoting is that |Li,j | ≤ 1 for all i, j = 1, . . . , n.
This avoids the chance of large numbers occurring that might lead to ill con-
ditioning and to accumulation of round-off error.
Exercise 2.3. By using pivoting if necessary an LU factorization is calculated
of an n × n matrix A, where L has ones on the diagonal and the moduli of
all off-diagonal elements do not exceed 1. Let α be the largest moduli of the
elements of A. Prove by induction that elements of U satisfy |Ui,j | ≤ 2i−1 α.
Construct 2 × 2 and 3 × 3 nonzero matrices A that give |U2,2 | = 2α and
|U3,3 | = 4α.
Depending on the architecture different formulations of the algorithm are
easier to implement. For example MATLAB and its support for matrix calcu-
lations and manipulations lends itself to the first formulation, as the following
example shows.

function [L,U]=LU(A)
% Computes the LU factorisation
% A input argument, square matrix
% L square matrix of the same dimensions as A containing the lower
% triangular portion of the LU factorisation
% U square matrix of the same dimensions as A containing the upper
% triangular portion of the LU factorisation

[n,m]=size(A); % finding the size of A


if n6= m
error('input is not a square matrix');
end
L=eye(n); % initialising L to the identity matrix
U=A; % initialising U to be A
for k=1:n % loop calculates one column of L and one row of U at a time
% Note that U holds in its lower portion a modified portion of A
for j=k+1:n
% if U(k,k) = 0, do nothing, because L is already initialised
% to the identity matrix and thus the k−th column is the k−th
% standard basis vector
if abs(U(k,k)) > 1e−12 % not comparing to zero because of
% possible rounding errors
L(j,k)=U(j,k)/U(k,k); % let the k−th column of L be the k−th
% column of the current U scaled by
% the diagonal element
end
U(j,:)=U(j,:)−L(j,k)*U(k,:); % adjust U by subtracting the
% outer product of of the k−th
% column of L and the k−th row
% of U
end
Linear Systems  31

end

Exercise 2.4. Implement the LU factorization with pivoting.


It is often required to solve very, very large systems of equations Ax = b
where nearly all of the elements of A are zero (for example, arising in the
solution of partial differential equations). A system of the size n = 105 would
be a small system in this context. Such a matrix is called sparse and this
sparsity should be exploited for an efficient solution. In particular, we wish
the matrices L and U to inherit as much as possible of the sparsity of A and for
the cost of computation to be determined by the number of nonzero entries,
rather than by n.
Exercise 2.5. Let A be a real nonsingular n × n matrix that has the factor-
ization A = LU , where L is lower triangular with ones on its diagonal and U
is upper triangular. Show that for k = 1, . . . , n the first k rows of U span the
same subspace as the first k rows of A. Show also that the first k columns of
A are in the k-dimensional subspace spanned by the first k columns of L.
From the above exercise, this useful theorem follows.
Theorem 2.1. Let A = LU be an LU factorization (without pivoting) of
a sparse matrix. Then all leading zeros in the rows of A to the left of the
diagonal are inherited by L and all the leading zeros in the columns of A
above the diagonal are inherited by U.
Therefore we should use the freedom to exchange rows and columns in
a preliminary calculation so that many of the zero elements are leading zero
elements in rows and columns. This will minimize fill-in. We illustrate this with
the following example where we first calculate the LU factorization without
exchange of rows and columns.
  
−3 1 1 2 0
 
1 0 0 0 0 −3 1 1 2 0
1 8 1 2
 1 −3 0 0 1   − 3 1 0 0 0   0 − 3 3 3
   1 
   1 1 19 3 1 
−3 −8 1 0 0  0 0

 1 0 2 0 0 = 8 4 8 .

   2 1 6

 2 0 0 3 0    − 3 − 4 19 1 0   0
 0 0 8119
4 
19 
0 1 0 0 3 0 − 38 19
1 4
81
1 0 0 0 0 272
81

We see that the fill-in is significant. If, however, we symmetrically exchange


rows and columns, swapping the first and third, second and fourth, and fourth
and fifth, we get
 
2 0 1 0 0 
1 0 0 0 0

2 0 1 0 0

 0 3 2 0 0   0 1 0 0 0  0 3 2 0 0 
 
 1 2 −3 0 1  =   1 2   0 0 − 29 0
 
2 3
1 0 0 6
1 .
    
 0 0 0 3 1   0 0 0 1 0  0 0 0 3 1 
6 1 272
0 0 − 29 3 1 0 0 0 0 − 87
0 0 1 1 −3
32  A Concise Introduction to Numerical Analysis

There has been much research on how best to exploit sparsity with the
help of graph theory. However, this is beyond this course. The above theorem
can also be applied to banded matrices.
Definition 2.1. The matrix A is a banded matrix if there exists an integer
r < n such that Ai,j = 0 for |i − j| > r, i, j = 1, . . . , n. In other words, all
the nonzero elements of A reside in a band of width 2r + 1 along the main
diagonal.
For banded matrices, the factorization A = LU implies that Li,j = Ui,j = 0
for |i − j| > r and the banded structure is inherited by both L and U .
In general, the expense of calculating an LU factorization of an n × n
dense matrix A is O(n3 ) operations and the expense of solving Ax = b using
that factorization is O(n2 ). However, in the case of a banded matrix, we need
just O(r2 n) operations to factorize and O(rn) operations to solve the linear
system. If r is a lot smaller than n, then this is a substantial saving.

2.4 Cholesky Factorization


Let A be an n × n symmetric matrix, i.e., Ai,j = Aj,i . We can take advantage
of the symmetry by expressing A in the form of A = LDLT where L is lower
triangular with ones on its diagonal and D is a diagonal matrix. More explic-
itly, we can write the factorization which is known as Cholesky factorization
as
 
D1,1 0 ··· 0 lT1

 .. ..   lT  X n
 0 D2,2 . .  2 
A = (l1 . . . ln ) 
 .
 ..  = Dk,k lk lTk .
 .. . .. . ..

.
0 
 
k=1
T
0 ··· 0 Dn,n l n

Again lk denotes the k th column of L. The analogy to the LU algorithm is ob-


vious when letting U = DLT . However, this algorithm exploits the symmetry
and requires roughly half the storage. To be more specific, we let A0 = A at
the beginning and for k = 1, . . . , n we let lk be the k th column of Ak−1 scaled
such that Lk,k = 1. Set Dk,k = (Ak−1 )k,k and calculate Ak = Ak−1 −Dk,k lk lTk .
An example of such a factorization is

1 14
     
4 1 1 0 4 0
= 1 .
1 4 4 1 0 154 0 1

Recall that A is positive definite if xT Ax > 0 for all x 6= 0.


Theorem 2.2. Let A be a real n × n symmetric matrix. It is positive definite
if and only if an LDLT factorization exists where the diagonal elements of D
are all positive.
Linear Systems  33

Proof. One direction is straightforward. Suppose A = LDLT with Di,i > 0


for i = 1, . . . , n and let x 6= 0 ∈ Rn . Since L is nonsingular
Pn (ones on the
diagonal), y := LT x is nonzero. Hence xT Ax = yT Dy = i=1 Di,i yi2 > 0
and A is positive definite.
For the other direction, suppose A is positive definite. Our aim is to show
that an LDLT factorization exists. Let ek ∈ Rn denote the k th unit vector.
Firstly, eT1 Ae1 = A1,1 > 0 and l1 and D1,1 are well-defined. In the following
we show that (Ak−1 )k,k > 0 for k = 1, . . . , n and that hence lk and Dk,k are
well-defined and the factorization exists. We have already seen that
Pk−1 this is true
for k = 1. We continue by induction, assuming that Ak−1 = A− i=1 Di,i li lTi
has been computed successfully. Let x ∈ Rn be such that xk+1 = xk+2 = · · · =
Pk
xn = 0, xk = 1 and xj = − i=j+1 Li,j xi for j = k − 1, k − 2, . . . , 1. With this
choice of x we have
n
X k
X k
X
lTj x = Li,j xi = Li,j xi = xj + Li,j xi = 0 j = 1, . . . , k − 1.
i=1 i=j i=j+1

Since the first k − 1 rows and columns of Ak−1 and the last n − k components
of x vanish and xk = 1, we have
 
k−1
X
(Ak−1 )k,k = xT Ak−1 x = xT A − Dj,j lj lTj  x
j=1
k−1
X
= xT Ax − Dj,j (lTj x)2 = xT Ax > 0.
j=1

The conclusion of this theorem is that we can check whether a symmetric


matrix is positive definite by attempting to calculate its LDLT factorization.
Definition 2.2 (Cholesky factorization). Define D1/2 as the diagonal matrix
where (D )k,k = Dk,k . Hence D1/2 D1/2 = D. Then for positive definite
1/2
p

A, we can write
A = (LD1/2 )(D1/2 LT ) = (LD1/2 )(LD1/2 )T .
Letting L̃ := LD1/2 , A = L̃L̃T is known as the Cholesky factorization.
Exercise 2.6. Calculate the Cholesky factorization of the matrix
··· ··· 0
 
1 1 0
 .. .. 
 1 2 1 . . 
.. 
 
 ..
 0 1
 2 1 . .  .
 . .
 .. ..

 1 3 1 0  
 . ..
 ..

. 1 3 1 
0 ··· ··· 0 1 λ
34  A Concise Introduction to Numerical Analysis

Deduce from the factorization the value of λ which makes the matrix singular.

2.5 QR Factorization
In the following we examine another way to factorize a matrix. However, first
we need to recall a few concepts.
For all x, y ∈ Rn , the scalar product is defined by
n
X
hx, yi = hy, xi = xi yi = xT y = yT x.
i=1

The scalar product is a linear operation, i.e., for x, y, z ∈ Rn and α, β ∈ R

hαx + βy, zi = αhx, zi + βhy, zi.

The norm or Euclidean length of x ∈ Rn is defined as

n
!1/2
X
kxk = x2i = hx, xi1/2 ≥ 0.
i=1

The norm of x is zero if and only if x is the zero vector.


Two vectors x, y ∈ Rn are called orthogonal to each other if

hx, yi = 0.

Of course the zero vector is orthogonal to every vector including itself.


A set of vectors q1 , . . . , qm ∈ Rn is called orthonormal if

1, k = l,
hqk , ql i = k, l = 1, . . . , m.
0, k 6= l,

Let Q = (q1 . . . qn ) be an n × n real matrix. It is called orthogonal if its


columns are orthonormal. It follows from (QT Q)k,l = hqk , ql i that QT Q = I
where I is the unit or identity matrix . Thus Q is nonsingular and the inverse
exists, Q−1 = QT . Furthermore, QQT = QQ−1 = I. Therefore the rows of an
orthogonal matrix are also orthonormal and QT is also an orthogonal matrix.
Further, 1 = det I = det(QQT ) = det Q det QT = (det Q)2 and we deduce
det Q = ±1.
Lemma 2.1. If P, Q are orthogonal, then so is P Q.
Proof. Since P T P = QT Q = I, we have

(P Q)T (P Q) = (QT P T )(P Q) = QT (P T P )Q = QT Q = I

and hence P Q is orthogonal.


We will require the following lemma to construct orthogonal matrices.
Linear Systems  35

Lemma 2.2. Let q1 , . . . , qm ∈ Rn be orthonormal and m < n. Then there


exists qm+1 ∈ Rn such that q1 , . . . , qm+1 is orthonormal.
Proof. The proof is constructive. Let Q be the n × m matrix whose columns
are q1 , . . . , qm . Since
n X
X m m
X
Q2i,j = kqj k2 = m < n,
i=1 j=1 j=1

we can find a row of Q where the sum of the Pmsquares is less than 1, in other
words there exists l ∈ {1, . . . , n} such that j=1 Q2l,j < 1. Let el denote the
Pm
lth unit vector. Then Ql,j = hqj , el i. Further, set w = el − j=1 hqj , el iqj .
The scalar product of w with each qi for i = 1, . . . , m is then
m
X
hqi , wi = hqi , el i − hqj , el ihqi , qj i = 0.
j=1

Thus w is orthogonal to q1 , . . . , qm . Calculating the norm of w we get


Pm Pm Pm
kwk2 = hel , el i − 2 j=1 hqj , el i2 + j=1 hqj , el i k=1 hqk , el ihqj , qk i
Pm m
= hel , el i − j=1 hqj , el i2 = 1 − j=1 Q2l,j > 0.
P

Thus w is nonzero and we define qm+1 = w/kwk.


Definition 2.3 (QR factorization). The QR factorization of an n × m matrix
A (n > m) has the form A = QR, where Q is an n × n orthogonal matrix and
R is an n × m upper triangular matrix (i.e., Ri,j = 0 for i > j). The matrix
R is said to be in the standard form if the number of leading zeros in each
row increases strictly monotonically until all the rows of R are zero.
Denoting the columns of A by a1 , . . . , am ∈ Rn and the columns of Q by
q1 , . . . , qn ∈ Rn , we can write the factorization as

R1,1 R1,2 · · · R1,m


 
 .. 
 0 R2,2 . 
 
 .. . .. . .. .
.. 
 . 
(a1 · · · am ) = (q1 · · · qn )  0
 
 ··· 0 Rm,m  
 0
 ··· ··· 0  
 . ..
 ..

. 
0 ··· ··· 0

We see that
k
X
ak = Ri,k qi k = 1, . . . , n. (2.2)
i=1
36  A Concise Introduction to Numerical Analysis

In other words, the k th column of A is a linear combination of the first k


columns of Q.
As with the factorizations we have encountered before, the QR factoriza-
tion can be used to solve Ax = b by factorizing first A = QR (note m = n
here). Then we solve Qy = b by y = QT b and Rx = y by back substitution.
In the following sections we will examine three algorithms to generate a
QR factorization.

2.6 The Gram–Schmidt Algorithm


The first algorithm follows the construction process in the proof of Lemma
2.2. Assuming a1 is not the zero vector, we obtain q1 and R1,1 from Equation
(2.2) for k = 1. Since q1 is required to have unit length we let q1 = a1 /ka1 k
and R1,1 = ka1 k.
Next we form the vector w = a2 −hq1 , a2 iq1 . It is by construction orthogo-
nal to q1 , since we subtract the component in the direction of q1 . If w 6= 0, we
set q2 = w/kwk. With this choice q1 and q2 are orthonormal. Furthermore,

hq1 , a2 iq1 + kwkq2 = hq1 , a2 iq1 + w = a2 .

Hence we let R1,2 = hq1 , a2 i and R2,2 = kwk.


This idea can be extended to all columns of A. More specifically let A
be an n × m matrix which is not entirely zero. We have two counters, k is
the number of columns which have already be generated, j is the number of
columns of A which have already been considered. The individual steps are as
follows:

1. Set k := 0, j := 0.
2. Increase j by 1. If k = 0, set w := aj , otherwise (i.e., when k ≥ 1)
Pk
set w := aj − i=1 hqi , aj iqi . By this construction w is orthogonal to
q1 , . . . , qk .
3. If w = 0, then aj lies in the space spanned by q1 , . . . , qk or is zero. If aj
is zero, set the j th column of R to zero. Otherwise, set Ri,j := hqi , aj i
for i = 1, . . . , k and Ri,j = 0 for i = k + 1, . . . , n. Note that in this case
no new column of Q is constructed. If w 6= 0, increase k by one and
set qk := w/kwk, Ri,j := hqi , aj i for i = 1, . . . , k − 1, Rk,j := kwk and
Ri,j := 0 for i = k + 1, . . . n. By this construction, each column of Q has
Pk
unit length and aj = i=1 Ri,j qi as required and R is upper triangular,
since k ≤ j.
4. Terminate if j = m, otherwise go to 2.

Since the columns of Q are orthonormal, there are at most n of them. In


other words k cannot exceed n. If it is less than n, then additional columns
can be chosen such that Q becomes an n × n orthogonal matrix.
Linear Systems  37

The following example illustrates the workings of the Gram–Schmidt al-


gorithm, when A is singular. Let
 
1 2 1
A= 0 0 1 .
0 0 1
The first column of A has already length 1, and thus no normalization is
necessary. It becomes the first column q1 of Q, and we set R1,1 = 1 and
R2,1 = R3,1 = 0. Next we calculate
   
2 1
w = a2 − ha2 , q1 iq1 =  0  − 2  0  = 0.
0 0
In this case, we set R1,2 = 2 and R2,2 = R3,2 = 0. No column of Q has been
generated. Next
     
1 1 0
w = a3 − ha3 , q1 iq1 =  1  −  0  =  1 .
1 0 1

Now the√second column of Q can be generated as q2 = w/ 2. We set R1,3 = 1,
R2,3 = 2 and R3,3 = 0. Since we have considered all columns√ of √
A, but Q
is not square yet, we need to pad it out. The vector (0, 1/ 2, −1/ 2)T has
length 1 and is orthogonal to q1 and q2 . We can check
    
1 0√ 0√ 1 2 √1 1 2 1
 0 1/ 2 1/ 2   0 0 2  =  0 0 1 .
√ √
0 1/ 2 −1/ 2 0 0 0 0 0 1

Exercise 2.7. Let a1 , a2 and a3 denote the columns of the matrix


 
3 6 −1
A =  −6 −6 1  .
2 1 −1
Using the Gram–Schmidt procedure, generate
Pi orthonormal vectors q1 , q2 and
q3 and real numbers Ri,j such that ai = j=1 Ri,j qj , i = 1, 2, 3. Thus express
A as the product A = QR, where Q is orthogonal and R is upper triangular.
The disadvantage of this algorithm is that it is ill-conditioned. Small errors
in the calculation of inner products spread rapidly, which then can lead to loss
of orthogonality. Errors accumulate fast and the off-diagonal elements of QT Q
(which should be the identity matrix) may become large.
However, orthogonality conditions are retained well, if two given orthog-
onal matrices are multiplied. Therefore algorithms which compute Q as the
product of simple orthogonal matrices are very effective. In the following we
encounter one of these algorithms.
38  A Concise Introduction to Numerical Analysis

2.7 Givens Rotations


Given a real n × m matrix A, we let A0 = A and seek a sequence Ω1 , . . . , Ωk of
n × n orthogonal matrices such that the matrix Ai := Ωi Ai−1 has more zeros
below the diagonal than Ai−1 for i = 1, . . . , k. The insertion of zeros shall
be in such a way that Ak is upper triangular. We then set R = Ak . Hence
Ωk · · · Ω1 A = R and Q = (Ωk · · · Ω1 )−1 = (Ωk · · · Ω1 )T = ΩT1 · · · ΩTk . Therefore
A = QR and Q is orthogonal and R is upper triangular.
Definition 2.4 (Givens rotation). An n × n orthogonal matrix Ω is called a
Givens rotation, if it is the same as the identity matrix except for four elements
and we have det Ω = 1. Specifically we write Ω[p,q] , where 1 ≤ p < q ≤ n, for
a matrix such that

Ω[p,q] [p,q]
p,p = Ωq,q = cos θ, Ω[p,q]
p,q = sin θ, Ω[p,q]
q,p = − sin θ

for some θ ∈ [−π, π].


Letting n = 4 we have for example
   
cos θ sin θ 0 0 1 0 0 0
[1,2]
 − sin θ cos θ 0 0   0 cos θ 0 sin θ 
Ω = , Ω[2,4] = .
 0 0 1 0   0 0 1 0 
0 0 0 1 0 − sin θ 0 cos θ

Geometrically these matrices correspond to the underlying coordinate system


being rotated along a two-dimensional plane, which is called a Euler rotation in
mechanics. Orthogonality is easily verified using the identity cos2 θ+sin2 θ = 1.
Theorem 2.3. Let A be an n × m matrix. Then for every 1 ≤ p < q ≤ n,
we can choose indices i and j, where i is either p or q and j is allowed to
range over 1, . . . , m, such that there exists θ ∈ [−π, π] such that (Ω[p,q] A)i,j =
0. Moreover, all the rows of Ω[p,q] A, except for the pth and the q th , remain
unchanged, whereas the pth and the q th rows of Ω[p,q] A are linear combinations
of the pth and q th rows of A.
Proof. If Ap,j = Aqj = 0, there are already zeros in the desired positions and
no action is needed. Let i = q and set
q q
cos θ := Ap,j / A2p,j + A2q,j , sin θ := Aq,j / A2p,j + A2q,j .

Note that if Aq,j = 0, Ai,j is already zero, since i = q. In this case cos θ = 1
and sin θ = 0 and Ω[p,q] is the identity matrix. On the other hand, if Ap,j = 0,
then cos θ = 0 and sin θ = 1 and Ω[p,q] is the permutation matrix which swaps
the pth and q th rows to bring the already existing zero in the desired position.
Let Ap,j 6= 0 and Aq,j 6= 0. Considering the q th row of Ω[p,q] A, we see

(Ω[p,q] A)q,k = − sin θAp,k + cos θAq,k ,


Linear Systems  39

for k = 1, . . . , m. It follows that for k = j


−Aq,j Ap,j + Ap,j Aq,j
(Ω[p,q] A)q,j = q = 0.
A2p,j + A2q,j

On the other hand, when i = p, we let


q q
cos θ := Aq,j / A2p,j + A2q,j , sin θ := −Ap,j / A2p,j + A2q,j .

Looking at the pth row of Ω[p,q] A, we have

(Ω[p,q] A)p,k = cos θAp,k + sin θAq,k ,

for k = 1, . . . , m. Therefore for k = j (Ω[p,q] A)p,j = 0.


Since Ω[p,q] equals the identity matrix apart from the (p, p), (p, q), (q, p)
and (q, q) entries it follows that only the pth and q th rows change and are
linear combinations of each other.
As an example we look at a 3 × 3 matrix
 √ 
4 −2 √ 2 5
A= 3 2 √5
.
0 1 2

We first pick Ω[1,2] such that (Ω[1,2] A)2,1 = 0. Therefore


 4 3

5 5 0
Ω[1,2] =  − 35 45 0  .
0 0 1

The resultant matrix is then


 √ 
5 −√ 2 7
Ω[1,2] A =  0 2 2 √1
.
0 1 2

Since (Ω[1,2] A)3,1 is already zero, we do not need Ω[1,3] to introduce a zero
there. Next we pick Ω[2,3] such that (Ω[2,3] Ω[1,2] A)3,2 = 0. Thus
 
1 0
√ 0
Ω[2,3] =  0 23 2 1

3
.
1 2
0 −3 3 2

The final triangular matrix is then


 √ 
5 − 2 √7
Ω[2,3] Ω[1,2] A =  0 3 2 .
0 0 1
40  A Concise Introduction to Numerical Analysis

Exercise 2.8. Calculate the QR factorization of the matrix in Exercise 2.7


by using three Givens rotations.
For a general n × m matrix A, let li be the number of leading zeros in the
ith row of A for i = 1, . . . , n. We increase the number of leading zeros until A
is upper triangular. The individual steps of the Givens algorithm are

1. Stop if the sequence of leading zeros l1 , . . . , ln is strictly monotone for


li ≤ m. That is, every row has at least one more leading zero than the
row above it until all entries are zero.
2. Pick any row indices 1 ≤ p < q ≤ n such that either lp > lq or lp =
lq < m. In the first case the pth row has more leading zeros than the q th
row while in the second case both rows have the same number of leading
zeros.
3. Replace A by Ω[p,q] A such that the (q, lq + 1) entry becomes zero. In the
case lp > lq , we let θ = ±π, such that the pth and q th row are effectively
swapped. In the case lp = lq this calculates a linear combination of the
pth and q th row.

4. Update the values of lp and lq and go to step 1.

The final matrix A is the required matrix R. Since the number of leading
zeros increases strictly monotonically until all rows of R are zero, R is in
standard form.
The number of rotations needed is at most the number of elements below
the diagonal, which is
m
X m(m + 1)
(n − 1) + (n − 2) + · · · + (n − m) = mn − j = mn − = O(mn).
j=1
2

(For m = n this becomes n(n − 1)/2 = O(n2 )). Each rotation replaces two
rows by their linear combinations, which requires O(m) operations. Hence the
total cost is O(m2 n).
When solving a linear system, we multiply the right-hand side by the same
rotations. The cost for this is O(mn), since for each rotation two elements of
the vector are combined.
However, if the orthogonal matrix Q is required explicitly, we begin by
letting Ω be the m × m unit matrix. Each time A is pre-multiplied by Ω[p,q] ,
Ω is also pre-multiplied by the same rotation. Thus the final Ω is the product
of all rotations and we have Q = ΩT . The cost for obtaining Q explicitly is
O(m2 n), since in this case the rows have length m.
In the next section we encounter another class of orthogonal matrices.
Linear Systems  41

2.8 Householder Reflections


Definition 2.5 (Householder reflections). Let u ∈ Rn be a non-zero vector.
The n × n matrix of the form

uuT
I −2
kuk2
is called a Householder reflection.
A Householder reflection describes a reflection about a hyperplane which
is orthogonal to the vector u/kuk and which contains the origin. Each such
matrix is symmetric and orthogonal, since
T  2
uuT uuT uuT
  
I −2 I −2 = I −2
kuk2 kuk2 kuk2
uuT u(uT u)uT
= I −4 2
+4 = I.
kuk kuk4
We can use Householder reflections instead of Givens rotations to calculate a
QR factorization.
With each multiplication of an n×m matrix A by a Householder reflection
we want to introduce zeros under the diagonal in an entire column. To start
with we construct a reflection which transforms the first nonzero column a ∈
Rn of A into a multiple of the first unit vector e1 . In other words we want to
choose u ∈ Rn such that the last n − 1 entries of
uuT uT a
 
I −2 2
a=a−2 u (2.3)
kuk kuk2
vanish. Since we are free to choose the length of u, we normalize it such that
kuk2 = 2uT a, which is possible since a 6= 0. The right side of Equation (2.3)
then simplifies to a − u and we have ui = ai for i = 2, . . . , n. Using this we
can rewrite the normalization as
n
X n
X
2u1 a1 + 2 a2i = u21 + a2i .
i=2 i=2

Gathering the terms and extending the sum, we have


n
X n
X
u21 − 2u1 a1 + a21 − a2i = 0 ⇔ (u1 − a1 )2 = a2i .
i=1 i=1

Thus u1 = a1 ± kak. In numerical applications it is usual to let the sign be


the same sign as a1 to avoid kuk becoming too small, since a division by a
very small number can lead to numerical difficulties.
Assume the first k − 1 columns have been transformed such that they have
zeros below the diagonal. We need to choose the next Householder reflection
42  A Concise Introduction to Numerical Analysis

to transform the k th column such that the first k − 1 columns remain in this
form. Therefore we let the first k − 1 entries of u be zero. With this choice the
first k −1 rows and columns of the outer product uuT are zero and the top left
(k −1)×(k −1) p submatrix
Pn of the Householder reflection is the identity matrix.
Let uk = ak ± a 2 and u = a for i = k + 1, . . . , n. This introduces
i=k i i i
zeros below the diagonal in the k th column.
The end result after processing all columns of A in sequence is an upper
triangular matrix R in standard form.
For large n no explicit matrix multiplications are executed. Instead we use
uuT u(uT A)
 
I −2 2
A=A−2 .
kuk kuk2
2
So we first calculate wT := uT A and then A − T
kuk2 uw .

Exercise 2.9. Calculate the QR factorization of the matrix in Exercise 2.7


by using two Householder reflections. Show that for a general n × m matrix A
the computational cost is O(m2 n).
How do we choose between Givens rotations and Householder reflections?
If A is dense, Householder reflections are generally more effective. However,
Givens rotations should be chosen if A has many leading zeros. For example, if
an n × n matrix A has zeros under the first subdiagonal, these can be removed
by just n − 1 Givens rotations, which costs only O(n2 ) operations.
QR factorizations are used to solve over-determined systems of equations
as we will see in the next section.

2.9 Linear Least Squares


Consider a system of linear equations Ax = b where A is an n × m matrix
and b ∈ Rn .
In the case n < m there are not enough equations to define a unique
solution. The system is called under-determined . All possible solutions form
a vector space of dimension r, where r ≤ m − n. This problem seldom arises
in practice, since generally we choose a solution space in accordance with
the available data. An example, however, are cubic splines, which we will
encounter later.
In the case n > m there are more equations than unknowns. The sys-
tem is called over-determined . This situation may arise where a simple data
model is fitted to a large number of data points. Problems of this form occur
frequently when we collect n observations which often carry measurement er-
rors, and we want to build an m-dimensional linear model where generally m
is much smaller than n. In statistics this is known as linear regression. Many
machine learning algorithms have been developed to address this problem (see
for example [2] C. M. Bishop Pattern Recognition and Machine Learning).
We consider the simplest approach, that is, we seek x ∈ Rm that minimizes
the Euclidean norm kAx − bk. This is known as the least-squares problem.
Linear Systems  43

Theorem 2.4. Let A be any n × m matrix (n > m) and let b ∈ Rn . The


vector x ∈ Rm minimizes kAx − bk if and only if it minimizes kΩAx − Ωbk
for an arbitrary n × n orthogonal matrix Ω.
Proof. Given an arbitrary vector v ∈ Rm , we have
kΩvk2 = vT ΩT Ωv = vT v = kvk2 .
Hence multiplication by orthogonal matrices preserves the length. In particu-
lar, kΩAx − Ωbk = kAx − bk.
Suppose for simplicity that the rank of A is m, the largest it can be. That
is, there are no linear dependencies in the model space. Suppose that we have a
QR factorization of A with R in standard form. Because of the theorem above,
letting Ω := QT , we have kAx − bk2 = kQT (Ax − b)k2 = kRx − QT bk2 . We
can write R as  
R1
R= ,
0
where R1 is m×m upper triangular. Since R and thus R1 has the same rank as
A, it follows that R1 is nonsingular and therefore has no zeros on the diagonal.
Further, let us partition QT b as
 
T b1
Q b= ,
b2 ,
where b1 is a vector of length m and b2 is a vector of length n − m. The
minimization problem then becomes
kRx − QT bk2 = kR1 x − b1 k2 + kb2 k2 .
In order to obtain a minimum we can force the first term to be zero. The
second term remains. Therefore the solution to the least squares problem can
be calculated from R1 x = b1 and the minimum is
min kAx − bk = kb2 k.
x∈Rn

We cannot do any better since multiplication with any orthogonal matrix


preserves the Euclidean norm. Thus, QR factorization is a convenient way to
solve the linear least squares problem.

2.10 Singular Value Decomposition


In this section we examine another possible factorization known as the singular
value decomposition (SVD), but we see that it is not a good choice in many
circumstances.
For any non-singular square n × n matrix A, the matrix AT A is symmetric
and positive definite which can been easily seen from
xT AT Ax = kAxk2 > 0
44  A Concise Introduction to Numerical Analysis

for any nonzero x ∈ Rn .


This result generalizes as follows: for any n × m real matrix A, the m × m
matrix AT A is symmetric and positive semi-definite. The latter means that
there might be vectors such that xT AT Ax = 0. Hence it has real, non-negative
eigenvalues λ1 , . . . , λm , which we arrange in decreasing order. Note that some
of the eigenvalues might be zero. Let D be the diagonal matrix with entries
Di,i = λi , i = 1, . . . , m. From the spectral theorem (see for example [4] C. W.
Curtis Linear Algebra: an introductory approach) we know that a orthonormal
basis of eigenvectors v1 , . . . , vm can be constructed for AT A. These are also
known as right singular vectors. We write these as an orthogonal matrix V =
(v1 , . . . , vm ). Therefore we have
 
D1 0
V T AT AV = D = ,
0 0

where D1 is a k × k matrix (k ≤ min{m, n}) and contains the nonzero portion


1/2
of D. Since D1 is a diagonal matrix with positive elements, D1 and its inverse
−1/2
D1 can be formed easily.
We partition V accordingly into a m × k matrix V1 and m × (m − k) matrix
V2 . That is, V1 contains the eigenvectors corresponding to nonzero eigenvalues,
while V2 contains the eigenvectors corresponding to zero eigenvalues. We arrive
at  T   T T
V1 A AV1 V1T AT AV2

V1 T
A A (V1 V2 ) = ,
V2T V2T AT AV1 V2T AT AV2
where V1T AT AV1 = D1 and V2T AT AV2 = 0 (which means AV2 = 0).
−1/2
Now let U1 = AV1 D1 , which is an n × k matrix. We then have
−1/2
U1 D1/2 V1T = AV1 D1 D1/2 V1T = A.

Additionally, the columns of U1 are orthonormal, since


−1/2 T −1/2 −1/2 −1/2
U1T U1 = (D1 ) V1T AT AV1 D1 = D1 D1 D1 = I.

Moreover, AAT is an n × n matrix with


−1/2 −1/2
U1T AAT U1 = (D1 )T (V1T AT A)(AT AV1 )D1
−1/2 −1/2
= D1 (V1 D1 )T (V1 D1 )D1
−1/2 −1/2
= D1 D1 V1T V1 D1 D1 = D1 ,

since the columns of V1 are orthonormal eigenvectors of AT A. Thus the


columns of U1 are eigenvectors of AAT . They are also called left singular
vectors.
Thus we have arrived at a factorization of A of the form

A = U1 D1/2 V1T ,
Linear Systems  45

where U1 and V1 are rectangular matrices whose columns are orthonormal to


each other. However, it is more desirable to have square orthogonal matrices.
V1 is already part of such a matrix and we extend U1 in a similar fashion.
More specifically, we choose a n × (n − k) matrix U2 such that

U = (U1 U2 )

is an n × n matrix for which U T U = I, i.e., U is orthogonal. √


Since the eigenvalues are non-negative, we can define σi = λi for i =
1, . . . , k. Let S be the n × m matrix with diagonal Si,i = σi , i = 1, . . . , k. Note
that the last diagonal entries of S may be zero.
We have now constructed the singular value decomposition of A

A = U SV T .

The singular value decomposition can be used to solve Ax = b for a general


n × m matrix A and b ∈ Rn , since it is equivalent to

SV T x = U T b.

For overdetermined systems n > m, consider the linear least squares prob-
lem, which we can rewrite using the orthogonality of U and V .

kAx − bk2 = kU T (AV V T x − b)k2 = kSV T x − U T bk2


Pk T T 2
Pn T 2
= i=1 (σi (V x)i − ui b) + i=k+1 (ui b) ,

where ui denotes the ith column of U .


We achieve a minimum if we force the first sum to become zero:
uTi b
(V T x)i = .
σi
Using the orthogonality of V again, we deduce
k
X uT b i
x= vi ,
i=1
σi

where vi denotes the ith column of V . Thus x is a linear combination of the


first k eigenvectors of AT A with the given coefficients. This method is however
in general not suitable to solve a system of linear equations, since it involves
finding the eigenvalues and eigenvectors of AT A, which in itself is a difficult
problem.
The following theorem shows another link between the linear least squares
problem with the matrix AT A.
Theorem 2.5. x ∈ Rm is a solution to the linear least squares problem if and
only if we have AT (Ax − b) = 0.
46  A Concise Introduction to Numerical Analysis

Proof. If x is a solution then it minimizes the function


f (x) = kAx − bk2 = xT AT Ax − 2xT AT b + bT b.
Hence the gradient ∇f (x) = 2AT Ax − 2AT b vanishes. Therefore AT (Ax −
b) = 0.
Conversely, suppose that AT (Ax − b) = 0 and let z ∈ Rm . We show that
the norm of Az−b is greater or equal to the norm of Ax−b. Letting y = z−x,
we have
kAz − bk2 = kA(x + y) − bk2
= kAx − bk2 + 2yT AT (Ax − b) + kAyk2
= kAx − bk2 + kAyk2 ≥ kAx − bk2
and x is indeed optimal.
Corollary 2.1. x ∈ Rm is a solution to the linear least squares problem if
and only if the vector Ax − b is orthogonal to all columns of A.
Hence another way to solve the linear least squares problem is to solve the
m × m system of linear equations AT Ax = AT b. This is called the method of
normal equations.
There are several disadvantages with this approach and also the singu-
lar value decomposition. Firstly, AT A might be singular, secondly a sparse A
might be replaced by a dense AT A, and thirdly calculating AT A might lead
to loss of accuracy; for example, due to overflow when large elements are mul-
tiplied. These are problems with forming AT A and this is before we attempt
to calculate the eigenvalues and eigenvectors of AT A for the singular value
decomposition.
In the next sections we will encounter iterative methods for linear systems.

2.11 Iterative Schemes and Splitting


Given a linear system of the form Ax = b, where A is an n × n matrix and
x, b ∈ Rn , solving it by factorization is frequently very expensive for large n.
However, we can rewrite it in the form
(A − B)x = −Bx + b,
where the matrix B is chosen in such a way that A − B is non-singular and
the system (A − B)x = y is easily solved for any right-hand side y. A simple
iterative scheme starts with an estimate x(0) ∈ Rn of the solution (this could
be arbitrary) and generates the sequence x(k) , k = 1, 2, . . . , by solving
(A − B)x(k+1) = −Bx(k) + b. (2.4)
This technique is called splitting. If the sequence converges to a limit,
limk→∞ x(k) = x̂, then taking the limit on both sides of Equation (2.4) gives
(A − B)x̂ = −B x̂ + b. Hence x̂ is a solution of Ax = b.
Linear Systems  47

What are the necessary and sufficient conditions for convergence? Suppose
that A is non-singular and therefore has a unique solution x∗ . Since x∗ solves
Ax = b, it also satisfies (A − B)x∗ = −Bx∗ + b. Subtracting this equation
from (2.4) gives

(A − B)(x(k+1) − x∗ ) = −B(x(k) − x∗ ).

We denote x(k) − x∗ by e(k) . It is the error in the k th iteration. Since A − B


is non-singular, we can write

e(k+1) = −(A − B)−1 Be(k) .

The matrix H := −(A − B)−1 B is known as the iteration matrix . In practical


applications H is not calculated. We analyze its properties theoretically in
order to determine whether or not we have convergence. We will encounter
such analyses later on.

Definition 2.6 (Spectral radius). Let λ1 , . . . , λn be the eigenvalues of the


n × n matrix H. Then its spectral radius ρ(H) is defined as

ρ(H) = max (|λi |).


i=1,...,n

Note that even if H is real, its eigenvalues might be complex.

Theorem 2.6. We have limk→∞ x(k) = x∗ for all x(0) ∈ Rn if and only if
ρ(H) < 1.
Proof. For the first direction we assume ρ(H) ≥ 1. Let λ be an eigenvalue of
H such that |λ| = ρ(H) and let v be the corresponding eigenvector, that is
Hv = λv. If v is real , we let x(0) = x∗ + v, hence e(0) = v. It follows by
induction that e(k) = λk v. This cannot tend to zero since |λ| ≥ 1.
If λ is complex, then v is complex. In this case we have a complex pair
of eigenvalues, λ and its complex conjugate λ, which has v as its eigenvector.
The vectors v and v are linearly independent. We let x(0) = x∗ + v + v ∈ Rn ,
k
hence e(0) = v + v ∈ Rn . Again by induction we have e(k) = λk v + λ v ∈ Rn .
Now
k
ke(k) k = kλk v + λ vk
= |λk |keiθk v + e−iθk vk,
where we have changed to polar coordinates. Now θk lies in the closed interval
[−π, π] for all k = 0, 1, . . .. The function in θ ∈ [−π, π] given by keiθ v + e−iθ vk
is continuous and has a minimum with value, say, µ. This has to be positive
since v and v are linearly independent. Hence ke(k) k ≥ µ|λk | and e(k) cannot
tend to zero, since |λ| ≥ 1.
The other direction is beyond the scope of this course, but can be found
in [20] R. S. Varga Matrix Iterative Analysis, which is regarded as a classic in
its field.
48  A Concise Introduction to Numerical Analysis

Exercise 2.10. The iteration x(k+1) = Hx(k) +b is calculated for k = 0, 1, . . .,


where H is given by  
α γ
H= ,
0 β
with α, β, γ ∈ R and γ large and |α| < 1, |β| < 1. Calculate H k and show that
its elements tend to zero as k → ∞. Further deduce the equation x(k) − x∗ =
H k (x(0) −x∗ ) where x∗ satisfies x∗ = Hx∗ +b. Hence deduce that the sequence
x(k) , k = 0, 1, . . ., tends to x∗ .
Exercise 2.11. Starting with an arbitrary x(0) the sequence x(k) , k = 1, 2, . . . ,
is calculated by
   
1 1 1 0 0 0
 0 1 1  x(k+1) +  α 0 0  x(k) = b
0 0 1 γ β 0

in order to solve the linear system


 
1 1 1
 α 1 1  x = b,
γ β 1

where α, β, γ ∈ R are constant. Find all values for α, β, γ such that the se-
quence converges for every x(0) and b. What happens when α = β = γ = −1
and α = β = 0?
In some cases, however, iteration matrices can arise where we will have
convergence, but it will be very, very slow. An example for this situation is

0.99 106 1012


 

H= 0 0 1020  .
0 0 0.99

2.12 Jacobi and Gauss–Seidel Iterations


Both the Jacobi and Gauss–Seidel methods are splitting methods and can
be used whenever A has nonzero diagonal elements. We write A in the form
A = L + D + U , where L is the subdiagonal (or strictly lower triangular),
D is the diagonal, and U is the superdiagonal (or strictly upper triangular)
portion of A.
Jacobi method
We choose A − B = D, the diagonal part of A, or in other words we let
B = L + U . The iteration step is given by

Dx(k+1) = −(L + U )x(k) + b.


Linear Systems  49

Gauss–Seidel method
We set A − B = L + D, the lower triangular portion of A, or in other
words B = U . The sequence x(k) , k = 1, . . . , is generated by

(L + D)x(k+1) = −U x(k) + b.

Note that there is no need to invert L + D; we calculate the components


of x(k+1) in sequence using the components we have just calculated by
forward substitution:
(k+1) (k+1) (k)
X X
Ai,i xi =− Ai,j xj − Ai,j xj + bi , i = 1, . . . , n.
j<i j>i

As we have seen in the previous section, the sequence x(k) converges to


the solution if the spectral radius of the iteration matrix, HJ = −D−1 (L + U )
for Jacobi or HGS = −(L + D)−1 U for Gauss–Seidel, is less than one. We
will show this is true for two important classes of matrices. One is the class
of positive definite matrices and the other is given in the following definition.
Definition 2.7 (Strictly diagonally dominant matrices). A matrix A is called
strictly diagonally dominant by rows if
n
X
|Ai,i | > |Ai,j |
j=1
j6=i

for i, 1, . . . , n.
For the first class, the following theorem holds:
Theorem 2.7 (Householder–John theorem). If A and B are real matrices
such that both A and A − B − B T are symmetric and positive definite, then
the spectral radius of H = −(A − B)−1 B is strictly less than one.
Proof. Let λ be an eigenvalue of H and v 6= 0 the corresponding eigenvector.
Note that λ and thus v might have nonzero imaginary parts. From Hv = λv
we deduce −Bv = λ(A − B)v. λ cannot equal one since otherwise A would
map v to zero and be singular. Thus we deduce
λ
vT Bv = vT Av. (2.5)
λ−1
Writing v = u + iw, where u, w ∈ Rn , we deduce vT Av = uT Au + wT Aw.
Hence, positive definiteness of A and A − B − B T implies vT Av > 0 and
vT (A − B − B T )v > 0. Inserting (2.5) and its conjugate transpose into the
latter, and we arrive at

1 − |λ|2 T
 
T T T T λ λ
0 < v Av−v Bv−v B v = 1 − − vT Av = v Av.
λ−1 λ−1 |λ − 1|2
50  A Concise Introduction to Numerical Analysis

The denominator does not vanish since λ 6= 1. Hence |λ − 1|2 > 0. Since
vT Av > 0, 1 − |λ|2 has to be positive. Therefore we have |λ| < 1 for every
eigenvalue of H as required.
For the second class we use the following simple, but very useful theorem.

Theorem 2.8 (Gerschgorin theorem). All eigenvalues of an n × n matrix A


are contained in the union of the Gerschgorin discs Γi , i = 1, . . . , n defined in
the complex plane by
n
X
Γi := {z ∈ C : |z − Ai,i | ≤ |Ai,j |}.
j=1
j6=i

It follows from the Gerschgorin theorem that strictly diagonally dominant


matrices are nonsingular, since then none of the Gerschgorin discs contain
zero.
Theorem 2.9. If A is strictly diagonally dominant, then both the Jacobi and
the Gauss–Seidel methods converge.

Proof. For the Gauss–Seidel method the eigenvalues of the iteration matrix
HGS = −(L + D)−1 U are solutions to

det[HGS − λI] = det[−(L + D)−1 U − λI] = 0.

Since L+D is non-singular, det(L+D) 6= 0 and the equation can be multiplied


by this to give
det[U + λD + λL] = 0. (2.6)
Now for |λ| ≥ 1 the matrix λL + λD + U is strictly diagonally dominant, since
A is strictly diagonally dominant. Hence λL + λD + U is non-singular and
thus its determinant does not vanish. Therefore Equation (2.6) does not have
a solution with |λ| ≥ 1. Therefore |λ| < 1 and we have convergence.
The same argument holds for the iteration matrix for the Jacobi method.

Exercise 2.12. Prove that the Gauss–Seidel method to solve Ax = b con-


verges whenever the matrix A is symmetric and positive definite. Show how-
ever by a 3 × 3 counterexample that the Jacobi method for such an A does not
necessarily converge.
Corollary 2.2. 1. If A is symmetric and positive definite, then the Gauss–
Seidel method converges.
2. If both A and 2D−A are symmetric and positive definite, then the Jacobi
method converges.
Proof. 1. This is the subject of the above exercise.
Linear Systems  51

2. For the Jacobi method we have B = A − D. If A is symmetric, then


A − B − B T = 2D − A. This matrix is the same as A, just the off
diagonal elements have the opposite sign. If both A and 2D − A are
positive definite, the Householder–John theorem applies.

We have already seen one example where convergence can be very slow.
The next section shows how to improve convergence.

2.13 Relaxation
The efficiency of the splitting method can be improved by relaxation. Here,
instead of iterating (A − B)x(k+1) = −Bx(k) + b, we first calculate (A −
B)x̃(k+1) = −Bx(k) + b as an intermediate value and then let

x(k+1) = ωx̃(k+1) + (1 − ω)x(k)

for k = 0, 1, . . ., where ω ∈ R is called the relaxation parameter . Of course


ω = 1 corresponds to the original method without relaxation. The parameter
ω is chosen such that the spectral radius of the relaxed method is smaller.
The smaller the spectral radius, the faster the iteration converges. Letting
c = (A − B)−1 b, the relaxation iteration matrix Hω can then be deduced
from

x(k+1) = ωx̃(k+1) + (1 − ω)x(k) = ωHx(k) + (1 − ω)x(k) + ωc

as
Hω = ωH + (1 − ω)I.
It follows that the eigenvalues of Hω and H are related by λω = ωλ + (1 − ω).
The best choice for ω would be to minimize max{|ωλi + (1 − ω)|, i = 1, . . . , n}
where λ1 , . . . , λn are the eigenvalues of H. However, the eigenvalues of H
are often unknown, but sometimes there is information (for example, derived
from the Gerschgorin theorem), which makes it possible to choose a good if not
optimal value for ω. For example, it might be known that all the eigenvalues
are real and lie in the interval [a, b], where −1 < a < b < 1. Then the interval
containing the eigenvalues of Hω is given by [ωa + (1 − ω), ωb + (1 − ω)].
An optimal choice for ω is the one which centralizes this interval around the
origin:
−[ωa + (1 − ω)] = ωb + (1 − ω).
It follows that
2
ω= .
2 − (a + b)
The eigenvalues of the relaxed iteration matrix lie in the interval
b−a b−a
[− 2−(a+b) , 2−(a+b) ]. Note that if the interval [a, b] is already symmetric about
zero, i.e., a = −b, then ω = 1 and no relaxation is performed. On the other
52  A Concise Introduction to Numerical Analysis

hand consider the case where all eigenvalues lie in a small interval close to
2
1. More specifically, let a = 1 − 2 and b = 1 − ; then ω = 3 and the new
1 1
interval is [− 3 , 3 ].
Exercise 2.13. The Gauss–Seidel method is used to solve Ax = b, where
 
100 −11
A= .
9 1

Find the eigenvalues of the iteration matrix. Then show that with relaxation
the spectral radius can be reduced by nearly a factor of 3. In addition show that
after one iterations with the relaxed method the error kx(k) − x∗ k is reduced
by more than a factor of 3. Estimate the number of iterations the original
Gauss–Seidel would need to achieve a similar decrease in the error.

2.14 Steepest Descent Method


In this section we look at an alternative approach to construct iterative meth-
ods to solve Ax = b in the case where A is symmetric and positive definite.
We consider the quadratic function
1 T
F (x) = x Ax − xT b, x ∈ Rn .
2
It is a multivariate function which can be written as
n n
1 X X
F (x1 , . . . , xn ) = Aij xi xj − bi xi .
2 i,j=1 i=1

Note that the first sum is a double sum. A multivariate function has an
extremum at the point where the derivatives in each of the directions xi ,
i = 1, . . . , n vanish. The vector of derivatives is called the gradient and is de-
noted ∇F (x). So the extremum occurs when the gradient vanishes, or in other
words when x satisfies ∇F (x) = 0. The derivative of F (x) in the direction of
xk is
 
n n
d 1 X X
F (x1 , . . . , xn ) = Aik xi + Akj xj  − bk
dxk 2 i=1 j=1
n
X
= Akj xj − bk ,
j=1

where we used the symmetry of A in the last step. This is one component of
the gradient vector and thus

∇F (x) = Ax − b.
Linear Systems  53

Thus finding the extremum is equivalent to x being a solution of Ax = b. In


our case the extremum is a minimum, since A is positive definite.
Let x∗ denote the solution. The location of the minimum of F does not
change if the constant 12 x∗ T Ax∗ is added. Using b = Ax∗ we see that it is
equivalent to minimize
1 T 1 1
x Ax − xT Ax∗ + x∗ T Ax∗ = (x∗ − x)T A(x∗ − x).
2 2 2
The latter√expression can be viewed as the square of a norm defined by
kxkA := xT Ax, which is well-defined, since A is positive definite. Thus
we are minimizing kx∗ − xkA . Every iteration constructs an approximation
x(k) which is closer to x∗ in the norm defined by A. Since this is equivalent
to minimizing F , the constructed sequence should satisfy the condition

F (x(k+1) ) < F (x(k) ).

That is, the value of F at the new approximation should be less than the value
of F at the current approximation, since we are looking for a minimum. Both
Jacobi and Gauss–Seidel methods do so.
Generally decent methods have the following form.
1. Pick any starting vector x(0) ∈ Rn .
2. For any k = 0, 1, 2, . . . the calculation stops if the norm of the gradient
kAx(k) − bk = k∇F (x(k) )k is acceptably small.
3. Otherwise a search direction d(k) is generated that satisfies the descent
condition  
d
F (x(k) + ωd(k) ) < 0.
dω ω=0
In other words, if we are walking in the search direction, the values of
F become smaller.

4. The value ω (k) > 0 which minimizes F (x(k) + ωd(k) ) is calculated and
the new approximation is

x(k+1) = x(k) + ω (k) d(k) . (2.7)

Return to 2.
We will see specific choices for the search direction d(k) later. First we look
54  A Concise Introduction to Numerical Analysis

at which value ω (k) takes. Using the definition of F gives


1 (k)
F (x(k) + ωd(k) ) = (x + ωd(k) )T A(x(k) + ωd(k) ) − (x(k) + ωd(k) )T b
2
1 h (k) T T T
= x Ax(k) + x(k) Aωd(k) + ωd(k) Ax(k)
2
T
i T T
+ω 2 d(k) Ad(k) − x(k) b − ωd(k) b
T 1 T
= F (x(k) ) + ωd(k) g(k) + ω 2 d(k) Ad(k) ,
2
(2.8)
where we used the symmetry of A and where g(k) = Ax(k) − b denotes the
gradient ∇F (x(k) ). Differentiating with respect to ω and equating to zero,
leads to
T
d(k) g(k)
ω (k) = − T
. (2.9)
d(k) Ad(k)
Looking at (2.8) more closely, the descent direction has to satisfy
(k) T (k)
d g < 0, otherwise no reduction will be achieved. It is possible to satisfy
this condition, since the method terminates when g(k) is zero.
Multiplying both sides of (2.7) by A from the left and subtracting b,
successive gradients satisfy

g(k+1) = g(k) + ω (k) Ad(k) . (2.10)


T
Multiplying this now by d(k) from the left and using (2.9), we see that
this equates to zero. Thus the new gradient is orthogonal to the previous
search direction. Thus the descent method follows the search direction until
the gradient becomes perpendicular to the search direction.
Making the choice d(k) = −g(k) leads to the steepest descent method .
Locally the gradient of a function shows the direction of the sharpest increase
of F at this point. With this choice we have

x(k+1) = x(k) − ω (k) g(k)

and
T
(k) g(k) g(k) kg(k) k2
ω = = ,
T
g(k) Ag(k) kg(k) k2A
which is the square of the Euclidean norm of the gradient divided by the
square of the norm defined by A of the gradient.
It can be proven that the infinite sequence x(k) , k = 0, 1, 2, . . . converges
to the solution of Ax = b (see for example [9] R. Fletcher Practical Methods
of Optimization). However, convergence can be unacceptably slow.
We look at the contour lines of F in two dimensions (n = 2). These are
lines where F takes the same value. They are ellipses with the minimum
lying at the intersection of the axes of the ellipses. The gradients and thus
Linear Systems  55

the search directions are perpendicular to the contour lines. That is, they
are perpendicular to the tangent to the contour line at that point. When the
current search direction becomes tangential to another contour line, the new
approximation is reached and the next search direction is perpendicular to the
previous one. Figure 2.1 illustrates this. The resultant zigzag search path is
typical. The value F (x(k+1) ) is decreased locally relative to F (x(k) ), but the
global decrease with respect to F (x(0) ) can be small.

Figure 2.1 Worst-case scenario for the steepest descent method

Let λn be the largest eigenvalue of A and λ1 the smallest. The rate of


convergence is then
 2
λn − λ1
λn + λ1
(see [16] J. Nocedal, S. Wright Numerical Optimization). Returning to the
example in two dimensions, if λ1 = λ2 , then the rate of convergence is zero
and the minimum is reached in one step. This is because in this case the
contour lines are circles and thus all gradients point directly to the centre
where the minimum lies. The bigger the difference in the biggest and smallest
eigenvalue, the more elongated the ellipses are and the slower the convergence.
In the following section we see that the use of conjugate directions improves
the steepest descent method and performs very well.
56  A Concise Introduction to Numerical Analysis

2.15 Conjugate Gradients


We have already seen how the positive definiteness of A can be used to define
a norm. This can be extended to have a concept similar to orthogonality.
Definition 2.8. The vectors u and v are conjugate with respect to the positive
definite matrix A, if they are nonzero and satisfy uT Av = 0.
Conjugacy plays such an important role, because through it search direc-
tions can be constructed such that the new gradient g(k+1) is not just orthog-
onal to the current search direction d(k) but to all previous search directions.
This avoids revisiting search directions as in Figure 2.1.
Specifically, the first search direction is chosen as d(0) = −g(0) . The fol-
lowing search directions are then constructed by

d(k) = −g(k) + β (k) d(k−1) , k = 1, 2, . . . ,


T
where β (k) is chosen such that the conjugacy condition d(k) Ad(k−1) = 0 is
satisfied. This yields
T
g(k) Ad(k−1)
β (k) = T
, k = 1, 2, . . . .
d(k−1) Ad(k−1)
This gives the conjugate gradient method .
The values of x(k+1) and ω (k) are calculated as before in (2.7) and (2.9).
These search directions satisfy the descent condition. From Equation (2.8)
T
we have seen that the descent condition is equivalent to d(k) g(k) < 0. Using
the formula for d(k) above and the fact that the new gradient is orthogonal
T
to the previous search direction, i.e., d(k−1) g(k) = 0, we see that
T
d(k) g(k) = −kg(k) k2 < 0.

Theorem 2.10. For every integer k ≥ 1 until kg(k) k is small enough, we


have the following properties:
1. The space spanned by the gradients g(j) ,j = 0, . . . , k − 1, is the same as
the space spanned by the search direction d(j) ,j = 0, . . . , k − 1,
T
2. d(j) g(k) = 0, for j = 0, 1, . . . , k − 1,
T
3. d(j) Ad(k) = 0 for j = 0, 1, . . . , k − 1 and
T
4. g(j) g(k) = 0 for j = 0, 1, . . . , k − 1.
Proof. We use induction on k ≥ 1. The assertions are easy to verify for k =
1. Indeed,(1) follows from d(0) = −g(0) , and (2) and (4) follow, since the
new gradient g(1) is orthogonal to the search direction d(0) = −g(0) ; and (3)
follows, since d(1) is constructed in such a way that it is conjugate to d(0) .
Linear Systems  57

We assume that the assertions are true for some k ≥ 1 and prove that they
remain true when k is increased by one.
For (1), the definition of d(k) = −g(k) + β (k) d(k−1) and the inductive
hypothesis show that any vector in the span of d(0) , . . . , d(k) also lies in the
span of g(0) , . . . , g(k) .
We have seen that the search directions satisfy g(k+1) = g(k) + ω (k) Ad(k) .
T
Multiplying this by d(j) from the left for j = 0, 1, . . . , k − 1, the first term
of the sum vanishes due to (2) and the second term vanishes due to (3). For
T
j = k, the choice of ω (k) ensures orthogonality. Thus d(j) g(k+1) = 0 for
j = 0, 1, . . . , k.
This also proves (4), since d(0) , . . . , d(k) span the same space as
g , . . . , g(k) .
(0)

Turning to (3), the next search direction is given by d(k+1) = −g(k+1) +


T
β (k+1) d(k) . The value of β (k+1) gives d(k) Ad(k+1) = 0. It remains to show
T
that d(j) Ad(k+1) = 0 for j = 0, . . . , k − 1. Inserting the definition of d(k+1)
T T
and using d(j) Ad(k) = 0, it is sufficient to show d(j) Ag(k+1) = 0. However,
1
Ad(j) = (g(j+1) − g(j) )
ω (j)
and the assertion follows from the mutual orthogonality of the gradients.
The last assertion of the theorem establishes that if the algorithm is ap-
plied in exact arithmetic, then termination occurs after at most n iterations,
since there can be at most n mutually orthogonal non-zero vectors in an
n-dimensional space. Figure 2.2 illustrates this, showing that the conjugate
gradient method converges after two steps in two dimensions.
In the following we reformulate and simplify the conjugate gradient method
to show it in standard from. Firstly since g(k) − g(k−1) = ω (k−1) Ad(k−1) , the
expression for β (k) becomes
T
g(k) (g(k) − g(k−1) ) kg(k) k2
β (k) = = ,
(k−1) T
d (g(k) − g(k−1) ) kg(k−1) k2

where d(k−1) = −g(k−1) + β (k−1) d(k−2) and the orthogonality properties (2)
and (4) of Theorem 2.10 are used.
We write −r(k) instead of g(k) , where r(k) is the residual b − Ax(k) . The
zero vector is chosen as the initial approximation x(0) .
The algorithm is then as follows
1. Set k = 0, x(0) = 0, r(0) = b and d(0) = r(0) .
2. Stop if kr(k) k is sufficiently small.
3. If k ≥ 1, set d(k) = r(k) + β (k) d(k−1) , where β (k) = kr(k) k2 /kr(k−1) k2 .
T
4. Calculate v(k) = Ad(k) and ω (k) = kr(k) k2 /d(k) v(k) .
58  A Concise Introduction to Numerical Analysis

Figure 2.2Conjugate gradient method applied to the same problem as


in Figure 2.1

5. Set x(k+1) = x(k) + ω (k) d(k) and r(k+1) = r(k) − ω (k) v(k) .

6. Increase k by one and go back to 2.


T
The number of operations per iteration is 2n each to calculate d(k) v(k) ,
kr k , d(k) = r(k) + β (k) d(k−1) , x(k+1) = x(k) + ω (k) d(k) and r(k+1) =
(k) 2

r(k) −ω (k) v(k) . However, the number of operations is dominated by the matrix
multiplication Ad(k) , which is O(n2 ) if A is dense. In the cases where A is
sparse this can be reduced and the conjugate gradient method becomes highly
suitable.
Only in exact arithmetic is the number of iterations at most n. The con-
jugate gradient method is sensitive to even small perturbations. In practice,
most directions will not be conjugate and the exact solution is not reached.
An acceptable approximation is however usually reached within a small (com-
pared to the problem size) number of iterations. The speed of convergence is
typically linear, but it depends on the condition number of the matrix A. The
larger the condition number the slower the convergence. In the following sec-
tion we analyze this further and show how to improve the conjugate gradient
method.
Linear Systems  59

Exercise 2.14. Let A be positive definite and let the standard conjugate gra-
dient method be used to solve Ax P = b. Express d(k) in terms of r(j) and β (j) ,
(k+1) k
j = 0, 1, . . . , k. Using x = j=0 ω (j) d(j) , ω (j) > 0 and Theorem 2.10
show that the sequence kx(j) k, j = 0, 1, . . . , k + 1, increases monotonically.
Exercise 2.15. Use the standard form of the conjugate gradient method to
solve    
1 0 0 1
 0 2 0 x =  1 
0 0 3 1
starting with x(0) = 0. Show that the residuals r(0) , r(1) and r(2) are mutu-
ally orthogonal and that the search directions d(0) , d(1) and d(2) are mutually
conjugate and that x(3) is the solution.

2.16 Krylov Subspaces and Pre-Conditioning


Definition 2.9. Let A be an n × n matrix, b ∈ Rn a non-zero vector, then
for a number m the space spanned by Aj b, j = 0, . . . , m − 1 is the mth Krylov
subspace of Rn and is denoted by Km (A, b).
In our analysis of the conjugate gradient method we saw that in the k th
iteration the space spanned by the search directions d(j) and the space spanned
by the gradients g(j) , j = 0, . . . , k, are the same.
Lemma 2.3. The space spanned by g(j) (or d(j) ), j = 0, . . . , k, is the same
as the k + 1th Krylov subspace.
Proof. For k = 0 we have d(0) = −g(0) = b ∈ K1 (A, b).
We assume that the space spanned by g(j) , j = 0, . . . , k, is the same as
the space spanned by b, Ab, . . . , Ak b and increase k by one.
In the formula g(k+1) = g(k) + ω (k) Ad(k) both g(k) and d(k) can be ex-
pressed as linear combinations of b, Ab, . . . , Ak b by the inductive hypothesis.
Thus g(k+1) can be expressed as a linear combination of b, Ab, . . . , Ak+1 b.
Equally, using d(k+1) = −g(k+1) + β (k) d(k) we show that d(k+1) lies in the
space spanned by b, Ab, . . . , Ak+1 b, which is Kk+2 (A, b). This completes the
proof.
In the following lemma we show some properties of the Krylov subspaces.
Lemma 2.4. Given A and b. Then Km (A, b) is a subspace of Km+1 (A, b)
and there exists a positive integer s ≤ n such that for
Ptevery m ≥ s Km (A, b) =
Ks (A, b). Furthermore, if we express b as b = i=1 ci vi , where v1 , . . . , vt
are eigenvectors of A corresponding to distinct eigenvalues and all coefficients
ci , i = 1, . . . , t, are non-zero, then s = t.
Proof. Clearly, Km (A, b) ⊆ Km+1 (A, b). The dimension of Km (A, b) is less
than or equal to m, since it is spanned by m vectors. It is also at most n since
60  A Concise Introduction to Numerical Analysis

Km (A, b) is a subspace of Rn . The first Krylov subspace has dimension 1. We


let s be the greatest integer such that the dimension of Ks (A, b) is s. Then
the dimension of Ks+1 (A, b) cannot be s + 1 by the choice of s. It has to be s,
since Ks (A, b) ⊆ Ks+1 (A, b). Hence the two spaces are the same. This means
Ps−1
that As b ∈ Ks (A, b), i.e., As b = j=0 aj Aj b. But then

s−1
X
As+r b = aj Aj+r b
j=0

for any positive r. This means that also the spaces Ks+r+1 (A, b) and
Ks+r (A, b) are the same. Therefore,
Pt for every m ≥ s Km (A, b) = Ks (A, b).
Suppose now that b = i=1 ci vi , where v1 , . . . , vt are eigenvectors of A
corresponding to distinct eigenvalues λi . Then for every j = 1, . . . , s
t
X
Aj b = ci λji vi .
i=1

Thus Ks (A, b) is a subspace of the space spanned by the eigenvectors, which


has dimension t, since the eigenvectors are linearly independent. Hence s ≤ t.
Now assume that s < t. The dimension of Kt (A, b) is s. Thus there exists
a linear combination of b, Ab, . . . , At−1 b which equates to zero,
t−1
X t−1
X t
X t
X t−1
X
0= aj Aj b = aj Aj ci vi = ci aj λji vi .
j=0 j=0 i=1 i=1 j=0

Since the eigenvectors are linearly independent and all ci are nonzero, we have
t−1
X
aj λji = 0
j=0

for the distinct eigenvalues λi , i = 1, . . . , t. But a polynomial of degree t − 1


can have at most t − 1 roots. So we have a contradiction and s = t.
From this lemma it follows that the number of iterations of the conjugate
gradient method is at most the number of distinct eigenvalues of A. We can
tighten this bound even more: if b is expressed as a linear combination of
eigenvectors of A with distinct eigenvalues, then the number of iterations is
at most the number of nonzero coefficients in this linear combination.
By changing variables, x̃ = P −1 x, where P is a nonsingular n × n matrix,
we can significantly reduce the work of the conjugate gradient method. Instead
of solving Ax = b we solve the system P T AP x̃ = P T b. P T AP is also sym-
metric and positive definite, since A is symmetric and positive definite and P
is nonsingular. Hence we can apply the conjugate gradient method to obtain
the solution x̃, which then in turn gives x∗ = P x̃. This procedure is called the
preconditioned conjugate gradient method and P is called the preconditioner .
Linear Systems  61

The speed of convergence of the conjugate gradient method depends on


the condition number of A, which for symmetric, positive definite A is the
ratio between the modulus of its largest and its least eigenvalue. Therefore,
P should be chosen so that the condition number of P T AP is much smaller
than the one of A.
This gives the transformed preconditioned conjugate gradient method with
formulae:
d̃(0) = r̃(0) = P T b − P T AP x̃(0) ,

r̃(k)T r̃(k)
ω (k) = ,
d̃(k)T P T AP d̃(k)
x̃(k+1) = x̃(k) + ω (k) d̃(k) ,

r̃(k+1) = r̃(k) − ω (k) P T AP d̃(k) ,

r̃(k+1)T r̃(k+1)
β (k+1) = ,
r̃(k)T r̃(k)
d̃(k+1) = r̃(k+1) + β (k+1) d̃(k) .
The number of iterations is at most the dimension of the Krylov subspace
Kn (P T AP, P T b). This space is spanned by (P T AP )j P T b = P T (AP P T )j b,
j = 0, . . . , n−1. Since P is nonsingular, so is P T . It follows that the dimension
of Kn (P T AP, P T b) is the same as the dimension of Kn (AP P T , b).
It is undesirable in this method that P has to be computed. However, with
a few careful changes of variables P can be eliminated. Instead, the matrix
S = P P T is used. We see later why this is advantageous.
Firstly, we use x̃(k) = P −1 x(k) , or equivalently x(k) = P x̃(k) . Setting
r = P −T r̃(k) and d(k) = P d̃(k) , we derive the untransformed preconditioned
(k)

conjugate gradient method :

r(0) = P −T r̃(0) = b − Ax(0) ,

d(0) = P d̃(0) = P P T b − P P T Ax(0) = Sr(0) ,


T
(P T r(k) )T P T r(k) r(k) Sr(k)
ω (k) = = T
,
(P d(k) )T P T AP P −1 d(k)
−1
d(k) Ad(k)
x(k+1) = P x̃(k+1) = P x̃(k) + ω (k) P d̃(k) = x(k) + ω (k) d(k) ,

r(k+1) = P −T r̃(k+1) = r(k) − ω (k) Ad(k) ,


= P −T r̃(k) − ω (k) P −T P T AP d̃(k)
T
(k+1) (P T r(k+1) )T P T r(k+1) r(k+1) Sr(k+1)
β = = T
,
(P T r(k) )T P r(k) r(k) Sr(k)
d(k+1) = P d̃(k+1) = P r̃(k+1) + β (k+1) P d̃(k) = Sr(k+1) + β (k+1) d(k) ,

The effectiveness of the preconditioner S is determined by the condition


62  A Concise Introduction to Numerical Analysis

number of AS = AP P T (and occasionally by its clustering of eigenvalues).


The problem remains of finding S which is close enough to A to improve
convergence, but the cost of computing Sr(k) is low.
The perfect preconditioner would be S = A−1 , since for this precondi-
tioner AS = I has the condition number 1. Unfortunately, finding this pre-
conditioner is equivalent to solving Ax = b, the original problem for which
we seek a preconditioner. Hence this fails to be a useful preconditioner at all.
The simplest useful choice for the matrix S is a diagonal matrix whose
inverse has the same diagonal entries as A. Making the diagonal entries of AS
equal to one often causes the eigenvalues of P T AP to be close to one. This is
known as diagonal preconditioning or Jacobi preconditioning. It is trivial to
invert a diagonal matrix, but often this is a mediocre preconditioner.
Another possibility is to let P be the inverse of the lower triangular part
of A, possibly changing the diagonal elements in the hope that S is close to
the inverse of A then.
A more elaborate preconditioner is incomplete Cholesky preconditioning.
We express A as S −1 + E, where S −1 is symmetric, positive definite, close
to A (such that the error E is small) and can be factorized easily into a
Cholesky factorization, S −1 = LLT . For example, S might be a banded matrix
with small band width. Once the Cholesky factorization of S −1 is known, the
main expense is calculating v = Sr(k) . However, this is equivalent to solving
S −1 v = LLT v = r(k) , which can be done efficiently by back substitution.
Unfortunately, incomplete Cholesky preconditioning is not always stable.
Many preconditioners, some quite sophisticated, have been developed. In
general, conjugate gradients should nearly always be used with a precondi-
tioner for large-scale applications. For moderate n the standard form of the
conjugate gradients method usually converges with an acceptably small value
of kr(k) k in far fewer than n iterations and there is no need for preconditioning.
Exercise 2.16. Let A be an n × n tridiagonal matrix of the form
 
α β
 β α β 
 
A=
 . .
.. .. ... .

 
 β α β 
β α
Verify that α > 2β > 0 implies that the matrix is positive definite. We now
precondition the system with the lower triangular, bidiagonal matrix P being
the inverse of  
γ
 δ γ 
P −1 =  .
 
.. ..
 . . 
δ γ
Determine γ and δ such that the inverse of S = P P T is the same as A apart
Linear Systems  63

from the (n, n) entry. Prove that the preconditioned gradient method converges
in just two iterations then.
As a closing remark on conjugate gradients, we have seen that the method
of normal equations solves AT Ax = AT b for which conjugate gradients can
be used as long as Ax = b is not underdetermined, because only then is
AT A nonsingular. However, the condition number of AT A is the square of
the condition number of A, so convergence can be significantly slower. An
important technical point is that AT A is never computed explicitly, since it
is less sparse. Instead when calculating AT Ad(k) , first Ad(k) is computed and
this is then multiplied by AT . It is also numerically more stable to calculate
T
d(k) AT Ad(k) as the inner product of Ad(k) with itself.

2.17 Eigenvalues and Eigenvectors


So far we only considered eigenvalues when analyzing the properties of numer-
ical methods. In the following sections we look at how to determine eigenvalues
and eigenvectors. Let A be a real n × n matrix. The eigenvalue equation is
given by
Av = λv,
where λ is scalar. It may be complex if A is not symmetric. There exists v ∈ Rn
satisfying the eigenvalue equation if and only if the determinant det(A − λI)
is zero. The function p(λ) = det(A − λI), λ ∈ C, is a polynomial of degree
n. However, calculating the eigenvalues by finding the roots of p is generally
unsuitable because of loss of accuracy due to rounding errors. In Chapter 1,
Fundamentals, we have seen how even finding the roots of a quadratic can be
difficult due to loss of significance.
If the polynomial has some multiple roots and if A is not symmetric, then
A might have less than n linearly independent eigenvalues. However, there
are always n mutually orthogonal real eigenvectors when A is symmetric. In
the following we assume that A has n linearly independent eigenvectors vj
for each eigenvalue λj , j = 1, . . . , n. This can be achieved by perturbing A
slightly if necessary. The task is now to find vj and λj , j = 1, . . . , n.

2.18 The Power Method


The power method forms the basic of many iterative algorithms for the cal-
culation of eigenvalues and eigenvectors. It generates a single eigenvector and
eigenvalue of A.

1. Pick a starting vector x(0) ∈ Rn satisfying kx(0) k = 1. Set k = 0 and


choose a tolerance  > 0.
2. Calculate x(k+1) = Ax(k) and find the real number λ that minimizes
64  A Concise Introduction to Numerical Analysis

kx(k+1) − λx(k) k. This is given by the Rayleigh quotient


T
x(k) Ax(k)
λ= T
.
x(k) x(k)

3. Accept λ as an eigenvalue and x(k) as an eigenvector , if kx(k+1) −


λx(k) k ≤ .
4. Otherwise, replace x(k+1) by x(k+1) /kx(k+1) k, increase k by one, and go
back to step 2.

Theorem 2.11. If there is one eigenvalue of A whose modulus is larger than


the moduli of the other eigenvalues, then the power method terminates with
an approximation to this eigenvalue and the corresponding eigenvector as long
as the starting vector has a component of the largest eigenvector (in exact
arithmetic).
Proof. Let |λ1 | ≤ |λ2 | ≤ . . . ≤ |λn−1 | < |λn | be the eigenvalues ordered by
modulus and v1 , . . . , vPn the corresponding eigenvectors of unit length. Let the
n
starting vector x(0) = i=1 ci vi be chosen such that cn is nonzero. Then x(k)
is a multiple of
n n−1
!
X X ci  λi k
Ak x(0) = ci λki vi = cn λkn vn + vi ,
i=1
c
i=1 n
λn

since in every iteration the new approximation is scaled to have length 1. Since
kx(k) k = kvn k = 1, it follows that x(k) = ±vn + O(|λn−1 /λn |k ), where the
sign is determined by cn λkn , since we scale by a positive factor in each iteration.
The fraction |λn−1 /λn | characterizes the rate of convergence. Thus if λn−1 is
similar in size to λn convergence is slow. However, if λn is considerably larger
than the other eigenvalues in modulus, convergence is fast.
Termination occurs, since

kx(k+1) − λx(k) k = minλ kx(k+1) − λx(k) k ≤ kx(k+1) − λn x(k) k


= kAx(k) − λn x(k) k = kAvn − λn vn k + O(|λn−1 /λn |k )
k→∞
= O(|λn−1 /λn |k ) → 0.

Exercise 2.17. Let A be the bidiagonal n × n matrix


 
λ 1
 .. .. 
A=
 . . .

 λ 1 
λ

Find an explicit expression for Ak . Letting n = 3, the sequence x(k+1) , k =


Linear Systems  65

0, 1, 2, . . . , is generated by the power method x(k+1) = Ax(k) /kx(k) k, starting


with some x(0) ∈ R3 . From the expression for Ak , deduce that the second and
third component of x(k) tend to zero as k tend to infinity. Further show that
this implies Ax(k) − λx(k) tends to zero.
The power method usually works well if the modulus of one of the eigen-
values is substantially larger than the moduli of the other n − 1 eigenvalues.
However, it can be unacceptably slow, if there are eigenvalues of similar size.
In the case cn = 0, the method should find the eigenvector vm for the largest
m for which cm is nonzero in exact arithmetic. Computer rounding errors can,
however, introduce a small nonzero component of vn , which will grow and the
method converges to vn eventually.
Two other methods are the Arnoldi and Lancozs method , which are Kry-
low subspace methods and exploit the advantages associated with Krylow
subspaces; that is, convergence is guaranteed in a finite number of iterations
(in exact arithmetic). They are described in [6] Numerical Linear Algebra and
Applications by B. Datta. In the following we examine a variation of the power
method dealing with complex conjugate pairs of eigenvalues.
When A is real and not symmetric, then some eigenvalues occur in com-
plex conjugate pairs. In this case we might have two eigenvalues with largest
modulus. The two-stage power method is used then.
1. Pick a starting vector x(0) ∈ Rn satisfying kx(0) k = 1. Set k = 0 and
choose a tolerance  > 0. Calculate x(1) = Ax(0) .
2. Calculate x(k+2) = Ax(k+1) and find the real numbers α and β that
minimize kx(k+2) + αx(k+1) + βx(k) k.
3. Let λ+ and λ− be the roots of the quadratic equation λ2 + αλ + β = 0.
If kx(k+2) + αx(k+1) + βx(k) k ≤  holds, then accept λ+ as eigenvalue
with eigenvector x(k+1) − λ− x(k) and λ− as eigenvalue with eigenvector
x(k+1) − λ+ x(k) .
4. Otherwise, replace the vector x(k+1) by x(k+1) /kx(k+2) k and x(k+2) by
x(k+2) /kx(k+2) k, keeping the relationship x(k+2) = Ax(k+1) , then in-
crease k by one and go back to step 2.
Lemma 2.5. If kx(k+2) + αx(k+1) + βx(k) k = 0 in step 3 of the two-stage
power method, then x(k+1) −λ− x(k) and x(k+1) −λ+ x(k) satisfy the eigenvalue
equations
A(x(k+1) − λ− x(k) ) = λ+ (x(k+1) − λ− x(k) ),
A(x(k+1) − λ+ x(k) ) = λ− (x(k+1) − λ+ x(k) ).
Proof. Using Ax(k+1) = x(k+2) and Ax(k) = x(k+1) , we have
A(x(k+1) − λ− x(k) ) − λ+ (x(k+1) − λ− x(k) )

= x(k+2) − (λ+ + λ− )x(k+1) + λ+ λ− x(k)

= x(k+2) + αx(k+1) + βx(k) = 0.


66  A Concise Introduction to Numerical Analysis

This proves the first assertion, the second one follows similarly.
If kx(k+2) + αx(k+1) + βx(k) k = 0, then the vectors x(k) and x(k+1) span
an eigenspace of A. This means that if A is applied to any linear combination
of x(k) and x(k+1) , then the result is again a linear combination of x(k) and
x(k+1) . Since kx(k+2) + αx(k+1) + βx(k) k = 0, it is easy to see that

x(k+2) = Ax(k+1) = −αx(k+1) − βx(k)

is a linear combination of x(k) and x(k+1) . Let v = ax(k+1) + bx(k) be any


linear combination, then
 
Av = aAx(k+1) +bAx(k) = ax(k+2) +bx(k+1) = a −αx(k+1) − βx(k) +bx(k+1)

is a linear combination of x(k) and x(k+1) .


In the two-stage power method a norm of the form ku + αv + βwk has to
be minimized. This is equivalent to minimizing

u + αv + βw)T (u + αv + βw =


uT u + α2 vT v + β 2 wT w + 2αuT v + 2βuT w + 2αβvT w.

Differentiating with respect to α and β, we see that the gradient vanishes if


α and β satisfy the following system of equations
 T
v v vT w −uT v
   
α
= ,
vT w wT w β −uT w

which can be solved easily.


The two-stage power method can find complex eigenvectors, although most
of its operations are in real arithmetic. The complex values λ+ and λ− and
their eigenvectors are only calculated if kx(k+2) + αx(k+1) + βx(k) k is accept-
ably small. The main task in each operation is a matrix by vector product.
Therefore, both power methods benefit greatly from sparsity in A. However,
in practice the following method is usually preferred, since it is more efficient.
When solving systems of linear equations iteratively, we have seen that
convergence can be sped up by relaxation. Relaxation essentially shifts the
eigenvalues of the iteration matrix so that they lie in an interval symmetrically
about zero, which leads to faster convergence. The eigenvectors of A − sI for
s ∈ R are also the eigenvectors of A. The eigenvalues of A − sI are the same
as the eigenvalues of A shifted by s. We use this fact by choosing a shift s
to achieve faster convergence. However, the shift is not chosen such that the
interval containing all eigenvalues is symmetric about zero. The component of
the j th eigenvector is reduced by the factor |λj /λn | in each iteration. If |λj | is
close to λn the reduction is small. For example, if A has n − 1 eigenvalues in
the interval [100, 101] and λn = 102, the reduction factors lie in the interval
[100/102, 101/102] ≈ [0.98, 0.99]. Thus convergence is slow. However, for s =
Linear Systems  67

100.5, the first n − 1 eigenvalues lie in [−0.5, 0.5] and the largest eigenvalue is
1.5. Now the reduction factor is at least 0.5/1.5 = 1/3. Occasionally some prior
knowledge (Gerschgorin theorem) is available, which provides good choices for
the shift.
This gives the power method with shifts.

1. Pick a starting vector x(0) ∈ Rn satisfying kx(0) k = 1. Set k = 0 and


choose a tolerance  > 0 and a shift s ∈ R.
2. Calculate x(k+1) = (A − sI)x(k) . Find the real number λ that minimizes
kx(k+1) − λx(k) k.
3. Accept λ + s as an eigenvalue and x(k) as an eigenvector, if we have
kx(k+1) − λx(k) k ≤ .
4. Otherwise, replace x(k+1) by x(k+1) /kx(k+1) k, increase k by one and go
back to step 2.

The following method also uses shifts, but with a different intention.

2.19 Inverse Iteration


The method described in this section is called inverse iteration and is very
effective in practice. It is similar to the power method with shifts, except that,
instead of x(k+1) being a multiple of (A − sI)x(k) , it is calculated as a scalar
multiple of the solution to

(A − sI)x(k+1) = x(k) , k = 0, 1, . . . ,

where s is a scalar that may depend on k. Thus the inverse power method
is the power method applied to the matrix (A − sI)−1 . If s is close to an
eigenvalue, then the matrix A − sI has an eigenvalue close to zero, but this
implies that (A − sI)−1 has a very large eigenvalue and we have seen that in
this case the power method converges fast.
In every iteration x(k+1) is scaled such that kx(k+1) k = 1. We see that the
calculation of x(k+1) requires the solution of an n × n system of equations.
If s is constant in every iteration such that A − sI is nonsingular,
Pn then
x(k+1) is a multiple of (A − sI)−k−1 x(0) . As before we let x(0) = j=1 cj vj ,
where vj , j = 1, . . . , n, are the linearly independent eigenvectors. The eigen-
value equation then implies (A − sI)vj = (λj − s)vj . For the inverse we have
then (A − sI)−1 vj = (λj − s)−1 vj . It follows that x(k+1) is a multiple of
n
X n
X
(A − sI)−k−1 x(0) = cj (A − sI)−k−1 vj = cj (λj − s)−k−1 vj .
j=0 j=0

Let m be the index of the smallest number of |λj − s|. j = 1, . . . , n. If cm is


nonzero then x(k+1) tends to be multiple of vm as k tends to infinity. The
68  A Concise Introduction to Numerical Analysis

speed of convergence can be excellent if s is close to λm . It can be improved


even more if s is adjusted during the iterations as the following implementation
shows.

1. Set s to an estimate of an eigenvalue of A. Either pick a starting vector


x(0) 6= 0 or let it be chosen automatically in step 4. Set k = 0 and choose
a tolerance  > 0.

2. Calculate (with pivoting if necessary) the LU factorization of (A − sI) =


LU .
3. Stop if U is singular (in this case one or more diagonal elements are
zero), because then s is an eigenvalue of A, while the corresponding
eigenvector lies in the null space of U and can be found easily by back
substitution.
4. If k = 0 and x(0) has not been chosen, let i be the index of the smallest
diagonal element of U , i.e., |Ui,i | ≤ |Uj,j |, i 6= j. We define x(1) by
U x(1) = ei and let x(0) = Lei , where ei is the ith standard unit vector.
Then (A−sI)x(1) = x(0) . Otherwise, solve (A−sI)x(k+1) = LU x(k+1) =
x(k) by back substitution to obtain x(k+1)
5. Let λ be the number which minimizes kx(k) − λx(k+1) k.
6. Stop if kx(k) − λx(k+1) k ≤ kx(k+1) k. Since

kx(k) −λx(k+1) k = k(A−sI)x(k+1) −λx(k+1) k = kAx(k+1) −(s+λ)x(k+1) k,

we let s + λ and x(k+1) /kx(k+1) k be the approximation to the eigenvalue


and its corresponding eigenvector.
7. Otherwise, replace x(k+1) by x(k+1) /kx(k+1) k, increase k by one and
either return to step 4 without changing s or to step 2 after replacing s
by s + λ.

The order of convergence is illustrated in the following exercise.

Exercise 2.18. Let A be a symmetric 2 × 2 matrix with distinct eigenvalues


λ1 > λ2 with normalized corresponding eigenvectors v1 and v2 . Starting with
x(0) 6= 0, the sequence x(k+1) , k = 0, 1, . . . , is generated by inverse iteration. In
T
every iteration we let s(k) be the Rayleigh
p quotient s(k) = x(k) Ax(k) /kx(k) k2 .
2
Show that if x(k) = (v1 +(k) v2 )/ 1 + (k) , where |(k) | is small, then |(k+1) |
is of magnitude |(k) |3 . That is, the order of convergence is 3.
Adjusting s and calculating an LU-factorization in every iteration seems
excessive. The same s can be retained for a few iterations and adjusted when
necessary However, if A is an upper Hessenberg matrix , that is, if every el-
ement of A below the first subdiagonal is zero (i.e., Aij = 0, j < i − 1),
Linear Systems  69

inverse iterations are very efficient, because in this case the LU factorization
requires O(n2 ) operations when A is nonsymmetric and O(n) operations if A
is symmetric.
We have seen earlier when examining how to arrive at a QR factorization,
that Givens rotations and Householder reflections can be used to introduce
zeros below the diagonal. These can be used to transform A into an upper
Hessenberg matrix. They are also used in the next section.

2.20 Deflation
Suppose we have found one solution of the eigenvector equation Av = λv
(or possibly a pair of complex conjugate eigenvalues with their corresponding
eigenvectors), where A is an n × n matrix. Deflation constructs an (n − 1) ×
(n−1) (or (n−2)×(n−2)) matrix, say B, whose eigenvalues are the other n−1
(or n − 2) eigenvalues of A. The concept is based on the following theorem.
Theorem 2.12. Let A and S be n × n matrices, S being nonsingular. Then
v is an eigenvector of A with eigenvalue λ if and only if Sv is an eigenvector
of SAS −1 with the same eigenvalue. S is called a similarity transformation.
Proof.
Av = λv ⇔ AS −1 (Sv) = λv ⇔ (SAS −1 )(Sv) = λ(Sv).

Let’s assume one eigenvalue λ and its corresponding eigenvector have been
found. In deflation we apply a similarity transformation S to A such that the
first column of SAS −1 is λ times the first standard unit vector e1 ,
(SAS −1 )e1 = λe1 .
Then we can let B be the bottom right (n − 1) × (n − 1) submatrix of SAS −1 .
We see from the above theorem that it is sufficient to let S have the property
Sv = ae1 , where a is any nonzero scalar.
If we know a complex conjugate pair of eigenvalues, then there is a two-
dimensional eigenspace associated with them. Eigenspace means that if A is
applied to any vector in the eigenspace, then the result will again lie in the
eigenspace. Let v1 and v2 be vectors spanning the eigenspace. For example
these could have been found by the two-stage power method. We need to find
a similarity transformation S which maps the eigenspace to the space spanned
by the first two standard basis vectors e1 and e2 . Let S1 such that S1 v1 = ae1 .
In addition let v̂ be the vector composed of the last n − 1 components of S1 v2 .
We then let S2 be of the form
 
1 0 ··· 0
 0 
S2 =  . ,
 
.
 . Ŝ 
0
70  A Concise Introduction to Numerical Analysis

where Ŝ is a (n − 1) × (n − 1) matrix such that the last n − 2 components


of Ŝ v̂ are zero. The matrix S = S2 S1 then maps the eigenspace to the space
spanned by the first two standard basis vectors e1 and e2 , since Sv1 = ae1
and Sv2 is a linear combination of e1 and e2 . The matrix SAS −1 then has
zeros in the last n − 2 entries of the first two columns.
For symmetric A we want B to be symmetric also. This can be achieved if S
is an orthogonal matrix, since then S −1 = S T and SAS −1 remains symmetric.
A Householder reflection is the most suitable choice:
uuT
S =I −2 .
kuk2
As in the section on Householder reflections we let ui = vi for i = 2, . . . , n
and choose u1 such that 2uT v = kuk2 , if v is a known eigenvector. This gives
u1 = v1 ± kvk. The sign is chosen such that loss of significance is avoided.
With this choice we have
uuT
 
Sv = I − 2 v = ±kvke1 .
kuk2
Since the last n − 1 components of u and v are the same, the calculation of
u only requires O(n) operations. Further using the fact that S −1 = S T = S,
since S is not only orthogonal, but also symmetric, SAS −1 can be calculated
as
uuT uuT
   
−1
SAS = SAS = I −2 A I −2
kuk2 kuk2
T T
uu A Auu uuT AuuT
= A−2 2
−2 2
+4 .
kuk kuk kuk4
The number of operations is O(n2 ) to form Au and then AuuT . The matrix
uuT A is just the transpose of AuuT , since A is symmetric. Further uT Au
is a scalar which can be calculated in O(n) operations once Au is known. It
remains to calculate uuT in O(n2 ) operations. Thus the overall number of
iterations is O(n2 ).
Once an eigenvector ŵ ∈ Rn−1 of B is found, we extend it to a vector
w ∈ Rn by letting the first component equal zero and the last n−1 components
equal ŵ. This is an eigenvector of SAS −1 . The corresponding eigenvector of
A is S −1 w = Sw.
Further eigenvalue/eigenvector pairs can be found by deflating B and con-
tinuing the process.
Exercise 2.19. The symmetric matrix
 
3 2 4
A= 2 0 2 
4 2 3

has the eigenvector v = (2, 1, 2)T . Use a Householder reflection to find an


Linear Systems  71

orthogonal matrix S such that Sv is a multiple of the first standard unit vector
e1 . Calculate SAS. The resultant matrix is suitable for deflation and hence
identify the remaining eigenvalues and eigenvectors.
We could achieve the same form of SAS −1 using successive Givens rota-
tions instead of one Householder reflection. However, this makes sense only if
there are already many zero elements in the first column of A.
The following algorithm for deflation can be used for nonsymmetric matri-
ces as well as symmetric ones. Let vi , i = 1, . . . , n, be the components of the
eigenvector v. We can assume v1 6= 0, since otherwise the variables could be
reordered. Let S be the n × n matrix which is identical to the n × n identity
matrix except for the off-diagonal elements of the first column of S, which are
Si,1 = −vi /v1 , i = 2, . . . , n.

0 ··· ··· 0
 
1
.. 
 −v2 /v1 1 . . .

 . 

S= .
.. . .. . . . ... .

 0 
 .. .. . . .. 
 . . . . 0 
−vn /v1 0 · · · 0 1
Then S is nonsingular, has the property Sv = v1 e1 , and thus is suitable for our
purposes. The inverse S −1 is also identical to the identity matrix except for the
off-diagonal elements of the first column of S −1 which are (S −1 )i,1 = +vi /v1 ,
i = 2, . . . , n. Because of this form of S and S −1 , SAS −1 and hence B can
be calculated in only O(n2 ) operations. Moreover, the last n − 1 columns of
SAS −1 and SA are the same, since the last n − 1 columns of S −1 are taken
from the identity matrix, and thus B is just the bottom (n − 1) × (n − 1)
submatrix of SA. Therefore, for every integer i = 1, . . . , n − 1 we calculate the
ith row of B by subtracting vi+1 /v1 times the first row of A from the (i + 1)th
row of A and deleting the first component of the resultant row vector.
The following algorithm provides deflation in the form of block matrices. It
is known as the QR algorithm, since QR factorizations are calculated again and
again. Set A0 = A. For k = 0, 1, . . . calculate the QR factorizationAk = Qk Rk ,
where Qk is orthogonal and Rk is upper triangular. Set Ak+1 = Rk Qk . The
eigenvalues of Ak+1 are the same as the eigenvalues of Ak , since

Ak+1 = Rk Qk = Q−1 −1
k Qk Rk Qk = Qk Ak Qk

is a similarity transformation. Furthermore, Q−1 T


k = Qk , since Qk is orthogo-
nal. So if Ak is symmetric so is Ak+1 . Surprisingly often the matrix Ak+1 can
be regarded as deflated. That is, it has the block structure
 
B C
Ak+1 = ,
D E

where B and E are square m × m and (n − m) × (n − m) matrices and where


72  A Concise Introduction to Numerical Analysis

all entries in D are close to zero. We can then calculate the eigenvalues of
B and E separately, possibly again with the QR algorithm, except for 1 × 1
and 2 × 2 blocks where the eigenvalue problem is trivial. The space spanned
by e1 , . . . , em can be regarded as an eigenspace of Ak+1 , since, if D = 0,
Ak+1 ei , i = 1, . . . , m, again lies in this space. Equally the space spanned by
em+1 , . . . , en can be regarded as an eigenspace of Ak+1 .
The concept of eigenspaces is important when dealing with a complex
conjugate pair of eigenvalues λ and λ with corresponding eigenvectors v and
v in Cn . However, we are operating in Rn . The real and imaginary parts
Re(v) and Im(v) form an eigenspace of Rn ; that is, A applied to any linear
combination of Re(v) and Im(v) will again be a linear combination of Re(v)
and Im(v).
In this situation we choose S such that S applied to any vector in the
space spanned by Re(v) and Im(v) is a linear combination e1 and e2 . The
matrix SAS −1 , then, consists of a 2 × 2 block in the top left corner and an
(n − 2) × (n − 2) block B in the bottom right, and the last (n − 2) elements in
the first two columns are zero. The search for eigenvalues can then continue
with B. The following exercise illustrates this.
Exercise 2.20. Show that the vectors x, Ax, and A2 x are linearly dependent
for
1
1 0 − 34
   
4 1
1
 3
2 2 − 12   4 
A= and x =  1 .
  
 0 − 34 1 1 
4

2 −1 3 1 4
2 2

From this, calculate two eigenvalues of A. Obtain by deflation a 2 × 2 matrix


whose eigenvalues are the remaining eigenvalues of A.

2.21 Revision Exercises


Exercise 2.21. (a) Explain the technique of splitting for solving the linear
system Ax = b iteratively where A is an n × n, non-singular matrix.
Define the iteration matrix H and state the property it has to satisfy to
ensure convergence.

(b) Define the Gauss–Seidel and Jacobi iterations and state their iteration
matrices, respectively.
(c) Describe relaxation and briefly consider the cases when the relaxation pa-
rameter ω equals 0 and 1.
(d) Show how the iteration matrix Hω of the relaxed method is related to the
iteration matrix H of the original method and thus how the eigenvalues
are related. How should ω be chosen?
Linear Systems  73

(e) We now consider the tridiagonal matrix A with diagonal elements Ai,i = 1
and off-diagonal elements Ai,i−1 = Ai,i+1 = 1/4. Calculate the iteration
matrices H of the Jacobi method and Hω of the relaxed Jacobi method.
(f ) The eigenvectors of both H and Hω are v1 , . . . , vn where the ith component
πik
of vk is given by (vk )i = sin n+1 . Calculate the eigenvalues of H by
evaluating Hvk (Hint: sin(x ± y) = sin x cos y ± cos x sin y).
(g) Using the formula for the eigenvalues of Hω derived earlier, state the
eigenvalues of Hω and show that the relaxed method converges for 0 <
ω ≤ 4/3.

Exercise 2.22. Let A be an n × n matrix with n linearly independent eigen-


vectors. The eigenvectors are normalized to have unit length.

(a) Describe the power method to generate a single eigenvector and eigenvalue
of A. Define the Rayleigh quotient in the process.
(b) Which assumption is crucial for the power method to converge? What
characterizes the rate of convergence? By expressing the starting vector
x(0) as a linear combination of eigenvectors, give an expression for x(k) .
(c) Given    
1 1 1 2
A= 1 1 0  and x(0) =  1  ,
1 0 1 −1
calculate x(1) and x(2) and evaluate the Rayleigh quotient λ(k) for k =
0, 1, 2.

(d) Suppose now that for a different matrix A the eigenvalues of largest mod-
ulus are a complex conjugate pair of eigenvalues, λ and λ̄. In this case the
vectors x(k) , Ax(k) and A2 x(k) tend to be linearly dependent. Assuming
that they are linearly dependent, show how this can be used to calculate
the eigenvalues λ and λ̄.

(e) For large k, the iterations yielded the following vectors


     
1 2 2
x(k) =  1  , Ax(k) =  3  and A2 x(k) =  4  .
1 4 6

Find the coefficients of the linear combination of these vectors which


equates to zero. Thus deduce two eigenvalues of A.

Exercise 2.23. (a) Given an n × n matrix A, define the concept of LU fac-


torization and how it can be used to solve the system of equations Ax = b.
(b) State two other applications of the LU factorization.
74  A Concise Introduction to Numerical Analysis

(c) Describe the algorithm to obtain an LU factorization. How many opera-


tions does this generally require?
(d) Describe the concept of pivoting in the context of solving the system of
equations Ax = b by LU factorization.
(e) How does the algorithm need to be adjusted if in the process we encounter
a column with all entries equal to zero? What does it mean if there is a
column consisting entirely of zeros in the process?
(f ) How can sparsity be exploited in the LU factorization?
(g) Calculate the LU factorization with pivoting of the matrix
 
2 1 1 0
 4 3 3 1 
A=  8
.
7 9 5 
6 7 9 8

Exercise 2.24. (a) Define the QR factorization of an n × n matrix A ex-


plaining what Q and R are. Show how the QR factorization can be used
to solve the system of equations Ax = b.
(b) How is the QR factorization defined, if A is an m×n (m 6= n) rectangular
matrix? What does it mean if R is said to be in standard form?
(c) For an m × n matrix A and b ∈ Rm , explain how the QR factorization
can be used to solve the least squares problem of finding x∗ ∈ Rn such that

kAx∗ − bk = minn kAx∗ − bk,


x∈R
pPm
where the norm is the Euclidean distance kvk = i=1 |vi |2 .
(d) Define a Householder reflection H in general and prove that H is an
orthogonal matrix.
(e) Find a Householder reflection H, such that for
 
2 4
A =  1 −1 
2 1

the first column of HA is a multiple of the first standard unit vector and
calculate HA.
(f ) Having found H in the previous part, calculate Hb for
 
1
b =  5 .
1
Linear Systems  75

(g) Using the results of the previous two parts, find the x ∈ R2 which mini-
mizes kAx − bk and calculate the minimum.
Exercise 2.25. (a) Explain the technique of splitting for solving the linear
system Ax = b iteratively where A is an n × n, non-singular matrix.
Define the iteration matrix H and state the property it has to satisfy to
ensure convergence.
(b) Define what it means for a matrix to be positive definite. Show that all
diagonal elements of a positive definite matrix are positive.

(c) State the Householder-John theorem and explain how it can be used to
design iterative methods for solving Ax = b.
(d) Let the iteration matrix H have a real eigenvector v with real eigenvalue
λ. Show that the condition of the Householder-John theorem implies that
|λ| < 1.

(e) We write A in the form A = L + D + U , where L is the subdiagonal (or


strictly lower triangular), D is the diagonal and U is the superdiagonal
(or strictly upper triangular) portion of A. The following iterative scheme
is suggested

(L + ωD)x(k+1) = −[(1 − ω)D + U ]x(k) + b.

Using the Householder-John theorem, for which values of ω does the


scheme converge in the case when A is symmetric and positive definite?
Exercise 2.26. Let A be an n × n matrix which is symmetric and positive
definite and let b ∈ Rn .

(a) Explain why solving Ax = b is equivalent to minimizing the quadratic


function
1
F (x) = xT Ax − xT b.
2
By considering x∗ + v where x∗ denotes the vector where F takes it ex-
tremum, show why the extremum has to be a minimum.
(b) Having calculated x(k) in the k th iteration, the descent methods pick a
search direction d(k) in the next iteration which satisfy the descent condi-
tion. Define the descent condition.

(c) The next approximation x(k+1) is calculated as x(k+1) = x(k) + ω (k) d(k) .
Derive an expression for ω (k) using the gradient g(k) = ∇F (x(k) ) =
Ax(k) − b.
(d) Derive an expression for the new gradient g(k+1) and a relationship be-
tween it and the search direction d(k) .
76  A Concise Introduction to Numerical Analysis

(e) Explain how the search direction d(k) is chosen in the steepest descent
method and give a motivation for this choice.
(f ) Define the concept of conjugacy.

(g) How are the search directions chosen in the conjugate gradient method?
Derive an explicit formula for the search direction d(k) stating the conju-
gacy condition in the process. What is d(0) ?
(h) Which property do the gradients in the conjugate gradient method satisfy?

(i) Perform the conjugate gradient method to solve the system


   
2 −1 1
x=
−1 2 0

starting from x(0) = 0.

Exercise 2.27. (a) Use Gaussian elimination with backwards substitution to


solve the linear system:

5x1 + 10x2 + 9x3 = 4


10x1 + 26x2 + 26x3 = 10
15x1 + 54x2 + 66x3 = 27

(b) How is the LU factorization defined, if A is an n × n square matrix, and


how can it be used to solve the system of equations Ax = b?
(c) Describe the algorithm to obtain an LU factorization.
(d) By which factor does the number of operations increase to obtain an LU
factorization if n is increased by a factor of 10?

(e) What needs to be done if during Gaussian elimination or LU factorization


a zero entry is encountered on the diagonal? Distinguish two different
cases.
(f ) Describe scaled and total pivoting. Explain why it is necessary under cer-
tain circumstances.

(g) Perform an LU factorization on the matrix arising from the system of


equations given in (a).
Exercise 2.28. (a) Define an n × n Householder reflection H in general and
prove that H is a symmetric and orthogonal matrix.

(b) For a general, non-zero vector v ∈ Rn , describe the construction of a


Householder reflection which transforms v into a multiple of the first unit
vector e1 = (1, 0, . . . , 0)T .
Linear Systems  77

(c) For v = (0, 1, −1)T calculate H and Hv such that the last two components
of Hv are zero.
(d) Let A be an n × n real matrix with a real eigenvector v ∈ Rn with real
eigenvector λ. Explain how a similarity transformation can be used to
obtain an (n − 1) × (n − 1) matrix B whose n − 1 eigenvalues are the other
n − 1 eigenvalues of A.
(e) The matrix  
1 1 1
A= 1 1 2 
1 2 1
has the eigenvector v = (0, 1, −1)T with eigenvalue −1. Using the H ob-
tained in (c) calculate HAH T and thus calculate two more eigenvalues.
Exercise 2.29. (a) Explain the technique of splitting for solving the linear
system Ax = b iteratively where A is an n × n, non-singular matrix.
Define the iteration matrix H and state the property it has to satisfy to
ensure convergence.
(b) Define the Gauss–Seidel and Jacobi iterations and state their iteration
matrices, respectively.
(c) Let  √ 
3 1

2 2 √2
A= 3 3
2 .
 
2 √ 2
1 3
2 2 2
Derive the iteration matrix for the Jacobi iterations and state the eigen-
value equation. Check that the numbers −3/4, 1/4, 1/2 satisfy the eigen-
value equation and thus are the eigenvalues of the iteration matrix.

(d) The matrix given in (c) is positive definite. State the Householder–John
theorem and apply it to show that the Gauss–Seidel iterations for this
matrix converge.
(e) Describe relaxation and show how the the iteration matrix Hω of the re-
laxed method is related to the iteration matrix H of the original method
and thus how the eigenvalues are related. How should ω be chosen?
(f ) For the eigenvalues given in (c) calculate the best choice of ω and the
eigenvalues of the relaxed method.
CHAPTER 3

Interpolation and
Approximation Theory

Interpolation describes the problem of finding a curve (called the interpolant)


that passes through a given set of real values f0 , f1 , . . . , fn at real data points
x0 , x1 , . . . , xn , which are sometimes called abscissae or nodes. Different forms
of interpolant exist. The theory of interpolation is important as a basis for
numerical integration known also as quadrature. Approximation theory on the
other hand seeks an approximation such that an error norm is minimized.

3.1 Lagrange Form of Polynomial Interpolation


The simplest case is to find a straight line

p(x) = a1 x + a0

through a pair of points given by (x0 , f0 ) and (x1 , f1 ). This means solving 2
equations, one for each data point. Thus we have 2 degrees of freedom. For a
quadratic curve there are 3 degrees of freedom, fitting a cubic curve we have
4 degrees of freedom, etc.
Let Pn [x] denote the linear space of all real polynomials of degree at most
n. Each p ∈ Pn [x] is uniquely defined by its n + 1 coefficients. This gives
n + 1 degrees of freedom, while interpolating x0 , x1 , . . . , xn gives rise to n + 1
conditions.
As we have mentioned above, in determining the polynomial interpolant
we can solve a linear system of equations. However, this can be done more
easily.
Definition 3.1 (Lagrange cardinal polynomials). These are given by
n
Y x − xl
Lk (x) := , x ∈ R.
l=0
xk − xl
l6=k
80  A Concise Introduction to Numerical Analysis

Note that the Lagrange cardinal polynomials lie in Pn [x]. It is easy to


verify that Lk (xk ) = 1 and Lk (xj ) = 0 for j 6= k. The interpolant is then
given by the Lagrange formula
n n n
X X Y x − xl
p(x) = fk Lk (x) = fk .
xk − xl
k=0 k=0 l=0
l6=k

Exercise 3.1. Let the function values f (0), f (1), f (2), and f (3) be given. We
want to estimate Z 3
f (−1), f 0 (1) and f (x)dx.
0
To this end, we let p be the cubic polynomial that interpolates these function
values, and then approximate by
Z 3
0
p(−1), p (1) and p(x)dx.
0

Using the Lagrange formula, show that every approximation is a linear com-
bination of the function values with constant coefficients and calculate these
coefficients. Show that the approximations are exact if f is any cubic polyno-
mial.
Lemma 3.1 (Uniqueness). The polynomial interpolant is unique.
Proof. Suppose that two polynomials p, q ∈ Pn [x] satisfy p(xi ) = q(xi ) = fi ,
i = 0, . . . , n. Then the nth degree polynomial p − q vanishes at n + 1 distinct
points. However, the only nth degree polynomial with n + 1 or more zeros is
the zero polynomial. Therefore p = q.
Exercise 3.2 (Birkhoff–Hermite interpolation). Let a, b, and c be distinct
real numbers, and let f (a), f (b), f 0 (a), f 0 (b), and f 0 (c) be given. Because there
are five function values, a possibility is to approximate f by a polynomial of
degree at most four that interpolates the function values. Show by a general
argument that this interpolation problem has a solution and that the solution
is unique if and only if there is no nonzero polynomial p ∈ P4 [x] that satisfies
p(a) = p(b) = p0 (a) = p0 (b) = p0 (c) = 0. Hence, given a and b, show that there
exists a possible value of c 6= a, b such that there is no unique solution.
Let [a, b] be a closed interval of R. C[a, b] is the space of continuous func-
tions from [a, b] to R and we denote by C k [a, b] the set of such functions which
have continuous k th derivatives.
Theorem 3.1 (The error of polynomial interpolation). Given f ∈ C n+1 [a, b]
and f (xi ) = fi , where x0 , . . . , xn are pairwise distinct, let p ∈ Pn [x] be the
interpolating polynomial. Then for every x ∈ [a, b], there exists ξ ∈ [a, b] such
that
n
1 Y
f (x) − p(x) = f (n+1) (ξ) (x − xi ). (3.1)
(n + 1)! i=0
Interpolation and Approximation Theory  81

Proof. Obviously, the formula (3.1) is true when x = xj for j = 0, . . . , n, since


both sides of the equation vanish. Let x be any other fixed point in the interval
and for t ∈ [a, b] let
n
Y n
Y
φ(t) := [f (t) − p(t)] (x − xi ) − [f (x) − p(x)] (t − xi ).
i=0 i=0

For t = xj the first term vanishes, since f (xj ) = fj = p(xj ), and by construc-
tion the product in the second term vanishes. We also have φ(x) = 0, since
then the two terms cancel. Hence φ has at least n + 2 distinct zeros in [a, b].
By Rolle’s theorem if a function with continuous derivative vanishes at two
distinct points, then its derivative vanishes at an intermediate point. Since
φ ∈ C n+1 [a, b], we can deduce that φ0 vanishes at n + 1 distinct points in
[a, b]. Applying Rolle again, we see that φ00 vanishes at n distinct points in
[a, b]. By induction, φ(n+1) vanishes once, say at ξ ∈ [a, b]. Since p is an nth
degree polynomial, we have p(n+1) ≡ 0. On the other hand,
n
dn+1 Y
(t − xi ) = (n + 1)!.
dtn+1 i=0

Hence
n
Y
0 = φ(n+1) (ξ) = f (n+1) (ξ) (x − xi ) − [f (x) − p(x)](n + 1)!
i=0

and the result follows.


This gives us an expression to estimate the error. Runge gives an example
where a polynomial interpolation to an apparently well-behaved function is
not suitable. Runge’s example is
1
R(x) = on [−1, 1].
1 + 25x2
It behaves like a polynomial near the centre x = 0,

R(x) = 1 − 25x2 + O(x4 ),

but it behaves like a quadratic hyperbola near the endpoints x = ±1,


1 1
R(x) = + O( ).
25x2 (25x2 )2
Thus a polynomial interpolant will perform badly at the endpoints, where
the behaviour is not like a polynomial. Figure 3.1 illustrates this when using
equally spaced interpolation points. Adding more interpolation points actually
makes the largest error worse. The growth in the error is explained by the
product in (3.1). One can help matters by clustering points towards the ends
82  A Concise Introduction to Numerical Analysis

Figure 3.1Interpolation of Runge’s example with polynomials of degree


8 (top) and degree 16 (bottom)

of the interval by letting xk = cos( π2 2k+1


n+1 ), k = 0, . . . , n. These are the so-
called Chebyshev points or nodes.
These Chebyshev points are the zeros of the (n + 1)th Chebyshev polyno-
mial , which is defined on the interval [−1, 1] by the trigonometric definition
Tn+1 (x) = cos((n + 1) arccos x), n ≥ −1.
From this definition it is easy to see that the roots of Tn+1 are xk =
cos( π2 2k+1
n+1 ) ∈ (−1, 1), k = 0, . . . , n. Moreover, all the extrema of Chebyshev
polynomials have values of either −1 or 1, two of which lie at the endpoints
Tn+1 (1) = 1
Tn+1 (−1) = (−1)n+1 .

In between we have extrema at xj = cos n+1 , j = 1, . . . , n. Figure 3.2 shows
the Chebyshev polynomial for n = 9. It is trivial to show that the Chebyshev
polynomials satisfy the recurrence relation
T0 (x) = 1
T1 (x) = x
Tn+1 (x) = 2xTn (x) − Tn−1 (x).
Interpolation and Approximation Theory  83

Figure 3.2 The 10th Chebyshev Polynomial

Qn
Lemma 3.2. The maximum absolute value of i=0 (x − xi ) on Q [−1, 1] is
n
minimal if it is the normalized (n + 1)th Chebyshev polynomial, i.e., i=0 (x −
−n −n
xi ) = 2 Tn+1 (x). The maximum absolute value is then 2 .
Qn
Proof. i=0 (x − xi ) describes an (n + 1)th degree polynomial with leading
coefficient 1. From the recurrence relation we see that the leading coefficient of
Tn+1 is 2n and thus 2−n Tn+1 has leading coefficient one. Let p be a polynomial
of degree n + 1 with leading coefficient 1 with maximum absolute value m <
2−n on [−1, 1]. Tn+1 has n+2 extreme points. At these points we have |p(x)| ≤
m < |2−n Tn+1 (x)|. Moreover, for x = cos n+12kπ
, 0 ≤ 2k ≤ n + 1, where Tn+1
has a maximum, we have

2−n Tn+1 (x) − p(x) ≥ 2−n Tn+1 (x) − m > 0,

while for x = cos (2k+1)π


n+1 , 0 ≤ 2k + 1 ≤ n + 1, where Tn+1 has a minimum,

2−n Tn+1 (x) − p(x) ≤ 2−n Tn+1 (x) + m < 0.

Thus the function 2−n Tn+1 (x)−p(x) changes sign between the points cos n+1kπ
,
−n
0 ≤ k ≤ n + 1. From the intermediate value theorem 2 Tn+1 − p has at least
n + 1 roots. However, this is impossible, since 2−n Tn+1 − p is a polynomial of
degree n, since the leading coefficients cancel. Thus we have a contradiction
and 2−n Tn+1 (x) gives the minimal value.
For a general interval [a, b] the Chebyshev points are xk = (a + b + (b −
a) cos( π2 2k+1
n+1 ))/2.
Another interesting fact about Chebyshev polynomials is that they form
a set of polynomials which are orthogonal with respect to the weight function
(1 − x2 )−1/2 on (−1, 1). More specifically we have

Z 1
dx  π, m = n = 0,
π
Tm (x)Tn (x) √ = , m = n ≥ 1,
−1 1 − x2  2
0 m 6= n.
84  A Concise Introduction to Numerical Analysis

We will study orthogonal polynomials in much more detail when considering


polynomial best approximations. Because of the orthogonality every polyno-
mial p of degree n can be expressed as
n
X
p(x) = p̌k Tk (x),
k=0

where the coefficients p̌k are given by

1 1 2 1
Z Z
dx dx
p̌0 = p(x) √ , p̌k = p(x)Tk (x) √ , k = 1, . . . , n.
π −1 1 − x2 π −1 1 − x2
We will encounter Chebyshev polynomials also again in spectral methods for
the solution of partial differential equations.

3.2 Newton Form of Polynomial Interpolation


Another way to describe the polynomial interpolant was introduced by New-
ton. First we need to introduce some concepts, however.

Definition 3.2 (Divided difference). Given pairwise distinct points


x0 , x1 , . . . , xn ∈ [a, b], let p ∈ Pn [x] interpolate f ∈ C n [a, b] at these points.
The coefficient of xn in p is called the divided difference of degree n and de-
noted by f [x0 , x1 , . . . , xn ].

From the Lagrange formula we see that


n n
X Y 1
f [x0 , x1 , . . . , xn ] = f (xk ) .
xk − xl
k=0 l=0
l6=k

Theorem 3.2. There exists ξ ∈ [a, b] such that


1 (n)
f [x0 , x1 , . . . , xn ] = f (ξ).
n!
Proof. Let p be the polynomial interpolant. The difference f − p has at least
n + 1 zeros. By applying Rolle’s theorem n times, it follows that the nth
derivative f (n) − p(n) is zero at some ξ ∈ [a, b]. Since p is of degree n, p(n) is
constant, say c, and we have f (n) (ξ) = c. On the other hand the coefficient of
1
xn in p is given by n! c, since the nth derivative of xn is n!. Hence we have

1 1
f [x0 , x1 , . . . , xn ] = c = f (n) (ξ).
n! n!

Thus, divided differences can be used to approximate derivatives.


Interpolation and Approximation Theory  85

Exercise 3.3. Let f be a real valued function and let p be the polynomial of de-
gree at most n that interpolates f at the pairwise distinct points x0 , x1 , . . . , xn .
Furthermore, let x be any real number that is not an interpolation point. De-
duce for the error at x
n
Y
f (x) − p(x) = f [x0 , . . . , xn , x] (x − xj ).
j=0

(Hint: Use the definition for the divided difference f [x0 , . . . , xn , x].)

We might ask what the divided difference of degree zero is. It is the co-
efficient of the zero degree interpolating polynomial, i.e., a constant. Hence
f [xi ] = f (xi ). Using the formula for linear interpolation between two points
(xi , f (xi )) and (xj , f (xj )), the interpolating polynomial is given by
   
x − xj x − xi
p(x) = f (xi ) + f (xj )
xi − xj xj − xi

and thus we obtain


f [xi ] − f [xj ]
f [xi , xj ] = .
xi − xj
More generally, we have the following theorem
Theorem 3.3. Let x0 , x1 , . . . , xk+1 be pairwise distinct points, where k ≥ 0.
Then
f [x1 , . . . , xk+1 ] − f [x0 , . . . , xk ]
f [x0 , x1 , . . . , xk , xk+1 ] = .
xk+1 − x0
Proof. Let p, q ∈ Pk [x] be the polynomials that interpolate f at x0 , . . . , xk
and x1 , . . . , xk+1 respectively. Let

(x − x0 )q(x) + (xk+1 − x)p(x)


r(x) := ∈ Pk+1 [x].
xk+1 − x0

It can be easily seen that r(xi ) = f (xi ) for i = 0, . . . , k + 1. Hence r is the


unique interpolating polynomial of degree k + 1 and the coefficient of xk+1 in
r is given by the formula in the theorem.

This recursive formula gives a fast way to calculate the divided difference
table shown in Figure 3.3. This requires O(n2 ) operations and calculates the
numbers f [x0 , . . . xl ] for l = 0, . . . , n. These are needed for the alternative
representation of the interpolating polynomial.
Exercise 3.4. Implement a routine calculating the divided difference table.

Theorem 3.4 (Newton Interpolation Formula). Let x0 , x1 , . . . , xn be pairwise


86  A Concise Introduction to Numerical Analysis

f [x0 ] &
% f [x0 , x1 ] &
f [x1 ] & % f [x0 , x1 , x2 ]
f [x1 , x2 ] ..
% & .
f [x2 ] ..
& .
.. f [x0 , . . . , xn ].
.. .
.
.. . ..
.
. ..
f [xn−2 , xn−1 , xn ]
f [xn−1 , xn ] %
f [xn ] %

Figure 3.3 Divided difference table

distinct. The polynomial


n−2
Y
pn (x) := f [x0 ] + f [x0 , x1 ](x − x0 ) + · · · + f [x0 , . . . , xn−1 ] (x − xi )
i=0
n−1
Y
+f [x0 , . . . , xn ] (x − xi ) ∈ Pn [x]
i=0

satisfies pn (xi ) = f (xi ), i = 0, 1, . . . , n.


Proof. The proof is done by induction on n. The statement is obvious for
n = 0 since the interpolating polynomial is the constant f (x0 ). Suppose that
the assertion is true for n. Let p ∈ Pn+1 [x] be the interpolating polynomial
at x0 , . . . , xn+1 . The difference
Qn p − pn vanishes at xi for i = 0, . . . , n and
hence it is a multiple of i=0 (x − xi ). By definition of divided differences the
coefficient of xn+1 in p is f [x0Q , . . . , xn+1 ]. Since pn ∈ Pn [x], it follows that
n
p(x) − pn (x) = f [x0 , . . . , xn+1 ] i=0 (x − xi ). The explicit form of pn+1 follows
by adding pn to p − pn .
Here is a MATLAB implementation of evaluating the interpolating poly-
nomial given the interpolation points and divided difference table.

function [y] = NewtonEval(inter, d, x )


% Calculates the values of the polynomial in Newton form at the
% points given in x
% inter input parameter, the interpolation points
% d input parameter, divided differences prescribing the polynomial

[n,m]=size(inter); % finding the size of inter


[k,l] = size(d); % finding the size of d
if m6=1 | | l6=1
disp('input need to be column vectors');
Interpolation and Approximation Theory  87

return;
end
if n6=k
disp('input dimensions do not agree')
return;
end
m = size(x); % number of evaluation points
y = d(1) * ones(m); % first term of the sum in the Newton form
temp = x − inter(1) * ones(m); % temp holds the product
% Note that y and temp are vectors
for i=2:n % add the terms of the sum in the Newton form one after
% the other
y = y + d(i) * temp;
temp = temp .* (x − inter(i)); % Note: *. is element−wise
% multiplication
end
end

Exercise 3.5. Given f, g ∈ C[a, b], let h := f g. Prove by induction that the
divided differences of h satisfy the relation
n
X
h[x0 , . . . , xn ] = f [x0 , . . . , xj ]g[xj , . . . , xn ].
j=0

By using the representation as derivatives of the differences and by letting the


points x0 , . . . , xn coincide, deduce the Leibniz formula for the nth derivative
of a product of two functions.
The Newton interpolation formula has several advantages over the La-
grange formula. Provided that the divided differences are known, it can be
evaluated at a given point x in just O(n) operations as long as we employ the
nested multiplication as in the Horner scheme

pn (x) = {· · · {{f [x0 , . . . , xn ](x − xn−1 ) + f [x0 , . . . , xn−1 ]} × (x − xn−2 )


+f [x0 , . . . , xn−2 ]} × (x − xn−3 ) + · · · } + f [x0 ].

The MATLAB implementation using the Horner scheme is as follows

function [y] = NewtonHorner(inter, d, x )

However, both representations of the interpolating polynomials have their


advantages. For example the Lagrange formula is often better when the in-
terpolating polynomial is part of a larger mathematical expression such as in
Gaussian quadrature.
If x0 , . . . , xn are equally spaced and arranged consecutively, letting h =
xi+1 − xi for each i = 0, . . . , n − 1, we can rewrite the Newton formula for
88  A Concise Introduction to Numerical Analysis

x = x0 + sh. Then x − xi = (s − i)h, and we have


n−2
Y
pn (x0 + sh) = f [x0 ] + f [x0 , x1 ]sh + · · · + f [x0 , . . . , xn−1 ] (s − i)hn−1
i=0
n−1
Y
+f [x0 , . . . , xn ] (s − i)hn
i=0

This is called the Newton Forward Divided Difference Formula.


Defining the finite difference

∆f (x0 ) := f (x1 ) − f (x0 )

we see that f [x0 , x1 ] = h−1 ∆f (x0 ). We will encounter finite differences again
when deriving solutions to partial differential equations. Since

∆2 f (x0 ) = ∆f (x1 ) − ∆f (x0 ) = hf [x1 , x2 ] − hf [x0 , x1 ] = 2h2 f [x0 , x1 , x2 ],

where we use the recurrence formula for divided differences, we can deduce by
induction that
∆j f (x0 ) = j!hj f [x0 , . . . , xj ].
Hence we can rewrite the Newton formula as
n j−1 n  
X Y 1 X s
pn (x0 + sh) = f (x0 ) + (s − i) ∆j f (x0 ) = f (x0 ) + ∆j f (x0 ).
j! j
j=1 i=0 j=1

In this form the formula looks suspiciously like a finite analog of the Taylor
expansion. The Taylor expansion tells where a function will go based on the
values of the function and its derivatives (its rate of change and the rate of
change of its rate of change, etc.) at one given point x. Newtons formula is
based on finite differences instead of instantaneous rates of change.
If the points are reordered as xn , . . . , x0 , we can again rewrite the Newton
formula for x = xn + sh. Then x − xi = (s + n − i)h, since xi = xn − (n − i)h,
and we have
n−2
Y
pn (xn + sh) = f [xn ] + f [xn , xn−1 ]sh + · · · + f [xn , . . . , x1 ] (s + i)hn−1
i=0
n−1
Y
+f [xn , . . . , x0 ] (s + i)hn .
i=0

This is called the Newton Backward Divided Difference Formula.


The degree of an interpolating polynomial can be increased by adding
more points and more terms. In Newton’s form points can simply be added
at one end. Newton’s forward formula can add new points to the right, and
Newton’s backward formula can add new points to the left. The accuracy of
Interpolation and Approximation Theory  89

an interpolation polynomial depends on how far the point of interest is from


the middle of the interpolation points used. Since points are only added at
the same end, the accuracy only increases at that end. There are formulae by
Gauss, Stirling, and Bessel to remedy this problem (see for example [17] A.
Ralston, P. Rabinowitz First Course in Numerical Analysis).
The Newton form of the interpolating polynomial can be viewed as one
of a class of methods for generating successively higher order interpolation
polynomials. These are known as iterated linear interpolation and are iterative.
They are all based on the following lemma.

Lemma 3.3. Let xi1 , xi2 , . . . , xim be m distinct points and denote by
pi1 ,i2 ,...,im the polynomial of degree m − 1 satisfying

pi1 ,i2 ,...,im (xiν ) = f (xiν ), ν = 1, . . . , m.

Then for n ≥ 0, if xj , xk and xiν , ν = 1, . . . , n, are any n + 2 distinct points,


then
(x − xk )pi1 ,i2 ,...,in ,j (x) − (x − xj )pi1 ,i2 ,...,in ,k (x)
pi1 ,i2 ,...,in ,j,k (x) = .
xj − xk

Proof. For n = 0, there are no additional points xiν and pj (x) ≡ f (xj ) and
pk (x) ≡ f (xk ). It can be easily seen that the right-hand side defines a poly-
nomial of degree at most n + 1, which takes the values f (xiν ) at xiν for
ν = 1, . . . , n, f (xj ) at xj and f (xk ) at xk . Hence the right-hand side is the
unique interpolation polynomial.
We see that with iterated linear interpolation, points can be added
anywhere. The variety of methods differ in the order in which the values
(xj , f (xj )) are employed. For many applications, additional function values
are generated on the fly and thus their number are not known in advance. For
such cases we may always employ the latest pair of values as in

x0 f (x0 )
x1 f (x1 ) p0,1 (x)
x2 f (x2 ) p1,2 (x) p0,1,2 (x)
.. .. .. ..
. . . .
xn f (xn ) pn−1,n (x) pn−2,n−1,n (x) · · · p0,1,...,n (x).

The rows are computed sequentially. Any p··· (x) is calculated using the two
quantities to its left and diagonally above it. To determine the (n + 1)th row
only the nth row needs to be known. As more points are generated, rows of
greater lengths need to be calculated and stored. The algorithm stops when-
ever
|p0,1,...,n (x) − p0,1,...,n−1 (x)| < .
This scheme is known as Neville’s iterated interpolation.
90  A Concise Introduction to Numerical Analysis

An alternative scheme is Aitken’s iterated interpolation given by


x0 f (x0 )
x1 f (x1 ) p0,1 (x)
x2 f (x2 ) p0,2 (x) p0,1,2 (x)
.. .. .. ..
. . . .
xn f (xn ) p0,n (x) p0,1,n (x) · · · p0,1,...,n (x).
The basic difference between the two methods is that in Aitken’s method the
interpolants on the row with xk use points with subscripts nearest 0, while in
Neville’s they use points with subscripts nearest to k.

3.3 Polynomial Best Approximations


We now turn our attention to best approximations where best is defined by
a norm (possibly introduced by a scalar product) which we try to minimize.
Recall that a scalar or inner product is any function V × V → R, where V is
a real vector space, subject to the three axioms
Symmetry:
hx, yi = hy, xi for all x, y ∈ V,
Non-negativity:
hx, xi ≥ 0 for all x ∈ V and hx, xi = 0 if and only if x = 0, and
Linearity:
hax + by, zi = ahx, zi + bhy, zi for all x, y, z ∈ V, a, b ∈ R.
We already encountered the vector space Rn and its scalar product with the
QR factorization of matrices. Another example of a vector space is the space
of polynomials of degree n, Pn [x], but no scalar product has been defined for
it so far.
Once a scalar product is defined, we can define orthogonality: x, y ∈ V are
orthogonal if hx, yi = 0. A norm can be defined by
p
kxk = hx, xi x ∈ V.

For V = C[a, b], the space of all continuous functions on the interval [a, b],
we can define a scalar product using a fixed positive function w ∈ C[a, b], the
weight function, in the following way
Z b
hf, gi := w(x)f (x)g(x)dx for all f, g ∈ C[a, b].
a

All three axioms are easily verified for this scalar product. The associated
norm is s
p Z b
kf k2 = hf, f i = w(x)f 2 (x)dx.
a
Interpolation and Approximation Theory  91

For w(x) ≡ 1 this is known as the L2 -norm. Note that Pn is a subspace of


C[a, b].
Generally Lp norms are defined by
!1/p
Z b
kf kp = |f (x)|p dx .
a

The vector space of functions for which this integral exists is denoted by Lp .
Unless p = 2, this is a normed space, but not an inner product space, because
this norm does not satisfy the parallelogram equality given by

2kf k2p + 2kgk2p = kf + gk2p + kf − gk2p

required for a norm to have an associated inner product.


Let g be an approximation to the function f . Often g is chosen to lie in
a certain subspace, for example the space of polynomials of a certain degree.
The best approximation chooses g such that the norm of the error kf − gk is
minimized. Different choices of norm give different approximations.
For p → ∞ the norm becomes

kf k∞ = max |f (x)|.
x∈[a,b]

We actually have already seen the best L∞ approximation from Pn [x]. It is


the interpolating polynomial where the interpolation points are chosen to be
the Chebyshev points. This is why the best approximation with regards to the
L∞ norm is sometimes called the Chebyshev approximation. It is also known
as Minimax approximation, since the problem can be rephrased as finding g
such that
min max |f (x) − g(x)|.
g x∈[a,b]

3.4 Orthogonal polynomials


Given an inner product, we say that pn ∈ Pn [x] is the nth orthogonal polyno-
mial if hpn , pi = 0 for all p ∈ Pn−1 [x]. We have already seen the orthogonal
polynomials with regards to the weight function (1 − x2 )−1/2 . These are the
Chebyshev polynomials. Different scalar products lead to different orthogo-
nal polynomials. Orthogonal polynomials stay orthogonal if multiplied by a
constant. We therefore introduce a normalization by requiring the leading co-
efficient to equal one. These polynomials are called monic, the noun being
monicity.

Theorem 3.5. For every n ≥ 0 there exists a unique monic orthogonal poly-
nomial pn of degree n. Any p ∈ Pn [x] can be expressed as a linear combination
of p0 , p1 , . . . pn .
92  A Concise Introduction to Numerical Analysis

Proof. We first consider uniqueness. Assume that there are two monic orthog-
onal polynomials pn , p̃n ∈ Pn [x]. Let p = pn − p̃n which is of degree n − 1,
since the xn terms cancel, because in both polynomials the leading coefficient
is 1. By definition of orthogonal polynomials, hpn , pi = 0 = hp̃n , pi. Thus we
can write
0 = hpn , pi − hp̃n , pi = hpn − p̃n , pi = hp, pi,
and hence p ≡ 0.
We provide a constructive proof for the existence by induction on n. We
let p0 (x) ≡ 1. Assume that p0 , p1 , . . . , pn have already been constructed, con-
sistent with both statements of the theorem. Let q(x) := xn+1 ∈ Pn+1 [x].
Following the Gram–Schmidt algorithm, we construct
n
X hq, pk i
pn+1 (x) = q(x) − pk (x).
hpk , pk i
k=0

It is of degree n + 1 and it is monic, since all the terms in the sum are of
degree less than or equal to n. Let m ∈ 0, 1, . . . , n.
n
X hq, pk i hq, pm i
hpn+1 , pm i = hq, pm i − hpk , pm i = hq, pm i − hpm , pm i = 0.
hpk , pk i hpm , pm i
k=0

Hence pn+1 is orthogonal to p0 , . . . , pn and consequently to all p ∈ Pn [x] due


to the second statement of the theorem.
Finally, to prove that any p ∈ Pn+1 [x] is a linear combination of
p0 . . . . , pn , pn+1 , we note that p can be written as p = cpn+1 + p̃, where c
is the coefficient of xn+1 in p and where p̃ ∈ Pn [x]. Due to the induction
hypothesis, p̃ is a linear combination of p0 , . . . , pn and hence p is a linear
combination of p0 , . . . , pn , pn+1 .
Well-known examples of orthogonal polynomials are (α, β > −1):

Name Notation Interval Weight function


Legendre Pn [−1, 1] w(x) ≡ 1
(α,β)
Jacobi Pn (−1, 1) w(x) = (1 − x)α (1 + x)β
Chebyshev (first kind) Tn (−1, 1) w(x) = (1 − x2 )−1/2
Chebyshev (second kind) Un [−1, 1] w(x) = (1 − x2 )1/2
Laguerre Ln [0, ∞) w(x) = e−x
2
Hermite Hn (−∞, ∞) w(x) = e−x

Polynomials can be divided by each other. For example, the polynomial


x2 − 1 has zeros at ±1. Thus x − 1 and x + 1 are factors of x2 − 1 or written
as a division
( x2 + 0x −1)/(x − 1) = x + 1
x2 − x
x − 1.
Interpolation and Approximation Theory  93

Exercise 3.6. The polynomial p(x) = x4 − x3 − x2 − x − 2 is zero for x = −1


and x = 2. Find two factors q1 (x) and q2 (x) and divide p(x) by those to
obtain a quadratic polynomial. Find two further zeros of p(x) as the zeros of
that quadratic polynomial.
Exercise 3.7. The functions p0 , p1 , p2 , . . . are generated by the Rodrigues
formula
dn
pn (x) = ex n (xn e−x ), x ∈ R+ .
dx
Show that these functions are polynomials and prove by integration by parts
that for every p ∈ Pn−1 [x] we have the orthogonality condition hpn , pi = 0
with respect to the scalar product given by
Z ∞
hf, gi := e−x f (x)g(x)dx.
0

Thus these polynomials are the Laguerre polynomials. Calculate p3 , p4 , and p5


from the Rodrigues formula.
The proof of Theorem 3.5 is constructive, but it suffers from loss of accu-
racy in the calculation of the inner products. With Chebyshev polynomials we
have seen that they satisfy a three-term recurrence relation. This is also true
for monic orthogonal polynomials of other weight functions, as the following
theorem shows.
Theorem 3.6. Monic orthogonal polynomials are given by the recurrence
relation
p−1 (x) ≡ 0,
p0 (x) ≡ 1,
pn+1 (x) = (x − αn )pn (x) − βn pn−1 (x),
where
hpn , xpn i hpn , pn i
αn = , βn = > 0.
hpn , pn i hpn−1 , pn−1 i
Proof. For n ≥ 0 let p(x) := pn+1 (x) − (x − αn )pn (x) + βn pn−1 (x). We will
show that p is actually zero. Since pn+1 and xpn are monic, it follows that
p ∈ Pn [x]. Furthermore,

hp, pl i = hpn+1 , pl i − h(x − αn )pn , pl i + βn hpn−1 , pl i = 0, l = 0, . . . , n − 2,

since from the definition of the inner product, h(x − αn )pn , pl i = hpn , (x −
αn )pl i, and since pn−1 , pn and pn+1 are orthogonal polynomials. Moreover,

hp, pn−1 i = hpn+1 , pn−1 i − h(x − αn )pn , pn−1 i + βn hpn−1 , pn−1 i


= −hpn , xpn−1 i + βn hpn−1 , pn−1 i.

Now because of monicity, xpn−1 = pn + q, where q ∈ Pn−1 [x], we have

hp, pn−1 i = −hpn , pn i + βn hpn−1 , pn−1 i = 0


94  A Concise Introduction to Numerical Analysis

due to the definition of βn . Finally,

hp, pn i = hpn+1 , pn i−h(x−αn )pn , pn i+βn hpn−1 , pn i = −hxpn , pn i+αn hpn , pn i = 0

from the definition of αn . It follows that p is orthogonal to p0 , . . . , pn which


form a basis of Pn [x] which is only possible for the zero polynomial. Hence
p ≡ 0 and the assertion is true.
Exercise 3.8. Continuing from the previous exercise, show that the coeffi-
cients of p3 , p4 and p5 are compatible with a three-term recurrence relation of
the form
p5 (x) = (γx − α)p4 (x) − βp3 (x), x ∈ R+ .

3.5 Least-Squares Polynomial Fitting


We now have the tools at our hands to find the best polynomial approxima-
tion p ∈ Pn [x], which minimizes hf − p, f − pi subject to a given inner prod-
Rb
uct hg, hi = a w(x)g(x)h(x)dx where w(x) > 0 is the weight function. Let
p0 , p1 , . . . , pn be the orthogonal polynomials, pl ∈ Pl [x], which formPa basis of
n
Pn [x]. Therefore there exist constants c0 , c1 , . . . , cn such that p = j=0 cj pj .
Because of orthogonality the inner product simplifies to
* n n
+
X X
hf − p, f − pi = f− cj pj , f − cj pj
j=0 j=0
n
X n
X
= hf, f i − 2 cj hf, pj i + c2j hpj , pj i.
j=0 j=0

This is a quadratic function in the cj s and we can minimize it to find optimal


values for the cj s. Differentiating with respect to cj gives


hf − p, f − pi = −2hpj , f i + 2cj hpj , pj i, j = 0, . . . , n.
∂cj

Setting the gradient to zero, we obtain

hpj , f i
cj =
hpj , pj i

and thus
n
X hpj , f i
p= pj .
j=0
hpj , pj i

The value of the error norm is then


n
X hf, pj i2
hf − p, f − pi = hf, f i − .
j=0
hpj , pj i
Interpolation and Approximation Theory  95

The coefficients cj , j = 0, . . . , n, are independent of n. Thus, more and


more terms can be added until hf − p, f − pi is below a given tolerance  or
Pn hf,p i2
in other words, until j=0 hpj ,pj j i becomes close to hf, f i.
So far our analysis was concerned with f ∈ C[a, b] and the inner product
was defined by an integral over [a, b]. However, what if we only have discrete
function values f1 , . . . , fm , m > n at pairwise distinct points x1 , . . . , xm avail-
able?
An inner product can be defined in the following way
m
X
hg, hi := g(xj )h(xj ).
j=1

We then seek p ∈ Pn [x] that Pn minimizes hf − p, f − pi. A straightforward ap-


proach is to express p as k=0 ck xk and to find optimal values for c0 , . . . , cn .
An alternative approach, however, is to construct orthogonal polynomials
p0 , p1 , . . . with regards to this inner product. Letting p0 (x) ≡ 1, we construct
m
hx, p0 i 1 X
p1 (x) = x − p0 (x) = x − xj .
hp0 , p0 i m j=1

We continue to construct orthogonal polynomials according to


k−1 X m
k−1
xkj pi (xj )
P
hxk , pi i
Pj=1
X
k k
pk (x) = x − pi (x) = x − m 2 i
p (x).
i=0
hpi , pi i i=0 j=1 pi (xj )

In the fraction we see the usual inner product of the vector (pi (x1 ), . . . , pi (xm ))T
with itself and with the vector (xk1 , . . . , xkm )T .
Once the orthogonal polynomials are constructed, we find p by
n
X hpk , f i
p(x) = pk (x).
hpk pk i
k=0

For each k the work to find pk is bounded by a multiple of m and thus the
complete cost is O(mn). The only difference to the continuous case is that
we cannot keep adding terms, since we only have enough data to construct
p0 , p1 , . . . , pm−1 .

3.6 The Peano Kernel Theorem


In the following we study a general abstract result by Guiseppe Peano, which
applies to the error of a wide class of numerical approximations. Suppose we
are given an approximation, e.g., to a function, a derivative at a given point, or
a quadrature as an approximation to an integral. For f ∈ C k+1 [a, b] let L(f )
be the approximation error. For example, when approximating the function
96  A Concise Introduction to Numerical Analysis

this could be the maximum of the absolute difference between the function
and the approximation over the interval [a, b]. In the case of a quadrature it
is the absolute difference between the value given by the quadrature and the
value of the integral. Thus L maps the space of functions C k+1 [a, b] to R. We
assume that L is a linear functional , i.e., L(αf + βg) = αL(f ) + βL(g) for
all α, β ∈ R. We also assume that the approximation is constructed in such a
way that it is correct for all polynomials of degree at most k, i.e., L(p) = 0
for all p ∈ Pk [x].
At a given point x ∈ [a, b], f (x) can be written as its Taylor polynomial
with integral remainder term
2
(x−a)k (k)
f (x) = f (a)R+ (x − a)f 0 (a) + (x−a)
2! f 00 (a) + · · · + k! f (a)
1 x k (k+1)
+ k! a (x − θ) f (θ)dθ.

This can be verified by integration by parts. Applying the functional L to


both sides of the equation, we obtain
Z x 
1 k (k+1)
L(f ) = L (x − θ) f (θ)dθ , x ∈ [a, b],
k! a

since the first terms of the Taylor expansion are polynomials of degree at most
k. To make the integration independent of x, we can use the notation

(x − θ)k , x ≥ θ,

(x − θ)k+ :=
0, x ≤ θ.

Then !
Z b
1
L(f ) = L (x − θ)k+ f (k+1) (θ)dθ .
k! a

We now make the important assumption that the order of the integral and
functional can be exchanged. For most approximations, calculating L involves
differentiation, integration and linear combination of function values. In the
case of quadratures (which we will encounter later and which are a form of
numerical integration), L is the difference of the quadrature rule which is a
linear combination of function values and the integral. Both these operations
can be exchanged with the integral.
Definition 3.3 (Peano kernel). The Peano kernel K of L is the function
defined by
K(θ) := L[(x − θ)k+ ] for x ∈ [a, b].
Theorem 3.7 (Peano kernel theorem). Let L be a linear functional such that
L(p) = 0 for all p ∈ Pk [x]. Provided that the exchange of L with the integration
is valid, then for f ∈ C k+1 [a, b]
Z b
1
L(f ) = K(θ)f (k+1) (θ)dθ.
k! a
Interpolation and Approximation Theory  97

Theorem 3.8. If K does not change sign in (a, b), then for f ∈ C k+1 [a, b]
"Z #
b
1
L(f ) = K(θ)dθ f (k+1) (ξ)
k! a

for some ξ ∈ (a, b).


Proof. Without loss of generality K is positive on [a, b]. Then
!
1 b
Z Z b
(k+1) 1
L(f ) ≥ K(θ) min f (x)dθ = K(θ)dθ min f (k+1) (x).
k! a x∈[a,b] k! a x∈[a,b]

Similarly we can obtain an upper bound employing the maximum of f (k+1)


over [a, b]. Hence
L(f )
min f (k+1) (x) ≤ Rb ≤ max f (k+1) (x).
x∈[a,b] 1 x∈[a,b]
k! a
K(θ)dθ
The result of the theorem follows by the intermediate value theorem.
As an example for the application of the Peano kernel theorem we consider
the approximation of a derivative by a linear combination of function values.
More specifically, f 0 (0) = − 32 f (0) + 2f (1) − 12 f (2). Hence, L(f ) := f 0 (0) −
[− 32 f (0) + 2f (1) − 12 f (2)]. It can be easily verified by letting p(x) = 1, x, x2
and using linearity that L(p) = 0 for all p ∈ P2 [x]. Therefore, for f ∈ C 3 [a, b],

1 2
Z
L(f ) = K(θ)f 000 (θ)dθ.
2 0
The Peano kernel is
3 1
K(θ) = L[(x − θ)2+ ] = 2(0 − θ)+ − [− (0 − θ)2+ + 2(1 − θ)2+ − (2 − θ)2+ ].
2 2
Using the definition of (x − θ)k+ , we obtain

−2θ + 32 θ2 − 2(1 − θ)2 + 12 (2 − θ)2 ≡ 0 θ≤0






 −2(1 − θ)2 + 1 (2 − θ)2 = θ(2 − 3 θ)

0≤θ≤1
2 2
K(θ) = 1 2


 2 (2 − θ) 1≤θ≤2

θ ≥ 2.

0

We see that K ≥ 0. Moreover,


Z 2 Z 1 Z 2
3 1 2
K(θ)dθ = θ(2 − θ)dθ + (2 − θ)2 dθ = .
0 0 2 1 2 3

Hence L(f ) = 12 23 f 000 (ξ) = 13 f 000 (ξ) for some ξ ∈ (0, 2). Thus the error in
approximating the derivative at zero is 13 f 000 (ξ) for some ξ ∈ (0, 2).
98  A Concise Introduction to Numerical Analysis

Exercise 3.9. Express the divided difference f [0, 1, 2, 3] in the form


Z 3
1
L(f ) = f [0, 1, 2, 3] = K(θ)f 000 (θ)dθ,
2 0

assuming that f ∈ C 3 [0, 3]. Sketch the kernel function K(θ) for θ ∈ [0, 3]. By
integrating K(θ) and using the mean value theorem, show that
1 000
f [0, 1, 2, 3] = f (ξ)
6
for some point ξ ∈ [0, 3].
1
Exercise 3.10. We approximate the function value of f ∈ C 2 [0, 1] at 2 by
p( 12 ) = 12 [f (0) + f (1)]. Find the least constants c0 , c1 and c2 such that

1 1
|f ( ) − p( )| ≤ ck kf (k) k∞ , k = 0, 1, 2.
2 2
For k = 0, 1 work from first principles and for k = 2 apply the Peano kernel
theorem.

3.7 Splines
The problem with polynomial interpolation is that with increasing degree the
polynomial ’wiggles’ from data point to data point. Low-order polynomials do
not display this behaviour. Let’s interpolate data by fitting two cubic poly-
nomials, p1 (x) and p2 (x), to different parts of the data meeting at the point
x∗ . Each cubic polynomial has four coefficients and thus we have 8 degrees
of freedom and hence can fit 8 data points. However, the two polynomial
pieces are unlikely to meet at x∗ . We need to ensure some continuity. If we let
p1 (x∗ ) = p2 (x∗ ), then the curve is at least continuous, but we are losing one
degree of freedom. The fit

p01 (x∗ ) = p02 (x∗ )


p001 (x∗ ) = p002 (x∗ )

take up one degree of freedom each. This gives a smooth curve, but we are
only left with five degrees of freedom and thus can only fit 5 data points. If
we also require the third derivative to be continuous, the two cubics become
the same cubic which is uniquely specified by the four data points.
The point x∗ is called a knot point. To fit more data we specify n + 1 such
knots and fit a curve consisting of n separate cubics between them, which is
continuous and also has continuity of the first and second derivative. This has
n + 3 degrees of freedom. This curve is called a cubic spline.
The two-dimensional equivalent of the cubic spline is called thin-plate
spline. A linear combination of thin-plate splines passes through the data
Interpolation and Approximation Theory  99

points exactly while minimizing the so-called bending energy, which is defined
as the integral over the squares of the second derivatives
Z Z
2 2 2
(fxx + 2fxy + fyy )dxdy

where fx = ∂f∂x . The name thin-plate spline refers to bending a thin sheet of
metal being held in place by bolts as in the building of a ship’s hull. However,
here we will consider splines in one dimension.
Definition 3.4 (Spline function). The function s is called a spline function
of degree k if there exist points a = x0 < x1 < . . . < xn = b such that
s is a polynomial of degree at most k on each of the intervals [xj−1 , xj ] for
j = 1, . . . , n and such that s has continuous k − 1 derivatives. In other words,
s ∈ C k−1 [a, b]. We call s a linear, quadratic, cubic, or quartic spline for k =
1, 2, 3, or 4. The points x0 , . . . , xn are called knots and the points x1 , . . . , xn−1
are called interior knots.
A spline of degree k can be written as
k
X n−1
X
s(x) = ci xi + dj (x − xj )k+ (3.2)
i=0 j=1

for x ∈ [a, b]. Recall the notation

(x − xj )k , x ≥ xj ,

(x − xj )k+ :=
0, x ≤ xj .

introduced for the Peano kernel. The first sum defines a general polynomial
on [x0 , x1 ] of degree k. Each of the functions (x − xj )k+ is a spline itself with
only one knot at xj and continuous k − 1 derivatives. These derivatives all
vanish at xj . Thus (3.2) describes all possible spline functions.
Theorem 3.9. Let S be the set of splines of degree k with fixed knots
x0 , . . . , xn , then S is a linear space of dimension n + k.
Proof. Linearity is implied since differentiation and continuity are linear. The
notation in (3.2) implies that S has dimension at most n + k. Hence it is
sufficient to show that if s ≡ 0, then all the coefficients ci , i = 0, . . . , k, and dj
j = 1, . . . , n − 1, vanish. Considering the interval [x0 , x1 ], then s(x) is equal
Pk
to i=0 ci xi on this interval and has an infinite number of zeros. Hence the
coefficients ci , i = 0, . . . , k have to be zero. To deduce dj = 0, j = 1, . . . , n − 1,
we consider each interval [xj , xj+1 ] in turn. The polynomial there has again
infinite zeros from which dj = 0 follows.
Definition 3.5 (Spline interpolation). Let f ∈ C[a, b] be given. The spline
interpolant to f is obtained by constructing the spline s that satisfies s(xi ) =
f (xi ) for i = 0, . . . , n.
100  A Concise Introduction to Numerical Analysis

For k = 1 this gives us piecewise linear interpolation. The spline s is a linear


(or constant) polynomial between adjacent function values. These conditions
define s uniquely.
For k = 3 we have cubic spline interpolation. Here the dimension of S is
n + 3, but we have only n + 1 conditions given by the function values. Thus
the data cannot define s uniquely.
In the following we show how the cubic spline is constructed and how
uniqueness is achieved. If we have not only the values of s, but also of s0
available at all the knots, then on [xi−1 , xi ] for i = 1, . . . , n the polynomial
piece of s can be written as

s(x) = s(xi−1 ) + s0 (xi−1 )(x − xi−1 ) + c2 (x − xi−1 )2 + c3 (x − xi−1 )3 . (3.3)

The derivative of s is

s0 (x) = s0 (xi−1 ) + 2c2 (x − xi−1 ) + 3c3 (x − xi−1 )2 .

Obviously s and s0 take the right value at x = xi−1 . Considering s and s0 at


x = xi , we get two equations for the coefficients c2 and c3 . Solving these we
obtain
s(xi ) − s(xi−1 ) 2s0 (xi−1 ) + s0 (xi )
c2 = 3 − ,
(xi − xi−1 )2 xi − xi−1 (3.4)
0 0
s(xi−1 ) − s(xi ) s (xi−1 ) + s (xi )
c3 = 2 + .
(xi − xi−1 )3 (xi − xi−1 )2
The second derivative of s is

s00 (x) = 2c2 + 6c3 (x − xi−1 ).

Considering two polynomial pieces in adjacent intervals [xi−1 , xi ] and


[xi , xi+1 ], we have to ensure second derivative continuity at xi . Calculating
s00 (xi ) from the polynomial piece on the left interval and on the right interval
gives two expressions which must be equal. This implies for i = 1, . . . , n − 1
the equation

s0 (xi−1 ) + 2s0 (xi ) 2s0 (xi ) + s0 (xi+1 ) 3s(xi ) − 3s(xi−1 ) 3s(xi+1 ) − 3s(xi )
+ = + .
xi − xi−1 xi+1 − xi (xi − xi−1 )2 (xi+1 − xi )2
(3.5)
Assume we are given the function values f (xi ), i = 0, . . . , n, and the deriva-
tives f 0 (a) and f 0 (b) at the endpoints a = x0 and b = xn . Thus we know s0 (x0 )
and s0 (xn ). We seek the cubic spline that interpolates the augmented data.
Note that now we have n + 3 conditions consistent with the dimension of the
space of cubic splines. Equation (3.5) describes a system of n − 1 equations
in the unknowns s0 (xi ), i = 1, . . . , n − 1, specified by a tridiagonal matrix S
where the diagonal elements are
2 2
Si,i = +
xi − xi−1 xi+1 − xi
Interpolation and Approximation Theory  101

and the off-diagonal elements are


1 1
Si,i−1 = and Si,i+1 = .
xi − xi−1 xi+1 − xi
This matrix is nonsingular, since it is diagonally dominant. That means that
for each row the absolute value of the diagonal element is larger than the
sum of the absolute values of the off-diagonal elements. Here the diagonal and
elements on the first subdiagonal and superdiagonal are all positive, since the
knots are ordered from smallest to largest. Thus the diagonal elements are
twice the sum of the off-diagonal elements. Nonsingularity implies that the
system has a unique solution and thus the cubic interpolation spline exists
and is unique.
The right-hand side of the system of equations is given by the right-hand
side of Equation (3.5) for i = 2, . . . , n − 2 with s(xi ) = f (xi ). For i = 1 the
right-hand side is
3f (x1 ) − 3f (x0 ) 3f (x2 ) − 3f (x1 ) f 0 (a)
+ − .
(x1 − x0 )2 (x2 − x1 )2 x1 − x0
For i = n − 1 the right-hand side is
3f (xn−1 ) − 3f (xn−2 ) 3f (xn ) − 3f (xn−1 ) f 0 (b)
+ − .
(xn−1 − xi−2 )2 (xn − xn−1 )2 xn − xn−1
Once the values s0 (xi ), i = 2, . . . , n − 2, are obtained from the system of
equations, they can be inserted in (3.4) to calculate c2 and c3 . We then have
all the values to calculate s(x) on [xi−1 , xi ] given by Equation (3.3).
So far we have taken up the extra two degrees of freedom by requiring
s0 (a) = f 0 (a) and s0 (b) = f 0 (b). In the following we examine how to take up
the extra degrees of freedom further.
Let S be the space of cubic splines with knots a = x0 < x1 < . . . < xn = b
and let ŝ and š be two different cubic splines that satisfy ŝ(xi ) = š(xi ) = f (xi ).
The difference ŝ − š is a non-zero element of S that is zero at all the knots.
These elements of S form a subspace of S, say S0 . How to take up the extra
degree of freedom is equivalent to which element from S0 to choose.
To simplify the argument we let the knots be equally spaced, i.e., xi −
xi−1 = h for i = 1, . . . , n. Equation (3.5) then becomes
3
s0 (xi−1 ) + 4s0 (xi ) + s0 (xi+1 ) = [s(xi+1 ) − s(xi−1 )] (3.6)
h
for i = 1. . . . , n − 1. For s ∈ S0 , the right-hand side is zero. Thus (3.6) is a
recurrence relation where the solutions are given by αλi1 + βλi2 , where λ1 , λ2
are the roots of λ2 + 4λ + 1 = 0. These are

λ1 = 3−2 √ √
√ ( 3 + 2)( 3 − 2)
λ2 = − 3 − 2 = − √ = 1/λ1 .
3−2
102  A Concise Introduction to Numerical Analysis

Figure 3.4 Basis of the subspace S0 with 8 knots


If the coefficients α and β are chosen as α = 1, β = 0 or α = 0, β = ( 3 − 2)n
we obtain two splines with values of the derivative at x0 , . . . , xn given by
√ √
s0 (xi ) = ( 3 − 2)i and s0 (xi ) = ( 3 − 2)n−i i = 0, . . . , n.

These two splines are a convenient basis of S0 . One decays rapidly across the
interval [a, b] when moving from left to right, while the other decays rapidly
when moving from right to left as can be seen in Figure 3.4. This basis implies
that, for equally spaced knots, the freedom in a cubic spline interpolant is
greatest near the end points of the interval [a, b]. Therefore it is important
to take up this freedom by imposing an extra condition at each end of the
interval as we have done by letting s0 (a) = f 0 (a) and s0 (b) = f 0 (b). The
following exercise illustrates how the solution deteriorates if this is not done.
Exercise 3.11. Let S be the set of cubic splines with knots xi = ih for
i = 0, . . . , n, where h = 1/n. An inexperienced user obtains an approximation
to a twice-differentiable function f by satisfying the conditions s0 (0) = f 0 (0),
s00 (0) = f 00 (0), and s(xi ) = f (xi ), i = 0, . . . , n. Show how the changes in the
first derivatives s0 (xi ) propagate if s0 (0) is increased by a small perturbation
, i.e., s0 (0) = f 0 (0) + , but the remaining data remain the same.
Another possibility to take up this freedom is known as the not-a-knot
technique. Here we require the third derivative s000 to be continuous at x1
and xn−1 . It is called not-a-knot since there is no break between the two
polynomial pieces at these points. Hence you can think of these knots as not
being knots at all.
Definition 3.6 (Lagrange form of spline interpolation). For j = 0, . . . , n, let
sj be an element of S that satisfies

sj (xi ) = δij , i = 0, . . . , n, (3.7)


Interpolation and Approximation Theory  103

Figure 3.5 Example of a Lagrange cubic spline with 9 knots

where δij is the Kronecker delta (i.e., δjj = 1 and δij = 0 for i 6= j). Figure
3.5 shows an example of a Lagrange cubic spline. Note that the splines sj are
not unique, since any element from S0 can be added. Any spline interpolant
to the data f0 , . . . , fn can then be written as
n
X
s(x) = fj sj (x) + ŝ(x),
j=0

where ŝ ∈ S0 . This expression is the Lagrange form of spline interpolation.

Theorem 3.10. Let S be the space of cubic splines with equally spaced knots
a = x0 < x1 < . . . < xn = b where xi − xi−1 = h for i = 1, . . . , n. Then for
each integer j = 0, . . . , n there is a cubic spline sj that satisfies the Lagrange
conditions (3.7) and that has the first derivative values
 √

 − h3 ( 3 − 2)j−i , i = 0, . . . , j − 1,


s0j (xi ) = 0, i=j



 3
 i−j
h ( 3 − 2) , i = j + 1, . . . , n.

Proof. We know that to achieve second derivative continuity, the derivatives


have to satisfy (3.6). For sj this means

s0j (xi−1 ) + 4s0j (xi ) + s0j (xi+1 ) = 0 i = 1, . . . , n − 1, i 6= j − 1, j + 1

s0j (xj−2 ) + 4s0j (xj−1 ) + s0j (xj ) = 3


h i = j − 1,

s0j (xj ) + 4s0j (xj+1 ) + s0j (xj+2 ) = − h3 i = j + 1.


104  A Concise Introduction to Numerical Analysis

It is easily verified that the values for s0j (xi ) given in the theorem satisfy these.
For example, considering the last two equations, we have
3 √ 3 √
s0j (xj−2 ) + 4s0j (xj−1 ) + s0j (xj ) = − ( 3 − 2)2 − 4 ( 3 − 2)
h h
3 √ √ 3
= − (7 − 4 3 + 4 3 − 8] =
h h
and
3 √ 3 √
s0j (xj ) + 4s0j (xj+1 ) + s0j (xj+2 ) = 4 ( 3 − 2) + ( 3 − 2)2
h h
3 √ √ 3
= (−8 + 4 3 − 4 3 + 7) = − .
h h


Since 3 − 2 ≈ −0.268, it can be deduced that sj (x) decays rapidly as
|x − xj | increases. Thus the contribution of fj to the cubic spline interpolant
s decays rapidly as |x − xj | increases. Hence for x ∈ [a, b], the value of s(x) is
determined mainly by the data fj for which |x − xj | is relatively small.
However, the Lagrange functions of quadratic spline interpolation do not
enjoy these decay properties if the knots coincide with the interpolation points.
Generally, on the interval [xi , xi+1 ], the quadratic polynomial piece can be
derived from the fact that the derivative is a linear polynomial and thus given
by
s0 (xi+1 ) − s0 (xi )
s0 (x) = (x − xi ) + s0 (xi ).
xi+1 − xi
Integrating over x and using s(xi ) = fi we get

s0 (xi+1 ) − s0 (xi )
s(x) = (x − xi )2 + s0 (xi )(x − xi ) + fi .
2(xi+1 − xi )

Using s(xi+1 ) = fi+1 we can solve for s0 (xi+1 )

fi+1 − fi
s0 (xi+1 ) = 2 − s0 (xi ),
xi+1 − xi

which is a recurrence relation. The extra degree of freedom in quadratic splines


can be taken up by letting s0 (x0 ) = 0 which gives the natural quadratic spline.
Another possibility is letting s0 (x0 ) = s0 (x1 ), then s0 (x0 ) = xf22 −f
−x1 .
1

For the Lagrange functions sj , j = 1, . . . , n, of quadratic spline interpo-


lation the recurrence relation implies s0j (xi+1 ) = −s0j (xi ) for i = 1, . . . , j − 2
and i = j + 1, . . . , n. Thus the function will continue to oscillate. This can be
seen in Figure 3.6.
Exercise 3.12. Another strategy to use quadratic splines to interpolate equally
spaced function values f (jh), j = 0, . . . , n, is to let s have the interior knots
Interpolation and Approximation Theory  105

Figure 3.6 Lagrange function for quadratic spline interpolation

(j + 12 )h, j = 1, . . . , n − 2. Verify that the values s((j + 12 )h), j = 1, . . . , n −


2, and the interpolation conditions s(jh) = f (jh), j = 0, . . . , n, define the
quadratic polynomial pieces of s. Further, prove that the continuity of s0 at
the interior knots implies the equations
1 1 3
s((j− )h)+6s((j+ )h)+s((j+ )h) = 4[f (jh)+f (jh+h)], j = 2, . . . , n−3.
2 2 2

3.8 B-Spline
In this section we generalize the concept of splines.

Definition 3.7 (B-splines). Let S be the linear space of splines of degree k


that have the knots a = xo < x1 < . . . < xn = b, and let 0 ≤ p ≤ n − k − 1 be
an integer. The spline
p+k+1
X
s(x) = λj (x − xj )k+ , x∈R (3.8)
j=p

is called a B-spline if s(x) ≡ 0 for x ≤ xp and s(x) ≡ 0 for x ≥ xp+k+1 , but


the real coefficients λj , j = p, . . . , p + k + 1 do not vanish.
Recall that the notation (·)+ means that the expression is zero, if the term
in the brackets becomes negative. Thus the spline defined by (3.8) satisfies
s(x) ≡ 0 for x ≤ xp as required. For x ≥ xp+k+1 , s(x) is given by
p+k+1 p+k+1 k  
X
k
X X k
s(x) = λj (x − xj ) = λj xk−i (−xj )i .
i
j=p j=p i=0

This being identical to 0, means that there are infinitely many zeros, and this
in turn means that the coefficients of xk−i have to vanish for all i = 0, . . . , k.
106  A Concise Introduction to Numerical Analysis

Therefore
p+k+1
X
λj xij = 0, i = 0, . . . , k.
j=p
A solution to this problem exists, since k + 2 coefficients have to satisfy only
k + 1 conditions. The matrix describing the system of equations is
 
1 ··· ··· 1
 xp xp+1 · · · xp+k+1 
A =  .. .. .. .
 
..
 . . . . 
xkp xkp+1 ··· xkp+k+1
If λp+k+1 is zero, then the system reduces to a (k + 1) × (k + 1) matrix where
the last column of A is removed. This however is a Vandermonde matrix ,
which is non-singular, since all xi , i = p . . . , p + k + 1 are distinct. It follows
then that all the other coefficients λp , . . . , λp+k are also zero.
Therefore λp+k+1 has to be nonzero. This can be chosen and the system
can be solved uniquely for the remaining k+1 coefficients, since A with the last
column removed is nonsingular. Therefore, apart from scaling by a constant,
the B-spline of degree k with knots xp < xp+1 < . . . < xp+k+1 that vanishes
outside (xp , xp+k+1 ) is unique.
Theorem 3.11. Apart from scaling, the coefficients λj , j = p, . . . , p + k + 1,
in (3.8) are given by
 −1
p+k+1
Y
λj =  (xj − xi ) .
i=p,i6=j

Therefore the B-spline is


 −1
p+k+1
X p+k+1
Y
Bpk (x) =  (xj − xi ) (x − xj )k+ , x ∈ R. (3.9)
j=p i=p,i6=j

Proof. The function Bpk (x) is a polynomial of degree at most k for x ≥ xp+k+1
 −1
p+k+1 p+k+1 k  
X Y X k
Bpk (x) =  (xj − xi ) xk−l (−xj )l
l
j=p i=p,i6=j l=0
  −1 
k
X k   p+k+1 p+k+1
 X  Y
= (−1)l  (xj − xi ) xlj  xk−l .

l
l=0 j=p i=p,i6=j

On the other hand, let 0 ≤ l ≤ k be an integer. We use the Lagrange interpo-


lation formula to interpolate xl , at the points xp , . . . , xp+k+1
 
p+k+1
X p+k+1
Y x − xi
xl =   xlj .
j=p
x j − x i
i=p,i6=j
Interpolation and Approximation Theory  107

Comparing the coefficient of xk+1 on both sides of the equation gives


 
p+k+1 p+k+1
X Y 1  l
0=  xj for l = 0, . . . , k.
j=p
x j − xi
i=p,i6=j

The right-hand side is a factor of the coefficient of xk−l in Bpk (x) for x ≥
xp+k+1 . It follows that the coefficient of xk−l in Bpk (x), x ≥ xp+k+1 , is zero
for l = 0, 1, . . . , k. Thus Bpk (x) vanishes for x ≥ xp+k+1 and is the required
B-spline.
The advantage of B-splines is that the nonzero part is confined to an
interval which contains only k + 2 consecutive knots. This is also known as
the spline having finite support.
As an example we consider the (n + 1)-dimensional space of linear splines
with the usual knots a = x0 < . . . < xn = b. We introduce extra knots x−1 and
1
xn+1 outside the interval and we let Bj−1 be the linear spline which satisfies
1
the conditions Bj−1 (xi ) = δij , i = 0, . . . , n, where δij denotes the Kronecker
delta. Then every s in the space of linear splines can be written as
n
X
1
s(x) = s(xj )Bj−1 (x), a ≤ x ≤ b.
j=0

These basis functions are often called hat functions because of their shape, as
Figure 3.7 shows.
For general k we can generate a set of B-splines for the (n+k)- dimensional
space S of splines of degree k with the knots a = x0 < x1 < . . . < xn = b. We
add k additional knots both to the left of a and to the right of b. Thus the
full list of knots is x−k < x−k+1 < . . . < xn+k . We let Bpk be the function as
defined in (3.9) for p = −k, −k+1, . . . , n−1, where we restrict the range of x to
the interval [a, b] instead of x ∈ R. Therefore for each p = −k, −k +1, . . . , n−1
the function Bpk (x), a ≤ x ≤ b, lies in S and vanishes outside the interval
(xp , xp+k+1 ). Figure 3.8 shows these splines for k = 3.
Theorem 3.12. The B-splines Bpk , p = −k, −k + 1, . . . , n − 1, form a basis
of S.
Proof. The number of B-splines is n + k and this equals the dimension of
S. To show that the B-splines form a basis, it is sufficient to show that a
nontrivial linear combination of them cannot vanish identically in the interval
[a, b]. Assume otherwise and let
n−1
X
s(x) = sj Bjk (x), x ∈ R,
j=−k

where s(x) ≡ 0 for a ≤ x ≤ b. We know that s(x) also has to be zero for
108  A Concise Introduction to Numerical Analysis

Figure 3.7 Hat functions generated by linear B-splines on equally spaced


nodes

Figure 3.8 Basis of cubic B-splines

x ≤ x−k and x ≥ xn+k . Considering x ≤ b, then s(x) is a B-spline which


is identical to zero outside the interval (x−k , x0 ). However, this interval only
contains k + 1 knots and we have seen that it is not possible to construct a B-
spline with fewer than k +2 knots. Thus s(x) does not only vanish for x ≤ x−k
but also on the intervals [xj , xj+1 ] for j = −k, . . . , n − 1. By considering these
in sequence we can deduce that sj = 0 for j = −k, . . . , n − 1 and the linear
combination is the trivial linear combination.
To use B-splines for interpolation at general points, let the distinct points
ξj ∈ [a, b] and data f (ξj ) be given for j = 1, . . . , n + k. If s is expressed as
n−1
X
s(x) = sj Bjk (x), x ∈ [a, b],
j=−k
Interpolation and Approximation Theory  109

then the coefficients sj , j = −k, . . . , n − 1 have to satisfy the (n + k) × (n + k)


system of equations given by
n−1
X
sj Bjk (ξi ) = f (ξi ), i = 1, . . . , n + k.
j=−k

The advantage of this form is that Bjk (ξi ) is nonzero if and only if ξi is in
the interval (xj , xj+k+1 ). Therefore there are at most k + 1 nonzero elements
in each row of the matrix. Thus the matrix is sparse.
The explicit form of a B-spline given in (3.9) is not very suitable for eval-
uation if x is close to xp+k+1 , since all (·)+ terms will be nonzero. However, a
different representation of B-splines can be used for evaluation purposes. The
following definition is motivated by formula (3.9) for k = 0.
Definition 3.8. The B-spline of degree 0 Bp0 is defined as
0, x < xp or x > xp+1 ,




Bp0 (x) = (xp+1 − xp )−1 , xp < x < xp+1 ,


−1
 1
2 (xp+1 − xp ) , x = xp or x = xp+1 .
Theorem 3.13 (B-spline recurrence relation). For k ≥ 1 the B-splines satisfy
the following recurrence relation
k−1
(x − xp )Bpk−1 (x) + (xp+k+1 − x)Bp+1 (x)
Bpk (x) = , x ∈ R. (3.10)
xp+k+1 − xp
Proof. In the proof one needs to show that the resulting function has the
required properties to be a B-spline and this is left to the reader.
Let s ∈ S be a linear combination of B-splines. If s needs to be evaluated
for x ∈ [a, b], we pick j between 0 and n − 1 such that x ∈ [xj , xj+1 ]. It is only
necessary to calculate Bpk (x) for p = j − k, j − k + 1, . . . , j, since the other
B-splines of order k vanish on this interval. The calculation starts with the
one nonzero value of Bp0 (x), p = 0, 1, . . . , n. (It could be two nonzero values if
x happens to coincide with a knot.) Then for l = 1, 2, . . . , k the values Bpl (x)
for p = j − l, j − l + 1, . . . , j are generated by the recurrence relation given
l l
in (3.10), keeping in mind that Bj−l−1 (x) and Bj+1 (x) are zero. Note that as
the order of the B-splines increases, the number of B-splines not zero on the
interval [xj , xj+1 ] increases.
k
Bj−k (x)
.
2
Bj−2 (x) ..
1 %
Bj−1 (x)
% &
Bj0 (x) 2
Bj−1 (x) · · ·
& %
Bj1 (x)
&
Bj2 (x) ..
.
Bjk (x)
110  A Concise Introduction to Numerical Analysis

This means that s can be evaluated at x ∈ [a, b] in O(k 2 ) operations.

3.9 Revision Exercises


Exercise 3.13. (a) Let Qk , k = 0, 1, . . ., be a set of polynomials orthogonal
with respect to some inner product h·, ·i in the interval [a, b]. Let f be a
continuous function in [a, b]. Write explicitly the least-squares polynomial
approximation to f by a polynomial of degree n in terms of the polynomials
Qk , k = 0, 1, . . ..
(b) Let an inner product be defined by the formula
Z 1
hg, hi = (1 − x2 )−1/2 g(x)h(x)dx.
−1

The orthogonal polynomials are the Chebyshev polynomials of the first kind
given by Qk (x) = cos(k arccos x), k ≥ 0. Using the substitution x = cos θ,
calculate the inner products hQk , Qk i for k ≥ 0. (Hint: 2 cos2 x = 1 +
cos 2x.)
(c) For the inner product given above and the Chebyshev polynomials, calcu-
late the inner products hQk , f i for k ≥ 0, k 6= 1, where f is given by
f (x) = (1 − x2 )1/2 . (Hint: cos x sin y = 12 [sin(x + y) − sin(x − y)].)

(d) Now for k = 1, calculate the inner product hQ1 , f i.


(e) Thus for even n write the least squares polynomial approximation to f as
linear combination of the Chebyshev polynomials with the correct coeffi-
cients.

Exercise 3.14. (a) Given a set of real values f0 , f1 , . . . , fn at real data points
x0 , x1 , . . . , xn , give a formula for the Lagrange cardinal polynomials and
state their properties. Write the polynomial interpolant in the Lagrange
form.
(b) Define the divided difference of degree n, f [x0 , x1 , . . . , xn ], and give a for-
mula for it derived from the Lagrange form of the interpolant. What is the
divided difference of degree zero?
(c) State the recursive formula for the divided differences and proof it.
(d) State the Newton form of polynomial interpolation and show how it can
be evaluated in just O(n) operations.
(e) Let x0 = 4, x1 = 6, x2 = 8, and x3 = 10 with data values f0 = 1, f1 =
3, f2 = 8, and f3 = 20. Give the Lagrange form of the polynomial inter-
polant.
Interpolation and Approximation Theory  111

(f ) Calculate the divided difference table for x0 = 4, x1 = 6, x2 = 8, and


x3 = 10 with data values f0 = 1, f1 = 3, f2 = 8, and f3 = 20.
(g) Thus give the Newton form of the polynomial interpolant.
Exercise 3.15. (a) Define the divided difference of degree n, f [x0 , x1 , . . . , xn ].
What is the divided difference of degree zero?
(b) Prove the recursive formula for divided differences
f [x1 , . . . , xn+1 ] − f [x0 , . . . , xn ]
f [x0 , x1 , . . . , xk , xn+1 ] = .
xn+1 − x0

(c) By considering the polynomials p, q ∈ Pk [x] that interpolate f


at x0 , . . . , xi−1 , xi+1 , . . . , xn and x0 , . . . , xj−1 , xj+1 , . . . , xn , respectively,
where i 6= j, construct a polynomial r, which interpolates f at x0 , . . . , xn .
For the constructed r show that r(xk ) = f (xk ) for k = 0, . . . , n.
(d) Deduce that, for any i 6= j, we have
f [x0 , . . . , xi−1 , xi+1 , . . . , xn ] − f [x0 , . . . , xj−1 , xj+1 , . . . , xn ]
f [x0 , . . . , xn ] = .
xj − xi

(e) Calculate the divided difference table for x0 = 0, x1 = 1, x2 = 2, and


x3 = 3 with data values f0 = 0, f1 = 1, f2 = 8, and f3 = 27.
(f ) Using the above formula, calculate the divided differences f [x0 , x2 ],
f [x0 , x2 , x3 ], and f [x0 , x1 , x3 ].
Exercise 3.16. (a) Give the definition of a spline function of degree k.
(b) Proof that the set of splines of degree k with fixed knots x0 , . . . , xn is a
linear space of dimension n + k.
(c) Turning to cubic splines. Let s(x), 1 ≤ x < ∞, be a cubic spline that has
the knots xj = µj : j = 0, 1, 2, 3, . . ., where µ is a constant greater than 1.
Prove that if s is zero at every knot, then its first derivatives satisfy the
recurrence relation
µs0 (xj−1 ) + 2(µ + 1)s0 (xj ) + s0 (xj+1 ) = 0, j = 1, 2, 3, . . . .

(d) Using
s(x) = s(xi−1 ) + s0 (xi−1 )(x − xi−1 ) + c2 (x − xi−1 )2 + c3 (x − xi−1 )3 .
on the interval [xi−1 , xi ], where
s(xi ) − s(xi−1 ) 2s0 (xi−1 ) + s0 (xi )
c2 = 3 − ,
(xi − xi−1 )2 xi − xi−1
0 0
s(xi−1 ) − s(xi ) s (xi−1 ) + s (xi )
c3 = 2 + ,
(xi − xi−1 )3 (xi − xi−1 )2
deduce that s can be nonzero with a bounded first derivative.
112  A Concise Introduction to Numerical Analysis

(e) Further show that such an s is bounded, if µ is at most 12 (3 + 5).
Exercise 3.17. (a) Given a set of real values f0 , f1 , . . . , fn at real data points
x0 , x1 , . . . , xn , give a formula for the Lagrange cardinal polynomials and
state their properties. Write the polynomial interpolant in the Lagrange
form.
(b) How many operations are necessary to evaluate the polynomial interpolant
in the Lagrange form at x?
(c) Prove that the polynomial interpolant is unique.

(d) Using the Lagrange form of interpolation, compute the polynomial p(x)
that interpolates the data x0 = 0, x1 = 1, x2 = 2 and f0 = 1, f1 = 2,
f2 = 3. What is the degree of p(x)?
(e) What is a divided difference and a divided difference table and for which
form of interpolant is it used? Give the formula for the interpolant, how
many operations are necessary to evaluate the polynomial in this form?
(f ) Prove the relation used in a divided difference table.
(g) Write down the divided difference table for the interpolation problem given
in (d). How does it change with the additional data f3 = 5 at x3 = 3?
CHAPTER 4

Non-Linear Systems

4.1 Bisection, Regula Falsi, and Secant Method


We consider the solution of the equation f (x) = 0 for a suitably smooth
function f : R → R, i.e., we want to find a root of the function f . Any
non-linear system can be expressed in this form.
If for a given interval [a, b] f (a) and f (b) have opposite signs, then f must
have at least one zero in the interval by the intermediate value theorem, if
f is continuous. The method of bisection can be used to find the zero. It is
robust, i.e., it is guaranteed to converge although at a possibly slow rate. It is
also known as binary search method .
We repeatedly bisect the interval and select the interval in which the root
must lie. At each step we calculate the midpoint m = (a+b)/2 and the function
value f (m). Unless m is itself a root (improbable, but not impossible), there
are two cases: If f (a) and f (m) have opposite signs, then the method sets m
as the new value for b. Otherwise if f (m) and f (b) have opposite signs, then
the method sets m as the new a. The algorithm terminates, when b − a is
sufficiently small.
Suppose the calculation is performed in binary. In every step the width of
the interval containing a zero is reduced by 50% and thus the distance of the
end points to the zero is also at least halved. Therefore at worst the method
will add one binary digit of accuracy in each step. So the iterations are linearly
convergent.
The bisection method always chooses the mid-point of the current interval.
It can be improved by considering the straight line between f (a) and f (b)
which is given by
x−b x−a
f (a) + f (b)
a−b b−a
and where it crosses the axis. Then m is calculated by
f (b)a − f (a)b
m=
f (b) − f (a)
and the interval containing the root is updated as before. The method is
114  A Concise Introduction to Numerical Analysis

Figure 4.1 The rule of false position

illustrated in Figure 4.1. Loss of significance is unlikely since f (a) and f (b)
have opposite signs. This method is sometimes called regula falsi or rule of
false position, since we take a guess as to the position of the root.
At first glance it seems superior to the bisection method, since an approx-
imation to the root is used. However, asymptotically one of the end-points
will converge to the root, while the other one remains fixed. Thus only one
end-point of the interval gets updated. We illustrate this behaviour with the
function
f (x) = 2x3 − 4x2 + 3x,
which has a root for x = 0. We start with the initial interval [−1, 1]. The
left end-point −1 is never replaced while the right end-point moves towards
zero. Thus the length of the interval is always at least 1. In the first iteration
the right end-point becomes 3/4 and in the next iteration it is 159/233. It
converges to zero at a linear rate similar to the bisection method.
The value m is a weighted average of the function values. The method can
be modified by adjusting the weights of the function values in the case where
the same endpoint is retained twice in a row. The value m is then for example
calculated according to
1
2 f (b)a − f (a)b f (b)a − 12 f (a)b
m= 1 or .
2 f (b) − f (a) f (b) − 12 f (a)

This adjustment guarantees superlinear convergence. It and similar modifica-


tions to the regula falsi method are known as the Illinois algorithm. For more
information on this modification of the regula falsi method see for example [5]
G. Dahlquist, Â. Björck, Numerical Methods.
So far we always chose the interval containing the root, but a further
development is to abandon the use of intervals. The secant method always
retains the last two values instead of making sure to keep one point on either
Non-Linear Systems  115

side of the root. Initializing x(0) = a and x(1) = b the method calculates

f (x(n) )x(n−1) − f (x(n−1) )x(n)


x(n+1) =
f (x(n) ) − f (x(n−1) )
(4.1)
x(n) − x(n−1)
= x(n) − f (x(n) ) .
f (x(n) ) − f (x(n−1) )

This is the intersection of the secant through the points (x(n−1) , f (x(n−1) ) and
(x(n) , f (x(n) ) with the axis. There is now the possibility of loss of significance,
since f (x(n−1) ) and f (x(n) ) can have the same sign. Hence the denominator
can become very small leading to large values, possibly leading away from the
root.
Also, convergence may be lost, since the root is not bracketed by an interval
anymore. If, for example, f is differentiable and the derivative vanishes at some
point in the initial interval, the algorithm may not converge, since the secant
can become close to horizontal and the intersection point will be far away
from the root.
To analyze the properties of the secant method further, in particular to
find the order of convergence, we assume that f is twice differentiable and
that the derivative of f is bounded away from zero in a neighbourhood of the
root. Let x∗ be the root and let e(n) denote the error e(n) = x(n) − x∗ at the
nth iteration. We can then express the error at the (n + 1)th iteration in terms
of the error in the previous two iterations:

e(n) − e(n−1)
e(n+1) = e(n) − f (x∗ + e(n) )
f (x∗
+ e(n) ) − f (x∗ + e(n−1) )
f (x∗ + e(n) )e(n−1) − f (x∗ + e(n−1) )e(n)
=
f (x∗ + e(n) ) − f (x∗ + e(n−1) )
∗ (n) ∗ (n−1)

(n) (n−1)
( f (x e+e
(n)
)
− f (x e+e
(n−1)
)
)
= e e ∗ (n) ∗ (n−1)
.
f (x + e ) − f (x + e )

Using the Taylor expansion of f (x∗ + e(n) ) we can write

f (x∗ + e(n) ) 1
(n)
= f 0 (x∗ ) + f 00 (x∗ )e(n) + O([e(n) ]2 ).
e 2

Doing the same for f (x∗ + e(n−1) )/e(n−1) and assuming that x(n) and x(n−1)
are close enough to the root such that the terms O([e(n) ]2 ) and O([e(n−1) ]2 )
can be neglected, the expression for e(n+1) becomes

1 e(n) − e(n−1)
e(n+1) ≈ e(n) e(n−1) f 00 (x∗ )
2 f (x∗ + e(n) ) − f (x∗ + e(n−1) )
00 ∗
f (x )
≈ e(n) e(n−1) 0 ∗ ,
2f (x )
116  A Concise Introduction to Numerical Analysis

where we again used the Taylor expansion in the last step. Letting C =
f 00 (x∗ )
| 2f 0 (x∗ ) |, the modulus of the error in the next iteration is then approximately

|e(n+1) | ≈ C|e(n) ||e(n−1) |

or in other words |e(n+1) | = O(|e(n) ||e(n−1) |). By definition the method is of


order p if |e(n+1) | = O(|e(n) |p ). We can write this also as |e(n+1) | = c|e(n) |p
for some constant c which is equivalent to

|e(n+1) |
= c = O(1).
|e(n) |p

The O-notation helps us avoid noting down the constants C and c.


On the other hand,

|e(n+1) | O(|e(n) ||e(n−1) |)


(n) p
= = O(|e(n) |1−p |e(n−1) |).
|e | |e(n) |p

Using the fact that also e(n) = O(|e(n−1) |p ), it follows that


2
O(|e(n−1) |p−p +1
) = O(1).

only possible if p − p2 + 1 = 0. The positive solution of this quadratic


This is √
is p = ( 5 + 1)/2 ≈ 1.618. Thus we have calculated the order of convergence
and it is superlinear.

4.2 Newton’s Method


(n) (n−1)
Looking closer at Equation (4.1), f (x(n) ) is divided by f (xx(n))−f (x
−x(n−1)
)
. Taking
(n−1) 0 (n)
the theoretical limit x → x this becomes the derivative f (x ) at x(n) .
(n)

This is known as Newton’s or Newton–Raphson method ,

f (x(n) )
x(n+1) = x(n) − . (4.2)
f 0 (x(n) )

Geometrically the secant through two points of a curve becomes the tan-
gent to the curve in the limit of the points coinciding. The tangent to the
curve f at the point (x(n) , f (x(n) )) has the equation

y = f (x(n) ) + f 0 (x(n) )(x − x(n) ).

The point x(n+1) is the point of intersection of this tangent with the x-axis.
This is illustrated in Figure 4.2.
Let x∗ be the root. The Taylor expansion of f (x∗ ) about x(n) is
1 00 (n) ∗
f (x∗ ) = f (x(n) ) + f 0 (x(n) )(x∗ − x(n) ) + f (ξ )(x − x(n) )2 ,
2!
Non-Linear Systems  117

Figure 4.2 Newton’s method

where ξ (n) lies between x(n) and x∗ . Since x∗ is the root, this equates to zero:
1 00 (n) ∗
f (x(n) ) + f 0 (x(n) )(x∗ − x(n) ) + f (ξ )(x − x(n) )2 = 0.
2!
Let’s assume that f 0 is bounded away from zero in a neighbourhood of x∗
and x(n) lies in this neighbourhood. We can then divide the above equation
by f 0 (x(n) ). After rearranging, this becomes

f (x(n) ) 1 f 00 (ξ (n) ) ∗
0
+ (x∗ − x(n) ) = − (x − x(n) )2 .
(n)
f (x ) 2 f 0 (x(n) )

Using (4.2) we can relate the error in the next iteration to the error in the
current iteration
1 f 00 (ξ (n) ) ∗
x∗ − x(n+1) = − (x − x(n) )2 .
2 f 0 (x(n) )

This shows that under certain conditions the convergence of Newton’s method
is quadratic. The conditions are that there exists a neighbourhood U of the
root where f 0 is bounded away from zero and where f 00 is finite and that the
starting point lies sufficiently close to x∗ . More specifically, let
supx∈U |f 00 (x)|
C= .
inf x∈U |f 0 (x)|

Here the notation supx∈U |f 00 (x)| is the smallest upper bound of the values
118  A Concise Introduction to Numerical Analysis

|f 00 | takes in U , while inf x∈U |f 0 (x)| stands for the largest lower bound of the
values |f 0 | takes in U . Then
1
|x∗ − x(n+1) | ≤ C(x∗ − x(n) )2 .
2
Here we can see that sufficiently close to x∗ means that 12 C|x∗ − x(0) | < 1,
otherwise the distance to the root may not decrease.
We take a closer look at the situations where the method fails. In some
cases the function in question satisfies the conditions for convergence, but the
point chosen as starting point is not sufficiently close to the root. In this case
other methods such as bisection are used to obtain a better starting point.
The method fails if any of the iteration points happens to be a stationary
point, i.e., a point where the first derivative vanishes. In this case the next
iteration step is undefined, since the tangent there will be parallel to the x-
axis and not intersect it. Even if the derivative is nonzero, but small, the next
approximation may be far away from the root.
For some functions it can happen that the iteration points enter an infinite
cycle. Take for example the polynomial f (x) = x3 −2x+2. If 0 is chosen as the
starting point, the first iteration produces 1, while the next iteration produces
0 again and so forth. The behaviour of the sequence produced by Newton’s
method is illustrated by Newton fractals in the complex plane C. If z ∈ C
is chosen as a starting point and if the sequence produced converges to a
specific root then z is associated with this root. The set of initial points z
that converge to the same root is called basin of attraction for that root. The
fractals illustrate beautifully that a slight perturbation of the starting value
can result into the algorithm converging to a different root.
Exercise 4.1. Write a program which takes a polynomial of degree between
2 and 7 as input and colours the basins of attraction for each root a different
colour. Try it out for the polynomial z n − 1 for n = 2, . . . , 7.

√ A simple example where Newton’s method diverges is the cube root f (x) =
3
x which is continuous and infinitely differentiable except for the root x = 0,
where the derivative is undefined. For any approximation x(n) 6= 0 the next
approximation will be
x(n)1/3
x(n+1) = x(n) − 1 (n)1/3−1 = x
(n)
− 3x(n) = −2x(n) .
3x

In every iteration the algorithm overshoots the solution onto the other side
further away than it was initially. The distance to the solution is doubled in
each iteration.
There are also cases where Newton’s method converges, but the rate of
convergence is not quadratic. For example take f (x) = x2 . Then for every
approximation x(n) the next approximation is
x(n)2 1
x(n+1) = x(n) − = x(n) .
2x(n) 2
Non-Linear Systems  119

Thus the distance to the root is halved in every iteration comparable to the
bisection method.
Newton’s method readily generalizes to higher dimensional problems.
Given a function f : Rm → Rm , we consider f as a vector of m functions
 
f1 (x)
f (x) =  ..
,
 
.
fm (x)

where x ∈ Rm . Let h = (h1 , . . . hm )T be a small perturbation vector. The mul-


tidimensional Taylor expansion of each function component fi , i = 1, . . . , m
is
m
X ∂fi (x)
fi (x + h) = fi (x) + hk + O(khk2 ).
∂xk
k=1
∂fi (x)
The Jacobian matrix Jf (x) has the entries [Jf (x)]i,k = ∂xk and thus we
can write in matrix notation

f (x + h) = f (x) + Jf (x)h + O(khk2 ). (4.3)

Assuming that x is an approximation to the root, we want to improve the ap-


proximation by choosing h. Ignoring higher-order terms gathered in O(khk2 )
and setting f (x + h) = 0, we can solve (4.3) for h. The new approximation is
set to x + h. More formally the Newton iteration in higher dimensions is

x(n+1) = x(n) − Jf (x(n) )−1 f (x(n) ).

However, computing the Jacobian can be a difficult and expensive opera-


tion. The next method computes an approximation to the Jacobian in each
iteration step.

4.3 Broyden’s Method


We return to the one-dimensional case for a comparison. Newton’s method
and the secant method are related in that the secant method replaces the
derivative in Newton’s method by a finite difference.

f (x(n) ) − f (x(n−1) )
f 0 (x(n) ) ≈ .
x(n) − x(n−1)
Rearranging gives

f 0 (x(n) )(x(n) − x(n−1) ) ≈ f (x(n) ) − f (x(n−1) ).

This can be generalized to

Jf (x(n) )(x(n) − x(n−1) ) ≈ f (x(n) ) − f (x(n−1) ). (4.4)


120  A Concise Introduction to Numerical Analysis

In Broyden’s method the Jacobian is only calculated once in the first iteration
as A(0) = Jf (x(0) ). In all the subsequent iterations the matrix is updated in
such a way that it satisfies (4.4). That is, if the matrix multiplies x(n) −x(n−1) ,
the result is f (x(n) ) − f (x(n−1) ). This determines how the matrix acts on the
one-dimensional subspace of Rm spanned by the vector x(n) −x(n−1) . However,
this does not determine how the matrix acts on the (m − 1)-dimensional
complementary subspace. Or in other words (4.4) provides only m equations
to specify an m×m matrix. The remaining degrees of freedom are taken up by
letting A(n) be a minimal modification to A(n−1) , minimal in the sense that
A(n) acts the same as A(n−1) on all vectors orthogonal to x(n) − x(n−1) . These
vectors are the (m−1)-dimensional complementary subspace. The matrix A(n)
is then given by
f (x(n) ) − f (x(n−1) ) − A(n−1) (x(n) − x(n−1) ) (n)
A(n) = A(n−1) + (x −x(n−1) )T .
(x(n) − x(n−1) )T (x(n) − x(n−1) )

If A(n) is applied to x(n) − x(n−1) , most terms vanish and the desired result
f (x(n) )−f (x(n−1) ) remains. The second term vanishes whenever A(n) is applied
to a vector v orthogonal to x(n) −x(n−1) , since in this case (x(n) −x(n−1) )T v =
0, and A(n) acts on these vectors exactly as A(n−1) .
The next approximation is then given by
x(n+1) = x(n) − (A(n) )−1 f (x(n) ).
Just as in Newton’s method, we do not calculate the inverse directly. Instead
we solve
A(n) h = −f (x(n) )
for some perturbation h ∈ Rm and let x(n+1) = x(n) + h.
However, if the inverse of the initial Jacobian A(0) has been calculated the
inverse can be updated in only O(m2 ) operations using the Sherman–Morrison
formula, which states that for a non-singular matrix A and vectors u and v
such that vT A−1 u 6= −1 we have
A−1 uvT A−1
(A + uvT )−1 = A−1 − .
1 + vT A−1 u
Letting
A = A(n−1) ,
f (x(n) ) − f (x(n−1) ) − A(n−1) (x(n) − x(n−1) )
u = ,
(x(n) − x(n−1) )T (x(n) − x(n−1) )
v = x(n) − x(n−1) ,
we have a fast update formula.
Broyden’s method is not as fast as the quadratic convergence of Newton’s
method. But the smaller operation count per iteration is often worth it. In
the next section we return to the one-dimensional case.
Non-Linear Systems  121

4.4 Householder Methods


Householder methods are Newton-type methods with higher order of conver-
gence. The first of these is Halley’s method with cubic order of convergence if
the starting point is close enough to the root x∗ . As with Newton’s method it
can be applied in the case where the root x∗ is simple; that is, the derivative
of f is bounded away from zero in a neighbourhood of x∗ . In addition f needs
to be three times continuously differentiable.
The method is derived by noting that the function defined by

f (x)
g(x) = p
|f 0 (x)|

also has a root at x∗ . Applying Newton’s method to g gives

g(x(n) )
x(n+1) = x(n) − .
g 0 (x(n) )

The derivative of g(x) is given by

2[f 0 (x)]2 − f (x)f 00 (x)


g 0 (x) = p .
2f 0 (x) |f 0 (x)|

With this the update formula is

2f (x(n) )f 0 (x(n) )
x(n+1) = x(n) − .
2[f 0 (x(n) )]2 − f (x(n) )f 00 (x(n) )

The formula can be rearranged to show the similarity between Halley’s method
and Newton’s method
−1
f (x(n) ) f (x(n) ) f 00 (x(n) )

x(n+1) = x(n) − 0 (n) 1 − 0 (n) .
f (x ) f (x ) 2f 0 (x(n) )

We see that when the second derivative is close to zero near x∗ then the itera-
tion is nearly the same as Newton’s method. The expression f (x(n) )/f 0 (x(n) )
is only calculated once. This form is particularly useful when f 00 (x(n) )/f 0 (x(n) )
can be simplified.
The technique is also known as Bailey’s method when written in the fol-
lowing form:
−1
f (x(n) )f 00 (x(n) )

(n+1) (n) (n) 0 (n)
x =x − f (x ) f (x ) − .
2f 0 (x(n) )

As with Newton’s method the convergence behaviour of Halley’s method


depends strongly on the position of the starting point as the following exercise
shows.
122  A Concise Introduction to Numerical Analysis

Exercise 4.2. Write a program which takes a polynomial of degree between 2


and 7 as input, applies Halley’s root finding method, and colours the basins of
attraction for each root a different colour. Try it out for the polynomial z n − 1
for n = 2, . . . , 7.
More generally, the Householder methods of order k ∈ N are given by the
formula
(1/f )(k−1) (x(n) )
x(n+1) = x(n) + k ,
(1/f )(k) (x(n) )
where (1/f )(k) denotes the k th derivative of 1/f . If f is k+1 times continuously
differentiable and the root at x∗ is a simple root then the rate of convergence
is k + 1, provided that the starting point is sufficiently close to x∗ .
Of course all these methods depend on the ability to calculate the deriva-
tives of f . The following method however does not.

4.5 Müller’s Method


The methods in this section and the following are motivated by the secant
method. The secant method interpolates two function values linearly and takes
the intersection of this interpolant with the x-axis as the next approximation
to the root. Müller’s method uses three function values and quadratic inter-
polation instead. Figure 4.3 illustrates this.
Having calculated x(n−2) , x(n−1) and x(n) , a polynomial p(x) = a(x −
x ) + b(x − x(n) ) + c is fitted to the data:
(n) 2

f (x(n−2) ) = a(x(n−2) − x(n) )2 + b(x(n−2) − x(n) ) + c,


f (x(n−1) ) = a(x(n−1) − x(n) )2 + b(x(n−1) − x(n) ) + c,
f (x(n) ) = c.

Solving for a and b, we have


(x(n−1) −x(n) )[f (x(n−2) )−f (x(n) )]−(x(n−2) −x(n) )[f (x(n−1) )−f (x(n) )]
a = (x(n−2) −x(n) )(x(n−1) −x(n) )(x(n−2) −x(n−1) )

(x(n−2) −x(n) )2 [f (x(n−1) )−f (x(n) )]−(x(n−1) −x(n) )2 [f (x(n−2) )−f (x(n) )]
b = (x(n−2) −x(n) )(x(n−1) −x(n) )(x(n−2) −x(n−1) )

The next approximation x(n+1) is one of the roots of p and the one closer to
x(n) is chosen. To avoid errors due to loss of significance we use the alternative
formula for the roots derived in 1.4,
−2c
x(n+1) − x(n) = √ , (4.5)
b + sgn(b) b2 − 4ac

where sgn(b) denotes the sign of b. This way the root which gives the largest
denominator and thus is closest to x(n) is chosen.
Note that x(n+1) can be complex even if all previous approximations have
been real. This is in contrast to previous root-finding methods where the
Non-Linear Systems  123

Illustration of the secant method on the left and Müller’s


Figure 4.3
method on the right

iterates remain real if the starting value is real. This behaviour can be an
advantage, if complex roots are to be found or a disadvantage if the roots are
known to be real.
An alternative representation uses the Newton form of the interpolating
polynomial

p(x) = f (x(n) ) + (x − x(n) )f [x(n) , x(n−1) ]

+(x − x(n) )(x − x(n−1) )f [x(n) , x(n−1) , x(n−2) ],

where f [x(n) , x(n−1) ] and f [x(n) , x(n−1) , x(n−2) ] denote divided differences. Af-
ter some manipulation using the recurrence relation for divided differences, one
can see that

a = f [x(n) , x(n−1) , x(n−2) ],


b = f [x(n) , x(n−1) ] + f [x(n) , x(n−2) ] − f [x(n−1) , x(n−2) ].

Thus (4.5) can be evaluated fast using divided differences.


The order of convergence is approximately 1.84 if the initial approxima-
tions x(0) , x(1) , and x(2) are close enough to the simple root x∗ . This is better
than the secant method with 1.62 and worse than Newton’s method with 2.

4.6 Inverse Quadratic Interpolation


The occurrence of complex approximants can be avoided by interpolating
the inverse of f . This method is known as the inverse quadratic interpolation
method . To interpolate the inverse one uses the Lagrange interpolation formula
124  A Concise Introduction to Numerical Analysis

swapping the roles of f and x

(y − f (x(n−1) ))(y − f (x(n) ))


p(y) = x(n−2)
(f (x(n−2) ) − f (x(n−1) ))(f (x(n−2) ) − f (x(n) ))
(y − f (x(n−2) ))(y − f (x(n) ))
+ x(n−1)
(f (x(n−1) )− f (x(n−2) ))(f (x(n−1) ) − f (x(n) ))
(y − f (x(n−2) ))(y − f (x(n−1) ))
+ x(n) .
(f (x(n) ) − f (x(n−2) ))(f (x(n) ) − f (x(n−1) ))
Now we are interested in a root of f and thus substitute y = 0 and use
p(0) = x(n+1) . Since the inverse of f is interpolated this immediately gives a
formula for x(n+1)
f (x(n−1) )f (x(n) )
x(n+1) = x(n−2)
(f (x(n−2) ) − f (x(n−1) ))(f (x(n−2) ) − f (x(n) ))
f (x(n−2) )f (x(n) )
+ x(n−1)
(f (x(n−1) ) − f (x(n−2) ))(f (x(n−1) ) − f (x(n) ))
f (x(n−2) )f (x(n−1) )
+ x(n) .
(f (x(n) ) − f (x(n−2) ))(f (x(n) ) − f (x(n−1) ))
If the starting point is close to the root then the order of convergence
is approximately 1.8. However, the algorithm can fail completely if any two
of the function values f (x(n−2) ), f (x(n−1) ), and f (x(n) ) coincide. However, it
plays an important role in mixed algorithms, which are considered in the last
section.

4.7 Fixed Point Iteration Theory


All the algorithms we have encountered so far are examples of fixed point
iterative methods. In these methods each iteration has the form

x(n+1) = g(x(n) ). (4.6)

If we have convergence to some limit x∗ , then x∗ is a fixed point of the function


g, i.e., x∗ = g(x∗ ), hence the name.
Theorem 4.1. Suppose for the iteration given by (4.6), we have

|g(x) − g(x0 )| ≤ λ|x − x0 | (4.7)

for all x and x0 in an interval I = [x(0) − δ, x(0) + δ] for some constant λ.


This means g is Lipschitz continuous on I. Suppose further that λ < 1 and
|x(0) − g(x(0) )| < (1 − λ)δ. Then
1. all iterates lie in I,
2. the iterates converge and
Non-Linear Systems  125

3. the solution is unique.


Proof. We have
|x(n+1) − x(n) | = |g(x(n) ) − g(x(n−1) )|
≤ λ|x(n) − x(n−1) |
≤ λn |x(1) − x(0) |
≤ λn (1 − λ)δ,

since x(1) − x(0) = g(x(0) ) − x(0) . Thus for each new iterate
n
X
|x(n+1) − x(0) | ≤ |x(k+1) − x(k) |
k=0
Xn
≤ λk (1 − λ)δ
k=0
≤ (1 − λn+1 )δ ≤ δ.
Hence all iterates lie in the interval I. This also means that the sequence of
iterates is bounded. Moreover,
n+p−1
X
|x(n+p) − x(n) | ≤ |x(k+1) − x(k) |
k=n
n+p−1
X
≤ λk (1 − λ)δ
k=n
n→∞
≤ (1 − λp )λn δ → 0.
Since λ < 1, the sequence is a Cauchy sequence and hence converges.
Suppose the solution is not unique, i.e., there exist x∗ 6= x̃∗ such that
x = g(x∗ ) and x̃∗ = g(x̃∗ ). Then

|x∗ − x̃∗ | = |g(x∗ ) − g(x̃∗ )|


≤ λ|x∗ − x̃∗ |
< |x∗ − x̃∗ |,
which is a contradiction.
If in (4.7) x is close to x0 and g(x) is differentiable, then it is reasonable
to assume that λ ≈ |g 0 (x)|. Indeed it can be shown that if |g 0 (x∗ )| < 1 for a
fixed point x∗ then there exists an interval of convergence. Such a fixed point
x∗ is called a point of attraction for the iterative method.
A good way to visualize fixed point iteration is through a cobweb plot
(Figure 4.4) where the function as well as the line y = x is plotted. The last
function value becomes the new x-value illustrated by the horizontal lines.
Then the function is evaluated there illustrated by the vertical lines.
126  A Concise Introduction to Numerical Analysis

Figure 4.4 Cobweb plot for fixed point iteration

4.8 Mixed Methods


The Bus and Dekker algorithm is a hybrid of the secant and the bisection
method. Starting from an initial interval [a0 , b0 ], where f (a0 ) and f (b0 ) have
opposite signs, the algorithm constructs a sequence of sub-intervals, each con-
taining the zero. The current approximation to the root is bn called the iterate,
and an is called its contrapoint. In each iteration two values are calculated
an + bn
m = ,
2
bn − bn−1
s = bn − f (bn ),
f (bn ) − f (bn−1 )

where b−1 is taken to be a0 . The first value m is the midpoint of the interval,
while the second value s is the approximation to the root given by the secant
method. If s lies between bn and m, it becomes the next iterate, that is bn+1 =
s, otherwise the midpoint is chosen, bn+1 = m. If f (an ) and f (bn+1 ) have
opposite signs, the new contrapoint is an+1 = an , otherwise an+1 = bn , since
f (bn ) and f (bn+1 ) must have opposite signs in this case, since f (an ) and
f (bn ) had opposite signs in the previous iteration. Additionally, if the modulus
of f (an+1 ) is less than the modulus of f (bn+1 ), an+1 is considered a better
approximation to the root and it becomes the new iterate while bn+1 becomes
the new contrapoint. Thus the iterate is always the better approximation.
This method performs generally well, but there are situations where every
iteration employs the secant method and convergence is very slow, requiring
far more iterations than the bisection method. In particular, bn − bn−1 might
Non-Linear Systems  127

become arbitrarily small while the length of the interval given by an and bn
decreases very slowly. The following method tries to alleviate this problem.
Brent’s method combines the bisection method, the secant method, and
inverse quadratic interpolation. It is also known as the Wijngaarden–Dekker–
Brent method . At every iteration, Brent’s method decides which method out
of these three is likely to do best, and proceeds by doing a step according to
that method. This gives a robust and fast method.
A numerical tolerance  is chosen. The method ensures that the bisection
method is used, if consecutive iterates are too close together. More specifically,
if the previous step performed the bisection method and |bn − bn−1 | ≤ , then
the bisection method will be performed again. Similarly, if the previous step
performed interpolation (either linear interpolation for the secant method or
inverse quadratic interpolation) and |bn−1 − bn−2 | ≤ , then the bisection
method will be performed again. Thus bn and bn−1 are allowed to become
arbitrarily close at most two times in a row.
Additionally, the intersection s from interpolation (either linear or inverse
quadratic) is only accepted as new iterate if |s − bn | < 12 |bn − bn−1 |, if the
previous step used bisection, or if |s − bn | < 12 |bn−1 − bn−2 |, if the previous
step used interpolation (linear or inverse quadratic). These conditions enforce
that consecutive interpolation steps halve the step size every two iterations
until the step size becomes less than  after at most 2 log2 (|bn−1 − bn−2 |/)
iterations which invokes a bisection.
Brent’s algorithm uses linear interpolation, that is, the secant method,
if any of f (bn ), f (an ), or f (bn−1 ) coincide. If they are all distinct, inverse
quadratic interpolation is used. However, the requirement for s to lie between
m and bn is changed: s has to lie between (3an + bn )/4 and bn .
Exercise 4.3. Implement Brent’s algorithm. It should terminate if either
f (bn ) or f (s) is zero or if |bn − an | is small enough. Use the bisection rule if
s is not between (3an + bn )/4 and bn for both linear and inverse quadratic
interpolation or if any of Brent’s conditions arises. Try your program on
f (x) = x3 − x2 − 4x + 4, which has zeros at −2, 1, and 2. Start with the
interval [−4, 2.5], which contains all roots. List which method is used in each
iteration.

4.9 Revision Exercises


Exercise 4.4. The secant method for finding the solution of f (x) = 0 is given
by
f (x(n) )x(n−1) − f (x(n−1) )x(n)
x(n+1) =
f (x(n) ) − f (x(n−1) )
x(n) − x(n−1)
= x(n) − f (x(n) ) .
f (x(n) ) − f (x(n−1) )
(a) By means of a sketch graph describe how the method works in a simple
case and give an example where it might fail to converge.
128  A Concise Introduction to Numerical Analysis

(b) Let x∗ be the root and let e(n) denote the error e(n) = x(n) − x∗ at the
nth iteration. Express the error at the (n + 1)th iteration in terms of the
errors in the previous two iterations.
(c) Approximate f (x∗ + e(n−1) )/e(n−1) , f (x∗ + e(n) )/e(n) , and f (x∗ + e(n) ) −
f (x∗ + e(n−1) ) using Taylor expansion. You may assume that x(n) and
x(n−1) are close enough to the root such that the terms O([e(n) ]2 ) and
O([e(n−1) ]2 ) can be neglected.
(d) Using the derived approximation and the expression derived for e(n+1) ,
show that the error at (n + 1)th iteration is approximately

f 00 (x∗ )
e(n+1) ≈ e(n) e(n−1) .
2f 0 (x∗ )

(e) From |e(n+1) | = O(|e(n) ||e(n−1) |) derive p such that |e(n+1) | = O(|e(n) |p ).
(f ) Derive the Newton method from the secant method.
(g) Let f (x) = x2 . Letting x( 1) = 12 x(0) , for both the secant and the Newton
method express x(2) , x(3) , and x(4) in terms of x(0) .
Exercise 4.5. Newton’s method for finding the solution of f (x) = 0 is given
by
f (x(n) )
x(n+1) = x(n) − 0 (n) ,
f (x )
where x(n) is the approximation to the root x∗ in the nth iteration. The starting
point x(0) is already close enough to the root.

(a) By means of a sketch graph describe how the method works in a simple
case and give an example where it might fail to converge.

(b) Using the Taylor expansion of f (x∗ ) = 0 about x(n) , relate the error in
the next iteration to the error in the current iteration and show that the
convergence of Newton’s method is quadratic.
(c) Generalize Newton’s method to higher dimensions.
(d) Let
1 2
 
f (x) = f (x, y) = 2x +y
.
1 2
2y +x
The roots lie at (0, 0) and (−2, −2). Calculate the Jacobian of f and its
inverse.
(e) Why does Newton’s method fail near (1, 1) and (−1, −1)?

(f ) Let x(0) = (1, 0). Calculate x(1) , x(2) and x(3) , and their Euclidean norms.
Non-Linear Systems  129

(g) The approximations converge to (0, 0). Show that the speed of convergence
agrees with the theoretical quadratic speed of convergence.

Exercise 4.6. The Newton–Raphson iteration for solving f (x) = 0 is


f (x)
x̂ = x − .
f 0 (x)
(a) By drawing a carefully labeled graph, explain the graphical interpretation
of this formula.
(b) What is the order of convergence?
(c) Under which conditions can this order be achieved?
(d) Consider f (x) = x3 +x2 −2. The following table shows successive iterations
for each of the three starting values (i) x = 1.5, (ii) x = 0.2, (iii ) x =
−0.5. Note that, to the accuracy shown, each iteration finds the root at
x = 1.
n (i) (ii) (iii)
0 1.50000 × 100 2.00000 × 10−1 −5.00000 × 10−1
1 1.12821 × 100 3.95384 × 100 −8.00000 × 100
2 1.01152 × 100 2.57730 × 100 −5.44318 × 100
3 1.00010 × 100 1.70966 × 100 −3.72976 × 100
4 1.00000 × 100 1.22393 × 100 −2.56345 × 100
5 1.00000 × 100 1.03212 × 100 −1.72202 × 100
6 1.00079 × 100 −9.62478 × 10−1
7 1.00000 × 100 1.33836 × 100
8 1.00000 × 100 1.06651 × 100
9 1.00329 × 100
10 1.00000 × 100
11 1.00000 × 100

Sketch the graph of f (x) and sketch the first iteration for cases (i) and
(ii) to show why (i) converges faster than (ii).
(e) In a separate (rough) sketch, show the first two iterations for case (iii).
(f ) Now consider f (x) = x4 − 3x2 − 2. Calculate two Newton–Raphson it-
erations from the starting value x = 1. Comment on the prospects for
convergence in this case.
(g) Give further examples where the method might fail to converge or con-
verges very slowly.
Exercise 4.7. The following reaction occurs when water vapor is heated
1
H2 O
H2 + O2 .
2
130  A Concise Introduction to Numerical Analysis

The fraction x ∈ [0, 1] of H2 O that is consumed satisfies the equation


r
x 2pt
K= , (4.8)
1−x 2+x
where K and pt are given constants. The following figure illustrates this:

(a) Rephrase the problem of determining x as finding the root of a function


f (x) and state f (x). Sketch a graph illustrating the rephrased problem.
(b) Describe the bisection method to find the root of a function. Comment on
the robustness and speed of convergence of the method.
(c) Given an approximation x(n) to the root x∗ of the function f (x), give the
formula how Newton’s method calculates the next approximation x(n+1) ,
explain what this means geometrically, and expand your sketch with an
example of how Newton’s method works.

(d) What is the order of convergence of Newton’s method?


(e) What happens to the right-hand side of Equation (4.8) if x approaches
1 and what does this mean for Newton’s method, if the starting point is
chosen close to 1?

(f ) The derivative of f (x) at 0 is 1. What is the next approximation if 0 is


chosen as the starting point? Depending on K, what problem might this
cause?
(g) Give another example to demonstrate when Newton’s method might fail to
converge?
CHAPTER 5

Numerical Integration

In this part of the course we consider the numerical evaluation of an integral


of the form Z b
I= f (x)dx.
a
A quadrature rule approximates it by
n
X
I ≈ Qn (f ) = wi f (xi ).
i=1

The points xi , i = 1, . . . , n are called the abscissae chosen such that a ≤ x1 <
. . . < xn ≤ b. The coefficients wi are called the weights. Quadrature rules are
derived by integrating a polynomial interpolating the function values at the
abscissae. Usually only positive weights are allowed since whether something
is added or subtracted should be determined by the sign of the function at
this point. This also avoids loss of significance.

5.1 Mid-Point and Trapezium Rule


The simplest rule arises from interpolating the function by a constant, which
is a polynomial of degree zero, and it is reasonable to let this constant take
the value of the function at the midpoint of the interval [a, b]. This is known
as the mid-point-rule or mid-ordinate rule. It is illustrated in Figure 5.1. The
formula is
a+b
I ≈ (b − a)f ( ).
2
The mid-point is the only abscissa and its weight is b − a.
Theorem 5.1. The mid-point rule has the following error term
Z b
a+b 1
f (x)dx = (b − a)f ( ) + f 00 (ξ)(b − a)3 ,
a 2 24
where ξ is some point in the interval [a, b]
132  A Concise Introduction to Numerical Analysis

Figure 5.1 The mid-point rule

a+b
Proof. We use Taylor expansion of f around the midpoint 2

a+b a+b 0 a+b 1 a + b 2 00


f (x) = f ( ) + (x − )f ( ) + (x − ) f (ξ),
2 2 2 2 2
where ξ lies in the interval [a, b]. Integrating this over the interval [a, b] gives
 b
a+b 1 a+b 2 0 a+b 1 a + b 3 00
xf ( ) + (x − ) f( ) + (x − ) f (ξ)
2 2 2 2 6 2 a
a+b 1 00 3
= (b − a)f ( ) + f (ξ)(b − a) ,
2 24
since the term involving the first derivative becomes zero.
Alternatively, we can integrate the linear polynomial fitted to the two
endpoints of the interval as Figure 5.2 illustrates. This rule is the trapezium
rule described by the formula
b−a
I≈ [f (a) + f (b)].
2

Theorem 5.2. The trapezium rule has the following error term
Z b
b−a 1
f (x)dx = [f (a) + f (b)] − f 00 (ξ)(b − a)3 ,
a 2 12
where ξ is some point in the interval [a, b]
Proof. We use Taylor expansion of f (a) around the point x
1
f (a) = f (x) + (a − x)f 0 (x) + (a − x)2 f 00 (ξ),
2
Numerical Integration  133

Figure 5.2 The trapezium rule

where ξ lies in the interval [a, b]. This can be rearranged as


1
f (x) = xf 0 (x) + f (a) − af 0 (x) − (x − a)2 f 00 (ξ).
2
Integrating this over the interval [a, b] gives
Z b Z b  b
1
f (x)dx = xf 0 (x)dx + xf (a) − af (x) − (x − a)3 f 00 (ξ)
a a 6 a
Z b
b
= [xf (x)]a − f (x)dx + (b − a)f (a) − a[f (b) − f (a)]
a
1
− (b − a)3 f 00 (ξ),
6
where we used integration by parts for the first integral on the right-hand
Rb
side. Solving for a f (x) this becomes
Z b
1 1
f (x)dx = [bf (b) − af (a) + bf (a) − af (b) − (b − a)3 f 00 (ξ)]
a 2 6
b−a 1
= [f (a) + f (b)] − (b − a)3 f 00 (ξ).
2 12

5.2 The Peano Kernel Theorem


So far we deduced the error term from first principles, but the Peano kernel
theorem introduced in the chapter on interpolation proves very useful. In the
case of a quadrature the functional L(f ) describing the approximation error
134  A Concise Introduction to Numerical Analysis

is the difference between the value of the integral and the value given by the
quadrature,
Z b n
X
L(f ) = f (x)dx − wi f (xi ).
a i=1
k+1
Thus L maps the space of functions C [a, b] to R. L is a linear functional ,
i.e., L(αf + βg) = αL(f ) + βL(g) for all α, β ∈ R, since integration and
weighted summation are linear operations themselves. We assume that the
quadrature is constructed in such a way that it is correct for all polynomials
of degree at most k, i.e., L(p) = 0 for all p ∈ Pk [x]. Recall
Definition 5.1 (Peano kernel). The Peano kernel K of L is the function
defined by
K(θ) := L[(x − θ)k+ ] for θ ∈ [a, b].
and
Theorem 5.3 (Peano kernel theorem). Let L be a linear functional such that
L(p) = 0 for all p ∈ Pk [x]. Provided that the exchange of L with the integration
is valid, then for f ∈ C k+1 [a, b]
Z b
1
L(f ) = K(θ)f (k+1) (θ)dθ.
k! a

Here the order of L and the integration can be swapped, since L consists
of an integration and a weighted sum of function evaluations. The theorem
has the following extension:
Theorem 5.4. If K does not change sign in (a, b), then for f ∈ C k+1 [a, b]
"Z #
b
1
L(f ) = K(θ)dθ f (k+1) (ξ)
k! a

for some ξ ∈ (a, b).


We derive the result about the error term for the trapezium rule again this
time applying the Peano kernel theorem. By construction the trapezium rule
is exact for all linear polynomials, thus k = 1. The kernel is given by
Z b
b−a
K(θ) = L[(x − θ)+ ] = (x − θ)+ dx − [(a − θ)+ + (b − θ)+ ]
2
Zab
(b − a)(b − θ)
= (x − θ)dx −
θ 2
2 b
 
(x − θ) (b − a)(b − θ)
= −
2 θ 2
2
(b − θ) (b − a)(b − θ) (b − θ)(a − θ)
= − = ,
2 2 2
Numerical Integration  135

where we used the fact that (a − θ)+ = 0, since a ≤ θ for all θ ∈ [a, b] and
(b − θ)+ = b − θ, since b ≥ θ for all θ ∈ [a, b].
The kernel K(θ) does not change sign for θ ∈ (a, b), since b − θ > 0 and
a − θ < 0 for θ ∈ (a, b). The integral over [a, b] of the kernel is given by
Z b Z b
(b − θ)(a − θ) 1
K(θ)dθ = dθ = − (b − a)3 ,
a a 2 12
which can be easily verified. Thus the error for the trapezium rule is
"Z #
b
1 1
L(f ) = K(θ)dθ f 00 (ξ) = − (b − a)3 f 00 (ξ)
1! a 12

for some ξ ∈ [a, b].

5.3 Simpson’s Rule


The error term can be reduced significantly, if a quadratic polynomial is fitted
through the end-points and the mid-point. This is known as Simpson’s rule
and the formula is
b−a a+b
I≈ [f (a) + 4f ( ) + f (b)].
6 2
The rule is illustrated in Figure 5.3.
Theorem 5.5. Simpson’s rule has the following error term
Z b
b−a a+b 1 b−a 5
f (x)dx = [f (a) + 4f ( ) + f (b)] − f (4) (ξ)( ) ,
a 6 2 90 2
where ξ is some point in the interval [a, b]

Proof. We apply Peano’s kernel theorem to prove this result. Firstly, Simp-
son’s rule is correct for all quadratic polynomials by construction. However,
it is also correct for cubic polynomials, which can be proven by applying it
to the monomial x3 . The value of the integral of x3 over the interval [a, b] is
(b4 − a4 )/4. Simpson’s rule applied to x3 gives
(b − a) 3
a + 4(a + b)3 /23 + b3

6
(b − a)
2a3 + a3 + 3a2 b + 3ab2 + b3 + 2b3

=
12
(b − a) 3
a + a2 b + ab2 + b3

=
4
1
= (a3 b + a2 b2 + ab3 + b4 − a4 − a3 b − a2 b2 − ab3 )
4
(b4 − a4 )
= .
4
136  A Concise Introduction to Numerical Analysis

Figure 5.3 Simpson’s rule

So Simpson’s rule is correct for all cubic polynomials, since they form a linear
space, and we have k = 3. However, it is not correct for polynomials of degree
four. Let for example a = 0 and b = 1, then the integral of x4 over the interval
[0, 1] is 1/5 while the approximation by Simpson’s rule is 1/6 ∗ (04 + 4(1/2)4 +
14 ) = 5/24.
The kernel is given by

K(θ) = L[(x − θ)3+ ]


Z b
= (x − θ)3+ dx − b−a 3 a+b 3
6 [(a − θ)+ + 4( 2 − θ)+ + (b − θ)+ ]
3
a
Z b
= (x − θ)3 dx − (b−a) a+b 3 3
6 [4( 2 − θ)+ + (b − θ) ]
θ
4

 (b−θ)4 − (b−a) a+b 3 3 a+b
6 [4( 2 − θ) + (b − θ) ] θ ∈ [a, 2 ]
=
 (b−θ)4
4 − (b−a)
6 (b − θ)
3
θ ∈ [ a+b
2 , b].

For θ ∈ [ a+b
2 , b], the result can be simplified as

K(θ) = (b − θ)3 [(b − θ)/4 − (b − a)/6] = (b − θ)3 [b/12 − θ/4 + a/6] .

The first term (b − θ)3 is always positive, since θ < b. The expression in
the square brackets is decreasing and has a zero at the point 23 a + 13 b =
1 1 1
2 (a + b) − 6 (b − a) < 2 (a + b) and thus is negative. Hence K(θ) is negative
a+b
on [ 2 , b].
For θ ∈ [a, a+b 3
2 ] the additional term 2/3(b − a)[(a + b)/2 − θ] , which is
positive in this interval, is subtracted. Thus, K(θ) is also negative here.
Hence the kernel does not change sign. We need to integrate K(θ) to obtain
Numerical Integration  137

the error term


Z b Z b Z (a+b)/2
(b−θ)4 (b−a) 3 (b−a) a+b
3
K(θ)dθ = 4 − 6 (b − θ) dθ − 6 4 2 −θ dθ
a a a
5 b 4 b 4 i(a+b)/2
h i h i h
= − (b−θ)
20 − b−a6 − (b−θ)
4 − b−a6 − a+b2 −θ
a a a
5 5
 4
(b − a) (b − a) b−a b−a
= − −
20 24 6 2
  5  5
8 4 1 b−a 1 b−a
= − − =− .
5 3 3 2 15 2

Thus the error for Simpson’s rule is


"Z # 5
b 
1 (4) 1 b−a
L(f ) = K(θ)dθ f (ξ) = − f (4) (ξ)
3! a 90 2

for some ξ ∈ [a, b].


Exercise 5.1. The mid-point rule (b − a)f ( 12 (a + b)) is exact for polynomials
of degree 1. Use Peano’s kernel theorem to find a formula for L(f ). (Hint:
This is similar to the trapezium rule, except that it is harder to prove that
K(θ) does not change sign in [a, b].)

5.4 Newton–Cotes Rules


The idea of approximating f by its interpolating polynomial at equally spaced
points and then integrating can be extended to n points. These rules are called
Newton–Cotes rules or Newton–Cotes formulae. We distinguish two types of
Newton–Cotes rules, the closed type, which uses the function values at the
end points of the interval, and the open type, which does not use the function
values at the end points of the interval.
The closed Newton–Cotes rules are

Trapezium (b − a)/2[f (a) + f (b)]


Simpson’s (b − a)/6[f (a) + 4f ((a + b)/2) + f (b)]
Simpson’s 3/8 (b − a)/8[f (a) + 3f ((2a + b)/3) + 3f ((a + 2b)/3) + f (b)]
Boole’s (b − a)/90[7f (a) + 32f ((3a + b)/4) + 12f ((a + b)/2)
+32f ((a + 3b)/4) + 7f (b)]
138  A Concise Introduction to Numerical Analysis

and their error terms are

Trapezium −(b − a)3 /12f 00 (ξ)


Simpson’s −(b − a)5 /2880f (4) (ξ)
Simpson’s 3/8 −(b − a)5 /6480f (4) (ξ)
Boole’s −(b − a)7 /1935360f (6) (ξ)

The open Newton–Cotes rules are

Mid-point (b − a)f ((a + b)/2)


unnamed (b − a)/2[f ((2a + b)/3) + f ((a + 2b)/3)]
Milne’s (b − a)/3[2f ((3a + b)/4) − f ((a + b)/2) + 2f ((a + 3b)/4)]

and their error terms are

Mid-point (b − a)3 /24f 00 (ξ)


unnamed (b − a)3 /36f 00 (ξ)
Milne’s 7(b − a)5 /23040f (4) (ξ)

High-order Newton–Cotes Rules are rarely used, for two reasons. Firstly,
for larger n some of the weights are negative, which leads to numerical instabil-
ity. Secondly, methods based on high-order polynomials with equally spaced
points have the same disadvantages as high-order polynomial interpolation,
as we have seen with Runge’s phenomenon.

5.5 Gaussian Quadrature


Since equally spaced abscissae cause problems in quadrature rules, the next
step is to choose the abscissae x1 , . . . , xn such that the formula is accurate for
the highest possible degree of polynomial. These rules are known as Gaussian
quadrature or Gauss–Christoffel quadrature.
To construct these quadrature rules, we need to revisit orthogonal polyno-
mials. Let Pn [x] be the space of polynomials of degree at most n. We say that
pn ∈ Pn [x] is the nth orthogonal polynomial over (a, b) with respect to the
weight function w(x), where w(x) > 0 for all x ∈ [a, b], if for all p ∈ Pn−1 [x]
Z b
pn (x)p(x)w(x)dx = 0.
a

Theorem 5.6. All roots of pn are real, have multiplicity one, and lie in (a, b).
Proof. Let x1 , . . . , xm be the places where pn changes sign in (a, b). These are
Numerical Integration  139

roots of pn , but pn does not necessarily change sign at every root (e.g., if the
root has even multiplicity where the curve just touches the x-axis, but does
not cross it). For pn to change sign the root must have odd multiplicity. There
could be no places in (a, b) where pn changes sign, in which case m = 0. We
know that m ≤ n, since pn has at most n real roots.
The polynomial (x − x1 )(x − x2 ) · · · (x − xm ) changes sign in the same way
as pn . Hence the product of the two does not change sign at all in (a, b). Now
pn is orthogonal to all polynomials of degree less than n. Thus,
Z b
(x − x1 )(x − x2 ) · · · (x − xm )pn (x)w(x)dx = 0,
a

unless m = n. However, the integrand is non-zero and never changes sign.


Therefore we must have n = m, and x1 , . . . , xn are the n distinct zeros of pn
in (a, b).

Gaussian quadrature based on orthogonal polynomials is developed as fol-


lows. Let x1 , . . . , xn be the roots of the nth orthogonal polynomial pn . Let Li
be the ith Lagrange interpolating polynomial for these points, i.e., Li is the
unique polynomial of degree n − 1 such that Li (xi ) = 1 and Li (xk ) = 0 for
k 6= i. The weights for the quadrature rule are then calculated according to
Z b
wi = Li (x)w(x)dx
a

and we have Z b n
X
f (x)w(x)dx ≈ wi f (xi ).
a i=1

Theorem 5.7. The above approximation is exact if f is a polynomial of degree


at most 2n − 1
Proof. Suppose f is a polynomial of degree at most 2n − 1. Then f = pn q + r,
where q, r ∈ Pn−1 [x]. Since pn is orthogonal to all polynomials of degree at
most n − 1, we have
Z b Z b Z b
f (x)w(x)dx = [pn (x)q(x) + r(x)]w(x)dx = r(x)w(x)dx.
a a a
Pn
On the other hand, r(x) = i=1 r(xi )Li (x) and f (xi ) = pn (xi )q(xi ) + r(xi ) =
r(xi ), since pn vanishes at x1 , . . . , xn . Therefore
Z b n
X Z b n
X
r(x)w(x)dx = f (xi ) Li (x)w(x)dx = wi f (xi )
a i=1 a i=1

and the assertion follows.


140  A Concise Introduction to Numerical Analysis

Exercise 5.2. Calculate the weights w0 and w1 and the abscissae x0 and x1
such that the approximation
Z 1
f (x)dx ≈ w0 f (x0 ) + w1 f (x1 )
0

is exact when f is a cubic polynomial. You may use the fact that x0 and x1
are the zeros of a quadratic polynomial which is orthogonal to all linear poly-
nomials. Verify your calculation by testing the formula when f (x) = 1, x, x2
and x3 .
The consequence of the theorem is that using equally spaced points, the
resulting method is only necessarily exact for polynomials of degree at most
n − 1. By picking the abscissae carefully, however, a method results which
is exact for polynomials of degree up to 2n − 1. For the price of storing the
same number of points, one gets much more accuracy for the same number
of function evaluations. However, if the values of the integrand are given as
empirical data, where it was not possible to choose the abscissae, Gaussian
quadrature is not appropriate.
The weights of Gaussian quadrature rules are positive. Consider L2k (x),
which is a polynomial of degree 2n − 2, and thus the quadrature formula is
exact for it and we have
Z b Xn
0< L2k (x)w(x)dx = wi L2k (xi ) = wk .
a i=0

Gaussian quadrature rules based on a weight function w(x) work very well
for functions that behave like a polynomial times the weight, something which
occurs in physical problems. However, a change of variables may be necessary
for this condition to hold.
The most common Gaussian quadrature formula is the case where (a, b) =
(−1, 1) and w(x) ≡ 1. The orthogonal polynomials are then called Legendre
polynomials. To construct the quadrature rule, one must determine the roots
of the Legendre polynomial of degree n and then calculate the associated
weights.
2
√ let n = 2. The quadratic Legendre polynomial is x √− 1/3
As an example,
with roots ±1/
√ 3. The two interpolating Lagrange polynomials are ( 3x +
1)/2 and (− 3x+1)/2 and both integrate to 1 over [−1, 1]. Thus the two-point
Gauss–Legendre rule is given by
Z 1
1 1
f (x)dx ≈ f (− √ ) + f ( √ )
−1 3 3
and it is correct for all cubic polynomials.
For n ≤ 5, the largest degree of polynomial for which the quadrature is
correct, the abscissae and corresponding weights are given in the following
Numerical Integration  141

table,
n 2n − 1 abscissae weights

2 3 ±1/ 3 1
3 5 0p 8/9
± 3/5 5/9
q p √
4 7 ± (3 − 2 6/5)/7 (18 + 30)/36
q p √
± (3 + 2 6/5)/7 (18 − 30)/36
5 9 0 q 128/225
p √
± 13 5 − 2 10/7 (322 + 13 70)/900
q p √
± 13 5 + 2 10/7 (322 − 13 70)/900
Notice that the abscissae are not uniformly distributed in the interval
[−1, 1]. They are symmetric about zero and cluster near the end points.
Exercise 5.3. R 1 Implement the Gauss–Legendre quadrature for n = 2, . . . , 5,
approximate −1 xj dx for j = 1, . . . , 10, and compare the results to the true
solution. Interpret your results.
Other choices of weight functions are listed in the following table
Name Notation Interval Weight function
Legendre Pn [−1, 1] w(x) ≡ 1
(α,β)
Jacobi Pn (−1, 1) w(x) = (1 − x)α (1 + x)β
Chebyshev (first kind) Tn (−1, 1) w(x) = (1 − x2 )−1/2
Chebyshev (second kind) Un [−1, 1] w(x) = (1 − x2 )1/2
Laguerre Ln [0, ∞) w(x) = e−x
2
Hermite Hn (−∞, ∞) w(x) = e−x
where α, β > −1.
Next we turn to the estimation of the error of Gaussian quadrature rules.

Theorem 5.8. If f ∈ C 2n [a, b] and the integral is approximated by a Gaussian


quadrature rule, then
b n b
f (2n) (ξ)
Z X Z
f (x)w(x)dx − wi f (xi ) = [p̂n (x)]2 w(x)dx
a i=1
(2n)! a

for some ξ ∈ (a, b), where p̂n is the nth orthogonal polynomial with respect to
w(x), scaled such that the leading coefficient is 1.

Proof. Let q be the polynomial of degree at most 2n − 1, which satisfies the


conditions

q(xi ) = f (xi ) and q 0 (xi ) = f 0 (xi ), i = 1, . . . , n.


142  A Concise Introduction to Numerical Analysis

This is possible, since these are 2n conditions and q has 2n degrees of freedom.
Since the degree of q is at most 2n − 1, the quadrature rule is exact
Z b n
X n
X
q(x)w(x)dx = wi q(xi ) = wi f (xi ).
a i=1 i=1

Therefore we can rewrite the error term as


Z b Xn Z b
f (x)w(x)dx − wi f (xi ) = (f (x) − q(x))w(x)dx.
a i=1 a

We know that f (x) − q(x) vanishes at x1 , . . . , xn . Let x be any other fixed


point in the interval, say between xk and xk+1 . For t ∈ [a, b] we define the
function
n
Y n
Y
φ(t) := [f (t) − q(t)] (x − xi )2 − [f (x) − q(x)] (t − xi )2 .
i=1 i=1

For t = xj , j = 1, . . . , n, the first term vanishes, since f (xj ) = q(xj ), and by


construction the product in the second term vanishes. We also have φ(x) = 0,
since then the two terms cancel. Hence φ has at least n + 1 distinct zeros in
[a, b]. By Rolle’s theorem, if a function with continuous derivative vanishes at
two distinct points, then its derivative vanishes at an intermediate point. We
deduce that φ0 has at least one zero between xj and xj+1 for j = 1, . . . , k − 1,
and j = k+1, . . . , n−1. It also has two further zeros, one between xk and x and
one between x and xk+1 . These are n zeros. However, since f 0 (xi ) = q 0 (xi ) for
i = 1, . . . , n we have a further n zeros and thus altogether 2n zeros. Applying
Rolle’s theorem again, the second derivative of φ has at least 2n − 1 zeros.
Continuing in this manner, the (2n)th derivative has at least one zero, say at
ξ(x). There is a dependence on x since x was fixed.
Therefore
n
Y
0 = φ(2n) (ξ(x)) = [f (2n) (t) − q (2n) (t)] (x − xi )2 − [f (x) − q(x)](2n)!,
i=1

or after rearranging and using (x − x1 )2 . . . (x − xn )2 = [p̂n (x)]2 ,

f (2n) (ξ(x))
f (x) − q(x) = [p̂n (x)]2 .
(2n)!

Next the function


f (2n) (ξ(x)) f (x) − q(x)
=
(2n)! [p̂n (x)]2
is continuous on [a, b], since the zeros of the denominator are also zeros of the
Numerical Integration  143

numerator with the same multiplicity. We can apply the mean value theorem
of integral calculus:
Z b Z b (2n)
f (ξ(x))
(f (x) − q(x))w(x)dx = [p̂n (x)]2 w(x)dx
a a (2n)!
f (2n) (ξ) b
Z
= [p̂n (x)]2 w(x)dx
(2n)! a
for some ξ ∈ (a, b).
There are two variations of Gaussian quadrature rules. The Gauss–Lobatto
rules, also known as Lobatto quadrature, explicitly include the end points of the
interval as abscissae, x1 = a and xn = b, while the remaining n − 2 abscissae
are chosen optimally. The quadrature is then accurate for polynomials up to
degree 2n − 3. For w(x) ≡ 1 and [a, b] = [−1, 1], the remaining abscissae are
the zeros of the derivative of the (n − 1)th Legendre polynomial Pn−1 (x). The
Lobatto quadrature of f (x) on [−1, 1] is
Z 1 n−1
2 X
f (x)dx ≈ [f (1) + f (−1)] + wi f (xi ),
−1 n(n − 1) i=2

where the weights are given by


2
wi = , i = 2, . . . , n − 1.
n(n − 1)[Pn−1 (xi )]2
Simpson’s rule is the simplest Lobatto rule. The following table lists the Lo-
batto rules with their abscissae and weights until n = 5 and the degree of
polynomial they are correct for.

n 2n − 3 abscissae weights
3 3 0 4/3
±1 1/3

4 5 ±1/ 5 5/6
±1 1/6
5 7 0p 32/45
± 3/7 49/90
±1 1/10
The second variation are the Gauss–Radau rules or Radau quadratures.
Here one end point is included as abscissa. Therefore we distinguish left and
right Radau rules. The remaining n − 1 abscissae are chosen optimally. The
quadrature is accurate for polynomials of degree up to 2n − 2. For w(x) ≡ 1
and [a, b] = [−1, 1] and x1 = −1, the remaining abscissae are the zeros of the
polynomial given by
Pn−1 (x) + Pn (x)
.
1+x
144  A Concise Introduction to Numerical Analysis

The Radau quadrature of f (x) on [−1, 1] is


Z 1 n
2 X
f (x)dx ≈ 2 f (−1) + wi f (xi ),
−1 n i=2

where the other weights are given by


1 − xi
wi = , i = 2, . . . , n.
n2 [Pn−1 (xi )]
2

The following table lists the left Radau rules with their abscissae and weights
until n = 5 and the degree of polynomial they are correct for. For n = 4 and
5 only approximations to the abscissae and weights are given.

n 2n − 2 abscissae weights
2 2 −1 1/2
1/3 3/2
3 4 −1 √ 2/9 √
1/5(1 ± 6) 1/18(16 ∓ 6)

4 6 −1 1/8
−0.575319 0.657689
0.181066 0.776387
0.822824 0.440924
5 8 0 2/25
−0.72048 0.446208
−0.167181 0.623653
0.446314 0.562712
0.885792 0.287427
One drawback of Gaussian quadrature is the need to pre-compute the
necessary abscissae and weights. Often the abscissae and weights are given
in look-up tables for specific intervals. If one has a quadrature rule for the
interval [c, d], it can be adapted to the interval [a, b] with a simple change of
variables. Let t(x) be the linear transformation taking [c, d] to [a, b] and t−1 (y)
its inverse,
b−a
y = t(x) = a+ (x − c),
d−c
d−c
x = t−1 (y) = c + (y − a),
b−a
dy b−a
dx = .
d−c
The integration is then transformed:
Z b Z t−1 (b)
f (y)w(y)dy = f (t(x))w(t(x))t0 (x)dx
a t−1 (a)
b−a d
Z
= f (t(x))w(t(x))dx.
d−c c
Numerical Integration  145

The integral is then approximated by


Z b n
b−aX
f (x)w(x)dx ≈ wi f (t(xi )),
a d − c i=1

where xi and wi , i = 0, . . . , n, are the abscissae and weights of a quadrature


approximating integrals over [c, d] with weight function w(t(x)). Thus the
abscissae and weights of the quadrature over [a, b] are
b−a
x̂i = t(xi ) and ŵi = wi .
d−c
It is important to note that the change of variables alters the weight func-
tion. This does not play a role in the Gauss–Legendre quadrature, since there
w(x) ≡ 1, but it does for the other Gaussian quadrature rules.
The change of variables technique is important for decomposing the inter-
val of integration into smaller intervals, over each of which Gaussian quadra-
ture rules can be applied. The strategy of splitting the integral into a set of
panels is the subject of the next section.

5.6 Composite Rules


A composite rule is also known as a repeated, compound, iterated, or extended
rule. It is constructed by splitting the integral into smaller integrals (usually
of the same length) and applying (usually) the same quadrature rule in each
sub-interval and summing the results. For example, consider the simple case
of the midpoint rule. Figure 5.4 illustrates its composite rule.
Let there be N sub-intervals of equal length and let h denote the width of
each sub-interval, i.e., h = (a − b)/N . The composite midpoint rule then has
the formula
Z b N
X 1
f (x)dx ≈ h f (a + (i − )h).
a i=1
2
The error term is
N
X 1 00 1
f (ξi )h3 ≤ max |f 00 (ξ)|N h3 = O(h2 ),
i=1
24 24 ξ∈[a,b]

where each ξi lies in the interval [a + (i − 1)h, a + ih].


Figure 5.5 illustrates the composite trapezium rule, where the sub-intervals
are generated in the same way. It has the formula
Z b N −1
h X h
f (x)dx ≈ f (a) + h f (a + ih) + f (b),
a 2 i=1
2

because the function values at the interior abscissae are needed twice to ap-
proximate the integrals on the intervals on the left and right of them. Since the
146  A Concise Introduction to Numerical Analysis

Figure 5.4 The composite midpoint rule

error in each sub-interval is of magnitude O(h3 ) and there are N sub-intervals,


the overall error is O(h2 ).
The composite Simpson’s rule has the formula
−1
Z b " N N
#
h X X 1
f (x)dx ≈ f (a) + 2 f (a + ih) + 4 f (a + (i − )h) + f (b) .
a 6 i=1 i=1
2

On each sub-interval the error is O(h5 ), and since there are N sub-intervals,
the overall error is O(h4 ).
Because evaluating an arbitrary function can be potentially expensive, the
efficiency of quadrature rules is usually measured by the number of func-
tion evaluations required to achieve a desired accuracy. In composite rules it
is therefore advantageous, if the endpoints are abscissae, since the function
value at the end-point of one sub-interval will be used again in the next sub-
interval. Therefore Lobatto rules play an important role in the construction
of composite rules. In the following we put this on a more theoretical footing.
Definition 5.2 (Riemann integral). For each n ∈ N let there be a set of
numbers a = ξ0 < ξ1 < . . . < ξn = b. A Riemann integral is defined by
Z b Xn
f (x)dx = lim (ξi − ξi−1 )f (xi ),
a n→∞,∆ξ→0
i=1

where xi ∈ [ξi−1 , ξi ] and ∆ξ = max1≤i≤n |ξi −ξi−1 |. The sum on the right-hand
side is called a Riemann sum.
Some simple quadrature rules are clearly Riemann sums. For example,
Numerical Integration  147

Figure 5.5 The composite trapezium rule

take the composite rectangle rule, which approximates the value on each sub-
interval by the function value at the right end-point times the length of the
interval
XN
QN (f ) = h f (a + ih),
i=1

where h = (b − a)/N . This is illustrated in Figure 5.6.


Other quadrature rules of the form
n
X
Qn (f ) = wi f (xi )
i=1

can be expressed as Riemann sums as long as we can set ξ0 = a and ξn = b and


there exist ξ1 < ξ2 < . . . < ξn−1 such that xi ∈ [ξi−1 , ξi ] and wi = ξi − ξi−1 .
To show that a sequence of quadrature rules Qn (f ), n = 1, 2, . . . converges
to the integral of f , it is sufficient to prove that the sequence of quadrature
rules is a sequence of Riemann sums where ∆ξ → 0 as n → ∞.
Let Qn be an n-point quadrature rule. We denote by (M ×Qn ) the rule Qn
applied to M sub-intervals. If Qn is an open rule, that is, it does not include
the endpoints, then (M × Qn ) uses M n points. However, if Qn is a closed
rule, that is, it includes both end-points, then M × Qn uses only (n − 1)M + 1
points, which is M − 1 less function evaluations.
Let the quadrature rule be given on [c, d] by
n
X
Qn (f ) = wj f (xj ).
j=1
148  A Concise Introduction to Numerical Analysis

Figure 5.6 The composite rectangle rule

Then the composite rule on [a, b] is given by


M n
b − a XX
(M × Qn )(f ) = wj f (xij ), (5.1)
M (d − c) i=1 j=1

where xij is the j th abscissa in the ith sub-interval calculated as xij = ti (xj ),
where ti is the transformation taking [c, d] of length d − c to [a + (i − 1)(b −
a)/M, a + i(b − a)/M ] of length (b − a)/M .
Theorem 5.9. Let Qn be a quadrature rule that integrates constants exactly,
Rd
i.e., Qn (1) = c 1dx = d−c. If f is bounded on [a, b] and is Riemann integrable
then Z b
lim (M × Qn )(f ) = f (x)dx.
M →∞ a

Proof. Firstly note that


n
X
Qn (1) = wj = d − c.
j=1

That is, the weights sum to d − c. Swapping the summations and taking
everything independent of M out of the limit in (5.1) leads to
n
" M
#
1 X b−aX
limM →∞ (M × Qn )(f ) = wj lim f (xij )
d − c j=1 M →∞ M i=1
Z b
= f (x)dx,
a
Numerical Integration  149

since the expression in the square brackets is a Riemann sum with ξi = a +


i(b − a)/M for i = 1, . . . , M .
Definition 5.3. The quadrature rule Qn on [a, b] is called a simplex rule if
the error can be expressed as
Z b
EQn (f ) = f (x)w(x)dx − Qn (f ) = C(b − a)k+1 f (k) (ξ),
a

for f ∈ C k [a, b], ξ ∈ (a, b), and some constant C.

We have already seen that the error can be expressed in such a form for
all quadrature rules we have encountered so far.
Theorem 5.10. Let Qn be a simplex rule as defined above and let EM ×Qn (f )
denote the error of (M × Qn )(f ). Then
h i
lim M k EM ×Qn (f ) = C(b − a)k f (k−1) (b) − f (k−1) (a) .
M →∞

Rb
That is, (M × Qn )(f ) converges to a
f (x)w(x)dx like M −k for sufficiently
large M .
Proof. The error of the composite rule is the sum of the errors in each sub-
interval. Thus
M  k+1
X b−a
EM ×Qn (f ) = C f (k) (ξi ),
i=1
M

where ξi lies inside the ith sub-interval. Multiplying by M k and taking the
limit gives
" M
#
k k b − a X (k)
lim M EM ×Qn (f ) = C(b − a) lim f (ξi ) .
M →∞ M →∞ M i=1

The expression in the square brackets is a Riemann sum and therefore


Z b
k k
lim M EM ×Qn (f ) = C(b − a) f (k) (x)dx
M →∞
ha i
= C(b − a)k f (k−1) (b) − f (k−1) (a) .

So far we have only looked at one-dimensional integration. The next section


looks at several dimensions.
150  A Concise Introduction to Numerical Analysis

5.7 Multi-Dimensional Integration


Consider the problem of evaluating
Z u1 Z u2 (x1 ) Z ud (x1 ,...,xd−1 )
I= ... f (x1 , x2 , . . . , xd )dxd dxd−1 . . . dx1 , (5.2)
l1 l2 (x1 ) ld (x1 ,...,xd−1 )

where there are d integrals and where the boundaries of each integral may
depend on variables not used in that integral. In the k th dimension the inter-
val of integration is [lk (x1 , . . . , xk−1 ), uk (x1 , . . . , xk−1 )]. Such problems often
arise in practice, mostly for two or three dimensions, but sometimes for 10
or 20 dimensions. The problem becomes considerably more expensive with
each extra dimension. Therefore different methods have been developed for
different ranges of dimensions.
We first consider the transformation into standard regions with the hyper-
cube as an example. Other standard regions are the hypersphere, the surface
of the hypersphere, or a simplex, where a simplex is the generalization of the
triangle or tetrahedron to higher dimensions. Different methods have been
developed for different standard regions. Returning to the hypercube, it can
be transformed to the region of the integral given in (5.2) by
1
xi = [ui (x1 , . . . , xi−1 ) + li (x1 , . . . , xi−1 )]
2
1
+yi [ui (x1 , . . . , xi−1 ) − li (x1 , . . . , xi−1 )] .
2
If yi = −1, then xi = li (x1 , . . . , xi−1 ), the lower limit of the integration. If
yi = 1, then xi = ui (x1 , . . . , xi−1 ), the upper limit of the integration. For
yi = 0 xi is the midpoint of the interval. The derivative of this transformation
is given by
dxi 1
= [ui (x1 , . . . , xi−1 ) − li (x1 , . . . , xi−1 )] .
dyi 2
Using the transformation, we write f (x1 , . . . , xd ) = g(y1 , . . . , yd ) and the in-
tegral I becomes
Z 1 Z 1 d
1 Y
... [ui (x1 , . . . , xi−1 ) − li (x1 , . . . , xi−1 )] g(y1 , . . . , yd )dy1 . . . dyd .
−1 −1 2d i=1

We can now use a standard method. In the following we give an example of


how such a method can be derived.
If I1 = [a, b] and I2 = [c, d] are intervals, then the product I1 × I2 denotes
the rectangle a ≤ x ≤ b, c ≤ y ≤ d. Let Q1 and Q2 denote quadrature rules
over I1 and I2 respectively,
n
X m
X
Q1 (f ) = w1i f (x1i ) and Q2 (g) = w2j g(x2j ),
i=1 j=1
Numerical Integration  151

where x1i ∈ [a, b], i = 1, . . . , n, are the abscissae of Q1 and x2j ∈ [c, d],
j = 1, . . . , m, are the abscissae of Q2 .
Definition 5.4. The product rule Q1 × Q2 to integrate a function F : I1 ×
I2 → R is defined by
n X
X m
(Q1 × Q2 )(F ) = w1i w2j F (x1i , x2j ).
i=1 j=1

Exercise 5.4. Let Q1 integrate f exactly over the interval I1 and let Q2
integrate g exactly over the interval I2 . Prove that Q1 ×Q2 integrates f (x)g(y)
over I1 × I2 exactly.
A consequence of the above definition and exercise is that we can combine
all the one-dimensional quadrature rules we encountered before to create rules
in two dimensions. If Q1 is correct for polynomials of degree at most k1 and
Q2 is correct for polynomials of degree at most k2 , then the product rule is
correct for any linear combination of the monomials xi y j , where i = 0, . . . , k1 ,
and j = 0, . . . , k2 . As an example let Q1 and Q2 both be Simpson’s rule. The
product rule is then given by

Q1 × Q2 = (b−a)(d−c) F (a, c) + 4F ( a+b



36 2 , c) + F (b, c)

+4F (a, c+d a+b c+d c+d


2 ) + 16F ( 2 , 2 ) + 4F (b, 2 )

F (a, d) + 4F ( a+b

2 , d) + F (b, d) ,

and it is correct for 1, x, y, x2 , xy, y 2 , x3 , x2 y, xy 2 , y 3 , x3 y, x2 y 2 , xy 3 , x3 y 2 , x2 y 3 ,


and x3 y 3 .
Note that the product of two Gauss rules is not a Gauss rule. The optimal
abscissae in more than one dimension have different positions. Gauss rules in
more than one dimension exist, but are rarely used, since Gauss rules in higher
dimensions do not always have positive weights and the abscissae might lie
outside the region of integration, which causes a problem if the integrand is
not defined there.
The idea of product rules generalize to d dimensions, but they become in-
creasingly inefficient due to the dimensional effect. The error of most methods
behaves like n−k/d as the number of dimensions increases, and where k is a
constant depending on the method. Product rules are generally not used for
more than three dimensions.
To improve accuracy in one dimension we normally half the sub-intervals.
Thus n sub-intervals become 2n sub-intervals. If we have n hypercubes in d
dimensions which subdivide the region and we half each of its sides, then we
have 2d n hypercubes. With each subdivision the number of function evalua-
tions rises quickly, even if the abscissae lie on the boundaries and the function
values there can be reused. Consequently for large d different methods are re-
quired. In a large number of dimensions analytical strategies such as changing
152  A Concise Introduction to Numerical Analysis

the order of integration or even just reducing the number of dimensions by one
can have a considerable effect. In high dimensional problems, high accuracy
is often not required. Often merely the magnitude of the integral is sufficient.
Note that high dimensional integration is a problem well-suited to parallel
computing.

5.8 Monte Carlo Methods


In the following we will shortly introduce Monte Carlo methods. However,
their probabilistic analysis is beyond this course and we will just state a few
facts. Monte Carlo methods cannot be used to obtain high accuracy, but
they are still very effective when dealing with many dimensions. Abscissae are
generated pseudo-randomly. The function values at these abscissae are then
averaged and multiplied by the volume of the region to estimate the integral.
The accuracy of Monte Carlo methods is generally very poor, even in one
dimension. However, the dimensional effect, i.e., the loss of efficiency when
extending to many dimensions, is less severe than in other methods.
As anRexample consider the one-dimensional case where we seek to calcu-
1
late I = 0 f (x)dx. Let xi , i = 1, . . . , n be a set of pseudo-random variables.
Note that pseudo-random numbers are not really random, but are sets of num-
bers which are generated by an entirely deterministic causal process, which
is easier than producing genuinely random numbers. The advantage is that
the resulting numbers are reproducable, which is important for testing and
fixing software. Pseudo randomness is defined as the set of numbers being
non-distinguishable from the uniform distribution by any of several tests. It
is an open question whether a test could be developed to distinguish a set of
pseudo-random numbers from the uniform distribution.
The strong law of large numbers states that the probability of
n
1X
lim f (xi ) = I
n→∞ n
i=1

is one. That is, using truly random abscissae and equal weights, convergence
is almost sure.
Statistical error estimates are available, but these depend on the variance
of the function f . As an example for the efficiency of the Monte Carlo method,
to obtain an error less than 0.01 with 99% certainty in the estimation of I,
we need to average 6.6 × 104 function values. To gain an extra decimal place
with the same certainty requires 6.6 × 106 function evaluations.
The advantage of the Monte Carlo method is that the error behaves like
n−1/2 and not like n−k/d , that is, it is independent of the number of dimen-
sions. However, there is still a dimensional effect, since higher dimensional
functions have larger variances.
As a final algorithm the Korobov–Conroy method needs to be mentioned.
Here the abscissae are not chosen pseudo-randomly, but are in some sense op-
Numerical Integration  153

timal. They are derived from results in number theory (See for example [15] H.
Niederreiter, Random Number Generation and Quasi-Monte Carlo Methods).

5.9 Revision Exercises


Exercise 5.5. Consider the numerical evaluation of an integral of the form
Z b
I= f (x)w(x)dx.
a

(a) Define Gaussian quadrature and state how the abscissae are obtained. Give
a formula for the weights. If f is a polynomial, what is the maximum degree
of f for which the Gaussian quadrature rule is correct?
(b) In the following let the interval be [a, b] = [−2, 2] and w(x) = 4 − x2 . Thus
we want to approximate the integral
Z 2
(4 − x2 )f (x)dx.
−2

Let the number of abscissae be 2. Calculate the abscissae.


(c) Calculate the weights.
(d) To approximate the integral
Z 1
(1 − x2 )f (x)dx
−1

by a Gaussian quadrature the orthogonal polynomials are the Jacobi poly-



α = 1 and β = 1. For n = 2 the abscissae are x1 = −1/ 5
nomials for √
and x2 = 1/ 5. The weights are w1 = w2 = 2/3. The interval of integra-
tion is changed from [−1, 1] to [−2, 2]. What are the new abscissae and
weights? Explain why the weights are different from the weights derived in
the previous part.
Exercise 5.6. Let f ∈ Ck+1 [a, b], that is, f is k + 1 times continuously
differentiable.
(a) Expand f (x) in a Taylor series using the integral form of the remainder.
(b) The integral of f over [a, b] is approximated numerically in such a way
that the approximation is correct whenever f is a polynomial of degree k
or less. Let L(f ) denote the approximation error. Show that L(f ) can be
calculated as
1 b
Z
L(f ) = K(θ)f (k+1) (θ)dθ.
k! a
In the process specify K(θ) and state which conditions L(f ) has to satisfy.
154  A Concise Introduction to Numerical Analysis

(c) If K does not change sign in (a, b), how can the expression for L(f ) be
further simplified?
(d) In the following we let a = 0 and b = 2 and let
Z 2
1
L(f ) = f (x)dx − [f (0) + 4f (1) + f (2)].
0 3
Find the highest degree of polynomials for which this approximation is
correct.
(e) Calculate K(θ) for θ ∈ [0, 2].
(f ) Given that K(θ) is negative for θ ∈ [0, 2], obtain c such that

L(f ) = cf (4) (ξ)

for some ξ ∈ (0, 2).


(g) The above is the Simpson rule of numerical integration on the interval
[0, 2]. State two other rules of numerical integration on the interval [0, 2].
Exercise 5.7. (a) Describe what is meant by a composite rule of integration.
(b) Give two examples of composite rules and their formulae.
(c) Let a quadrature rule be given on [c, d] by
n
X Z d
Qn (f ) = wj f (xj ) ≈ f (x)dx.
j=1 c

We denote by (M × Qn ) the composite rule Qn applied to M subintervals


of [a, b]. Give the formula for (M × Qn ).
(d) Describe the difference between open and closed quadrature rules and how
this affects the composite rule.
(e) Show that if Qn is a quadrature rule that integrates constants exactly, i.e.,
Rd
Qn (1) = c 1dx = d − c, and if f is bounded on [a, b] and is Riemann
integrable, then
Z b
lim (M × Qn )(f ) = f (x)dx.
M →∞ a

(f ) Let [c, d] = [−1, 1]. Give the constant, linear, and quadratic monic poly-
nomials which are orthogonal with respect to the inner product given by
Z 1
hf, gi = f (x)g(x)dx
−1

and check that they are orthogonal to each other.


Numerical Integration  155

(g) Give the abscissae of the two-point Gauss–Legendre rule on the interval
[−1, 1].
(h) The weights of the two-point Gauss–Legendre rule are 1 for both abscis-
sae. State the two-point Gauss–Legendre rule and give the formula for the
composite rule on [a, b] employing the two-point Gauss–Legendre rule.
Exercise 5.8. The integral
Z 1
(1 − x2 )f (x)dx
−1

is approximated by a Gaussian quadrature rule of the form


n
X
wi f (xi ),
i=1

which is exact for all f (x) that are polynomials of degree less than or equal to
2n − 1.
(a) Explain how the weights wi are calculated, writing down explicit expres-
sions in terms of integrals.
(b) Explain why it is necessary that the xi are the zeros of a (monic) poly-
R1
nomial pn of degree n that satisfies −1 (1 − x2 )pn (x)q(x)dx = 0 for any
polynomial q(x) of degree less than n.
(c) The first such polynomials are p0 = 1, p1 = x, p2 = x2 − 15 , p3 = x3 − 37 x.
Show that the Gaussian quadrature formulae for n = 2, 3 are
 
2 1 1
n=2: f (− √ ) + f ( √ ) ,
3 5 5
" r r #
14 3 3 32
n=3: f (− ) + f( ) + f (0).
45 7 7 45

(d) Verify the result for n = 3 by considering f (x) = 1, x2 , x4 .


Exercise 5.9. The integral Z 2
f (x)dx
0
shall be approximated by a two point Gaussian quadrature formula.
(a) Find the monic quadratic polynomial g(x) which is orthogonal to all linear
polynomials with respect to the scalar product
Z 2
hf, gi = f (x)g(x)dx,
0

where f (x) denotes an arbitrary linear polynomial.


156  A Concise Introduction to Numerical Analysis

(b) Calculate the zeros of the polynomial found in (a) and explain how they
are used to construct a Gaussian quadrature rule.
(c) Describe how the weights are calculated for a Gaussian quadrature rule
R2
and calculate the weights to approximate 0 f (x)dx.
(d) For which polynomials is the constructed quadrature rule correct?

(e) State the functional L(f ) acting on f describing the error when the integral
R2
0
f (x)dx is approximated by the quadrature rule.

(f ) Define the Peano kernel and state the Peano kernel theorem.
(g) Calculate the Peano kernel for the functional L(f ) in (e).
(h) The Peano kernel does not change sign in [0, 2] (not required to be proven).
Derive an expression for L(f ) of the form constant times a derivative of
f . (Hint: (a + b)4 = a4 + 4a3 b + 6a2 b2 + 4ab3 + b4 .)
CHAPTER 6

ODEs

We wish to approximate the exact solution of the ordinary differential equation


(ODE)
∂y
= y0 = f (t, y), t ≥ 0, (6.1)
∂t
where y ∈ RN and the function f : R × RN → RN is sufficiently well behaved.
The equation is accompanied by the initial condition y(0) = y0 .
The following definition is central in the analysis of ODEs.
Definition 6.1 (Lipschitz continuity). f is Lipschitz continuous, if there ex-
ists a bound λ ≥ 0 such that

kf (t, v) − f (t, w)k ≤ λkv − wk, t ∈ [0, t∗ ], v, w ∈ RN .

Lipschitz continuity means that the slopes of all secant lines to the function
between possible points v and w are bounded above by a positive constant.
Thus a Lipschitz continuous function is limited in how much and how fast
it can change. In the theory of differential equations, Lipschitz continuity is
the central condition of the Picard–Lindelöf theorem, which guarantees the
existence and uniqueness of a solution to an initial value problem.
For our analysis of numerical solutions we henceforth assume that f is
analytic and we are always able to expand locally into a Taylor series.
We want to calculate yn+1 ≈ y(tn+1 ), n = 0, 1, . . . , from y0 , y1 , . . . , yn ,
where tn = nh and the time step h > 0 is small.

6.1 One-Step Methods


In a one-step method yn+1 is only allowed to depend on tn , yn , h and the
ODE given by Equation (6.1). The slope of y at t = 0 is given by y0 (0) =
f (0, y(0)) = f (0, y0 ). The most obvious approach is to truncate the Taylor
series expansion y(h) = y(0) + hy0 (0) + 12 h2 y00 (0) + · · · before the h2 term.
Thus we set y1 = y0 + hf (0, y0 ). Following the same principle, we advance
from h to 2h by letting y2 = y1 + hf (t1 , y1 ). However, note that the second
158  A Concise Introduction to Numerical Analysis

Figure 6.1 The forward Euler method

term is an approximation to y0 (h), since y1 is an approximation to y(h).


Carrying on, we obtain the Euler method

yn+1 = yn + hf (tn , yn ), n = 0, 1, . . . . (6.2)

Figure 6.1 illustrates the method.


1
Exercise 6.1. Let h = M, where M is a positive integer. The following ODEs
are given
y 2y
y0 = − and y0 = , 0 ≤ t ≤ 1,
1+t 1+t
with starting conditions y0 = y(0) = 1 in both cases. Forward Euler is used to
calculate the estimates yn , n = 1, . . . M . By using induction and by canceling
as many terms as possible in the resultant products, deduce simple explicit
expressions for yn , n = 1, . . . M , which should be free from summations and
products. By considering the limit for h → 0, deduce the exact solutions of the
equations. Verify that the error |yn − y(tn )| is at most O(h).

Definition 6.2 (Convergence). Let t∗ > 0 be given. A method which produces


for every h > 0 the solution sequence yn = yn (h), n = 0, 1, . . . , bt∗ /hc con-
k→∞
verges, if, as h → 0 and nk (h)h −→ t, it is true that ynk → y(t), the exact
solution of Equation (6.1), uniformly for t ∈ [0, t∗ ].

Theorem 6.1. Suppose that f is Lipschitz continuous as in definition 6.1.


Then the Euler method given by (6.2) converges.
ODEs  159

Proof. Let en = yn − y(tn ) be error at step n, where 0 ≤ n ≤ bt∗ /hc. Thus


en+1 = yn+1 − y(tn+1 ) = [yn + hf (tn , yn )] − [y(tn ) + hy0 (tn ) + O(h2 )].
The O(h2 ) term can be bounded uniformly for all [0, t∗ ] by ch2 for some c > 0.
Using (6.1) and the triangle inequality, we can deduce
ken+1 k ≤ kyn − y(tn )k + hkf (tn , yn ) − f (tn , y(tn ))k + ch2
≤ kyn − y(tn )k + hλkyn − y(tn )k + ch2 = (1 + hλ)ken k + ch2 ,
where we used the Lipschitz condition in the second inequality.
By induction, and bearing in mind that e0 = 0, we have
n
X (1 + hλ)n+1 − 1 ch
ken+1 k ≤ ch2 (1 + hλ)j = ch2 ≤ (1 + hλ)n+1 .
j=0
(1 + hλ) − 1 λ

Looking at the expansion of ehλ , we see that 1 + hλ ≤ ehλ , since hλ > 0. Thus

ch ch (n+1)hλ cet λ
(1 + hλ)n+1 ≤ e ≤ h = Ch,
λ λ λ
where we use the fact that (n + 1)h ≤ t∗ . Thus ken k converges uniformly to
zero and the theorem is true.
Exercise 6.2. Assuming that f is Lipschitz continuous and possesses a
bounded third derivative in [0, t∗ ], use the same method of analysis to show
that the trapezoidal rule
1
yn+1 = yn + h[f (tn , yn ) + f (tn+1 , yn+1 )]
2
converges and that kyn − y(tn )k ≤ ch2 for some c > 0 and all n such that
0 ≤ nh ≤ t∗ .

6.2 Multistep Methods, Order, and Consistency


Generally we define a multistep method in terms of a map φh such that the
next approximation is given by
yn+1 = φh (tn , y0 , y1 , . . . yn+1 ). (6.3)
If yn+1 appears on the right-hand side, then the method is called implicit,
otherwise explicit.
Definition 6.3 (Order, local truncation/discretization error). The order of
a numerical method given by (6.3) to obtain solutions for (6.1) is the largest
integer p ≥ 0 such that
δn+1,h = y(tn+1 ) − φh (tn , y(t0 ), y(t1 ), . . . y(tn+1 )) = O(hp+1 )
for all h > 0, n ≥ 0 and all sufficiently smooth functions f . The expression
δn+1,h is called the local truncation error or local discretization error.
160  A Concise Introduction to Numerical Analysis

Note that the arguments in φh are the true function values of y. Thus
the local truncation error is the difference between the true solution and the
method applied to the true solution. Hence it only gives an indication of the
error if all previous steps have been exact.
Definition 6.4 (Consistency). The numerical method given by (6.3) to obtain
solutions for (6.1) is called consistent if
δn+1,h
lim = 0.
h→0 h
Thinking back to the definition of the O-notation, consistency is equivalent
to saying that the order is at least one. For convergence, p ≥ 1 is necessary.
For Euler’s method we have φh (tn , yn ) = yn + hf (tn , yn ). Using again
Taylor series expansion,

y(tn+1 ) − [y(tn ) + hf (tn , y(tn ))] = [y(tn ) + hy0 (tn ) + 12 h2 y00 (tn ) + · · · ]
−[y(tn ) + hy0 (tn )]
= O(h2 ),

and we deduce that Euler’s method is of order 1.


Definition 6.5 (Theta methods). These are methods of the form

yn+1 = yn + h[θf (tn , yn ) + (1 − θ)f (tn+1 , yn+1 )], n = 0, 1, . . . ,

where θ ∈ [0, 1] is a parameter.

1. For θ = 1, we recover the Euler method.


2. The choice θ = 0, is known as backward Euler

yn+1 = yn + hf (tn+1 , yn+1 ).

1
3. For θ = 2 we have the trapezoidal rule
1
yn+1 = yn + h[f (tn , yn ) + f (tn+1 , yn+1 )].
2

The order of the theta method can be determined as follows


y(tn+1 ) − y(tn ) − h(θy0 (tn ) + (1 − θ)y0 (tn+1 )) =
1 1
(y(tn ) + hy0 (tn ) + h2 y00 (tn ) + h3 y000 (tn )) + O(h4 )−
2 6
1
−y(tn ) − hθy (tn ) − (1 − θ)h(y (tn ) + hy00 (tn ) + h2 y000 (tn )) =
0 0
2
1 2 00 1 1 3 000 4
(θ − )h y (tn ) + ( θ − )h y (tn ) + O(h ).
2 2 3
ODEs  161

Therefore all theta methods are of order 1, except that the trapezoidal rule
(θ = 1/2) is of order 2.
If θ < 1, then the theta method is implicit. That means each time step
requires the solution of N (generally non-linear) algebraic equations to find
the unknown vector yn+1 . This can be done by iteration and generally the
[0]
first estimate yn+1 for yn+1 is set to yn , assuming that the function does not
change rapidly between time steps.
To obtain further estimates for yn+1 one can use direct iteration;
[j+1] [j]
yn+1 = φh (tn , y0 , y1 , . . . yn , yn+1 ).
Other methods arise by viewing the problem of finding yn+1 as finding the
zero of the function F : RN → RN defined by
F (y) = y − φh (tn , y0 , y1 , . . . yn , y).
This is the subject of the chapter on non-linear systems. Assume we already
have an estimate y[j] for the zero. Let
 
F1 (y)
F (y) =  ..
.
 
.
FN (y)
and let h = (h1 , . . . hN )T be a small perturbation vector. The multidimen-
sional Taylor expansion of each function component Fi , i = 1, . . . , N is
N
X ∂Fi (y[j] )
Fi (y[j] + h) = Fi (y[j] ) + hk + O(khk2 ).
∂yk
k=1

∂Fi (y[j] )
The Jacobian matrix JF (y[j] ) has the entries ∂yk and thus we can write
in matrix notation
F (y[j] + h) = F (y[j] ) + JF (y[j] )h + O(khk2 ).
Neglecting the O(khk2 ), we equate this to zero (since we are looking for a
better approximation of the zero) and solve for h. We let the new estimate be
y[j+1] = y[j] + h = y[j] − [JF (y[j] )]−1 F (y[j] ).
This method is called the Newton–Raphson method .
Of course the inverse of the Jacobian is not computed explicitly, instead
the equation
[JF (y[j] )]h = −F (y[j] )
is solved for h.
The method can be simplified further by using the same Jacobian JF (y[0] )
in the computation of the new estimate y[j+1] , which is then called modified
Newton–Raphson.
Exercise 6.3. Implement the backward Euler method in MATLAB or a dif-
ferent programming language of your choice.
162  A Concise Introduction to Numerical Analysis

6.3 Order Conditions


We now consider multistep methods which can be expressed as linear combina-
tions of approximations to the function values and first derivatives. Assuming
that yn , yn+1 , . . . , yn+s−1 are available, where s ≥ 1, we say that
s
X s
X
ρl yn+l = h σl f (tn+l , yn+l ), n = 0, 1, . . . (6.4)
l=0 l=0

is an s-step method. Here ρs = 1. If σs = 0, the method is explicit, otherwise


implicit. If s ≥ 2 we need to obtain additional starting values y1 , . . . , ys−1 by
a different time-stepping method.
We define the following complex polynomials
s
X s
X
ρ(w) = ρl w l , σ(w) = σl w l .
l=0 l=0

Theorem 6.2. The s-step method (6.4) is of order p ≥ 1 if and only if


ρ(ez ) − zσ(ez ) = O(z p+1 ), z → 0.
Proof. Define the operator Dt to be differentiation with respect to t. Sub-
stituting the exact solution and expanding into Taylor series about tn , we
obtain
Xs s
X
ρl y(tn+l ) − h σl y0 (tn+l ) =
l=0 l=0
s ∞ k X X (lh)k s ∞
X (lh) (k)
X
ρl y (tn ) − h σl y(k+1) (tn ) =
k! k!
l=0 k=0 l=0 k=0
s ∞ s ∞
X X (lhDt )k X X (lhDt )k
ρl y(tn ) − h σl Dt y(tn ) =
k! k!
l=0 k=0 l=0 k=0
Xs Xs
ρl elhDt y(tn ) − hDt σl elhDt y(tn ) =
l=0 l=0
[ρ(ehDt ) − hDt σ(ehDt )]y(tn ).
Sorting the terms with regards to derivatives we have
s
X Xs
ρl y(tn+l ) − h σl y0 (tn+l ) =
l=0 l=0

s
! s s
!
X X 1 X k X
k−1
ρl y(tn ) + l ρl − k l σl hk y(k) (tn ).
k!
l=0 k=1 l=0 l=0
p+1
Thus, to achieve O(h ) regardless of the choice of y, it is necessary and
sufficient that
Xs Xs s
X
ρl = 0, l k ρl = k lk−1 σl , k = 1, 2, . . . , p. (6.5)
l=0 l=0 l=0
ODEs  163

This is equivalent to saying that ρ(ez ) − zσ(ez ) = O(z p+1 ).


Definition 6.6 (Order conditions). The formulae given in Equation (6.5) are
the order conditions to achieve order p.

We illustrate this with the 2-step Adams–Bashforth method defined by


 
3 1
yn+2 − yn+1 = h f (tn+1 , yn+1 ) − f (tn , yn ) .
2 2

Here we have ρ(w) = w2 − w and σ(w) = 32 w − 12 , and therefore

4 1 1
ρ(ez ) − zσ(ez ) = [1 + 2z + 2z 2 + z 3 ] − [1 + z + z 2 + z 3 ]
3 2 6
3 1 2 1 4
− z[1 + z + z ] + z + O(z )
2 2 2
5 3 4
= z + O(z ).
12
Hence the method is of order 2.
Exercise 6.4. Calculate the coefficients of the multistep method

yn+3 + ρ2 yn+2 + ρ1 yn+1 + ρ0 yn = σ2 f (tn+2 , yn+2 )

such that it is of order 3.


Let us consider the 2-step method
1
yn+2 − 3yn+1 + 2yn = h [13f (tn+2 , yn+2 ) − 20f (tn+1 , yn+1 ) − 5f (tn , yn )] .
12
(6.6)
Exercise 6.5. Show that the above method is at least of order 2 just like the
2-step Adams–Bashforth method.
However, we consider the trivial ODE given by y 0 = 0 and y(0) = 1, which
has the exact solution y(t) ≡ 1. The right-hand side in (6.6) vanishes and thus
a single step reads yn+2 − 3yn+1 + 2yn = 0. This linear recurrence relation can
be solved by considering the roots of the quadratic x2 −3x+2, which are x1 = 1
and x2 = 2. A general solution is given by yn = c1 xn1 + c2 xn2 = c1 + c2 2n ,
n = 0, 1, . . ., where c1 and c2 are constants determined by y0 = 1 and the
choice of y1 . If c2 6= 0, then for h → 0, nh → t and thus n → ∞, we have
|yn | → ∞ and we are moving away from the exact solution. A choice of
y1 = 1 yields c2 = 0, but even then this method poses problems because of
the presence of round-off errors if the right-hand side is nonzero.
Thus method (6.6) does not converge. The following theorem provides a
theoretical tool to allow us to check for convergence.
164  A Concise Introduction to Numerical Analysis

Theorem 6.3 (The Dahlquist equivalence theorem). The multistep method


(6.4) is convergent if and only if it is of order p ≥ 1 and the polynomial ρ obeys
the root condition, which means all its zeros lie within |w| ≤ 1 and all zeros
of unit modulus are simple zeros. In this case the method is called zero-stable.

Proof. The proof of this result is long and technical. Details can be found
in [10] W. Gautschi, Numerical Analysis or [11] P. Henrici, Discrete Variable
Methods in Ordinary Differential Equations.
Exercise 6.6. Show that the multistep method given by
3
X 2
X
ρj yn+j = h σj f (tn+j , yn+j )
j=0 j=0

is fourth order only if the conditions ρ0 + ρ2 = 8 and ρ1 = −9 are satisfied.


Hence deduce that this method cannot be both fourth order and satisfy the root
condition.
Theorem 6.4 (The first Dahlquist barrier). Convergence implies that the
order p can be at most s + 2 for even s and s + 1 for odd s.

Proof. Again the proof is technical and beyond the scope of this course. See
again [11] P. Henrici, Discrete Variable Methods in Ordinary Differential Equa-
tions.

6.4 Stiffness and A-Stability


Consider the linear system y0 = Ay for a general N × N constant matrix A.
By defining

X 1 k k
etA = t A
k!
k=0

the exact solution of the ODE can be represented explicitly as y(t) = etA y0 .
We solve the ODE with the forward Euler method. Then

yn+1 = (I + hA)yn ,

and therefore
yn = (I + hA)n y0 .
Let the eigenvalues of A be λ1 , . . . , λN with corresponding linear indepen-
dent eigenvectors v1 , . . . , vN . Further let D be the diagonal matrix with the
eigenvalues being the entries on the diagonal and V = [v1 , . . . , vN ], whence
A = V DV −1 .
We assume further that Reλl < 0, l = 1, . . . , N . In this case limt→∞ y(t) =
ODEs  165

0, since etA = V etD V −1 and


etλ1
 
0 ··· 0
 .. .. 
0 etλ2 . .
etD = 
 
.
 .. .. .. 
 . . . 0 
0 ··· 0 etλN

On the other hand, however, the forward Euler method yn = V (I +


hD)n V −1 y0 and therefore limn→∞ yn = 0 for all initial values of y0 if and
only if |1 + hλl | < 1, for l = 1, . . . , N . We illustrate this with a concrete
example
1
 
− 10 1
.
0 −100
The exact solution is a linear combination of e−1/10t and e−100t : the first
decays gently, whereas the second becomes practically zero almost at once.
1 1
Thus we require |1 − 10 h| < 1 and |1 − 100h| < 1, hence h < 50 . This
restriction on h has nothing to do with local accuracy. Its purpose is solely to
prevent an unbounded growth in the numerical solution.
Definition 6.7 (Stiffness). We say that the ODE y0 = f (t, y) is stiff if (for
some methods) we need to depress h to maintain stability well beyond require-
ments of accuracy.
The opposite of stiff is called non-stiff .
An important example of stiff systems is when an equation is linear and
Reλl < 0, l = 1, . . . , N and the quotient max |λl |/ min |λl | is large. A
l=1,...,N l=1,...,N
ratio of 1020 is not unusual in real-life problems. Non-linear stiff equations oc-
cur throughout applications where we have two (or more) different timescales
in the ODE, i.e., if different parts of the system change with different speeds.
A typical example are equations of chemical kinetics where each timescale is
determined by the speed of reaction between two compounds. Such speeds can
differ by many orders of magnitude.
Exercise 6.7. The stiff differential equation

y 0 (t) = −106 (y − t−1 ) − t−2 , t ≥ 1, y(1) = 1,

has the analytical solution y(t) = t−1 , t ≥ 1. Let it be solved numerically


by forward Euler yn+1 = yn + hn f (tn , yn ) and by backward Euler yn+1 =
yn + hn f (tn+1 , yn+1 ), where hn = tn+1 − tn is allowed to depend on n and to
be different for the two methods. Suppose that at a point tn ≥ 1 an accuracy of
|yn − y(tn )| ≤ 10−6 is achieved and that we want to achieve the same accuracy
in the next step, i.e., |yn+1 − y(tn+1 )| ≤ 10−6 . Show that forward Euler can
fail if hn = 2 × 10−6 , but that backward Euler always achieves the desired
accuracy if hn ≤ tn t2n+1 . (Hint: Find relations between yn+1 − y(tn+1 ) and
yn − y(tn ).)
166  A Concise Introduction to Numerical Analysis

Definition 6.8 (A-stability). Suppose that a numerical method is applied to


the test equation y 0 = λy with initial condition y(0) = 1 and produces the
solution sequence {yn }n∈Z+ for constant h. We call the set
D = {hλ ∈ C : lim yn = 0}
n→∞
t→∞
the linear stability domain of the method. The set of λ ∈ C for which y(t) →
0 is the exact stability domain and is the left half-plane C− = {z ∈ C : Rez <
0}. We say that the method is A-stable if C− ⊆ D.
Note that A-stability does not mean that any step size can be chosen. We
need to choose h small enough to achieve the desired accuracy, but we do not
need to make it smaller to prevent instability.
We have already seen that for the forward Euler method yn → 0 if and only
if |1 + hλ| < 1. Therefore the stability domain is D = {z ∈ C : |1 + z| < 1}.
Solving y 0 = λy with the trapezoidal rule yn+1 = yn + 12 hλ(yn + yn+1 )
gives
n+1
1 + 12 hλ 1 + 12 hλ

yn+1 = y n = y0 .
1 − 12 hλ 1 − 12 hλ
Therefore the trapezoidal rule has the stability domain
1 + 12 z

D = {z ∈ C : < 1} = C−
1 − 12 z
and the method is A-stable.
Similarly it can be proven that the stability domain of the backward Euler
method is D = {z ∈ C : |1 − z| > 1} and hence the method is also A-stable.
The stability domains of the Theta methods (to which forward, backward
Euler, and the trapezoidal rule belong) are shown in Figure 6.2 for the values
θ = 1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1. As θ decreases from 1, the dark
stability region increases until the circle opens up and becomes the imaginary
axis for θ = 12 , the trapezoidal rule. Then the white instability region continues
to shrink until it is the circle given by the backward Euler method (θ = 0).
Exercise 6.8. Find stability domain D for the explicit mid-point rule yn+2 =
yn + 2hf (tn+1 , yn+1 ).
The requirement of a method to be A-stable, however, limits the achievable
order.
Theorem 6.5. An s-step method given by (6.4) is A-stable if and only if the
zeros of the polynomial given by
s
X
τ (w) = (ρl − hλσl )wl (6.7)
l=0

lie within the unit circle for all hλ ∈ C− and the roots on the unit circle are
simple roots (these are roots where the function vanishes, but not its deriva-
tive).
ODEs  167

Figure 6.2 Stability domains for θ = 1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2,
and 0.1
168  A Concise Introduction to Numerical Analysis

Proof. When the s-step method given by (6.4) is applied to the test equation
y 0 = λy, y(0) = 1, it reads
s
X
(ρl − hλσl )yn+l = 0.
l=0

This recurrence relation has the characteristic polynomial given by (6.7). Let
its zeros be w1 (hλ), . . . , wN (hλ) (hλ) with multiplicities µ1 (hλ), . . . , µN (hλ) (hλ),
respectively, where the multiplicities sum to the order of the polynomial τ . If
the root of a function has multiplicity k, then it and its first k − 1 derivatives
vanish there. The solutions of the recurrence relation are given by
N (hλ) µj (hλ)−1
X X
yn = ni wj (hλ)n αij (hλ),
j=1 i=0

where αij (hλ) are independent of n but depend on the starting values
y0 , . . . , ys−1 . Hence the linear stability domain is the set of all hλ ∈ C such
that all the zeros of (6.7) satisfy |wj (hλ)| ≤ 1 and if |wj (hλ)| = 1, then
µj (hλ) = 1.
The theorem implies that hλ ∈ C is in the stability region if the roots of
the polynomial ρ(w) − hλσ(w) lie within the unit circle. It follows that if hλ
is on the boundary of the stability region, then ρ(w) − hλσ(w) must have at
least one root with magnitude exactly equal to 1. Let this root be eiα for some
value α in the interval [0, 2π]. Since eiα is a root we have
ρ(eiα ) − hλσ(eiα ) = 0
and hence
ρ(eiα )
hλ = .
σ(eiα )
Since every point hλ on the boundary of the stability domain has to be of this
form, we can determine the parametrized curve
ρ(eiα )
z(α) =
σ(eiα )
for 0 ≤ α ≤ 2π which are all points that are potentially on the boundary of the
stability domain. For simple methods this yields the stability domain directly
after one determines on which side of the boundary the stability domain lies.
This is known as the boundary locus method .
We illustrate this with the Theta methods, which are given by
yn+1 − yn = h[(1 − θ)f (tn+1 , yn+1 ) + θf (tn , yn )].
Thus ρ(w) = w − 1 and σ(w) = (1 − θ)w + θ and the parametrized curve is
ρ(eiα ) eiα − 1
z(α) = iα
= .
σ(e ) (1 − θ)eiα + θ
For various values of θ, these curves were used to generate Figure 6.2.
ODEs  169

Theorem 6.6 (the second Dahlquist barrier). A-stability implies that the
order p has to be less than or equal to 2. Moreover, the second order A-stable
method with the least truncation error is the trapezoidal rule.
So no multistep method of p ≥ 3 may be A-stable, but there are methods
which are satisfactory for most stiff equations. The point is that in many stiff
linear systems in real world applications the eigenvalues are not just in C− , but
also well away from iR, that is, the imaginary axis. Therefore relaxed stability
concepts are sufficient. Requiring stability only across a wedge in C− of angle
α results in A(α)-stability. A-stability is equivalent to A(90◦ )-stability. A(α)-
stability is sufficient for most purposes. High-order A(α)-stable methods exist
for α < 90◦ . For α → 90◦ the coefficients of high-order A(α)-stable methods
begin to diverge. If λ in the test equation is purely imaginary, then the solution
is a linear combination of sin(λt) and cos(λt) and it oscillates a lot for large
λ. Therefore numerical pointwise solutions are useless anyway.

6.5 Adams Methods


Here we consider a technique to generate multistep methods which are con-
vergent and of high order. In the proof of Theorem (6.2) we have seen that a
necessary condition to achieve an order p ≥ 1 is
s
X
ρl = ρ(1) = 0.
l=0

The technique chooses an arbitrary s-degree polynomial ρ that obeys the


root condition and has a simple root at 1 in order to achieve convergence.
To achieve the maximum order, we let σ be the polynomial of degree s for
implicit methods or the polynomial of degree s − 1 for explicit methods which
ρ(w)
arises from the truncation of the Taylor series expansion of about the
log w
point w = 1. Thus, for example, for an implicit method,
ρ(w)
σ(w) = +O(|w−1|s+1 ) ⇒ ρ(ez )−zσ(ez ) = O(z s+2 ) (6.8)
log w
and the order is at least s + 1.
The Adams methods correspond to the choice
ρ(w) = ws−1 (w − 1).
For σs = 0 we obtain the explicit Adams–Bashforth methods of order s. Oth-
erwise we obtain the implicit Adams–Moulton methods of order s + 1.
For example, letting s = 2 and v = w − 1 (equivalent w = v + 1) we expand
w(w − 1) v + v2 v + v2 1+v
= = =
log w log(1 + v) v − 2 v 2 + 13 v 3 − · · ·
1
1 − 12 v + 13 v 2 − · · ·
1 1 1 1
= (1 + v)[1 + ( v − v 2 ) + ( v − v 2 )2 + O(v 3 )]
2 3 2 3
170  A Concise Introduction to Numerical Analysis

3 5
= 1 + v + v 2 + O(v 3 )
2 12
3 5
= 1 + (w − 1) + (w − 1)2 + O(|w − 1|3 )
2 12
1 2 5 2
= − + w + w + O(|w − 1|3 ).
12 3 12
Therefore the 2-step, 3rd order Adams–Moulton method is
 
1 2 5
yn+2 − yn+1 = h − f (tn , yn ) + f (tn+1 , yn+1 ) + f (tn+2 , yn+2 ) .
12 3 12

Exercise 6.9. Calculate the actual values of the coefficients of the 3-step
Adams–Bashforth method.
Exercise 6.10 (Recurrence relation for Adams–Bashforth). Let ρs and σs
denote the polynomials generating the s-step Adams–Bashforth method. Prove
that
σs (w) = wσs−1 (w) + αs−1 (w − 1)s−1 ,
where αs 6= 0, s = 1, 2, . . ., is a constant such that ρs (z) − log zσs (w) =
αs (w − 1)s+1 + O(|w − 1|s+2 ) for w close to 1.
The Adams–Bashforth methods are as follows:

1. 1-step ρ(w) = w − 1, σ(w) ≡ 1, which is the forward Euler method.


2. 2-step ρ(w) = w2 − w, σ(w) = 21 (3w − 1).
1
3. 3-step ρ(w) = w3 − w2 , σ(w) = 12 (23w
2
− 16w + 5).

The Adams–Moulton methods are


1. 1-step ρ(w) = w − 1, σ(w) = 12 (w + 1), which is the trapezoidal rule.
1
2. 2-step ρ(w) = w2 − w, σ(w) = 12 (5w
2
+ 8w − 1).
1
3. 3-step ρ(w) = w3 − w2 , σ(w) = 24 (9w
3
+ 19w2 − 5w + 1).
Figure 6.3 depicts the stability domain for the 1-,2-, and 3-step Adams–
Bashforth methods, while Figure 6.4 depicts the stability domain for the 1-,2-,
and 3-step Adams–Moulton methods.
Another way to derive the Adams–Bashforth methods is by transforming
the initial value problem given by

y 0 = f (t, y), y(t0 ) = y0 ,

into its integral form


Z t
y(t) = y0 + f (τ, y(τ ))dτ
t0
ODEs  171

Figure 6.3 The stability domains for various Adams–Bashforth methods

and partitioning the interval equally into t0 < t1 < · · · < tn < · · · with
step size h. Having already approximated yn , . . . , yn+s−1 , we use polynomial
interpolation to find the polynomial p of degree s − 1 such that p(tn+i ) =
f (tn+i , yn+i ) for i = 0, . . . , s − 1. Locally p is a good approximation to the
right-hand side of y 0 = f (t, y) that is to be solved, so we consider y 0 = p(t)
instead. This can be solved explicitly by
Z tn+s
yn+s = yn+s−1 + p(τ )dτ.
tn+s−1

The coefficients of the Adams–Bashforth method can be calculated by substi-


tuting the Lagrange formula for p given by
s−1 s−1 s−1 s−1
X Y t − tn+i X Y t − tn+i
p(t) = f (tn+j , yn+j ) = f (tn+j , yn+j )
j=0 i=0
tn+j − tn+i j=0 i=0
jh − ih
i6=j i6=j
s−1 s−1
s−j−1
(−1) f (tn+j ,yn+j )
X Y
= j(j−1)...(j−(j−1))(j−(j+1))...(j−(s−1))hs−1 (t − tn+i )
j=0 i=0
i6=j
s−1 s−1
X (−1)s−j−1 f (tn+j , yn+j ) Y
= (t − tn+i ).
j=0
j!(s − j − 1)!hs−1 i=0
i6=j

The one-step Adams–Bashforth method uses a zero-degree, i.e., constant poly-


nomial interpolating the single value f (tn , yn ).
The Adams–Moulton methods arise in a similar way; however, the interpo-
lating polynomial is of degree s and not only uses the points tn , . . . tn+s−1 , but
also the point tn+s . In this framework the backward Euler method is also often
172  A Concise Introduction to Numerical Analysis

Figure 6.4 The stability domains for various Adams–Moulton methods

regarded as an Adams–Moulton method, since the interpolating polynomial


is the constant interpolating the single value f (tn+1 , yn+1 ).

6.6 Backward Differentiation Formulae


The backward differentiation formulae (BDF) are a family of implicit multi-
step methods. Here all but one coefficient on the right-hand side of (6.4) are
zero, i.e., σ(w) = σs ws for some s 6= 0, σs 6= 0. In other words,
s
X
ρl yn+l = hσs f (tn+s , yn+s ), n = 0, 1, . . . ,
l=0

BDF are especially used for the solution of stiff differential equations.
To derive the explicit form of the s-step BDF we again employ the tech-
nique introduced in (6.8), this time solving for ρ(w), since σ(w) is given. Thus
ρ(w) = σs ws log w + O(|w − 1|s+1 ). Dividing by ws and setting v = 1/w, this
becomes
X s
ρl v s−l = −σs log v + O(|v − 1|s+1 ).
l=0

The simple change form w to v in the O-term is possible since we are consid-
ering w close to 1 or written mathematically w = O(1) and
1 s+1
O(|w−1|s+1 ) = O(|w|s+1 |1− | ) = O(|w|s+1 )O(|1−v|s+1 ) = O(|v−1|s+1 ).
w
P∞
Now log v = log(1 + (v − 1)) = l=1 (−1)l−1 (v − 1)l /l. Consequently we want
ODEs  173

to choose the coefficients ρ1 , . . . , ρs such that


s s
X
s−l
X (−1)l
ρl v = σs (v − 1)l O(|v − 1|s+1 ).
l
l=0 l=1

Restoring w = v −1 and multiplying by ws yields


s ∞ ∞
X X (−1)l X (−1)l
ρl wl = σs ws w−l (1 − w)l = σs ws−l (1 − w)l . (6.9)
l l
l=0 l=1 l=1

We expand
     
l l l
(1 − w)l = 1 − w+ w2 − . . . + (−1)l wl
1 2 l

and then pick σs so that ρs = 1 by collecting the powers of ws on the right-


hand side. This gives
s
!−1
X 1
σs = . (6.10)
l
l=1
2
As an example we let s = 2. Substitution in (6.10) gives σ2 = 3 and using
(6.9) we have ρ(w) = w2 − 43 w + 13 . Hence the 2-step BDF is

4 1 2
yn+2 − yn+1 + yn = hf (tn+2 , yn+2 ).
3 3 3
The BDF are as follows

• 1-step ρ(w) = w − 1, σ(w) = w, which is the backward Euler method


(again!).
• 2-step ρ(w) = w2 − 43 w + 13 , σ(w) = 23 w2 .
• 3-step ρ(w) = w3 − 18 2
11 w + 9
11 w − 2
11 , σ(w) = 6 3
11 w .

• 4-step ρ(w) = w4 − 48 3
25 w + 36 2
25 w − 16
25 w + 3
25 , σ(w) = 12 4
25 w .

• 5-step ρ(w) = w5 − 300 4 300 3 200 2 75 12


137 w + 137 w − 137 w + 137 w − 137 , σ(w) =
60 5
137 w .

• 6-step ρ(w) = w6 − 360 5


147 w + 450 4
147 w − 400 3
147 w + 225 2
147 w − 72
147 w + 10
147 ,
60
σ(w) = 147 w6 .

It can be proven that BDF are convergent if and only if s ≤ 6. For higher
values of s they must not be used. Figure (6.5) shows the stability domain for
various BDF methods. As the number of steps increases the stability domain
shrinks but it remains unbounded for s ≤ 6. In particular, the 3-step BDF is
A(86◦ 20 )-stable.
174  A Concise Introduction to Numerical Analysis

Figure 6.5 The stability domains of various BDF methods in grey. The
instability regions are in white.

6.7 The Milne and Zadunaisky Device


The step size h is not some fixed quantity: it is a parameter of the method
which can vary from step to step. The basic input of a well-designed computer
package for ODEs is not the step size but the error tolerance, as required
by the user. With the choice of h > 0 we can keep a local estimate of the
error beneath the required tolerance in the solution interval. We don’t just
need a time-stepping algorithm, but also mechanisms for error control and for
adjusting the step size.
Suppose we wish to estimate in each step the error of the trapezoidal rule
(TR)
1
yn+1 = yn + h[f (tn , yn ) + f (tn+1 , yn+1 )].
2
ODEs  175

Substituting the true solution, we deduce that


1 1
y(tn+1 ) − {y(tn ) + h[y0 (tn ) + y0 (tn+1 )]} = − h3 y000 (tn ) + O(h4 )
2 12
1
and the order is 2. The error constant of TR is cTR = − 12 . To estimate the
error in a single step we assume that yn = y(tn ) and this yields

y(tn+1 ) − yn+1 = cTR h3 y000 (tn ) + O(h4 ). (6.11)

Each multi-step method has its own error constant. For example, the 2nd -order
2-step Adams–Bashforth method (AB2)
1
yn+1 − yn = h[3f (tn , yn ) − f (tn−1 , yn−1 )],
2
5
has the error constant cAB2 = 12 .
The idea behind the Milne device is to use two multistep methods of the
same order, one explicit and the second implicit, to estimate the local error
of the implicit method. For example, locally,
5 3 000
AB2
yn+1 ≈ y(tn+1 ) − cAB2 h3 y000 (tn ) = y(tn+1 ) − h y (tn ),
12
1
TR
yn+1 ≈ y(tn+1 ) − cTR h3 y000 (tn ) = y(tn+1 ) + h3 y000 (tn ).
12
Subtracting, we obtain the estimate

h3 y000 (tn ) ≈ −2(yn+1


AB2 TR
− yn+1 ),

which we can insert back into (6.11) to give


1 AB2 TR
y(tn+1 ) − yn+1 ≈ (y − yn+1 )
6 n+1
and we use the right-hand side as an estimate of the local error.
The trapezoidal rule is the superior method, since it is A-stable and its
global behaviour is hence better. The Adams–Bashforth method is solely em-
ployed to estimate the local error. Since it is explicit this adds little overhead.
In this example, the trapezoidal method was the corrector , while the
Adams–Bashforth method was the predictor . The Adams–Bashforth and
Adams–Moulton methods of the same order are often employed as predictor–
corrector pairs. For example, for the order 3 we have the predictor
5 4 23
yn+2 = yn+1 + h[ f (tn−1 , yn−1 ) − f (tn , yn ) + f (tn+1 , yn+1 )],
12 3 12
and the corrector
1 2 5
yn+2 = yn+1 + h[− f (tn , yn ) + f (tn+1 , yn+1 ) + f (tn+2 , yn+2 )].
12 3 12
176  A Concise Introduction to Numerical Analysis

The predictor is employed not just to estimate the error of the corrector,
but also to provide an initial guess in the solution of the implicit corrector
equations. Typically, for nonstiff equations, we iterate correction equations at
most twice, while stiff equations require iteration to convergence, otherwise
the typically superior stability features of the corrector are lost.
Exercise 6.11. Consider the predictor–corrector pair given by
P
yn+3 = − 12 yn + 3yn+1 − 32 yn+2 + 3hf (tn+2 , yn+2 ),
C 1
yn+3 = 11 [2yn − 9yn+1 + 18yn+2 + 6hf (tn+3 , yn+3 )].

The predictor is as in Exercise 6.4. The corrector is the three-step backward


differentiation formula. Show that both methods are third order, and that the
estimate of the error of the corrector formula by Milne’s device has the value
6 P C
17 |yn+3 − yn+3 |.

Let  > 0 be a user-specified tolerance, i.e., this is the maximal error


allowed. After completion of a single step with a local error estimate e, there
are three outcomes.
1
1. 10 ≤ kek ≤ : Accept the step size and continue to calculate the next
estimate with the same step size.
1
2. kek ≤ 10 : Accept the step and increase the step length.
3. kek > : Reject the step and return to tn and use a smaller step size.
The quantity e/h is the estimate for the global error in an interval of unit
length. This is usually required not to exceed , since good implementations
of numerical ODEs should monitor the accumulation of global error . This is
called error estimation per unit step.
Adjusting the step size can be done with polynomial interpolation to obtain
estimates of the function values required, although this means that we need to
store past values well in excess of what is necessary for simple implementation
of both multistep methods.

Exercise 6.12. Let p be the cubic polynomial that is defined by p(tj ) =


yj , j = n, n+1, n+2, and by p0 (tn+2 ) = f (tn+2 , yn+2 ). Show that the predictor
P
formula of the previous exercise is yn+3 = p(tn+2 + h). Further, show that the
corrector formula is equivalent to the equation
5 1
C
yn+3 = p(tn+2 ) + hp0 (tn+2 ) − h2 p00 (tn+2 )
11 22
7 6
− h3 p000 (tn+2 ) + hf (tn+2 + h, yn+3 ).
66 11
The point is that p can be derived from available data, and then the above forms
of the predictor and corrector can be applied for any choice of h = tn+3 −tn+2 .
ODEs  177

The Zadunaisky device is another way to obtain an error estimate. Sup-


pose we have calculated the (not necessarily equidistant) solution values
yn , yn−1 , . . . , yn−p with an arbitrary numerical method of order p. We form
an interpolating pth degree (vector valued) polynomial p such that p(tn−i ) =
yn−i , i = 0, 1, . . . , p, and consider the differential equation

z0 = f (t, z) + p0 (t) − f (t, p), z(tn ) = yn . (6.12)

There are two important observations with regard to this differential equa-
tion. Firstly, it is a small perturbation on the original ODE, since the
term p0 (t) − f (t, p) is usually small since locally p(t) − y(t) = O(hp+1 )
and y0 (t) = f (t, y(t)). Secondly, the exact solution of (6.12) is obviously
z(t) = p(t). Now we calculate zn+1 using exactly the same method and imple-
mentation details. We then evaluate the error in zn+1 , namely zn+1 −p(tn+1 ),
and use it as an estimate of the error in yn+1 . The error estimate can then be
used to assess the step and adjust the step size if necessary.

6.8 Rational Methods


In the following we consider methods which use higher derivatives. Repeatedly
differentiating (6.1) we obtain formulae of the form y(k) = fk (y) where

∂f (y)
f0 (y) = y, f1 (y) = f (y), f2 (y) = f (y), ...
∂y
This then motivates the Taylor method
p
X 1 k
yn+1 = h fk (yn ), n ∈ Z+ . (6.13)
k!
k=0

For example,

p = 1 : yn+1 = yn + hf (yn ) forward Euler,


1 ∂f (yn )
p = 2 : yn+1 = yn + hf (yn ) + h2 f (yn ).
2 ∂y

Theorem 6.7. The Taylor method given by (6.13) has error O(hp+1 ).
Proof. The proof is easily done by induction. Firstly we have y0 = y(0) =
y(0) + O(hp+1 ). Now assume yn = y(tn ) + O(hp+1 ) = y(nh) + O(hp+1 ). It
follows that fk (yn ) = fk (y(nh)) + O(hp+1 ), since f is analytic. Hence yn+1 =
y((n + 1)h) + O(hp+1 ) = y(tn+1 ) + O(hp+1 ).
Recalling the differentiation operator Dt y(t) = y0 (t), we see that

Dtk y = fk (y).
178  A Concise Introduction to Numerical Analysis

P∞
Let R(z) = k=0 rk z k be an analytic function such that R(z) = ez +O(z p+1 ),
1
i.e., rk = k! for k = 0, 1, . . . , p. Then the formal method defined by

X
yn+1 = R(hDt )yn = rk hk fk (yn ), n ∈ Z+ ,
k=0

is of order p. Indeed, the Taylor method is one such method by dint of letting
R(z) be the pth section of the Taylor expansion of ez .
We can let R be a rational function of the form
PM
pk z k
R(z) = Pk=0
N
,
k
k=0 qk z

which corresponds to the numerical method


N
X M
X
qk hk fk (yn+1 ) = pk hk fk (yn ), (6.14)
k=0 k=0

where q0 = 1. For N ≥ 1 this is an algebraic system of equations which has


to be solved by a suitable method; for example, Newton–Raphson.
Definition 6.9 (Padé approximants). Given a function f which is analytic
at the origin, the [M/N ] Padé approximant is the quotient of an M th degree
polynomial over an N th degree polynomial which matches the Taylor expansion
of f to the highest possible order.
For f (z) = ez we have the Padé approximant RM/N = PM/N /QM/N ,
where
M  
X M (M + N − k)! k
PM/N = z ,
k (M + N )!
k=0
N  
X N (M + N − k)!
QM/N = (−z)k = PN/M (−z).
k (M + N )!
k=0

Lemma 6.1. RM/N (z) = ez + O(z M +N +1 ) and no other rational function of


this form can do better.
Proof. Omitted. For further information on Padé approximants please see [1]
G. A. Baker Jr. and P. Graves–Morris, Padé Approximants.
It follows directly from the lemma above that the Padé method
N   M  
X N (M + N − k)! k X M (M + N − k)! k
(−1)k h fk (yn+1 ) = h fk (yn )
k (M + N )! k (M + N )!
k=0 k=0

is of order M + N .
ODEs  179

For example, if M = 1, N = 0 we have P1/0 = 1 + z and Q1/0 = 1 and this


yields the forward Euler method
yn+1 = yn + hf (yn ).
On the other hand, if M = 0, N = 1 we have P0/1 = 1 and Q0/1 = 1 − z and
this yields the backward Euler method
yn+1 = yn + hf (yn+1 ).
For M = N = 1, the polynomials are P1/1 = 1 + 12 z and Q1/1 = 1 − 12 z and
we have the trapezoidal rule
1
yn+1 = yn + h(f (yn ) + f (yn+1 )).
2
On the other hand, M = 0, N = 2 gives P0/2 = 1 and Q0/2 = 1 − z + 12 z 2 with
the method
1 ∂f (yn+1 )
yn+1 − hf (yn+1 ) + h2 f (yn+1 ) = yn ,
2 ∂y
which could be described as backward Taylor method.
Considering the test equation y 0 = λy, y(0) = 1, then the method given
by (6.14) calculates the approximations according to
N
X M
X
qk hk λk yn+1 = pk hk λk yn .
k=0 k=0
n+1
Hence yn+1 = R(hλ)yn = (R(hλ)) y0 . Thus the stability domain of this
method is
{z ∈ C : |R(z)| < 1}.
Theorem 6.8. The rational method given by (6.14) is a A-stable if and only if
firstly all the poles of R reside in the right half-plane C+ := {z ∈ C : Rez > 0}
and secondly |R(iy)| ≤ 1 for all y ∈ R.
Proof. This is proven by the maximum modulus principle. Since R is analytic
in C− and all poles are in C+ , R takes its maximum modulus in C− on the
boundary of C− , which is the line iy, y ∈ R.
Moreover, according to a theorem of Wanner, Hairer, and Norsett, the
Padé method is A-Stable if and only if M ≤ N ≤ M + 2.

6.9 Runge–Kutta Methods


In the section on Adams methods we have seen how they can be derived by
approximating the derivative by an interpolating polynomial which is then
integrated. This was inspired by replacing the initial value problem given by
y0 = f (t, y), y(t0 ) = y0 ,
180  A Concise Introduction to Numerical Analysis

into its integral form


Z tn+1
y(tn+1 ) = y(tn ) + f (τ, y(τ ))dτ. (6.15)
tn

The integral can be approximated by a quadrature formula. A quadrature


formula is a linear combination of function values in a given interval which
approximates the integral of the function over the interval. Quadrature for-
mulae are often chosen so that they are exact for polynomials up to a certain
degree ν.
More precisely, we approximate (6.15) by
ν
X
yn+1 = yn + h bl f (tn + cl h, y(tn + cl h)), (6.16)
l=1

except that, of course, the vectors y(tn + cl h) are unknown! Runge–Kutta


methods are a means of implementing (6.16) by replacing unknown values
of y by suitable linear combinations. The general form of a ν-stage explicit
Runge–Kutta method (RK) is

k1 = f (tn , yn ),
k2 = f (tn + c2 h, yn + hc2 k1 ),
k3 = f (tn + c3 h, yn + h(a3,1 k1 + a3,2 k2 )), a3,1 + a3,2 = c3 ,
..
.  
ν−1
X ν−1
X
kν = f tn + cν h, yn + h aν,j kj  , aν,j = cν ,
j=1 j=1
ν
X
yn+1 = yn + h bl k l .
l=1

b =P(b1 , . . . , bν )T are called the Runge–Kutta weights and satisfy the con-
ν
dition l=1 bl = 1. c = (c1 , . . . , cν )T are called the Runge–Kutta nodes. The
method is called consistent if
ν−1
X
ai,j = ci , i = 1, . . . , ν.
j=1

To specify a particular method, one needs to provide the integer ν (the


number of stages), and the coefficients ai,j (for 1 ≤ i, j ≤ ν), bi (for i =
1, 2, ..., ν), and ci (for i = 1, 2, . . . , ν). These data are usually arranged in a
mnemonic device, known as a Butcher tableau shown in Figure 6.6.
The simplest Runge–Kutta method is the forward Euler method, which
we can rewrite as

yn+1 = yn + hk1 , k1 = f (tn , yn ),


ODEs  181

0
c2 a2,1
c3 a3,1 a3,2 c A
.. .. .. ⇔
. . . bT
cν aν,1 aν,2 ··· aν,ν−1
b1 b2 ··· bν−1 bν

Figure 6.6 Butcher tableau

where k1 is an increment based on an estimate of y0 at the point (tn , yn ).


This is the only consistent explicit Runge–Kutta method of one stage. The
corresponding tableau is:
0
1
A 2-stage Runge–Kutta method can be derived by re-estimating y0 based
at the point (tn+1 , yn+1 ). This gives an increment, say k2 , where

k2 = f (tn + h, yn + hk1 ).

The Runge–Kutta method uses the average of these two increments, i.e.,
1
yn+1 = yn + h(k1 + k2 ).
2
The corresponding tableau is

0
1 1
1 1
2 2

This method is known as Heun’s method .


Another 2-stage method is provided by the midpoint method specified by
1 1
yn+1 = yn + hf (tn + h, yn + hf (tn , yn )).
2 2
with tableau
0
1 1
2 2

0 1
Both of these methods belong to the family of explicit methods given by
 
1 1
yn+1 = yn + h (1 − )f (tn , yn ) + f (tn + αh, yn + αhf (tn , yn )) .
2α 2α
182  A Concise Introduction to Numerical Analysis

Figure 6.7 Stability domain of the methods given by (6.17)

They have the tableau


0
α α (6.17)
1 1
(1 − 2α ) 2α

The choice α = 12 recovers the mid-point rule and α = 1 is Heun’s rule. All
these methods have the same stability domain which is shown in Figure 6.7.
The choice of the RK coefficients al,j is motivated at the first instance by
order considerations. Thus we again have to perform Taylor expansions. As
an example we derive a 2-stage Runge–Kutta method. We have k1 = f (tn , yn )
and to examine k2 we Taylor-expand about (tn , yn ),
k2 = f (tn + c2 h, yn + hc2 f (tn , yn ))
h i
= f (tn , yn ) + hc2 ∂f (t∂t
n ,yn )
+ ∂f (t∂y
n ,yn )
f (tn , yn ) + O(h2 ).

On the other hand we have


∂f (t, y) ∂f (t, y)
y0 = f (t, y) ⇒ y00 = + f (t, y).
∂t ∂y
Therefore, substituting the exact solution yn = y(tn ), we obtain k1 = y0 (tn )
and k2 = y0 (tn ) + hc2 y00 (tn ) + O(h2 ). Consequently, the local error is

y(tn+1 ) − yn+1 = [y(tn ) + hy0 (tn ) + 12 h2 y00 (tn ) + O(h3 )]


−[y(tn ) + h(b1 + b2 )y0 (tn ) + h2 b2 c2 y00 (tn ) + O(h3 )].
We deduce that the Runge–Kutta method is of order 2 if b1 + b2 = 1 and
b2 c2 = 12 . The methods given by (6.17) all fulfill this criterion.
ODEs  183

Exercise 6.13. Show that the truncation error of methods given by (6.17) is
minimal for α = 23 . Also show that no such method has order 3 or above.
Different categories of Runge–Kutta methods are abbreviated as follows

1. Explicit RK (ERK): A is strictly lower triangular;


2. Diagonally implicit RK (DIRK): A lower triangular;
3. Singly diagonally implicit RK (SDIRK): A lower triangular, ai,i ≡
const 6= 0;

4. Implicit RK (IRK): Otherwise.


The original Runge–Kutta method is fourth-order and is given by the
tableau
0
1 1
2 2
1 1
2 0 2 (6.18)
1 0 0 1
1 1 1 1
6 3 3 6

Its stability domain is shown in Figure 6.8.


Two 3-stage explicit methods of order 3 are named after Kutta and Nys-
trom, respectively.

0 0
1 1 2 2
2 2 3 3
Kutta 1 −1 2 Nystrom 2
0 2
3 3
1 2 1 1 3 3
6 3 6 4 8 8

The one in the tableau of Kutta’s method shows that this method explicitly
employs an estimate of f at tn + h. Both methods have the same stability
domain shown in Figure 6.9.
An error control device specific to Runge–Kutta methods is embedding
with adaptive step size. We embed a method in a larger method. For example,
let    
à 0 c̃
A= , c= ,
aT a c
such that the method given by

c A
bT

is of higher order than


c̃ Ã
.
b̃T
184  A Concise Introduction to Numerical Analysis

Figure 6.8The stability domain of the original Runge–Kutta method


given by (6.18)

Comparison of the two yields an estimate of the local error. We use the method
with the smaller truncation error to estimate the error in the other method.
More specifically, kyn+1 − y(tn+1 )k ≈ kyn+1 − ỹn+1 k. This is then used to
adjust the step size to achieve a desired accuracy. The error estimate is used
to improve the solution. Often the matrix à and vector c̃ are actually not
extended. Both methods use the same matrix of coefficients and nodes. How-
ever, the weights b̃ differ. The methods are described with an extended Butcher
tableau, which is the Butcher tableau of the higher-order method with another
row added for the weights of the lower-order method.
c1 a1,1 a1,2 ··· a1,ν
c2 a2,1 a2,2 ··· a2,ν
.. .. .. .. ..
. . . . .
cν aν,1 aν,2 ··· aν,ν
b1 b2 ··· bν
b̃1 b̃2 ··· b̃ν
The simplest adaptive Runge–Kutta method involves combining the Heun
method, which is order 2, with the forward Euler method, which is order 1.
ODEs  185

Figure 6.9 Stability domain of Kutta’s and Nystrom’s method

The result is the Heun–Euler method and its extended Butcher tableau is
0
1 1
1 1
2 2
1 0

The zero in the bottom line of the tableau shows that the forward Euler
method does not use the estimate k2 .
The Bogacki–Shampine method has two methods of orders 3 and 2. Its
extended Butcher tableau is shown in Figure 6.10

0
1 1
2 2
3 3
4 0 4
2 1 4
1 9 3 9
2 1 4
9 3 9 0
7 1 1 1
24 4 3 8

Figure 6.10 Tableau of the Bogacki–Shampine method

There are several things to note about this method: firstly, the zero in the
186  A Concise Introduction to Numerical Analysis

vector of weights of the higher-order method. Thus the higher-order method


actually does not employ k4 . Thus the lower-order method employs more esti-
mates. Secondly, the calculation of k4 uses the same weights as the calculation
of yn+1 . Thus k4 actually equals f (tn + h, yn+1 ). This is known as the First
Same As Last (FSAL) property, since k4 in this step will be k1 in the next
step. So this method uses three function evaluations of f per step. This method
is implemented in the ode23 function in MATLAB.
The Runge–Kutta–Fehlberg method (or Fehlberg method) combines two
methods of order 4 and 5. Figure 6.11 shows the tableau. This method is
often implemented as RKF45 in collections of numerical methods. The coeffi-
cients are chosen so that the error in the fourth-order method is minimized.

0
1 1
4 4
3 3 9
8 32 32
12 1932
13 2197 − 7200
2197
7296
2197
439 3680 845
1 216 −8 513 − 4104
1 8
2 − 27 2 − 3544
2565
1859
4104 − 11
40
16 6656 28561 9 2
135 0 12825 56430 − 50 55
25 1408 2197 1
216 0 2565 4104 5 0

Figure 6.11 Tableau of the Fehlberg method

The Cash–Karp method takes the concept of error control and adaptive
step size to a whole new level by not only embedding one lower method,
methods of order 1, 2, 3, and 4 are embedded in a fifth-order method. The
tableau is as displayed in Figure 6.12. Note that the order-one method is the
forward Euler method.
The Cash–Karp method was motivated to deal with the situation when
certain derivatives of the solution are very large for part of the region. These
are, for example, regions where the solution has a sharp front or some deriva-
tive of the solution is discontinuous in the limit. In these circumstance the step
size has to be adjusted. By computing solutions at several different orders, it
is possible to detect sharp fronts or discontinuities before all the function eval-
uations defining the full Runge–Kutta step have been computed. We can then
either accept a lower-order solution or abort the step (and try again with a
smaller step-size), depending on which course of action seems appropriate. J.
Cash provides the code for this algorithm in Fortran, C, and MATLAB on his
homepage.
The Dormand–Prince method is similar to the Fehlberg method. It also
embeds a fourth-order method in a fifth-order method. The coefficients are,
however, chosen so that the error of the fifth-order solution is minimized and
ODEs  187

0
1 1
5 5
3 3 9
10 40 40
3 3 9 6
5 10 − 10 5
1 − 11
54
5
2 − 70
27
35
27
7 1631 175 575 44275 253
8 55296 512 13824 110592 4096
37 250 125 512
378 0 621 594 0 1771 Order 5
2825 18575 13525 277 1
27648 0 48384 55296 14336 4 Order 4
19
54 0 − 10
27
55
54 0 0 Order 3
− 32 5
2 0 0 0 0 Order 2
1 0 0 0 0 0 Order 1

Figure 6.12 Tableau of the Cash–Karp method

the difference between the solutions is used to estimate the error in the fourth-
order method. The Dormand–Prince method has seven stages, but it uses
only six function evaluations per step, because it has the FSAL property.
The tableau is given in Figure 6.13. This method is currently the default in
MATLAB’s ode45 solver.

0
1 1
5 5
3 3 9
10 40 40
4 44
5 45 − 56
15
32
9
8 19372
9 6561 − 25360
2187
644482
6561 − 212
729
9017
1 3168 − 355
33
46732
5247
49
176
5103
− 18656
35 500 125
1 384 0 1113 192 − 2187
6784
11
84
35 500 125
384 0 1113 192 − 2187
6784
11
84 0
5179 7571 393 92097 187 1
57600 0 16695 640 − 339200 2100 40

Figure 6.13 Tableau of the Dormand–Prince method

Runge–Kutta methods are a field of ongoing research. New high-order


methods are invented. For recent research see for example the work by J.H.
Verner whose work concerns the derivation of Runge–Kutta pairs.
Next we will turn our attention to implicit Runge–Kutta methods. A gen-
188  A Concise Introduction to Numerical Analysis

eral ν-stage Runge–Kutta method is


 
ν
X ν
X
kl = f tn + cl h, yn + h al,j kj  , al,j = cl
j=1 j=1 (6.19)
ν
X
yn+1 = yn + h bl k l .
l=1

Obviously, al,j = 0 for all l < j yields the standard explicit RK . Otherwise,
an RK method is said to be implicit.
One way to derive implicit Runge–Kutta methods are collocation methods.
However, not all Rung–Kutta methods are collocation methods. The idea is to
choose a finite-dimensional space of candidate solutions (usually, polynomials
up to a certain degree) and a number of points in the domain (called collocation
points), and to select that solution which satisfies the given equation at the
collocation points. More precisely, we want to find a ν-degree polynomial p
such that p(tn ) = yn and
p0 (tn + cl h) = f (tn + cl h, p(tn + cl h)), l = 1, . . . , ν,
where cl , l = 1, . . . , ν are the collocation points. This gives ν + 1 conditions,
which matches the ν + 1 parameters needed to specify a polynomial of degree
ν. The new estimate yn+1 is defined to be p(tn+1 ).
As an example pick the two collocation points c1 = 0 and c2 = 1 at the
beginning and the end of the interval [tn , tn+1 ]. The collocation conditions are
p(tn ) = yn ,
0
p (tn ) = f (tn , p(tn )),
p0 (tn + h) = f (tn + h, p(tn + h)).
For these three conditions we need a polynomial of degree 2, which we write
in the form
p(t) = α(t − tn )2 + β(t − tn ) + γ.
The collocation conditions can be solved to give the coefficients
1
α = [f (tn + h, p(tn + h)) − f (tn , p(tn ))],
2h
β = f (tn , p(tn )),
γ = yn .
Putting these coefficients back into the definition of p and evaluating it at
t = tn+1 gives the method
1
yn+1 = p(tn + h) = yn + h(f (tn + h, p(tn + h)) + f (tn , p(tn )))
2
1
= yn + h(f (tn + h, yn+1 ) + f (tn , yn )),
2
and we have recovered Heun’s method.
ODEs  189

Theorem 6.9 (Guillou and Soulé). Let the collocation points cl ,Q l = 1, . . . , ν


ν
be distinct. Let w(t) be the ν-degree polynomial defined by w(t) := l=1 (t − cl )
w(t)
and let wl (t) the (ν − 1) degree polynomial defined by wl (t) := t−c l
, l =
1, . . . , ν. The collocation method is then identical to the ν-stage Runge–Kutta
method with coefficients
Z ck Z 1
1 1
ak,l = wl (τ )dτ, bl = wl (τ )dτ, k, l = 1, . . . , ν.
wl (cl ) 0 wl (cl ) 0

Proof. The polynomial p0 is the (ν − 1)th degree Lagrange interpolation poly-


nomial. Define the lth Lagrange cardinal polynomial of degree ν − 1 as
wl (t)
Ll (t) = .
wl (cl )

Thus Ll (cl ) = 1 and Ll (ci ) = 0 for all i 6= l. We then have


ν
X
p0 (t) = Ll ((t − tn )/h)f (tn + cl h, p(tn + cl h))
l=1
ν
X wl ((t − tn )/h)
= f (tn + cl h, p(tn + cl h))
wl (cl )
l=1

and by integration we obtain


Z t
p(t) = p(tn ) + p0 (τ̃ )dτ̃
tn
ν t
wl ((τ̃ − tn )/h)
X Z
= p(tn ) + f (tn + cl h, p(tn + cl h)) dτ̃
tn wl (cl )
l=1
ν Z (t−tn )/h
X wl (τ )
= yn + h f (tn + cl h, p(tn + cl h)) dτ.
0 wl (cl )
l=1

Now we have
ν Z ck
X 1
p(tn + ck h) = yn + h f (tn + cl h, p(tn + cl h)) wl (τ )dτ
wl (cl ) 0
l=1

= yn + h ak,l f (tn + cl h, p(tn + cl h)).
l=1

This and defining kl = f (tn + cl h, p(tn + cl h)) gives the intermediate stages
of the Runge–Kutta method. Additionally we have
ν Z 1 ν
X 1 X
yn+1 = p(tn + h) = yn + h kl wl (τ )dτ = p(tn ) + h bl k l .
wl (cl ) 0
l=1 l=1

This and the definition of the Runge–Kutta method proves the theorem.
190  A Concise Introduction to Numerical Analysis

Collocation methods have the advantage that we obtain a continuous ap-


proximation to the solution y(t) in each of the intervals [tn , tn+1 ]. Different
choices of collocation points lead to different Runge–Kutta methods. For this
we have to look more closely at the concept of numerical quadrature. In terms
of order the optimal choice is to let the collocation points be the roots of a
shifted Legendre polynomial . The Legendre polynomials ps , s = 0, 1, 2, . . ., are
defined on the interval [−1, 1] by
1 ds
ps (x) = [(x2 − 1)s ].
2s s! dxs
Thus ps is an s-degree polynomial. Shifting the Legendre polynomials from
[−1, 1] to [0, 1] by x 7→ 2x − 1 leads to
1 ds s
p̃s (x) = ps (2x − 1) = [x (x − 1)s ]
s! dxs
and using its roots as collocation points gives the Gauss–Legendre Runge–
Kutta methods which are of order 2s, if the roots of ps are employed. For
s = 1, 2, and 3, the collocation points in this case are
1 1
2 2
• c1 = 1
2, order 2 with tableau , which is the implicit midpoint
1
rule,
√ √
• c1 = 1 3 1 3
2 − 6 , c2 = 2 + 6 order 4 with tableau
√ √
1
2 − 63 1
4
1
4 − 6
3

√ √
1
2 + 63 14 + 63 1
4
1 1
2 2
√ √
• c1 = 1
2 − 15
10 , c2 = 1
2 , c3 = 1
2 + 15
10 order 6 with tableau
√ √ √
1 15 5 2 15 5 15
2 − 10 36 9 − 15 36 − 30
√ √
1 5 15 2 5 15
2 36 + 24 9 36 − 24
√ √ √
1 15 5 15 2 15 5
2 + 10 36 + 30 9 + 15 36
5 4 5
18 9 18

Further methods were developed by considering the roots of Lobatto and


Radau polynomials. Butcher introduced a classification of type I, II, or III
which still survives in the naming conventions.
ds−1 s
I [x (x − 1)s−1 ] left Radau
dxs−1
ds−1 s−1
II [x (x − 1)s ] right Radau
dxs−1
ODEs  191

ds−2 s−1
III [x (x − 1)s−1 ] Lobatto
dxs−2
The methods are of order 2s − 1 in the Radau case and of order 2s − 2 in
the Lobatto case. Note that the Radau I methods have 0 as one of their
collocation points, while the Radau II method has 1 as one of its collocation
points. This means that for Radau I the first row in the tableau always consists
entirely of zeros while for Radau II the last row is identical to the vector of
weights. The 2-stage methods are given in Figure 6.14. The letters correspond
to certain conditions imposed on A which are however beyond this course (for
further information see [3] J. C. Butcher, The numerical analysis of ordinary
differential equations: Runge–Kutta and general linear methods.

0 0 0 0 0 0
2 1 1 1 1
Radau I 3 3 3 Lobatto III(A) 1 2 2
1 3 1 1
4 4 2 2

1
0 4 − 14 0 1
2 0
2 1 5 1
Radau IA 3 4 12 Lobatto IIIB 1 2 0
1 3 1 1
4 4 2 2

1 5 1 1
3 12 − 12 0 2 − 12
3 1 1 1
Radau II(A) 1 4 4 Lobatto IIIC 1 2 2
3 1 1 1
4 4 2 2

Figure 6.14 The 2-stage Radau and Lobatto methods

For the three stages we have the Radau methods as specified in Figure
6.15 and the Lobatto methods as in Figure 6.16.
Next we examine the stability domain of Runge–Kutta methods by con-
sidering the test equation y 0 = λy = f (t, y). Firstly, we get for the internal
stages the relations

ki = f (t + ci h, yn + h(ai,1 k11 + ai,2 k2 + . . . ai,ν kν ))


= λ [yn + h(ai,1 k11 + ai,2 k2 + . . . ai,ν kν )] .

Defining k = (k1 , . . . kν )T as the vector of stages and the ν × ν matrix A =


(ai,j ), and denoting (1, . . . , 1)T as 1, this can be rewritten in matrix from as

k = λ (yn 1 + hAk) .
192  A Concise Introduction to Numerical Analysis


0 √
0 0 √
0 √
6− 6 9+ 6 24+ 6 168−73∗ 6
10 75 120 600
Radau I √ √ √ √
6+ 6 9− 6 168+73∗ 6 24− 6
10 75 600 120
√ √
1 16+ 6 16− 6
9 36 36

√ √
1 −1− 6 −1+ 6
0 9 18 18
√ √ √
6− 6 1 88+7 6 88−43 6
10 9 360 360
Radau IA √ √ √
6+ 6 1 88+43∗ 6 88−7 6
10 9 360 360
√ √
1 16+ 6 16− 6
9 36 36

√ √ √ √
4− 6 88−7 6 296−169 6 −2+3 6
10 360 1800 225
√ √ √ √
4+ 6 296+169 6 88+7 6 −2−3 6
10 1800 360 225
Radau II(A) √ √
16− 6 16+ 6 1
1 36 36 9
√ √
16− 6 16+ 6 1
36 36 9

Figure 6.15 The 3-stage Radau methods

Solving for k,
−1
k = λyn (I − hλA) 1.
Further, we have
ν
X
yn+1 = yn + h bl kl = yn + hbT k,
l=1

if we define b = (b1 , . . . , bν )T . Using the above equation for k, we obtain


 
−1
yn+1 = yn 1 + hλbT (I − hλA) 1 .

Since l=1 bl = bT 1 = 1, this can be rewritten as
h i
−1
yn+1 = yn bT I + hλ (I − hλA) 1
−1
= yn bT (I − hλA) [I − hλ(A − I)] 1
1
= yn bT adj(I − hλA) [I − hλ(A − I)] 1.
det(I − hλA)
ODEs  193

0 0 0 0
1 5 1 1
2 24 3 − 24
Lobatto III(A) 1 2 1
1 6 3 6
1 2 1
6 3 6

1
0 6 − 16 0
1 1 1
2 6 3 0
Lobatto IIIB 1 5
1 6 6 0
1 2 1
6 3 6

1
0 6 − 13 1
6
1 1 5 1
2 6 12 − 12
Lobatto IIIC 1 2 1
1 6 3 6
1 2 1
6 3 6

Figure 6.16 The 3-stage Lobatto methods

Now det(I − hλA) is a polynomial in hλ of degree ν, each entry in the adjoint


matrix of I − hλA is a polynomial in hλ of degree ν − 1, and each entry
in I − hλ(A − I) is a polynomial in hλ of degree 1. So we can deduce that
yn+1 = R(hλ)yn , where R = P/Q is a rational function with P and Q being
polynomials of degree ν. R is called the stability function of the Runge–Kutta
method. For explicit methods Q ≡ 1, since the matrix A is strictly lower
triangular and thus the inverse of I − hλA can be easily calculated without
involving the determinant and the adjoint matrix.
By induction we deduce yn = [R(hλ)]n y0 and we deduce that the stability
domain is given by
D = {z ∈ C : |R(z)| < 1}.
For A-stability it is therefore necessary that |R(z)| < 1 for every z ∈ C− . For
explicit methods we have Q ≡ 1 and the stability domain of an explicit method
is a bounded set. We have seen the stability regions for various Runge–Kutta
methods in Figures 6.7, 6.8, 6.9, and 6.17.
194  A Concise Introduction to Numerical Analysis

To illustrate we consider the 3rd-order method given by


1
k1 = f (tn , yn + h(k1 − k2 )),
4
2 1
k2 = f (tn + h, yn + h(3k1 + 5k2 )), (6.20)
3 12
1
yn+1 = yn + h(k1 + 3k2 ).
4
Applying it to y 0 = λy, we have
1 1
hk1 = hλ(yn + hk1 − hk2 ),
4 4
1 5
hk2 = hλ(yn + hk1 + hk2 ).
4 12
This is a linear system which has the solution
" #−1 
1 − 14 hλ 1
4 hλ
  
hk1 hλyn
=
hk2 − 14 hλ 1 − 125
hλ hλyn
" #
hλyn 1 − 23 hλ
= ,
1 − 23 hλ + 16 (hλ)2 1

therefore

1 3 1 + 13 hλ
yn+1 = yn + hk1 + hk2 = yn .
4 4 1 − 3 hλ + 16 (hλ)2
2

Hence the stability function is given by

1 + 13 z
R(z) = .
1 − 23 z + 16 z 2

Figure 6.17 illustrates the stability region given by this stability function.
We prove A-stability by the following technique. According to the maxi-
mum modulus principle, if g is analytic in the closed complex domain V , then
|g| attains its maximum on the boundary ∂V . We let g√= R. This is a rational
function, hence its only singularities are the poles 2 ± i 2, which are the roots
of the denominator and g is analytic in V = clC− = {z ∈ C : Rez ≤ 0}.
Therefore it attains its maximum on ∂V = iR and the following statements
are equivalent

A-stability ⇔ |R(z)| < 1, z ∈ C− ⇔ |R(it)| ≤ 1, t ∈ R.

In this example we have


2 1 1
|R(it)| ≤ 1 ⇔ |1 − it − t2 |2 − |1 + it|2 ≥ 0.
3 6 3
ODEs  195

Figure 6.17 Stability domain of the method given in (6.20). The insta-
bility region is white.

But |1 − 23 it − 16 t2 |2 − |1 + 13 it|2 = (1 − 16 t2 )2 + ( 23 t)2 − 1 − ( 13 t)2 = 361 4


t ≥0
and it follows that the method is A-stable.
As another example we consider the 2-stage Gauss–Legendre method given
by √ √
1 3 1 1 3
k1 = f (tn + ( − )h, yn + hk1 + ( − )hk2 ),
2 √6 4 √ 4 6
1 3 1 3 1
k2 = f (tn + ( + )h, yn + ( + )hk1 + hk2 ),
2 6 4 6 4
1
yn+1 = yn + h(k1 + k2 ).
2
It is possible to prove that it is of order 4. [You can do this for y 0 = f (y)
by expansion, but it becomes messy for y0 = f (t, y).] It can be easily verified
that for y 0 = λy we have yn = [R(hλ)]n y0 , where R(z) = (1 + 12 z + 12 1 2
z )/(1 −
1 1 2
√ +
2 z + 12 z ). The poles of R are at 3 ± i 3, and thus lie in C . In addition,
|R(it)| ≡ 1, t ∈ R, because the denominator is the complex conjugate of the
numerator. We can again use the maximum modulus principle to argue that
D ⊇ C− and the 2-stage Gauss–Legendre method is A-stable, indeed D = C− .
Unlike other multistep methods, implicit high-order Runge–Kutta meth-
ods may be A-stable, which makes them particularly suitable for stiff prob-
lems.
In the following we will consider the order of Runge–Kutta methods. As
196  A Concise Introduction to Numerical Analysis

an example we consider the 2-stage method given by


1
k1 = f (tn , yn + h(k1 − k2 )),
4
2 1
k2 = f (tn + h, yn + h(3k1 + 5k2 )),
3 12
1
yn+1 = yn + h(k1 + 3k2 ).
4
In order to analyze the order of this method, we restrict our attention to scalar,
autonomous equations of the form y 0 = f (y). (This procedure simplifies the
process but might lead to loss of generality for methods of order ≥ 5.) For
brevity, we use the convention that all functions are evaluated at y = yn , e.g.,
fy = df (yn )/dy. Thus,
1 1
k1 = f + h(k1 − k2 )fy + h2 (k1 − k2 )2 fyy + O(h3 ),
4 32
1 1 2
k2 = f + h(3k1 + 5k2 )fy + h (3k1 + 5k2 )2 fyy + O(h3 ).
12 288
We see that both k1 and k2 equal f + O(h) and substituting this in the above
equations yields k1 = f + O(h2 ) (since (k1 − k2 ) in the second term) and
k2 = f + 23 hfy f + O(h2 ). Substituting this result again, we obtain
1
k1 = f − h2 fy2 f + O(h3 ),
6
2 5 2
k2 = f + hfy f + h2 ( fy2 f + fyy f 2 ) + O(h3 ).
3 18 9
Inserting these results into the definition of yn+1 gives
1 1
yn+1 = y + hf + h2 fy f + h3 (fy2 f + fyy f 2 ) + O(h4 ).
2 6
But y 0 = f ⇒ y 00 = fy f ⇒ y 000 = fy2 f + fyy f 2 and we deduce from Taylor’s
theorem that the method is at least of order 3. (By applying it to the equation
y 0 = λy, it is easy to verify that it is not of order 4.)
Exercise 6.14. The following four-stage Runge–Kutta method has order four,

k1 = f (tn , yn ),
1 1
k2 = f (tn + h, yn + hk1 ),
3 3
2 1
k3 = f (tn + h, yn − hk1 + hk2 ),
3 3
k4 = f (tn + h, yn + hk1 − hk2 + hk3 ),
1
yn+1 = yn + h(k1 + 3k2 + 3k3 + k4 ).
8
By applying it to the equation y 0 = y, show that the order is at most four.
ODEs  197

Then, for scalar functions, prove that the order is at least four in the easy
case when f is independent of y, and that the order is at least three in the
relatively easy case when f is independent of t. (Thus you are not expected to
do Taylor expansions when f depends on both y and t.)

Any Runge–Kutta method can be analyzed using Taylor series expansions.


The following lemma is helpful.
Lemma 6.2. Every non-autonomous system dependent on Pνt can be trans-
ferred P
into an autonomous system independent of t. If i=1 bi = 1 and
ν
ci = j=1 a ij , and if the Runge–Kutta method gives a unique approxima-
tion, then this equivalence of autonomous and non-autonomous systems is
preserved by this method.
Proof. Set
τ := t, τ0 := t0
z := (t, y)T
g(z) := (1, f (t, y))T = (1, f (z))T
z(τ0 ) = (τ0 , y(τ0 ))T = (t0 , y0 )T = z0 .
dt
Note that = 1.

 dt     
dz dτ 1 1
= dy = dy = = g(z).
dτ dτ dt f (y, t)

A solution to this system gives a solution for (6.1) by removing the first
component.
Suppose zn = (tn , yn )T . Now let lj = (1, kj ).
ν
X
li = g(zn + h ai,j lj )
j=1
Xν ν
X
= g(tn + h ai,j , yn + h ai,j kj )
j=1 j=1
ν
X
= g(tn + hci , yn + h ai,j kj )
j=1
ν  
X 1 T
= (1, f (tn + hci , yn + h ai,j kj )) =
j=1
ki

since we assumed that the Runge–Kutta method gives a unique solution.


198  A Concise Introduction to Numerical Analysis

Additionally we have
ν
X ν
X
T
zn+1 = zn + h bi li = (tn , yn ) + h bi (1, ki )T
i=1 i=1
Xν ν
X
= (tn + h bi , y n + h bi ki )T
i=1 i=1
ν
X
= (tn+1 , yn + h bi ki )T ,
i=1

since i=1 bi = 1 and tn+1 = tn + h.
For simplification we restrict ourselves to one-dimensional, autonomous
systems in the following analysis. The Taylor expansion will produce terms
of the form y 0 = f , y 00 = fy f , y 000 = fyy f 2 + fy (fy f ) and y (4) = fyyy f 3 +
4fyy fy f 2 +fy3 f . The f, fy f, fy2 f, fyy f 2 , etc., are called elementary differentials.
Every derivative of y is a linear combination with positive integer coefficients
of elementary differentials.
A convenient way to represent elementary differentials is by rooted trees.
For example, f is represented by

T0 = f

while fy f is represented by
T1 = fy

For the third derivative, we have

fyy f 2 fy (fy f )
T2 = T3 =

fy

fyy fy

f f f
ODEs  199

With the fourth derivative of y, it becomes more interesting, since the ele-
mentary differential fyy fy f 2 arises in the differentiation of the first term as
well as the second term of y000 . This corresponds to two different trees and
the distinction is important since in several variables these correspond to dif-
ferentiation matrices which do not commute. In Figure 6.18 we see the two
different trees for the same elementary differential.

fyyy f 3 fyy fy f 2 fyy fy f 2 fy3 f


T4 = T5 = T6 = T7 =

fy

fyy fyy fy

fyyy fy f fy fy

f f f f f f f

Figure 6.18 Fourth order elementary differentials

The correspondence to Runge–Kutta methods is created by labeling the


vertices in the tree with P
indices, say j and k. Every edge from j to k is
ν
associated with the sum a . If the root is labeled by say i, it is
Pνj,k=1 j,k
associated with theP sum i=1 bi . For every tree T this gives an expression
ν
Φ(T ). The identity j,k=1 aj,k = cj can be used to simplify Φ(T ).
200  A Concise Introduction to Numerical Analysis

For the trees we have considered so far, we have


ν
X
Φ(T0 ) = bi
i=1
Xν ν
X
Φ(T1 ) = bi ai,j = bi ci
i,j=1 i=1
Xν ν
X
Φ(T2 ) = bi ai,j ai,k = bi c2i
i,j,k=1 i=1
X ν Xν
Φ(T3 ) = bi ai,j aj,k = bi ai,j cj
i,j,k=1 i,j=1
Xν ν
X
Φ(T4 ) = bi ai,j ai,k ai,l = bi c3i
i,j,k,l=1 i=1
Xν Xν
Φ(T5 ) = bi ai,j aj,k ai,l = bi ci ai,j cj
i,j,k,l=1 i,j=1
Xν X ν
Φ(T6 ) = bi ai,j aj,k aj,l = bi ai,j c2j
i,j,k,l=1 i,j=1
Xν Xν
Φ(T7 ) = bi ai,j aj,k ak,l = bi ai,j aj,k ck
i,j,k,l=1 i,j,k=1

Φ(T ) together with the usual Taylor series coefficients give the coefficients
of the elementary differentials when expanding the Runge–Kutta method. In
order for the Taylor expansion of the true solution and the Taylor expansion of
the Runge–Kutta method to match up to the hp terms, we need the coefficients
of the elementary differentials to be the same. This implies for all trees T with
p vertices or less,
1
Φ(T ) = ,
γ(T )
where γ(T ) is the density of the tree which is defined to be the product of the
number of vertices of T with the number of vertices of all possible trees after
ODEs  201

successively removing roots. This gives the order conditions.


ν
X
bi = 1
i=1
ν
X 1
bi c i =
i=1
2
ν
X 1
bi c2i =
i=1
3
ν
X 1
bi ai,j cj =
i,j=1
6
ν
X 1
bi c3i =
i=1
4
ν
X 1
bi ci ai,j cj =
i,j=1
8
ν
X 1
bi ai,j c2j =
i,j=1
12
ν
X 1
bi ai,j aj,k ck =
24
i,j,k=1

The number of order conditions explodes with the number of stages.

order p 1 2 3 4 5 6 7 8 ... 9
number of conditions 1 2 4 8 17 37 85 200 ... 7813

6.10 Revision Exercises


Exercise 6.15. Consider the scalar ordinary differential y 0 = f (y)|, that is,
f is independent of t. It is solved by the following Runge–Kutta method,

k1 = f (yn ),
k2 = f (yn + (1 − α)hk1 + αhk2 ),
h
yn+1 = yn + (k1 + k2 ),
2
where α is a real parameter.

(a) Express the first, second, and third derivative of y in terms of f .


(b) Perform the Taylor expansion of y(tn+1 ) using the expressions found in
the previous part and explain what it means for the method to be of order
p.

(c) Determine p for the given Runge–Kutta method.


202  A Concise Introduction to Numerical Analysis

(d) Define A-stability, stating explicitly the linear test equation.


(e) Suppose the Runge–Kutta method is applied to the linear test equation.
Show that then
yn+1 = R(hλ)yn
and derive R(hλ) explicitly.
(f ) Show that the method is A-stable if and only if α = 12 .

Exercise 6.16. Consider the multistep method for numerical solution of the
differential equation y0 = f (t, y):
s
X s
X
ρl yn+l = h σl f (tn+l , yn+l ), n = 0, 1, . . . .
l=0 l=0

(a) Give the definition of the order of the method.


(b) State the test equation, definition of stability region, and A-stability in
general.
(c) Using Taylor expansions, show that the method is of order p if
s
X s
X s
X
ρl = 0, l k ρl = k lk−1 σl , k = 1, 2, . . . , p,
l=0 l=0 l=0

(d) State the Dahlqusit equivalence theorem.


(e) Determine the order of the two-step method
1 2 1
yn+2 −(1+α)yn+1 +αyn = h[ (5+α)fn+2 + (1−α)fn+1 − (1+5α)fn ]
12 3 12
for different choices of α.
(f ) For which α is the method convergent?

Exercise 6.17. Consider the multistep method for numerical solution of the
differential equation y0 = f (t, y):
s
X s
X
ρl yn+l = h σl f (tn+l , yn+l ), n = 0, 1, . . . .
l=0 l=0

(a) Describe in general what it means that a method is of order p.


(b) Define generally the convergence of a method.
(c) Define the stability region and A-stability in general.
(d) Describe how to determine the stability region of the multistep method.
ODEs  203

(e) Show that the method is of order p if


s
X s
X s
X
ρl = 0, l k ρl = k lk−1 σl , k = 1, 2, . . . , p,
l=0 l=0 l=0

Ps
(f ) Give the conditions on ρ(w) = l=0 ρl wl that ensure convergence.
(g) Hence determine for what values of θ and σ0 , σ1 , σ2 the two-step method

yn+2 −(1−θ)yn+1 −θyn = h[σ0 f (tn , yn )+σ1 f (tn+1 , yn+1 )+σ2 f (tn+2 , yn+2 )]

is convergent and of order 3.

Exercise 6.18. We approximate the solution of the ordinary differential equa-


tion
∂y
= y0 = f (t, y), t ≥ 0,
∂t
by the s-step method
s
X s
X
ρl yn+l = h σl f (tn+l , yn+l ), n = 0, 1, . . . , (6.21)
l=0 l=0

assuming that yn , yn+1 , . . . , yn+s−1 are available. The following complex poly-
nomials are defined:
s
X s
X
ρ(w) = ρl w l , σ(w) = σl w l .
l=0 l=0

(a) When is the method given by (6.21) explicit and when implicit?
(b) Derive a condition involving the polynomials ρ and σ which is equivalent
to the s-step method given in (6.21) being of order p.
(c) Define another polynomial and state (no proof required) a condition for
the method in (6.21) to be A-stable.
(d) Describe the boundary locus method to find the boundary of the stability
domain for the method given in (6.21).

(e) What is ρ for the Adams methods and what is the difference between
Adams–Bashforth and Adams–Moulton methods?
(f ) Let s = 1. Derive the Adams–Moulton method of the form

yn+1 = yn + h (Af (tn+1 , yn+1 ) + Bf (tn , yn )) .

(g) Is the one-step Adams–Moulton method A-stable? Prove your answer.


204  A Concise Introduction to Numerical Analysis

Exercise 6.19. We consider the autonomous scalar differential equation


d
y(t) = y 0 (t) = f (y(t)), y(0) = y0 .
dt
Note that f is independent of t.

(a) Express the second and third derivative of y in terms of f and its deriva-
tives. Write the Taylor expansion of y(t + h) in terms of f and its deriva-
tives up to O(h4 ).
(b) The differential equation is solved by the Runge–Kutta scheme

k1 = hf (yn ),
k2 = hf (yn + k1 ),
yn+1 = yn + 12 (k1 + k2 ).

Show that the scheme is of order 2.


(c) Define the linear stability domain and A-stability for a general numerical
method, stating explicitly the linear test equation on which the definitions
are based.

(d) Apply the Runge–Kutta scheme given in (b) to the linear test equation
from part (c) and find an expression for the linear stability domain of the
method. Is the method A-stable?
(e) We now modify the Runge–Kutta scheme in the following way

k1 = hf (yn ),
k2 = hf (yn + a(k1 + k2 )),
yn+1 = yn + 12 (k1 + k2 ),

where a ∈ R. Apply it to the test equation and find a rational function R


such that yn+1 = R(hλ)yn .
(f ) Explain the maximum modulus principle and use it to find the values of a
such that the method given in (e) is A-stable.
CHAPTER 7

Numerical
Differentiation

Numerical differentiation is an ill-conditioned problem in the sense that small


perturbations in the input can lead to significant differences in the outcome. It
is, however important, since many problems require approximations to deriva-
tives. It introduces the concept of discretization error , which is the error oc-
curring when a continuous function is approximated by a set of discrete values.
This is also often known as the local truncation error . It is different from the
rounding error , which is the error due to the inherent nature of floating point
representation. For a given function f (x) : R → R the derivative f 0 (x) is
defined as
f (x + h) − f (x)
f 0 (x) = lim .
h→0 h
Thus f 0 (x) can be estimated by choosing a small h and letting

f (x + h) − f (x)
f 0 (x) ≈ .
h
This approximation is generally called a difference quotient. The important
question is how should h be chosen.
Suppose f (x) can be differentiated at least three times, then from Taylor’s
theorem we can write
h2 00
f (x + h) = f (x) + hf 0 (x) + f (x) + O(h3 ).
2
After rearranging, we have

f (x + h) − f (x) h 00
f 0 (x) = − f (x) + O(h2 ).
h 2
The first term is the approximation to the derivative. Thus the absolute value
of the discretization error or local truncation error is approximately h2 |f 00 (x)|.
206  A Concise Introduction to Numerical Analysis

We now turn to the rounding error. The difference quotient uses the float-
ing point representations
f (x + h)∗ = f (x + h) + x+h ,

f (x) = f (x) + x .
Thus the representation of the approximation is given by
f (x + h) − f (x) x+h − x
+ ,
h h
where we assume for simplicity that the difference and division have been cal-
culated exactly. If f (x) can be evaluated with a relative error of approximately
macheps, we can assume that
|x+h − x | ≤ macheps (|f (x)| + |f (x + h)|).
Thus the rounding error is at most macheps (|f (x)| + |f (x + h)|)/h.
The main point to note is that as h decreases, the discretization error
decreases, but the rounding error increases, since we are dividing by h. The
ideal choice of h is the one which minimizes the total error. This is the case
where the absolute values of the discretization error and the rounding error
become the same
h 00
|f (x)| = macheps(|f (x)| + |f (x + h)|)/h.
2
However, this involves unknown quantities. Assuming that 12 |f 00 (x)| and
O(1), a good choice for h would satisfy h2 ≈
|f (x)| + |f (x + h)| are of order √
macheps or in other words h ≈ √macheps. The total absolute error in the ap-
proximation is then of order O( macheps). However, since we assumed that
|f (x)|p= O(1) in the above analysis, a more realistic choice for h would be
h ≈ macheps|f (x)|. This, however, does not deal with the assumption on
|f 00 (x)|.
Exercise 7.1. List the assumptions made in the analysis and give an example
where at least one of these assumptions does not hold. What does this mean
in practice for the approximation of derivatives?

7.1 Finite Differences


In general, finite differences are mathematical expressions of the form f (x +
b) − f (x + a). If a finite difference is divided by b − a one gets a difference
quotient. As we have seen above, these can be used to approximate derivatives.
The main basic operators of finite differences are
forward difference operator ∆+ f (x) = f (x + h) − f (x),
backward difference operator ∆− f (x) = f (x) − f (x − h),
central difference operator δf (x) = f (x + 12 h) − f (x − 12 h).
Numerical Differentiation  207

Additionally, we can define


shift operator Ef (x) = f (x + h),
averaging operator µ0 f (x) = 12 (f (x + 12 h) + f (x − 12 h)),
differential operator Df (x) = d
dx f (x) = f 0 (x).
All operators commute and can be expressed in terms of each other. We can
easily see that
∆+ = E − I,
∆− = I − E −1 ,
1 1
= E 2 − E− 2 ,
δ
1 1 1
µ0 = (E 2 + E − 2 ).
2
Most importantly, it follows from Taylor expansion that

E = ehD

and thus
1 1 hD
∆+ = (e − I) = D + O(h),
h h
1 1
∆− = (I − e−hD ) = D + O(h).
h h
It follows that the forward and backward difference operators approximate
the differential operator, or in other words the first derivative with an error
of O(h).
Both the averaging and the central difference operator are not well-defined
on a grid. However, we have
δ 2 f (x) = f (x + h) − 2f (x) + f (x − h),
1
δµ0 f (x) = (f (x + h) − f (x − h).
2
1 2
Now h2 δ approximates the second derivative with an error of O(h2 ), because
1 2 1
δ = 2 (ehD − 2I + e−hD ) = D2 + O(h2 ).
h2 h
On the other hand, we have
1 1 hD
δµ0 = (e − e−hD ) = D + O(h2 ).
h 2h
Hence the combination of central difference and averaging operator gives a
better approximation to the first derivative. We can also achieve higher accu-
racy by using the sum of the forward and backward difference
1 1 hD
(∆+ + ∆− ) = (e − I + I − e−hD ) = D + O(h2 ).
2h 2h
208  A Concise Introduction to Numerical Analysis

Figure 7.1 Inexact and incomplete data with a fitted curve

Higher-order derivatives can be approximated by applying the difference


operators several times. For example the nth order forward, backward, and
central difference operators are given by
n   n  
X n X n
∆n+ f (x) = (ehD )n−i (−I)i = (−1)i f (x + (n − i)h),
i
i=0   i=0
i
n n  
X n X n
∆n− f (x) = (I)n−i (−e−hD )i = (−1)i f (x − ih),
i
i=0   i=0
i
n n  
X n hD n−i −h D i
X
i n n
δ n f (x) = (e ) (−e
2 2 ) = (−1) f (x + ( − i)h).
i=0
i i=0
i 2

For odd n the nth central difference is again not well-defined on a grid, but
this can be alleviated as before by combining one central difference with the
averaging operator. After dividing by hn the nth order forward and backward
differences approximate the nth derivative with an error of O(h), while the nth
order central difference (if necessary combined with the averaging operator)
approximates the nth derivative with an error of O(h2 ).
Combination of higher-order differences can also be used to construct bet-
ter approximations. For example,
1 1 −1
(∆+ − ∆2+ )f (x) = (f (x + 2h) − 4f (x + h) + 3f (x))
h 2 2h
can be written with the shift and differential operator as
−1 2 −1 2hD
(E − 4E + 3I) = (e − 4ehD + 3I) = D + O(h2 ).
2h 2h
The drawback is that more grid points need to be employed. This is called
the bandwidth. Special schemes at the boundaries are then necessary.
Numerical Differentiation  209

7.2 Differentiation of Incomplete or Inexact Data


If the function f is only known at a finite number of points or if the func-
tion values are known to be inexact, it does not make sense to calculate the
derivatives from the data as it stands. A much better approach is to fit a curve
through the data taking into account the inexactness of the data which can
be done, for example, by least squares fitting. Figure 7.1 illustrates this. The
derivatives are then approximated by the derivatives of the fitted curve.
No precise statement of the accuracy of the resulting derivatives can be
made, since the accuracy of the fitted curve would need to be assessed first.
CHAPTER 8

PDEs

8.1 Classification of PDEs


A partial differential equation (PDE) is an equation of a function of several
variables and its partial derivatives. The partial derivatives of a scalar valued
function u : Rn → R are often abbreviated as
∂u ∂2u
uxi = , uxi xj = , ...
∂xi ∂xi ∂xj

The PDE has the structure

F (x1 , . . . , xn , u, ux1 , . . . , uxn , ux1 x1 , ux1 x2 , . . . , uxn xn , . . .) = 0. (8.1)

Notation can be compacted by using multi-indices α = (α1 , . . . , αn ) ∈ Nn0 ,


where N0 denotes the natural numbers including 0. The notation |α| is used
to denote the length α1 + · · · + αn . Then

∂ |α| u ∂ |α| u
Dα u = = .
∂x α ∂x1 · · · ∂xα
α1
n
n

Further, let Dk u = {Dα u : |α| = k}, the construct of all derivatives of degree
∂u ∂u
k. For k = 1, Du = ( ∂x1
, . . . , ∂x n
) is the gradient of u. For k = 2, D2 u is the
Hessian matrix of second derivatives given by
 ∂2u ∂2u ∂2u 
∂x21 ∂x1 ∂x2 ··· ∂x1 ∂xn
∂2u ∂2u ∂2u
···
 
∂x2 ∂x1 ∂x22 ∂x2 ∂xn
 
.
 
 .. .. .. ..

 . . . . 

∂2u ∂2u ∂2u
∂xn ∂x1 ∂xn ∂x2 ··· ∂x2n

Note that the Hessian matrix is symmetric, since it does not matter in which
order partial derivatives are taken. (The Hessian matrix of a scalar valued
212  A Concise Introduction to Numerical Analysis

function should not be confused with the Jacobian matrix of a vector valued
function which we encountered in the chapter on non-linear systems.) Then

F (x, u(x), Du(x), . . . , Dk u(x)) = 0,

describes a PDE of order k. We will only consider PDE of order 2 in this


chapter, but the principles extend to higher orders.
Definition 8.1 (Linear PDE). The PDE is called linear if it has the form
X
cα (x)Dα u(x) − f (x) = 0. (8.2)
|α|≤k

If f ≡ 0, the PDE is homogeneous, otherwise inhomogeneous.


Definition 8.2 (Semi-linear PDE). The PDE is called semi-linear if it has
the form
X
cα (x)Dα u(x) + G(x, u(x), Du(x), . . . , Dk−1 u(x)) = 0.
|α|=k

In other words, the PDE is linear with regards to the derivatives of highest
degree, but nonlinear for lower derivatives.
Definition 8.3 (Quasilinear PDE). The PDE is called quasilinear if it has
the form X
cα (x, u(x), Du(x), . . . , Dk−1 u(x))Dα u(x)
|α|=k
+G(x, u(x), Du(x), . . . , Dk−1 u(x)) = 0.
For further classification, we restrict ourselves to quasilinear PDEs of order
2, of the form
n
X ∂2u
aij − f = 0, (8.3)
i,j=1
∂xi ∂xj

where the coefficients aij and f are allowed to depend on x, u, and the gradient
of u. Without loss of generality we can assume that the matrix A = (aij ) is
symmetric, since otherwise we could rewrite the PDE according to

∂2u ∂2u ∂2u


aij + aji = (aij + aji ) ,
∂xi ∂xj ∂xj ∂xi ∂xi ∂xj
2
1 ∂ u 1 ∂2u
= (aij + aji ) + (aij + aji ) ,
2 ∂xi ∂xj 2 ∂xj ∂xi

and the matrix B = (bij ) with coefficients bij = 12 (aij + aji ) is symmetric.
Definition 8.4. Let λ1 (x), . . . , λn (x) ∈ R be the eigenvalues of the symmetric
coefficient matrix A = (aij ) of the PDE given by (8.3) at a point x ∈ Rn . The
PDE is
PDEs  213

parabolic
at x, if there exists j ∈ {1, . . . , n} for which λj (x) = 0,
elliptic
at x, if λi (x) > 0 for all i = 1, . . . , n,

hyperbolic
at x, if λj (x) > 0 for one j ∈ {1, . . . , n} and λi (x) < 0 for all i 6= j, or
if λj (x) < 0 for one j ∈ {1, . . . , n} and λi (x) > 0 for all i 6= j.
Exercise 8.1. Consider the PDE

auxx + 2buxy + cuyy = f,

where a > 0, b, c > 0, and f are functions of x, y, u, ux , and uy . At (x, y) the


PDE is
parabolic, if b2 − ac = 0,

elliptic, if b2 − ac < 0 and


hyperbolic, if b2 − ac > 0.
Show that this definition is equivalent to the above definition.

An example of a PDE which changes classification is the Euler–Tricomi


equation
uxx = xuyy ,
which is the simplest model of a transonic flow. It is hyperbolic in the half
plane x > 0, parabolic at x = 0, and elliptic in the half plane x < 0.

8.2 Parabolic PDEs


Partial differential equations of evolution are second-order equations with a
time component t where only t and the first derivative with regards to t
occur in the equation. Second derivatives with regards to t do not feature in
the equation and therefore have coefficient zero. Hence the eigenvalue with
regards to this component is zero and these are parabolic partial differential
equations. They describe a wide family of problems in science including heat
diffusion, ocean acoustic propagation, and stock option pricing.
In the following sections we focus on the heat equation to illustrate the
principles. For a function u(x, y, z, t) of three spatial variables (x, y, z) and
the time variable t, the heat equation is
 2
∂ u ∂2u ∂2u

∂u
=α + 2 + 2 = α∇2 u,
∂t ∂x2 ∂y ∂z

where α is a positive constant known as the thermal diffusivity and where


214  A Concise Introduction to Numerical Analysis

∇2 denotes the Laplace operator (in three dimensions). For the mathematical
treatment it is sufficient to consider the case α = 1. We restrict ourselves
further and consider the one-dimensional case u(x, t) specified by
∂u ∂2u
= , (8.4)
∂t ∂x2
for 0 ≤ x ≤ 1 and t ≥ 0. Initial conditions u(x, 0) describe the state of the
system at the beginning and Dirichlet boundary conditions u(0, t) and u(1, t)
show how the system changes at the boundary over time. The most common
example is that of a metal rod heated at a point in the middle. After a long
enough time the rod will have constant temperature everywhere.

8.2.1 Finite Differences


We already encountered finite differences in the chapter on numerical dif-
ferentiation. They are an essential tool in the solution of PDEs. In the fol-
lowing we denote the time step by ∆t and the discretization in space by
∆x = 1/(M + 1); that is, we divide the interval [0, 1] into M + 1 subinter-
vals. Denote u(m∆x, n∆t) by unm . Then u0m is given by the initial conditions
u(m∆x, 0) and we have the known boundary conditions un0 = u(0, n∆t) and
unM +1 = u(1, n∆t). Consider the Taylor expansions
∂u(x, t) 1 ∂ 2 u(x, t)
u(x − ∆x, t) = u(x, t) − ∆x + (∆x)2
∂x 2 ∂x2
3
1 ∂ u(x, t)
− (∆x)3 + O((∆x)4 ),
6 ∂x3
∂u(x, t) 1 ∂ 2 u(x, t)
u(x + ∆x, t) = u(x, t) + ∆x + (∆x)2
∂x 2 ∂x2
3
1 ∂ u(x, t)
+ (∆x)3 + O((∆x)4 ).
6 ∂x3
Adding both together, we deduce
∂ 2 u(x, t) 1
= [u(x − ∆x, t) − 2u(x, t) + u(x + ∆x, t)]+O((∆x)2 ). (8.5)
∂x2 (∆x)2
Similarly, in the time direction we have
∂u(x, t) 1
= [u(x, t + ∆t) − u(x, t)] + O(∆t). (8.6)
∂t ∆t
Inserting the approximations (8.5) and (8.6) into the heat equation (8.4) mo-
tivates the numerical scheme
un+1
m = unm + µ(unm−1 − 2unm + unm+1 ), m = 1, . . . , M, (8.7)
where µ = ∆t/(∆x)2 is the Courant number which is kept constant. The
method of constructing numerical schemes by approximations to the deriva-
tives is known as finite differences. Let un denote the vector (un1 , . . . , unM )T .
PDEs  215

The initial condition gives the vector u0 . Using the boundary conditions
un0 and unm+1 when necessary, (8.7) can be advanced from un to un+1 for
n = 0, 1, 2, . . ., since from one time step to the next the right-hand side of
Equation (8.7) is known. This is an example of a time marching scheme.
Keeping µ fixed and letting ∆x → 0 (which also implies that ∆t → 0, since
∆t = µ(∆x)2 ), the question is: Does for every T > 0 the point approximation
unm converge uniformly to u(x, t) for m∆x → x ∈ [0, 1] and n∆t → t ∈ [0, T ]?
The method here has an extra parameter, µ. It is entirely possible for a method
to converge for some choice of µ and diverge otherwise.
1
Theorem 8.1. µ ≤ 2 ⇒ convergence.

Proof. Define the error as enm := unm −u(m∆x, n∆t), m = 0, . . . , M +1, n ≥ 0.


Convergence is equivalent to

lim max |enm | = 0


∆x→0 m=1,...,M
0<n≤bT /∆tc

for every constant T > 0. Since u satisfies the heat equation, we can equate
(8.5) and (8.6), which gives
1
[u(x, t + ∆t) − u(x, t)] + O(∆t) =
∆t
1
[u(x − ∆x, t) − 2u(x, t) + u(x + ∆x, t)] + O((∆x)2 ).
(∆x)2

Rearranging yields

u(x, t + ∆t) = u(x, t) + µ [u(x − ∆x, t) − 2u(x, t) + u(x + ∆x, t)] +


O((∆t)2 ) + O(∆t(∆x)2 ).

Subtracting this from (8.7) and using ∆t = µ(∆x)2 , it follows that there exists
C > 0 such that

|en+1 n n n n 4
m | ≤ |em + µ(em−1 − 2em + em+1 )| + C(∆x) ,

where we used the triangle inequality and properties of the O-notation. Let
enmax := max |enm |. Then
m=1,...,M

|en+1
m | ≤ |µenm−1 + (1 − 2µ)enm + µenm+1 | + C(∆x)4
≤ (2µ + |1 − 2µ|)enmax + C(∆x)4 ,

where we used the triangle inequality again. When µ ≤ 12 , we have 1 − 2µ ≥ 0


and the modulus sign is not necessary. Hence

ken+1 n 4 n 4
m | ≤ (2µ + 1 − 2µ)emax + C(∆x) = emax + C(∆x) .
216  A Concise Introduction to Numerical Analysis

Continuing by induction and using the fact that e0max = 0, we deduce


T T
enmax ≤ Cn(∆x)4 ≤ C (∆x)4 = C (∆x)2 → 0
∆t µ
as ∆x → 0.
The restriction on µ has practical consequences, since it follows that ∆t ≤
1 2
2 (∆x) must hold. Thus the time step ∆t has to be tiny compared to ∆x in
practice. Like the forward Euler method for ODEs, the method given by (8.7)
is easily derived, explicit, easy to execute, and simple – but of little use in
practice.
In the following we will use finite differences extensively to approximate
derivatives occurring in PDEs. To this end we define the order of accuracy
and consistency of a finite difference scheme.
Definition 8.5 (Order of accuracy). Let u(x) be any sufficiently smooth solu-
tion to a linear PDE as given in (8.2). Let F be an operator contructed from
finite difference operators using stepsizes ∆x and ∆t, where ∆x is a fixed
function of ∆t. F has order of accuracy p, if
 
X
F u(x) = (∆t)p  cα (x)Dα u(x) − f (x) + O((∆t)p+1 )
|α|≤k

for all x ∈ Rn .
Returning to the previous example, the PDE is
(Dt − Dx2 )u(x, t) = 0,
while F is given by
−1
 
F u(x, t) = E∆t − I − µ E∆x − 2I + E∆x , u(x, t),
where E∆t and E∆x denote the shiftpoperators in the x and y direction respec-
tively. Since µ = ∆t/(∆x)2 , ∆x = ∆t/µ is a fixed funtion of ∆t. Denoting
the differentiation operator in the t direction by Dt and the differentiation
operator in the x direction by Dx , we have
E∆t = e∆tDt and E∆x = e∆xDx .
Then F becomes
 
∆tDt ∆t −∆xDx ∆xDx

F u(x, t) = e −I − e − 2I + e u(x, t)
 (∆x)2  
1 2 2 1 2 4
= ∆tDt + (∆tDt ) + . . . − ∆t Dx + (∆x) Dx + . . . u(x, t)
2 6
= ∆t(Dt − Dx2 )u(x, t) + O((∆t)2 ),
since (∆x)2 = O(∆t).
Note, since ∆x and ∆t are linked by a fixed function, the local error can
be expressed both in terms of ∆x or ∆t.
PDEs  217

Definition 8.6 (Consistency). If the order of accuracy p is greater or equal


to one, then the finite difference scheme is consistent.

8.2.2 Stability and Its Eigenvalue Analysis


In our numerical treatment of PDEs we need to ensure that we do not try to
achieve the impossible. For example, we describe the solution by point values.
However, if the underlying function oscillates, point values can only adequately
describe it, if the resolution is small enough to capture the oscillations. This
becomes a greater problem if oscillations increase with time. The following
defines PDEs where such problems do not arise.

Definition 8.7 (Well-posedness). A PDE problem is said to be well-posed if


1. a solution to the problem exists,
2. the solution is unique, and
3. the solution (in a compact time interval) depends in a uniformly bounded
and continuous manner on the initial conditions.
An example of an ill-posed PDE is the backward heat equation. It has
the same form as the heat equation, but the thermal diffusivity is negative.
Essentially it asks the question: If we know the temperature distribution now,
can we work out what it has been? Returning to our example of the metal
rod heated at one point: If there is another, but smaller, heat source, this
will soon disappear because of diffusion and the solution of the heat equation
will be indistinguishable from the solution without the additional heat source.
Therefore the solution to the backwards heat equation cannot be unique, since
we cannot tell whether we started with one or two heat sources. It can also
be shown that the solutions of the backward heat equation blow up at a rate
which is unbounded. For more information on the backward heat equation,
see Lloyd Nick Trefethen, The (Unfinished) PDE Coffee Table Book.
Definition 8.8 (Stability in the context of time marching algorithms for PDEs
of evolution). A numerical method for a well-posed PDE of evolution is stable
if for zero boundary conditions it produces a uniformly bounded approximation
of the solution in any bounded interval of the form 0 ≤ t ≤ T , when ∆x → 0
and the Courant number (or a generalization thereof ) is constant.
Theorem 8.2 (The Lax equivalence theorem). For a well-posed underlying
PDE which is solved by a consistent numerical method, we have stability ⇔
convergence.

The above theorem is also known as the Lax-Richtmyer equivalence the-


orem. A proof can be found in J. C. Strikwerda, Finite Difference Schemes
and Partial Differential Equations, [19].
Stability of (8.7) for µ ≤ 12 is implied by the Lax equivalence theorem, since
218  A Concise Introduction to Numerical Analysis

we have proven convergence. However, it is often easier to prove stability and


then deduce convergence. Stability can be proven directly by the eigenvalue
analysis of stability, as we will show, continuing with our example.
Recall that un = (un1 , . . . , unM )T is the vector of approximation at time
t = n∆t. The recurrence in (8.7) can be expressed as un+1 = Aun , where
A is a M × M Toeplitz symmetric tri-diagonal (TST) matrix with entries
1 − 2µ on the main diagonal and µ on the two adjacent diagonals. In general
Toeplitz matrices are matrices with constant entries along the diagonals. All
M × M TST matrices T have the same set of eigenvectors q1 , . . . , qM . The
ith component of qk is given by
πik
(qk )i = a sin ,
M +1
p
where
p a is a normalization constant which is 2/M for even M and
2/(M + 1) for odd M . The corresponding eigenvalue is
πk
λk = α + 2β cos ,
M +1
where α are the diagonal entries and β are the subdiagonal entries. Moreover,
the eigenvectors form an orthogonal set. That these are the eigenvectors and
eigenvalues can be verified by checking whether the eigenvalue equations

T qk = λk qk

hold for k = 1, . . . , M .
Thus the eigenvalues of A are given by
πk πk
(1 − 2µ) + 2µ cos = 1 − 4µ(sin )2 , k = 1, . . . , M.
M +1 2M + 2
Note that
πk
0 < (sin )2 < 1.
2M + 2
For µ ≤ 12 , the maximum modulus of eigenvalue is given by
πM 2
|1 − 4µ(sin ) | ≤ 1.
2M + 2
This is the spectral radius of A and thus the matrix norm kAk. Hence

kun+1 k ≤ kAkkun k ≤ . . . ≤ kAkn+1 ku0 k ≤ ku0 k

as n → ∞ for every initial condition u0 and the approximation is uniformly


bounded.
For µ > 12 the maximum modulus of eigenvalue is given by
πM 2
4µ(sin ) −1>1
2M + 2
PDEs  219

for M large enough. If the initial condition u0 happens to be the eigenvector


corresponding to the largest (in modulus) eigenvalue λ, then by induction
un = λn u0 , which becomes unbounded as n → ∞.
The method of lines (MOL) refers to the construction of schemes for partial
differential equations that proceeds by first discretizing the spatial derivatives
only and leaving the time variable continuous. This leads to a system of ordi-
nary differential equations to which one of the numerical methods for ODEs
we have seen already can be applied. In particular, using the approximation
to the second derivative given by (8.5), the semidiscretization

dum (t) 1
= (um−1 (t) − 2um (t) + um+1 (t)), m = 1, . . . , M
dt (∆x)2

carries a local truncation error of O((∆x)2 ). This is an ODE system which


can be written as
−2 1 0 ···
 
0  
. . . u0 (t)
 1 −2 . . .. .. 
 
 0 
0 1  . . .
 1  .

u (t) = 2

 0 .. .. .. 0  u(t) +

2
 .. ,

(∆x)  (∆x) 

 
 . .. .. .. 0
 ..
  
. . . 1 
uM +1 (t)
0 ··· 0 1 −2
(8.8)
where u0 (t) = u(0, t) and uM +1 (t) = u(1, t) are the known boundary condi-
tions. The initial condition is given by
 
u(∆x, 0)
 u(2∆x, 0) 
u(0) =  .
 
..
 . 
u(M ∆x, 0)

If this system is solved by the forward Euler method the resulting scheme
is (8.7), while backward Euler yields

un+1
m − µ(un+1 n+1
m−1 − 2um + un+1 n
m+1 ) = um .

Taking this two-stage approach, first semi-discretization and then applying an


ODE solver is conceptually easier than discretizing in both time and space in
one step (so-called full discretization).
Now, solving the ODE given by (8.8) with the trapezoidal rule gives
1 1
un+1
m − µ(un+1 n+1
m−1 − 2um + un+1 n n n n
m+1 ) = um + µ(um−1 − 2um + um+1 ),
2 2
which is known as the Crank–Nicolson method . The error is O((∆t)3 +
(∆t)(∆x)2 ) where the (∆t)3 term comes from the trapezoidal rule and the
220  A Concise Introduction to Numerical Analysis

(∆t)(∆x)2 ) is inherited from the discretization in space. The Crank–Nicolson


scheme has superior qualities, as we will see in the analysis.
The Crank–Nicolson scheme can be written as Bun+1 = Cun (assuming
zero boundary conditions), where

1 + µ − 12 µ 1
   
1−µ 2µ
 −1µ 1 + µ   1µ 1−µ 
2  2
B= ,C =  .
  
. .. . ..  . .. . ..
 −1µ 
2
 1
µ  2
− 12 µ 1 + µ 1
2µ 1−µ

Both B and C are TST matrices. The eigenvalues of B are λB k = 1 + µ −


2
µ cos Mπk
+1 = 1 + 2µ sin πk
2(M +1) , while the eigenvalues of C are λ C
k = 1−µ+
πk 2 πk
µ cos M +1 = 1 − 2µ sin 2(M +1) . Let Q = (q1 . . . qM ) be the matrix whose
columns are the eigenvectors. Thus we can write

QDB QT un+1 = QDC QT un ,

where DB and Dc are diagonal matrices with the eigenvalues of B and C as


−1 T
entries. Multiplying by QDB Q using the fact that QT Q = I gives
−1 −1
un+1 = QDB DC QT un = Q(DB DC )n+1 QT u0 .
−1
The moduli of the eigenvalues of DB DC are

1 − 2µ sin2 πk

2(M +1)
≤ 1, k = 1, . . . , M.

1 + 2µ sin2 2(M
πk

+1)

Thus we can deduce that the Crank–Nicolson method is stable for all µ > 0
and we only need to consider accuracy in our choice of ∆t versus ∆x.
This technique is the eigenvalue analysis of stability. More generally, sup-
pose that a numerical method (for a PDE with zero boundary conditions) can
be written in the form
un+1 n
∆x = A∆x u∆x ,

where un∆x ∈ RM and A∆x is an M × M matrix. We use the subscript ∆x


here to emphasize that the dimensions of un and A depend on ∆x, since as
∆x decreases M increases. By induction we have un∆x = (A∆x )n u0∆x . For any
vector norm k · k and the induced matrix norm kAk = sup kAxk
kxk , we have

kun∆x k = k(A∆x )n u0∆x k ≤ k(A∆x )n kku0∆x k ≤ k(A∆x )kn ku0∆x k.

Stability can now be defined as preserving the boundedness of un∆x with re-
spect to the chosen norm k · k, and it follows from the inequality above that
the method is stable if

kA∆x k ≤ 1 as ∆x → 0.
PDEs  221

Usually the norm k · k is chosen to be the averaged Euclidean length

M
! 12
X
ku∆x k∆x = ∆x |ui |2 .
i=1

Note that the dimensions depend on ∆x. The reason for the factor ∆x1/2 is
because of
M
! 12 Z 1  12
∆x→0
X
2 2
ku∆x k∆x = ∆x |ui | −→ |u(x)| dx ,
i=1 0

where u is a square-integrable function such that (u∆x )m = u(m∆x). The sum


PM
∆x i=1 |ui |2 is the Riemann sum approximating the integral. The averaging
does not affect the induced Euclidean matrix norm (since the averaging factor
cancels). For normal matrices (i.e., matrices which have a complete set of or-
thonormal eigenvectors), the Euclidean norm of the matrix equals the spectral
radius, i.e., kAk = ρ(A), which is the maximum modulus of the eigenvalues,
and we are back at the eigenvalue analysis.
Since the Crank–Nicolson method is implicit, we need to solve a system
of equations. However, the matrix of the system is TST and its solution by
sparse Cholesky factorization can be done in O(M ) operations. Recall that
generally the Cholesky decomposition or Cholesky triangle is a decomposition
of a real symmetric, positive-definite matrix A into the unique product of
a lower triangular matrix L and its transpose, where L has strictly positive
diagonal entries. Thus we want to find A = LLT . The algorithm starts with
i = 1 and A(1) := A. At step i, the matrix A(i) has the form.
 
Ii−1 0 0
A(i) =  0
 bii bTi  ,
0 bi B (i)

where Ii−1 denotes the (i−1)×(i−1) identity matrix, B (i) is a (M −i)×(M −i)
submatrix, bii is a scalar, and bi is a (M − i) × 1 vector. For i = 1, I0 is a
matrix of size 0, i.e., it is not there. We let
 
Ii−1 √ 0 0
Li :=  0 bii 0 ,

1
0 √
b
b i In−i
ii

then A(i) = Li A(i+1) LTi , where


 
Ii−1 0 0
A(i+1) =  0 1 0 .
 
0 0 B (i) − √1 bi bT
bii i
222  A Concise Introduction to Numerical Analysis

Since bi bTi is an outer product, this algorithm is also called the outer prod-
uct version (other names are the Cholesky–Banachiewicz algorithm and the
Cholesky–Crout algorithm). After M steps, we get A(M +1) = I, since B (M )
and bM are of size 0. Hence, the lower triangular matrix L of the factorization
is L := L1 L2 . . . LM .
For TST matrices we have
a11 bT1
 
(1)
A = ,
b1 B (1)

where bT1 = (a12 , 0, . . . , 0). Note that B (1) is a tridiagonal symmetric matrix.
So
 √ 
a11 0 0  
 √a12 (2) 1 0
L1 =  a11 1 0  and A = .

0 B (1) − √a111 b1 bT1
0 0 In−2

Now the matrix formed by b1 bT1 has only one non-zero entry in the top left
corner, which is a212 . It follows that B (1) − √a111 b1 bT1 is again a tridiagonal
symmetric positive-definite matrix. Thus the algorithm calculates successively
smaller tridiagonal symmetric positive-definite matrices where only the top
left element has to be updated. Additionally the matrices Li differ from the
identity matrix in only two entries. Thus the factorization can be calculated
in O(M ) operations. We will see later when discussing the Hockney algorithm
how the structure of the eigenvectors can be used to obtain solutions to this
system of equations using the fast Fourier transform.

8.2.3 Cauchy Problems and the Fourier Analysis of Stability


So far we have looked at PDEs where explicit boundary conditions are given.
We now consider PDEs where the spatial variable extends over the whole
range of R. Again, we have to ensure that a solution exists, and therefore
restrict ourselves to so-called Cauchy problems.
Definition 8.9 (Cauchy problem). A PDE is known as a Cauchy problem,
if there are no explicit boundary conditions but the initial condition u(x, 0)
must be square-integrable in (−∞, ∞); that is, the integral of the square of the
absolute value is finite.
To solve a Cauchy problem, we assume a recurrence of the form
s
X s
X
αk un+1
m+k = βk unm+k , m ∈ Z, n ∈ Z+ , (8.9)
k=−r k=−r

where the coefficients αk and βk are independent of m and n, but typically


depend on µ. For example, for the Crank–Nicolson method we have
r = 1, s = 1 α0 = 1 + µ, α±1 = −µ/2, β0 = 1 − µ, β±1 = µ/2.
PDEs  223

The stability of (8.9) is investigated by Fourier analysis. This is independent


of the underlying PDE. The numerical stability is a feature of the algebraic
recurrences.
Definition 8.10 (Fourier transform). Let v = (vm )m∈Z be a sequence of
numbers in C. The Fourier transform of this sequence is the function
X
v̂(θ) = vm e−imθ , −π ≤ θ ≤ π. (8.10)
m∈Z

The elements of the sequence can be retrieved from v̂(θ) by calculating


Z π
1
vm = v̂(θ)eimθ dθ
2π π

for m ∈ Z, which is the inverse transform.


The sequences and functions are equipped with the following norms
! 12  12
 Z π
X
2 1 2
kvk = |vm | and kv̂k = |v̂(θ)| dθ .
2π −π
m∈Z

Lemma 8.1 (Parseval’s identity). For any sequence v, we have the identity
kvk = kv̂k.
Proof. We have Z π  
2π, l = 0
e−ilθ dθ = .
−π 0, l 6= 0.
So by definition,
Z π X Z π XX
1 1
kv̂k2 = | vm e−imθ |2 dθ = vm v̄k e−i(m−k)θ dθ
2π −π 2π −π
m∈Z Z π m∈Z k∈Z
1 XX −i(m−k)θ
XX
= vm v̄k e dθ = vm v̄k δmk = kvk2 .
2π −π
m∈Z k∈Z m∈Z k∈Z

The implication of Parseval’s identity is that the Fourier transform is an


isometry of the Euclidean norm, which is an important reason for its many
applications.
For the Fourier analysis of stability we have for every
P time step n a se-
quence un = (unm )m∈Z . For θ ∈ [−π, π], let ûn (θ) = m∈Z unm e−imθ be the
224  A Concise Introduction to Numerical Analysis

Fourier transform of the sequence un . Multiplying (8.9) by e−imθ , and sum-


ming over m ∈ Z, gives for the left-hand side
X s
X s
X X
e−imθ αk un+1
m+k = αk e−imθ un+1
m+k
m∈Z k=−r k=−r m∈Z
Xs X
= αk e−i(m−k)θ un+1
m
k=−r m∈Z !
Xs X
= αk eikθ e−imθ un+1
m
k=−r ! m∈Z
Xs
= αk eikθ ûn+1 (θ).
k=−r
Ps 
Similarly, the right-hand side is k=−r βk eikθ ûn (θ) and we deduce
Ps
βk eikθ
û n+1 n
(θ) = H(θ)û (θ) where H(θ) = sk=−r
P ikθ
. (8.11)
k=−r αk e

The function H is called the amplification factor of the recurrence given by


(8.9).
Theorem 8.3. The method given by (8.9) is stable if and only if |H(θ)| ≤ 1
for all θ ∈ [−π, π].
Proof. Since we are solving a Chauchy problem and m ranges over the whole
of Z, the equations are identical for all ∆x and there is no need to insist
explicitly that kun k remains uniformly bounded for ∆x → 0. The definition
of stability says that there exists C > 0 such that kun k ≤ C for all n ∈ Z.
Since the Fourier transform is an isometry, this is equivalent to kûn k ≤ C for
all n ∈ Z. Iterating (8.11), we deduce

ûn+1 (θ) = [H(θ)]n+1 û0 (θ), θ ∈ [−π, π], n ∈ Z+ .

To prove the first direction, let’s assume |H(θ)| ≤ 1 for all θ ∈ [−π, π]. Then
by the above equation |ûn (θ)| ≤ |û0 (θ)| it follows that
Z π Z π
1 1
kûn k2 = |ûn (θ)|2 dθ ≤ |H(θ)|2n |û0 (θ)|2 dθ
2π Z−π 2π −π
π
1
≤ |û0 (θ)|2 dθ = kû0 k2 ,
2π −π

and hence we have stability.


The other direction is more technical, since we have to construct an exam-
ple of an initial condition where the solution will become unbounded. Suppose
that there exists θ∗ ∈ (−π, π) such that |H(θ∗ )| = 1 +  > 1. Since H is ratio-
nal, it is continuous everywhere apart from where the denominator vanishes,
PDEs  225

which cannot be at θ∗ , since it takes a finite value there. Hence there exist
θ− < θ∗ < θ+ ∈ [−π, π] such that |H(θ)| ≥ 1 + 12  for all θ ∈ [θ− , θ+ ]. Let
( q )
2π − +
0 +
θ −θ − , θ ≤ θ ≤ θ ,
û (θ) =
0, otherwise.

This is a step function over the interval [θ− , θ+ ]. We can calculate the sequence
which gives rise to this step function by
Z π Z θ+ r
1 1 2π
u0m = û0 (θ)eimθ dθ = eimθ dθ.
2π −π 2π θ− θ+ − θ−

For m = 0 we have e−imθ = 1 and therefore


r
0 θ+ − θ−
u0 = .

For m 6= 0 we get
 θ+
1 1 imθ 1 + −
u0m =p e = p (eimθ − eimθ ).
+ −
2π(θ − θ ) im θ−
+ −
im 2π(θ − θ )

The complex-valued function


 r 
 θ+ − θ− 

 , x=0 

u(x) = 2π
1 + −

 p (eixθ − eixθ ), x =
6 0 

ix 2π(θ+ − θ− )
 

is well defined and continuous, since for x → 0 it tends to the value for x = 0
(which can be verified by using the expansions of the exponentials). It is also
square integrable, since it tends to zero for x → ∞ due to x being in the
denominator. Therefore it is a suitable choice for an initial condition.
On the other hand,
Z π  12
1
kûn k = √ |H(θ)|2n |û0 (θ)|2 dθ
2π −π
! 12
Z θ+
1 2n 0 2
= √ |H(θ)| |û (θ)| dθ
2π θ−
Z θ+ ! 12
1 1 n + − −1
≥ √ (1 + ) 2π(θ − θ ) dθ
2π 2 θ−
1 n→∞
= (1 + )n −→ ∞,
2
since the last integral equates to 1. Thus the method is then unstable.
226  A Concise Introduction to Numerical Analysis

Consider the Cauchy problem for the heat equation and recall the first
method based on solving the semi-discretized problem with forward Euler.

un+1
m = unm + µ(unm−1 − 2unm + unm+1 ).

Putting this into the new context, we have

r = 1, s = 1 α0 = 1, α±1 = 0, β0 = 1 − 2µ, β±1 = µ.

Therefore
θ
H(θ) = 1 + µ(e−iθ − 2 + eiθ ) = 1 − 4µ sin2 , θ ∈ [−π, π]
2
and thus 1 ≥ H(θ) ≥ H(π) = 1 − 4µ. Hence the method is stable if and only
if µ ≤ 12 .
On the other hand, for the backward Euler method we have

un+1
m − µ(un+1 n+1
m−1 − 2um + un+1 n
m+1 ) = um

and therefore
θ
H(θ) = [1 − µ(e−iθ − 2 + eiθ )]−1 = [1 + 4µ sin2 ]−1 ∈ (0, 1],
2
which implies stability for all µ > 0.
The Crank–Nicolson scheme given by
1 1
un+1
m − µ(un+1 n+1
m−1 − 2um + un+1 n n n n
m+1 ) = um + µ(um−1 − 2um + um+1 )
2 2
results in
1 + 12 µ(e−iθ − 2 + eiθ ) 1 − 2µ sin2 θ2
H(θ) = = ,
1 − 12 µ(e−iθ − 2 + eiθ ) 1 + 2µ sin2 θ2
which lies in (−1, 1] for all θ ∈ [−π, π] and all µ > 0.
Exercise 8.2. Apply the Fourier stability test to the difference equation
1 2
un+1
m = (2 − 5µ + 6µ2 )unm + µ(2 − 3µ)(unm−1 + unm+1 )
2 3
1
− µ(1 − 6µ)(um−2 + unm+2 ).
n
12
Deduce that the test is satisfied if and only if 0 ≤ µ ≤ 23 .
The eigenvalue stability analysis and the Fourier stability analysis are
tackling two fundamentally different problems. In the eigenvalue framework,
boundaries are incorporated, while in the Fourier analysis we have m ∈ Z,
which corresponds to x ∈ R in the underlying PDE. It is no trivial task to
translate Fourier analysis to problems with boundaries. When either r ≥ 2 or
PDEs  227

s ≥ 2 there are not enough boundary values to satisfy the recurrence equations
near the boundary. This means the discretized equations need to be amended
near the boundary and the identity (8.11) is no longer valid. It is not enough
to extend the values unm with zeros for m ∈ / {1, 2, . . . , M }. In general a great
deal of care needs to be taken to combine Fourier analysis with boundary
conditions. How to treat the problem at the boundaries has to be carefully
considered to avoid the instability which then propagates from the boundary
inwards.
With many parabolic PDEs, e.g., the heat equation, the Euclidean norm
of the exact solution decays (for zero boundary conditions) and good methods
share this behaviour. Hence they are robust enough to cope with inwards error
propagation from the boundary into the solution domain, which might occur
when discretized equations are amended there. The situation is more difficult
for many hyperbolic equations, e.g., the wave equation, since the exact solution
keeps the energy (a.k.a. the Euclidean norm) constant and so do many good
methods. In that case any error propagation from the boundary delivers a false
result. Additional mathematical techniques are necessary in this situation.
Exercise 8.3. The Crank–Nicolson formula is applied to the heat equa-
tion ut = uxx on a rectangular mesh (m∆x, n∆t), m = 0, 1, ..., M + 1,
n = 0, 1, 2, ..., where ∆x = 1/(M + 1). We assume zero boundary conditions
u(0, t) = u(1, t) = 0 for all t ≥ 0. Prove that the estimates um
n ≈ u(m∆x, n∆t)
satisfy the equation
M M +1
X 1 X n+1
[(un+1 2 n 2
m ) − (um ) ] = − µ (um + unm − un+1 n 2
m−1 − um−1 ) .
m=1
2 m=1

PM n 2
This shows that m=1 (um ) is monotonically decreasing with increasing n
and the numerical solution mimics the decaying behaviour of the exact solu-
tion. (Hint: Substitute the value of un+1m − unm that is given by the Crank–
PM n+1 n
Nicolson formula into the elementary equation
PM m=1 [(um )2 − (um )2] =
n+1
m=1 (um − unm )(un+1
m + unm ). It is also helpful occasionally to change the
index m of the summation by one.)

8.3 Elliptic PDEs


To move from the one-dimensional case to higher dimensional cases, let’s first
look at elliptic partial differential equations. Since the heat equation in two
dimensions is given by ut = ∇2 u, let’s look at the Poisson equation

∇2 u(x, y) = uxx (x, y) + uyy (x, y) = f (x, y) for all (x, y) ∈ Ω,

where Ω is an open connected domain in R2 with boundary ∂Ω. For all (x, y) ∈
∂Ω we have the Dirichlet boundary condition u(x, y) = φ(x, y). We assume
that f is continuous in Ω and that φ is twice differentiable. We lay a square
228  A Concise Introduction to Numerical Analysis

grid over Ω with uniform spacing of ∆x in both the x and y direction. Further,
we assume that ∂Ω is part of the grid. For our purposes Ω is a rectangle, but
the results hold for other domains.

8.3.1 Computational Stencils


The finite difference given by Equation (8.5) gives an O(∆x2 ) approximation
to the second derivative. We employ this in both the x and y direction to
obtain the approximation
1
∇2 u(x, y) ≈ [u(x − ∆x, y) + u(x + ∆x, y) + u(x, y − ∆x)
(∆x)2
+u(x, y + ∆x) − 4u(x, y)],

which produces a local error of O((∆x)2 ). This gives rise to the five-point
method

ul−1,m +ul+1,m +ul,m−1 +ul,m+1 −4ul,m = (∆x)2 fl,m , (l∆x, m∆x) ∈ Ω,

where fl,m = f (l∆x, m∆x) and ul,m approximates u(l∆x, m∆x). A compact
notation is the computational stencil (also known as computational molecule)

1
1 −4 1 ul,m = fl,m
(∆x)2

1
(8.12)

Whenever (l∆x, m∆x) ∈ ∂Ω, we substitute the appropriate value


φ(l∆x, m∆x). This procedure leads to a (sparse) set of linear algebraic equa-
tions, whose solution approximates the true solution at the grid points.
By using the approximation

∂ 2 u(x, y) 1 1 4 5
= [− u(x − 2∆x, y) + u(x − ∆x, y) − u(x, y)
∂x2 (∆x)2 12 3 2
4 1
+ u(x + ∆x, y) − u(x + 2∆x, y)] + O((∆x)4 )
3 12
PDEs  229

we obtain the computational stencil

1
− 12

4
3

1 1 4 4 1
− 12 −5 − 12 ul,m = fl,m
(∆x)2 3 3

4
3

1
− 12

(8.13)
which produces a local error of O((∆x)4 ). However, the implementation of this
method is more complicated, since at the boundary, values of points outside
the boundary are needed. These values can be approximated by nearby values.
For example, if we require an approximation to u(l∆x, m∆x), where m∆x lies
outside the boundary, we can set
1 1
u(l∆x, m∆x) ≈ u((l + 1)∆x, (m − 1)∆x) + u(l∆x, (m − 1)∆x)
4 2
1
+ u((l − 1)∆x, (m − 1)∆x).
4
(8.14)
The set of linear algebraic equations has to be modified accordingly to take
these adjustments into account.
Now consider the approximation

u(x + ∆x, y) + u(x − ∆x, y) = e∆xDx + e−∆xDx u(x, y)




∂2u
= 2u(x, y) + (∆x)2 (x, y) + O((∆x)4 ),
∂x2
where Dx denotes the differential operator in the x-direction. Applying the
230  A Concise Introduction to Numerical Analysis

same principle in the y direction, we have

u(x + ∆x, y + ∆x) + u(x − ∆x, y + ∆x)


+u(x + ∆x, y − ∆x) + u(x − ∆x, y − ∆x) =
e∆xDx + e−∆xDx e∆xDy + e−∆xDy u(x, y) =
 

∂2 2 ∂
2
[2 + (∆x)2 + O((∆x) 4
)] × [2 + (∆x) + O((∆x)4 )]u(x, y) =
∂x2 ∂y 2
∂2u ∂2u
4u(x, y) + 2(∆x)2 2 (x, y) + 2(∆x)2 2 (x, y) + O((∆x)4 ).
∂x ∂y
This motivates the computational stencil

1 1
2 0 2

1
0 −2 0 ul,m = fl,m
(∆x)2

1 1
2 0 2
(8.15)

Both (8.12) and (8.15) give O((∆x)2 ) approximations to ∇2 u. Computa-


tional stencils can be added: 2× (8.12) + (8.15) and dividing by 3 results in
the nine-point method :

1 2 1
6 3 6

1 2
− 10 2 ul,m = fl,m
(∆x)2 3 3 3

1 2 1
6 3 6
(8.16)

This method has advantages, as we will see in the following.


Exercise 8.4. We have seen that the nine-point method provides a numerical
solution to the Poisson equation with local error O((∆x)2 ). Show that for the
Laplace equation ∇2 u = 0 the local error is O((∆x)4 ).
PDEs  231

Generally the nine-point formula is an O((∆x)2 ) approximation. However,


this can be increased to O((∆x)4 ). In the above exercise the (∆x)2 error term
is  4
∂4u ∂4u

1 ∂ u
(∆x)2 + 2 + ,
12 ∂x4 ∂x2 ∂y 2 ∂y 4
which vanishes for ∇2 u = 0. This can be rewritten as
 2
∂ ∂2u ∂2u ∂2 ∂2u ∂2u

1 2
(∆x) ( + 2) + 2( 2 + 2) =
12 ∂x2 ∂x2 ∂y ∂y ∂x ∂y
 2 2

1 ∂ f ∂ f 1
(∆x)2 + 2 = (∆x)2 ∇2 f.
12 ∂x2 ∂y 12

The Laplacian of f can be approximated by the five-point method (8.12) and


adding this to the right-hand side of (8.16) gives the scheme

1 2 1
6 3 6

1 2
− 10 2 ul,m =
(∆x)2 3 3 3

1 2 1
6 3 6

1
12

1
(∆x)2 12 − 13 1
12
fl,m + fl,m

1
12

which has a local error of O((∆x)4 ), since the (∆x)2 error term is canceled.
Exercise 8.5. Determine the order of the local error of the finite difference
232  A Concise Introduction to Numerical Analysis

approximation to ∂ 2 u/∂x∂y, which is given by the computational stencil

− 14 0 1
4

1
0 0 0
(∆x)2

1
4 0 − 14

We have seen that the first error term in the nine-point method is
 4
∂4u ∂4u

1 ∂ u 1
(∆x)2 (x, y) + 2 (x, y) + (x, y) = (∆x)2 ∇4 u(x, y),
12 ∂x4 ∂x2 ∂y 2 ∂y 4 12

while the method given by (8.13) has no (∆x)2 error term. Now we can com-
bine these two methods to generate an approximation for ∇4 . In particular,
let the new method be given by 12×(8.16) −12×(8.13), and dividing by (∆x)2
gives

2 −8 2

1
1 −8 20 −8 1
(∆x)4

2 −8 2

1
PDEs  233

Exercise 8.6. Verify that the above approximation of ∇4 has a local error of
O((∆x)2 ) and identify the first error term.
Thus, by knowing the first error term, we can combine different finite dif-
ference schemes to obtain new approximations to different partial differential
equations.
So far, we have only looked at equispaced square grids. The boundary,
however, often fails to fit exactly into a square grid. Thus we sometimes need
to approximate derivatives using non-equispaced points at the boundary. In
the interior the grid remains equispaced. For example, suppose we try to
approximate the second directional derivative and that the grid points have
the spacing ∆x in the interior and α∆x at the boundary, where 0 < α ≤ 1.
Using the Taylor expansion, one can see that
 
1 2 2 2
g(x − ∆x) − g(x) + g(x + α∆x)
(∆x)2 α + 1 α α(α + 1)
1
= g 00 (x) + (α − 1)g 000 (x)∆x + O((∆x)2 ),
2
with error of O(∆x). Note that α = 1 recovers the finite difference with error
O((∆x)2 ) that we have already used. Better approximation can be obtained
by taking two equispaced points on the interior side.

1 α−1 2(α − 2)
g(x − 2∆x) − g(x − ∆x)
(∆x)2 α + 2 α+1

α−3 6
+ g(x) + g(x + α∆x)
α α(α + 1)(α + 2)
= g 00 (x) + O((∆x)2 ).

8.3.2 Sparse Algebraic Systems Arising from Computational Stencils


For simplicity we restrict our attention to the case of Ω being a square where
the sides are divided into pieces of length ∆x by M + 1. Thus we need to
estimate M 2 unknown function values u(l∆x, m∆x), l, m = 1, . . . , M . Let
N = M 2 . The computational stencils then yield an N × N system of linear
equations. How this is represented in matrix form depends of course on the
way the grid points are arranged into a one-dimensional array. In the natural
ordering the grid points are arranged by columns. Using this ordering, the
five-point method gives the linear system (∆x)−2 Au = b where A is the
block tridiagonal matrix
   
B I −4 1
 I B I



 1
 −4 1 

A=
 .. .. ..  and B = 
  .. .. .. .

 . . .   . . . 
 I B I   1 −4 1 
I B 1 −4
234  A Concise Introduction to Numerical Analysis

Note that B is a TST matrix. The vector b is given by the right-hand sides fl,m
(following the same ordering) and the boundary conditions. More specifically,
if ul,m is such that for example ul,m+1 lies on the boundary, we let ul,m+1 =
φ(l∆x, (m + 1)∆x) and the right-hand side bl,m = fl,m − φ(l∆x, (m + 1)∆x).
The method specified by (8.13) has the associated matrix
4 1
 
B 3I − 12 I
 4 4 1

 3I
 B 3I − 12 I 

 −1I 4 4 1
 
 12 3I B 3I − 12 I 

A=
 . .. . .. . .. . .. . .. 
 1 4 4 1


 − 12 I 3 I B 3 I − 12 I 

 1 4 4


 − 12 I 3I B 3I


1 4
− 12 I 3 I B

where
4 1
 
−5 3 − 12
 4 4 1


 3 −5 3 − 12 

 −1 4 4 1
 
 12 3 −5 3 − 12 

B=
 .. .. .. .. ..  .
 . . . . . 

1 4 4 1 

 − 12 3 −5 3 − 12 
 1 4 4


 − 12 3 −5 3


1 4
− 12 3 −5
Again the boundary conditions need to be incorporated into the right-hand
side, using the approximation given by (8.14) for points lying outside Ω.
Now the nine-point method has the associated block tridiagonal matrix
 
B C
 C B C 
 
A=
 . .
.. .. ... 
 (8.17)
 
 C B C 
C B

where the blocks are given by


 10 2
  2 1

−3 3 3 6
 2
− 10 2 1 2 1
  
   
 3 3 3   6 3 6 
B=
 .. .. ..  and C = 
  .. .. .. .

 . . .   . . . 
2 10 2 1 2 1

 3 −3 3



 6 3 6


2
3 − 10
3
1
6
2
3
PDEs  235

With these examples, we see that the symmetry of computational stencils


yields symmetric block matrices with constant blocks only on the main diag-
onal and one or two subdiagonals where the blocks themselves are symmetric
Toeplitz matrices with only the main diagonal, and one or two subdiagonals
nonzero. This special structure lends itself to efficient algorithms to solve the
large system of equations.

8.3.3 Hockney Algorithm


For simplicity we restrict ourselves to tridiagonal systems, which results in the
Hockney algorithm. Recall that the normalized eigenvectors and the eigenval-
ues for a general M × M tridiagonal symmetric Toeplitz (TST) matrix given
by  
α β
 β α β 
 
B=
 .. .. .. 
 . . . 

 β α β 
β α
are
kπ jkπ
λB
k = α + 2β cos , (qk )j = a sin , k, j = 1, . . . , M,
M +1 M +1
where a is a normalization constant. Let Q = (q1 . . . qM ) be the matrix whose
columns are the eigenvectors. We consider the matrices B and C occurring in
the nine-point method. The matrix A can then be written as

QDB QT QDC QT
 

 QDC QT QDB QT QDC QT


 

 
A=
 . .. . .. . .. .

 
T T T 

 QDC Q QDB Q QDC Q 
QDC QT QDB QT

Setting vk = (∆x)−2 QT uk and multiplying by a diagonal block matrix where


each block equals QT , our system becomes
    
DB DC v1 c1
 DC DB DC   v 2   c2 
    
 . .. . .. . ..  .. =
  .. 



 . 
 
 . 

 DC DB DC   vM −1   cM −1 
DC DB vM cM

where ck = QT bk . At this stage we reorder the gird by rows instead of


columns. In other words, we permute v 7→ v̂ = P v, c 7→ ĉ = P c, such that
236  A Concise Introduction to Numerical Analysis

the portion v̂1 is made out of the first components of each of the portions
v1 , . . . , vM , the portion v̂2 is made out of the second components of each
of the portions v1 , . . . , vM , and so on. Permutations come essentially at no
computational cost since in practice we store c, v as 2D arrays (which are
addressed accordingly) and not in one long vector. This yields a new system
    
Λ1 v̂1 ĉ1
 Λ2   v̂2   ĉ2 
  ..  =  ..  ,
    

 . ..  .   . 
ΛM v̂M ĉM

where
λB λC
 
k k
 C
λB λC

 λk k k

 
Λk = 
 .. .. .. .

 . . . 

 λC
k λB
k λC
k


λC
k λB
k

These are M uncoupled systems, Λk v̂k = ĉk for k = 1, . . . , M . Since these


systems are tridiagonal, each can be solved fast in O(M ) operations. Hence
the steps of the algorithm and their computational cost are as follows:

• Form the products ck = QT bk for k = 1, . . . , M taking O(M 3 ) opera-


tions.
• Solve the M × M tridiagonal system Λk v̂k = ĉk for k = 1, . . . , M taking
O(M 2 ) operations.
• Form the products uk = Qvk for k = 1, . . . , M taking O(M 3 ) opera-
tions.
The computational bottleneck consists of the 2M matrix-vector products
by the matrices Q and QT . The elements of Q are (qk )j = a sin Mjkπ +1 . Now
sin Mjkπ
+1 is the imaginary part of exp ijkπ
M +1 . This observation lends itself to a
considerable speedup.
Definition 8.11 (Discrete Fourier Transform). Let x = (xl )l∈Z be a sequence
of complex numbers with period n. i.e., xl+n = xl . Set ωn = exp 2πi n , the
primitive root of unity of degree n. The discrete Fourier transform (DFT) x̂
of x is given by
n−1
1 X −jl
x̂j = ωn xl , j = 0, . . . , n − 1.
n
l=0
PDEs  237

It can be easily proven that the discrete Fourier transform is an isomor-


phism and that the inverse is given by
n−1
X
xl = ωnjl x̂j , l = 0, . . . , n − 1.
j=0

For l = 1, . . . , M the lth component of the product Qy is


M M M
X ljπ X ljπ X iljπ
(Qy)l = a yj sin =a yj sin = aIm yj exp ,
j=1
M +1 j=0
M +1 j=0
M +1

since sin Ml0π


+1 = 0. Here Im denotes the imaginary part of the expression. To
obtain a formula similar to the discrete Fourier transform we need a factor of
2 multiplying π. Subsequently, we need to extend the sum, which we can do
by setting yM +1 = . . . = y2M +1 = 0.
M 2M +1
X ilj2π X ilj2π
(Qy)l = aIm yj exp = aIm yj exp
j=0
2M + 2 j=0
2M +2
2M
X +1
lj
= aIm yj ω2M +2 .
j=0

Thus, multiplication by Q can be reduced to calculating an inverse DFT.


The calculation can be sped up even more by using the fast Fourier trans-
form (FFT). Suppose that n is a power of 2, i.e., n = 2L , and denote by

x̂E = (x̂2l )l∈Z and x̂O = (x̂2l+1 )l∈Z

the even and odd portions of x̂. Note that both x̂E and x̂O have period
n/2 = 2L−1 . Suppose we already know the inverse DFT of both the short
sequences xE and xO . Then it is possible to assemble x in a small number of
operations. Remembering wnn = 1,
L
2X −1 2L−1
X−1 2L−1
X−1 (2j+1)l
xl = w2jlL x̂j = w22jl
L x̂2j + w2L x̂2j+1
j=0 j=0 j=0
2L−1
X−1 2L−1
X−1
= w2jlL−1 x̂E
j + ω2l L w2jlL−1 x̂O E l O
j = xl + ω2L xl .
j=0 j=0

Thus it only costs n products to calculate x, provided that xE and xO are


known. This can be reduced even further to n/2, since the second half of the
sequence can be calculated as
L−1 L−1
l+2
xl+2L−1 = xE
l+2L−1 + ω2L xO E 2
l+2L−1 = xl + ω2L ω2l L xO E l O
l = xl − ω2L xl .
238  A Concise Introduction to Numerical Analysis

Thus the products ω2l L xO l only need to be evaluated for l = 0, . . . , n/2 − 1.


To execute the FFT we start with vectors with only one element and in
the ith stage, i = 1, . . . , L, assemble 2L−i vectors of length 2i from vectors of
length 2i−1 . Altogether the cost of the FFT is 12 n log2 n products. For example,
if n = 1024 = 210 , the cost is ≈ 5 × 103 compared to ≈ 106 for the matrix
multiplication. For n = 220 the numbers become ≈ 1.05 × 107 for the FFT
compared to ≈ 1.1 × 1012 , which represents a factor of more than 105 .
The following schematic shows the structure of the FFT algorithm for
n = 8. The bottom row shows the indices forming the vectors consisting of
only one element. The branches show how the vectors are combined with which
factors ±ω2l L .
√ 0 √ 1 √ 2 √ 3 √ 0 √ 1 √ 2 √ 3
+ i + i + i + i − i − i − i − i


+i0 +i1 − i0 −i1 +i0 +i1 − i0 −i1

+1 |−1 +1 |−1 +1 |−1 +1 |−1

0 4 2 6 1 5 3 7

The FFT was discovered by Gauss (and forgotten), rediscovered by Lanc-


zos (and forgotten), and, finally, rediscovered by Cooley and Tuckey under
which name it is implemented in many packages. It is a classic divide-and-
conquer algorithm.
Exercise 8.7. Let (x̂0 , x̂1 , x̂2 , x̂3 , x̂4 , x̂5 , x̂6 , x̂7 ) = (2, 0, 6, −2, 6, 0, 6, 2). By
P7 jl
applying the inverse of the FFT algorithm, calculate xl = j=0 ω8 x̂j for
l = 0, 2, 4, 6, where ω8 = exp 2iπ
8 .

8.3.4 Multigrid Methods


In the previous section, we have exploited the special structure in the matrices
of computational stencils we encountered so far. Another approach are multi-
grid methods which were developed from the computational observation that
when the Gauss–Seidel method is applied to solve the five-point formula on a
square M ×M grid, the error decays substantially in each iteration for the first
few iterations. Subsequently, the rate slows down and the method settles to
its slow asymptotic rate of convergence. Note that this is the error associated
with the Gauss–Seidel solution to the linear equations, not the error in the
solution of the original Laplacian equation.
PDEs  239

Recall that for a system of linear equations given by Ax = (D + L + U )x =


b, where D, L, U are the diagonal, strictly lower triangular, and strictly upper
triangular parts of A, respectively, the Gauss–Seidel and Jacobi methods are:
Gauss–Seidel
(D + L)x(k+1) = −U x(k) + b,
Jacobi
Dx(k+1) = −(U + L)x(k) + b.
Specifically for the five-point method, this results in (using the natural
ordering)
Gauss–Seidel
(k+1) (k+1) (k+1) (k) (k)
ul−1,m + ul,m−1 − 4ul,m = −ul+1,m − ul,m+1 + (∆x)2 fl,m ,
Jacobi
(k+1) (k) (k) (k) (k)
−4ul,m = −ul−1,m − ul,m−1 − ul+1,m − ul,m+1 + (∆x)2 fl,m .
So why does the rate of convergence change so dramatically? In our analysis
we consider the Jacobi method. The iteration matrix is given by
   
B I 0 1
 I B I   1 0 1 
1 . . .
 
. . .

H= 
 .. .. ..  where B = 
  .. .. .. .

4   
 I B I   1 0 1 
I B 1 0

We know from our previous results that the eigenvalues of B are λB i =


2 cos Miπ+1 , i = 1, . . . , M . Changing the ordering from columns to rows, we
obtain the matrix
 B 
  λi 1
Λ1  1 λB i 1 
1 Λ2   
 
where Λi = 
 .. .. .. .

4
 . .  . . .
.  
B

 1 λi 1 
ΛM
1 λBi


The eigenvalues of Λi are λB
i + 2 cos M +1 , j = 1, . . . , M . We deduce that the
eigenvalues of the system are
1 iπ jπ iπ jπ
λi,j = (2 cos + 2 cos ) = 1 − (sin2 + sin2 ).
4 M +1 M +1 2(M + 1) 2(M + 1)
Hence all the eigenvalues are smaller than 1 in modulus, since i and j range
from 1 to M , guaranteeing convergence; however, the spectral radius is close
π2
to 1, being 1 − 2 sin2 2(Mπ+1) ≈ 1 − 2M 2 . The larger M , the closer the spectral

radius is to 1.
240  A Concise Introduction to Numerical Analysis

Let e(k) be the error in the k th iteration and let vi,j be the orthonormal
eigenvectors. We can expand the error with respect to this basis.
M
(k)
X
e(k) = ei,j vi,j .
i,j=1

Iterating, we have
(k) (0)
e(k) = H k e(0) ⇒ |ei,j | = |λi,j |k |ei,j |.
Thus the components of the error (with respect to the basis of eigenvectors)
decay at a different rate for different values of i, j, which are the frequencies.
We separate those into low frequencies (LF) where both i and j lie in [1, M2+1 ),
which results in the angles lying between zero and π/4, high frequencies (HF)
where both i and j lie in [ M2+1 , M ], which results in the angles lying between
pi/4 and π/2, and mixed frequencies (MF) where one of i and j lies in [1, M2+1 )
and the other lies in [ M2+1 , M ]. Let us determine the least factor by which the
amplitudes of the mixed frequencies are damped. Either i or j is at least
M +1
2 while the other is at most M2+1 and thus sin2 2(Miπ+1) ∈ [0, 12 ] while

sin2 2(M +1) ∈ [ 12 , 1]. It follows that
iπ jπ 1 1
1 − (sin2 + sin2 ) ∈ [− , ]
2(M + 1) 2(M + 1) 2 2
for i, j in the mixed frequency range. Hence the amplitudes are damped by at
least a factor of 12 , which corresponds to the observations. This explains the
observations, but how can it be used to speed up convergence?
Firstly, let’s look at the high frequencies. Good damping of those is
achieved by using the Jacobi method with relaxation (also known as damped
Jacobi ) which gives a damping factor of 35 . The proof of this is left to the
reader in the following exercise.
Exercise 8.8. Find a formula for the eigenvalues of the iteration matrix of
the Jacobi with relaxation and an optimal value for the relaxation parameter
for the MF and HF components combined.
From the analysis and the results from the exercise, we can deduce that
the damped Jacobi method converges fast for HF and MF. This is also true
for the damped Gauss–Seidel method. For the low frequencies we note that
these are the high frequencies if we consider a grid with spacing 2∆x instead
of ∆x.
To examine this further we restrict ourselves to one space dimension. The
matrix arising from approximating the second derivative is given by
 
−2 1
 1 −2 1 
 

 . .. . .. . .. .

 
 1 −2 1 
1 −2
PDEs  241

For the Jacobi method, the corresponding iteration matrix is


 
0 1
 1 0 1 
1 . . ..

 .. .. . .

(8.18)
2


 1 0 1 
1 0

Assume that both matrices have dimension (n − 1) × (n − 1) and that n is


even. For k = 1, . . . , n − 1, the eigenvalues of the iteration matrix are
1 kπ kπ
λk = (0 + 2 cos ) = cos ,
2 n n
and the corresponding eigenvector vk has the components
kjπ
(vk )j = sin .
n
|λk | is largest for k = 1 and k = n − 1, while for k = n/2 we have λk = 0,
since cos π/2 = 0. We deduce that we have a fast reduction of error for k ∈
(n/4, 3n/4), while for k ∈ [1, n/4] ∪ [3n/4, n − 1] the error reduces slowly. For
k ∈ [3n/4, n − 1] a faster reduction in error is usually obtained via relaxation.
To consider k ∈ [1, n/4], take every second component of vk forming a new
vector v̂k with components
k2jπ kjπ
(v̂k )j = sin = sin
n n/2

for j = 1, . . . , n/2 − 1. This is now the eigenvector of a matrix with the same
form as in (8.18) but with dimension (n/2 − 1) × (n/2 − 1). The corresponding
eigenvalue is

λk = cos .
n/2
For k = n/4 where we had a slow reduction in error on the fine grid, we now
have
n/4π π
λk = cos = cos = 0,
n/2 2
the fastest reduction possible.
The idea of the multigrid method is to cover the square domain by a range
of nested grids, of increasing coarseness, say,

Ω∆x ⊂ Ω2∆x ⊂ · · · ,

so that on each grid we remove the contribution of high frequencies relative


to this grid. A typical multigrid sweep starts at the finest grid, travels to the
coarsest (where the number of variables is small and we can afford to solve
242  A Concise Introduction to Numerical Analysis

the equations directly), and back to the finest. Whenever we coarsen the grid,
we compute the residual

r∆x = b∆x − A∆x x∆x ,

where ∆x is the size of the grid we are coarsening and x∆x is the current
solution on this grid. The values of the residual are restricted to the coarser
grid by combining nine fine values by the restriction operator R

1 1 1
4 2 4

1 1
2 1 2

1 1 1
4 2 4

Thus the new value on the coarse grid is an average of the fine value at this
point and its eight neighbouring fine values. Then we solve for the residual,
i.e., we iterate to solve the equations

A2∆x x2∆x = r2∆x = Rr∆x . (8.19)

(Note that the residual becomes the new b.)


On the way back, refinement entails a prolongation P by linear interpola-
tion, which is the exact opposite of restriction. The values of the coarse grid
are distributed to the fine grid such that
• if the points coincide, the value is the same,

• if the point has two coarse neighbours at each side, the value is the
average of those two,
• if the point has four coarse neighbours (top left, top right, bottom left,
bottom right), the value is the average of those four.
We then correct x∆x by P x2∆x .
Usually only a moderate number of iterations (3 to 5) is employed in each
restriction to solve (8.19). At each prolongation one or two iterations are
necessary to remove high frequencies which have been reintroduced by the
prolongation. We check for convergence at the end of each sweep. We repeat
the sweeps until convergence has occurred.
Before the first multigrid sweep, however, we need to obtain good starting
PDEs  243

values for the finest grid. This is done by starting from the coarsest grid solving
the system of equations there and prolonging the solution to the finest grid
in a zig-zag fashion. That means we do not go directly to the finest grid, but
return after each finer grid to the coarsest grid to obtain better initial values
for the solution.
Exercise 8.9. The function u(x, y) = 18x(1 − x)y(1 − y), 0 ≤ x, y ≤ 1, is the
solution of the Poisson equation uxx + uyy = 36(x2 + y 2 − x − y) = f (x, y),
subject to zero boundary conditions. Let ∆x = 1/6 and seek the solution of
the five-point method

um−1,n +um+1,n +um,n−1 +um,n+1 −4um,n = (∆x)2 f (mh, nh), 1 ≤ m, n ≤ 5,

where um,n is zero if one of m, n is 0 or 6. Let the multigrid method be applied,


using only this fine grid and a coarse grid of mesh size 1/3, and let every um,n
be zero initially. Calculate the 25 residuals of the starting vector on the fine
grid. Then, following the restriction procedure, find the residuals for the initial
calculation on the coarse grid. Solve the equations on the coarse grid exactly.
The resultant estimates of u at the four interior points of the coarse grid all
have the value 5/6. By applying the prolongation operator to these estimates,
find the 25 starting values of um,n for the subsequent iterations of Jacobi on
the fine grid. Further, show that if one Jacobi iteration is performed, then
u3,3 = 23/24 occurs, which is the estimate of u(1/2, 1/2) = 9/8.

8.4 Parabolic PDEs in Two Dimensions


We are now combining the analysis of elliptic partial differential equations
from the previous chapter with the analysis of parabolic partial differential
equations. Let’s look at the heat equation on the unit square
∂u
= ∇2 u, 0 ≤ x, y ≤ 1, t ≥ 0,
∂t
where u = u(x, y, t). We are given initial condition at t = 0 and zero bound-
ary conditions on ∂Ω, where Ω = [0, 1]2 × [0, ∞). We generalize the method
of lines which was introduced earlier to derive an algorithm. To this pur-
pose we lay a square grid over the unit square with mesh size ∆x. Let
ul,m (t) ≈ u(l∆x, m∆x, t) and let unl,m ≈ ul,m (n∆t). The five-point method
approximating the right-hand side of the PDE results in
1
u0l,m = (ul−1,m + ul+1,m + ul,m−1 + ul,m+1 − 4ul,m ).
(∆x)2

Using the previous analysis, in matrix form this is


1
u0 = A∗ u, (8.20)
(∆x)2
244  A Concise Introduction to Numerical Analysis

where A∗ is the block TST matrix


   
B I −4 1
 I B I



 1
 −4 1 

A∗ = 
 .. .. .. 
and B =
 .. .. .. .

 . . . 


 . . . 
 I B I   1 −4 1 
I B 1 −4

Using the forward Euler method to discretize in time yields

un+1 n n n n n n
l,m = ul,m + µ(ul−1,m + ul+1,m + ul,m−1 + ul,m+1 − 4ul,m ), (8.21)
∆t
where µ = (∆x)2 . Again, in matrix form this is

un+1 = Aun , A = I + µA∗ .

The local error is O((∆t)2 + ∆t(∆x)2 ) = O((∆x)4 ), when µ is held constant,


since the forward Euler method itself carries an error of O((∆t)2 ) and the
discretization in space carries an error of O((∆x)2 ), which gets multiplied by
∆t when the forward Euler method is applied.
As we have seen, the eigenvalues of A∗ are given by
iπ jπ
λi,j (A∗ ) = −4 + 2 cos + 2 cos
 M + 1 M +1 
2 iπ 2 jπ
= −4 sin + sin
2(M + 1) 2(M + 1)

and thus the eigenvalues of A are


 
iπ jπ
λi,j (A) = 1 − 4µ sin2 + sin2 .
2(M + 1) 2(M + 1)

To achieve stability, the spectral radius of A has to be smaller or equal to one.


This is satisfied if and only if µ ≤ 14 . Thus, compared to the one-dimensional
case where µ ≤ 12 , we are further restricted in our choice of ∆t ≤ 14 (∆x)2 .
Generally, using the same discretization in space for all space dimensions and
employing the forward Euler method, we have µ ≤ 21d where d is the number
of space dimensions.
The Fourier analysis of stability generalizes equally to two dimensions
when the space dimensions are an infinite plane. Of course, the range and in-
dices have to be extended to two dimensions. For a given sequence of numbers
u = (ul,m )l,m∈Z , the 2-D Fourier transform of this sequence is
X
û(θ, ψ) = ul,m e−i(lθ+mψ) , −π ≤ θ, ψ ≤ π.
l,m∈Z

All the previous results generalize. In particular, the Fourier transform is an


PDEs  245

isometry.
 1/2
 Z π Z π 1/2
X 1
 |ul,m |2  =: kuk = kûk := |û(θ, ψ)| 2
dθdφ .
4π 2 −π −π
l,m∈Z

Assume our numerical method takes the form


X s Xs
αk,j un+1
l+k,m+j = βk,j unl+k,m+j .
k,j=−r k,j=−r

We can, analogous to (8.11), define the amplification factor


Ps i(kθ+jψ)
k,j=−r βk,j e
H(θ, ψ) = Ps i(kθ+jψ)
,
k,j=−r αk,j e

and the method is stable if and only if |H(θ, ψ)| ≤ 1 for all θ, ψ ∈ [−π, π].
For the method given in (8.21) we have
 
−iθ iθ −iψ iψ 2 θ 2 ψ
H(θ, ψ) = 1 + µ(e +e +e + e − 4) = 1 − 4µ sin + sin ,
2 2
and we again deduce stability if and only if µ ≤ 14 .
If we apply the trapezoidal rule instead of the forward Euler method to
our semi-discretization (8.20), we obtain the two-dimensional Crank–Nicolson
method
un+1 1 n+1 n+1 n+1 n+1 n+1
l,m − 2 µ(ul−1,m + ul+1,m + ul,m−1 + ul,m+1 − 4ul,m ) =

unl,m + 12 µ(unl−1,m + unl+1,m + unl,m−1 + unl,m+1 − 4unl,m ),


or in matrix form
1 1
(I − µA∗ )un+1 = (I + µA∗ )un . (8.22)
2 2
The local error is O((∆t)3 + ∆t(∆x)2 ), since the trapezoidal rule carries an
error of O((∆t)3 ). Similarly to the one-dimensional case, the method is stable
if and only if the moduli of the eigenvalues of A = (I − 12 µA∗ )−1 (I + 12 µA∗ ) are
less than or equal to 1. Both matrices I − 12 µA∗ and I + 12 µA∗ are block TST
matrices and share the same set of eigenvectors with A∗ . Thus the eigenvalue of
A corresponding to a particular eigenvector is easily calculated as the inverse
of the eigenvalue of I − 12 µA∗ times the eigenvalue of I + 12 µA∗ ,
1 + 12 µλi,j (A∗ )

|λi,j (A)| = .
1 − 12 µλi,j (A∗ )
This is always less than one, because the numerator is always less than the
denominator, since all the eigenvalues of A∗ lie in (−8, 0).
Exercise 8.10. Deduce the amplification factor for the two-dimensional
Crank–Nicolson method.
246  A Concise Introduction to Numerical Analysis

8.4.1 Splitting
Solving parabolic equations with explicit methods typically leads to restric-
tions of the form ∆t ∼ ∆x2 , and this is generally not acceptable. Instead,
implicit methods are used, for example, Crank–Nicolson. However, this means
that in each time step a system of linear equations needs to be solved. This
can become very costly for several space dimensions. The matrix I − 12 µA∗ is
in structure similar to A∗ , so we can apply the Hockney algorithm.
However, since the two-dimensional Crank–Nicolson method already car-
ries a local truncation error of O((∆t)3 + ∆t(∆x)2 ) = O((∆t)2 ) (because of
∆t = (µ∆x)2 ), the system does not need to be solved exactly. It is enough
to solve it within this error. Using the following operator notation (central
difference operator applied twice),

δx2 ul,m = ul−1,m − 2ul,m + ul+1,m , δy2 ul,m = ul,m−1 − 2ul,m + ul,m+1 ,

the Crank–Nicolson method becomes


1 1
[I − µ(δx2 + δy2 )]un+1 2 2 n
l,m = [I + µ(δx + δy )]ul,m .
2 2
We have, however, the same magnitude of local error if this formula is replaced
by
1 1 1 2 1 2 n
[I − µδx2 ][I − µδy2 ]un+1
l,m = [I + µδx ][I + µδy ]ul,m ,
2 2 2 2
which is called the split version of Crank–Nicolson. Note that this modifi-
cation decouples the x and y direction, since operator multiplication means
that they can be applied after each other. Therefore this technique is called
dimensional splitting.We will see below what practical impact that has, but
we first examine the error introduced by the modification.
Multiplying the split version of Crank–Nicolson out,
1 1 1 1 2 2 2 n
[I − µ(δx2 + δy2 ) + µ2 δx2 δy2 ]un+1 2 2
l,m = [I + µ(δx + δy ) + µ δx δy ]ul,m ,
2 4 2 4
we see that on each side a term of the form 14 µ2 δx2 δy2 is introduced. The extra
error introduced into the modified scheme is therefore
1 2 2 2 n+1
e= µ δx δy (ul,m − unl,m ).
4
Now the term in the bracket is an approximation to the first derivative in the
time direction times ∆t. In fact, the difference between the two schemes is
 
1 2 2 2 ∂ 1 ∂
e = µ δx δy ∆t ul,m (t) + O((∆t) ) = µ2 ∆tδx2 δy2 ul,m (t) + O((∆t)2 ).
2
4 ∂t 4 ∂t

We know that δx2 /(∆x)2 and δy2 /(∆x)2 are approximations to the second par-
tial derivatives in the space directions carrying an error of O(∆x)2 . Thus we
PDEs  247

can write, using µ = ∆t/(∆x)2 ,

(∆t)2 1 1 ∂
e = δ2 δ 2 ∆t ul,m (t) + O((∆t)2 )
4 (∆x)2 x (∆x)2 y ∂t
(∆t)3 ∂ 2 ∂ 2 ∂
= u(x, y, t) + O((∆t)3 (∆x)2 ) + O((∆t)2 ) = O((∆t)2 ).
4 ∂x2 ∂y 2 ∂t
In matrix form, the new method is equivalent to splitting the matrix A∗ into
the sum of two matrices, Ax and Ay , where
 
−2I I 
B

 .. .. 
B
 I . .   
Ax =  , A = ,
 
. . y  ..

 . . . .

I 
 . 
I −2I B

where  
−2 1
 .. .. 
 1 . . 
B= .
 .. .. 
 . . 1 
1 −2
We then solve the uncoupled system
1 1 1 1
(I − µAx )(I − µAy )un+1 = (I + µAx )(I + µAy )un
2 2 2 2
in two steps, first solving
1 1 1
(I − µAx )un+1/2 = (I + µAx )(I + µAy )un
2 2 2
then solving
1
(I − µAy )un+1 = un+1/2 .
2
The matrix I − 12 µAy is block diagonal, where each block is I − 12 µB. Thus
solving the above system is equivalent to solving the same tridiagonal system
n+1/2
(I − 12 µB)un+1
i = ui for different right-hand sides M times, for which the
same method can be reused. Here the vector u has been divided into vectors
ui of size M for i = 1, . . . , M . The matrix I − 12 µAx is of the same form after a
reordering of the grid which changes the right hand sides. Thus we first have
to calculate (I + 12 µAx )(I + 12 µAy )un , then reorder, solve the first system,
then reorder and solve the second system.
Speaking more generally, suppose the method of lines results after dis-
cretization in space in the linear system of ODEs given by

u0 = Au, u(0) = u0 ,
248  A Concise Introduction to Numerical Analysis

where u0 is derived from the initial condition.


We formally define the matrix exponential by Taylor series

B
X 1 k
e = B .
k!
k=0

Using this definition to differentiate etA , we have


∞ ∞
detA X 1 d((tA)k ) X 1
= =A (tA)k = AetA ,
dt k! dt k!
k=0 k=0

and the solution to the system of ODEs is u(t) = e(tA) u0 , or at time tn+1 ,

u(tn+1 ) = e∆tA u(tn ).

Many methods for ODEs are actually approximations to the matrix expo-
nential. For example, applying the forward Euler method to the ODE results
in
un+1 = (I + ∆tA)un .
The corresponding approximation to the exponential is 1 + z = ez + O(z 2 ).
On the other hand, if the trapezoidal rule is used instead, we have
 
1 1
un+1 = (I − ∆tA)−1 (I + ∆tA)
2 2

with the corresponding approximation to the exponential

1 + 12 z
= ez + O(z 3 ).
1 − 12 z

Now we apply dimensional splitting by A = Ax + Ay , where Ax and Ay


contain the contributions in the x and y direction. If Ax and Ay commute,
then e(tA) = et(Ax +Ay ) = etAx etAy and we could solve the system of ODEs
by approximating each exponential independently. Using the approximation
to the exponential given by the forward Euler method gives

un+1 = (I + ∆tAx )(I + ∆tAy )un ,

while the trapezoidal approximation yields


 
1 1 1 1
un+1 = (I − ∆tAx )−1 (I + ∆tAx )(I − ∆tAy )−1 (I + ∆tAy ) un .
2 2 2 2

The advantage is that up to reordering all matrices involved are tridiagonal (if
only neighbouring points are used) and the system of equations can be solved
cheaply.
PDEs  249

However, the assumption et(Ax +Ay ) = etAx etAy is generally false. Taking
the first few terms of the definition of the matrix exponential, we have
1
et(Ax +Ay ) = I + t(Ax + Ay ) + t2 (A2x + Ax Ay + Ay Ax + A2y ) + O(t3 )
2
while
etAx etAy = [I + tAx + 12 t2 A2x + O(t3 )] × [I + tAy + 12 t2 A2y + O(t3 )]
= I + t(Ax + Ay ) + 12 t2 (A2x + 2Ax Ay + A2y ) + O(t3 ).

Hence the difference is 12 t2 (Ax Ay − Ay Ax ) + O(t3 ), which does not vanish if


the matrices Ax and Ay do not commute. Still, splitting is, when suitably im-
plemented, a powerful technique to drastically reduce computational expense.
Common splitting techniques are
Beam and Warming’s splitting
et(Ax +Ay ) = etAx etAy + O(t2 ),
Strang’s splitting
1 1
et(Ax +Ay ) = e 2 tAx etAy e 2 tAx + O(t3 ),
Parallel splitting
et(Ax +Ay ) = 12 etAx etAy + 12 etAy etAx + O(t3 ).
Let r = p/q be a rational function where p and q are polynomials, which ap-
proximates the exponential function. As long as the order of the error in this
approximation is the same as the order of the error in the splitting, approx-
imating e∆tAx and e∆tAy by r(∆tAx ) and r(∆tAy ) results in the same local
error. Note that if r(B) for a matrix B is evaluated, we calculate q(B)−1 p(B)
where q(B)−1 is the inverse of the matrix formed by computing q(B). The
ordering of the multiplication is important, since matrices generally do not
commute.
For example, if r(z) = ez + O(z 2 ), and employing Beam and Warming’s
splitting, we can construct the time stepping method

un+1 = r(∆tAx )r(∆tAy )un

which produces an error of O((∆t)2 ). The choice r(z) = (1 + 12 z)/(1 − 12 z)


produces the split Crank–Nicolson scheme. On the other hand, as long as
r(z) = ez + O(z 3 ), using Strang’s splitting to obtain
1 1
un+1 = r( ∆tAx )r(∆tAy )r( ∆tAx )un
2 2
carries an error of O((∆t)3 ).
Stability depends on the eigenvalues of Ax and Ay as well as on the the
properties of r. In the case of the split Crank–Nicolson we examined above, Ax
and Ay are (up to a re-ordering) TST matrices with eigenvalues (∆x)−2 (−2 +
250  A Concise Introduction to Numerical Analysis

cos Mkπ
+1 ) for k = 1, . . . , M , where each eigenvalue is M -fold. It is easy to see
that these eigenvalues are nonpositive. So as long as r specifies an A-stable
method, that is, |r(z)| < 1 for all z ∈ C− , we have stability.
Exercise 8.11. Let F (t) = etA etB be the first order Beam–Warming splitting
of et(A+B) . Generally the splitting error is of the form t2 C for some matrix
C. If C has large eigenvalues the splitting error can be large even for small t.
Show that
Z t
F (t) = et(A+B) + e(t−τ )(A+B) eτ A B − Beτ A eτ B dτ.

0

(Hint: Find explicitly G(t) = F 0 (t)−(A+B)F (t) and use variation of constants
to find the solution of the linear matrix ODE F 0 = (A + B)F + G, F (0) = I.)
Suppose that a matrix norm k · k is given and that there exist real constants
cA , cB and cA+B such that

ketA k ≤ ecA t , ketB k ≤ ecB t , ket(A+B) k ≤ ecA+B t .

Prove that
e(cA +cB )t − ecA+B t
kF (t) − et(A+B) k ≤ 2kBk .
cA + cB − cA+B
Hence, for cA , cB ≤ 0, the splitting error remains relatively small even for
large t. (ecA+B t is an intrinsic error.)
So far we have made it easy for ourselves by assuming zero boundary
conditions. We now consider the splitting of inhomogeneous systems where
the boundary conditions are also allowed to vary over time. In general, the
linear ODE system is of the form

u0 (t) = (Ax + Ay )u(t) + b(t), u(0) = u0 , (8.23)

where b originates in the boundary conditions and also possibly a forcing


term in the original PDE. We have seen that the solution to the homogeneous
system (where b(t) = 0) is

u(t) = et(Ax +Ay ) u0 .

We assume that the solution of the inhomogeneous system is of the form

u(t) = et(Ax +Ay ) c(t).

Inserting this into the inhomogeneous differential equation, we obtain

(Ax + Ay )et(Ax +Ay ) c(t) + et(Ax +Ay ) c0 (t) = (Ax + Ay )et(Ax +Ay ) c(t) + b(t)

and thus
Z t
0 −t(Ax +Ay )
c (t) = e b(t) ⇒ c(t) = e−τ (Ax +Ay ) b(τ )dτ + c0 .
0
PDEs  251

Using the initial condition u(0) = u0 , the exact solution of (8.23) is provided
by
 Z t 
u(t) = et(Ax +Ay ) u0 + e−τ (Ax +Ay ) b(τ )dτ
Z t0
t(Ax +Ay )
= e u0 + e(t−τ )(Ax +Ay ) b(τ )dτ, t ≥ 0.
0

This technique of deriving the solution to an inhomogeneous differential equa-


tion is called variation of constants.
In particular, for each time step n = 0, 1, . . .
Z (n+1)∆t
u((n + 1)∆t) = e∆t(Ax +Ay ) u(n∆t) + e((n+1)∆t−τ )(Ax +Ay ) b(τ )dτ.
n∆t

Often, we can evaluate the integral explicitly; for example, when b(t) is a linear
combination of exponential and polynomial terms. If b is constant, then
 
u((n + 1)∆t) = e∆t(Ax +Ay ) u(n∆t) + (Ax + Ay )−1 e∆t(Ax +Ay ) − I b.

However, this observation does not get us any further, since, even if we split
the exponential, an equivalent technique to split (Ax + Ay )−1 does not exist.
The solution is not to compute the integral explicitly but to use quadrature
rules instead. One of those is the trapezium rule given by
Z h
1
g(τ )dτ = h[g(0) + g(h)] + O(h3 ).
0 2

This gives

u((n + 1)∆t) = e∆t(Ax +Ay ) u(n∆t)+


1 h ∆t(Ax +Ay ) i
∆t e b(n∆t) + b((n + 1)∆t) + O((∆t)3 ).
2
We then gather the exponentials together and replace them by their splittings
and use an approximation r to the exponential. For example, using Strang’s
splitting results in
 
1 1 1 1
un+1 = r( ∆tAx )r(∆tAy )r( ∆tAx ) un + ∆tbn + ∆tbn+1 .
2 2 2 2

Again, we have tridiagonal systems which are inexpensive to solve.


As an example, lets look at the general diffusion equation in two space
dimensions
∂u ∂ ∂
= ∇T (a∇u) + f = (aux ) + (auy ) + f, −1 ≤ x, y ≤ 1,
∂t ∂x ∂y
252  A Concise Introduction to Numerical Analysis

where a(x, y) > 0 and f (x, y) are given, as are initial conditions on the unit
square and Dirichlet boundary conditions on ∂[0, 1]2 × [0, ∞). Every space
derivative is replaced by a central difference at the midpoint, for example
∂ 1
u(x, y) = δx u(x, y) + O((∆x)2 )
∂x ∆x  
1 1 1
= u(x + ∆x, y) − u(x − ∆x, y) + O((∆x)2 ).
∆x 2 2
This yields the ODE system
1 h
u0l,m = a 1 ul−1,m + al+ 12 ul+1,m + al,m− 12 ul,m−1
(∆x)2 l− 2 ,m i
+al,m+ 12 ul,m+1 + (al− 12 ,m + al+ 12 ,m + al,m− 12 + al,m+ 12 )ul,m + fl,m .

The resulting matrix A is split in such a way that Ax consists of all the
al± 12 ,m terms, while Ay includes the remaining al,m± 12 terms. Again, if the
grid is ordered by columns, Ay is tridiagonal; if it is ordered by rows, Ax
is tridiagonal. The vector b consists of fl,m and incorporates the boundary
conditions.
What we have looked at so far is known as dimensional splitting. In ad-
dition there also exists operational splitting, to resolve non-linearities. As an
example, consider the reaction-diffusion equation in one space dimension
∂u ∂2u
= + αu(1 − u).
∂t ∂x2
For simplicity we assume zero boundary conditions at x = 0 and x = 1.
Discretizing in space, we arrive at
1
u0m = (um−1 − 2um + um+1 ) + αum (1 − um ).
(∆x)2
We separate the diffusion from the reaction part by keeping one part constant
and advancing the other by half a time step. We add the superscript n to the
part which is kept constant. In particular we advance by 12 ∆t solving
1
u0m = (um−1 − 2um + um+1 ) + αunm (1 − unm ),
(∆x)2
i.e., keeping the reaction part constant. This can be done, for example, by
Crank–Nicolson. Then we advance another half time step solving
1 n+ 12 n+ 12 n+ 12
u0m = (um−1 − 2u m + um+1 ) + αum (1 − um ),
(∆x)2
this time keeping the diffusion part constant. The second ODE is a linear
Riccati equation, i.e., the right-hand side is a quadratic in um which can be
solved explicitly (see for example [21] D. Zwillinger, Handbook of Differential
Equations).
PDEs  253

8.5 Hyperbolic PDEs


Hyperbolic PDEs are qualitatively different from elliptic and parabolic PDEs.
A perturbation of the initial or boundary data of an elliptic or parabolic PDE
changes the solution at all points in the domain instantly. Solutions of hyper-
bolic PDEs are, however, wave-like. If the initial data of a hyperbolic PDE is
disturbed, the effect is not felt at once at every point in the domain. Distur-
bances travel along the characteristics of the equation with finite propagation
speed. Good numerical methods mimic this behaviour. Hyperbolic PDEs have
been studied extensively, since they are part of many engineering and scien-
tific areas. As an example see [14] R. J. LeVeque, Finite Volume Methods for
Hyperbolic Problems. We have already studied how to construct algorithms
based on finite differences and the order of error these carry, as well as stabil-
ity. We consider these principles examining the advection and wave equation
as examples.

8.5.1 Advection Equation


A useful example of hyperbolic PDEs is the advection equation
∂ ∂
u(x, t) = −c u(x, t),
∂t ∂x
with initial condition u(x, 0) = φ(x), where φ(x) has finite support, that is,
it is zero outside a finite interval. By the chain rule an exact solution of the
advection equation is given by

u(x, t) = φ(x − ct).

As time passes, the initial condition retains its shape while shifting with ve-
locity c to the right or left depending on the sign of c (to the right for positive
c, to the left for negative c). This has been likened to a wind blowing from left
to right or vice versa. For simplicity let c = −1, which gives ut (x, t) = ux (x, t),
and let the support of φ lie in [0, 1]. We restrict ourselves to the interval [0, 1]
by imposing the boundary condition u(1, t) = φ(t + 1).
Let ∆x = M1+1 . We start by semidiscretizing the right-hand side by the
sum of the forward and backward difference
∂ 1
um (t) = [um+1 (t) − um−1 (t)] + O((∆x)2 ). (8.24)
∂t 2∆x
Solving the resulting ODE u0m (t) = (2∆x)−1 [um+1 (t) − um−1 (t)] by forward
Euler results in
1
un+1
m = unm + µ(unm+1 − unm−1 ), m = 1, . . . , M, n ≥ 0,
2
∆t
where µ = ∆x is the Courant number. The overall local error is O((∆t)2 +
254  A Concise Introduction to Numerical Analysis

∆t(∆x)2 ). In matrix form this is un+1 = Aun where


 1 
1 2µ
 1 .. 
 − µ 1
2
. 
A=  .
. .. . .. 1


 
1
−2µ 1

Now the matrix A is tridiagonal and Toeplitz. However, it is not symmetric,


but skew-symmetric. Similar to TST matrices, the eigenvalues and eigenvec-
tors of  
α β
 −β α . . .
 

 

 . .. . .. β 

−β α
are given by λk = α + 2iβ cos Mkπ +1 , with corresponding eigenvector vk , which
has as j th component ij sin Mjkπ kπ
+1 , j = 1, . . . , M . So for A, λk = 1 + iµ cos M +1
and |λk |2 = 1 + µ2 cos2 Mkπ+1 > 1. Hence we have instability for any µ.
It is, however, sufficient to have a local error of O(∆x) when discretizing
in space, since it is multiplied by ∆t, which is then O((∆x)2 ) for a fixed µ.
Thus if we discretized in space by the forward difference
∂ 1
um (t) = [um+1 (t) − um (t)] + O(∆x)
∂t ∆x
and solve the resulting ODE again by the forward Euler method, we arrive at

un+1
m = unm + µ(unm+1 − unm ), m = 1, . . . M, n ≥ 0. (8.25)

This method is known as the upwind method . It takes its name because we
are taking additional information from the point m + 1 which is against the
wind, which is blowing from right to left since c is negative. It makes logical
sense to take information from the direction the wind is blowing from. This
implies that the method has to be adjusted for positive c to use unm−1 instead
of unm+1 . It also explains the instability of the first scheme we constructed,
since there information was taken from both sides of um in the form of um−1
and um+1 .
In matrix form, the upwind method is un+1 = Aun where
 
1−µ µ
.
1 − µ ..
 
 
A=  .
. ..

 µ 
1−µ
PDEs  255

Now the matrix A is not normal and thus its 2-norm is not equal to its spectral
radius, but equal to the square root of the spectral radius of AAT . Now
(1 − µ)2 + µ2 µ(1 − µ)
 
.
(1 − µ)2 + µ2 . .
 

 µ(1 − µ) 

T
AA = 
 .. .. .. ,

 . . . 
 .. 2 2

 . (1 − µ) + µ µ(1 − µ) 
µ(1 − µ) (1 − µ)2
which is not TST, since the entry in the bottom right corner differs. The
eigenvalues can be calculated solving a three term recurrence relation (see for
example [18]). However, defining kun k∞ = maxm |unm |, it follows from (8.25)
that
kun+1 k∞ = max |un+1 n n n
m | ≤ max{|1 − µ||um | + µ|um+1 |} ≤ (|1 − µ| + µ)ku k∞ .
m m

As long as µ ∈ (0, 1] we have kun+1 k∞ ≤ kun k∞ ≤ · · · ≤ ku0 k∞ and hence


stability.
If we keep with our initial space discretization as in (8.24), but now solve
the resulting ODE with the second order mid-point rule
yn+1 = yn−1 + 2∆tf (tn , yn ),
the outcome is the two-step leapfrog method

un+1
m = µ(unm+1 − unm−1 ) + un−1
m .

The local truncation error is O((∆t)3 + ∆t(∆x)2 ) = O((∆x)3 ), because ∆t =


µ∆x. Since it is a two-step method another method has to be chosen to
compute the first time step.
We analyze the stability by the Fourier technique. Since it is a two-step
method, we have
ûn+1 (θ) = µ(eiθ − e−iθ )ûn (θ) + ûn−1 (θ) = 2iµ sin θûn (θ) + ûn−1 (θ).

Thus we are looking for solutions to the recurrence relation


ûn+1 (θ) − 2iµ sin θûn (θ) − ûn−1 (θ) = 0.
Generally for a recurrence relation given by axn+1 + bxn + cxn−1 = 0, n ≥ 1,
we let x± be the roots of the characteristic equation ax2 + bx + c = 0. For
x− 6= x+ the general solution is xn = αxn+ +βxn− , where α and β are constants
derived from the initial values x0 and x1 . For x− = x+ , the solution is xn =
(α + βn)xn+ , where again α and β are constants derived from the initial values
x0 and x1 . In our case
 
1
q q
û± (θ) = 2iµ sin θ ± 4i µ sin θ − 4 ∗ (−1) = iµ sin θ ± 1 − µ2 sin2 θ.
2 2 2
2
256  A Concise Introduction to Numerical Analysis

We have stability if |û± (θ)| ≤ 1 for all θ ∈ [−π, π] and we do not have a double
root on the unit circle. For µ > 1 the square root is imaginary at θ = ±π/2
and then
p p
|û+ (π/2)| = µ + µ − 1 > 1 and |û− (−π/2)| = | − µ − µ − 1| > 1.

For µ = 1 we have a double root for both θ = π/2 and θ = −π/2, since the
square root vanishes. In this case û± (±π/2) = ±i, which lies on the unit circle.
Thus we have instability for µ ≥ 1. For |µ| < 1 we have stability, because in
this case
|û± (θ)|2 = µ2 sin2 θ + 1 − µ2 sin2 θ = 1.
The leapfrog method is a good example of how instability can be intro-
duced from the boundary. Calculating un+1 m for m = 0 we are lacking the
value un−1 . Setting un−1 = 0 introduces instability which propagates inwards.
However, stability can be recovered by letting un+1
0 = un1 .

8.5.2 The Wave Equation


Once we have seen solutions for the advection equation, it is easy to derive
solutions for the wave equation
∂2u ∂2u
=
∂t2 ∂x2
in an appropriate domain of R × R+ with boundary conditions and initial
conditions for u and ∂u
∂t , since they are fundamentally linked. Specifically, let
(v, w) be solutions to the system of advection equations given by
∂v ∂w
= ,
∂t ∂x
∂w ∂v
= .
∂t ∂x
Then
∂2v ∂2w ∂2w ∂2v
= = = .
∂t2 ∂t∂x ∂x∂t ∂x2
Then imposing the correct initial and boundary conditions on v and w, we
have u = v.
More generally speaking, once we have a method for the advection equa-
tion, this can be generalized to the system of advection equations
∂ ∂
u = A u,
∂t ∂x
where all the eigenvalues of A are real and nonzero to ensure the system of
equations is hyperbolic. For the wave equation, A is given by
 
0 1
A= .
1 0
PDEs  257

If the original method for the advection equation is stable for all µ ∈ [a, b]
where a < 0 < b, then the method for the system of advection equations
is stable as long as a ≤ λµ ≤ b for all eigenvalues λ of A. Again for the
wave equation the eigenvalues are ±1 with corresponding eigenvectors (1, 1)T
and (1, −1)T . Thus using the upwind method (8.25) (for which we have the
n n
condition |µ| ≤ 1) we calculate vm and wm according to
n+1 n n n n+1 n n n
vm = vm + µ(wm+1 − wm ), wm = wm + µ(vm+1 − vm ).
n
Eliminating the wm s and letting unm = vm
n
, we obtain

un+1
m − 2unm + un−1
m = µ2 (unm+1 − 2unm + unm−1 ),

which is the leapfrog scheme. Note that we could also have obtained the
method by using the usual finite difference approximating the second deriva-
tive.
Since this is intrinsically a two-step method in the time direction, we need
to calculate u1m . One possibility is to use the forward Euler method and let
u1m = u(m∆x, 0) + ∆tut (m∆x, 0) where both terms on the right-hand side are
given by the initial conditions. This carries an error of O((∆t)2 ). However,
considering the Taylor expansion,

u(m∆x, ∆t) = u(m∆x, 0) + ∆tut (m∆x, 0)


1
+ (∆t)2 utt (m∆x, 0) + O((∆t)3 )
2
= u(m∆x, 0) + ∆tut (m∆x, 0)
1
+ (∆t)2 uxx (m∆x, 0) + O((∆t)3 )
2
= u(m∆x, 0) + ∆tut (m∆x, 0)
1
+ µ(u((m − 1)∆x, 0) − 2u(m∆x, 0) + u((m + 1)∆x, 0))
2
+O((∆t)2 (∆x)2 + (∆t)3 ).

We see that approximating according to the last line has better accuracy.
The Fourier stability analysis of the leapfrog method for the Cauchy prob-
lem yields
θ
ûn+1 (θ) − 2ûn (θ) + ûn−1 (θ) = µ(eiθ − 2 + e−iθ )ûn (θ) = −4µ sin2 ûn (θ).
2
This recurrence relation has the characteristic equation
θ
x2 − 2(1 − 2µ sin2 )x + 1 = 0
2
q
with roots x± = (1 − 2µ sin2 θ2 ) ± (1 − 2µ sin2 θ2 )2 − 1. The product of the
258  A Concise Introduction to Numerical Analysis

roots is 1. For stability we require the moduli of both roots to be less than
or equal to 1 and if a root lies at 1 it has to be a single root. Thus the roots
must be a complex conjugate pair and this leads to the inequality
θ
(1 − 2µ sin2 )2 ≤ 1.
2
This condition is fulfilled if and only if µ = (∆t/∆x)2 ≤ 1.

8.6 Spectral Methods


Let f be a function on the interval [−1, 1]. Its Fourier series is given by

X
f (x) = fˆn eiπnx , x ∈ [−1, 1],
n=−∞

where the Fourier coefficients fˆn are given by

1 1
Z
ˆ
fn = f (τ )e−iπnτ dτ, , n ∈ Z.
2 −1

Letting θ = πτ , which implies dτ = dθ/π, the Fourier coefficients can also be


calculated by the following formula
Z π  
ˆ 1 θ
fn = f e−inθ dθ, , n ∈ Z.
2π −π π

This formula is also widely used.


As examples for Fourier series take cos(πx) and sin(πx). Both have only
two non-zero coefficients in their Fourier expansion:
1 iπx
+ e−iπx ,

cos(πx) = e
2
1 iπx
− e−iπx .

sin(πx) = e
2i
We define the N -point truncated Fourier approximation φN by
N/2
X
φN (x) = fˆn eiπnx , x ∈ [−1, 1],
n=−N/2+1

where here and elsewhere in this section N ≥ 2 is an even integer.


Theorem 8.4 (The de la Valleé Poussin theorem). If the function f is Rie-
mann integrable and the coefficients fˆn are of order O(n−1 ) for large n, then
φN (x) = f (x) + O(N −1 ) as N → ∞ for every point x in the open interval
(−1, 1), where f is Lipschitz.
PDEs  259

Gibbs effect when approximating the line y = x for the


Figure 8.1
choices N = 4, 8, 16, 32

Note that the above theorem explicitly excludes the endpoints of the in-
terval. This is due to the Gibbs phenomenon. Figure 8.1 illustrates this. The
Gibbs effect involves both the fact that Fourier sums overshoot at a discon-
tinuity, and that this overshoot does not die out as the frequency increases.
With increasing N the point where that overshoot happens moves closer and
closer to the discontinuity. So once the overshoot has passed by a particular
x, convergence at the value of x is possible.However, convergence at the end-
points −1 and 1 cannot be guaranteed. It is possible to show (as a consequence
of the Dirichlet–Jordan theorem) that
1
φN (±1) → [f (−1) + f (1)] as N →∞
2
and hence there is no convergence unless f is periodic.
For proofs and more in-depth analysis, see [13] T. W. Körner, Fourier
Analysis.
Theorem 8.5. Let f be an analytic function in [−1, 1], which can be extended
analytically to a complex domain Ω and which is periodic with period 2, i.e.,
f (m) (1) = f (m) (−1) for all m = 1, 2, . . .. Then the Fourier coefficients fˆn are
260  A Concise Introduction to Numerical Analysis

O(n−m ) for any m = 1, 2, . . .. Moreover, the Fourier approximation φN is of


infinite order, that is, |φN − f | = O(N −p ) for any p = 1, 2, . . ..
Proof. We only sketch the proof. Using integration by parts we can deduce

1 1
Z
fˆn = f (τ )e−iπnτ dτ
2 −1
1
e−iπnτ 1 1 0 e−iπnτ
 Z
1
= f (τ ) − f (τ ) dτ.
2 (−iπn) −1 2 −1 (−iπn)

The first term vanishes, since f (−1) = f (1) and e±iπn = cos nπ ± i sin nπ =
cos nπ. Thus
1 b0
fˆn = f .
πin n
Using f (m) (1) = f (m) (−1) for all m = 1, 2, . . . and multiple integration by
parts gives
 2  3
1 b0 1 1
fˆn = f = fc00 n = 000 = . . . .
fc
πin n πin πin n

Hence  m
1
fˆn = (m) ,
fd n m = 1, 2, . . . .
πin

Now using Cauchy’s theorem of complex analysis |fd(m) | can be bounded by


n
m
cm!α for some constants c, α > 0. Then

N/2 ∞
X X
ˆ iπnx ˆ iπnx

|φN (x) − f (x)| =
fn e − fn e
n=−N/2+1 n=−∞
−N/2 ∞
X X
≤ |fˆn | + |fˆn |
−∞ N/2+1
−N/2 ∞
X |fd (m) | X |fd(m) |
n n
= m
+ m
−∞
(−πn) (πn)
 N/2+1 
 α m ∞
1 X 1 
≤ cm!  +2
π (N/2)m nm
" n=N/2+1 #
 α m Z ∞
1 dτ
≤ cm! +2
π (N/2)m N/2+1 τ
m
 α m  
1 2
= cm! +
π (N/2)m (m − 1)(N/2 + 1)m−1
−m+1
≤ Cm N .
PDEs  261

Definition 8.12 (Convergence at spectral speed). An N -point approximation


φN of a function f converges to f at spectral speed if |φN −f | decays pointwise
faster than O(N −p ) for any p = 1, 2, . . ..
The fast convergence of Fourier approximations rests on two properties of
the underlying function: analyticity and periodicity. If one is not satisfied the
rate of convergence in general drops to polynomial. In general, the speed of
convergence of the truncated Fourier series of a function f depends on the
smoothness of the function. In fact, the smoother the function the higher the
convergence, i.e., for f ∈ C p (−1; 1) (i.e., the derivatives up to p exist and are
continuous), we receive an O(N −p ) order of convergence.
We now consider the heat equation ut = uxx on the interval [−1, 1] with
given initial condition u(x, 0) = g(x), periodic boundary conditions, and a
normalization condition, i.e.,

u(−1, t) = u(1, t), ux (−1, t) = ux (1, t), t≥0


Z 1
u(x, t)dx = 0, t ≥ 0.
−1

We approximate u by its N th order Fourier expansion in x


N/2
X
u(x, t) ≈ ûn (t)eiπnx .
n=−N/2+1

Differentiating once with respect to t and on the other hand differentiating


twice with respect to x gives
N/2
X
ut ≈ û0n (t)eiπnx ,
n=−N/2+1
N/2
X
ux x ≈ ûn (t)(iπn)2 eiπnx .
n=−N/2+1

Equating yields for each coefficient ûn the ODE

û0n (t) = −π 2 n2 ûn (t), n = −N/2 + 1, . . . , N/2.


2 2
This has the exact solution ûn (t) = cn e−π n t for n 6= 0. The constant coeffi-
cients cn are given by the initial condition, since
N/2 N/2
X X
g(x) = u(x, 0) ≈ ûn (0)eiπnx = cn eiπnx .
n=−N/2+1 n=−N/2+1

Approximating g by its N th order Fourier expansion, we see that cn = ĝn is


the appropriate choice.
262  A Concise Introduction to Numerical Analysis

For n = 0 the ODE simplifies to û00 (t) = 0 and thus û0 (t) = c0 . The
constant c0 is determined by the normalization condition
Z 1 N/2 N/2 Z 1
2
n2 t iπnx 2
n2 t
X X
cn e−π e dx = cn e−π eiπnx dx = 2c0 ,
−1 n=−N/2+1 n=−N/2+1 −1

since the integral is zero for n 6= 0. Thus û0 (t) = 0.


For more general problems we need to examine the algebra of Fourier
expansions. The set of all analytic functions f : [−1, 1] → C, which are periodic
with period 2 and which can be extended analytically into the complex plane,
form a linear space. The Fourier series of sums and scalar products of such
functions are well-defined. In particular for

X ∞
X
f (x) = fˆn eiπnx , g(x) = ĝn eiπnx
n=−∞ n=−∞

we have

X ∞
X
f (x) + g(x) = (fˆn + ĝn )eiπnx af (x) = afˆn eiπnx .
n=−∞ n=−∞

Moreover,
∞ ∞ ∞ 
!
X X X 
f (x)g(x) = fˆn−m ĝm eiπnx = {fˆn } ∗ {ĝn } eiπnx ,
n=−∞ m=−∞ n=−∞

where ∗ denotes the convolution operator acting on sequences. In addition,


the derivative of f is given by

X
f 0 (x) = iπ nfˆn eiπnx .
n=−∞

Since {fˆn } decays faster than O(n−m ) for all m ∈ Z+ , it follows that all
derivatives of f have rapidly convergent Fourier expansions.
We now have the tools at our disposal to solve the heat equation with
non-constant coefficient α(x). In particular,
 
∂u(x, t) ∂ ∂u(x, t)
= α(x)
∂t ∂x ∂x

for −1 ≤ x ≤ 1, t ≥ 0 and given initial conditions u(x, 0). Letting



X ∞
X
u(x, t) = ûn (t)eiπnx and α(x) = α̂n eiπnx
n=−∞ n=−∞
PDEs  263

we have

∂u(x, t) X
= ûn (t)iπneiπnx
∂x n=−∞

and by convolution,
∞ ∞
!
∂u(x, t) X X
α(x) = α̂n−m iπmûm (t) eiπnx .
∂x n=−∞ m=−∞

Differentiating again and truncating, we deduce the following system ODEs


for the coefficients ûn
N/2
X
û0n (t) = −π 2
nmα̂n−m ûm (t), n = −N/2 + 1, . . . , N/2.
m=−N/2+1

We can now proceed to discretize in time by applying a suitable ODE solver.


For example, the forward Euler method results in
N/2
X
ûk+1
n = ûkn − ∆tπ 2 nmα̂n−m ûkm ,
m=−N/2+1

or in vector form,
ûk+1 = (I + ∆tÂ)ûk ,
where  has elements Ânm = −π 2 nmα̂n−m , m, n = −N/2 + 1, . . . , N/2. Note
that every row and every column in  has a common factor. If α(x) is constant,
i.e., α(x) ≡ α0 , then α̂n = 0 for all n 6= 0 and  is a diagonal matrix

(−N/2 + 1)2 α0 0
 
··· 0
 .. 

 0 . 

 α0 
2
 
−π  0 .


 α0 

 .. 
 . 0 
0 ··· 0 (N/2)2 α0

It is easy to see that the eigenvalues of (I + ∆tÂ) are given by 1, 1 −


∆tπ 2 α0 , . . . , 1 − ∆tπ 2 (N/2 − 1)2 α0 , 1 − ∆tπ 2 (N/2)2 α0 . Thus for the method
to be numerically stable, ∆t must scale like N −2 .
Speaking more generally, the maximum eigenvalue of the matrix approxi-
mating the k th derivative is (N/2)k . This means that ∆t in spectral approxi-
mations for linear PDEs with constant coefficients must scale like N −k , where
k is the maximal order of differentiation.
In the analysis of the spectral speed of convergence we have seen that
264  A Concise Introduction to Numerical Analysis

analyticity and f (m) (1) = f (m) (−1) for all m = 1, 2, . . . is crucial. What to
do, however, if the latter does not hold? We can force values at the endpoints
to be equal. Consider the function
1 1
g(x) = f (x) − (1 − x)f (−1) − (1 + x)f (1),
2 2
which satisfies g(±1) = 0. Now the Fourier coefficients ĝn are O(n−1 ). Ac-
cording to the de la Valleé Poussin theorem, the rate of convergence of the
N -terms truncated Fourier expansion of g is hence O(N −1 ). This idea can be
iterated. By letting
h(x) = g(x) + a(1 − x)(1 + x) + b(1 − x)(1 + x)2 ,
which already satisfies h(±1) = g(±1) = 0, and choosing a and b appropri-
ately, we achieve h0 (±1) = 0. Here the Fourier coefficients ĥn are O(n−2 ).
However, the values of the derivatives at the boundaries need to be known
and with every step the degree of the polynomial which needs to be added to
achieve zero boundary conditions increases by 2.
Another possibility to deal with the lack of periodicity is the use of
Chebyshev polynomials (of the first kind), which are defined by Tn (x) =
cos(n arccos x), n ≥ 0. Each Tn is a polynomial of degree n, i.e.,
T0 (x) = 1, T1 (x) = x, T2 (x) = 2x2 − 1, T3 (x) = 4x3 − 3x, ...
The sequence Tn obeys the three-term recurrence relation
Tn+1 (x) = 2xTn (x) − Tn−1 (x), n = 1, 2, . . . .
Moreover, they form a sequence of orthogonal polynomials, which are orthogo-
1
nal with respect to the weight function (1−x2 )− 2 in (−1, 1). More specifically,
 
Z 1
dx  π m=n=0 
π
Tm (x)Tn (x) √ = m=n≥1 . (8.26)
−1 1 − x2  2
0 m 6= n.

This can be proven by letting x = cos θ and using the identity Tn (cos θ) =
cos nθ.
Now since the Chebyshev polynomials are mutually orthogonal, a general
integrable function f can be expanded in

X
f (x) = fˇn Tn (x). (8.27)
n=0

The formulae for the coefficients can be derived by multiplying (8.27) by


1
Tm (x)(1−x2 )− 2 and integrating over (−1, 1). Further using the orthogonality
property (8.26) results in the explicit expression for the coefficients
1 1 2 1
Z Z
ˇ dx ˇ dx
f0 = f (x) √ , fn = f (x)Tn (x) √ , n = 1, 2, . . . .
π −1 1−x 2 π −1 1 − x2
PDEs  265

The connection to Fourier expansions can be easily seen by letting x = cos θ,


Z 1 Z π
dx
f (x)Tn (x) √ = f (cos θ)Tn (cos θ)dθ
−1 1 − x2 0Z
1 π
= f (cos θ) cos nθdθ
2 −π
Z π
1 1
= f (cos θ) (einθ + e−inθ )dθ
2 −π 2
π
= (f (cos θ) + f \
\
2 −n (cos θ) ), n

since the general Fourier transform of a function g defined in the interval [a, b],
a < b, and which is periodic with period b − a, is given by the sequence
Z b
1 2πinτ
ĝ = g(τ )e− b−a dτ.
b−a a
In particular, letting g(x) = f (cosx) and [a, b] = [−π, π], we have
Z π
1
ĝ = g(τ )e−inτ dτ
2π −π
and the result follows. Thus we can deduce
 
 f\ (cos θ)0 , n = 0, 
fˇn = .
 \
f (cos θ)−n + f \
(cos θ)n , n = 1, 2, . . . .

Thus for a general integrable function f the computation of its Chebyshev


expansion is equivalent to the Fourier expansion of the function f (cos θ). The
latter is periodic with period 2π. In particular, if f can be extended analyti-
cally, then fˇn decays spectrally fast for large n. So the Chebyshev expansion
inherits the rapid convergence of spectral methods without ever assuming that
f is periodic.
Similar to Fourier expansions, we have an algebra of Chebyshev expansions.
The Chebyshev expansions of sums and scalar products are well-defined. In
particular for

X X∞
f (x) = ˇ
fn Tn (x), g(x) = ǧn Tn (x)
n=0 n=0
we have

X ∞
X
f (x) + g(x) = (fˇn + ǧn )Tn (x) af (x) = afˇn Tn (x).
n=0 n=0
Moreover, since
Tm (x)Tn (x) = cos(m arccos x) cos(n arccos x)
1
= [cos((m − n) arccos x) + cos((m + n) arccos x)]
2
1 
= T|m−n| (x) + Tm+n (x) ,
2
266  A Concise Introduction to Numerical Analysis

we have

X ∞
X
f (x)g(x) = fˇm Tm (x) ǧn Tn (x)
m=0 n=0

1 X ˇ
= fm ǧn [T|m−n| (x) + Tm+n (x)]
2 m,n=0

1 X ˇ
= fm (ǧm+n + ǧ|m−n| )Tn (x),
2 m,n=0

where the last equation is due to a change in summation.


Lemma 8.2 (Derivatives of Chebyshev polynomials). The derivatives of
Chebyshev polynomials can be explicitly expressed as the linear combinations
of Chebyshev polynomials
n−1
X
0
T2n (x) = 4n T2l+1 (x), (8.28)
l=0
n
X
0
T2n+1 (x) = (2n + 1)T0 (x) + 2(2n + 1) T2l (x). (8.29)
l=1

Proof. We only proof (8.28), since the proof of (8.29) follows a similar argu-
ment. Since T2n (x) = cos(2n arccos x), we have

0 1
T2n (x) = 2n sin(2n arccos x) √ .
1 − x2
Letting x = cos θ and rearranging, it follows that
0
sin θ T2n (cos θ) = 2n sin(2nθ).

Multiplying the right-hand side of (8.28) by sin θ, we have


n−1
X n−1
X
4n sin θ T2l+1 (cos θ) = 4n cos((2l + 1)θ) sin θ
l=0 l=0
n−1
X
= 2n (sin((2l + 2)θ) − sin 2lθ) = 2n sin 2nθ.
l=0

This concludes the proof.


We can use this Lemma to express the coefficients of the Chebyshev ex-
pansion of the derivative in terms of the original Chebyshev coefficients fˇn

2 X
fˇn0 = pfˇp , (8.30)
cn p=n+1
n+p odd
PDEs  267

where cn is 2 for n = 0 and 1 for n ≥ 1. We now have the tools at our


hand to use Chebyshev expansions in a similar way as Fourier expansions.
However, this results in much more complicated relations, since (8.30) shows
that a Chebyshev coefficient of the derivative is linked to infinitely many
original Chebyshev coefficients, while the equivalent relation between Fourier
coefficients is one-to-one.

8.6.1 Spectral Solution to the Poisson Equation


We consider the Poisson equation
∇2 u = f, −1 ≤ x, y ≤ 1,
where f is analytic and obeys periodic boundary conditions
f (−1, y) = f (1, y), −1 ≤ y ≤ 1, f (x, −1) = f (x, 1), −1 ≤ x ≤ 1.
Additionally we impose the following periodic boundary conditions on u
u(−1, y) = u(1, y), ux (−1, y) = ux (1, y), −1 ≤ y ≤ 1,
(8.31)
u(x, −1) = u(x, 1), uy (x, −1) = uy (x, 1), −1 ≤ y ≤ 1.
These conditions alone only define the solution up to an additative constant.
As we have already seen in the spectral solution to the heat equation, we add
a normalization condition:
Z 1Z 1
u(x, y)dxdy = 0. (8.32)
−1 −1

Since f is analytic, its Fourier expansion



X
f (x, y) = fˆk,l eiπ(kx+ly)
k,l=−∞

is spectrally convergent. From the Fourier expansion of f we can calculate the


coefficients of the Fourier expansion of u

X
u(x, y) = ûk,l eiπ(kx+ly) .
k,l=−∞

From the normalization condition we obtain


Z 1Z 1 X∞ Z 1Z 1
0= u(x, y)dxdy = ûk,l eiπ(kx+ly) dxdy = 4û0,0 .
−1 −1 k,l=−∞ −1 −1

For the other coefficients,



X ∞
X
∇2 u(x, y) = −π 2 (k 2 + l2 )ûk,l eiπ(kx+ly) = fˆk,l eiπ(kx+ly) ,
k,l=−∞ k,l=−∞
268  A Concise Introduction to Numerical Analysis

thus summarizing

1
 − fˆk,l , k, l ∈ Z\{0, 0}
ûk,l = π (k + l2 )
2 2
 0 (k, l) = (0, 0).
This solution is not representative for its application to general PDEs.
The reason is the special structure of the Poisson equation, because φk,l =
eiπ(kx+ly) are the eigenfunctions of the Laplace operator with eigenvalue
−π 2 (k 2 + l2 ), since
∇2 φk,l = −π 2 (k 2 + l2 )φk,l ,
and they obey periodic boundary conditions.
The concept can be extended to general second-order elliptic PDEs speci-
fied by the equation
∇T (a∇u) = f, −1 ≤ x, y ≤ 1,
where a is a positive analytic function and f is an analytic function, and both
a and f are periodic. We again impose the boundary conditions (8.31) and
the normalization condition (8.32). Writing

X
a(x, y) = âk,l eiπ(kx+ly) ,
k,l=−∞
X∞
f (x, y) = fˆk,l eiπ(kx+ly) ,
k,l=−∞
X∞
u(x, y) = ûk,l eiπ(kx+ly) ,
k,l=−∞

and rewriting the PDE using the fact that the Laplacian ∇2 is the divergence
∇T of the gradient ∇u as
∂ ∂
∇T (a∇u) = (aux ) + (auy ) = a∇2 u + ax ux + ay uy ,
∂x ∂y
we get
 
∞ ∞
!
X X
−π 2  âk,l eiπ(kx+ly)  (m2 + n2 )ûm,n eiπ(mx+ny)
m,n=−∞
k,l=−∞ 
∞ ∞
!
X X
2 iπ(kx+ly)  iπ(mx+ny)
−π kâk,l e mûm,n e
m,n=−∞
k,l=−∞ 
∞ ∞
!
X X
−π 2  lâk,l eiπ(kx+ly)  nûm,n e iπ(mx+ny)

k,l=−∞ m,n=−∞

X
= fˆk,l eiπ(kx+ly) .
k,l=−∞
PDEs  269

Again the normalization condition yields û0,0 = 0. Replacing the products by


convolutions using the bivariate version
 
∞ ∞
!
X X
 fˆk,l eiπ(kx+ly)  ĝm,n eiπ(mx+ny) =
k,l=−∞ m,n=−∞
∞ ∞
!
X X
fˆk−m,l−n ĝm,n eiπ(kx+ly)
k,l=−∞ m,n=−∞

and truncating the sums to −N/2 + 1 ≤ k, l ≤ N/2, we get a system of


N 2 − 1 linear algebraic equations in the unknowns ûk,l , −N/2 + 1 ≤ k, l ≤
N/2, (k, l) 6= (0, 0)
N/2
X
−π 2 âk−m,l−n (m2 + n2 ) + (k − m)ak−m,l−n m+


m,n=−N/2+1
(l − n)âk−m,l−n n] ûm,n = fˆk,l .
The main difference between methods arising from computational stencils
and spectral methods is that the former leads to large sparse matrices, while
the latter leads to small but dense matrices.

8.7 Finite Element Method


So far we have calculated estimates of the values of the solution to a differential
equation at discrete grid points. We now take a different approach by choosing
an element from a finite dimensional space of functions as an approximation to
the true solution. For example, if the true solution is a function of one variable,
this space may be made up of piecewise polynomials such as linear or cubic
splines. There are analogous piecewise polynomials of several variables.
This technique was developed because many differential equations arise
from variational calculations. For example, let u describe a physical system
and let I(u) be an integral that gives the total energy in the system. The
system is in a steady state if u is the function that minimizes I(u). Often it
can then be deduced that u is the solution to a differential equation. Solving
this differential equation might be the best way of calculating u.
On the other hand, it is sometimes possible to show that the solution u of
a differential equation is the function in an infinite dimensional space which
minimizes a certain integral I(u). The task is to choose a finite dimensional
subspace V and then to find the element ũ in this subspace that minimizes
I(ũ).
We will not go into detail, but only give an introduction to finite elements.
Ern and Guermond cover the topic extensively in [8].
We will consider a one-dimensional example. Let u be the solution to the
differential equation
u00 (x) = f (x), 0 ≤ x ≤ 1,
270  A Concise Introduction to Numerical Analysis

subject to the boundary conditions u(0) = u0 and u(1) = u1 , where f is


a smooth function from [0, 1] to R. Moreover, let w be the solution of the
minimization of the integral
Z 1
I(w) = [w0 (x)]2 + 2w(x)f (x)dx
0

subject to the same boundary conditions w(0) = u0 and w(1) = u1 . We will


show that u = w. That is, the solution to the differential equation is also the
minimizer of I(w). In other words, we prove v = w − u = 0.
We have
Z 1
I(w) = I(u + v) = [u0 (x) + v 0 (x)]2 + 2[u(x) + v(x)]f (x)dx
0
Z 1
= [u0 (x)]2 + 2u(x)f (x)dx
0 Z
1 Z 1
+2 u0 (x)v 0 (x) + v(x)f (x)dx + [v 0 (x)]2 dx.
0 0

The first integral is I(u) and the last integral is always non-negative. For the
second integral integrating by parts, we obtain
Z 1
u0 (x)v 0 (x) + v(x)f (x)dx =
0 Z 1 Z 1
1
= [u0 (x)v(x)]0 − u00 (x)v(x)dx + v(x)f (x)dx
0 Z 1 0

= u0 (1)v(1) − u0 (0)v 0 (0) + v(x)(f (x) − u00 (x))dx = 0,


0

since v(0) = v(1) = 0, since u and w have the same boundary conditions, and
since u00 (x) = f (x). Hence
Z 1
I(w) = I(u) + [v 0 (x)]2 dx.
0

Since w minimizes I(w) we have I(w) ≤ I(u). It follows that


Z 1
[v 0 (x)]2 dx = 0
0

and thus v(x) ≡ 0, since v(0) = 0.


The above solution can be adjusted for any boundary condition u(0) = a
and u(1) = b, because the linear polynomial a − u0 + (b − u1 − a + u0 )x can
be added to the solution. This does not change the second derivative, but
achieves the desired boundary conditions. Thus we can assume zero boundary
conditions w(0) = 0 and w(1) = 0 when minimizing I(w).
PDEs  271

This example can be extended to two dimensions. We consider the Poisson


equation
∇2 u(x, y) = uxx (x, y) + uyy (x, y) = f (x, y)
on the square [0, 1]×[0, 1] with zero boundary conditions. We assume that f is
continuous on [0, 1] × [0, 1]. The solution to the Poisson equation is equivalent
to minimizing
Z 1Z 1
2 2
I(w) = (wx (x, y)) + (wy (x, y)) + 2w(x, y)f (x, y)dxdy.
0 0

As before, we let v = w − u and hence

I(w) = I(u + v)
Z 1Z 1
2 2
= (ux (x, y) + vx (x, y)) + (uy (x, y) + vy (x, y))
0 0
Z 1Z 1
+ 2(u(x, y) + v(x, y)f (x, y)dxdy
0 0
Z 1 Z 1
= [ux (x, y)]2 + [uy (x, y)]2 + 2u(x, y)f (x, y)dxdy
0 0
Z 1 Z 1
+2 ux (x, y)vx (x, y) + uy (x, y)vy (x, y) + v(x, y)f (x, y)dxdy
0 0
Z 1Z 1
+ [vx (x, y)]2 + [vy (x, y)]2 dxdy.
0 0

Simplifying the second integral


Z 1Z 1
ux (x, y)vx (x, y) + uy (x, y)vy (x, y) + 2v(x, y)f (x, y)dxdy =
0 0
Z 1 Z 1 Z 1
1
= [ux (x, y)v(x, y)]0 dy − uxx (x, y)v(x, y)dxdy
0 0 0
Z 1 Z 1 Z 1
1
+ [uy (x, y)v(x, y)]0 dx − uyy (x, y)v(x, y)dydx
0 0 0
Z 1 Z 1
+ v(x, y)f (x, y)dxdy
0 0
Z 1
= ux (1, y)v(1, y) − ux (0, y)v(0, y)dy
0
Z 1
+ uy (x, 1)v(x, 1) − uy (x, 0)v(x, 0)dx
0
Z 1 Z 1
+ v(x, y)(f (x, y) − uxx (x, y) − uyy (x, y))dxdy = 0,
0 0

since v(x, 0) = v(x, 1) = v(0, y) = v(1, y) = 0 for all x, y ∈ [0, 1] and f (x, y) =
272  A Concise Introduction to Numerical Analysis

uxx (x, y) + uyy (x, y) for all (x, y) ∈ [0, 1] × [0, 1]. Hence
Z 1 Z 1
I(w) = I(u) + [vx (x, y)]2 + [vy (x, y)]2 dxdy.
0 0

On the other hand, I(w) is minimal and thus


Z 1 Z 1
[vx (x, y)]2 + [vy (x, y)]2 dxdy = 0.
0 0

This implies v(x, y) ≡ 0 and thus the solution to the Poisson equation also
minimizes the integral.
We have seen two examples of the first step of the finite element method.
The first step is to rephrase the problem as a variational problem. The true
solution lies in an infinite dimensional space of functions and is minimal with
respect to a certain functional. The next step is to choose a finite dimen-
sional subspace S and find the element of that subspace which minimizes the
functional.
To be more formal, let the functional be of the form

I(u) = A(u, u) + 2hu, f i, (8.33)

where hu, f i is a scalar product. In the first example


Z 1
A(u, v) = u0 (x)v 0 (x)dx
0

and the scalar product is


Z 1
hu, f i = u(x)f (x)dx.
0

In the second example, on the other hand,


Z 1Z 1
A(u, v) = ux (x, y)vx (x, y) + uy (x, y)vy (x, y)dxdy
0 0

and the scalar product is


Z 1 Z 1
hu, f i = u(x, y)f (x, y)dxdy.
0 0

In the following we assume that the functional A(u, v) satisfies the follow-
ing properties:
Symmetry
A(u, v) = A(v, u).
PDEs  273

Bi-linearity
A(u, v) is a linear function of u for fixed v:

A(λu + µw, v) = λA(u, v) + µA(w, v).

Due to symmetry, A(u, v) is also a linear functional of v for fixed u.


Non-negativity
A(u, u) ≥ 0 for all u and A(u, u) = 0 if and only if u ≡ 0.
Theorem 8.6. If the above assumptions hold, then u ∈ S minimizes I(u) as
in (8.33) if and only if for all non-zero functions v ∈ S,

A(u, v) + hv, f i = 0.

Proof. Let u ∈ S minimize I(u), then for any non-zero v ∈ S and scalars λ,

I(u) ≤ I(u + λv) = A(u + λv, u + λv) + 2hu + λv, f i


= I(u) + 2λ [A(u, v) + hv, f i] + λ2 A(v, v).

It follows that 2λ [A(u, v) + hv, f i] + λ2 A(v, v) has to be non-negative for all


λ. This is only possible if the linear term in λ vanishes, that is,

A(u, v) + hu, f i = 0.

To proof the other direction let u ∈ S such that A(u, v) + hv, f i = 0 for
all non-zero functions v ∈ S. Then for any v ∈ S,

I(u + v) = A(u + v, u + v) + 2hu + v, f i


= I(u) + 2 [A(u, v) + hv, f i] + A(v, v)
= I(u) + A(v, v) ≥ I(u).

Thus u minimizes I(u) in S.


We are looking for the element u in the finite dimensional space S that
minimizes I(u) as in (8.33). Let the dimension of S be n and let vi , i = 1, . . . , n
be a basis of S. Remember that S is a space of functions; for example, the
elements of S could be piecewise polynomials and the basis could be B-splines.
We can express u as a linear combination of the basis functions
n
X
u= ai vi .
i=1

We know that A(u, v)+hv, f i = 0 for all non-zero functions in S. In particular,


the equation holds for all basis functions,
n
! n
X X
A(u, vj ) + hvj , f i = A ai vi , vj + hvj , f i = A(vi , vj )ai + hvj , f i = 0.
i=1 i=1
274  A Concise Introduction to Numerical Analysis

Defining an n × n matrix A with elements Aij = A(vi , vj ), 1 ≤ i, j ≤ n,


and a vector b with entries bj = −hvj , f i, j = 1, . . . , n, the coefficients ai ,
i = 1, . . . , n can be found by solving the system of equations given by
Aa = b.
A is a symmetric matrix, since A satisfies the symmetry property, and A is also
positive definite, since A also satisfies the non-negativity property. Depending
on f , numerical techniques might be necessary to evaluate hvj , f i to calculate
b, since no analytical solution for the integral may exist.
Choosing the basis functions carefully, we can introduce sparseness into A
and exploit this when solving the system of equations. In particular, selecting
basis functions which have small, finite support is desirable. The support of
a function is the closure of the set of points where the function has non-zero
values. If vi and vj are functions whose supports do not overlap, then
Ai,j = A(vi , vj ) = 0,
since A is defined as the integration over a given region of the product of
derivatives of the two functions and these products vanish, if the supports are
separated.
Because of the properties of symmetry, bi-linearity and non-negativity,
A can be viewed as a distance measure between two functions u and v by
calculating A(u−v, u−v). Let û be the true solution to the partial differential
equation, or equivalently to the variational problem which lies in the infinite
dimensional space for which I(·) is well defined. We are interested in how far
the solution u ∈ S is from the true solution û.
Theorem 8.7. Let u be the element of S that minimizes I(u) for all elements
of S. Further, let û be the element of the infinite dimensional space minimizing
I(u) over this space. Then u also minimizes
A(û − u, û − u)
for all u ∈ S. That is, u is the best solution in S with regards to the distance
measure induced by A.
Proof. Consider
I(û − λ(û − u)) = I(û) + 2λ [A(û, û − u) + hû − u, f i] + λ2 A(û − u, û − u).
This is quadratic in λ and has a minimum for λ = 0. This implies that the
linear term has to vanish, and thus
I(û − λ(û − u)) = I(û) + λ2 A(û − u, û − u),
especially when letting λ = 1,
I(u) = I(û − (û − u)) = A(û − u, û − u).
Since I(û) is already minimal, minimizing I(u) minimizes A(û − u, û − u) at
the same time.
PDEs  275

We continue our one-dimensional example, solving u00 (x) = f (x) with zero
boundary conditions. Let x0 , . . . , xn+1 be nodes on the interval [0, 1] with
x0 = 0 and xn+1 = 1. Let hi = xi − xi−1 be the spacing. For i = 1, . . . , n we
choose
x − xi−1


 , x ∈ (xi−1 , xi ),

 hi
vi (x) = x i+1 − x
, x ∈ [xi , xi + 1),


 hi+1
0 otherwise

as basis functions. These are similar to the linear B-splines displayed in Fig-
ure 3.7, the difference being that the nodes are not equally spaced and we
restricted the basis to those splines evaluating to zero at the boundary, be-
cause of the zero boundary conditions.
Recall that in this example
Z 1
A(u, v) = u0 (x)v 0 (x)dx
0

and the scalar product is


Z 1
hu, f i = u(x)f (x)dx.
0

For the chosen basis functions


 1
 , x ∈ (xi−1 , xi ),
 hi


vi0 (x) = −1
, x ∈ [xi , xi + 1),
 hi+1



0 otherwise

and hence
−1


 , j = i − 1,


 hi−1
 1 1


+ , j = i,
Ai,j = A(vi , vj ) = hi hi+1
 −1
, j = i + 1,





 hi+1
0 otherwise

Thus A is tridiagonal and symmetric.


The scalar product of the basis functions with f is, on the other hand,
Z xj+1
hvj , f i = vj (x)f (x)dx.
xj−1

If f is continuous, we can write


Z xj+1
hvj , f i = f (ξj ) vj (x)dx
xj−1
276  A Concise Introduction to Numerical Analysis

for some ξj ∈ (xj−1 , xj+1 ). We can simplify


"Z #
xj Z xj+1
x − xj−1 xj+1 − x
hvj , f i = f (ξj ) dx +
hj hj+1
 xj−1 x
xj
x
(x − xj−1 )2 j (xj+1 − x)2 j+1
 
= f (ξj ) − f (ξj )
2hj xj−1 2hj+1 xj
hj + hj+1
= f (ξj ).
2
The solution is
n
X
u(x) = aj vj (x)
j=1

where the coefficients aj are determined by the equations


 
h1 +h2
− h11 + h12 a1 + h12 a2 = 2 f (ξ1 ),
 
1 1 1 1 hj +hj+1
hj aj−1 − hj + hj+1  aj + hj+1 aj+1 = 2 f (ξj ),

1 1 1 hn +hn+1
hn an−1 − hn + hn+1 an = 2 f (ξn ).

If hj = h for j = 1, . . . , n + 1 this becomes


1
h (−2a1 + a2 ) = hf (ξ1 ),
1
h (aj−1 − 2aj + aj+1 ) = hf (ξj ),
1
h (an−1 − 2an ) = hf (ξn ).

This looks very similar to the finite difference approximation to the solution
of the Poisson equation in one dimension on an equally spaced grid. Let um
approximate u(xm ). We approximate the second derivative by applying the
central difference operator twice and dividing by h2 . Then, using the zero
boundary conditions, the differential equation can be approximated on the
grid by
1
h2 (−2u1 + u2 ) = f (x1 ),
1
h2 (um−1 − 2um + um+1 ) = f (xm ),
1
h2 (un−1 − 2un ) = hf (xn ).

However, the two methods are two completely different approaches to the
same problem. The finite difference solution calculates approximations to func-
tion values on grid points, while the finite element method produces a function
as a solution which is the linear combination of basis functions. The right-hand
sides in the finite difference technique are f evaluated at the grid points. The
right-hand sides in the finite element method are scalar products of the basis
functions with f . Finite element methods are chosen, when it is important
to have a continuous representation of the solution. By choosing appropriate
basis functions such as higher-order B-splines, the solution can also be forced
to have continuous derivatives up to a certain degree.
PDEs  277

8.8 Revision Exercises


Exercise 8.12. The diffusion equation

∂u ∂2u
=
∂t ∂x2
is discretized by the finite difference method
1
un+1 − (µ − α) un+1 n+1
+ un+1

m m−1 − 2um m+1
2
1
= unm + (µ + α) unm−1 − 2unm + unm+1 ,

2
where unm approximates u(m∆x, n∆t) and µ = ∆t/(∆x)2 and α are constant.

(a) Show that the order of magnitude (as a power of ∆x) of the local error is
O((∆x)4 ) for general α and derive the value of α for which it is O((∆x)6 ).
State which expansions and substitutions you are using.
(b) Define the Fourier transform of a sequence unm , m ∈ Z. Investigate the
stability of the given finite difference method by Fourier technique and its
dependence on α. In the process define the amplification factor. (Hint:
Express the amplification factor as 1 − . . .)
Exercise 8.13. The diffusion equation
 
∂u ∂ ∂u
= a(x) , 0 ≤ x ≤ 1, t ≥ 0,
∂t ∂x ∂x

with the initial condition u(x, 0) = φ(x), 0 ≤ x ≤ 1 and zero boundary


conditions for x = 0 and x = 1, is solved by the finite difference method
(m = 1, . . . , M )
h i
un+1
m = unm + µ am− 12 unm−1 − (am− 12 + am+ 12 )unm + am+ 12 unm+1 ,

where µ = ∆t/(∆x)2 is constant, ∆x = M1+1 , and unm approximates


u(m∆x, n∆t). The notation aα = a(α∆x) is employed.

(a) Assuming sufficient smoothness of the function a, show that the local error
of the method is at least O((∆x)3 ). State which expansions and substitu-
tions you are using.

(b) Remembering the zero boundary conditions, write the method as

un+1 = Aun

giving a formula for the entries Ak,l of A. From the structure of A, what
can you say about the eigenvalues of A?
278  A Concise Introduction to Numerical Analysis

(c) Describe the eigenvalue analysis of stability.


(d) Assume that there exist finite positive constants a− and a+ such that for
0 ≤ x ≤ 1, a(x) lies in the interval [a− , a+ ]. Prove that the method is
stable for 0 < µ ≤ 2a1+ . (Hint: You may use without proof the Gerschgorin
theorem: All eigenvalues of the matrix A are contained in the union of the
Gerschgorin discs given for each k = 1, . . . , M by
M
X
{z ∈ C : |z − Ak,k | ≤ |Ak,l |}.)
l=1,l6=k

Exercise 8.14. (a) The computational stencil given by

c b c

1 ul,m
(∆x)2 b a b

c b c

is used to approximate
∂2u ∂2u
+ 2.
∂x2 ∂y

(i) Translate the computational stencil into a formula.


(ii) Express a and b in terms of c such that the computational stencil
provides an approximation with error O((∆x)2 ).
(iii) Find a value of c such that the approximation has an error of
O((∆x)4 ) in the case where the function u satisfies

∂2u ∂2u
+ 2 = 0.
∂x2 ∂y

(b) We now consider the partial differential equation

∂2u ∂2u
+ 2 = f, −1 ≤ x, y ≤ 1
∂x2 ∂y
where f is analytic and obeys periodic boundary conditions

f (−1, y) = f (1, y), −1 ≤ y ≤ 1, f (x, −1) = f (x, 1), −1 ≤ x ≤ 1.


PDEs  279

Additionally, we impose the following periodic boundary conditions on u

u(−1, y) = u(1, y), −1 ≤ x ≤ 1, ux (−1, y) = ux (1, y), −1 ≤ y ≤ 1,


u(x, −1) = u(x, 1), −1 ≤ x ≤ 1, uy (x, −1) = uy (x, 1), −1 ≤ y ≤ 1

and a normalization condition:


Z 1Z 1
u(x, y)dxdy = 0.
−1 −1

(i) Write down the Fourier expansion for both f and u.


(ii) Define convergence at spectral speed and state the two properties it
depends on.
(iii) Which Fourier coefficient can be calculated from the normalization
condition and what is its value?
(iv) Assuming that the Fourier coefficients of f are known, calculate the
Fourier coefficients of u.
Exercise 8.15. Consider the advection equation
∂u ∂u
=
∂t ∂x
for x ∈ [0, 1] and t ∈ [0, T ].
(a) What does it mean if a partial differential equation is well-posed?

(b) Define stability for time-marching algorithms for PDEs.


(c) Derive the eigenvalue analysis of stability.
(d) Define the forward difference operator ∆+ , the central difference operator
δ, and the averaging operator µ0 , and calculate the operator defined by
δµ0 .
(e) In the solution of partial differential equations, matrices often occur which
are constant on the diagonals. Let A be an M × M matrix of the form
 
a b
 −b a 
A= ,
 
.. ..
 . . b 
−b a

that is, Ai,i = a, Ai,i+1 = b, Ai+1,i = −b and Ai,j = 0 otherwise. The


where the j th component of vk is given
eigenvectors of A are v1 , . . . , vM √
j πjk
by (vk )j = ı sin M +1 , where ı = −1. Calculate the eigenvalues of A by
evaluating Avk (Hint: sin(x ± y) = sin x cos y ± cos x sin y).
280  A Concise Introduction to Numerical Analysis

(f ) The advection equation is approximated by the following Crank–Nicolson


scheme
1 1
un+1
m − unm = µ(un+1 n+1 n n
m+1 − um−1 ) + µ(um+1 − um−1 ),
4 4
where µ = ∆t/∆x and ∆x = 1/(M + 1). Assuming zero boundary condi-
tions, that is, u(0, t) = u(1, t) = 0, show that the scheme can be written
in the form
Bun+1 = Cun ,
T
where un = un1 . . . unM . Specify the matrices B and C.
(g) Calculate the eigenvalues of A = B −1 C and their moduli.

(h) Deduce the range of µ for which the method is stable.


Exercise 8.16. The advection equation
∂u ∂u
=
∂t ∂x
for x ∈ R and t ≥ 0 is solved by the leapfrog scheme

un+1
m = µ(unm+1 − unm−1 ) + un−1
m ,

where µ = ∆t/∆x is the Courant number.

(a) Looking at the scheme, give the finite difference approximations to ∂u


∂t and
∂u
∂x which are employed, and state the order of these approximations.

(b) Determine the local error for the leapfrog scheme.


(c) The approximations unm , m ∈ Z, are an infinite sequences of numbers.
Define the Fourier transform ûn (θ) of this sequence.
(d) Define a norm for the sequence unm , m ∈ Z, and for the Fourier transform
ûn (θ). State Parseval’s identity and prove it.
(e) How can Parseval’s identity be used in the stability analysis of a numerical
scheme?
(f ) Apply the Fourier transform to the leapfrog scheme and derive a three-
term recurrence relation for ûn (θ).
(g) Letting θ = π/2 and µ = 1, express û2 (π/2), û3 (π/2) and û4 (π/2) in terms
of û0 (π/2) and û1 (π/2). Hence deduce that the scheme is not stable for
µ = 1.
PDEs  281

Exercise 8.17. Assume a numerical scheme is of the form


s
X s
X
αk un+1
m+k = βk unm+k , m ∈ Z, n ∈ Z+ ,
k=−r k=−r

where the coefficients αk and βk are independent of m and n.

(a) The approximations unm , m ∈ Z, are an infinite sequences of numbers.


Define the Fourier transform ûn (θ) of this sequence.
(b) Derive the Fourier analysis of stability. In the process give a definition of
the amplification factor.
(c) Prove that the method is stable if the amplification factor is less than or
equal to 1 in modulus.
(d) Find the range of parameters µ such that the method

(1 − 2µ)un+1 n+1
m−1 + 4µum + (1 − 2µ)un+1 n n
m+1 = um−1 + um+1

is stable, where µ = ∆t/∆x2 > 0 is the Courant number. (Hint: Sub-


stitute x = cos θ and check whether the amplification factor can become
unbounded and consider the gradient of the amplification factor.)
(e) Suppose the above method is used to solve the heat equation

∂u ∂2u
= .
∂t ∂x2
Express the local error as a power of ∆x.

Exercise 8.18. (a) Define the central difference operator δ and show how it
is used to approximate the second derivative of a function. What is the
approximation error?
(b) Explain the method of lines, applying it to the diffusion equation

∂u ∂2u
=
∂t ∂x2
for x ∈ [0, 1], t ≥ 0 using the results from (a).
(c) Given space discretization step ∆x and time discretization step ∆t, the
diffusion equation is approximated by the scheme

un+1
m − unm − µ(un+1 n+1
m+1 − 2um + un+1
m−1 ) = 0,

where unm approximates u(m∆x, n∆t) and µ = ∆t/(∆x)2 is constant.


Show that the local truncation error of the method is O(∆x4 ), stating the
first error term explicitly. Thus deduce that a higher order can be achieved
for a certain choice of µ.
282  A Concise Introduction to Numerical Analysis

(d) The scheme given in (c) is modified by adding the term

α unm+2 − 4unm+1 + 6unm − 4unm−1 + unm−2 .




on the left side of the equation. How does this change the error term
calculated in (c)? For which choice of α depending on µ can a higher
order be achieved?
(e) Perform a Fourier stability analysis on the scheme given in (c) with arbi-
trary value of µ stating for which values of µ the method is stable. (Hint:
cos θ − 1 = −2 sin2 θ/2.)
Exercise 8.19. We consider the diffusion equation with variable diffusion
coefficient
∂u ∂ ∂u
= (a ),
∂t ∂x ∂x
where a(x), x ∈ [−1, 1] is a given differentiable function. The initial condi-
tion for t = 0 is given, that is, u(x, 0) = u0 (x), and we have zero boundary
conditions for x = −1 and x = 1, that is, u(−1, t) = 0 and u(1, t) = 0, t ≥ 0.

(a) Given space discretization step ∆x and time discretization step ∆t, the
following finite difference method is used,

un+1 = unm + µ am−1/2 unm−1 − (am−1/2 + am+1/2 )unm + am+1/2 unm+1 ,


 
m

where am±1/2 = a(−1 + m∆x ± ∆x/2) and unm approximates u(−1 +


m∆x, n∆t), and µ = ∆t/(∆x)2 is constant. Show that the local error is
at least O(∆x4 ).
(b) Derive the matrix A such that the numerical method given in (a) is written
as
un+1 = Aun .

(c) Since the boundary conditions are zero, the solution may be expressed in
terms of periodic functions. Therefore the differential equation is solved
by spectral methods letting

X ∞
X
u(x, t) = ûn (t)eiπnx and a(x) = ân eiπnx .
n=−∞ n=−∞

Calculate the first derivative of u with regards to x.


(d) Using convolution, calculate the product

∂u(x, t)
a(x) .
∂x
PDEs  283

(e) By differentiating the result in (d) again with regards to x and truncating,
deduce the system of ODEs for the coefficients ûn (t). Specify the matrix
B such that
d
û(t) = B û(t).
dt
(f ) Let a(x) be constant, that is, a(x) = â0 . What are the matrices A and B
with this choice of a(x)?
(g) Let
1 ıπx
a(x) = cos πx = (e + e−ıπx ).
2
What are the matrices A and B with this choice of a(x)? (Hint: cos(x −
π) = − cos x and cos(x − y) + cos(x + y) = 2 cos x cos y.)

Exercise 8.20. We consider the square [0, 1] × [0, 1], which is divided by
M + 1 in both directions, with a grid spacing of ∆x in both directions. The
computational stencil given by

1 1
2 0 2

1
0 −2 0 ul,m
(∆x)2

1 1
2 0 2

is used to approximate the right hand side of

∂2u ∂2u
+ 2 = f.
∂x2 ∂y

The function u(x, y) satisfies zero boundary conditions.

(a) Translate the computational stencil into a formula.


(b) Show that the computational stencil provides an approximation with error
O(∆x2 ) and state the ∆x2 term explicitly.
(c) What is meant by natural ordering? Write down the system of equations
arising from the computational stencil in this ordering. Use block matrices
to simplify notation stating the matrix dimensions explicitly.
284  A Concise Introduction to Numerical Analysis

(d) Using the vectors q1 , . . . , qM where the j th component of qk is given by


(qk )j = sin Mπjk
+1 , calculate the eigenvalues of the block matrices in (c).
(Hint: sin(x ± y) = sin x cos y ± cos x sin y.)
(e) Describe how the result in (d) can be used to transform the system of
equations in (c) into several uncoupled systems of equations.
(f ) Define the inverse discrete Fourier transform of a sequence (xl )l∈Z of
complex numbers with period n, i.e., xl+n = xl . Show how this is related
to calculating qTk v for some general vector v.
Bibliography

[1] George A. Baker and Peter Graves-Morris. Padé Approximants. Cam-


bridge University Press, 1996.
[2] Christopher M. Bishop. Pattern Recognition and Machine Learning.
Springer (SIE), 2013.
[3] John C. Butcher. The numerical analysis of ordinary differential equa-
tions: Runge–Kutta and general linear methods. Wiley, 1987.

[4] Charles W. Curtis. Linear Algebra: an introductory approach. Springer,


1999.
[5] Germund Dahlquist and Björck Âke. Numerical Methods. Dover Publi-
cations, 2003.

[6] Biswa N. Datta. Numerical Linear Algebra and Applications. SIAM,


2010.
[7] James W. Demmel. Applied Numerical Linear Algebra. SIAM, 1997.
[8] Alexandre Ern and Jean-Luc Guermond. Theory and Practice of Finite
Elements. Springer, 2004.
[9] Roger Fletcher. Practical Methods of Optimization. Wiley, 2000.
[10] Walter Gautschi. Numerical Analysis. Birkhuser, 2012.
[11] Peter Henrici. Discrete Variable Methods in Ordinary Differential Equa-
tions. Wiley, 1962.
[12] Israel Koren. Computer Arithmetic Algorithms. A K Peters/CRC Press,
2001.
[13] Thomas W. Körner. Fourier Analysis. Cambridge University Press, 1989.

[14] Randall J. LeVeque. Finite Volume Methods for Hyperbolic Problems.


Cambridge University Press, 2002.
[15] Harald Niederreiter. Random Number Generation and Quasi-Monte
Carlo Methods. SIAM, 1987.

285
286  Bibliography

[16] Jorge Nocedal and Stephen Wright. Numerical Optimization. Springer,


2006.
[17] Anthony Ralston and Philip Rabinowitz. A First Course in Numerical
Analysis. Dover Publications, 2001.

[18] J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Springer,


2002.
[19] John C. Strikwerda. Finite Difference Schemes and Partial Differential
Equations. SIAM, 2004.

[20] Richard S. Varga. Matrix Iterative Analysis. Springer, 2000.


[21] Daniel Zwillinger. Handbook of Differential Equations. Academic Press,
1997.
Numerical Analysis & Mathematical Computation
A Concise Introduction

A Concise Introduction to Numerical Analysis


This textbook provides an accessible and concise introduction to nu-
merical analysis for upper undergraduate and beginning graduate stu-
dents from various backgrounds. It was developed from the lecture
to Numerical Analysis
notes of four successful courses on numerical analysis taught within
the MPhil of Scientific Computing at the University of Cambridge. The
A. C. Faul
book is easily accessible, even to those with limited knowledge of
mathematics.
Students will get a concise, but thorough introduction to numerical
analysis. In addition the algorithmic principles are emphasized to en-
courage a deeper understanding of why an algorithm is suitable, and
sometimes unsuitable, for a particular problem.
A Concise Introduction to Numerical Analysis strikes a balance be-
tween being mathematically comprehensive, but not overwhelming
with mathematical detail. In some places where further detail was felt
to be out of scope of the book, the reader is referred to further reading.
The book uses MATLAB® implementations to demonstrate the work-
ings of the method and thus MATLAB’s own implementations are
avoided, unless they are used as building blocks of an algorithm. In
some cases the listings are printed in the book, but all are available
online on the book’s page at www.crcpress.com.
Most implementations are in the form of functions returning the out-
come of the algorithm. Also, examples for the use of the functions are
given. Exercises are included in line with the text where appropriate,
and each chapter ends with a selection of revision exercises. Solutions
to odd-numbered exercises are also provided on the book’s page at
www.crcpress.com.
This textbook is also an ideal resource for graduate students coming
from other subjects who will use numerical techniques extensively in
their graduate studies.
Faul

K25104

w w w. c rc p r e s s . c o m

K25104_cover.indd 1 2/8/16 12:59 PM

You might also like