0% found this document useful (0 votes)

95 views167 pages

Applied Numerical Analysis (AE2220-I) : R. Klees and R.P. Dwight

This document provides an overview of numerical analysis techniques for approximating solutions to problems that cannot be solved analytically. It covers topics like computer representation of numbers, Taylor series, solving nonlinear equations iteratively, polynomial interpolation, least squares regression, numerical differentiation and integration, and solving ordinary differential equations. The document is a table of contents for a course in applied numerical analysis, outlining major sections and subsections within each topic area at a high level.

Uploaded by

Pythonraptor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

95 views167 pages

Applied Numerical Analysis (AE2220-I) : R. Klees and R.P. Dwight

Uploaded by

Pythonraptor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 167

Applied Numerical Analysis (AE2220-I)

R. Klees and R.P. Dwight

February 2018
2
Contents

1 Preliminaries: Motivation, Computer arithmetic, Taylor series 1

1.1 Numerical Analysis Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Computer Representation of Numbers . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Real numbers - Fixed-point arithmetic . . . . . . . . . . . . . . . . . . 3
1.2.3 Real numbers - Floating-point arithmetic . . . . . . . . . . . . . . . . 4
1.3 Taylor Series Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Truncation error versus Rounding error . . . . . . . . . . . . . . . . . 9

2 Iterative Solution of Non-linear Equations 11

2.1 Recursive Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Fixed-point iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Polynomial Interpolation in 1d 21
3.1 The Monomial Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Why interpolation with polynomials? . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Newton polynomial basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Lagrange polynomial basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Interpolation Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5.1 Chebychev polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Advanced Interpolation: Splines, Multi-dimensions and Radial Bases 39

4.1 Spline interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 Linear Splines (d=1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.2 Cubic Splines (d=3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Bivariate interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.1 Tensor product polynomial interpolation . . . . . . . . . . . . . . . . . 47
4.2.2 Patch interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.3 Radial function interpolation . . . . . . . . . . . . . . . . . . . . . . . 55

3
4 CONTENTS

4.2.4 Bicubic spline interpolation (?? not examined) . . . . . . . . . . . . . 58

5 Least-squares Regression 63
5.1 Least-squares basis functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Least-squares approximation - Example . . . . . . . . . . . . . . . . . . . . . 65
5.3 Least-squares approximation - The general case . . . . . . . . . . . . . . . . . 67
5.4 Weighted least-squares (?? not examined) . . . . . . . . . . . . . . . . . . . . 71

6 Numerical Differentiation 73
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Numerical differentiation using Taylor series . . . . . . . . . . . . . . . . . . . 73
6.2.1 Approximation of derivatives of 2nd degree . . . . . . . . . . . . . . . 79
6.2.2 Balancing truncation error and rounding error . . . . . . . . . . . . . 81
6.3 Richardson extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.4 Difference formulae from interpolating polynomials (?? - not examined) . . . 83

7 Numerical Integration 89
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.2 Solving for quadrature weights . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.3 Numerical integration error – Main results . . . . . . . . . . . . . . . . . . . . 93
7.4 Newton-Cotes formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.4.1 Closed Newton-Cotes (s=2) – Trapezoidal rule . . . . . . . . . . . . . 95
7.4.2 Closed Newton-Cotes (s=3) – Simpson’s rule . . . . . . . . . . . . . . 96
7.4.3 Closed Newton-Cotes (s=4) – Simpson’s 3/8-rule . . . . . . . . . . . . 96
7.4.4 Closed Newton-Cotes (s=5) – Boules’s rule . . . . . . . . . . . . . . . 97
7.4.5 Open Newton-Cotes Rules . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.5 Composite Newton-Cotes formulas . . . . . . . . . . . . . . . . . . . . . . . . 98
7.5.1 Composite mid-point rule . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.5.2 Composite trapezoidal rule . . . . . . . . . . . . . . . . . . . . . . . . 99
7.6 Interval transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.7 Gauss quadrature (?? – not examined) . . . . . . . . . . . . . . . . . . . . . . 101
7.8 Numerical integration error – Details (?? – not examined) . . . . . . . . . . . 105
7.9 Two-dimensional integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.9.1 Cartesian products and product rules . . . . . . . . . . . . . . . . . . 113
7.9.2 Some remarks on 2D-interpolatory formulas . . . . . . . . . . . . . . . 116

8 Numerical Methods for Solving Ordinary Differential Equations 119

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.2 Basic concepts and classification . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.3 Single-step methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.3.1 The methods of Euler-Cauchy . . . . . . . . . . . . . . . . . . . . . . . 125
CONTENTS 5

8.3.2 The method of Heun . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

8.3.3 Classical Runge-Kutta method . . . . . . . . . . . . . . . . . . . . . . 131
8.4 Multistep methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8.5 Stability and convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
8.5.1 Stability of the ordinary differential equation . . . . . . . . . . . . . . 138
8.5.2 Stability of the numerical algorithm . . . . . . . . . . . . . . . . . . . 139
8.6 How to choose a suitable method? . . . . . . . . . . . . . . . . . . . . . . . . 141

9 Numerical Optimization 145

9.1 Statement of the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
9.2 Global and local minima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9.3 Golden-section search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
9.4 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
9.5 Steepest descent method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
9.6 Nelder-Mead simplex method . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

Bibliography 161
6 CONTENTS
Chapter 1

Preliminaries: Motivation,
Computer arithmetic, Taylor series

1.1 Numerical Analysis Motivation

Numerical analysis is a toolbox of methods to find solutions of analysis problems by purely

arithmetic operations, that is +, −, ×, and ÷. For example, imagine you are performing
some structural analysis problem, that requires you to evaluate the definite integral:
Z π
I˜ = x sin x dx. (1.1)
0

You can solve this exactly by applying the chain-rule, to discover I = π.

One numerical approach to the same problem is known as the Trapezoidal rule: divide
the interval [0, π] into n smaller intervals, and approximate the area under the curve in each
interval by the area of a trapezoid, see Figure 1.1.
Writing this symbolically we have

n−1
X f (xi + h) + f (xi )
In = h· ,
2
i=0
iπ
xi = ,
n
π
h=
n

where f (x) = x sin x, the integrand. The xi define the edges of the subintervals, and h the
width of each subinterval. The accuracy of the approximation In ≈ I˜ = 3.14159265359 · · ·
depends on n:

1
2 Chapter 1. Preliminaries: Motivation, Computer arithmetic, Taylor series

Figure 1.1: Trapezoidal rule for integral (1.1).

n In ˜
= |In − I|
10 3.11 2.6 × 10−2
100 3.1413 2.6 × 10−4
1000 3.141590 2.6 × 10−6
10000 3.14159262 2.6 × 10−8

We want efficient methods, where the error → 0 rapidly as n → ∞. It is often the case
that evaluation of f (x) is expensive, and then using n = 10000 might not be practical. In
the above as n is increased by a factor of 10, h is reduced by a factor of 10, but the error
is reduced by 102 = 100. Because of the exponent 2 the method is said to be 2nd-order
accurate.
In the above integral, an analytic solution was possible. Now what about:
Z πp
˜
I= 1 + cos2 x dx?
0

With conventional analysis there exists no closed-form solution. With numerical analysis the
procedure is exactly the same as before! In engineering practice integrands are substantially
more complicated than this, and may have no closed-form expression themselves.

1.2 Computer Representation of Numbers

A fundamental question is how to represent infinite fields such as the integers Z and real
numbers R, with a finite and small number of bits. Consider that even individual numbers
such as π can require an infinite decimal representation. How we describe real numbers in
1.2. Computer Representation of Numbers 3

the computer will have consequences for the accuracy of the numerical methods we develop
in this course.

1.2.1 Integers
To represent integers we use a binary number system. Assume we have N bits

b = (b0 , b1 , · · · , bN −1 )

taking values 0 or 1. A given b represents the natural number

N
X −1
z= bi · 2i .
i=0

E.g. to represent the number 19:

i 6 5 4 3 2 1 0
2i 26 = 64 5
2 = 32 4
2 = 16 3
2 =8 2
2 =4 1
2 =2 20 = 1
bi 0 0 1 0 0 1 1
z 0 × 64 + 0 × 32 + 1 × 16 + 0 × 8 + 0 × 4 + 1 × 2 + 1 × 1 = 19
PN −1 i
Note that z ≥ 0 and z ≤ i=0 2 = 2N − 1, and we can represent every integer in that
range with a choice of b. For 32-bit computers (CPU register size), z < 232 = 4294967296.
Not very large - more people in the world than this. For signed integers, 1 bit is used for
the sign, z < 231 = 2147483648. For 64-bit computers z < 264 ≈ 1.8 × 1019 - much better.
Possible errors with integers:

• Overflow - trying to represent a number larger than zmax . E.g. typically for unsigned
32-bit integers (232 − 1) + 1 → 0, for signed 32-bit integers (231 − 1) + 1 → −231 ,
depending on the exact system.

1.2.2 Real numbers - Fixed-point arithmetic

To represent real numbers, can use fixed-point arithmetic. One integer is assigned to each
real number with a fixed interval h:
N
X −1
r =h· bi · 2i .
i=0

E.g. with 32-bits, 1 bit to represent the sign, and a interval h of 1 × 10−4 , we can repre-
sent numbers between ±231 · 1 × 10−4 ≈ ±200000 with a resolution of 0.0001. This range
and accuracy is obviously very limited. It is used primarily on embedded systems, for e.g.
video/audio decoding where accuracy is not critical. Possible errors:

• Overflow - as for integers.

4 Chapter 1. Preliminaries: Motivation, Computer arithmetic, Taylor series

P10
• Accumulation of rounding error - e.g. in the above system i=1 0.00011 gives
0.0010, rather than the exact 0.0011.

A real-life example of the last error is failures in the Patriot missile system. A 24-bit
1
fixed-point number contained the current time in seconds, which was incremented every 10 th
1
of a second. Key point - 10 = 0.0001100110011... has an non-terminating expansion in binary,
which was truncated after the 24th bit. So each increment we make an error of 9.5 × 10−8 s.
After 100 hours cumulative error is 100 × 60 × 60 × 10 × 9.5 × 10−8 s = 0.34 s - in which time
a target missile travels ≈ 0.5km. Quick fix: reboot every few days.

1.2.3 Real numbers - Floating-point arithmetic

Floating-point representations are the modern default for real-numbers. A real number x is
written as a combination of a mantissa s and exponent e, given a fixed base b:

x = s × be .

In particular:

• b - base, usually 2 or 10, fixed for system.

• s - significand (or mantissa), 1 ≤ s < b, with n-digits - a fixed-point number. E.g. in

base 10 with 5-digits 1.0000 ≤ s < 9.9999. A

• e - exponent, an integer emin ≤ e ≤ emax .

For example, 5-digit mantissa, b = 10, −8 ≤ e ≤ 8. Then 10 · π = 3.1416 × 101 . The system
as described does not contain zero, this is added explicitly. Negative numbers are defined
using a sign bit, as for integers. The resulting sampling of the real-number line is shown in
Figure 1.2.
Possible errors:

• Overflow - trying to represent a number larger/smaller than ±smax × bemax . For ex-
ample 9.9999 × 108 + 0.0001 = inf. Special value inf.

• Underflow - trying to represent a number closer to zero than 1 × bemin . For example
2
1. × 10−8 = 0.

• Undefined operation - such as divide-by-zero, sqrt of −1. Special value, not-an-

number nan.

• Rounding error - 1 + 1 × 10−5 = 1 in above system. We define the machine epsilon,

machine , the smallest number when added to 1, gives a number distinct from 1. Also
√
1 + 1 × 10−4 = 1. In the above system machine = 0.0001.
1.3. Taylor Series Review 5

Floating-point concept
Decimal system with 5-digit manitssa (sign = +1)
0.01 0.1 1 10
1.0001
0.099999 0.99999 9.9999 99.999

x
Machine ε
0
e=-2 e=-1 e=0 e=1
0.01 0.1 1 10
1.0001
0.099999 0.99999 9.9999 99.999

log x

e=-2 e=-1 e=0 e=1

Figure 1.2: Graphical representation of a floating-point number system in base 10 (deci-

mal), with a 5-digit mantissa 1.0000 ≤ s ≤ 9.9999.

IEEE754 is a technical standard for floating-point arithmetic, defining not only the rep-
resentation (see Figure 1.3), but also rounding behaviour under arithmetic operations.

• 64-bits, b = 2, 11-bit exponent, giving emax = 1024, emin = −1023, 53-bit mantissa
(including 1 sign bit). This corresponds approximately to a decimal exponent between
53
− and 308 (since 10308 ≈ 21024 ), and about 16 decimal-digits (since 21 ≈ 1.1×10−16 .

• Overflow at ≈ ±1.7 × 10308 (atoms in universe ≈ 1080 ).

• Underflow at ≈ ±2.2 × 10−308 (Planck scale ≈ 1.6 × 10−35 m).

• Machine epsilon ≈ 1.1 × 10−16 .

• Smallest integer not in system 9007199254740993 = 253 + 1.

1
Note that floating-point with base-2 still doesn’t represent 10 exactly. This can be seen
in e.g. Matlab by evaluating 0.3 − 0.1 − 0.1 − 0.1. The answer under IEEE754 is something
like −2.7755575615628914 × 10−17 .

1.3 Taylor Series Review

The main idea of this course to approximate complex functions by a sequence of polynomials.
Once we have a polynomial representation, differentiation, integration, interpolation etc. can
6 Chapter 1. Preliminaries: Motivation, Computer arithmetic, Taylor series

single-precision IEEE 754 (32-bit)

sign exponent mantissa (23-bit)

0 0 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = 0.15625

double-precision IEEE 754 (64-bit)

Figure 1.3: Bit layout in the IEEE754 standard.

be performed easily and exactly. So to find e.g.

Z 1
f (x) dx,
0

we first approximate f (x) by a polynomial p(x), and then posit that the integral of p(x) will
be an approximation of the integral of f (x):
Z 1 Z 1
f (x) ≈ p(x), ∀x ∈ [0, 1] ⇒ f (x) dx ≈ p(x) dx.
0 0

One simple polynomial approximation that comes from basic calculus is Taylor’s theorem:
Theorem 1.1 (Taylor’s theorem with Lagrange remainder)
Let f (x) be a f : R → R which is N + 1-times continuously differentiable on the interval
[x0 , x] (the first N + 1 derivatives exist and are continuous). Then the Taylor expansion of
f (x) about x0 is
N
X f (n) (x0 )
f (x) = (x − x0 )n + O(x − x0 )N +1 .
n!
n=0
For small (x − x0 ) we expect the last term, the truncation error to be small, and therefore
the sum to be a good approximation to f (x). Note that the sum contains only powers of x -
it is therefore a polynomial in x.
Furthermore we can write the truncation error in a specific form: there exists a ξ ∈ [x0 , x]
such that
N
X f (n) (x0 ) f (N +1) (ξ)
f (x) = (x − x0 )n + (x − x0 )N +1 . (1.2)
n! (N + 1)!
n=0
This is the Lagrange form of the remainder, and a generalization of the Mean-Value Theorem.
It is important as it gives us an estimate for the error in the expansion, though in practice
we never know ξ.
1.3. Taylor Series Review 7

A convenient rewriting of (1.2) is obtained by defining the step-size

h = x − x0 ,

given which
N
X f (n) (x0 ) f (N +1) (ξ) N +1
f (x0 + h) = hn + h . (1.3)
n! (N + 1)!
n=0

and the series is a good approximation for small h. The Taylor series including terms up to
and including hN is called an N th-order Taylor expansion, or alternatively an (N + 1)th-term
Taylor expansion.

Example 1.2
Expand f (x) = cos(x) about x0 = 0 in an 4th-order Taylor expansion, plus remainder.

cos(0) = 1
cos0 (0) = − sin(0) = 0
cos00 (0) = − cos(0) = −1
cos000 (0) = sin(0) = 0
cos(4) (0) = cos(0) = 1
cos(5) (ξ) = − sin(ξ)

Substituting into (1.2) gives

x2 x4 x5

cos(x) = 1 − + − sin(ξ)
2! 4! 5!

for some ξ ∈ [0, x].

Example 1.3
Consider the case of expanding a polynomial f (x) = ax4 + bx3 + cx2 + dx + e as a Taylor
series about 0:

f (x) = ax4 + bx3 + cx2 + dx + e

f 0 (x) = 4ax3 + 3bx2 + 2!cx + d
f 00 (x) = 4 · 3ax2 + 3!bx + 2!c
f 000 (x) = 4!ax + 3!b
f 0000 (x) = 4!a
8 Chapter 1. Preliminaries: Motivation, Computer arithmetic, Taylor series

1.5
Taylor 1-term: p(x) =1 1.5
Taylor 2-term: p(x) =1−x2!2 1.5
Taylor 3-term: p(x) =1−x2!2 + x4!4
f(x)
p(x)
1.0 1.0 1.0

0.5 0.5 0.5

0.0 0.0 0.0

0.5 0.5 0.5

1.0 1.0 1.0

1.5 1.5 1.5

6 4 2 0 2 4 6 6 4 2 0 2 4 6 6 4 2 0 2 4 6
x x x

Figure 1.4: Successive approximations to cos(x) at 0 with a Taylor expansion.

then

f (0) = 0!e
f 0 (0) = 1!d
f 00 (0) = 2!c
f 000 (0) = 3!b
f 0000 (0) = 4!a

0000
and rearranging each term shows a = f 4!(0) etc. So an N -term Taylor series reproduces
polynomials exactly, given sufficient terms. If you forget Taylor’s expansion, you can derive
it like this. Also for a general function f (x), the Taylor series selects the polynomial with
the same derivatives as f (x) at x0 .

We never know ξ. The important consequence of this remainder is that the rate at which
the error goes to zero as h → 0 is known: ∝ hN +1 . Using big-O notation = O(hN +1 ).
We do not know the exact error (if we did we would know the exact value of the function at
x), but we know how quickly it gets smaller, and that we can make it as small as we like by
reducing h.

Exercise 1.4
Is the condition on differentiability of f (x) important? Is the theorem still true if f (x) is not
continuously differentiable? Consider the case f (x) = |x| expanded about x = 1.
1.3. Taylor Series Review 9

1.3.1 Truncation error versus Rounding error

Consider approximating f 0 (x) at x0 = 0. Start with the 1st-order Taylor expansion:

f (h) = f (0) + f 0 (0)h + O(h2 ),

and rearrange for f 0 (0):

f (h) − f (0)
f 0 (0) = + O(h).
h
This is a simple way of finding an approximation to the derivative - this is the forward
difference scheme. The truncation error of f 0 (0) is O(h) - so it makes sense to choose h as
small as possible to get the best approximation, e.g. h = 1 × 10−15 , a number which can be
represented by floating-point arithmetic. However we run into rounding error.
For example consider f (x) = ex , so that f 0 (0) = 1. For h = 1 × 10−15 in floating-point
arithmetic:
eh − e0 1.0000000000000011 − 1.
f 0 (0) ≈ = = 1.1102230246251565.
h 1 × 10−15
√
The error is more than 10%! Now let h = machine ≈ 1 × 10−8 , then

1.0000000099999999 − 1.
f 0 (0) ≈ = 0.99999999392252903.
1 × 10−8
10 Chapter 1. Preliminaries: Motivation, Computer arithmetic, Taylor series
Chapter 2

Iterative Solution of Non-linear

Equations

One of the fundamental problems of mathematics is the solution of an equation in a single

unknown
f (x) = 0.
Only for limited special cases can we solve the equation explicitly. For instance, if f (x) is a
simple algebraic polynomial of degree n, i.e.
n
X
f (x) = ai xi = a0 + a1 x + a2 x2 + · · · an xn ,
i=0

then we can write down the solution of f (x) = 0 if the polynomial is of degree one or two.
It is also possible to obtain the zeros of a cubic polynomial but the formula is already very
complicated and hardly used in practice. For degree 5 and higher it can be shown that no
such formula exists. Then, we have to resort to numerical approximation techniques. For
functions f (x) which are not polynomial in nature, numerical techniques are virtually the
only approach to find the solution of the equation f (x) = 0. For instance, some elementary
mathematical functions such as the square root and the reciprocal are evaluated on any
√
computer or calculator by an equation-solving approach: a is computed by solving the
2 1 1
equation x − a = 0 and a is computed by solving x − a = 0.
The algorithms we are going to study are all iterative in nature. That means we start our
solution process with a guess at the solution being sought and then refine this guess following
specific rules. In that way, a sequence of estimates of the solution is generated x0 , x1 , · · · xN
which should converge to the true solution x̃, meaning that

lim xN = x̃,
N →∞

where f (x̃) = 0. However this is guaranteed only under certain conditions, depending on the
algorithm.

11
12 Chapter 2. Iterative Solution of Non-linear Equations

Therefore simply describing the algorithm, and thereby the sequence, is not sufficient.
We wish to know in advance:

1. Under what conditions the algorithm converges.

2. A bound on the error of the estimate xN .

3. How rapidly the algorithm converges (the rate at which the error in xN decreases).

2.1 Recursive Bisection

Let us first illustrate a very simple method to find a zero of the continuous function f (x).
We start with an interval [a, b] in which we know a root exists. Then we half the interval,
and choose the half which contains the root. We repeat the procedure on the new interval.

Algorithm 2.1 (Recursive Bisection)

Assume an interval [a, b], and a continuous function f (x), such that f (a) and f (b) have oppo-
site sign. Then pseudocode for Nmax iterations of recursive bisection is:
a0 ← a; b0 ← b
for i = [0 : Nmax ] do
ci ← 21 (ai + bi )
if f (ai ) · f (ci ) < 0 then
ai+1 ← ai ; bi+1 ← ci
else
ai+1 ← ci ; bi+1 ← bi
end if
end for
return ci
Example 2.2
We apply this method to the polynomial

f (x) := x3 − x − 1 = 0, (2.1)

starting from the interval [1, 2]. We first note that f (1) = −1 and f (2) = 5, and since f (x) is
continuous on [1, 2], it must vanish somewhere in the interval [1, 2] by the intermediate value
theorem for continuous functions. Now, we take the midpoint of the interval [1, 2] as the
initial guess of the zero, i.e. x0 = 1.5. The error in that guess 0 := |x0 − x̃| is at most half of
the length of the interval, i.e. ≤ 0.5. Since f (1.5) = 0.875 > 0 the zero must lye in the smaller
interval [1, 1.5]. Again, we take the midpoint of that interval as the next guess of the solution,
i.e. x1 = 1.25, and the error of that guess is ≤ 0.25. We obtain f (1.25) = −0.296 < 0, thus
a yet smaller interval where the solution must lie is [1.25, 1.5], and the next guess of the
solution is x2 = 1.375; it has an error ≤ 0.125. See Figure 2.1.
2.1. Recursive Bisection 13

Figure 2.1: Progress of the Recursive Bisection Method.

It is easy to prove that for any continuous function f (·) and initial end points satisfying
f (xleft ) · f (xright ) < 0 the sequence of midpoints generated by the bisection method converges
to the solution of f (x) = 0. Obviously, the midpoint of the corresponding interval differs
from the solution by at most half the length of the interval. This gives a simple expression
for the upper-bound on the error.
Let the error in the root after the ith iteration be

i := |xi − x̃|.

On the first iteration we know that 0 ≤ (b − a)/2, and for every subsequent iteration the
interval size halves, so
b−a
N ≤ EN = N +1 ,
2
where EN is the upper-bound on the error in xN . Note that
1
EN +1 = EN ,
2
so the error at each iteration is reduced by a constant factor of 0.5. This is an example of a
linear rate of convergence. The name “linear” originates from the convergence curve when
plotted on an iteration-log error graph, see Figure 2.2.
We are interested in the rate of convergence because of the common case where f (x) is
extremely expensive to evaluate. In N iterations of recursive bisection f (·) must be evaluated
N + 2 times.
14 Chapter 2. Iterative Solution of Non-linear Equations

Figure 2.2: Convergence of recursive bisection (linear – left) and Newton (quadratic –
right).

Recursive bisection is an excellent method: guaranteed to converge, and possessing a

strong upper-bound on the error. The major limitation is that it applies in 1d only. It can
not be generalized to the case where f (·) is a vector function of a vector variable:

f : Rn → Rn .

The fixed-point iteration and Newton methods in the following can be generalized.

Exercise 2.3
Consider applying the recursive bisection method to a continuous function f (x) with multiple
roots in the initial interval.

1. What restrictions exist on the number of roots in [a, b] given that f is continuous and
f (a) < 0, f (b) > 0?

2. If a continuous curve has 3 roots in the interval [a, b], can recursive bisection converge
to the middle root?

Now consider the case that the function is not continuous - consider for example f (x) = 1/x
on the initial interval [−1, 2]. Does recursive bisection converge? Does it converge to a root?

2.2 Fixed-point iteration

The fixed-point iteration requires only that the original equation:

f (x) = 0 (2.2)

be rewritten in an equivalent form

x = ϕ(x). (2.3)
2.2. Fixed-point iteration 15

Provided that the two equations (2.2), (2.3) are equivalent, i.e. x = ϕ(x) ⇔ f (x) = 0, it
follows that any solution of the original equation (2.2) is a solution of the second equation.
Then, an initial guess x0 generates a sequence of estimates of the fixed point by setting

xn = ϕ(xn−1 ), n = 1, 2, . . . . (2.4)

Clearly if the exact solution x̃ is achieved on iteration n, xn = x̃, then xn+1 = x̃. The exact
solution is preserved, it is a fixed point of the iteration. See Figure 2.3. The algorithm can
be summarized as follows:
Algorithm 2.4 (Fixed-Point Iteration)
Let the initial guess be x0 . Then the following performs a fixed-point iteration:
for i = [0 : Nmax ] do
xi+1 ← ϕ(xi )
end for
return xi+1

Figure 2.3: Progress of a fixed-point iteration for 3 different ϕ.

Note that the choice of ϕ(·) is not unique. As before, the questions we have to answer are
(a) whether the iteration scheme converges, and if so (b) how the error behaves. Both these
depend on the choice of ϕ(·).

Example 2.5 (Continuation of example 2.2)

Before answering these questions let us first illustrate the method using the example 2.2
given before. Usually, the iteration function ϕ is obtained by solving (2.2) for x in some way.
Equation (2.1) may be written as

• x = x3 − 1 =: ϕ(x)

• x = (x + 1)1/3 =: ϕ(x)
16 Chapter 2. Iterative Solution of Non-linear Equations

1 1
• x= x + x2
=: ϕ(x), x 6= 0
1
• x= x2 −1
=: ϕ(x), x 6= {1, −1}

Let us take ϕ(x) = (x + 1)1/3 . As starting value we use x0 = 1.5. Then, following 2.4 we
obtain the scheme
xn = (xn−1 + 1)1/3 , n = 1, 2, . . . , (2.5)
which generates the following values for x1 , x2 , x3 , and x4 :

x1 = 1.3572, x2 = 1.3309, x3 = 1.3259, x4 = 1.3249.

When using ϕ(x) = x3 − 1, we obtain the scheme

xn = x3n−1 − 1, n = 1, 2, . . . , (2.6)

which generates the values

x1 = 2.375, x2 = 12.396, x3 = 1904.003, x4 = 6.902 · 109 .

Obviously, the latter sequence does not converge, since the solution of the given equation is
1.3247179 . . . . However, the first iteration scheme seems to converge, because x4 = 1.3249 is
a much better approximation than the starting value x0 = 1.5.

The example shows that it is necessary to investigate under what conditions the iteration
scheme will converge and how fast the convergence will be. First we need some calculus:

Theorem 2.6 (Mean-Value Theorem)

If f (x) and f 0 (x) are continuous, there exists a ξ ∈ [a, b] such that

f (b) − f (a)
f 0 (ξ) = .
b−a
Note that this is a special case of Taylors theorem with Lagrange remainder. Visually the
mean-value theorem is easy to verify, see Figure 2.4.
We now investigate the convergence of a fixed-point iteration. Let x̃ be the exact root of
f (x), and xi the current approximation. Now we can derive a relationship between the error
in xi and xi+1 as follows:

xi+1 = ϕ(xi )
xi+1 − x̃ = ϕ(xi ) − x̃
xi+1 − x̃ = ϕ(xi ) − ϕ(x̃)
ϕ(xi ) − ϕ(x̃)
xi+1 − x̃ = (xi − x̃)
xi − x̃
xi+1 − x̃ = ϕ0 (ξi )(xi − x̃),
2.2. Fixed-point iteration 17

Figure 2.4: Graphical argument for the Mean-Value Theorem

where the mean-value theorem has been applied at the last stage. Note that the ξ in the
MVT will be different at each iteration, hence ξi . By defining the error i := |xi − x̃|, we have

i+1 = ϕ0 (ξi )i .

If we want the error to strictly drop at iteration i, then we require |ϕ0 (ξi )| < 1. Otherwise
error grows (divergence). If −1 < ϕ0 (ξ) < 0 the error oscillates around the root, as we
observed in Figure 2.3.
Assume that |ϕ0 (ξi )| < K < 1, ∀i. Then we have an error bound after n iterations:

i < Ki−1 < K 2 i−2 < · · · < K i 0 .

Again we have linear convergence - error is reduced by a constant factor at each iteration.

Example 2.7 (Continuation of example 2.2)

We consider again the iteration function ϕ(x) = x3 −1 and investigate ϕ0 . From ϕ(x) = x3 −1
it follows ϕ0 (x) = 3x2 . For the initial guess x = 1.5, we obtain ϕ0 (1.5) = 6.75 > 1, i.e. the
condition |ϕ0 | < 1 is already not fulfilled for the initial guess. Since the condition |ϕ0 | < 1
p
is only fulfilled for |x| < 2/3, the iteration function ϕ(x) = x3 − 1 is not suited for the
iteration scheme.
Using the iteration function ϕ(x) = (x + 1)1/3 , we obtain ϕ0 (x) = 31 (x + 1)−2/3 . Thus,
|ϕ0 (x)| < 0.21 for all x ∈ [1, 2]. Therefore, this iteration function leads to a sequence xn of
approximations which does converge to the solution ξ.
18 Chapter 2. Iterative Solution of Non-linear Equations

Because of the flexibility in choice of ϕ, fixed-point iterations encompass an enormous

variety of iterative techniques. FPIs of some variety are used almost universally in physics
simulation codes, including CFD. They can be very efficient for vector-valued f (·) and x, but
convergence is rarely guaranteed, and often slow (think K = 0.99). In some cases we can
improve on FPI with Newton’s method.

2.3 Newton’s method

The principle of Newton’s method is to construct a linear approximation to f (·) at xi . The
next approximation of the root is where this linear approximation crosses the axis. Therefore
in the trivial case that f (x) is a straight-line the method will converge in 1 iteration (compare
with an FPI in the same case).

Figure 2.5: Progression of Newton’s method

Assume that f is two times continuously differentiable in [a, b] and that it has a simple
zero x̃ ∈ [a, b], i.e. f (x̃) = 0 and f 0 (x̃) 6= 0. The Taylor series expansion of f at x0 is

(x̃ − x0 )2
f (x̃) = f (x0 ) + f 0 (x0 )(x̃ − x0 ) + f 00 (ξ) = 0, ξ ∈ [x0 , x̃].
2
If we are close to the solution, then |x̃ − x0 | is small, and we can neglect the remainder term.
By neglecting the remainder we are invoking the linear approximation. We obtain

0 ≈ f (x0 ) + f 0 (x0 )(x̃ − x0 ),

2.3. Newton’s method 19

rearranging for x̃ gives,

f (x0 )
x̃ ≈ x0 − =: x1 .
f 0 (x0 )
Thus we can establish the following iteration scheme:
f (xn−1 )
xn = xn−1 − , n = 1, 2, . . .
f 0 (xn−1 )
The method is called Newton’s method. It is a special case of a FPI with iteration function
f (x)
ϕ(x) = x − .
f 0 (x)

Stability and Convergence Again we ask the questions: (a) does the iteration converge?,
and (b) how rapidly? To answer (a) we observe that
f (x)f 00 (x)
ϕ0 (x) = .
(f 0 (x))2
So, ϕ0 (x̃) = 0. Therefore, it follows that (provided ϕ0 is continuous, which it is if f is twice
continuously differentiable) |ϕ0 (x)| < 1 for all x in some neighbourhood of the solution x̃.
Therefore, Newton’s method will converge provided that the initial guess x0 is “close enough”
to x̃. The convergence result demands a starting point x0 which may need to be very close
to the solution we look for. This is called local convergence. Such a starting point can often
be found by first using several iterations of a FPI in the hope that a suitably small interval
is obtained.
Something special happens in the convergence rate of Newton – which makes it unique.
As before, let x̃ be a root of f (x), and xi the current approximation. By defining the current
error:
ei := xi − x̃
which we expect to be small (at least close to the root), the iteration can be approximated
with Taylor series about the exact solution:

xi+1 = ϕ(xi ) = ϕ(x̃ + ei ) (2.7)

ϕ00 (x̃)
= ϕ(x̃) + ϕ0 (x̃)ei + e2i + O(e3i ) (2.8)
2
ϕ00 (x̃) 2
≈ x̃ + 0 + ei , (2.9)
2
where the final approximation neglects only e3i terms. Rearranging
ϕ00 (x̃) 2
ei+1 = ei .
2
The error is reduced by its square, each iteration. This is called quadratic convergence, see
Figure 2.2.
20 Chapter 2. Iterative Solution of Non-linear Equations

Example 2.8 (Continuation of example 2.2)

We apply Newton’s method to example 2.2 and obtain for the initial guess x0 = 1.5 the series

n xn
0 1.5
1 1.348
2 1.3252
3 1.3247182
4 1.324717957

Exercise 2.9
√
Use Newton’s method to compute 2. Use the function f (x) = x2 − 2 and the initial guess
x0 = 0.2. How many iterations are needed to get six decimal places? Perform the iterations.
Chapter 3

Polynomial Interpolation in 1d

The process of constructing a smooth function which passes exactly though specified data
points is called interpolation. Introducing some notation, the data points we denote

(xi , fi ), for i = 0, 1, . . . , n,

which we imagine to come from some exact function f (x) which is unknown, and which we
wish to reconstruct on the interval [x0 , xn ]. We usually expect that x0 < x1 < · · · < xn , in
particular that no two xi s are equal.
An interpolating function is called an interpolant and is a linear combination of prescribed
basis functions. If the basis functions are:

{ϕi (x) : i = 0, . . . , n}

then the interpolant will have the form

n
X
φ(x) := a0 ϕ0 + a1 ϕ1 + · · · + an ϕn = ai ϕi (x)
i=0

where ai are the interpolation coefficients, and are constant (not a function of x), and must
be choosen to force φ(x) to match the data (xi , fi ).

Definition 3.1 (Interpolation of data)

Given n + 1 pairs of numbers (xi , fi ), 0 ≤ i ≤ n, with xi 6= xj for all i 6= j. We are looking
for a function
Xn
φ(x) := ai ϕi ,
i=0

satisfying the interpolation conditions

φ(xi ) = fi , for i = 0, . . . , n.

21
22 Chapter 3. Polynomial Interpolation in 1d

In the above definition note that we have n + 1 interpolation conditions to satisfy, and
also n + 1 degrees of freedom (DoFs) the ai that we can change to satsify the conditions. This
suggests that the problem can be solved.
The functions ϕi (x) might be polynomials (leading to polynomial interpolation), trigono-
metric functions (Fourier interpolation), rational functions (rational interpolation), or any-
thing else. Then, the problem of linear interpolation can be formulated as follows:
Pn
If the element φ = i=0 ai ϕi is to satisfy the interpolation conditions φ(xi ) = fi for
i = 0, . . . , n, then the coefficients must satisfy the linear system of equations:

a0 ϕ0 (x0 ) + · · · + an ϕn (x0 ) = f0 ,
a0 ϕ0 (x1 ) + · · · + an ϕn (x1 ) = f1 ,
.. .. ..
. . .
a0 ϕ0 (xn ) + · · · + an ϕn (xn ) = fn ,

or more concisely:
n
X
ai ϕi (xj ) = fj , for j = 0, . . . , n. (3.1)
i=0

This is a matrix equation. If A is the matrix with elements aij = φj (xi ), a = (a0 , . . . , an ),
f = (f0 , . . . , fn ), we write (3.1) as

Aa = f . (3.2)

Equation (3.2) is a linear system of dimension (n+1)×(n+1). There exists a unique solution

a = A−1 f ,

det A 6= 0.

The value of det A depends on the chosen basis functions {φj } and on the data locations
{xi }, but not on the data values {fi }. If det A 6= 0 for every selection of n + 1 distinct data
points, then the system of basis functions {φj (x)} is called unisolvent - a highly desirable
property. Note that if xi = xj for any i 6= j then two rows of A will be identical and therefore
det A = 0.
Initially we concentrate on the one-dimensional case (that is with only one independent
variable x). This is called univariate interpolation.
3.1. The Monomial Basis 23

3.1 The Monomial Basis

One natural basis for polynomials are the monomials:

ϕ0 (x) = 1
ϕ1 (x) = x
ϕ2 (x) = x2
.. ..
. .
ϕn (x) = xn ,

denote the monomial basis

mn (x) := (1, x, x2 , . . . , xn ).

Given which the interpolant has the form

n
X
φ(x) := a0 + a1 x + a2 x2 + · · · + an xn = ai xi .
i=0

It is clear that all polynomials of degree ≤ n can be written in this form. This is not the only
possible basis, the Lagrange basis and the Newton basis will be discussed later.
As for all bases the coefficients {ai } are uniquely determined by the interpolation condi-
tions
pn (xi ) = fi , i = 0, . . . , n,

these are n + 1 conditions for n + 1 unknowns. We write the conditions a linear system

V a = f.

The particular form of the matrix that results with a monomial basis has a special name: the
Vandermonde matrix
1 x0 x20 . . . xn−1 xn0
 
0
V =  ... ... .. .. ..  . (3.3)

. . . 
1 xn x2n . . . xn−1
n xnn
The right-hand side is simply
T
f = f0 f1 f2 . . . fn−1 fn .

The interpolating polynomial is therefore

n
X
pn (x) = ai ϕi (x) = aT mn (x).
i=0
24 Chapter 3. Polynomial Interpolation in 1d

Example 3.2 (Monomial interpolation of sin(x))

Construct a quadratic approximation of f (x) = sin(x) on the interval [0, π]. First choose
the nodal locations - a quadratic approximation always requires 3 support points - we choose
uniformly spaced points on the interval: x = (0, π/2, π). Evaluating at these points gives
f = (0, 1, 0).
The Vandermonde matrix is
   
1 x0 x20 1 0 0
V = 1 x1 x21  = V = 1 π/2 π 2 /4 .
   
1 x2 x22 1 π π2

We therefore solve     
1 0 0 a0 0
2
1 π/2 π /4 a1  = 1
    
1 π π2 a2 0
for a, giving
a0 = 0 a1 = 4/π a2 = −4/π 2 .
The approximating function is therefore

p(x) = 1.27323954x − 0.40528473x2 ,

plotted in Figure 3.1.

Example 3.3 (Polynomial reconstruction with Taylor series)

A Taylor series can be regarded as an approximation to a function at a point. In the language
of polynomial interpolation, a Taylor series approximation about x0 = 0 is a kind of inter-
polation with a monomial basis. We don’t need to solve a linear system for the coefficients,
they are given by Taylor’s theorem:

a0 = f (x0 )
a1 = f 0 (x0 )
1
a2 = f 00 (x0 )
2!
.. ..
. .
1 (n)
an = f (x0 ).
n!
Thus
1 00 1
φ(x) = f (x0 ) + f 0 (x0 )x +
f (x0 )x2 + · · · + f (n) (x0 )xn ,
2! n!
is the polynomial reconstruction — and we already know an expression for the error (the
Lagrange remainder). Note that it is not an interpolation, as it does not in general pass
3.1. The Monomial Basis 25

2.0
sin(x)
Monomial 2th order
1.5 Taylor 5th order

1.0

0.5

0.0

0.5

1.0

1.5 1 0 1 2 3 4
x

Figure 3.1: Approximating sin(x) with monomial interpolation and Taylor expansion.
26 Chapter 3. Polynomial Interpolation in 1d

through a prescribed set of points {(xi , fi )}. Also the approximation is typically very good
close to x0 , and deteriorates rapidly with distance. In interpolation we would like the error to
be small on an entire interval. Finally in practice we often have a set of data points, but we
very rarely have higher derivatives f (n) (x0 ) of the function of interest, so we can not compute
ai .
The Taylor expansion of sin(x) about x0 = 0 is plotted in Figure 3.1. The expansion is
better than 2nd-order monomial interpolation close to 0, but the latter much better over the
whole interval with a polynomial of much lower order.

3.2 Why interpolation with polynomials?

Univariate interpolation is usually done using polynomials - this is an important case we
consider in detail. The question is whether it is a good idea to approximate a smooth
univariate function f (x) by polynomials at all. The answer is ’yes’ due to a theorem by
Weierstrass (1885):

Theorem 3.4 (Weierstrass, 1885)

For any continuous function on the interval [a, b] and any ε > 0, there exists an algebraic
polynomial p such that1

kf − pk∞ := max |f (x) − p(x)| < ε. (3.4)

The theorem tells us that any continuous function may be approximated as closely as we
wish by a polynomial of suffciently high degree. The theorem does not tell us how to find
that polynomial, and the remainder of this chapter is dedicated to that task.
Since a polynomial of degree n has n + 1 coefficients, there is a unique polynomial of
degree ≤ n which agrees with a given function at n + 1 data points.
There are other reasons for using polynomials as interpolant:

• polynomials can be evaluated using the arithmetic operations +,−,× only (i.e. easy for
a computer);

• derivatives and indefinite integrals of polynomials are easy to compute and are polyno-
mials themselves;
1
The ∞-norm is defined by
kf k∞ := max |f (x)|,
x∈[a,b]

and is one measure of the “size” of a function or distance between two functions. Another common measure
is the L2 -norm (pronounced “L-two”),
s
Z b
kf k2 := [f (x)]2 dx,
a
p
which is comparable to the Euclidian norm for vectors |x|2 = x2 + y 2 + z 2 .
3.3. Newton polynomial basis 27

Figure 3.2: Graphical representation of the Weierstrass Approximation Theorm

• polynomials are always continuous and infinitely differentiable;

• univariate polynomial interpolation is always uniquely solvable (i.e. polynomial bases

are unisolvent).

This last is very important: it means there is exactly one polynomial of degree ≤ n that
passes through n + 1 points:

Theorem 3.5 (Uniqueness of polynomial interpolation)

Given a continuous function f (x) and a grid X of n + 1 nodes {x0 , x1 , . . . , xn } with a ≤ x0 <
x1 < . . . < xn ≤ b. There exists a unique polynomial pn (x) of degree ≤ n, such that

pn (xi ) = fi , i = 0, 1, . . . , n. (3.5)

Note carefully that the polynomial pn has degree n or less. Thus, if the n + 1 data points lie
on a straight line, the polynomial πn (x) will look like

a0 + a1 x + 0x2 + . . . + 0xn .

3.3 Newton polynomial basis

In order to identify the interpolation coefficients using a monomial basis, the Vandermonde
matrix V must be inverted, which could potentially be an ill-conditioned matrix. Also, every
28 Chapter 3. Polynomial Interpolation in 1d

time we add a new node to the data-set, we have to re-evaluate the values of all coeffients,
which is inefficient. This can be avoided when using the Newton basis:
k−1
Y
π0 (x) = 1, πk (x) = (x − xj ), k = 1, 2, . . . , n. (3.6)
j=0

Then as usual the interpolation polynomial can be written as a sum of basis functions:

pn (x) = d0 π0 (x) + d1 π1 (x) + . . . + dn πn (x) = dT π(x). (3.7)

Why is this a good choice of basis? Consider the interpolation conditions (3.1) for the
Newton basis, and the linear system Aa = f resulting from these. In general the matrix A
has entries aij = φj (xi ), in the case of the Newton basis we have the matrix
 
π0 (x0 ) π1 (x0 ) . . . πn (x0 )
U =  ... .. ..  . (3.8)

. . 
π0 (xn ) π1 (xn ) . . . πn (xn )

However, examining (3.6) shows that

πk (xj ) = 0 if j < k. (3.9)

Therefore, the matrix U is a lower triangular matrix:

 
1 0 0 ... 0
1 (x1 − x0 ) 0 ... 0
 

U=  .. .. .. .. . (3.10)
. . . .


Qn−1
1 (xn − x0 ) (xn − x0 )(xn − x1 ) . . . j=0 (x n − x j )

and the linear system

Ad = f

is particularly easy to solve.

Example 3.6 (Continuation of Example 3.2)

Construct a quadratic approximation of f (x) = sin(x) on the interval [0, π] with nodes at
x = (0, π/2, π).
The Newton basis depends only on the nodes xi :

π0 (x) = 1
π1 (x) = (x − x0 ) = x
π2 (x) = (x − x0 )(x − x1 ) = x(x − π/2)
3.4. Lagrange polynomial basis 29

and the resulting linear system for the coefficients is

       
π0 (x0 ) π1 (x0 ) π2 (x0 ) d0 1 0 0 d0 0
U = π0 (x1 ) π1 (x1 ) π2 (x1 ) d1  = 1 π/2 0  d1  = 1 . (3.11)
       
π0 (x2 ) π1 (x2 ) π2 (x2 ) d2 2
1 π π /2 d2 0

Clearly from the first row d0 = 0. Substituting into the 2nd row gives directly d1 = 2/π, and
then the 3rd row becomes
2 π2
π + d2 = 0
π 2
so d2 = −4/π 2 and the interpolating polynomial is

2 4 π 4 4
p(x) = x − 2 x(x − ) = x − 2 x2 ,
π π 2 π π
i.e. exactly the same polynomial obtained in Example 3.2.

If you ever have to do polynomial interpolation on paper (e.g. in an exam) — this is usually
the easiest way to do it

3.4 Lagrange polynomial basis

We would like to construct a polynomial basis with particularly simple form of the matrix A
– this is the Lagrange basis. It is obtained in the following way: consider the reconstruction
of a function using a monomial basis, and substitute in the expression for a:
T
pn (x) = aT mn (x) = V−1 f mn (x) = mT (x)V−1 f . (3.12)

Suppose we define a new polynomial basis

l(x) := m(x)T V−1 . (3.13)

Then, every element li (x) of l is a polynomial in x of degree n, and the expression for the
interpolating polynomial (3.12) becomes simply:
n
X
pn (x) = l(x)T f = fi li (x).
i=0

I.e. the interpolation matrix A has become the identity, and the interpolation coefficients a
are just the function values f .
From the interpolation condition pn (xi ) = fi , i = 0, . . . , n, it follows that

li (xj ) = δij ,
30 Chapter 3. Polynomial Interpolation in 1d

where
(
1 if i = j
δij := .
0 otherwise

Hence, the nodes {xj : i 6= j} are the zeros of the polynomial li (x). In the next step we
therefore construct polynomials which take the value 0 at all nodes except xi , and take the
value 1 at xi :2
n
Y x − xj
li (x) = , i = 0, . . . , n.
j=0
x i − xj
j6=i

These polynomials of degree n, li (x), are called Lagrange polynomials and are often denoted
li (x). This notation will be used in these notes:

n
Y x − xj
li (x) = , i = 0, . . . , n.
j=0
xi − xj
j6=i

When using the Lagrange polynomials, the interpolant can be written as

pn (x) = f T l(x), (3.14)

which is the Lagrange representation of the interpolation polynomial. The advantage of the
Lagrange representation is that the Vandermonde matrix (3.3) may become ill-conditioned if
n is large and/or the distance between two interpolation nodes xi and xj is small (and two
rows of V become similar, so det V → 0. When using the Lagrange representation (3.14) it is
possible to write down the interpolating polynomial pn (x) without solving a linear system to
compute the coefficients; the coefficient of the Lagrange basis functions li (x) is the function
value fi (this is useful on quizzes).

Example 3.7
Find the Lagrange interpolation polynomial which agrees with the following data. Use it to
estimate the value of f (2.5).

i 0 1 2 3
xi 0 1 3 4
f (xi ) 3 2 1 0

2
Alternatively we could invert V in (3.13) which gives the same answer, but in a less clean form.
3.5. Interpolation Error 31

The Lagrange polynomials for this case are

(x − 1)(x − 3)(x − 4) 1
l0 (x) = =− (x − 1)(x − 3)(x − 4),
(0 − 1)(0 − 3)(0 − 4) 12
(x − 0)(x − 3)(x − 4) 1
l1 (x) = = x(x − 3)(x − 4),
(1 − 0)(1 − 3)(1 − 4) 6
(x − 0)(x − 1)(x − 4) 1
l2 (x) = = − x(x − 1)(x − 4),
(3 − 0)(3 − 1)(3 − 4) 6
(x − 0)(x − 1)(x − 3) 1
l3 (x) = = x(x − 1)(x − 3),
(4 − 0)(4 − 1)(4 − 3) 12
and, therefore, the interpolation polynomial is
−x3 + 6x2 − 17x + 36
p(x) = 3l0 (x) + 2l1 (x) + 1l2 (x) + 0l3 (x) = .
12
The estimated value of f (2.5) is p(2.5) = 1.28125.

3.5 Interpolation Error

Intuitively we feel that as the number of data points and polynomial order increases, the
accuracy of the interpolation should improve, i.e. the interpolation error (3.15) should be-
come smaller and smaller — at least for smooth functions. This feeling is supported by the
Weierstrass’ theorem. However, what Weierstrass’s theorem states is that there always exists
arbitrarily exact polynomial approximations; what it does not say is that any interpolation
process can be used to find them. In fact, within [x0 , xn ] it is not always true that finer and
finer samplings of the function will give better and better approximations through interpo-
lation. This is the question of convergence. To find the answer to the question whether the
interpolation error (3.15) goes to zero if the number of nodes goes to infinity is much more
difficult then one might expect at first glance.
It turns out that the situation is particularly bad for equidistant nodes, i.e., if the interpo-
lation points are uniformly spaced on the interval. Then, for instance, uniform convergence
is not guaranteed even for f (x) ∈ C ∞ (infinitely differentiable or analytic functions). One
classical example is due to Carl Runge:

Example 3.8 (Non-convergence of polynomial interpolation)

1
Let f be defined on the interval [−5, 5] by f (x) = 1+x 2 . Calculate the interpolant using

n = 11 and n = 21 equally spaced data points. The results are shown in figure 3.3. It can be
shown that pn (x) 6→ f (x) as n → ∞ for any |x| > 3.63.
Hence, the interpolation error kf − pn k∞ grows without bound as n → ∞. Another
example is f (x) = |x| on [−1, 1], for which the interpolation polynomials do not even converge
pointwise except at the 3 points −1, 0, 1. So, the answer to the question, whether there
exists a single grid X for which the sequence of interpolation polynomials converge to any
32 Chapter 3. Polynomial Interpolation in 1d

2 20
n1 n2
1.5
0
1

0.5 −20

0
−40
−0.5
x− x+ x− x+
−1 −60
−5 0 5 −5 0 5
x x
0.4 0.4
n1 n2
0.3 0.3

0.2 0.2

0.1 0.1

0 0

−0.1 −0.1
x− x+ x− x+
−0.2 −0.2
−5 0 5 −5 0 5
x x

Figure 3.3: The top row shows f (x) (dashed line) and the interpolant (solid line) using
n = 11 and n = 21 equally spaced data points (stars) respectively. We also
indicated the lines x = ±3.5. At the bottom row the difference between f and
its interpolant are shown for x ∈ [−3.5, 3.5].
3.5. Interpolation Error 33

continuous function f is ’no’. However, Weierstrass’ theorem tells us that such a grid exists
for every continuous function! Hence, the statement made before implies that we would need
to determine this grid for each individual continuous function newly.

We need a way of estimating the error in an interpolation. Up to now, it has been

irrelevant whether the data values fi are related to each other in some way or not. However,
if we want so say something about how the interpolant behaves between the data points, this
question becomes important. This question is answered by a theorem due to Cauchy

Theorem 3.9 (Interpolation error (Cauchy))

If f ∈ C n+1 ([a, b]), then for any grid X of n + 1 nodes and for any x ∈ [a, b], the interpolation
error at x is
f (n+1) (ξ)
Rn (f ; x) := f (x) − pn (x) = ωn+1 (x), (3.15)
(n + 1)!
where ξ = ξ(x) ∈ (mink (xk , x), maxk (xk , x)) ⊂ [a, b] and ωn+1 (x) is the nodal polynomial
associated with the grid X, i.e.,
n
Y
ωn+1 (x) = (x − xi ).
i=0

The nodal polynomial is unique, a polynomial of degree n + 1, and has leading coefficient 1.
The latter means that the coefficient of the highest power of x, xn+1 , is equal to 1. Moreover,
all nodes xi are the zeros of the nodal polynomial.
In most instances we don’t know ξ. Then, we may use the following estimate:

|ωn+1 (x)|
|Rn (f ; x)| ≤ max |f n+1 (x)| . (3.16)
x∈[a,b] (n + 1)!

In (3.16), we have no control over maxx∈[a,b] |f n+1 (x)|, which depends on the function, and
which can be very large. Take for instance

1
f (x) = ⇒ kf n+1 k∞ = (n + 1)! αn+1 .
1 + α 2 x2
However, we may have control on the grid X. Hence, we may reduce the interpolation
error if we choose the grid so that kωn+1 k∞ is small. The question is therefore: what is the
grid for which this is minimized? The answer is given by the following theorem:

Theorem 3.10 (Control of kωn+1 k∞ )

Among all the polynomials of degree n + 1 and leading coefficient 1, the unique polynomial
which has the smallest uniform norm on [−1, 1] is the Chebychev polynomial of degree n + 1
divided by 2n :
Tn+1 (x)
.
2n
34 Chapter 3. Polynomial Interpolation in 1d

3.5.1 Chebychev polynomials

Since it can be shown that kTn+1 k∞ = 1, we conclude that if we choose the grid X to be the
n + 1 zeros of the Chebychev polynomial Tn+1 (x), it is

Tn+1 (x)
ωn+1 (x) = ,
2n
and
1
kωn+1 k∞ =
2n
and this is the smallest possible value! The grid X such that the xi ’s are the n+1 zeros of the
Chebychev polynomial of degree n + 1, Tn+1 (x), is called the Chebychev-Gauss grid. Then,
Eq. (3.16) tells us that if f n+1 < C, for some constant C the convergence of the interpolation
polynomial towards f for n → ∞ is extremely fast. Hence, the Chebyshev-Gauss grid has
much better interpolation properties than any other grid, in particular the uniform one.
What is left is to define briefly the Chebyshev polynomials. The Chebyshev polynomial
of degree n for x ∈ [−1, 1] is defined by

Tn (x) := cos(n arccos(x)),

or equivalently
Tn (cos(x)) = cos(nx).
Note that the Chebyshev polynomials are only defined on the interval [−1, 1].
Chebyshev polynomials can also be defined recursively. The first two are

T0 (x) = 1, T1 (x) = x,

and the higher-degree Chebyshev polynomials can be obtained by applying

Tn+1 (x) = 2x Tn (x) − Tn−1 (x), n = 1, 2, . . . . (3.17)

For instance, T2 (x) = 2x2 − 1, T3 (x) = 4x3 − 3x etc. Note that Tn (x) is an polynomial in x
of degree n. It has n distinct zeros, which are all located inside the interval [−1, 1]. These
zeros are given by
2i − 1
ξi = cos π , i = 1, . . . , n. (3.18)
2n

The first few Chebyshev polynomials are plotted in Figure 3.4 - note their roots, and
take maximum and minimum values of ±1. The coeffcient of xn in Tn is 2n−1 , which can be
deduced from examining the recurrence relation (3.17). Therefore the maximum value of the
monic polynomial
Tn (x)
T̃n (x) := n+1 ,
2
3.5. Interpolation Error 35

T0 (x)
T1 (x)
1.0
T2 (x)
T3 (x)
T4 (x)
T5 (x)

0.5

0.0

0.5

1.0

x
1.0 0.5 0.0 0.5 1.0

Figure 3.4: The first 6 Chebyshev polynomials.

36 Chapter 3. Polynomial Interpolation in 1d

is 1/2n+1 .
When we want to interpolate a function f (x) on the interval [a, b] using a polynomial of
degree n on a Chebyshev-Gauss grid, the grid points a < x0 < x1 < . . . < xn < b can be
computed by a simple linear transformation, which maps [−1, 1] on [a, b]. Taking also into
account that the first node of the grid, x0 , has index 0 and the zero closest to −1 of the
Chebyshev polynomial Tn+1 (x) is ξn+1 , the nodes on [a, b] can be computed by

b+a b−a
xn−i = + ξi , i = 1, . . . , n. (3.19)
2 2

The {xi } are the nodes of the grid X we would choose for interplation on the interval [a, b].

Example 3.11 (Nodes of a Chebyshev-Gauss grid)

We want to interpolate a function f (x) on the interval [6, 10] by a degree-4 univariate poly-
nomial using a Chebyshev-Gauss grid. Compute the grid nodes. Solution: we need 5 nodes,
i.e., we take the zeros of T5 (x), which can be computed as
2i − 1
ξi = cos π , i = 1, . . . , 5. (3.20)
10

These zeros are transformed onto [6, 10] using the transformation

x5−i = 8 + 2 ξi , i = 1, . . . , 5. (3.21)

The results are shown in table 3.1 Hence, the interpolation nodes are x0 = 6.098, x1 = 6.824,

i ξi x5−i
1 0.951 9.902
2 0.588 9.176
3 0.000 8.000
4 -0.588 6.824
5 -0.951 6.098

Table 3.1: 5-point Chebyshev-Gauss grid on [6, 10].

x2 = 8.000, x3 = 9.176, and x4 = 9.902.

Though the choice of the Chebyshev-Gauss grid for interpolation has advantages over a
uniform grid, there are also some penalties.

1. Extrapolation using the interpolant based on data measured at Chebyshev-Gauss points

is even more disastrous than extrapolation based on the same number of equally spaced
data points.
3.5. Interpolation Error 37

2. In practice it may be difficult to obtain the data fi measured at the Chebyshev points.
Therefore, if for some reason it is not possible to choose the Chebyshev-Gauss grid,
choose the grid so that there are more nodes towards the endpoints of the interval
[a, b].

Example 3.12 (Continuation of example 3.8)

Figure 3.5 shows the result of applying interpolation to the data points (3.19) for n = 11
and n = 21, distributed over the interval [−5, 5]. In the case of 11 data points, 6 of them are
placed in the previously troublesome region |x| > 3.5.

1 1
n1 n2
0.8
0.5
0.6

0.4
0
0.2

−0.5 0
−5 0 5 −5 0 5
x x
0.6 0.6
n1 n2
0.4 0.4

0.2 0.2

0 0

−0.2 −0.2
−5 0 5 −5 0 5
x x

Figure 3.5: Here we show f (dashed line) and its interpolant (solid line) using unequally
spaced data points (stars) distributed over [−5, 5] according to (3.19). The
bottom row shows again the difference between f and its interpolant.
38 Chapter 3. Polynomial Interpolation in 1d
Chapter 4

Advanced Interpolation: Splines,

Multi-dimensions and Radial Bases

4.1 Spline interpolation

Standard polynomial interpolation through n + 1 given data points produces a single poly-
nomial of degree ≤ n, which passes exactly through every data point. This can be unstable
if the number of points is large. Further, if the number of data is large, the solution of
the linear system may be expensive. Finally, the Runge phenomenon implies large oscilla-
tions around the boundaries of the grid. A solution to the latter problem is the choice of a
Chebyshev-Gauss grid. However, as has been noticed already before, this may not always be
possible in practice. Therefore, we want to consider another technique designed to reduce
the problems that arise when data are interpolated by a single polynomial. The basic idea is
to use a collection of low-degree polynomials to interpolate the give data rather than a single
high-degree polynomial. That this may be beneficial is seen for instance in the estimation of
the interpolation error. Remember the estimate
kf n+1 (ξ)k∞
kR(f ; x)k∞ ≤ |ωn+1 (x)|, (4.1)
(n + 1)!
where
ωn+1 (x) = (x − x0 )(x − x1 ) · · · (x − xn ) (4.2)
is the nodal polynomial of degree n + 1. Since |ωn+1 (x)| is is the product of n + 1 linear
factors |x − xi |, each the distance between the two points that both lie in [a, b], we have
|x − xi | ≤ b − a and so
|ωn+1 (x)| ≤ (b − a)n+1 . (4.3)
Hence, the maximum absolute interpolation error is
kf n+1 k∞
kR(f ; x)k∞ ≤ (b − a)n+1 . (4.4)
(n + 1)!

39
40 Chapter 4. Advanced Interpolation: Splines, Multi-dimensions and Radial Bases

This large error bound suggests that we can make the interpolation error as small as we wish
by freezing the value of n and then reducing the size of b − a. We still need an approximation
over the original interval [a, b], so we use a piecewise polynomial approximation: the original
interval is divided into non-overlapping subintervals and a different polynomial fit of the data
is used on each subinterval.
A simple piecewise polynomial fit is obtained in the following way: for data {(xi , fi ) :
i = 0, . . . , n}, where a ≤ x0 < x1 < · · · < xn ≤ b, we take the straight line connection of
two neighbouring data points xi and xi+1 as the interpolant on the interval [xi , xi+1 ]. Hence,
globally the interpolant is the unique polygon obtained by joining the data points together.
Instead of using linear polynomials to interpolate between two neighboring data points we
may use quadratic polynomials to interpolate between three neighboring data points, cubic
polynomials to interpolate between four neighboring data points etc. In this way we may
improve the performance but at the expense of smoothness in the approximating function.
Globally the interpolant will be at best continuous, but it will not be differentiable at the data
points. For many applications a higher degree of smoothness at the data points is required.
This additional smoothness can be achieved by using low-degree polynomials on each interval
[xi , xi+1 ] while imposing some smoothness conditions at the data points to ensure that the
overall interpolating function has globally as high a degree of continuity as possible. The
corresponding functions are called splines.

Definition 4.1 (Spline of degree d)

Let a = x0 < x1 < · · · < xn = b be an increasing sequence of data points. The function s is
a spline of degree d if

(a) s is a polynomial of degree at most d on each of the subintervals [xi , xi+1 ],

(b) s, s0 , . . . , s(d−1) are all continuous on [a, b].

4.1.1 Linear Splines (d=1)

The connection of points by straight lines as explained above is an example of a spline of
degree 1, i.e. a piecewise linear function which is continuous at the data points. This spline
is called a linear spline. It is given as



 s0 (x) x ∈ [x0 , x1 ]

s1 (x) x ∈ [x1 , x2 ]

s(x) = . .. , (4.5)

 .. .



sn−1 (x) x ∈ [xn−1 , xn ]


where
fi+1 − fi
si (x) = fi + (x − xi ), i = 0, 1, . . . , n − 1. (4.6)
xi+1 − xi
4.1. Spline interpolation 41

Hence, if we are given n+1 data points, the linear spline will consists of n degree-1 polynomials
each of which holds between a pair of consecutive data points. With h := maxi |xi+1 − xi |,
we obtain an upper bound for the interpolation error for x ∈ [xi , xi+1 ] from (4.1) with n = 1
as follows: define
xi + xi+1
x̂ := ,
2
the interval midpoint. Then for the nodal polynomial in this case:
xi+1 − xi xi+1 − xi
|ω2 (x)| = |(x − xi )(x − xi+1 )| ≤ |(x̂ − xi )(x̂ − xi+1 )| = ,
2 2
where the 1st inequality follows from the fact that the maximum or minimum of a parabola
is located at the midpoint of the two roots. Therefore by (4.1), an upper bound on the
interpolation error on the interval is:

|xi+1 − xi |2 (2) h2 (2)

kR(f ; x)k∞ ≤ kf k∞ ≤ kf k∞ . (4.7)
8 8
Note that this bound is smaller, by a factor of 4, than the general upper bound given in (4.4).
It is therefore tighter (i.e. better). If an upper bound is the best possible upper bound in a
given situation, it is called tight.
The most important feature of the above error bound is that it behaves like h2 . Suppose
that the nodes are chosen equally spaced in [a, b], so that xi = a + ih, i = 0, . . . , N , where
h = b−aN . As the number of data points N + 1 increases, the error using a linear spline as an
approximation to f (x) tends to zero like N12 . In general when a polynomial approximation
uses polynomials of degree d, we expect the error to behave like hd+1 .

4.1.2 Cubic Splines (d=3)

Linear splines suffer from a major limitation: the derivative of a linear spline is generally
discontinuous at each interior node xi . Hence, the linear spline is a continuous function on
[a, b], but not smoother. To derive a piecewise polynomial approximation with a continuous
derivative requires that we use polynomial pieces of higher degree and constrain the pieces
to make the interpolant smoother. By far the most commonly used spline for interpolation
purposes is the cubic spline (d = 3), which is in C 2 ([a, b]), i.e., the first and second derivatives
at the interior nodes xi are still continuous functions. We will restrict our attention on them.
Let us denote the subinterval [xi , xi+1 ] by Ii and the spline restricted to Ii by si , i.e.

s(x) = si (x) for x ∈ Ii , i = 0, . . . , n − 1.

The conditions which s must satisfy are that s interpolates f at the data points x0 , . . . , xn and
that s0 and s00 must be continuous at the interior data points x1 , . . . , xn−1 . Let us determine
the spline interpolant s from these conditions.
42 Chapter 4. Advanced Interpolation: Splines, Multi-dimensions and Radial Bases

Since si is a cubic polynomial, s00 is linear. Let us denote the yet unknown values of s00 at
the data points xi and xi+1 by Mi and Mi+1 , respectively, i.e. s00 (xi ) = Mi and s00 (xi+1 ) =
Mi+1 .
si (x) is a cubic polynomial, hence it can be written as

si (x) = ai (x − xi )3 + bi (x − xi )2 + ci (x − xi ) + d1 , i = 0, . . . , n − 1. (4.8)

Then,

s0i (x) = 3ai (x − xi )2 + 2bi (x − xi ) + ci

s00i (x) = 6ai (x − xi ) + 2bi .

hence,

Mi
s00i (xi ) = Mi = 2bi ⇒ bi =
2
Mi+1 − Mi
s00i (xi+1 ) = 6ai hi + 2bi ⇒ ai = ,
6hi
where we have defined hi := xi+1 − xi . We insert the results for ai and bi into the equation
of the spline si (x) and find

Mi+1 − Mi Mi
si (x) = (x − xi )3 + (x − xi )2 + ci (x − xi ) + di . (4.9)
6hi 2

The interpolation conditions yield si (xi ) = fi and si (xi+1 ) = fi+1 , hence,

si (xi ) = fi = di ⇒ di = fi
Mi+1 − Mi 3 Mi 2
si (xi+1 ) = fi+1 = hi + h + chi + fi
6hi 2 i
fi+1 − fi hi hi
⇒ ci = − Mi − Mi+1 .
hi 3 6

We insert the results for di and ci into the equation for si (x) and find

Mi+1 − Mi Mi f
i+1 − fi hi hi
si (x) = (x−xi )3 + (x−xi )2 + − Mi − Mi+1 (x−xi )+fi . (4.10)
6hi 2 hi 3 6
So far, we have not used the conditions

s0i (xi ) = s0i−1 (xi ), i = 1, . . . , n − 1, (4.11)

which provide n − 1 equations. It is

Mi+1 − Mi f
i+1 − fi hi hi
s0i (x) = (x − xi )2 + Mi (x − xi ) + − Mi − Mi+1 , (4.12)
2hi hi 3 6
4.1. Spline interpolation 43

hence,
f
i+1 − fi hi hi
s0i (xi ) = − Mi − Mi+1 . (4.13)
hi 3 6
In the same way, we find
Mi − Mi−1 f − f
i i−1 hi−1 hi−1
s0i−1 (x) = (x − xi−1 )2 + Mi−1 (x − xi−1 ) + − Mi−1 − Mi ,
2hi−1 hi−1 3 6
(4.14)
hence,
hi−1 hi−1 fi − fi−1
s0i−1 (xi ) = Mi + Mi−1 + . (4.15)
3 6 hi−1
Therefore, the conditions s0i (xi ) = s0i−1 (xi ), i = 1, . . . , n − 1 yield the following system of n − 1
equations for the n + 1 unknowns M0 , M1 , . . . , Mn−1 , Mn :
hi−1 (hi−1 + hi ) hi fi+1 − fi fi − fi−1
Mi−1 + Mi + Mi+1 = − , i = 1, . . . , n − 1. (4.16)
6 3 6 hi hi−1
This system has infinitely many solutions. A unique solution can only be obtained if addi-
tional constraints are imposed. There are many constraints we could choose. The simplest
constraints are
M0 = Mn = 0. (4.17)
When making this choice, the cubic spline is called natural cubic spline. The natural cubic
spline may deliver no accurate approximation of the underlying function at the ends of the
interval [x0 , xn ]. This may be anticipated from the fact that we are forcing a zero value on
the second derivative when this is not necessarily the value of the second derivative of the
function which the data measures. For instance, a natural cubic spline is built up from cubic
polynomials, so it is reasonable to expect that if the data is measured from a cubic polynomial
then the natural cubic spline will reproduce the cubic polynomial. However, if the data are
measured from, e.g., the function f (x) = x2 , then the natural cubic spline s(x) 6= f (x). The
function f (x) = x2 has nonzero second derivatives at the nodes x0 and xn where the value
of the second derivative of the natural cubic spline is zero by definition.
To clear up the inaccuracy problem associated with the natural spline conditions, we
could replace them with the correct second derivatives values

s00 (x0 ) = f 00 (x0 ), s00 (xn ) = f 00 (xn ). (4.18)

These second derivatives of the data are not usually available, but they can be replaced by
reasonable approximations. Anyway, if the exact values or sufficiently accurate approxima-
tions are used then the resulting spline will be as accurate as possible for a cubic spline. Such
approximations may be obtained by using polynomial interpolation to sufficient data values
separately near each end of the interval [x0 , xn ]. Then, the two interpolating polynomials are
each twice differentiated and the resulting twice differentiated polynomials are evaluated at
the corresponding end points to approximate the f 00 there.
44 Chapter 4. Advanced Interpolation: Splines, Multi-dimensions and Radial Bases

A simpler, and usually sufficiently accurate spline may be determined as follows: on the
first two and the last two intervals we define each a cubic polynomial, hence x1 and xn−1
are in fact no nodes. This fixes in fact the two unknown second derivatives M1 and Mn−1
uniquely and we are left with the solution of a linear system of dimension n − 1 for the n − 1
unknowns M0 , M2 , M3 , . . . , Mn−2 , Mn . The corresponding cubic spline is sometimes called a
not-a-knot spline (’knot’ is an alternative notation for ’node’).
For each way of supplying the additional constraints that is discussed before, the cubic
spline is unique. From the error bound for polynomial interpolation, for a cubic polynomial
interpolating at data points in the interval [a, b], we have

kR(f ; x)k∞ ≤ Ch4 kf (4) k∞ , (4.19)

where C is a constant and h = maxi (xi+1 − xi ). Therefore, we might anticipate that the error
associated with a cubic spline interpolant behaves like h4 for h small. However, the maximum
absolute error associated with a natural cubic spline behaves like h2 as h → 0. In contrast,
the maximum absolute error for a cubic spline based on correct endpoint second derivatives
or on the not-a-knot conditions behaves like h4 . Unlike the natural cubic spline, the correct
second derivative value and not-a-knot cubic splines reproduce cubic polynomials.

Example 4.2
Find the natural cubic spline which interpolates the data

xi 0.0 0.1 0.3 0.6

fi 0.0000 0.2624 0.6419 1.0296

With h0 = 0.1, h1 = 0.2, and h2 = 0.3 we obtain the following linear system of equations for
the unknowns M1 and M2 :

0.3 0.2
M1 + M2 = 1.8975 − 2.6240
3 6
0.2 0.5
M1 + M2 = 1.2923 − 1.8975
6 3

It has the solution M1 = −6.4871, M2 = −2.3336. Then, the natural cubic spline is

3
 −11.0812 x + 2.7321 x
 x ∈ [0, 0.1]
s(x) = 3.4613(x − 0.1)3 − 3.2436(x − 0.1)2 + 2.4078(x − 0.1) + 0.2624 x ∈ [0.1, 0.3]
 1.2964(x − 0.3)3 − 1.1668(x − 0.3)2 + 1.5257(x − 0.3) + 0.6419 x ∈ [0.3, 0.6]


We can simplify substantially the rather time-consuming computations if the data points are
equally spaced so that xi = x0 + ih, i = 1, . . . , n. Then we obtain the following linear system
4.2. Bivariate interpolation 45

of equations for the coefficients Mi , i = 1, . . . , n − 1 of the natural cubic spline:

 
4 1 0 ... ... 0    
 1 4 M 1 f [x 0 , x 1 , x2 ]
1 0 ... 0 
 M2  6  f [x1 , x2 , x3 ]
    
 . ..  
 .. .  .. = h .. (4.20)
    
. .
 

 0 ... 0 1 4 1 
   
Mn−1 f [xn−2 , xn−1 , xn ]
0 ... ... 0 1 4

Example 4.3
1
Find the natural cubic spline interpolant to the function f (x) = 1+x2
from the following
table:

xi −1 −0.5 0 0.5 1
fi 0.5 0.8 1 0.8 0.5

Compute s(0.8) and the error f (0.8) − s(0.8).

It is n = 4, h = 12 . Moreover, for a natural spline we have M0 = M4 = 0. Further, we
obtain f [x0 , x1 , x2 ] = −0.2, f [x1 , x2 , x3 ] = −0.8, f [x2 , x3 , x4 ] = −0.2. This gives the linear
system of equations     
4 1 0 M1 −2.4
 1 4 1   M2  =  −9.6 
    
0 1 4 M3 −2.4
with the solution M1 = 0, M2 = −2.4, M3 = 0. Thus, the natural cubic spline interpolant to
the data is:


 s0 (x) = 0.5 + 0.6(x + 1) x ∈ [−1, −0.5]
s1 (x) = 0.8 + 0.6(x + 0.5) − 0.8(x + 0.5)3

 x ∈ [−0.5, 0]
s(x) =

 s2 (x) = 1 − 1.2x2 + 0.8x3 x ∈ [0, 0.5]

s3 (x) = 0.8 − 0.6(x − 0.5) x ∈ [0.5, 1]


It is s(0.8) = s3 (0.8) = 0.8 − 0.6 · 0.3 = 0.62 and the absolute error is |f (0.8) − s(0.8)| =
|0.61 − 0.62| = 0.01.

Exercise 4.4
Find the natural cubic spline which interpolates the values ln(1+2x) at x = 0, 0.1, 0.2, 0.3, 0.4,
and 0.5. Use this spline to estimate the values of ln(1.1), ln(1.3), ln(1.5), ln(1.7), and ln(1.9).
Compare the results with those using the linear spline.

4.2 Bivariate interpolation

The interpolation problem can be generalized to two dimensions (bivariate interpolation):
46 Chapter 4. Advanced Interpolation: Splines, Multi-dimensions and Radial Bases

Definition 4.5 (Bivariate interpolation)

Given data points {xi : i = 1, . . . , n} with x ∈ D̄ ⊂ R2 and function values {fi : i = 1, . . . , n},
find a (continuous) function p such that

p(xi ) = fi , i = 1, . . . , n. (4.21)

Geometrically, we can interpret this as finding an approximation of a surface in R3 . The

generalization to N -dimensions follows similarly.
In many cases the domain D̄ is a rectangle and the data points lie on a rectangular grid.
It may also happen that D̄ is of unusual shape and the data points are irregularly scattered
throughout D̄.
In the following we will discuss bivariate interpolation methods for regular and scattered
data. To distinguish between regular and scattered data is also interesting for so-called two-
stage processes in which we first construct an interpolation function based on scattered data
and then use this interpolation function to generate data on a grid for the construction of
another, perhaps smoother or more convenient interpolation function.
A convenient and common approach to the construction of a two-dimensional interpolation
function is the following: assume p is a linear combination of certain basis functions bk , i.e.,
n
X
p(x) = ck bk (x), x ∈ R2 . (4.22)
k=1

Solving the interpolation problem under this assumption leads to a system of linear equations
of the form
A c = f, (4.23)
where the entries of the matrix A are

aij = bj (xi ), i, j = 1, . . . , n. (4.24)

A unique solution of the interpolation problem will exist if and only if det(A) 6= 0. In one-
dimensional interpolation, it is well-known that one can interpolate to arbitrary data at n
distinct points using a polynomial of degree n − 1. However, it can be shown that det(A) 6= 0
does not hold for arbitrary distributions of distinct data points in two (or more) dimensions.
Hence, it is not possible to perform a unique interpolation with bivariate polynomials of a
certain degree for data given at arbitrary locations in R2 . If we want to have a well-posed
(i.e. uniquely solvable) bivariate interpolation problem for scattered data, then the basis
needs to depend on the data (what this means will be explained later). There are a very few
exceptions of this rule, for specific data distributions, choice of basis functions, and choice of
the orientation of the Cartesian coordinate system.
In the following we will discuss bivariate interpolation. We will start with polynomial
interpolation, which is suitable for gridded data. For scattered data, basis functions will be
used that depend on the data, the so-called radial basis functions. Points in R2 are described
using Cartesian coordinates, i.e., x = (x, y)T .
4.2. Bivariate interpolation 47

4.2.1 Tensor product polynomial interpolation

The bivariate interpolation problem is much more complicated than the univariate one even
when restricting to polynomial interpolation. One reason is that it is not so easy to guarantee
that the problem has a unique solution. For instance, the linear bivariate interpolation
polynomial p(x, y) = a00 + a10 x + a01 y may uniquely be determined by the function values at
three data points {(xi , yi ) : i = 1 . . . , 3}. However, if the three data points lie on a line, then
the problem does not have a unique solution. Therefore, to guarantee that a unique solution
of the problem of polynomial interpolation exists, we have to impose some constraints on the
location of the interpolation points and/or the domain.
Of particular practical interest is the case where the data points lie on a rectangular lattice
in the (x, y)-coordinate plane and function values are assigned to each point. Thus, suppose
there are (n+1)(m+1) points (xi , yj ), 0 ≤ i ≤ n, 0 ≤ j ≤ m in the lattice, n+1 in x-direction
and m + 1 in y-direction (this automatically implies that the coordinate axes are supposed to
be drawn parallel to the edges of the lattice; this pre-requisite is important to get a unique
solution!). Then, each point is assigned coordinates (xi , yj ), where x0 < x1 < . . . < xn−1 < xn
is the grid X and y0 < y1 < . . . < ym−1 < ym is the grid Y . The interpolation problem then
requires to determine a function p = p(x, y) defined on the rectangle [x0 , xn ] × [y0 , ym ], for
which P (xi , yj ) = fij for each i and j. If the xi ’s and the yj ’s are uniformly spaced, the
lattice is called uniform.
The basic idea is to make use of the results of one-dimensional interpolation. Several
representations of one-dimensional polynomials have been considered in the previous sec-
tion: the monomial basis, the Lagrange basis, the Newton basis, and the basis of orthogonal
polynomials. Now we consider products of these basis functions, i.e., we consider the set of
(n + 1) · (m + 1) functions of the form

bij (x, y) = ϕi (x) · ψj (y), i = 0, . . . , n, j = 0, . . . , m. (4.25)

For ϕi and ψj we may use any of the representations mentioned before. In particular, if we
choose the Lagrange representation, it is
(X) (Y )
bij (x, y) = li (x) · lj (y), (4.26)

where
n n
(X)
Y x − xj (Y )
Y y − yk
li (x) = , lj (y) = , (4.27)
xi − xj yj − yk
j=0,j6=i k=0,k6=j

and (
(X) (Y ) 1 if k = i and l = j
bij (xk , yl ) = li (xk ) · lj (yl ) = . (4.28)
0 otherwise
Hence, the functions bij (x, y) behave like the univariate Lagrange basis functions, but are
based now on the rectangular lattice. Then, the bivariate interpolation polynomial is given
48 Chapter 4. Advanced Interpolation: Splines, Multi-dimensions and Radial Bases

by
n X
X m
p(x, y) = fij bij (x, y). (4.29)
i=0 j=0

Defining a data matrix F with entries Fij = fij and vectors of basis functions lX (x) =
T T
(X) (X) (Y ) (Y )
l0 (x) . . . ln (x) and lY (y) = l0 (y) . . . lm (y) , we can write the bivariate interpo-
lation polynomial also as
T
p(x, y) = lTX (x) F lY (y). (4.30)

(X) (Y )
Note that in general lk (x) 6= lk (y) unless X = Y .
Instead of using the Lagrange representation of the univariate polynomials in x and y,
we could also use the Newton representation or the monomial representation. However, as
already mentioned in the section on univariate interpolation, the Newton representation has to
be preferred, because it is less computationally intensive than the other two representations.
While in this manner quite acceptable interpolation surfaces result for small values of n
and m, for larger values of n and m and/or certain configurations of data (i.e., equidistant
data), the resulting surfaces have a very wavy appearance, due to strong oscillations along
the boundary of the lattice. This phenomenon has already been discussed in the section on
univariate interpolation, and all statements also apply to bivariate polynomial interpolation.
Therefore, for larger values of n and m it is better to use piecewise polynomials, e.g.,
splines. Howeover, the approach described above can only be used for Overhauser splines in
x and y, but not for instance for the cubic spline. Theoretically, it would also be possible to
use different basis functions for x and for y, though this must be justified by some a-priori
information about the behavior of the function f in x and y.
The approach outlined before is called tensor-product interpolation. Existence and unique-
ness of the tensor product interpolation is stated in the following theorem:

Theorem 4.6
Let ϕ0 , . . . , ϕn be a set of functions and x0 < x1 < . . . < xn be a set of points with the property
that, for any f0 , f1 , . . . , fn ,there exist unique numbers α0 , . . . , αn such that ni=0 αi ϕi (xk ) =
P

fk for k = 1, . . . , n. Let ψ0 , . . . , ψm have the corresponding property with respect to points

y0 < y1 < . . . < ym and define

bij (x, y) = ϕi (x) · ψj (y), i = 0, . . . , n, j = 0, . . . , m. (4.31)

Then, given any set of numbers fij there exists a unique corresponding set of numbers aij
such that the function
Xn X m
p(x, y) = aij bij (x, y) (4.32)
i=0 j=0

satisfies the interpolation conditions p(xk , yl ) = fkl for each k and l.

4.2. Bivariate interpolation 49

Example 4.7
Find the bilinear interpolation polynomial from the data f (0, 0) = 1, f (1, 0) = f (0, 1) =
f (1, 1) = 0.
The bilinear interpolation polynomial p(x, y) is given by:
1 X
X 1
p(x, y) = fij li (x)lj (y)
i=0 j=0

The Lagrange polynomials are:

l0 (x) = 1 − x, l1 (x) = x, l0 (y) = 1 − y, l1 (y) = y

Therefore, we obtain

p(x, y) = (x − 1)(y − 1) = 1 − x − y + xy.

The surface p(x, y) = constant is called a “hyperbolic paraboloid”.

Exercise 4.8
Interpolate f (x, y) = sin(πx) sin(πy) on (x, y) ∈ [0, 1] × [0, 1] on a rectangular grid with the
step size hx = hy = 21 using a polynomial which is quadratic in x and y.

4.2.2 Patch interpolation

A special case of polynomial interpolation is based on a triangulation of the interpolation
domain. Normally rectangular or triangular elements are used for that. Whereas triangular
elements can also be used for scattered data, the use of rectangular elements is restricted to
gridded data.
Let us first consider a rectangular element, as in figure 4.1, with four nodes P1 , . . . , P4 .
If the function values are given at these four nodes, we can easily determine an interpolating
function. Since 4 nodes determine uniquely an interpolation function with 4 parameters, a
reasonable ‘Ansatz’ might be:

p(x, y) = a0 + a1 x + a2 y + a3 xy

Let us assume that the points P1 , P2 , P3 , and P4 have the local coordinates (−1, 1), (1, 1),
(−1, −1), and (1, −1), respectively. Then, the interpolation condition reads

p(xi , yi ) = fi , i = 1, . . . , 4

or     
f1 1 −1 1 −1 a0
 f2   1 1 1 1  a1 
=
    
1 −1 −1
  
 f3   1  a2 
f4 1 1 −1 −1 a3
50 Chapter 4. Advanced Interpolation: Splines, Multi-dimensions and Radial Bases

P4 P3

P1 P2

Figure 4.1: Rectangular patch.

This linear system of equations to determine the coefficients ai can be written as

f =A·a

Formally,
a = A−1 f ,
and we can write the interpolation polynomial on the rectangle as

p(x, y) = aT b(x, y) = bT (x, y)a = bT (x, y)A−1 f =: S(x, y)f

The vector S(x, y) obtains the so-called “shape functions”. They can be calculated without
knowing the function values at the nodes. In our case:

bT (x, y) = (1, x, y, xy),

and we obtain
   
1 − x + y − xy (1 + y)(1 − x)
 1 + x + y + xy   (1 + y)(1 + x) 
S(x, y) = bT (x, y) A−1 = 1
= 1
(4.33)
   
4 
 1 − x − y + xy  4 
 (1 − y)(1 − x)


1 + x − y − xy (1 − y)(1 + x)

Then, we can represent the interpolation function in the form

4
X
p(x, y) = fi Si (x, y) (4.34)
i=1
4.2. Bivariate interpolation 51

1 5 2

8 6

3 7 4

Figure 4.2: Quadratic Serendipity element.

Example 4.9 (Rectangular patch)

Given four data points P1 = (1.0, 0.5), P2 = (2.0, 0.5), P3 = (1.0, 0.2), P4 = (2.0, 0.2) and
the function values f1 = 1.703, f2 = 3.943, f3 = 0.640, f4 = 1.568. Calculate the bilinear
interpolation polynomial.
With respect to the standard rectangular patch, Fig 4.1, (coordinates (x, y)), the bilinear
interpolating polynomial is given by (4.34) with the basis functions given by (4.33). The
mapping of the standard rectangular patch onto the given rectangular patch (coordinates
(ξ, η)) is given by
20 7
x = 2ξ − 3, y = η − .
3 3
Therefore, in coordinates (ξ, η), the basis functions are
 
−2(5η − 1)(ξ − 2)
1 2(5η − 1)(ξ − 1) 
S̃(ξ, η) = S(x(ξ), y(η)) =  ,
 
3 5(2η − 1)(ξ − 2) 
−5(2η − 1)(ξ − 1)

and the bilinear interpolating polynomial is

4
X
p(ξ, η) = fi S̃i (ξ, η).
i=1

If the function values at the mid-side nodes (as figure 4.2 shows) are known as well, we obtain
with
bT (x, y) = (1, x, y, xy, x2 , y 2 , x2 y, xy 2 )
52 Chapter 4. Advanced Interpolation: Splines, Multi-dimensions and Radial Bases
 
−(y + 1)(1 − x)(1 + x − y)

 −(y + 1)(1 + x)(1 − x − y) 

−(1 − x)(1 − y)(1 + x + y)
 
 
 
1
 −(1 + x)(1 − y)(1 − x + y) 
S(x, y) = 4  
 2(1 − x)(1 + x)(1 + y) 

2(1 + x)(1 + y)(1 − y)
 
 
2(1 + x)(1 − x)(1 − y)
 
 
2(1 + y)(1 − x)(1 − y)

Exercise 4.10 (Quadratic Serendipity element)

Given 8 data points and corresponding function values. Calculate the quadratic polynomial
of Serendipity-type, which interpolates the data.

1 2 3 4 5 6 7 8
ξi 1.0 2.0 1.0 2.0 1.5 2.0 1.5 1.0
ηi 0.5 0.5 0.2 0.2 0.5 0.35 0.2 0.35
fi 1.703 3.943 0.640 1.568 2.549 2.780 0.990 1.181

We may also consider triangular elements, e.g. the standard triangle (0, 0), (1, 0), (0, 1), as
shown in figure 4.3. If we know the function values at the vertices, we can uniquely determine
a linear interpolator:
p(x, y) = a0 + a1 x + a2 y
using the interpolation condition

p(xi , yi ) = fi i = 1, 2, 3.

This yields the linear system of equations

    
f1 1 0 0 a0
 f2  =  1 1 0   a1 
    
f3 1 0 1 a2

With bT (x, y) = (1, x, y) we obtain

   
1 0 0 1−x−y
S(x, y) = (1, x, y)  −1 1 0  =  x .
   
−1 0 1 y

Thus, the linear interpolator is

3
X
p(x, y) = fi si (x, y) = f1 (1 − x − y) + f2 x + f3 y
i=1
4.2. Bivariate interpolation 53

Following the same procedure, a quadratic interpolator

p(x, y) = a0 + a1 x + a2 y + a3 x2 + a4 xy + a5 y 2

can be determined if in addition the function values at the three mid-sides are known, as is
shown in figure 4.3.
y y

P3 P3

P6 P5

x x

P1 P2 P1 P4 P2

Figure 4.3: Linear triangular element (left) and quadratic triangular element (right).

With bT (x, y) = (1, x, y, x2 , xy, y 2 ), we obtain

    
f1 1 0 0 0 0 0 a0

 f2  
  1 1 0 1 0 0 
 a1 

 f3   1 0 1 0 0 1  a2 
=
    
 1 1  
 f4   1 2 0 4 0 0  a3 
1 1 1 1 1
    
 f5   1 2 2 4 4 4
 a4 
1 1
f6 1 0 2 0 0 4 a5

and finally
 
(1 − x − y)(1 − 2x − 2y)

 x(2x − 1) 

 y(2y − 1) 
S(x, y) =  .
 
 4x(1 − x − y) 
 
 4xy 
4y(1 − x − y)

If we have a triangle in general position as in figure 4.4, we perform simply a parameter

54 Chapter 4. Advanced Interpolation: Splines, Multi-dimensions and Radial Bases

v y

P3
P3a

u x

P1a P2a

Figure 4.4: General triangle and transformed triangle.

transformation (x, y) → (u, v): if Pi = (ui , vi ) and Pi0 = (xi , yi ), we have

(
u = u1 + (u2 − u1 ) x + (u3 − u1 ) y
⇒
v = v1 + (v2 − v1 ) x + (v3 − v1 ) y

 x = ((v1 − v3 )(u − u1 ) + (u3 − u1 )(v − v1 )) /N

y = ((v2 − v1 )(u − u1 ) + (u1 − u2 )(v − v1 )) /N

 N := u (v − v ) + u (v − v ) + u (v − v )
2 1 3 1 3 2 3 2 1

Thus,
S(x, y) = S (x(u, v), y(u, v)) = S∗ (u, v) (4.35)
Then X X
p∗ (u, v) = fi Si∗ (u, v) = p(x, y) = fi Si (x, y) (4.36)
i i
is the interpolant in (u, v)-coordinates.
In an analogous manner we can transform a general parallelogram onto the standard
rectangle [−1, 1] × [−1, 1], as figure 4.5 shows.
v y

P4 P3 Q4 Q3

P1 P2

u Q1 Q2

Figure 4.5: Parallelogram and standard rectangle.

4.2. Bivariate interpolation 55

4.2.3 Radial function interpolation

We mentioned already that the bivariate interpolation problem has in general no solution.
Only for special arrangements of data points parallel to the coordinate axes of the Cartesian
coordinate system we could find uniquely solvable interpolation problems. We also mentioned
that uniqueness and existence of the bivariate interpolation problem for arbitrary distinct data
points requires a different approach: instead of taking linear combinations of a set of basis
functions that are independent of the data points, one takes a linear combination of translates
of a single basis function that is radially symmetric about its centre. This approach is referred
to as the radials basis function (RBF) method. This method is now one of the primary tools
for interpolating multi-dimensional scattered data. Its simple form, and ability to accurately
approximate an underlying function have made the method particularly popular.

Definition 4.11 (Radial function)

A function φ : R2 → R is called radial provided there exists a univariate function ϕ : [0, ∞) →
R such that
Φ(x) = ϕ(r), r = kxk, (4.37)
and k · k is some norm on R2 , usually the Euclidean norm. Hence,

kx1 k = kx2 k ⇒ Φ(x1 ) = Φ(x2 ). (4.38)

The interpolation problem using radial basis functions (RBF’s) can be formulated as
follows:
Definition 4.12 (Basic RBF method)
Given a set of n distinct data points {xi : i = 1, . . . , n} and data values {fi : i = 1, . . . , n},
the basic RBF interpolant is given by
n
X
s(x) = aj Φ(kx − xj k), (4.39)
j=1

where Φ(r), r ≥ 0 is some radial function. The coefficients aj are determined from the
interpolation conditions
s(xj ) = fj , j = 1, . . . , n, (4.40)
which leads to the following symmetric linear system:

A a = f. (4.41)

The entries of the matrix A are given by

Aij = Φ(kxi − xj k). (4.42)

Some common examples of the Φ(r) that lead to a uniquely solvable method (i.e., to a
non-singular matrix A) are:
56 Chapter 4. Advanced Interpolation: Splines, Multi-dimensions and Radial Bases

1. Gaussian:
2
Φ(r) = e−(εr) . (4.43)

2. Inverse Quadratic (IQ):

1
Φ(r) = . (4.44)
1 + (εr)2
3. Inverse Multiquadric (IMQ)
1
Φ(r) = p . (4.45)
1 + (εr)2

4. Multiquadric (MQ): p
Φ(r) = 1 + (εr)2 . (4.46)

5. Linear:
Φ(r) = r. (4.47)

In all cases, ε is a free parameter to be chosen appropriately. It controls the shape of the
functions: as ε → 0, radial functions become more flat. At this point, assume that it is
some fixed non-zero real value. The cubic RBF and the very popular thin-plate spline RBF,
defined by
6. Cubic RBF:
Φ(r) = r3 . (4.48)

7. Thin-plate spline:
Φ(r) = r2 log r. (4.49)

do not provide a regular matrix A unless some additional restrictions are met, which leads to
the so-called augmented RBF method, which is discussed next. Before we define this method
we need to make one more definition:
Definition 4.13
Let Πm (R2 ) be the space of all bivariate polynomials that have degree less than or equal to
m. Furthermore, let M denote the dimension of Πm (R2 ), then
1
M = (m + 1)(m + 2). (4.50)
2
For instance, a basis of Π1 comprises the polynomials 1, x, and y; this space has dimension
M = 3; a basis of Π2 comprises the polynomials 1, x, y, x2 , xy, and y 2 ; the dimension of the
space is M = 6. In general, any function f ∈ Πm can be written as
X
f (x, y) = aij xi y j , aij ∈ R. (4.51)
0≤i+j≤m

Now we are ready to define the augmented RBF method.

4.2. Bivariate interpolation 57

Definition 4.14 (Augmented RBF method)

Given a set of n distinct points {xi : i = 1, . . . , n} and data values {fi : i = 1, . . . , n}, the
augmented RBF interpolant is given by
n
X M
X
s(x) = λj Φ(kx − xj k) + γk pk (x), x ∈ R2 , (4.52)
j=1 k=1

where {pk (x) : k = 1, . . . , M } is a basis of the space Πm (R2 ) and Φ(r), r ≥ 0 is some
radial function. To account for the additional polynomial terms, the following constraints
are imposed:
Xn
λj pk (xj ) = 0, k = 1, . . . , M. (4.53)
j=1

The expansion coefficients λj and γk are then determined from the interpolation conditions
and the constraints (4.53): ! ! !
A P λ f
· = . (4.54)
PT 0 γ 0
P is the n × M matrix with entries Pij = pj (xi ) for i = 1, . . . , n and j = 1, . . . , M .
It can be shown that the augmented RBF method is uniquely solvable for the cubic and
thin-plate spline RBFs when m = 1 and the data points are such that the matrix P has
rank(P) = M . This is equivalent to saying that for a given basis {pk (x) : k = 1, . . . , M } for
Πm (R2 ) the data points {xj : j = 1, . . . , n} must satisfy the condition
M
X
γk pk (xj ) = 0 ⇒ γ = 0, (4.55)
k=1

for j = 1, . . . , n. For m = 1, it is M = 3, and the RBF interpolant is augmented with a

constant and linear bivariate polynomial, i.e., it is

p1 (x) = 1, p2 (x) = x, p3 (x) = y, (4.56)

and the RBF interpolant is given by

n
X
s(x, y) = λj Φ(kx − xj k) + γ0 + γ1 x + γ2 y. (4.57)
j=1

Hence, we see that the augmented RBF method is much less restrictive than the basic RBF
method on the functions Φ(r) that can be used; however, it is far more restrictive on the data
points {xj : j = 1, . . . , n} that can be used. Remember that the only restriction on the data
points for the basic method is that the points are distinct.
Let us make some remarks related to the effect of ε and n on the stability of the basic
and/or augmented RBF method. For the infinitely smooth Φ(r) (i.e., for the Gaussian, the
58 Chapter 4. Advanced Interpolation: Splines, Multi-dimensions and Radial Bases

inverse quadratic, the inverse multiquadric, and the multiquadric RBF), the accuracy and
the stability depend on the number of data points n and the value of the shape parameter ε.
For a fixed ε, as the number of data points increases, the RBF interpolant converges to the
underlying (sufficiently smooth) function being interpolated at a spectral rate, i.e., O(e−c/h ),
where c is a constant and h is a measure of the typical distance between data points. The
2
Gaussian RBF exhibits even ’super-spectral’ convergence, i.e., O(e−c/h ). In either case, the
value of c in the estimates is effected by the value of ε. For a fixed number of data points, the
accuracy of the RBF interpolant can often be significantly improved by decreasing the value
of ε. However, decreasing ε or increasing the number n of data points has a severe effect on
the stability of the linear system (4.41) and (4.54), respectively. For a fixed ε, the condition
number of the matrix in the linear systems grows exponentially as the number of data points
is increased. For a fixed number of data points, similar growth occurs as ε → 0.
A very important feature of the RBF method is that its complexity does not increase
as dimension of the interpolation increases. Their simple form makes implementing the
methods extremely easy compared to, for example, a bicubic spline method. However, main
computational challenges are that i) the matrix for determining the interpolation coefficients,
(4.41) and (4.54) is dense, which makes the computational cost of the methods high; ii) the
matrix is ill-conditioned when the number of data points is large; iii) for the infinitely smooth
RBFs and a fixed number of data points, the matrix is also ill-conditioned when ε is small.
There are techniques available, which address these problems. However, they are beyond the
scope of this course.

4.2.4 Bicubic spline interpolation (?? not examined)

In univariate interpolation, the cubic spline offers a good interpolant also for a large number
of data points. The idea of a cubic spline can also be generalized to the bivariate case if
the data are given on a lattice. The corresponding spline is called a bicubic spline, denoted
S(x, y). A bicubic spline on a lattice with n + 1 points in x and m + 1 points in y is defined
by the following properties:

(1) S fulfills the interpolation property

S(xi , yj ) = fij , i = 0, 1, . . . , n, j = 0, 1, . . . , m.

∂2S
(2) S ∈ C 1 (D), continuous on D.
∂x∂y
(3) S is a bicubic polynomial within each rectangle Dij :

Dij := {(x, y) : xi ≤ x ≤ xi+1 , i = 0, . . . , n − 1; yj ≤ y ≤ yj+1 , j = 0, . . . , m − 1}.

(4) S fulfills certain conditions on the boundary of the domain D̄ which still have to be
defined.
4.2. Bivariate interpolation 59

Because of (3), the bicubic spline function S(x, y) has on (x, y) ∈ Dij the representation
3 X
X 3
S(x, y)|Dij =: Sij = aijkl (x − xi )k (y − yj )l ,
k=0 l=0
(x, y) ∈ Dij , i = 0, . . . , n − 1, j = 0, . . . , m − 1 (4.58)

The 16m · n coefficients aijkl have to be determined such that the conditions (1) and (2) are
fulfilled. To determine them uniquely we must formulate certain boundary conditions like in
the one-dimensional case. One possibility is to prescribe the following partial derivatives of
S:
∂S
(xi , yj ) =: pij = aij10 , i = 0, n, j = 0, 1, . . . , m
∂x
∂S
(xi , yj ) =: qij = aij01 , i = 0, 1, . . . , n, j = 0, m (4.59)
∂y
∂2S
(xi , yj ) =: rij = aij11 , i = 0, n, j = 0, m
∂x∂y
They can be calculated approximately using one-dimensional splines or other interpolation
methods. For example one-dimensional splines through three points each and their derivatives
can be used to approximate the boundary conditions. The following algorithm assumes that
the boundary values (4.59) are given. It determines all coefficients aijkl in 9 steps:
Step 1: Calculation of aij10 = pij for i = 1, . . . , n − 1 and j = 0, . . . , m:

1 1 1 1
ai−1,j10 + 2 + aij10 + ai+1,j10 =
hi−1 h i−1 hi hi
3 3
= 2 (aij00 − ai−1,j00 ) + 2 (ai+1,j00 − aij00 ),
hi−1 hi
i = 1, . . . , n − 1, j = 0, . . . , m (4.60)

where hκ = xκ+1 − xκ , κ = 0, 1, . . . , n − 1. These are m + 1 linear systems, each

with n − 1 equations for n + 1 unknowns. By prescribing the 2(m + 1) quantities
pij = aij10 , i = 0, n, j = 0, . . . , m, these systems are uniquely solvable.

Step 2: Calculation of aij01 = qij for i = 0, . . . , n and j = 1, . . . , m − 1 with

1 1 1 1
ai,j−1,01 + 2 + aij01 + ai,j+1,01 =
hj−1 hj−1 hj hj
3 3
= 2 (aij00 − ai,j−1,00 ) + 2 (ai,j+1,00 − aij00 ),
hj−1 hj
i = 0, . . . , n, j = 1, . . . , m − 1 (4.61)

where hκ = yκ+1 − yκ , κ = 0, 1, . . . , m − 1. With the prescribed 2(n + 1) boundary

values qij = aij01 , i = 0, . . . , n, j = 0, m, these systems are uniquely solvable.
60 Chapter 4. Advanced Interpolation: Splines, Multi-dimensions and Radial Bases

Step 3: Calculation of aij11 = rij for i = 1, 2, . . . , n − 1 and j = 0, m from the system

1 1 1 1
ai−1,j11 + 2 + aij11 + ai+1,j11 =
hi−1 hi−1 hi hi
3 3
= 2 (aij01 − ai−1,j01 ) + 2 (ai+1,j01 − aij01 ),
hi−1 hi
i = 1, 2, . . . , n − 1, j = 0, m (4.62)
where hκ = xκ+1 − xκ . The values of the four corner points a0011 , an011 , a0m11 , and
anm11 are given.

Step 4: Calculation of the derivatives rij = aij11 for i = 0, 1, . . . , n and j = 1, 2, . . . , m − 1

with

1 1 1 1
ai,j−1,11 + 2 + aij11 + ai,j+1,11 =
hj−1 hj−1 hj hj
3 3
= 2 (aij10 − ai,j−1,10 ) + 2 (ai,j+1,10 − aij10 ),
hj−1 hj
i = 0, 1, . . . , n, j = 1, 2, . . . , m − 1 (4.63)
where hκ = yκ+1 − yκ . The required values of the boundary points aij11 , i = 1, . . . , n −
1, j = 0, m, have been calculated in step 3. The corner points aij11 , i = 0, n, j = 0, m
are prescribed.

Step 5: Determination of the matrix G(xi )−1 . Because

 
1 0 0 0
 0 1 0 0 
G(xi ) = 
 
 1 hi h2i h3i


0 1 2hi 3h2i
with det G(xi ) = h4i 6= 0, hi = xi+1 − xi , i = 0, . . . , n − 1, the inverse G(xi )−1 exists
and can explicitly be calculated:
 
1 0 0 0
0 1 0 0
 
G(xi )−1 = 
 
 − 32 − h2 3
− h1i

h2i

 hi i 
2 1
h3 h2
− h23 1
h2
i i i i

−1
Step 6: Determination of the matrix G(yj )T . Because
 
1 0 0 0
 0 1 0 0 
G(yj ) = 
 
 1 hj hj 2 h3j


0 1 2hj 3h2j
4.2. Bivariate interpolation 61

−1
with det G(yj ) = h4j 6= 0, hj = xj+1 − xj , j = 0, . . . , m − 1, the inverse G(yj )T
exists:
0 − h32 2
 
1 h3j
j
1 − h2j 1
 
 0
−1  h2j

G(yj )T =

3
− h23

 0 0 h2j

 j 
1 1
0 0 −j h2j

Step 7: Determination of the matrices Mij , i = 0, . . . , n − 1, j = 0, . . . , m − 1 as

 
aij00 aij01 ai,j+1,00 ai,j+1,01
 a aij11 ai,j+1,10 ai,j+1,11 
ij10
Mij = 
 

 ai+1,j00 ai+1,j01 ai+1,j+1,00 ai+1,j+1,01 
ai+1,j10 ai+1,j11 ai+1,j+1,10 ai+1,j+1,11

Step 8: Calculation of the coefficient matrices Aij :

−1
Aij = {aijkl }3k=0, 3l=0 = G(xi )−1 Mij G(yj )T

Step 9: Formation of the bicubic spline function Sij (x, y) for all rectangles Dij with equa-
tion (4.58).

The boundary conditions can be determined for instance by fitting one-dimensional splines
through three points each: Through the points (xi , fij ), i = 0, 1, 2, n − 2, n − 1, n we fit
for j = 0, . . . , m one-dimensional natural cubic splines and calculate the derivatives; they
provide the pij at the boundary. Through the points (yj , fij ), j = 0, 1, 2, m − 2, m − 1, m
for i = 0, . . . , n we fit the same type of spline functions and calculate the derivatives; they
provide the qij at the boundary. To calculate the rij = aij11 for i = 0, n, j = 0, m we fit
one-dimensional natural splines through (xi , qij ) for i = 0, 1, 2, n − 2, n − 1, n and j = 0, m
and determine the derivatives.
62 Chapter 4. Advanced Interpolation: Splines, Multi-dimensions and Radial Bases
Chapter 5

Least-squares Regression

In the chapters on interpolation we discussed the approximation of a given function by an-

other function that exactly reproduced the given data values. We considered both univariate
functions (i.e., functions which depend on one variable) and bivariate functions (i.e. func-
tions which depend on two variables). The function that interpolates given data has been
called “interpolant”. By an “approximant” (or “approximating function”) we mean a func-
tion which approximates the given data values as well as possible (in some sense we need to
agree upon later) but does not necessarily reproduce the given data values exactly. That is,
the graph of the approximant will not in general go through the data points, but will be close
to them — this is called regression.

A justification for approximation rather than interpolation is the case of experimental or

statistical data. The data from experiments are normally subject to errors. The interpolant
would exactly reproduce the errors. The approximant, however, allows to adjust for these
errors such that a smooth function results. In the presence of noise it would even be foolish
and, indeed, inherently dangerous, to attempt to determine an interpolant, because it is very
likely that the interpolant oscillates violently about the curve or surface which represents the
true function. Another justification is that there may be so many data points that efficiency
considerations force us to approximate from a space spanned by fewer basis functions that
data points.

In regression, as for interpolation, we consider the case where a function has to be recov-
ered from partial information, e.g. when we only know (possibly noisy) values of the function
at a set of points.

63
64 Chapter 5. Least-squares Regression

5.1 Least-squares basis functions

The most commonly used classes of approximating functions are functions that can be written
as linear combinations of basis functions ϕi , i = 1, . . . , M , i.e., approximants of type:
M
X
φ(x) = ci ϕi (x) = cT ϕ(x), (5.1)
i=1

with T
c := c1 c2 . . . cM , (5.2)
and T
ϕ(x) := ϕ1 (x) ϕ2 (x) . . . ϕM (x) ,
exactly the same as for interpolation. In the following we will restrict to this type of ap-
proximants. Moreover, we will assume that the function we want to approximate is at least
continuous.
Where least-squares differs from interpolation is that the number of basis functions M is
in general less than the number of data points N ,

M ≤ N,

which means we will have more constraints than free variables, and therefore we will not be
able to satisfy all constraints exactly.
Commonly used classes of (univariate) basis functions are:

• algebraic polynomials:
ϕ1 = 1, ϕ2 = x, ϕ3 = x2 , . . .

• trigonometric polynomials:

ϕ1 = 1, ϕ2 = cos x, ϕ3 = sin x, ϕ4 = cos 2x, ϕ5 = sin 2x, . . .

• exponential functions:
ϕ1 = 1, ϕ2 = eα1 x , ϕ3 = eα2 x , . . .

• rational functions:
1 1
ϕ1 = 1, ϕ2 = p
, ϕ3 = ,..., pi ∈ N
(x − α1 ) 1 (x − α2 )p2
.

For bivariate regression, radial functions are also popular; they have been introduced in the
previous chapter.
5.2. Least-squares approximation - Example 65

5.2 Least-squares approximation - Example

In practical applications we are interested in getting “good” approximations, i.e. the approx-
imant should not deviate “much” from the given data. However, what is the precise meaning
of “good” and “much”? In other words, how do we measure the quality of the approximation
or, equivalently, how do we measure the error in the approximation?
In order to answer this question let us first discuss a simple example. Suppose we are
given the data in table 5.1. We have 3 data points. Let us look for a straight line, which best
fits in some sense the given data. The equation of the straight line is

i 1 2 3
xi 0 1 2
fi 4.5 3.0 2.0

Table 5.1: Example data set.

φ(x) = a0 + a1 x,

with so far unknown coefficients a0 and a1 . Intuitively, we would like the straight line to be
as close as possible to the function f (x) that generates the data. A measure of the ’closeness’
between φ(x) and f (x) could be based on the difference of the function values of f and φ at
the given data points, i.e., on the quantities

ri := f (xi ) − φ(xi ) = fi − (a0 + a1 xi ), i = 1, 2, 3.

The residuals are shown graphically in Figure 5.1 as the vertical red bars capturing the
distance between samples of f at the nodes, and the approximation φ. The ri ’s are called the
residuals. The least-squares method finds among all possible coefficients a0 and a1 the pair
that minimizes the square sum of the residuals,
3
X
ri2 ,
i=1
P3
i.e., that makes i=1 ri2 as small as possible. This minimization principle is sometimes called
the (discrete) least-squares principle. Of course, other choices also exists: we could, e.g.,
minimize the absolute sum of the residuals,
N
X
|ri |,
i=1

or we could minimize the largest absolute residual:

max(|ri |).
i
66 Chapter 5. Least-squares Regression

Figure 5.1: Example of regression of samples of a function f (x) with the curve φ(x). Red
bars show the residuals, which are minimized in order to solve for φ.

The advantage of the least-squares principle is that it is the only one among the three princi-
ples, which yields a linear system of equations for the unknown coefficients a0 and a1 . That is
the main reason why this principle has become so popular. Let us determine the coefficients
a0 and a1 according to the least-squares principle. We define a function Φ, which is equal to
the square sum of the residuals:
3
X 3
X
Φ(a0 , a1 ) := ri2 = (fi − a0 − a1 xi )2 .
i=1 i=1

We have written Φ(a0 , a1 ) to emphasize that the square sum of the residuals is seen as a
function of the unknown coefficients a0 and a1 . Hence, minimizing the square sum of the
residuals means to look for the minimum of the function Φ(a0 , a1 ). A necessary condition
for Φ to attain a minimum is that the first derivatives with respect to a0 and a1 are equal to
zero:
3
∂Φ X
= −2 (fi − a0 − a1 xi ) = 0,
∂a0
i=1
3
∂Φ X
= −2 (fi − a0 − a1 xi )xi = 0.
∂a1
i=1

This is a system of 2 equations for the 2 unknowns a0 and a1 . It is called normal equations.
The solution of the normal equations is sometimes called the least-squares solution, denoted
â0 and â1 . This is mostly done to emphasize that other solutions are possible, as well, as
5.3. Least-squares approximation - The general case 67

outlined before. Adopting this notation for the least-squares solution, the normal equations
are written as
3
X 3
X
(â0 + â1 xi ) = fi
i=1 i=1
3
X 3
X
2
(â0 xi + â1 xi ) = fi xi ,
i=1 i=1

and in matrix-vector notation

P3 P3 ! ! P3 !
1 x â f
P3i=1 P3i=1 2i 0
= P3 i=1 i . (5.3)
i=1 x i x
i=1 i â 1 i=1 fi xi

Numerically, we find ! ! !
3 3 â0 9.5
= .
3 5 â1 7
The solution is (to 4 decimal places) â0 = 4.4167 and â1 = −1.2500. Hence, the least-squares
approximation of the given data by a straight line is

φ̂(x) = 4.4167 − 1.2500 x.

The least-squares residuals are computed as r̂i = fi − φ̂i , which gives r̂1 = 0.0833, r̂2 =
−0.1667, and r̂3 = 0.0833. The square sum of the (least-squares) residuals is Φ̂ = 3i=1 r̂i2 =
P

0.0417. Notice that the numerical values of the coefficients found before is the choice which
yields the smallest possible square sum of the residuals. No other pair of coefficients a0 , a1
yields a smaller Φ (try it yourself!).
In order to give the normal equations more ’structure’, we can define the following scalar
product of two functions given on a set of N points xi :
N
X
hf, gi := f (xi )g(xi ).
i=1

Obviously, hf, gi = hg, f i, i.e., the scalar product is symmetric. Using this scalar product,
the normal equations (5.3) can be written as (try it yourself!):
! ! !
< 1, 1 > < 1, x > â0 < f, 1 >
= .
< 1, x > < x, x > â1 < f, x >

5.3 Least-squares approximation - The general case

Let us now generalize this approach to an arbitrary number N of given data and an approx-
imation function of type
XM
φ(x) = ai ϕi (x), M ≤ N,
i=1
68 Chapter 5. Least-squares Regression

for given basis functions {ϕi (x)}. Note that we allow for both univariate and bivariate
data. In the univariate case, the location of a data point is uniquely described by 1 variable,
denoted e.g., x; in the bivariate case, we need 2 variables to uniquely describe the location
of a data point, e.g., the Cartesian coordinates (x, y). Please also notice that the number of
basis functions, M , must not exceed the number of data points, N . In least-squares, usually
M N.
Then, the residuals are (cf. the example above),

r(xi ) = f (xi ) − φ(xi ), i = 1, . . . , N,

or, simply, ri = fi − φi , and the least-squares principle is

N N
!
X X
Φ(a) = r(xi )2 = ri2 = hr, ri = minimum, (5.4)
i=1 i=1

where the vector a is defined as a := (a1 , . . . , aM )T . The advantage of the use of a scalar
product becomes clear now, because the normal equations, which are the solution of the
minimization problem (5.4), are
    
hϕ1 , ϕ1 i . . . hϕ1 , ϕM i â1 hf, ϕ1 i
.. ..   ..   ..
 .  =  , (5.5)
 
 . . .
hϕM , ϕ1 i . . . hϕM , ϕM i âM hf, ϕM i
with
N
X
hϕj , ϕk i = ϕj (xi )ϕk (xi ). (5.6)
i=1
Note that the normal equations are symmetric, because

hϕk , ϕj i = hϕj , ϕk i.

The solution of the normal equations yields the least-squares estimate of the coefficients a,
denoted â, and the discrete least-squares approximation is the function
M
X
φ̂(x) = âi ϕi (x).
i=1

The smallness of the square sum of the residuals,

hr̂, r̂i = hf − φ̂, f − φ̂i

can be used as a criterion for the efficiency of the approximation. Alternatively, the so-called
root-mean-square (RMS) error in the approximation is also used as a measure of the fit of
the function φ̂(x) to the given data. It is defined by
r
hr̂, r̂i
σRM S := ,
N
5.3. Least-squares approximation - The general case 69

where r̂ = f − φ̂, and

N
X 2
hr̂, r̂i = f (xi ) − φ̂(xi ) .
i=1

When univariate algebraic polynomials are used as basis functions it can be shown that
N ≥ M together with the linear independence of the basis functions guarantee the unique
solvability of the normal equations. For other types of univariate basis functions this is not
guaranteed. For N < M the uniqueness gets lost. For N = M we have interpolation and
φ̂(xi ) = f (xi ) for i = 1, . . . , N .
For the bivariate case using radial functions as basis functions we have the additional
problem that we have less basis functions (namely M ) than we have data sites (namely N ).
Hence, we cannot place below every data point a basis function. Therefore, we need to have
a strategy of where to locate the radial functions. This already indicates that the question
whether the normal equations can be solved in the bivariate case with radial functions is
non-trivial. In fact, we can only guarantee unique solvability of the normal equations for
certain radial functions (i.e., multiquadrics, Gaussian) and severe restrictions to the location
of the centres of the radial functions (they must be sufficiently well distributed over D in
some sense) and to the location of the data sites (they must be fairly evenly clustered about
the centres of the radial functions with the diameter of the clusters being relatively small
compared to the separation distance of the data sites). Least-squares approximation with
radial basis functions is not subject of this course.
The method of (discrete) least squares has been developed by Gauss in 1794 for smoothing
data in connection with geodetic and astronomical problems.

Example 5.1
Given 5 function values f (xi ), i = 1, . . . , 5, of the function f (x) = (1 + x2 )−1 (see the
table below). We look for the discrete least-squares approximation φ̂ among all quadratic
polynomials φ.

Step 1: the choice of the basis functions is prescribed by the task description:
X3
2
ϕ1 = 1, ϕ2 = x, ϕ3 = x ⇒ φ(x) = ci ϕi (x).
i=1

Step 2: because no other information is available, we choose wi = 1, i = 1, . . . , 5.

Step 3: the given values f (xi ), i = 1, . . . , 5

i 1 2 3 4 5
xi −1 − 21 0 1
2 1
f (xi ) 0.5 0.8 1 0.8 0.5
70 Chapter 5. Least-squares Regression

lead to the normal equations

    
5 0 2.5 ĉ1 3.6
 0 2.5 0   ĉ2  =  0 
    
2.5 0 2.125 ĉ3 1.4

with the solution (to 5 decimal places) ĉ1 = 0.94857, ĉ2 = 0.00000, ĉ3 = −0.45714,
yielding φ̂(x) = 0.94857 − 0.45714 x2 , x ∈ [−1, 1]. Under all quadratic polynomials,
φ̂(x) is the best approximation of f in the discrete least squares sense. For x = 0.8, we
obtain for instance φ̂(0.8) = 0.65600. The absolute error at x = 0.8 is |f (0.8)− φ̂(0.8)| =
4.6 · 10−2 . The results are plotted in figure 5.2.

1.1

1 fx

0.9

0.8

0.7

0.6 px

0.5

0.4

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Figure 5.2: The 5 data points given by f (x) = (1+x2 )−1 (dots) and the best approximation
φ (solid line) of all quadratic polynomials in the discrete least squares sense.

Exercise 5.2
Given points (xi , f (xi )), i = 1, . . . , 4.

i 1 2 3 4
xi 0.02 0.10 0.50 1.00
f (xi ) 50 10 1 0

We look for the best approximation φ̂ in the discrete least squares sense of all functions
P
φ(x) = ck ϕk (x) and the following basis functions:
5.4. Weighted least-squares (?? not examined) 71

1. ϕ1 = 1, ϕ2 = x

2. ϕ1 = 1, ϕ2 = x, ϕ3 = x2

3. ϕ1 = 1, ϕ2 = x, ϕ3 = x2 , ϕ4 = x3

4. ϕ1 = 1, ϕ2 = 1/x

Give a graphical representation of the four different φ̂. What choice of the basis functions
yields the “best” result?

5.4 Weighted least-squares (?? not examined)

We can slightly generalize the least-squares method by assigning to each data value fi a so-
called weight wi > 0. This may be justified, for instance, if the accuracy of the data values
vary, i.e., if one data point, say, the one with index i, is more accurate than another one, say,
with index j. If this is true, then it is natural to expect that |fi − φi | is smaller than |fj − φj |.
To achieve this, we have to assign the data point with index i a larger weight than the data
point with index j. The corresponding method is called weighted least-squares method. It can
be shown that the normal equations associated with the weighted least-squares method are
formally identical to the normal equations associated with the classical least-squares method
(which uses unit weights wi = 1) if we slightly redefine the scalar product: instead of the
definition (5.6), we use
N
X
hϕj , ϕk i := wi ϕj (xi )ϕk (xi ). (5.7)
i=1

The weighted least-squares principle is

N
!
X
Φ(a) = hr, ri = wi (fi − φi )2 = minimum, (5.8)
i=1

and the weighted least-squares solution is given by Eq. (5.5) when using the definition of
the scalar product, Eq. (5.7). Note that according to Eq. (5.8), the weighted least-squares
method does not minimize the square sum of the residuals, but the weighted square sum of
the residuals. The RMS error in the weighted least-squares approximation is
v
N
r u
hr̂, r̂i u 1 X
σRM S = =t wi r̂i2 .
N N
i=1
72 Chapter 5. Least-squares Regression
Chapter 6

Numerical Differentiation

6.1 Introduction
Numerical differentiation is the computation of derivatives of a function using a computer.
The function may be given analytically or on a discrete set of points x0 , . . . , xn , the nodes,
similarly to interpolation. The derivatives may be needed at arbitrary points or at the nodes
x0 , . . . , x n .
The basic idea of numerical differentiation of a function f (x) is to first interpolate f (x)
at the n + 1 nodes x0 , . . . , xn and then to use the analytical derivatives of the interpolating
polynomial φ(x) as an approximation to the derivatives of the function f (x).
If the interpolation error E(f ) = f (x) − φ(x) is small it may be hoped that the result of
differentiation, i.e., φ(k) (x) will also satisfactorily approximate the corresponding derivative
f (k) (x). However, if we visualize an interpolating polynomial φ(x) we often observe that it
oscillates about the function f (x), i.e., we may anticipate the fact that even though the devi-
ation between the interpolating polynomial φ(x) and the function f (x) (i.e., the interpolation
error E(x)) be small throughout an interval, still the slopes of the two functions may differ
significantly. Furthermore, roundoff errors or noise in the given data of alternating sign in
consecutive nodes could effect the calculation of the derivative strongly if those nodes are
closely spaced.
There are different procedures for deriving numerical difference formulae. In this course,
we will discuss three of them: (i) using Taylor series, (ii) Richardson extrapolation, and (iii)
using interpolating polynomials.

6.2 Numerical differentiation using Taylor series

Taylor series can be used to derive difference formulae for equidistant nodes if the derivative
at a node is looked for. Suppose f ∈ C k ([a, b]) for some k ≥ 1. For a given step size h, we

73
74 Chapter 6. Numerical Differentiation

consider the Taylor series

f 00 (x) 2 f (k−1) (x) k−1

f (x + h) = f (x) + f 0 (x) h + h + ... + h + Rk (ξ), (6.1)
2 (k − 1)!

where the Lagrange form of the remainder is

f (k) (ξ) k
Rk (ξ) = h , x < ξ < x + h. (6.2)
k!
We want to determine an approximation of the first derivative f 0 (x) for a fixed x ∈ (a, b). If
we choose k = 2 then
1
f (x + h) = f (x) + h f 0 (x) + h2 f 00 (ξ), x < ξ < x + h, (6.3)
2
which can be rearranged to give

f (x + h) − f (x) 1
f 0 (x) = − h f 00 (ξ), x < ξ < x + h. (6.4)
h 2
Suppose f (x) is given at a set of equidistant nodes x0 < x1 . . . < xi < . . . with distance h.
Using the equation just derived we find for f 0 (xi )

fi+1 − fi
f 0 (xi ) = + O(h). (6.5)
h
This formula is referred to as the forward difference formula 1 . It has simple geometrical
interpretation: we are trying to find the gradient of a curve by looking at the gradient of a
short chord to that curve (cf. Fig 6.1).

Example 6.1 (Application of the forward difference formula)

Given the function f (x) = xx for x > 0. Approximate f 0 (1) using the forward difference
formula trying some different values of h (work with 10 significant digits): the solution is
given in table 6.1.

We observe that as h is taken smaller and smaller, the forward difference result appears
to converge to the correct value f 0 (1) = 1. We also observe that each time we divide h by
a factor of 10, then the error decreases (roughly) by a factor of 10. Thus we might say the
forward difference formula has error roughly proportional to h. This is exactly what the
term O(h) in Eq. (6.5) tells us; the error of the forward difference formula is on the order
of O(h), i.e., it scales proportional to h for sufficiently small h. Difference formulae, which
1
Remember from real analysis that the derivative is defined as the limit
f (x + h) − f (x)
f 0 (x) = lim .
h→0 h
Compare this with the forward difference formula.
6.2. Numerical differentiation using Taylor series 75

Figure 6.1: Geometric interpretation of the forward difference formula.

h (f (1 + h) − f (1))/h error
0.1 1.105342410 -0.105342410
0.01 1.010050300 -0.010050300
0.001 1.001001000 -0.001001000
0.0001 1.000100000 -0.000100000

Table 6.1: Convergence of the (first order) forward difference formula.

76 Chapter 6. Numerical Differentiation

have an error O(h) are called first order formulae. Therefore, the forward difference formula
is sometimes also called the first order forward difference formula. Note, however, that if h
is taken much smaller, then rounding error becomes an issue in these calculations, see below.
Similarly,

f 00 (η)
f (x − h) = f (x) + f 0 (x)(−h) + (−h)2 , x − h < η < x. (6.6)
2

Hence, for equidistant nodes, we find at x = xi :

fi − fi−1
f 0 (xi ) = + O(h), (6.7)
h

which is called the (first order) backward difference formula for the first derivative. A geo-
metrical interpretation of the backward difference formula is shown in Fig 6.2.

Figure 6.2: Geometric interpretation of the backward difference formula.

Higher order (i.e., more accurate) schemes can be derived by Taylor series of the function f
at different points about the point xi . For instance, assume that f ∈ C 3 ([a, b]). Taking the
6.2. Numerical differentiation using Taylor series 77

difference between the two Taylor series

f 00 (x) 2 f (3) (ξ1 ) 3
f (x + h) = f (x) + f 0 (x) h + h + h , x < ξ1 < x + h, and (6.8)
2 6
f 00 (x) 2 f (3) (ξ2 ) 3
f (x − h) = f (x) − f 0 (x) h + h − h , x − h < ξ2 < x, (6.9)
2 6
we find
f (3) (ξ1 ) + f (3) (ξ2 ) 3
f (x + h) − f (x − h) = 2f 0 (x) h + h . (6.10)
6
Since f (3) is continuous, the intermediate value theorem implies that there must exist some
ξ between ξ1 and ξ2 such that
f (3) (ξ1 ) + f (3) (ξ2 )
f (3) (ξ) = . (6.11)
2
Thus,
f (3) (ξ) 3
f (x + h) − f (x − h) = 2f 0 (x) h + h , x − h < ξ < x + h. (6.12)
3
Solving for f 0 (x),
f (x + h) − f (x − h) f (3) (ξ) 2
f 0 (x) = − h , x−h<ξ <x+h (6.13)
2h 6
f (x + h) − f (x − h)
= + O(h2 ) (6.14)
2h
For equidistant nodes, we find at x = xi
fi+1 − fi−1
f 0 (xi ) = + O(h2 ). (6.15)
2h
The first term on the right-hand side of this equation is the central difference formula for the
first derivative. Like the forward difference formula, it has a geometrical interpretation in
terms of the gradient of a chord (cf. Fig 6.3). Beware that it might also suffer catastrophic
cancellation problems when h is very small, see below.
Example 6.2 (Application of the central difference formula)
We apply the central difference formula to estimate f 0 (1) when f (x) = xx . To 10 significant
digits, we find the results shown in table 6.2.
From table 6.2 we deduce that each time we decrease h by a factor of 10, the error is reduced
(roughly) by a factor of 100. This is due to the fact that the error of the central difference
formula is O(h2 ). We thus say that the central difference formula is a second order formula
and therefore sometimes referred to as the second order central difference formula. It follows
that the central difference formula is usually superior to the forward difference formula,
because as h is decreased, the error for the central difference formula will converge to zero
much faster than that for the forward difference formula. The superiority may be seen in
the above example: for h = 0.0001, the forward difference formula is correct to 4 significant
figures, whereas the central difference formula is correct to 8 significant figures.
78 Chapter 6. Numerical Differentiation

Figure 6.3: Geometric interpretation of the central difference formula.

f (1+h)−f (1−h)
h 2h error
0.1 1.005008325 0.105342410
0.01 1.000050001 0.010050300
0.001 1.000000500 0.000000500
0.0001 1.000000005 0.000000005

Table 6.2: Example (second order) central difference formula for the first derivative.
.
6.2. Numerical differentiation using Taylor series 79

Example 6.3 (Comparison of various difference formulae for the first derivative)
Use the data given in table 6.3 to estimate the acceleration at t = 16 s using the forward,
backward, and central difference formula for various step sizes. The data have been generated
with the formula
h 1.4 · 105 i
v(t) = 2000 ln − 9.8 t, 0 ≤ t ≤ 30, (6.16)
1.4 · 105 − 2100 t
where v is given in [m/s] and t is given in [s]. The results are shown in table 6.3.

method approximation absolute error

[m/s2 ] [m/s2 ]
forward 30.475 2.6967
backward 28.915 2.5584
central 29.695 0.0692

Table 6.3: Comparison of forward, backward, and central difference formula. Note that
the central difference formula outperforms the forward and backward difference
formulae.

In the same way, a fourth-order formula would look like

fi−2 − 8fi−1 + 8fi+1 − fi+2
f 0 (xi ) = + O(h4 ), (6.17)
12h
for f ∈ C 5 ([a, b]). One of the difficulties with higher order formulae occurs near the bound-
aries of the interval. They require functional values at points outside the interval, which are
not available. Near the boundaries one usually resorts to lower order (backward and forward)
formulae.

6.2.1 Approximation of derivatives of 2nd degree

Similar formulae can be derived for second or higher order derivatives. For example, suppose
f (x) ∈ C 4 ([a, b]). Then,

f 00 (x) 2 f (3) (x) 3 f (4) (ξ1 ) 4

f (x + h) = f (x) + f 0 (x) h + h + h + h , x < ξ1 < x + h, (6.18)
2 6 24
and
f 00 (x) 2 f (3) (x) 3 f (4) (ξ2 ) 4
f (x − h) = f (x) − f 0 (x) h + h − h + h , x − h < ξ2 < x. (6.19)
2 6 24
Adding the last two equations gives

f (4) (ξ1 ) + f (4) (ξ2 )

f (x + h) + f (x − h) = 2f (x) + f 00 (x) h2 + . (6.20)
2
80 Chapter 6. Numerical Differentiation

Since f (4) is continuous, the intermediate value theorem implies that there must be some ξ
between ξ1 and ξ2 such that

f (4) (ξ1 ) + f (4) (ξ2 )

f (4) (ξ) = . (6.21)
2
Therefore,

f (4) (ξ) 4
f (x + h) − 2f (x) + f (x − h) = f 00 (x) h2 + h , x − h < ξ < x + h. (6.22)
12
Solving for f 00 (x) gives

f (x + h) − 2f (x) + f (x − h) f (4) (ξ) 2

f 00 (x) = − h , x − h < ξ < x + h. (6.23)
h2 12
For equidistant data, we obtain for the second derivative of the function f at the point xi :
fi+1 − 2fi + fi−1
f 00 (xi ) = + O(h2 ). (6.24)
h2
This is the central difference formula for the second derivative.

Example 6.4 (Central difference formula for the second derivative)

The velocity of a rocket is given by
h 1.4 · 105 i
v(t) = 2000 ln − 9.8 t, 0 ≤ t ≤ 30, (6.25)
1.4 · 105 − 2100 t
where v is given in [m/s] and t is given in [s]. Use the forward difference formula for the
second derivative of v(t) to calculate the jerk at t = 16 s. Use a step size of ∆t = 2 s. The
forward difference approximation of the second derivative is
vi+2 − 2vi+1 + vi
φ00 (ti ) = . (6.26)
∆t2
With ti = 16 and ∆t = 2, we find ti+1 = ti + ∆t = 18 and ti+2 = ti + 2∆t = 20. Hence,

v(20) − 2v(18) + v(16) 517.35 − 2 · 453.02 + 392.07

φ00 (16) = = = 0.845 [m/s3 ]. (6.27)
22 4
The exact value, v 00 (16) can be computed from
18000
v 00 (t) = . (6.28)
(−200 + 3t)2

Hence, v 00 (16) = 0.7791m/s3 . The absolute relative error is

0.7791 − 0.8452
|ε| = · 100 = 8.5%. (6.29)

0.7791
6.2. Numerical differentiation using Taylor series 81

6.2.2 Balancing truncation error and rounding error

Note that all of these methods suffer from problems with rounding errors due to cancellation.
This is because the terms in the numerator for each formula become close together as the step
size h decreases. There is a point where the rounding error, which increases if h decreases,
becomes larger than the truncation error, which decreases as a power of h. This can be
illustrated using the first order forward difference formula
f (x + h) − f (x) 1
f 0 (x) = − h f 00 (ξ), x < ξ < x + h. (6.30)
h 2
Suppose that when evaluating f (x + h) and f (x), we encounter roundoff errors ε(x + h)
and ε(x), respectively. Then, our computation uses in fact the values F (x + h) and F (x),
where F (x + h) = f (x + h) − ε(x + h) and F (x) = f (x) − ε(x). Then, the total error in the
approximation of f 0 (x) by the forward difference formula is
F (x + h) − F (x) ε(x + h) − ε(x) h 00
f 0 (x) − = − f (ξ), x < ξ < x + h. (6.31)
h h 2
If we assume that the roundoff errors ε(x + h) and ε(x) are bounded by some number ε > 0
and the second derivative of f is bounded by a number M > 0 in the interval (x, x + h), then

0 F (x + h) − F (x) ε(x + h) − ε(x) h 00
f (x) − ≤ − f (ξ)
h h 2

ε(x + h) ε(x) h
≤ + + f 00 (ξ)

h h 2
2ε M
≤ + h.
h 2
This is an upper bound of the total error of the first order forward difference scheme. It
consists of two parts. The second term on the right-hand side results from the error of the
forward difference scheme (truncation error ). This term decreases when reducing the step
size h. The first term is caused by roundoff errors in the function evaluation. Thereby, a
small roundoff error ε is amplified in case of tiny step sizes h. This implies that when using
the forward difference scheme, it does not make sense to make h arbitrarily small to obtain
a more accurate estimate of the first derivative. If h is smaller than some optimal value, say,
hopt , roundoff errors will exceed the error of the forward difference scheme as they increase
proportional to 1/h. The optimal h, denoted hopt , is the one which minimizes the upper
bound of the total error: let e(h) denote the upper bound of the total error, i.e., we define
2ε M
e(h) := + h. (6.32)
h 2
de
e(h) attains a minimum if dh = 0, i.e. for a value hopt given by
r
ε
hopt = 2 . (6.33)
M
Similar equations can be derived for other difference schemes, as well.
82 Chapter 6. Numerical Differentiation

6.3 Richardson extrapolation

Richardson extrapolation is a method of increasing the order of accuracy of any numerical
scheme — here we show it by the example of numerical differentiation but it applies equally
to numerical integration and numerical solution of ODEs. For numerical differentiation, we
can always express the truncation error as a series of powers of the step size h. In general,
let F (h) be a difference formula with step-size h, approximating f 0 (x):

F (h) = f 0 (x) + α1 hp1 + α2 hp2 + α3 hp3 + . . . , (6.34)

where we assume that p1 < p2 < p3 < . . .. Here, αi are constants,

f 0 (x) = F (h) − α1 hp1 − α2 hp2 − . . . = F (h) + O(hp1 ). (6.35)

Hence, if we take F (h) as approximation to f 0 (x), the error is O(hp1 ). If h goes to zero,
F (h) → f 0 (x) — i.e. the method is consistent. Assume that we know the power p1 , which we
can obtain by the analysis of the previous sections. If we compute F (h) for two step sizes, h
and h/2, we have

F (h) = f 0 (x) + α1 hp1 + α2 hp2 + α3 hp3 + . . .

hp1 hp2 hp3
F (h/2) = f 0 (x) + α1 p1 + α2 p2 + α3 p3 + . . .
2 2 2
By taking a linear combination of these two formulae, we can eliminate α1 , giving
2p1 F (h/2) − F (h) 2p1 F (h/2) − F (h)
f 0 (x) = + β 2 hp2
+ β3 h p3
+ . . . = + O(hp2 ). (6.36)
2p 1 − 1 2p1 − 1
If we take the first term on the right-hand side as new approximation to f 0 (x), we immedi-
ately see the important point here: we have reduced the error from O(hp1 ) in the original
approximation to O(hp2 ) in the new approximation. If F (h) is known for several values of h,
for example, h, h/2, h/4, h/8 etc, the extrapolation process above can be repeated to obtain
still more accurate approximations.

Example 6.5 (Richardson extrapolation of 1st-order forward differences)

Let us apply this method to the 1st-order forward difference formula. The Taylor series

f 00 (x) 2 f (3) (x) 3

f (x + h) = f (x) + f 0 (x) h + h + h + ..., (6.37)
2 6
can be rearranged for f 0 (x), giving

f (x + h) − f (x) f 00 (x) f (3) (x) 2

f 0 (x) = − h− h − ... (6.38)
h 2 6
The forward difference approximation of f 0 (x) is therefore
f (x + h) − f (x)
F1 (h) := , (6.39)
h
6.4. Difference formulae from interpolating polynomials (?? - not examined) 83

and the truncation error can be written

f 00 (x) f (3) (x) 2
F1 (h) = f 0 (x) + h+ h + . . . = f 0 (x) + O(h). (6.40)
2 6
Applying Richardson extrapolation via equation (6.36) gives the new difference formula
(where we have used p1 = 1 and p2 = 2):
2F1 (h/2) − F1 (h)
F2 (h) = = 2F1 (h/2) − F1 (h). (6.41)
21 − 1
Because p2 = 2,
F2 (h) = f 0 (x) + O(h2 ), (6.42)
i.e., the error of the new approximation F2 (h) is O(h2 ).
Given the formula F2 (h), we can apply Richardson extrapolation again by applying (6.36)
to F2 (h), where now p1 = 2 and p2 = 3 corresponding to the truncation error of F2 (h):

22 F2 (h/2) − F2 (h) 4F2 (h/2) − F2 (h)

F3 (h) = 2
= = f 0 (x) + O(h3 ). (6.43)
2 −1 3
The truncation error of the F3 (h) is O(h3 ), and we can continue like this indefinitely, provided
samples of f (x + ih/2n ), i ∈ {0, . . . , 2n } are available.

6.4 Difference formulae from interpolating polynomials (?? -

not examined)
The Taylor series approach for approximating the derivatives of a function f (x) requires the
knowledge of values of the function at specific points in relation to where the derivative is
to be approximated. For example, to approximate f (x) by means of the central difference
formula, one must know the value of f (x) at x + h and x − h for some small value of h. This
does not present a problem if we already know the function f (x), but if f (x) is only known
through a finite collection of unequally spaced points, then the formulae we derived earlier
will not, in general, have much value if we wish to find a derivative at a specific value of x.
To be able to deal with this, we shall create a function that interpolates a finite number of
points lying on the graph of f (x). We shall then approximate f (r) (x) by differentiating this
interpolating polynomial r times. The approach is shown graphically in Figure 6.4.
When we interpolate a function f (x) at n + 1 points x0 , . . . , xn , we have (cf. Chapters on
interpolation), the function f (x) can be written as:
n
X f (n+1) (ξ)
f (x) = li (x)fi + ωn+1 (x), x0 < ξ < xn , (6.44)
(n + 1)!
i=0

where li (x) are the Lagrange basis functions associated to the nodes x0 . . . xn . Note that
the first term on the right-hand side is the Lagrange representation of the interpolating
84 Chapter 6. Numerical Differentiation

Figure 6.4: Approximating derivatives of f (x) (black), by first approximating with a inter-
polating polynomial p(x) (red), and subsequently differentiating the polynomial
(blue).
6.4. Difference formulae from interpolating polynomials (?? - not examined) 85

polynomial. Correspondingly, the second term on the right-hand side is the interpolation
error. We differentiate this equation r times and obtain
n
X (r) 1 dr h i
f (r) (x) = li (x) fi + ω n+1 (x) f (n+1)
(ξ) , (6.45)
(n + 1)! dxr
i=0

where ξ = ξ(x), i.e., ξ is a function of x. Hence, we can compute an estimate of the r-th
derivative f (r) (x) by computing the r-th derivative of the interpolating polynomial.

Example 6.6 (Differentiation using interpolating polynomials)

If r = 1, we obtain
n
X f (n+1) (ξ1 ) f (n+2) (ξ2 )
f 0 (x) = li0 (x) fi + ωn+1
0
(x) + ωn+1 (x) , (6.46)
(n + 1)! (n + 2)!
i=0

where ξ1 , ξ2 ∈ (x0 , xn ). If we need to know f 0 at a node, say, xj , we obtain

n
X f (n+1) (ξ1 )
f 0 (xj ) = li0 (xj ) fi + ωn+1
0
(xj ) , (6.47)
(n + 1)!
i=0

since ωn+1 (x) vanishes at x = xj .

Example 6.7 (Differentiation using interpolating polynomials)

The 3 points (0.1, e0.1 ), (1, e), and (2, e2 ) lie on the graph of f (x) = ex . Find an approximation
of the value f 0 (x) using the derivative of the interpolating polynomial through the 3 points.
The (quadratic) interpolating polynomial is given by (test if yourself)

p(x) = 1.151496 x2 + 0.125887 x + 1.077433, (6.48)

to 6 decimal places. The first derivative of p(x) is given by

p0 (x) = 3.029925 x + 0.125887, (6.49)

while f 0 (x) = ex . Table 6.4 compares values of p0 (x) with f 0 (x) for various values of x (limited
to 4 decimal places):

For instance, from quadratic polynomial interpolation through the data points (xi−1 , fi−1 ),
(xi , fi ), and (xi+1 , fi+1 ), we find
(x − xi )(x − xi+1 )
φ(x) = fi−1
(xi−1 − xi )(xi−1 − xi+1 )
(x − xi−1 )(x − xi+1 )
+ fi (6.50)
(xi − xi−1 )(xi − xi+1 )
(x − xi−1 )(x − xi )
+ fi+1 ,
(xi+1 − xi−1 )(xi+1 − xi )
86 Chapter 6. Numerical Differentiation

x p0 (x) f 0 (x) error

0.0 0.1259 1.0000 08741
0.1 0.4289 1.1052 0.6763
0.5 1.6408 1.6487 0.0078
1.0 3.1558 2.7183 -0.4375
1.5 4.6708 4.4817 -0.1891
2.0 6.1857 7.3891 1.2033
2.5 7.7007 12.1825 4.4818

Table 6.4: Differentiation using interpolating polynomials

for x ∈ [xi , xi+1 ]. We differentiate φ(x) analytically and can use the result to obtain an
approximation of f 0 (x) at any point x ∈ [xi , xi+1 ]. Note that you should not evaluate φ0 (x)
at a point x outside the interval [xi , xi+1 ]. If we want to know an approximation of f 0 (x) at
the node xi , we evaluate φ0 (x) at x = xi :
hi hi−1 − hi hi−1
φ0 (xi ) = − fi−1 − fi + fi+1 , (6.51)
hi−1 (hi−1 + hi ) hi−1 hi hi (hi−1 + hi )
where hk = xk+1 − xk . For equally spaced intervals it is hk = h = constant, and we find
fi+1 − fi−1
φ0 (xi ) = . (6.52)
2h
This is the central difference formula for the first derivative, which has already been derived
previously making use of Taylor series. If we differentiate φ(x), Eq. (6.50), twice we obtain a
constant:
2 2 2
φ00 (x) = fi−1 − fi + fi+1 , x ∈ [xi .xi+1 ]. (6.53)
hi−1 (hi−1 + hi ) hi−1 hi hi (hi−1 + hi )

This constant can be used as an approximation of f 00 (x) at any point in the interval [xi , xi+1 ].
For equidistant points, it simplifies to
fi−1 − 2fi + fi+1
φ00 (x) = . (6.54)
h2
When we use Eq. (6.54) as an approximation to f 00 (xi ), we obtain the well-known central
difference formula for the second derivative. Correspondingly, if we use Eq. (6.54) as an ap-
proximation to f 00 (xi−1 ) and f 00 (xi+1 ) it is called the forward formula for the second derivative
and backward formula for the second derivative, respectively. Note that the backward formula
for the second derivative is in fact based on an extrapolation.
Of course, additional differentiation formulae can be obtained if we fit an algebraic poly-
nomial through 4 or more points and computing the derivative(s) analytically. Note that
6.4. Difference formulae from interpolating polynomials (?? - not examined) 87

the data points do not need to be equidistant and that the use of interpolating polynomials
allows the computation of derivative(s) at any point inside the data interval. For instance,
for equidistant nodes through 5 points, we obtain the approximation

1 h4
f 0 (xi ) ≈ fi−2 − 8fi−1 + 8fi+1 − fi+2 + f (5) (ξ), x − 2h < ξ < x + 2h. (6.55)
12h 30
Example 6.8 (Differentiation using interpolating polynomials)
The upward velocity of a rocket is given as a function of time in table 6.5. Use a cubic
interpolating polynomial to obtain an approximation of the acceleration of the rocket at
t = 16 s. Since we want to find the velocity at t = 16 s, and we are using a cubic interpolating

time t [s] velocity v(t) [m/s]

0 0
10 227.04
15 362.78
20 517.35
22.5 602.97
30 901.67

Table 6.5: Example: approximate first derivatives from a cubic interpolating polynomial

polynomial, we need to choose the four points closest to t = 16 s and that also bracket t = 16
s to evaluate it. The four points are t0 = 10, t1 = 15, t2 = 20, and t3 = 22.5. We use the
monomial representation of the cubic interpolating polynomial,

φ(t) = a0 + a1 t + a2 t2 + a3 t3 , (6.56)

and obtain from the interpolation condition the linear system

    
1 10 100 1000 a0 227.04
1 15 225 3375 
 a1  362.78
   
  =  (6.57)

 
1 20 400 8000  a2  517.35
1 22.5 506.25 11391 a3 602.97

Solving the above system of equations gives

a0 = −4.3810, a1 = 21.289, a2 = 0.13065, a3 = 0.0054606. (6.58)

The acceleration at t = 16 s is found by the evaluation of φ0 (t) = a1 + 2a2 t + 3a3 t2 at t = 16:

φ0 (t = 16) = 29.664 m/s2 . (6.59)

88 Chapter 6. Numerical Differentiation
Chapter 7

Numerical Integration

7.1 Introduction
The numerical computation of definite integrals is one of the oldest problems in mathematics.
In its earliest form the problem involved finding the area of a region bounded by curved lines,
a problem which has been recognised long before the concept of integrals was developed in
the 17th and 18th century.
Numerical integration is often referred to as numerical quadrature, a name which comes
from the problem of computing the area of a circle by finding a square with the same area.
The numerical computation of two- and higher-dimensional integrals is often referred to as
numerical cubature. We will treat both topics in this chapter.
There are mainly three situations where it is necessary to calculate approximations to
definite integrals. First of all it may be so that the antiderivative of the function to be
integrated cannot be expressed in terms of elementary functions such as algebraic polynomials,
2
logarithmic functions, or exponential functions. A typical example is e−x dx. Secondly, it
R

might be that the antiderivative function can be written down but is so complicated that its
function values are better computed using numerical integration formulas. For instance, the
number of computations that must be carried out to evaluate
x
√
x2 + 2x + 1
Z
dt 1 1 x x
= √ log √ + √ arctan √ + arctan √
0 1 + t4 4 2 x2 − 2x + 1 2 2 2−x 2+x

using the ‘exact’ formula is substantial. A final reason for developing numerical integration
formulas is that, in many instances, we are confronted with the problem of integrating ex-
perimental data, which also includes data generated by a computation. In that case the
integrand is only given at discrete points and theoretical devices may be wholly inapplica-
ble. This case also arises when quadrature methods are applied in the numerical treatment
of differential equations. Most methods for discretizing such equations rest on numerical
integration methods.

89
90 Chapter 7. Numerical Integration

Let us begin by introducing some notations which will be used in this chapter. The idea
Rb
of numerical integration of a definite integral a f (x)dx is to look for an approximation of
the form Z b Xn
f (x)dx ≈ wi f (xi ). (7.1)
a i=0

The coefficients wi are called the “weights”; they depend on the location of the points xi
and on the integration domain [a, b]. The points xi are called the “nodes” (or “knots”). The
definite sum is called a “quadrature formula”. All quadrate formulas look like (7.1); they
only differ in the choice of the nodes and the weights. We will often write
Z b
I(f ; a, b) := f (x)dx (7.2)
a
n
X
Q(f ; a, b) := wi f (xi ) (7.3)
i=0

The difference I(f ; a, b)−Q(f ; a, b) =: E(f ; a, b) is called the “error of numerical integration”,
or briefly “quadrature error ”. Thus,

I(f ; a, b) = Q(f ; a, b) + E(f ; a, b) (7.4)

Definition 7.1 (Concept of the degree d of precision (DOP))

A quadrature formula
n
X
Q(f ; a, b) = wi f (xi ) (7.5)
i=0

approximating the integral

Z b
I(f ; a, b) = f (x) dx (7.6)
a
has DOP d if Z b n
X
f (x) dx = wi f (xi ) (7.7)
a i=0

whenever f (x) is a polynomial of degree at most d, but

Z b n
X
f (x) dx 6= wi f (xi ) (7.8)
a i=0

for some f (x) of degree d + 1.

Equivalent to this is the following formulation: the rule Q(f ; a, b) approximating the definite
integral I(f ; a, b) has DOP d if
Z b n
wi xqi
X
xq dx = (7.9)
a i=0
7.2. Solving for quadrature weights 91

Figure 7.1: Trapezoidal rule (Section 7.4.1) and Simpson’s rule (Section 7.4.2) for integral
approximation.

for q = 0, 1, . . . , d, but
Z b n
wi xqi
X
q
x dx 6= (7.10)
a i=0

for q = d + 1. From a practical point of view, if a quadrature formula Q1 (f ; a, b) has a higher

DOP than another quadrature formula Q2 (f ; a, b), then Q1 is considered more accurate than
Q2 . This DOP concept can be used to derive quadrature formulas directly.
Figure 7.1 shows the basic idea: the area under the integrand is approximated by the
area under a polynomial interpolant. The error in the approximation E[f ; a, b] corresponds
to the area “missed” by the fitting polynomial.

7.2 Solving for quadrature weights

Suppose we want to develop a formula of approximate integration
Z b n
X
f (x) dx ≈ wi f (xi )
a i=0

of a given degree of precision at least n. Assume that the nodes x0 , . . . , xn are fixed and
distinct numbers lying between a and b and prescribed in advance. Then the only degrees of
freedom are the n + 1 weights wi which can be determined in two ways:
92 Chapter 7. Numerical Integration

(a) Interpolate the function f (x) at the n + 1 points x0 , . . . , xn by a polynomial of degree

≤ n. Then integrate this interpolation polynomial exactly. The degree of precision d
will be at least n.

(b) Select the weights w0 , . . . , wn so that the rule integrates the monomials xj exactly for
j = 0, . . . , n. I.e. we obtain the n + 1 integration conditions:
Z b n
wi xji , ∀j = 0, . . . , n,
X
xj dx =
a i=0

(compare these conditions with the interpolation conditions). Again the degree of
precision will be at least n by construction.

It can be shown that both ways lead to one and the same formula, which is called “an
interpolatory quadrature formula”. Following (b), the integration conditions yields a linear
system for the unknown weights wi :
    Rb 
1 1 ··· 1 w0 dx
    R ba
 x0 x1 · · · xn   w1   a xdx 
 
 .
 . .. ..   ..  = 
    ..  (7.11)
. . . . .

    
Rb n
xn0 xn1 · · · xnn wn a x dx

The matrix is the familiar “Vandermonde matrix”. It is invertible if and only if the nodes
xi are distinct. Obviously, quadrature formulas of interpolatory type using n + 1 nodes (i.e.
n + 1 function evaluations) have exact degree of precision d = n, i.e. they integrate exactly
polynomials of degree at most n.

Exercise 7.2
Construct a rule of the form
Z 1
I(f ; −1, 1) = f (x)dx ≈ w0 f (−1/2) + w1 f (0) + w2 f (1/2)
−1

which is exact for all polynomials of degree ≤ 2.

Rather than fixing xi and determining wi , one could prescribe the weights wi and deter-
mining the nodes xi such that the integration conditions are satisfied for all polynomials of
degree ≤ n. However this yields a non-linear system of algebraic equations for the nodes xi ,
which is much more difficult to solve then the linear system of equations for wi .
A more exciting possibility is to determine weights and nodes simultaneously such that
the degree of precision d is as high as possible. The number of degrees of freedom is 2(n + 1),
so we should be able to satisfy integration conditions of up to degree 2n + 1 – these are the
Gauss quadrature rules, and are the best possible integration rules (in the sense that they
have highest degree of precision).
7.3. Numerical integration error – Main results 93

7.3 Numerical integration error – Main results

We would like to have an expression for the numerical integration error E(f ; a, b) for a
quadrature rule with degree of precision d. In this section the result is simply stated, for a
derivation see Section 7.8. We define:
Z b n
X
E(f ; a, b) = f (x) dx − wi f (xi ), (7.12)
a i=0

where a ≤ x0 < x1 < . . . < xn ≤ b are the nodes and wi are the weights and E is the
integration error. By the definition of degree of precision, if the quadrature rule has DOP d
then:
E(xj ; a, b) = 0, for all j = 0, . . . , d, (7.13)

and
E(xd+1 ; a, b) 6= 0. (7.14)

A general expression for E will now be stated. First we must introduce some notation. The
truncated power function is defined as
(
(x − s)d when x ≥ s
(x − s)d+ := . (7.15)
0 when x < s

Theorem 7.3 (Peano kernel theorem)

Let f ∈ C d+1 ([a, b]) and let Q(f ; a, b) = ni=0 wi f (xi ) be a quadrature rule with DOP d.
P

Then,
1 b (d+1)
Z
E(f ; a, b) = f (s) K(s) ds, (7.16)
d! a
where
K(s) = E((x − s)d+ ; a, b). (7.17)

This result may appear somewhat ponderous, but it captures a beautiful idea: the error in
the quadrate formula of f is related to the integral of a function K(s) (called the Peano
kernel or the influence function) that is independent of f .
Under typical circumstances K(s) does not change sign on [a, b]. In this circumstance
the second law of the mean can be applied to extract f from inside the integral in (7.16).
Therefore E(f ; a, b) can be described as

b
f (d+1) (ξ)
Z
E(f ; a, b) = K(s) ds a < ξ < b, (7.18)
d! a
= κ f (d+1) (ξ), a < ξ < b, (7.19)
94 Chapter 7. Numerical Integration

where κ is independent of f (x). Since we have this independence, to calculate κ we may

make the special choice f (x) = xd+1 . Then from (7.19) we have

E(xd+1 ; a, b) = κ (d + 1)!, (7.20)

which fixes κ:
E(xd+1 ; a, b)
κ= .
(d + 1)!
Hence for any (d + 1)-times continuously differentiable function f ∈ C d+1 ([a, b]), we have

E(xd+1 ; a, b) (d+1)
E(f ; a, b) = f (ξ), a<ξ<b (7.21)
(d + 1)!

so we can test our quadrature rule against xd+1 , and if it does well, it will do well for all f (·)!
In the case of quadrature rules with equidistant nodes (closed Newton-Cotes rules), we
can be even more specific:

Theorem 7.4 (Errors for Newton-Cotes Rules)

Let Qn (f ; a, b) be a quadrature rule on [a, b] with n + 1 equidistant nodes x0 , . . . , xn

b−a
xi = a + ih, h=
n
and degree of precision n. If n is odd (and f ∈ C n+1 ([a, b])) then the integration error can
be written:
hn+2 f (n+1) (ξ) n
Z
En (f ; a, b) = s(s − 1) · · · (s − n) ds. (7.22)
(n + 1)! 0

If n is even (and f ∈ C n+2 ([a, b])) it is:

n
hn+3 f (n+2) (ξ)
Z
n
En (f ; a, b) = (s − ) s(s − 1) · · · (s − n) ds, (7.23)
(n + 2)! 0 2

where a < ξ < b in both cases.

These expressions are used in the following sections to derive error estimates for closed
Newton-Cotes rules of varying degree.

7.4 Newton-Cotes formulas

The Newton-Cotes integration formulas are quadrature rules with equidistant nodes. We
distinguish two cases:

(a) the end-points of the integration domain are nodes, i.e. a = x0 , b = xn . Then, we talk
about closed Newton-Cotes formulas.
7.4. Newton-Cotes formulas 95

(b) the end-points of the integration domain are not nodes, i.e., a < x0 and xn < b. Then,
we talk about open Newton-Cotes formulas.

First let us consider closed-type formulas. Thus, the integration domain [a, b] is divided into
n subintervals of equal length h at the points

xi = a + ih, i = 0, . . . , n,

where h = b−a
n . The number of nodes is s = n + 1. For different s we obtain different closed
Newton-Cotes formulas:

7.4.1 Closed Newton-Cotes (s=2) – Trapezoidal rule

With s = 2 we only have 2 points: x0 = a and x1 = b. The weights w0 and w1 are the
solution of the linear system of equations
! ! Rb !
1 1 w0 1dx = b − a
= R ab 1 2 2
(7.24)
a b w1 a xdx = 2 (b − a )

b−a
The solution is w0 = w1 = 2 . Thus, the quadrature formula is

b 1
b − ah
Z X i
f (x)dx ≈ wi f (xi ) = f (a) + f (b) (7.25)
a 2
i=0

This quadrature formula is called the trapezoidal rule. The same result is obtained if we use
the Lagrange representation of the interpolation polynomial through the data points:

b Z b
x − x1 x − x0
Z
h
p1 (x) dx = f0 + f1 dx = f0 + f1 )
a a x0 − x1 x1 − x0 2
h
= f (a) + f (b) , where h = b − a.
2

Since n = 1, the integration error is:

h3 f 00 (ξ) 1
Z
1 3 (2)
E(f ; a, b) = s(s − 1) ds = − h f (ξ), a < ξ < b. (7.26)
2 0 12

We say that E(f ; a, b) = O(h3 ), which means that E(f ; a, b) tends to zero as h3 , h → 0. Note
that if f is a linear function, f 00 = 0, hence E(f ; a, b) = 0. This was to be expected, because
the degree of precision of the trapezoidal rule is, by construction, equal to 1. Remember that
the error estimates requires f ∈ C 2 ([a, b]).
96 Chapter 7. Numerical Integration

7.4.2 Closed Newton-Cotes (s=3) – Simpson’s rule

For s = 3 we have 3 points: x0 = a, x1 = a+b
2 , x2 = b. The weights are the solution of the
linear system     
1 1 1 w0 b−a
a+b
 a b   w1  =  12 (b2 − a2 )  (7.27)
    
2
2 a+b 2 2 1 3 3
a ( 2 ) b w2 3 (b − a )

The solution is w0 = 61 (b − a), w1 = 23 (b − a), w3 = 61 (b − a), and the quadrature formula is

b 2
b − ah
Z X a+b i
f (x)dx ≈ wi f (xi ) = f (a) + 4f + f (b) (7.28)
a 6 2
i=0

This quadrature formula is called “Simpson’s rule”. Using the Lagrange representation of
the degree-2 interpolation polynomial yields
b Z bh
(x − x1 )(x − x2 ) (x − x0 )(x − x2 )
Z
p2 (x) dx = f0 + f1
a a (x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 )
(x − x0 )(x − x1 ) i h b−a
+ f2 dx = f0 + 4f1 + f2 , where h = .
(x2 − x0 )(x2 − x0 ) 3 2
It is n = 2, and the error of Simpson’s rule is

h5 f (4) (ξ) 2 h5
Z
E(f ; a, b) = s(s − 1)2 (s − 2) ds = − f (4) (ξ), a < ξ < b. (7.29)
4! 0 90

Therefore, the error is of the order O(h5 ). This is remarkable, because the trapezoidal rule,
which uses only one node less, is of the order O(h3 ). That is the reason why Simpson’s rule
is so popular and often used.

7.4.3 Closed Newton-Cotes (s=4) – Simpson’s 3/8-rule

For s = 4 the points are x0 = a, x1 = a + b−a 2
3 , x2 = a + 3 (b − a), x3 = b, i.e., h =
b−a
3 . The
weights are the solution of the linear system
    
1 1 1 1 w0 b−a
2a+b a+2b   1
  w1   2 (b2 − a2 ) 
 a b   

 2 2a+b 3 3 = (7.30)
 a ( 3 )2 ( a+2b 2 b2   w   31 (b3 − a3 ) 
    
3 ) 2
a3 ( 2a+b
3 )
3 ( a+2b )3 b3
3 w3 1 4 4
4 (b − a )

The solution is w0 = 81 (b − a), w1 = w2 = 83 (b − a), w3 = 18 (b − a) and the quadrature formula

is
Z b 3
X b − ah 2a + b a + 2b i
f (x)dx ≈ wi f (xi ) = f (a) + 3f + 3f + f (b) (7.31)
a 8 3 3
i=0
7.4. Newton-Cotes formulas 97

This formula is called “Simpson’s 3/8-rule”. The error is

3 5 (4) (b − a)5 (4)

E(f ; a, b) = − h f (ξ) = − f (ξ), a < ξ < b. (7.32)
80 6480
We observe that the error is of the order O(h5 ). Though Simpson’s 3/8-rule uses one node
more than Simpson’s rule, it has the same order O(h5 ).

7.4.4 Closed Newton-Cotes (s=5) – Boules’s rule

For s = 5, xi = a + h i, i = 0, . . . , 4, and h = b−a
4 . The weights are the solution of a linear
system which is straightforward to set up, but is not shown here. The result is
7 32 12 32 7
w0 = (b − a), w1 = (b − a), w2 = (b − a), w3 = (b − a), w4 = (b − a)
90 90 90 90 90
Thus, the quadrature formula is
b
b − ah
Z
3a + b a+b a + 3b i
f (x)dx ≈ 7f (a) + 32f + 12f + 32f + 7f (b) (7.33)
a 90 4 2 4

This formula is called “Boules’s rule” or “4/90-rule”. The error is of the order O(h7 ):

8 7 (6) (b − a)7 (6)

E(f ; a, b) = − h f (ξ) = − f (ξ), a<ξ<b (7.34)
945 1935360

7.4.5 Open Newton-Cotes Rules

The Newton-Cotes formulas discussed so far are called closed Newton-Cotes formulas, because
the endpoints of he integration interval (i.e., a and b) belong to the nodes. Otherwise, they
are called open Newton-Cotes formulas. The most simple open Newton-Cotes formula uses
s = 1 node at a+b 2 , i.e., the node is at the midpoint of the integration interval. Therefore,
this rule is called midpoint rule:

a+b (b − a)3 (2)

Q(f ; a, b, ) = (b − a) f ( ), E(f ; a, b) = f (ξ), a < ξ < b. (7.35)
2 24
If the open Newton-Cotes rule has s points, the location of these points may be
b−a
xi = a + h(i + 1), i = 0, . . . , s − 1, h = . (7.36)
s+1
Alternatively, open Newton-Cotes formulas with s = n − 1 points can be obtained by leaving
out the nodes x0 = a and xn = b of a (n + 1)-point closed Newton-Cotes formula.
Another well-known open Newton-Cotes formula is Milne’s rule: s = 3, h = b−a4 :

b − a 4h
Q(f ; a, b) = 2f0 − f1 + 2f2 = 2f0 − f1 + 2f2 , (7.37)
3 3
98 Chapter 7. Numerical Integration

7(b − a)5 (4) 14h5 (4)

E(f ; a, b) = f (ξ) = f (ξ), a < ξ < b. (7.38)
23040 45
Note that Simpson’s rule and Milne’s rule are even exact with degree of precision 3, though
they use only s = 3 nodes. This is due to the symmetric distribution of the nodes over the
interval [a, b].

Example 7.5
1
Given the function f (x) = 1+x 2 for x ∈ [0, 1]. We look for an approximation to π/4 by

integrating f over the interval [0, 1] using (a) the trapezoidal rule, (b) Simpson’s rule, (c)
Simpson’s 3/8-rule, and (d) the 4/90-rule.

(a) trapezoidal rule: Q(f ; 0, 1) = 12 (f (0) + f (1)) = 21 (1 + 1/2) = 0.7500.

(b) Simpson’s rule: Q(f ; 0, 1) = 16 (f (0) + 4f (1/2) + f (1)) = 16 (1 + 3.2 + 0.5) = 0.78333.

1 1
(c) Simpson’s 3/8-rule: Q(f ; 0, 1) = 8 (f (0) + 3f (1/3) + 3f (2/3) + f (1)) = 8 (1 + 2.7 +
27/13 + 0.5) = 0.78462.

1 1
(d) 4/90-rule: Q(f ; 0, 1) = 90 (7f (0) + 32f (1/4) + 12f (1/2) + 32f (3/4) + 7f (1)) = 90 (7 +
512/17 + 48/5 + 512/25 + 3.5) = 0.78553.

The corresponding errors are for (a) 3.5 · 10−2 , (b) 2.1 · 10−3 , (c) 7.8 · 10−4 , and (d) 1.3 · 10−4 .

7.5 Composite Newton-Cotes formulas

The idea of piecewise polynomial interpolation leads to the so-called composite Newton-Cotes
formulas. The basic idea is to integrate piecewise polynomials, i.e., to subdivide the interval
[a, b] into n subintervals [xi , xi+1 ], i = 0, . . . , n − 1 where a = x0 < x1 < . . . < xn = b and to
use the fact that
Z b n−1
X Z xi+1
f (x) dx = f (x) dx. (7.39)
a i=0 xi

We then approximate each of the integrals

Z xi+1
f (x) dx (7.40)
xi

by one of the Newton-Cotes formulas just developed and add the results. Hence, if we
use a closed Newton-Cotes formula with s nodes, the composite Newton-Cotes formula has
n(s − 1) + 1 nodes. The corresponding rules are known as composite Newton-Cotes rules.
7.5. Composite Newton-Cotes formulas 99

7.5.1 Composite mid-point rule

Let h = b−a
n , xi = a + ih for i = 0, . . . n − 1. The most simple quadrature rule to be used on
each subinterval [xi , xi+1 ] is the midpoint rule:
xi+1
h3 (2)
Z
f (x) dx = hfi+1/2 + f (ξi ), ξi ∈ [xi , xi+1 ], (7.41)
xi 24
xi +xi+1
where fi+1/2 = f (xi+1/2 ) and xi+1/2 = 2 . Hence,

b n−1 n−1
h3 X (2)
Z X
f (x) dx = h fi+1/2 + f (ξi )
a 24
i=0 i=0
n−1
X b − a 2 (2)
=h fi+1/2 + h f (ξ), a < ξ < b.
24
i=0

Note that if
|f (2) (x)| ≤ M for all x ∈ [a, b], (7.42)

then by choosing h sufficiently small, we can achieve any desired accuracy (neglecting roundoff
errors). The rule
n−1
X b−a
Q(f ; a, b) = h fi+1/2 , h = (7.43)
n
i=0

is called the composite midpoint rule.

7.5.2 Composite trapezoidal rule

b−a
Following the same idea, we may construct the composite trapezoidal rule: let h = n ; then,
xi+1
h3 (2)
Z
h
f (x) dx = (fi + fi+1 ) − f (ξi ), xi < ξ < xi+1 , (7.44)
xi 2 12

so that
Z b n−1 i h3 X s−2
h Xh
f (x) dx = fi + fi+1 − f (2) (ξi )
a 2 12
i=0 i=0
h1 1 i b − a 2 (2)
= h f0 + f1 + f2 + . . . + fn−1 + fn − h f (ξ), a < ξ < b.
2 2 12

Example 7.6 (Error in the composite trapezoidal rule)

We consider the trapezoidal rule

b−a h3 (2)
Q(f ; a, b) = (f (a) + f (b)) ; E(f ; a, b) = − f (ξ), a < ξ < b.
2 12
100 Chapter 7. Numerical Integration

We want to apply a composite trapezoidal rule to evaluate approximately

Z b
f (x)dx
a

For m subintervals [ai , bi ], i = 1, . . . , m, we have

b m Z bi m
bi − ai
Z X X
f (x)dx = f (x)dx ≈ (f (ai ) + f (bi )) ,
a ai 2
i=1 i=1

b−a
with ai = a + (i − 1)h, bi = a + ih, h = m . Since bi − ai = h, we obtain
m
X bi − ai
Qm (f ; a, b) = (f (ai ) + f (bi )) =
2
i=1

h
= f (a) + 2f (a + h) + · · · + 2f (a + (m − 1)h) + f (b) .
2

For the error of the composed rule, we obtain

h3
f 00 (ξ1 ) + f 00 (ξ2 ) + · · · + f 00 (ξm )

Em (f ; a, b) = −
12
with ξi ∈ (ai , bi ), i = 1, . . . , m.
Following the mean value theorem that any continuous function g on [a, b] with c1 ≤
g(x) ≤ c2 for x ∈ [a, b] takes on every value between c1 and c2 at least once, we find at least
one ξ ∈ [a, b] such that
1
f 00 (ξ) = f 00 (ξ1 ) + f 00 (ξ2 ) + · · · + f 00 (ξm ) .

m
Therefore, we obtain

h2 b − a h2
Em (f ; a, b) = − m f 00 (ξ) = − (b − a) f 00 (ξ), a < ξ < b,
12 m 12
which means that the total error is of the order O(h2 ), one order less than that of the simple
trapezoidal rule, which has O(h3 ).

7.6 Interval transformation

Quadrature rules are often developed for special integration intervals, e.g. [0, 1] or [−1, 1].
That is, the nodes and weights given in tables or computer software refer to these intervals.
How can be use these formulas to approximate
Z b
f (x) dx? (7.45)
a
7.7. Gauss quadrature (?? – not examined) 101

Suppose the interval to which the nodes and weights of Q refer is [−1, 1]. Let {ξi : i = 1, . . . , s}
be the nodes of Q(f ; −1, 1) and {wi : i = 1, . . . , s} be the weights. Then,
b 1 1
b−a b−a
Z Z Z
f (x) dx = f (x(ξ)) dξ = g(ξ) dξ = Q(g; −1, 1) + E(g; −1, 1), (7.46)
a 2 −1 2 −1

where
b−a
x= (ξ + 1) + a. (7.47)
2
Hence,
b−a
Q(f ; a, b) = Q(g; −1, 1), g(ξ) := f (x(ξ)). (7.48)
2
That is if Q(f ; −1, 1) has nodes ξ and weights wi , then Q(f ; a, b) has nodes xi and weights
b−a
2 wi , where
b−a
xi = (ξi + 1) + a. (7.49)
2
Hence,
Z b s
X b−a
f (x) dx ≈ wi fi , fi = f (xi ). (7.50)
a 2
i=1
That means if we want to construct a quadrature rule for an arbitrary integration interval
[a, b] from given nodes and weights referring to another interval, say, [−1, 1], we only need
to find the (linear) mapping that maps [−1, 1] onto [a, b]. This is exactly Eq. (7.47). The
transformation allows to compute the nodes referring to the interval [a, b] and the Jacobian
of the mapping multiplied by the weights referring to [−1, 1] gives the weights referring to
[a, b].
We shall call two rules by the same name, provided that the abscissas and weights are
related by a linear transformation as above. For instance, we shall speak of Simpson’s rule
over the interval [−1, 1] and over [a, b].

7.7 Gauss quadrature (?? – not examined)

Till now we prescribed the nodes {xi : i = 1, . . . , s} of the quadrature formula and determined
the weights {wi : i = 1, . . . , s} as solution of the linear system
s
X bm+1 − am+1
wk x m
k = , m = 0, 1, . . . , s − 1 (7.51)
m+1
k=1

Then, the resulting quadrature formula has DOP s − 1, i.e., it is exact for all polynomials of
degree at most s − 1.
If we do not prescribe nodes and weights, the above system of equations contains 2s free
parameters, namely s nodes and s weights. The system is linear in the weights and non-
linear in the nodes. We may use the 2s degrees of freedom to demand that the integration
102 Chapter 7. Numerical Integration

formula is exact for polynomials of degree at most 2s − 1, which are uniquely determined by
2s coefficients. That means we require that
Z b Xs
p(x)dx = wi p(xi ), (7.52)
a i=1

for all p ∈ P2s−1 , i.e., for all polynomials of degree at most 2s − 1. This is the idea behind
the so-called Gauss quadrature formulas. Of course, the question is whether the system of
non-linear equations (7.52) is uniquely solvable.
Suppose the integration interval is [−1, 1] and suppose there exists nodes and weights so
that E(f ; −1, 1) = 0 for all f ∈ P2s−1 . Let q ∈ Ps−1 arbitrary and consider the polynomial

q(x) · (x − x0 )(x − x1 ) · · · (x − xs−1 ) ∈ P2s−1 . (7.53)

The polynomial
ωs (x) := (x − x0 )(x − x1 ) · · · (x − xs−1 ) (7.54)
is the nodal polynomial; it has degree s. Moreover, it holds obviously Q(q · ωs ; −1, 1) = 0.
Hence, it must hold Z 1
q(x) ωs (x) dx = 0, for all q ∈ Ps−1 . (7.55)
−1
We define in C([−1, 1]) the scalar product
Z 1
hf, gi := f (x)g(x) dx. (7.56)
−1

Then, condition (7.55) means that ωs (x) is orthogonal to all polynomials q ∈ Ps−1 with
respect to the scalar product (7.56). We construct a system of orthogonal polynomials Li ,
i.e., (
0 for i 6= j
hLi , Lj i = . (7.57)
6= 0 for i = j
This can be done from the monomial basis xk using the Gram-Schmidt orthogonalization
procedure. With L0 (x) = 1 and L1 (x) = x, the solution is
n−1
X hxn , Li i
Ln (x) = xn − Li (x). (7.58)
hLi , Li i
i=0

The polynomials Ln (x) are unique and can be written as

2n (n!)2
Ln (x) = Pn (x), (7.59)
(2n)!
where Pn (x) is the Legendre polynomial of degree n:
1 dn 2
Pn = (x − 1)n . (7.60)
2n n! dxn
7.7. Gauss quadrature (?? – not examined) 103

It is P0 (x) = 1, P1 (x) = x, P2 (x) = 21 (3x2 − 1), etc. They can be computed recursively:

2n + 1 n
Pn+1 (x) = x Pn (x) − Pn−1 (x), n = 1, 2, . . . . (7.61)
n+1 n+1
Hence, Ln is up to a scaling factor identical to Pn . The Legendre polynomial Pn (x) has n
distinct zeros in the interval (−1, 1) and so has Ln . For given number of nodes s, we choose
the zeros of Ps (x) as nodes of the integration formula. For arbitrary p ∈ P2s−1 , we have by
polynomial division
p(x) = q(x)Ls (x) + r(x), (7.62)
where q ∈ Ps−1 and a remainder r ∈ Ps−1 . Due to the orthogonality, we have
Z 1 Z 1 Z 1
p(x) dx = q(x)Ls (x) dx + r(x) dx. (7.63)
−1 −1 −1

The first integral on the right-hand site is zero due to the orthogonality. Hence,
Z 1 Z 1
p(x) dx = r(x) dx. (7.64)
−1 −1

Moreover, it is for arbitrary weights wi :

s−1
X
Q(p; −1, 1) = Q(qLs ; −1, 1) + Q(r; −1, 1) = wi q(xi )Ls (xi ) + Q(r; −1, 1) = Q(r; −1, 1),
i=0
(7.65)
because Ls (xi ) = 0. Hence, it holds
Z 1
p(x) dx = Q(p; −1, 1), for all p ∈ P2s−1 , (7.66)
−1

if and only if Z 1
r(x) dx = Q(r; −1, 1), for all r ∈ Ps−1 . (7.67)
−1
That the latter holds can be achieved by a suitable choice of the weights according to the
principle of interpolatory quadrature. It is for r ∈ Ps−1 :
Z 1 Z s−1
1 X s−1 Z
X 1
r(x) dx = r(xi ) li (x) dx = li (x) dx r(xi ). (7.68)
−1 −1 i=0 i=0 −1

Hence, the weights are

1 1
x − xj
Z Z Y
wi = li (x) dx = dx, i = 0, . . . s − 1, (7.69)
−1 −1 j=0,j6=i xi − xj

and can be computed from the nodes x0 , . . . xs−1 . The result is the following:
104 Chapter 7. Numerical Integration

Definition 7.7 (Gauss-Legendre quadrature)

Let {xi : i = 0, . . . s − 1} be the s zeros of the Legendre polynomial Ps (x) and let
Z 1
wi = li (x) dx, (7.70)
−1

where li (x) is the i-th Lagrange polynomial associated with the nodes {xi }. Then,
Z 1 s−1
X
p(x) dx = wi p(xi ) for all p ∈ P2s−1 , (7.71)
−1 i=0

and for f ∈ C 2s ([−1, 1]) the integration error is

1
f (2s) (ξ)
Z
2
E(f ; −1, 1) = Ls (x)2 dx = f (2s) (ξ), −1 < ξ < 1. (7.72)
(2s)! −1 (2s + 1)(2s)!

The weights of all Gauss-Legendre quadrature rules are positive. The sum of the weights
is always equal to 2, the length of the interval [−1, 1]. A Gauss-Legendre rule with s nodes
has DOP 2s − 1. This is the maximum degree of precision a quadrate rule with s nodes
can have. All nodes are in (−1, 1). The nodes are placed symmetrically around the origin,
and the weights are correspondingly symmetric. For s even, the nodes satisfy x0 = −xs−1 ,
x1 = −xs−2 etc., and the weights satisfy w0 = ws−1 , w1 = ws−2 etc. For s odd, the nodes
and weights satisfy the same relation as for s even plus we have x s+1 = 0. Notice that we
2
have shown before that the Gauss-Legendre rules are interpolatory rules! If we need to apply
a Gauss-Legendre rule to approximate an integral over the interval [a, b] we transform the
nodes and weights from the interval [−1, 1] to the interval [a, b] as explained in section 7.6.

Example 7.8
When applying a Gauss-Legendre formula with s nodes to a function f on the interval [−1, 1],
we obtain for
h4 (4)
R1
s = 2: −1 f (x)dx ≈ f (− √13 ) + f ( √13 ), E(f ; −1, 1) = 135 f (ξ), −1 < ξ < 1.
R1 1
√ √ h6 (6) (ξ),
s = 3: −1 f (x)dx ≈ 9 5f (− 0.6) + 8f (0) + 5f ( 0.6) , E(f ; −1, 1) = 15750 f

−1 < ξ < 1.

Obviously, the error estimates are better than for the Newton-Cotes formulas with the
Ps
same number of nodes. Since no integration rule of type i=1 wi f (xi ) can integrate ex-
Qs 2
actly the function i=1 (x − xi ) ∈ P2s , we see that Gauss rules are best in the sense that
they integrate exactly polynomials of a degree as high as possible with a formula of type
Ps
i=1 wi f (xi ). Therefore, Gauss formulas possess maximum degree of precision. There is no
s-point quadrature formula with degree of precision 2s.
7.8. Numerical integration error – Details (?? – not examined) 105

Example 7.9 (Continuation of example 7.5)

We want to apply a Gauss-Legendre formula with s = 1, 2, 3, and 4 nodes. The nodes and
weights w.r.t. [−1, 1] are given in the following table.

s nodes xi , i = 1, . . . , 2s − 1 weights wi , i = 1, . . . , 2s − 1
1 x0 = 0 w0 = 2
2 x0,1 = ± √13 w0 = w1 = 1
√
3 x0,2 = ± 0.6 w0 = w2 = 59
x1 = 0 w1 = 98
4 x0,3 = ±0.86113631 w0 = w3 = 0.34785485
x1,2 = ±0.33998104 w1 = w2 = 0.65214515

We obtain, observing the parameter transformation [−1, 1] → [0, 1]:

s = 1: Q(f ; 0, 1) = f (1/2) = 0.8000

s = 2: Q(f ; 0, 1) = 12 (f (0.7887) + f (0.2113)) = 0.7869

5
s = 3: Q(f ; 0, 1) = 18 (f (0.8873) + f (0.1127)) + 49 f (0.5) = 0.7853

s = 4: Q(f ; 0, 1) = 0.1739(f (0.9306) + f (0.6943)) + 0.3261(f (0.6700) + f (0.3300)) = 0.7854.

The corresponding integration errors are 1.5 · 10−2 , 1.5 · 10−3 , 1.3 · 10−4 , and 4.8 · 10−6 ,
respectively.
Exercise 7.10
Find an approximation to
3
sin2 x
Z
I= dx
1 x
using Gaussian quadrature with s = 3 nodes. Evaluate the same integral but now using
composed Gaussian quadrature with m = 2 subintervals and s = 3 nodes each.

7.8 Numerical integration error – Details (?? – not examined)

We want to derive expressions for the numerical integration error E(f ; a, b) for an interpola-
tory integration formula with DOP = d. We consider
Z b n
X
f (x) dx = wi f (xi ) + E(f ; a, b), (7.73)
a i=0

where a ≤ x0 < x1 < . . . < xn ≤ b are the nodes and {wi } are the weights of the integration
formula and E is the integration error. Hence,
Z b n
X
E(f ; a, b) = f (x) dx − wi f (xi ), (7.74)
a i=0
106 Chapter 7. Numerical Integration

and
E(xq ; a, b) = 0, for all q ≤ d, (7.75)
and
E(xq+1 ; a, b) 6= 0. (7.76)
Suppose f ∈ C d+1 ([a, b]). Then, for any value of x and x̄ ∈ [a, b], we can write

f 0 (x̄) f (d) (x̄) f (d+1) (ξ)

f (x) = f (x̄) + (x − x̄) + . . . + (x − x̄)d + (x − x̄)d+1 , (7.77)
1! d! (d + 1)!
where for any fixed x̄ ∈ [a, b], ξ depends on x, but is in [a, b]. We write the last equation as

f (x) = pd (x) + ed (x), (7.78)

where pd (x) comprises the first d + 1 terms, i.e., is a polynomial of degree d. Hence,
Z b Z b n
X
E(f ; a, b) = pd (x) dx + ed (x) dx − wi pd (xi ) − ed (xi )
a a i=0
Z b n
X Z b n
X
= pd (x) dx − wi pd (xi ) + ed (x) dx − wi ed (xi )
a i=0 a i=0
Z b n
X
= ed (x) − wi ed (xi ),
a i=0

because DOP = d. Inserting the expression for ed (x) yields

b n
f (d+1) (ξ) X f (d+1) (ξi )
Z
E(f ; a, b) = (x − x̄)d+1 dx − wi (xi − x̄)d+1 , (7.79)
a (d + 1)! (d + 1)!
i=0

where ξ, ξ0 , ξ1 , . . . ξn all lie in [a, b]. Let

M := max |f (d+1) (x)| (7.80)

x∈[a,b]

and notice that

b−a
|x − x̄| ≤ (7.81)
2
a+b
in [a, b] when we choose x̄ = 2 . Then,
b n n
M (b − a)d+1 h i M (b − a)d+1 h
Z X X i
|E(f ; a, b)| ≤ dx + |wi | ≤ d+1 (b − a) + |wi | . (7.82)
2d+1 (d + 1)! a 2 (d + 1)!
i=0 i=0

When all the weights wi are positive, we have

n
X n
X
|wi | = wi = b − a, (7.83)
i=0 i=0
7.8. Numerical integration error – Details (?? – not examined) 107

because any interpolatory rule integrates a constant exactly, i.e., the sum of all weights must
be equal to b − a. Hence, for integration rules with positive weights, it holds

M (b − a)d+2
|E(f ; a, b)| ≤ . (7.84)
2d (d + 1)!

The error bounds (7.82) and (7.84) are often too conservative. A more useful form is obtained
when we start with the interpolation error: let
n
X 1
f (x) = li (x) f (xi ) + ωn+1 (x)f (n+1) (ξ), (7.85)
(n + 1)!
i=0

where wn+1 (x) is the nodal polynomial of degree n + 1 and

min (xi , x) ≤ ξ ≤ max (xi , x). (7.86)

i=0,...,n i=0,...,n

Then,
Z b n Z b
X 1
f (x) dx = wi f (xi ) + ωn+1 (x) f (n+1) (ξ) dx, (7.87)
a (n + 1)! a
i=0

where Z b
wi = li (x) dx. (7.88)
a

Note that ξ = ξ(x), i.e., we can’t take the term f (n+1) (ξ) out of the integral. Hence, even
if f (n+1) (x) is known analytically, we can’t evaluate the second term on the right-hand side,
i.e., the integration error, analytically. However, with

M := max |f (n+1) (x)|, (7.89)

x∈[a,b]

we obtain the estimate

Z b
M
|E(f ; a, b)| ≤ |ωn+1 (x)| dx. (7.90)
(n + 1)! a

If no one of the abscissas xi lies in (a, b), for instance if we want to compute the integral
Z xi+1
f (x) dx, (7.91)
xi

the nodal polynomial ωn+1 (x) does not change sign in (a, b) and the second law of the mean
may be invoked to show that
b
f (n+1) (η)
Z
E(f ; a, b) = ωn+1 (x) dx. (7.92)
(n + 1)! a
108 Chapter 7. Numerical Integration

Definition 7.11 (Second law of the mean)

If f ∈ C([a, b]) and g does not change sign inside [a, b], then
Z b Z b
f (x) g(x) dx = f (η) g(x) dx (7.93)
a a

for at least one η such that a < η < b.

If the nodes are distributed equidistantly over [a, b], we can obtain other error estimates. First
observe that if x = a + b−a
n s, n ∈ N+ , then

b n
b−a
Z Z
f (x) dx = F (s) ds, (7.94)
a n 0

where
b−a
F (s) = f a+ s . (7.95)
n
Suppose f (x) is interpolated by a polynomial of degree n, which agrees with f (x) at the n + 1
equally spaced nodes xi in [a, b]. Then,
Z b n
X
F (s) ds = wi F (i) + En (F ; 0, n), (7.96)
a i=0

where Z b Z n
wi = li (x) dx = Li (s) ds, (7.97)
a 0
where Y s−j
Li (s) = (7.98)
i−j
j=0,j6=i

and
Z n
1
En (F ; 0, n) = s(s − 1) · · · (s − n) F (n+1) (ξ1 ) ds, 0 < ξ1 < n. (7.99)
(n + 1)! 0

Since s(s − 1) · · · (s − n) is not of constant sign on [0, n], we can’t apply the second law of
the mean. However, it can be shown that when n is odd, the error can be expressed in the
form which would be obtained if the second law of the mean could be applied:
n
F (n+1) (ξ2 )
Z
En (F ; 0, n) = s(s − 1) · (s − n) ds, n odd. (7.100)
(n + 1)! 0

Moreover, it can be shown that when n is even, it holds

n
F (n+2) (ξ2 )
Z
n
En (F ; 0, n) = (s − ) s(s − 1) · · · (s − n) ds, n even, (7.101)
(n + 2)! 0 2
7.8. Numerical integration error – Details (?? – not examined) 109

where in both cases 0 < ξ2 < n. Since we assumed equally spaced nodes, we may write if
b−a
h= , xi = a + ih, i = 0, . . . , n : (7.102)
n
n
hn+2 f (n+1) (ξ)
Z
En (f ; a, b) = s(s − 1) · · · (s − n) ds, n odd (7.103)
(n + 1)! 0
and
n
hn+3 f (n+2) (ξ)
Z
n
En (f ; a, b) = (s − ) s(s − 1) · · · (s − n) ds, n even, (7.104)
(n + 2)! 0 2
where a < ξ < b in each case.
We return to the integration error E(f ; a, b) for any interpolatory quadrature rule. Instead
of
f (d+1) (ξ)
ed (x) = (x − x̄)d+1 , (7.105)
(d + 1)!
we use the integral remainder term for the Taylor series:
1 x
Z
ed (x) = (x − s)d f (d+1) (s) ds. (7.106)
d! x̄
Let x̄ = a. Then,
Z bZ x n
X Z xi
d! E(f ; a, b) = (x − s)d f (d+1) (s) ds dx − wi (xi − s)d f (d+1) (s) ds. (7.107)
a a i=0 a

It will be convenient to remove the x from the upper limit of the interior integral above. To
this end we introduce the truncated power function
(
(x − s)d when x ≥ s
(x − s)d+ := . (7.108)
0 when x < s

Since (x − s)d+ = 0 for s > x, we have

1 x 1 b
Z Z
d (d+1)
ed (x) = (x − s) f (s) ds = (x − s)d+ f (d+1) (s) ds, (7.109)
d! a d! a
and so
Z bhZ b i n
X Z b
d! E(f ; a, b) = (x − s)d+ f (d+1) (s) ds dx − wi (xi − s)d+ f (d+1) (s) ds
a a i=0 a
Z b hZ b i Z b n
X
(d+1) d
= f (s) (x − s)+ dx ds − f (d+1) (s) wi (xi − s)d+ ds
a a a i=0
Z b hZ b n
X i
(d+1)
= f (s) (x − s)d+ dx − wi (xi − s)d+ ds
a a i=0
Z b
= f (d+1) (s) E((x − s)d+ ; a, b) ds.
a
110 Chapter 7. Numerical Integration

We have just proved the Peano kernel theorem (in fact, the full theorem is a bit more general
than what we proved here, though our development is sufficient for analysis of Newton-Cotes
rules).

Theorem 7.12 (Peano kernel theorem)

Suppose f ∈ C d+1 ([a, b]) and let Q(f ; a, b) = ni=0 wi f (xi ) be a quadrature rule with DOP
P

d. Then,
1 b (d+1)
Z
E(f ; a, b) = f (s) K(s) ds, (7.110)
d! a
where
K(s) = E((x − s)d+ ; a, b). (7.111)

This result may appear somewhat ponderous, but it captures a beautiful idea: the error in
the quadrate formula of f is related to the integral of a function K(s) (called the Peano
kernel or the influence function) that is independent of f .
In typical circumstances, K(s) does not change sign on [a, b], so the second law of the mean
can be applied to extract f from inside the integral. This allows E(f ; a, b) to be described as

E(f ; a, b) = κ f (d+1) (ξ), a < ξ < b, (7.112)

where κ is independent of f (x). Hence, to calculate κ, we may make the special choice
f (x) = xd+1 . Then,
E(xd+1 ; a, b) = κ (d + 1)!, (7.113)
which fixes κ, hence,

E(xd+1 ; a, b) (d+1)
E(f ; a, b) = f (ξ), a<ξ<b (7.114)
(d + 1)!

if K does not change sign in [a, b]. Note that positive weights wi is not sufficient to guarantee
that K does not change sign in [a, b].

7.9 Two-dimensional integration

The problem of two-dimensional integration can be formulated as follows:
Let B define a fixed closed region in R2 and dB = dxdy the surface element. Find fixed
points (x1 , y1 ), (x2 , y2 ), . . . , (xs , ys ) and fixed weights w1 , w2 , . . . , ws such that
ZZ s
X
w(x, y)f (x, y)dxdy ≈ wi f (xi , yi ) (7.115)
B i=1

is a useful approximation to the integral on the left for a reasonable large class of functions
of two variables defined over B.
7.9. Two-dimensional integration 111

In passing from one-dimensional integration to two-dimensional integration a series of

new problems arise since the diversity of integrals and the difficulty in handling them greatly
increases.

(1) In one dimension only three different types of integration domains are possible: the
finite interval (which has been discussed before), the single infinite interval, and the
double infinite interval. In two dimensions we have to deal with a variety of domains.

(2) The behaviour of functions of two variables can be considerably more complicated than
that of functions of one variable and the analysis of them is limited.

(3) The evaluation of functions of two variables takes much more time.

Only for some standard regions the theory is developed much further and the following brief
discussion will be restricted to the two most important standard regions, namely

(a) the unit square U := {(u, v) : −1 ≤ u, v ≤ 1}

(b) the standard triangle T := {(u, v) : u, v ≥ 0, u + v ≤ 1}

They allow to transform two-dimensional integrals into two iterated integrals. Arbitrary
domains B can sometimes be transformed into a standard region D by affine or other trans-
formations. Then, given cubature formulas for D can be transformed as well providing a
cubature formula for B.
To explain this, let B be a region in the x, y-plane and D be our standard region in the
u, v-plane. Let the regions B and D be related to each other by means of the transformation

x = φ(u, v),
y = ψ(u, v).

We assume that φ, ψ ∈ C 1 (D) and that the Jacobian

∂φ ∂φ
!
∂u ∂v
J(u, v) = det ∂ψ ∂ψ
(7.116)
∂u ∂v

does not vanish in D. Suppose further that a cubature formula for the standard domain D
is available, i.e.
ZZ Xs
h(u, v)dudv ≈ wi h(ui , vi ), (ui , vi ) ∈ D, (7.117)
D i=1
112 Chapter 7. Numerical Integration

with given nodes (ui , vi ) and given weights wi . Now we have

ZZ ZZ
f (x, y)dxdy = f (φ(u, v), ψ(u, v)) |J(u, v)|dudv (7.118)
B D
s
X
≈ wi f (φ(ui , vi ), ψ(ui , vi )) |J(ui , vi )| (7.119)
i=1
Xs
=: Wi f (xi , yi ), (7.120)
i=1

where
xi = φ(ui , vi ), yi = ψ(ui , vi ), Wi = wi |J(ui , vi )|.
Thus,
ZZ s
X
f (x, y)dxdy ≈ Wi f (xi , yi ). (7.121)
B i=1

An important special case occurs when B and D are related by a non-singular affine trans-
formation:

x = a0 + a1 u + a2 v,
y = b0 + b1 u + b2 v.

Then, the Jacobian is constant:

!
a1 a2
|J(u, v)| = det = |a1 b2 − a2 b1 | =
6 0. (7.122)

b1 b2

Examples are the parallelogram which is an affine transformation of the standard square U
and any triangle which is an affine transformation of the standard triangle T .

Example 7.13
Given the s nodes (ui , vi ) and weights wi of a cubature formula w.r.t. to the standard triangle
T = {(u, v) : 0 ≤ u ≤ 1, 0 ≤ v ≤ 1 − u}. Calculate the nodes (xi , yi ) and weights Wi of a
cubature formula
Xs Z
Q(f ; B) = Wi f (xi , yi ) ≈ f (x, y)dx dy,
i=1 B

where B = {(x, y) : 0 ≤ x ≤ 1, 0 ≤ y ≤ x}.

First of all we have to find the transformation which maps T onto B. Since both are
triangles, the transformation is an affine transformation which can be written as

x = a0 + a1 u + a2 v; y = b0 + b1 u + b2 v.
7.9. Two-dimensional integration 113

The coefficients are determined using the identical points (u = 0, v = 0) → (x = 0, y = 0),

(u = 1, v = 0) → (x = 1, y = 0), and (u = 0, v = 1) → (x = 1, y = 1). This yields x =
u + v, y = v. The Jacobian of the parameter transformation is equal to 1, i.e. dx dy = du dv.
Now, we obtain
Z Z
f (x, y)dx dy = f (x(u, v), y(u, v))du dv
B T
s
X s
X
≈ wi f (x(ui , vi ), y(ui , vi )) = Wi f (xi , yi ) =: Q(f ; B).
i=1 i=1

Therefore the nodes of Q(f ; B) are xi = ui + vi , yi = vi , i = 1, . . . , s and the weights are

Wi = wi , i = 1, . . . , s.
Exercise 7.14
Given a triangle B with vertices (1, 1), (2, 2), and (2, 3). Determine the transformation which
maps the standard triangle T onto B. Let (ui , vi ) and wi , i = 1, . . . , s denote the nodes and
weights of a cubature formula Q(f ; T ) = si=1 wi f (ui , vi ) which approximates the integral
P
R
T f (u, v) du dv.
R
Use these nodes and weights and derive a cubature formula Q(g; B) to
approximate B g(x, y)dx dy.

7.9.1 Cartesian products and product rules

Let I1 , I2 be intervals. The symbol I1 × I2 denotes the Cartesian product of I1 and I2 , and
by this we mean the region in R2 with points (x, y) that satisfy x ∈ I1 , y ∈ I2 .
Example 7.15
I1 : −1 ≤ x ≤ 1, I2 : −1 ≤ y ≤ 1. Then I1 × I2 is the unit square Q : {(x, y) : −1 ≤ x, y ≤ 1}.

Suppose now that Q1 is an s1 -point rule of integration over I1 and Q2 is an s2 -point rule of
integration over I2 :
Xs1 Z
Q1 (h, I1 ) = wi h(xi ) ≈ h dx, (7.123)
i=1 I1

Xs2 Z
Q2 (g, I2 ) = vi g(yi ) ≈ g dy. (7.124)
i=1 I2

wi and vi denote the weights of the quadrature formulas Q1 and Q2 , respectively. Then, by
the product rule of Q1 and Q2 we mean the s1 · s2 -point rule applicable to I1 × I2 and defined
by
s1 X
X s2
Q1 × Q2 (f ; I1 × I2 ) = wi vj f (xi , yj ) (7.125)
i=1 j=1
ZZ
≈ f (x, y) dxdy. (7.126)
I1 ×I2
114 Chapter 7. Numerical Integration

We can show that if Q1 integrates h(x) exactly over I1 , if Q2 integrates g(y) exactly over I2 ,
and if f (x, y) = h(x)g(y), then Q1 × Q2 will integrate f (x, y) exactly over I1 × I2 .

Example 7.16
Let B be the rectangle a ≤ x ≤ b, c ≤ y ≤ d. The evaluation of
ZZ
f (x, y)dxdy
B

by the product of two Simpson’s rules yields

(b − a)(c − d)
ZZ
f (x, y)dxdy ≈ f (a, c) + f (a, d) + f (b, c) + f (b, d)
36
B

c+d a+b c+d a+b
+ 4 f a, +f , c + f b, +f ,d
2 2 2 2

a+b c+d
+ 16f , . (7.127)
2 2
This 9-point rule will integrate exactly all linear combinations of the 16 monomials xi y j ,
0 ≤ i, j ≤ 3. The corresponding weights are the products of the weights of the two Simpson
rules.
Example 7.17
Given the s nodes and weights of a Gauss-Legendre quadrature formula w.r.t. the standard
interval [−1, 1]. Construct a cubature formula to compute the integral
Z
I := f (x, y) dD,
D

with D := {(x, y) : 0 ≤ x2 + y 2 ≤ 1}.

D is the unit circle. Introducing polar coordinates x = r cos α, y = r sin α, we can
transform D into the rectangle R := {(r, α) : 0 ≤ r ≤ 1, 0 ≤ α ≤ 2π}. The Jacobian of this
transformation is J1 (r, α) = r. Therefore, we obtain
Z Z
I= f dD = f |J1 | dR,
D R

with dR = dr dα. The affine transformation r = 12 (u + 1), α = π(v + 1) transforms the unit
square U = [−1, 1] × [−1, 1] onto R. Its Jacobian is J2 (u, v) = π2 . Therefore, we obtain
Z Z
I= f |J1 |dR = f |J1 ||J2 |dU,
R U

with dU = du dv. Let g := f |J1 ||J2 |. Then

Z u=1 Z v=1
I= g(u, v)du dv.
u=−1 v=−1
7.9. Two-dimensional integration 115

R v=1
Let v=−1 g(u, v)dv =: h(u). Then, we may write
Z u=1 s
X
I= h(u)du ≈ wi h(ui ),
u=−1 i=1

where ui and wi denote the nodes and weights of the Gauss-Legendre quadrature w.r.t. the
interval [−1, 1]. Because of the definition of h(u) we obtain
Xs Xs Z v=1
I≈ wi h(ui ) = wi g(ui , v) dv.
i=1 i=1 v=−1

For fixed i, g(ui , v) is a function of v only, thus we can again apply the Gauss-Legendre
quadrature formula to compute the integral:
Z v=1 s
X
g(ui , v) dv ≈ wj g(ui , vj ).
v=−1 j=1

This yields
s X
X s
I≈ wi wj g(ui , vj ),
i=1 j=1
u+1 u+1 1
with g = f |J1 | |J2 |, x = 2 cos(π(v+1)), y = 2 sin(π(v+1)), r = 2(u+1) , and α = π(v+1).

We can generalize the product rule approach. If the region B is sufficiently simple, the
integral ZZ
I= f (x, y) dxdy
B
may be expressed as an iterated integral of the form
Z b Z u2 (x) Z b Z u2 (x)
I= f (x, y) dxdy =: g(x) dx, g(x) := f (x, y) dy (7.128)
a u1 (x) a u1 (x)

Let us use an s1 -point rule Q1 for integrating g(x):

Z b Xs1 s1
X Z u2 (xi )
g(x)dx ≈ wi g(xi ) = wi f (xi , y)dy (7.129)
a i=1 i=1 u1 (xi )

For each of the s1 integrals we use an s2 -point rule Q2

Z u2 (xi ) s2
X
f (xi , y)dy ≈ vji f (xi , yji ) (7.130)
u1 (xi ) j=1

The double subscripts on the right reflect the fact that the abscissas and weights of Q2 must
be adjusted for each value of i to the interval u1 (xi ) ≤ y ≤ u2 (xi ). Thus
s1
X s2
X
I= wi vji f (xi , yji ) (7.131)
i=1 j=1
116 Chapter 7. Numerical Integration

is an s1 · s2 -point rule for I. Note that this is the same formula that results when applying the
product rule Q1 × Q2 to the transformed integral over the square [a, b] × [a, b] with Jacobian

u2 (x) − u1 (x)
J(x) = .
b−a

7.9.2 Some remarks on 2D-interpolatory formulas

Univariate interpolatory integration formulas with prescribed nodes x1 , . . . , xs can be ob-
tained by requiring Q(f ; a, b) to integrate all polynomials in Ps−1 exactly. All interpolatory
quadrature formulas Q have a degree of precision d ≥ s − 1, where the actual degree of pre-
cision d depends on the specific abscissas x1 , . . . , xs . The largest possible degree of precision,
d = 2s − 1, is attained for Gauss formulas, whereas Newton-Cotes formulas only provide
d = s − 1 and d = s, depending on whether s is even or odd. It seems reasonable to construct
multivariate integration formulas Q by generalizing this univariate approach, i.e. by requiring
Q to integrate all polynomials of total degree d exactly. That means that for s given distinct
nodes (x1 , y1 ), . . . , (xs , ys ) the weights w1 , . . . , ws of the cubature formula
s
X
Q(f ; I) = wi f (xi , yi )
i=1

have to be chosen in such a way that

E(f ; I) = 0 for all f ∈ Pd . (7.132)

The space Pd has dimension M := 2+d

d , i.e. all functions from Pd can be written as a linear
2+d

combination of d basis functions φi . Then (7.132) is equivalent to

E(φi ; I) = 0, i = 1, . . . , M. (7.133)

Equations (7.133) are called moment equations. For a fixed choice of the basis {φ1 , . . . , φM }
and of the abscissas (x1 , y1 ), . . . , (xs , ys ) they define a system of linear equations
s
X
wi φj (xi , yi ) = Iφj , j = 1, . . . , M. (7.134)
i=1

Then, Q is called an interpolatory cubature formula, provided the system has a unique
solution.
In one dimension, Q is identical to the quadrature formula obtained by interpolation with
polynomials from Pd at the distinct nodes x1 . . . . , xs . This relationship does not generally
hold in two (or more) dimensions!
For arbitrary given distinct points (x1 , y1 ), . . . , (xs , ys ) the moment equations usually
have no unique solution. Thus, when trying to construct s-point interpolatory cubature
formulas, the linear system has to be solved not only for the weights w1 , . . . , ws , but also
7.9. Two-dimensional integration 117

for the nodes (x1 , y1 ), . . . , (xs , ys ). Then the system (7.134) is non-linear in the unknowns
(x1 , y1 ), . . . , (xs , ys ). Each node (xi , yi ) introduces three unknowns: the weight wi and the
two coordinates of each node (xi , yi ). Therefore, the cubature formula Q to be constructed
has to satisfy a system of d+2

d non-linear equations in 3s unknowns. For non-trivial values
of s, these non-linear equations are too complex to be solved directly.
However, Chakalov’s theorem guarantees at least that such a cubature formula exists:

Theorem 7.18 (Chakalov)

Let B be a closed bounded region in R2 and let W be a non-negative weight function on
B. Then there exist s = d+2

d points (x1 , y1 ), . . . , (xs , ys ) in B and corresponding positive
weights w1 , . . . , ws so that the corresponding cubature formula Q has a degree of precision d.

Unfortunately, Chakalov’s theorem is not constructive, and therefore useless for the practical
construction of cubature formulas. Therefore, the construction of interpolatory cubature
formulas is an area of ongoing research. For most of the practical applications, however,
product rules as discussed in section 7.9.1 can be derived.
118 Chapter 7. Numerical Integration
Chapter 8

Numerical Methods for Solving

Ordinary Differential Equations

8.1 Introduction
Many problems in technical sciences result in the task to look for a differentiable function
y = y(x) of one real variable x, whose derivative y 0 (x) fulfils an equation of the form

y 0 (x) = f (x, y(x)), x ∈ I = [x0 , xn ]. (8.1)

Equation (8.1) is called ordinary differential equation. Since only the first derivative of y
occurs, the ordinary differential equations (8.1) is of first order. In general it has infinitely
many solutions y(x). Through additional conditions one can single out a specific one. Two
types of conditions are commonly used, the so-called initial condition

y(x0 ) = y0 , (8.2)

i.e. the value of y at the initial point x0 of I is given, and the so-called boundary condition

g(y(x0 ), y(xn )) = 0, (8.3)

where g is a function of two variables. Equations (8.1),(8.2) define an initial value problem
and equations (8.1),(8.3) define a boundary value problem.
More generally, we can also consider systems of p ordinary differential equations of first
order
y10 (x) = f1 (x, y1 (x), y2 (x), . . . , yp (x))
y20 (x) = f2 (x, y1 (x), y2 (x), . . . , yp (x))
.. (8.4)
.
yp0 (x) = fp (x, y1 (x), y2 (x), . . . , yp (x))

119
120 Chapter 8. Numerical Methods for Solving Ordinary Differential Equations

for p unknown functions yi (x), i = 1, . . . , p of a real variable x. Such systems can be written
in the form (8.1) when y 0 and f are interpreted as vector of functions:
 0   
y1 (x) f1 (x, y1 (x), . . . , yp (x))
y 0 (x) :=  ...  , f (x, y(x)) :=  ..
. (8.5)
   
.
yp0 (x) fp (x, y1 (x), . . . , yp (x))

Correspondingly the initial condition (8.2) has to be interpreted as

   
y1 (x0 ) y1,0
y(x0 ) = y0 =  ..   .. 
 =  . . (8.6)

.
yp (x0 ) yp,0

In addition to ordinary differential equations of first order there are ordinary differential
equations of the mth order , which have the form

y (m) (x) = f (x, y(x), y 0 (x), . . . , y (m−1) (x)). (8.7)

The corresponding inital value problem is to determine a function y(x) which is m-times
differentiable, satisfies (8.7), and fulfils the initial condition:
(i)
y (i) (x0 ) = y0 , i = 0, 1, . . . , m − 1. (8.8)

By introducing auxiliary functions

z1 (x) : = y(x)
z2 (x) : = y 0 (x),
.. (8.9)
.
zm (x) : = y (m−1) (x),

the ordinary differential equation of the mth order can always be transformed into an equiv-
alent system of m first order differential equations
   
z10 z2
 .   ..
 ..  

 = . 
. (8.10)
 0
 zm−1   zm
  

zm0 f (x, z1 , z2 , . . . , zm )

In this chapter we will restrict to initial value problems for ordinary differential equations of
first order, i.e. to the case of only one ordinary differential equation of first order for only
one unknown function. However, all methods and results holds for systems of p ordinary
differential equations of first order, as well, provided quantities such as y and f (x, y) are
interpreted as vectors, and | · | as norm k · k. Moreover, we always assume that the initial
value problem is uniquely solvable. This is guaranteed by the following theorem:
8.1. Introduction 121

Figure 8.1: Graphical interpretation of Lipschitz continuity: finding a cone at each point
on the curve such that the function does not intersect the cone.

Theorem 8.1 (Existence and uniqueness)

Let f be defined and continuous on a strip G := {(x, y) : a ≤ x ≤ b, y ∈ R}, a, b finite.
Further, let there be a constant L such that

|f (x, y1 ) − f (x, y2 )| ≤ L|y1 − y2 | (8.11)

for all x ∈ [a, b] and all y1 , y2 ∈ R (“Lipschitz condition”). L is called the “Lipschitz constant”
of f . Then for every x0 ∈ [a, b] and every y0 ∈ R there exists exactly one solution of the
initial value problem (8.1),(8.2).

This theorem has a geometric interpretation illustrated in Figure 8.1: if at every point on
the curve we can draw a cone (in gray) with some (finite) slope L, such that the curve does
not intersect the cone, then the function is Lipschitz continuous. The function sketched in
the figure is clearly Lipschitz continuous.
Condition (8.11) is fulfilled if the partial derivative fy := ∂f
∂y exists on the strip G and is
continuous and bounded there. Then we can choose

L = max fy (x, y).
(x,y)∈G

In most applications f is continuous on G, but the partial derivative fy is often unbounded

on G. Then, the initial value problem (8.1),(8.2) is still solvable, but the solution may only
be defined in a certain neighborhood of the initial point x0 and not on all of [x0 , xn ].
Most initial value problems for ordinary differential equations cannot be solved ana-
lytically. Instead, numerical methods have to be used which provide to certain abscissae
(“nodes”) xi , i = 0, . . . , n approximate values η(xi ) for the exact values y(xi ). The abscissae
122 Chapter 8. Numerical Methods for Solving Ordinary Differential Equations

xi are often equidistant, i.e. xi = x0 + i h, where h is the step size. We will discuss various
numerical methods and will examine whether and how fast η(x) converges to y(x) as h → 0.
We will intensively make use of the results of Chapter 1 and Chapter 4.

8.2 Basic concepts and classification

Let us assume that n + 1 nodes subdivide the integration interval I = [x0 , xn ] into equally
large subintervals, i.e.
xi = x0 + i h, i = 0, . . . , n, (8.12)
with the fixed step size h = xn −x n . By formal integration of (8.1) over the interval
0

[xi , xi+1 ], i = 0, . . . , n − 1, we obtain

Z xi+1
y(xi+1 ) = y(xi ) + f (t, y(t)) dt, i = 0, . . . , n − 1. (8.13)
xi

The numerical methods for solving the initial value problem (8.1), (8.2) differ in what quadra-
ture formula Q(f, y; xi , xi+1 ) is used to compute the integral on the right-hand side of equa-
tion (8.13):

y(xi+1 ) = y(xi ) + Q(f, y; xi , xi+1 ) + E(f, y; xi , xi+1 ), i = 0, . . . , n − 1. (8.14)

Approximations η̄(xi+1 ) of y(xi+1 ), i = 0, . . . , n−1 can be obtained by neglecting the quadra-

ture error E(f, y; xi , xi+1 ), yielding an approximation

η̄(xi+1 ) = y(xi ) + Q(f, y; xi , xi+1 ), i = 0, . . . , n − 1. (8.15)

The difference y(xi+1 ) − η̄(xi+1 ) is then equal to the quadrature error E(f, y; xi , xi+1 ). In
the context of ODE it is called local discretization error or local truncation error of step i + 1
and denoted by i+1 :

i+1 = E(f, y; xi , xi+1 ) = y(xi+1 ) − η̄(xi+1 ), i = 0, . . . , n − 1. (8.16)

For applications it is important how fast the local discretization error decreases as h → 0.
This is expressed by the order of the local discretization error. It is the largest number p
such that
i+1
lim p < ∞. (8.17)
h→0 h

We simply write i+1 = O(hp ) which means nothing else but i+1 ≈ C hp with some C ∈ R.
Usually y(xi ) is not known, and (8.15) cannot be used directly. Therefore, an additional
approximation has to be introduced, for instance, by replacing the terms with y on the
right-hand side of (8.15) by the corresponding approximations η, which yields:

η(xi+1 ) = η(xi ) + Q(f, η; xi , xi+1 ), i = 0, . . . , n − 1, (8.18)

8.2. Basic concepts and classification 123

where y(x0 ) = η(x0 ) = y0 . To keep notation simple we will often write yj , ηj instead of
y(xj ), η(xj ).
Mostly we want to know how large the difference between yi+1 and its approximation
ηi+1 is. This question is not answered by the local discretization error because it only tells
us how good the numerical integration has been performed in the corresponding step of the
algorithm; the numerical integration errors of the previous steps and also the errors introduced
by replacing terms with y by the approximations η on the right-hand side of (8.15) are not
taken into account. The difference yi+1 −ηi+1 is called global discretization error or sometimes
global truncation error at xi+1 , denoted by ei+1 :

ei+1 := yi+1 − ηi+1 , i = 0, 2, . . . , n − 1. (8.19)

The order of the global discretization error is defined in the same way as the order of the
local discretization error. It can be shown that the global order is always one less than the
local order. It can be estimated using approximations with different step size: If ηhj (xi ) is
the approximation to y(xi ) obtained with a method of global order O(hpj ), and step size hj ,
j = 1, 2, we can proof that

ηh1 (xi ) − ηh2 (xi )

ei,h1 = y(xi ) − ηh1 (xi ) ≈ p = ēi,h1 .
1 − hh12

This estimate of the global discretization error can be used to obtain an improved approxi-
mation η̄h1 (xi ):
p
h2
h1 ηh1 (xi ) − ηh2 (xi )
η̄h1 (xi ) = ηh1 (xi ) + ēi,h1 = p .
h2
h1 −1

The approximation η̄h1 (xi ) has a global order of at least O(hp+1

1 ), i.e. the order of the global
discretization error is at least one higher than for the original approximation ηh1 (xi ).
For the special case that h2 = 2h1 = 2h, we obtain

ηh (xi ) − η2h (xi )

ēi,h (xi ) = p ,
1 − 12

and
2p ηh (xi ) − η2h (xi )
η̄h (xi ) = .
2p − 1
In addition to discretization errors, there are rounding errors which have to be taken into
account. Rounding errors are unavoidable since all numerical computations on a computer
are done in floating-point arithmetic with a limited number of decimal digits. They behave
like
ri = O(h−q ),
124 Chapter 8. Numerical Methods for Solving Ordinary Differential Equations

with some q > 0, where the error constant depends on the number of decimal digits in the
floating-point arithmetic. Note that the rounding error increases with decreasing step size h,
while the discretization error decreases with decreasing h. Therefore, the step size h has to be
chosen carefully. Investigations have shown that usually the choice hL = 0.05, . . . , 0.20 yields
good results, where L denotes the Lipschitz constant of f (cf. (8.11)). For more information
about rounding errors we refere to Stoer and Bulirsch (1993).
In this chapter we will focus on single-step methods and multistep methods. Single-step
methods are of the form

ηi+1 = ηi + h · Φ(ηi+1 , ηi , xi , h), i = 0, 1, . . . , n − 1, (8.20)

where the function Φ can be linear or non-linear in its arguments. If Φ does not depend on
ηi+1 , the method is called explicit, otherwise implicit. Characteristic for single-step methods
is that they use only one previously calculated approximation ηi to get an approximation
ηi+1 . Multistep methods (i.e., s-step methods with s > 1) make use of s + 1, s ≥ 1 previously
calculated approximations ηi−s , ηi−s+1 , . . . , ηi−1 , ηi to calculate ηi+1 . The linear multistep
methods have the form

s
X s
X
ηi+1 = as−k ηi+1−k + h bs−k fi+1−k , i = 0, 1, . . . , n − 1, (8.21)
k=1 k=0

with some real coefficients aj , bj . If bs = 0 they are called explicit, otherwise implicit.
A third group of numerical method are so-called extrapolation methods, which are similar
to Romberg integration methods; they are beyond the scope of this lecture. The reader is
referred to Stoer and Bulirsch (1993).
Among the single-step methods we will discuss the methods of Euler-Cauchy, the method
of Heun, and the classical Runge-Kutta method. Among the multistep methods we will dis-
cuss the method of Adams-Bashforth and the method of Adams-Moulton. In addition we
will discuss a subclass of single-step and multistep methods, the so-called predictor-corrector
(0)
methods. These are methods which first determine an approximation ηi+1 using a single-step
or a multistep method; the corresponding formula is called predictor. Then, the approxima-
(0)
tion ηi+1 is improved using another single-step or a multistep method (the so-called “correc-
(1) (2) 0 (k )
tor”), yielding approximations ηi+1 , ηi+1 , . . ., ηi+1 . For instance, the methods of Heun and
Adams-Moulton belong to the class of predictor-corrector methods.
What method is used to solve a given initial value problem depends on several factors,
among them are the required accuracy, the computer time and memory needed, the flexibility
w.r.t. the step size h, and the number of function evaluations. We will give some hints for
practical applications.
8.3. Single-step methods 125

8.3 Single-step methods

8.3.1 The methods of Euler-Cauchy
We start from equation (8.13) with i = 0 and use a closed Newton-Cotes formula with one
node, namely at x0 , (see Chapter 4) to calculate the integral on the right-hand side of (8.13).
We obtain:
y1 = y0 + h f (x0 , y(x0 )) + E(f, y; x0 , x1 ). (8.22)
The local discretization error of the first Euler-Cauchy step is simply (see Chapter 4):
1 1
EC
1 = h2 f 0 (ξ0 , y(ξ0 )) = h2 y 00 (ξ0 ), ξ0 ∈ [x0 , x1 ].
2 2
Assuming that f (x, y(x)) is differentiable w.r.t. x, we obtain

y1 = y0 + h f (x0 , y0 ) + EC
1 = η1 + EC
1 ,

with
η1 := y0 + h f (x0 , y0 ).
When integrating over [x1 , x2 ] we obtain correspondingly
Z x2
y2 = y1 + f (t, y(t)) dt = y1 + h f (x1 , y1 ) + EC
2 .
x1

Since y1 is unknown, we use instead its approximation η1 to obtain the approximation η2 :

η2 = η1 + h f (x1 , η1 ).

The local discretization error of the second Euler-Cauchy step is:

1
EC
2 = h2 y 00 (ξ1 ), ξ1 ∈ [x1 , x2 ].
2
Continuing in the same way, we obtain, when using xi as node of the one point Newton-Cotes
formula over [xi , xi+1 ]:

ηi+1 = ηi + h f (xi , ηi ),
1 2 00 (8.23)
EC
i+1 = h y (ξi ), ξi ∈ [xi , xi+1 ], i = 0, . . . , n − 1.
2
The scheme (8.23) is called forward Euler-Cauchy method. The local discretization error has
order 2, and the global discretization error has order 1. The latter means that if the step
size in the Euler-Cauchy method is reduced by a factor of 21 , we can expect that the global
discretization error will be reduced by a factor of 12 . Thus, the forward Euler-Cauchy method
converges for h → 0, i.e.,
lim (y(x) − η(x)) = 0.
h→0
126 Chapter 8. Numerical Methods for Solving Ordinary Differential Equations

Figure 8.2: Solution space for the ODE y 0 = f (t, y). Exact solutions with various initial
conditions are plotted (black lines). The progress of 4 steps of forward Euler-
Cauchy is shown (red arrows).

Graphically we can plot the path of forward Euler-Cauchy in the solution space, see
Figure 8.2. The black lines in this figure correspond to exact solutions of the ODE with
different initial conditions. The thick black line is the exact solution for the initial condition
we are interested in, ỹ. Using the ODE y 0 = f (x, y) the exact derivative of the solution y can
be computed at every point in this space. Forward Euler starts from the initial condition, and
takes a linear step using the initial derivative. Because the exact solution is curved, forward
Euler lands a little to the side of the exact solution. The next step proceeds from this new
point. Over a number of steps these errors compound.
Example 8.2 (Forward Euler-Cauchy)
Given the initial value problem y 0 = y, y(0) = 1. We seek an approximation to y(0.5) using
the forward Euler-Cauchy formula with step size h = 0.1. Compare the approximations with
the true solution at the nodes.
Solution: With f (x, y(x)) = y, we obtain the recurrence formula

ηi+1 = (1 + h)ηi , i = 0, . . . , 5.

The nodes are xi = x0 + ih = 0.1 i, i = 0, . . . , 5. With h = 0.1 and retaining four decimal
places, we obtain
8.3. Single-step methods 127

i 0 1 2 3 4 5
xi 0 0.1 0.2 0.3 0.4 0.5
ηi 1 1.1 1.21 1.331 1.4641 1.61051
yi 1 1.1052 1.2214 1.3499 1.4918 1.6487
|yi − ηi | 0 5.3(-3) 1.1(-2) 1.9 (-2) 2.8(-2) 3.8 (-2)

The exact solution of the initial value problem is y(x) = ex , the correct value at x = 0.5
is y(0.5) = 1.64872. Thus, the absolute error is eEC (0.5) = 3.8(−2). A smaller step size
yields higher accuracies. Taking e.g. h = 0.005, we obtain η(0.5) = 1.6467, which is correct
to 2.1(−3).

Example 8.3 (Euler-Cauchy: Global discretization error and step size)

Consider the initial value problem y 0 = x−y2 , y(0) = 1 over the interval [0, 3] using the forward
Euler-Cauchy method with step sizes h = 1, 12 , . . . , 64
1
and calculate the global discretization
− x2
error. The exact solution is y(x) = 3e − 2 + x. The results clearly show that the error in

step size h number of steps n η(3) eEC (0.3) O(h) ≈ Ch, C = 0.256
1 3 1.375 0.294390 0.256
1
2 6 1.533936 0.135454 0.128
1
4 12 1.604252 0.065138 0.064
1
8 24 1.637429 0.031961 0.032
1
16 48 1.653557 0.015833 0.016
1
32 96 1.661510 0.007880 0.008
1
64 192 1.665459 0.003931 0.004

1
the approximation of y(3) descreases by about 2 when the step size h is reduced by 12 . The
error constant can be estimated to C ≈ 0.256.

Instead of using the node xi we may use xi+1 yielding the so-called backward Euler-Cauchy
formula:
ηi+1 = ηi + h f (xi+1 , ηi+1 ), i = 0, 1, . . . , n − 1. (8.24)
Obviously it is an implicit formula since ηi+1 also appears on the right-hand side of the
equation, in contrast to the explicit forward Euler-Cauchy formula (8.23). For general f we
have to solve for the solution ηi+1 by iteration. Since equation (8.24) is of the form

ηi+1 = ϕ(ηi+1 ),

we can do that with the aid of the fixed-point iteration of Chapter 1. That means we start
(0) (0)
with an approximation ηi+1 , e.g. ηi+1 = ηi + h f (xi , ηi ), and solve the equation iteratively:

(k+1) (k)
ηi+1 = ϕ ηi+1 , k = 0, 1, . . . , k0 .
128 Chapter 8. Numerical Methods for Solving Ordinary Differential Equations

It can be shown that under the conditions of Theorem 8.1 and for hL = 0.05, . . . , 0.20 the
iteration converges. Mostly, one iteration step is already sufficient.

Exercise 8.4 (Backward Euler-Cauchy)

Given the initial value problem y 0 = y, y(0) = 1. Compute an approximation η(0.5) to y(0.5)
using the backward Euler-Cauchy method with step size h = 0.1. Compare the results with
the forward Euler-Cauchy formula w.r.t. accuracy. Calculate the absolute error for each step.

8.3.2 The method of Heun

The Euler-Cauchy method makes use of the most simple quadrature formula, namely the
closed one point Newton-Cotes formula with the node at the lower or upper bound of the
integration domain. Higher-order methods can easily be obtained by chosing more accurate
quadrature formulas. For instance, when using the trapezoidal rule (see Chapter 4), we obtain

h
yi+1 = yi + f (xi , yi ) + f (xi+1 , yi+1 ) + H
i+1 ,
2

and, correspondingly,

h
ηi+1 = ηi + f (xi , ηi ) + f (xi+1 , ηi+1 ) . (8.25)
2

Equation (8.25) defines an implicit single-step method which is called the trapezoidal rule.
The local discretization error is
h3 00
H
i+1 = − f (ξi , y(ξi )), ξi ∈ [xi , xi+1 ], i = 0, . . . , n − 1. (8.26)
12
The right-hand side of (8.25) still contains the unknown value ηi+1 . We can solve it using
the fixed-point iteration:

(k) h (k−1)

ηi+1 = ηi + f (xi , ηi ) + f (xi+1 , ηi+1 ) , k = 1, 2, . . . , k0 . (8.27)
2
As starting value for the iteration we use the approximation of the forward Euler-Cauchy
method:
(0)
ηi+1 = ηi + h f (xi , ηi ), i = 0, 1, . . . , n − 1, (8.28)
Equations (8.27),(8.28) define a predictor-corrector method, which is called Method of Heun.
The forward Euler-Cauchy formula (8.28) is used as predictor ; it yields a first approximation
(0)
ηi+1 . The trapezoidal rule is used to correct the first approximation and, therefore, is called
corrector (equation (8.27)). A graphical summary is shown in Figure 8.3.
It can be shown that the corrector converges if hL < 1, where L is the Lipschitz constant
of the function f (cf. (8.11)). As already mentioned, in practical applications the step size is
chosen such that hL ≈ 0.05, . . . , 0.20.
8.3. Single-step methods 129

Figure 8.3: Graphical representation of the predictor-corrector in the solution space, with
exact solutions plotted (block lines). A prediction is made using the forward
Euler-Cauchy formula (thin red line). The solution gradient at the prediction
is evaluated (thick red line). This gradient is averaged with the initial solution
gradient, and on this basis a new correction step is made (blue line). The final
result is much closer to the truth than the predictor.
130 Chapter 8. Numerical Methods for Solving Ordinary Differential Equations

The local discretization error of the Heun method is identical to the quadrature error
of the trapezoidal rule, and given by equation (8.26). Since we do not know exactly the
H := η (k0 )
approximation ηi+1 , an additional iteration error, δi+1 i+1 − ηi+1 , is introduced, and
the total local discretization error becomes H H
i+1 + δi+1 . It can be shown that the total
local discretization error is still of the order O(h3 ), if hL ≈ 0.05, . . . , 0.20 is chosen, even
with k0 = 0. Thus, the global discretization error is of the order O(h2 ) (see e.g. Stoer and
Bulirsch (1993)). The latter means that when reducing the step size by a factor 21 , the global
discretization error will be reduced by a factor of about 14 .

Example 8.5 (Method of Heun)

Given the initial value problem y 0 = y, y(0) = 1. We seek an approximation to y(0.5) using
the method of Heun with step size h = 0.1. Use two iterations to get ηi+1 . Compare the
approximations with the true solution at the nodes.
Solution: With f (x, y(x)) = y, we obtain the recurrence formula
(0)
predictor: ηi+1 = (1 + h)ηi , i = 0, . . . , 5,

(k+1) h (k)
corrector: ηi+1 = ηi + ηi + ηi+1 , i = 0, . . . , 5, k = 0, 1.
2

The nodes are xi = x0 + ih = 0.1 i, i = 0, . . . , 5. With h = 0.1 and retaining five decimal
places, we obtain the results shown in the next table. The absolute error is 5.9 × 10−4 ,

i 0 1 2 3 4 5
xi 0 0.1 0.2 0.3 0.4 0.5
ηi 1 1.10525 1.22158 1.35015 1.49225 1.64931
(0)
ηi+1 1.1 1.21578 1.34374 1.48517 1.64148
(1)
ηi+1 1.105 1.22130 1.34985 1.49192 1.64894
(2)
ηi+1 1.10525 1.22158 1.35015 1.49225 1.64931
yi 1 1.1052 1.2214 1.3499 1.4918 1.6487
|yi − ηi | 0 0.8 × 10−5 1.8 × 10−4 2.9 × 10−4 4.3 × 10−4 5.9 × 10−4

thus the approximation is about 2 decimals more accuracte than when using the forward
Euler-Cauchy method.

Example 8.6 (Heun: Global discretization error and step size)

Consider the initial value problem y 0 = x−y 2 , y(0) = 1 over the interval [0, 3] using the method
of Heun with step sizes h = 1, 12 , . . . , 64
1
and calculate the global discretization error. The
− x2
exact solution is y(x) = 3e − 2 + x.

Exercise 8.7
We seek an error estimate for η(0.2) obtained by the method of Heun. We know ηh (0.2) =
1.2217. Repeat the calculation with step size 2h = h̃ = 0.2. Estimate the global discretization
8.3. Single-step methods 131

step size h number of steps n η(3) eH (0.3) O(h) ≈ Ch2 , C = −0.0432

1 3 1.732422 −6.30 × 10−2 −4.32 × 10−2
1
2 6 1.682121 −1.27 × 10−2 −1.08 × 10−2
1
4 12 1.672269 −2.88 × 10−3 −2.70 × 10−3
1
8 24 1.670076 −6.86 × 10−4 −6.75 × 10−4
1
16 48 1.669558 −1.68 × 10−4 −1.69 × 10−4
1
32 96 1.669432 −4.20 × 10−5 −4.20 × 10−4
1
64 192 1.669401 −1.10 × 10−5 −1.10 × 10−5

error eH
h (0.2) and calculate an improved approximation η̄h (0.2). Compare it with the exact
solution.

8.3.3 Classical Runge-Kutta method

When we want to improve the accuracy of the methods of Euler-Cauchy and Heun we can
take smaller step sizes h and at the same time try to control the rounding errors. However,
there are methods which have global discretization errors of orders higher than 1 or 2. Thus,
they converge much faster if h → 0 and usually provide higher accuracies for a given h. One
class of methods are the so-called Runge-Kutta methods. Since the construction of Runge-
Kutta formulas is very difficult we restrict to the classical Runge-Kutta method, an explicit
method with a local discretization error of order O(h5 ) and a global discretization error of
order O(h4 ). It makes use of the “ansatz”
4
X
ηi+1 = ηi + wj · kj,i
j=1

tj − xi
kj,i = h · f tj , ηi + · kj−1,i , j = 1, . . . 4, k0,i = 0.
h

The nodes tj and weights wj are determined such that ηi+1 agrees with the Taylor series
expansion of yi+1 at xi up to terms of order O(h4 ). This yields a linear system of equations
which has two equations less than the number of unknown parameters; so two parameters may
be chosen arbitrarily. They are fixed such that the resulting formula has certain symmetry
properties w.r.t. the nodes and the weights. The result is
1 1 1 1
w1 = , w2 = , w3 = , w4 = ,
6 3 3 6
h h
t 1 = xi , t2 = xi + , t3 = xi + , t4 = xi + h = xi+1 .
2 2
Therefore, the classical Runge-Kutta formula to perform the step from xi to xi+1 is given by
1
ηi+1 = ηi + (k1,i + 2k2,i + 2k3,i + k4,i ) , i = 0, . . . , n − 1, η(x0 ) = y(x0 ) = y0 ,
6
132 Chapter 8. Numerical Methods for Solving Ordinary Differential Equations

with

k1,i = h · f (xi , ηi ),
h k1,i
k2,i = h · f (xi + , ηi + ),
2 2
h k2,i
k3,i = h · f (xi + , ηi + ),
2 2
k4,i = h · f (xi + h, ηi + k3,i ).

Obviously, per step four evaluations of the function f are required. That is the price we
have to pay for the favorable discretization error. It may be considerable if the function f is
complicated.
It is very difficult to specify the optimal choice of the step size h. Mostly, hL ≈ 0.05 . . . 0.20
will give good accuracies. After each step it is possible to check whether hL is still in the
given range: we use the estimate

k2,i − k3,i
hL ≈ 2 .
k1,i − k2,i

This enables to vary the step size h during the calculations. It can be shown that, when
f does not depend on y, the classical Runge-Kutta formula corresponds to using Simpson’s
3/8-rule (cf. Chapter 4) for evaluating the integral on the right-hand side of (8.13).
Higher-order Runge-Kutta formulas have also been derived, e.g. the Runge-Kutta-Butcher
formula (m = 6) and the Runge-Kutta-Shanks formula (m = 8). The general structure of all
Runge-Kutta formulas is
m
X
ηi+1 = ηi + wj · kj,i
j=1
m
!
X
kj,i = h · f tj , ηi + aj,n · kn,i , j = 1, . . . , m
n=1

If aj,n = 0 for n ≥ j, we obtain an explicit Runge-Kutta formula, otherwise the formula

is implicit. An example for an implicit Runge-Kutta formula is the classical Runge-Kutta
formula discussed before with m = 4 function evaluations per step. With an explicit Runge-
Kutta formula we obtain a local discretization error of at most order O(hm+1 ) for m ≤ 4
whereas for m > 4 the best we can get is O(hm ). The optimal local discretization error
for implicite Runge-Kutta formulas is O(h2m+1 ) if the m nodes of the corresponding Gauss-
Legendre quadrature formula are used. The corresponding formulas are sometimes called
implicit Runge-Kutta formulas of Gauss-type. They are well-suited if high accuracies are
required and if the integration interval is large. For more details see e.g. Engeln-Müllges and
Reutter (1985).
8.3. Single-step methods 133

Example 8.8 (Classical Runge-Kutta method)

Given the initial value problem y 0 = y, y(0) = 1. We seek an approximation to y(0.2) using
the classical Runge-Kutta method with step size h = 0.1. Compare the approximations with
the true solution at the nodes.
Solution: With f (x, y(x)) = y, we obtain the recurrence formula

1
ηi+1 = ηi + k1,i + 2k2,i + 2k3,i + k4,i ,
6
with

k1,i k2,i
k1,i = hηi , k2,i = h ηi + , k3,i = h ηi + , k4,i = h (ηi + k3,i ) , i = 0, 1.
2 2
The nodes are xi = x0 + ih = 0.1 i, i = 0, . . . , 5. With h = 0.1 and retaining five decimal
places, we obtain the results shown in the next table. It is η(0.2) = 1.22140, and since

i ti ȳi f (ti , ȳi ) kj,i

0 0 1.00000 1.00000 0.100000
0.05 1.05000 1.05000 0.105000
0.05 1.05250 1.05250 0.110525
0.1 1.10525 1.10525 0.110525
0.1 1.10517
1 0.1 1.10517 1.10517 0.110517
0.15 1.16043 1.16043 0.116043
0.15 1.16319 1.16319 0.116319
0.2 1.22149 1.22149 0.122149
0.2 1.22140

e0.2 = 1.22140, the approximation agrees with the exact solution up to 5 decimal places.
Example 8.9 (Classical Runge-Kutta: Global discretization error and step size)
Consider the initial value problem y 0 = x−y
2 , y(0) = 1 over the interval [0, 3] using the classical
Runge-Kutta method with step sizes h = 1, 21 , . . . , 18 and calculate the global discretization
x
error. The exact solution is y(x) = 3e− 2 − 2 + x. The results are shown in the next table.

step size h number of steps n η(3) eH (0.3) O(h) ≈ Ch4 , C = −6.14 × 10−4
1 3 1.670186 −7.96 × 10−4 −6.14 × 10−4
1
2 6 1.6694308 −4.03 × 10−5 −3.84 × 10−5
1
4 12 1.6693928 −0.23 × 10−5 −0.24 × 10−5
1
8 24 1.6693906 −0.01 × 10−5 −0.01 × 10−5

Since the global discretization error is of the order O(h4 ) we can expect that when reducing
the step size by a factor of 21 , the error will reduce by about 16
1
.
134 Chapter 8. Numerical Methods for Solving Ordinary Differential Equations

Exercise 8.10
Solve the equation y 0 = −2x − y, y(0) = −1 with step size h = 0.1 at x = 0.1, . . . , 0.5
using the classical Runge-Kutta method. The analytical solution is y(x) = −3e−x − 2x + 2.
Calculate the global discretization error at each node.

8.4 Multistep methods

The methods of Euler-Cauchy, Heun, and Runge-Kutta are single-step methods, because
they use only the information from the previous node to compute an approximation at the
next node. That is, only ηi is used to compute ηi+1 . Multistep methods for the solution of
the initial value problem (8.1),(8.2) use, in order to compute an approximate value ηi+1 ,
s + 1, s ∈ N, previously computed approximations ηi−s , ηi−s+1 , . . . , ηi−1 , ηi at equidistant
points. To initialize such methods, it is necessary that s starting values η0 , η1 , . . . , ηs−1 are
at our disposal. s is the step number of the multistep method. Since η0 = y0 is given,
the remaining approximations η1 , . . . , ηs−1 must be obtained by other means, e.g. by using
single-step methods.
In the following we assume that the starting values (xi , f (xi , ηi )), i = −s, −s + 1, . . . , 0
are given. We will often write fi instead of f (xi , ηi ). One class of multistep methods can be
derived from the formula
Z xi+1
yi+1 = yi + f (t, y(t)) dt, i = 0, . . . , n − 1, (8.29)
xi

which is obtained by formally integrating the ODE y 0 = f (x, y(x)). To evaluate the integral
on the right-hand side we replace f by the interpolating polynomial Ps of degree s through
the past s + 1 points (xi−s+j , fi−s+j ), j = 0, . . . , s:
s
X
Ps (x) = fi−s+j · lj (x),
j=0

with the Lagrange basis functions

s
Y x − xi−s+k
lj (x) = .
xi−s+j − xi−s+k
k=0
k6=j

The approximation ηi+1 is then obtained from

Z xi+1 s
X Z xi+1
ηi+1 = ηi + Ps (t) dt = ηi + fi−s+j lj (x) dx,
xi j=0 xi

where the integral is calculated exactly. Note that [xi−s , xi ] is the interpolation interval to
determine Ps , but Ps is used to integrate over [xi , xi+1 ]. That means we extrapolate! The
8.4. Multistep methods 135

result is an explicit formula which yields the approximation ηi+1 :

s
X
ηi+1 = ηi + wj fi−s+j ,
j=0

with the weights Z xi+1

wj := lj (x) dx.
xi
A well-known example of an explicit multistep method is the method of Adams-Bashforth
with s = 3. The corresponding formula is
h
ηi+1 = ηi + (55fi − 59fi−1 + 37fi−2 − 9fi−3 ) , i = 0, . . . , n − 1.
24
The local order of the method is O(h5 ), the global order is O(h4 ). To initialize this formula we
need s+1 = 4 starting values (xj , fj ), j = 0, . . . , 3. They have to be computed using a method
with the same local order, e.g. the classical Runge-Kutta formula discussed in the previous
Section. Compared with the latter, the Adams-Bashforth formula has the advantage that per
step only one evaluation of f is necessary whereas the classical Runge-Kutta formula requires
four function evaluations per step. Thus, the Adams-Bashforth formula is much faster than
the classical Runge-Kutta formula although both have the same order.
Example 8.11 (Adams-Bashforth method)
Given the initial value problem y 0 = y, y(0) = 1. Choose step size h = 0.1 and determine
an approximation η(0.4) to y(0.4) using the Adams-Bashforth method. Initialize the method
using the classical Runge-Kutta formula. Compare the solution with the exact solution y(0.4).
Calculate with 9 decimal digits.
Solution: η0 , . . . , η3 are calculated using the classical Runge-Kutta formula. Then, the
Adams-Bashforth formula has been initialized and η4 can be calculated.

step i node xi ηi fi
0 0 1.00000000 1.00000000
1 0.1 1.10517092 1.10517092
2 0.2 1.22140276 1.22140276
3 0.3 1.34985881 1.34985881
4 0.4 1.49182046

The exact solution is y(0.4) = e0.4 = 1.49182470, i.e. the absolute global discretization
error at x = 0.4 is 4.24 × 10−6 . Computing η4 with the classical Runge-Kutta formula,
we obtain 1.491824586, which has an absolute error of 1.14 × 10−7 . That is, in this case
the classical Runge-Kutta method, which has the same local order O(h5 ) than the Adams-
Bashforth method yields a higher accuracy. The reason is that the Adams-Bashforth method
is based on extrapolation which yields larger local errors especially when the step size h is
large.
136 Chapter 8. Numerical Methods for Solving Ordinary Differential Equations

In the same way we can also derive an implicit multistep formula if the node xi+1 is used to
construct the interpolating polynomial Ps instead of the node xi−s . In that case, f (x, y(x))
is approximated by the interpolating polynomial Ps of degree s through the s + 1 points
(xi−s+j , fi−s+j ), j = 1, . . . , s + 1. Then, we avoid to extrapolate but the price we have to
pay is that ηi+1 also appears on the right-hand side, so we have to iterate in order to solve
the equation. One example of an implicit multistep formula is the Adams-Moulton formula
with s = 4:
h
ηi+1 = ηi + (251fi+1 + 646fi − 264fi−1 + 106fi−2 − 19fi−3 ) .
720
The fixed-point iteration yields

(k+1) h (k)

ηi+1 = ηi + 251f (xi+1 , ηi+1 ) + 646fi − 264fi−1 + 106fi−2 − 19fi−3 , k = 0, . . . , k0 .
720
The iteration converges if hL < 360
251 . However, when continuing the iteration on the corrector,
the result will converge to a fixed point of the Adams-Moulton formula rather than to the
ordinary differential equation. Therefore, if a higher accuracy is needed it is more efficient to
reduce the step size. In practical applications, mostly hL ≈ 0.05 . . . 0.20 is chosen and two
iteration steps are sufficient. The Adams-Moulton formula has the local order O(h6 ) and the
global order O(h5 ).

Example 8.12 (Adams-Moulton method)

Given the initial value problem y 0 = y, y(0) = 1. Choose step size h = 0.1 and determine
an approximation η(0.4) to y(0.4) using the Adams-Bashforth method. Initialize the method
using the classical Runge-Kutta formula. Compare the solution with the exact solution y(0.4).
Calculate with 9 decimal digits

step i node xi ηi fi
0 0 1.00000000 1.00000000
1 0.1 1.10517092 1.10517092
2 0.2 1.22140276 1.22140276
3 0.3 1.34985881 1.34985881
4 (A-B) 0.4 1.49182046 1.49182046
4 (A-M) 0.4 1.49182458
1.49182472

The iteration even stops after k0 = 1. The exact solution is y(0.4) = e0.4 = 1.49182470,
i.e. the absolute global error at x = 0.4 is 0.2 × 10−7 . Thus, Adams-Moulton yields in this
case a higher accuracy than the classical Runge-Kutta and the Adams-Bashforth method.

Explicit multistep methods such as the Adams-Bashforth have the disadvantage that due
to extrapolation the numerical integration error may become large, especially for large step
8.5. Stability and convergence 137

size h. Therefore, such formulas should only be used as predictor and should afterwards be
corrected by an implicit multistep formula which is used as corrector. For instance, when
using Adams-Bashforth as predictor (local order O(h5 )) and Adams-Moulton as corrector
(local order O(h6 )), we obtain to perform the step from xi to xi+1 :
(0)
(a) Compute ηi+1 using the Adams-Bashforth formula,
(0)
(b) Compute f (xi+1 , ηi+1 ),
(k+1)
(c) Compute ηi+1 for k = 0, . . . , k0 using the Adams-Moulton formula.

Other multistep methods can easily be constructed: in equation (8.29) we replace f (x, y(x))
by the interpolation polynomial through the data points (xi−s+j , fi−s+j ), j = 0, . . . , s, and
integrate over [xi−r , xi+1 ] with 0 ≤ r ≤ s. If r = 0 we obtain the Adams-Bashforth formula.
More examples and a detailed error analysis of multistep methods is given in Stoer and
Bulirsch (1993).
For every predictor-corrector method, where the predictor is of order O(hp ) and the
corrector is of order O(hc ), the local discretization error after k + 1 iteration steps is of order
O(hmin(c,p+k+1) ). Therefore, if p = c − 1 one iteration is sufficient to obtain the order of the
corrector. For arbitrary p < c, the order O(hc ) is obtained after k + 1 = c − p iteration steps.
However, since the error constant of the predictor are larger than the error constant of the
corrector, it may happen that some more iterations are needed to guarantee the order of the
corrector. Therefore, in practical applications the order of the corrector is chosen one higher
than the order of the predictor, and one iteration step is performed.
Numerous modifications of the discussed algorithms have been developed so far. One
example is the extrapolation method of Bulirsch and Stoer, which is one of the best algorithms
at all w.r.t. accuracy and stability. It is widely used in planetology and satellite geodesy to
compute long orbits. In fact it is a predictor-corrector method with a multistep method as
predictor. By repeating the predictor for different choices of h a series of approximations is
constructed, and the final approximation is obtained by extrapolation. For more information
we refer to Engeln-Müllges and Reutter (1985); Stoer and Bulirsch (1993).

8.5 Stability and convergence

The numerical solution of ordinary differential equations provides to a given set of nodes xi
approximations η(xi ) to the unknown solutions y(xi ). For all discussed methods we can show
that
lim |y(xi ) − η(xi )| = 0, xi ∈ [x0 , xn ],
h→0
if certain conditions are fulfiled. That means the approximations η(xi ) converge for h → 0
towards the exact values y(xi ), assuming that no rounding errors occur. However, in practical
applications it is important to know how discretization errors and rounding errors propagate,
138 Chapter 8. Numerical Methods for Solving Ordinary Differential Equations

i.e. how stable the algorithm is. An algorithm is called stable if an error, which is tolerated
in one step, is not amplified when performing the next steps. It is called instable if for an
arbitrary large number of steps the difference between approximation and unknown solution
continuously increases, yielding a totally wrong solution when approaching the upper bound
of the integration interval. Responsible for instability can be the ordinary differential equation
and/or the used algorithm.

8.5.1 Stability of the ordinary differential equation

Let y denote the solution of the initial value problem

y 0 = f (x, y), y(x0 ) = y0 .

Let u be a solution of the same ordinary differential equation (ODE), i.e. u0 = f (x, u) but let
u fulfil a slightly different initial condition u(x0 ) = u0 . If u differs only slightly from y, we
may write
u(x) = y(x) + s(x), u(x0 ) = u0 = y0 + s0 ,
where s is the so-called disturbing function and a small parameter, i.e. 0 < << 1.
Therefore, u fulfils the ODE u0 (x) = y 0 (x) + s0 (x). Taylor series expansion of f (x, u) yields

f (x, u) = f (x, y + s) = f (x, y) + s fy + O(2 ),

and, when neglecting terms of order O(2 ), we obtain the so-called differential variational
equation
s0 = fy s.
Assuming fy = c = constant, this equation has the solution

s(x) = s0 ec(x−x0 ) , s(x0 ) = s0 .

This equation describes the disturbance at x due to a perturbed initial condition. If fy = c < 0
the disturbance descreases with increasing x, and the ODE is called stable, otherwise we call
it unstable. In case of a stable ODE, the solutions w.r.t. the perturbed initial condition
will approach the solutions w.r.t. the unperturbed initial condition. On the other hand,
if the initial value problem is unstable, then, as x increases, the solution that starts with
the perturbed value y0 + s0 will diverge away from the solution that started at y0 . If the
slight difference in initial conditions is due to rounding errors or discretization errors stability
(instability) means that this error is damped (amplified) if the integration process continues.

Example 8.13
Let us consider the ODE y 0 = f (x, y) = λ y, λ ∈ R, and the two initial conditions y(0) = 1
and y(0) = 1 + with a small parameter 0 < << 1. Obviously f is differentiable w.r.t. y
and it is fy = λ. The solution of the unperturbed initial value problem is y(x) = eλx and
8.5. Stability and convergence 139

the solution of the perturbed initial value problem is u(x) = (1 + )eλx ; the difference is
(u − y)(x) = eλx . It is clear that when λ < 0, the inital error is strongly damped, while
when λ > 0 it is amplified for increasing values x.

If the ODE is stable, a stable numerical algorithm provides a stable solution. However,
if the ODE is unstable we can never find a stable numerical algorithm. Therefore, let us
assume in the following that the ODE is stable. Then it remains to address the problem of
stability of the numerical algorithm.

8.5.2 Stability of the numerical algorithm

Let us consider linear s-step methods, s ≥ 1, which have the general form
s
X s
X
ηi+1 = as−k ηi+1−k + h bs−k fi+1−k , i = 0, 1, . . . , n − 1. (8.30)
k=1 k=0

We consider the disturbed initial value of the algorithm, u0 = η0 +H0 , with a small parameter
0 < << 1. The original solution is denoted by ηi , the disturbed solution by ui . Let
δi := ui − ηi . The algorithm (8.30) is called stable if

lim |δi | = lim |ui − ηi | < ∞.

i→∞ i→∞

With ui = ηi + Hi , we easily can derive the difference equation for the disturbances δi ,
sometimes called difference variational equation. We obtain
s
X X s
δi+1 = as−k δi+1−k + h bs−k f (xi+1−k , ηi+1−k + δi+1−k ) − f (xi+1−k , ηi+1−k ) . (8.31)
k=1 k=0

It is very difficult to analyse this equation for general functions f . However, a consideration
of a certain simplified ODE is sufficient, to give an indication of the stability of an algorithm.
This simple ODE is derived from the general equation y 0 = f (x, y(x)) (a) by assuming that
f does not depend on x and (b) by restricting to a neighborhood of a value ȳ. Within this
neighborhood we can approximate y 0 = f (y) by its linear version y 0 = fy (ȳ) y =: c y with
c ∈ C. The corresponding initial value problem y 0 = c y, y(x̄) = ȳ has the exact solution
y(x) = ȳ · ec(x−x̄) . For our simple ODE, the difference variational equation (8.31) becomes
s
X s
X
δi+1 = as−k δi+1−k + hc bs−k δi+1−k . (8.32)
k=1 k=0

This is a homogeneous difference equation of order s with constant coefficients. Its solutions
can be looked for in the form δi = β i for all i. Substituting into (8.32) yields
s
X s
X
β i+1 = as−k β i+1−k + hc bs−k β i+1−k . (8.33)
k=1 k=0
140 Chapter 8. Numerical Methods for Solving Ordinary Differential Equations

Since β 6= 0, we can divide by β i+1−s and obtain

s
X s
X
s s−k
β = as−k β + hc bs−k β s−k , (8.34)
k=1 k=0

or
s
X s
X
s s−k
P (β) := β − as−k β − hc bs−k β s−k = 0. (8.35)
k=1 k=0

P (β) is called characteristic polynomial of the difference variational equation (8.31). It is a

polynomial of degree s. Assuming that its zeros β1 , . . . , βs are distinct, the general solution
is
s
X
δi = cj βji , for all i, (8.36)
j=1

with arbitrary constants cj . One of the zeros, say β1i , will tend to zero as h → 0. All
the other zeros are extraneous. If the extraneous zeros satisfy as h → 0 the condition
|βi | < 1, i = 2, . . . , s, then the algorithm is absolutely stable.
Since the zeros are continuous functions of hc, the stability condition |βj | < 1 defines a
region in the hc-plane, where the algorithm is stable. This region is called region of stability
(S), i.e.:
S = {z ∈ C : |β(z)| < 1}, z := hc ∈ C.
That means for any z = hc ∈ S the algorithm is stable.

Example 8.14 (Stability of the forward Euler-Cauchy formula)

Let us consider the initial value problem y 0 = −y 2 , y(0) = 1. The forward Euler-Cauchy
formula becomes ηi+1 = ηi + hηi . Comparing with (8.30) it is s = 1, a0 = 1, b0 = 1, and
b1 = 0. The difference variational equation becomes, observing that

f (xi , ηi + δi ) − f (xi , ηi ) = −2ηi δi + O(δi2 ) :

δi+1 = (1 − 2hηi ) δi .
That means, c = −2ηi . The characteristic polynomial is β = (1 − 2hηi ) and the stability
conditions is |1 − 2hηi | < 1. That is the region of stability is S = {z ∈ C : |1 + z| < 1},
where z := −2hηi . S is a circle with center at the point -1 on the real line and radius 1.
Assuming e.g. ηi > 0, we have stability of the forward Euler-Cauchy method if h < η1i . Since
the stability analysis is based on local linearization, we have to be careful in practise and
have to choose h smaller than the criterium says to be on the safe side.

Example 8.15 (Stability of the backward Euler-Cauchy formula)

Let us consider the initial value problem y 0 = −y 2 , y(0) = 1. The backward Euler-Cauchy
formula becomes ηi+1 = ηi + h fi+1 . Comparing this equation with (8.30), we observe that
s = 1, a0 = 1, b0 = 0, and b1 = 1. The difference variational equation is δi+1 = δi − 2hηi δi+1 ,
8.6. How to choose a suitable method? 141

i.e. c = −2ηi and z = hc = −2hηi . The characteristic polynomial is P (β) = 1 + z β and

stability of the backward Euler-Cauchy formula is obtained if |1 − z|−1 < 1. Assuming ηi > 0,
stability of the backward Euler-Cauchy formula is guaranteed if h > 0, i.e. for all h.

In a similar way we can determine the region of stability for other methods. We obtain
for instance:

β(z) = 1 + z (forward Euler-Cauchy)

1
β(z) = (backward Euler-Cauchy)
1−z
1 + z/2
β(z) = (trapezoidal rule)
1 − z/2
z2
β(z) = 1 + z + (Heun)
2
z2 z3 z4
β(z) = 1 + z + + + (classical Runge-Kutta)
2 6 24
The stability regions are shown in Figure 8.4 and Figure 8.5.

Example 8.16 (Stability region for the Heun method)

The Heun scheme is ηi+1 = ηi + h2 (fi + f (xi+1 , ηi + hfi )). To derive the stability region we
neglect the dependency of f on x and obtain after linearization around ηi :

hc
ηi+1 = ηi + h fi (1 + )
2

with c = (fy )i . Comparing with (8.30), it is s = 1, a0 = 1, b0 = 1 + hc 2 , and b1 = 0.

z2
Thus, the characteristic polynomial is P (β) = β − (1 + z + 2 ) and the stability region is
2
S = {z ∈ C : |1 + z + z2 | < 1}.

8.6 How to choose a suitable method?

Many methods to solve ordinary differential equations have been proposed so far. We could
only discuss some of them. None of the methods has such advantages that it can be preferred
over all others. On the other hand, the question which method should be used to solve a
specific problem still depends on a variety of factors that cannot be discussed in full detail.
Examples are the dimension of the system of ODE, the complexity of the function f and
the number of function evaluations, the required accuracy, the computer time and memory
needed, the flexibility w.r.t. variable step size h, and the stability properties of the initial
value problem.
For many applications the computational effort is important. He consists of three parts,
(a) the effort for evaluating f , (b) the effort in order to vary the step size, and (c) the effort
for all other operations. If f can easily be calculated (a) is not very expensive. Then, for
142 Chapter 8. Numerical Methods for Solving Ordinary Differential Equations

Figure 8.4: The stability region for (1) forward Euler-Cauchy, (2) Heun, (3) classical
Runge-Kutta, and (4) backward Euler-Cauchy. Within the curve (1), a cir-
cle centered at (−1, 0), we fulfill the condition |1 + z| < 1, which provides
stability of the forward Euler-Cauchy method. The region within curve (2),
defined by |1 + z + 21 z 2 | < 1, guarantees stability of the Heun method.
The classical Runge-Kutta method is stable within region (3), defined by
|1 + z + 12 z 2 + 16 z 3 + 24
1 4
z | < 1. The backward Euler-Cauchy method is stable
outside the circle (4), i.e. if |1 − z| > 1.
8.6. How to choose a suitable method? 143

1+z/2
Figure 8.5: Regions 1−z/2 = b for various values b. The stability region of the trapezoidal
rule is obtained if b = 1. Obviously the trapezoidal rule is stable for Re(z) < 0,
i.e. for the complete left half-plane.
144 Chapter 8. Numerical Methods for Solving Ordinary Differential Equations

moderate accuracies < 10−4 the classical Runge-Kutta formula is a good choice. For lower
accuracies > 10−4 single-step methods with local order < 4 can be used. If the computation
of f is costly, multistep methods are superior to single-step methods, although they are
more costly when the step size has to be changed during the calculations. Very efficient are
implicit Adams methods of variable order. Implicit Runge-Kutta methods, which have not
been discussed, are suitable if very high accuracies (10−10 . . . 10−20 ) are needed.
Independent on the initial value problem we can summarize the following:

(a) Runge-Kutta methods have the advantage that they have not to be initialized, they
have a high local order, they are easy to handle, and an automatic step size control is
easy to implement. A drawback is that each step requires several evaluations of the
function f .

(b) Multistep methods have the advantage that in general only 2-3 function evaluations per
step are needed (including iterations). Moreover, formulas of arbitrarily high order can
easily be constructed. A drawback is that they have to be initialized, especially if the
step size has to be changed since after any change of the step size, a new initialization
is necessary.
Chapter 9

Numerical Optimization

9.1 Statement of the problem

The basic continuous optimization problem is to find x ∈ Ω such that some function f (x) is
minimized (or maximized) on the domain Ω ⊂ RN , the design space. We say: Find x̄ ∈ Ω
such that
f (x̄) ≤ f (x), ∀x ∈ Ω. (9.1)

Equivalently: Find
f¯ := min f (x).
x∈Ω

And equivalently again: Find

x̄ := arg min f (x),
x∈Ω

where f¯ = f (x̄). In the terminology of optimization

• f (·) is the objective function or cost function.

• x = (x0 , . . . , xN −1 ) is the vector of design variables.

• Ω is the design space.

• N is the dimension of the problem.

• We say the solution x is constrained to lie in Ω.

In 1-dimension Ω = [a, b] is typically an interval of the real line. As for all other numerical
methods in these notes, we are mainly interested in the richer and more challenging multi-
dimensional case.
More generally we can consider optimization problems where the design space is partly
or wholly discrete, e.g. find Ω = RN × N. Further we may have problems with an additional

145
146 Chapter 9. Numerical Optimization

constraint on x, such as: Find x ∈ Ω such that

f¯ := min f (x), subject to (9.2)

x∈Ω

g(x) = 0, and (9.3)

h(x) ≥ 0, (9.4)

where g(·) = 0 is called an equality constraint and h(·) ≥ 0 an inequality constraint. The
problem is now known as a constrained optimization problem, an important example of which
is PDE-constrained optimization, where Ω is a function space and g(x) = 0 is a partial differ-
ential equation. Of course in all cases the minimization can be replaced with a maximization.

Example 9.1 (Drag minimization for aircraft)

When designing aircraft wings one goal is to minimize the drag coefficient cD , which we can
reduce by modifying the shape of the wing. First we choose some parameters x that specify
the shape, e.g. x0 ∈ [0, Cmax ] could be camber, x1 ∈ [0, ∞) leading-edge radius, x2 mid-
chord thickness, etc. where the intervals exclude unrealistic shapes. By choosing sufficiently
many parameters we can allow for a wide variety of wing shapes - this process is known as
parameterization. In practice to specify the shape of a 3d wing fully we might need O(100)
parameters, giving a design space and optimization problem of the same dimension. Let x(0)
be our initial design.
Given a parameterization x ∈ Ω, our basic optimization problem for drag is: Find x ∈ Ω
such that
c̄D := min cD (x).
x∈Ω

But the new shape might result in an undesirable reduction in lift cL compared to the initial
design, in which case we should introduce a constraint that guarantees the lift is not reduced

cL (x) − cL (x(0) ) ≥ 0,

and we now have a constrained optimization problem. Furthermore imagine that we are at
a very early stage in the design of the aircraft, and have a choice a regarding the number of
engines we should install on the wing. The number of engines is a new discrete design vari-
able, Neng ∈ {1, 2, 4}, and the optimization problem becomes continuous-discrete inequality
constrained.
In any case, to solve the problem we need to evaluate cD (·) and cL (·) for different x
representing different wing geometries. This is the job of Computational Fluid Dynamics
(CFD), and may be a computationally expensive operation. We therefore need efficient
numerical methods that find x̄ with as few evaluations of cD (·) as possible.

The numerical methods required for solving each type of optimization problem given above
are quite different. In the following we consider methods for the most common unconstrained
continuous optimization problem only.
9.2. Global and local minima 147

9.2 Global and local minima

As for previous problem statements we would first like to establish the existence and unique-
ness of solutions to (9.1). Unfortunately in optimization the situation is hopeless. By con-
sidering the optimization problems

min sin x
x∈[0,2πM ]

where M > 0 and

min 1
x∈[0,1]

it’s easy to convince ourselves respectively that optimization problems may have any number
of solutions, and even infinitely many solutions on a finite interval - so solutions can definitely
not be regarded as unique in general. As for existence, consider the problem

min x,
x∈(0,1)

where (0, 1) indicates the open interval (not including the end points). This problem has no
solution by the following proof by contradiction: assume there exists a solution x̄ = ∈ (0, 1);
now consider /2 which is certainly in (0, 1) and /2 < , so is not the solution. Contra-
diction. QED.1 So we can not establish existence of solutions in general either. However in
engineering practice (e.g. Example 9.1) we rarely deal with open intervals and therefore don’t
encounter issues of existence. And if there are multiple solutions any may be a satisfactory
design, or we may choose one based on other considerations.
In fact we often reduce the scope of the problem by accepting local solutions. In particular
a global optimum is a solution of (9.1), while a local optimum is a solution of the related
problem: Find x ∈ E() ⊂ Ω such that

x̃ := arg min f (x), (9.5)

x∈E

where > 0, and

E() := {x | kx − x̄k < } ∩ Ω,

i.e. the ball of radius surrounding the local solution x̃. In other words a local minima
represents the optimal choice in a local sense – the value of the objective function can not
be reduced without performing a step greater than some small but finite > 0. Note that a
global optimum is necessarily also a local optimum. Needless to say, we may have multiple
local and global optima in a single problem.
The reason we often compromise and only ask for local optima, is that it is easy to find
a local optima, but very difficult to determine if a given optima is also global — one has to
1
This is the reason we need the concept of the infimum in real analysis.
148 Chapter 9. Numerical Optimization

Figure 9.1: One-d objective function f (x) containing multiple local and global optima.
Extrema are marked with circles.

search all Ω and make sure there are no other (potentially very limited) regions where f is
smaller. With the algorithms presented here we can find local optima quickly and cheaply
close to a starting point x(0) . This is often sufficient for engineering purposes, where an
acceptable design is known a priori — e.g. we know roughly what a wing looks like and want
to find another very similar shape with lower drag.
Several algorithms for finding optima will be discussed below. They all find optima close
to the starting point x(0) , and are therefore all local. Optimization algorithms are divided into
gradient-based methods that use f 0 , and gradient-free methods that require only the ability
to evaluate f . In general the former are effective for any N , the latter only for N ∼ O(10).

9.3 Golden-section search

Golden-section search is a gradient-free method that does not rely on the differentiability of
f 0 (x), and is guaranteed to converge to some minimum. However it only applies to problems
of dimension 1, which limits its usefulness. It may be considered the contemporary of the
recursive bisection for root-finding. Whereas in recursive bisection we needed 2 points to
establish the presence of a root in an interval, in Golden-section search we need 3 points to
establish the presence of an optimum in an interval.
Consider the case of minimization, and that we wish to find a minimum of f (x) on the
interval [a, b]. We first choose two nodes xL , xR ∈ [a, b] with xL < xR , and evaluate the
objective function at these points. If f (xL ) < f (xR ) we can conclude that f is decreasing
as we head from xR towards xL , and that there is guaranteed to be a minimum in the the
interval [a, xR ] (which is not at xR ). Similarly if f (xL ) > f (xR ) we can conclude that there is
9.3. Golden-section search 149

certainly a minimum in the interval [xL , b] (that is not at xL . See Figure 9.2 for a graphical
interpretation.
This suggests a recursive algorithm where we progressively reduce the size of the interval
on which we know an minimum to exist.

Algorithm 9.2 (Recursive interval minimization)

Assume an initial interval [a, b] ⊂ R and a continuous function f (x), and choose α, β ∈
(0, 1) with α < β. Then pseudocode for Nmax iterations of an interval minimization search
is:
a0 ← a; b0 ← b
for i = [0 : Nmax ] do
xL ← ai + α(bi − ai )
xR ← ai + β(bi − ai )
if f (xL ) < f (xR ) then
ai+1 ← ai
bi+1 ← xR
else
ai+1 ← xL
bi+1 ← bi
end if
end for
return [ai+1 , bi+1 ]
We might also augment this algorithm with a convergence criteria, stopping the algorithm
after the size of the interval is less than some tolerance.

The question remains: how should we choose xL and xR , or α and β in the above algo-
rithm? We specify a desirable property: that we can reuse one of f (xL ) or f (xR ) from the
previous step — that is, if we choose the sub-interval [an , xR,n ] on step n, then xL,n should
be in the position of xR,n+1 on step n + 1. Similarly if we choose the sub-interval [xL,n , bn ]
then xR,n should be in the position of xL,n+1 on step n + 1. Thus we reduce the number of
evaluations of f (·) to 1 per step, rather than 2 per step.
Satisfying this conditions is possible using the Golden section (or Gulden snede in Dutch).
The conditions require that
xR − xL xL − a
= ,
xL − a b − xL
and
xR − xL xL − a
= .
b − (xR − xL ) b − xL
Eliminating xR − xL from these equations we get

ϕ2 = ϕ + 1 (9.6)
150 Chapter 9. Numerical Optimization

Figure 9.2: One iteration of the golden-section search.

9.4. Newton’s method 151

where
b − xL
ϕ= . (9.7)
xL − a
The quadratic equation (9.6) has one positive and one negative root, the positive one is
√
1+ 5
ϕ= = 1.618033988...
2
the Golden ratio. Solving (9.7) for xL gives

xL = a + (1 − ϕ−1 )(b − a)

given which
xR = a + ϕ−1 (b − a)

and therefore
α = 1 − ϕ−1 ≈ 0.3820, β = ϕ−1 ≈ 0.6180.

With these values of α and β the above algorithm is known as the golden-section search
method. The interval width decreases by a constant factor ϕ−1 on each iteration, no matter
which side of the interval is choosen. Therefore if we take the midpoint of the interval as our
approxiation of the minimum, we make an error of at most

b−a
0 =
2
on the first step, and
b−a
n = (ϕ−1 )n
2
on the nth step. As for recursive bisection we have therefore linear convergence, in this case
at a rate of ϕ−1 . Applying the method to various hand-sketched functions is a good way of
getting an intuition for the behaviour of the method.

9.4 Newton’s method

One strategy for solving (9.1) is the following: assume that the objective function f (x) is
twice continuously differentiable, i.e. f ∈ C 2 (RN ), that the derivative
T
∂f ∂f ∂f
f 0 (x) = ...

∂x0 ∂x1 ∂xN −1
x

is available, and further that Ω = RN . In this case we can rewrite (9.1) as: Find one x̄ ∈ Ω
such that
f 0 (x̄) = 0, (9.8)
152 Chapter 9. Numerical Optimization

which is an N -dimensional root-finding problem. This gives us a location where f is station-

ary, and then it only remains to establish whether f (x̄) is a local maxima, local minima, or
saddle point. This can be done by examining the N × N Hessian matrix
 
∂2f ∂2f ∂2f
2 ∂x0 ∂x1 · · ·
∂x0 ∂xN −1 
 ∂x0
 ∂2f .. ..
.

00
 ∂x ∂x . 
f (x̄) = 
 1
..
0
. .
 .
 . . . .
.


2 2
 
∂ f ∂ f
∂xN −1 ∂x0 · · · · · · ∂x 2

N −1 x̄

The Hessian is symmetric so all eigenvalues λ are real. If all eigenvalues are positive we
have a local minimum, all negative a local maximum, and mixed implies a saddle point. See
Example 9.3.2
In order to solve (9.8) we can use any of the methods of Chapter 2, for example Newton’s
method: h i−1
x(n+1) = x(n) − f 00 (x(n) ) f 0 (x(n) ), (9.9)
with a suitable choice of starting point. The properties of Newton’s method for root-finding
are transfered to this algorithm. In particular we know that Newton’s method converges
quadratically, so we expect only a few iterations are required for an accurate solution. Fur-
thermore Newton tends to converge to a solution close to the initial guess x(0) , so this
algorithm will have the property of finding a local optimum, and different starting points
may give different optima.

Example 9.3 (Quadratic forms)

Consider the special objective function

1
f (x) = c + bT x + xT Ax, (9.10)
2
known as a quadratic form, where A = AT is a symmetric matrix, b a vector and c a constant.
This is a useful objective function to consider as it is the first 3 terms of a Taylor expansion
of a general function f :
1 T 00
f (x0 + h) = f (x0 ) + f 0 (x0 )T h + h f (x0 )h + O(khk3 ),
2!
where we can immediately identify

c = f (x0 )
b = f 0 (x0 )
A = f 00 (x0 ).
2
This is a generalization of the 1d principle that if the 1st-derivative of f is zero and the 2nd-derivative is
positive we have a local minimum (consider what this means for the Taylor expansion of f about x̄).
9.4. Newton’s method 153

Therefore for small khk, f is approximated well by a quadratic form. In particular if f has
a stationary point at x̄ then f 0 = 0 and
1
f (x̄ + h) ≈ f (x̄) + hT f 00 (x̄)h. (9.11)
2
Now we would like to understand under what conditions on A = f 00 (x̄) the function f
has a minimum at x̄ (rather than a maximum or something else). Since A is symmetric
(i) all eigenvalues are real, and (ii) we can find an orthonormal basis of eigenvectors vi for
RN −1 , satisfying Avi = λi vi and viT vj = δij . Now consider (9.11), and write h as a sum of
eigenvectors of A:
NX −1
h= ai vi ,
i=0
substituting into the second term in (9.11) we have
N −1 N −1 N −1 N −1 N −1
1 T 1X X 1X X 1X
h Ah = ai vi Avj aj = ai vi λj vj aj = λj a2j .
2 2 2 2
i=0 j=0 i=0 j=0 j=0

Now if λj > 0, ∀j, then this term will be positive no matter what direction h we choose, and
therefore the function increases in every direction, and x̄ is a local minimum. Similarly if
λj < 0, ∀j then this term will be negative for all directions h, and we have a local maximum.
If λj are mixed positive and negative, then in some directions f increases, and in others
it decreases — these are known as saddle-points.3 A quadratic form with λj > 0 in 2d is
sketched in Figure 9.3.
Example 9.4 (Newton’s method applied to a quadratic form)
To apply Newton to the quadratic form
1
f (x) = c + bT x + xT Ax,
2
where we assume A is positive definite, we first differentiate once:
1 T
f 0 (x) = b +

x A + Ax = b + Ax
2
and then again:
f 00 (x) = A,
and apply the Newton iteration (9.9)

x(1) := x(0) − A−1 (b + Ax(0) ) = −A−1 b.

But then x(1) satisfies immediately f 0 (x(1) ) = 0. So if f is a quadratic form Newton converges
to the exact optimum in 1 iteration from any starting point x(0) ! Compare this result to the
result from root-finding that if f is linear Newton finds the root f (x̄) = 0 in one iteration.
3
A symmetric matrix A is called strictly positive definite if λj > 0, ∀j, which is equivalent to xT Ax >
0, ∀x ∈ RN −1 . Therefore we are mainly interested in quadratic forms with strictly positive definite A.
154 Chapter 9. Numerical Optimization

Figure 9.3: Contours of a quadratic form in 2-d with positive definite A.

9.5 Steepest descent method

In practice we may wish to optimize a function where we have access to f 0 but not f 00 .
We cannot apply Newton but would still like to use the information provided by f 0 . The
essential idea of the steepest descent method is that, given a starting guess x(n) , we search
for a minimum of f in the direction in which f decreases fastest. This direction is simply
rn = −f 0 (x(n) ). We then solve the 1-d optimization problem: Find α(n) ∈ [0, ∞) such that

α(n) := arg min g(α),

α∈[0,∞)

g(α) = f x(n) − αf 0 (x(n) ) ,

using Golden-section search (for example). Then the new guess is

x(n+1) = x(n) − α(n) f 0 (x(n) ),

and the algorithm repeats with a new search direction. See Figure 9.4 for a sketch of the
progress of this method in 2d. At each step we are guaranteed to reduce the value of the
objective function:
f (x(n+1) ) ≤ f (x(n) ),

and if x(n) = x̄ then x(n+1) = x̄ and the exact solution is preserved, but it unclear how fast
this method converges. For this analysis we return to the example of a quadratic form, which
will describe the behaviour of the algorithm close to the optimum of any function f .
9.5. Steepest descent method 155

Figure 9.4: Three iterations of the steepest descent method.

Example 9.5 (The steepest descent method applied to a quadratic form)

In the case of a quadratic form we do not need to apply an algorithm to search for the
minimum in the search direction, we can get an explicit expression for the size of step to
take. Consider the quadratic form:
1
f (x) = c + bT x + xT Ax.
2
The search direction on step n of the method is

rn = −f 0 (x(n) ) = −b − Ax(n) .

Now the function f (·) will have a minimum in this direction when the gradient of f (·) is
orthogonal to the search direction, i.e. when

f 0 (x(n+1) )T · f 0 (x(n) ) = 0.

Starting from this point we derive an expression for the step-size α(n) :
T
rn+1 · rn = 0
(−b − Ax(n+1) )T · rn = 0 Defn. of rn+1
(n) (n) T
(−b − A(x +α rn )) · rn = 0 Defn. of x(n+1)
(−b − Ax(n) )T · rn − α(n) (Arn )T · rn = 0 Linearity of A
rnT · rn − α(n) rnT Arn = 0 Symmetry of A
156 Chapter 9. Numerical Optimization

so that
rnT rn
α(n) = . (9.12)
rnT Arn
So for a step of steepest descent for a quadratic form we know exactly what distance to step in
the direction rn . We could also take this step if we know the Taylor expansion of f including
the quadratic term; though in general this will be a different step than that performed using
a numerical 1d search method (because of the additional higher-order terms neglected).

Example 9.6 (Convergence of steepest descent for a stiff problem)

Consider the simple quadratic form
!
1 1 0
f (x) = xT x,
2 0 1000

with exact solution x̄ = 0, f¯ = 0.4 Apply steepest decent with an initial guess x(0) =
(−1, 0.001). The derivative is
! !
0 1 0 x0
f (x) = x= .
0 1000 1000x1

and therefore the first search direction is

r0 = −f 0 (x(0) ) = (1, −1),

and by (9.12):
r0T r0 2
α(0) = T
= .
r0 Ar0 1001
Note that this is a tiny step in comparision to the distance to the true minimum at 0, which
is ≈ 1. We have
1 T
x(1) = 999
−999 − 1000
1001
which is extremely close to the starting point. The next search direction is

999
r1 = −f 0 (x(1) ) = (1, 1),
1001
which is just 90◦ to the previous direction but slightly shorter. The next step is

r1T r1 2
α(1) = T
= ,
r1 Ar1 1001

i.e. the same as before, but since r1 is slightly shorter the actual step is slightly shorter.
The iteration proceeds in this way, with the search direction alternating between (1, 1) and
9.6. Nelder-Mead simplex method 157

Figure 9.5: Slow progress of the steepest descent method for the stiff problem of Exam-
ple 9.6.

(1, −1), α = 2/1001, and the total step reducing a factor of 999/1001 at each step. See
Figure 9.5
Therefore the error approximate minimum at step n can be approximately written:

999 n

n ≈ .
1001
This is familiar linear convergence, but extremely slow. After 100 steps the error is 100 ≈
0.82, after 1000 steps still 1000 ≈ 0.14. To solve this issue with stiffness causing slow conver-
gence there is a related algorithm called the conjugate gradient method, which make a cleverer
choice of search direction, but this is outside the scope of this course.

9.6 Nelder-Mead simplex method

The final algorithm we consider in this course takes the most space to describe, but as we
have seen again and again the definition of a numerical method is only the tip of the iceberg;
understanding convergence, accuracy and other important properties requires investigating
deep below the water-line. We do not analyse the Nelder-Mead simplex method in detail,
but note that it has many nice properties: it is gradient-free, converges rapidly for moderate
N and stretched objective functions, is stable, and robust to noise in the objective function.5
In the following the method is defined for 2 dimensions in which the simplex is a triangle.
In 3 dimensions the simplex is a tetrahedron, and in higher dimensions a hyper-tetrahedron
or “N -simplex”, the generalizations of the operations to these cases is clear.
4
In the case of ODEs we would call this a stiff problem, where there are 2 or more disparate scales in the
same system— a similar concept applies here.
5
Probably because of these favorable properties the Nelder-Mead simplex method happens to be the default
method in the fmin optimization routine in MATLAB.
158 Chapter 9. Numerical Optimization

Algorithm 9.7 (Nelder-Mead Simplex)

Start with an initial triangle with three nodes (x1 , x2 , x3 ) not lying on a line. Perform the
following steps:

1. ORDER the nodes according to the values of f at the vertices:

f (x1 ) ≤ f (x2 ) ≤ f (x3 )

i.e. x1 corresponds to the best value of the objective function, and x3 the worst.

2. Compute x0 = 21 (x1 + x2 ), the midpoint of all nodes except the worst x3 .

3. REFLECT: Compute xR = x0 + (x0 − x3 ), the reflection of x3 in the line containing

the two other points. Evaluate f (xR ), and:

• If f (xR ) < f (x1 ) then EXPAND.

• If f (xR ) > f (x2 ) then CONTRACT.
• Otherwise replace x3 with xR and goto 1.

4. EXPAND: Compute xE = x0 + 2(x0 − x3 ), the

• If f (xE ) < f (xR ) then replace x3 with xE and goto 1.

• Otherwise replace x3 with xR and goto 1.

5. CONTRACT: Compute xC = x0 + 21 (x0 − x3 ),

• If f (xC ) < f (x3 ) then replace x3 with xC and goto 1.

• Otherwise REDUCE.

6. REDUCE: Nothing is producing an improvement, so replace all the nodes but the best
node with the midpoints of the triangle edges, to create a triangle in the same location
but half the size:

• x2 = x1 + 12 (x2 − x1 )
• x3 = x1 + 12 (x3 − x1 ) and goto 1.

The basic idea is that the values of f at the 3 nodes of the triangle give a clue about the
best direction in which to search for a new point without needing any gradient information. At
each step the algorithm tries to move away from the worst point in the triangle (REFLECT).
If this strategy is successful then it goes even further and stretches the triangle in that
direction (EXPAND); if not successful it is more conservative (CONTRACT); and if none of
this seems to work then it makes the entire triangle smaller (REDUCE), before trying again.
See Figure 9.6
9.6. Nelder-Mead simplex method 159

After a number of iterations the stretching of the simplex corresponds roughly to the
stretching of the objective function in the design space. If the EXPAND operation is per-
formed repeatedly the result will be an increasingly stretched triangle, but this will only
occur if this strategy is producing a consistent reduction in that direction. This flexibility
allows the method to take large steps when necessary, and thereby gain some of the efficiency
advantages of gradient-based methods.
160 Chapter 9. Numerical Optimization

Figure 9.6: The component operations of Nelder-Mead simplex.

Bibliography

Buchanan, F. L. and Turner, P. R. (1992). Numerical Methods and Analysis. McGraw-Hill.

Engeln-Müllges, G. and Reutter, F. (1985). Numerische Mathemathik für Ingenieure. B.I.-

Wissenschaftsverlag, Mannheim.

Gerald, C. F. and Wheatly, P. O. (1994). Applied numerical Analysis. Addison-Wesley, 5th

edition.

Stoer, J. and Bulirsch, R. (1993). Introduction to numerical Analysis. Springer, New York,
2nd edition.

Vetterling, W. T., Teukolsky, S. A., and Press, W. H. (1992). Numerical Recipes in FOR-
TRAN; the Art of scientific Computing; Numerical Recipes Example Book. Cambridge
University Press, New York.

161

3-Uninformed Search
No ratings yet
3-Uninformed Search
48 pages
Chapter 14 Arithmetic Sequences
No ratings yet
Chapter 14 Arithmetic Sequences
12 pages
57 Book Manuscript 394 1 10 20230209
100% (1)
57 Book Manuscript 394 1 10 20230209
135 pages
(Ebook PDF) A First Course On Numerical Methods 5th Edition Download
100% (1)
(Ebook PDF) A First Course On Numerical Methods 5th Edition Download
57 pages
Practice Challenge Calendar by TwoSet Apparel
100% (1)
Practice Challenge Calendar by TwoSet Apparel
1 page
Numerical Methods
No ratings yet
Numerical Methods
106 pages
Bài Giảng Giải Tích Số
No ratings yet
Bài Giảng Giải Tích Số
174 pages
MODULE 3 Software Testing Notes
No ratings yet
MODULE 3 Software Testing Notes
10 pages
Lagrangians and Hamiltonians - 250411 - 180334
No ratings yet
Lagrangians and Hamiltonians - 250411 - 180334
10 pages
KP Portfolio 2024
No ratings yet
KP Portfolio 2024
36 pages
Mock Test CH - 1 10th 2025-26
No ratings yet
Mock Test CH - 1 10th 2025-26
2 pages
Num Comp EWN2004
No ratings yet
Num Comp EWN2004
383 pages
Graph
No ratings yet
Graph
28 pages
Introduction Numerical Analysis
No ratings yet
Introduction Numerical Analysis
252 pages
Google Maps and Graph Theory
No ratings yet
Google Maps and Graph Theory
16 pages
Lecture Notes2019
No ratings yet
Lecture Notes2019
91 pages
Numerical Analysis Lecture Ch.01 06
No ratings yet
Numerical Analysis Lecture Ch.01 06
241 pages
MA214-Lecture Notes
No ratings yet
MA214-Lecture Notes
282 pages
Metnum V5
No ratings yet
Metnum V5
114 pages
MAT321 Lecture Notes Boumal 2019
No ratings yet
MAT321 Lecture Notes Boumal 2019
203 pages
Gasdynamics1 7
No ratings yet
Gasdynamics1 7
241 pages
Introduction Numerical Analysis
No ratings yet
Introduction Numerical Analysis
443 pages
Autumn 2024 CS30001 DAA Complete Lesson Plan - Section CSE-XX
No ratings yet
Autumn 2024 CS30001 DAA Complete Lesson Plan - Section CSE-XX
6 pages
NM Script
No ratings yet
NM Script
181 pages
Lec 9 New
No ratings yet
Lec 9 New
12 pages
A Course On Integral Equations With Numerical Analysis: Tofigh Allahviranloo Armin Esfandiari
No ratings yet
A Course On Integral Equations With Numerical Analysis: Tofigh Allahviranloo Armin Esfandiari
222 pages
Numerical Methods For Ordinary Differential Equations
100% (2)
Numerical Methods For Ordinary Differential Equations
134 pages
Kruskal's Minimum Spanning Tree Algorithm
No ratings yet
Kruskal's Minimum Spanning Tree Algorithm
4 pages
Week 5
No ratings yet
Week 5
12 pages
Mat637 Notes
No ratings yet
Mat637 Notes
127 pages
Manual 2022
No ratings yet
Manual 2022
21 pages
Curseng
No ratings yet
Curseng
230 pages
Numerical Analysis I-1
100% (1)
Numerical Analysis I-1
205 pages
MA214LectureNotesFULL PDF
No ratings yet
MA214LectureNotesFULL PDF
273 pages
Gasdynamics AE4140 Chapter 1: Introduction
No ratings yet
Gasdynamics AE4140 Chapter 1: Introduction
60 pages
Violin LRSM Repertoire (7 Nov 2023) FINAL
No ratings yet
Violin LRSM Repertoire (7 Nov 2023) FINAL
2 pages
NUM701S Lecture Notes Book
No ratings yet
NUM701S Lecture Notes Book
58 pages
Trigonometry 01
No ratings yet
Trigonometry 01
20 pages
NumericalAnalysis Notes (In Progress)
No ratings yet
NumericalAnalysis Notes (In Progress)
79 pages
AE4133 CFD II Part 1 Discretisations For Compressible Flows
No ratings yet
AE4133 CFD II Part 1 Discretisations For Compressible Flows
104 pages
Notes ITSC
No ratings yet
Notes ITSC
117 pages
Gasdynamics AE4140 Chapter 2: Linearized Flow Equations
No ratings yet
Gasdynamics AE4140 Chapter 2: Linearized Flow Equations
47 pages
2 Iteration Bound
No ratings yet
2 Iteration Bound
19 pages
SI507lecturenotes PDF
No ratings yet
SI507lecturenotes PDF
245 pages
Math 248: Computers and Numerical Algorithms
No ratings yet
Math 248: Computers and Numerical Algorithms
162 pages
Control Theory Python Summary
No ratings yet
Control Theory Python Summary
25 pages
Lecture 0201 Python Program GCD
No ratings yet
Lecture 0201 Python Program GCD
3 pages
Numerical Methods: Radostin Simitev Simon Candelaresi
No ratings yet
Numerical Methods: Radostin Simitev Simon Candelaresi
127 pages
Course Note
No ratings yet
Course Note
121 pages
Assignment6 Final
No ratings yet
Assignment6 Final
17 pages
Main PDF
No ratings yet
Main PDF
137 pages
Autopilot Design With Root-Locus
No ratings yet
Autopilot Design With Root-Locus
36 pages
Text Book
100% (1)
Text Book
129 pages
Mona Lisa Rozycki Violin 1-Violins - 1
No ratings yet
Mona Lisa Rozycki Violin 1-Violins - 1
6 pages
Lecture Notes-1
No ratings yet
Lecture Notes-1
98 pages
An Intuitive Guide To Numerical Methods Heinold
100% (1)
An Intuitive Guide To Numerical Methods Heinold
100 pages
Lecture 8 - More State-Space and System Properties: G(S) G(S) 6205 S (+ 13s + 1281) S (S) Y (S) /R(S) H
No ratings yet
Lecture 8 - More State-Space and System Properties: G(S) G(S) 6205 S (+ 13s + 1281) S (S) Y (S) /R(S) H
33 pages
Numerical Analysis With - Matlab
No ratings yet
Numerical Analysis With - Matlab
76 pages
Be Information Technology Engineering Semester 3 2022 November Discrete Mathematics DM Pattern 2019
No ratings yet
Be Information Technology Engineering Semester 3 2022 November Discrete Mathematics DM Pattern 2019
4 pages
Problems For Algorithm Development: Java Programming
0% (1)
Problems For Algorithm Development: Java Programming
7 pages
Lecture 6 - State-Space Systems in Matlab: (T) A (T) + B (T) X X U C + D y X U
No ratings yet
Lecture 6 - State-Space Systems in Matlab: (T) A (T) + B (T) X X U C + D y X U
33 pages
01 Logic1
No ratings yet
01 Logic1
18 pages
CFD Exams Solution
No ratings yet
CFD Exams Solution
12 pages
An Intuitive Guide To Numerical Methods Heinold PDF
No ratings yet
An Intuitive Guide To Numerical Methods Heinold PDF
121 pages
Numerical Methods: Jeffrey R. Chasnov
No ratings yet
Numerical Methods: Jeffrey R. Chasnov
60 pages
1.1 The Real Number System: Types of Numbers
No ratings yet
1.1 The Real Number System: Types of Numbers
4 pages
NumeericalAnalysis PDF
No ratings yet
NumeericalAnalysis PDF
167 pages
The Billionth Digit of Pi
No ratings yet
The Billionth Digit of Pi
10 pages
Course Notes MATH
No ratings yet
Course Notes MATH
130 pages
Numerical Analysis Notes
No ratings yet
Numerical Analysis Notes
73 pages
Numerical
No ratings yet
Numerical
146 pages
Compositions and Fibonacci Identities
No ratings yet
Compositions and Fibonacci Identities
16 pages
Numerical Computions
No ratings yet
Numerical Computions
103 pages
Num
No ratings yet
Num
114 pages
AM341
No ratings yet
AM341
118 pages
Num Computing Notes Only
No ratings yet
Num Computing Notes Only
102 pages
Full SSG Ma214 Napostmidsem 201718
100% (1)
Full SSG Ma214 Napostmidsem 201718
267 pages
Numerical Methods For Graduate School: JP Bersamina October 11,2018
No ratings yet
Numerical Methods For Graduate School: JP Bersamina October 11,2018
67 pages
Organizational Communication in Business
No ratings yet
Organizational Communication in Business
4 pages
IJAMSS - V-Super and E-Super Vertex-Magic Total Labeling of Graphs
No ratings yet
IJAMSS - V-Super and E-Super Vertex-Magic Total Labeling of Graphs
8 pages
Numerical Analysis
No ratings yet
Numerical Analysis
117 pages
Log BBA
No ratings yet
Log BBA
7 pages
Question Set 1
No ratings yet
Question Set 1
9 pages
Question Set 2: Your Answer
No ratings yet
Question Set 2: Your Answer
7 pages
Numerical Analysis Durham UNI
No ratings yet
Numerical Analysis Durham UNI
87 pages
The Handshaking Lemma
No ratings yet
The Handshaking Lemma
14 pages
Esc101: Fundamentals of Computing Esc101: Fundamentals of Computing
No ratings yet
Esc101: Fundamentals of Computing Esc101: Fundamentals of Computing
7 pages
Linear Algebra
No ratings yet
Linear Algebra
43 pages
Combinatorial Set Theory PDF
No ratings yet
Combinatorial Set Theory PDF
2 pages
On The Mathematics of Flat Origami - Hull
No ratings yet
On The Mathematics of Flat Origami - Hull
10 pages
Ece198jl sp13 Exam1
No ratings yet
Ece198jl sp13 Exam1
9 pages
Sample - Programs Pps
No ratings yet
Sample - Programs Pps
6 pages
Delft University of Technology Faculty of Aerospace Engineering
No ratings yet
Delft University of Technology Faculty of Aerospace Engineering
4 pages
Lecture Notes For Math-CSE 451: Introduction To Numerical Computation
100% (1)
Lecture Notes For Math-CSE 451: Introduction To Numerical Computation
102 pages
Root-Locus: Your Answer
No ratings yet
Root-Locus: Your Answer
5 pages
Mathematics QM016 Topic 1: Number System - Tutorial
No ratings yet
Mathematics QM016 Topic 1: Number System - Tutorial
4 pages
CBSE Class 7 Mathematics Question Paper Set A
No ratings yet
CBSE Class 7 Mathematics Question Paper Set A
2 pages
Python Command Explanation: Ae2235-I Aerospace Systems and Control Theory Command Summary
No ratings yet
Python Command Explanation: Ae2235-I Aerospace Systems and Control Theory Command Summary
2 pages
Operation Longlife
From Everand
Operation Longlife
E. Hoffmann Price
3.5/5 (3)
Numerical Methods
No ratings yet
Numerical Methods
60 pages
Undergraduate Text
No ratings yet
Undergraduate Text
351 pages
INMO-2018 Paper & Solution..
No ratings yet
INMO-2018 Paper & Solution..
6 pages
Mathematics N4: FET College Nated, #6
From Everand
Mathematics N4: FET College Nated, #6
Efetobo Emede
No ratings yet
Bach Bach Black Sheep: Arr. Anthony Cummins
No ratings yet
Bach Bach Black Sheep: Arr. Anthony Cummins
1 page
Dragon Prince Ocarina Melody PDF
No ratings yet
Dragon Prince Ocarina Melody PDF
1 page
PDF
No ratings yet
PDF
1 page
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
From Everand
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
Harrison K Cook
No ratings yet
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet
Big o Cheatsheet
No ratings yet
Big o Cheatsheet
2 pages
Introduction To Scientific Computing: Using Matlab
No ratings yet
Introduction To Scientific Computing: Using Matlab
8 pages
Frère Jacques: Picard & Daren: As Played in Star Trek: The Next Generation - Lessons
No ratings yet
Frère Jacques: Picard & Daren: As Played in Star Trek: The Next Generation - Lessons
2 pages
Deadline Yemen (The Elizabeth Darcy Series)
From Everand
Deadline Yemen (The Elizabeth Darcy Series)
Peggy Hanson
5/5 (1)
Osama the Gun
From Everand
Osama the Gun
Norman Spinrad
5/5 (1)
Hamlet Had an Uncle: A Comedy of Honor
From Everand
Hamlet Had an Uncle: A Comedy of Honor
James Branch Cabell
4.5/5 (7)
A Discourse Analysis of 1 Peter
From Everand
A Discourse Analysis of 1 Peter
Ervin Ray Starwalt
No ratings yet
Kellory the Warlock
From Everand
Kellory the Warlock
Lin Carter
No ratings yet
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)

Applied Numerical Analysis (AE2220-I) : R. Klees and R.P. Dwight

Uploaded by

Applied Numerical Analysis (AE2220-I) : R. Klees and R.P. Dwight

Uploaded by

Applied Numerical Analysis (AE2220-I)

R. Klees and R.P. Dwight

1 Preliminaries: Motivation, Computer arithmetic, Taylor series 1

2 Iterative Solution of Non-linear Equations 11

4 Advanced Interpolation: Splines, Multi-dimensions and Radial Bases 39

4.2.4 Bicubic spline interpolation (?? not examined) . . . . . . . . . . . . . 58

8 Numerical Methods for Solving Ordinary Differential Equations 119

8.3.2 The method of Heun . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

9 Numerical Optimization 145

1.1 Numerical Analysis Motivation

Numerical analysis is a toolbox of methods to find solutions of analysis problems by purely

You can solve this exactly by applying the chain-rule, to discover I = π.

Figure 1.1: Trapezoidal rule for integral (1.1).

1.2 Computer Representation of Numbers

taking values 0 or 1. A given b represents the natural number

E.g. to represent the number 19:

1.2.2 Real numbers - Fixed-point arithmetic

• Overflow - as for integers.

1.2.3 Real numbers - Floating-point arithmetic

• b - base, usually 2 or 10, fixed for system.

• s - significand (or mantissa), 1 ≤ s < b, with n-digits - a fixed-point number. E.g. in

• e - exponent, an integer emin ≤ e ≤ emax .

• Undefined operation - such as divide-by-zero, sqrt of −1. Special value, not-an-

• Rounding error - 1 + 1 × 10−5 = 1 in above system. We define the machine epsilon,

e=-2 e=-1 e=0 e=1

Figure 1.2: Graphical representation of a floating-point number system in base 10 (deci-

• Overflow at ≈ ±1.7 × 10308 (atoms in universe ≈ 1080 ).

• Underflow at ≈ ±2.2 × 10−308 (Planck scale ≈ 1.6 × 10−35 m).

• Machine epsilon ≈ 1.1 × 10−16 .

• Smallest integer not in system 9007199254740993 = 253 + 1.

1.3 Taylor Series Review

single-precision IEEE 754 (32-bit)

double-precision IEEE 754 (64-bit)

Figure 1.3: Bit layout in the IEEE754 standard.

be performed easily and exactly. So to find e.g.

A convenient rewriting of (1.2) is obtained by defining the step-size

Substituting into (1.2) gives

for some ξ ∈ [0, x].

f (x) = ax4 + bx3 + cx2 + dx + e

0.5 0.5 0.5

0.0 0.0 0.0

0.5 0.5 0.5

1.0 1.0 1.0

1.5 1.5 1.5

Figure 1.4: Successive approximations to cos(x) at 0 with a Taylor expansion.

1.3.1 Truncation error versus Rounding error

f (h) = f (0) + f 0 (0)h + O(h2 ),

and rearrange for f 0 (0):

Iterative Solution of Non-linear

One of the fundamental problems of mathematics is the solution of an equation in a single

1. Under what conditions the algorithm converges.

2. A bound on the error of the estimate xN .

2.1 Recursive Bisection

Algorithm 2.1 (Recursive Bisection)

Figure 2.1: Progress of the Recursive Bisection Method.

Recursive bisection is an excellent method: guaranteed to converge, and possessing a

2.2 Fixed-point iteration

be rewritten in an equivalent form

Figure 2.3: Progress of a fixed-point iteration for 3 different ϕ.

Example 2.5 (Continuation of example 2.2)

x1 = 1.3572, x2 = 1.3309, x3 = 1.3259, x4 = 1.3249.

When using ϕ(x) = x3 − 1, we obtain the scheme

which generates the values

x1 = 2.375, x2 = 12.396, x3 = 1904.003, x4 = 6.902 · 109 .

Theorem 2.6 (Mean-Value Theorem)

Figure 2.4: Graphical argument for the Mean-Value Theorem

i+1 = ϕ0 (ξi )i .

i < Ki−1 < K 2 i−2 < · · · < K i 0 .

Example 2.7 (Continuation of example 2.2)

Because of the flexibility in choice of ϕ, fixed-point iterations encompass an enormous

2.3 Newton’s method

Figure 2.5: Progression of Newton’s method

0 ≈ f (x0 ) + f 0 (x0 )(x̃ − x0 ),

rearranging for x̃ gives,

xi+1 = ϕ(xi ) = ϕ(x̃ + ei ) (2.7)

Example 2.8 (Continuation of example 2.2)

i+1 = ϕ0 (ξi )i .

i < Ki−1 < K 2 i−2 < · · · < K i 0 .