0% found this document useful (0 votes)
32 views111 pages

MATH 2160 Numerical Analysis 1 Notes: S. H. Lui Department of Mathematics University of Manitoba

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views111 pages

MATH 2160 Numerical Analysis 1 Notes: S. H. Lui Department of Mathematics University of Manitoba

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

MATH 2160

Numerical Analysis 1 Notes

S. H. Lui
Department of Mathematics
University of Manitoba

Warning! This is a draft (2023) and may contain errors!


Contents

Table of Contents i

1 Floating Point Numbers 1


1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Binary Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Normalized Floating Point Numbers . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Chopping and Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Cancellation Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Numerical Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Nonlinear Equations 12
2.1 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Fixed Point Iteration (FPI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 System of Nonlinear Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Linear Systems 25
3.1 Basic Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Gaussian Elimination with Partial Pivoting . . . . . . . . . . . . . . . . . . . . . 27
3.3 Errors in Solving Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Symmetric Positive Definite Systems . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Iterative Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Least Squares 45
4.1 Polynomial Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Trigonometric and Exponential Models . . . . . . . . . . . . . . . . . . . . . . . . 47

5 Interpolation and Approximation 50


5.1 Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 Hermite Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.5 Rational Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6 Numerical Differentiation and Integration 73


6.1 Numerical Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Richardson Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

i
CONTENTS 1

6.3 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77


6.4 Improper Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.5 Additional Theory for Trapezoidal Rule . . . . . . . . . . . . . . . . . . . . . . . 88
6.6 Peano Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.7 Gaussian Quadrature Over Finite Intervals . . . . . . . . . . . . . . . . . . . . . 94
6.8 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.9 Multiple Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Chapter 1

Floating Point Numbers

1.1 Introduction
Numerical analysis is the design and analysis of accurate and efficient algorithms to solve prob-
lems in science and engineering. Computers work with numbers which can be represented by
finitely many bits whereas some real numbers require infinitely many bits to represent exactly.
Thus there is an error involved in representing each real number and this error propagates in
subsequent arithmetic operations. The question is whether we can trust the result of a long
sequence of calculations.

1.2 Binary Numbers


We shall use the following notation for a binary number

(bk bk−1 · · · b0 .b−1 · · · b−n )2 = bk 2k + bk−1 2k−1 + b1 2 + b0 + b−1 2−1 + · · · + b−n 2−n

where each bj = 0 or 1.

11
Example: (10101.1011)2 = 24 + 22 + 1 + 2−1 + 2−3 + 2−4 = 21 .
16

Example: x = (.1011)2 = (.1011 1011 1011 · · · )2 . To obtain the value of x, note that multiplying x by 16 is
equivalent to shifting the radix point right by four places. Hence 16x = (1011.1011)2 . Subtract
by x to get 15x = (1011)2 = 23 + 21 + 1 = 11. Hence x = 11/15.

Example: x = (.10101)2 . Subtract 4x = (10.101)2 from 32x = (10101.101)2 to get 28x = (10011)2 = 19
and so x = 19/28.

Example: Convert 53.7 to binary.


First convert the integer part 53 = 25 + 24 + 22 + 1 = (110101)2 . Next convert the fractional
part
.7 = b−1 2−1 + b−2 2−2 + · · · ,
where bi = 0 or 1. We calculate the coefficients one at a time by multiplying by increasing
powers of 2. Now
1.4 = b−1 + [b−2 2−1 + b−3 2−2 + · · · ].

1
2 CHAPTER 1. FLOATING POINT NUMBERS

This implies that b−1 = 1 and .4 = [· · · ]. Multiplying the latter by two results in

.8 = b−2 + [b−3 2−1 + b−4 2−2 + · · · ].

Thus b−2 = 0 and .8 = [· · · ], which upon multiplication by two yields

1.6 = b−3 + [b−4 2−1 + b−5 2−2 + · · · ].

Hence b−3 = 1 and


1.2 = b−4 + [b−5 2−1 + b−6 2−2 + · · · ].
This leads to b−4 = 1 and

.4 = b−5 + [b−6 2−1 + b−7 2−2 + · · · ].

Thus b−5 = 0. The pattern then repeats itself. We conclude that 53.7 = (110101.10110)2 .

1.3 Normalized Floating Point Numbers


One model for a normalized floating point number is

±(1.b−1 · · · b−n )2 × 2p

where bi = 0 or 1 and p is an integer, called the exponent part the number. One bit is used to
store the sign of this number while n bits are used for the fractional part which is also called
the mantissa. The 1 before the radix point renders this a normalized floating point number.
This digit is not allocated storage.
There are infinitely many floating point representations of a number but only one normalized
representation.

Example: (1.1)2 × 20 = (.11)2 × 21 with the former normalized.


1
9 = (1001.01)2 = (1.00101)2 × 23 is normalized.
4
1 = (.1)2 = (1.0)2 with the latter normalized.

Suppose m bits are allocated for the exponent. Let M = 2m−1 − 1. Then instead of storing
p, to avoid allocating a bit to store the sign, we store the biased exponent q = p + M . Hence
with −M ≤ p ≤ M + 1, it follows that 0 ≤ q ≤ 2m − 1.
The range 0 < q < 2m − 1 are reserved for normalized floating point numbers. Denormalized
floating point numbers are those having the form

(0.b−1 · · · b−n )2 × 2p

where bi = 0 or 1. These numbers are indicated by q = 0 or p = −M . The number zero is


stored with b−1 = · · · = b−n = 0 = q. Actually, there are two floating point zeros: ±0. All
denormalized floating point numbers have magnitudes smaller than 2−M .
At the other extreme are floating point numbers with q = 2m − 1 or p = M + 1. If
b−1 = · · · = b−n = 0, then this is used to store either ±∞. This occurs when a nonzero number
is divided by zero. The calculation should stop whenever this occurs. If at least one of the bi is
nonzero, then this is called NaN or not a number. This results when an invalid mathematical
operation such as ∞ − ∞ or square root of a negative number is executed.
1.4. CHOPPING AND ROUNDING 3

An overflow is said to have occurred if the result of an operation is a number whose magnitude
is larger than the largest machine representable number. An underflow occurs when result of an
operation is a nonzero number whose magnitude is smaller than the smallest non-zero machine
number. Typically, the result is set to zero in this instance. Underflows are usually harmless
while overflows are almost always fatal.

Example: Let m = 11 so that M = 1023. The exponent part of (1.10)2 × 2−1000 is stored as q =
−1000 + 1023 = 23 = (00000010111)2 .

The largest (finite) machine representable number is (1.1 · · · 1)2 × 2M = (2 − 2−n )2M ≈ 2M +1
(remembering that p = M + 1 is reserved for infinities and NaNs) while the smallest nonzero
one is (0.0 · · · 01)2 × 2−M = 2−n−M . The total number of machine representable numbers
(including infinities and NaN) is 2m+n+1 . The two most common formats are single precision
(m = 8, n = 23) and double precision (m = 11, n = 52).
It is interesting that the density of machine representable numbers is not uniform. For
instance, there are 2n machine representable numbers in [1, 2). In general, there are 2n−p floating
point numbers in [2p , 2p + 1) for 0 ≤ p ≤ M . Take the case p = 1 so the interval of concern is
[2, 3). The binary numbers in this interval are (10.b−1 b−2 · · · )2 which when normalized becomes
(1.0b−1 b−2 · · · )2 × 21 .
When p is much larger than one, then there are far fewer machine numbers in the interval
[2 , 2p + 1). On the other hand, there are M 2n ≈ 2m+n−1 machine numbers in [0, 1). This
p

nonuniform distribution of machine numbers is actually good since most numbers that we deal
with in everyday life are between (−10, 10), say, where the density of machine numbers is large.

Example: Suppose we wish to evaluate


 π
I = sin e−x x1000 + , x ≥ 0.
2
Note that x1000 may overflow, while e−x x1000 may be perfectly machine representable. Another
potential problem arises when the argument s of sin is large, in which case sin s may not be
evaluated accurately. Write

z = e−x x1000 = e−x e1000 ln x = e−x+1000 ln x .

The last expression on the right is a better way to evaluate z than the naive way. Next, define
z + π/2 = y + r, where y = 2nπ for some integer n and r ∈ [0, 2π). Then I = sin(r) is a
better method to calculate I than the straightforward way. Let a satisfy e−a a1 000 = 1015 π.
Note that a ≈ 1.0374. When x = 10000, 1000, a, the naive way reports NaN, NaN and 0.1504,
respectively, while the better method gives 1, NaN and 1. While clearly superior to the naive
way, some additional work is necessary before the second method can give a correct answer when
x = 1000.

1.4 Chopping and Rounding


To store a real number, that is, to obtain a machine representation of a real number, we must
truncate the fractional part of the binary expansion of the real number to n bits, the number
of bits in the mantissa. This is done by chopping (ignoring all digits after the nth bit) or by
rounding. In the latter, if the (n + 1)st bit of the expansion of the given number is one, then we
4 CHAPTER 1. FLOATING POINT NUMBERS

round up (take the first n bits and then add one to the last bit). Otherwise, we take simply take
the first n bits as in chopping. For a real number x, its floating point representation is denoted
by x̃.
Example: Suppose n = 7, x = 53.7 = (1.101011011 · · · )2 × 25 . Then using chopping arithmetic, x̃ =
(1.1010110)2 × 25 while x̃ = (1.1010111)2 × 25 in rounding arithmetic. If n = 5, then using
rounding, x̃ = (1.10110)2 × 25 .
In the above definition of rounding, there is a bias in that we always round up if the (n + 1)st
digit is one. A better implementation is (still with the (n + 1)st digit as one) round up if, for
instance, the nth bit is one and round down otherwise. This way, we round up and down with
equal probability.
Define the absolute error of the truncation of a real number x to be |x − x̃| while the relative
|x − x̃|
error is defined as . The magnitude of the absolute error depends on the magnitude of
|x|
x. For instance, an absolute error of 10 may seem large but it may be quite acceptable if x and
x̃ have values of magnitude 1020 . The relative error takes the scale of the problem into account
but it is not defined if x = 0 and it may give a misleading result if |x| is small.
It is not difficult to see that the relative errors of chopping and rounding are 2−n and 2−n−1 ,
respectively. For instance, in chopping, x and x̃ differ possibly for digits after the (n + 1)st bit.
Hence
X∞
|x − x̃| 1
≤ = 2−n .
|x| 2j
j=n+1

The unit roundoff is defined as 2−n or 2−n−1 depending on whether chopping or rounding
is being used. In double precision, the unit roundoff is 2−53 which is approximately 10−16 .
Needless to say, rounding is preferred to chopping.
Another characterization of the unit roundoff ǫM is that it is the smallest machine number
so that 1 ⊕ ǫM > 1. Here ⊕ denotes machine addition which is different than (exact) addition
because of truncation. For all positive machine numbers x smaller than ǫM , we have the strange
looking 1 ⊕ x = 1. Roundoff errors occur in floating point arithmetic operations, even if the
operands can be represented exactly.
Example: For simplicity use base 10 and assume n = 2. The exact sum 2.34 + 1.09 × 10−1 + 7.65 is
1.0099 × 101 . Let us calculate (2.34 ⊕ 1.09 × 10−1 ) ⊕ 7.65. The exact first sum 2.34 + .109 is 2.449
which becomes 2.44 after chopping. Adding this to 7.65 becomes 10.09 = 1.009 × 101 which
.099
becomes 1.00 × 101 after chopping. The absolute and relative errors are .099 and ≈ 10−2 ,
10.099
respectively.
If rounding is used, then the rounded first sum is 2.45. Add this to 7.65 followed by rounding
.001
results in 1.01 × 101 . The absolute and relative errors are .001 and ≈ 10−4 , respectively.
10.099
Still with base 10 and n = 2, 1.23 × 2.01 = 2.4723. Chopping or rounding both lead to 1.23 ⊗
2.01 = 2.47 with an absolute error of .0023 and relative error of ≈ 10−3 .
Floating point arithmetic do not obey the usual rules of arithmetic such as associative and
distributive laws. When programming, one should avoid pitfalls such as
if x = .3 then...
Here roundoff errors may mean that the statement is not executed even if the exact value of x
is .3. A better construct is to test if |x − .3| is less than some tolerance.
1.5. CANCELLATION ERROR 5

1.5 Cancellation Error


Cancellation error is the result of loss of significant digits when subtracting two nearly equal
numbers. For instance, in base 10 and 6 digit mantissa, suppose we wish to find the difference
of 1.2345671 and 1.2345660. After rounding,

1.234567 ⊖ 1.234566 = .000001 = 1.000000 × 10−6

a result which has only one correct digit. This phenomenon can lead to an answer which is
totally different from the exact one, especially over the course of a sequence of operations.

Example: In base 10 and 2–digit mantissa,



9.01 ⊖ 3.00 = 3.00 ⊖ 3.00 = 0

which is very different from the exact answer ≈ 1.666 × 10−3 .


Observe that √
√ √ 9.01 + 3 .01
9.01 − 3 = ( 9.01 − 3) √ =√ .
9.01 + 3 9.01 + 3
Calculation of this quantity using the last expression has no cancellation error and leads to
1.67 × 10−3 which is as good as one can expect in two–digit arithmetic.

Example: Solve x2 + 109 x − 3 = 0. By the quadratic formula, the two roots are
√ √
−109 − 1018 + 12 −109 + 1018 + 12
x− = , x+ = .
2 2
There is no cancellation in x− which can be computed to full precision. On the other hand x+ is
a difference of nearly equal numbers. In double precision, the result is zero. As above, a better
calculation is
√ √
−109 + 1018 + 12 109 + 1018 + 12 6
x+ = √ = √
2 109 + 1018 + 12 109 + 1018 + 12
which results in a computed answer accurate to full precision.
1 − cos x
Example: Let f (x) = . If x = 10−8 , then the computed answer in double precision is zero because
sin2 x
of severe cancellation error. Using sin2 x = 1 − cos2 x, we obtain f (x) = (1 + cos x)−1 and this
leads to an answer with a relative error close to the machine epsilon for x = 10−8 .

XN
(−1)i
Example: The direct computation of for large values of M, N leads to disastrous cancellation.
i
i=M
The following rearrangement
XN XN
1 1

i i
i=M, i even i=M, i odd

gives a more accurate answer.


More generally, suppose we wish to evaluate the sum
n
X
s= xi .
i=1
6 CHAPTER 1. FLOATING POINT NUMBERS

1.2

0.8

log(1+x)/x
0.6

0.4

0.2

0
−1 −0.5 0 0.5 1
x x 10
−15

log(1 + x)
Figure 1.1: Evaluation of at x = −10−15 + j10−16 , j = 0, . . . , 20 by the naive way
x
(‘x’) and the better way (‘o’).

The following method, known as compensated summation, tries to capture the rounding error
of each sum and adds that to the next sum.
s = 0; e = 0;
for i = 1 : n
t = s; y = xi + e;
s = t + y;
e = (t − s) + y;
end
We shall apply this method in a numerical solution of an ODE with dramatic improvement.

Example: This example illustrates the effects of cancellation errors graphically. Consider evaluation of
log(1 + x)
f (x) = in a neighbourhood of x = 0. By the L’Hopital’s rule, f (0) = 1. The naive
x
way to evaluate suffers from serious cancellation error near x = 0. A better way is the following:
(
log(1⊕x)
(1⊕x)⊖1 , 1 ⊕ x 6= 1;
1, 1 ⊕ x = 1.

See Figure 1.1.

Let us examine more carefully errors in floating point addition (subtract) and multiplication
(division). Let x1 , x2 be non–zero real numbers and x̃1 , x̃2 be the corresponding floating point
representations. Recall that
x̃i = (1 + ǫi )xi , |ǫi | ≤ ǫM
where ǫM is the unit roundoff. Because of roundoff,

x̃1 ⊕ x̃2 = (x̃1 + x̃2 )(1 + ǫ3 ), |ǫ3 | ≤ ǫM .


1.5. CANCELLATION ERROR 7

Therefore the relative error in floating point addition of two numbers is

x̃1 ⊕ x̃2 − (x1 + x2 ) x1 x2


E= ≈ (ǫ1 + ǫ3 ) + (ǫ2 + ǫ3 ) .
x1 + x2 x1 + x2 x1 + x2

In the above, we have ignored all terms which contain a product of two epsilons which are
necessarily small. Hence E, the relative error of addition has been expressed as the relative error
of x1 , which is (ǫ1 + ǫ3 ) times the amplification factor x1 (x1 + x2 )−1 plus a similar term for x2 .
Now if the amplification factor is large, then E is large. This happens when we are subtracting
two nearly equal numbers, that is, x2 ≈ −x1 and so the amplification factor |x1 (x1 + x2 )−1 | ≫ 1.
This is of course cancellation error which we have discussed above. If x1 and x2 have the same
sign, then the magnitude of the amplification factor is at most one and so the relative error of
floating point addition will be near ǫM .
Now we carry out the estimation of the relative error of floating point multiplication. Using
the same notation as above,

x̃1 ⊗ x̃2 = x̃1 x̃2 (1 + ǫ3 ), |ǫ3 | ≤ ǫM .

Hence, ignoring all terms containing products of two or more epsilons,

x̃1 ⊗ x̃2 − x1 x2
≈ |ǫ1 + ǫ2 + ǫ3 | ≤ 3ǫM
x1 x2

and so multiplication can always be computed accurately.


Now consider the addition of three positive machine numbers x1 , x2 , x3 . Since x1 ⊕ x2 =
(x1 + x2 )(1 + ǫ1 ) for some |ǫ1 | ≤ ǫM , we have

(x1 ⊕ x2 ) ⊕ x3 = ((x1 + x2 )(1 + ǫ1 ) + x3 )(1 + ǫ2 )

for some |ǫ2 | ≤ ǫM . Thus the absolute error of the sum is

|(x1 ⊕ x2 ) ⊕ x3 − (x1 + x2 + x3 )| ≈ |(x1 + x2 )(ǫ1 + ǫ2 ) + x3 ǫ2 | ≤ (x1 + x2 )2ǫM + x3 ǫM .

By induction, the absolute error of the sum of n machine numbers x1 , · · · , xn is

|(· · · ((x1 ⊕ x2 )⊕ x3 )⊕ · · · )⊕ xn − (x1 + · · ·+ xn )| ≤ (x1 + x2 )(n − 1)ǫM + x3 (n − 2)ǫM + · · ·+ xn ǫM .

This analysis suggests that when adding many positive numbers, it is best to add starting with
the smallest numbers first.
We have been discussing roundoff errors. A second source of error is called truncation
error. This is the error where an answer is the result of applying infinite number of steps and
we approximate this answer by doing finitely many steps.

X∞ XN
1 1
Example: Consider computing 2
by calculating . The truncation error is
n=1
n n=1
n2


X Z ∞
1 dx 1
≤ = .
n2 N x 2 N
n=N +1
8 CHAPTER 1. FLOATING POINT NUMBERS

Z .1
Example: Consider estimating ex dx by
0
Z .1  
x2 xN
1+x+ + ··· + dx.
0 2 N!
(Assume we only know how to integrate polynomials and not exponentials analytically.) Then
the truncation error is
Z .1 X
∞ X∞
xj .1j+1 10−N −2
dx = ≤ .
0 j! (j + 1)! .9 (N + 2)!
j=N +1 j=N +1

Another source of errors which requires no further explanation is errors in computer program-
ming. All these errors can render the result of a numerical method to be completely wrong.
Great care must be exercised when using a computer to solve a problem.
We list some actual disasters caused due to these errors, with losses of many lives and billions
of dollars. During the Gulf War in 1991, a Patriot missile failed to intercept a Scud missile,
resulting in the loss of 28 lives. The source of the problem was an accumulation of round off
errors. In 1996, the Ariane 5 rocket went off course and exploded. The problem was traced to
the failure of the conversion of a 64-bit floating point number to a 16-bit integer. (The floating
point number was larger than 32,768 which was the largest 16-bit (signed) integer possible. This
caused an overflow error.) In 1999, NASA lost contact with the Mars Climate Orbiter. The
reason was that the program used a mixture of miles and meters. As we can see, the answer
spits out by a computer need not be correct! Great care must be exercised in writing codes and
interpreting results.

1.6 Numerical Stability


Given a problem f with input x. f is said to be well-conditioned if small changes in the
input leads to small changes in the solution. Otherwise, it is said to be ill-conditioned. These
definitions are relevant because the input is almost always polluted by roundoff errors as well as
errors in modeling, measurement, etc.
Consider the simplest example where f is a real-valued differentiable function of a real
variable. Let y = f (x) and ỹ = f (x̃) where x̃ = x + dx. We have approximately

ỹ ≈ y + f ′ (x)dx

and so
ỹ − y f ′ (x)dx dx
≈ =C
y f (x) x
xf ′ (x)
where |C| = is called the condition number. A large condition number indicates
f (x)
that the problem is ill-conditioned.

Example: Let y = x. Then the condition number is 1/2 and so the problem is well-conditioned.

Suppose f : Rn → R is differentiable. Then


n
X ∂f (x)
ỹ ≈ y + f ′ (x)dx = y + dxi .
∂xi
i=1
1.6. NUMERICAL STABILITY 9

Assuming xi 6= 0 for all i,


n
ỹ − y X dxi ∂f (x) xi
≈ Ci , Ci = .
y xi ∂xi f (x)
i=1

Hence
n
!1/2 n
!1/2
ỹ − y X dx2 X
. |C| i
, |C| = Ci2 ,
y
i=1
x2i i=1

with |C| as the condition number.


We can carryout an error analysis of addition and multiplication using condition numbers.

Example: Let y = x1 x2 . Then


ỹ − y x2 dx1 x1 dx2 dx1 dx2
= + = +
y x1 x2 x1 x1 x2 x2 x1 x2
and so multiplication is a well-conditioned problem.

Example: Let y = x1 + x2 . Then


ỹ − y x1 dx1 x2 dx2
= +
y x1 + x2 x1 x1 + x2 x2
and so addition is a well-conditioned problem only if x1 + x2 is not small relative to both x1 and
x2 . For instance, if x1 and x2 have nearly equal magnitudes and are of different signs, then the
problem is ill-conditioned. We have already discussed this phenomenon of cancellation error.

Example: The roots of the polynomial

p(x) = (x − 1)(x − 2) · · · (x − 20) = x20 − 210x19 + · · ·

are obviously 1, 2, · · · , 20. Suppose the coefficient 210 is changed to 210 + ǫ where ǫ = 10−7 .
Naively, it is expected that the roots will perturb by a small amount. However this is not the
case. For instance, the perturbed polynomial has two roots at approximately 16.7 ± 2.8i which
are quite far from any integer.
Call the perturbed polynomial p(x, ǫ). We determine how sensitive the root x = x(ǫ) is to the
perturbation ǫ. By the chain rule,
∂p ∂x ∂p
+ =0
∂x ∂ǫ ∂ǫ
and so
∂p
∂x ∂ǫ x19
= − ∂p = P20 Q20 .
∂ǫ j=1 i=1, i6=j (x − i)
∂x

The sensitivity of the root x = j is

∂x j 19
= Q20 .
∂ǫ x=j i=1, i6=j (j − i)

At j = 16, the above evaluates to O(109 ).


10 CHAPTER 1. FLOATING POINT NUMBERS

Example: Consider the problem of finding the intersection of two straight lines. This can be posed as a
system of two equations in two unknowns: Ax = b. In case the lines are nearly parallel, then
the rows of A are nearly linearly dependent. Thus a small perturbation of A or b can lead to
large changes in the solution. This is an ill-conditioned problem. As a specific example, let ǫ be
a small positive real and  
1 1  
A= , b= 0 1 .
1 1+ǫ
The solution of the system Ax = b is x = ǫ−1 [−1, 1]T . Consequently, a small change in the value
of ǫ leads to a huge change in the solution.
More generally, consider any non-singular A ∈ Rn×n and b, b̃ ∈ Rn with b non-zero. Define
x = A−1 b and x̃ = A−1 b̃. Note that |b| = |Ax| ≤ kAk |x|. Using this inequality,

|x − x̃| |A−1 (b − b̃)| |b| |b − b̃| |b − b̃|


= ≤ kA−1 k |b − b̃| ≤κ ,
|x| |x| |x| |b| |b|

where κ = kAkkA−1 k is the condition number.

Example: For any f ∈ C[a, b], define


Z b
I(f ) = f (x) dx.
a

Suppose f, f˜ ∈ C[a, b]. Assuming that I(f ) 6= 0, then


Rb
|I(f ) − I(f˜)| | a (f (x)− f˜(x)) dx| kf k1 kf − f˜k1
= ≤ ,
|I(f )| |I(f )| |I(f )| kf k1

where, for any continuous g,


Z b
kgk1 = |g(x)| dx,
a
and
kf k1
κ=
|I(f )|
is the condition number, which can be arbitrarily large for some f . However, if f is restricted
to the class of non-negative continuous functions, then the condition number becomes one and
so integration becomes a well conditioned operation.

Example: For any differentiable function on [a, b], define D(f ) = f ′ and kf k∞ = maxx∈[a,b] |f (x)|. Suppose
f, f˜ are distinct differentiable on [a, b], with f non-constant. Then

kD(f ) − D(f˜)k∞ kf ′ − f˜′ k∞ kf − f˜k∞


= = κ ,
kD(f )k∞ kf ′ k∞ kf k∞

where
kf ′ − f˜′ k∞ kf k∞
κ=
kf − f˜k∞ kf ′ k∞
is the condition number which can be arbitrarily large. For instance, take f (x) = sin x and
f˜(x) = cos kx for k large. Then κ = O(k).
1.6. NUMERICAL STABILITY 11

A numerical method f˜ to calculate problem f is stable if whenever f is well-conditioned at


input x, then f˜(x + dx) is close to f (x) for all dx small.

Example: Let f (x) = 1 − 1 − x. The direct way to calculate f is unstable if x is small in magnitude due
to cancellation errors. A stable method is
x
f˜(x) = √ .
1+ 1+x

A numerical method f˜ to calculate problem f is stable if whenever f is well-conditioned at


input x, then f˜(x + dx) is close to f (x) for all dx small.

Example: Let f (x) = 1 − 1 − x. The direct way to calculate f is unstable if x is small in magnitude due
to cancellation errors. A stable method is
x
f˜(x) = √ .
1+ 1+x

Of course, f (x) = f˜(x) in exact arithmetic. However, in the presence of roundoff errors, they
behave differently.
Chapter 2

Nonlinear Equations

Given f : R → R. The goal is to find a root x so that f (x) = 0. This is a difficult problem
because usually, we do not know how many solutions are there. There might be none, one,
two, or infinitely many solutions. A solution is found by an iterative method which in general
converges to the root after infinitely many steps (assuming that it converges). Bisection is a
slowly converging method but it is guaranteed to find a root. Newton’s method is a rapidly
converging method when it converges but if the initial guess is not good enough, the sequence
may diverge.

2.1 Bisection
Recall the intermediate value theorem which states that for a continuous function f defined on
[a, b] satisfying f (a)f (b) < 0, then there is some x∗ ∈ (a, b) so that f (x∗ ) = 0. The theorem says
that there exists at least one root in (a, b) provided that f (a)f (b) < 0. The method of bisection
finds one such root. At each iteration, a root is bounded between an interval whose length
decreases by one half with each iteration while maintaining the property that the function has
different signs at the end points of the interval.

BISECTION: Given [a, b] with f (a)f (b) < 0 and tolerance ǫ > 0.
WHILE (b − a)/2 > ǫ
c = (a + b)/2
If f (c) = 0 then STOP
If f (a)f (c) < 0 THEN b ← c ELSE a ← c
END
RETURN c = (a + b)/2

Theorem: Suppose f is a continuous function on [a, b] with f (a)f (b) < 0. Let cn denote the estimate of
the root given by the bisection algorithm after n times of the loop. Then lim cn = x∗ where
n→∞
x∗ ∈ (a, b) is a root f . Also |cn+1 − x∗ | ≤ 2−1 (bn − an ) = 2−n−1 (b − a).

Proof: Denote the successive intervals generated by the algorithm by [an , bn ] with a0 = a, b0 = b. For
each positive n,
a0 ≤ an ≤ an+1 ≤ bn+1 ≤ bn ≤ b0

12
2.2. FIXED POINT ITERATION (FPI) 13

and bn − an = 2−n (b − a). Thus for some x∗ ∈ (a, b),

lim an = x∗ = lim bn .
n→∞ n→∞

If the algorithm terminates in finitely many steps, then we have f (c) = 0 for some c ∈ (a, b).
Otherwise, since f (an )f (bn ) < 0, take the limit to get f (x∗ )2 ≤ 0 which implies that f (x∗ ) = 0.
Since cn+1 = 2−1 (an + bn ),
bn − a n b−a
|cn+1 − x∗ | ≤ = n+1 .
2 2


Example: Let f (x) = x3 + x − 1. Since f (0) = −1, f (1) = 1, there is a root in (0, 1) by the intermediate
value theorem. Take a0 = 0, b0 = 1. Then c1 = 1/2 with f (c1 ) < 0. Thus a1 = 1/2, b1 = 1 and
so c2 = 3/4 with f (c2 ) > 0. Next a2 = 1/2, b2 = 3/4 and so c3 = 5/8 with f (c3 ) < 0. Finally,
a3 = 5/8, b3 = 3/4 and so c4 = 11/16. If ǫ = 1/10, then we can stop here since the difference
between c4 and a root is now smaller than 1/16.
In this method, it is easy to determine how many iterations to satisfy a given tolerance. For
instance, suppose we wish an absolute error of no bigger than 10−4 . Then we need n so that
4
10−4 ≈ 2−n−1 or n ≈ − 1 ≈ 13.
log10 2

The method of bisection converges slowly but it is guaranteed to converge. We now examine
other methods which may not converge but may converge quicker if the iteration converges.

2.2 Fixed Point Iteration (FPI)


Given a function g. A point x∗ ∈ R is called a fixed point of g if g(x∗ ) = x∗ . Given x0 , the
FPI of g is given by
xn+1 = g(xn ), n ≥ 0.
Suppose g is continuous and the fixed point iterates converge to some number x∗ . Then

x∗ = lim xn+1 = lim g(xn ) = g(x∗ )


n→∞ n→∞

since g is continuous. In other words, the limit x∗ is a fixed point of g.


Given any nonlinear equation f (x) = 0, it can always be converted into a fixed point problem.
For instance, define g(x) = f (x) + x. Then f (x) = 0 iff g(x) = x. There are infinitely many
ways to define the fixed point function. The game is to find a function g whose fixed point is
the same as the root(s) of f and whose FPI converges quickly.

Example: Let f (x) = x3 + x − 1 again. Consider three different FPI all of whose fixed points coincide with
the roots of f

1. x = g1 (x) = 1 − x3
2. x = g2 (x) = (1 − x)1/3
1 + 2x3
3. x = g3 (x) =
1 + 3x2
14 CHAPTER 2. NONLINEAR EQUATIONS

Observe that that latter can be obtained from f (x) = 0 as follows. From x3 + x = 1, add 2x3 to
both sides to obtain x(3x2 + 1) = 1 + 2x3 . Now divide by 3x2 + 1 on both sides.
Suppose we start with x0 = .5. Then for the first iteration function g1 , we obtain x1 = 1 − .53 =
.875, · · · , x9 ≈ 1, x10 ≈ 0, x11 ≈ 1, x12 ≈ 0, · · · . Thus the sequence oscillates between 0 and 1
and does not converge. This is the major drawback of FPI: its iteration may not converge.
For the second iteration function g2 , we have x1 = (1 − .5)1/3 ≈ .7937, · · · , x25 ≈ .6823, · · · . This
iteration converges, but rather slowly.
1 + 2 .53
For the final function g3 , the iteration converges very rapidly: x1 = ≈ .7142, x2 ≈
1 + 3 .52
.6831, x3 ≈ .6823. Hence in three iterations, this sequence has reached the point which took
the above function 25 iterations.

We shall try to understand this discrepancy among the three iteration functions.
The following is a well-known result in analysis known as the Contraction Mapping Principle:
Theorem: Suppose g : R → R satisfies |g(x) − g(y)| ≤ c |x − y| for all real x, y and some constant c ∈ [0, 1).
Then g has a unique fixed point.
The function g above is called a contraction mapping. The proof of this result bears a strong
resemblance to a slightly different theory which now follows.

Definition: Let en = xn − x∗ where x∗ is a fixed point and xn is the nth iterate of a FPI. Suppose for some
positive S < 1,
en+1
lim = S,
n→∞ en

then the FPI is said to converge linearly to x∗ at rate S.

Theorem: Suppose g is a continuously differentiable function with fixed point x∗ . Define S = |g ′ (x∗ )|. If
S < 1, then for any x0 sufficiently close to x∗ , the FPI converges linearly to x∗ at rate S. If
S > 1 and x0 6= x∗ , then the FPI diverges.

Proof: Subtract the equations xn+1 = g(xn ) and x∗ = g(x∗ ) to obtain

en+1 = g(xn ) − g(x∗ ) = g ′ (cn )en (2.1)

where cn is some number between x∗ and xn . The mean value theorem guarantees the existence
of cn . If S < 1, then there is some open interval B containing x∗ so that
S+1
|g′ (x)| ≤ < 1, ∀x ∈ B.
2
For any x0 ∈ B (this is the meaning of sufficiently close to x∗ in the statement of the theorem),
S+1
|g′ (c0 )| ≤ (S + 1)/2 and so by (2.1), |e1 | ≤ |e0 |. By induction, for every n ≥ 0,
2
S+1
|en+1 | ≤ |en |
2
and so lim en = 0. By (2.1),
n→∞

en+1
lim = lim |g′ (cn )| = |g ′ (x∗ )| = S.
n→∞ en n→∞
2.2. FIXED POINT ITERATION (FPI) 15

If S > 1, then there is some open interval B̃ containing x∗ so that

S+1
|g′ (x)| ≥ > 1, ∀x ∈ B̃.
2

If x0 ∈ B̃ \ {x∗ }, then for every n ≤ N for some N ,

S+1
|en+1 | ≥ |en |
2

and xN +1 6∈ B̃. This means that the sequence does not converge. 

This theorem says that the rate of convergence of a FPI depends on the value of g′ (x∗ ) where
x∗ is a fixed point. The smaller the value of |g ′ (x∗ )| is, the faster the FPI converges.
The case |g′ (x∗ )| = 1 is indeterminant. Consider g± (x) = x ± x3 which has fixed point
x∗ = 0. Note that g± ′ (x) = 1 ± 3x2 and so g (0) = 1. The fixed point iteration for g diverges
± −
and the one for g+ converges.

Example: Consider f (x) = x3 + x − 1 again with the three different FPIs and fixed point x∗ ≈ .6823.

1. x = g1 (x) = 1 − x3 . Here g1′ (x) = −3x2 and so g1′ (.6823) ≈ 1.3966 > 1 and thus the
iteration diverges.
2. x = g2 (x) = (1 − x)1/3 . Here g2′ (x) = −3−1 (1 − x)−2/3 and so |g2′ (.6823)| ≈ .716 which
implies that the iteration converges.
1 + 2x3 3
′ (x) = 6x(x + x − 1) and so g ′ (x ) = 0. Thus the FPI
3. x = g3 (x) = . Here g3 3 ∗
1 + 3x2 (1 + 3x2 )2
converges linearly at rate 0, the best possible rate!

Example: On your calculator, type in any number x0 and then repeatedly press the cos key. This cor-
responds to a FPI with g(x) = cos x with fixed point x∗ ≈ .7390. Since g ′ (x) = − sin x, the
iteration converges linearly at rate |g′ (.7390)| ≈ .67.
Because |g′ (x)| < 1 unless x = π/2+ 2kπ for any integer k, we know that this FPI must converge
if x∗ is not equal any of the above special values.

Unlike the method of bisection, we cannot predict in advance how many iterations it takes a
FPI to satisfy a given tolerance (assuming that the iteration converges). Typically we stop the
iteration if
|xn+1 − xn |
|xn+1 − xn | or
|xn |
is sufficiently small. This can be supplemented with |f (xn+1 )| sufficiently small.
We emphasize that even if |g ′ (x∗ )| < 1 at a fixed point x∗ , the iteration may not converge if
the initial iterate is not sufficiently close to x∗ . FPI is said to converge locally. The ideal method
is one which converges globally, that is, converge for arbitrary x0 . For instance, if g(x) = x/2,
then g ′ (x) = 1/2 and so it converges globally to the fixed point zero.
x
Example: Let g(x) = − x3 which has a unique fixed point x∗ = 0. Note g′ (0) = 1/2. The fixed point
2
iteration xn+1 = g(xn ) only converges locally. For instance, if x0 = 10, then x1 = −995, x2 ≈
109 . It is clearly divergent.
16 CHAPTER 2. NONLINEAR EQUATIONS

x*
x
xn+1 xn

Figure 2.1: Geometry of Newton’s method.

2.3 Newton’s Method


FPI in general converges linearly. We saw one example where convergence is extremely quickly
at rate zero. Newton’s method is a special instance of a FPI where it converges linearly at rate
zero. In fact, it converges quadratically, meaning that if the iteration converges, then the number
of correct digits doubles with each iteration. For the nonlinear equation f (x) = 0, Newton’s
method is defined by
f (xn )
xn+1 = g(xn ) ≡ xn − ′ , n ≥ 0.
f (xn )
It achieves its fast convergence by using derivative information f ′ .
The method is easily derived. Suppose the current iterate is xn . We model the function at
xn by a straight line y = mx + b. Naturally, take m = f ′ (xn ) and requiring that this line pass
through the point (xn , f (xn )), we obtain b = f (xn ) − f ′ (xn )xn . The next iterate xn+1 is defined
as the zero of this line: 0 = mxn+1 + b. Solving for xn+1 results in the Newton iterate defined
above. See Figure 2.1.

Example: Let f (x) = x3 + x − 1. We have already seen the function several times. With f ′ (x) = 3x2 + 1,
the Newton iteration is
x3 + xn − 1 2x3 + 1
xn+1 = xn − n 2 = n2 .
3xn + 1 3xn + 1
One root is x∗ ≈ .6823. Take x0 = .5. Define en = xn − x∗ . Then |e0 | ≈ 1.8 × 10−1 , |e1 | ≈
3.2 × 10−2 , |e2 | ≈ 8.5 × 10−4 , |e3 | ≈ 6.2 × 10−7 , |e4 | ≈ 3.3 × 10−13 . Thus |en+1 | ≈ c e2n for some
constant c.

This example shows that when Newton’s method converges, it converges very quickly indeed.

Example: Find the square root of 2 using Newton’s method.


The required quantity is the root of f (x) = x2 − 2. With f ′ (x) = 2x, Newton’s iteration is

x2n − 2 xn 1
xn+1 = xn − = + .
2xn 2 xn
√ √
If x0 > 0, the iterates will converge to 2 while if x0 < 0, they will converge to − 2. If x0 = 0,
then x1 is not well defined since f ′ (x0 ) = 0.
2.3. NEWTON’S METHOD 17

en+1
Definition: Suppose xn → x∗ . Let en = xn − x∗ . If lim < ∞, then {xn } is said to converge
n→∞ e2n
quadratically.

The proof of the quadratic convergence of Newton’s method requires

Theorem: Taylor’s Theorem. Let x and x0 be real numbers and f be k + 1 times continuously differentiable
on the interval between x and x0 . Then there exists some c in between x and x0 so that

f ′′ (x0 ) f (k) (x0 ) f (k+1) (c)


f (x) = f (x0 ) + f ′ (x0 )(x − x0 ) + (x − x0 )2 + · · · + (x − x0 )k + (x − x0 )k+1 .
2! k! (k + 1)!

Theorem: Suppose f is twice continuously differentiable with f (x∗ ) = 0 and f ′ (x∗ ) 6= 0. Then Newton’s
method is locally quadratically convergent to x∗ .

Proof: The fixed point function in Newton’s method is

f (x) ′ f (x)f ′′ (x)


g(x) = x − with g (x) = .
f ′ (x) f ′ (x)2

Hence g′ (x∗ ) = 0 and so Newton’s iteration is locally linearly convergent at rate 0. By Taylor’s
theorem, there is some cn between x∗ and xn so that

f ′′ (cn )
0 = f (x∗ ) = f (xn ) + f ′ (xn )(x∗ − xn ) + (x∗ − xn )2
2
and so
f (xn ) f ′′ (cn )
− = x ∗ − x n + (x∗ − xn )2 .
f ′ (xn ) 2f ′ (xn )
Now
f (xn ) f ′′ (cn ) 2
en+1 = xn+1 − x∗ = xn − − x ∗ = en − en + e .
f ′ (xn ) 2f ′ (xn ) n
Since xn → x∗ , we must have cn → x∗ . Thus

en+1 f ′′ (x∗ )
lim =
n→∞ e2
n 2f ′ (x∗ )

meaning quadratic convergence. 

en+1
In this theorem, if f ′′ (x∗ ) = 0 as well, then the method converges locally cubically: lim <
n→∞ e3n
∞. If, however, f ′ (x∗ ) = 0, then quadratic convergence is lost.

Example: Let f (x) = x2 . Note that f (0) = 0 = f ′ (0) and f ′′ (0) 6= 0. Newton’s iteration here is

x2n 1
xn+1 = xn − = xn = g(xn ).
2xn 2

The convergence rate is linear at rate 1/2 = g ′ (0) and not quadratic.

This example suggests


18 CHAPTER 2. NONLINEAR EQUATIONS

Theorem: Suppose f is three times continuously differentiable. If f (x∗ ) = 0 = f ′ (x∗ ) and f ′′ (x∗ ) 6= 0,
then Newton’s method is locally linearly convergent to x∗ at rate 1/2. The modified Newton’s
method
2f (xn )
xn+1 = xn − ′
f (xn )
converges locally quadratically to x∗ .

Proof: Let f (x) = (x − x∗ )2 h(x) for some twice continuously differentiable function h with h(x∗ ) 6= 0.
The fixed point function corresponding to Newton’s method is

f (x) (x − x∗ )h(x)
g(x) = x − =x− .
f ′ (x) 2h(x) + (x − x∗ )h′ (x)

By a direct calculation, g′ (x∗ ) = 1/2 and so Newton’s method converges locally linearly at rate
1/2.
Now for the modified method,

2(xn − x∗ )h(xn )
xn+1 = g2 (xn ) = xn − .
2h(xn ) + (xn − x∗ )h′ (xn )

It can be checked that g2′ (x∗ ) = 0 and so this iteration locally converges linearly at rate 0. From
(2.1),

g2′′ (cn ) 2 g ′′ (cn ) 2


en+1 = g2 (xn ) − g2 (x∗ ) = g2 (x∗ ) + g2′ (x∗ )en + en − g2 (x∗ ) = 2 en
2 2
where cn is some number between xn and x∗ and g2′′ (cn ) is a complicated expression involving
h(cn ), h′ (cn ), h′′ (cn ), h′′′ (cn ). Since en → 0 and f is three times continuously differentiable,
g2′′ (cn ) → g2′′ (x∗ ) and so
en+1 g2′′ (x∗ ) 3h′ (x∗ )
lim = = .
n→∞ e2 n 2 4h(x∗ )
Thus the modified Newton’s method converges locally quadratically. 

Again, Newton’s method is locally convergent meaning that the initial iterate x0 must be
sufficiently close to the root for convergence. If x0 is far from the root, then the iterates may
diverge. In fact, they may not even be well defined.

Example: Let f (x) = −x4 + 3x2 + 2. Newton iterates are

−x4n + 3x2n + 2
xn+1 = xn − .
−4x3n + 6xn

If x0 = 1, then −1 = x1 = x3 = x5 = · · · and 1 = x2 = x4 = x6 = · · · and so the iterates do not


converge. If x0 = 0, then x1 is undefined since f ′ (0) = 0.

Example: Let f (x) = xe−x . From the graph of this function (Figure 2.2), we see that Newton’s method
converges to the root 0 for every x0 < 1 while it diverges for every x0 > 1. The method is
undefined if x0 = 1 since f ′ (1) = 0.
2.4. SECANT METHOD 19

0.4

0.2

−0.2

−0.4

−0.6

−0.8

−1
−1 0 1 2 3 4 5

Figure 2.2: Graph of y = xe−x .

Example: Let f (x) = x3 + x − 1. With x0 = .5, estimate the number of Newton iteration it takes to
approximate the root x∗ ≈ .6823 that is correct to 10−6 .
Recall that
en+1 f ′′ (x∗ )
lim = ≈ .4656.
n→∞ e2
n 2f ′ (x∗ )
Then e0 ≈ 1.8 × 10−1 , e1 ≈ .4656e20 ≈ 1.5 × 10−2 , e2 ≈ .4656e21 ≈ 1.1 × 10−4 , e3 ≈ .4656e22 ≈
5.2 × 10−9 . Hence three Newton iterations are sufficient. In practice, such estimates are difficult
to obtain since the exact solution is unknown. Furthermore, the estimate en+1 = C e2n may be
completely incorrect when the iterate is far from the solution. If the initial iterate is far away,
then Newton iteration may not converge or take many iterations before it reaches the region
where quadratic convergence holds.

2.4 Secant Method


To solve the nonlinear equation f (x) = 0, Newton’s method requires calculation of the derivative
of the function which may not be available or may be too difficult to evaluate. The secant method
is Newton’s method where the derivative is replaced by a finite difference approximation:

f (xn ) − f (xn−1 )
f ′ (xn ) ≈ .
xn − xn−1

The secant scheme is


f (xn )(xn − xn−1 )
xn+1 = xn − , n ≥ 1.
f (xn ) − f (xn−1 )
Note that this scheme require two given initial iterates x0 and x1 .
We now give a geometric derivation of secant method. We seek a new iterate given xn−1 and
xn . Let y = mx + b be the unique line which passes through the two points (xn−1 , f (xn−1 )) and
(xn , f (xn )). The new iterate xn+1 is defined as the zero of this line: xn+1 = −b/m. A simple
calculation will result in the secant iteration.
The following convergence result is known.
20 CHAPTER 2. NONLINEAR EQUATIONS

Theorem: Suppose f ∈ C 3 (R) and there is some x∗ ∈ R so that f (x∗ ) = 0 and f ′ (x∗ ) 6= 0. There is some
δ > 0 so that if |e0 |, |e1 | ≤ δ, then

en+1 f ′′ (x∗ ) α−1 1+ 5
lim = , α= ≈ 1.62.
n→∞ eα n 2f ′ (x∗ ) 2
Hence the secant method converges locally linearly at rate 0 (faster than linear convergence)
but not quite as quickly as quadratic convergence, in terms of the number of iterations. Note
however that in each iteration, Newton’s method requires two function evaluations (f (xn ) and
f ′ (xn )), while in the secant method, only one function evaluation (f (xn )) is required. In real–life
examples where each function evaluation is very expensive (for instance, requires the solution
of a differential equation), then the number of function evaluations is a better indication of the
time complexity of the algorithm.
Example: Let f (x) = x3 + x − 1. Then the secant iterates are
(x3n + xn − 1)(xn − xn−1 )
xn+1 = xn − .
(x3n + xn − 1) − (x3n−1 + xn−1 − 1)
With x0 = 0, x1 = .5, we calculate x2 = .8, · · · , x6 ≈ .6823. Thus the secant method is slower
than Newton’s method but it is definitely faster than linearly converging methods.
Example: Suppose A(x) ∈ Rm×m for each x. Let Λ(x) denote the set of eigenvalues of A(x). Define
f (x) = min Re λ where Re z denotes the real part of a complex number z. We wish to find a
λ∈Λ(x)
zero of f . This problem comes up in determining stability of a steady solution of a differential
equation. Here it is exceedingly difficult to calculate the derivative of f . In fact, there may be
points where the derivative does not exist. Here secant method is a more appropriate method
to numerically find the root of f than Newton’s method.
We now give a proof of convergence of the secant method provided that the initial two
iterates are sufficiently close to a root x∗ of f . The secant iteration is
f (xn )(xn − xn−1 ) f (xn )xn−1 − f (xn−1 )xn
xn+1 = xn − = := g(xn , xn−1 ).
f (xn ) − f (xn−1 ) f (xn ) − f (xn−1 )
Note that
f (u)
lim g(u, v) = u − ′ .
v→u f (u)
Thus the secant method is identical to Newton’s method in case xn−1 = xn . It is simple to
check that g satisfies the following equalities:
g(u, x∗ ) = 0 = g(x∗ , v), gu (u, x∗ ) = guu (u, x∗ ) = 0 = gv (x∗ , v) = gvv (x∗ , v)
for all u, v. Using Taylor’s expansion twice, there are some θ, γ, µ ∈ (0, 1) so that
g(x∗ + ξ, x∗ + η) = g(x∗ , x∗ ) + gu (x∗ , x∗ )ξ + gv (x∗ , x∗ )η
1 
+ guu (x∗ + θξ, x∗ + θη)ξ 2 + 2guv (x∗ + θξ, x∗ + θη)ξη + gvv (x∗ + θξ, x∗ + θη)η 2
2
1
= x∗ + guu (x∗ + θξ, x∗ )ξ 2 + guuv (x∗ + θξ, x∗ + γθη)θηξ 2 + 2guv (x∗ + θξ, x∗ + θη)ξη
2 
+gvv (x∗ , x∗ + θη)η 2 + guvv (x∗ + µθξ, x∗ + θη)θξη 2
1
= x∗ + guuv (x∗ + θξ, x∗ + γθη)θηξ 2 + 2guv (x∗ + θξ, x∗ + θη)ξη
2 
+guvv (x∗ + µθξ, x∗ + θη)θξη 2 .
2.4. SECANT METHOD 21

Recall that en = xn − x∗ . With ξ = e1 and η = e0 , it follows that

e2 = g(x∗ + e1 , x∗ + e0 ) − x∗
e1 e0  
= guuv (x∗ + θe1 , x∗ + γθe0 )θe1 + 2guv (x∗ + θe1 , x∗ + θe0 ) + guvv (x∗ + µθe1 , x∗ + θe0 )θe0
2
e1 e0
:= h(e1 , e0 ).
2
Since h(0, 0) = 2guv (x∗ , x∗ ), there is some δ > 0 so that |vh(u, v)| ≤ ǫ < 1 whenever |u|, |v| ≤ δ.
In particular, if |e0 |, |e1 | ≤ δ, it follows that |e2 | ≤ ǫ |e1 | ≤ δ. By induction, |en | ≤ ǫ|en−1 | ≤
ǫn−1 |e1 | → 0 as n → ∞. This establishes convergence of the secant method.
Now
f (xn )xn−1 − f (xn−1 )xn − x∗ (f (xn ) − f (xn−1 ))
en+1 =
f (xn ) − f (xn−1 )
f (xn )en−1 − f (xn−1 )en
=
f (xn ) − f (xn−1 )
f (xn ) f (xn−1 )
xn − xn−1 en − en−1
= en en−1
f (xn ) − f (xn−1 ) xn − xn−1
1 f ′′ (x∗ )
→ en en−1 .
f ′ (x∗ ) 2

Here, we use the fact that for x0 sufficiently close to x∗ , then xn → x. The following fact has
also been used:
f (xn ) f (xn−1 ) f ′′ (x∗ )
lim − = lim (xn − xn−1 ). (2.2)
n→∞ en en−1 2 n→∞
To see this, note by Taylor’s theorem that there is some cn in between xn and x∗ so that

f ′′ (cn ) 2 f ′′ (cn ) 2
f (xn ) = f (x∗ + en ) = f (x∗ ) + f ′ (x∗ )en + en = f ′ (x∗ )en + en
2 2
and so
f (xn ) f ′′ (cn )
= f ′ (x∗ ) + en .
en 2
Subtract this equation with n replaced by n − 1 from the above equation and then take the limit
as n → ∞ to obtain
f (xn ) f (xn−1 ) f ′′ (x∗ ) f ′′ (x∗ )
lim − = lim (en − en−1 ) = lim (xn − xn−1 )
n→∞ en en−1 2 n→∞ 2 n→∞

which is (2.2).
f ′′ (x∗ )
Let C = and yn = − ln(Cen ). Note that yn → ∞ as n → ∞. Recall that in the
2f ′ (x∗ )
limit of large n, en+1 = Cen en−1 . In terms of yn this relation becomes

yn+1 = yn + yn−1 , n → ∞.

The solution of this recurrence relation can be found by the substitution yn = αn , from √ which
1± 5
the equation α2 − α − 1 = 0 follows. The roots of this quadratic equation are α± = . The
2
general solution of the recurrence relation is yn = c1 αn+ + c2 αn− for some constants c1 , c2 . Since
22 CHAPTER 2. NONLINEAR EQUATIONS

yn → ∞ and |α− | < 1, we can make the approximation yn ≈ c1 αn+ with c1 6= 0. To simplify the
n
notation, take α = α+ . Now en = C −1 e−yn ≈ C −1 e−c1 α . In the limit of large n,
n+1
en+1 C −1 e−c1 α α−1
≈ n+1 = C .
eαn −α
C e −c 1 α

This completes the demonstration of the convergence rate of the secant method.

Other Methods
There are other methods to find zeroes of f with higher order of convergence provided f is
sufficiently smooth. Below we assume that f ′ (x∗ ) 6= 0 for some root x∗ of f .
The first method can be shown to be cubically convergent, but it requires three function
evaluations per step. Recall that a linear model at xn is y = f (xn ) + f ′ (xn )(x − xn ). From the
Fundamental Theorem of Calculus,
Z x
f (x) = f (xn ) + f ′ (ξ) dξ.
xn

The linear model results from approximation the above integral by the area f ′ (xn )(x − xn ) of a
rectangle. A more accurate quadratic model approximates the integral by the trapazoidal rule:

f ′ (x) + f ′ (xn )
y = M (x) := f (xn ) + (x − xn ).
2
Observe that M ′′ (xn ) = f ′′ (xn ). This is in addition to M (xn ) = f (xn ) and M ′ (xn ) = f ′ (xn ).
Define the next iterate x = xn+1 so that it is a zero of the model:
2xn
xn+1 = xn − .
f ′ (x ′
n ) + f (xn+1 )

This is a nonlinear equation for xn+1 and so the iteration is not really practical. One simple idea
is to approximate xn+1 on the right-hand side by the Newton iterate: xn+1 ≈ xn − f (xn )/f ′ (xn ),
resulting in the final iteration
2xn
xn+1 = xn − .
f ′ (x n) + f′ xn − f (xn )/f ′ (xn )

If f ′′ is available, then two third-order methods are


 
f (xn ) f ′′ (xn )f (xn )
xn+1 = xn − ′ 1+
f (xn ) 2f ′ (xn )2

and  
f (xn ) 2f ′ (xn )2
xn+1 = xn − ′ .
f (xn ) 2f ′ (xn )2 − f ′′ (xn )f (xn )
A fourth-order method is
  
f (xn ) f (yn ) f (yn ) f (xn )
xn+1 = xn − ′ 1+ 1+2 , y n = xn − .
f (xn ) f (xn ) f (xn ) f ′ (xn )

All three methods above require three function evaluations per iteration. The final method
is a modification of the first method of this subsection; it only requires two function evaluations
2.5. SYSTEM OF NONLINEAR EQUATIONS 23

per step, just like Newton’s method, but its convergence order reduces to 1 + 2. Define
x̃0 = x0 , x1 = x0 − f (x0 )/f ′ (x0 ) and
f (xn ) f (xn )
x̃n = xn − , xn+1 = xn − , n ≥ 1.
f′ (xn−1 + x̃n−1 )/2 f′ (xn + x̃n )/2
All methods discussed in this subsection typically require a very good initial guess for con-
vergence.

2.5 System of Nonlinear Equations


So far, we have only looked at scalar functions. In many applications, a system of nonlinear
equations must be solved. Let F : Rm → Rm for some m ≥ 2. We wish to find X∗ ∈ Rm so
that F (X∗ ) = 0. Newton’s method can be extended to such systems. In the system case, the
Jacobian DF (Xn ) ∈ Rm×m replaces the derivative f ′ (xn ) in the scalar case. The (i, j) entry of
this m × m matrix is
∂Fi (X)
(DF (X))ij = .
∂Xj
The Newton iteration is

Xn+1 = Xn − DF (Xn )−1 F (Xn ), n ≥ 0.

The convergence result is similar to the scalar case. If DF (X∗ ) is nonsingular, then the iterates
converge locally quadratically:
kEn+1 k
lim =C
n→∞ kEn k2

for some constant C independent of n. Here En = Xn − X∗ and kXk denotes the length of the
vector X.

Example: Consider the intersection of the curve y = x3 and the unit circle. This can be solved by finding
the roots of the system of equations
   
x2 − x31 0
F (x1 , x2 ) = 2 2 = .
x1 + x2 − 1 0

We find that  
−3x21 1
DF (x1 , x2 ) = .
2x1 2x2
If X0 = [1, 2]T , then
   −1    
1 −3 1 1 1
X1 = − = .
2 2 4 4 1
Continuing,
   −1    
1 −3 1 0 7/8
X2 = − = .
1 2 2 1 5/8
If X0 = [0, 0]T , then  
0 1
DF (0, 0) =
0 0
which is singular and so X1 cannot be defined.
24 CHAPTER 2. NONLINEAR EQUATIONS

Example: Let  
cos x1 + x21 ex2
F (x1 , x2 ) = .
x1 + x2
Then  
− sin x1 + 2x1 ex2 ex2 x21
DF (x1 , x2 ) = .
1 1

Nonlinear systems arise naturally in optimization problems of functions of several variables.


Let f : Rn → R. A necessary condition for a local extremum at X∗ is that the gradient vanishes:
∇f (X∗ ) = 0. This is a nonlinear system of equations. If Newton’s method is used to solve this
system, then
X (n+1) = X (n) − H(X (n) )−1 ∇f (X (n) ), n ≥ 0.
Here, the Jacobian of the system is the Hessian of F which is defined by
∂2f
Hij = .
∂Xi ∂Xj
When m, the number of nonlinear equations is large, say m > 1000, then it is very time
consuming to calculate the Jacobian since it is a m × m matrix. Also, solving the linear system
at each iteration can be very expensive. This practical issue has led to several modified Newton’s
method. We describe one such, known as the chord method. Instead of calculating DF at each
iteration, this method uses the same Jacobian DF (X0 ) for all iterations:

Xn+1 = Xn − DF (X0 )−1 F (Xn ), n ≥ 0.

The obvious advantage is that there is no need to calculate a new Jacobian at each iteration.
Also, as we shall see later, we can factor DF (X0 ) into triangular factors once in the beginning and
so all subsequent linear solves involving DF (X0 ) can be performed quickly. The drawback of the
chord method is that it is no longer quadratically convergent but only locally linearly convergent.
To see the latter, the chord method is a FPI with iteration function g(X) = X −DF (X0 )−1 F (X)
and so Dg(X) = I − DF (X0 )−1 DF (X). If S = kDg(X∗ )k < 1, then convergence is locally linear
at rate S.
The secant method also has an analogue in the system case. Recall that we use this method
if it is difficult or even impossible to obtain an analytic expression for the derivative (Jacobian).
In the scalar case, the derivative which is a number, must be estimated. In the system case,
the Jacobian is a matrix and it is not obvious how it can be approximated. One such method,
known as the Broyden’s method, is quite popular. Given two initial vectors X0 , X1 and an
initial matrix A0 , the Broyden’s iteration is

FOR n = 1, 2, 3, · · ·
δn = Xn − Xn−1
(F (Xn ) − F (Xn−1 ) − An−1 δn ) δnT
An = An−1 +
δnT δn
−1
Xn+1 = Xn − An F (Xn )
END

If F ′ (X∗ ) is nonsingular, then it can be shown that it converges locally linearly at rate 0
(faster than linear convergence) but it does not converge quadratically.
Chapter 3

Linear Systems

Given a nonsingular A ∈ Rn×n and b ∈ Rn , the goal is to find x ∈ Rn so that Ax = b. This


is without a doubt the most important problem in scientific computing. The solution of many
problems in science and engineering require the solution of some linear system of equations. For
instance, after discretization, a system of linear partial differential equations becomes a system
of linear equations.
We shall examine two classes of methods to solve linear systems. The first one is Gaussian
Elimination which is called a direct method because it calculates the exact solution (modulo
roundoff errors) in finitely many steps. The second class is called iterative methods which may
converge only after infinitely steps. Iterative methods are used primarily for large systems, say,
n > 1000.

3.1 Basic Gaussian Elimination


Gaussian Elimination (GE) is the workhorse solver for small to medium systems (n ≤ O(1000)).
In this section, we shall discuss the basic version which does not work for all matrices. The full
version will be given in the following section.

Example: Given the system

x + 2y − z = 3
2x + y − 2z = 3
−3x + y + z = −6.

In the basic version of GE, we write the matrix and the righthand side as
 
1 2 −1 3
 2 1 −2 3  .
−3 1 1 −6
Let Ri denote the ith row of the above augmented matrix. The notation Rj ← aRi + Rj means
replacing row j by a times Ri plus Rj . GE performs a sequence of row operations to reduce the
augmented matrix to upper triangular form. For this example performing R2 ← −2R1 + R2 and
R3 ← 3R1 + R3 results in  
1 2 −1 3
0 −3 0 −3 .
0 7 −2 3

25
26 CHAPTER 3. LINEAR SYSTEMS

Notice that the first column is zero except for the first entry. Now perform R3 ← 7/3R2 + R3
to the above augmented matrix to get
 
1 2 −1 3
0 −3 0 −3 .
0 0 −2 −4

Note that the new matrix is now upper triangular. The solution can easily be calculated by
back substitution. From the last equation, −2z = −4 which results in z = 2. From the second
equation, −3y + 0z = −3 and so y = 1. From the first equation

x + 2y − z = 3.

Using our values for y and z enables us to solve for x = 3.

Let us calculate the complexity of GE for a n × n matrix. In the first pass, we zero out all
entries in the first column except the first one. This is accomplished by

Ri ← µi1 R1 + Ri , 2≤i≤n

for some numbers µi1 called multipliers. (In the above example, µ21 = −2, µ31 = 3.) This
takes n(n − 1) multiplications. In the second pass, we zero all entries in the second column
except the first two. This is accomplished by

Ri ← µi2 R2 + Ri , 3 ≤ i ≤ n.

(In the above example, µ32 = 7/3.) This takes (n − 1)(n − 2) multiplications. Continuing to the
final (n − 1)st pass, we zero out the (n, n − 1) entry. This takes 2 · 1 multiplications. Hence the
total number of multiplications is
n−1
X n−1
X n−1
X  
2 (n − 1)n(2n − 1) n(n − 1) n3
(j + 1)j = j + j= + =O .
6 2 3
j=1 j=1 j=1

Here O(f (n)) = g(n) means that there is some constant C independent of n so that

f (n)
lim ≤ C.
n→∞ g(n)

We have dropped the terms of n and n2 since they are insignificant when compared to the term n3
for large n. Note that we have ignored additions in the above accounting. This is because there
are the same number of additions as multiplications. Actually in modern computer architecture,
the operation Ri ← aRi + Rj takes about the same amount of CPU time as adding two vectors
which is significantly faster than n times the time it takes to add two numbers.
The complexity of back substitution is easier to estimate. Starting from the last equation,
we solve for xn in one operation. From the second last equation, xn−1 can be solved in two
operations. Continuing, the solution of x1 requires n operations. The total number then is
 
n(n + 1) n2
1 + 2 + ··· + n = =O .
2 2
3.2. GAUSSIAN ELIMINATION WITH PARTIAL PIVOTING 27

Mathematically, GE is equivalent to the factorization of the matrix A into a product of a unit


lower triangular matrix L (unit referring to ones along the diagonal) and an upper triangular
matrix U . That is, A = LU ,
   
1 u11 u12 · · · u1,n−1 u1n
 −µ21 1   u22 · · · u2,n−1 u2n 
   
 −µ31 −µ32 1   . . .
. .
. 
L= , U = . . .  .
 .. .. .. ..   
 . . . .   un−1,n−1 u 
n−1,n
−µn1 −µn2 · · · −µn,n−1 1 unn

The numbers µij are multipliers which are generated in the GE process. The upper triangular
matrix U is the same as the one at the end of the GE process.

Example: For the previous example,


    
1 2 −1 1 1 2 −1
A =  2 1 −2  =  2 1   −3 0  .
−3 1 1 −3 −7/3 1 −2

3.2 Gaussian Elimination with Partial Pivoting


Not every matrix has an LU factorization. For instance, GE fails at the first step for the matrix
 
0 1
1 1

because the (1, 1) entry, called the pivot, is zero. Even if an LU factorization exists mathemati-
cally, its calculation may not be stable numerically.

Example: Consider the system

ǫx1 + x2 = 1 (3.1)
x1 + 2x2 = 4

where ǫ is a small number. This system has the solution


   2   
x1 2
= 1−2ǫ
1−4ǫ ≈ .
x2 1−2ǫ 1

After one step of GE, we have


   
ǫ 1 1 ǫ 1 1

0 −ǫ−1 + 2 −ǫ−1 + 4 0 −ǫ−1 −ǫ−1

after machine roundoff. The solution of this approximate system is x2 = 1 and x1 = 0 which is
quite different from the exact solution.

The reason for the poor approximation in the above example is that the multiplier is huge
(ǫ−1 ). The fix is to permute the rows so that all multipliers are bounded above by one. This is
GE with partial pivoting.
28 CHAPTER 3. LINEAR SYSTEMS

Example: Consider the problem Ax = b with the augmented matrix


 
0 1 2
.
1 1 3

Recall that GE fails at the first step. Suppose we interchange the two rows:
 
1 1 3
.
0 1 2

Clearly the solution remains the same as the original system. It is fortunate that the matrix
is already upper triangular and so the system can be solved by back substitution, obtaining
x2 = 2, x1 = 1. Notice that after this row interchange, GE is now well defined.

Example: Consider the system (3.1) again:


 
ǫ 1 1
.
1 2 4
We permute the rows and then perform GE:
     
1 2 4 1 2 4 1 2 4
⇒ ≈ .
ǫ 1 1 0 1 − 2ǫ 1 − 4ǫ 0 1 1

The latter approximate system has the solution x2 = 1, x1 = 2 which is a good approximation
to the exact solution. Notice that the multiplier of the system after the permutation is ǫ and
there are no entries in the new matrix of magnitude ǫ−1 like before.

Given the general system


 
a11 ··· a1n b1
 .. .. ..  .
 . . .
an1 · · · ann bn
We now give the general GE with pivoting which works for arbitrary nonsingular matrices and
is numerical stable in practice, unlike the previous version of GE.
In step one, we find ap1 with the property that |ap1 | ≥ |ai1 | for all i. In other words, ap1
is the entry of the largest magnitude in column one. If p 6= 1, then interchange rows 1 and p.
The entry ap1 becomes the new pivot for the new system. Carry out one step of GE to zero
ai1
out the second through nth entry of column one. Note that all multipliers are µi1 = , i≥2
ap1
and they satisfy |µi1 | ≤ 1. Note also that ap1 6= 0 since otherwise, the matrix is singular. The
resultant matrix is  
ap1 ap2 · · · apn bp
 0 c22 · · · c2n d2 
 
 .. .. .. .. 
 . . . .
0 cn2 · · · cnn dn
for some numbers cij , dk .
In step two, find the new pivot cq2 with the property that |cq2 | ≥ |ci2 |, 2 ≤ i ≤ n. Note
that cq2 6= 0 since otherwise the matrix is singular. If q 6= 2, interchange rows 2 and q. Carry
3.2. GAUSSIAN ELIMINATION WITH PARTIAL PIVOTING 29

out one step of GE to zero out the third through nth entry of column two. The multipliers are
ci2
µi2 = , i ≥ 3 and they satisfy |µi2 | ≤ 1. The resultant matrix is
cq2
 
ap1 ap2 · · · · · · apn bp
 0 c22 c23 · · · c2n d2 
 
 0 0 e · · · e f 
 33 3n 3 
 .. .. .. .. .. 
 . . . . .
0 0 en3 · · · enn fn

for some numbers eij , fk . Next find er3 which has the largest magnitude among the entries ei3
and interchange columns 3 and r if r 6= 3. Continue until the matrix is upper triangular. If the
initial matrix is non-singular, this procedure cannot fail. In the example below Ri ↔ Rj denotes
interchanging rows i and j.

Example: GE with partial pivoting proceeds as follows:


   
1 2 −1 3 −3 1 1 −6 R2 ← 23 R1 +R2 , R3 = 31 R1 +R3
 2 1 −2 3  R=⇒ 1 ↔R3 
2 1 −2 3  =⇒
−3 1 1 −6 1 2 −1 3
     
−3 1 1 −6 −3 1 1 −6 R3 ←− 75 R2 +R3
−3 1 1 −6
R2 ↔R3
 0 5 −4 −1 =⇒  0 7 −2 1 =⇒  0 7 −2 1 .
3 3 3 3 3 3
0 37 − 23 1 0 35 − 34 −1 0 0 − 67 − 12
7
On back substitution, we obtain x3 = 2, x2 = 1, x1 = 3.

Partial pivoting has eliminated the problem of zero pivots and numerical instability.
The interchange of two rows of a matrix can be performed by left multiplication of a matrix
which is sometimes called a transposition.

Example: The transposition T which interchanges rows 2 and 5 is given below.

T A = Ã
   
1 R1 R1
 0  
1 R2   
  R5 
 1  R3  = R3 
    
 1  R4   R4 
1 0 R5 R2

A permutation matrix is a product of several transpositions. A general permutation matrix


applied to the left of a matrix permutes the rows of that matrix.

Example: General permutation.     


0 1 R1 R2
 0    
1   R2   R4 
 = .
1 0   R3   R1 
1 0 R4 R3
The above permutation matrix is a product of transpositions T3 T2 T1 where T1 transposes rows
one and two, T2 transposes rows two and four while T3 transposes rows four and three.
30 CHAPTER 3. LINEAR SYSTEMS

It turns out that GE with partial pivoting applied to a matrix A is equivalent to the math-
ematical representation P A = LU where P is a permutation matrix (equal to the product of
the transpositions in GE), L, U are unit lower and upper triangular matrices. For the previous
example of GE with partial pivoting, ignoring the righthand side,
       
1 2 −1 1 1 −3 1 1
A =  2 1 −2 , P = 1  , L = − 1 1  , U = 
3
7
3 −3 .
2

−3 1 1 1 − 23 75 1 − 67

Here P is the product of two transpositions (note the order)


  
1 1
P = 1  1  .
1 1

Note also that the order of the multipliers in the first column of L has been switched because
of the last permutation R2 ↔ R3 .
Instead of calculating the permutation matrix P at the end, we can also update it as the
calculation proceeds. Start with the identity matrix. For the first transposition, apply it to I
to get P1 . For the next transposition, apply it to P1 to get P2 , etc. The matrix at the end of
this process is the permutation P .
In practice, the multipliers are stored immediately after they have been computed in the
strictly lower triangular part of A. Subsequent permutations can permute these multipliers as
well. Entries of U are stored at the corresponding entries in the upper triangular part of A. At
the end of the elimination process, L and U can be read off and no extra storage is needed.

Example: Apply GE with partial pivoting to


 
−1 1 0 −3
1 0 3 1
 . (3.2)
0 1 −1 −1
3 0 1 2

First permute rows 1 and 4:


     
3 0 1 2 3 0 1 2 3 0 1 2
1 0 3  
1  GE 0 0 38 1 
R2 ↔R3 0 1 −1 −1 
 3    GE
 0 1 −1 −1 =⇒ 0 1 −1 −1 
=⇒ 0 0 38 1  =⇒
3
−1 1 0 −3 0 1 31 − 37 0 1 13 − 37
   
3 0 1 2 3 0 1 2
0 1 −1 −1  GE 0 1 −1 −1 
 
1  =⇒ 
 
1  = U.
0 0 38 0 0 83
3 3
0 0 43 − 34 0 0 0 − 32
Note that there is no need to do permutation before the final GE step. We have P A = LU
where    
1 1
 1   0 1 
P = T2 T1 = 
 1
,
 L=  1
.
3 0 1 
1 − 31 1 12 1
3.2. GAUSSIAN ELIMINATION WITH PARTIAL PIVOTING 31

In the above, T1 is the transposition R1 ↔ R4 while T2 is the transposition R2 ↔ R3 . Let us


derive the entries of L. From the first pass of GE, we have
     
1 1 1
−µ21   1  R2 ↔R3  0 
   3   
−µ31  =  0  =⇒  1  .
3
−µ41 − 13 − 13

We now redo this example where we perform the row interchanges of the multipliers and keep
track of the permutations at the same time that we perform GE. Append a permutation counter
to the right of the matrix starting with the initial order 1, 2, 3, 4. The multipliers are displayed
in their natural positions in the lower triangular part of the matrix in boldface.
     
−1 1 0 −3 1 3 0 1 2 4 3 0 1 2 4
 1 0 3 1 2   1 0 3 1 2  GE 
1 8 1
2 
  R=⇒
1 ↔R4   =⇒  −3 0 3 3  R2 ↔R3
 0 1 −1 −1 3   0 1 −1 −1 3   0 1 −1 −1 3  =⇒
3 0 1 2 4 −1 1 0 −3 1 1
3 1 13 − 37 1
     
3 0 1 2 4 3 0 1 2 4 3 0 1 2 4
 0 1 −1 −1  
3  GE  0 1 −1 −1  
3  GE  0 1 −1 −1 3 
 1 =⇒  1 =⇒  1 .
 − 0 38 1
2  −3 0 8 1
2  −3 0 8 1
2 
3 3 3 3 3 3
1
3 1 3 − 37
1
1 1
3 −1 3 − 34
4
1 1
3 −1 − 2 − 23
1
1
The strict lower triangular entries of L are the negative of the strict lower triangular entries
of the above matrix (in boldface) while the entries of U are the upper triangular entries of the
above matrix. The final permutation matrix P are determined from the rightmost column of
the above matrix.
 
2 1 1 0
4 3 3 1
Example: Apply GE with partial pivoting to  8 7 9 5.

6 7 9 8
First permute rows 1 and 3:
     
8 7 9 5 8 7 9 5 8 7 9 5
4 3 3 1 GE 0 − 1 − 23 − 32  R2 ↔R4 0 7 9 17 
GE
  =⇒  2  =⇒  4 4 4  =⇒
2 1 1 0 0 − 3 − 45 − 54  0 − 3 − 45 − 45 
4 4
6 7 9 8 0 47 9
4
17
4 0 − 12 − 23 − 23
     
8 7 9 5 8 7 9 5 8 7 9 5
0 7 9 17  0 7 9 17   7
GE 0 4
9 17 
 4 4 4  R=⇒
3 ↔R4  4 4 4  =⇒ 4 4  = U.
0 0 − 2 4  0 0 − 6 − 72  0 0 − 6 − 72 
7 7 7 7
0 0 − 67 − 72 0 0 − 27 4
7 0 0 0 2
3

We have P A = LU where
   
1 1
 1 3 1 
P = T3 T2 T1 = 

,
 L=
1
4 .
1 2 − 72 1 
1
1 4 − 73 1
3 1
32 CHAPTER 3. LINEAR SYSTEMS

In the above, T1 , T2 , T3 are the transpositions R1 ↔ R3 , R2 ↔ R4 , R3 ↔ R4 , respectively. Let


us derive the entries of L. From the first pass of GE, we have
       
1 1 1 1
−µ21   1  R2 ↔R4  3  R3 ↔R4  3 
  2 4 4
−µ31  =  1  =⇒  1  =⇒  1 
4 4 2
3 1 1
−µ41 4 2 4
and from the second pass,      
0 0 0
 1   1  R3 ↔R4  1 
     2 .
−µ32  = − 3  =⇒ − 
7 7
−µ42 − 27 − 37
Let us repeat this example using the second method.
     
2 1 1 0 1 8 7 9 5 3 8 7 9 5 3
 4 3 3 1   
2  R1 ↔R3  4 3 3 1 2  GE  − 2 − 21  1
− 23 − 32 2  R2 ↔R4
 =⇒  =⇒  1  =⇒
 8 7 9 5 3  2 1 1 0 1  − 4 − 43 − 45 − 54 1 
7 9 17
6 7 9 8 4 6 7 9 8 4 − 34 4 4 4 4
     
8 7 9 5 3 8 7 9 5 3 8 7 9 5 3
 −3 7 9 17
4  GE 
3 7 9 17
4   − 43 7 9 17
4 
 41 4 4 4  =⇒  − 41 34 4 4  R=⇒3 ↔R4  4 4 4 GE
 =⇒
 − 3 5
−4 1 
5  − 2 4
1   − 21 2
− 76 − 72 2 
4 −4 −4 4 7 −7 7 7
− 21 − 12 − 23 −2 23 1 2 6
−2 7 −7 −7 2 2
− 41 3
7 − 72 4
7 1
 
8 7 9 5 3
 −3 7 9 17
4 
 41 24 4 4 .
 − 6 2 
2 7 −7 −7 2
1 3 1 2
−4 7 −3 3 1
Given a system of equations Ax = b. We perform GE with partial pivoting to obtain
P Ax = LU x = P b. The system is now solved by first solving the lower triangular system
Ly = P b. Next, the solution can be found from solving the upper triangular system U x = y.
The approximate number of operations is n3 /3 + n2 = O(n3 /3) for a system with n unknowns.
Note that by far, the bulk of the work is in the factorization step. Hence if we need to solve
systems with the same matrix A but different righthand side vectors b, then it is only necessary
to do the factorization once.
Example: Take A as defined in (3.2). Define b = [1, 6, −1, 2]T . We solve the system Ax = b by solving
LU x = P b = [2, −1, 6, 1]T . First solve the lower triangular system Ly = P b:
    
1 y1 2
 0 1  y2  −1
 1   =  
 0 1  y3   6 
3
− 31 1 12 1 y4 1
to obtain [y1 , y2 , y3 , y4 ]T = [2, −1, 16/3, 0]T . Next solve the upper triangular system U x = y:
    
3 0 1 2 x1 2
0 1 −1 −1  x2   −1 
   
1   = 

0 0 8 x3 16/3
3 3
0 0 0 − 23 x4 0
to obtain the solution [x1 , x2 , x3 , x4 ]T = [0, 1, 2, 0]T .
3.3. ERRORS IN SOLVING LINEAR SYSTEMS 33

The wrong way to solve the system Ax = b is to calculate A−1 followed by the multiplication
by b. This is a very common mistake. The reason is that to calculate A−1 , we must solve
the systems Azi = Ii , i = 1, · · · , n where Ii is the ith column of the identity matrix. The
columns of A−1 are given by the vectors zi . The amount of work required is O(n3 /3) for one
factorization plus the cost of n back substitutions to solve for the vectors zi . The latter costs
n O(n2 ) = O(n3 ). Hence the total cost is O(4n3 /3) which is four times more expensive than the
method in the above paragraph. In addition, it takes twice as much storage (store A and A−1 )
and it is also less accurate because of extra operations.
GE with partial pivoting is a numerically stable algorithm in practice although there are
academic examples where roundoff errors increase exponentially quickly as a function of the
number of unknowns in the system. Such examples, fortunately, rarely come up in practice.
There is a more stable version known as GE with full pivoting. Here at the jth pass of the
process, we choose as the pivot the entry in the (n − j + 1) × (n − j + 1) submatrix starting with
jth row and the jth column of the largest magnitude. In this case, exponential growth of entries
of U cannot happen. Full pivoting is rarely used in practice because of the extra complexity
involved in searching for the pivot and that partial pivoting is quite adequate.

3.3 Errors in Solving Linear Systems


Given a linear system Ax = b, suppose we have an approximate solution x̃. How do we measure
the size of the error x̃ − x? One way is by the length of the vector. For vector y ∈ Rn , define

n
!1/2
X
|y|2 = yi2 ,
i=1

the 2–norm of vector y. Another measure is given by the ∞–norm:

|y|∞ = max |yi |.


1≤i≤n


For instance, if y = [−1, 2, −3], then |y|2 = 14, |y|∞ = 3.
There other many other ways to measure the length of a vector. In general, we define a
vector norm ν : Rn → R which satisfies the three conditions (for any x ∈ Rn ):

1. ν(x) ≥ 0 and ν(x) = 0 iff x = 0.

2. ν(ax) = |a|ν(x) for all a ∈ R.

3. ν(x + y) ≤ ν(x) + ν(y) for all y ∈ R.

| · |2 and | · |∞ are two examples of a norm.


For vectors of a fixed dimension, it makes little difference which norm we use because it
can be shown that all norms are equivalent in the sense that the results differ at most by a
constant multiplicative factor independent of the vector. More precisely, for any two vector
norms | · | and | · |′ , there exist some constants c1 , c2 such that for all vectors x, the inequalities
c1 |x| ≤ |x|′ ≤ c2 |x| hold.
Given the system Ax = b, suppose x̃ is an approximate solution. Define the forward error
by |x̃ − x|∞ and the backward error (or residual) by |Ax̃ − b|∞ . (Of course, other norms can
also be used.)
34 CHAPTER 3. LINEAR SYSTEMS

Example: Consider the system


        
1 1 x1 3 x1 2
= with solution = .
3 −4 x2 2 x2 1

Let x̃ = [1, 1]T . By√a direction calculation, the forward error in the infinity norm is 1 while the
backward error is 10.

Example: Consider the system


        
1 1 x1 2 x1 1
= with solution = .
1.0001 1 x2 2.0001 x2 1

Let x̃ = [−1, 3.0001]T . By a direction calculation, the forward error in the infinity norm is 2.0001
while the backward error is .0001.
In practice, the exact solution x is unknown while x̃, usually the result of a numerical calculation
is known. Hence the forward error is not computable while the backward error is. In practice, we
gauge the accuracy of an approximate solution by its backward error. In the last example, the
backward error is .0001 which is apparently very small (assume we work with a 4–digit mantissa).
However the actual (forward) error is about 2 which is unacceptably large. This is an example
of an ill–conditioned linear system where it is possible for an approximate solution to have a
small backward error but large forward error. The solution of this system corresponds to the
point of intersection of two straight lines. What makes the system ill–conditioned is that the
straight lines are nearly parallel. In the example before that, the two straight lines are not nearly
parallel and so the forward and backward errors have approximately the same magnitudes.

The “size” of a matrix can be measured by matrix norms. Given a vector norm | · | and
A ∈ Rn×n . The matrix norm k · k induced by | · | is defined by
|Ax|
kAk = max .
x6=0 |x|
The matrix norm satisfies the three properties of a vector norm, plus the following

kABk ≤ kAk kBk

for any square matrices A, B.


The matrix norms k·k2 and k·k∞ are induced by the vector norms |·|2 and |·|∞ , respectively.
The definition of a matrix norm does not lend itself to a practical way of calculating the norm.
Fortunately, we have
Theorem: Let A ∈ Rn×n . Then
q n
X
kAk2 = Λmax (AT A), kAk∞ = max |aij |.
1≤i≤n
j=1

Here Λmax () denotes the maximum eigenvalue of a matrix.

Proof: By the definition of the two–norm, we need to maximize f (x) = |Ax|22 subject to the constraint
g(x) = xT x − 1 = 0. By the method of Lagrange multipliers, we form

L = |Ax|22 − λ(xT x − 1) = xT AT Ax − λ(xT x − 1).


3.3. ERRORS IN SOLVING LINEAR SYSTEMS 35

At a critical point of L,

Lx = 2AT Ax − λ2x = 0
Lλ = 1 − xT x = 0.

Hence AT Ax = λx and xT x = 1. That is x is an eigenvector of AT A with corresponding


eigenvalue λ.
Therefore max of f (x) = xT AT Ax = max eigenvalue of AT A.
Why does ▽L = 2AT Ax − λ2x? Let B = AT A which is a symmetric matrix. Then
n
X
T
f (x) = x Bx = bij xi xj .
i,j=1

 ∂f 
∂x1
 
Find ▽f =  ... .
∂f
∂xn

n
X  
∂f ∂xi ∂xj
= bij xj + xi
∂xk ∂xk ∂xk
i,j=1
X n
= bij (δik xj + xi δjk )
i,j=1
X n n
X
= bkj xj + bik xi
j=1 i=1
Xn Xn
= bkj xj + bki xi
j=1 i=1
Xn
= 2 bkj xj
j=1
= 2Bk∗ · x.

Here Bk∗ refers to the kth row of B. Therefore ▽f = 2Bx = 2AT Ax.
Now we prove the result on the infinity norm. Take any x 6= 0. Look at the ith component of
Ax.
  Pn
n n n n
X X X j=1 aij xj X
aij xj ≤ |aij | |xj | ≤  |aij | |x|∞ =⇒ ≤ |aij |.
|x|∞
j=1 j=1 j=1 j=1

X n
|Ax|∞
Take maximum over i =⇒ ≤ max |aij |.
|x|∞ i
j=1
n
X
Take maximum over x 6= 0 =⇒ |A|∞ ≤ max |aij |.
i
j=1
36 CHAPTER 3. LINEAR SYSTEMS

X n
|Ax|∞
For equality, find x such that = max |aij |. Suppose the maximum row sum occurs at
|x|∞ i
j=1
the kth row. That is,
n
X n
X
|akj | = max |aij |.
i
j=1 j=1

Define xj = sign(akj ), so |x|∞ = 1. The kth row of Ax is

n
X n
X |Ax|∞
akj xj = |akj | = .
|x|∞
j=1 j=1


 
1 0
Example: Let A = . Then kAk2 ≈ 5.0368 (square root of the largest eigenvalue of AT A) and
−3 4
kAk∞ = max(1, 7) = 7.

The infinity matrix norm is much easier to calculate because it only involves calculating
the row sums. The matrix 2–norm however requires the calculation of the largest eigenvalue of
AT A which is a computational intensive operation when A is a large matrix. It does have nice
properties which makes it attractive for theoretical purposes.

Definition: Let k · k be the matrix norm induced by the vector norm | · |. Let A ∈ Rn×n be non–singular.
The condition number of A is κ(A) = kAk kA−1 k.

A matrix is said to be ill–conditioned if its condition number is large relative to the working
precision of the calculation. For instance, on a computer with double precision arithmetic, there
are approximately 15 significant (decimal) digits to represent a real number. If the condition
number is, say, greater than 1010 , then the matrix is ill-conditioned. On a machine with 30
significant digits, then the same matrix is not ill–conditioned. The concept of ill-conditioning is
irrelevant if all calculations are done exactly.
It is simple to show that for any non–singular matrix A, κ(A) ≥ 1. To show this, note that

1 = kIk = kA−1 Ak ≤ kA−1 k kAk = κ(A).


   
1 1 1 1
Example: Let A = , B = which we have already encountered before. By a direct
3 −4 1.0001 1
calculation, κ2 (A) = 3.5776, κ∞ (A) = 5 while κ2 (B) = 40002, κ∞ (B) = 40004. Hence ill–
conditioned systems are characterized by large condition numbers (independent of the norm).

The following theorem bounds the relative error of an approximate solution in terms of the
condition number and the relative error of the data. Let | · | be any vector norm and k · k be the
induced matrix norm.

Theorem: Let A ∈ Rn×n be non–singular and x, b ∈ Rn . Suppose Ax = b. Let x̃ ∈ Rn and r = Ax̃ − b be


the backward error. Then
|x̃ − x| |r|
≤ κ(A) .
|x| |b|
3.4. SYMMETRIC POSITIVE DEFINITE SYSTEMS 37

Proof: Use the facts A−1 r = A−1 (Ax̃ − b) = A−1 (Ax̃ − Ax) = x̃ − x and |b| = |Ax| ≤ kAk |x| to get

|x̃ − x| kAk kAk |r|


≤ |A−1 r| ≤ kA−1 k |r| = κ(A) .
|x| |b| |b| |b|

This is a sharp upper bound which is achievable by the following: let x and r be such that
kAk |x| = |Ax| and kA−1 k |r| = |A−1 r|. Then

|x̃ − x| kAk |r|


= |A−1 r| = κ(A) .
|x| |Ax| |b|

What this theorem says is that the size of the residual r is indicative of the relative error only
if the condition number of the matrix is not large. If the condition number is large, then one can
have a large relative error with a small residual. The following is an alternative interpretation.
Suppose Ax̃ = b + r. That is, x̃ is the exact solution of a system whose righthand side is
perturbed (due to roundoff error or uncertainty in data). The significance of this result is that
the relative error of the solution is proportional to the condition number of the matrix – the
larger the condition number, the larger the relative error.
In the above theorem, the righthand side vector b is perturbed. Now we perturb the matrix.
Again the relative error will be seen to depend on the condition number of the matrix.

Theorem: Let A be non–singular and kA−1 Ek < 1/2 for some matrix E. Suppose Ax = b and (A+E)x̃ = b.
Then
|x̃ − x| kEk
≤ 2κ(A) .
|x| kAk

Proof: From simple algebra, we have x̃ − x = −A−1 E x̃ and so |x̃| − |x| ≤ |x̃ − x| ≤ kA−1 Ek |x̃| from
which we obtain
|x|
|x̃| ≤ ≤ 2 |x|.
1 − kA−1 Ek
Therefore
|x̃ − x| |x̃| kEk
≤ kA−1 Ek ≤ kA−1 k kEk 2 = 2κ(A) .
|x| |x| kAk


A rule–of–thumb for the solution of a linear system in double precision is the following. If
the condition number of the matrix is 10m , then the computed solution by GE (with partial
pivoting) is expected to have 16 − m correct digits. For example, if the condition number is 10,
then we expect 15–digit accuracy in the computed solution. If the condition number is 1016 or
larger, then the computed solution can be totally worthless. Roughly speaking, the system is
ill-conditioned if the condition number is larger than 10m/2 . Hence the degree of ill-conditioning
depends on the number of digits in the computation.

3.4 Symmetric Positive Definite Systems


Matrix A ∈ Rn×n is symmetric if A = AT . It is said to be positive definite if xT Ax > 0 for
every nonzero x ∈ Rn . For a symmetric matrix, it can be shown that it is positive definite iff all
38 CHAPTER 3. LINEAR SYSTEMS

its (real) eigenvalues are positive. Indeed, if λ is an eigenvalue with corresponding eigenvector
x, i.e., Ax = λx, then
0 < xT Ax = λxT x
iff λ > 0. In particular, a symmetric positive definite matrix is non–singular.
  

2 0 3
Example: Let A = , B= . Note that A is symmetric positive definite while B is not because
3 3 1
for x = [1, −1]T T
√ , it is easy to check that x Bx = −5. It can be checked that the eigenvalues of
B are (1 ± 37)/2.

Symmetric positive definite matrices occur frequently in practice. Because of their special
properties, no pivoting is required in the GE. The operation count for the factorization is O(n3 /6)
which is one half of the count of GE for general matrices.

Theorem: Let A be symmetric, positive definite. Then there exists a unique upper triangular R with
positive diagonal entries such that A = RT R (called the Cholesky decomposition of A).

Proof: Induction on n. n = 1 is fine.


Suppose true for n − 1. Let A ∈ Rn×n be symmetric positive definite.
 
B a
A= , B ∈ Rn−1×n−1 symmetric, a ∈ Rn−1 .
aT ann
Note that ann = eTn Aen > 0 since A is positive definite. Let y ∈ Rn−1 , y 6= 0, show y T By > 0.
Since A is positive definite,

 T    T   
y y y B a y
A = = y T By > 0.
0 0 0 aT ann 0

Induction hypothesis implies that there exists a unique upper triangular S with positive diagonal
entries such that B = S T S. Let
    
ST S a ST S b
A= = = RT R
aT ann bT c c
where c ∈ R, b ∈ Rn−1 . From the above system, a = S T b and ann = bT b + c2 . Since S is
nonsingular,
p we have b = S −T a. If it can be shown that ann − bT b > 0, then we can define
c = ann − bT b.
By a direct calculation, ann − bT b = ann − aT B −1 a. Define γ = B −1 a. Since A is positive
definite,
  
T B a γ
0 < [γ , −1] T
a ann −1
= ann − aT B −1 a
= ann − bT b.

This is what we wanted to show.


3.5. ITERATIVE SOLVERS 39

Finally, we show uniqueness. Let A = RT R = R̃T R̃ where R̃ is upper triangular with positive
diagonal entries. Then R̃−T RT = R̃R−1 . Note that the inverse of an upper triangular matrix is
upper triangular and the product of two upper triangular matrices is upper triangular. The same
remark applies to lower triangular matrices. Hence R̃−T RT = R̃R−1 = D, a diagonal matrix.
Look at the (i, i) entry of R̃ = DR and of RT = R̃T D to obtain r̃ii = dii rii and rii = r̃ii dii or
dii = ±1. Since rii and r̃ii are positive, dii = 1 for every i and so D is the identity matrix which
implies that R = R̃. 
Example: Cholesky decomposition.
    
2 −1 a a b c
−1 2 −1 =  b d   d e
−1 2 c e f f
√  √ 
2 q 2 − √12 0
 1 3  q q 
= 

− √2
q 2


3
2 − 23 
.
0 − 23 √23 √2
3

The variables can be solved in the order a, b, c, d, e, f .


When the matrix is symmetric positive definite, use the Cholesky decomposition instead of
GE. The execution time and storage requirement are halved!

3.5 Iterative Solvers


GE calculates the exact solution of a linear system in a finite number of steps in the absence
of roundoff errors. A second class of method called iterative solvers are similar to FPI, usually
converging to the exact solution after infinitely many iterations. At first, it seems odd why one
would want to consider iterative solvers at all. There are several reasons. In many applications,
solution of partial differential equations for instance, the matrix is large, say, 106 × 106 or larger.
The matrix is called sparse meaning that most of the entries are zero, say, only O(106 ) nonzero
entries. GE requires storage of the entire matrix which has 1012 elements. This is because during
the elimination process, zero entries can fill–in, meaning that it becomes nonzero. In general, the
LU factors are full triangular matrices even though the original matrix is sparse. No computer
can store 1012 numbers in its memory and one would need to rely on secondary memory (disks)
which drastically slows down the solution process. Iterative solvers, on the other hand, only
requires the storage of the nonzero entries. Another consideration is the number of operations
in GE which is approximately O(1018 /3). Assuming that each operation can be performed in
10−9 seconds, then GE will require O(109 /3) seconds or approximately 10 years.
A second reason for using iterative solvers is that sometimes, we have a good guess to the
solution. In this case, there is a good chance that iterative solvers will converge to a better
solution (to within some tolerance) much quicker than GE which cannot make use of a good
initial guess.
Iterative solvers, however, may not converge for arbitrary non–singular matrices. There are
classes of matrices, symmetric positive definite and strictly diagonally dominant matrices to
name two examples, for which certain iterative solvers are guaranteed to converge.
Suppose the given linear system is Ax = b. Let B be any non–singular matrix. The system
is equivalent to Bx + (A − B)x = b or
x = (I − B −1 A)x + B −1 b. (3.3)
40 CHAPTER 3. LINEAR SYSTEMS

It is natural to define an iterative method as

x(k+1) = Gx(k) + B −1 b, G = I − B −1 A. (3.4)

The matrix G is called the iteration matrix.


In general, B should be easy to invert and should be as close to A as possible, that is,
G = I − B −1 A should be small. For instance, if kGk is large, then the iteration may diverge.
These two criteria are, unfortunately, conflicting. If we choose B = A, then I − B −1 A = 0 and
only one iteration is necessary to compute the exact solution. However, in that iteration, we
just solve the original linear system. At the other extreme, if we choose B = I, linear solves
involving B is trivial but I − B −1 A is, in general, not small and the iteration may not converge.
Note that (3.4) is a FPI. In the scalar case, recall that xk+1 = g(xk ) converges to a fixed
point x∗ of g if |g′ (x∗ )| < 1. In the present vector case, we might guess that kGk < 1 is a
condition for convergence. But which matrix norm should be taken?
Define e(k) = x(k) − x. Let ρ(G) denote the spectral radius of matrix G, that is,

ρ(G) = max |λi |,


i

where λi are the eigenvalues of G. The following is necessary and sufficient for the convergence
of the above iterative method.

Theorem: Let G be a diagonalizable matrix: G = XDX −1 where D is diagonal and X is non–singular.


Let x(0) be any initial iterate. Then e(k) → 0 iff ρ(G) < 1.

Proof: Subtracting (3.3) and (3.4), we obtain e(k+1) = Ge(k) or e(k) = Gk e(0) . If ρ(G) ≥ 1, let e(0) be a
normalized eigenvector of G with eigenvalue of magnitude ρ(G). Clearly, e(n) does not converge
to 0.
On the other hand, suppose ρ(G) < 1. Then Gk = XD k X −1 . Since every eigenvalue of D has
magnitude less than one, D k → 0 and hence Gk → 0 which implies that e(k) → 0. 

We remark that the assumption of a diagonalizable matrix G in the above theorem is not
necessary. It merely serves to simplify the proof.
We can now state that for any matrix norm k · k induced by a vector norm | · |, the condition
kGk < 1 is a sufficient condition for the FPI (3.4) to converge. To see this, let Gz = λz where
|λ| = ρ(G) and z is corresponding eigenvector. Now

|Gx| |Gz| |λ| |z|


kGk = max ≥ = = ρ(G).
x6=0 |x| |z| |z|

Jacobi and Gauss Seidel


We now discuss two classical iterative methods. Suppose the given matrix A = D + L + U where
D is a diagonal matrix containing the diagonal entries of A, L and U are the strict lower and
upper triangular parts of A. (These L and U should not be confused with the LU factors of A
in the GE process.) The Jacobi iteration takes B = D and thus

x(k+1) = −D −1 (L + U )x(n) + D −1 b.
3.5. ITERATIVE SOLVERS 41

Suppose x(k) is known. For ith component of the new iterate is obtained by solving xi from the
(k)
ith equation where the current value xj is used for all j 6= i:
 
(k+1) 1  X (k)
xi = bi − aij xj  .
aii
j6=i

The Gauss Seidel iteration takes B = D + L and thus

x(k+1) = −(D + L)−1 U x(k) + (D + L)−1 b.

This scheme is similar to Jacobi scheme except that the most up–to–date components are used
in the definition:  
(k+1) 1  X (k+1)
X (k)
xi = bi − aij xj − aij xj  .
aii
j<i j>i

     
3 1 5 1

Example: Let A = 1 2  , b = 5 with exact solution 2. For this problem
−4 0 0
     
3 0 0 1
D= 2 , L = 1 0  , U =  0 .
−4 0 0

Using the initial iterate x(0) = [0, 0, 0]T , the Jacobi iterates are
     
5/3 5/6 10/9
x(1) = 5/2 , x(2) = 5/3 , x(3) = 25/12
0 0 0

and they converge nicely to the exact solution


 (after infinitely many iterations). The Jacobi
−1/3
iteration matrix is G = −1/2  which has eigenvalues 0, ±6−1/2 . Thus ρ(G) =
0
6 −1/2 < 1 and so as predicted, the iteration must converge.
The Gauss–Seidel iterates for the same initial guess are
   
5/3 10/9
x(1) = 5/3 , x(2) = 35/18
0 0

and they converge faster to the solution compared to the Jacobi iteration. The iterates are given
by
(k) (k)
(k+1) 5 − x2 − 0 x3
x1 =
3
(k+1) (k)
(k+1) 5 − x1 − 0 x3
x2 =
2
(k+1) (k+1)
(k+1) 0 − 0 x1 − 0 x2
x3 = .
−4
42 CHAPTER 3. LINEAR SYSTEMS
 
0 −1/3
The iteration matrix is G =  1/6  which has spectral radius 1/6 < 1. Thus this
0
iteration must converge and in fact it converges faster than the Jacobi iteration because it has
a smaller spectral radius.

As already indicated, iterative methods can fail.


     
1 2 5 1

Example: Let A = 3 1  , b = 5 with exact solution 2. This system is equivalent to the one
  
−4 0 0
in the above example where the first two rows have been switched. Now
     
1 0 0 2
D= 1 , L = 3 0 , U = 0 .
−4 0 0

Using the initial iterate x(0) = [0, 0, 0]T , the Jacobi iterates are
     
5 −5 25
x(1) = 5 , x(2) = −10 , x(3) = 20
0 0 0
 
−2 √
and they diverge. The iteration matrix is G = −3  which has eigenvalues 0, ± 6 and
0

so ρ(G) = 6 > 1. It is no surprise that the iteration diverges.

The condition ρ(G) < 1 unambiguously decides whether an iteration converges or diverges.
However, calculating the spectral radius is often much more difficult (takes more work) than solv-
ing the linear system. We now give a simpler sufficient condition that determines convergence.
This condition is much simpler to apply.
Matrix A ∈ Rn×n is said to be strictly diagonally dominant if for every i,
X
|aii | > |aij |.
j6=i

   
3 1 −1 3 2 6
Example: Let A = 2 −5 2  , B = 1 8 1 . It is easy to check that A is strictly diagonally
1 6 8 9 2 −2
dominant while B is not.

The following theorem gives a simple condition for convergence.

Theorem: If A is strictly diagonally dominant, then the Jacobi and Gauss Seidel iterations converge for
any initial iterate.

Proof: Let λ be an eigenvalue of the iteration matrix G = I − B −1 A such that |λ| = ρ(G) and let
v be the corresponding eigenvector with |vm | = 1 and |vi | ≤ 1, ∀i. For the Jacobi method,
3.5. ITERATIVE SOLVERS 43

G −1
X= −D (L + U ) which implies that (L + U )v = −λDv. The mth row of this equation reads
amj vj = −λamm vm which implies that
j6=m
P P
j6=m |amj | |vj | j6=m |amj |
|λ| ≤ ≤ <1
|amm | |amm |
since A is strictly diagonally dominant. The previous theorem shows that the iterates must
converge to the true solution.
For the Gauss Seidel method, GX= −(L + D)−1 UXwhich implies that U v = −λ(L + D)v. The
mth row of this equation reads amj vj = −λ amj vj which implies that
j>m j≤m
P P
j>m |amj | |amj |
|λ| ≤ P =P j>m P  <1
|amm | − j<m |amj | j>m |amj | + |amm | − j6=m |amj |

since A is diagonally dominant. An application of the previous theorem yields the result. 

In light of this result, it pays to try to rearrange the matrix so that it is diagonally dominant.
This is only practical if the matrix is not too large. Note that not all matrices can become
diagonally dominant with a permutation. For instance, the matrix
 
3 2
1 1

is not diagonally dominant regardless of any permutation of rows or columns.


Although the Jacobi method usually converges slower than the Gauss Seidel method, it is
ideally suited for parallel computers. These methods are rarely used directly to solve linear
systems but are rather used as preconditioners which will be discussed later.
We also mention one relaxation method called successive over–relaxation or SOR where the
iterates are defined as

Dx(k+1) = ω[b − Lx(k+1) − U x(k) ] + (1 − ω)Dx(k) or

x(k+1) = Gx(k) + ω(D + ωL)−1 b, G = (D + ωL)−1 [(1 − ω)D − ωU ].


Here ω is a positive constant whose values must lie in (0, 2). When ω = 1, then SOR reduces to
the Gauss Seidel iteration. In some sense, the SOR iterate x(k+1) is a convex combination of a
Gauss Seidel iterate and the current iterate x(k) .
Theorem: Let A be an n × n matrix and G be the SOR iteration matrix. Then ρ(G) ≥ |1 − ω|.

Proof:

det(G) = det(D + ωL)−1 det[(1 − ω)D − ωU ]


(1 − ω)n det(D)
=
det(D)
= (1 − ω)n .

Thus the product of the eigenvalues of A must equal (1 − ω)n . Hence |1 − ω|n ≤ ρ(G)n which
implies the result. 
44 CHAPTER 3. LINEAR SYSTEMS

Thus if ω 6∈ (0, 2), SOR iterates will diverge in general. It turns out that if A is symmetric and
positive definite, then the SOR iterates converge for an arbitrary initial guess iff ω ∈ (0, 2). SOR
can converge quickly provided that one can choose an optimal value of the parameter ω which
in general is extremely difficult to find. The methods below do not require the user to specify
such parameters and thus are preferred over SOR.

Modern Iterative Solvers


The above classical solvers are rarely used to solve linear systems nowadays. Modern iterative
solvers converge much quicker and are more robust. For symmetric positive definite systems, the
conjugate gradient method is the preferred iterative solver while for non–symmetric matrices,
GMRES is a popular option. It is beyond the scope of this course to describe the convergence
properties of these algorithms. Suffice to say that the rate of convergence depends on the
condition number of the coefficient matrix. If the condition number is large, then the iteration
will converge very slowly. To solve the linear system Ax = b, iterative methods only require the
user to supply a subroutine which calculates Az given any vector z. Hence unlike in GE, there
is no need to allocate storage for the entire matrix. In fact, in case where the entries of A are
given by simple expressions, then there is no need to store A at all.

Example: Consider the n × n symmetric tridiagonal matrix


 
2 −1
−1 2 −1 
 
 .. .. .. 
A= . . . .
 
 −1 2 −1
−1 2

The subroutine which calculates y = Az is given by


y1 = 2z1 − z2
yi = 2zi − zi−1 − zi+1 , i = 2, · · · , n − 1
yn = 2zn − zn−1

Notice that there is no need to store the entries of A at all.

One technique to accelerate the convergence is to precondition the system. In place of the
system Ax = b, we solve M −1 Ax = M −1 b where M is called a preconditioner. If the condition
number of M −1 A is much smaller than the condition number of A, the iterative solvers for the
new system will converge much quicker. This subject is an active area of research.
Chapter 4

Least Squares

In the last chapter, we solved linear systems where the coefficient matrix is square. Now we
consider the case where A is a rectangular matrix. Usually, there are more equations than
unknowns so that there is no solution in general. The system is said to be over–determined.
The method of least squares finds a solution which minimizes the residual.
Let A ∈ Rm×n where m > n. Given b ∈ Rm , there is in general no solution to the system
Ax = b for any x ∈ Rn . Assume that the columns of A are linearly independent so that rank
A = n. We require that the residual Ax − b be perpendicular to all vectors in the range space
of A. Recall that vectors p and q are perpendicular if pT q = 0. So the requirement that the
residual be perpendicular to the range space means that (Az)T (Ax − b) = 0 for every z ∈ Rn .
This implies that z T (AT Ax − AT b) = 0 for every z which implies that

AT Ax = AT b.

This is called the normal equation. It is not difficult to show that AT A is symmetric positive
definite. Symmetry is easy to show. To show positive definiteness, let z be any non–zero vector.
Then
z T (AT A)z = (Az)T (Az) = |Az|22 > 0
since z 6= 0 and that A has rank n. Consequently, the normal equation has a unique solution
x = x∗ , called the least squares solution of Ax = b. This system can be solved by Cholesky
factorization. We stress that in general Ax∗ 6= b but Ax∗ − b is perpendicular to the range space
of A.
Let us give an alternative derivation of the normal equation. This will explain why the
method is called least squares. Since Ax = b has no solution in general, we would like the
“solution” to give the smallest possible residual. That is, we find the x ∈ Rn which minimizes
|Ax − b|2 or equivalently, minimize

f (x) = |Ax − b|22 = (Ax − b)T (Ax − b) = xT AT Ax − 2bT Ax + bT b.

From calculus, the minimum occurs at the critical point of this quadratic function:

0 = ∇f = 2AT Ax − 2AT b.

Solve this equation to obtain AT Ax = AT b which is the normal equation again. Note that
the second derivative matrix is 2AT A which is symmetric positive definite and has positive
eigenvalues and so the critical point must be a local as well as global minimum.

45
46 CHAPTER 4. LEAST SQUARES

It should be remarked that the method of normal equation is not the best method to solve
the least squares problem. The reason is that the condition number of the normal system is
κ(AT A) = κ(A)2 , the square of the condition number of A. (We have not defined the condition
number of a rectangular matrix.) This means that the normal equation can be much more
sensitive to roundoff errors than the solution of the least squares method by other approaches
(for instance, QR factorization which will be discussed later.) The virtue of the normal equation
is its simplicity.

4.1 Polynomial Models


The classical application of least squares is to fit a line through some data. Given data points
(t1 , y1 ), · · · , (tm , ym ), m ≥ 3. We wish to find the best line y = at + c which fits the data. The
Xm
goodness of the fit is measured in terms of size of the residual R = (ati + c − yi )2 . Observe
i=1
that R = |Ax − b|22 where
   
t1 1   y1
 .. ..  , a  
A= . . x= , b =  ...  .
c
tm 1 ym

The minimum of R can be found by the normal equation, for instance. Hence fitting a line is a
least squares problem.
Recall in the formulation of the least squares problem, we assume that rank A = n. This
means that the terms in the model are “linearly independent”. For instance, the model y =
at + c(2t) would lead to linearly dependent columns.
One can also fit the data using a quadratic y = pt2 +qt+r for some real numbers p, q, r. In this
Xm
case, m ≥ 4. The method of least squares means that we minimize R = (pt2i + qti + r − yi )2 =
i=1
|Ax − b|22 where
     
t21 t1 1 p y1
 ..  ,  
A =  ... ..
. . x = q  , b =  ...  .
t2m tm 1 r ym

Example: Given data points (−1, 1), (0, 0), (1, 0), (2, −2). Fit a line and a quadratic through the data
using the method of least squares.
First fit a line. Define
   
−1 1   1
0 1 a 0
A=
1
, x= , b= 
 0 .
1 c
2 1 −2

The method of least squares is to minimize |Ax − b|2 . The normal equation is AT Ax = AT b
where    
T 6 2 T −5
A A= , A b= .
2 4 −1
4.2. TRIGONOMETRIC AND EXPONENTIAL MODELS 47

Solving this equation leads to x = [a, c]T = [−.9, .2]. Hence the line has the equation y =
−.9t + .2. Here |Ax − b|22 = .7.
Now fit a quadratic. Define
  
1 −1 1   1
0 0 1 p 0
A= 
1 1 1 , x = q  , b= 
 0 .
r
4 2 1 −2
The method of least squares is to minimize |Ax − b|2 . The normal equation is AT Ax = AT b
where    
18 8 6 −7
AT A =  8 6 2 , AT b = −5 .
6 2 4 −1
Solving this equation leads to x = [p, q, r]T = [−.25, −.65, .45]T . Hence the quadratic has the
equation y = −.25t2 − .65t + .45. Here |Ax − b|22 = .45. Why is this number smaller than the
corresponding one for the least squares line?

4.2 Trigonometric and Exponential Models


We have used polynomials (line and quadratic) to model data but others are possible. We shall
examine a trigonometric model as well as two exponential models. The choice of the model is a
non–trivial one. Sometimes, theory is available which predicts a certain model such as a linear
model. In these cases, the model is given. In other instances, for example, the average daily
temperature distribution of Winnipeg over several years, one expects the data to be periodic
and so a trigonometric model is appropriate while a linear model is not.
As always, given data (t1 , y1 ), · · · , (tm , ym ), m ≥ 4. Suppose we wish to fit the data with
the model y = c1 + c2 cos πt + c3 sin πt. The over–determined system of equations is Ax = b
where      
1 cos πt1 sin πt1 y1 c1
 ..  ,  .. 
A =  ... ..
. .  b =  .  , x =  c2  .
1 cos πtm sin πtm ym c3
In the method of least squares, we minimize |Ax − b|2 which can be solved by the normal
equation, for instance.

Example: Given data (0, 2), (.5, 0), (1, −1), (2, 1). Using the above trigonometric model, we obtain
     
1 cos 0 sin 0 1 1 0 2
1 cos π/2 sin π/2 1 0 1 0
A= 1 cos π
= , b= 
−1 .
sin π  1 −1 0
1 cos 2π sin 2π 1 1 0 1
The normal equation is AT Ax = AT b where
   
4 1 1 2
T
A A= 1 3 0 , A b = 4 .
T 
1 0 1 0
Solve this system to get c1 = .25, c2 = 1.25, c3 = −.25. Thus the model is y = .25+1.25 cos πt−
.25 sin πt.
48 CHAPTER 4. LEAST SQUARES

We now examine an exponential model y = c1 ec2 t . Using the same procedure as before to
fit the data would lead to a system of over–determined nonlinear equations which is much more
difficult to solve than the (linear) least squares problems which we have seen so far. A better
way is to take the log of the model to obtain
ln y = ln c1 + c2 t = c3 + c2 t
with c3 = ln c1 . Now the over–determined system becomes Ax = b where
   
t1 1 ln y1  
    c
A =  ... ...  , b =  ...  , x= 2 .
c3
tm 1 ln ym
After solving this least squares problem, the original parameter c1 can easily be recovered from
c1 = ec3 .
Example: Fit the data (0, e0 ), (1, e1 ), (2, e3 ) using the above exponential model. Now
   
0 1 0
A = 1 1 , b = 1 .
2 1 3
The normal equation is AT Ax = AT b where
     
5 3 7 c
T
A A= , T
A b= , x= 2 .
3 3 4 c3
Solve this system to get c2 = 1.5, c3 = −.1667 from which we obtain c1 = ec3 = .8465. Thus
the model is y = .8465e1.5t .
Of course, other exponential models are possible. We consider another one y = c1 tec2 t . As
before, take the log on both sides to obtain
ln y − ln t = ln c1 + c2 t = c3 + c2 t.
Notice the term ln t is placed on the left-hand side which makes this a linear least squares
problem Ax = b with
   
t1 1 ln y1 − ln t1  
 .. ..   .  c
A =  . . , b= .
. , x= 2 .
c3
tm 1 ln ym − ln tm
After solving this least squares problem, the original parameter c1 can be recovered from c1 = ec3 .
Example: Fit the data (.5, e0 ), (1, e1 ), (2, e3 ) using the new exponential model. Now
   
.5 1 .69315
A =  1 1 , b =  1 .
2 1 2.3069
The normal equation is AT Ax = AT b where
     
5.25 3.5 5.9603 c
T
A A= , T
A b= , x= 2 .
3.5 3 4 c3
Solve this system to get c2 = 1.1088, c3 = −.039721 from which we obtain c1 = ec3 = .9611.
Thus the model is y = .9611te1.1088t .
4.2. TRIGONOMETRIC AND EXPONENTIAL MODELS 49

Suppose we have a linearly independent set {x1 , x2 , ..., xn }. We want an orthonormal set
{u1 , u2 , ..., un } spanning the same subspace.
Gram Schmidt Procedure

x1
u1 =
|x1 |2
y2 = x2 − xT2 u1 u1
y2
u2 =
|y2 |2
y3 = x3 − xT3 u1 u1 − xT3 u2 u2
y3
u3 =
|y3 |2
yk = xk − xTk u1 u1 − · · · − xTk uk−1 uk−1
yk
uk = .
|yk |2

It is not difficult to check that uTi uj = δij and that {u1 , u2 , ..., un } spans the same space as
{x1 , x2 , ..., xn }.

Let A = [x1 | · · · |xn ] ∈ Rm×n . Assume m ≥ n and rank A = n. In matrix notation, Gram
Schmidt gives a rectangular factorization A = QR, where Q ∈ Rm×n is orthogonal (QT Q = I)
and R ∈ Rn×n is upper triangular.
Note that R is a non-singular matrix since

n = rankA = rank(QR) = rankR.

Denote the columns of Q by uj and so Q = [u1 | · · · |un ].

Example:  1 
  √ 0 √1
1 1 3 2 3 √ √ √ 
 2 2 2 2
0 2 1   0 1 0
A=
0
 = QR = 
√1 
0 2 √1
.
0 1  0 0 3 0 0 3
−1 −1 −1 − √1 2
0 √1
3

We now use the QR factorization to solve the least squares problem minn |Ax − b|2 for A ∈
x∈R
Rm×n with rank A = n. Recall that the solution is given by the normal equation AT Ax = AT b.
Let A = QR be the QR factorization. On substitution into the normal equation, we obtain
RT QT QRx = RT QT b. Since Q is orthogonal and R is invertible, we obtain

Rx = QT b.

The solution can easily be obtained after a simple back substitution. This method of solution
is numerically stable and better than the method of normal equation.
Chapter 5

Interpolation and Approximation

Given data points (x1 , y1 ), · · · , (xn , yn ). Throughout this chapter, it is assumed that xi 6= xj
if i 6= j. The goal is to find a function f (x) which interpolates the data points. That is, the
function satisfies yi = f (xi ), i = 1, · · · , n. The simplest function to choose is a polynomial.
However polynomial interpolation is unstable if n is not small, say, n > 5 unless the nodes {xi }
are chosen properly. For larger values of n, it is better to use several lower–order polynomials.
This is the topic of splines.

5.1 Polynomial Interpolation


Given data points (x1 , y1 ), · · · , (xn , yn ). Let Pm denote the set of polynomials of degree m or
smaller. The goal is to find p ∈ Pn−1 which interpolates the data. This means that p(xi ) =
yi , i = 1, · · · , n.
The simplest case is n = 1. Given point (x1 , y1 ), then the constant polynomial y = y1 is the
unique polynomial interpolant. When n = 2, then the line which passes through the two given
points is the unique polynomial interpolant. This line is
y2 − y1
y = y1 + (x − x1 ).
x2 − x1

In case n = 3, the formula for the polynomial interpolant is more complicated. Let p(x) = a+bx+
cx2 be the polynomial which passes through the three given points. Hence p(xi ) = yi , i = 1, 2, 3.
This yields a linear system of three equations for the three unknown coefficients:
    
1 x1 x21 a y1
1 x2 x22   b  = y2  .
1 x3 x23 c y3

The matrix is called a Vandermonde


Y matrix. It can be shown that the n × n Vandermonde
matrix has determinant (xj − xi ) which is non-zero. Hence the linear system which
1≤i<j≤n
determines the coefficients of the interpolating polynomial has a unique solution. This method
of finding the coefficients takes O(n3 ) operations which is highly inefficient and it is also highly
inaccurate since the coefficient matrix is ill–conditioned. (The condition number of the n × n
Vandermonde matrix grows exponentially quickly as a function of n.)
We shall give two other approaches to the interpolation problem: Lagrange interpolation
and Newton’s divided difference. The difference between all the methods lies in the basis used.

50
5.1. POLYNOMIAL INTERPOLATION 51

Lagrange Interpolation
Fix n, the number of data points. Define the polynomials of degree n − 1
n
Y x − xj
Li (x) = , i = 1, · · · , n. (5.1)
xi − xj
j=1, j6=i

These polynomials have the special property that



1, i = j;
Li (xj ) = δij =
0, i =
6 j.

Using this property, it is easy to verify that the unique polynomial interpolant (of degree n − 1
or less) of the data points (x1 , y1 ), · · · , (xn , yn ) is
n
X
p(x) = yi Li (x).
i=1
n
X n
X
Indeed, for each j, p(xj ) = yi Li (xj ) = yi δij = yj . This is called the Lagrange form of
i=1 i=1
the interpolant.
We now show that there is a unique polynomial interpolant.
Theorem: Given the data points (x1 , y1 ), · · · , (xn , yn ). There exists a unique polynomial p of degree at
most n − 1 so that yi = p(xi ), i = 1, · · · , n.

Proof: The Lagrange form of the interpolant gives a polynomial of degree n − 1 which interpolates the
data. Suppose p and q are polynomials of degree n − 1 or less which interpolate the data. Define
d = p − q which is a polynomial of degree at most n − 1. We wish to show that d is the identically
zero function so that the polynomial interpolant is unique. Now d(xj ) = 0, j = 1, · · · , n. By
the Fundamental Theorem of Algebra, a polynomial of degree at most n − 1 has at most n − 1
zeroes or is the zero polynomial. In this case, d ≡ 0. 

Example: Given the data (0, 2), (1, 1), (2, 0), (3, −1). They happen to lie along the straight line y = −x+2
which is the unique polynomial interpolant. Notice that this polynomial has degree one and not
three.

Example: Find the Lagrange interpolating polynomial for the data (0, 1), (2, 2), (3, 4).
First, calculate the polynomials Li :
(x − 2)(x − 3) (x − 2)(x − 3)
L1 (x) = = ,
(0 − 2)(0 − 3) 6
(x − 0)(x − 3) x(x − 3)
L2 (x) = = ,
(2 − 0)(2 − 3) −2
(x − 0)(x − 2) x(x − 2)
L3 (x) = = .
(3 − 0)(3 − 2) 3
Consequently, the interpolating polynomial is
(x − 2)(x − 3) x(x − 3) x(x − 2) x2 x
p(x) = 1 · +2· +4· = − + 1.
6 −2 3 2 2
52 CHAPTER 5. INTERPOLATION AND APPROXIMATION

For x different from {xi }, evaluation of the Lagrange polynomial interpolant p(x) takes O(n2 )
operations. There is an alternative form which takes only O(n) operations. Define
n
Y 1
φn (x) = (x − xi ), wi = Q .
i=1 j6=i (xi − xj )

Then the polynomial interpolant for the data (xi , yi ) is


n
X yi wi
p(x) = φn (x) .
x − xi
i=1

Since this is exact for the constant function y ≡ 1,


n
X wi
1 = φn (x)
x − xi
i=1

and so for x distinct from the nodes,


Pn wi
i=1 x−xi yi
p(x) = Pn wi .
i=1 x−xi

Once the weights wi have been computed, evaluation of p(x) takes O(n) operations. This form
is known as the Barycentric Lagrange interpolant.

Newton’s Divided Differences


We have seen that there is a unique interpolating polynomial for a given set of data points.
Newton’s divided differences gives the same polynomial interpolant as the Lagrange interpolant
but it does so in a more efficient manner as we shall see.
Given data (x1 , y1 ), · · · , (xn , yn ). For any function f , define the sequence of divided differ-
ences

f [xi ] = f (xi ), i = 1, · · · , n;
f [xi+1 ] − f [xi ]
f [xi , xi+1 ] = , i = 1, · · · , n − 1;
xi+1 − xi
f [xi+1 , xi+2 ] − f [xi , xi+1 ]
f [xi , xi+1 , xi+2 ] = , i = 1, · · · , n − 2;
xi+2 − xi
f [xi+1 , xi+2 , xi+3 ] − f [xi , xi+1 , xi+2 ]
f [xi , xi+1 , xi+2 , xi+3 ] = , i = 1, · · · , n − 3;
xi+3 − xi
..
.
f [xi+1 , · · · , xn ] − f [xi , · · · , xn−1 ]
f [xi , · · · , xn ] = , i = 1, · · · , n − 1.
xn − xi
From the key observation that

1, x − x1 , (x − x1 )(x − x2 ), (x − x1 )(x − x2 )(x − x3 ), · · · , (x − x1 ) · · · (x − xn−1 )

is a basis for the space of polynomials of degree n − 1 or lower, we can express p, any polynomial
of degree n − 1 or lower by

p(x) = c1 + c2 (x − x1 ) + c3 (x − x1 )(x − x2 ) + · · · + cn (x − x1 ) · · · (x − xn−1 ). (5.2)


5.1. POLYNOMIAL INTERPOLATION 53

Theorem: Let p be the interpolating polynomial of degree at most n − 1 for the data (x1 , y1 ), · · · , (xn , yn ).
Then
cj = p[x1 , · · · , xj ], 1≤j≤n
where cj is defined in (5.2).

Proof: The proof is induction on n. Notice that y1 = p(x1 ) = p[x1 ] = c1 . Next, y2 = p(x2 ) =
c1 + c2 (x2 − x1 ) and so
p[x2 ] − p[x1 ]
c2 = = p[x1 , x2 ].
x2 − x1
(We also show the case n = 2 because it is instructive. Strictly speaking it is not necessary for
the proof.)
Assume the statement of the theorem holds for n data points. Now consider the case with n + 1
data points. Let p be the unique interpolating polynomial of degree at most n. Let q be the
unique polynomial of degree n − 1 or less interpolating (x1 , y1 ), · · · , (xn , yn ) and let r be the
unique polynomial of degree n − 1 or less interpolating (x2 , y2 ), · · · , (xn+1 , yn+1 ). We claim, to
be shown later, that
(x − x1 )
p(x) = q(x) + (r(x) − q(x)). (5.3)
xn+1 − x1
Therefore the coefficient of xn of p is the same as that of the expression on the right, that is,

p[x2 , · · · , xn+1 ] − p[x1 , · · · , xn ]


cn+1 = = p[x1 , · · · , xn+1 ]
xn+1 − x1

by the induction hypothesis and the definition of divided differences.


The proof is complete once we show the claim. Note that q(x1 ) = p(x1 ) and q(xn+1 ) + r(xn+1 ) −
q(xn+1 ) = r(xn+1 ) = p(xn+1 ). For 1 < i < n + 1, we have

xi − x1
q(xi ) + (r(xi ) − q(xi )) = q(xi ) = p(xi ).
xn+1 − x1

Therefore p and the righthand side of (5.3) agree at n + 1 distinct points and since they are both
polynomials of degree at most n, they must in fact be the same polynomial. 

A corollary of this theorem is that p[x1 , · · · , xn ] is a symmetric function meaning that


the order of nodes in the argument is arbitrary. (For instance, p[x1 , x2 , x3 ] = p[x1 , x3 , x2 ] =
p[x2 , x1 , x3 ].) To see this, observe that the coefficient of xn−1 of p, the polynomial interpolant,
is p[x1 , · · · , xn ]. Now the interpolant is unique and is independent of the ordering of the nodes
and so is the coefficient of xn−1 .
Another observation is that some or all of the arguments in p[x1 , · · · , xn ] can be the same.
For instance,
p[x] − p[x1 ]
p[x1 , x1 ] = lim = p′ (x1 )
x→x1 x − x1
n
z }| { p(n−1) (x1 )
and p[ x1 , · · · , x1 ] = by induction.
(n − 1)!
Using a divided difference table, it is easy to read off the coefficients of the interpolating
polynomial. The case n = 3 is shown below
54 CHAPTER 5. INTERPOLATION AND APPROXIMATION

x1 p[x1 ]
p[x1 , x2 ]
x2 p[x2 ] p[x1 , x2 , x3 ] .
p[x2 , x3 ]
x3 p[x3 ]
Example: Given the data (0, 1), (2, 2), (3, 4). Construct the Newton’s divided difference table and write
down the interpolating polynomial of degree 2.

0 1
1
2
1
2 2 2 .
2
3 4
The interpolating polynomial is
1 1 x2 x
p(x) = 1 + x + x(x − 2) = − + 1.
2 2 2 2
One advantage of the Newton’s divided difference over the Lagrange form is that if a new
data point comes in, one can simply add one more row in the current divided difference table
using O(n) operations. In contrast, the Lagrange method must restart from the scratch. Hence
the Newton’s method is far more efficient.
Example: Continuing with the last example, suppose we add a new data point (1, 0). We append a new
row at the bottom of the above table to obtain
0 1
1
2
1
2 2 2
2 − 21 .
3 4 0
2
1 0
The new interpolating polynomial is
1 1 1
p(x) = 1 + x + x(x − 2) − x(x − 2)(x − 3).
2 2 2
Example: Given the data (0, 2), (1, 1), (2, 0), (3, −1). Construct the Newton’s divided difference table
and write down the interpolating polynomial.

0 2
−1
1 1 0
−1 0 .
2 0 0
−1
3 −1
The interpolating polynomial is p(x) = 2 − x.
5.1. POLYNOMIAL INTERPOLATION 55

Example: Given the data (0, 1), (2, 2), (3, 4). How many degree three polynomials interpolate these three
points?
Recall that there is a unique polynomial of degree two which interpolates these points: p(x) =
x2 x
− + 1. There are actually infinitely many polynomials of degree three which interpolate
2 2
these points:
x2 x
q(x) = − + 1 + cx(x − 2)(x − 3)
2 2
works for every real number c.

Example: How many polynomials of degree d, where 0 ≤ d < ∞, pass through the points (−1, −5), (0, −1), (2, 1), (3, 11).
First construct the divided difference table:
−1 −5
4
0 −1 −1
1 1 .
2 1 3
10
3 11
Hence the unique interpolating polynomial of degree three is

p(x) = −5 + 4(x + 1) − (x + 1)x + (x + 1)x(x − 2).

Therefore there can be no interpolating polynomials of degree smaller than three and there are
infinitely many interpolating polynomials of degree larger than three.

Example: Interpolation can be used to approximate the values of complicated functions using only additions
and multiplications which are needed in evaluating polynomials. As a simple example, consider
approximating sin x by a third degree polynomial.
First, since sin is 2π–periodic, we can restrict the argument x ∈ [0, 2π). Using symmetries
sin x = − sin(2π − x) for x ∈ [π, 2π) and sin x = sin(π − x) for x ∈ [π/2, π], we can further
restrict x ∈ [0, π/2]. For a third degree polynomial, we construct the divided difference table for
sin x using four equally spaced points 0, π/6, π/3, π/2:
0 0
.9549
π 1
6 2 −.2443
.6691 −.1139 .
π π
3 2 −.4232
.2559
π
2 1
Hence the third degree polynomial which interpolates sin x at the above four points is
 π  π π h πi
p(x) = .9549x − .2443x x − − .1139x x − x− , x ∈ 0, .
6 6 3 2
The absolute error is less than .01 for all x. Of course, the error can be made as small as desired
by taking sufficiently many points. In some sense, this is a compression problem: reducing the
sine function to the coefficients of the interpolating polynomial.
56 CHAPTER 5. INTERPOLATION AND APPROXIMATION

The following polynomial interpolation error holds:

Theorem: Let n be a positive integer. Given f ∈ C n [a, b] and distinct points {xj , j = 1, · · · , n} lying in
[a, b]. Let p be the (unique) polynomial interpolant of f of degree n − 1 or less over {xj }. Then
for every x ∈ [a, b],
n
f (n) (ξ) Y
f (x) − p(x) = (x − xj ) (5.4)
n!
j=1

where ξ = ξ(x) is some number in (a, b). Furthermore,

f (n) (ξ)
= f [x1 , · · · , xn , x] (5.5)
n!
and this is a continuous function of x.

Proof: Fix some x ∈ [a, b]. If x is a node xj , then the result is clearly true. So suppose x is not a node.
Define ψ(z) = f (z) − p(z) − αφ(z) for z ∈ R where φ is the product in (5.4) and α ∈ R is chosen
such that ψ(x) = 0. Thus ψ has at least n + 1 zeroes in [a, b], namely, x, x1 , · · · , xn . By Rolle’s
theorem, ψ ′ has at least n zeroes different from the points just enumerated. By repeatedly
applying Rolle’s theorem, ψ (n) has some zero ξ ∈ (a, b):

0 = ψ (n) (ξ) = f (n) (ξ) − p(n) (ξ) − αφ(n) (ξ).

Noting that p is a polynomial of degree at most n − 1 while the leading term of φ is xn , it follows
f (n) (ξ)
that 0 = f (n) (ξ) − α n! and so α = . Therefore, 0 = ψ(x) = f (x) − p(x) − αφ(x) which
n!
gives (5.4).
Let x ∈ [a, b] be distinct from the nodes. Continuity of f [x1 , · · · , xn , x] as a function of x follows
by induction. The case n = 0 is trivial. Suppose it holds for n − 1. Let zj → x with each zj
distinct from the nodes. Now

f [x2 , · · · , xn , zj ] − f [x1 , · · · , xn ]
lim f [x1 , · · · , xn , zj ] = lim
j→∞ j→∞ zj − x1
f [x2 , · · · , xn , x] − f [x1 , · · · , xn ]
=
x − x1
= f [x1 , · · · , xn , x].

Now if x = xk for some k > 1, then the above argument still holds. If x = x1 , then recall that
f is symmetric and so we can permute, for instance, the first two arguments of f .
Let p be the polynomial interpolant as in the statement of this theorem. Assume first that x is
distinct from the nodes. Then the polynomial interpolant for the points x1 , · · · , xn , x is
n
Y
p̂(z) = p(z) + f [x1 , · · · , xn , x] (z − xj ), z∈R
j=1

by (5.2). But f (x) = p̂(x). Uniqueness of polynomial interpolation and (5.4) imply (5.5). The
case if x is one of the nodes follows by continuity. 
5.1. POLYNOMIAL INTERPOLATION 57

Example: Find the smallest value of positive integer n so that for any n distinct points {x1 , · · · , xn } in
[0, 1], their polynomial interpolant p satisfies | sin x − p(x)| ≤ .001 for all x ∈ [0, 1].
From the above theorem, for any x ∈ [0, 1],
n
| sin(n) (x)| Y 1
| sin x − p(x)| = |x − xj | ≤ .
n! n!
j=1

Since 7! = 5040, n = 7 is sufficient.

Example: Given f (x) = 2x3 + 5x − 1 and distinct nodes xj , j = 1, · · · , 5. Let p be the polynomial
interpolant of these nodes. Find the maximum difference |f (x) − p(x)| for x ∈ R.
The maximum difference is zero since from the above theorem, the difference is proportional to
f (5) = 0.

Example: Given f (x) = 2x3 + 5x − 1 and nodes x1 = 1, x2 = 2. Let p be the polynomial interpolant of
these nodes. Find the maximum difference |f (x) − p(x)| for x ∈ [1, 2].
Here, n = 2 and and so f ′′ (x) = 12x. Thus |f ′′ (x)| ≤ 24 for all x ∈ [1, 2]. Let q(x) = (x−1)(x−2).
A simple estimate is |q(x)| ≤ 1 but we can do better. Since q is a parabola which vanishes at
x = 1, 2, its maximum magnitude in [1, 2] occurs at x = 3/2 with q(3/2) = −1/4. Hence
|f ′′ (x)| 24 1
max |f (x) − p(x)| ≤ max max |q(x)| ≤ = 3.
x∈[1,2] x∈[1,2] 2 x∈[1,2] 2 4

From the interpolation error of the previous theorem, an immediate question is how to choose
the nodes {x0 , . . . , xn } so that the error is as small as possible. The error is bounded by
n
Y
f (n+1) (ξ)
|φ(x)|, φ(x) = (x − xj ).
(n + 1)!
j=0

For convenience, let the interval be [−1, 1]. We wish to determine {xj } so that kφk∞ is as small
as possible. Here, for any function g,

kgk∞ = max |g(x)|.


x∈[−1,1]

This is a classical problem in approximation theory. Its solution is given by Chebyshev


polynomials, which are defined as

T0 (x) = 1, T1 (x) = x, Tn+1 (x) = 2xTn (x) − Tn−1 (x),

for n ≥ 1. Note that Tn is a polynomial of degree n. By induction, it can be shown that for
n ≥ 0 and x ∈ [−1, 1],
Tn (x) = cos(n cos−1 x).
Observe that kTn k∞ = 1 and that the coefficient of xn of Tn is 2n−1 . Thus T̃n ≡ 21−n Tn is a
monic polynomial. It is remarkable that among all monic polynomials of degree n, T̃n achieves
the smallest supremum norm of 21−n on [−1, 1]. To see this, we argue by contradiction. Let
pn be a monic polynomial of degree n so that kpn k∞ < 21−n . Denote the extrema of T̃n by
yi = cos(iπ/n), i = 0, . . . , n. Observe that

(−1)i pn (yi ) ≤ |pn (yi )| < 21−n = (−1)i T̃n (yi ).


58 CHAPTER 5. INTERPOLATION AND APPROXIMATION

So (−1)i (T̃n (yi ) − pn (yi )) > 0 meaning that T̃n − pn oscillates with at least n zeroes in (−1, 1).
However, since both T̃n and pn are monic, T̃n − pn is a polynomial of degree at most n − 1 having
at least n roots. This is a contradiction.
Since φ is a polynomial of degree n + 1, apply the above minimal property of the Chebyshev
polynomial to conclude that
Tn+1 1
kφk∞ ≥ = .
2n ∞ 2n
Equality is obtained if {xj } is the set of zeroes of Tn+1 , so-called, Chebyshev points:
 
2j + 1
xj = cos π , j = 0, . . . , n.
2n + 2

For other choices of the nodes, for instance, equally spaced nodes, kφk∞ can be much larger.
Using Chebyshev points, the interpolation error can be bounded as:

kf (n+1) k∞
kf (x) − p(x)k∞ ≤ ,
2n (n + 1)!

where p is the polynomial of degree at most n interpolating f at the Chebyshev points.


Another question is the error of the derivative of the polynomial interpolant f ′ − p′ . Naively,
one might simply differentiate (5.4). However, ξ = ξ(x) and there is no simple way to obtain
ξ ′ (x). In fact, it is not clear that it is differentiable. There is an alternative approach.

Theorem: Let n be a positive integer. Given f ∈ C n [a, b] and distinct points {xj , j = 1, · · · , n} lying
in [a, b]. Let p be the (unique) polynomial interpolant of f of degree n − 1 or less over {xj }.
Then there exist distinct points zi , i = 1, · · · , n − 1 in (a, b) and for each x ∈ [a, b], there exists
η ∈ (a, b) so that
n−1
f (n) (η) Y
f ′ (x) − p′ (x) = (x − zj ). (5.6)
(n − 1)!
j=1

Proof: Since f − p vanishes at the nodes, by Rolle’s theorem, there are points zi ∈ (xi , xi+1 ), i =
1, · · · , n − 1 so that f ′ − p′ vanishes at each zi . Fix x ∈ [a, b]. If x = zi , then clearly (5.6) holds.
Suppose x is distinct from the {zj }. Define
n−1
Y
ψ(x) = f ′ (x) − p′ (x) − α (x − zj ).
j=1

Choose α so that ψ(x) = 0. Of course ψ has at least n − 1 other roots at {zj }. Repeatedly apply
Rolle’s theorem to ψ to obtain the existence of η so that 0 = ψ (n−1) (η) from which (5.6) follows.


We conclude this section with some important theoretical results. Fix any positive integer
n. Define the interpolation operator In : C[0, 1] → C[0, 1] by
n
X
In (f )(x) = f (xin )Lin (x), f ∈ C[0, 1], x ∈ [0, 1].
i=1
5.1. POLYNOMIAL INTERPOLATION 59

Here {xin , 1 ≤ i ≤ n} is the set of distinct nodes and Lin ∈ Pn−1 was defined in (5.1), where
the subscript n was omitted there. Of course In (f )(xin ) = f (xin ) for 1 ≤ i ≤ n. It is easy to
see that In is a linear operator. {In } is said to be consistent if for each polynomial p,

lim kIn (p) − pk∞ = 0.


n→∞

{In } is defined to be stable if


sup kIn k∞ =: C < ∞.
n≥1

Here, C is called the stability constant or Lebesgue constant. {In } is convergent if

lim In (f ) = f, f ∈ C[0, 1].


n→∞

There is an explicit expression for kIn k∞ for each n:


n
X
kIn k∞ = |Lin | .
i=1 ∞

To see this, let f ∈ C[0, 1]. Then


n
X n
X
kIn (f )k∞ = f (xin )Lin ≤ kf k∞ |Lin | .
i=1 ∞ i=1 ∞

This shows that


n
X
kIn (f )k∞
kIn k∞ = sup ≤ |Lin | .
f ∈C[0,1]\0 kf k∞
i=1 ∞
To show that equality is achieved, let x∗ ∈ [0, 1] so that
n
X n
X
|Lin | = |Lin (x∗ )|.
i=1 ∞ i=1

Define any f∗ ∈ C[0, 1] so that kf∗ k∞ = 1 and

f∗ (xin ) = sign Lin (x∗ ), 1 ≤ i ≤ n.

Then
n
X n
X
|In (f∗ )(x∗ )| = f∗ (xin )Lin (x∗ ) = |Lin (x∗ )|,
i=1 i=1
implying that
n
kIn (f∗ )k∞ |In (f∗ )(x∗ )| X
kIn k∞ ≥ ≥ = |Lin (x∗ )|.
kf∗ k∞ 1
i=1
An example of an unstable interpolating family is one with equally spaced nodes: xin = i/n.
For Chebyshev nodes, it can be shown that the interpolating family is stable with
2
kIn k∞ ≤ 1 + log(n + 1), n ≥ 1.
π
(Here the interval is [−1, 1] rather than [0, 1].)
The next result relates the error of the best polynomial approximation and that of the
polynomial interpolant.
60 CHAPTER 5. INTERPOLATION AND APPROXIMATION

Theorem: Let f ∈ C[0, 1] and n be a positive integer. Let p∗n be the best polynomial approximation of f
of degree at most n:
kf − p∗n k∞ ≤ kf − pk∞ , p ∈ Pn ,
and pn = In (f ) ∈ Pn be the polynomial interpolant of f at the nodes {xin }. Then

kf − pn k∞ ≤ (1 + kIn k∞ ) kf − p∗n k∞ .

Proof: Observe that In (p∗n ) = p∗n . Then

f − pn = f − In (f )
= f − In (p∗n ) + In (p∗n − f )
= f − p∗n − In (f − p∗n ).

Hence
kf − pn k∞ ≤ kf − p∗n k∞ + kIn k∞ kf − p∗n k∞ ≤ (1 + kIn k∞ ) kf − p∗n k∞ .

A theoretical result from functional analysis will be needed.

Theorem: Principle of Uniform Boundedness. Let X and Y be Banach spaces. Suppose {Tn , n ≥ 1}
is a family of linear operators from X to Y which is pointwise bounded:

sup kTn xkY ≤ Cx kxkX , x ∈ X.


n≥1

Here Cx is a positive number depending on x, but independent of n. Then {Tn } is uniformly


bounded:
sup kTn kY < ∞.
n≥1

Now we are ready for one of the most important results of this chapter. It states that for
a consistent family of interpolating operators, stability is equivalent to convergence. Result of
this flavour is pervasive in numerical analysis.

Theorem: Suppose {In } is consistent. Then it is stable iff it is convergent.

Proof: Suppose {In } is stable. Given ǫ > 0, we need to find some integer N so that for every n ≥ N ,

kIn (f ) − f k∞ < ǫ, f ∈ C[0, 1].

Let f ∈ C[0, 1]. From the Weierstrass Approximation Theorem, there is some polynomial p so
that
ǫ
kf − pk∞ < ,
2(1 + C)
where C is the stability constant. By consistency, there is some integer N so that for all n ≥ N ,
ǫ
kIn (p) − pk∞ < .
2
5.2. HERMITE INTERPOLATION 61

Then for every n ≥ N ,

kIn (f ) − f k∞ ≤ kIn (f ) − In (p)k∞ + kIn (p) − pk∞ + kp − f k∞


ǫ ǫ
< kIn k∞ kf − pk∞ + +
2 2(1 + C)
ǫ ǫ ǫ
≤ C + +
2(1 + C) 2 2(1 + C)
= ǫ.

Suppose {In } is convergent. That is, for any f ∈ C[0, 1],

kIn (f ) − f k∞ → 0.

Therefore, there is some real number Cf which depends on f , but is independent of n, so that

kIn (f )k∞ ≤ Cf kf k∞ .

This means that {In } is pointwise bounded. By the Principle of Uniform Boundedness, supn≥1 kIn k∞ <
∞. This means that {In } is stable. 

5.2 Hermite Interpolation


Given the data (x1 , y1 ), · · · , (xn , yn ), we discussed interpolation of the data by a polynomial p in
the previous section. A generalization is to require that the interpolating polynomial to take on
given derivative values at the nodes as well. There are 2n conditions and so we expect p to be
a polynomial of degree 2n − 1. This is called Hermite interpolation and the unique polynomial
interpolant is called the Hermite interpolation polynomial.

Theorem: Let n be a positive integer. Given distinct nodes x1 , · · · , xn and numbers yi , zi , i = 1, · · · , n.


There is a unique polynomial p of degree 2n − 1 or less such that

p(xi ) = yi , p′ (xi ) = zi , i = 1, · · · , n.

Proof: The case n = 1 is trivial since the unique Hermite interpolant is the line p(x) = z1 (x − x1 ) + y1 .
Suppose n ≥ 2. Define the following polynomials of degree 2n − 1:

Mi (x) = Li (x)2 (1 − 2L′i (xi )(x − xi )), Ni (x) = Li (x)2 (x − xi )

where Li is the polynomial defined in (5.1). It is easy to check these new polynomials satisfy
the following properties. For all i, j,

1. Mi (xj ) = δij , Mi′ (xj ) = 0,


2. Ni (xj ) = 0, Mi′ (xj ) = δij .

Hence
n
X
p(x) := Mi (x)yi + Ni (x)zi
i=1
is an interpolating polynomial satisfying the 2n conditions.
Next we check that p is unique. Suppose q is a polynomial of degree 2n − 1 or less which satisfies
the same 2n conditions. Then p − q has at least n distinct zeroes at the nodes. By Rolle’s
62 CHAPTER 5. INTERPOLATION AND APPROXIMATION

theorem, p′ − q ′ has (at least) n − 1 distinct zeroes, (at least) one in (xi , xi+1 ). But p′ − q ′ also
vanishes at each node by assumption and so p′ − q ′ has at least 2n − 1 distinct zeroes. Since
p′ − q ′ is a polynomial of degree at most 2n − 2, it can be concluded that p′ − q ′ ≡ 0 or p − q is
a constant. Since p − q vanishes at the nodes, p ≡ q. 

Example: Find the Hermite polynomial p of degree 3 so that p(0) = 0, p(1) = 1, p′ (0) = 1, p′ (1) = 0.
Here n = 2 with
L1 (x) = 1 − x, L2 (x) = x
and so

M1 (x) = (1 − x)2 (1 + 2x), M2 (x) = x2 (1 − 2(x − 1)), N1 (x) = (1 − x)2 x, N2 (x) = x2 (x − 1).

Hence
p(x) = 0 · M1 (x) + 1 · M2 (x) + 1 · N1 (x) + 0 · N2 (x) = −x3 + x2 + x.

Next, we give the error when a smooth function is interpolated by a Hermite polynomial
interpolant.

Theorem: Let n be a positive integer. Given f ∈ C 2n [a, b] and distinct points {xj , j = 1, · · · , n} lying
in [a, b]. Let p be the (unique) Hermite polynomial interpolant of f over {xj }. Then for every
x ∈ [a, b],
n
f (2n) (ξ) Y
f (x) − p(x) = (x − xj )2 (5.7)
(2n)!
j=1

where ξ = ξ(x) ∈ (a, b) is a continuous function of x.

Proof: Fix some x ∈ [a, b]. If x is a node xj , then the result is clearly true. So suppose x is not a node.
For any z ∈ R, define ψ(z) = f (z) − p(z) − αφ(z) where φ is the product in (5.7) and α ∈ R
is chosen such that ψ(x) = 0. Thus ψ has at least n + 1 zeroes in [a, b], namely, x, x1 , · · · , xn .
By Rolle’s theorem, ψ ′ has at least n zeroes different from the points just enumerated. Since ψ ′
also vanishes at the nodes by definition, ψ ′ has at least 2n distinct zeroes in [a, b]. By repeatedly
applying Rolle’s theorem, ψ (2n) has some zero ξ ∈ (a, b):

0 = ψ (2n) (ξ) = f (2n) (ξ) − p(2n) (ξ) − αφ(2n) (ξ).

Noting that p is a polynomial of degree at most 2n − 1 while the leading term of φ is x2n , it
follows that 0 = f (2n) (ξ) − α (2n)! and the claim follows upon solving for α. Next, we show
that ξ = ξ(x) is a continuous function. This is accomplished by using the Newton form of
interpolation.
Let xj,k → xj as k → ∞ for 1 ≤ j ≤ n. Assume that every element of the set {xj,k , xi , 1 ≤
i, j ≤ n, k ≥ 0} is distinct. Let qk be the interpolating polynomial of f of degree 2n − 1 at the
nodes x1 , x1,k , x2 , x2,k , . . . , xn , xn,k . Then

qk (x) = f (x1 ) + f [x1 , x1,k ](x − x1 ) + f [x1 , x1,k , x2 ](x − x1 )(x − x1,k )
+f [x1 , x1,k , x2 , x2,k ](x − x1 )(x − x1,k )(x − x2 ) + . . .
+f [x1 , x1,k , . . . , xn−1 , xn−1,k , xn ](x − x1 )(x − x1,k ) . . . (x − xn−1 )(x − xn−1,k )
+f [x1 , x1,k , . . . , xn , xn,k ](x − x1 )(x − x1,k ) . . . (x − xn−1 )(x − xn−1,k )(x − xn ).
5.3. SPLINES 63

Take the limit k → ∞ to obtain

q(x) := lim qk (x)


k→∞
= f (x1 ) + f [x1 , x1 ](x − x1 ) + f [x1 , x1 , x2 ](x − x1 )2 + f [x1 , x1 , x2 , x2 ](x − x1 )2 (x − x2 )
+ . . . + f [x1 , x1 , . . . , xn−1 , xn−1 , xn ](x − x1 )2 . . . (x − xn−1 )2
+f [x1 , x1 , . . . , xn , xn ](x − x1 )2 . . . (x − xn−1 )2 (x − xn ).

Induction can be used to show that q ≡ p, the Hermite interpolant of f at x1 , . . . , xn . Recall


that there is some continuous ξk = ξk (x) so that

f (2n) (ξk (x))


= f [x1 , x1,k , . . . , xn , xn,k , x].
(2n)!

Fix x. Show that ξk (x) → ξ(x) and so

f (2n) (ξ(x))
= f [x1 , x1 , . . . , xn , xn , x].
(2n)!

Since the right-hand side is continuous in x, so is ξ. 

In the special case n = 2, it is easy to show that for all x ∈ [x1 , x2 ],

|x2 − x1 |4
|f (x) − p(x)| ≤ max |f (4) (ξ)| .
x1 ≤ξ≤x2 384

We had mentioned earlier that polynomial interpolation using equally spaced points is unsta-
ble unless there is a small number of points, say, fewer than five points. Consider interpolation
of the function f (x) = (1 + 16x2 )−1 for x ∈ [−1, 1]. The left graph in Figure 5.1 shows the case
with 11 equally spaced points. Note the occurrence of huge oscillations near the end points.
This is known as Runge phenomenon. The optimal choice of points (Chebyshev points) cluster
near the end points. The right graph in Figure 5.4 clearly shows the superiority of the latter
case. An alternative to using one polynomial to interpolate all the data points is to use several.
This leads to the following topic.

5.3 Splines
Given data (x1 , y1 ), · · · , (xn , yn ). Assume x1 < · · · < xn . We employ several low–degree poly-
nomials, splines, to interpolate data. At the end of the last section, we had alluded to the
ill–conditioning of high degree polynomial interpolation with equally spaced points. Another
problem is that if the function to be interpolated is singular at some point, then a global inter-
polating polynomial can result in poor approximation everywhere. Splines do not suffer from
either of these two difficulties since it uses many low degree piecewise polynomials. Hence any
bad behaviour in the underlying function is localized.
This simplest spline uses piecewise polynomials of degree one. In this case of linear splines,
the interpolating function is piecewise linear. In the interval, [xi , xi+1 ], the interpolant is the
line which passes through the two points (xi , yi ), (xi+1 , yi+1 ). The virtues of this method are
its simplicity and efficiency. However, depending on the application, the data may come from a
smooth function and the interpolant is at best continuous and in general non–differentiable.
64 CHAPTER 5. INTERPOLATION AND APPROXIMATION

equispaced Chebyshev
1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5
−1 0 1 −1 0 1

Figure 5.1: Interpolation by a 10th degree polynomial using equally spaced points (left) and
Chebyshev points (right).

Theorem: Let f be twice continuously differentiable on [a, b]. Let p be the linear spline interpolant of f at
the nodes x1 , · · · , xn . Then for every x ∈ [a, b],

h2
|f (x) − p(x)| ≤ max |f ′′ (c)| , h= max xi+1 − xi .
a≤c≤b 8 1≤i≤n−1

Proof: On [xi , xi+1 ], there is some ci ∈ (xi , xi+1 ) so that


f ′′ (ci )
f (x) − p(x) = (x − xi )(x − xi+1 )
2
by (5.4). The largest magnitude of the quadratic term occurs at the mid point x = (xi + xi+1 )/2.
Hence for all x ∈ [xi , xi+1 ],

f ′′ (ci ) (xi+1 − xi )2
|f (x) − p(x)| ≤
2 4
and the result of the theorem follows immediately. 

In quadratic splines, we use a piecewise quadratic interpolant. On the interval [xi , xi+1 ],
define the quadratic pi (x) = ai (x − xi )2 + bi (x − xi ) + ci . There are n − 1 intervals and so there
are 3(n − 1) parameters to be determined.
Since pi is an interpolant, it must satisfy pi (xi ) = yi , pi (xi+1 ) = yi+1 , i = 1, · · · , n − 1.
There are 2(n − 1) conditions. Since we still have extra degrees of freedom, why not impose that
the piecewise quadratic interpolant to be continuously differentiable: pi (xi+1 ) = pi+1 (xi+1 ), i =
1, · · · , n − 2. Thus far, we have 3n − 4 conditions which is still one short. This last condition
is usually determined by specifying one of p′1 (x1 ), p′′1 (x1 ), p′n−1 (xn ), p′′n−1 (xn ). In the absence of
any knowledge, it is simplest to take the value as zero.
The most common type of splines is cubic splines – using a third degree polynomial in each
interval [xi , xi+1 ]. Since there are n − 1 intervals and each interval requires four coefficients to
5.3. SPLINES 65

define the cubic polynomial in that interval, there is a total of 4(n − 1) unknown coefficients
to determine. Let pi be the cubic polynomial on [xi , xi+1 ], i = 1, · · · , n − 1. The following are
reasonable conditions imposed on the polynomials:
1. pi (xi ) = yi , pi (xi+1 ) = yi+1 , i = 1, · · · , n − 1
2. p′i (xi+1 ) = p′i+1 (xi+1 ), i = 1, · · · , n − 2
3. p′′i (xi+1 ) = p′′i+1 (xi+1 ), i = 1, · · · , n − 2.
With these conditions, the interpolating function is at least twice continuously differentiable,
much smoother than linear splines.
There are 4n − 6 conditions and 4n − 4 unknowns and so two more conditions are needed.
There are many possibilities. For instance, in some applications, it may be desirable to have
p′1 (x1 ), p′n−1 (xn ) prescribed, say, to be zero. Another possibility is to specify p′′1 (x1 ), p′′n−1 (xn ).
One of the most popular methods, called natural cubic splines, is to set p′′1 (x1 ) = 0 =
p′′n−1 (xn ). This choice minimizes the curvature among all interpolants of the data.
Theorem: Let f be twice continuously differentiable on [a, b], where a = x1 < x2 < · · · < xn−1 < xn = b.
Let p be the natural cubic spline interpolating f at x1 , · · · , xn . Then
Z b Z b
′′ 2
p (x) dx ≤ f ′′ (x)2 dx.
a a

Proof: Let g = f − p. Thus


Z b Z b Z b Z b
′′ 2 ′′ 2 ′′ 2
f (x) dx = p (x) dx + g (x) dx + 2 p′′ (x)g′′ (x) dx.
a a a a

The task at hand is to show that the last term is non–negative. On [xi , xi+1 ], let p = pi , the
cubic interpolant.
Z b X Z xi+1
n−1
′′ ′′
p (x)g (x) dx = p′′i (x)g ′′ (x) dx
a i=1 xi
n−1
X Z xi+1
= (p′′i g′ )(xi+1 ) − (p′′i g′ )(xi ) − p′′′ ′
i (x)g (x) dx
i=1 xi
n−1
X Z xi+1
= (p′′n−1 g′ )(xn ) − (p′′1 g′ )(x1 ) − p′′′
i g ′ (x) dx
i=1 xi
n−1
X  
= − p′′′
i g(x i+1 ) − g(x i )
i=1
= 0.
In the above, we have used the conditions defining the natural cubic splines, including the fact
that g(xi ) = 0 for every i, and that p′′′
i is a constant since pi is cubic. 

In practice, the unknown coefficients are determined by setting up a system of 4n − 4 equa-


tions. This system is non–symmetric and it is not ideal if n is large. One can manually solve
some of the variables and with some algebra, one can reduce to a symmetric positive definite
tridiagonal system of size n − 2.
The following error estimate is available.
66 CHAPTER 5. INTERPOLATION AND APPROXIMATION

Theorem: Let f be four times continuously differentiable on [a, b] and p be the natural cubic spline inter-
polant of f at the nodes x1 , · · · , xn . Then for every x ∈ [a, b],
5h4
|f (x) − p(x)| ≤ max |f (4) (ξ)| , h= max (xi+1 − xi ).
a≤ξ≤b 384 1≤i≤n−1

Example: Find the linear spline, quadratic spline and natural cubic spline which interpolates the data

(0, 3), (1, −2), (2, 1).

For the quadratic spline, assume the quadratic interpolants over the two intervals have a con-
tinuous derivative at x = 1 and that the derivative of the interpolant vanishes at x = 0.
The linear spline is 
−5x + 3, x ∈ [0, 1];
p(x) =
3x − 5, x ∈ [1, 2].
For the quadratic spline, let p1 (x) = a1 x2 + b1 x + c1 and p2 (x) = a2 (x − 1)2 + b2 (x − 1) + c2
be the interpolating quadratics on [0, 1] and [1, 2], respectively. The equations determining the
coefficients are:
3 = p1 (0), −2 = p1 (1), −2 = p2 (1), 1 = p2 (2)
since pi are interpolants and
p′1 (1) = p′2 (1), p′1 (0) = 0
are the constraints on the derivatives. Solve these equations to obtain a1 = −5, b1 = 0, c1 =
3, a2 = 13, b2 = −10, c2 = −2. Hence the piecewise quadratic interpolant is

−5x2 + 3, x ∈ [0, 1];
p(x) =
13(x − 1)2 − 10(x − 1) − 2, x ∈ [1, 2].

The natural cubic spline requires a fair amount of calculations. Let the cubic in the first interval
be p1 (x) = a1 x3 + b1 x2 + c1 x + d1 . From the conditions p1 (0) = 0, p1 (1) = 3, p′′1 (0) = 0, we
obtain b1 = 0, d1 = 3, c1 = −5 − a1 . Let the cubic in the second interval be p2 (x) = a2 (x −
1)3 + b2 (x − 1)2 + c2 (x − 1) + d2 . From the conditions p2 (1) = 3, p2 (2) = 1, p′′2 (0) = 0, p′2 (1) =
p′1 (1), p′′2 (1) = p′′1 (1), we obtain d2 = −2, a2 + b2 + c2 = 3, b2 = −3a2 , c2 = 3a1 + c1 , b2 = 3a1 .
Solve these equations to get a2 = −2, b2 = 6, c2 = −1, a1 = 2, c1 = 7. Hence the natural cubic
spline is 
2x3 − 7x + 3, x ∈ [0, 1];
p(x) =
−2(x − 1)3 + 6(x − 1)2 − (x − 1) − 2, x ∈ [1, 2].

Hermite cubic splines


In some applications, it is possible that both the function value and derivative value are known at
each node. In this case, the Hermite cubic splines is a popular piecewise polynomial interpolant.
More precisely, given f ∈ C 1 [a, b] and nodes {xi } satisfying a = x1 < x2 < · · · < xn−1 <
xn = b. The Hermite cubic spline p ∈ C 1 [a, b] is a piecewise cubic polynomial (that is, it is cubic
on each [xi , xi+1 ]) which satisfies

p(xi ) = f (xi ), p′ (xi ) = f ′ (xi ), i = 1, · · · , n.

Let us determine p on [xi , xi+1 ]. Let

p(x) = c0 + c1 (x − xi ) + c2 (x − xi )2 + c3 (x − xi )3 on [xi , xi+1 ]


5.4. APPROXIMATION 67

for some coefficients cj . Since p(xi ) = f (xi ) and p′ (xi ) = f ′ (xi ), it is easy to see that c0 =
f (xi ), c1 = f ′ (xi ). From p(xi+1 ) = f (xi+1 ) and p′ (xi+1 ) = f ′ (xi+1 ), it can be deduced that

f (xi+1 ) − f (xi ) f ′ (xi+1 ) + 2f ′ (xi ) f ′ (xi+1 ) + f ′ (xi ) f (xi+1 ) − f ′ (xi )


c2 = 3 − , c3 = −2 .
(xi+1 − xi )2 xi+1 − xi (xi+1 − xi ) 2 (xi+1 − xi )3

The following is an error estimate of the Hermite cubic spline interpolant. Its proof is very
similar to the case of linear splines given earlier, using instead (5.7) with n = 2.

Theorem: Let f ∈ C 4 [a, b] and p be the Hermite cubic spline interpolant of f at the nodes x1 , · · · , xn .
Then for every x ∈ [a, b],

h4
|f (x) − p(x)| ≤ max |f (4) (ξ)| , h= max xi+1 − xi .
a≤ξ≤b 384 1≤i≤n−1

5.4 Approximation
Given f ∈ C[a, b] and ǫ > 0. Recall that the Weierstrass Approximation Theorem says that
there is some polynomial p so that kf − pk∞ < ǫ. We shall be using two norms in this section:
Z b 1/2
2
kgk∞ := max |g(x)|, kgk2 := g (x) dx .
x∈[a,b] a

Note that the degree of p in the above theorem can be large. Suppose we fix n, the degree of
the polynomial and pose the minimization problem

inf kf − qk∞ .
q∈Pn

A related problem replaces the k · k∞ by k · k2 . We shall look at the latter problem.


The minimization problem can be solved using Calculus. For simplicity, consider n = 1 and
[a, b] = [0, 1]. The minimization problem is

min f (a0 , a1 ), f (a0 , a1 ) = ka0 + a1 x − f (x)k22 .


a0 ,a1

By a direct calculation,
Z 1 Z 1 Z 1
a2
f (a0 , a1 ) = a20 + 1+ 2
f (x) dx − 2a0 f (x) dx − 2a1 xf (x) dx + a0 a1 .
3 0 0 0

The unique critical point of f is given by the solution of ∇f (a0 , a1 ) = 0:


  " R1 #  
a0 2 0 f (x) dx 2 1
A = R1 , A= .
a1 2 0 xf (x) dx 1 2/3

The solution of this linear system must correspond to the unique minimum of f since the matrix
of second derivatives of f is A, which has positive eigenvalues. This means that f is a strictly
convex function and so its critical point is a global minimum of f .
The above method also works for any positive value of n. Unfortunately, if n is not small, the
resultant linear system is usually very ill-conditioned. Instead of using the basis {1, x, x2 , . . . , xn }
68 CHAPTER 5. INTERPOLATION AND APPROXIMATION

of Pn , it is better to use orthogonal polynomials. Orthogonality is defined in terms of the L2 (a, b)


inner product
Z b
hf, gi = f (x)g(x) dx.
a

Taking [a, b] = [−1, 1] for convenience, use Gram-Schmidt to obtain an orthogonal set of poly-
nomials of Pn , called the set of Legendre polynomials. The first few are given by

3x2 − 1
φ0 (x) = 1, φ1 (x) = x, φ2 (x) = .
2
These polynomials are normalized by the condition φj (1) = 1. Note that φj is a polynomial
of degree j and hφi , φj i = 0 if i 6= j. We now solve the same minimization problem for the
case n = 1 and [a, b] = [−1, 1] using the first two Legendre polynomials. The function to be
minimized becomes
Z 1 Z 1 Z 1
2 2 2a21 2
f (a0 , a1 ) = ka0 + a1 x − f (x)k2 = 2a0 + + f (x) dx − 2a0 f (x) dx − 2a1 xf (x) dx.
3 −1 −1 −1

The critical point is the solution of the simpler linear system


Z 1 Z 1
a1
4a0 − 2 f (x) dx = 0 = 4 − 2 xf (x) dx.
−1 3 −1

The advantage of using orthogonal polynomials is that it is trivial to solve the resultant linear
system. The following is the main theorem for the least-squares solution of approximating a
function by a general set of basis functions. First define the L2 (a, b) inner product by
Z b
(f, g) = f (x)g(x) dx, f, g ∈ L2 (a, b).
a

Theorem: Let {φ1 , . . . , φn } ⊂ C[a, b] be linearly independent and Sn be the span of {φj , 1 ≤ j ≤ n}. Given
f ∈ C[a, b]. The solution of the best approximation of f by a function in Sn in the L2 sense:
n
X
min f − ai φi
a∈Rn
i=1 2

is given by a = A−1 b, where aij = (φi , φj ) and bi = (φi , f ).

Proof: The problem is equivalent to minimizing


n n
! n n
X X X X
F (a) := f − ai φi , f − ai φi = (f, f ) − 2 ai (φi , f ) + ai aj (φi , φj ), a ∈ Rn .
i=1 i=1 i=1 i,j=1

A necessary condition for a∗ ∈ Rn to be a minimizer is that the gradient of F with respect to a


vanishes:
n
X n
X
0 = −2(φp , f ) + 2 a∗j (φp , φj ) or apj a∗j = bp , 1 ≤ p ≤ n.
j=1 j=1
5.4. APPROXIMATION 69

In matrix notation the above equation can be written as Aa∗ = b. Note that A is a symmetric
matrix and since
 
n n n n 2
X X X X
T
a Aa = ai aj (φi , φj ) =  ai φi , 
aj φj = ai φi > 0
i,j=1 i=1 j=1 i=1 2

for all non-zero vectors a. Hence A is positive definite, which implies that A is invertible. Since
the Hessian D 2 F (a) = 2A for all a ∈ Rn , it is positive definite, implying that F is a strictly
convex function of a. Consequently, a∗ is the unique global minimizer of F . 

In the notation of the above theorem, it is not difficult to see that the error of the approxi-
mation satisfies
 
n n 2 1/2
X X
f− a∗i φi = kf k22 − a∗i φi  .
i=1 2 i=1 2

Instead of the 2-norm, one can also pose the minimization problem in the ∞-norm.

Theorem: Let f ∈ C[a, b] and n be a non-negative integer. There is a unique p ∈ Pn so that

kf − pk∞ = min kf − qk∞ .


q∈Pn

We omit the proof of this theorem, but sketch some results that give rise to this theorem.
Define {φi , 1 ≤ i ≤ n} ⊂ C[a, b] to satisfy the Haar condition if
 
φ1 (x1 ) · · · φn (x1 )
 .. ..  6= 0
D[x1 , . . . , xn ] := det  . ··· . 
φ1 (xn ) · · · φn (xn )

whenever a ≤ x1 < · · · < xn ≤ b.


The canonical example of functions satisfying the Haar condition is φj = xj , 0 ≤ j ≤ n − 1.
In this case, for any strictly increasing sequence {xi , 1 ≤ i ≤ n} ⊂ [a, b],
 
1 x1 · · · x1n−1
 .. .. . . ..  = Y
D[x1 , . . . , xn ] = det  . . . .  (xj − xi ) > 0,
1 xn · · · xnn−1 1≤i<j≤n

which can be shown by induction. The above matrix is known as the Vandermonde matrix.
The above theorem is a consequence of the following, which says that the best polynomial
interpolant measured in the infinity norm satisfies an equi-oscillation property: the interpolation
error oscillates between minimum and maximum points, and the magnitudes of the errors at
these extreme points are the same.

Theorem: Let {φ1 , . . . , φn } ⊂ C[a, b] satisfy the Haar condition. Suppose f ∈ C[a, b]. Define r = f −
Xn
ai φi , ai ∈ R. Then a minimizes krk∞ iff |r(xi )| = krk∞ for all i and r(xi ) = −r(xi−1 ), 1 ≤
i=1
i ≤ n for some strictly increasing sequence {xi , 0 ≤ i ≤ n} ⊂ [a, b].
70 CHAPTER 5. INTERPOLATION AND APPROXIMATION

n
X
We can set up a linear system that solves for the coefficient a as follows. Let p = ai φi
i=1
and {xi , 0 ≤ i ≤ n} be a given strictly increasing sequence
 of points in [a, b]. According to the
above theorem, f (xi ) − p(xi ) = − f (xi−1 ) − p(xi−1 ) , or

f (xi ) − p(xi ) = (−1)i h, 0 ≤ i ≤ n,

where h = f (x0 ) − p(x0 ) with |h| = |r(xi )| for all i. Therefore


 
n
X Xn
f (xi ) − aj φj (xi ) = (−1)i f (x0 ) − aj φj (x0 ) ,
j=1 j=1

or
n
X 
aj φj (xi ) − (−1)i φj (x0 ) = f (xi ) − (−1)i f (x0 ), 0 ≤ i ≤ n.
j=1

This linear system uniquely determines the coefficient a.

Example: Solve the minimization problem in case n = 0. That is,

min kc − f k∞ .
c∈R

Since f ∈ C[a, b], there exist α, β ∈ [a, b] so that

f (α) = min f (x), f (β) = max f (x).


x∈[a,b] x∈[a,b]

For any c ∈ [f (α), f (β)],

kc − f k∞ = max(f (β) − c, c − f (α)).

It is not difficult to see that the above is minimized when


f (α) + f (β)
f (β) − c = c − f (α) or c= .
2

Example: Solve the minimization problem for f (x) = ex on the interval [−1, 1]:

min kex − (a0 + a1 x)k∞ .


a∈R2

Note that f is a convex function. Hence the three extreme points in the equi-oscillation property
can be taken as x0 = −1, x1 ∈ (−1, 1) and x2 = 1. Let r denote the error as before. Then
r(−1) = −r(x1 ) = r(1) and r ′ (x1 ) = 0 since x1 is an interior critical point. Solve these equations
to obtain the solution a1 = (e − e−1 )/2, a0 = (e − a1 ln a1 )/2, x1 = ln a1 .

5.5 Rational Approximation


Instead of approximating a given continuous function f by a polynomial, we can also approximate
it by a more general class of functions such as rational functions (ratio of polynomials with no
5.5. RATIONAL APPROXIMATION 71

common factors). Specifically, given non-negative integers n and m, we approximate f by the


rational function
p 0 + p 1 x + . . . + p n xn
r(x) = . (5.8)
q 0 + q 1 x + . . . + q m xm
Since polynomials are special cases of rational functions (with m = 0), it is expected that r
should give a better approximation of f than a polynomial approximation provided m and n
are appropriately chosen.
We consider one specific type of rational approximation known as Pade approximation.
Suppose f is defined on [a, b] with a < 0 < b. It is implicitly assumed that q 6= 0 on [a, b].
Without loss of generality, assume q0 = 1. This can always be arranged by rescaling p. Given
non-negative integers n and m, the idea is to find a rational function r so that the lowest degree
of the Taylor expansion of f − r is as large as possible. Let

X
f (x) = ai xi
i=0

and r be the ratio of polynomials defined as in (5.8) with q0 = 1. The goal is to find the
coefficients pi and qi so that

p(x) X i
f (x) − = ci x
q(x)
i=s

for some coefficients {ci } so that s is as large as possible. Toward this end, write

X ∞
X
f (x)q(x) − p(x) = q(x) ci xi =: di xi . (5.9)
s=i s=i

Plugging in the expansions for f, p and q, it follows that the left-hand side is

a0 − p0 + (a0 q1 + a1 − p1 )x + (a0 q2 + a1 q1 + a2 − p2 )x2 + · · · .

We set as many of the above coefficients to zero as possible.

Example: Let f (x) = ex defined on [−5, 1] and n = 1, m = 2. Then


 
x2 x3 x4
f (x)q(x) − p(x) = 1+x+ + + + · · · (1 + q1 x + q2 x2 ) − p0 − p1 x
2 6 24
   
1 2 1 q1
= 1 − p0 + (1 + q1 − p1 )x + + q1 + q2 x + + + q 2 x3
2 6 2
 
1 q1 q2
+ + + x4 + · · · .
24 6 2

Since there are four unknowns (p0 , p1 , q1 , q2 ), we expect to be able to set the coefficients of xi to
be zero for 0 ≤ i ≤ 3. The four resultant equations are
1 q1 1
p0 = 1, p1 = 1 + q 1 , q1 + q2 = − , + q2 = − . (5.10)
2 2 6
The solution of the system is
1 2 1
p0 = 1, p1 = , q1 = − , q2 = .
3 3 6
72 CHAPTER 5. INTERPOLATION AND APPROXIMATION

-2

-4

-6

-8

-10 ex
r(x)
-12 f 3(x)

-14
-5 -4 -3 -2 -1 0 1
x

Figure 5.2: Pade approximation r(x) (with n = 1, m = 2) gives a superior approximation of ex


on [−5, 1] than f3 (x), the Taylor’s expansion of ex up to degree three.

Note that the coefficient of x4 is non-zero. In Figure 5.2, it is clear that this Pade approximation
is superior to f3 (x), the Taylor’s expansion of ex up to degree three that also contains four
coefficients. However on [−1, 5], f3 (x) gives a better approximation because on this interval, r
goes to zero for large x and thus is not an appropriate choice to approximate ex .

In (5.9), typically s = m + n + 1. However, the following example shows that this does not
always occur: f (x) = a0 + a2 x2 + . . . , p(x) = p0 + p1 x, q(x) = q0 + q1 x with a2 6= 0. Equations
(5.10) for the unknowns are

p 0 = a0 , p 1 = a0 q 1 , a2 = 0.

The last equation yields a contradiction.


Chapter 6

Numerical Differentiation and


Integration

In this chapter, we approximate the operations of differentiation and integration. Given a dif-
ferentiable function, we can always find its derivative although this may not be inviting if the
function is complicated. Numerical differentiation approximates the derivative using only func-
tion values. Another instance where numerical differentiation is needed is in solving differential
equations. Numerical integration is clearly useful since there are many functions that cannot be
integrated analytically.

6.1 Numerical Differentiation


For a differentiable function f on (a, b), recall that

f (x + h) − f (x)
f ′ (x) = lim .
h→0 h
For a sufficiently small h, we have the approximation
f (x + h) − f (x)
f ′ (x) ≈ .
h
Let us study the error in this approximation known as forward difference. From Taylor’s expan-
sion (provided f is twice continuously differentiable),

f ′′ (c) 2
f (x + h) = f (x) + f ′ (x)h + h
2
for some c between x and x + h. Hence
f (x + h) − f (x) h
f ′ (x) − ≤ max |f ′′ (ξ)|.
h 2 a≤ξ≤b

This holds for all x ∈ (a, b). We say that this approximation is first–order accurate.
There is another first–order scheme known as backward difference:
f (x) − f (x − h)
f ′ (x) ≈
h
which has the same upper bound of the error as above.

73
74 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION

In many applications, this is not accurate enough. A second–order accurate scheme, that is,
the error is bounded above by a term proportional to h2 is

f (x + h) − f (x − h)
f ′ (x) ≈ .
2h
This is called a centered difference scheme. To see this, subtract the Taylor’s expansions

f ′′ (x) 2 f ′′′ (c± ) 3


f (x ± h) = f (x) ± f ′ (x)h + h ± h
2 6
for some c± in between x and x ± h to obtain

f (x + h) − f (x − h) f ′′′ (c+ ) + f ′′′ (c− ) 2 h2


f ′ (x) − = h ≤ max |f ′′′ (ξ)| (6.1)
2h 12 6 a≤ξ≤b

for all x ∈ (a, b).

Example: Let f (x) = x3 . Then f ′ (1) = 3. The first–order and second–order finite difference schemes using
h = .1 yield
f (1.1) − f (1) f (1.1) − f (.9)
= 3.31, = 3.01.
.1 .2
The second–order scheme is clearly superior. With h = .05,

f (1.05) − f (1) f (1.05) − f (.95)


= 3.1525, = 3.0025.
.05 .1
Hence we see that the errors do behave like h and h2 for the two schemes.

Since the finite difference error decreases like h or h2 , why not simply take h very small,
say, 10−100 so that the error would be insignificant? The answer is that this would be fine in
exact arithmetic. However in floating point arithmetic, there are roundoff errors. In particular,
cancellation is at work here since two nearly equal quantities are subtracted. Let us estimate
an optimal value of h > 0 for the second–order finite difference scheme.
Assume f˜ is the floating point representation of f and

f˜(z) = f (z) + ǫz , |ǫz | ≤ ǫM

where ǫM is the unit roundoff. In this model, we ignore the floating point error of z itself. For
some |ǫ± | ≤ ǫM ,

f˜(x + h) − f˜(x − h) f (x + h) + ǫ+ − f (x − h) − ǫ−
f ′ (x) − = f ′ (x) −
2h 2h
f (x + h) − f (x − h) ǫM
≤ f ′ (x) − +
2h h
≤ E(h)

where we have used (6.1) in

M h2 ǫM
E(h) = + , M = max |f ′′′ (ξ)|.
6 h a≤ξ≤b
6.1. NUMERICAL DIFFERENTIATION 75

The task is to minimize E(h), an upper bound of the error. The critical point of E is easily
computed from
M h ǫM
0 = E ′ (h) = − 2
3 h
 1/3
3ǫM
to obtain h = . It is easy to check that this is a global minimum. As a rough
M
estimate, set ǫM = 10−16 for double precision arithmetic, and take 3M −1 = 10 to obtain the
value h = 10−5 .

Example: For the same example f (x) = x3 with f ′ (1) = 3. Take h = 10−5 ,

f (1 + .00001) − f (1) f (1 + .00001) − f (1 − .00001)


≈ 3.00003, ≈ 3.0000000001.
.00001 .00002

Suppose f is a (complex) analytic function in some neighbourhood of x0 ∈ R so that f (z) is


real whenever z is real. Here is an elegant way to approximate f ′ (x0 ) without worrying about

cancellation errors of the usual finite difference schemes. By a Taylor’s expansion, with i = −1,

h2 ′′ ih3 ′′′
f (x0 + ih) = f (x0 ) + ihf ′ (x0 ) − f (x0 ) − f (x0 ) + · · · .
2 6
Since f (x0 ) is real, this yields

Imf (x0 + ih) h2 ′′′


f ′ (x0 ) = + f (x0 ) + · · · .
h 6
The approximation of f ′ (x0 ) given by the first term on the right-hand side is not prone to
cancellation errors because no subtraction is involved..

Example: Let f (x) = e10x . The relative error of the central difference scheme with h = 10−6 to evaluate
f ′ (1) is 1.5 × 10−11 , while using the above technique with h = 10−8 , the relative error is 1.8 ×
10−15 .

Another way to assess√ the accuracy of a difference scheme is to see its effect on a plane wave
f (x) = eikx , where i = −1 and k is a given wave number, or, frequency of a wave. Suppose the
wave has period L > 0 so that k = 2πn/L for n ≥ 0. Given a grid xj = jh for some positive grid
spacing h. The exact derivative at xj is f ′ (xj ) = ikeikxj . The centered difference approximation
of f ′ (xj ) is
ei(j+1)hk − ei(j−1)hk eijhk eihk − e−ihk sin(hk) ikxj
=i =i e .
2h h 2i h
Comparing with the exact derivative, it is readily seen that the wave number of the difference
scheme has changed from k to sin(hk)/h. When k = O(1), then sin(hk)/h ∼ k, approximating
well the exact wave number. However, the approximation gets progressively worse as k increases.
This method of measuring the accuracy of a difference scheme is meaningful when the given
function f is expanded as a Fourier series, a sum of complex exponentials.
Note that the forward difference scheme applied to f (x) = eikx yields

eihk − 1 ikxj
i e .
ih
Here the wave number is complex, but it still approximates k well when k = O(1).
76 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION

Second Derivative
It is not difficult to derive a second–order finite difference approximation of the second derivative.
Assume that f is four times continuously differentiable on (a, b). From Taylor’s expansion, for
some c± in between x and x ± h,

f ′′ (x) 2 f ′′′ (x) 3 f ′′′′ (c± ) 4


f (x ± h) = f (x) ± f ′ (x)h + h ± h + h .
2 6 24
Add these equations to obtain, after some algebra,
f (x + h) − 2f (x) + f (x − h) f ′′′′ (c+ ) + f ′′′′ (c− ) 2
f ′′ (x) = − E, E= h .
h2 24
Thus
M 2
|E| ≤ h , M = max |f ′′′′ (ξ)|.
12 a≤ξ≤b

Now we estimate an optimal value of h to calculate the second derivative in the presence
of roundoff errors. Assume f˜ is the floating point representation of f as before. For some
|ǫi | ≤ ǫM , i = 1, 2, 3,

f˜(x − h) − 2f˜(x) + f˜(x + h) f (x − h) + ǫ1 − 2(f (x) + ǫ2 ) + f (x + h) + ǫ3


f ′′ (x) − = f ′′ (x) −
h2 h2
f (x − h) − 2f (x) + f (x + h) 4ǫM
≤ f ′′ (x) − 2
+ 2
h h
≤ E(h)

where
M h2 4ǫM
E(h) = + 2 , M = max |f ′′′′ (ξ)|.
12 h a≤ξ≤b

The task is to minimize E(h), an upper bound of the error. The critical point of E is easily
computed from
M h 8ǫM
0 = E ′ (h) = − 3
6 h
 1/4
48ǫM
to obtain h = . It is easy to check that this is a global minimum. A rough estimate
M
of the optimal value is h = 10−4 in double precision arithmetic.

Example: Let f (x) = x4 with f ′′ (1) = 12. Take h = 10−4 ,

f (1 + .0001) − 2f (1) + f (1 − .0001)


≈ 12.000000005.
10−8

Suppose the task is to evaluate f (k) (x) where k is large, say, k = 100. A finite difference
formula would be impossibly long and the resultant scheme may suffer from serious cancellation
errors. If f is analytic, the Cauchy integral formula offers a much better solution:
I
k! f (z)
f (k) (x) = dz,
2πi Γ (z − x)k+1
where Γ is any circle in the complex plane with centre at x.
6.2. RICHARDSON EXTRAPOLATION 77

6.2 Richardson Extrapolation


Let Dh denote the finite difference operator:
f (x + h) − f (x − h)
(Dh f )(x) = .
2h
For a positive integer k, the notation g(h) = O(hk ) means that
g(h)
lim ≤C
h→0 hk
for some constant C independent of h. From before,
f ′′′ (x)
f ′ (x) = (Dh f )(x) + Kh2 + O(h4 ), K= .
12
Here K is regarded as an unknown and it is assumed that f is four times continuously differen-
tiable. Observe that
K
f ′ (x) = (Dh/2 f )(x) + h2 + O(h4 ).
4
Eliminate K from these two equations to obtain
4(Dh/2 f )(x) − (Dh f )(x)
f ′ (x) = + O(h4 ).
3
Hence a fourth–order accurate scheme can be obtained by two applications of the finite difference
operator at h and h/2. It can be verified that this more accurate scheme is
f (x − h) − 8f (x − h/2) + 8f (x + h/2) − f (x + h)
f ′ (x) ≈ .
6h
This formula can also be derived from a Taylor’s expansion. However Richardson extrapolation
can be applied in many other situations where the asymptotic expansion of an operation is
known. We shall encounter it again in numerical integration.
Richardson extrapolation can be applied repeatedly to get higher order formulae.

6.3 Numerical Integration


In this section, we are interested in numerically integrating a function f :
Z b
f (x) dx.
a

Newton–Cotes Formulae
A simple scheme is to subdivide the interval [a, b] into n equal intervals. Define for 0 ≤ i ≤ n,
b−a
xi = a + ih, h= .
n
Approximate f on each interval [xi−1 , xi ] by a straight line. The integral over this interval is
approximated by the area of the trapezoid:
Z b Xn Z xi Xn
f (xi−1 ) + f (xi )
f (x) dx = f (x) dx ≈ h.
a xi−1 2
i=1 i=1
78 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION

This leads to the trapezoidal rule


Z b
h  n−1
X
f (x) dx = f (a) + f (b) + h f (xi ).
a 2
i=1

Example: Let f (x) = x4 . Find the integral I4 over [0, 1] using the trapezoidal rule with n = 4 and n = 8
intervals.
For n = 4, h = 1/4. Hence
f (0) + f (1) f (1/4) + f (2/4) + f (3/4)
I4 = + = 0.2207.
8 4
The exact answer is .2 and so the error is about .02. For n = 8, I8 = .2052 with an error of
about .005 which is four times smaller than that of I4 .
The proof of the error of the trapezoidal rule requires
Theorem: Mean Value Theorem for Integrals. Let f be continuous on [a, b] and let g be an integrable
function on [a, b] that does not change sign on this interval. Then there exists some c ∈ (a, b) so
that Z b Z b
f (x)g(x) dx = f (c) g(x) dx.
a a

Let Eh be the difference between the exact integral and the value given by the trapezoidal
rule.
Theorem: Let f be twice continuously differentiable on [a, b]. Then
b−a 2
|Eh | ≤ h max |f ′′ (c)|.
12 a≤c≤b

Proof: On [xi−1 , xi ], the trapezoidal rule interpolates f by a linear function p(x). From the interpolation
error (5.4),
f ′′ (cx )
f (x) − p(x) = (x − xi−1 )(x − xi ), x ∈ [xi−1 , xi ]
2
where cx lies in between (xi−1 , xi ). Hence
Z xi Z
1 xi ′′
(f − p)(x) dx = f (cx )(x − xi−1 )(x − xi ) dx
xi−1 2 xi−1
Z
f ′′ (ci ) xi
= (x − xi−1 )(x − xi ) dx
2 xi−1
 3
f ′′ (ci ) h
= −
2 6
for some ci ∈ (xi−1 , xi ) by the mean value theorem for integrals. (Continuity of f ′′ (cx ) with
respect to x follows from (5.5).) Thus the error over the entire interval is
Z b n Z xi
X n
h3 X ′′
(f − p)(x) dx = (f − p)(x) dx = − f (ci ).
a xi−1 12
i=1 i=1

This leads to
h3 b−a 2
|Eh | ≤ n max |f ′′ (c)| = h max |f ′′ (c)|.
12 a≤c≤b 12 a≤c≤b

6.3. NUMERICAL INTEGRATION 79

The trapezoidal rule approximates f by a linear function on each subinterval. The next
method uses a quadratic function. Consider the interval [x2i , x2i+2 ], let pi be the unique poly-
nomial interpolant of (x2i , f (x2i )), (x2i+1 , f (x2i+1 )), (x2i+2 , f (x2i+2 )) of degree two or lower:

(x − x2i+1 )(x − x2i+2 ) (x − x2i )(x − x2i+2 )


pi (x) = f (x2i ) + f (x2i+1 )
(x2i − x2i+1 )(x2i − x2i+2 ) (x2i+1 − x2i )(x2i+1 − x2i+2 )
(x − x2i )(x − x2i+1 )
+f (x2i+2 ) .
(x2i+2 − x2i )(x2i+2 − x2i+1 )

Assume the number of subintervals is even: n = 2m. Then


Z b X Z x2i+2
m−1
f (x) dx = f (x) dx
a i=0 x2i
X Z x2i+2
m−1
≈ pi (x) dx
i=0 x2i
m−1
X Z x2i+2
f (x2i )
= (x − x2i+1 )(x − x2i+2 ) dx
(x2i − x2i+1 )(x2i − x2i+2 ) x2i
i=0
Z x2i+2
f (x2i+1 )
+ (x − x2i )(x − x2i+2 ) dx
(x2i+1 − x2i )(x2i+1 − x2i+2 ) x2i
Z x2i+2
f (x2i+2 )
+ (x − x2i )(x − x2i+1 ) dx
(x2i+2 − x2i )(x2i+2 − x2i+1 ) x2i
m−1
X h 4h h
= f (x2i ) + f (x2i+1 ) + f (x2i+2 )
3 3 3
i=0

h X
m−1 
= f (x2i ) + 4f (x2i+1 ) + f (x2i+2 )
3
i=0

h  4h m−1
X m−1
2h X
= f (a) + f (b) + f (x2i+1 ) + f (x2i ).
3 3 3
i=0 i=1

This is Simpson’s rule.


Let Eh be the difference between the exact integral and the value given by Simpson’s rule.

Theorem: Let f be four times continuously differentiable on [a, b]. Then

b−a 4
|Eh | ≤ h max |f ′′′′ (c)|.
180 a≤c≤b

Let us follow the approach used in the proof of the corresponding result for the trapezoidal
rule. On [x2i , x2i+2 ], Simpson’s rule interpolates f by a quadratic function p(x). From the
interpolation error (5.4),

f ′′′ (cx )
f (x) − p(x) = (x − x2i )(x − x2i+1 )(x − x2i+2 ), x ∈ [x2i , x2i+2 ].
6
We cannot use the mean value theorem for integrals as before because the cubic polynomial
changes sign in [x2i , x2i+2 ].
80 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION

Proof: We first estimate the error of the integral over [a, a + 2h]. Using the symmetry property and
definition of the divided differences,

f [a, a + h, a + 2h, x] = f [a, a + h, a + 2h, a + h] + f [a, a + h, a + 2h, a + h, x](x − a − h).

Let p be the unique polynomial interpolant over [a, a + 2h] of degree at most two. Use the
interpolation error (5.4) to obtain the integration error on [a, a + 2h] as
Z a+2h   Z a+2h
f (x) − p(x) dx = f [a, a + h, a + 2h, x] (x − a)(x − a − h)(x − a − 2h) dx
a a
Z a+2h  
= f [a, a + h, a + 2h, a + h] + f [a, a + h, a + 2h, a + h, x](x − a − h)
a
×(x − a)(x − a − h)(x − a − 2h) dx
Z a+2h
= f [a, a + h, a + 2h, a + h] (x − a)(x − a − h)(x − a − 2h) dx
a
Z a+2h
+ f [a, a + h, a + 2h, a + h, x](x − a)(x − a − h)2 (x − a − 2h) dx
a
Z a+2h (4)
f (ξ(x))
= 0+ (x − a)(x − a − h)2 (x − a − 2h) dx.
a 24
Z
f (4) (η) a+2h
= (x − a)(x − a − h)2 (x − a − 2h) dx.
24 a
f (4) (η) 4h5
= − .
24 15
In the above η, ξ(x) ∈ (a, a + 2h). We have used the mean value theorem for integrals and the
fact that the integral of (x−a)(x−a−h)(x−a−2h) is zero by symmetry. (In fact, the additional
point a + h has been chosen so that the mean value theorem for integrals can be used.) Now
sum over intervals to obtain the global error estimate
n h5
|Eh | ≤ max |f (4) (c)|
a≤c≤b 2 90
from which the desired result follows. 

Example: Let f (x) = x4 . Find the integral I4 over [0, 1] using Simpson’s method with n = 4 intervals.
For n = 4, h = 1/4. Hence
f (0) + f (1) f (1/4) + f (3/4) f (1/2)
I4 = + + = 0.2005.
12 3 6
The error is .0005 which is much smaller than that of the trapezoidal rule. For n = 8, I8 =
.200033 with an error of about .000033 which is 16 times smaller than the error of I4 .
Z π
Example: Find the n in Simpson’s method so that sin2 x dx is approximated with an error smaller than
0
10−6 .
1 − cos 2x
Since sin2 x = , its fourth derivative is −8 cos 2x which is bounded above by 8. From
2
πh4
the error of Simpson’s rule, we set 8 < 10−6 . Solve this to get h < .0517 or n = π/h ≈ 60.77.
180
The first even number larger than this is 62 which is the number of intervals needed.
6.3. NUMERICAL INTEGRATION 81

2 −5
10 10

0 −6
10 10

−2 −7
10 10

−4 −8
10 10

−6 −9
10 10
error

error
−8 −10
10 10

−10 −11
10 10

−12 −12
10 10

−14 −13
10 10

−16 −14
10 10
1 2 3 4
0 5 10 15 20 25 30 35 10 10 10 10
n n

Figure 6.1: Numerical integration by the trapezoidal rule ‘o’ and Simpson’s rule ‘+’. The
integrands are f (x) = esin x (left) and f (x) = (1 + x2 )−1 (right) over the interval [0, 2π].

In Figure 6.1, observe the difference in the integration errors for a periodic function (left)
and a non–periodic function (right) for trapezoidal and Simpson’s rules. This property will be
illustrated in a later section.
Trapezoidal and Simpson’s rules are examples of closed Newton–Cotes Formulae, where
the integrand is approximated on each subinterval by a polynomial. In these formulae, the
integrand is evaluated at each node xi . Open Newton–Cotes Formulae also approximate
each subinterval by a polynomial but the integrand is evaluated strictly between the nodes.
These formulae are useful if the integrand has some singular behaviour at an end point or the
interval of integration is infinite.
The midpoint rule is the simplest open Newton–Cotes formula. With the same equally
spaced nodes xi as before, this formula is
Z b n−1 
X 
xi + xi+1
f (x) dx ≈ f h.
a 2
i=0
 
xi + xi+1
It approximates f on [xi , xi+1 ] by the constant f . This explains why it is a member
2
of the Newton–Cotes family.
Let Eh be the difference between the exact integral and the value given by the midpoint rule.

Theorem: Let f be twice continuously differentiable on [a, b]. Then

b−a 2
|Eh | ≤ h max |f ′′ (c)|.
24 a≤c≤b

Proof: On [a, a + h], the midpoint rule interpolates f by the constant function y = f (a + h2 ). From the
Taylor’s expansion, there is some cx ∈ (a, a + h) so that
      2
h ′ h h f ′′ (cx ) h
f (x) = f a+ +f a+ x−a− + x− x ∈ [a, a + h].
2 2 2 2 2
82 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION

Hence the integration error on [a, a + h] is


Z a+h    Z     
h a+h
′ h h f ′′ (cx ) h 2
f (x) − f a+ dx = f a+ x−a− + x− dx
a 2 a 2 2 2 2
Z  
f ′′ (c) a+h h 2
= 0+ x− dx
2 a 2
f ′′ (c) 3
= h
24
for some c ∈ (a, a + h) by the mean value theorem for integrals. Summing the errors over all
intervals, we obtain the desired result. 
Z 1 Z 1
sin x
Example: The integrals dx and x−1/2 dx are examples where open Newton–Cotes formulae
0 x 0
are preferable. In the first example, one has to do analysis to realize that the function evaluates
to 1 in the limit x → 0. For more difficult examples, such analysis may not be possible. In the
second example, the integrand is undefined at the origin and so closed Newton–Cotes formulae
cannot be used.

The final integration scheme we consider is known as Romberg integration. It is nothing


more than applying Richardson extrapolation to the trapezoidal rule. Let h be the initial step
size of a coarse uniform mesh and T (h) be the result of trapezoidal rule. We next apply the
trapezoidal rule on a fine mesh with step size h/2 to obtain T (h/2). Recall that the error of the
quadrature formula has the form c2 h2 + c4 h4 + · · · . Richardson extrapolation says that

4T (h/2) − T (h)
3
approximates the exact integral to O(h4 ). It can be checked that this is exactly Simpson’s rule
with step size h/2. One can continue to calculate T (h/4) resulting in an extrapolated scheme
which is accurate to O(h6 ). One attraction of these schemes is that they are quite efficient
since all function values can be reused at subsequent levels. For instance, the evaluation of T (h)
requires n + 1 function values while the evaluation of T (h/2) requires only n extra function
evaluations if the previous n + 1 values have been saved.

Example: Let f (x) = x4 . Earlier we had calculated the integral from 0 to 1 using the trapezoidal rule with
n = 4 and n = 8 intervals: I4 = .2207, I8 = .2052. Using Richardson extrapolation, a fourth
order approximation is
4I8 − I4 4(.2052) − .2052
= = .200033.
3 3

Adaptive Quadrature
In all numerical integration schemes so far, all nodes are equally spaced. However, this is not
wise in case a function varies rapidly in one region but is slowly varying in others. A scheme
which uses equally spaced nodes must take a step size h very small to resolve the function in the
rapidly varying region. However, this is a waste of computational resources in the region where
the function is slowly changing. A solution is to modify the step size according to the need. In
rapidly varying regions, the step size is small while in slowly varying regions, the step size can
be larger.
6.3. NUMERICAL INTEGRATION 83

Z 1
Example: (1 + sin e5x ) dx is rapidly oscillatory near x = 1 and slowly varying near the other end. An
−1
adaptive quadrature is ideal for this integration.

The following strategy detects whether the function is varying rapidly or not. Use a scheme
such as the trapezoidal rule to find Ih , an approximation to
Z b
I= f (x) dx
a

with uniform step size h. Repeat the calculation now with step size h/2 to obtain the approx-
xi−1 + xi
imation Ih/2 . Let {xi } be the set of nodes separated by h and define xi− 1 = . The
2 2
trapezoidal rule with step size h applied to [xi−1 , xi ] gives
Z xi
h3 f (xi−1 ) + f (xi )
Ii := f (x) dx = Ti,h − f ′′ (xi− 1 ) + O(h5 ), Ti,h = h,
xi−1 12 2 2
while the trapezoidal rule with step size h/2 yields
h3 ′′ 
Ii = Ti,h/2 − f (xi− 1 ) + f ′′ (xi− 3 ) + O(h5 ),
96 4 4

j
where xi− j = a + h i − 4 . The above assumes f ∈ C 4 [a, b]. The following calculation
4

h ′′′ h
f (xi− 1 ) + O(h2 ) + f ′′ (xi− 1 ) − f ′′′ (xi− 1 ) + O(h2 )
f ′′ (xi− 1 ) + f ′′ (xi− 3 ) = f ′′ (xi− 1 ) +
4 4 2 4 2 2 4 2
′′ 2
= 2f (xi− 1 ) + O(h )
2

shows that
h3 ′′
Ii − Ti,h/2 = − f (xi− 1 ) + O(h5 ).
48 2

Observe that
Ti,h/2 − Ti,h h3
= − f ′′ (xi− 1 ) + O(h5 ) = Ii − Ti,h/2 + O(h5 ).
3 48 2

Given a tolerance ǫ > 0 of the approximate integral. The adaptive quadrature scheme assumes
that the error in each interval [xi−1 , xi ] be no more than ǫh/(b − a). Hence if |Ti,h/2 − Ti,h |/3 ≤
ǫh/(b − a), then Ti,h/2 is accepted as an approximation on [xi−1 , xi ]. Otherwise calculate
Ti,h/4 , Ti,h/8 , . . . until the tolerance on [xi−1 , xi ] is satisfied. It should be emphasized that
the above strategy is only an heuristic and it can fail.

We conclude this section with the most important theoretical result in numerical integration.
We begin with some definitions. Given some continuous function f , define the quadrature scheme
n
X
Qn (f ) = ain f (xin ), n ≥ 1,
i=1

where ain are weights and xin are nodes associated with the scheme. We say that {Qn } is
consistent if there is some function g : N → N with g(n) → ∞ as n → ∞ so that Qn (p) is
exact for every polynomial p with degree at most g(n):
Z 1
Qn (p) = p(x) dx.
0
84 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION

Define {Qn } to be stable if


n
X
C := sup |ain | < ∞.
n≥1 j=1

If {Qn } is exact for constants and all weights {ain } are non-negative, then {Qn } is stable and
the stability constant can be taken as 1. To see this, observe that the scheme is exact for the
constant function f (x) = 1. Therefore for all n ≥ 1,
n
X Z 1
ain = Qn (1) = 1 dx = 1
i=1 0

implies that
n
X n
X
sup |ain | = sup ain = 1.
n≥1 i=1 n≥1 i=1

Finally, define {Qn } to be convergent if


Z 1
lim Qn (f ) − f (x) dx = 0, f ∈ C[0, 1].
n→∞ 0

The result is that for a consistent scheme, stability is equivalent to convergence. We saw an
analogous result for interpolation.

Theorem: Suppose {Qn } is consistent. Then it is stable iff it is convergent.

Proof: Suppose {Qn } is stable. Given ǫ > 0, we need to find some integer N so that for every n ≥ N ,
Z 1
Qn (f ) − f (x) dx < ǫ, f ∈ C[0, 1].
0

Let f ∈ C[0, 1]. From the Weierstrass Approximation Theorem, there is some polynomial p so
that
ǫ
kf − pk∞ < ,
1+C
where C is the stability constant. Choose N = g(m), where m is the degree of p. Then for every
n ≥ N,
Z 1 Z 1 Z 1
Qn (f ) − f (x) dx ≤ |Qn (f ) − Qn (p)| + Qn (p) − p(x) dx + (p − f )(x) dx
0 0 0
n
X ǫ
< |ain | |f (xin ) − p(xin )| + 0 +
1+C
i=1
ǫ
≤ C kf − pk∞ +
1+C
< ǫ.

Suppose {Qn } is convergent. That is,


Z 1
Qn (f ) → I(f ) := f (x) dx, f ∈ C[0, 1].
0
6.4. IMPROPER INTEGRALS 85

It is not difficult to check that for each n ≥ 1, Qn , I : C[0, 1] → R are linear operators with
n
X
kQn k∞ = |ain |, kIk∞ = 1.
i=1

Recall that
|Qn (f )|
kQn k∞ = sup .
f ∈C[0,1]\0 kf k∞

Since {Qn } is convergent, Qn is pointwise bounded, meaning that

sup |Qn f | ≤ Cf , f ∈ C[0, 1].


n≥1

Here Cf is a real number depending on f . By the Principle of Uniform Boundedness, supn≥1 kQn k∞ <
∞. This means that {Qn } is stable. 

6.4 Improper Integrals


Numerical integration of functions which are smooth is quite routine. Modern quadrature algo-
rithms are absolutely trustworthy. We now examine some techniques to treat improper integrals,
that is, those where the integrand contains a singularity or the interval of integration is infinite.
Suppose the integrand has a mild singularity at c in the interior of the interval of integration
[a, b]. A direct application of a numerical integration over [a, b] would typically yield a poor
result. (We shall see that the convergence of the quadrature depends on the smoothness of
the integrand.) Without loss of generality, it can be assumed that the singularity occurs at an
end point. This is because it is possible to perform two integrations over [a, c] and [c, b]. For
Z 1
instance, |x − .3|1/2 dx can be split into two integrals from [0, .3] and from [.3, 1]. It should
0
be emphasized that one should not rely on general quadrature algorithms to numerically inte-
grate improper integrals. Henceforth, we only consider improper integrals where the singularity
occurs at an end point. Open Newton–Cotes formulae give one approach. We now examine
other strategies.
Substitution can often remove the singularity of an integrand, or transform an infinite interval
of integration to a finite one.
Z 1 Z 1
−1/2 x 2
Example: x e dx = 2 et dt where x = t2 .
0 0
Z ∞ Z 1
1−3/2 sin t
Example: x sin dx = √ dt where x = 1/t. The second integral can be integrated using any
1 x 0 t
integration scheme since the integrand has a limit 0 at the origin.

Integration by parts sometimes can remove a singularity in the integrand.

Example: Continuing the example above,


Z 1 Z 1
sin t √ 1 √
√ dt = 2 t sin t − 2 t cos t dt
0 t 0 0
Z 1

= 2 sin 1 − 2 t cos t dt.
0
86 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION

The above integral now has a weaker singularity at t = 0 and it can be routinely integrated
numerically. It is possible to weaken the singularity further by performing one more integration
by parts: Z 1√ Z
2 2 1 3/2
t cos t dt = cos 1 + t sin t dt.
0 3 3 0
We shall see later that the smoother the integrand (or the weaker the singularity), the faster the
convergence of the numerical solution to the exact value.
One can also subtract away any singular term in the integrand, assuming that it can be
integrated analytically. The remaining smooth term can be integrated numerically.
Example: Z 1 Z 1  Z 1
sin x sin x − x x 2 sin x − x
√ dx = √ + √ dx = + √ dx.
0 x 0 x x 3 0 x
The integrand in the last integral behaves like O(x5/2 ) at the origin and has a much weaker
singularity compared to the original integrand.
Example: Here is a technique for infinite interval of integration. Consider
Z ∞
dx
I= .
0 1 + x10
Write the integral as the sum Z Z ∞
1
dx dx
10
+ .
0 1+x 1 1 + x10
The first integral can be routinely integrated numerically. The second can be converted to a
proper integral using the substitution t = x−1 :
Z ∞ Z 0 Z 1 8
dx −t−2 dt t dt
10
= −10
= 10
,
1 1+x 1 t +1 0 1+t
which can easily be integrated numerically.
We split the integral into one over [0, 1] and another one over [1, ∞) because the substitution
t = x−1 cannot handle the point x = 0. Another possible substitution is t = (1 + x)−1 . In this
case, it is no longer necessary to split up the integral and we obtain
Z 1
t10 dt
I= 10 10
.
0 (t + (1 − t) )

Example: Besides substitution, which only works if a good substitution is known, another way to handle an
infinite interval of integration is to truncate the domain. Let R be a number to be determined.
Z ∞ Z R Z ∞
−x 2 2 −x 2 2
I= e cos (x ) dx = e cos (x ) dx + e−x cos2 (x2 ) dx.
0 0 R
If an error of 2ǫ or less is desired, then use any method to evaluate the first integral IR to ǫ.
Now Z ∞ Z ∞
−x 2 2
e cos (x ) dx ≤ e−x dx = e−R ≤ ǫ,
R R
which means that we can choose R = − ln ǫ. Suppose the numerical integration of IR yields I˜R .
We know that |IR − I˜R | ≤ ǫ. Therefore the error of the computation is
|I − I˜R | ≤ |I − IR | + |IR − I˜R | ≤ 2ǫ.
6.4. IMPROPER INTEGRALS 87

Before doing the truncation, sometimes it may be wise to do an integration by parts or use
substitution so that the new integrand decays more rapidly at infinity. This way, R does not
need to be very large for a more efficient quadrature. For instance,
Z ∞ Z ∞
sin x cos x
dx = cos 1 − 2 dx.
1 x2 1 x3

To obtain an accuracy of 10−6 , truncation on the left integral requires integration to R = O(106 ),
while R = O(103 ) for the integral on the right. Clearly, the second integral requires less work
to evaluate numerically.

Example: A common improper integral is


Z ∞
I= f (x) sin x dx.
a

Assume that |f (j) (x)| = O(x−m−j ) as x → ∞ for all j ≥ 0 and some m > 1. Truncation of
domain means that we calculate IN , that is, integrate only in [a, b] for some b = b(N ). By taking
a clever choice of b, we can obtain a simple method with quite acceptable errors. Take b = N π
for some large positive integer N . The error is

EN = I − IN
Z ∞
= f (x) sin x dx

Z ∞
N
= (−1) f (N π) + f ′ (x) cos x dx
ZN∞
π

= (−1)N f (N π) − f ′′ (x) sin x dx.


If simple truncation is used, that is, approximate I by IN , then using the bound on f (x) for large
x, we get |EN | = O(N 1−m ). To see this, observe that there is some C so that |f (x)| xm ≤ C for
x large enough. Therefore for N large enough,
Z ∞
dx Cπ 1−m
|EN | ≤ C = C1 N 1−m , C1 = .
Nπ xm m−1

By adding the simple correction (−1)N f (N π) to IN , the error now behaves like O(N −m−1 )
which can be a drastic improvement over the estimate IN . In this case, the approximation to I
is
Z Nπ
f (x) sin x dx + (−1)N f (N π).
a

Of course, b has been chosen so that f ′ (N π) vanishes, thus gaining extra powers of accuracy.
We can add more corrections but the complexity of the expression f (2j) typically increases
exponentially with j.
How would you design a similar method for
Z ∞
f (x) cos x dx?
a
88 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION

Example: As a final example, we make a transformation so that an integral on a finite domain is replaced
by one on an infinite domain! The reason is that the second integrand is smooth and decays
exponentially quickly at infinity. Consider
Z 1
dy
I= p .
−1 1 − y2

Note that the integrand has square root singularities at the end points. Define y = tanh x. Then
Z ∞
dx
I= ,
−∞ cosh x

where the integrand now is smooth and decays like e−|x| for large |x|. Now it is easy to estimate
I by doing truncation.

6.5 Additional Theory for Trapezoidal Rule


Given a continuous function f defined on [a, b]. The trapezoidal rule approximates the area
under the curve y = f (x) by a sum of areas of trapezoids:
Z n
b
hX b−a
f (x) dx ≈ Tn ≡ (f (xi ) + f (xi−1 )), xi = a + ih, h= .
a 2 n
i=1

We have given an error in a previous section. We now give several other bounds on the error
Z b
En = f (x) dx − Tn
a

depending on the smoothness of f – the smoother f is, the smaller the error is.
If the integrand f is not twice continuously differentiable, then the error of the trapezoidal
rule decays at a slower rate than O(h2 ). We give two results in this direction. First assume that
f ′ ∈ L1 (a, b), that is, f ′ is integrable. Let
Z b
kgk1 ≡ |g(x)| dx
a

denote the norm of any g ∈ L1 (a, b). The error in the interval [xi−1 , xi ] is
Z xi
h
ǫi ≡ (f (xi−1 ) + f (xi )) − f (x) dx.
2 xi−1

By integration by parts,
Z xi
xi−1 + xi
ǫi = (x − xi− 1 ) f ′ (x) dx, xi− 1 = .
xi−1 2 2 2

Thus Z
n
X b
h h ′
|En | ≤ |ǫi | ≤ |f ′ (x)| dx = kf k1 = O(h).
2 a 2
i=1
6.5. ADDITIONAL THEORY FOR TRAPEZOIDAL RULE 89

Now suppose that f ′′ ∈ L1 (a, b), then apply integration by parts once more to obtain
Z "  #
1 xi h 2
ǫi = − (x − xi− 1 ) f ′′ (x) dx.
2
2 xi−1 2 2

Consequently
n
X
|En | ≤ |ǫi |
i=1
n Z  2
1 X xi h
≤ − (x − xi− 1 )2 |f ′′ (x)| dx (∗)
2 2 2
i=1 xi−1
 2 X n Z xi
1 h
≤ |f ′′ (x)| dx
2 2 xi−1
i=1
h2
= kf ′′ k1 = O(h2 ).
8
This implies, in particular, that the trapezoidal rules integrates linear functions exactly.
Recall that L∞ (a, b) is the space of bounded functions so that f ∈ L∞ (a, b) if
kf k∞ := sup |f (x)| < ∞.
x∈[a,b]

Assume f ′′ ∈ L∞ (a, b). We had already considered this case in an earlier section but we shall
derive the result in a different way. From (*),
n Z  2
kf ′′ k∞ X xi h
|En | ≤ − (x − xi− 1 )2 dx
2 xi−1 2 2
i=1
n
X
kf ′′ k∞ h3
=
2 6
i=1
(b − a)kf ′′ k∞
= h2 .
12

Example: Let f (x) = x on [0, 1]. Note that f ′ ∈ L1 (0, 1) but f ′′ 6∈ L1 (0, 1). According to the above error
analysis, |En | ≤ (2n)−1 . It can be shown that |En | ≥ c n−1 for some positive constant c and
hence the error estimate is sharp.
If f is infinitely many times differentiable and f (j)(a) = f (j) (b) for all non–negative integers
j (i.e., f is smooth and periodic), then it will be shown below that the quadrature error decays
faster than hk for all positive integer k, where h = (b − a)/n is the width of each sub-interval.
We call this exponential convergence. The trapezoidal rule is perfect for integrating functions
such as Z 2π
esin x dx
0
because the integrand is smooth and periodic and so the error decays exponentially. See the left
figure of Figure 6.1.
Note that when f is periodic, then
 
n n−1 n−1
hX f (a) + f (b) X X
Tn = (f (xj ) + f (xj−1 )) = h  + f (xj ) = h f (xj ).
2 2
j=1 j=1 j=0
90 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION

Theorem: Let f ∈ C m ([0, 1]) be periodic of period 1 where m ≥ 2. Fix n ≥ 2. Then


Z 1 n−1
X  
1 j C
f (x) dx − f ≤ m.
0 n n n
j=0

Proof: Consider first the case f (x) = e2iπkx where i2 = −1 and k is an arbitrary integer. Observe that
the geometric sum
n−1
X 
2iπkj/n n, divides k;
e dx =
0, otherwise.
j=0
Z 1
Note that if f (x) = e2iπkx , f (x) dx = δ0k , and so
0
Z 1 n−1
X   
1 j −1, n divides k, k 6= 0;
ekn ≡ f (x) dx − f =
0 n n 0, otherwise.
j=0

Now consider a general f ∈ C m [0, 1] which is periodic of period 1. Recall that Z denotes the set
of integers, and f has a Fourier expansion
X
f (x) = fˆ(k)e2iπkx ,
k∈Z

which is uniformly convergent in [0, 1]. Recall that


Z 1
ˆ
f (k) = f (x)e−2iπkx dx
0

is the Fourier coefficient of f . Now


Z 1 X 1 j X
n−1 X
f (x) dx − f = fˆ(k)ekn = − fˆ(pn).
0 n n
j=0 k∈Z p∈Z\0

We claim (to be shown later) that the Fourier coefficients satisfy


Cm
|fˆ(k)| ≤ , k 6= 0, (6.2)
|k|m
for some Cm independent of k. Then
Z n−1
X   ∞
1
1 j 2Cm X 1
f (x) dx − f ≤ m ,
0 n n n pm
j=0 p=1

which is the desired result since the infinite sum on the right-hand side converges.
To complete the proof, we prove the claim (6.2). Let k be a non-zero integer. Then
Z 1
ˆ
f (k) = f (x)e−2iπkx dx
0
1 Z 1
f (x)e−2iπkx f ′ (x)e−2iπkx
= − dx
−2iπk 0 0 −2iπk
b′
f (k)
= .
2iπk
6.5. ADDITIONAL THEORY FOR TRAPEZOIDAL RULE 91

Repeating integration by parts m − 1 more times results in


d R 1 (m)
f (m) (k) C |f (x)| dx
m
|fˆ(k)| = ≤ m, Cm = 0 .
(2iπk)m |k| (2π)m

This result illustrates a common principle: the smoother the integrand, the faster the quadrature
error goes to zero. The fast decay of the error is due to the fast decay of the coefficient fˆ(k) as
a function of |k|.

Example: Since the trapezoidal rule is so efficient at evaluating integrals of smooth periodic integrands,
there are transformations which take advantage of this fact. Consider
Z ∞
dx
I= 4
.
−∞ 1 + x

Let x = tan(θ/2). Then Z π


1 1 + tan2 (θ/2)
I= dθ.
2 −π 1 + tan4 (θ/2)
The new integrand is smooth (vanishing at both end points) and 2π-periodic. Hence the trape-
zoidal rule converges exponentially quickly.

Example: Consider the evaluation of the function

ex − 1 − x − x2 /2
f (x) = .
x3
When x is small, severe cancellation error occurs. We can find some cutoff value r so that
whenever |x| ≤ r, we compute f by a Taylor’s expansion and evaluate f directly if |x| > r.
Another solution is to use contour integration
Z
1 f (z)
f (x) = dz
2πi Γ z − x
where Γ is, say, the circle of radius one√with centre at the origin of the complex plane traversed in
counterclockwise direction. Here i = −1. Since f is real, we can save some work by computing
the contour integral in the upper half of the circle, double its value and then take its real part.
Suppose m points are taken. Since the integral is periodic, it can be accurately computed by
the trapezoidal rule:  
m
X
1
real  f (zj ) , zj = eiπ(j−0.5)/m .
m
j=1

With m = 15, the function can be evaluated to full double precision.


This technique can be extended to matrices:
Z
1
f (A) = f (z) (z − A)−1 dz
2πi Γ
where Γ encloses all the eigenvalues of matrix A. It can be used in accurate solutions of ODEs
and PDEs.
92 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION

6.6 Peano Kernel


In the previous section, we used integration by parts to obtain estimates of the quadrature
error for the trapezoidal rule. We now give a general theory which works for a wide class of
quadrature schemes. Let w be a positive function, wj be positive reals. Fix some nodes xj so
that a ≤ x0 < . . . < xn ≤ b. Consider the quadrature error
Z b n
X
En f = f (x)w(x) dx − f (xj )wj .
a j=0

For any function g, define (


g(x), g(x) ≥ 0;
g+ (x) =
0, otherwise.
Fix any integer k, define the Peano kernel
Z b n
X
K(y) = (x − y)k+ w(x) dx − wj (xj − y)k+ .
a j=0

Note that ak+ means (a+ )k . Let Pm denote the set of polynomials of degree at most m.
Theorem: Fix positive integer n and non-negative integer m. Suppose En p = 0 for all p ∈ Pm . Let
0 ≤ k ≤ m. Suppose f ∈ C k+1 [a, b]. Then
Z
1 b (k+1)
En f = f (y)K(y) dy.
k! a

Proof: By Taylor’s theorem, f (x) = p(x) + r(x), where


k
X Z x
f (j) (a) j 1
p(x) = (x − a) , r(x) = f (k+1) (y) (x − y)k dy.
j! k! a
j=0

Observe that p ∈ Pk and


Z b
1
r(x) = f (k+1) (y) (x − y)k+ dy.
k! a
Thus

En f = En p + En r
Z b n
X
= 0+ r(x)w(x) dx − r(xj )wj
a j=0
Z bZ n Z
1 b
(k+1) 1 X b
= f (y)(x − y)k+ dy w(x) dx − f (k+1) (y)(xj − y)k+ dy wj
k! a a k! a
j=0
 
Z b Z b n
X
1
= f (k+1) (y)  (x − y)k+ w(x) dx − (xj − y)k+ wj  dy
k! a a j=0
Z b
1
= f (k+1) (y)K(y) dy.
k! a
6.6. PEANO KERNEL 93

Example: We recover the error of the trapezoidal rule using the Peano kernel assuming f ∈ C 2 [a, b].
This corresponds to k = 1 = m in the theorem since the trapezoidal rule integrates all linear
polynomials exactly. Recall the nodes are xj = a + jh, where h = (b − a)/n. Then

Z
h X 
b n−1
En f = f (x) dx − f (xj+1 ) + f (xj )
a 2
j=0

X Z xj+1
n−1
h 
= f (x) dx − f (xj+1 ) + f (xj ) .
xj 2
j=0

Let ǫj denote the quadrature error on the interval [xj , xj+1 ]. That is, ǫj is the expression to the
right of the summation in the above expression. The Peano kernel on this interval is
Z xj+1
h h
Kj (y) = (x − y)+ dx − (xj − y)+ − (xj+1 − y)+
xj 2 2
Z xj+1
h
= (x − y) dx − 0 − (xj+1 − y)
y 2
(xj+1 − y)(y − xj )
= − .
2

Consequently,
Z xj+1
1
ǫj = − f ′′ (y)(xj+1 − y)(y − xj ) dy.
2 xj

We first derive an k · k1 error estimate, notice that

h2
(xj+1 − y)(y − xj ) ≤ , y ∈ [xj , xj+1 ].
4

Therefore Z xj+1
h2 1
|ǫj | ≤ |f ′′ (y)| dy,
4 2 xj

resulting in
n−1
X n−1 Z
h2 X xj+1 ′′ h2
|En f | ≤ |ǫj | ≤ |f (y)| dy = kf ′′ k1 .
8 xj 8
j=0 j=0

For an k · k∞ error estimate,


Z xj+1
kf ′′ k∞ h3
|ǫj | ≤ (xj+1 − y)(y − xj ) dy = kf ′′ k∞ .
2 xj 12

This easily leads to


n−1
X (b − a) h2
|En f | ≤ |ǫj | ≤ kf ′′ k∞ .
12
j=0

For the
94 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION

6.7 Gaussian Quadrature Over Finite Intervals


Simple integration schemes take equal subintervals. If we relax this restriction, then the resultant
method can be more accurate. The problem now is to find points xi and weights wi so that
Z b k
X
f (x)dx ≈ f (xi )wi
a i=1

with the error as small as possible. Here k is a given positive integer. There are 2k degrees of
freedom (xi and wi ) and so it is reasonable to expect that all polynomials of degree 2k − 1 or
lower can be integrated exactly.
x−a
By a change of variable z = 2 − 1,
b−a
Z b Z 1  
b−a (z + 1)(b − a)
f (x)dx = F (z)dz, F (z) = f +a .
a −1 2 2

Hence we shall only consider integration on [−1, 1] from now on.


Two-point Gaussian quadrature refers to the above scheme with k = 2:
Z 1
f (x)dx = f (x1 )w1 + f (x2 )w2 .
−1

With four parameters, all polynomials of degree three or less can be integrated exactly. Substi-
tuting f by 1, x, x2 , x3 results in four equations

2 = w1 + w2
0 = x1 w1 + x2 w2
2
= x21 w1 + x22 w2
3
0 = x31 w1 + x32 w3

1
which can be solved to yield w1 = w2 = 1 and x1,2 = ± √ . (We shall see later that x1,2 are
3
roots of a quadratic polynomial.) Hence
Z 1    
1 1
f (x)dx = f √ + f −√
−1 3 3
is exact for polynomials of degree three or less. This should be contrasted with the trapezoidal
rule. The above is procedure is not recommended for higher order Gaussian quadrature because
of the difficulty in solving the associated nonlinear system.
Before proceeding with the general theory of Gaussian quadrature, we recall a few facts
about Legendre polynomials. The first few are defined by

3x2 − 1 5x3 − 3x
p0 (x) = 1, p1 (x) = x, p2 (x) = , p3 (x) = .
2 2
A recurrence relation defining the Legendre polynomials is

(j + 1)pj+1 (x) − (2j + 1)xpj (x) + jpj−1 (x) = 0, j ≥ 1. (6.3)


6.7. GAUSSIAN QUADRATURE OVER FINITE INTERVALS 95

It is known that each pj is an eigenfunction of a singular Sturm Liouville eigenvalue problem

(1 − x2 )p′′j − 2xp′j + j(j + 1)pj = 0

where the solution is required to be bounded on [−1, 1] and satisfies pj (1) = 1. It can be shown
that Z 1
2
p2j (x) dx = . (6.4)
−1 2j + 1
Note that pi is a polynomial of degree i. For a fixed positive n, it is known that {pi , 0 ≤ i ≤ n}
is an orthogonal basis for Pn , the space of polynomials of degree at most n, under the inner
product Z 1
hf, gi = f (x)g(x)dx, f, g ∈ Pn .
−1
Z 1
Orthogonality refers to the fact that pi pj dx is zero if i 6= j and is positive if i = j. It is also
−1
known that the n zeroes of pn are all real, distinct and lie in (−1, 1).
The sequence of Legendre polynomials can be derived in many ways. Here we describe
three of them. First, recall that a basis for Pn is {1, x, . . . , xn }. Gram Schmidt can be used to
produce an orthogonal set {pj , j = 1, . . . , n} which spans the same space. To ensure uniqueness,
we require pj (1) = 1 for every j. Taking p0 := 1, the second function must be orthogonalized
against the first one:
hx, p0 i
x− p0 = x.
hp0 , p0 i
Thus we can take p1 (x) = x since it satisfies p1 (1) = 1. Next, orthogonalize x2 against p0 and
p1 :
hx2 , p0 i hx2 , p1 i 1
x2 − p0 − p1 = x2 − − 0.
hp0 , p0 i hp1 , p1 i 3
We must rescale the above function so that it takes on the value 1 at x = 1. Thus p2 (x) =
(3x2 − 1)/2. This procedure can continued indefinitely.
Next we show that pn satisfies the eigenvalue relation

(1 − x2 )p′′n − 2xp′n + n(n + 1)pn = 0, pn (1) = 1.

We seek a power series solution



X
pn (x) = aj x j .
j=0

Substitute this into the differential equation to get


∞ 
 X 
2a2 +n(n+1)a0 + 6a3 −2a1 +n(n+1)a1 x+ aj+2 (j+2)(j+1)−aj j(j+1)−n(n+1) xj = 0.
j=2

We can conclude that


n(n + 1) 2 − n(n + 1) j(j + 1) − n(n + 1)
a2 = − a0 , a3 = a1 , aj+2 = aj , j ≥ 2.
2 6 (j + 2)(j + 1)
Note that ai = 0 for i ≥ n + 2k for all k ≥ 1 and that a0 and a1 are parameters to be determined.
If n is odd, then take a0 = 0 and define a1 so that pn (1) = 1. Observe that a2k = 0 for all
96 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION

non-negative k. If n is even, take a1 = 0 so that a2k−1 = 0 for all positive k and define a0 so
that pn (1) = 1. In both cases, pn is, in fact, a polynomial of degree n.
The final derivation of the Legendre polynomials uses the 3-term recurrence relation. Its
proof is by induction. Since this will be done for a more general case in the next chapter, we do
not prove the special case here.
Fix a positive integer k and let {x1 , . . . , xk } be the roots of pk . Recall the following polyno-
mials which were defined in the discussion on Lagrange interpolation:

k
Y x − xj
Li (x) = , i = 1, · · · , k.
xi − xj
j=1, j6=i

These polynomials are of degree k − 1 and satisfy the property Li (xj ) = δij .

Z 1
Theorem: Let k be a positive integer and {xi , i = 1, · · · , k} be the roots of pk . Define wi = Li (x) dx.
−1
Then
Z 1 k
X
f (x)dx = wi f (xi )
−1 i=1

holds for all polynomials f of degree 2k − 1 or less.

Proof: Let f be a polynomial of degree 2k − 1 or less. There exist polynomials q, r, of degree k − 1 or


less so that f = qpk + r. Note

k
X k
X
r(x) = Li (x)r(xi ) = Li (x)f (xi ).
i=1 i=1
P
The first equality above holds by the definition of Li and the fact that r(x) − i Li (x)r(xi ) is a
polynomial of degree at most k − 1 with k distinct roots {xi }. The second equality holds because
pk vanishes at the nodes {xi }. In fact, r ∈ Pk is the polynomial interpolant of f at the k nodes.
Using the orthogonality of Legendre polynomials,
Z 1 Z 1 Z 1
f (x)dx = qpk dx + rdx
−1 −1 −1
k
X Z 1
= 0+ f (xi ) Li (x) dx
i=1 −1
k
X
= wi f (xi ).
i=1

The following theorems give additional properties about Gaussian quadrature.

Theorem: Fix a positive k. The weights wi and nodes xi defined above are unique for the exact integration
of all polynomials of degree 2k − 1 or less. Furthermore 0 < wi < 2, i = 1, · · · , k.
6.7. GAUSSIAN QUADRATURE OVER FINITE INTERVALS 97

Proof: Let {Wi } be non-zero and let {yi } be distinct, i = 1, · · · , k. Suppose for every polynomial p of
degree 2k − 1 or less,
Z 1 k
X
p(x)dx = Wi p(yi ).
−1 i=1

We wish to show that wi = Wi and xi = yi for every i.


By the orthogonality of Legendre polynomials, for 0 ≤ j ≤ k − 1,
Z 1 k
X
0= pj (x)pk (x)dx = Wi pj (yi )pk (yi ).
−1 i=1

This can be represented as a system of linear equations Ac = 0 where


   
p0 (y1 ) · · · p0 (yk ) W1 pk (y1 )
 .. ..   .. 
A= . . , c= . .
pk−1 (y1 ) · · · pk−1 (yk ) Wk pk (yk )

We shall show below that A is non–singular and so this means that {yi } are the roots of pk .
That is, yi = xi for every i.
k
X k
X
We have wi p(xi ) = Wi p(xi ) for every polynomial p of degree less than or equal to 2k − 1.
i=1 i=1
k
X
Take p = pj , j = 0, · · · , k − 1. The system of equations becomes pj (xi )(wi − Wi ) = 0, 0 ≤
i=1
j ≤ k − 1, or
 
w1 − W1
A  · · ·  = 0.
wk − Wk
Since A is non–singular, wi = Wi for every i.
Finally, we demonstrate that A is non–singular. Suppose AT c = 0 for some vector. Define the
k−1
X
polynomial q(x) = ci pi (x) of degree at most k − 1. Now AT c = 0 means that q has k distinct
i=0
roots x1 , · · · , xk . This implies that q is the zero function. Since {pi } are linearly independent,
c = 0.
Fix i. Recall Li satisfies Li (xq ) = δiq . Note that L2i is a positive polynomial of degree 2k − 2.
Thus
Z 1 Xk
0< L2i (x)dx = wj L2i (xj ) = wi .
−1 j=1

Next,
Z 1 k
X k
X
2= 1 dx = wj 1 = wj .
−1 j=1 j=1

Since every wj > 0, we can conclude that wj < 2. 


98 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION

Theorem: Let k be a positive integer and {xi , i = 1, · · · , k} be the roots of pk . Define the weights wi as
in the above theorem. Let f ∈ C 2k (−1, 1). Then for some ξ ∈ (−1, 1),
Z 1 k
X Z 1 k
Y
f (2k) (ξ)
Ek ≡ f (x)dx − wi f (xi ) = φk (x)dx, φk (x) = (x − xi )2 .
−1 (2k)! −1
i=1 i=1

Proof: Let p(x) be the (Hermite) polynomial of degree 2k − 1 such that p(xi ) = f (xi ) and p′ (xi ) =
f ′ (xi ), i = 1, · · · , k. Hence
Z 1 k
X k
X
p(x)dx = wi p(xi ) = wi f (xi ).
−1 i=1 i=1

Recall from (5.7) that for some fixed x ∈ (−1, 1) and ξ(x) ∈ (−1, 1),

f (2k) (ξ(x))
f (x) − p(x) = φk (x).
(2k)!
Consequently,
Z 1 k
X Z 1
1
f (x) dx − wj f (xj ) = f (2k) (ξ(x))φk (x) dx
−1 (2k)! −1
j=1
Z 1
f (2k) (ξ)
= φk (x) dx
(2k)! −1

for some ξ by the mean value theorem for integrals. This permits us to conclude that
Z 1 Z
f (2k) (ξ) 1
Ek = (f − p)dx = φk (x)dx.
−1 (2k)! −1


Using (6.3) and (6.4), it can be shown that


Z 1
22k+1 (k!)4
φk (x) dx = . (6.5)
−1 (2k + 1)[(2k)!]2
Consequently,
22k+1 (k!)4 |f (2k) (ξ)|
|Ek | ≤ sup .
(2k + 1)[(2k)!]2 ξ∈[−1,1] (2k)!

Using the Stirling approximation n! ≈ e−n nn 2πn for large n, it follows that in the limit k → ∞,
π |f (2k) (ξ)|
|Ek | ≤ sup .
4k ξ∈[−1,1] (2k)!

Although the above is only an approximation for large k, it turns out to be very accurate even
for small k, and, in fact, gives an upper bound for all k ≥ 1.
Thus the error decays exponentially quickly as a function of k provided that f is smooth
enough. This should be compared with a decay of k−2 for the trapezoidal rule, even if f is
smooth.
The following result says that Gaussian quadrature is exact in the limit of infinitely many
nodes assuming only continuity of f .
6.7. GAUSSIAN QUADRATURE OVER FINITE INTERVALS 99

k 1 2 3 5 10 20
22k+1 (k!)4
(2k+1) [(2k)!]2
0.6667 0.1778 0.0457 2.9318 × 10−3 2.9256 × 10−6 2.8226 × 10−12
π4−k 0.7854 0.1963 0.0491 3.0680 × 10−3 2.9961 × 10−6 2.8573 × 10−12

Theorem: Suppose f ∈ C[−1, 1]. Then lim Ek = 0.


k→∞

Proof: Given any ǫ > 0. By the Weierstrass approximation theorem, there is some polynomial p so that
ǫ
|f (x) − p(x)| < , ∀x ∈ [−1, 1].
4
Take k be any positive integer so that 2k is larger than the degree of p. Recall that the Gaussian
quadrature with k nodes integrates p exactly. Now
Z 1 k
X
|Ek | = f (x) dx − f (xj )wj
−1 j=1
Z 1 Z 1 k
X
≤ (f (x) − p(x)) dx + p(x) dx − f (xj )wj
−1 −1 j=1
Z 1 k
X k
X
≤ |f (x) − p(x)| dx + p(xj )wj − f (xj )wj
−1 j=1 j=1
Z 1 k
X
ǫ ǫ
≤ dx + wj
4 −1 4
j=1
Z 1
ǫ ǫ
= + dx
2 4 −1
= ǫ.

The conclusion of this theorem may seem unspectacular. It should be contrasted with the
quadrature scheme using equi-spaced nodes and the integrand approximated by the polynomial
interpolant at the equi-spaced nodes. Here the quadrature error not only does it not decrease
to zero, but it increases exponentially as a function of the number of nodes. See Figure 5.1.
The final result is a general theorem about convergence of the quadrature
n
X
In (f ) := f (xi )wi
i=1

with −1 ≤ x1 < · · · < xn ≤ 1 and arbitrary weights wi ∈ R to approximate the integral of


f ∈ C[−1, 1]:
Z 1
I(f ) := f (x) dx.
−1

The nodes {xi } and weights {wi } depend on n, but not on f . We do not indicate explicitly their
dependence on n to simplify the notation. Define the quadrature error En (f ) = |I(f ) − In (f )|.
100 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION

Theorem: Assume the notation described above. Then En → 0 iff


lim In (xk ) = I(xk ), k ≥ 0,
n→∞

and
n
X
|wi | ≤ M, n ≥ 1,
i=1
for some positive constant M .
Proof: We only give a sketch of the proof of the “only if” part. Suppose the two assumptions hold.
We show that En → 0. Given any positive integer n, there exists some polynomial pN of degree
N = N (n) so that En (pN ) = 0 by the first assumption, and kf − pN k∞ → 0 as n → ∞. (In
case of Gaussian quadrature, N = 2n − 1 and the latter property follows from the Weierstrass
approximation theorem.) Then
En (f ) = En (f − pN )
Z 1 n
X
 
= f (x) − pN (x) dx − wj f (xj ) − pN (xj )
−1 j=1
 
n
X
< kf − pN k∞ 2 + |wj |
j=1

≤ kf − pN k∞ (2 + M ).
Take n → ∞ to conclude that En (f ) → 0. 
This theorem offers an alternative proof that Gaussian quadrature converges for continuous
functions on [−1, 1]. This follows since
n
X
In (1) = wi = I(1) = 2, n ≥ 1,
i=1

using the fact that Gaussian quadrature In is exact for polynomials of degree at most 2n − 1
and that all the weights are positive.

General Gaussian Quadrature


Let Pk denote the space of polynomials of degree k or less. Let [a, b] be a (finite or infinite)
interval and w be a positive piecewise continuous function defined on [a, b]. Assume that
Z b
xn w(x) dx < ∞, n ≥ 0.
a
Define the inner product Z b
hf, gi = f (x)g(x)w(x) dx.
a
Using Gram-Schmidt process, the set {1, x, x2 , . . . , xn } can be transformed to a set of monic
polynomials {p0 , . . . , pn } which is orthogonal with respect to the above inner product:
hpi , pj i = 0, i 6= j.
In fact, deg pj = j, j ≥ 0. It turns out that the orthogonal polynomials satisfy a simple
recurrence relation.
6.7. GAUSSIAN QUADRATURE OVER FINITE INTERVALS 101

Theorem: There are constants an and bn so that

pn (x) = (x + an ) pn−1 (x) + bn pn−2 (x), n ≥ 2.

Proof: The proof is by induction on n. The base case n = 2 is trivial. Assume that the result holds for
n. We proceed to show that it holds for n + 1. Since pn+1 − xpn ∈ Pn and that {p0 , . . . , pn } is
an orthogonal basis for Pn , it follows that

n−2
X
pn+1 (x) − xpn (x) = an+1 pn + bn+1 pn−1 + dj pj , (6.6)
j=0

for some real constants an+1 , bn+1 , dj . Take the inner product with pn to get

0 − hxpn , pn i = an+1 hpn , pn i,

or
hxpn , pn i
an+1 = − .
hpn , pn i
Next, take the inner product with pn−1 in (6.6) to get

0 − hxpn , pn−1 i = bn+1 hpn−1 , pn−1 i,

or
hxpn , pn−1 i
bn+1 = −
hpn−1 , pn−1 i
hpn , xpn−1 i
= −
hpn−1 , pn−1 i
hpn , pn + qi
= −
hpn−1 , pn−1 i
hpn , pn i
= − ,
hpn−1 , pn−1 i

where q ∈ Pn−1 . Finally, take the inner product with pj , 0 ≤ j ≤ n − 2 in (6.6) to get

0 − hxpn , pj i = dj hpj , pj i,

or
dj hpj , pj i = −hpn , xpj i = −hpn , pj+1 + ri = 0,

where r ∈ Pj . This shows that dj = 0 for every j and completes the induction proof. 

Theorem: For each n ≥ 1, all zeros of pn are real, simple and lie in (a, b).

Proof: If pn does not change sign in [a, b], then


Z b
pn (x)w(x) dx = hpn , p0 i
a
102 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION

is either positive or negative. But this is a contradiction since the right-hand side must be zero
by orthogonality. Hence pn must have some zero x1 ∈ (a, b). If x1 is not a simple zero, then
pn (x)(x − x1 )−2 ∈ Pn−2 and by orthogonality,
   
pn p2n
0= , pn = , 1 > 0,
(x − x1 )2 (x − x1 )2

a contradiction. Thus every zero of pn in (a, b) must be simple.


Let z1 , . . . , zm be all zeros of pn in (a, b). Clearly m ≤ n. If m < n, then

pn (x)(x − x1 ) . . . (x − xm ) = q(x)(x − x1 )2 . . . (x − xm )2 ,

for some q ∈ Pn−m which does not vanish in (a, b). Without loss of generality, assume q > 0 in
(a, b). Therefore
hpn (x) (x − x1 )−1 . . . (x − xm )−1 , 1i = hq(x), 1i > 0.
However, by orthogonality, hq, p0 i = 0, which is absurd. It can be inferred that m = n. 

Denote the corresponding set of orthonormal polynomials by {p̂0 , . . . , p̂n }, whose elements
satisfy
pi
hp̂i , p̂j i = δij , p̂i = , i, j ≥ 0.
hpi , pi i1/2
In the special case w ≡ 1, the orthogonal polynomials are the Legendre polynomials. Another
common case is w = (1 − x2 )−1/2 , where the orthogonal polynomials are known as Chebyshev
polynomials.
Define the kernel polynomial
n
X
Kn (x, y) = p̂j (x)p̂j (y).
j=0

Theorem: Let p ∈ Pn . Then


hp, Kn (·, y)i = p(y).
Kn is the unique polynomial of degree at most n in x (with y fixed) and of degree at most n in
y (with x fixed) satisfying the above relation.

Proof: Since {p̂0 , . . . , p̂n } is an orthonormal basis for Pn , it follows that


n
X
p(x) = hp, p̂j ip̂j (x).
j=0

Therefore,
n D
X E
hp, Kn (·, y)i = hp, p̂j ip̂j (x), p̂k (x)p̂k (y)
j,k=0
Xm
= hp, p̂j ip̂j (y)
j=0
= p(y).
6.7. GAUSSIAN QUADRATURE OVER FINITE INTERVALS 103

Let k(x, y) be a polynomial of degree at most n in x and at most n in y so that hp, k(·, y)i = p(y)
for every p ∈ Pn . Take p(x) = Kn (x, z). Then
hKn (·, z), k(·, y)i = Kn (y, z).
From the hypothesis of this theorem,
hKn (·, z), k(·, y)i = hk(·, y), Kn (·, z)i = k(z, y).
Combine the symmetry of Kn and these results to get Kn (z, y) = Kn (y, z) = k(z, y), demon-
strating the uniqueness of Kn . 

The kernel polynomial is useful because for any continuous function f , it follows that
hKn (x, ·), f i ∈ Pn is a good approximation to f (x) in the following senses. Let En (x) =
f (x) − hKn (x, ·), f i. Then
1. hEn , p̂j i = 0, 0 ≤ j ≤ n,
2. En vanishes at at least n + 1 points in (a, b).
We shall only prove the first property.
X
hEn , p̂j i = hf, p̂j i − hp̂i , f i hp̂i , p̂j i
i
X
= hf, p̂j i − hp̂i , f i δij
i
= 0.
Now we are ready to apply the above theory of orthogonal polynomials to Gaussian quadra-
ture. Let w be the weight function and pn be an orthogonal polynomial as before.
Theorem: For n ≥ 1, let x1 < · · · < xn be the zeroes of pn . There are positive constants w1 , . . . , wn so that
Z b X n
p(x)w(x) dx = wj p(xj ), p ∈ P2n−1 .
a j=1

Furthermore,
Z b
0 < wi < w(x) dx, 1 ≤ i ≤ n.
a

Proof: Given any p ∈ P2n−1 , write p = qpn + r for some q, r ∈ Pn−1 . Since xj is a root of pn , it follows
that p(xj ) = r(xj ), j = 1, . . . , n. Hence
n
X n
Y x − xj
r(x) = p(xj )Lj (x), Li (x) = .
xi − xj
j=1 j=1, j6=i

Note that r is the polynomial interpolant of p at the n nodes. Consequently,


Z b Z b
p(x)w(x) dx = (r(x) + pn (x)q(x))w(x) dx
a a
Z b Z b
= r(x)w(x) dx + pn (x)q(x)w(x) dx
a a
n
X
= p(xj )wj ,
j=1
104 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION

where Z b
wj = Lj (x)w(x) dx.
a
Note that the integral of pn qw is zero since q ∈ Pn−1 is orthogonal to pn in the inner product
h·, ·i. Since L2j ∈ P2n−2 and Lj (xk ) = δjk , substitute L2j for p to get
n
X Z b
wj = wk L2j (xk ) = L2j (x)w(x) dx > 0.
k=1 a

Finally, apply the exact quadrature formula for the constant 1 function to get
Z b n
X
w(x) dx = wj > wi , 1 ≤ i ≤ n.
a j=1

Theorem: Assume the setting of the above theorem. Let f ∈ C 2n [a, b] and d2n = hpn , pn i. Then there is
some η ∈ (a, b) so that
Z b n
X f (2n) (η)
f (x)w(x) dx − f (xj )wj = .
a (2n)! d2n
j=1

Proof: Let h ∈ P2n−1 be the Hermite interpolant of f . By (5.7),

f (2n) (ξ(x))
f (x) = h(x) + (x − x1 )2 · · · (x − xn )2 , x ∈ [a, b],
(2n)!

for some ξ(x) ∈ (a, b). According to the Theorem associated with (5.7),

f (x) − h(x)
∈ C[a, b].
(x − x1 )2 · · · (x − xn )2

Therefore f (2n) (ξ(x)) ∈ C[a, b]. Integrate

f (2n) (ξ(x)) p̂2n (x)


f (x)w(x) = h(x)w(x) + w(x),
(2n)! d2n
to get
Z b Z b Z b
1
f (x)w(x) dx = h(x)w(x) dx + f (2n) (ξ(x))p̂2n (x)w(x) dx
a a (2n)! d2n a
Z b Z b
f (2n) (η)
= h(x)w(x) dx + p̂2n (x)w(x) dx
a (2n)! d2n a
n
X f (2n) (η)
= f (xj )wj + .
(2n)! d2n
j=1

Note that dn is the leading coefficient of p̂n .


6.8. MONTE CARLO METHODS 105

Example: Let a = −1, b = 1 and w(x) = (1 − x2 )−1/2 . The associated orthogonal polynomials are called
Chebyshev polynomials. The first few are
T0 (x) = 1, T1 (x) = x, T2 (x) = 2x2 − 1, T3 (x) = 4x3 − 3x.

Example: Let a = 0, b = ∞ and w(x) = e−x . The associated orthogonal polynomials are called Laguerre
polynomials. The first few are
G0 (x) = 1, G1 (x) = −x + 1, G2 (x) = x2 − 4x + 2, G3 (x) = −x3 + 9x2 − 18x + 6.
2
Example: Let a = −∞, b = ∞ and w(x) = e−x . The associated orthogonal polynomials are called
Hermite polynomials. The first few are
H0 (x) = 1, H1 (x) = 2x, H2 (x) = 4x2 − 2, H3 (x) = 8x3 − 12x.

6.8 Monte Carlo Methods


This method approximates the value of an integral by the average of the integrand evaluated at
many random points. It is used only in high dimensional integrals.
Given a continuous function f , we wish to approximate
Z b
I= f (x) dx.
a

Suppose xi comes from the uniform distribution on [a, b]. Then with N such random points, an
approximation to I is
N
b−aX
IN = f (xi ).
N
i=1
We now compute two statistics, E(IN ), the mean value of IN , and σ(IN ), the standard deviation.
Now
N
b−a X
E(IN ) = E(f )
N
i=1
= (b − a)E(f )
Rb
f (x) dx
= (b − a) a
b−a
= I.
This is a desirable result.
Next we compute the variance of IN :
N
! N
!
b−aX (b − a)2 X
var f (xi ) = var f (xi )
N N2
i=1 i=1
(b − a)2
= N var(f ).
N2
Recall that var(f ) = E(f 2 ) − E(f )2 , then
(b − a) (E(f 2 ) − I 2 )1/2
σ(IN ) = (var(IN ))1/2 = √ . (6.7)
N
106 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION

Thus the standard deviation, which measures the deviation from the mean and thus is in some
sense the ‘error’ of the approximation, behaves like N −1/2 . Since the error of the trapezoidal
rule is O(N −2 ) for f smooth, the Monte Carlo method should not be used for one-dimensional
integrals. However in d dimensions, the error of the trapezoidal rule is O(N −2/d ) while that
of the Monte Carlo method is still O(N −1/2 ), independent of d! Hence when d > 4, the Monte
Carlo method is more efficient than the trapezoidal rule. As an added bonus, the Monte Carlo
method is not sensitive to singularities or discontinuities of the integrand.
In the above description of the Monte Carlo method, we employed the uniform distribution.
This may not be the best. Suppose we choose the distribution p(x), which of course satisfies
Z b
p(x) dx = 1.
a

Then Z Z
b b
f (x)
f (x) dx = g(x)p(x) dx, g(x) = .
a a p(x)
We can now define the approximation
N
1 X
IN = g(xi ),
N
i=1

where xi has probability distribution p. It follows that


N
1 X
E(IN ) = E(g)
N
i=1
= E(g)
Z b
= gp dx
a
Z b
= f (x) dx
a
= I.

Now
1
var(IN ) = var(g)
N
1
= (E(g 2 ) − E(g)2 )
N
Z b 
1 2 2
= g p dx − I
N a
Z b 2 
1 f 2
= dx − I .
N a p

Finally
Z b 1/2
1 f2
σ(IN ) = √ dx − I 2 .
N a p
While it is still O(N −1/2 ), the constant multiplying N −1/2 can be smaller if p is chosen properly.
6.9. MULTIPLE INTEGRALS 107

Z π/2
Example: Let I = sin x dx = 1. The Monte Carlo method with uniform distribution yields σ(IN ) ≈
0
0.483N −1/2 with I10 = 0.952 for one set of random numbers while using the distribution function
p(x) = 8π −2 x, we obtain I10 = 1.016.
Another useful technique to improve the efficiency is to modify the integrand f so that E(f 2 )
is as small as possible. See (6.7).
Example: For the same integral as in the above example, write
Z π/2 Z π/2  
2x 2x
I= dx + sin x − dx.
0 π 0 π
The first term on the right-hand side is easily evaluated as π/4 while the second term can be
evaluated using the Monte Carlo method. Since the square of the integrand of the second term
is small, the result will be more accurate. Here, σ(IN ) ≈ 0.1N −1/2 .

6.9 Multiple Integrals


One–dimensional quadrature extends in a straightforward way to rectangles and cubes. For
more general domains, the procedure can be more difficult. For instance, a Gaussian quadrature
for Z Z 2 1 x +1
f (x, y) dy dx
−1 0
is
k X
X K
wi Wij f (xi , yij ),
i=1 j=1
for some fixed positive integers k and K. Here, xi and wi are the usual nodes and weights for
a k point Gaussian quadrature over [−1, 1]. Integration in the y direction requires a little extra
effort. For a fixed xi , the range of integration in y is from 0 to x2i + 1. One must do a linear
transformation from y to Y so that −1 ≤ Y ≤ 1. Then {Yj } and Wij are the nodes and weights
for a K point Gaussian quadrature over [−1, 1]. Now back transform Yj to yij to get the nodes
in the original variable.
There is another approach. Suppose the region of integration is T , the right triangle with
vertices (0, 0), (0, 1), (1, 0). In the spirit of Gaussian quadrature, we wish to find weights and
points on or inside T so that
Z
f (x, y) dxdy = wA f (A) + wB f (B) + wC f (C)
T
is exact for for f a polynomial of degree as high as possible. This problem is quite diffi-
cult. Let us simplify the problem by choosing the points A, B, C as the midpoints of the sides:
A(0, 1/2), B(1/2, 0), C(1/2, 1/2). It seems reasonable to expect that we can choose the weights
so that the integral is exact for all linear polynomials a + bx + cy for arbitrary constants a, b, c.
Indeed, substituting f = 1, x, y, we obtain three equations
1
= wA + wB + wC
2
1 1 1
= wA + wC
6 2 2
1 1 1
= wB + wC .
6 2 2
108 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION

Solving these equations yields wA = wB = wC = 1/6. If we substitute f = x2 , y 2 , xy into the


quadrature formula with the above weights, we get the pleasant surprise that they are also exact.
Hence Z
f (A) + f (B) + f (C)
f (x, y) dxdy =
T 6
is actually exact for all polynomials of degree two or less. For an arbitrary triangle T ,
Z
|K|
f (x, y) dxdy = (f (A) + f (B) + f (C)),
T 3

for all polynomials of degree two or less, where |K| is the area of K and A, B, C are the midpoints
of the sides.
If T , the given domain of integration, can be partitioned into many triangles, the above
quadrature rule can be used to get an accurate approximation to the integral over T . Of course,
the more triangles we use, the smaller the error, provided that f is smooth.

You might also like