MATH 2160 Numerical Analysis 1 Notes: S. H. Lui Department of Mathematics University of Manitoba
MATH 2160 Numerical Analysis 1 Notes: S. H. Lui Department of Mathematics University of Manitoba
S. H. Lui
Department of Mathematics
University of Manitoba
Table of Contents i
2 Nonlinear Equations 12
2.1 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Fixed Point Iteration (FPI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 System of Nonlinear Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Linear Systems 25
3.1 Basic Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Gaussian Elimination with Partial Pivoting . . . . . . . . . . . . . . . . . . . . . 27
3.3 Errors in Solving Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Symmetric Positive Definite Systems . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Iterative Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Least Squares 45
4.1 Polynomial Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Trigonometric and Exponential Models . . . . . . . . . . . . . . . . . . . . . . . . 47
i
CONTENTS 1
1.1 Introduction
Numerical analysis is the design and analysis of accurate and efficient algorithms to solve prob-
lems in science and engineering. Computers work with numbers which can be represented by
finitely many bits whereas some real numbers require infinitely many bits to represent exactly.
Thus there is an error involved in representing each real number and this error propagates in
subsequent arithmetic operations. The question is whether we can trust the result of a long
sequence of calculations.
(bk bk−1 · · · b0 .b−1 · · · b−n )2 = bk 2k + bk−1 2k−1 + b1 2 + b0 + b−1 2−1 + · · · + b−n 2−n
where each bj = 0 or 1.
11
Example: (10101.1011)2 = 24 + 22 + 1 + 2−1 + 2−3 + 2−4 = 21 .
16
Example: x = (.1011)2 = (.1011 1011 1011 · · · )2 . To obtain the value of x, note that multiplying x by 16 is
equivalent to shifting the radix point right by four places. Hence 16x = (1011.1011)2 . Subtract
by x to get 15x = (1011)2 = 23 + 21 + 1 = 11. Hence x = 11/15.
Example: x = (.10101)2 . Subtract 4x = (10.101)2 from 32x = (10101.101)2 to get 28x = (10011)2 = 19
and so x = 19/28.
1
2 CHAPTER 1. FLOATING POINT NUMBERS
This implies that b−1 = 1 and .4 = [· · · ]. Multiplying the latter by two results in
Thus b−5 = 0. The pattern then repeats itself. We conclude that 53.7 = (110101.10110)2 .
±(1.b−1 · · · b−n )2 × 2p
where bi = 0 or 1 and p is an integer, called the exponent part the number. One bit is used to
store the sign of this number while n bits are used for the fractional part which is also called
the mantissa. The 1 before the radix point renders this a normalized floating point number.
This digit is not allocated storage.
There are infinitely many floating point representations of a number but only one normalized
representation.
Suppose m bits are allocated for the exponent. Let M = 2m−1 − 1. Then instead of storing
p, to avoid allocating a bit to store the sign, we store the biased exponent q = p + M . Hence
with −M ≤ p ≤ M + 1, it follows that 0 ≤ q ≤ 2m − 1.
The range 0 < q < 2m − 1 are reserved for normalized floating point numbers. Denormalized
floating point numbers are those having the form
(0.b−1 · · · b−n )2 × 2p
An overflow is said to have occurred if the result of an operation is a number whose magnitude
is larger than the largest machine representable number. An underflow occurs when result of an
operation is a nonzero number whose magnitude is smaller than the smallest non-zero machine
number. Typically, the result is set to zero in this instance. Underflows are usually harmless
while overflows are almost always fatal.
Example: Let m = 11 so that M = 1023. The exponent part of (1.10)2 × 2−1000 is stored as q =
−1000 + 1023 = 23 = (00000010111)2 .
The largest (finite) machine representable number is (1.1 · · · 1)2 × 2M = (2 − 2−n )2M ≈ 2M +1
(remembering that p = M + 1 is reserved for infinities and NaNs) while the smallest nonzero
one is (0.0 · · · 01)2 × 2−M = 2−n−M . The total number of machine representable numbers
(including infinities and NaN) is 2m+n+1 . The two most common formats are single precision
(m = 8, n = 23) and double precision (m = 11, n = 52).
It is interesting that the density of machine representable numbers is not uniform. For
instance, there are 2n machine representable numbers in [1, 2). In general, there are 2n−p floating
point numbers in [2p , 2p + 1) for 0 ≤ p ≤ M . Take the case p = 1 so the interval of concern is
[2, 3). The binary numbers in this interval are (10.b−1 b−2 · · · )2 which when normalized becomes
(1.0b−1 b−2 · · · )2 × 21 .
When p is much larger than one, then there are far fewer machine numbers in the interval
[2 , 2p + 1). On the other hand, there are M 2n ≈ 2m+n−1 machine numbers in [0, 1). This
p
nonuniform distribution of machine numbers is actually good since most numbers that we deal
with in everyday life are between (−10, 10), say, where the density of machine numbers is large.
The last expression on the right is a better way to evaluate z than the naive way. Next, define
z + π/2 = y + r, where y = 2nπ for some integer n and r ∈ [0, 2π). Then I = sin(r) is a
better method to calculate I than the straightforward way. Let a satisfy e−a a1 000 = 1015 π.
Note that a ≈ 1.0374. When x = 10000, 1000, a, the naive way reports NaN, NaN and 0.1504,
respectively, while the better method gives 1, NaN and 1. While clearly superior to the naive
way, some additional work is necessary before the second method can give a correct answer when
x = 1000.
round up (take the first n bits and then add one to the last bit). Otherwise, we take simply take
the first n bits as in chopping. For a real number x, its floating point representation is denoted
by x̃.
Example: Suppose n = 7, x = 53.7 = (1.101011011 · · · )2 × 25 . Then using chopping arithmetic, x̃ =
(1.1010110)2 × 25 while x̃ = (1.1010111)2 × 25 in rounding arithmetic. If n = 5, then using
rounding, x̃ = (1.10110)2 × 25 .
In the above definition of rounding, there is a bias in that we always round up if the (n + 1)st
digit is one. A better implementation is (still with the (n + 1)st digit as one) round up if, for
instance, the nth bit is one and round down otherwise. This way, we round up and down with
equal probability.
Define the absolute error of the truncation of a real number x to be |x − x̃| while the relative
|x − x̃|
error is defined as . The magnitude of the absolute error depends on the magnitude of
|x|
x. For instance, an absolute error of 10 may seem large but it may be quite acceptable if x and
x̃ have values of magnitude 1020 . The relative error takes the scale of the problem into account
but it is not defined if x = 0 and it may give a misleading result if |x| is small.
It is not difficult to see that the relative errors of chopping and rounding are 2−n and 2−n−1 ,
respectively. For instance, in chopping, x and x̃ differ possibly for digits after the (n + 1)st bit.
Hence
X∞
|x − x̃| 1
≤ = 2−n .
|x| 2j
j=n+1
The unit roundoff is defined as 2−n or 2−n−1 depending on whether chopping or rounding
is being used. In double precision, the unit roundoff is 2−53 which is approximately 10−16 .
Needless to say, rounding is preferred to chopping.
Another characterization of the unit roundoff ǫM is that it is the smallest machine number
so that 1 ⊕ ǫM > 1. Here ⊕ denotes machine addition which is different than (exact) addition
because of truncation. For all positive machine numbers x smaller than ǫM , we have the strange
looking 1 ⊕ x = 1. Roundoff errors occur in floating point arithmetic operations, even if the
operands can be represented exactly.
Example: For simplicity use base 10 and assume n = 2. The exact sum 2.34 + 1.09 × 10−1 + 7.65 is
1.0099 × 101 . Let us calculate (2.34 ⊕ 1.09 × 10−1 ) ⊕ 7.65. The exact first sum 2.34 + .109 is 2.449
which becomes 2.44 after chopping. Adding this to 7.65 becomes 10.09 = 1.009 × 101 which
.099
becomes 1.00 × 101 after chopping. The absolute and relative errors are .099 and ≈ 10−2 ,
10.099
respectively.
If rounding is used, then the rounded first sum is 2.45. Add this to 7.65 followed by rounding
.001
results in 1.01 × 101 . The absolute and relative errors are .001 and ≈ 10−4 , respectively.
10.099
Still with base 10 and n = 2, 1.23 × 2.01 = 2.4723. Chopping or rounding both lead to 1.23 ⊗
2.01 = 2.47 with an absolute error of .0023 and relative error of ≈ 10−3 .
Floating point arithmetic do not obey the usual rules of arithmetic such as associative and
distributive laws. When programming, one should avoid pitfalls such as
if x = .3 then...
Here roundoff errors may mean that the statement is not executed even if the exact value of x
is .3. A better construct is to test if |x − .3| is less than some tolerance.
1.5. CANCELLATION ERROR 5
a result which has only one correct digit. This phenomenon can lead to an answer which is
totally different from the exact one, especially over the course of a sequence of operations.
Example: Solve x2 + 109 x − 3 = 0. By the quadratic formula, the two roots are
√ √
−109 − 1018 + 12 −109 + 1018 + 12
x− = , x+ = .
2 2
There is no cancellation in x− which can be computed to full precision. On the other hand x+ is
a difference of nearly equal numbers. In double precision, the result is zero. As above, a better
calculation is
√ √
−109 + 1018 + 12 109 + 1018 + 12 6
x+ = √ = √
2 109 + 1018 + 12 109 + 1018 + 12
which results in a computed answer accurate to full precision.
1 − cos x
Example: Let f (x) = . If x = 10−8 , then the computed answer in double precision is zero because
sin2 x
of severe cancellation error. Using sin2 x = 1 − cos2 x, we obtain f (x) = (1 + cos x)−1 and this
leads to an answer with a relative error close to the machine epsilon for x = 10−8 .
XN
(−1)i
Example: The direct computation of for large values of M, N leads to disastrous cancellation.
i
i=M
The following rearrangement
XN XN
1 1
−
i i
i=M, i even i=M, i odd
1.2
0.8
log(1+x)/x
0.6
0.4
0.2
0
−1 −0.5 0 0.5 1
x x 10
−15
log(1 + x)
Figure 1.1: Evaluation of at x = −10−15 + j10−16 , j = 0, . . . , 20 by the naive way
x
(‘x’) and the better way (‘o’).
The following method, known as compensated summation, tries to capture the rounding error
of each sum and adds that to the next sum.
s = 0; e = 0;
for i = 1 : n
t = s; y = xi + e;
s = t + y;
e = (t − s) + y;
end
We shall apply this method in a numerical solution of an ODE with dramatic improvement.
Example: This example illustrates the effects of cancellation errors graphically. Consider evaluation of
log(1 + x)
f (x) = in a neighbourhood of x = 0. By the L’Hopital’s rule, f (0) = 1. The naive
x
way to evaluate suffers from serious cancellation error near x = 0. A better way is the following:
(
log(1⊕x)
(1⊕x)⊖1 , 1 ⊕ x 6= 1;
1, 1 ⊕ x = 1.
Let us examine more carefully errors in floating point addition (subtract) and multiplication
(division). Let x1 , x2 be non–zero real numbers and x̃1 , x̃2 be the corresponding floating point
representations. Recall that
x̃i = (1 + ǫi )xi , |ǫi | ≤ ǫM
where ǫM is the unit roundoff. Because of roundoff,
In the above, we have ignored all terms which contain a product of two epsilons which are
necessarily small. Hence E, the relative error of addition has been expressed as the relative error
of x1 , which is (ǫ1 + ǫ3 ) times the amplification factor x1 (x1 + x2 )−1 plus a similar term for x2 .
Now if the amplification factor is large, then E is large. This happens when we are subtracting
two nearly equal numbers, that is, x2 ≈ −x1 and so the amplification factor |x1 (x1 + x2 )−1 | ≫ 1.
This is of course cancellation error which we have discussed above. If x1 and x2 have the same
sign, then the magnitude of the amplification factor is at most one and so the relative error of
floating point addition will be near ǫM .
Now we carry out the estimation of the relative error of floating point multiplication. Using
the same notation as above,
x̃1 ⊗ x̃2 − x1 x2
≈ |ǫ1 + ǫ2 + ǫ3 | ≤ 3ǫM
x1 x2
This analysis suggests that when adding many positive numbers, it is best to add starting with
the smallest numbers first.
We have been discussing roundoff errors. A second source of error is called truncation
error. This is the error where an answer is the result of applying infinite number of steps and
we approximate this answer by doing finitely many steps.
X∞ XN
1 1
Example: Consider computing 2
by calculating . The truncation error is
n=1
n n=1
n2
∞
X Z ∞
1 dx 1
≤ = .
n2 N x 2 N
n=N +1
8 CHAPTER 1. FLOATING POINT NUMBERS
Z .1
Example: Consider estimating ex dx by
0
Z .1
x2 xN
1+x+ + ··· + dx.
0 2 N!
(Assume we only know how to integrate polynomials and not exponentials analytically.) Then
the truncation error is
Z .1 X
∞ X∞
xj .1j+1 10−N −2
dx = ≤ .
0 j! (j + 1)! .9 (N + 2)!
j=N +1 j=N +1
Another source of errors which requires no further explanation is errors in computer program-
ming. All these errors can render the result of a numerical method to be completely wrong.
Great care must be exercised when using a computer to solve a problem.
We list some actual disasters caused due to these errors, with losses of many lives and billions
of dollars. During the Gulf War in 1991, a Patriot missile failed to intercept a Scud missile,
resulting in the loss of 28 lives. The source of the problem was an accumulation of round off
errors. In 1996, the Ariane 5 rocket went off course and exploded. The problem was traced to
the failure of the conversion of a 64-bit floating point number to a 16-bit integer. (The floating
point number was larger than 32,768 which was the largest 16-bit (signed) integer possible. This
caused an overflow error.) In 1999, NASA lost contact with the Mars Climate Orbiter. The
reason was that the program used a mixture of miles and meters. As we can see, the answer
spits out by a computer need not be correct! Great care must be exercised in writing codes and
interpreting results.
ỹ ≈ y + f ′ (x)dx
and so
ỹ − y f ′ (x)dx dx
≈ =C
y f (x) x
xf ′ (x)
where |C| = is called the condition number. A large condition number indicates
f (x)
that the problem is ill-conditioned.
√
Example: Let y = x. Then the condition number is 1/2 and so the problem is well-conditioned.
Hence
n
!1/2 n
!1/2
ỹ − y X dx2 X
. |C| i
, |C| = Ci2 ,
y
i=1
x2i i=1
are obviously 1, 2, · · · , 20. Suppose the coefficient 210 is changed to 210 + ǫ where ǫ = 10−7 .
Naively, it is expected that the roots will perturb by a small amount. However this is not the
case. For instance, the perturbed polynomial has two roots at approximately 16.7 ± 2.8i which
are quite far from any integer.
Call the perturbed polynomial p(x, ǫ). We determine how sensitive the root x = x(ǫ) is to the
perturbation ǫ. By the chain rule,
∂p ∂x ∂p
+ =0
∂x ∂ǫ ∂ǫ
and so
∂p
∂x ∂ǫ x19
= − ∂p = P20 Q20 .
∂ǫ j=1 i=1, i6=j (x − i)
∂x
∂x j 19
= Q20 .
∂ǫ x=j i=1, i6=j (j − i)
Example: Consider the problem of finding the intersection of two straight lines. This can be posed as a
system of two equations in two unknowns: Ax = b. In case the lines are nearly parallel, then
the rows of A are nearly linearly dependent. Thus a small perturbation of A or b can lead to
large changes in the solution. This is an ill-conditioned problem. As a specific example, let ǫ be
a small positive real and
1 1
A= , b= 0 1 .
1 1+ǫ
The solution of the system Ax = b is x = ǫ−1 [−1, 1]T . Consequently, a small change in the value
of ǫ leads to a huge change in the solution.
More generally, consider any non-singular A ∈ Rn×n and b, b̃ ∈ Rn with b non-zero. Define
x = A−1 b and x̃ = A−1 b̃. Note that |b| = |Ax| ≤ kAk |x|. Using this inequality,
Example: For any differentiable function on [a, b], define D(f ) = f ′ and kf k∞ = maxx∈[a,b] |f (x)|. Suppose
f, f˜ are distinct differentiable on [a, b], with f non-constant. Then
where
kf ′ − f˜′ k∞ kf k∞
κ=
kf − f˜k∞ kf ′ k∞
is the condition number which can be arbitrarily large. For instance, take f (x) = sin x and
f˜(x) = cos kx for k large. Then κ = O(k).
1.6. NUMERICAL STABILITY 11
Of course, f (x) = f˜(x) in exact arithmetic. However, in the presence of roundoff errors, they
behave differently.
Chapter 2
Nonlinear Equations
Given f : R → R. The goal is to find a root x so that f (x) = 0. This is a difficult problem
because usually, we do not know how many solutions are there. There might be none, one,
two, or infinitely many solutions. A solution is found by an iterative method which in general
converges to the root after infinitely many steps (assuming that it converges). Bisection is a
slowly converging method but it is guaranteed to find a root. Newton’s method is a rapidly
converging method when it converges but if the initial guess is not good enough, the sequence
may diverge.
2.1 Bisection
Recall the intermediate value theorem which states that for a continuous function f defined on
[a, b] satisfying f (a)f (b) < 0, then there is some x∗ ∈ (a, b) so that f (x∗ ) = 0. The theorem says
that there exists at least one root in (a, b) provided that f (a)f (b) < 0. The method of bisection
finds one such root. At each iteration, a root is bounded between an interval whose length
decreases by one half with each iteration while maintaining the property that the function has
different signs at the end points of the interval.
BISECTION: Given [a, b] with f (a)f (b) < 0 and tolerance ǫ > 0.
WHILE (b − a)/2 > ǫ
c = (a + b)/2
If f (c) = 0 then STOP
If f (a)f (c) < 0 THEN b ← c ELSE a ← c
END
RETURN c = (a + b)/2
Theorem: Suppose f is a continuous function on [a, b] with f (a)f (b) < 0. Let cn denote the estimate of
the root given by the bisection algorithm after n times of the loop. Then lim cn = x∗ where
n→∞
x∗ ∈ (a, b) is a root f . Also |cn+1 − x∗ | ≤ 2−1 (bn − an ) = 2−n−1 (b − a).
Proof: Denote the successive intervals generated by the algorithm by [an , bn ] with a0 = a, b0 = b. For
each positive n,
a0 ≤ an ≤ an+1 ≤ bn+1 ≤ bn ≤ b0
12
2.2. FIXED POINT ITERATION (FPI) 13
lim an = x∗ = lim bn .
n→∞ n→∞
If the algorithm terminates in finitely many steps, then we have f (c) = 0 for some c ∈ (a, b).
Otherwise, since f (an )f (bn ) < 0, take the limit to get f (x∗ )2 ≤ 0 which implies that f (x∗ ) = 0.
Since cn+1 = 2−1 (an + bn ),
bn − a n b−a
|cn+1 − x∗ | ≤ = n+1 .
2 2
Example: Let f (x) = x3 + x − 1. Since f (0) = −1, f (1) = 1, there is a root in (0, 1) by the intermediate
value theorem. Take a0 = 0, b0 = 1. Then c1 = 1/2 with f (c1 ) < 0. Thus a1 = 1/2, b1 = 1 and
so c2 = 3/4 with f (c2 ) > 0. Next a2 = 1/2, b2 = 3/4 and so c3 = 5/8 with f (c3 ) < 0. Finally,
a3 = 5/8, b3 = 3/4 and so c4 = 11/16. If ǫ = 1/10, then we can stop here since the difference
between c4 and a root is now smaller than 1/16.
In this method, it is easy to determine how many iterations to satisfy a given tolerance. For
instance, suppose we wish an absolute error of no bigger than 10−4 . Then we need n so that
4
10−4 ≈ 2−n−1 or n ≈ − 1 ≈ 13.
log10 2
The method of bisection converges slowly but it is guaranteed to converge. We now examine
other methods which may not converge but may converge quicker if the iteration converges.
Example: Let f (x) = x3 + x − 1 again. Consider three different FPI all of whose fixed points coincide with
the roots of f
1. x = g1 (x) = 1 − x3
2. x = g2 (x) = (1 − x)1/3
1 + 2x3
3. x = g3 (x) =
1 + 3x2
14 CHAPTER 2. NONLINEAR EQUATIONS
Observe that that latter can be obtained from f (x) = 0 as follows. From x3 + x = 1, add 2x3 to
both sides to obtain x(3x2 + 1) = 1 + 2x3 . Now divide by 3x2 + 1 on both sides.
Suppose we start with x0 = .5. Then for the first iteration function g1 , we obtain x1 = 1 − .53 =
.875, · · · , x9 ≈ 1, x10 ≈ 0, x11 ≈ 1, x12 ≈ 0, · · · . Thus the sequence oscillates between 0 and 1
and does not converge. This is the major drawback of FPI: its iteration may not converge.
For the second iteration function g2 , we have x1 = (1 − .5)1/3 ≈ .7937, · · · , x25 ≈ .6823, · · · . This
iteration converges, but rather slowly.
1 + 2 .53
For the final function g3 , the iteration converges very rapidly: x1 = ≈ .7142, x2 ≈
1 + 3 .52
.6831, x3 ≈ .6823. Hence in three iterations, this sequence has reached the point which took
the above function 25 iterations.
We shall try to understand this discrepancy among the three iteration functions.
The following is a well-known result in analysis known as the Contraction Mapping Principle:
Theorem: Suppose g : R → R satisfies |g(x) − g(y)| ≤ c |x − y| for all real x, y and some constant c ∈ [0, 1).
Then g has a unique fixed point.
The function g above is called a contraction mapping. The proof of this result bears a strong
resemblance to a slightly different theory which now follows.
Definition: Let en = xn − x∗ where x∗ is a fixed point and xn is the nth iterate of a FPI. Suppose for some
positive S < 1,
en+1
lim = S,
n→∞ en
Theorem: Suppose g is a continuously differentiable function with fixed point x∗ . Define S = |g ′ (x∗ )|. If
S < 1, then for any x0 sufficiently close to x∗ , the FPI converges linearly to x∗ at rate S. If
S > 1 and x0 6= x∗ , then the FPI diverges.
where cn is some number between x∗ and xn . The mean value theorem guarantees the existence
of cn . If S < 1, then there is some open interval B containing x∗ so that
S+1
|g′ (x)| ≤ < 1, ∀x ∈ B.
2
For any x0 ∈ B (this is the meaning of sufficiently close to x∗ in the statement of the theorem),
S+1
|g′ (c0 )| ≤ (S + 1)/2 and so by (2.1), |e1 | ≤ |e0 |. By induction, for every n ≥ 0,
2
S+1
|en+1 | ≤ |en |
2
and so lim en = 0. By (2.1),
n→∞
en+1
lim = lim |g′ (cn )| = |g ′ (x∗ )| = S.
n→∞ en n→∞
2.2. FIXED POINT ITERATION (FPI) 15
S+1
|g′ (x)| ≥ > 1, ∀x ∈ B̃.
2
S+1
|en+1 | ≥ |en |
2
and xN +1 6∈ B̃. This means that the sequence does not converge.
This theorem says that the rate of convergence of a FPI depends on the value of g′ (x∗ ) where
x∗ is a fixed point. The smaller the value of |g ′ (x∗ )| is, the faster the FPI converges.
The case |g′ (x∗ )| = 1 is indeterminant. Consider g± (x) = x ± x3 which has fixed point
x∗ = 0. Note that g± ′ (x) = 1 ± 3x2 and so g (0) = 1. The fixed point iteration for g diverges
± −
and the one for g+ converges.
Example: Consider f (x) = x3 + x − 1 again with the three different FPIs and fixed point x∗ ≈ .6823.
1. x = g1 (x) = 1 − x3 . Here g1′ (x) = −3x2 and so g1′ (.6823) ≈ 1.3966 > 1 and thus the
iteration diverges.
2. x = g2 (x) = (1 − x)1/3 . Here g2′ (x) = −3−1 (1 − x)−2/3 and so |g2′ (.6823)| ≈ .716 which
implies that the iteration converges.
1 + 2x3 3
′ (x) = 6x(x + x − 1) and so g ′ (x ) = 0. Thus the FPI
3. x = g3 (x) = . Here g3 3 ∗
1 + 3x2 (1 + 3x2 )2
converges linearly at rate 0, the best possible rate!
Example: On your calculator, type in any number x0 and then repeatedly press the cos key. This cor-
responds to a FPI with g(x) = cos x with fixed point x∗ ≈ .7390. Since g ′ (x) = − sin x, the
iteration converges linearly at rate |g′ (.7390)| ≈ .67.
Because |g′ (x)| < 1 unless x = π/2+ 2kπ for any integer k, we know that this FPI must converge
if x∗ is not equal any of the above special values.
Unlike the method of bisection, we cannot predict in advance how many iterations it takes a
FPI to satisfy a given tolerance (assuming that the iteration converges). Typically we stop the
iteration if
|xn+1 − xn |
|xn+1 − xn | or
|xn |
is sufficiently small. This can be supplemented with |f (xn+1 )| sufficiently small.
We emphasize that even if |g ′ (x∗ )| < 1 at a fixed point x∗ , the iteration may not converge if
the initial iterate is not sufficiently close to x∗ . FPI is said to converge locally. The ideal method
is one which converges globally, that is, converge for arbitrary x0 . For instance, if g(x) = x/2,
then g ′ (x) = 1/2 and so it converges globally to the fixed point zero.
x
Example: Let g(x) = − x3 which has a unique fixed point x∗ = 0. Note g′ (0) = 1/2. The fixed point
2
iteration xn+1 = g(xn ) only converges locally. For instance, if x0 = 10, then x1 = −995, x2 ≈
109 . It is clearly divergent.
16 CHAPTER 2. NONLINEAR EQUATIONS
x*
x
xn+1 xn
Example: Let f (x) = x3 + x − 1. We have already seen the function several times. With f ′ (x) = 3x2 + 1,
the Newton iteration is
x3 + xn − 1 2x3 + 1
xn+1 = xn − n 2 = n2 .
3xn + 1 3xn + 1
One root is x∗ ≈ .6823. Take x0 = .5. Define en = xn − x∗ . Then |e0 | ≈ 1.8 × 10−1 , |e1 | ≈
3.2 × 10−2 , |e2 | ≈ 8.5 × 10−4 , |e3 | ≈ 6.2 × 10−7 , |e4 | ≈ 3.3 × 10−13 . Thus |en+1 | ≈ c e2n for some
constant c.
This example shows that when Newton’s method converges, it converges very quickly indeed.
x2n − 2 xn 1
xn+1 = xn − = + .
2xn 2 xn
√ √
If x0 > 0, the iterates will converge to 2 while if x0 < 0, they will converge to − 2. If x0 = 0,
then x1 is not well defined since f ′ (x0 ) = 0.
2.3. NEWTON’S METHOD 17
en+1
Definition: Suppose xn → x∗ . Let en = xn − x∗ . If lim < ∞, then {xn } is said to converge
n→∞ e2n
quadratically.
Theorem: Taylor’s Theorem. Let x and x0 be real numbers and f be k + 1 times continuously differentiable
on the interval between x and x0 . Then there exists some c in between x and x0 so that
Theorem: Suppose f is twice continuously differentiable with f (x∗ ) = 0 and f ′ (x∗ ) 6= 0. Then Newton’s
method is locally quadratically convergent to x∗ .
Hence g′ (x∗ ) = 0 and so Newton’s iteration is locally linearly convergent at rate 0. By Taylor’s
theorem, there is some cn between x∗ and xn so that
f ′′ (cn )
0 = f (x∗ ) = f (xn ) + f ′ (xn )(x∗ − xn ) + (x∗ − xn )2
2
and so
f (xn ) f ′′ (cn )
− = x ∗ − x n + (x∗ − xn )2 .
f ′ (xn ) 2f ′ (xn )
Now
f (xn ) f ′′ (cn ) 2
en+1 = xn+1 − x∗ = xn − − x ∗ = en − en + e .
f ′ (xn ) 2f ′ (xn ) n
Since xn → x∗ , we must have cn → x∗ . Thus
en+1 f ′′ (x∗ )
lim =
n→∞ e2
n 2f ′ (x∗ )
en+1
In this theorem, if f ′′ (x∗ ) = 0 as well, then the method converges locally cubically: lim <
n→∞ e3n
∞. If, however, f ′ (x∗ ) = 0, then quadratic convergence is lost.
Example: Let f (x) = x2 . Note that f (0) = 0 = f ′ (0) and f ′′ (0) 6= 0. Newton’s iteration here is
x2n 1
xn+1 = xn − = xn = g(xn ).
2xn 2
The convergence rate is linear at rate 1/2 = g ′ (0) and not quadratic.
Theorem: Suppose f is three times continuously differentiable. If f (x∗ ) = 0 = f ′ (x∗ ) and f ′′ (x∗ ) 6= 0,
then Newton’s method is locally linearly convergent to x∗ at rate 1/2. The modified Newton’s
method
2f (xn )
xn+1 = xn − ′
f (xn )
converges locally quadratically to x∗ .
Proof: Let f (x) = (x − x∗ )2 h(x) for some twice continuously differentiable function h with h(x∗ ) 6= 0.
The fixed point function corresponding to Newton’s method is
f (x) (x − x∗ )h(x)
g(x) = x − =x− .
f ′ (x) 2h(x) + (x − x∗ )h′ (x)
By a direct calculation, g′ (x∗ ) = 1/2 and so Newton’s method converges locally linearly at rate
1/2.
Now for the modified method,
2(xn − x∗ )h(xn )
xn+1 = g2 (xn ) = xn − .
2h(xn ) + (xn − x∗ )h′ (xn )
It can be checked that g2′ (x∗ ) = 0 and so this iteration locally converges linearly at rate 0. From
(2.1),
Again, Newton’s method is locally convergent meaning that the initial iterate x0 must be
sufficiently close to the root for convergence. If x0 is far from the root, then the iterates may
diverge. In fact, they may not even be well defined.
−x4n + 3x2n + 2
xn+1 = xn − .
−4x3n + 6xn
Example: Let f (x) = xe−x . From the graph of this function (Figure 2.2), we see that Newton’s method
converges to the root 0 for every x0 < 1 while it diverges for every x0 > 1. The method is
undefined if x0 = 1 since f ′ (1) = 0.
2.4. SECANT METHOD 19
0.4
0.2
−0.2
−0.4
−0.6
−0.8
−1
−1 0 1 2 3 4 5
Example: Let f (x) = x3 + x − 1. With x0 = .5, estimate the number of Newton iteration it takes to
approximate the root x∗ ≈ .6823 that is correct to 10−6 .
Recall that
en+1 f ′′ (x∗ )
lim = ≈ .4656.
n→∞ e2
n 2f ′ (x∗ )
Then e0 ≈ 1.8 × 10−1 , e1 ≈ .4656e20 ≈ 1.5 × 10−2 , e2 ≈ .4656e21 ≈ 1.1 × 10−4 , e3 ≈ .4656e22 ≈
5.2 × 10−9 . Hence three Newton iterations are sufficient. In practice, such estimates are difficult
to obtain since the exact solution is unknown. Furthermore, the estimate en+1 = C e2n may be
completely incorrect when the iterate is far from the solution. If the initial iterate is far away,
then Newton iteration may not converge or take many iterations before it reaches the region
where quadratic convergence holds.
f (xn ) − f (xn−1 )
f ′ (xn ) ≈ .
xn − xn−1
Theorem: Suppose f ∈ C 3 (R) and there is some x∗ ∈ R so that f (x∗ ) = 0 and f ′ (x∗ ) 6= 0. There is some
δ > 0 so that if |e0 |, |e1 | ≤ δ, then
√
en+1 f ′′ (x∗ ) α−1 1+ 5
lim = , α= ≈ 1.62.
n→∞ eα n 2f ′ (x∗ ) 2
Hence the secant method converges locally linearly at rate 0 (faster than linear convergence)
but not quite as quickly as quadratic convergence, in terms of the number of iterations. Note
however that in each iteration, Newton’s method requires two function evaluations (f (xn ) and
f ′ (xn )), while in the secant method, only one function evaluation (f (xn )) is required. In real–life
examples where each function evaluation is very expensive (for instance, requires the solution
of a differential equation), then the number of function evaluations is a better indication of the
time complexity of the algorithm.
Example: Let f (x) = x3 + x − 1. Then the secant iterates are
(x3n + xn − 1)(xn − xn−1 )
xn+1 = xn − .
(x3n + xn − 1) − (x3n−1 + xn−1 − 1)
With x0 = 0, x1 = .5, we calculate x2 = .8, · · · , x6 ≈ .6823. Thus the secant method is slower
than Newton’s method but it is definitely faster than linearly converging methods.
Example: Suppose A(x) ∈ Rm×m for each x. Let Λ(x) denote the set of eigenvalues of A(x). Define
f (x) = min Re λ where Re z denotes the real part of a complex number z. We wish to find a
λ∈Λ(x)
zero of f . This problem comes up in determining stability of a steady solution of a differential
equation. Here it is exceedingly difficult to calculate the derivative of f . In fact, there may be
points where the derivative does not exist. Here secant method is a more appropriate method
to numerically find the root of f than Newton’s method.
We now give a proof of convergence of the secant method provided that the initial two
iterates are sufficiently close to a root x∗ of f . The secant iteration is
f (xn )(xn − xn−1 ) f (xn )xn−1 − f (xn−1 )xn
xn+1 = xn − = := g(xn , xn−1 ).
f (xn ) − f (xn−1 ) f (xn ) − f (xn−1 )
Note that
f (u)
lim g(u, v) = u − ′ .
v→u f (u)
Thus the secant method is identical to Newton’s method in case xn−1 = xn . It is simple to
check that g satisfies the following equalities:
g(u, x∗ ) = 0 = g(x∗ , v), gu (u, x∗ ) = guu (u, x∗ ) = 0 = gv (x∗ , v) = gvv (x∗ , v)
for all u, v. Using Taylor’s expansion twice, there are some θ, γ, µ ∈ (0, 1) so that
g(x∗ + ξ, x∗ + η) = g(x∗ , x∗ ) + gu (x∗ , x∗ )ξ + gv (x∗ , x∗ )η
1
+ guu (x∗ + θξ, x∗ + θη)ξ 2 + 2guv (x∗ + θξ, x∗ + θη)ξη + gvv (x∗ + θξ, x∗ + θη)η 2
2
1
= x∗ + guu (x∗ + θξ, x∗ )ξ 2 + guuv (x∗ + θξ, x∗ + γθη)θηξ 2 + 2guv (x∗ + θξ, x∗ + θη)ξη
2
+gvv (x∗ , x∗ + θη)η 2 + guvv (x∗ + µθξ, x∗ + θη)θξη 2
1
= x∗ + guuv (x∗ + θξ, x∗ + γθη)θηξ 2 + 2guv (x∗ + θξ, x∗ + θη)ξη
2
+guvv (x∗ + µθξ, x∗ + θη)θξη 2 .
2.4. SECANT METHOD 21
e2 = g(x∗ + e1 , x∗ + e0 ) − x∗
e1 e0
= guuv (x∗ + θe1 , x∗ + γθe0 )θe1 + 2guv (x∗ + θe1 , x∗ + θe0 ) + guvv (x∗ + µθe1 , x∗ + θe0 )θe0
2
e1 e0
:= h(e1 , e0 ).
2
Since h(0, 0) = 2guv (x∗ , x∗ ), there is some δ > 0 so that |vh(u, v)| ≤ ǫ < 1 whenever |u|, |v| ≤ δ.
In particular, if |e0 |, |e1 | ≤ δ, it follows that |e2 | ≤ ǫ |e1 | ≤ δ. By induction, |en | ≤ ǫ|en−1 | ≤
ǫn−1 |e1 | → 0 as n → ∞. This establishes convergence of the secant method.
Now
f (xn )xn−1 − f (xn−1 )xn − x∗ (f (xn ) − f (xn−1 ))
en+1 =
f (xn ) − f (xn−1 )
f (xn )en−1 − f (xn−1 )en
=
f (xn ) − f (xn−1 )
f (xn ) f (xn−1 )
xn − xn−1 en − en−1
= en en−1
f (xn ) − f (xn−1 ) xn − xn−1
1 f ′′ (x∗ )
→ en en−1 .
f ′ (x∗ ) 2
Here, we use the fact that for x0 sufficiently close to x∗ , then xn → x. The following fact has
also been used:
f (xn ) f (xn−1 ) f ′′ (x∗ )
lim − = lim (xn − xn−1 ). (2.2)
n→∞ en en−1 2 n→∞
To see this, note by Taylor’s theorem that there is some cn in between xn and x∗ so that
f ′′ (cn ) 2 f ′′ (cn ) 2
f (xn ) = f (x∗ + en ) = f (x∗ ) + f ′ (x∗ )en + en = f ′ (x∗ )en + en
2 2
and so
f (xn ) f ′′ (cn )
= f ′ (x∗ ) + en .
en 2
Subtract this equation with n replaced by n − 1 from the above equation and then take the limit
as n → ∞ to obtain
f (xn ) f (xn−1 ) f ′′ (x∗ ) f ′′ (x∗ )
lim − = lim (en − en−1 ) = lim (xn − xn−1 )
n→∞ en en−1 2 n→∞ 2 n→∞
which is (2.2).
f ′′ (x∗ )
Let C = and yn = − ln(Cen ). Note that yn → ∞ as n → ∞. Recall that in the
2f ′ (x∗ )
limit of large n, en+1 = Cen en−1 . In terms of yn this relation becomes
yn+1 = yn + yn−1 , n → ∞.
The solution of this recurrence relation can be found by the substitution yn = αn , from √ which
1± 5
the equation α2 − α − 1 = 0 follows. The roots of this quadratic equation are α± = . The
2
general solution of the recurrence relation is yn = c1 αn+ + c2 αn− for some constants c1 , c2 . Since
22 CHAPTER 2. NONLINEAR EQUATIONS
yn → ∞ and |α− | < 1, we can make the approximation yn ≈ c1 αn+ with c1 6= 0. To simplify the
n
notation, take α = α+ . Now en = C −1 e−yn ≈ C −1 e−c1 α . In the limit of large n,
n+1
en+1 C −1 e−c1 α α−1
≈ n+1 = C .
eαn −α
C e −c 1 α
This completes the demonstration of the convergence rate of the secant method.
Other Methods
There are other methods to find zeroes of f with higher order of convergence provided f is
sufficiently smooth. Below we assume that f ′ (x∗ ) 6= 0 for some root x∗ of f .
The first method can be shown to be cubically convergent, but it requires three function
evaluations per step. Recall that a linear model at xn is y = f (xn ) + f ′ (xn )(x − xn ). From the
Fundamental Theorem of Calculus,
Z x
f (x) = f (xn ) + f ′ (ξ) dξ.
xn
The linear model results from approximation the above integral by the area f ′ (xn )(x − xn ) of a
rectangle. A more accurate quadratic model approximates the integral by the trapazoidal rule:
f ′ (x) + f ′ (xn )
y = M (x) := f (xn ) + (x − xn ).
2
Observe that M ′′ (xn ) = f ′′ (xn ). This is in addition to M (xn ) = f (xn ) and M ′ (xn ) = f ′ (xn ).
Define the next iterate x = xn+1 so that it is a zero of the model:
2xn
xn+1 = xn − .
f ′ (x ′
n ) + f (xn+1 )
This is a nonlinear equation for xn+1 and so the iteration is not really practical. One simple idea
is to approximate xn+1 on the right-hand side by the Newton iterate: xn+1 ≈ xn − f (xn )/f ′ (xn ),
resulting in the final iteration
2xn
xn+1 = xn − .
f ′ (x n) + f′ xn − f (xn )/f ′ (xn )
and
f (xn ) 2f ′ (xn )2
xn+1 = xn − ′ .
f (xn ) 2f ′ (xn )2 − f ′′ (xn )f (xn )
A fourth-order method is
f (xn ) f (yn ) f (yn ) f (xn )
xn+1 = xn − ′ 1+ 1+2 , y n = xn − .
f (xn ) f (xn ) f (xn ) f ′ (xn )
All three methods above require three function evaluations per iteration. The final method
is a modification of the first method of this subsection; it only requires two function evaluations
2.5. SYSTEM OF NONLINEAR EQUATIONS 23
√
per step, just like Newton’s method, but its convergence order reduces to 1 + 2. Define
x̃0 = x0 , x1 = x0 − f (x0 )/f ′ (x0 ) and
f (xn ) f (xn )
x̃n = xn − , xn+1 = xn − , n ≥ 1.
f′ (xn−1 + x̃n−1 )/2 f′ (xn + x̃n )/2
All methods discussed in this subsection typically require a very good initial guess for con-
vergence.
The convergence result is similar to the scalar case. If DF (X∗ ) is nonsingular, then the iterates
converge locally quadratically:
kEn+1 k
lim =C
n→∞ kEn k2
for some constant C independent of n. Here En = Xn − X∗ and kXk denotes the length of the
vector X.
Example: Consider the intersection of the curve y = x3 and the unit circle. This can be solved by finding
the roots of the system of equations
x2 − x31 0
F (x1 , x2 ) = 2 2 = .
x1 + x2 − 1 0
We find that
−3x21 1
DF (x1 , x2 ) = .
2x1 2x2
If X0 = [1, 2]T , then
−1
1 −3 1 1 1
X1 = − = .
2 2 4 4 1
Continuing,
−1
1 −3 1 0 7/8
X2 = − = .
1 2 2 1 5/8
If X0 = [0, 0]T , then
0 1
DF (0, 0) =
0 0
which is singular and so X1 cannot be defined.
24 CHAPTER 2. NONLINEAR EQUATIONS
Example: Let
cos x1 + x21 ex2
F (x1 , x2 ) = .
x1 + x2
Then
− sin x1 + 2x1 ex2 ex2 x21
DF (x1 , x2 ) = .
1 1
The obvious advantage is that there is no need to calculate a new Jacobian at each iteration.
Also, as we shall see later, we can factor DF (X0 ) into triangular factors once in the beginning and
so all subsequent linear solves involving DF (X0 ) can be performed quickly. The drawback of the
chord method is that it is no longer quadratically convergent but only locally linearly convergent.
To see the latter, the chord method is a FPI with iteration function g(X) = X −DF (X0 )−1 F (X)
and so Dg(X) = I − DF (X0 )−1 DF (X). If S = kDg(X∗ )k < 1, then convergence is locally linear
at rate S.
The secant method also has an analogue in the system case. Recall that we use this method
if it is difficult or even impossible to obtain an analytic expression for the derivative (Jacobian).
In the scalar case, the derivative which is a number, must be estimated. In the system case,
the Jacobian is a matrix and it is not obvious how it can be approximated. One such method,
known as the Broyden’s method, is quite popular. Given two initial vectors X0 , X1 and an
initial matrix A0 , the Broyden’s iteration is
FOR n = 1, 2, 3, · · ·
δn = Xn − Xn−1
(F (Xn ) − F (Xn−1 ) − An−1 δn ) δnT
An = An−1 +
δnT δn
−1
Xn+1 = Xn − An F (Xn )
END
If F ′ (X∗ ) is nonsingular, then it can be shown that it converges locally linearly at rate 0
(faster than linear convergence) but it does not converge quadratically.
Chapter 3
Linear Systems
x + 2y − z = 3
2x + y − 2z = 3
−3x + y + z = −6.
In the basic version of GE, we write the matrix and the righthand side as
1 2 −1 3
2 1 −2 3 .
−3 1 1 −6
Let Ri denote the ith row of the above augmented matrix. The notation Rj ← aRi + Rj means
replacing row j by a times Ri plus Rj . GE performs a sequence of row operations to reduce the
augmented matrix to upper triangular form. For this example performing R2 ← −2R1 + R2 and
R3 ← 3R1 + R3 results in
1 2 −1 3
0 −3 0 −3 .
0 7 −2 3
25
26 CHAPTER 3. LINEAR SYSTEMS
Notice that the first column is zero except for the first entry. Now perform R3 ← 7/3R2 + R3
to the above augmented matrix to get
1 2 −1 3
0 −3 0 −3 .
0 0 −2 −4
Note that the new matrix is now upper triangular. The solution can easily be calculated by
back substitution. From the last equation, −2z = −4 which results in z = 2. From the second
equation, −3y + 0z = −3 and so y = 1. From the first equation
x + 2y − z = 3.
Let us calculate the complexity of GE for a n × n matrix. In the first pass, we zero out all
entries in the first column except the first one. This is accomplished by
Ri ← µi1 R1 + Ri , 2≤i≤n
for some numbers µi1 called multipliers. (In the above example, µ21 = −2, µ31 = 3.) This
takes n(n − 1) multiplications. In the second pass, we zero all entries in the second column
except the first two. This is accomplished by
Ri ← µi2 R2 + Ri , 3 ≤ i ≤ n.
(In the above example, µ32 = 7/3.) This takes (n − 1)(n − 2) multiplications. Continuing to the
final (n − 1)st pass, we zero out the (n, n − 1) entry. This takes 2 · 1 multiplications. Hence the
total number of multiplications is
n−1
X n−1
X n−1
X
2 (n − 1)n(2n − 1) n(n − 1) n3
(j + 1)j = j + j= + =O .
6 2 3
j=1 j=1 j=1
Here O(f (n)) = g(n) means that there is some constant C independent of n so that
f (n)
lim ≤ C.
n→∞ g(n)
We have dropped the terms of n and n2 since they are insignificant when compared to the term n3
for large n. Note that we have ignored additions in the above accounting. This is because there
are the same number of additions as multiplications. Actually in modern computer architecture,
the operation Ri ← aRi + Rj takes about the same amount of CPU time as adding two vectors
which is significantly faster than n times the time it takes to add two numbers.
The complexity of back substitution is easier to estimate. Starting from the last equation,
we solve for xn in one operation. From the second last equation, xn−1 can be solved in two
operations. Continuing, the solution of x1 requires n operations. The total number then is
n(n + 1) n2
1 + 2 + ··· + n = =O .
2 2
3.2. GAUSSIAN ELIMINATION WITH PARTIAL PIVOTING 27
The numbers µij are multipliers which are generated in the GE process. The upper triangular
matrix U is the same as the one at the end of the GE process.
because the (1, 1) entry, called the pivot, is zero. Even if an LU factorization exists mathemati-
cally, its calculation may not be stable numerically.
ǫx1 + x2 = 1 (3.1)
x1 + 2x2 = 4
after machine roundoff. The solution of this approximate system is x2 = 1 and x1 = 0 which is
quite different from the exact solution.
The reason for the poor approximation in the above example is that the multiplier is huge
(ǫ−1 ). The fix is to permute the rows so that all multipliers are bounded above by one. This is
GE with partial pivoting.
28 CHAPTER 3. LINEAR SYSTEMS
Recall that GE fails at the first step. Suppose we interchange the two rows:
1 1 3
.
0 1 2
Clearly the solution remains the same as the original system. It is fortunate that the matrix
is already upper triangular and so the system can be solved by back substitution, obtaining
x2 = 2, x1 = 1. Notice that after this row interchange, GE is now well defined.
The latter approximate system has the solution x2 = 1, x1 = 2 which is a good approximation
to the exact solution. Notice that the multiplier of the system after the permutation is ǫ and
there are no entries in the new matrix of magnitude ǫ−1 like before.
out one step of GE to zero out the third through nth entry of column two. The multipliers are
ci2
µi2 = , i ≥ 3 and they satisfy |µi2 | ≤ 1. The resultant matrix is
cq2
ap1 ap2 · · · · · · apn bp
0 c22 c23 · · · c2n d2
0 0 e · · · e f
33 3n 3
.. .. .. .. ..
. . . . .
0 0 en3 · · · enn fn
for some numbers eij , fk . Next find er3 which has the largest magnitude among the entries ei3
and interchange columns 3 and r if r 6= 3. Continue until the matrix is upper triangular. If the
initial matrix is non-singular, this procedure cannot fail. In the example below Ri ↔ Rj denotes
interchanging rows i and j.
Partial pivoting has eliminated the problem of zero pivots and numerical instability.
The interchange of two rows of a matrix can be performed by left multiplication of a matrix
which is sometimes called a transposition.
T A = Ã
1 R1 R1
0
1 R2
R5
1 R3 = R3
1 R4 R4
1 0 R5 R2
It turns out that GE with partial pivoting applied to a matrix A is equivalent to the math-
ematical representation P A = LU where P is a permutation matrix (equal to the product of
the transpositions in GE), L, U are unit lower and upper triangular matrices. For the previous
example of GE with partial pivoting, ignoring the righthand side,
1 2 −1 1 1 −3 1 1
A = 2 1 −2 , P = 1 , L = − 1 1 , U =
3
7
3 −3 .
2
−3 1 1 1 − 23 75 1 − 67
Note also that the order of the multipliers in the first column of L has been switched because
of the last permutation R2 ↔ R3 .
Instead of calculating the permutation matrix P at the end, we can also update it as the
calculation proceeds. Start with the identity matrix. For the first transposition, apply it to I
to get P1 . For the next transposition, apply it to P1 to get P2 , etc. The matrix at the end of
this process is the permutation P .
In practice, the multipliers are stored immediately after they have been computed in the
strictly lower triangular part of A. Subsequent permutations can permute these multipliers as
well. Entries of U are stored at the corresponding entries in the upper triangular part of A. At
the end of the elimination process, L and U can be read off and no extra storage is needed.
We now redo this example where we perform the row interchanges of the multipliers and keep
track of the permutations at the same time that we perform GE. Append a permutation counter
to the right of the matrix starting with the initial order 1, 2, 3, 4. The multipliers are displayed
in their natural positions in the lower triangular part of the matrix in boldface.
−1 1 0 −3 1 3 0 1 2 4 3 0 1 2 4
1 0 3 1 2 1 0 3 1 2 GE
1 8 1
2
R=⇒
1 ↔R4 =⇒ −3 0 3 3 R2 ↔R3
0 1 −1 −1 3 0 1 −1 −1 3 0 1 −1 −1 3 =⇒
3 0 1 2 4 −1 1 0 −3 1 1
3 1 13 − 37 1
3 0 1 2 4 3 0 1 2 4 3 0 1 2 4
0 1 −1 −1
3 GE 0 1 −1 −1
3 GE 0 1 −1 −1 3
1 =⇒ 1 =⇒ 1 .
− 0 38 1
2 −3 0 8 1
2 −3 0 8 1
2
3 3 3 3 3 3
1
3 1 3 − 37
1
1 1
3 −1 3 − 34
4
1 1
3 −1 − 2 − 23
1
1
The strict lower triangular entries of L are the negative of the strict lower triangular entries
of the above matrix (in boldface) while the entries of U are the upper triangular entries of the
above matrix. The final permutation matrix P are determined from the rightmost column of
the above matrix.
2 1 1 0
4 3 3 1
Example: Apply GE with partial pivoting to 8 7 9 5.
6 7 9 8
First permute rows 1 and 3:
8 7 9 5 8 7 9 5 8 7 9 5
4 3 3 1 GE 0 − 1 − 23 − 32 R2 ↔R4 0 7 9 17
GE
=⇒ 2 =⇒ 4 4 4 =⇒
2 1 1 0 0 − 3 − 45 − 54 0 − 3 − 45 − 45
4 4
6 7 9 8 0 47 9
4
17
4 0 − 12 − 23 − 23
8 7 9 5 8 7 9 5 8 7 9 5
0 7 9 17 0 7 9 17 7
GE 0 4
9 17
4 4 4 R=⇒
3 ↔R4 4 4 4 =⇒ 4 4 = U.
0 0 − 2 4 0 0 − 6 − 72 0 0 − 6 − 72
7 7 7 7
0 0 − 67 − 72 0 0 − 27 4
7 0 0 0 2
3
We have P A = LU where
1 1
1 3 1
P = T3 T2 T1 =
,
L=
1
4 .
1 2 − 72 1
1
1 4 − 73 1
3 1
32 CHAPTER 3. LINEAR SYSTEMS
The wrong way to solve the system Ax = b is to calculate A−1 followed by the multiplication
by b. This is a very common mistake. The reason is that to calculate A−1 , we must solve
the systems Azi = Ii , i = 1, · · · , n where Ii is the ith column of the identity matrix. The
columns of A−1 are given by the vectors zi . The amount of work required is O(n3 /3) for one
factorization plus the cost of n back substitutions to solve for the vectors zi . The latter costs
n O(n2 ) = O(n3 ). Hence the total cost is O(4n3 /3) which is four times more expensive than the
method in the above paragraph. In addition, it takes twice as much storage (store A and A−1 )
and it is also less accurate because of extra operations.
GE with partial pivoting is a numerically stable algorithm in practice although there are
academic examples where roundoff errors increase exponentially quickly as a function of the
number of unknowns in the system. Such examples, fortunately, rarely come up in practice.
There is a more stable version known as GE with full pivoting. Here at the jth pass of the
process, we choose as the pivot the entry in the (n − j + 1) × (n − j + 1) submatrix starting with
jth row and the jth column of the largest magnitude. In this case, exponential growth of entries
of U cannot happen. Full pivoting is rarely used in practice because of the extra complexity
involved in searching for the pivot and that partial pivoting is quite adequate.
n
!1/2
X
|y|2 = yi2 ,
i=1
√
For instance, if y = [−1, 2, −3], then |y|2 = 14, |y|∞ = 3.
There other many other ways to measure the length of a vector. In general, we define a
vector norm ν : Rn → R which satisfies the three conditions (for any x ∈ Rn ):
Let x̃ = [1, 1]T . By√a direction calculation, the forward error in the infinity norm is 1 while the
backward error is 10.
Let x̃ = [−1, 3.0001]T . By a direction calculation, the forward error in the infinity norm is 2.0001
while the backward error is .0001.
In practice, the exact solution x is unknown while x̃, usually the result of a numerical calculation
is known. Hence the forward error is not computable while the backward error is. In practice, we
gauge the accuracy of an approximate solution by its backward error. In the last example, the
backward error is .0001 which is apparently very small (assume we work with a 4–digit mantissa).
However the actual (forward) error is about 2 which is unacceptably large. This is an example
of an ill–conditioned linear system where it is possible for an approximate solution to have a
small backward error but large forward error. The solution of this system corresponds to the
point of intersection of two straight lines. What makes the system ill–conditioned is that the
straight lines are nearly parallel. In the example before that, the two straight lines are not nearly
parallel and so the forward and backward errors have approximately the same magnitudes.
The “size” of a matrix can be measured by matrix norms. Given a vector norm | · | and
A ∈ Rn×n . The matrix norm k · k induced by | · | is defined by
|Ax|
kAk = max .
x6=0 |x|
The matrix norm satisfies the three properties of a vector norm, plus the following
Proof: By the definition of the two–norm, we need to maximize f (x) = |Ax|22 subject to the constraint
g(x) = xT x − 1 = 0. By the method of Lagrange multipliers, we form
At a critical point of L,
Lx = 2AT Ax − λ2x = 0
Lλ = 1 − xT x = 0.
∂f
∂x1
Find ▽f = ... .
∂f
∂xn
n
X
∂f ∂xi ∂xj
= bij xj + xi
∂xk ∂xk ∂xk
i,j=1
X n
= bij (δik xj + xi δjk )
i,j=1
X n n
X
= bkj xj + bik xi
j=1 i=1
Xn Xn
= bkj xj + bki xi
j=1 i=1
Xn
= 2 bkj xj
j=1
= 2Bk∗ · x.
Here Bk∗ refers to the kth row of B. Therefore ▽f = 2Bx = 2AT Ax.
Now we prove the result on the infinity norm. Take any x 6= 0. Look at the ith component of
Ax.
Pn
n n n n
X X X j=1 aij xj X
aij xj ≤ |aij | |xj | ≤ |aij | |x|∞ =⇒ ≤ |aij |.
|x|∞
j=1 j=1 j=1 j=1
X n
|Ax|∞
Take maximum over i =⇒ ≤ max |aij |.
|x|∞ i
j=1
n
X
Take maximum over x 6= 0 =⇒ |A|∞ ≤ max |aij |.
i
j=1
36 CHAPTER 3. LINEAR SYSTEMS
X n
|Ax|∞
For equality, find x such that = max |aij |. Suppose the maximum row sum occurs at
|x|∞ i
j=1
the kth row. That is,
n
X n
X
|akj | = max |aij |.
i
j=1 j=1
n
X n
X |Ax|∞
akj xj = |akj | = .
|x|∞
j=1 j=1
1 0
Example: Let A = . Then kAk2 ≈ 5.0368 (square root of the largest eigenvalue of AT A) and
−3 4
kAk∞ = max(1, 7) = 7.
The infinity matrix norm is much easier to calculate because it only involves calculating
the row sums. The matrix 2–norm however requires the calculation of the largest eigenvalue of
AT A which is a computational intensive operation when A is a large matrix. It does have nice
properties which makes it attractive for theoretical purposes.
Definition: Let k · k be the matrix norm induced by the vector norm | · |. Let A ∈ Rn×n be non–singular.
The condition number of A is κ(A) = kAk kA−1 k.
A matrix is said to be ill–conditioned if its condition number is large relative to the working
precision of the calculation. For instance, on a computer with double precision arithmetic, there
are approximately 15 significant (decimal) digits to represent a real number. If the condition
number is, say, greater than 1010 , then the matrix is ill-conditioned. On a machine with 30
significant digits, then the same matrix is not ill–conditioned. The concept of ill-conditioning is
irrelevant if all calculations are done exactly.
It is simple to show that for any non–singular matrix A, κ(A) ≥ 1. To show this, note that
The following theorem bounds the relative error of an approximate solution in terms of the
condition number and the relative error of the data. Let | · | be any vector norm and k · k be the
induced matrix norm.
Proof: Use the facts A−1 r = A−1 (Ax̃ − b) = A−1 (Ax̃ − Ax) = x̃ − x and |b| = |Ax| ≤ kAk |x| to get
This is a sharp upper bound which is achievable by the following: let x and r be such that
kAk |x| = |Ax| and kA−1 k |r| = |A−1 r|. Then
What this theorem says is that the size of the residual r is indicative of the relative error only
if the condition number of the matrix is not large. If the condition number is large, then one can
have a large relative error with a small residual. The following is an alternative interpretation.
Suppose Ax̃ = b + r. That is, x̃ is the exact solution of a system whose righthand side is
perturbed (due to roundoff error or uncertainty in data). The significance of this result is that
the relative error of the solution is proportional to the condition number of the matrix – the
larger the condition number, the larger the relative error.
In the above theorem, the righthand side vector b is perturbed. Now we perturb the matrix.
Again the relative error will be seen to depend on the condition number of the matrix.
Theorem: Let A be non–singular and kA−1 Ek < 1/2 for some matrix E. Suppose Ax = b and (A+E)x̃ = b.
Then
|x̃ − x| kEk
≤ 2κ(A) .
|x| kAk
Proof: From simple algebra, we have x̃ − x = −A−1 E x̃ and so |x̃| − |x| ≤ |x̃ − x| ≤ kA−1 Ek |x̃| from
which we obtain
|x|
|x̃| ≤ ≤ 2 |x|.
1 − kA−1 Ek
Therefore
|x̃ − x| |x̃| kEk
≤ kA−1 Ek ≤ kA−1 k kEk 2 = 2κ(A) .
|x| |x| kAk
A rule–of–thumb for the solution of a linear system in double precision is the following. If
the condition number of the matrix is 10m , then the computed solution by GE (with partial
pivoting) is expected to have 16 − m correct digits. For example, if the condition number is 10,
then we expect 15–digit accuracy in the computed solution. If the condition number is 1016 or
larger, then the computed solution can be totally worthless. Roughly speaking, the system is
ill-conditioned if the condition number is larger than 10m/2 . Hence the degree of ill-conditioning
depends on the number of digits in the computation.
its (real) eigenvalues are positive. Indeed, if λ is an eigenvalue with corresponding eigenvector
x, i.e., Ax = λx, then
0 < xT Ax = λxT x
iff λ > 0. In particular, a symmetric positive definite matrix is non–singular.
2 0 3
Example: Let A = , B= . Note that A is symmetric positive definite while B is not because
3 3 1
for x = [1, −1]T T
√ , it is easy to check that x Bx = −5. It can be checked that the eigenvalues of
B are (1 ± 37)/2.
Symmetric positive definite matrices occur frequently in practice. Because of their special
properties, no pivoting is required in the GE. The operation count for the factorization is O(n3 /6)
which is one half of the count of GE for general matrices.
Theorem: Let A be symmetric, positive definite. Then there exists a unique upper triangular R with
positive diagonal entries such that A = RT R (called the Cholesky decomposition of A).
T T
y y y B a y
A = = y T By > 0.
0 0 0 aT ann 0
Induction hypothesis implies that there exists a unique upper triangular S with positive diagonal
entries such that B = S T S. Let
ST S a ST S b
A= = = RT R
aT ann bT c c
where c ∈ R, b ∈ Rn−1 . From the above system, a = S T b and ann = bT b + c2 . Since S is
nonsingular,
p we have b = S −T a. If it can be shown that ann − bT b > 0, then we can define
c = ann − bT b.
By a direct calculation, ann − bT b = ann − aT B −1 a. Define γ = B −1 a. Since A is positive
definite,
T B a γ
0 < [γ , −1] T
a ann −1
= ann − aT B −1 a
= ann − bT b.
Finally, we show uniqueness. Let A = RT R = R̃T R̃ where R̃ is upper triangular with positive
diagonal entries. Then R̃−T RT = R̃R−1 . Note that the inverse of an upper triangular matrix is
upper triangular and the product of two upper triangular matrices is upper triangular. The same
remark applies to lower triangular matrices. Hence R̃−T RT = R̃R−1 = D, a diagonal matrix.
Look at the (i, i) entry of R̃ = DR and of RT = R̃T D to obtain r̃ii = dii rii and rii = r̃ii dii or
dii = ±1. Since rii and r̃ii are positive, dii = 1 for every i and so D is the identity matrix which
implies that R = R̃.
Example: Cholesky decomposition.
2 −1 a a b c
−1 2 −1 = b d d e
−1 2 c e f f
√ √
2 q 2 − √12 0
1 3 q q
=
− √2
q 2
3
2 − 23
.
0 − 23 √23 √2
3
where λi are the eigenvalues of G. The following is necessary and sufficient for the convergence
of the above iterative method.
Proof: Subtracting (3.3) and (3.4), we obtain e(k+1) = Ge(k) or e(k) = Gk e(0) . If ρ(G) ≥ 1, let e(0) be a
normalized eigenvector of G with eigenvalue of magnitude ρ(G). Clearly, e(n) does not converge
to 0.
On the other hand, suppose ρ(G) < 1. Then Gk = XD k X −1 . Since every eigenvalue of D has
magnitude less than one, D k → 0 and hence Gk → 0 which implies that e(k) → 0.
We remark that the assumption of a diagonalizable matrix G in the above theorem is not
necessary. It merely serves to simplify the proof.
We can now state that for any matrix norm k · k induced by a vector norm | · |, the condition
kGk < 1 is a sufficient condition for the FPI (3.4) to converge. To see this, let Gz = λz where
|λ| = ρ(G) and z is corresponding eigenvector. Now
x(k+1) = −D −1 (L + U )x(n) + D −1 b.
3.5. ITERATIVE SOLVERS 41
Suppose x(k) is known. For ith component of the new iterate is obtained by solving xi from the
(k)
ith equation where the current value xj is used for all j 6= i:
(k+1) 1 X (k)
xi = bi − aij xj .
aii
j6=i
This scheme is similar to Jacobi scheme except that the most up–to–date components are used
in the definition:
(k+1) 1 X (k+1)
X (k)
xi = bi − aij xj − aij xj .
aii
j<i j>i
3 1 5 1
Example: Let A = 1 2 , b = 5 with exact solution 2. For this problem
−4 0 0
3 0 0 1
D= 2 , L = 1 0 , U = 0 .
−4 0 0
Using the initial iterate x(0) = [0, 0, 0]T , the Jacobi iterates are
5/3 5/6 10/9
x(1) = 5/2 , x(2) = 5/3 , x(3) = 25/12
0 0 0
and they converge faster to the solution compared to the Jacobi iteration. The iterates are given
by
(k) (k)
(k+1) 5 − x2 − 0 x3
x1 =
3
(k+1) (k)
(k+1) 5 − x1 − 0 x3
x2 =
2
(k+1) (k+1)
(k+1) 0 − 0 x1 − 0 x2
x3 = .
−4
42 CHAPTER 3. LINEAR SYSTEMS
0 −1/3
The iteration matrix is G = 1/6 which has spectral radius 1/6 < 1. Thus this
0
iteration must converge and in fact it converges faster than the Jacobi iteration because it has
a smaller spectral radius.
Using the initial iterate x(0) = [0, 0, 0]T , the Jacobi iterates are
5 −5 25
x(1) = 5 , x(2) = −10 , x(3) = 20
0 0 0
−2 √
and they diverge. The iteration matrix is G = −3 which has eigenvalues 0, ± 6 and
0
√
so ρ(G) = 6 > 1. It is no surprise that the iteration diverges.
The condition ρ(G) < 1 unambiguously decides whether an iteration converges or diverges.
However, calculating the spectral radius is often much more difficult (takes more work) than solv-
ing the linear system. We now give a simpler sufficient condition that determines convergence.
This condition is much simpler to apply.
Matrix A ∈ Rn×n is said to be strictly diagonally dominant if for every i,
X
|aii | > |aij |.
j6=i
3 1 −1 3 2 6
Example: Let A = 2 −5 2 , B = 1 8 1 . It is easy to check that A is strictly diagonally
1 6 8 9 2 −2
dominant while B is not.
Theorem: If A is strictly diagonally dominant, then the Jacobi and Gauss Seidel iterations converge for
any initial iterate.
Proof: Let λ be an eigenvalue of the iteration matrix G = I − B −1 A such that |λ| = ρ(G) and let
v be the corresponding eigenvector with |vm | = 1 and |vi | ≤ 1, ∀i. For the Jacobi method,
3.5. ITERATIVE SOLVERS 43
G −1
X= −D (L + U ) which implies that (L + U )v = −λDv. The mth row of this equation reads
amj vj = −λamm vm which implies that
j6=m
P P
j6=m |amj | |vj | j6=m |amj |
|λ| ≤ ≤ <1
|amm | |amm |
since A is strictly diagonally dominant. The previous theorem shows that the iterates must
converge to the true solution.
For the Gauss Seidel method, GX= −(L + D)−1 UXwhich implies that U v = −λ(L + D)v. The
mth row of this equation reads amj vj = −λ amj vj which implies that
j>m j≤m
P P
j>m |amj | |amj |
|λ| ≤ P =P j>m P <1
|amm | − j<m |amj | j>m |amj | + |amm | − j6=m |amj |
since A is diagonally dominant. An application of the previous theorem yields the result.
In light of this result, it pays to try to rearrange the matrix so that it is diagonally dominant.
This is only practical if the matrix is not too large. Note that not all matrices can become
diagonally dominant with a permutation. For instance, the matrix
3 2
1 1
Proof:
Thus the product of the eigenvalues of A must equal (1 − ω)n . Hence |1 − ω|n ≤ ρ(G)n which
implies the result.
44 CHAPTER 3. LINEAR SYSTEMS
Thus if ω 6∈ (0, 2), SOR iterates will diverge in general. It turns out that if A is symmetric and
positive definite, then the SOR iterates converge for an arbitrary initial guess iff ω ∈ (0, 2). SOR
can converge quickly provided that one can choose an optimal value of the parameter ω which
in general is extremely difficult to find. The methods below do not require the user to specify
such parameters and thus are preferred over SOR.
One technique to accelerate the convergence is to precondition the system. In place of the
system Ax = b, we solve M −1 Ax = M −1 b where M is called a preconditioner. If the condition
number of M −1 A is much smaller than the condition number of A, the iterative solvers for the
new system will converge much quicker. This subject is an active area of research.
Chapter 4
Least Squares
In the last chapter, we solved linear systems where the coefficient matrix is square. Now we
consider the case where A is a rectangular matrix. Usually, there are more equations than
unknowns so that there is no solution in general. The system is said to be over–determined.
The method of least squares finds a solution which minimizes the residual.
Let A ∈ Rm×n where m > n. Given b ∈ Rm , there is in general no solution to the system
Ax = b for any x ∈ Rn . Assume that the columns of A are linearly independent so that rank
A = n. We require that the residual Ax − b be perpendicular to all vectors in the range space
of A. Recall that vectors p and q are perpendicular if pT q = 0. So the requirement that the
residual be perpendicular to the range space means that (Az)T (Ax − b) = 0 for every z ∈ Rn .
This implies that z T (AT Ax − AT b) = 0 for every z which implies that
AT Ax = AT b.
This is called the normal equation. It is not difficult to show that AT A is symmetric positive
definite. Symmetry is easy to show. To show positive definiteness, let z be any non–zero vector.
Then
z T (AT A)z = (Az)T (Az) = |Az|22 > 0
since z 6= 0 and that A has rank n. Consequently, the normal equation has a unique solution
x = x∗ , called the least squares solution of Ax = b. This system can be solved by Cholesky
factorization. We stress that in general Ax∗ 6= b but Ax∗ − b is perpendicular to the range space
of A.
Let us give an alternative derivation of the normal equation. This will explain why the
method is called least squares. Since Ax = b has no solution in general, we would like the
“solution” to give the smallest possible residual. That is, we find the x ∈ Rn which minimizes
|Ax − b|2 or equivalently, minimize
From calculus, the minimum occurs at the critical point of this quadratic function:
0 = ∇f = 2AT Ax − 2AT b.
Solve this equation to obtain AT Ax = AT b which is the normal equation again. Note that
the second derivative matrix is 2AT A which is symmetric positive definite and has positive
eigenvalues and so the critical point must be a local as well as global minimum.
45
46 CHAPTER 4. LEAST SQUARES
It should be remarked that the method of normal equation is not the best method to solve
the least squares problem. The reason is that the condition number of the normal system is
κ(AT A) = κ(A)2 , the square of the condition number of A. (We have not defined the condition
number of a rectangular matrix.) This means that the normal equation can be much more
sensitive to roundoff errors than the solution of the least squares method by other approaches
(for instance, QR factorization which will be discussed later.) The virtue of the normal equation
is its simplicity.
The minimum of R can be found by the normal equation, for instance. Hence fitting a line is a
least squares problem.
Recall in the formulation of the least squares problem, we assume that rank A = n. This
means that the terms in the model are “linearly independent”. For instance, the model y =
at + c(2t) would lead to linearly dependent columns.
One can also fit the data using a quadratic y = pt2 +qt+r for some real numbers p, q, r. In this
Xm
case, m ≥ 4. The method of least squares means that we minimize R = (pt2i + qti + r − yi )2 =
i=1
|Ax − b|22 where
t21 t1 1 p y1
.. ,
A = ... ..
. . x = q , b = ... .
t2m tm 1 r ym
Example: Given data points (−1, 1), (0, 0), (1, 0), (2, −2). Fit a line and a quadratic through the data
using the method of least squares.
First fit a line. Define
−1 1 1
0 1 a 0
A=
1
, x= , b=
0 .
1 c
2 1 −2
The method of least squares is to minimize |Ax − b|2 . The normal equation is AT Ax = AT b
where
T 6 2 T −5
A A= , A b= .
2 4 −1
4.2. TRIGONOMETRIC AND EXPONENTIAL MODELS 47
Solving this equation leads to x = [a, c]T = [−.9, .2]. Hence the line has the equation y =
−.9t + .2. Here |Ax − b|22 = .7.
Now fit a quadratic. Define
1 −1 1 1
0 0 1 p 0
A=
1 1 1 , x = q , b=
0 .
r
4 2 1 −2
The method of least squares is to minimize |Ax − b|2 . The normal equation is AT Ax = AT b
where
18 8 6 −7
AT A = 8 6 2 , AT b = −5 .
6 2 4 −1
Solving this equation leads to x = [p, q, r]T = [−.25, −.65, .45]T . Hence the quadratic has the
equation y = −.25t2 − .65t + .45. Here |Ax − b|22 = .45. Why is this number smaller than the
corresponding one for the least squares line?
Example: Given data (0, 2), (.5, 0), (1, −1), (2, 1). Using the above trigonometric model, we obtain
1 cos 0 sin 0 1 1 0 2
1 cos π/2 sin π/2 1 0 1 0
A= 1 cos π
= , b=
−1 .
sin π 1 −1 0
1 cos 2π sin 2π 1 1 0 1
The normal equation is AT Ax = AT b where
4 1 1 2
T
A A= 1 3 0 , A b = 4 .
T
1 0 1 0
Solve this system to get c1 = .25, c2 = 1.25, c3 = −.25. Thus the model is y = .25+1.25 cos πt−
.25 sin πt.
48 CHAPTER 4. LEAST SQUARES
We now examine an exponential model y = c1 ec2 t . Using the same procedure as before to
fit the data would lead to a system of over–determined nonlinear equations which is much more
difficult to solve than the (linear) least squares problems which we have seen so far. A better
way is to take the log of the model to obtain
ln y = ln c1 + c2 t = c3 + c2 t
with c3 = ln c1 . Now the over–determined system becomes Ax = b where
t1 1 ln y1
c
A = ... ... , b = ... , x= 2 .
c3
tm 1 ln ym
After solving this least squares problem, the original parameter c1 can easily be recovered from
c1 = ec3 .
Example: Fit the data (0, e0 ), (1, e1 ), (2, e3 ) using the above exponential model. Now
0 1 0
A = 1 1 , b = 1 .
2 1 3
The normal equation is AT Ax = AT b where
5 3 7 c
T
A A= , T
A b= , x= 2 .
3 3 4 c3
Solve this system to get c2 = 1.5, c3 = −.1667 from which we obtain c1 = ec3 = .8465. Thus
the model is y = .8465e1.5t .
Of course, other exponential models are possible. We consider another one y = c1 tec2 t . As
before, take the log on both sides to obtain
ln y − ln t = ln c1 + c2 t = c3 + c2 t.
Notice the term ln t is placed on the left-hand side which makes this a linear least squares
problem Ax = b with
t1 1 ln y1 − ln t1
.. .. . c
A = . . , b= .
. , x= 2 .
c3
tm 1 ln ym − ln tm
After solving this least squares problem, the original parameter c1 can be recovered from c1 = ec3 .
Example: Fit the data (.5, e0 ), (1, e1 ), (2, e3 ) using the new exponential model. Now
.5 1 .69315
A = 1 1 , b = 1 .
2 1 2.3069
The normal equation is AT Ax = AT b where
5.25 3.5 5.9603 c
T
A A= , T
A b= , x= 2 .
3.5 3 4 c3
Solve this system to get c2 = 1.1088, c3 = −.039721 from which we obtain c1 = ec3 = .9611.
Thus the model is y = .9611te1.1088t .
4.2. TRIGONOMETRIC AND EXPONENTIAL MODELS 49
Suppose we have a linearly independent set {x1 , x2 , ..., xn }. We want an orthonormal set
{u1 , u2 , ..., un } spanning the same subspace.
Gram Schmidt Procedure
x1
u1 =
|x1 |2
y2 = x2 − xT2 u1 u1
y2
u2 =
|y2 |2
y3 = x3 − xT3 u1 u1 − xT3 u2 u2
y3
u3 =
|y3 |2
yk = xk − xTk u1 u1 − · · · − xTk uk−1 uk−1
yk
uk = .
|yk |2
It is not difficult to check that uTi uj = δij and that {u1 , u2 , ..., un } spans the same space as
{x1 , x2 , ..., xn }.
Let A = [x1 | · · · |xn ] ∈ Rm×n . Assume m ≥ n and rank A = n. In matrix notation, Gram
Schmidt gives a rectangular factorization A = QR, where Q ∈ Rm×n is orthogonal (QT Q = I)
and R ∈ Rn×n is upper triangular.
Note that R is a non-singular matrix since
Example: 1
√ 0 √1
1 1 3 2 3 √ √ √
2 2 2 2
0 2 1 0 1 0
A=
0
= QR =
√1
0 2 √1
.
0 1 0 0 3 0 0 3
−1 −1 −1 − √1 2
0 √1
3
We now use the QR factorization to solve the least squares problem minn |Ax − b|2 for A ∈
x∈R
Rm×n with rank A = n. Recall that the solution is given by the normal equation AT Ax = AT b.
Let A = QR be the QR factorization. On substitution into the normal equation, we obtain
RT QT QRx = RT QT b. Since Q is orthogonal and R is invertible, we obtain
Rx = QT b.
The solution can easily be obtained after a simple back substitution. This method of solution
is numerically stable and better than the method of normal equation.
Chapter 5
Given data points (x1 , y1 ), · · · , (xn , yn ). Throughout this chapter, it is assumed that xi 6= xj
if i 6= j. The goal is to find a function f (x) which interpolates the data points. That is, the
function satisfies yi = f (xi ), i = 1, · · · , n. The simplest function to choose is a polynomial.
However polynomial interpolation is unstable if n is not small, say, n > 5 unless the nodes {xi }
are chosen properly. For larger values of n, it is better to use several lower–order polynomials.
This is the topic of splines.
In case n = 3, the formula for the polynomial interpolant is more complicated. Let p(x) = a+bx+
cx2 be the polynomial which passes through the three given points. Hence p(xi ) = yi , i = 1, 2, 3.
This yields a linear system of three equations for the three unknown coefficients:
1 x1 x21 a y1
1 x2 x22 b = y2 .
1 x3 x23 c y3
50
5.1. POLYNOMIAL INTERPOLATION 51
Lagrange Interpolation
Fix n, the number of data points. Define the polynomials of degree n − 1
n
Y x − xj
Li (x) = , i = 1, · · · , n. (5.1)
xi − xj
j=1, j6=i
Using this property, it is easy to verify that the unique polynomial interpolant (of degree n − 1
or less) of the data points (x1 , y1 ), · · · , (xn , yn ) is
n
X
p(x) = yi Li (x).
i=1
n
X n
X
Indeed, for each j, p(xj ) = yi Li (xj ) = yi δij = yj . This is called the Lagrange form of
i=1 i=1
the interpolant.
We now show that there is a unique polynomial interpolant.
Theorem: Given the data points (x1 , y1 ), · · · , (xn , yn ). There exists a unique polynomial p of degree at
most n − 1 so that yi = p(xi ), i = 1, · · · , n.
Proof: The Lagrange form of the interpolant gives a polynomial of degree n − 1 which interpolates the
data. Suppose p and q are polynomials of degree n − 1 or less which interpolate the data. Define
d = p − q which is a polynomial of degree at most n − 1. We wish to show that d is the identically
zero function so that the polynomial interpolant is unique. Now d(xj ) = 0, j = 1, · · · , n. By
the Fundamental Theorem of Algebra, a polynomial of degree at most n − 1 has at most n − 1
zeroes or is the zero polynomial. In this case, d ≡ 0.
Example: Given the data (0, 2), (1, 1), (2, 0), (3, −1). They happen to lie along the straight line y = −x+2
which is the unique polynomial interpolant. Notice that this polynomial has degree one and not
three.
Example: Find the Lagrange interpolating polynomial for the data (0, 1), (2, 2), (3, 4).
First, calculate the polynomials Li :
(x − 2)(x − 3) (x − 2)(x − 3)
L1 (x) = = ,
(0 − 2)(0 − 3) 6
(x − 0)(x − 3) x(x − 3)
L2 (x) = = ,
(2 − 0)(2 − 3) −2
(x − 0)(x − 2) x(x − 2)
L3 (x) = = .
(3 − 0)(3 − 2) 3
Consequently, the interpolating polynomial is
(x − 2)(x − 3) x(x − 3) x(x − 2) x2 x
p(x) = 1 · +2· +4· = − + 1.
6 −2 3 2 2
52 CHAPTER 5. INTERPOLATION AND APPROXIMATION
For x different from {xi }, evaluation of the Lagrange polynomial interpolant p(x) takes O(n2 )
operations. There is an alternative form which takes only O(n) operations. Define
n
Y 1
φn (x) = (x − xi ), wi = Q .
i=1 j6=i (xi − xj )
Once the weights wi have been computed, evaluation of p(x) takes O(n) operations. This form
is known as the Barycentric Lagrange interpolant.
f [xi ] = f (xi ), i = 1, · · · , n;
f [xi+1 ] − f [xi ]
f [xi , xi+1 ] = , i = 1, · · · , n − 1;
xi+1 − xi
f [xi+1 , xi+2 ] − f [xi , xi+1 ]
f [xi , xi+1 , xi+2 ] = , i = 1, · · · , n − 2;
xi+2 − xi
f [xi+1 , xi+2 , xi+3 ] − f [xi , xi+1 , xi+2 ]
f [xi , xi+1 , xi+2 , xi+3 ] = , i = 1, · · · , n − 3;
xi+3 − xi
..
.
f [xi+1 , · · · , xn ] − f [xi , · · · , xn−1 ]
f [xi , · · · , xn ] = , i = 1, · · · , n − 1.
xn − xi
From the key observation that
is a basis for the space of polynomials of degree n − 1 or lower, we can express p, any polynomial
of degree n − 1 or lower by
Theorem: Let p be the interpolating polynomial of degree at most n − 1 for the data (x1 , y1 ), · · · , (xn , yn ).
Then
cj = p[x1 , · · · , xj ], 1≤j≤n
where cj is defined in (5.2).
Proof: The proof is induction on n. Notice that y1 = p(x1 ) = p[x1 ] = c1 . Next, y2 = p(x2 ) =
c1 + c2 (x2 − x1 ) and so
p[x2 ] − p[x1 ]
c2 = = p[x1 , x2 ].
x2 − x1
(We also show the case n = 2 because it is instructive. Strictly speaking it is not necessary for
the proof.)
Assume the statement of the theorem holds for n data points. Now consider the case with n + 1
data points. Let p be the unique interpolating polynomial of degree at most n. Let q be the
unique polynomial of degree n − 1 or less interpolating (x1 , y1 ), · · · , (xn , yn ) and let r be the
unique polynomial of degree n − 1 or less interpolating (x2 , y2 ), · · · , (xn+1 , yn+1 ). We claim, to
be shown later, that
(x − x1 )
p(x) = q(x) + (r(x) − q(x)). (5.3)
xn+1 − x1
Therefore the coefficient of xn of p is the same as that of the expression on the right, that is,
xi − x1
q(xi ) + (r(xi ) − q(xi )) = q(xi ) = p(xi ).
xn+1 − x1
Therefore p and the righthand side of (5.3) agree at n + 1 distinct points and since they are both
polynomials of degree at most n, they must in fact be the same polynomial.
x1 p[x1 ]
p[x1 , x2 ]
x2 p[x2 ] p[x1 , x2 , x3 ] .
p[x2 , x3 ]
x3 p[x3 ]
Example: Given the data (0, 1), (2, 2), (3, 4). Construct the Newton’s divided difference table and write
down the interpolating polynomial of degree 2.
0 1
1
2
1
2 2 2 .
2
3 4
The interpolating polynomial is
1 1 x2 x
p(x) = 1 + x + x(x − 2) = − + 1.
2 2 2 2
One advantage of the Newton’s divided difference over the Lagrange form is that if a new
data point comes in, one can simply add one more row in the current divided difference table
using O(n) operations. In contrast, the Lagrange method must restart from the scratch. Hence
the Newton’s method is far more efficient.
Example: Continuing with the last example, suppose we add a new data point (1, 0). We append a new
row at the bottom of the above table to obtain
0 1
1
2
1
2 2 2
2 − 21 .
3 4 0
2
1 0
The new interpolating polynomial is
1 1 1
p(x) = 1 + x + x(x − 2) − x(x − 2)(x − 3).
2 2 2
Example: Given the data (0, 2), (1, 1), (2, 0), (3, −1). Construct the Newton’s divided difference table
and write down the interpolating polynomial.
0 2
−1
1 1 0
−1 0 .
2 0 0
−1
3 −1
The interpolating polynomial is p(x) = 2 − x.
5.1. POLYNOMIAL INTERPOLATION 55
Example: Given the data (0, 1), (2, 2), (3, 4). How many degree three polynomials interpolate these three
points?
Recall that there is a unique polynomial of degree two which interpolates these points: p(x) =
x2 x
− + 1. There are actually infinitely many polynomials of degree three which interpolate
2 2
these points:
x2 x
q(x) = − + 1 + cx(x − 2)(x − 3)
2 2
works for every real number c.
Example: How many polynomials of degree d, where 0 ≤ d < ∞, pass through the points (−1, −5), (0, −1), (2, 1), (3, 11).
First construct the divided difference table:
−1 −5
4
0 −1 −1
1 1 .
2 1 3
10
3 11
Hence the unique interpolating polynomial of degree three is
Therefore there can be no interpolating polynomials of degree smaller than three and there are
infinitely many interpolating polynomials of degree larger than three.
Example: Interpolation can be used to approximate the values of complicated functions using only additions
and multiplications which are needed in evaluating polynomials. As a simple example, consider
approximating sin x by a third degree polynomial.
First, since sin is 2π–periodic, we can restrict the argument x ∈ [0, 2π). Using symmetries
sin x = − sin(2π − x) for x ∈ [π, 2π) and sin x = sin(π − x) for x ∈ [π/2, π], we can further
restrict x ∈ [0, π/2]. For a third degree polynomial, we construct the divided difference table for
sin x using four equally spaced points 0, π/6, π/3, π/2:
0 0
.9549
π 1
6 2 −.2443
.6691 −.1139 .
π π
3 2 −.4232
.2559
π
2 1
Hence the third degree polynomial which interpolates sin x at the above four points is
π π π h πi
p(x) = .9549x − .2443x x − − .1139x x − x− , x ∈ 0, .
6 6 3 2
The absolute error is less than .01 for all x. Of course, the error can be made as small as desired
by taking sufficiently many points. In some sense, this is a compression problem: reducing the
sine function to the coefficients of the interpolating polynomial.
56 CHAPTER 5. INTERPOLATION AND APPROXIMATION
Theorem: Let n be a positive integer. Given f ∈ C n [a, b] and distinct points {xj , j = 1, · · · , n} lying in
[a, b]. Let p be the (unique) polynomial interpolant of f of degree n − 1 or less over {xj }. Then
for every x ∈ [a, b],
n
f (n) (ξ) Y
f (x) − p(x) = (x − xj ) (5.4)
n!
j=1
f (n) (ξ)
= f [x1 , · · · , xn , x] (5.5)
n!
and this is a continuous function of x.
Proof: Fix some x ∈ [a, b]. If x is a node xj , then the result is clearly true. So suppose x is not a node.
Define ψ(z) = f (z) − p(z) − αφ(z) for z ∈ R where φ is the product in (5.4) and α ∈ R is chosen
such that ψ(x) = 0. Thus ψ has at least n + 1 zeroes in [a, b], namely, x, x1 , · · · , xn . By Rolle’s
theorem, ψ ′ has at least n zeroes different from the points just enumerated. By repeatedly
applying Rolle’s theorem, ψ (n) has some zero ξ ∈ (a, b):
Noting that p is a polynomial of degree at most n − 1 while the leading term of φ is xn , it follows
f (n) (ξ)
that 0 = f (n) (ξ) − α n! and so α = . Therefore, 0 = ψ(x) = f (x) − p(x) − αφ(x) which
n!
gives (5.4).
Let x ∈ [a, b] be distinct from the nodes. Continuity of f [x1 , · · · , xn , x] as a function of x follows
by induction. The case n = 0 is trivial. Suppose it holds for n − 1. Let zj → x with each zj
distinct from the nodes. Now
f [x2 , · · · , xn , zj ] − f [x1 , · · · , xn ]
lim f [x1 , · · · , xn , zj ] = lim
j→∞ j→∞ zj − x1
f [x2 , · · · , xn , x] − f [x1 , · · · , xn ]
=
x − x1
= f [x1 , · · · , xn , x].
Now if x = xk for some k > 1, then the above argument still holds. If x = x1 , then recall that
f is symmetric and so we can permute, for instance, the first two arguments of f .
Let p be the polynomial interpolant as in the statement of this theorem. Assume first that x is
distinct from the nodes. Then the polynomial interpolant for the points x1 , · · · , xn , x is
n
Y
p̂(z) = p(z) + f [x1 , · · · , xn , x] (z − xj ), z∈R
j=1
by (5.2). But f (x) = p̂(x). Uniqueness of polynomial interpolation and (5.4) imply (5.5). The
case if x is one of the nodes follows by continuity.
5.1. POLYNOMIAL INTERPOLATION 57
Example: Find the smallest value of positive integer n so that for any n distinct points {x1 , · · · , xn } in
[0, 1], their polynomial interpolant p satisfies | sin x − p(x)| ≤ .001 for all x ∈ [0, 1].
From the above theorem, for any x ∈ [0, 1],
n
| sin(n) (x)| Y 1
| sin x − p(x)| = |x − xj | ≤ .
n! n!
j=1
Example: Given f (x) = 2x3 + 5x − 1 and distinct nodes xj , j = 1, · · · , 5. Let p be the polynomial
interpolant of these nodes. Find the maximum difference |f (x) − p(x)| for x ∈ R.
The maximum difference is zero since from the above theorem, the difference is proportional to
f (5) = 0.
Example: Given f (x) = 2x3 + 5x − 1 and nodes x1 = 1, x2 = 2. Let p be the polynomial interpolant of
these nodes. Find the maximum difference |f (x) − p(x)| for x ∈ [1, 2].
Here, n = 2 and and so f ′′ (x) = 12x. Thus |f ′′ (x)| ≤ 24 for all x ∈ [1, 2]. Let q(x) = (x−1)(x−2).
A simple estimate is |q(x)| ≤ 1 but we can do better. Since q is a parabola which vanishes at
x = 1, 2, its maximum magnitude in [1, 2] occurs at x = 3/2 with q(3/2) = −1/4. Hence
|f ′′ (x)| 24 1
max |f (x) − p(x)| ≤ max max |q(x)| ≤ = 3.
x∈[1,2] x∈[1,2] 2 x∈[1,2] 2 4
From the interpolation error of the previous theorem, an immediate question is how to choose
the nodes {x0 , . . . , xn } so that the error is as small as possible. The error is bounded by
n
Y
f (n+1) (ξ)
|φ(x)|, φ(x) = (x − xj ).
(n + 1)!
j=0
For convenience, let the interval be [−1, 1]. We wish to determine {xj } so that kφk∞ is as small
as possible. Here, for any function g,
for n ≥ 1. Note that Tn is a polynomial of degree n. By induction, it can be shown that for
n ≥ 0 and x ∈ [−1, 1],
Tn (x) = cos(n cos−1 x).
Observe that kTn k∞ = 1 and that the coefficient of xn of Tn is 2n−1 . Thus T̃n ≡ 21−n Tn is a
monic polynomial. It is remarkable that among all monic polynomials of degree n, T̃n achieves
the smallest supremum norm of 21−n on [−1, 1]. To see this, we argue by contradiction. Let
pn be a monic polynomial of degree n so that kpn k∞ < 21−n . Denote the extrema of T̃n by
yi = cos(iπ/n), i = 0, . . . , n. Observe that
So (−1)i (T̃n (yi ) − pn (yi )) > 0 meaning that T̃n − pn oscillates with at least n zeroes in (−1, 1).
However, since both T̃n and pn are monic, T̃n − pn is a polynomial of degree at most n − 1 having
at least n roots. This is a contradiction.
Since φ is a polynomial of degree n + 1, apply the above minimal property of the Chebyshev
polynomial to conclude that
Tn+1 1
kφk∞ ≥ = .
2n ∞ 2n
Equality is obtained if {xj } is the set of zeroes of Tn+1 , so-called, Chebyshev points:
2j + 1
xj = cos π , j = 0, . . . , n.
2n + 2
For other choices of the nodes, for instance, equally spaced nodes, kφk∞ can be much larger.
Using Chebyshev points, the interpolation error can be bounded as:
kf (n+1) k∞
kf (x) − p(x)k∞ ≤ ,
2n (n + 1)!
Theorem: Let n be a positive integer. Given f ∈ C n [a, b] and distinct points {xj , j = 1, · · · , n} lying
in [a, b]. Let p be the (unique) polynomial interpolant of f of degree n − 1 or less over {xj }.
Then there exist distinct points zi , i = 1, · · · , n − 1 in (a, b) and for each x ∈ [a, b], there exists
η ∈ (a, b) so that
n−1
f (n) (η) Y
f ′ (x) − p′ (x) = (x − zj ). (5.6)
(n − 1)!
j=1
Proof: Since f − p vanishes at the nodes, by Rolle’s theorem, there are points zi ∈ (xi , xi+1 ), i =
1, · · · , n − 1 so that f ′ − p′ vanishes at each zi . Fix x ∈ [a, b]. If x = zi , then clearly (5.6) holds.
Suppose x is distinct from the {zj }. Define
n−1
Y
ψ(x) = f ′ (x) − p′ (x) − α (x − zj ).
j=1
Choose α so that ψ(x) = 0. Of course ψ has at least n − 1 other roots at {zj }. Repeatedly apply
Rolle’s theorem to ψ to obtain the existence of η so that 0 = ψ (n−1) (η) from which (5.6) follows.
We conclude this section with some important theoretical results. Fix any positive integer
n. Define the interpolation operator In : C[0, 1] → C[0, 1] by
n
X
In (f )(x) = f (xin )Lin (x), f ∈ C[0, 1], x ∈ [0, 1].
i=1
5.1. POLYNOMIAL INTERPOLATION 59
Here {xin , 1 ≤ i ≤ n} is the set of distinct nodes and Lin ∈ Pn−1 was defined in (5.1), where
the subscript n was omitted there. Of course In (f )(xin ) = f (xin ) for 1 ≤ i ≤ n. It is easy to
see that In is a linear operator. {In } is said to be consistent if for each polynomial p,
Then
n
X n
X
|In (f∗ )(x∗ )| = f∗ (xin )Lin (x∗ ) = |Lin (x∗ )|,
i=1 i=1
implying that
n
kIn (f∗ )k∞ |In (f∗ )(x∗ )| X
kIn k∞ ≥ ≥ = |Lin (x∗ )|.
kf∗ k∞ 1
i=1
An example of an unstable interpolating family is one with equally spaced nodes: xin = i/n.
For Chebyshev nodes, it can be shown that the interpolating family is stable with
2
kIn k∞ ≤ 1 + log(n + 1), n ≥ 1.
π
(Here the interval is [−1, 1] rather than [0, 1].)
The next result relates the error of the best polynomial approximation and that of the
polynomial interpolant.
60 CHAPTER 5. INTERPOLATION AND APPROXIMATION
Theorem: Let f ∈ C[0, 1] and n be a positive integer. Let p∗n be the best polynomial approximation of f
of degree at most n:
kf − p∗n k∞ ≤ kf − pk∞ , p ∈ Pn ,
and pn = In (f ) ∈ Pn be the polynomial interpolant of f at the nodes {xin }. Then
kf − pn k∞ ≤ (1 + kIn k∞ ) kf − p∗n k∞ .
f − pn = f − In (f )
= f − In (p∗n ) + In (p∗n − f )
= f − p∗n − In (f − p∗n ).
Hence
kf − pn k∞ ≤ kf − p∗n k∞ + kIn k∞ kf − p∗n k∞ ≤ (1 + kIn k∞ ) kf − p∗n k∞ .
Theorem: Principle of Uniform Boundedness. Let X and Y be Banach spaces. Suppose {Tn , n ≥ 1}
is a family of linear operators from X to Y which is pointwise bounded:
Now we are ready for one of the most important results of this chapter. It states that for
a consistent family of interpolating operators, stability is equivalent to convergence. Result of
this flavour is pervasive in numerical analysis.
Proof: Suppose {In } is stable. Given ǫ > 0, we need to find some integer N so that for every n ≥ N ,
Let f ∈ C[0, 1]. From the Weierstrass Approximation Theorem, there is some polynomial p so
that
ǫ
kf − pk∞ < ,
2(1 + C)
where C is the stability constant. By consistency, there is some integer N so that for all n ≥ N ,
ǫ
kIn (p) − pk∞ < .
2
5.2. HERMITE INTERPOLATION 61
kIn (f ) − f k∞ → 0.
Therefore, there is some real number Cf which depends on f , but is independent of n, so that
kIn (f )k∞ ≤ Cf kf k∞ .
This means that {In } is pointwise bounded. By the Principle of Uniform Boundedness, supn≥1 kIn k∞ <
∞. This means that {In } is stable.
p(xi ) = yi , p′ (xi ) = zi , i = 1, · · · , n.
Proof: The case n = 1 is trivial since the unique Hermite interpolant is the line p(x) = z1 (x − x1 ) + y1 .
Suppose n ≥ 2. Define the following polynomials of degree 2n − 1:
where Li is the polynomial defined in (5.1). It is easy to check these new polynomials satisfy
the following properties. For all i, j,
Hence
n
X
p(x) := Mi (x)yi + Ni (x)zi
i=1
is an interpolating polynomial satisfying the 2n conditions.
Next we check that p is unique. Suppose q is a polynomial of degree 2n − 1 or less which satisfies
the same 2n conditions. Then p − q has at least n distinct zeroes at the nodes. By Rolle’s
62 CHAPTER 5. INTERPOLATION AND APPROXIMATION
theorem, p′ − q ′ has (at least) n − 1 distinct zeroes, (at least) one in (xi , xi+1 ). But p′ − q ′ also
vanishes at each node by assumption and so p′ − q ′ has at least 2n − 1 distinct zeroes. Since
p′ − q ′ is a polynomial of degree at most 2n − 2, it can be concluded that p′ − q ′ ≡ 0 or p − q is
a constant. Since p − q vanishes at the nodes, p ≡ q.
Example: Find the Hermite polynomial p of degree 3 so that p(0) = 0, p(1) = 1, p′ (0) = 1, p′ (1) = 0.
Here n = 2 with
L1 (x) = 1 − x, L2 (x) = x
and so
M1 (x) = (1 − x)2 (1 + 2x), M2 (x) = x2 (1 − 2(x − 1)), N1 (x) = (1 − x)2 x, N2 (x) = x2 (x − 1).
Hence
p(x) = 0 · M1 (x) + 1 · M2 (x) + 1 · N1 (x) + 0 · N2 (x) = −x3 + x2 + x.
Next, we give the error when a smooth function is interpolated by a Hermite polynomial
interpolant.
Theorem: Let n be a positive integer. Given f ∈ C 2n [a, b] and distinct points {xj , j = 1, · · · , n} lying
in [a, b]. Let p be the (unique) Hermite polynomial interpolant of f over {xj }. Then for every
x ∈ [a, b],
n
f (2n) (ξ) Y
f (x) − p(x) = (x − xj )2 (5.7)
(2n)!
j=1
Proof: Fix some x ∈ [a, b]. If x is a node xj , then the result is clearly true. So suppose x is not a node.
For any z ∈ R, define ψ(z) = f (z) − p(z) − αφ(z) where φ is the product in (5.7) and α ∈ R
is chosen such that ψ(x) = 0. Thus ψ has at least n + 1 zeroes in [a, b], namely, x, x1 , · · · , xn .
By Rolle’s theorem, ψ ′ has at least n zeroes different from the points just enumerated. Since ψ ′
also vanishes at the nodes by definition, ψ ′ has at least 2n distinct zeroes in [a, b]. By repeatedly
applying Rolle’s theorem, ψ (2n) has some zero ξ ∈ (a, b):
Noting that p is a polynomial of degree at most 2n − 1 while the leading term of φ is x2n , it
follows that 0 = f (2n) (ξ) − α (2n)! and the claim follows upon solving for α. Next, we show
that ξ = ξ(x) is a continuous function. This is accomplished by using the Newton form of
interpolation.
Let xj,k → xj as k → ∞ for 1 ≤ j ≤ n. Assume that every element of the set {xj,k , xi , 1 ≤
i, j ≤ n, k ≥ 0} is distinct. Let qk be the interpolating polynomial of f of degree 2n − 1 at the
nodes x1 , x1,k , x2 , x2,k , . . . , xn , xn,k . Then
qk (x) = f (x1 ) + f [x1 , x1,k ](x − x1 ) + f [x1 , x1,k , x2 ](x − x1 )(x − x1,k )
+f [x1 , x1,k , x2 , x2,k ](x − x1 )(x − x1,k )(x − x2 ) + . . .
+f [x1 , x1,k , . . . , xn−1 , xn−1,k , xn ](x − x1 )(x − x1,k ) . . . (x − xn−1 )(x − xn−1,k )
+f [x1 , x1,k , . . . , xn , xn,k ](x − x1 )(x − x1,k ) . . . (x − xn−1 )(x − xn−1,k )(x − xn ).
5.3. SPLINES 63
f (2n) (ξ(x))
= f [x1 , x1 , . . . , xn , xn , x].
(2n)!
|x2 − x1 |4
|f (x) − p(x)| ≤ max |f (4) (ξ)| .
x1 ≤ξ≤x2 384
We had mentioned earlier that polynomial interpolation using equally spaced points is unsta-
ble unless there is a small number of points, say, fewer than five points. Consider interpolation
of the function f (x) = (1 + 16x2 )−1 for x ∈ [−1, 1]. The left graph in Figure 5.1 shows the case
with 11 equally spaced points. Note the occurrence of huge oscillations near the end points.
This is known as Runge phenomenon. The optimal choice of points (Chebyshev points) cluster
near the end points. The right graph in Figure 5.4 clearly shows the superiority of the latter
case. An alternative to using one polynomial to interpolate all the data points is to use several.
This leads to the following topic.
5.3 Splines
Given data (x1 , y1 ), · · · , (xn , yn ). Assume x1 < · · · < xn . We employ several low–degree poly-
nomials, splines, to interpolate data. At the end of the last section, we had alluded to the
ill–conditioning of high degree polynomial interpolation with equally spaced points. Another
problem is that if the function to be interpolated is singular at some point, then a global inter-
polating polynomial can result in poor approximation everywhere. Splines do not suffer from
either of these two difficulties since it uses many low degree piecewise polynomials. Hence any
bad behaviour in the underlying function is localized.
This simplest spline uses piecewise polynomials of degree one. In this case of linear splines,
the interpolating function is piecewise linear. In the interval, [xi , xi+1 ], the interpolant is the
line which passes through the two points (xi , yi ), (xi+1 , yi+1 ). The virtues of this method are
its simplicity and efficiency. However, depending on the application, the data may come from a
smooth function and the interpolant is at best continuous and in general non–differentiable.
64 CHAPTER 5. INTERPOLATION AND APPROXIMATION
equispaced Chebyshev
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 0 1 −1 0 1
Figure 5.1: Interpolation by a 10th degree polynomial using equally spaced points (left) and
Chebyshev points (right).
Theorem: Let f be twice continuously differentiable on [a, b]. Let p be the linear spline interpolant of f at
the nodes x1 , · · · , xn . Then for every x ∈ [a, b],
h2
|f (x) − p(x)| ≤ max |f ′′ (c)| , h= max xi+1 − xi .
a≤c≤b 8 1≤i≤n−1
f ′′ (ci ) (xi+1 − xi )2
|f (x) − p(x)| ≤
2 4
and the result of the theorem follows immediately.
In quadratic splines, we use a piecewise quadratic interpolant. On the interval [xi , xi+1 ],
define the quadratic pi (x) = ai (x − xi )2 + bi (x − xi ) + ci . There are n − 1 intervals and so there
are 3(n − 1) parameters to be determined.
Since pi is an interpolant, it must satisfy pi (xi ) = yi , pi (xi+1 ) = yi+1 , i = 1, · · · , n − 1.
There are 2(n − 1) conditions. Since we still have extra degrees of freedom, why not impose that
the piecewise quadratic interpolant to be continuously differentiable: pi (xi+1 ) = pi+1 (xi+1 ), i =
1, · · · , n − 2. Thus far, we have 3n − 4 conditions which is still one short. This last condition
is usually determined by specifying one of p′1 (x1 ), p′′1 (x1 ), p′n−1 (xn ), p′′n−1 (xn ). In the absence of
any knowledge, it is simplest to take the value as zero.
The most common type of splines is cubic splines – using a third degree polynomial in each
interval [xi , xi+1 ]. Since there are n − 1 intervals and each interval requires four coefficients to
5.3. SPLINES 65
define the cubic polynomial in that interval, there is a total of 4(n − 1) unknown coefficients
to determine. Let pi be the cubic polynomial on [xi , xi+1 ], i = 1, · · · , n − 1. The following are
reasonable conditions imposed on the polynomials:
1. pi (xi ) = yi , pi (xi+1 ) = yi+1 , i = 1, · · · , n − 1
2. p′i (xi+1 ) = p′i+1 (xi+1 ), i = 1, · · · , n − 2
3. p′′i (xi+1 ) = p′′i+1 (xi+1 ), i = 1, · · · , n − 2.
With these conditions, the interpolating function is at least twice continuously differentiable,
much smoother than linear splines.
There are 4n − 6 conditions and 4n − 4 unknowns and so two more conditions are needed.
There are many possibilities. For instance, in some applications, it may be desirable to have
p′1 (x1 ), p′n−1 (xn ) prescribed, say, to be zero. Another possibility is to specify p′′1 (x1 ), p′′n−1 (xn ).
One of the most popular methods, called natural cubic splines, is to set p′′1 (x1 ) = 0 =
p′′n−1 (xn ). This choice minimizes the curvature among all interpolants of the data.
Theorem: Let f be twice continuously differentiable on [a, b], where a = x1 < x2 < · · · < xn−1 < xn = b.
Let p be the natural cubic spline interpolating f at x1 , · · · , xn . Then
Z b Z b
′′ 2
p (x) dx ≤ f ′′ (x)2 dx.
a a
The task at hand is to show that the last term is non–negative. On [xi , xi+1 ], let p = pi , the
cubic interpolant.
Z b X Z xi+1
n−1
′′ ′′
p (x)g (x) dx = p′′i (x)g ′′ (x) dx
a i=1 xi
n−1
X Z xi+1
= (p′′i g′ )(xi+1 ) − (p′′i g′ )(xi ) − p′′′ ′
i (x)g (x) dx
i=1 xi
n−1
X Z xi+1
= (p′′n−1 g′ )(xn ) − (p′′1 g′ )(x1 ) − p′′′
i g ′ (x) dx
i=1 xi
n−1
X
= − p′′′
i g(x i+1 ) − g(x i )
i=1
= 0.
In the above, we have used the conditions defining the natural cubic splines, including the fact
that g(xi ) = 0 for every i, and that p′′′
i is a constant since pi is cubic.
Theorem: Let f be four times continuously differentiable on [a, b] and p be the natural cubic spline inter-
polant of f at the nodes x1 , · · · , xn . Then for every x ∈ [a, b],
5h4
|f (x) − p(x)| ≤ max |f (4) (ξ)| , h= max (xi+1 − xi ).
a≤ξ≤b 384 1≤i≤n−1
Example: Find the linear spline, quadratic spline and natural cubic spline which interpolates the data
For the quadratic spline, assume the quadratic interpolants over the two intervals have a con-
tinuous derivative at x = 1 and that the derivative of the interpolant vanishes at x = 0.
The linear spline is
−5x + 3, x ∈ [0, 1];
p(x) =
3x − 5, x ∈ [1, 2].
For the quadratic spline, let p1 (x) = a1 x2 + b1 x + c1 and p2 (x) = a2 (x − 1)2 + b2 (x − 1) + c2
be the interpolating quadratics on [0, 1] and [1, 2], respectively. The equations determining the
coefficients are:
3 = p1 (0), −2 = p1 (1), −2 = p2 (1), 1 = p2 (2)
since pi are interpolants and
p′1 (1) = p′2 (1), p′1 (0) = 0
are the constraints on the derivatives. Solve these equations to obtain a1 = −5, b1 = 0, c1 =
3, a2 = 13, b2 = −10, c2 = −2. Hence the piecewise quadratic interpolant is
−5x2 + 3, x ∈ [0, 1];
p(x) =
13(x − 1)2 − 10(x − 1) − 2, x ∈ [1, 2].
The natural cubic spline requires a fair amount of calculations. Let the cubic in the first interval
be p1 (x) = a1 x3 + b1 x2 + c1 x + d1 . From the conditions p1 (0) = 0, p1 (1) = 3, p′′1 (0) = 0, we
obtain b1 = 0, d1 = 3, c1 = −5 − a1 . Let the cubic in the second interval be p2 (x) = a2 (x −
1)3 + b2 (x − 1)2 + c2 (x − 1) + d2 . From the conditions p2 (1) = 3, p2 (2) = 1, p′′2 (0) = 0, p′2 (1) =
p′1 (1), p′′2 (1) = p′′1 (1), we obtain d2 = −2, a2 + b2 + c2 = 3, b2 = −3a2 , c2 = 3a1 + c1 , b2 = 3a1 .
Solve these equations to get a2 = −2, b2 = 6, c2 = −1, a1 = 2, c1 = 7. Hence the natural cubic
spline is
2x3 − 7x + 3, x ∈ [0, 1];
p(x) =
−2(x − 1)3 + 6(x − 1)2 − (x − 1) − 2, x ∈ [1, 2].
for some coefficients cj . Since p(xi ) = f (xi ) and p′ (xi ) = f ′ (xi ), it is easy to see that c0 =
f (xi ), c1 = f ′ (xi ). From p(xi+1 ) = f (xi+1 ) and p′ (xi+1 ) = f ′ (xi+1 ), it can be deduced that
The following is an error estimate of the Hermite cubic spline interpolant. Its proof is very
similar to the case of linear splines given earlier, using instead (5.7) with n = 2.
Theorem: Let f ∈ C 4 [a, b] and p be the Hermite cubic spline interpolant of f at the nodes x1 , · · · , xn .
Then for every x ∈ [a, b],
h4
|f (x) − p(x)| ≤ max |f (4) (ξ)| , h= max xi+1 − xi .
a≤ξ≤b 384 1≤i≤n−1
5.4 Approximation
Given f ∈ C[a, b] and ǫ > 0. Recall that the Weierstrass Approximation Theorem says that
there is some polynomial p so that kf − pk∞ < ǫ. We shall be using two norms in this section:
Z b 1/2
2
kgk∞ := max |g(x)|, kgk2 := g (x) dx .
x∈[a,b] a
Note that the degree of p in the above theorem can be large. Suppose we fix n, the degree of
the polynomial and pose the minimization problem
inf kf − qk∞ .
q∈Pn
By a direct calculation,
Z 1 Z 1 Z 1
a2
f (a0 , a1 ) = a20 + 1+ 2
f (x) dx − 2a0 f (x) dx − 2a1 xf (x) dx + a0 a1 .
3 0 0 0
The solution of this linear system must correspond to the unique minimum of f since the matrix
of second derivatives of f is A, which has positive eigenvalues. This means that f is a strictly
convex function and so its critical point is a global minimum of f .
The above method also works for any positive value of n. Unfortunately, if n is not small, the
resultant linear system is usually very ill-conditioned. Instead of using the basis {1, x, x2 , . . . , xn }
68 CHAPTER 5. INTERPOLATION AND APPROXIMATION
Taking [a, b] = [−1, 1] for convenience, use Gram-Schmidt to obtain an orthogonal set of poly-
nomials of Pn , called the set of Legendre polynomials. The first few are given by
3x2 − 1
φ0 (x) = 1, φ1 (x) = x, φ2 (x) = .
2
These polynomials are normalized by the condition φj (1) = 1. Note that φj is a polynomial
of degree j and hφi , φj i = 0 if i 6= j. We now solve the same minimization problem for the
case n = 1 and [a, b] = [−1, 1] using the first two Legendre polynomials. The function to be
minimized becomes
Z 1 Z 1 Z 1
2 2 2a21 2
f (a0 , a1 ) = ka0 + a1 x − f (x)k2 = 2a0 + + f (x) dx − 2a0 f (x) dx − 2a1 xf (x) dx.
3 −1 −1 −1
The advantage of using orthogonal polynomials is that it is trivial to solve the resultant linear
system. The following is the main theorem for the least-squares solution of approximating a
function by a general set of basis functions. First define the L2 (a, b) inner product by
Z b
(f, g) = f (x)g(x) dx, f, g ∈ L2 (a, b).
a
Theorem: Let {φ1 , . . . , φn } ⊂ C[a, b] be linearly independent and Sn be the span of {φj , 1 ≤ j ≤ n}. Given
f ∈ C[a, b]. The solution of the best approximation of f by a function in Sn in the L2 sense:
n
X
min f − ai φi
a∈Rn
i=1 2
In matrix notation the above equation can be written as Aa∗ = b. Note that A is a symmetric
matrix and since
n n n n 2
X X X X
T
a Aa = ai aj (φi , φj ) = ai φi ,
aj φj = ai φi > 0
i,j=1 i=1 j=1 i=1 2
for all non-zero vectors a. Hence A is positive definite, which implies that A is invertible. Since
the Hessian D 2 F (a) = 2A for all a ∈ Rn , it is positive definite, implying that F is a strictly
convex function of a. Consequently, a∗ is the unique global minimizer of F .
In the notation of the above theorem, it is not difficult to see that the error of the approxi-
mation satisfies
n n 2 1/2
X X
f− a∗i φi = kf k22 − a∗i φi .
i=1 2 i=1 2
Instead of the 2-norm, one can also pose the minimization problem in the ∞-norm.
We omit the proof of this theorem, but sketch some results that give rise to this theorem.
Define {φi , 1 ≤ i ≤ n} ⊂ C[a, b] to satisfy the Haar condition if
φ1 (x1 ) · · · φn (x1 )
.. .. 6= 0
D[x1 , . . . , xn ] := det . ··· .
φ1 (xn ) · · · φn (xn )
which can be shown by induction. The above matrix is known as the Vandermonde matrix.
The above theorem is a consequence of the following, which says that the best polynomial
interpolant measured in the infinity norm satisfies an equi-oscillation property: the interpolation
error oscillates between minimum and maximum points, and the magnitudes of the errors at
these extreme points are the same.
Theorem: Let {φ1 , . . . , φn } ⊂ C[a, b] satisfy the Haar condition. Suppose f ∈ C[a, b]. Define r = f −
Xn
ai φi , ai ∈ R. Then a minimizes krk∞ iff |r(xi )| = krk∞ for all i and r(xi ) = −r(xi−1 ), 1 ≤
i=1
i ≤ n for some strictly increasing sequence {xi , 0 ≤ i ≤ n} ⊂ [a, b].
70 CHAPTER 5. INTERPOLATION AND APPROXIMATION
n
X
We can set up a linear system that solves for the coefficient a as follows. Let p = ai φi
i=1
and {xi , 0 ≤ i ≤ n} be a given strictly increasing sequence
of points in [a, b]. According to the
above theorem, f (xi ) − p(xi ) = − f (xi−1 ) − p(xi−1 ) , or
or
n
X
aj φj (xi ) − (−1)i φj (x0 ) = f (xi ) − (−1)i f (x0 ), 0 ≤ i ≤ n.
j=1
min kc − f k∞ .
c∈R
Example: Solve the minimization problem for f (x) = ex on the interval [−1, 1]:
Note that f is a convex function. Hence the three extreme points in the equi-oscillation property
can be taken as x0 = −1, x1 ∈ (−1, 1) and x2 = 1. Let r denote the error as before. Then
r(−1) = −r(x1 ) = r(1) and r ′ (x1 ) = 0 since x1 is an interior critical point. Solve these equations
to obtain the solution a1 = (e − e−1 )/2, a0 = (e − a1 ln a1 )/2, x1 = ln a1 .
and r be the ratio of polynomials defined as in (5.8) with q0 = 1. The goal is to find the
coefficients pi and qi so that
∞
p(x) X i
f (x) − = ci x
q(x)
i=s
for some coefficients {ci } so that s is as large as possible. Toward this end, write
∞
X ∞
X
f (x)q(x) − p(x) = q(x) ci xi =: di xi . (5.9)
s=i s=i
Plugging in the expansions for f, p and q, it follows that the left-hand side is
Since there are four unknowns (p0 , p1 , q1 , q2 ), we expect to be able to set the coefficients of xi to
be zero for 0 ≤ i ≤ 3. The four resultant equations are
1 q1 1
p0 = 1, p1 = 1 + q 1 , q1 + q2 = − , + q2 = − . (5.10)
2 2 6
The solution of the system is
1 2 1
p0 = 1, p1 = , q1 = − , q2 = .
3 3 6
72 CHAPTER 5. INTERPOLATION AND APPROXIMATION
-2
-4
-6
-8
-10 ex
r(x)
-12 f 3(x)
-14
-5 -4 -3 -2 -1 0 1
x
Note that the coefficient of x4 is non-zero. In Figure 5.2, it is clear that this Pade approximation
is superior to f3 (x), the Taylor’s expansion of ex up to degree three that also contains four
coefficients. However on [−1, 5], f3 (x) gives a better approximation because on this interval, r
goes to zero for large x and thus is not an appropriate choice to approximate ex .
In (5.9), typically s = m + n + 1. However, the following example shows that this does not
always occur: f (x) = a0 + a2 x2 + . . . , p(x) = p0 + p1 x, q(x) = q0 + q1 x with a2 6= 0. Equations
(5.10) for the unknowns are
p 0 = a0 , p 1 = a0 q 1 , a2 = 0.
In this chapter, we approximate the operations of differentiation and integration. Given a dif-
ferentiable function, we can always find its derivative although this may not be inviting if the
function is complicated. Numerical differentiation approximates the derivative using only func-
tion values. Another instance where numerical differentiation is needed is in solving differential
equations. Numerical integration is clearly useful since there are many functions that cannot be
integrated analytically.
f (x + h) − f (x)
f ′ (x) = lim .
h→0 h
For a sufficiently small h, we have the approximation
f (x + h) − f (x)
f ′ (x) ≈ .
h
Let us study the error in this approximation known as forward difference. From Taylor’s expan-
sion (provided f is twice continuously differentiable),
f ′′ (c) 2
f (x + h) = f (x) + f ′ (x)h + h
2
for some c between x and x + h. Hence
f (x + h) − f (x) h
f ′ (x) − ≤ max |f ′′ (ξ)|.
h 2 a≤ξ≤b
This holds for all x ∈ (a, b). We say that this approximation is first–order accurate.
There is another first–order scheme known as backward difference:
f (x) − f (x − h)
f ′ (x) ≈
h
which has the same upper bound of the error as above.
73
74 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION
In many applications, this is not accurate enough. A second–order accurate scheme, that is,
the error is bounded above by a term proportional to h2 is
f (x + h) − f (x − h)
f ′ (x) ≈ .
2h
This is called a centered difference scheme. To see this, subtract the Taylor’s expansions
Example: Let f (x) = x3 . Then f ′ (1) = 3. The first–order and second–order finite difference schemes using
h = .1 yield
f (1.1) − f (1) f (1.1) − f (.9)
= 3.31, = 3.01.
.1 .2
The second–order scheme is clearly superior. With h = .05,
Since the finite difference error decreases like h or h2 , why not simply take h very small,
say, 10−100 so that the error would be insignificant? The answer is that this would be fine in
exact arithmetic. However in floating point arithmetic, there are roundoff errors. In particular,
cancellation is at work here since two nearly equal quantities are subtracted. Let us estimate
an optimal value of h > 0 for the second–order finite difference scheme.
Assume f˜ is the floating point representation of f and
where ǫM is the unit roundoff. In this model, we ignore the floating point error of z itself. For
some |ǫ± | ≤ ǫM ,
f˜(x + h) − f˜(x − h) f (x + h) + ǫ+ − f (x − h) − ǫ−
f ′ (x) − = f ′ (x) −
2h 2h
f (x + h) − f (x − h) ǫM
≤ f ′ (x) − +
2h h
≤ E(h)
M h2 ǫM
E(h) = + , M = max |f ′′′ (ξ)|.
6 h a≤ξ≤b
6.1. NUMERICAL DIFFERENTIATION 75
The task is to minimize E(h), an upper bound of the error. The critical point of E is easily
computed from
M h ǫM
0 = E ′ (h) = − 2
3 h
1/3
3ǫM
to obtain h = . It is easy to check that this is a global minimum. As a rough
M
estimate, set ǫM = 10−16 for double precision arithmetic, and take 3M −1 = 10 to obtain the
value h = 10−5 .
Example: For the same example f (x) = x3 with f ′ (1) = 3. Take h = 10−5 ,
h2 ′′ ih3 ′′′
f (x0 + ih) = f (x0 ) + ihf ′ (x0 ) − f (x0 ) − f (x0 ) + · · · .
2 6
Since f (x0 ) is real, this yields
Example: Let f (x) = e10x . The relative error of the central difference scheme with h = 10−6 to evaluate
f ′ (1) is 1.5 × 10−11 , while using the above technique with h = 10−8 , the relative error is 1.8 ×
10−15 .
Another way to assess√ the accuracy of a difference scheme is to see its effect on a plane wave
f (x) = eikx , where i = −1 and k is a given wave number, or, frequency of a wave. Suppose the
wave has period L > 0 so that k = 2πn/L for n ≥ 0. Given a grid xj = jh for some positive grid
spacing h. The exact derivative at xj is f ′ (xj ) = ikeikxj . The centered difference approximation
of f ′ (xj ) is
ei(j+1)hk − ei(j−1)hk eijhk eihk − e−ihk sin(hk) ikxj
=i =i e .
2h h 2i h
Comparing with the exact derivative, it is readily seen that the wave number of the difference
scheme has changed from k to sin(hk)/h. When k = O(1), then sin(hk)/h ∼ k, approximating
well the exact wave number. However, the approximation gets progressively worse as k increases.
This method of measuring the accuracy of a difference scheme is meaningful when the given
function f is expanded as a Fourier series, a sum of complex exponentials.
Note that the forward difference scheme applied to f (x) = eikx yields
eihk − 1 ikxj
i e .
ih
Here the wave number is complex, but it still approximates k well when k = O(1).
76 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION
Second Derivative
It is not difficult to derive a second–order finite difference approximation of the second derivative.
Assume that f is four times continuously differentiable on (a, b). From Taylor’s expansion, for
some c± in between x and x ± h,
Now we estimate an optimal value of h to calculate the second derivative in the presence
of roundoff errors. Assume f˜ is the floating point representation of f as before. For some
|ǫi | ≤ ǫM , i = 1, 2, 3,
where
M h2 4ǫM
E(h) = + 2 , M = max |f ′′′′ (ξ)|.
12 h a≤ξ≤b
The task is to minimize E(h), an upper bound of the error. The critical point of E is easily
computed from
M h 8ǫM
0 = E ′ (h) = − 3
6 h
1/4
48ǫM
to obtain h = . It is easy to check that this is a global minimum. A rough estimate
M
of the optimal value is h = 10−4 in double precision arithmetic.
Suppose the task is to evaluate f (k) (x) where k is large, say, k = 100. A finite difference
formula would be impossibly long and the resultant scheme may suffer from serious cancellation
errors. If f is analytic, the Cauchy integral formula offers a much better solution:
I
k! f (z)
f (k) (x) = dz,
2πi Γ (z − x)k+1
where Γ is any circle in the complex plane with centre at x.
6.2. RICHARDSON EXTRAPOLATION 77
Newton–Cotes Formulae
A simple scheme is to subdivide the interval [a, b] into n equal intervals. Define for 0 ≤ i ≤ n,
b−a
xi = a + ih, h= .
n
Approximate f on each interval [xi−1 , xi ] by a straight line. The integral over this interval is
approximated by the area of the trapezoid:
Z b Xn Z xi Xn
f (xi−1 ) + f (xi )
f (x) dx = f (x) dx ≈ h.
a xi−1 2
i=1 i=1
78 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION
Example: Let f (x) = x4 . Find the integral I4 over [0, 1] using the trapezoidal rule with n = 4 and n = 8
intervals.
For n = 4, h = 1/4. Hence
f (0) + f (1) f (1/4) + f (2/4) + f (3/4)
I4 = + = 0.2207.
8 4
The exact answer is .2 and so the error is about .02. For n = 8, I8 = .2052 with an error of
about .005 which is four times smaller than that of I4 .
The proof of the error of the trapezoidal rule requires
Theorem: Mean Value Theorem for Integrals. Let f be continuous on [a, b] and let g be an integrable
function on [a, b] that does not change sign on this interval. Then there exists some c ∈ (a, b) so
that Z b Z b
f (x)g(x) dx = f (c) g(x) dx.
a a
Let Eh be the difference between the exact integral and the value given by the trapezoidal
rule.
Theorem: Let f be twice continuously differentiable on [a, b]. Then
b−a 2
|Eh | ≤ h max |f ′′ (c)|.
12 a≤c≤b
Proof: On [xi−1 , xi ], the trapezoidal rule interpolates f by a linear function p(x). From the interpolation
error (5.4),
f ′′ (cx )
f (x) − p(x) = (x − xi−1 )(x − xi ), x ∈ [xi−1 , xi ]
2
where cx lies in between (xi−1 , xi ). Hence
Z xi Z
1 xi ′′
(f − p)(x) dx = f (cx )(x − xi−1 )(x − xi ) dx
xi−1 2 xi−1
Z
f ′′ (ci ) xi
= (x − xi−1 )(x − xi ) dx
2 xi−1
3
f ′′ (ci ) h
= −
2 6
for some ci ∈ (xi−1 , xi ) by the mean value theorem for integrals. (Continuity of f ′′ (cx ) with
respect to x follows from (5.5).) Thus the error over the entire interval is
Z b n Z xi
X n
h3 X ′′
(f − p)(x) dx = (f − p)(x) dx = − f (ci ).
a xi−1 12
i=1 i=1
This leads to
h3 b−a 2
|Eh | ≤ n max |f ′′ (c)| = h max |f ′′ (c)|.
12 a≤c≤b 12 a≤c≤b
6.3. NUMERICAL INTEGRATION 79
The trapezoidal rule approximates f by a linear function on each subinterval. The next
method uses a quadratic function. Consider the interval [x2i , x2i+2 ], let pi be the unique poly-
nomial interpolant of (x2i , f (x2i )), (x2i+1 , f (x2i+1 )), (x2i+2 , f (x2i+2 )) of degree two or lower:
h X
m−1
= f (x2i ) + 4f (x2i+1 ) + f (x2i+2 )
3
i=0
h 4h m−1
X m−1
2h X
= f (a) + f (b) + f (x2i+1 ) + f (x2i ).
3 3 3
i=0 i=1
b−a 4
|Eh | ≤ h max |f ′′′′ (c)|.
180 a≤c≤b
Let us follow the approach used in the proof of the corresponding result for the trapezoidal
rule. On [x2i , x2i+2 ], Simpson’s rule interpolates f by a quadratic function p(x). From the
interpolation error (5.4),
f ′′′ (cx )
f (x) − p(x) = (x − x2i )(x − x2i+1 )(x − x2i+2 ), x ∈ [x2i , x2i+2 ].
6
We cannot use the mean value theorem for integrals as before because the cubic polynomial
changes sign in [x2i , x2i+2 ].
80 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION
Proof: We first estimate the error of the integral over [a, a + 2h]. Using the symmetry property and
definition of the divided differences,
Let p be the unique polynomial interpolant over [a, a + 2h] of degree at most two. Use the
interpolation error (5.4) to obtain the integration error on [a, a + 2h] as
Z a+2h Z a+2h
f (x) − p(x) dx = f [a, a + h, a + 2h, x] (x − a)(x − a − h)(x − a − 2h) dx
a a
Z a+2h
= f [a, a + h, a + 2h, a + h] + f [a, a + h, a + 2h, a + h, x](x − a − h)
a
×(x − a)(x − a − h)(x − a − 2h) dx
Z a+2h
= f [a, a + h, a + 2h, a + h] (x − a)(x − a − h)(x − a − 2h) dx
a
Z a+2h
+ f [a, a + h, a + 2h, a + h, x](x − a)(x − a − h)2 (x − a − 2h) dx
a
Z a+2h (4)
f (ξ(x))
= 0+ (x − a)(x − a − h)2 (x − a − 2h) dx.
a 24
Z
f (4) (η) a+2h
= (x − a)(x − a − h)2 (x − a − 2h) dx.
24 a
f (4) (η) 4h5
= − .
24 15
In the above η, ξ(x) ∈ (a, a + 2h). We have used the mean value theorem for integrals and the
fact that the integral of (x−a)(x−a−h)(x−a−2h) is zero by symmetry. (In fact, the additional
point a + h has been chosen so that the mean value theorem for integrals can be used.) Now
sum over intervals to obtain the global error estimate
n h5
|Eh | ≤ max |f (4) (c)|
a≤c≤b 2 90
from which the desired result follows.
Example: Let f (x) = x4 . Find the integral I4 over [0, 1] using Simpson’s method with n = 4 intervals.
For n = 4, h = 1/4. Hence
f (0) + f (1) f (1/4) + f (3/4) f (1/2)
I4 = + + = 0.2005.
12 3 6
The error is .0005 which is much smaller than that of the trapezoidal rule. For n = 8, I8 =
.200033 with an error of about .000033 which is 16 times smaller than the error of I4 .
Z π
Example: Find the n in Simpson’s method so that sin2 x dx is approximated with an error smaller than
0
10−6 .
1 − cos 2x
Since sin2 x = , its fourth derivative is −8 cos 2x which is bounded above by 8. From
2
πh4
the error of Simpson’s rule, we set 8 < 10−6 . Solve this to get h < .0517 or n = π/h ≈ 60.77.
180
The first even number larger than this is 62 which is the number of intervals needed.
6.3. NUMERICAL INTEGRATION 81
2 −5
10 10
0 −6
10 10
−2 −7
10 10
−4 −8
10 10
−6 −9
10 10
error
error
−8 −10
10 10
−10 −11
10 10
−12 −12
10 10
−14 −13
10 10
−16 −14
10 10
1 2 3 4
0 5 10 15 20 25 30 35 10 10 10 10
n n
Figure 6.1: Numerical integration by the trapezoidal rule ‘o’ and Simpson’s rule ‘+’. The
integrands are f (x) = esin x (left) and f (x) = (1 + x2 )−1 (right) over the interval [0, 2π].
In Figure 6.1, observe the difference in the integration errors for a periodic function (left)
and a non–periodic function (right) for trapezoidal and Simpson’s rules. This property will be
illustrated in a later section.
Trapezoidal and Simpson’s rules are examples of closed Newton–Cotes Formulae, where
the integrand is approximated on each subinterval by a polynomial. In these formulae, the
integrand is evaluated at each node xi . Open Newton–Cotes Formulae also approximate
each subinterval by a polynomial but the integrand is evaluated strictly between the nodes.
These formulae are useful if the integrand has some singular behaviour at an end point or the
interval of integration is infinite.
The midpoint rule is the simplest open Newton–Cotes formula. With the same equally
spaced nodes xi as before, this formula is
Z b n−1
X
xi + xi+1
f (x) dx ≈ f h.
a 2
i=0
xi + xi+1
It approximates f on [xi , xi+1 ] by the constant f . This explains why it is a member
2
of the Newton–Cotes family.
Let Eh be the difference between the exact integral and the value given by the midpoint rule.
b−a 2
|Eh | ≤ h max |f ′′ (c)|.
24 a≤c≤b
Proof: On [a, a + h], the midpoint rule interpolates f by the constant function y = f (a + h2 ). From the
Taylor’s expansion, there is some cx ∈ (a, a + h) so that
2
h ′ h h f ′′ (cx ) h
f (x) = f a+ +f a+ x−a− + x− x ∈ [a, a + h].
2 2 2 2 2
82 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION
4T (h/2) − T (h)
3
approximates the exact integral to O(h4 ). It can be checked that this is exactly Simpson’s rule
with step size h/2. One can continue to calculate T (h/4) resulting in an extrapolated scheme
which is accurate to O(h6 ). One attraction of these schemes is that they are quite efficient
since all function values can be reused at subsequent levels. For instance, the evaluation of T (h)
requires n + 1 function values while the evaluation of T (h/2) requires only n extra function
evaluations if the previous n + 1 values have been saved.
Example: Let f (x) = x4 . Earlier we had calculated the integral from 0 to 1 using the trapezoidal rule with
n = 4 and n = 8 intervals: I4 = .2207, I8 = .2052. Using Richardson extrapolation, a fourth
order approximation is
4I8 − I4 4(.2052) − .2052
= = .200033.
3 3
Adaptive Quadrature
In all numerical integration schemes so far, all nodes are equally spaced. However, this is not
wise in case a function varies rapidly in one region but is slowly varying in others. A scheme
which uses equally spaced nodes must take a step size h very small to resolve the function in the
rapidly varying region. However, this is a waste of computational resources in the region where
the function is slowly changing. A solution is to modify the step size according to the need. In
rapidly varying regions, the step size is small while in slowly varying regions, the step size can
be larger.
6.3. NUMERICAL INTEGRATION 83
Z 1
Example: (1 + sin e5x ) dx is rapidly oscillatory near x = 1 and slowly varying near the other end. An
−1
adaptive quadrature is ideal for this integration.
The following strategy detects whether the function is varying rapidly or not. Use a scheme
such as the trapezoidal rule to find Ih , an approximation to
Z b
I= f (x) dx
a
with uniform step size h. Repeat the calculation now with step size h/2 to obtain the approx-
xi−1 + xi
imation Ih/2 . Let {xi } be the set of nodes separated by h and define xi− 1 = . The
2 2
trapezoidal rule with step size h applied to [xi−1 , xi ] gives
Z xi
h3 f (xi−1 ) + f (xi )
Ii := f (x) dx = Ti,h − f ′′ (xi− 1 ) + O(h5 ), Ti,h = h,
xi−1 12 2 2
while the trapezoidal rule with step size h/2 yields
h3 ′′
Ii = Ti,h/2 − f (xi− 1 ) + f ′′ (xi− 3 ) + O(h5 ),
96 4 4
j
where xi− j = a + h i − 4 . The above assumes f ∈ C 4 [a, b]. The following calculation
4
h ′′′ h
f (xi− 1 ) + O(h2 ) + f ′′ (xi− 1 ) − f ′′′ (xi− 1 ) + O(h2 )
f ′′ (xi− 1 ) + f ′′ (xi− 3 ) = f ′′ (xi− 1 ) +
4 4 2 4 2 2 4 2
′′ 2
= 2f (xi− 1 ) + O(h )
2
shows that
h3 ′′
Ii − Ti,h/2 = − f (xi− 1 ) + O(h5 ).
48 2
Observe that
Ti,h/2 − Ti,h h3
= − f ′′ (xi− 1 ) + O(h5 ) = Ii − Ti,h/2 + O(h5 ).
3 48 2
Given a tolerance ǫ > 0 of the approximate integral. The adaptive quadrature scheme assumes
that the error in each interval [xi−1 , xi ] be no more than ǫh/(b − a). Hence if |Ti,h/2 − Ti,h |/3 ≤
ǫh/(b − a), then Ti,h/2 is accepted as an approximation on [xi−1 , xi ]. Otherwise calculate
Ti,h/4 , Ti,h/8 , . . . until the tolerance on [xi−1 , xi ] is satisfied. It should be emphasized that
the above strategy is only an heuristic and it can fail.
We conclude this section with the most important theoretical result in numerical integration.
We begin with some definitions. Given some continuous function f , define the quadrature scheme
n
X
Qn (f ) = ain f (xin ), n ≥ 1,
i=1
where ain are weights and xin are nodes associated with the scheme. We say that {Qn } is
consistent if there is some function g : N → N with g(n) → ∞ as n → ∞ so that Qn (p) is
exact for every polynomial p with degree at most g(n):
Z 1
Qn (p) = p(x) dx.
0
84 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION
If {Qn } is exact for constants and all weights {ain } are non-negative, then {Qn } is stable and
the stability constant can be taken as 1. To see this, observe that the scheme is exact for the
constant function f (x) = 1. Therefore for all n ≥ 1,
n
X Z 1
ain = Qn (1) = 1 dx = 1
i=1 0
implies that
n
X n
X
sup |ain | = sup ain = 1.
n≥1 i=1 n≥1 i=1
The result is that for a consistent scheme, stability is equivalent to convergence. We saw an
analogous result for interpolation.
Proof: Suppose {Qn } is stable. Given ǫ > 0, we need to find some integer N so that for every n ≥ N ,
Z 1
Qn (f ) − f (x) dx < ǫ, f ∈ C[0, 1].
0
Let f ∈ C[0, 1]. From the Weierstrass Approximation Theorem, there is some polynomial p so
that
ǫ
kf − pk∞ < ,
1+C
where C is the stability constant. Choose N = g(m), where m is the degree of p. Then for every
n ≥ N,
Z 1 Z 1 Z 1
Qn (f ) − f (x) dx ≤ |Qn (f ) − Qn (p)| + Qn (p) − p(x) dx + (p − f )(x) dx
0 0 0
n
X ǫ
< |ain | |f (xin ) − p(xin )| + 0 +
1+C
i=1
ǫ
≤ C kf − pk∞ +
1+C
< ǫ.
It is not difficult to check that for each n ≥ 1, Qn , I : C[0, 1] → R are linear operators with
n
X
kQn k∞ = |ain |, kIk∞ = 1.
i=1
Recall that
|Qn (f )|
kQn k∞ = sup .
f ∈C[0,1]\0 kf k∞
Here Cf is a real number depending on f . By the Principle of Uniform Boundedness, supn≥1 kQn k∞ <
∞. This means that {Qn } is stable.
The above integral now has a weaker singularity at t = 0 and it can be routinely integrated
numerically. It is possible to weaken the singularity further by performing one more integration
by parts: Z 1√ Z
2 2 1 3/2
t cos t dt = cos 1 + t sin t dt.
0 3 3 0
We shall see later that the smoother the integrand (or the weaker the singularity), the faster the
convergence of the numerical solution to the exact value.
One can also subtract away any singular term in the integrand, assuming that it can be
integrated analytically. The remaining smooth term can be integrated numerically.
Example: Z 1 Z 1 Z 1
sin x sin x − x x 2 sin x − x
√ dx = √ + √ dx = + √ dx.
0 x 0 x x 3 0 x
The integrand in the last integral behaves like O(x5/2 ) at the origin and has a much weaker
singularity compared to the original integrand.
Example: Here is a technique for infinite interval of integration. Consider
Z ∞
dx
I= .
0 1 + x10
Write the integral as the sum Z Z ∞
1
dx dx
10
+ .
0 1+x 1 1 + x10
The first integral can be routinely integrated numerically. The second can be converted to a
proper integral using the substitution t = x−1 :
Z ∞ Z 0 Z 1 8
dx −t−2 dt t dt
10
= −10
= 10
,
1 1+x 1 t +1 0 1+t
which can easily be integrated numerically.
We split the integral into one over [0, 1] and another one over [1, ∞) because the substitution
t = x−1 cannot handle the point x = 0. Another possible substitution is t = (1 + x)−1 . In this
case, it is no longer necessary to split up the integral and we obtain
Z 1
t10 dt
I= 10 10
.
0 (t + (1 − t) )
Example: Besides substitution, which only works if a good substitution is known, another way to handle an
infinite interval of integration is to truncate the domain. Let R be a number to be determined.
Z ∞ Z R Z ∞
−x 2 2 −x 2 2
I= e cos (x ) dx = e cos (x ) dx + e−x cos2 (x2 ) dx.
0 0 R
If an error of 2ǫ or less is desired, then use any method to evaluate the first integral IR to ǫ.
Now Z ∞ Z ∞
−x 2 2
e cos (x ) dx ≤ e−x dx = e−R ≤ ǫ,
R R
which means that we can choose R = − ln ǫ. Suppose the numerical integration of IR yields I˜R .
We know that |IR − I˜R | ≤ ǫ. Therefore the error of the computation is
|I − I˜R | ≤ |I − IR | + |IR − I˜R | ≤ 2ǫ.
6.4. IMPROPER INTEGRALS 87
Before doing the truncation, sometimes it may be wise to do an integration by parts or use
substitution so that the new integrand decays more rapidly at infinity. This way, R does not
need to be very large for a more efficient quadrature. For instance,
Z ∞ Z ∞
sin x cos x
dx = cos 1 − 2 dx.
1 x2 1 x3
To obtain an accuracy of 10−6 , truncation on the left integral requires integration to R = O(106 ),
while R = O(103 ) for the integral on the right. Clearly, the second integral requires less work
to evaluate numerically.
Assume that |f (j) (x)| = O(x−m−j ) as x → ∞ for all j ≥ 0 and some m > 1. Truncation of
domain means that we calculate IN , that is, integrate only in [a, b] for some b = b(N ). By taking
a clever choice of b, we can obtain a simple method with quite acceptable errors. Take b = N π
for some large positive integer N . The error is
EN = I − IN
Z ∞
= f (x) sin x dx
Nπ
Z ∞
N
= (−1) f (N π) + f ′ (x) cos x dx
ZN∞
π
If simple truncation is used, that is, approximate I by IN , then using the bound on f (x) for large
x, we get |EN | = O(N 1−m ). To see this, observe that there is some C so that |f (x)| xm ≤ C for
x large enough. Therefore for N large enough,
Z ∞
dx Cπ 1−m
|EN | ≤ C = C1 N 1−m , C1 = .
Nπ xm m−1
By adding the simple correction (−1)N f (N π) to IN , the error now behaves like O(N −m−1 )
which can be a drastic improvement over the estimate IN . In this case, the approximation to I
is
Z Nπ
f (x) sin x dx + (−1)N f (N π).
a
Of course, b has been chosen so that f ′ (N π) vanishes, thus gaining extra powers of accuracy.
We can add more corrections but the complexity of the expression f (2j) typically increases
exponentially with j.
How would you design a similar method for
Z ∞
f (x) cos x dx?
a
88 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION
Example: As a final example, we make a transformation so that an integral on a finite domain is replaced
by one on an infinite domain! The reason is that the second integrand is smooth and decays
exponentially quickly at infinity. Consider
Z 1
dy
I= p .
−1 1 − y2
Note that the integrand has square root singularities at the end points. Define y = tanh x. Then
Z ∞
dx
I= ,
−∞ cosh x
where the integrand now is smooth and decays like e−|x| for large |x|. Now it is easy to estimate
I by doing truncation.
We have given an error in a previous section. We now give several other bounds on the error
Z b
En = f (x) dx − Tn
a
depending on the smoothness of f – the smoother f is, the smaller the error is.
If the integrand f is not twice continuously differentiable, then the error of the trapezoidal
rule decays at a slower rate than O(h2 ). We give two results in this direction. First assume that
f ′ ∈ L1 (a, b), that is, f ′ is integrable. Let
Z b
kgk1 ≡ |g(x)| dx
a
denote the norm of any g ∈ L1 (a, b). The error in the interval [xi−1 , xi ] is
Z xi
h
ǫi ≡ (f (xi−1 ) + f (xi )) − f (x) dx.
2 xi−1
By integration by parts,
Z xi
xi−1 + xi
ǫi = (x − xi− 1 ) f ′ (x) dx, xi− 1 = .
xi−1 2 2 2
Thus Z
n
X b
h h ′
|En | ≤ |ǫi | ≤ |f ′ (x)| dx = kf k1 = O(h).
2 a 2
i=1
6.5. ADDITIONAL THEORY FOR TRAPEZOIDAL RULE 89
Now suppose that f ′′ ∈ L1 (a, b), then apply integration by parts once more to obtain
Z " #
1 xi h 2
ǫi = − (x − xi− 1 ) f ′′ (x) dx.
2
2 xi−1 2 2
Consequently
n
X
|En | ≤ |ǫi |
i=1
n Z 2
1 X xi h
≤ − (x − xi− 1 )2 |f ′′ (x)| dx (∗)
2 2 2
i=1 xi−1
2 X n Z xi
1 h
≤ |f ′′ (x)| dx
2 2 xi−1
i=1
h2
= kf ′′ k1 = O(h2 ).
8
This implies, in particular, that the trapezoidal rules integrates linear functions exactly.
Recall that L∞ (a, b) is the space of bounded functions so that f ∈ L∞ (a, b) if
kf k∞ := sup |f (x)| < ∞.
x∈[a,b]
Assume f ′′ ∈ L∞ (a, b). We had already considered this case in an earlier section but we shall
derive the result in a different way. From (*),
n Z 2
kf ′′ k∞ X xi h
|En | ≤ − (x − xi− 1 )2 dx
2 xi−1 2 2
i=1
n
X
kf ′′ k∞ h3
=
2 6
i=1
(b − a)kf ′′ k∞
= h2 .
12
√
Example: Let f (x) = x on [0, 1]. Note that f ′ ∈ L1 (0, 1) but f ′′ 6∈ L1 (0, 1). According to the above error
analysis, |En | ≤ (2n)−1 . It can be shown that |En | ≥ c n−1 for some positive constant c and
hence the error estimate is sharp.
If f is infinitely many times differentiable and f (j)(a) = f (j) (b) for all non–negative integers
j (i.e., f is smooth and periodic), then it will be shown below that the quadrature error decays
faster than hk for all positive integer k, where h = (b − a)/n is the width of each sub-interval.
We call this exponential convergence. The trapezoidal rule is perfect for integrating functions
such as Z 2π
esin x dx
0
because the integrand is smooth and periodic and so the error decays exponentially. See the left
figure of Figure 6.1.
Note that when f is periodic, then
n n−1 n−1
hX f (a) + f (b) X X
Tn = (f (xj ) + f (xj−1 )) = h + f (xj ) = h f (xj ).
2 2
j=1 j=1 j=0
90 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION
Proof: Consider first the case f (x) = e2iπkx where i2 = −1 and k is an arbitrary integer. Observe that
the geometric sum
n−1
X
2iπkj/n n, divides k;
e dx =
0, otherwise.
j=0
Z 1
Note that if f (x) = e2iπkx , f (x) dx = δ0k , and so
0
Z 1 n−1
X
1 j −1, n divides k, k 6= 0;
ekn ≡ f (x) dx − f =
0 n n 0, otherwise.
j=0
Now consider a general f ∈ C m [0, 1] which is periodic of period 1. Recall that Z denotes the set
of integers, and f has a Fourier expansion
X
f (x) = fˆ(k)e2iπkx ,
k∈Z
which is the desired result since the infinite sum on the right-hand side converges.
To complete the proof, we prove the claim (6.2). Let k be a non-zero integer. Then
Z 1
ˆ
f (k) = f (x)e−2iπkx dx
0
1 Z 1
f (x)e−2iπkx f ′ (x)e−2iπkx
= − dx
−2iπk 0 0 −2iπk
b′
f (k)
= .
2iπk
6.5. ADDITIONAL THEORY FOR TRAPEZOIDAL RULE 91
This result illustrates a common principle: the smoother the integrand, the faster the quadrature
error goes to zero. The fast decay of the error is due to the fast decay of the coefficient fˆ(k) as
a function of |k|.
Example: Since the trapezoidal rule is so efficient at evaluating integrals of smooth periodic integrands,
there are transformations which take advantage of this fact. Consider
Z ∞
dx
I= 4
.
−∞ 1 + x
ex − 1 − x − x2 /2
f (x) = .
x3
When x is small, severe cancellation error occurs. We can find some cutoff value r so that
whenever |x| ≤ r, we compute f by a Taylor’s expansion and evaluate f directly if |x| > r.
Another solution is to use contour integration
Z
1 f (z)
f (x) = dz
2πi Γ z − x
where Γ is, say, the circle of radius one√with centre at the origin of the complex plane traversed in
counterclockwise direction. Here i = −1. Since f is real, we can save some work by computing
the contour integral in the upper half of the circle, double its value and then take its real part.
Suppose m points are taken. Since the integral is periodic, it can be accurately computed by
the trapezoidal rule:
m
X
1
real f (zj ) , zj = eiπ(j−0.5)/m .
m
j=1
Note that ak+ means (a+ )k . Let Pm denote the set of polynomials of degree at most m.
Theorem: Fix positive integer n and non-negative integer m. Suppose En p = 0 for all p ∈ Pm . Let
0 ≤ k ≤ m. Suppose f ∈ C k+1 [a, b]. Then
Z
1 b (k+1)
En f = f (y)K(y) dy.
k! a
En f = En p + En r
Z b n
X
= 0+ r(x)w(x) dx − r(xj )wj
a j=0
Z bZ n Z
1 b
(k+1) 1 X b
= f (y)(x − y)k+ dy w(x) dx − f (k+1) (y)(xj − y)k+ dy wj
k! a a k! a
j=0
Z b Z b n
X
1
= f (k+1) (y) (x − y)k+ w(x) dx − (xj − y)k+ wj dy
k! a a j=0
Z b
1
= f (k+1) (y)K(y) dy.
k! a
6.6. PEANO KERNEL 93
Example: We recover the error of the trapezoidal rule using the Peano kernel assuming f ∈ C 2 [a, b].
This corresponds to k = 1 = m in the theorem since the trapezoidal rule integrates all linear
polynomials exactly. Recall the nodes are xj = a + jh, where h = (b − a)/n. Then
Z
h X
b n−1
En f = f (x) dx − f (xj+1 ) + f (xj )
a 2
j=0
X Z xj+1
n−1
h
= f (x) dx − f (xj+1 ) + f (xj ) .
xj 2
j=0
Let ǫj denote the quadrature error on the interval [xj , xj+1 ]. That is, ǫj is the expression to the
right of the summation in the above expression. The Peano kernel on this interval is
Z xj+1
h h
Kj (y) = (x − y)+ dx − (xj − y)+ − (xj+1 − y)+
xj 2 2
Z xj+1
h
= (x − y) dx − 0 − (xj+1 − y)
y 2
(xj+1 − y)(y − xj )
= − .
2
Consequently,
Z xj+1
1
ǫj = − f ′′ (y)(xj+1 − y)(y − xj ) dy.
2 xj
h2
(xj+1 − y)(y − xj ) ≤ , y ∈ [xj , xj+1 ].
4
Therefore Z xj+1
h2 1
|ǫj | ≤ |f ′′ (y)| dy,
4 2 xj
resulting in
n−1
X n−1 Z
h2 X xj+1 ′′ h2
|En f | ≤ |ǫj | ≤ |f (y)| dy = kf ′′ k1 .
8 xj 8
j=0 j=0
For the
94 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION
with the error as small as possible. Here k is a given positive integer. There are 2k degrees of
freedom (xi and wi ) and so it is reasonable to expect that all polynomials of degree 2k − 1 or
lower can be integrated exactly.
x−a
By a change of variable z = 2 − 1,
b−a
Z b Z 1
b−a (z + 1)(b − a)
f (x)dx = F (z)dz, F (z) = f +a .
a −1 2 2
With four parameters, all polynomials of degree three or less can be integrated exactly. Substi-
tuting f by 1, x, x2 , x3 results in four equations
2 = w1 + w2
0 = x1 w1 + x2 w2
2
= x21 w1 + x22 w2
3
0 = x31 w1 + x32 w3
1
which can be solved to yield w1 = w2 = 1 and x1,2 = ± √ . (We shall see later that x1,2 are
3
roots of a quadratic polynomial.) Hence
Z 1
1 1
f (x)dx = f √ + f −√
−1 3 3
is exact for polynomials of degree three or less. This should be contrasted with the trapezoidal
rule. The above is procedure is not recommended for higher order Gaussian quadrature because
of the difficulty in solving the associated nonlinear system.
Before proceeding with the general theory of Gaussian quadrature, we recall a few facts
about Legendre polynomials. The first few are defined by
3x2 − 1 5x3 − 3x
p0 (x) = 1, p1 (x) = x, p2 (x) = , p3 (x) = .
2 2
A recurrence relation defining the Legendre polynomials is
where the solution is required to be bounded on [−1, 1] and satisfies pj (1) = 1. It can be shown
that Z 1
2
p2j (x) dx = . (6.4)
−1 2j + 1
Note that pi is a polynomial of degree i. For a fixed positive n, it is known that {pi , 0 ≤ i ≤ n}
is an orthogonal basis for Pn , the space of polynomials of degree at most n, under the inner
product Z 1
hf, gi = f (x)g(x)dx, f, g ∈ Pn .
−1
Z 1
Orthogonality refers to the fact that pi pj dx is zero if i 6= j and is positive if i = j. It is also
−1
known that the n zeroes of pn are all real, distinct and lie in (−1, 1).
The sequence of Legendre polynomials can be derived in many ways. Here we describe
three of them. First, recall that a basis for Pn is {1, x, . . . , xn }. Gram Schmidt can be used to
produce an orthogonal set {pj , j = 1, . . . , n} which spans the same space. To ensure uniqueness,
we require pj (1) = 1 for every j. Taking p0 := 1, the second function must be orthogonalized
against the first one:
hx, p0 i
x− p0 = x.
hp0 , p0 i
Thus we can take p1 (x) = x since it satisfies p1 (1) = 1. Next, orthogonalize x2 against p0 and
p1 :
hx2 , p0 i hx2 , p1 i 1
x2 − p0 − p1 = x2 − − 0.
hp0 , p0 i hp1 , p1 i 3
We must rescale the above function so that it takes on the value 1 at x = 1. Thus p2 (x) =
(3x2 − 1)/2. This procedure can continued indefinitely.
Next we show that pn satisfies the eigenvalue relation
non-negative k. If n is even, take a1 = 0 so that a2k−1 = 0 for all positive k and define a0 so
that pn (1) = 1. In both cases, pn is, in fact, a polynomial of degree n.
The final derivation of the Legendre polynomials uses the 3-term recurrence relation. Its
proof is by induction. Since this will be done for a more general case in the next chapter, we do
not prove the special case here.
Fix a positive integer k and let {x1 , . . . , xk } be the roots of pk . Recall the following polyno-
mials which were defined in the discussion on Lagrange interpolation:
k
Y x − xj
Li (x) = , i = 1, · · · , k.
xi − xj
j=1, j6=i
These polynomials are of degree k − 1 and satisfy the property Li (xj ) = δij .
Z 1
Theorem: Let k be a positive integer and {xi , i = 1, · · · , k} be the roots of pk . Define wi = Li (x) dx.
−1
Then
Z 1 k
X
f (x)dx = wi f (xi )
−1 i=1
k
X k
X
r(x) = Li (x)r(xi ) = Li (x)f (xi ).
i=1 i=1
P
The first equality above holds by the definition of Li and the fact that r(x) − i Li (x)r(xi ) is a
polynomial of degree at most k − 1 with k distinct roots {xi }. The second equality holds because
pk vanishes at the nodes {xi }. In fact, r ∈ Pk is the polynomial interpolant of f at the k nodes.
Using the orthogonality of Legendre polynomials,
Z 1 Z 1 Z 1
f (x)dx = qpk dx + rdx
−1 −1 −1
k
X Z 1
= 0+ f (xi ) Li (x) dx
i=1 −1
k
X
= wi f (xi ).
i=1
Theorem: Fix a positive k. The weights wi and nodes xi defined above are unique for the exact integration
of all polynomials of degree 2k − 1 or less. Furthermore 0 < wi < 2, i = 1, · · · , k.
6.7. GAUSSIAN QUADRATURE OVER FINITE INTERVALS 97
Proof: Let {Wi } be non-zero and let {yi } be distinct, i = 1, · · · , k. Suppose for every polynomial p of
degree 2k − 1 or less,
Z 1 k
X
p(x)dx = Wi p(yi ).
−1 i=1
We shall show below that A is non–singular and so this means that {yi } are the roots of pk .
That is, yi = xi for every i.
k
X k
X
We have wi p(xi ) = Wi p(xi ) for every polynomial p of degree less than or equal to 2k − 1.
i=1 i=1
k
X
Take p = pj , j = 0, · · · , k − 1. The system of equations becomes pj (xi )(wi − Wi ) = 0, 0 ≤
i=1
j ≤ k − 1, or
w1 − W1
A · · · = 0.
wk − Wk
Since A is non–singular, wi = Wi for every i.
Finally, we demonstrate that A is non–singular. Suppose AT c = 0 for some vector. Define the
k−1
X
polynomial q(x) = ci pi (x) of degree at most k − 1. Now AT c = 0 means that q has k distinct
i=0
roots x1 , · · · , xk . This implies that q is the zero function. Since {pi } are linearly independent,
c = 0.
Fix i. Recall Li satisfies Li (xq ) = δiq . Note that L2i is a positive polynomial of degree 2k − 2.
Thus
Z 1 Xk
0< L2i (x)dx = wj L2i (xj ) = wi .
−1 j=1
Next,
Z 1 k
X k
X
2= 1 dx = wj 1 = wj .
−1 j=1 j=1
Theorem: Let k be a positive integer and {xi , i = 1, · · · , k} be the roots of pk . Define the weights wi as
in the above theorem. Let f ∈ C 2k (−1, 1). Then for some ξ ∈ (−1, 1),
Z 1 k
X Z 1 k
Y
f (2k) (ξ)
Ek ≡ f (x)dx − wi f (xi ) = φk (x)dx, φk (x) = (x − xi )2 .
−1 (2k)! −1
i=1 i=1
Proof: Let p(x) be the (Hermite) polynomial of degree 2k − 1 such that p(xi ) = f (xi ) and p′ (xi ) =
f ′ (xi ), i = 1, · · · , k. Hence
Z 1 k
X k
X
p(x)dx = wi p(xi ) = wi f (xi ).
−1 i=1 i=1
Recall from (5.7) that for some fixed x ∈ (−1, 1) and ξ(x) ∈ (−1, 1),
f (2k) (ξ(x))
f (x) − p(x) = φk (x).
(2k)!
Consequently,
Z 1 k
X Z 1
1
f (x) dx − wj f (xj ) = f (2k) (ξ(x))φk (x) dx
−1 (2k)! −1
j=1
Z 1
f (2k) (ξ)
= φk (x) dx
(2k)! −1
for some ξ by the mean value theorem for integrals. This permits us to conclude that
Z 1 Z
f (2k) (ξ) 1
Ek = (f − p)dx = φk (x)dx.
−1 (2k)! −1
Although the above is only an approximation for large k, it turns out to be very accurate even
for small k, and, in fact, gives an upper bound for all k ≥ 1.
Thus the error decays exponentially quickly as a function of k provided that f is smooth
enough. This should be compared with a decay of k−2 for the trapezoidal rule, even if f is
smooth.
The following result says that Gaussian quadrature is exact in the limit of infinitely many
nodes assuming only continuity of f .
6.7. GAUSSIAN QUADRATURE OVER FINITE INTERVALS 99
k 1 2 3 5 10 20
22k+1 (k!)4
(2k+1) [(2k)!]2
0.6667 0.1778 0.0457 2.9318 × 10−3 2.9256 × 10−6 2.8226 × 10−12
π4−k 0.7854 0.1963 0.0491 3.0680 × 10−3 2.9961 × 10−6 2.8573 × 10−12
Proof: Given any ǫ > 0. By the Weierstrass approximation theorem, there is some polynomial p so that
ǫ
|f (x) − p(x)| < , ∀x ∈ [−1, 1].
4
Take k be any positive integer so that 2k is larger than the degree of p. Recall that the Gaussian
quadrature with k nodes integrates p exactly. Now
Z 1 k
X
|Ek | = f (x) dx − f (xj )wj
−1 j=1
Z 1 Z 1 k
X
≤ (f (x) − p(x)) dx + p(x) dx − f (xj )wj
−1 −1 j=1
Z 1 k
X k
X
≤ |f (x) − p(x)| dx + p(xj )wj − f (xj )wj
−1 j=1 j=1
Z 1 k
X
ǫ ǫ
≤ dx + wj
4 −1 4
j=1
Z 1
ǫ ǫ
= + dx
2 4 −1
= ǫ.
The conclusion of this theorem may seem unspectacular. It should be contrasted with the
quadrature scheme using equi-spaced nodes and the integrand approximated by the polynomial
interpolant at the equi-spaced nodes. Here the quadrature error not only does it not decrease
to zero, but it increases exponentially as a function of the number of nodes. See Figure 5.1.
The final result is a general theorem about convergence of the quadrature
n
X
In (f ) := f (xi )wi
i=1
The nodes {xi } and weights {wi } depend on n, but not on f . We do not indicate explicitly their
dependence on n to simplify the notation. Define the quadrature error En (f ) = |I(f ) − In (f )|.
100 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION
and
n
X
|wi | ≤ M, n ≥ 1,
i=1
for some positive constant M .
Proof: We only give a sketch of the proof of the “only if” part. Suppose the two assumptions hold.
We show that En → 0. Given any positive integer n, there exists some polynomial pN of degree
N = N (n) so that En (pN ) = 0 by the first assumption, and kf − pN k∞ → 0 as n → ∞. (In
case of Gaussian quadrature, N = 2n − 1 and the latter property follows from the Weierstrass
approximation theorem.) Then
En (f ) = En (f − pN )
Z 1 n
X
= f (x) − pN (x) dx − wj f (xj ) − pN (xj )
−1 j=1
n
X
< kf − pN k∞ 2 + |wj |
j=1
≤ kf − pN k∞ (2 + M ).
Take n → ∞ to conclude that En (f ) → 0.
This theorem offers an alternative proof that Gaussian quadrature converges for continuous
functions on [−1, 1]. This follows since
n
X
In (1) = wi = I(1) = 2, n ≥ 1,
i=1
using the fact that Gaussian quadrature In is exact for polynomials of degree at most 2n − 1
and that all the weights are positive.
Proof: The proof is by induction on n. The base case n = 2 is trivial. Assume that the result holds for
n. We proceed to show that it holds for n + 1. Since pn+1 − xpn ∈ Pn and that {p0 , . . . , pn } is
an orthogonal basis for Pn , it follows that
n−2
X
pn+1 (x) − xpn (x) = an+1 pn + bn+1 pn−1 + dj pj , (6.6)
j=0
for some real constants an+1 , bn+1 , dj . Take the inner product with pn to get
or
hxpn , pn i
an+1 = − .
hpn , pn i
Next, take the inner product with pn−1 in (6.6) to get
or
hxpn , pn−1 i
bn+1 = −
hpn−1 , pn−1 i
hpn , xpn−1 i
= −
hpn−1 , pn−1 i
hpn , pn + qi
= −
hpn−1 , pn−1 i
hpn , pn i
= − ,
hpn−1 , pn−1 i
where q ∈ Pn−1 . Finally, take the inner product with pj , 0 ≤ j ≤ n − 2 in (6.6) to get
0 − hxpn , pj i = dj hpj , pj i,
or
dj hpj , pj i = −hpn , xpj i = −hpn , pj+1 + ri = 0,
where r ∈ Pj . This shows that dj = 0 for every j and completes the induction proof.
Theorem: For each n ≥ 1, all zeros of pn are real, simple and lie in (a, b).
is either positive or negative. But this is a contradiction since the right-hand side must be zero
by orthogonality. Hence pn must have some zero x1 ∈ (a, b). If x1 is not a simple zero, then
pn (x)(x − x1 )−2 ∈ Pn−2 and by orthogonality,
pn p2n
0= , pn = , 1 > 0,
(x − x1 )2 (x − x1 )2
pn (x)(x − x1 ) . . . (x − xm ) = q(x)(x − x1 )2 . . . (x − xm )2 ,
for some q ∈ Pn−m which does not vanish in (a, b). Without loss of generality, assume q > 0 in
(a, b). Therefore
hpn (x) (x − x1 )−1 . . . (x − xm )−1 , 1i = hq(x), 1i > 0.
However, by orthogonality, hq, p0 i = 0, which is absurd. It can be inferred that m = n.
Denote the corresponding set of orthonormal polynomials by {p̂0 , . . . , p̂n }, whose elements
satisfy
pi
hp̂i , p̂j i = δij , p̂i = , i, j ≥ 0.
hpi , pi i1/2
In the special case w ≡ 1, the orthogonal polynomials are the Legendre polynomials. Another
common case is w = (1 − x2 )−1/2 , where the orthogonal polynomials are known as Chebyshev
polynomials.
Define the kernel polynomial
n
X
Kn (x, y) = p̂j (x)p̂j (y).
j=0
Therefore,
n D
X E
hp, Kn (·, y)i = hp, p̂j ip̂j (x), p̂k (x)p̂k (y)
j,k=0
Xm
= hp, p̂j ip̂j (y)
j=0
= p(y).
6.7. GAUSSIAN QUADRATURE OVER FINITE INTERVALS 103
Let k(x, y) be a polynomial of degree at most n in x and at most n in y so that hp, k(·, y)i = p(y)
for every p ∈ Pn . Take p(x) = Kn (x, z). Then
hKn (·, z), k(·, y)i = Kn (y, z).
From the hypothesis of this theorem,
hKn (·, z), k(·, y)i = hk(·, y), Kn (·, z)i = k(z, y).
Combine the symmetry of Kn and these results to get Kn (z, y) = Kn (y, z) = k(z, y), demon-
strating the uniqueness of Kn .
The kernel polynomial is useful because for any continuous function f , it follows that
hKn (x, ·), f i ∈ Pn is a good approximation to f (x) in the following senses. Let En (x) =
f (x) − hKn (x, ·), f i. Then
1. hEn , p̂j i = 0, 0 ≤ j ≤ n,
2. En vanishes at at least n + 1 points in (a, b).
We shall only prove the first property.
X
hEn , p̂j i = hf, p̂j i − hp̂i , f i hp̂i , p̂j i
i
X
= hf, p̂j i − hp̂i , f i δij
i
= 0.
Now we are ready to apply the above theory of orthogonal polynomials to Gaussian quadra-
ture. Let w be the weight function and pn be an orthogonal polynomial as before.
Theorem: For n ≥ 1, let x1 < · · · < xn be the zeroes of pn . There are positive constants w1 , . . . , wn so that
Z b X n
p(x)w(x) dx = wj p(xj ), p ∈ P2n−1 .
a j=1
Furthermore,
Z b
0 < wi < w(x) dx, 1 ≤ i ≤ n.
a
Proof: Given any p ∈ P2n−1 , write p = qpn + r for some q, r ∈ Pn−1 . Since xj is a root of pn , it follows
that p(xj ) = r(xj ), j = 1, . . . , n. Hence
n
X n
Y x − xj
r(x) = p(xj )Lj (x), Li (x) = .
xi − xj
j=1 j=1, j6=i
where Z b
wj = Lj (x)w(x) dx.
a
Note that the integral of pn qw is zero since q ∈ Pn−1 is orthogonal to pn in the inner product
h·, ·i. Since L2j ∈ P2n−2 and Lj (xk ) = δjk , substitute L2j for p to get
n
X Z b
wj = wk L2j (xk ) = L2j (x)w(x) dx > 0.
k=1 a
Finally, apply the exact quadrature formula for the constant 1 function to get
Z b n
X
w(x) dx = wj > wi , 1 ≤ i ≤ n.
a j=1
Theorem: Assume the setting of the above theorem. Let f ∈ C 2n [a, b] and d2n = hpn , pn i. Then there is
some η ∈ (a, b) so that
Z b n
X f (2n) (η)
f (x)w(x) dx − f (xj )wj = .
a (2n)! d2n
j=1
f (2n) (ξ(x))
f (x) = h(x) + (x − x1 )2 · · · (x − xn )2 , x ∈ [a, b],
(2n)!
for some ξ(x) ∈ (a, b). According to the Theorem associated with (5.7),
f (x) − h(x)
∈ C[a, b].
(x − x1 )2 · · · (x − xn )2
Example: Let a = −1, b = 1 and w(x) = (1 − x2 )−1/2 . The associated orthogonal polynomials are called
Chebyshev polynomials. The first few are
T0 (x) = 1, T1 (x) = x, T2 (x) = 2x2 − 1, T3 (x) = 4x3 − 3x.
Example: Let a = 0, b = ∞ and w(x) = e−x . The associated orthogonal polynomials are called Laguerre
polynomials. The first few are
G0 (x) = 1, G1 (x) = −x + 1, G2 (x) = x2 − 4x + 2, G3 (x) = −x3 + 9x2 − 18x + 6.
2
Example: Let a = −∞, b = ∞ and w(x) = e−x . The associated orthogonal polynomials are called
Hermite polynomials. The first few are
H0 (x) = 1, H1 (x) = 2x, H2 (x) = 4x2 − 2, H3 (x) = 8x3 − 12x.
Suppose xi comes from the uniform distribution on [a, b]. Then with N such random points, an
approximation to I is
N
b−aX
IN = f (xi ).
N
i=1
We now compute two statistics, E(IN ), the mean value of IN , and σ(IN ), the standard deviation.
Now
N
b−a X
E(IN ) = E(f )
N
i=1
= (b − a)E(f )
Rb
f (x) dx
= (b − a) a
b−a
= I.
This is a desirable result.
Next we compute the variance of IN :
N
! N
!
b−aX (b − a)2 X
var f (xi ) = var f (xi )
N N2
i=1 i=1
(b − a)2
= N var(f ).
N2
Recall that var(f ) = E(f 2 ) − E(f )2 , then
(b − a) (E(f 2 ) − I 2 )1/2
σ(IN ) = (var(IN ))1/2 = √ . (6.7)
N
106 CHAPTER 6. NUMERICAL DIFFERENTIATION AND INTEGRATION
Thus the standard deviation, which measures the deviation from the mean and thus is in some
sense the ‘error’ of the approximation, behaves like N −1/2 . Since the error of the trapezoidal
rule is O(N −2 ) for f smooth, the Monte Carlo method should not be used for one-dimensional
integrals. However in d dimensions, the error of the trapezoidal rule is O(N −2/d ) while that
of the Monte Carlo method is still O(N −1/2 ), independent of d! Hence when d > 4, the Monte
Carlo method is more efficient than the trapezoidal rule. As an added bonus, the Monte Carlo
method is not sensitive to singularities or discontinuities of the integrand.
In the above description of the Monte Carlo method, we employed the uniform distribution.
This may not be the best. Suppose we choose the distribution p(x), which of course satisfies
Z b
p(x) dx = 1.
a
Then Z Z
b b
f (x)
f (x) dx = g(x)p(x) dx, g(x) = .
a a p(x)
We can now define the approximation
N
1 X
IN = g(xi ),
N
i=1
Now
1
var(IN ) = var(g)
N
1
= (E(g 2 ) − E(g)2 )
N
Z b
1 2 2
= g p dx − I
N a
Z b 2
1 f 2
= dx − I .
N a p
Finally
Z b 1/2
1 f2
σ(IN ) = √ dx − I 2 .
N a p
While it is still O(N −1/2 ), the constant multiplying N −1/2 can be smaller if p is chosen properly.
6.9. MULTIPLE INTEGRALS 107
Z π/2
Example: Let I = sin x dx = 1. The Monte Carlo method with uniform distribution yields σ(IN ) ≈
0
0.483N −1/2 with I10 = 0.952 for one set of random numbers while using the distribution function
p(x) = 8π −2 x, we obtain I10 = 1.016.
Another useful technique to improve the efficiency is to modify the integrand f so that E(f 2 )
is as small as possible. See (6.7).
Example: For the same integral as in the above example, write
Z π/2 Z π/2
2x 2x
I= dx + sin x − dx.
0 π 0 π
The first term on the right-hand side is easily evaluated as π/4 while the second term can be
evaluated using the Monte Carlo method. Since the square of the integrand of the second term
is small, the result will be more accurate. Here, σ(IN ) ≈ 0.1N −1/2 .
for all polynomials of degree two or less, where |K| is the area of K and A, B, C are the midpoints
of the sides.
If T , the given domain of integration, can be partitioned into many triangles, the above
quadrature rule can be used to get an accurate approximation to the integral over T . Of course,
the more triangles we use, the smaller the error, provided that f is smooth.