Book Ena
Book Ena
Book Ena
Copyright by
2016
Preface
This book evolved from lecture notes written by James Lambers and used in undergraduate numeri-
cal analysis courses at the University of California at Irvine, Stanford University and the University
of Southern Mississippi. It is written for a year-long sequence of numerical analysis courses for ei-
ther advanced undergraduate or beginning graduate students. Part II is suitable for a semester-long
first course on numerical linear algebra.
The goal of this book is to introduce students to numerical analysis from both a theoretical
and practical perspective, in such a way that these two perspectives reinforce each other. It is
not assumed that the reader has prior programming experience. As mathematical concepts are
introduced, code is used to illustrate them. As algorithms are developed from these concepts, the
reader is invited to traverse the path from pseudocode to code.
Coding examples throughout the book are written in Matlab. Matlab has been a vital tool
throughout the numerical analysis community since its creation thirty years ago, and its syntax that
is oriented around vectors and matrices greatly accelerates the prototyping of algorithms compared
to other programming environments.
The authors are indebted to the students in the authors’ MAT 460/560 and 461/561 courses,
taught in 2015-16, who were subjected to an early draft of this book.
J. V. Lambers
A. C. Sumner
iii
iv
Contents
I Preliminaries 1
v
vi CONTENTS
1.4.5 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.5 Computer Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.5.1 Floating-Point Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.5.2 Issues with Floating-Point Arithmetic . . . . . . . . . . . . . . . . . . . . . . 38
1.5.3 Loss of Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
V Appendices 367
1.1 The dotted red curves demonstrate polynomial interpolation (left plot) and least-
squares approximation (right plot) applied to f (x) = 1/(1 + x2 ) (blue solid curve). . 7
1.2 Screen shot of Matlab at startup in Mac OS X . . . . . . . . . . . . . . . . . . . . 10
1.3 Figure for Exercise 1.2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1 The function f (x) = 1/(1 + x2 ) (solid curve) cannot be interpolated accurately on
[−5, 5] using a tenth-degree polynomial (dashed curve) with equally-spaced interpo-
lation points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
5.2 Cubic spline that passing through the points (0, 3), (1/2, −4), (1, 5), (2, −6), and (3, 7).190
6.1 Data points (xi , yi ) (circles) and least-squares line (solid line) . . . . . . . . . . . . . 196
6.2 Data points (xi , yi ) (circles) and quadratic least-squares fit (solid curve) . . . . . . . 199
6.3 Graphs of f (x) = ex (red dashed curve) and 4th-degree continuous least-squares
polynomial approximation f4 (x) on [0, 5] (blue solid curve) . . . . . . . . . . . . . . 203
6.4 Graph of cos x (solid blue curve) and its continuous least-squares quadratic approx-
imation (red dashed curve) on (−π/2, π/2) . . . . . . . . . . . . . . . . . . . . . . . 208
6.5 (a) Left plot: noisy signal (b) Right plot: discrete Fourier transform . . . . . . . . . 221
6.6 Aliasing effect on noisy signal: coefficients fˆ(ω), for ω outside (−63, 64), are added
to coefficients inside this interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
7.1 Graph of f (x) = e3x sin 2x on [0, π/4], with quadrature nodes from Example 7.7.2
shown on the graph and on the x-axis. . . . . . . . . . . . . . . . . . . . . . . . . . . 261
8.1 Left plot: Well-conditioned problem of solving f (x) = 0. f 0 (x∗ ) = 24, and an
approximate solution ŷ = f −1 () has small error relative to . Right plot: Ill-
conditioned problem of solving f (x) = 0. f 0 (x∗ ) = 0, and ŷ has large error relative
to . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
8.2 Illustrations of the Intermediate Value Theorem. Left plot: f (x) = x − cos x has a
unique root on [0, π/2]. Right plot: g(x) = ex cos(x2 ) has multiple roots on [0, π]. . . 274
8.3 Because f (π/4) > 0, f (x) has a root in (0, π/4). . . . . . . . . . . . . . . . . . . . . . 275
8.4 Progress of the Bisection method toward finding a root of f (x) = x − cos x on (0, π/2)277
8.5 Fixed-point Iteration applied to g(x) = cos x + 2. . . . . . . . . . . . . . . . . . . . . 286
8.6 Approximating a root of f (x) = x − cos x using the tangent line of f (x) at x0 = 1. . 289
xiii
xiv LIST OF FIGURES
8.7 Newton’s Method used to compute the reciprocal of 8 by solving the equation f (x) =
8 − 1/x = 0. When x0 = 0.1, the tangent line of f (x) at (x0 , f (x0 )) crosses the x-axis
at x1 = 0.12, which is close to the exact solution. When x0 = 1, the tangent line
crosses the x-axis at x1 = −6, which causes searching to continue on the wrong
portion of the graph, so the sequence of iterates does not converge to the correct
solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
8.8 Newton’s Method applied to f (x) = x2 − 2. The bold curve is the graph of f . The
initial iterate x0 is chosen to be 1. The tangent line of f (x) at the point (x0 , f (x0 ))
is used to approximate f (x), and it crosses the x-axis at x1 = 1.5, which is much
closer to the exact solution than x0 . Then, the tangent line at (x1 , f (x1 )) is used
to approximate f (x), and it crosses the x-axis at x2 = 1.416̄, which is already very
close to the exact solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
9.1 Solutions of y 0 = −2ty, y(0) = 1 on [0, 1], computed using Euler’s method and the
fourth-order Runge-Kutta method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
10.1 Left plot: exact (solid curve) and approximate (dashed curve with circles) solutions
of the BVP (10.8) computed using finite differences. Right plot: error in the approx-
imate solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
10.2 Exact (solid curve) and approximate (dashed curve with circles) solutions of the
BVP (10.11) from Example 10.2.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
10.3 Exact (blue curve) and approximate (dashed curve) solutions of (10.18), (10.19) from
Example 10.3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
10.4 Piecewise linear basis functions φj (x), as defined in (10.25), for j = 1, 2, 3, 4, with
N =4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
10.5 Exact (solid curve) and approximate (dashed curve) solutions of (10.22), (10.23) with
f (x) = x and N = 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
List of Tables
6.1 Data points (xi , yi ), for i = 1, 2, . . . , 10, to be fit by a linear function . . . . . . . . . 195
6.2 Data points (xi , yi ), for i = 1, 2, . . . , 10, to be fit by a quadratic function . . . . . . . 198
6.3 Data points (xi , yi ), for i = 1, 2, . . . , 5, to be fit by an exponential function . . . . . . 200
xv
xvi LIST OF TABLES
Part I
Preliminaries
1
Chapter 1
1.1 Overview
This book provides a comprehensive introduction to the subject of numerical analysis, which is the
study of the design, analysis, and implementation of numerical algorithms for solving mathematical
problems that arise in science and engineering. These numerical algorithms differ from the analytical
methods that are presented in other mathematics courses, in that they rely exclusively on the four
basic arithmetic operations, addition, subtraction, multiplication and division, so that they can be
implemented on a computer.
The goal in numerical analysis is to develop numerical methods that are effective, in terms of
the following criteria:
• A numerical method must be accurate. While this seems like common sense, careful consid-
eration must be given to the notion of accuracy. For a given problem, what level of accuracy
is considered sufficient? As will be discussed in Section 1.4, there are many sources of error.
As such, it is important to question whether it is prudent to expend resources to reduce one
type of error, when another type of error is already more significant. This will be illustrated
in Section 7.1.
• A numerical method must be efficient. Although computing power has been rapidly increasing
in recent decades, this has resulted in expectations of solving larger-scale problems. Therefore,
it is essential that numerical methods produce approximate solutions with as few arithmetic
operations or data movements as possible. Efficiency is not only important in terms of time;
memory is still a finite resource and therefore algorithms must also aim to minimize data
storage needs.
• A numerical method must be robust. A method that is highly accurate and efficient for some
(or even most) problems, but performs poorly on others, is unreliable and therefore not likely
to be used in applications, even if any alternative is not as accurate and efficient. The user
of a numerical method needs to know that the result produced can be trusted.
These criteria should be balanced according to the requirements of the application. For example,
if less accuracy is acceptable, then greater efficiency can be achieved. This can be the case, for
example, if there is so much uncertainty in the underlying mathematical model that there is no
point in obtaining high accuracy.
3
4 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?
• Using Gaussian elimination, with exact arithmetic, to solve a system of linear equations, and
Such analytical methods have the benefit that they yield exact solutions, but the drawback is that
they can only be applied to a limited range of problems. Numerical methods, on the other hand, can
be applied to a much wider range of problems, but only yield approximate solutions. Fortunately, in
many applications, one does not necessarily need very high accuracy, and even when such accuracy
is required, it can still be obtained, if one is willing to expend the extra computational effort (or,
really, have a computer do so).
Because solutions produced by numerical algorithms are not exact, we will begin our exploration
of numerical analysis with one of its most fundamental concepts, which is error analysis. Numerical
algorithms must not only be efficient, but they must also be accurate, and robust. In other words,
the solutions they produce are at best approximate solutions because an exact solution cannot
be computed by analytical techniques. Furthermore, these computed solutions should not be too
sensitive to the input data, because if they are, any error in the input can result in a solution that
is essentially useless. Such error can arise from many sources, such as
• discretization error, which arises from approximating continuous functions by sets of discrete
data points,
• convergence error, which arises from truncating a sequence of approximations that is meant
to converge to the exact solution, to make computation possible, and
• roundoff error, which is due to the fact that computers represent real numbers approximately,
in a fixed amount of storage in memory.
We will see that in some cases, these errors can be surprisingly large, so one must be careful
when designing and implementing numerical algorithms. Section 1.4 will introduce fundamental
concepts of error analysis that will be used throughout this book, and Section 1.5 will discuss
computer arithmetic and roundoff error in detail.
1.1. OVERVIEW 5
called the normal equations. While this system can be solved directly using methods discussed
above, this can be problematic due to sensitivity to roundoff error. We therefore explore other
approaches based on orthogonalization of the columns of A.
Another fundamental problem from linear algebra is the solution of the eigenvalue problem
Ax = λx,
where the scalar λ is called an eigenvalue and the nonzero vector x is called an eigenvector. This
problem has many applications throughout applied mathematics, including the solution of differen-
tial equations and statistics. We will see that the tools developed for efficient and robust solution
of least squares problems are useful for the eigenvalue problem as well.
Figure 1.1: The dotted red curves demonstrate polynomial interpolation (left plot) and least-squares
approximation (right plot) applied to f (x) = 1/(1 + x2 ) (blue solid curve).
Therefore, it is important to have methods for evaluating derivatives and integrals that are
insensitive to the complexity of the function being acted upon. Numerical techniques for these op-
erations make use of polynomial interpolation by (implicitly) constructing a polynomial interpolant
that fits the given data, and then applying differentiation or integration rules to the polynomial.
We will see that by choosing the method of polynomial approximation judiciously, accurate results
can be obtain with far greater efficiency than one might expect.
As an example, consider the definite integral
Z 1
1
dx.
0 x2 − 5x + 6
Evaluating this integral exactly entails factoring the denominator, which is simple in this case but
not so in general, and then applying partial fraction decomposition to obtain an antiderivative,
which is then evaluated at the limits. Alternatively, simply computing
1
[f (0) + 4f (1/4) + 2f (1/2) + 4f (3/4) + f (1)],
12
where f (x) is the integrand, yields an approximation with 0.01% error (that is, the error is 10−4 ).
While the former approach is less tedious to carry out by hand, at least if one has a calculator,
clearly the latter approach is the far more practical use of computational resources.
8 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?
One key issue with time-stepping methods is stability. If the time step is not chosen to be
sufficiently small, the computed solution can grow without bound, even if the exact solution is
bounded. Generally, the need for stability imposes a more severe restriction on the size of the time
step for explicit methods, which is why implicit methods are commonly used, even though they
tend to require more computational effort per time step. Certain systems of differential equations
can require an extraordinarily small time step to be solved by explicit methods; such systems are
said to be stiff.
y 00 = f (x, y, y 0 ), a ≤ x ≤ b,
output, to a text file. By default, this output is saved to a file that is named diary in the current
working directory, but we will supply our own filename as an argument, to make the saved file
easier to open later in a text editor.
>> a=3+4
a =
>> b=sqrt(a)
b =
2.645751311064591
>> c=exp(a)
1.2. GETTING STARTED WITH MATLAB 11
c =
1.096633158428459e+003
As can be seen from these statements, arithmetic operations and standard mathematical func-
tions can readily be performed, so Matlab could be used as a “desk calculator” in which results
of expressions can be stored in variables, such as a, b and c in the preceding example. Also, note
that once a command is executed, the output displayed is the variable name, followed by its value.
This is typical behavior in Matlab, so for the rest of this tutorial, the output will not be displayed
in the text.
A 2x2 32 double
B 2x2 32 double
C 2x2 32 double
a 1x1 8 double
ans 2x2 32 double
b 1x1 8 double
c 1x1 8 double
w 3x1 24 double
Note that each number, such as a, or each entry of a matrix, occupies 8 bytes of storage, which
is the amount of memory allocated to a double-precision floating-point number. This system of
representing real numbers will be discussed further in Section 1.5. Also, note the variable ans. It
was not explicitly created by any of the commands that we have entered. It is a special variable
that is assigned the most recent expression that is not already assigned to a variable. In this case,
the value of ans is the output of the operation 4*A, since that was not assigned to any variable.
>> E=ones(6,5)
>> E=ones(3)
>> R=rand(3,2)
As the name suggests, rand creates a matrix with random entries. More precisely, the entries
are random numbers that are uniformly distributed on [0, 1].
Exercise 1.2.1 What if we want the entries distributed within a different interval, such
as [−1, 1]? Create such a matrix, of size 3 × 2, using matrix arithmetic that we have seen,
and the ones function.
In many situations, it is helpful to have a vector of equally spaced values. For example, if we
want a vector consisting of the integers from 1 to 10, inclusive, we can create it using the statement
>> z=[ 1 2 3 4 5 6 7 8 9 10 ]
However, this can be very tedious if a vector with many more entries is needed. Imagine creating
a vector with all of the integers from 1 to 1000! Fortunately, this can easily be accomplished using
the colon operator. Try the following commands to see how this operator behaves.
>> z=1:10
>> z=1:2:10
>> z=10:-2:1
>> z=1:-2:10
It should be noted that the second argument, that determines spacing between entries, need not
be an integer.
Exercise 1.2.2 Use the colon operator to create a vector of real numbers between 0 and
1, inclusive, with spacing 0.01.
>> z=(0:0.1:1)’
has the desired effect. However, one should not simply conclude that the single quote is the
transpose operator, or they could be in for an unpleasant surprise when working with complex-
valued matrices. Try these commands to see why:
We can see that the single quote is an operator that takes the Hermitian transpose of a matrix A,
commonly denoted by AH : it is the transpose and complex conjugate of A. That is, AH = AT .
Meanwhile, the dot followed by the single quote is the transpose operator.
Either operator can be used to take the transpose for matrices with real entries, but one must
be more careful when working with complex entries. That said, why is the “default” behavior, rep-
resented by the simpler single quote operator, the Hermitian transpose rather than the transpose?
This is because in general, results or techniques established for real matrices, that make use of the
transpose, do not generalize to the complex case unless the Hermitian transpose is used instead.
1.2.8 if Statements
Now, we will learn some essential programming constructs, that Matlab shares with many other
programming languages. The first is an if statement, which is used to perform a different task
based on the result of a given conditional expression, that is either true or false.
At this point, we will also learn how to write a script in Matlab. Scripts are very useful for
the following reasons:
• Some Matlab statements, such as the programming constructs we are about to discuss, are
quite complicated and span several lines. Typing them at the command window prompt can
be very tedious, and if a mistake is made, then the entire construct must be retyped.
• It frequently occurs that a sequence of commands needs to be executed several times, with no
or minor changes. It can be very tedious and inefficient to repeatedly type out such command
sequences several times, even if Matlab’s history features (such as using the arrow keys to
scroll through past commands) are used.
A script can be written in a plain text file, called an M-file, which is a file that has a .m extension.
An M-file can be written in any text editor, or in Matlab’s own built-in editor. To create a new
M-file or edit an existing M-file, one can use the edit command at the prompt:
If no extension is given, a .m extension is assumed. If the file does not exist in the current working
directory, Matlab will ask if the file should be created.
In the editor, type in the following code, that computes the number of days in a given month,
while taking leap years into account.
% entermonthyear - script that asks the user to provide a month and year,
% and displays the number of days in that month
Can you figure out how an if statement works, based on your knowledge of what the result should
be?
Note that this M-file includes comments, which are preceded by a percent sign (%). Once a %
is entered on a line, the rest of that line is ignored. This is very useful for documenting code so
that a reader can understand what it is doing. The importance of documenting one’s code cannot
be overstated. In fact, it is good practice to write the documentation before the code, so that the
process of writing code is informed with a clearer idea of the task at hand.
As this example demonstrates, if statements can be nested within one another. Also note the
use of the keywords else and elseif. These are used to provide alternative conditions under which
different code can be executed, if the original condition in the if statement turns out to be false.
If any conditions paired with the elseif keyword also turn out to be false, then the code following
the else keyword, if any, is executed.
This script features some new functions that can be useful in many situations:
• deblank(s): returns a new string variable that is the same as s, except that any “white
space” (spaces, tabs, or newlines) at the end of s is removed
To execute a script M-file, simply type the name of the file (without the .m extension) at the
prompt.
16 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?
>> entermonthyear
Enter the month (1-12): 5
Enter the 4-digit year: 2001
May, 2001 has 31 days.
Note that in the above script M-file, most of the statements are terminated with semicolons.
Unlike in programming languages such as C++, the semicolon is not required. If it is omitted from
a statement, then the value of any expression that is computed in that statement is displayed, along
with its variable name (or ans, if there is no variable associated with the expression). Including the
semicolon suppresses printing. In most cases, this is the desired behavior, because excessive output
can make software less user-friendly. However, omitting semicolons can be useful when writing and
debugging new code, because seeing intermediate results of a computation can expose bugs. Once
the code is working, then semicolons can be added to suppress superfluous output.
Note the syntax for a for statement: the keyword for is followed by a loop variable, such as i,
j or k in this example, and that variable is assigned a value. Then the body of the loop is given,
followed by the keyword end.
What does this loop actually do? During the nth iteration, the loop variable is set equal to the
nth column of the expression that is assigned to it by the for statement. Then, the loop variable
retains this value throughout the body of the loop (unless the loop variable is changed within
the body of the loop, which is ill-advised, and sometimes done by mistake!), until the iteration is
completed. Then, the loop variable is assigned the next column for the next iteration. In most
cases, the loop variable is simply used as a counter, in which case assigning to it a row vector of
values, created using the colon operator, yields the desired behavior.
Now run this script, just like in the previous example. The script displays a randomly generated
matrix A, then performs Gaussian elimination on A to obtain an upper triangular matrix U , and
then displays the final result U .
Exercise 1.2.3 An upper trianglar matrix U has the property that uij = 0 whenever
i > j; that is, the entire “lower triangle” of U , consisting of all entries below the main
diagonal, must be zero. Examine the matrix U produced by the script gausselim above.
Why are some subdiagonal entries nonzero?
% newtonsqrt - script that uses Newton’s method to compute the square root
% of 2
end
end
% display result and verify that it really is the square root of 2
disp(’The square root of 2 is:’)
x
disp(’x^2 is:’)
disp(x^2)
Note the use of the expression true in the while statement. The value of the predefined
variable true is 1, while the value of false is 0, following the convention used in many programming
languages that a nonzero number is interpreted as the boolean value “true”, while zero is interpreted
as “false”.
A while loop runs as long as the condition in the while statement is true. It follows that
this particular while statement is an infinite loop, since the value of true will never be false.
However, this loop will exit when the condition in the enclosed if statement is true, due to the
break statement. A break statement causes the enclosing for √ or while loop to immediately exit.
This particular while loop computes an approximation to 2 such that the relative difference
between each new iterate x and the previous iterate oldx is less than the value of the predefined
variable eps, which is informally known as “unit roundoff” or “machine precision”. This concept
will
√ be discussed further in Section 1.5.1. The actual process used to obtain this approximation to
2 is obtained using Newton’s method, which will be discussed in Section 8.4.
Go ahead and run this script by typing its name, newtonsqrt, at the prompt. The code will
display each iteration
√ as the approximation is improved until it is sufficiently accurate. Note that
convergence to 2 is quite rapid! This effect will be explored further in Chapter 8.
function tc=fahtocel(tf)
% converts function input ’tf’ of temperatures in Fahrenheit to
% function output ’tc’ of temperatures in Celsius
temp=tf-32;
tc=temp*5/9;
Note that a function definition begins with the keyword function; this is how Matlab distinguishes
a script M-file from a function M-file (though in a function M-file, comments can still precede
function).
After the keyword function, the output arguments of the function are specified. In this function,
there is only one output argument, tc, which represents the Celsius temperature. If there were
more than one, then they would be enclosed in square brackets, and in a comma-separated list.
1.2. GETTING STARTED WITH MATLAB 19
After the output arguments, there is a = sign, then the function name, which should match the
name of the M-file aside from the .m extension. Finally, if there are any input arguments, then they
are listed after the function name, separated by commas and enclosed in parentheses.
After this first line, all subsequent code is considered the body of the function–the statements
that are executed when the function is called. The only exception is that other functions can be
defined within a function M-file, but they are “helper” functions, that can only be called by code
within the same M-file. Helper functions must appear after the function after which the M-file is
named.
Type in the above code for fahtocel into a file fahtocel.m in the current working directory.
Then, it can be executed as follows:
>> tc=fahtocel(212)
tc =
100
If tc= had been omitted, then the output value 100 would have been assigned to the special variable
ans, described in §1.2.4.
Note that the definition of fahtocel uses a variable temp. Here, it should be emphasized that
all variables defined within a function, including input and output arguments, are only defined
within the function itself. If a variable inside a function, such as temp, happens to have the same
name as another variable defined in the top-level workspace (the memory used by variables defined
outside of any function), or in another function, then this other variable is completely independent
of the one that is internal to the function. Consider the following example:
>> temp=32
temp =
32
>> tfreeze=fahtocel(temp)
tfreeze =
0
>> temp
temp =
32
Inside fahtocel, temp is set equal to zero by the subtraction of 32, but the temp in the top-level
workspace retains its value of 32.
Comments included at the top of an M-file (whether script or function) are assumed by Matlab
to provide documentation of the M-file. As such, these comments are displayed by the help
command, as applied to that function. Try the following commands:
We now illustrate other important aspects of functions, using the following M-file, which is
called vecangle2.m:
% vecangle2 - function that computes the angle between two given vectors
% in both degrees and radians. We use the formula
20 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?
% x’*y = ||x||_2 ||y||_2 cos(theta), where theta is the angle between x and
% y
function [anglerad,angledeg]=vecangle2(x,y)
n=length(x);
if n~=length(y)
error(’vector lengths must agree’)
end
% compute needed quantities for above formula
dotprod=x’*y;
xnorm=norm(x);
ynorm=norm(y);
% obtain cos(angle)
cosangle=dotprod/(xnorm*ynorm);
% use inverse cosine to obtain angle in radians
anglerad=acos(cosangle);
% if angle in degrees is desired (that is, two output arguments are
% specified), then convert to degrees. Otherwise, don’t bother
if nargout==2,
angledeg=anglerad*180/pi;
end
As described in the comments, the purpose of this function is to compute the angle between two
vectors in n-dimensional space, in both radians and degrees. Note that this function accepts
multiple input arguments and multiple output arguments. The way in which this function is called
is similar to how it is defined. For example, try this command:
>> [arad,adeg]=vecangle2(rand(5,1),rand(5,1))
It is important that code include error-checking, if it might be used by other people. To that
end, the first task performed in this function is to check whether the input arguments x and y have
the same length, using the length function that returns the number of elements of a vector or
matrix. If they do not have the same length, then the error function is used to immediately exit
the function vecangle2 and display an informative error message.
Note the use of the variable nargout at the end of the function definition. The function
vecangle2 is defined to have two output arguments, but nargout is the number of output arguments
that are actually specified when the function is called. Similarly, nargin is the number of input
arguments that are specified.
These variables allow functions to behave more flexibly and more efficiently. In this case, the
angle between the vectors is only converted to degrees if the user specified both output arguments,
thus making nargout equal to 2. Otherwise, it is assumed that the user only wanted the angle in
radians, so the conversion is never performed. Matlab typically provides several interfaces to its
functions, and uses nargin and nargout to determine which interface is being used. These multiple
interfaces are described in the help pages for such functions.
Exercise 1.2.4 Try calling vecangle2 in various ways, with different numbers of input
and output arguments, and with vectors of either the same or different lengths. Observe
the behavior of Matlab in each case.
1.2. GETTING STARTED WITH MATLAB 21
1.2.12 Graphics
Next, we learn some basic graphics commands. We begin by plotting the graph of the function
y = x2 on [−1, 1]. Start by creating a vector x, of equally spaced values between −1 and 1, using
the colon operator. Then, create a vector y that contains the squares of the values in x. Make sure
to use the correct approach to squaring the elements of a vector!
The plot command, in its simplest form, takes two input arguments that are vectors, that must
have the same length. The first input argument contains x-values, and the second input argument
contains y-values. The command plot(x,y) creates a new figure window (if none already exists)
and plots y as a function of x in a set of axes contained in the figure window. Try plotting the
graph of the function y = x2 on [−1, 1] using this command.
Note that by default, plot produces a solid blue curve. In reality, it is not a curve; it simply
“connects the dots” using solid blue line segments, but if the segments are small enough, the
resulting piecewise linear function resembles a smooth curve. But what if we want to plot curves
using different colors, different line styles, or different symbols at each point?
Use the help command to view the help page for plot, which lists the specifications for different
colors, line styles, and marker styles (which are used at each point that is plotted). The optional
third argument to the plot command is used to specify these colors and styles. They can be mixed
together; for example, the third argument ’r--’ plots a dashed red curve. Experiment with these
different colors and styles, and with different functions.
Matlab provides several commands that can be used to produce more sophisticated plots. It
is recommended that you view the help pages for these commands, and also experiment with their
usage.
• hold is used to specify that subsequent plot commands should be superimposed on the same
set of axes, rather than the default behavior in which the current axes are cleared with each
new plot command.
• subplot is used to divide a figure window into an m × n matrix of axes, and specify which
set of axes should be used for subsequent plot commands.
• xlabel and ylabel are used to label the horizontal and vertical axes, respectively, with given
text.
• legend is used to place a legend within a set of axes, so that the curves displayed on the axes
can be labeled with given text.
• gtext is used to place given text at an arbitrary point within a figure window, indicated by
clicking the mouse at the desired point.
Exercise 1.2.5 Reproduce the plot shown in Figure 1.3 using the commands discussed
in this section.
Finally, it is essential to be able to save a figure so that it can be printed or included in a
document. In the figure window, go to the File menu and choose “Save” or “Save As”. You will
22 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?
see that the figure can be saved in a variety of standard image formats, such as JPEG or Windows
bitmap (BMP).
Another format is “Matlab Figure (*.fig)”. It is highly recommended that you save your figure
in this format, as well as the desired image format. Then, if you need to go back and change
something about the figure after you have already closed the figure window, you can simply use
the open command, with the .fig filename as its input argument, to reopen the figure window.
Otherwise, you would have to recreate the entire figure from scratch.
• r=roots(p) returns a column vector r consisting of the roots of the polynomial represented
by p
• p=poly(r) is, in a sense, an inverse of roots. This function produces a row vector p that
represents the monic polynomial (that is, with leading coefficient 1) whose roots are the
entries of the vector r.
1.2. GETTING STARTED WITH MATLAB 23
• q=polyder(p) computes the coefficients of the polynomial q that is the derivative of the
polynomial p.
• r=conv(p,q) computes the coefficients of the polynomial r that is the product of the poly-
nomials p and q.
It is recommended that you experiment with these functions in order to get used to working with
them.
>> f=inline(’exp(sin(2*x))’);
>> f(pi/4)
ans =
2.7183
If an inline function takes more than one argument, it is important to specify which argument
p is first,
which is second, and so on. For example, to construct an inline function for f (x, y) = x + y 2 , it
2
>> f=@(x)exp(sin(2*x));
>> f(0)
ans =
1
“hands-on” approach is needed to achieve this level of proficiency, and this book is written with
this necessity in mind.
Exercise 1.4.1 Consider the process of computing cos(π/4) using a calculator or com-
puter. Indicate sources of data error and computational error, including both truncation
and roundoff error.
26 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?
Definition 1.4.1 (Absolute Error, Relative Error) Let ŷ be a real number that is
an approximation to the real number y. The absolute error in ŷ is
Eabs = ŷ − y.
Example 1.4.2 If we add the numbers 0.4567 × 100 and 0.8580 × 10−2 , we obtain the exact result
which is rounded to
x̂ = 0.4652 × 100 .
The absolute error in this computation is
which is rounded to
x̂ = 0.3896 × 102 = 38.96.
The absolute error in this computation is
Example 1.4.3 Suppose that the exact value of a computation is supposed to be 10−16 , and an
approximation of 2 × 10−16 is obtained. Then the absolute error in this approximation is
which suggests the computation is accurate because this error is small. However, the relative error
is
2 × 10−16 − 10−16
Erel = = 1,
10−16
which suggests that the computation is completely erroneous, because by this measure, the error is
equal in magnitude to the exact value; that is, the error is 100%. It follows that an approximation of
zero would be just as accurate. This example, although an extreme case, illustrates why the absolute
error can be a misleading measure of error. 2
Clearly, our primary goal in error analysis is to obtain an estimate of the forward error ∆y. Un-
fortunately, it can be difficult to obtain this estimate directly.
An alternative approach is to instead view the computed value ŷ as the exact solution of a
problem with modified data; i.e., ŷ = f (x̂) where x̂ is a perturbation of x.
Exercise 1.4.2 Let x0 = 1, and f (x) = ex . If the magnitude of the forward error
in computing f (x0 ), given by |fˆ(x0 ) − f (x0 )|, is 0.01, then determine a bound on the
magnitude of the backward error.
Exercise 1.4.3 For a general function f (x), explain when the magnitude of the forward
error is greater than, or less than, that of the backward error. Assume f is differentiable
near x and use calculus to explain your reasoning.
|f (x̂) − f (x)|
κabs = .
|x̂ − x|
If f (x) 6= 0, then the relative condition number of the problem of computing y = f (x),
denoted by κrel , is the ratio of the magnitude of the relative forward error to the magnitude
of the relative backward error,
Intuitively, either condition number is a measure of the change in the solution due to a change in
the data. Since the relative condition number tends to be a more reliable measure of this change,
it is sometimes referred to as simply the condition number.
If the condition number is large, e.g. much greater than 1, then a small change in the data can
cause a disproportionately large change in the solution, and the problem is said to be ill-conditioned
or sensitive. If the condition number is small, then the problem is said to be well-conditioned or
insensitive.
Since the condition number, as defined above, depends on knowledge of the exact solution f (x),
it is necessary to estimate the condition number in order to estimate the relative forward error. To
that end, we assume, for simplicity, that f : R → R is differentiable and obtain
|x∆y|
κrel =
|y∆x|
|x(f (x + ∆x) − f (x))|
=
|f (x)∆x|
0
|xf (x)∆x|
≈
|f (x)∆x|
0
xf (x)
≈ .
f (x)
Therefore, if we can estimate the backward error ∆x, and if we can bound f and f 0 near x, we can
then bound the relative condition number and obtain an estimate of the relative forward error. Of
course, the relative condition number is undefined if the exact value f (x) is zero. In this case, we
can instead use the absolute condition number. Using the same approach as before, the absolute
condition number can be estimated using the derivative of f . Specifically, we have κabs ≈ |f 0 (x)|.
Exercise 1.4.4 Let f (x) = ex , g(x) = e−x , and x0 = 2. Suppose that the relative
backward error in x0 satisfies |∆x0 /x0 | = |x̂0 − x0 |/|x0 | ≤ 10−2 . What is an upper
bound on the relative forward error in f (x0 ) and g(x0 )? Use Matlab or a calculator to
experimentally confirm that this bound is valid, by evaluating f (x) and g(x) at selected
points and comparing values.
The condition number of a function f depends on, among other things, the absolute forward
error f (x̂) − f (x). However, an algorithm for evaluating f (x) actually evaluates a function fˆ
that approximates f , producing an approximation ŷ = fˆ(x) to the exact solution y = f (x). In
our definition of backward error, we have assumed that fˆ(x) = f (x̂) for some x̂ that is close
to x; i.e., our approximate solution to the original problem is the exact solution to a “nearby”
problem. This assumption has allowed us to define the condition number of f independently of any
approximation fˆ. This independence is necessary, because the sensitivity of a problem depends
solely on the problem itself and not any algorithm that may be used to approximately solve it.
Is it always reasonable to assume that any approximate solution is the exact solution to a
nearby problem? Unfortunately, it is not. It is possible that an algorithm that yields an accurate
approximation for given data may be unreasonably sensitive to perturbations in that data. This
leads to the concept of a stable algorithm: an algorithm applied to a given problem with given data
x is said to be stable if it computes an approximate solution that is the exact solution to the same
problem with data x̂, where x̂ is a small perturbation of x.
30 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?
It can be shown that if a problem is well-conditioned, and if we have a stable algorithm for
solving it, then the computed solution can be considered accurate, in the sense that the relative
error in the computed solution is small. On the other hand, a stable algorithm applied to an
ill-conditioned problem cannot be expected to produce an accurate solution.
Example 1.4.7 This example will illustrate the last point made above. To solve a system of linear
equations Ax = b in Matlab, we can use the \ operator:
x = A\b
Enter the following matrix and column vectors in Matlab, as shown. Recall that a semicolon (;)
separates rows.
Then, solve the systems A*x1 = b1 and A*x2 = b2. Note that b1 and b2 are not very different,
but what about the solutions x1 and x2? The algorithm implemented by the \ operator is stable, but
what can be said about the conditioning of the problem of solving Ax = b for this matrix A? The
conditioning of systems of linear equations will be studied in depth in Chapter 2. 2
Exercise 1.4.5 Let f (x) be a function that is one-to-one. Then, solving the equation
f (x) = c for some c in the range of f is equivalent to computing x = f −1 (c). What is the
condition number of the problem of solving f (x) = c?
1.4.5 Convergence
Many algorithms in numerical analysis are iterative methods that produce a sequence {αn } of ap-
proximate solutions which, ideally, converges to a limit α that is the exact solution as n approaches
∞. Because we can only perform a finite number of iterations, we cannot obtain the exact solution,
and we have introduced computational error.
If our iterative method is properly designed, then this computational error will approach zero
as n approaches ∞. However, it is important that we obtain a sufficiently accurate approximate
solution using as few computations as possible. Therefore, it is not practical to simply perform
enough iterations so that the computational error is determined to be sufficiently small, because it
is possible that another method may yield comparable accuracy with less computational effort.
The total computational effort of an iterative method depends on both the effort per iteration
and the number of iterations performed. Therefore, in order to determine the amount of compu-
tation that is needed to attain a given accuracy, we must be able to measure the error in αn as
a function of n. The more rapidly this function approaches zero as n approaches ∞, the more
rapidly the sequence of approximations {αn } converges to the exact solution α, and as a result,
fewer iterations are needed to achieve a desired accuracy. We now introduce some terminology that
will aid in the discussion of the convergence behavior of iterative methods.
1.4. ERROR ANALYSIS 31
Definition 1.4.8 (Big-O Notation) Let f and g be two functions defined on a domain
D ⊆ R that is not bounded above. We write that f (n) = O(g(n)) if there exists a positive
constant c such that
|f (n)| ≤ c|g(n)|, n ≥ n0 ,
for some n0 ∈ D.
As sequences are functions defined on N, the domain of the natural numbers, we can apply
big-O notation to sequences. Therefore, this notation is useful to describe the rate at which a
sequence of computations converges to a limit.
where α is a real number. We say that {αn } converges to α with rate of convergence
O(βn ) if αn − α = O(βn ).
We say that an iterative method converges rapidly, in some sense, if it produces a sequence
of approximate solutions whose rate of convergence is O(βn ), where the terms of the sequence
βn approach zero rapidly as n approaches ∞. Intuitively, if two iterative methods for solving the
same problem perform a comparable amount of computation during each iteration, but one method
exhibits a faster rate of convergence, then that method should be used because it will require less
overall computational effort to obtain an approximate solution that is sufficiently accurate.
n+1
αn = , n = 1, 2, . . .
n+2
Then, we have
n + 1 1/n
lim αn = lim
n→∞ n→∞ n + 2 1/n
1 + 1/n
= lim
n→∞ 1 + 2/n
limn→∞ (1 + 1/n)
=
limn→∞ (1 + 2/n)
1 + limn→∞ 1/n
=
1 + limn→∞ 2/n
1+0
=
1+0
= 1.
That is, the sequence {αn } converges to α = 1. To determine the rate of convergence, we note that
and since
−1 1
n + 2 ≤ n
2n2 + 4n
αn = , n = 1, 2, . . .
n2 + 2n + 1
Then, we have
2n2 + 4n 1/n2
lim αn = lim
n→∞ n→∞ n2 + 2n + 1 1/n2
2 + 4/n
= lim
n→∞ 1 + 2/n + 1/n2
limn→∞ (2 + 4/n)
=
limn→∞ (1 + 2/n + 1/n2 )
2 + limn→∞ 4/n
=
1 + limn→∞ (2/n + 1/n2 )
= 2.
That is, the sequence {αn } converges to α = 2. To determine the rate of convergence, we note that
We can also use big-O notation to describe the rate of convergence of a function to a limit.
Example 1.4.11 Consider the function f (h) = 1 + 2h. Since this function is continuous for all
h, we have
lim f (h) = f (0) = 1.
h→0
It follows that
f (h) − f0 = (1 + 2h) − 1 = 2h = O(h),
so we can conclude that as h → 0, 1 + 2h converges to 1 of order O(h). 2
1.4. ERROR ANALYSIS 33
Example 1.4.12 Consider the function f (h) = 1 + 4h + 2h2 . Since this function is continuous for
all h, we have
lim f (h) = f (0) = 1.
h→0
It follows that
f (h) − f0 = (1 + 4h + 2h2 ) − 1 = 4h + 2h2 .
To determine the rate of convergence as h → 0, we consider h in the interval [−1, 1]. In this
interval, |h2 | ≤ |h|. It follows that
Since there exists a constant C (namely, 6) such that |4h + 2h2 | ≤ C|h| for h satisfying |h| ≤ h0 for
some h0 (namely, 1), we can conclude that as h → 0, 1 + 4h + 2h2 converges to 1 of order O(h). 2
f0 = lim f (h)
h→0
denotes the exact value, f (h) − f0 represents the absolute error in the approximation f (h). When
this error is a polynomial in h, as in this example and the previous example, the rate of convergence
is O(hk ) where k is the smallest exponent of h in the error. This is because as h → 0, the
smallest power of h approaches zero more slowly than higher powers, thereby making the dominant
contribution to the error.
By contrast, when determining the rate of convergence of a sequence {αn } as n → ∞, the
highest power of n determines the rate of convergence. As powers of n are negative if convergence
occurs at all as n → ∞, and powers of h are positive if convergence occurs at all as h → 0, it can be
said that for either types of convergence, it is the exponent that is closest to zero that determines
the rate of convergence.
Example 1.4.13 Consider the function f (h) = cos h. Since this function is continuous for all h,
we have
lim f (h) = f (0) = 1.
h→0
f 00 (ξ(h)) 2
f (h) = f (0) + f 0 (0)h + h ,
2
where ξ(h) is between 0 and h. Substituting f (h) = cos h into the above, we obtain
− cos ξ(h) 2
cos h = 1 − (sin 0)h + h ,
2
or
cos ξ(h) 2
cos h = 1 − h .
2
34 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?
Example 1.4.14 The following approximation to the derivative is based on the definition:
f (x0 + h) − f (x0 ) h
f 0 (x0 ) ≈ , error = − f 00 (ξ).
h 2
An alternative approximation is
f (x0 + h) − f (x0 − h) h2 000
f 0 (x0 ) ≈ , error = − f (ξ).
2h 6
Taylor series can be used to obtain the error terms in both cases.
Exercise 1.4.7 Use Taylor’s Theorem to derive the error terms in both formulas.
Exercise 1.4.8 Try both of these approximations with f (x) = sin x, x0 = 1, and h =
10−1 , 10−2 , 10−3 , and then h = 10−14 . What happens, and can you explain why?
If you can’t explain what happens for the smallest value of h, fortunately this will be addressed in
Section 1.5. 2
The term “floating-point” comes from the fact that as a number x ∈ F is multiplied by or divided by
a power of β, the mantissa does not change, only the exponent. As a result, the decimal point shifts,
or “floats,” to account for the changing exponent. Nearly all computers use a binary floating-point
system, in which β = 2.
Example 1.5.2 Let x = −117. Then, in a floating-point number system with base β = 10, x is
represented as
x = −(1.17)102 ,
where 1.17 is the mantissa and 2 is the exponent. If the base β = 2, then we have
x = −(1.110101)26 ,
where 1.110101 is the mantissa and 6 is the exponent. The mantissa should be interpreted as a
string of binary digits, rather than decimal digits; that is,
UFL = mmin β L ,
where L is the smallest valid exponent and mmin is the smallest mantissa. The largest
positive number in F is called the overflow level, and it has the value
OFL = β U +1 (1 − β −p ).
The value of mmin depends on whether floating-point numbers are normalized in F; this point
will be discussed later. The overflow level is the value obtained by setting each digit in the mantissa
to β − 1 and using the largest possible value, U , for the exponent.
Exercise 1.5.2 Determine the value of OFL for a floating-point system with base β = 2,
precision p = 53, and largest exponent U = 1023.
It is important to note that the real numbers that can be represented in F are not equally spaced
along the real line. Numbers having the same exponent are equally spaced, and the spacing between
numbers in F decreases as their magnitude decreases.
1.5.1.2 Normalization
It is common to normalize floating-point numbers by specifying that the leading digit d0 of the
mantissa be nonzero. In a binary system, with β = 2, this implies that the leading digit is equal
to 1, and therefore need not be stored. In addition to the benefit of gaining one additional bit of
precision, normalization also ensures that each floating-point number has a unique representation.
One drawback of normalization is that fewer numbers near zero can be represented exactly than
if normalization is not used. One workaround is a practice called gradual underflow, in which the
leading digit of the mantissa is allowed to be zero when the exponent is equal to L, thus allowing
smaller values of the mantissa. In such a system, the number UFL is equal to β L−p+1 , whereas in
a normalized system, UFL = β L .
Exercise 1.5.3 Determine the value of UFL for a floating-point system with base β = 2,
precision p = 53, and smallest exponent L = −1022, both with and without normalization.
1.5.1.3 Rounding
A number that can be represented exactly in a floating-point system is called a machine number.
Since only finitely many real numbers are machine numbers, it is necessary to determine how non-
machine numbers are to be approximated by machine numbers. The process of choosing a machine
1.5. COMPUTER ARITHMETIC 37
number to approximate a non-machine number is called rounding, and the error introduced by such
an approximation is called roundoff error. Given a real number x, the machine number obtained
by rounding x is denoted by fl(x).
In most floating-point systems, rounding is achieved by one of two strategies:
• chopping, or rounding to zero, is the simplest strategy, in which the base-β expansion of a
number is truncated after the first p digits. As a result, fl(x) is the unique machine number
between 0 and x that is nearest to x.
• rounding to nearest sets fl(x) to be the machine number that is closest to x in absolute value;
if two numbers satisfy this property, then an appropriate tie-breaking rule must be used, such
as setting fl(x) equal to the choice whose last digit is even.
Example 1.5.4 Suppose we are using a floating-point system with β = 10 (decimal), with p = 4
significant digits. Then, if we use chopping, or rounding to zero, we have fl(2/3) = 0.6666, whereas
if we use rounding to nearest, then we have fl(2/3) = 0.6667. 2
Example 1.5.5 When rounding to even in decimal, 88.5 is rounded to 88, not 89, so that the last
digit is even, while 89.5 is rounded to 90, again to make the last digit even. 2
for any real number x such that UFL < x < OFL.
An intuitive definition of u is that it is the smallest positive number such that
fl (1 + u) > 1.
The value of u depends on the rounding strategy that is used. If rounding toward zero is used,
then u = β 1−p , whereas if rounding to nearest is used, u = 21 β 1−p .
It is important to avoid confusing u with the underflow level UFL. The unit roundoff is deter-
mined by the number of digits in the mantissa, whereas the underflow level is determined by the
range of allowed exponents. However, we do have the relation that 0 < UFL < u.
In analysis of roundoff error, it is assumed that fl(x op y) = (x op y)(1 + δ), where op is an
arithmetic operation and δ is an unknown constant satisfying |δ| ≤ u. From this assumption, it
can be seen that the relative error in fl(x op y) is |δ|. In the case of addition, the relative backward
error in each operand is also |δ|.
38 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?
In fact, most computers conform to the IEEE standard for floating-point arithmetic. The standard
specifies, among other things, how floating-point numbers are to be represented in memory. Two
representations are given, one for single-precision and one for double-precision.
Under the standard, single-precision floating-point numbers occupy 4 bytes in memory, with
23 bits used for the mantissa, 8 for the exponent, and one for the sign. IEEE double-precision
floating-point numbers occupy eight bytes in memory, with 52 bits used for the mantissa, 11 for
the exponent, and one for the sign. That is, in the IEEE floating-point standard, p = 24 for single
precision, and p = 53 for double precision, even though only 23 and 52 bits, respectively, are used
to store mantissas.
Example 1.5.7 The following table summarizes the main aspects of a general floating-point system
and a double-precision floating-point system that uses a 52-bit mantissa and 11-bit exponent. For
both systems, we assume that rounding to nearest is used, and that normalization is used. 2
Exercise 1.5.4 Are the values for UFL and OFL given in the table above the actual
values used in the IEEE double-precision floating point system? Experiment with powers
of 2 in Matlab to find out. What are the largest and smallest positive numbers you can
represent? Can you explain any discrepancies between these values and the ones in the
table?
where each term xi is positive. Will the sum be computed more accurately in floating-
point arithmetic if the numbers are added in order from smallest to largest, or largest to
smallest? Justify your answer.
In multiplication or division, the operands need not be shifted, but the mantissas, when mul-
tiplied or divided, cannot necessarily be represented using only p digits of precision. The product
of two mantissas requires 2p digits to be represented exactly, while the quotient of two mantissas
could conceivably require infinitely many digits.
Because floating-point arithmetic operations are not exact, they do not follow all of the laws of real
arithmetic. In particular, floating-point arithmetic is not associative; i.e., x + (y + z) 6= (x + y) + z
in floating-point arithmetic.
Exercise 1.5.6 In Matlab, generate three random numbers x, y and z, and compute
x + (y + z) and (x + y) + z. Do they agree? Try this a few times with different random
numbers and observe what happens.
Furthermore, overflow or underflow may occur depending on the exponents of the operands, since
their sum or difference may lie outside of the interval [L, U ].
p
Exercise 1.5.7 Consider the formula z = x2 + y 2 . Explain how overflow can occur in
computing z, even if x, y and z all have magnitudes that can be represented. How can
this formula be rewritten so that overflow does not occur?
1.5.3.3 Cancellation
Subtraction of floating-point numbers presents a unique difficulty, in addition to the rounding error
previously discussed. If the operands, after shifting exponents as needed, have leading digits in
common, then these digits cancel and the first digit in which the operands do not match becomes
the leading digit. However, since each operand is represented using only p digits, it follows that
the result contains only p − m correct digits, where m is the number of leading digits that cancel.
In an extreme case, if the two operands differ by less than u, then the result contains no
correct digits; it consists entirely of roundoff error from previous computations. This phenomenon
is known as catastrophic cancellation. Because of the highly detrimental effect of this cancellation,
it is important to ensure that no steps in a computation compute small values from relatively large
operands. Often, computations can be rearranged to avoid this risky practice.
40 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?
Exercise 1.5.8 Use the Matlab function randn to generate 1000 normally distributed
random numbers with mean 1000 and standard deviation 0.1. Then, use these formulas
to compute the variance:
where x̄ is the mean. How do the results differ? Which formula is more susceptible to
issues with floating-point arithmetic, and why?
1.5. COMPUTER ARITHMETIC 41
Exercise 1.5.9 Recall Exercise 1.4.8, in which two approximations of the derivative were
tested using various values of the spacing h. In light of the discussion in this chapter,
explain the behavior for the case of h = 10−14 .
42 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?
Part II
43
Chapter 2
In this chapter we discuss methods for solving the problem Ax = b, where A is a square invertible
matrix of size n, x is an unknown n-vector, and b is a vector of size n. Solving a system of
linear equations comes up in a number of applications such as solving a linear system of ordinary
differential equations (ODEs).
45
46 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS
From the above augmented matrix, it is easy to see that we can now simply solve
a33 x3 = b3
x3 = b3 /a33
Now that we have found x3 , we substitute it in to the previous equation to solve for the unknown
in that equation, which is
a22 x2 + a23 x3 = b2
x2 = (b2 − a23 )/a22 .
Similarly as before, now substitute x2 , and x3 into the first equation, and again we have a linear
equation with only one unknown.
The process we have previously described is known as back subsitution. Just by looking at the
above 3 × 3 case you can see a pattern emerging, and this pattern can be described in terms of an
algorithm that can be used to solve these types of triangular systems.
Exercise 2.1.1 (a) Write a Matlab function to solve the following triangular system:
2x + y − 3z = −10
−2y + z = −2
z = 6
(b) How many floating point operations does this function perform?
In the algorithm, we assume that U is the upper triangular matrix containing the coefficients
of the system, and y is the vector containing the right-hand sides of the equations.
2.1. TRIANGULAR SYSTEMS 47
for i = n, n − 1, . . . , 1 do
xi = yi
for j = i + 1, i + 2, . . . , n do
xi = xi − uij xj
end
xi = xi /uii
end
Upper triangular systems are the goal we are trying to achieve through Gaussian elimnation, which
we discuss in the next section. If we take a look at the number of level of loops in the above
algorithm, we can see that there are two levels of nesting involved, therefore it takes n2 floating
point operations to solve these systems.
it can be seen that in the system, each linear equation only has one unknown, so we can solve each
equation independently of the other equations. It does not matter which equation we start with in
solving this system, since they can all be solved independently of each other. But for the purpose
of this book, we will start at the top. The equations we need to solve are as follows:
a11 x1 = b1
a22 x2 = b2
a33 x3 = b3
..
.
ann xn = bn .
48 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS
From the above we can see that each solution is found by xn = bn /ann .
Exercise 2.1.2 (a) Write a Matlab function to solve the following diagonal system
of equations:
3x = 4
2y = 8
7z = 21
(b) How many floating point operations does this function perform?
for i = 1, 2, . . . , n do
xi = bi /aii
end
We can see that the above algorithm only has 1 level of nested loops, therefore the number of
floating point operations required to solve a system of this type is only n1 or just n!
• Reordering the equations by interchanging both sides of the ith and jth equation in the
system (Ri → Rj )
• Replacing equation i by the sum of equation i and a multiple of both sides of equation j
(Rj → Rj − sRi )
Exercise 2.2.1 Prove that the following row operations do not change the solution set.
• Replacing equation i by the sum of equation i and a multiple of both sides of equation.
The third operation is by far the most useful. We will now demonstrate how it can be used to
reduce a system of equations to a form in which it can easily be solved.
Example Consider the system of linear equations
x1 + 2x2 + x3 = 5,
3x1 + 2x2 + 4x3 = 17,
4x1 + 4x2 + 3x3 = 26.
First, we eliminate x1 from the second equation by subtracting 3 times the first equation from the
second. This yields the equivalent system
x1 + 2x2 + x3 = 5,
−4x2 + x3 = 2,
4x1 + 4x2 + 3x3 = 26.
Next, we subtract 4 times the first equation from the third, to eliminate x1 from the third equation
as well:
x2 + 2x2 + x3 = 5,
−4x2 + x3 = 2,
−4x2 − x3 = 6.
Then, we eliminate x2 from the third equation by subtracting the second equation from it, which
yields the system
x1 + 2x2 + x3 = 5,
−4x2 + x3 = 2,
−2x3 = 4.
This system is in upper-triangular form, because the third equation depends only on x3 , and the
second equation depends on x2 and x3 .
Because the third equation is a linear equation in x3 , it can easily be solved to obtain x3 = −2.
50 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS
x1 + 2x2 + x3 = 5, x1 + 2x2 + x3 = 5,
−4x2 + x3 = 2, −4x2 + x3 = 2,
−2x3 = 4. → x3 = −2.
Then, we can substitute this value into the second equation, which yields −4x2 = 4.
x1 + 2x2 + x3 = 5, x1 + 2x2 + x3 = 5,
−4x2 + −2 = 2, → x2 = −1,
x3 = −2. x3 = −2.
This equation only depends on x2 , so we can easily solve it to obtain x2 = −1. Finally, we substitute
the values of x2 and x3 into the first equation to obtain x1 = 9.
x1 + 2(−1) + −2 = 5, → x1 = 9,
x2 = −1, x2 = −1,
x3 = −2. x3 = −2.
This process of computing the unknowns from a system that is in upper-triangular form is called
back substitution. 2
Exercise 2.2.2 Write a Matlab function to solve the following upper triangular system
using back substitution.
3x1 + 2x2 + x3 − x4 = 0
−x2 + 2x3 + x4 = 0
x3 + x4 = 1
−x4 = 2
Your function should return the solution, and should take the corresponding matrix and
right hand side vector as input.
In general, a system of n linear equations in n unknowns is in upper-triangular form if the ith
equation depends only on the unknowns xj for i < j ≤ n.
Now, performing row operations on the system Ax = b can be accomplished by performing
them on the augmented matrix
a11 a12 · · · a1n | b1
a21 a22 · · · a2n | b2
A b = . . . .
.. .. | ..
an1 an2 · · · ann | bn
By working with the augmented matrix instead of the original system, there is no need to continually
rewrite the unknowns or arithmetic operators. Once the augmented matrix is reduced to upper
triangular form, the corresponding system of linear equations can be solved by back substitution,
as before.
2.2. GAUSSIAN ELIMINATION 51
The process of eliminating variables from the equations, or, equivalently, zeroing entries of
the corresponding matrix, in order to reduce the system to upper-triangular form is called Gaus-
sian elimination. We will now step through an example as we discuss the steps of the Gaussian
Elimination algorithm. The algorithm is as follows:
for j = 1, 2, . . . , n − 1 do
for i = j + 1, j + 2, . . . , n do
mij = aij /ajj
for k = j + 1, j + 2, . . . , n do
aik = aik − mij ajk
end
bi = bi − mij bj
end
end
x1 + 2x2 + x3 − x4 = 5
3x1 + 2x2 + 4x3 + 4x4 = 16
4x1 + 4x2 + 3x3 + 4x4 = 22
2x1 + x3 + 5x4 = 15.
This system can be represented by the coefficient matrix A and right-hand side vector b, as follows:
1 2 1 −1 5
3 2 4 4 16
A=
4
, b=
22 .
4 3 4
2 0 1 5 15
52 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS
To perform row operations to reduce this system to upper triangular form, we define the augmented
matrix
1 2 1 −1 5
3 2 4 4 16
à = A b = 4 4 3
.
4 22
2 0 1 5 15
We first define Ã(1) = Ã to be the original augmented matrix. Then, we denote by Ã(2) the result of
the first elementary row operation, which entails subtracting 3 times the first row from the second
in order to eliminate x1 from the second equation:
1 2 1 −1 5
0 −4 1 7 1
Ã(2) =
4
.
4 3 4 22
2 0 1 5 15
Next, we eliminate x1 from the third equation by subtracting 4 times the first row from the
third:
1 2 1 −1 5
0 −4 1 7 1
Ã(3) =
0 −4 −1
.
8 2
2 0 1 5 15
Then, we complete the elimination of x1 by subtracting 2 times the first row from the fourth:
1 2 1 −1 5
0 −4 1 7 1
Ã(4) =
.
0 −4 −1 8 2
0 −4 −1 7 5
We now need to eliminate x2 from the third and fourth equations. This is accomplished by sub-
tracting the second row from the third, which yields
1 2 1 −1 5
0 −4 1 7 1
Ã(5) = ,
0 0 −2 1 1
0 −4 −1 7 5
and the fourth, which yields
1 2 1 −1 5
0 −4 1 7 1
Ã(6) = .
0 0 −2 1 1
0 0 −2 0 4
Finally, we subtract the third row from the fourth to obtain the augmented matrix of an upper-
triangular system,
1 2 1 −1 5
0 −4 1 7 1
Ã(7) = .
0 0 −2 1 1
0 0 0 −1 3
2.2. GAUSSIAN ELIMINATION 53
Note that in a matrix for such a system, all entries below the main diagonal (the entries where the
row index is equal to the column index) are equal to zero. That is, aij = 0 for i > j.
From this, we see that we need to examine all columns, i = 1 . . . n − 1. Likewise, we need to
examine rows where j = i + 1, i + 2, . . . , n. Now we are getting ready to build our algorithm.
Now, we can perform back substitution on the corresponding system,
x1 + 2x2 + x3 − x4 = 5,
−4x2 + x3 + 7x4 = 1,
−2x3 + x4 = 1,
−x4 = 3,
to obtain the solution, which yields x4 = −3, x3 = −2, x2 = −6, and x1 = 16. 2
where the entry −mij is in row i, column j. Each such matrix Mij is an example of an elementary
row matrix, which is a matrix that results from applying any elementary row operation to the
identity matrix.
More generally, if we let A(1) = A and let A(k+1) be the matrix obtained by eliminating elements
of column k in A(k) , then we have, for k = 1, 2, . . . , n − 1,
where
1 0 ··· ··· ··· ··· ··· ··· 0
0 1 0 0
.. . . . . .. ..
. . . . .
.. .. .. ..
. 0 . . .
.. .. .. .. ..
M (k)
=
. . −mk+1,k . . . ,
.. .. .. .. .. ..
. . . 0 . . .
.. .. .. .. . . . . . . ..
. . . . . . . .
.. .. .. ..
..
. 1 0
. . . .
0 ··· 0 −mnk 0 ··· ··· 0 1
with the elements −mk+1,k , . . . , −mnk occupying column k. It follows that the matrix
being the result of applying the same row operations to b, is the right-hand side for the upper-
triangular system that is to be solved by back subtitution.
Exercise 2.2.3 (a) Write a Matlab function that computes the matrix U by us-
ing the above description. Start in the first column, and accumulate all of the
multipliers in a matrix. Once you have done this for each n − 1 columns, you
can multiply them together in the manner described to accumulate the matrix
U = M (n−1) M (n−2) . . . M 1 A.
(b) Your function should store the values of mij in the appropriate entry of a matrix
we will call L. This matrix will be lower unit triangular, and is discussed in the
next section.
process similar to back substitution, called forward substitution. As with upper triangular matri-
ces, a lower triangular matrix is nonsingular if and only if all of its diagonal entries are nonzero.
Exercise 2.2.5 Prove the following useful properties for triangular matrices.
Triangular matrices have the following useful properties:
• A unit lower/upper triangular matrix is nonsingular, and its inverse is unit lower/upper
triangular.
In fact, the inverse of each M (k) is easily computed. We have
1 0 ··· ··· ··· ··· ··· ··· 0
0 1 0 0
. . ..
.. . . . . . ..
. .
. . .. ..
.. ..
0 . .
.
. .
.. m .. .. ..
L(k) = [M (k) ]−1 =
. k+1,k . . . .
.. .. .. .. .. ..
. . . 0 . . .
.. .. .. .. . . . . . . ..
. . . . . . . .
.. .. .. ..
..
. 1 0
. . . .
0 ··· 0 mnk 0 ··· ··· 0 1
It follows that if we define M = M (n−1) · · · M (1) , then M is unit lower triangular, and M A = U ,
where U is upper triangular. It follows that A = M −1 U = LU , where
is also unit lower triangular. Furthermore, from the structure of each matrix L(k) , it can readily
be determined that
1 0 ··· ··· 0
m21 ..
1 0 .
.. .. .. .. .
L= . m32 . . .
.. ..
..
. . . 1 0
mn1 mn2 · · · mn,n−1 1
56 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS
That is, L stores all of the multipliers used during Gaussian elimination. The factorization of A
that we have obtained,
A = LU,
is called the LU decomposition, or LU factorization, of A.
Exercise 2.2.6 Write a Matlab function to compute the matrices L and U for a ran-
domly generated matrix. Check your accuracy by multiplying them together to see if you
get A = LU .
2.2.2.2 Solution of Ax = b
Once the LU decomposition A = LU has been computed, we can solve the system Ax = b by first
noting that if x is the solution, then
Ax = LU x = b.
Therefore, we can obtain x by first solving the system
Ly = b,
for i = 1, 2, . . . , n do
yi = bi
for j = 1, 2, . . . , i − 1 do
yi = yi − `ij yj
end
end
Exercise 2.2.7 Implement a function for forward substitution in Matlab. Try your
function on the following lower unit triangular matrix.
1 0 0 0
2 1 0 0
A=
3
3 1 0
4 6 4 1
Like back substitution, this algorithm requires O(n2 ) floating-point operations. Unlike back sub-
stitution, there is no division of the ith component of the solution by a diagonal element of the
matrix, but this is only because in this context, L is unit lower triangular, so `ii = 1. When
applying forward substitution to a general lower triangular matrix, such a division is required.
2.2. GAUSSIAN ELIMINATION 57
Exercise 2.2.8 (a) How might these elementary row matrices be used as a part of the
matrix factorization? Consider how you would apply each one of them individually
to the matrix A to get the intermediate result, U = L−1 A.
(b) Check your results by multiplying the matrices in the order that you have discovered
to get U as a result.
(c) Explain why this works, what is this matrix multiplication equivalent to?
58 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS
Now that you have discovered what we call pre-multiplying the matrix A by L−1 , we know that
1 0 0 0
3 1 0 0
L = E1−1 E2−1 E3−1 E4−1 E5−1 E6−1 =
4 1 1 0 ,
2 1 1 1
L Ux = b Let U x = y
Ly = b Solving requires O(n2 )
Ux = y Solving requires O(n2 )
the other rows to eliminate subdiagonal elements in the first column. That is, Gaussian elimination
can break down. Even if a11 6= 0, it can happen that the (j, j) element of A(j) is zero, in which
case a similar breakdown occurs. When this is the case, the LU decomposition of A does not exist.
This will be addressed by pivoting, resulting in a modification of the LU decomposition.
It can be shown that the LU decomposition of an n × n matrix A does exist if the leading
principal submatrices of A, defined by
a11 · · · a1k
[A]1:k,1:k = ... .. .. , k = 1, 2, . . . , n,
. .
ak1 · · · akk
are all nonsingular. Furthermore, when the LU decomposition exists, it is unique.
Exercise 2.2.9 Prove, by contradiction, that the LU factorization is unique. Start by
supposing that two distinct LU factorizations exist.
P∞
and therefore (I −F ) is nonsingular, with inverse (I −F )−1 = i=0 F
i. By the properties of matrix
norms, and convergence of geometric series, we then obtain
∞
−1
X 1
k(I − F ) kp ≤ kF kip = .
1 − kF kp
i=0
Now, let A be a nonsingular matrix, and let E be a perturbation of A such that r ≡ kA−1 Ekp <
1. Because A + E = A(I − F ) where F = −A−1 E, with kF kp = r < 1, I − F is nonsingular, and
therefore so is A + E. We then have
where a = maxi,j |aij |, ` = maxi,j |L̄ij |, and G is the growth factor. Putting our bounds together,
we have
max |δAij | ≤ max |eij | + max |L̄δ Ūij | + max |Ū δ L̄ij | + max |δ L̄δ Ūij |
i,j i,j i,j i,j i,j
2 2 2
≤ n(1 + `)Gau + n `Gau + n `Gau + O(u )
Note that a similar conclusion is reached if we assume that the computed solution x̄ solves a nearby
problem in which both A and b are perturbed, rather than just A.
We see that the important factors in the accuracy of the computed solution are
• The growth factor G
• The precision u
In particular, κ(A) must be large with respect to the accuracy in order to be troublesome. For
example, consider the scenario where κ(A) = 102 and u = 10−3 , as opposed to the case where
κ(A) = 102 and u = 10−50 . However, it is important to note that even if A is well-conditioned, the
error in the solution can still be very large, if G and ` are large.
2.2.3 Pivoting
During Gaussian elimination, it is necessary to interchange rows of the augmented matrix whenever
the diagonal element of the column currently being processed, known as the pivot element, is equal
to zero.
However, if we examine the main step in Gaussian elimination,
(j+1) (j) (j)
aik = aik − mij ajk ,
62 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS
(j)
we can see that any roundoff error in the computation of ajk is amplified by mij . Because the mul-
tipliers can be arbitrarily large, it follows from the previous analysis that the error in the computed
solution can be arbitrarily large, meaning that Gaussian elimination is numerically unstable.
Therefore, it is helpful if it can be ensured that the multipliers are small. This can be accom-
plished by performing row interchanges, or pivoting, even when it is not absolutely necessary to do
so for elimination to proceed.
and then using both row and column interchanges to move apq into the pivot position in row j and
column j. It has been proven that this is an effective strategy for ensuring that Gaussian elimination
is backward stable, meaning it does not cause the entries of the matrix to grow exponentially as
they are updated by elementary row operations, which is undesirable because it can cause undue
amplification of roundoff error.
multiplying A(j) on the left by P (j) interchanges these rows of A(j) . It follows that the process of
Gaussian elimination with pivoting can be described in terms of the matrix multiplications.
Exercise 2.2.10 (a) Find the order, as described above, in which the permutation ma-
trices P and the multiplier matrices M should be multiplied by A.
(b) What permutation matrix P would constitute no change in the previous matrix?
However, because each permutation matrix P (k) at most interchanges row k with row p, where
p > k, there is no difference between applying all of the row interchanges “up front”, instead of
applying P (k) immediately before applying M (k) for each k. It follows that
P A = LU,
where L is defined as before, and P = P (n−1) P (n−2) · · · P (1) . This decomposition exists for any
nonsingular matrix A.
Once the LU decomposition P A = LU has been computed, we can solve the system Ax = b
by first noting that if x is the solution, then
P Ax = LU x = P b.
Therefore, we can obtain x by first solving the system Ly = P b, and then solving U x = y. Then,
if b should change, then only these last two systems need to be solved in order to obtain the new
solution; as in the case of Gaussian elimination without pivoting, the LU decomposition does not
need to be recomputed.
Example Let
1 4 7
A = 2 8 5 .
3 6 9
Applying Gaussian elimination to A, we subtract twice the first row from the second, and three
times the first row from the third, to obtain
1 4 7
A(2) = 0 0 −9 .
0 −6 −12
At this point, Gaussian elimination breaks down, because the multiplier m32 = a32 /a22 = −6/0 is
undefined.
Therefore, we must interchange the second and third rows, which yields the upper triangular
matrix
1 4 7
U = A(3) = P (2) A(2) = 0 −6 −12 ,
0 0 −9
64 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS
P A = LU,
or
1 0 0 1 4 7 1 0 0 1 4 7
0 0 1 2 8 5 = 3 1 0 0 −6 −12 .
0 1 0 3 6 9 2 0 1 0 0 −9
It can be seen in advance that A does not have an LU factorization because the second minor of
A, a1:2,1:2 , is a singular matrix. 2
This formula for x suggests that if σn is small relative to the other singular values, then the system
Ax = b can be sensitive to perturbations in A or b. This makes sense, considering that σn is the
distance between A and the set of all singular n × n matrices.
In an attempt to measure the sensitivity of this system, we consider the parameterized system
(A + E)x() = b + e,
2.3. ESTIMATING AND IMPROVING ACCURACY 65
where E ∈ Rn×n and e ∈ Rn are perturbations of A and b, respectively. Taking the Taylor
expansion of x() around = 0 yields
where
x0 () = (A + E)−1 (e − Ex),
which yields x0 (0) = A−1 (e − Ex).
Using norms to measure the relative error in x, we obtain
kA−1 (e − Ex)k
kx() − xk kek
= || + O(2 ) ≤ ||kA−1 k + kEk + O(2 ).
kxk kxk kxk
Multiplying and dividing by kAk, and using Ax = b to obtain kbk ≤ kAkkxk, yields
kx() − xk kek kEk
= κ(A)|| + ,
kxk kbk kAk
where
κ(A) = kAkkA−1 k
is called the condition number of A. We conclude that the relative errors in A and b can be
amplified by κ(A) in the solution. Therefore, if κ(A) is large, the problem Ax = b can be quite
sensitive to perturbations in A and b. In this case, we say that A is ill-conditioned; otherwise, we
say that A is well-conditioned.
The definition of the condition number depends on the matrix norm that is used. Using the
`2 -norm, we obtain
σ1 (A)
κ2 (A) = kAk2 kA−1 k2 = .
σn (A)
It can readily be seen from this formula that κ2 (A) is large if σn is small relative to σ1 . We
also note that because the singular values are the lengths of the semi-axes of the hyperellipsoid
{Ax | kxk2 = 1}, the condition number in the `2 -norm measures the elongation of this hyperellipsoid.
Example The matrices
0.7674 0.0477 0.7581 0.1113
A1 = , A2 =
0.6247 0.1691 0.6358 0.0933
do not appear to be very different from one another, but κ2 (A1 ) = 10 while κ2 (A2 ) = 1010 . That
is, A1 is well-conditioned while A2 is ill-conditioned.
To illustrate the ill-conditioned nature of A2 , we solve the two systems of equations A2 x1 = b1
and A2 x2 = b2 for the unknown vectors x1 and x2 , where
0.7662 0.7019
b1 = , b2 = .
0.6426 0.7192
These vectors differ from one another by roughly 10%, but the solutions
−1.4522 × 108
0.9894
x1 = , x2 =
0.1452 9.8940 × 108
66 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS
Just as the largest singular value of A is the `2 -norm of A, and the smallest singular value is
the distance from A to the nearest singular matrix in the `2 -norm, we have, for any `p -norm,
1 k∆Akp
= min .
κp (A) A+∆A singular kAkp
That is, in any `p -norm, κp (A) measures the relative distance in that norm from A to the set of
singular matrices.
Because det(A) = 0 if and only if A is singular, it would appear that the determinant could be
used to measure the distance from A to the nearest singular matrix. However, this is generally not
the case. It is possible for a matrix to have a relatively large determinant, but be very close to a
singular matrix, or for a matrix to have a relatively small determinant, but not be nearly singular.
In other words, there is very little correlation between det(A) and the condition number of A.
Example Let
1 −1 −1 −1 −1 −1 −1 −1 −1 −1
0 1 −1 −1 −1 −1 −1 −1 −1 −1
0 0 1 −1 −1 −1 −1 −1 −1 −1
0 0 0 1 −1 −1 −1 −1 −1 −1
0 0 0 0 1 −1 −1 −1 −1 −1
A= .
0 0 0 0 0 1 −1 −1 −1 −1
0 0 0 0 0 0 1 −1 −1 −1
0 0 0 0 0 0 0 1 −1 −1
0 0 0 0 0 0 0 0 1 −1
0 0 0 0 0 0 0 0 0 1
Then det(A) = 1, but κ2 (A) ≈ 1, 918, and σ10 ≈ 0.0029. That is, A is quite close to a singular
T ,
matrix, even though det(A) is not near zero. For example, the nearby matrix à = A − σ10 u10 v10
whose entries are equal to those of A to within two decimal places, is singular. 2
Although we have learned about solving a system of linear equations Ax = b. we have yet to
discuss methods of estimating the error in a computed solution x̃. A simple approach to judging
the accuracy of x̃ is to compute the residual vector r = b − Ax̃, and then compute the magnitude
of r using any vector norm. However, this approach can be misleading, as a small residual does not
necessarily imply that the error in the solution, which is e = x − x̃, is small.
To see this, we first note that
It follows that for any vector norm k · k, and the corresponding induced matrix norm, we have
kek = kA−1 rk
≤ kA−1 kkrk
kbk
≤ kA−1 kkrk
kbk
kAxk
≤ kA−1 kkrk
kbk
krk
≤ kAkkA−1 k kxk.
kbk
kek krk
≤ κ(A) ,
kxk kbk
where
κ(A) = kAkkA−1 k
is the condition number of A. Therefore, it is possible for the residual to be small, and the error
to still be large.
We can exploit the relationship between the error e and the residual r, Ae = r, to obtain an
estimate of the error, ẽ, by solving the system Ae = r in the same manner in which we obtained x̃
by attempting to solve Ax = b.
Since ẽ is an estimate of the error e = x − x̃ in x̃, it follows that x̃ + ẽ is a more accurate
approximation of x than x̃ is. This is the basic idea behind iterative refinement, also known as
iterative improvement or residual correction. The algorithm is as follows:
The algorithm repeatedly applies the relationship Ae = r, where e is the error and r is the
residual, to update the computed solution with an estimate of its error. For this algorithm to be
effective, it is important that the residual r̃(k) be computed as accurately as possible, for example
using higher-precision arithmetic than for the rest of the computation.
It can be shown that if the vector r(k) is computed using double or extended precision that x(k)
converges to a solution where almost all digits are correct when κ(A)u ≤ 1.
68 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS
It is important to note that in the above algorithm, the new residual r(k+1) is computed using
the formula r(k+1) = r(k) − Aẽ(k) , rather than the definition r(k+1) = b − Ax(k+1) . To see that
these formulas are equivalent, we use the definition of x(k+1) to obtain
r(k+1) = b − Ax̃(k+1)
= b − A(x̃(k) + ẽ(k) )
= b − Ax̃(k) − Aẽ(k)
= r(k) − Aẽ(k) .
This formula is preferable to the definition because as k increases, both r(k) and ẽ(k) should approach
the zero vector, and therefore smaller vectors than b and Ax̃(k+1) will be subtracted to obtain r(k+1) ,
thus reducing the amount of cancellation error that occurs.
Ax = b
and
1+ √7
50
κ= ≈ 200.
1− √7
50
2.4. SPECIAL MATRICES 69
One scaling strategy is called equilibration. The idea is to set A(0) = A and compute A(1/2) =
(1) (1) Pn (0)
D(1) A(0) = {di aij }, choosing the diagonal matrix D(1) so that di j=1 |aij | = 1. That is, all row
(1/2) (1)
sums of |D(1) A(0) | are equal to one. Then, we compute A(1) = A(1/2) E (1) = {aij ej }, choosing
(1) Pn (1/2)
each element of the diagonal matrix E (1) so that ej i=1 |aij | = 1. That is, all column sums of
|A (1/2) (1)
E | are equal to one. We then repeat this process, which yields
Under very general conditions, the A(k) converge to a matrix whose row and column sums are all
equal.
Exercise 2.4.2 If A has O(1) bandwidth, then how many FLOPS do Gaussian elimina-
tion, forward substitution and back substitution require?
Example The matrix
−2 1 0 0 0
1 −2 1 0 0
A=
0 1 −2 1 0 ,
0 0 1 −2 1
0 0 0 1 −2
which arises from discretization of the second derivative operator, is banded with lower bandwith
and upper bandwith 1, and total bandwidth 3. Its LU factorization is
−2 1 0 0 0
1 −2 1 0 0
0
1 −2 1 0 =
0 0 1 −2 1
0 0 0 1 −2
1 0 0 0 0 −2 1 0 0 0
−1 3
1 0 0 0 0 −2 1 0 0
2 2 4
0 − 1 0 0 0 0 −3 1 0
3
3
5
.
0 0 −4 1 0 0 0 0 −4 1
4
0 0 0 −5 1 0 0 0 0 − 56
We see that L has lower bandwith 1, and U has upper bandwith 1. 2
Exercise 2.4.3 (a) Write a Matlab function to find the LU factorization of a tridi-
agonal matrix.
(b) Now write a Matlab function for any banded matrix with bandwidth w.
(c) How do the two functions differ in performance? How many FLOPS do each re-
quire?
When a matrix A is banded with bandwidth w, it is wasteful to store it in the traditional
2-dimensional array. Instead, it is much more efficient to store the elements of A in w vectors of
length at most n. Then, the algorithms for Gaussian elimination, forward substitution and back
substitution can be modified appropriately to work with these vectors. For example, to perform
Gaussian elimination on a tridiagonal matrix, we can proceed as in the following algorithm. We
assume that the main diagonal of A is stored in the vector a, the subdiagonal (entries aj+1,j ) is
stored in the vector l, and the superdiagonal (entries aj,j+1 ) is stored in the vector u.
for j = 1, 2, . . . , n − 1 do
lj = lj /aj
aj+1 = aj+1 − lj uj
end
Notice that this algorithm is much shorter than regular Gaussian Elimination. That is because
the number of operations for solving a tridiagonal system is significantly reduced.
2.4. SPECIAL MATRICES 71
y1 = b1
for i = 2, 3, . . . , n do
yi = bi − li−1 yi−1
end
xn = yn /an
for i = n − 1, n − 2, . . . , 1 do
xi = (yi − ui xi+1 )/ai
end
After Gaussian elimination, the components of the vector l are the subdiagonal entries of L in the
LU decomposition of A, and the components of the vector u are the superdiagonal entries of U .
Pivoting can cause difficulties for banded systems because it can cause fill-in: the introduction
of nonzero entries outside of the band. For this reason, when pivoting is necessary, pivoting schemes
that offer more flexibility than partial pivoting are typically used. The resulting trade-off is that
the entries of L are permitted to be somewhat larger, but the sparsity (that is, the occurrence of
zero entries) of A is preserved to a greater extent.
then D is also nonsingular, and then the matrix D−1 U , which has entries
uij
[D−1 U ]ij = , i, j = 1, 2, . . . , n.
uii
The diagonal entries of this matrix are equal to one, and therefore D−1 U is unit upper-triangular.
Therefore, if we define the matrix M by M T = D−1 U , then we have the factorization
A = LU = LDD−1 U = LDM T ,
where both L and M are unit lower-triangular, and D is diagonal. This is called the LDM T
factorization of A.
Because of the close connection between the LDM T factorization and the LU factorization,
the LDM T factorization is not normally used in practice for solving the system Ax = b for a
general nonsingular matrix A. However, this factorization becomes much more interesting when A
is symmetric.
72 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS
A = LDLT .
1 21
1 0 4 0
S= 1
2 1 0 2 0 1
From the above example we see that whenever we have a symmetric matrix we have the factorization
LDU where L is lower unit triangular, D is diagonal, and U is upper unit triangular. Note that
U = LT , and therefore we have the factorization LDLT . This factorization is quite economical,
compared to the LU and LDM T factorizations, because only n(n + 1)/2 entries are needed to
represent L and D. Once these factors are obtained, we can solve Ax = b by solving the simple
systems
Ly = b, Dz = y, LT x = z,
using forward substitution, simple divisions, and back substitution.
The LDLT factorization can be obtained by performing Gaussian elimination, but this is not
efficient, because Gaussian elimination requires performing operations on entire rows of A, which
does not exploit symmetry. This can be addressed by omitting updates of the upper-triangular
portion of A, as they do not influence the computation of L and D. An alternative approach, that
is equally efficient in terms of the number of floating-point operations, but more desirable overall
due to its use of vector operations, involves computing L column-by-column.
If we multiply both sides of the equation A = LDLT by the standard basis vector ej to extract
the jth column of this matrix equation, we obtain
j
X
aj = `k vkj ,
k=1
2.4. SPECIAL MATRICES 73
where
A= a1 · · · an , L= `1 · · · `n
are column partitions of A and L, and vj = DLT ej .
Suppose that columns 1, 2, . . . , j − 1 of L, as well as d11 , d22 , . . . , dj−1,j−1 , the first j − 1 diagonal
elements of D, have already been computed. Then, we can compute vkj = dkk `jk for k = 1, 2, . . . , j−
1, because these quantities depend on elements of L and D that are available. It follows that
j−1
X
aj − `k vkj = `j vjj = `j djj `jj .
k=1
However, `jj = 1, which means that we can obtain djj from the jth component of the vector
j−1
X
uj = a j − `k vkj ,
k=1
and then obtain the “interesting” portion of the new column `j , that is, entries j : n, by computing
`j = uj /djj . The remainder of this column is zero, because L is lower-triangular.
Exercise 2.4.4 Write a Matlab function that implements the following LDLT algo-
rithm. This algorithm should take the random symmetric matrix A as input, and return
L and D as output. Check to see if A = LDLT .
The entire algorithm proceeds as follows:
L=0
D=0
for j = 1 : n do
for k = 1 : j − 1 do
vkj = dkk `jk
end
uj = aj:n,j
for k = 1 : j − 1 do
uj = uj − `j:n,k vkj
end
djj = u1j
`j:n,j = uj /djj
end
2.4.3.1 Properties
A real, n × n symmetric matrix A is symmetric positive definite if A = AT and, for any nonzero
vector x,
xT Ax > 0.
Exercise 2.4.5 Show that if matrices A and B are positive definite, then A+B is positive
definite.
A symmetric positive definite matrix is the generalization to n×n matrices of a positive number.
If A is symmetric positive definite, then it has the following properties:
In general it is not easy to determine whether a given n × n symmetric matrix A is also positive
definite. One approach is to check the matrices
a11 a12 · · · a1k
a21 a22 · · · a2k
Ak = . .. , k = 1, 2, . . . , n,
..
.. . .
ak1 ak2 · · · akk
Exercise 2.4.6 Show that A is positive definite if and only if det(Ak ) > 0 for k =
1, 2, . . . , n.
There are other types of matrices with all the same properties as above except zeros are allowed.
These matrices are defined as follows: negative definite, where xT Ax < 0; positive semi-definite,
where xT Ax ≥ 0; and negative semi-definite, where xT Ax ≤ 0.
Exercise 2.4.7 Find the values of c for which the following matrix is
Gaussian elimination applied to such matrices is robust with respect to the accumulation of roundoff
error. However, Gaussian elimination is not the most practical approach to solving systems of
linear equations involving symmetric positive definite matrices, because it is not the most efficient
approach in terms of the number of floating-point operations that are required.
A = GGT ,
where G is a lower triangular matrix with positive diagonal entries. Because A is factored into two
matrices that are the transpose of one another, the process of computing the Cholesky factorization
requires about half as many operations as the LU decomposition.
The algorithm for computing the Cholesky factorization can be derived by matching entries of
GGT with those of A. This yields the following relation between the entries of G and A,
k
X
aik = gij gkj , i, k = 1, 2, . . . , n, i ≥ k.
j=1
for j = 1, 2, . . . , n do
√
gjj = ajj
for i = j + 1, j + 2, . . . , n do
gij = aij /gjj
for k = j + 1, . . . , i do
aik = aik − gij gkj
end
end
end
The innermost loop subtracts off all terms but the last (corresponding to j = k) in the above
summation that expresses aik in terms of entries of G. Equivalently, for each j, this loop subtracts
the matrix gj gjT from A, where gj is the jth column of G. Note that based on the outer product
view of matrix multiplication, the equation A = GGT is equivalent to
n
X
A= gj gjT .
j=1
Therefore, for each j, the contributions of all columns g` of G, where ` < j, have already been
subtracted from A, thus allowing column j of G to easily be computed by the steps in the outer
loops, which account for the last term in the summation for aik , in which j = k.
76 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS
Exercise 2.4.8 Write a Matlab function that performs the Cholesky factorization A =
GGT . Have your function:
(b) return variable isposdef that checks to see if the matrix A is positive definite, and
(a) Use the Matlab commands ’tic’ and ’toc’ for when the size of the matrix is double.
Compare calculation times for these matrices, and try using large matrices.
(b) Count how many FLOPs are performed in your implementation of the Cholesky
algorithm.
9 −7 15 44
We compute the nonzero entries of G one column at a time. For the first column, we have
√ √
g11 = a11 = 9 = 3,
g21 = a21 /g11 = −3/3 = −1,
g31 = a31 /g11 = 3/3 = 1,
g41 = a41 /g11 = 9/3 = 3.
Before proceeding to the next column, we first subtract all contributions to the remaining entries of
A from the entries of the first column of G. That is, we update A as follows:
2
a22 = a22 − g21 = 17 − (−1)2 = 16,
a32 = a32 − g31 g21 = −1 − (1)(−1) = 0,
a42 = a42 − g41 g21 = −7 − (3)(−1) = −4,
2
a33 = a33 − g31 = 17 − 12 = 16,
a43 = a43 − g41 g31 = 15 − (3)(1) = 12,
2
a44 = a44 − g41 = 44 − 32 = 35.
Now, we can compute the nonzero entries of the second column of G just as for the first column:
√ √
g22 = a22 = 16 = 4,
g32 = a32 /g22 = 0/4 = 0,
g42 = a42 /g22 = −4/4 = −1.
We then remove the contributions from G’s second column to the remaining entries of A:
2
a33 = a33 − g32 = 16 − 02 = 16,
a43 = a43 − g42 g32 = 12 − (−1)(0) = 12,
2
a44 = a44 − g42 = 35 − (−1)2 = 34.
If A is not symmetric positive definite, then the algorithm will break down, because it will attempt
to compute gjj , for some j, by taking the square root of a negative number, or divide by a zero gjj .
is symmetric but not positive definite, because det(A) = 4(2) − 3(3) = −1 < 0. If we attempt to
compute the Cholesky factorization A = GGT , we have
√ √
g11 = a11 = 4 = 2,
g21 = a21 /g11 = 3/2,
2
a22 = a22 − g21 = 2 − 9/4 = −1/4,
√ p
g22 = a22 = −1/4,
This is similar to the process of solving Ax = b using the LDLT factorization, except that
there is no diagonal system to solve. In fact, the LDLT factorization is also known as the “square-
root-free Cholesky factorization”, since it computes factors that are similar in structure to the
Cholesky factors, but without computing any square roots. Specifically, if A = GGT is the Cholesky
factorization of A, then G = LD1/2 . As with the LU factorization, the Cholesky factorization is
unique, because the diagonal is required to be positive.
Ax = b,
where A is an invertible n × n matrix, we can obtain the solution using a direct method such as
Gaussian elimination in conjunction with forward and back substitution. However, there are several
drawbacks to this approach:
• If we have an approximation to the solution x, a direct method does not provide any means
of taking advantage of this information to reduce the amount of computation required.
• If we only require an approximate solution, rather than the exact solution except for roundoff
error, it is not possible to terminate the algorithm for a direct method early in order to obtain
such an approximation.
2.5. ITERATIVE METHODS 79
• If the matrix A is sparse, Gaussian elimination or similar methods can cause fill-in, which is
the introduction of new nonzero elements in the matrix, thus reducing efficiency.
x(k+1) = g(x(k) ),
for some function g : Rn → Rn . The solution x is a fixed point, or stationary point, of g. In other
words, a stationary iterative method is one in which fixed-point iteration, which we have previously
applied to solve nonlinear equations, is used to obtain the solution.
M x = N x + b,
or
x = M −1 (N x + b).
We therefore define
g(x) = M −1 (N x + b),
so that the iteration takes the form
M x(k+1) = N x(k) + b.
It follows that for the sake of efficiency, the splitting A = M − N should be chosen so that the
system M y = c is easily solved.
M (x(k+1) − x) = N (x(k) − x) + b − b,
80 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS
which yields
x(k+1) − x = M −1 N (x(k) − x)
and
x(k) − x = (M −1 N )k (x(0) − x).
That is, the error after each iteration is obtained by multiplying the error from the previous iteration
by T = M −1 N . Therefore, in order for the error to converge to the zero vector, for any choice of
the initial guess x(0) , we must have ρ(T ) < 1, where ρ(T ) is the spectral radius of T . Let λ1 . . . λn
be the eigenvalues of a matrix A. Then ρ(A) = max {| λ1 | . . . | λn |}.
We now discuss some basic stationary iterative methods. For convenience, we write
A = D + L + U,
where D is a diagonal matrix whose diagonal entries are the diagonal entries of A, L is a strictly
lower triangular matrix defined by
aij i>j
`ij = ,
0 i≤j
and U is a strictly upper triangular matrix that is similarly defined: uij = aij if i < j, and 0
otherwise.
The Jacobi method is defined by the splitting
A = M − N, M = D, N = −(L + U ).
That is,
x(k+1) = D−1 [−(L + U )x(k) + b].
This description of the Jacobi method is helpful for its practical implementation, but it also reveals
how the method can be improved. If the components of x(k+1) are computed in order, then the
(k+1)
computation of xi uses components 1, 2, . . . , i − 1 of x(k) even though these components of
x (k+1) have already been computed.
2.5. ITERATIVE METHODS 81
Exercise 2.5.1 Solve the linear system Ax = b by using the Jacobi method, where
2 7 1
A= 4 1 −1
1 −3 12
and
19
b = 3 .
31
Compute the iteration matrix T using the fact that M = D and N = −(L + U ) for the
Jacobi method. Is ρ(T ) < 1?
Hint: First rearrange the order of the equations so that the matrix is strictly diagonally
dominant.
and
19
b = 3 .
31
using the Gauss-Seidel Method. What are the differences between this computation and
the one from Exercise 2.5.1?
important applications, both methods can converge quite slowly. To accelerate convergence, we
first rewrite the Gauss-Seidel method as follows:
The quantity in brackets is the step taken from x(k) to x(k+1) . However, if the direction of this
step corresponds closely to the step x − x(k) to the exact solution, it may be possible to accelerate
convergence by increasing the length of this step. That is, we introduce a parameter ω so that
(k+1)
x(k+1) = x(k) + ω[xGS − x(k) ],
(k+1)
where xGS is the iterate obtained from x(k) by the Gauss-Seidel method. By choosing ω > 1,
(k+1)
which is called overrelaxation, we take a larger step in the direction of [xGS − x(k) ] than Gauss-
Seidel would call for.
This approach leads to the method of successive overrelaxation (SOR),
Note that if ω = 1, then SOR reduces to the Gauss-Seidel method. If we examine the iteration
matrix Tω for SOR, we have
Because the matrices (D + ωL) and [(1 − ω)D − ωU ] are both triangular, it follows that
n
! n !
Y Y
−1
det(Tω ) = aii (1 − ω)aii = (1 − ω)n .
i=1 i=1
Because the determinant is the product of the eigenvalues, it follows that ρ(Tω ) ≥ |1 − ω|.
Exercise 2.5.3 By the above argument, find a lower and upper bound for the parameter
ω. Hint: Consider criteria for ρ(Tω ) if this method converges.
In some cases, it is possible to analytically determine the optimal value of ω, for which con-
vergence is most rapid. For example, if A is symmetric positive definite and tridiagonal, then the
optimal value is
2
ω= p ,
1 + 1 − [ρ(Tj )]2
where Tj is the iteration matrix −D−1 (L + U ) for the Jacobi method.
Ax = b.
B = DAD−1
is symmetric and positive definite, then SOR converges for the original problem.
Hint: Use the fact that SOR converges for any symmetric positive definite matrix.
2.5. ITERATIVE METHODS 83
A natural criterion for stopping any iterative method is to check whether kx(k) − x(k−1) k is less
than some tolerance. However, if kT k < 1 in some natural matrix norm, then we have
Therefore, the tolerance must be chosen with kT k/(1 − kT k) in mind, as this can be quite large
when kT k ≈ 1.
We assume that A is symmetric positive definite, and consider the problem of minimizing the
function
1
φ(x) = xT Ax − bT x.
2
Differentiating, we obtain
∇φ(x) = Ax − b.
Therefore, this function has one critical point, when Ax = b. Differentiating ∇φ, we find that
the Hessian matrix of φ is A. Because A is symmetric positive definite, it follows that the unique
minimizer of φ is the solution to Ax = b. Therefore, we can use techniques for minimizing φ to
solve Ax = b.
From any vector x0 , the direction of steepest descent is given by
−∇φ(x0 ) = b − Ax0 = r0 ,
the residual vector. This suggests a simple non-stationary iterative method, which is called the
method of steepest descent. The basic idea is to choose the search direction pk to be rk = b − Ax(k) ,
84 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS
and then to choose αk so as to minimize φ(x(k+1) ) = φ(x(k) + αk rk ). This entails solving a single-
variable minimization problem to obtain αk . We have
d (k) d 1 (k)
[φ(x + αk rk )] = (x + αk rk )T A(x(k) + αk rk )−
dαk dαk 2
i
bT (x(k) + αk rk )
= rTk Ax(k) + αk rTk Ark − bT rk ]
= −rTk rk + αk rTk Ark .
It follows that the optimal choice for αk is
rTk rk
αk = ,
rTk Ark
and since A is symmetric positive definite, the denominator is guaranteed to be positive.
The method of steepest descent is effective when A is well-conditioned, but when A is ill-
conditioned, convergence is very slow, because the level curves of φ become long, thin hyperellipsoids
in which the direction of steepest descent does not yield much progress toward the minimum.
Another problem with this method is that while it can be shown that rk+1 is orthogonal to rk , so that
each direction is completely independent of the previous one, rk+1 is not necessarily independent
of previous search directions.
Exercise 2.5.5 Show that each search direction rk is orthogonal to the previous search
direction rk−1 .
In fact, even in the 2 × 2 case, where only two independent search directions are available, the
method of steepest descent exhibits a “zig-zag” effect because it continually alternates between two
orthogonal search directions, and the more ill-conditioned A is, the smaller each step tends to be.
k = 0, rk = b, qk = 0, x(k) = 0
while x(k) is not converged do
βk = krk k2
qk+1 = rk /βk
k =k+1
vk = Aqk
αk = qTk vk
rk = vk − αk qk − βk−1 qk−1
x(k) = β0 Qk Tk−1 e1
end
This method is the Lanczos iteration. It is not only used for solving linear systems; the matrix
Tk is also useful for approximating extremal eigenvalues of A, and for approximating quadratic or
bilinear forms involving functions of A, such as the inverse or exponential.
Exercise 2.5.6 Implement the above Lanczos algorithm in Matlab. Your function
should take a random symmetric matrix A, a random initial vector u, and n where n
is the number of iterations as input. Also, your function should return the symmetric
matrix T , where T is the tridiagonal matrix described above that contains quantities αj
and βj that are computed by the algorithm.
P̃k LTk = Qk ,
then
x(k) = Qk yk = P̃k LTk Tk−1 e1 = P̃k wk
where wk satisfies
Lk Dk wk = e1 .
2.5. ITERATIVE METHODS 87
This representation of x(k) is more convenient than Qk yk , because, as a consequence of the recursive
definitions of Lk and Dk , and the fact that Lk Dk is lower triangular, we have
wk−1
wk = ,
wk
which implies
di > 0 i = j
p̃Ti Ap̃j = ,
0 i=6 j
we see that these search directions, while not orthogonal, are A-orthogonal, or A-conjugate. There-
fore, they are linearly independent, thus guaranteeing convergence, in exact arithmetic, within n
iterations. It is for this reason that the Lanczos iteration, reformulated in this manner with these
search directions, is called the conjugate gradient method.
From the definition of Pk , we obtain the relation
It follows that each search direction p̃k is a linear combination of the residual rk = b − Ax(k−1) ,
which is a scalar multiple of qk , and the previous search direction pk−1 , except for the first direction,
which is equal to q1 = b. The exact linear combination can be determined by the requirement that
the search directions be A-conjugate.
Specifically, if we define pk = krk k2 p̃k , then, from the previous linear combination, we have
pk = rk + µk pk−1 ,
for some constant µk . From the requirement that pTk−1 Apk = 0, we obtain
pTk−1 Ark
µk = − .
pTk−1 Apk−1
We have eliminated the computation of the qk from the algorithm, as we can now use the residuals
rk instead to obtain the search directions that we will actually use, the pk .
The relationship between the residuals and the search direction also provides a simple way to
compute each residual from the previous one. We have
The orthogonality of the residuals, and the A-orthogonality of the search directions, yields the
relations
wk−1 T
rTk rk = −wk−1 rTk Ap̃k−1 = r Apk−1 ,
krk k2 k
88 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS
and
for some constant νk . From the definition of pk = krk k2 p̃k , we know that νk = wk /krk k2 , but
we wish to avoid the explicit computation of Tk and its LDLT factorization that are needed to
compute wk . Instead, we use the relation
rk+1 = rk − νk Apk
k = 1, r1 = b, x(0) = 0
while not converged do
if k > 1 then
µk = −rTk rk /rTk−1 rk−1
pk = rk + µk pk−1
else
p1 = r1
end if
vk = Apk
νk = rTk rk /pTk vk
x(k) = x(k−1) + νk pk
rk+1 = rk − νk vk
k =k+1
end while
2.5. ITERATIVE METHODS 89
An appropriate stopping criterion is that the norm of the residual rk+1 is smaller than some
tolerance. It is also important to impose a maximum number of iterations. Note that only one
matrix-vector multiplication per iteration is required.
2.5.2.4 Preconditioning
The conjugate gradient method is far more effective than the method of steepest descent, but it
can also suffer from slow convergence when A is ill-conditioned. Therefore, the conjugate gradient
method is often paired with a preconditioner that transforms the problem Ax = b into an equivalent
problem in which the matrix is close to I. The basic idea is to solve the problem
Ãx̃ = b̃
where
à = C −1 AC −1 , x̃ = Cx, b̃ = C −1 b,
and C is symmetric positive definite. Modifying the conjugate gradient method to solve this
problem, we obtain the algorithm
k = 1, C −1 r1 = C −1 b, Cx(0) = 0
while not converged do
if k > 1 then
µk = −rTk C −2 rk /rTk−1 C −2 rk−1
Cpk = C −1 rk + µk Cpk−1
else
Cp1 = C −1 r1
end if
C −1 vk = C −1 Apk
νk = rTk C −2 rk /pTk Cvk
Cx(k) = Cx(k−1) + νk Cpk
C −1 rk+1 = C −1 rk − νk vk
k =k+1
end while
k = 1, r1 = b, x(0) = 0
while not converged do
Solve M zk = rk
if k > 1 then
µk = −rTk zk /rTk−1 zk−1
pk = zk + µk pk−1
else
p1 = z1
end if
vk = Apk
νk = rTk zk /pTk vk
90 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS
x(k) = x(k−1) + νk pk
rk+1 = rk − νk vk
k =k+1
end while
We see that the action of the transformation is only felt through the preconditioner M = C 2 .
Because a system involving M is solved during each iteration, it is essential that such a system
is easily solved. One example of such a preconditioner is to define M = HH T , where H is an
“incomplete Cholesky factor” of A, which is a sparse matrix that approximates the true Cholesky
factor.
Chapter 3
we have
n
X ∂xj n
∂ψ X
= cj = cj δjk = ck ,
∂xk ∂xk
j=1 j=1
91
92 CHAPTER 3. LEAST SQUARES PROBLEMS
and therefore
∇ψ(x) = c.
Now, let
n X
X n
ϕ(x) = xT Bx = bij xi xj .
i=1 j=1
Then
n X
n
∂ϕ X ∂(xi xj )
= bij
∂xk ∂xk
i=1 j=1
n X
X n
= bij (δik xj + xi δjk )
i=1 j=1
Xn X n n X
X n
= bij xj δik + bij xi δjk
i=1 j=1 i=1 j=1
n
X n
X
= bkj xj + bik xi
j=1 i=1
n
X
= (Bx)k + (B T )ki xi
i=1
= (Bx)k + (B T x)k .
We conclude that
∇ϕ(x) = (B + B T )x.
From
kyk22 = yT y,
and the properties of the transpose, we obtain
1 1
kb − Axk22 = (b − Ax)T (b − Ax)
2 2
1 T 1 1 1
= b b − (Ax)T b − bT Ax + xT AT Ax
2 2 2 2
1 T 1
= b b − bT Ax + xT AT Ax
2 2
1 T 1
= b b − (A b) x + xT AT Ax.
T T
2 2
Using the above formulas, with c = 12 AT b and B = 12 AT A, we have
1 1
∇ kb − Axk22 = −AT b + (AT A + (AT A)T )x.
2 2
However, because
(AT A)T = AT (AT )T = AT A,
3.1. THE FULL RANK LEAST SQUARES PROBLEM 93
this simplifies to
1
∇ kb − Axk22 = −AT b + AT Ax = AT Ax − AT b.
2
The Hessian of the function ϕ(x), denoted by Hϕ (x), is the matrix with entries
∂2ϕ
hij = .
∂xi ∂xj
∂2ϕ ∂2ϕ
=
∂xi ∂xj ∂xj ∂xi
as long as they are continuous, the Hessian is symmetric under these assumptions.
In the case of ϕ(x) = xT Bx, whose gradient is ∇ϕ(x) = (B + B T )x, the Hessian is Hϕ (x) =
B + B T . It follows from the previously computed gradient of 12 kb − Axk22 that its Hessian is AT A.
Recall that A is m × n, with m ≥ n and rank(A) = n. Then, if x 6= 0, it follows from the linear
independence of A’s columns that Ax 6= 0. We then have
since the norm of a nonzero vector must be positive. It follows that AT A is not only symmetric,
but positive definite as well. Therefore, the Hessian of φ(x) is positive definite, which means that
the unique critical point x, the solution to the equations AT Ax − AT b = 0, is a minimum.
In general, if the Hessian at a critical point is
• positive definite, meaning that its eigenvalues are all positive, then the critical point is a local
minimum.
• negative definite, meaning that its eigenvalues are all negative, then the critical point is a
local maximum.
• indefinite, meaning that it has both positive and negative eigenvalues, then the critical point
is a sadddle point.
• singular, meaning that one of its eigenvalues is zero, then the second derivative test is incon-
clusive.
In summary, we can minimize φ(x) by noting that ∇φ(x) = AT (b − Ax), which means that
∇φ(x) = 0 if and only if AT Ax = AT b. This system of equations is called the normal equations,
and were used by Gauss to solve the least squares problem. If m n then AT A is n × n, which
is a much smaller system to solve than Ax = b, and if κ(AT A) is not too large, we can use the
Cholesky factorization to solve for x, as AT A is symmetric positive definite.
94 CHAPTER 3. LEAST SQUARES PROBLEMS
r + Ax̂ = b
AT r = 0
AT Avj = σj2 vj .
That is, σj2 is an eigenvalue of AT A. Furthermore, because AT A is symmetric positive definite, the
eigenvalues of AT A are also its singular values. Specifically, if A = U ΣV T is the SVD of A, then
V T (ΣT Σ)V is the SVD of AT A.
It follows that the condition number in the 2-norm of AT A is
2
T T T −1 σ12 σ1
κ2 (A A) = kA Ak2 k(A A) k2 = 2 = = κ2 (A)2 .
σn σn
Note that because A has full column rank, AT A is nonsingular, and therefore (AT A)−1 exists, even
though A−1 may not.
3.2. THE QR FACTORIZATION 95
If we partition
c
QT b = ,
d
where c is an n-vector, then
2
c R1
Axk22 2 2
min kb − = min
−
= min kc − R1 xk2 + kdk2 .
x
d 0 2
Therefore, the minimum is achieved by the vector x such that R1 x = c and therefore
2. Using Givens rotations, also known as Jacobi rotations, used by W. Givens and originally
invented by Jacobi for use with in solving the symmetric eigenvalue problem in 1846.
where
A= a1 · · · an , Q= q1 · · · qm
From the above matrix product we can see that a1 = r11 q1 , from which it follows that
1
r11 = ±ka1 k2 , q1 = a1 .
ka1 k2
1
r12 = qT1 a2 , r22 = ka2 − r12 q1 k2 , q2 = (a2 − r12 q1 ).
r22
to obtain
k−1
1 X
qk = ak − rjk qj , rjk = qTj ak .
rkk
j=1
3.2. THE QR FACTORIZATION 97
If we define Pi = qi qTi , then Pi is a symmetric projection that satisfies Pi2 = Pi , and Pi Pj = δij .
Thus we can write
k−1 k−1
1 X 1 Y
qk = I− Pj ak = (I − Pj )ak .
rkk rkk
j=1 j=1
which means
A − C (k) =
0 0 · · · 0 A(k)
,
because the first k − 1 columns of A are linear combinations of the first k − 1 columns of Q1 , and
the contributions of these columns of Q1 to all columns of A are removed by subtracting C (k) .
If we write
A(k) = zk Bk
then, because the kth column of A is a linear combination of the first k columns of Q1 , and the
contributions of the first k − 1 columns are removed in A(k) , zk must be a multiple of qk . Therefore,
1
rkk = kzk k2 , qk = zk .
rkk
We then compute
= qTk Bk
rk,k+1 · · · rk,n
98 CHAPTER 3. LEAST SQUARES PROBLEMS
which yields
A(k+1) = Bk − qk
rk,k+1 · · · rkn .
This process is numerically stable.
Note that Modified Gram-Schmidt computes the entries of R1 row-by-row, rather than column-
by-column, as Classical Gram-Schmidt does. This rearrangement of the order of operations, while
mathematically equivalent to Classical Gram-Schmidt, is much more stable, numerically, because
each entry of R1 is obtained by computing an inner product of a column of Q1 with a modified
column of A, from which the contributions of all previous columns of Q1 have been removed.
To see why this is significant, consider the inner products
uT v, uT (v + w),
where uT w = 0. The above inner products are equal, but suppose that |uT v| kwk. Then uT v
is a small number that is being computed by subtraction of potentially large numbers, which is
susceptible to catastrophic cancellation.
It can be shown that Modified Gram-Schmidt produces a matrix Q̂1 such that
and Q̂1 can be computed in approximately 2mn2 flops (floating-point operations), whereas with
Householder QR,
Q̂T1 Q̂1 = I + En , kEn k ≈ u,
with Q̂1 being computed in approximately 2mn2 − 2n2 /3 flops to factor A and an additional
2mn2 − 2n2 /3 flops to obtain Q1 , the first n columns of Q. That is, Householder QR is much less
sensitive to roundoff error than Gram-Schmidt, even with modification, although Gram-Schmidt is
more efficient if an explicit representation of Q1 desired.
It follows that if τ = 2/uT u, then P T P = I for any nonzero u. Without loss of generality, we can
stipulate that uT u = 1, and therefore P takes the form P = I − 2vvT , where vT v = 1.
Why is the matrix P called a reflection? This is because for any nonzero vector x, P x is the
reflection of x across the hyperplane that is normal to v. To see this, we consider the 2 × 2 case
3.2. THE QR FACTORIZATION 99
T T
and set v = 1 0 and x = 1 2 . Then
P = I − 2vvT
1
= I −2 1 0
0
1 0 1 0
= −2
0 1 0 0
−1 0
=
0 1
Therefore
−1 0 1 −1
Px = = .
0 1 2 2
Now, let x be any vector. We wish to construct P so that P x = αe1 for some α, where
T
e1 = 1 0 · · · 0 . From the relations
Rearranging, we obtain
1
(x − αe1 ) = (vT x)v.
2
It follows that the vector v, which is a unit vector, must be a scalar multiple of x − αe1 . Therefore,
v is defined by the equations
x1 − α
v1 =
kx − αe1 k2
x1 − α
= p
2
kxk2 − 2αx1 + α2
x −α
= √ 1
2α2 − 2αx1
α − x1
= −p
2α(α − x1 )
r
α − x1
= −sgn(α) ,
2α
x2
v2 = p
2α(α − x1 )
x2
= − ,
2αv1
..
.
xn
vn = − .
2αv1
100 CHAPTER 3. LEAST SQUARES PROBLEMS
To avoid catastrophic cancellation, it is best to choose the sign of α so that it has the opposite sign
of x1 . It can be seen that the computation of v requires about 3n operations.
Note that the matrix P is never formed explicitly. For any vector b, the product P b can be
computed as follows:
P b = (I − 2vvT )b = b − 2(vT b)v.
This process requires only 4n operations. It is easy to see that we can represent P simply by storing
only v.
Now, suppose that that x = a1 is the first column of a matrix A. Then we construct a
Householder reflection H1 = I − 2v1 v1T such that Hx = αe1 , and we have
r11 r12 ··· r1n
0
A(2) = H1 A = .. .
(2) (2)
. a ··· a 2:m,2
2:m,n
0
where we denote the constant α by r11 , as it is the (1, 1) element of the updated matrix A(2) . Now,
we can construct H̃2 such that
r22
0
(2)
H̃2 a2:m,2 = . ,
.
.
0
r11 r12 r13 ··· r1n
0 r22 r23 ··· r2n
1 0
A (3)
= (2)
A =
0 0 .
0 H̃2 .. .. (3) (3)
. . a3:m,3 · · · a3:m,n
0 0
Note that the first column of A(2) is unchanged by H̃2 , because H̃2 only operates on rows 2 through
m, which, in the first column, have zero entries. Continuing this process, we obtain
Hn · · · H1 A = A(n+1) = R,
where, for j = 1, 2, . . . , n,
Ij−1 0
Hj =
0 H̃j
and R is an upper triangular matrix. We have thus factored A = QR, where Q = H1 H2 · · · Hn is
an orthogonal matrix.
Note that for each j = 1, 2, . . . , n, H̃j is also a Householder reflection, based on a vector whose
first j − 1 components are equal to zero. Therefore, application of Hj to a matrix does not affect
the first j rows or columns. We also note that
AT A = RT QT QR = RT R,
Example We apply Householder reflections to compute the QR factorization of the matrix from
the previous example,
0.8147 0.0975 0.1576
0.9058 0.2785 0.9706
A(1) = A =
0.1270 0.5469 0.9572 .
0.9134 0.9575 0.4854
0.6324 0.9649 0.8003
First, we work with the first column of A,
0.8147
0.9058
(1)
x1 = a1:5,1 =
0.1270 ,
kx1 k2 = 1.6536.
0.9134
0.6324
0.6477
102 CHAPTER 3. LEAST SQUARES PROBLEMS
Applying these same Householder reflections, in order, on the right of the identity matrix, yields
the orthogonal matrix
−0.4927 −0.4806 0.1780 −0.6015 −0.3644
−0.5478 −0.3583 −0.5777 0.3760 0.3104
−0.0768
Q = H1 H2 H3 = 0.4754 −0.6343 −0.1497 −0.5859
−0.5523 0.3391 0.4808 0.5071 −0.3026
−0.3824 0.5473 0.0311 −0.4661 0.5796
such that
−1.6536 −1.1405 −1.2569
0 0.9661 0.6341
A(4) = R = QT A = H3 H2 H1 A =
0 0 −0.8816
0 0 0
0 0 0
is upper triangular, where
1 0 0
1 0
H1 = H̃1 , H2 = , H3 = 0 1 0 ,
0 H̃2
0 0 H̃3
are the same Householder transformations before, defined in such a way that they can be applied
to the entire matrix A. Note that for j = 1, 2, 3,,
T 0
Hj = I − 2vj vj , vj = , kvj k2 = kṽj k2 = 1,
ṽj
where γ 2 + σ 2 = 1.
104 CHAPTER 3. LEAST SQUARES PROBLEMS
γa21 = σa11
γ 2 a221 = σ 2 a211 = (1 − γ 2 )a211
which yields
a11
γ = ±p 2 .
a21 + a211
It is conventional to choose the + sign. Then, we obtain
a211 a2
σ2 = 1 − γ 2 = 1 − = 2 21 2 ,
a221 2
+ a11 a21 + a11
or
a21
σ = ±p 2 .
a21 + a211
Again, we choose the + sign. As a result, we have
a11 a21
q
r11 = a11 p + a21 p 2 = a221 + a211 .
a221 + a211 a21 + a211
The matrix T
γ −σ
Q=
σ γ
is called a Givens rotation. It is called a rotation because it is orthogonal, and therefore length-
preserving, and also because there is an angle θ such that sin θ = σ and cos θ = γ, and its effect is
to rotate a vector clockwise through the angle θ. In particular,
T
γ −σ α ρ
=
σ γ β 0
p
where ρ = α2 + β 2 , α = ρ cos θ and β = ρ sin θ. It is easy to verify that the product of two
rotations is itself a rotation. Now, in the case where A is an n × n matrix, suppose that we have
the vector
×
..
.
×
α
×
..
. .
×
β
×
.
.
.
×
3.2. THE QR FACTORIZATION 105
Then
1 × ×
.. .. ..
.
. .
1 × ×
γ σ α
ρ
1 × ×
.. .. .. .
. =
. .
1 × ×
−σ γ β
0
1 × ×
.. . ..
.
. . .
1 × ×
So, in order to transform A into an upper triangular matrix R, we can find a product of rotations Q
such that QT A = R. It is easy to see that O(n2 ) rotations are required. Each rotation takes O(n)
operations, so the entire process of computing the QR factorization requires O(n3 ) operations.
It is important to note that the straightforward approach to computing the entries γ and σ of
the Givens rotation,
α β
γ=p , σ=p ,
2
α +β 2 α + β2
2
p
is not always advisable, because in floating-point arithmetic, the computation of α2 + β 2 could
overflow. To get around this problem, suppose that |β| ≥ |α|. Then, we can instead compute
α 1
τ= , σ=√ , γ = στ,
β 1 + τ2
which is guaranteed not to overflow since the only number that is squared is less than one in
magnitude. On the other hand, if |α| ≥ |β|, then we compute
β 1
τ= , γ=√ , σ = γτ.
α 1 + τ2
Now, we describe the entire algorithm for computing the QR factorization using Givens rota-
tions. Let [c, s] = givens(a, b) be a Matlab-style function that computes c and s such that
T
c −s a r p
= , r= a2 + b2 .
s c b 0
Then, let G(i, j, c, s)T be the Givens rotation matrix that rotates the ith and jth elements of a
vector v clockwise by the angle θ such that cos θ = c and sin θ = s, so that if vi = a√ and vj = b,
and [c, s] = givens(a, b), then in the updated vector u = G(i, j, c, s)T v, ui = r = a2 + b2 and
uj = 0. The QR factorization of an m × n matrix A is then computed as follows.
Q=I
R = A for j = 1 : n do
for i = m : −1 : j + 1 do
106 CHAPTER 3. LEAST SQUARES PROBLEMS
Note that the matrix Q is accumulated by column rotations of the identity matrix, because the
matrix by which A is multiplied to reduce A to upper-triangular form, a product of row rotations,
is QT .
0.8147 0.0975 0.1576
0.9058 0.2785 0.9706
A=
0.1270 0.5469 0.9572 .
0.9134 0.9575 0.4854
0.6324 0.9649 0.8003
First, we compute a Givens rotation that, when applied to a41 and a51 , zeros a51 :
T
0.8222 −0.5692 0.9134 1.1109
= .
0.5692 0.8222 0.6324 0
T
1 0 0 0 0 0.8147 0.0975 0.1576
0 1 0 0 0
0.9058 0.2785 0.9706
0 0 1 0 0
0.1270 0.5469 0.9572 =
0 0 0 0.8222 −0.5692 0.9134 0.9575 0.4854
0 0 0 0.5692 0.8222 0.6324 0.9649 0.8003
0.8147 0.0975 0.1576
0.9058 0.2785 0.9706
0.1270 0.5469 0.9572 .
1.1109 1.3365 0.8546
0 0.2483 0.3817
Next, we compute a Givens rotation that, when applied to a31 and a41 , zeros a41 :
T
0.1136 −0.9935 0.1270 1.1181
= .
0.9935 0.1136 1.1109 0
3.2. THE QR FACTORIZATION 107
Moving to the second column, we compute a Givens rotation that, when applied to a42 and a52 ,
zeros a52 :
T
0.8445 0.5355 −0.3916 0.4636
= .
−0.5355 0.8445 0.2483 0
T
1 0 0 0 0 1.6536 1.1405 1.2569
0 1 0 0 0
0 0.5336 0.5305
0 0 1 0 0
0 0.6585 −0.1513 =
0 0 0 0.8445 0.5355 0 −0.3916 −0.8539
0 0 0 −0.5355 0.8445 0 0.2483 0.3817
1.6536 1.1405 1.2569
0 0.5336 0.5305
0 0.6585 −0.1513 .
0 −0.4636 −0.9256
0 0 −0.1349
Next, we compute a Givens rotation that, when applied to a32 and a42 , zeros a42 :
T
0.8177 0.5757 0.6585 0.8054
= .
−0.5757 0.8177 −0.4636 0
T
1 0 0 0 0 1.6536 1.1405 1.2569
0 1 0 0 0
0 0.5336 0.5305
0 0 0.8177 0.5757 0
0 0.6585 −0.1513 =
0 0 −0.5757 0.8177 0 0 −0.4636 −0.9256
0 0 0 0 1 0 0 −0.1349
1.6536 1.1405 1.2569
0 0.5336 0.5305
0 0.8054 0.4091 .
0 0 −0.8439
0 0 −0.1349
Next, we compute a Givens rotation that, when applied to a22 and a32 , zeros a32 :
T
0.5523 −0.8336 0.5336 0.9661
= .
0.8336 0.5523 0.8054 0
3.2. THE QR FACTORIZATION 109
Applying the transpose of each Givens rotation, in order, to the columns of the identity matrix
yields the matrix
0.4927 −0.4806 0.1780 −0.7033 0
0.5478 −0.3583 −0.5777 0.4825 0.0706
Q=
0.0768 0.4754 −0.6343 −0.4317 −0.4235
0.5523 0.3391 0.4808 0.2769 −0.5216
0.3824 0.5473 0.0311 −0.0983 0.7373
span{a1 , . . . , ak } = span{q1 , . . . , qk }
can not be expected to hold, because the first k columns of A could be linearly dependent, while
the first k columns of Q, being orthonormal, must be linearly independent.
Example The matrix
1 −2 1
A = 2 −4 0
1 −2 3
has rank 2, because the first two columns are parallel, and therefore are linearly dependent, while
the third column is not parallel to either of the first two. Columns 1 and 3, or columns 2 and 3,
form linearly independent sets. 2
Therefore, in the case where rank(A) = r < n, we seek a decomposition of the form AΠ = QR,
where Π is a permutation matrix chosen so that the diagonal elements of R are maximized at each
stage. Specifically, suppose H1 is a Householder reflection chosen so that
r11
0
H1 A = . , r11 = ka1 k2 .
.. ∗
0
3.3. RANK-DEFICIENT LEAST SQUARES 111
To maximize r11 , we choose Π1 so that in the column-permuted matrix A = AΠ1 , we have ka1 k2 ≥
kaj k2 for j ≥ 2. For Π2 , we examine the lengths of the columns of the submatrix of A obtained by
removing the first row and column. It is not necessary to recompute the lengths of the columns,
because we can update them by subtracting the square of the first component from the square of
the total length.
This process is called QR with column pivoting. It yields the decomposition
R S
Q = AΠ
0 0
where Q = H1 · · · Hr , Π = Π1 · · · Πr , and R is an upper triangular, r × r matrix. The last m − r
rows are necessarily zero, because every column of A is a linear combination of the first r columns
of Q.
Example We perform QR with column pivoting on the matrix
1 3 5 1
2 −1 2 1
A= 1
.
4 6 1
4 5 10 1
Computing the 2-norms of the columns yields
ka1 k2 = 22, ka2 k2 = 51, ka3 k2 = 165, ka4 k2 = 4.
We see that the third column has the largest 2-norm. We therefore interchange the first and third
columns to obtain
0 0 1 0 5 3 1 1
0 1 0 0 2 −1 2 1
A(1) = AΠ1 = A
= .
1 0 0 0 6 4 1 1
0 0 0 1 10 5 4 1
We then apply a Householder transformation H1 to A(1) to make the first column a multiple of e1 ,
which yields
−12.8452 −6.7729 −4.2817 −1.7905
0 −2.0953 1.4080 0.6873
H1 A(1) =
.
0 0.7141 −0.7759 0.0618
0 −0.4765 1.0402 −0.5637
Next, we consider the submatrix obtained by removing the first row and column of H1 A(1) :
−2.0953 1.4080 0.6873
Ã(2) = 0.7141 −0.7759 0.0618 .
−0.4765 1.0402 −0.5637
We compute the lengths of the columns, as before, except that this time, we update the lengths of
the columns of A, rather than recomputing them. This yields
(2) (1) (1)
kã1 k22 = ka2 k2 − [a12 ]2 = 51 − (−6.7729)2 = 5.1273,
(2) (1) (1)
kã2 k22 = ka3 k2 − [a13 ]2 = 22 − (−4.2817)2 = 3.6667,
(2) (1) (1)
kã3 k22 = ka4 k2 − [a14 ]2 = 4 − (−1.7905)2 = 0.7939.
112 CHAPTER 3. LEAST SQUARES PROBLEMS
The second column is the largest, so there is no need for a column interchange this time. We
apply a Householder transformation H̃2 to the first column of Ã(2) so that the updated column is a
multiple of e1 , which is equivalent to applying a 4 × 4 Householder transformation H2 = I − 2v2 v2T ,
where the first component of v2 is zero, to the second column of A(2) so that the updated column
is a linear combination of e1 and e2 . This yields
2.2643 −1.7665 −0.4978
H̃2 Ã(2) = 0 −0.2559 0.2559 .
0 0.6933 −0.6933
Then, we consider the submatrix obtained by removing the first row and column of H2 Ã(2) :
(3) −0.2559 0.2559
à = .
0.6933 −0.6933
Both columns have the same lengths, so no column interchange is required. Applying a Householder
reflection H̃3 to the first column to make it a multiple of e1 will have the same effect on the second
column, because they are parallel. We have
(3) 0.7390 −0.7390
H̃3 Ã = .
0 0
It follows that the matrix Ã(4) obtained by removing the first row and column of H3 Ã(3) will be
the zero matrix. We conclude that rank(A) = 3, and that A has the factorization
R S
AΠ = Q ,
0 0
where
0 0 1 0
0 1 0 0
Π=
1
,
0 0 0
0 0 0 1
−12.8452 −6.7729 −4.2817 −1.7905
R= 0 2.2643 −1.7665 , S = −0.4978 ,
0 0 0.7390 −0.7390
and Q = H1 H2 H3 is the product of the Householder reflections used to reduce AΠ to upper-
triangular form. 2
Using this decomposition, we can solve the linear least squares problem Ax = b by observing
that
2
2
R S T
kb − Axk2 =
b − Q 0 0 Π x
2
2
T R S u
=
Q b −
0 0 v
2
2
c Ru + Sv
=
d −
0
2
= kc − Ru − Svk22 + kdk22 ,
3.3. RANK-DEFICIENT LEAST SQUARES 113
where
c u
QT b = , ΠT x = ,
d v
with c and u being r-vectors. Thus min kb − Axk22 = kdk22 , provided that Ru + Sv = c. A basic
solution is obtained by choosing v = 0. A second solution is to choose u and v so that kuk22 + kvk22
is minimized. This criterion is related to the pseudo-inverse of A.
RT
T 0
A =Π QT
ST 0
where L is a lower-triangular matrix of size r × r, where r is the rank of A. This is the complete
orthogonal decomposition of A.
Recall that X is the pseudo-inverse of A if
1. AXA = A
2. XAX = X
3. (XA)T = XA
4. (AX)T = AX
Let X = {x|kb − Axk2 = min }. If x ∈ X and we desire kxk2 = min , then x = A+ b. Note that in
this case,
r = b − Ax = b − AA+ b = (I − AA+ )b
where the matrix (I − AA+ ) is a projection matrix P ⊥ . To see that P ⊥ is a projection, note that
+
P = AA
L−1 0
L 0 T
= Q Z Z QT
0 0 0 0
Ir 0
= Q QT .
0 0
It can then be verified directly that P = P T and P 2 = P .
and A+ = V Σ+ U T . The matrix A+ is called the pseudo-inverse of A. In the case where A is square
and has full rank, the pseudo-inverse is equal to A−1 . Note that A+ is independent of b. It also
has the properties
1. AA+ A = A
2. A+ AA+ = A+
3. (A+ A)T = A+ A
4. (AA+ )T = AA+
The solution x of the least-squares problem minimizes kb − Axk2 , and therefore is the vector
that solves the system Ax = b as closely as possible. However, we can use the SVD to show that
x is the exact solution to a related system of equations. We write b = b1 + b2 , where
1. P = P T
2. P 2 = P
In other words, the matrix P is a projection. In particular, it is a projection onto the space spanned
by the columns of A, i.e. the range of A. That is, P = Ur UrT , where Ur is the matrix consisting of
the first r columns of U .
The residual vector r = b − Ax can be expressed conveniently using this projection. We have
r = b − Ax = b − AA+ b = b − P b = (I − P )b = P ⊥ b.
That is, the residual is the projection of b onto the orthogonal complement of the range of A, which
is the null space of AT . Furthermore, from the SVD, the 2-norm of the residual satisfies
where, as before, c = U T b.
116 CHAPTER 3. LEAST SQUARES PROBLEMS
dA dP dA
P + A= .
d d d
It follows that
dP dA dA
A = (I − P ) = P⊥ .
d d d
Multiplying through by A+ , we obtain
dP dA +
P = P⊥ A .
d d
Because P is a projection,
d(P 2 ) dP dP dP
=P + P = ,
d d d d
so, using the symmetry of P ,
dP dA + dAT ⊥
= P⊥ A + (A+ )T P .
d d d
dP ⊥
r() = r(0) + b + O(2 )
d
d(I − P )
= r(0) + b + O(2 )
d
dP
= r(0) − b + O(2 )
d
= r(0) − [P ⊥ E x̂(0) + (A+ )T E T r(0)] + O(2 ).
Note that if A is scaled so that kAk2 = 1, then the second term above involves the condition number
κ2 (A). We also have
kx() − x(0)k2 kr(0)k2
= ||kEk2 2κ2 (A) + κ2 (A)2 + O(2 ).
kx̂k2 kx̂(0)k2
Note that a small perturbation in the residual does not imply a small perturbation in the solution.
3.4. THE SINGULAR VALUE DECOMPOSITION 117
Continuing this process on B, and keeping in mind that kBk2 ≤ kAk2 , we obtain the decom-
position
U T AV = Σ
where
∈ Rm×m , ∈ Rn×n
U= u1 · · · um V = v1 · · · vn
are both orthogonal matrices, and
This decomposition of A is called the singular value decomposition, or SVD. It is more commonly
written as a factorization of A,
A = U ΣV T .
3.4.2 Properties
The diagonal entries of Σ are the singular values of A. The columns of U are the left singular
vectors, and the columns of V are the right singluar vectors. It follows from the SVD itself that
the singular values and vectors satisfy the relations
For convenience, we denote the ith largest singular value of A by σi (A), and the largest and smallest
singular values are commonly denoted by σmax (A) and σmin (A), respectively.
The SVD conveys much useful information about the structure of a matrix, particularly with
regard to systems of linear equations involving the matrix. Let r be the number of nonzero singular
values. Then r is the rank of A, and
That is, the SVD yields orthonormal bases of the range and null space of A.
It follows that we can write
X r
A= σi ui viT .
i=1
This is called the SVD expansion of A. If m ≥ n, then this expansion yields the “economy-size”
SVD
A = U1 Σ 1 V T ,
where
∈ Rm×n , Σ1 = diag(σ1 , . . . , σn ) ∈ Rn×n .
U1 = u1 · · · un
and
42.4264 0 0
S= 0 2.4495 0 .
0 0 0
3.4. THE SINGULAR VALUE DECOMPOSITION 119
Let U = u1 u2 u3 and V = v1 v2 v3 be column partitions of U and V , respectively.
Because there are only two nonzero singular values, we have rank(A) = 2, Furthermore, range(A) =
span{u1 , u2 }, and null(A) = span{v3 }. We also have
2
The SVD is also closely related to the `2 -norm and Frobenius norm. We have
and
kAxk2
min = σp , p = min{m, n}.
x6=0 kxk2
These relationships follow directly from the invariance of these norms under orthogonal transfor-
mations.
3.4.3 Applications
The SVD has many useful applications, but one of particular interest is that the truncated SVD
expansion
Xk
Ak = σi ui viT ,
i=1
where k < r = rank(A), is the best approximation of A by a rank-k matrix. It follows that the
distance between A and the set of matrices of rank k is
Xr
T
min kA − Bk2 = kA − Ak k2 =
σi ui vi
= σk+1 ,
rank(B)=k
i=k+1
because the `2 -norm of a matrix is its largest singular value, and σk+1 is the largest singular value
of the matrix obtained by removing the first k terms of the SVD expansion. Therefore, σp , where
p = min{m, n}, is the distance between A and the set of all rank-deficient matrices (which is zero
when A is already rank-deficient). Because a matrix of full rank can have arbitarily small, but still
positive, singular values, it follows that the set of all full-rank matrices in Rm×n is both open and
dense.
Example The best approximation of the matrix A from the previous example, which has rank
two, by a matrix of rank one is obtained by taking only the first term of its SVD expansion,
10 20 10
A1 = 42.4264u1 v1T = 10 20 10 .
10 20 10
The solution is
x̂ = A+ b = V Σ+ U T b
kA − BQkF = min
Q̂ = U V T , B T A = U ΣV T .
3.5. LEAST SQUARES WITH CONSTRAINTS 121
Then
kA − B̂k2F = kU (Σ − Ωk )V T k2F
= kΣ − Ωk k2F
2
= σk+1 + · · · + σr2 .
We now consider a variation of this problem. Suppose that B is a perturbation of A such that
A = B + E, where kEk2F ≤ 2 . We wish to find B̂ such that kA − B̂k2F ≤ 2 , where the rank of B̂
is minimized. We know that if Bk = U Ωk V T then
kA − BK k2F = σk+1
2
+ · · · + σr2 .
Note that !
+ 1 1
kA − B̂ + k2F = 2 + ··· + 2 .
σk+1 σr
The general form of a least squares problem with linear constraints is as follows: we wish to
find an n-vector x that minimizes kAx − bk2 , subject to the constraint C T x = d, where C is a
known n × p matrix and d is a known p-vector.
This problem is usually solved using Lagrange multipliers. We define
Then
∇f = 2(AT Ax − AT b + Cλ).
To minimize f , we can solve the system
T T
A A C x A b
T = .
C 0 λ d
From AT Ax = AT b − Cλ, we see that we can first compute x = x̂ − (AT A)−1 Cλ where x̂ is the
solution to the unconstrained least squares problem. Then, from the equation C T x = d we obtain
the equation
C T (AT A)−1 Cλ = C T x̂ − d,
which we can now solve for λ. The algorithm proceeds as follows:
2. Compute A = QR.
UT U = (P T W )T (P T W )
= WTPPTW
= C T R−1 (RT )−1 C
= C T (RT R)−1 C
= C T (RT QT QR)−1 C
= C T (AT A)−1 C
This method is not the most practical since it has more unknowns than the unconstrained least
squares problem, which is odd because the constraints should have the effect of eliminating un-
knowns, not adding them. We now describe an alternate approach.
Suppose that we compute the QR factorization of C to obtain
T R
Q C=
0
3.5. LEAST SQUARES WITH CONSTRAINTS 123
where R is a p × p upper triangular matrix. Then the constraint C T x = d takes the form
u
RT u = d, QT x = .
v
Then
kb − Axk2 = kb − AQQT xk
u
b − Ã v
, Ã = AQ
=
2
u
=
b − Ã1 Ã2
v
2
= kb − Ã1 u − Ã2 vk2
2. Compute à = AQ
3. Solve RT u = d
4. Solve the new least squares problem of minimizing k(b − Ã1 u) − Ã2 vk2
5. Compute
u
x=Q .
v
This approach has the advantage that there are fewer unknowns in each system that needs to be
solved, and also that κ(Ã2 ) ≤ κ(A). The drawback is that sparsity can be destroyed.
This problem is known as least squares with quadratic constraints. To solve this problem, we define
λi (AT A) = λ1 , . . . , λn , λ1 ≥ λ2 ≥ · · · ≥ λn ≥ 0
124 CHAPTER 3. LEAST SQUARES PROBLEMS
then
λi (AT A + µI) = λ1 + µ, · · · , λn + µ.
λ1 + µ λ1
≤ ,
λn + µ λn
so AT A + µI is better conditioned.
Solving the least squares problem with quadratic constraints arises in many literatures, including
2. Regularization: Tikhonov
x = (AT A + µI)−1 AT b
where
xT x = bT A(AT A + µI)−2 AT b = α2 .
α2 = bT U ΣV T (V ΣT ΣV T + µI)−2 V ΣT U T b
= cT Σ(ΣT Σ + µI)−2 ΣT c, UT b = c
r
X c2i σi2
=
i=1
(σi2 + µ)2
= χ(µ)
2. Compute c = U T b.
3. Solve χ(µ∗ ) = α2 where µ∗ ≥ 0. Don’t use Newton’s method on this equation directly; solving
1/χ(µ) = 1/α2 is much better.
Ax = b + r, krk2 = min .
Ĉvn+1 = U Ωn V T vn+1 = 0.
Our solution is
x̂ 1
=− vn+1
−1 vn+1,n+1
provided that vn+1,n+1 6= 0.
Now, suppose that only some of the data is contaminated, i.e. E = 0 E1 where the first
p columns of E are zero. Then, in solving (C + F )z = 0, we use Householder transformations to
compute QT (C +F ) where the first p columns are zero below the diagonal. Since kF kF = kQT F kF ,
we then have a block upper triangular system
R11 R12 + F12 u
z = 0, z = .
0 R22 + F22 v
(R22 + F22 )v = 0,
Eigenvalue Problems
λyH = yH A.
The superscript H refers to the Hermitian transpose, which includes transposition and complex
conjugation. That is, for any matrix A, AH = AT . An eigenvector of A, as defined above, is
sometimes called a right eigenvector of A, to distinguish from a left eigenvector. It can be seen
that if y is a left eigenvector of A with eigenvalue λ, then y is also a right eigenvector of AH , with
eigenvalue λ.
Because x is nonzero, it follows that if x is an eigenvector of A, then the matrix A − λI is
singular, where λ is the corresponding eigenvalue. Therefore, λ satisfies the equation
det(A − λI) = 0.
The expression det(A−λI) is a polynomial of degree n in λ, and therefore is called the characteristic
polynomial of A (eigenvalues are sometimes called characteristic values). It follows from the fact
that the eigenvalues of A are the roots of the characteristic polynomial that A has n eigenvalues,
which can repeat, and can also be complex, even if A is real. However, if A is real, any complex
eigenvalues must occur in complex-conjugate pairs.
The set of eigenvalues of A is called the spectrum of A, and denoted by λ(A). This terminology
explains why the magnitude of the largest eigenvalues is called the spectral radius of A. The trace
of A, denoted by tr(A), is the sum of the diagonal elements of A. It is also equal to the sum of the
eigenvalues of A. Furthermore, det(A) is equal to the product of the eigenvalues of A.
127
128 CHAPTER 4. EIGENVALUE PROBLEMS
Example A 2 × 2 matrix
a b
A=
c d
has trace tr(A) = a + d and determinant det(A) = ad − bc. Its characteristic polynomial is
a−λ b
det(A − λI) =
c d−λ
= (a − λ)(d − λ) − bc = λ2 − (a + d)λ + (ad − bc)
= λ2 − tr(A)λ + det(A).
4.1.2 Decompositions
A subspace W of Rn is called an invariant subspace of A if, for any vector x ∈ W , Ax ∈ W .
Suppose that dim(W ) = k, and let X be an n × k matrix such that range(X) = W . Then, because
each column of X is a vector in W , each column of AX is also a vector in W , and therefore is a
linear combination of the columns of X. It follows that AX = XB, where B is a k × k matrix.
Now, suppose that y is an eigenvector of B, with eigenvalue λ. It follows from By = λy that
T
Let x = xT1 xT2 be an eigenvector of B, where x1 ∈ Cp and x2 ∈ Cq . Then, for some
Therefore, if we can use similarity transformations to reduce A to such a block structure, the
problem of computing the eigenvalues of A decouples into two smaller problems of computing the
eigenvalues of Bii for i = 1, 2. Using an inductive argument, it can be shown that if A is block
upper-triangular, then the eigenvalues of A are equal to the union of the eigenvalues of the diagonal
blocks. If each diagonal block is 1 × 1, then it follows that the eigenvalues of any upper-triangular
matrix are the diagonal elements. The same is true of any lower-triangular matrix; in fact, it can
be shown that because det(A) = det(AT ), the eigenvalues of AT are the same as the eigenvalues of
A.
Example The matrix
1 −2 3 −3 4
0 4 −5 6 −5
A=
0 0 6 −7 8
0 0 0 7 0
0 0 0 −8 9
has eigenvalues 1, 4, 6, 7, and 9. This is because A has a block upper-triangular structure
1 −2 3
A11 A12 7 0
A= , A11 = 0 4 −5 , A22 = .
0 A22 −8 9
0 0 6
Because both of these blocks are themselves triangular, their eigenvalues are equal to their diagonal
elements, and the spectrum of A is the union of the spectra of these blocks. 2
Suppose that x is a normalized eigenvector of A, with eigenvalue λ. Furthermore, suppose that
P is a Householder reflection such that P x = e1 . Because P is symmetric and orthogonal, P is its
own inverse, so P e1 = x. It follows that the matrix P T AP , which is a similarity transformation of
A, satisfies
P T AP e1 = P T Ax = λP T x = λP x = λe1 .
That is, e1 is an eigenvector of P T AP with eigenvalue λ, and therefore P T AP has the block
structure
λ vT
T
P AP = .
0 B
Therefore, λ(A) = {λ} ∪ λ(B), which means that we can now focus on the (n − 1) × (n − 1) matrix
B to find the rest of the eigenvalues of A. This process of reducing the eigenvalue problem for A
to that of B is called deflation.
130 CHAPTER 4. EIGENVALUE PROBLEMS
A = QH T Q
where T is an upper-triangular matrix whose diagonal elements are the eigenvalues of A, and Q is
a unitary matrix, meaning that QH Q = I. That is, a unitary matrix is the generalization of a real
orthogonal matrix to complex matrices. Every square matrix has a Schur decomposition.
The columns of Q are called Schur vectors. However, for a general matrix A, there is no relation
between Schur vectors of A and eigenvectors of A, as each Schur vector qj satisfies Aqj = AQej =
QT ej . That is, Aqj is a linear combination of q1 , . . . , qj . It follows that for j = 1, 2, . . . , n, the
first j Schur vectors q1 , q2 , . . . , qj span an invariant subspace of A.
The Schur vectors and eigenvectors of A are the same when A is a normal matrix, which means
that AH A = AAH . Any symmetric or skew-symmetric matrix, for example, is normal. It can be
shown that in this case, the normalized eigenvectors of A form an orthonormal basis for Rn . It
follows that if λ1 , λ2 , . . . , λn are the eigenvalues of A, with corresponding (orthonormal) eigenvectors
q1 , q2 , . . . , qn , then we have
AQ = QD, Q = q1 · · · qn , D = diag(λ1 , . . . , λn ).
QH AQ = QH QD = D,
xk = c1 x1 + c2 x2 + · · · + ck−1 xk−1 .
4.1. EIGENVALUES AND EIGENVECTORS 131
because Axi = λi xi for i = 1, 2, . . . , k − 1. However, because both sides are equal to xk , and
Axk = λk xk , we also have
It follows that
However, because the eigenvalues λ1 , . . . , λk are distinct, and not all of the coefficients c1 , . . . , ck−1
are zero, this means that we have a nontrivial linear combination of linearly independent vectors be-
ing equal to the zero vector, which is a contradiction. We conclude that eigenvectors corresponding
to distinct eigenvalues are linearly independent.
It follows that if A has n distinct eigenvalues, then it has a set of n linearly independent
eigenvectors. If X is a matrix whose columns are these eigenvectors, then AX = XD, where D is
a diagonal matrix of the eigenvectors, and because the columns of X are linearly independent, X
is invertible, and therefore X −1 AX = D, and A is diagonalizable.
Now, suppose that the eigenvalues of A are not distinct; that is, the characteristic polynomial
has repeated roots. Then an eigenvalue with multiplicity m does not necessarily correspond to m
linearly independent eigenvectors. The algebraic multiplicity of an eigenvalue λ is the number of
times that λ occurs as a root of the characteristic polynomial. The geometric multiplicity of λ is
the dimension of the eigenspace corresponding to λ, which is equal to the maximal size of a set of
linearly independent eigenvectors corresponding to λ. The geometric multiplicity of an eigenvalue
λ is always less than or equal to the algebraic multiplicity. When it is strictly less, then we say
that the eigenvalue is defective. When both multiplicities are equal to one, then we say that the
eigenvalue is simple.
The Jordan canonical form of an n × n matrix A is a decomposition that yields information
about the eigenspaces of A. It has the form
A = XJX −1
The number of Jordan blocks, p, is equal to the number of linearly independent eigenvectors of A.
The diagonal element of Ji , λi , is an eigenvalue of A. The number of Jordan blocks associated with
λi is equal to the geometric multiplicity of λi . The sum of the sizes of these blocks is equal to the
algebraic multiplicity of λi . If A is diagonalizable, then each Jordan block is 1 × 1.
Example Consider a matrix with Jordan canonical form
2 1 0
0 2 1
0 0 2
J = .
3 1
0 3
2
The eigenvalues of this matrix are 2, with algebraic multiplicity 4, and 3, with algebraic multiplicity
2. The geometric multiplicity of the eigenvalue 2 is 2, because it is associated with two Jordan
blocks. The geometric multiplicity of the eigenvalue 3 is 1, because it is associated with only one
Jordan block. Therefore, there are a total of three linearly independent eigenvectors, and the matrix
is not diagonalizable. 2
The Jordan canonical form, while very informative about the eigensystem of A, is not practical
to compute using floating-point arithmetic. This is due to the fact that while the eigenvalues of a
matrix are continuous functions of its entries, the Jordan canonical form is not. If two computed
eigenvalues are nearly equal, and their computed corresponding eigenvectors are nearly parallel, we
do not know if they represent two distinct eigenvalues with linearly independent eigenvectors, or a
multiple eigenvalue that could be defective.
x = −(D − λI)−1 F x.
which yields
n
−1
X |fij |
k(D − λI) F k∞ = max ≥ 1.
1≤i≤n |dii − λ|
j=1,j6=i
4.1. EIGENVALUES AND EIGENVECTORS 133
That is, λ lies within one of the Gerschgorin circles in the complex plane, that has center aii and
radius
Xn
ri = |aij |.
j=1,j6=i
D1 = {z ∈ C | |z − 7| ≤ 4},
D2 = {z ∈ C | |z − 2| ≤ 3},
D3 = {z ∈ C | |z + 5| ≤ 2}.
min |λ − µ| ≤ κp (X)kEkp .
λ∈λ(A)
We conclude that
yH x 1
cos θ = = ,
kyk2 kxk2 kyk2 kxk2
it follows that
1
|λ0 (0)| ≤ .
| cos θ|
We define 1/| cos θ| to be the condition number of the simple eigenvalue λ. We require λ to be
simple because otherwise, the angle between the left and right eigenvectors is not unique, because
the eigenvectors themselves are not unique.
It should be noted that the condition number is also defined by 1/|yH x|, where x and y are
normalized so that kxk2 = kyk2 = 1, but either way, the condition number is equal to 1/| cos θ|. The
interpretation of the condition number is that an O() perturbation in A can cause an O(/| cos θ|)
perturbation in the eigenvalue λ. Therefore, if x and y are nearly orthogonal, a large change in the
eigenvalue can occur. Furthermore, if the condition number is large, then A is close to a matrix
with a multiple eigenvalue.
Example The matrix
3.1482 −0.2017 −0.5363
A = 0.4196 0.5171 1.0888
0.3658 −1.7169 3.3361
has a simple eigenvalue λ = 1.9833 with left and right eigenvectors
0.4150 −7.9435
x = 0.6160 , y = 83.0701 ,
0.6696 −70.0066
4.1. EIGENVALUES AND EIGENVECTORS 135
such that yH x = 1. It follows that the condition number of this eigenvalue is kxk2 kyk2 = 108.925.
In fact, the nearby matrix
3.1477 −0.2023 −0.5366
B = 0.4187 0.5169 1.0883
0.3654 −1.7176 3.3354
has a double eigenvalue that is equal to 2. 2
We now consider the sensitivity of repeated eigenvalues. First, it is important to note that while
the eigenvalues of a matrix A are continuous functions of the entries of A, they are not necessarily
differentiable functions of the entries. To see this, we consider the matrix
1 a
A= ,
1
where a > 0. Computing its characteristic polynomial
det(A − λI) = λ2 − 2λ + 1 − a
√
and computings its roots yields the eigenvalues λ = 1 ± a. Differentiating these eigenvalues with
respect to yields r
dλ a
=± ,
d
which is undefined at = 0. In general, an O() perturbation in A causes an O(1/p ) perturbation in
an eigenvalue associated with a p×p Jordan block, meaning that the “more defective” an eigenvalue
is, the more sensitive it is.
We now consider the sensitivity of eigenvectors, or, more generally, invariant subspaces of a
matrix A, such as a subspace spanned by the first k Schur vectors, which are the first k columns in
a matrix Q such that QH AQ is upper triangular. Suppose that an n × n matrix A has the Schur
decomposition
H
T11 T12
A = QT Q , Q = Q1 Q2 , T = ,
0 T22
where Q1 is n × r and T11 is r × r. We define the separation between the matrices T11 and T22 by
kT11 X − XT22 kF
sep(T11 , T22 ) = min .
X6=0 kXkF
It can be shown that an O() perturbation in A causes a O(/sep(T11 , T22 )) perturbation in the
invariant subspace Q1 .
We now consider the case where r = 1, meaning that Q1 is actually a vector q1 , that is also an
eigenvector, and T11 is the corresponding eigenvalue, λ. Then, we have
kλX − XT22 kF
sep(λ, T22 ) = min
X6=0 kXkF
= min kyH (T22 − λI)k2
kyk2 =1
= σmin ((T22 − λI )H )
= σmin (T22 − λI),
136 CHAPTER 4. EIGENVALUE PROBLEMS
since the Frobenius norm of a vector is equivalent to the vector 2-norm. Because the smallest
singular value indicates the distance to a singular matrix, sep(λ, T22 ) provides a measure of the
separation of λ from the other eigenvalues of A. It follows that eigenvectors are more sensitive to
perturbation if the corresponding eigenvalues are clustered near one another. That is, eigenvectors
associated with nearby eigenvalues are “wobbly”.
It should be emphasized that there is no direct relationship between the sensitivity of an eigen-
value and the sensitivity of its corresponding invariant subspace. The sensitivity of a simple eigen-
value depends on the angle between its left and right eigenvectors, while the sensitivity of an
invariant subspace depends on the clustering of the eigenvalues. Therefore, a sensitive eigenvalue,
that is nearly defective, can be associated with an insensitive invariant subspace, if it is distant
from other eigenvalues, while an insensitive eigenvalue can have a sensitive invariant subspace if it
is very close to other eigenvalues.
We also assume that A is diagonalizable, meaning that it has n linearly independent eigenvectors
x1 , . . . , xn such that Axi = λi xi for i = 1, . . . , n.
Suppose that we continually multiply a given vector x(0) by A, generating a sequence of vectors
x , x(2) , . . . defined by
(1)
x(k) = Ak x(0)
Xn
= ci Ak xi
i=1
n
X
= ci λki xi
i=1
n
" k #
X λi
= λk1 c1 x1 + ci xi .
λ1
i=2
Because |λ1 | > |λi | for i = 2, . . . , n, it follows that the coefficients of xi , for i = 2, . . . , n, converge
to zero as k → ∞. Therefore, the direction of x(k) converges to that of x1 . This leads to the most
basic method of computing an eigenvalue and eigenvector, the Power Method:
4.2. POWER ITERATIONS 137
This algorithm continues until qk converges to within some tolerance. If it converges, it con-
verges to a unit vector that is a scalar multiple of x1 , an eigenvector corresponding to the largest
eigenvalue, λ1 . The rate of convergence is |λ1 /λ2 |, meaning that the distance between qk and a
vector parallel to x1 decreases by roughly this factor from iteration to iteration.
It follows that convergence can be slow if λ2 is almost as large as λ1 , and in fact, the power
method fails to converge if |λ2 | = |λ1 |, but λ2 6= λ1 (for example, if they have opposite signs). It
is worth noting the implementation detail that if λ1 is negative, for example, it may appear that
qk is not converging, as it “flip-flops” between two vectors. This is remedied by normalizing qk so
that it is not only a unit vector, but also a positive number.
Once the normalized eigenvector x1 is found, the corresponding eigenvalue λ1 can be computed
using a Rayleigh quotient. Then, deflation can be carried out by constructing a Householder
reflection P1 so that P1 x1 = e1 , as discussed previously, and then P1 AP1 is a matrix with block
upper-triangular structure. This decouples the problem of computing the eigenvalues of A into the
(solved) problem of computing λ1 , and then computing the remaining eigenvalues by focusing on
the lower right (n − 1) × (n − 1) submatrix.
Generally, this method computes a convergent sequence {Qk }, as long as Q0 is not deficient in
the directions of certain eigenvectors of AH , and |λr | > |λr+1 |. From the relationship
Rk = QH H
k Zk = Qk AQk−1 ,
QH QH AQ QH AQ⊥
⊥
A Q Q =
(Q⊥ )H (Q⊥ )H AQ (Q⊥ )H AQ⊥
QH AQ⊥
R
= .
0 (Q⊥ )H AQ⊥
That is, λ(A) = λ(R) ∪ λ((Q⊥ )H AQ⊥ ), and because R is upper-triangular, the eigenvalues of R are
its diagonal elements. We conclude that Orthogonal Iteration, when it converges, yields the largest
r eigenvalues of A.
and
Tk = QH
k AQk
= QH H
k AQk−1 Qk−1 Qk
= QH H
k Zk Qk−1 Qk
= QH H
k Qk Rk (Qk−1 Qk )
= Rk (QH
k−1 Qk ).
That is, Tk is obtained from Tk−1 by computing the QR factorization of Tk−1 , and then multi-
plying the factors in reverse order. Equivalently, Tk is obtained by applying a unitary similarity
transformation to Tk−1 , as
Tk = Rk (QH H H H H
k−1 Qk ) = (Qk−1 Qk ) Tk−1 (Qk−1 Qk ) = Uk Tk−1 Uk .
Choose Q0 so that QH H
0 Q0 = In T0 = Q0 AQ0
for k = 1, 2, . . . do
Tk−1 = Uk Rk (QR factorization)
Tk = Rk Uk
end
4.3. THE QR ALGORITHM 139
It is this version of Orthogonal Iteration that serves as the cornerstone of an efficient algorithm
for computing all of the eigenvalues of a matrix. As described, QR iteration is prohibitively expen-
sive, because O(n3 ) operations are required in each iteration to compute the QR factorization of
Tk−1 , and typically, many iterations are needed to obtain convergence. However, we will see that
with a judicious choice of Q0 , the amount of computational effort can be drastically reduced.
It should be noted that if A is a real matrix with complex eigenvalues, then Orthogonal Iteration
or the QR Iteration will not converge, due to distinct eigenvalues having equal magnitude. However,
the structure of the matrix Tk in QR Iteration generally will converge to “quasi-upper-triangular”
form, with 1 × 1 or 2 × 2 diagonal blocks corresponding to real eigenvalues or complex-conjugate
pairs of eigenvalues, respectively. It is this type of convergence that we will seek in our continued
development of the QR Iteration.
where each diagonal block Tii is 1 × 1, corresponding to a real eigenvalue, or a 2 × 2 block, corre-
sponding to a pair of complex eigenvalues that are conjugates of one another.
If QR iteration is applied to such a matrix, then the sequence {Tk } will not converge, but a
block upper-triangular structure will be obtained, which can then be used to compute all of the
eigenvalues. Therefore, the iteration can be terminated when appropriate entries below the diagonal
have been made sufficiently small.
However, one significant drawback to the QR iteration is that each iteration is too expensive, as
it requires O(n3 ) operations to compute the QR factorization, and to multiply the factors in reverse
order. Therefore, it is desirable to first use a similarity transformation H = U T AU to reduce A to
a form for which the QR factorization and matrix multiplication can be performed more efficiently.
Suppose that U T includes a Householder reflection, or a product of Givens rotations, that trans-
forms the first column of A to a multiple of e1 , as in algorithms to compute the QR factorization.
Then U operates on all rows of A, so when U is applied to the columns of A, to complete the
similarity transformation, it affects all columns. Therefore, the work of zeroing the elements of the
first column of A is undone.
140 CHAPTER 4. EIGENVALUE PROBLEMS
Now, suppose that instead, U T is designed to zero all elements of the first column except the
first two. Then, U T affects all rows except the first, meaning that when U T A is multiplied by U
on the right, the first column is unaffected. Continuing this reasoning with subsequent columns
of A, we see that a sequence of orthogonal transformations can be used to reduce A to an upper
Hessenberg matrix H, in which hij = 0 whenever i > j +1. That is, all entries below the subdiagonal
are equal to zero.
It is particularly efficient to compute the QR factorization of an upper Hessenberg, or simply
Hessenberg, matrix, because it is only necessary to zero one element in each column. Therefore,
it can be accomplished with a sequence of n − 1 Givens row rotations, which requires only O(n2 )
operations. Then, these same Givens rotations can be applied, in the same order, to the columns in
order to complete the similarity transformation, or, equivalently, accomplish the task of multiplying
the factors of the QR factorization.
Specifically, given a Hessenberg matrix H, we apply Givens row rotations GT1 , GT2 , . . . , GTn−1 to
H, where GTi rotates rows i and i + 1, to obtain
H̃ = QT HQ = RQ = RG1 G2 · · · Gn−1
for j = 1, 2, . . . , n − 1 do
[c, s] = givens(h jj , hj+1,j )
c −s
Gj =
s c
H(j : j + 1, j : n) = GTj H(j : j + 1, :)
end
Q=I
for j = 1, 2, . . . , n − 1 do
H(1 : j + 1, j : j + 1) = H(1 : j + 1, j : j + 1)Gj
Q(1 : j + 1, j : j + 1) = Q(1 : j + 1, j : j + 1)Gj
end
Note that when performing row rotations, it is only necessary to update certain columns, and
when performing column rotations, it is only necessary to update certain rows, because of the
structure of the matrix at the time the rotation is performed; for example, after the first loop, H
is upper-triangular.
Before a Hessenberg QR step can be performed, it is necessary to actually reduce the original
matrix A to Hessenberg form H = U T AU . This can be accomplished by performing a sequence of
Householder reflections U = P1 P2 · · · Pn−2 on the columns of A, as in the following algorithm.
U =I
for j = 1, 2, . . . , n − 2 do
v = house(A(j + 1 : n, j)), c = 2/vT v
A(j + 1 : n, j : n) = A(j + 1 : n, j : n) − cvvT A(j + 1 : n, j : n)
A(1 : n, j + 1 : n) = A(1 : n, j + 1 : n) − cA(1 : n, j + 1 : n)vvT
end
The function house(x) computes a vector v such that P x = I − cvvT x = αe1 , where c = 2/vT v
and α = ±kxk2 . The algorithm for the Hessenberg reduction requires O(n3 ) operations, but it is
performed only once, before the QR Iteration begins, so it still leads to a substantial reduction in
the total number of operations that must be performed to compute the Schur Decomposition.
If a subdiagonal entry hj+1,j of a Hessenberg matrix H is equal to zero, then the problem of
computing the eigenvalues of H decouples into two smaller problems of computing the eigenvalues
of H11 and H22 , where
H11 H12
H=
0 H22
and H11 is j ×j. Therefore, an efficient implementation of the QR Iteration on a Hessenberg matrix
H focuses on a submatrix of H that is unreduced, meaning that all of its subdiagonal entries are
nonzero. It is also important to monitor the subdiagonal entries after each iteration, to determine if
any of them have become nearly zero, thus allowing further decoupling. Once no further decoupling
is possible, H has been reduced to quasi-upper-triangular form and the QR Iteration can terminate.
It is essential to choose an maximal unreduced diagonal block of H for applying a Hessenberg
QR step. That is, the step must be applied to a submatrix H22 such that H has the structure
H11 H12 H13
H = 0 H22 H23
0 0 H33
where H22 is unreduced. This condition ensures that the eigenvalues of H22 are also eigenvalues
of H, as λ(H) = λ(H11 ) ∪ λ(H22 ) ∪ λ(H33 ) when H is structured as above. Note that the size of
either H11 or H33 may be 0 × 0.
The following property of unreduced Hessenberg matrices is useful for improving the efficiency
of a Hessenberg QR step.
Theorem (Implicit Q Theorem) Let A be an n × n matrix, and let Q and V be n × n orthgonal
matrices such that QTAQ = H and T
V AV = G are both upper Hessenberg, and H is unreduced.
If Q = q1 · · · qn and V = v1 · · · vn , and if q1 = v1 , then qi = ±vi for i = 2, . . . , n,
and |hij | = |gij | for i, j = 1, 2, . . . , n.
142 CHAPTER 4. EIGENVALUE PROBLEMS
That is, if two orthogonal similarity transformations that reduce A to Hessenberg form have the
same first column, then they are “essentially equal”, as are the Hessenberg matrices.
The proof of the Implicit Q Theorem proceeds as follows: From the relations QT AQ = H
and V T AV = G, we obtain GW = W H, where W = V T Q is orthogonal. Because q1 = v1 , we
have W e1 = e1 . Equating first columns of GW = W H, and keeping in mind that G and H are
both upper Hessenberg, we find that only the first two elements of W e2 are nonzero. Proceeding
by induction, it follows that W is upper triangular, and therefore W −1 is also upper triangular.
However, because W is orthogonal, W −1 = W T , which means that W −1 is lower triangular as well.
Therefore, W is a diagonal matrix, so by the orthogonality of W , W must have diagonal entries
that are equal to ±1, and the theorem follows.
Another important property of an unreduced Hessenberg matrix is that all of its eigenvalues
have a geometric multiplicity of one. To see this, consider the matrix H − λI, where H is an n × n
unreduced Hessenberg matrix and λ is an arbitrary scalar. If λ is not an eigenvalue of H, then H
is nonsingular and rank(H) = n. Otherwise, because H is unreduced, from the structure of H it
can be seen that the first n − 1 columns of H − λI must be linearly independent. We conclude
that rank(H − λI) = n − 1, and therefore at most one vector x (up to a scalar multiple) satisfies
the equation Hx = λx. That is, there can only be one linearly independent eigenvector. It follows
that if any eigenvalue of H repeats, then it is defective.
where λp is the pth largest eigenvalue of A in magnitude. It follows that convergence can be
particularly slow if eigenvalues are very close to one another in magnitude. Suppose that we shift
H by a scalar µ, meaning that we compute the QR factorization of H − µI instead of H, and then
update H to obtain a new Hessenberg H̃ by multiplying the QR factors in reverse order as before,
but then adding µI. Then, we have
H̃ = RQ + µI
= QT (H − µI)Q + µI
= QT HQ − µQT Q + µI
= QT HQ − µI + µI
= QT HQ.
So, we are still performing an orthogonal similarity transformation of H, but with a different Q.
Then, the convergence rate becomes |λp+1 − µ|/|λp − µ|. Then, if µ is close to an eigenvalue,
convergence of a particular subdiagonal entry will be much more rapid.
4.3. THE QR ALGORITHM 143
then the value of h21 after each of the first three QR steps is 0.1575, −0.0037, and 2.0876 × 10−5 .
2
This shifting strategy is called the single shift strategy. Unfortunately, it is not very effective
if H has complex eigenvalues. An alternative is the double shift strategy, which is used if the two
eigenvalues, µ1 and µ2 , of the lower-right 2 × 2 block of H are complex. Then, these two eigenvalues
are used as shifts in consecutive iterations to achieve quadratic convergence in the complex case as
well. That is, we compute
H − µ1 I = U1 R1
H1 = R1 U1 + µ1 I
H1 − µ2 I = U2 R2
H2 = R2 U2 + µ2 I.
To avoid complex arithmetic when using complex shifts, the double implicit shift strategy is
used. We first note that
U1 U2 R2 R1 = U1 (H1 − µ2 I)R1
= U1 H1 R1 − µ2 U1 R1
= U1 (R1 U1 + µ1 I)R1 − µ2 (H − µ1 I)
= U1 R1 U1 R1 + µ1 U1 R1 − µ2 (H − µ1 I)
= (H − µ1 I)2 + µ1 (H − µ1 I) − µ2 (H − µ1 I)
= H 2 − 2µ1 H + µ21 I + µ1 H − µ21 I − µ2 H + µ1 µ2 I
= H 2 − (µ1 + µ2 )H + µ1 µ2 I.
H2 = R2 U2 + µ2 I
= U2T U2 R2 U2 + µ2 U2T U2
= U2T (U2 R2 + µ2 I)U2
= U2T H1 U2
= U2T (R1 U1 + µ1 I)U2
= U2T (U1T U1 R1 U1 + µ1 U1T U1 )U2
= U2T U1T (U1 R1 + µ1 I)U1 U2
= U2T U1T HU1 U2 .
That is, U1 U2 is the orthogonal matrix that implements the similarity transformation of H to obtain
H2 . Therefore, we could use exclusively real arithmetic by forming M = H 2 − (µ1 + µ2 )H + µ1 µ2 I,
compute its QR factorization to obtain M = ZR, and then compute H2 = Z T HZ, since Z = U1 U2 ,
in view of the uniqueness of the QR decomposition. However, M is computed by squaring H, which
requires O(n3 ) operations. Therefore, this is not a practical approach.
4.4. THE SYMMETRIC EIGENVALUE PROBLEM 145
We can work around this difficulty using the Implicit Q Theorem. Instead of forming M in its
entirety, we only form its first column, which, being a second-degree polynomial of a Hessenberg
matrix, has only three nonzero entries. We compute a Householder transformation P0 that makes
this first column a multiple of e1 . Then, we compute P0 HP0 , which is no longer Hessenberg, because
it operates on the first three rows and columns of H. Finally, we apply a series of Householder
reflections P1 , P2 , . . . , Pn−2 that restore Hessenberg form. Because these reflections are not applied
to the first row or column, it follows that if we define Z̃ = P0 P1 P2 · · · Pn−2 , then Z and Z̃ have
the same first column. Since both matrices implement similarity transformations that preserve the
Hessenberg form of H, it follows from the Implicit Q Theorem that Z and Z̃ are essentially equal,
and that they essentially produce the same updated matrix H2 . This variation of a Hessenberg QR
step is called a Francis QR step.
A Francis QR step requires 10n2 operations, with an additional 10n2 operations if orthogonal
transformations are being accumulated to obtain the entire real Schur decomposition. Generally,
the entire QR algorithm, including the initial reduction to Hessenberg form, requires about 10n3
operations, with an additional 15n3 operations to compute the orthogonal matrix Q such that
A = QT QT is the real Schur decomposition of A.
In fact, by computing the gradient of r(y), it can be shown that every eigenvector of A is a critical
point of r(y), with the corresponding eigenvalue being the value of r(y) at that critical point.
are
λ(A) = {14.6515, 4.0638, −10.7153}.
The Gerschgorin intervals are
D1 = {x ∈ R | |x − 14| ≤ 4},
D2 = {x ∈ R | |x − 4| ≤ 5},
D3 = {x ∈ R | |x + 10| ≤ 5}.
Furthermore,
|λk (A + E) − λk (A)| ≤ kEk2 .
The second inequality in the above theorem follows directly from the first, as the 2-norm of the
symmetric matrix E, being equal to its spectral radius, must be equal to the larger of the absolute
value of λ1 (E) or λn (E).
4.4. THE SYMMETRIC EIGENVALUE PROBLEM 147
For a symmetric matrix, or even a more general normal matrix, the left eigenvectors and right
eigenvectors are the same, from which it follows that every simple eigenvalue is “perfectly condi-
tioned”; that is, the condition number 1/| cos θ| is equal to 1 because θ = 0 in this case. However,
the same results concerning the sensitivity of invariant subspaces from the nonsymmetric case apply
in the symmetric case as well: such sensitivity increases as the eigenvalues become more clustered,
even though there is no chance of a defective eigenvalue. This is because for a nondefective, re-
peated eigenvalue, there are infinitely many possible bases of the corresponding invariant subspace.
Therefore, as the eigenvalues approach one another, the eigenvectors become more sensitive to small
perturbations, for any matrix.
Let Q1 be an n × r matrix with orthonormal columns, meaning that QT1 Q1 = Ir . If it spans an
invariant subspace of an n × n symmetric matrix A, then AQ1 = Q1 S, where S = QT1 AQ1 . On the
other hand, if range(Q1 ) is not an invariant subspace, but the matrix
AQ1 − Q1 S = E1
is small for any given r × r symmetric matrix S, then the columns of Q1 define an approximate
invariant subspace.
It turns out that kE1 kF is minimized by choosing S = QT1 AQ1 . Furthermore, we have
kAQ1 − Q1 SkF = kP1⊥ AQ1 kF ,
where P1⊥ = I − Q1 QT1 is the orthogonal projection into (range(Q1 ))⊥ , and there exist eigenvalues
µ1 , . . . , µr ∈ λ(A) such that
√
|µk − λk (S)| ≤ 2kE1 k2 , k = 1, . . . , r.
That is, r eigenvalues of A are close to the eigenvalues of S, which are known as Ritz values,
while the corresponding eigenvectors are called Ritz vectors. If (θk , yk ) is an eigenvalue-eigenvector
pair, or an eigenpair of S, then, because S is defined by S = QT1 AQ1 , it is also known as a Ritz
pair. Furthermore, as θk is an approximate eigenvalue of A, Q1 yk is an approximate corresponding
eigenvector.
To see this, let σk (not to be confused with singular values) be an eigenvalue of S, with eigen-
vector yk . We multiply both sides of the equation Syk = σk yk by Q1 :
Q1 Syk = σk Q1 yk .
Then, we use the relation AQ1 − Q1 S = E1 to obtain
(AQ1 − E1 )yk = σk Q1 yk .
Rearranging yields
A(Q1 yk ) = σk (Q1 yk ) + E1 yk .
If we let xk = Q1 yk , then we conclude
Axk = σk xk + E1 yk .
Therefore, kE1 k is small in some norm, Q1 yk is nearly an eigenvector.
148 CHAPTER 4. EIGENVALUE PROBLEMS
Let A have eigenvalues λ1 , . . . , λn . Then, the eigenvalues of (A − µI)−1 matrix are 1/(λi − µ),
for i − 1, 2, . . . , n. Therefore, this method finds the eigenvalue that is closest to µ.
Now, suppose that we vary µ from iteration to iteration, by setting it equal to the Rayleigh
quotient
xH Ax
r(x) = H ,
x x
of which the eigenvalues of A are constrained extrema. We then obtain Rayleigh Quotient Iteration:
From
λ1 − (λ1 c2k + λ2 s2k )
0
A − µk I =
0 λ2 − (λ1 c2k + λ2 s2k )
2
sk 0
= (λ1 − λ2 ) ,
0 −c2k
4.4. THE SYMMETRIC EIGENVALUE PROBLEM 149
we obtain
ck /s2k c3k
1 1
zk = = .
λ1 − λ2 −sk /c2k c2k s2k (λ1 − λ2 ) −s3k
Normalizing yields
c3k
1
xk+1 = q ,
−s3k
c6k + s6k
While the shift µ = tnn can always be used, it is actually more effective to use the Wilkinson
shift, which is given by
q tn−1,n−1 − tnn
µ = tnn + d − sign(d) d2 + t2n,n−1 , d= .
2
This expression yields the eigenvalue of the lower 2 × 2 block of T that is closer to tnn . It can be
shown that this choice of shift leads to cubic convergence of tn,n−1 to zero.
The symmetric QR algorithm is much faster than the unsymmetric QR algorithm. A single
QR step requires about 30n operations, because it operates on a tridiagonal matrix rather than a
Hessenberg matrix, with an additional 6n2 operations for accumulating orthogonal transformations.
The overall symmetric QR algorithm requires 4n3 /3 operations to compute only the eigenvalues,
and approximately 8n3 additional operations to accumulate transformations. Because a symmetric
matrix is unitarily diagonalizable, then the columns of the orthogonal matrix Q such that QT AQ
is diagonal contains the eigenvectors of A.
A = U ΣV T ,
σ1 ≥ σ2 ≥ · · · ≥ σp , p = min{m, n},
known as the singular values of A, is an extremely useful decomposition that yields much informa-
tion about A, including its range, null space, rank, and 2-norm condition number. We now discuss
a practical algorithm for computing the SVD of A, due to Golub and Kahan.
Let U and V have column partitions
U= u1 · · · um , V = v1 · · · vn .
it follows that
AT Avj = σj2 vj .
That is, the squares of the singular values are the eigenvalues of AT A, which is a symmetric matrix.
It follows that one approach to computing the SVD of A is to apply the symmetric QR algorithm
to AT A to obtain a decomposition AT A = V ΣT ΣV T . Then, the relations Avj = σj uj , j = 1, . . . , p,
can be used in conjunction with the QR factorization with column pivoting to obtain U . However,
this approach is not the most practical, because of the expense and loss of information incurred
from computing AT A.
4.5. THE SVD ALGORITHM 151
Instead, we can implicitly apply the symmetric QR algorithm to AT A. As the first step of the
symmetric QR algorithm is to use Householder reflections to reduce the matrix to tridiagonal form,
we can use Householder reflections to instead reduce A to upper bidiagonal form
d1 f1
d2 f2
T
U1 AV1 = B =
.. .. .
. .
dn−1 fn−1
dn
1. Determine the first Givens row rotation GT1 that would be applied to T − µI, where µ is the
Wilkinson shift from the symmetric QR algorithm. This requires only computing the first
column of T , which has only two nonzero entries t11 = d21 and t21 = d1 f1 .
3. Apply a Givens row rotation H1 to rows 1 and 2 to zero the (2, 1) entry of B1 , which yields
B2 = H1T BG1 . Then, B2 has an unwanted nonzero in the (1, 3) entry.
4. Apply a Givens column rotation G2 to columns 2 and 3 of B2 , which yields B3 = H1T BG1 G2 .
This introduces an unwanted zero in the (3, 2) entry.
5. Continue applying alternating row and column rotations to “chase” the unwanted nonzero
entry down the diagonal of B, until finally B is restored to upper bidiagonal form.
By the Implicit Q Theorem, since G1 is the Givens rotation that would be applied to the first
column of T , the column rotations that help restore upper bidiagonal form are essentially equal to
those that would be applied to T if the symmetric QR algorithm was being applied to T directly.
Therefore, the symmetric QR algorithm is being correctly applied, implicitly, to B.
To detect decoupling, we note that if any superdiagonal entry fi is small enough to be “declared”
equal to zero, then decoupling has been achieved, because the ith subdiagonal entry of T is equal
to di fi , and therefore such a subdiagonal entry must be zero as well. If a diagonal entry di becomes
zero, then decoupling is also achieved, because row or column rotations can be used to zero an
entire row or column of B. In summary, if any diagonal or superdiagonal entry of B becomes zero,
then the tridiagonal matrix T = B T B is no longer unreduced.
Eventually, sufficient decoupling is achieved so that B is reduced to a diagonal matrix Σ. All
Householder reflections that have pre-multiplied A, and all row rotations that have been applied
to B, can be accumulated to obtain U , and all Householder reflections that have post-multiplied
A, and all column rotations that have been applied to B, can be accumulated to obtain V .
152 CHAPTER 4. EIGENVALUE PROBLEMS
< off(A)2 .
4.6. JACOBI METHODS 153
We see that the “size” of the off-diagonal part of the matrix is guaranteeed to decrease from such
a similarity transformation.
If we define
aqq − app s
τ= , t= ,
2apq c
then t satisfies the quadratic equation
t2 + 2τ t − 1 = 0.
off(A)2 ≤ 2N a2pq .
154 CHAPTER 4. EIGENVALUE PROBLEMS
which implies linear convergence. However, it has been shown that for sufficiently large k, there
exist a constant c such that
off(A(k+N ) ) ≤ c ∗ off(A(k) )2 ,
where A(k) is the matrix after k Jacobi updates, meaning that the clasical Jacobi algorithm con-
verges quadratically as a function of sweeps. Heuristically, it has been argued that approximatly
log n sweeps are needed in practice.
It is worth noting that the guideline that θ be chosen so that |θ| ≤ π/4 is actually essential
to ensure quadratic convergence, because otherwise it is possible that Jacobi updates may simply
interchange nearly converged diagonal entries.
another algorithm, such as the symmetric QR algorithm, on a smaller scale. Then, if p ≥ n/(2r),
an entire block Jacobi sweep can be parallelized.
• One-sided Jacobi: This approach, like the Golub-Kahan SVD algorithm, implicitly applies
the Jacobi method for the symmetric eigenvalue problem to AT A. The idea is, within each
update, to use a column Jacobi rotation to rotate columns p and q of A so that they are
orthogonal, which has the effect of zeroing the (p, q) entry of AT A. Once all columns of AV
are orthogonal, where V is the accumulation of all column rotations, the relation AV = U Σ
is used to obtain U and Σ by simple column scaling. To find a suitable rotation, we note that
if ap and aq , the pth and qth columns of A, are rotated through an angle θ, then the rotated
columns satisfy
where c = cos θ and s = sin θ. Dividing by c2 and defining t = s/c, we obtain a quadratic
equation for t that can be solved to obtain c and s.
156 CHAPTER 4. EIGENVALUE PROBLEMS
Part III
157
Chapter 5
Polynomial Interpolation
Calculus provides many tools that can be used to understand the behavior of functions, but in most
cases it is necessary for these functions to be continuous or differentiable. This presents a problem
in most “real” applications, in which functions are used to model relationships between quantities,
but our only knowledge of these functions consists of a set of discrete data points, where the data
is obtained from measurements. Therefore, we need to be able to construct continuous functions
based on discrete data.
The problem of constructing such a continuous function is called data fitting. In this lecture,
we discuss a special case of data fitting known as interpolation, in which the goal is to find
a linear combination of n known functions to fit a set of data that imposes n constraints, thus
guaranteeing a unique solution that fits the data exactly, rather than approximately. The broader
term “constraints” is used, rather than simply “data points”, since the description of the data may
include additional information such as rates of change or requirements that the fitting function
have a certain number of continuous derivatives.
When it comes to the study of functions using calculus, polynomials are particularly simple to
work with. Therefore, in this course we will focus on the problem of constructing a polynomial
that, in some sense, fits given data. We first discuss some algorithms for computing the unique
polynomial pn (x) of degree n that satisfies pn (xi ) = yi , i = 0, . . . , n, where the points (xi , yi ) are
given. The points x0 , x1 , . . . , xn are called interpolation points. The polynomial pn (x) is called
the interpolating polynomial of the data (x0 , y0 ), (x1 , y1 ), . . ., (xn , yn ). At first, we will assume
that the interpolation points are all distinct; this assumption will be relaxed in a later section.
pn (x) = a0 + a1 x + a2 x2 + · · · + an xn
159
160 CHAPTER 5. POLYNOMIAL INTERPOLATION
a0 + a1 x0 = y0 ,
a0 + a1 x1 = y1
for the coefficients of the linear function p1 (x) = a0 + a1 x that interpolates the data
(x0 , y0 ), (x1 , y1 ). What is the system of equations that must be solved to compute the
coefficients a0 , a1 and a2 of the quadratic function p2 (x) = a0 +a1 x+a2 x2 that interpolates
the data (x0 , y0 ), (x1 , y1 ), (x2 , y2 )? Express both systems of equations in matrix-vector
form.
For general n, computing the coefficients a0 , a1 , . . . , an of pn (x) requires solving the system of
linear equations Vn a = y, where the entries of Vn are defined by [Vn ]ij = xji , i, j = 0, . . . , n, where
x0 , x1 , . . . , xn are the points at which the data y0 , y1 , . . . , yn are obtained. The basis {1, x, . . . , xn }
of the space of polynomials of degree n is called the monomial basis, and the corresponding
matrix Vn is called the Vandermonde matrix for the points x0 , x1 , . . . , xn .
Unfortunately, this approach to computing pn (x) is not practical. Solving this system of equa-
tions requires O(n3 ) floating-point operations; we will see that O(n2 ) is possible. Furthermore, the
Vandermonde matrix can be ill-conditioned, especially when the interpolation points x0 , x1 , . . . , xn
are close together. Instead, we will construct pn (x) using a representation other than the monomial
basis. That is, we will represent pn (x) as
n
X
pn (x) = ai ϕi (x),
i=0
for some choice of polynomials ϕ0 (x), ϕ1 (x), . . . , ϕn (x). This is equivalent to solving the linear
system P a = y, where the matrix P has entries pij = ϕj (xi ). By choosing the basis functions
{ϕi (x)}ni=0 judiciously, we can obtain a simpler system of linear equations to solve.
Exercise 5.1.2 Write down the Vandermonde matrix Vn for the points x0 , x1 , . . . , xn .
Show that
n Y
Y i−1
det Vn = (xi − xj ).
i=0 j=0
Exercise 5.1.3 In this exercise, we consider another approach to proving the uniqueness
of the interpolating polynomial. Let pn (x) and qn (x) be polynomials of degree n such that
pn (xi ) = qn (xi ) = yi for i = 0, 1, 2, . . . , n. Prove that pn (x) ≡ qn (x) for all x.
Exercise 5.1.4 Suppose we express the interpolating polynomial of degree one in the
form
p1 (x) = a0 (x − x1 ) + a1 (x − x0 ).
What is the matrix of the system of equations p1 (xi ) = yi , for i = 0, 1? How should the
form of the interpolating polynomial of degree two, p2 (x), be chosen to obtain an equally
simple system of equations to solve for the coefficients a0 , a1 , and a2 ?
5.2. LAGRANGE INTERPOLATION 161
The polynomials {Ln,j }, j = 0, . . . , n, are called the Lagrange polynomials for the interpolation
points x0 , x1 , . . ., xn .
To obtain a formula for the Lagrange polynomials, we note that the above definition specifies
the roots of Ln,j (x): xi , for i 6= j. It follows that Ln,j (x) has the form
n
Y
Ln,j (x) = βj (x − xi )
i=0,i6=j
Qn 1
for some constant βj . Substituting x = xj and requiring Ln,j (xj ) = 1 yields βj = i=0,i6=j (xj −xi ) .
We conclude that
n
Y x − xk
Ln,j (x) = .
xj − xk
k=0,k6=j
As the following result indicates, the problem of polynomial interpolation can be solved using
Lagrange polynomials.
pn (xj ) = f (xj ), j = 0, 1, . . . , n.
Example 5.2.2 We will use Lagrange interpolation to find the unique polynomial p3 (x), of degree
3 or less, that agrees with the following data:
i xi yi
0 −1 3
1 0 −4
2 1 5
3 2 −6
162 CHAPTER 5. POLYNOMIAL INTERPOLATION
In other words, we must have p3 (−1) = 3, p3 (0) = −4, p3 (1) = 5, and p3 (2) = −6.
First, we construct the Lagrange polynomials {L3,j (x)}3j=0 , using the formula
3
Y (x − xi )
Ln,j (x) = .
(xj − xi )
i=0,i6=j
This yields
(x − x1 )(x − x2 )(x − x3 )
L3,0 (x) =
(x0 − x1 )(x0 − x2 )(x0 − x3 )
(x − 0)(x − 1)(x − 2)
=
(−1 − 0)(−1 − 1)(−1 − 2)
x(x2 − 3x + 2)
=
(−1)(−2)(−3)
1
= − (x3 − 3x2 + 2x)
6
(x − x0 )(x − x2 )(x − x3 )
L3,1 (x) =
(x1 − x0 )(x1 − x2 )(x1 − x3 )
(x + 1)(x − 1)(x − 2)
=
(0 + 1)(0 − 1)(0 − 2)
(x2 − 1)(x − 2)
=
(1)(−1)(−2)
1 3
= (x − 2x2 − x + 2)
2
(x − x0 )(x − x1 )(x − x3 )
L3,2 (x) =
(x2 − x0 )(x2 − x1 )(x2 − x3 )
(x + 1)(x − 0)(x − 2)
=
(1 + 1)(1 − 0)(1 − 2)
x(x2 − x − 2)
=
(2)(1)(−1)
1
= − (x3 − x2 − 2x)
2
(x − x0 )(x − x1 )(x − x2 )
L3,3 (x) =
(x3 − x0 )(x3 − x1 )(x3 − x2 )
(x + 1)(x − 0)(x − 1)
=
(2 + 1)(2 − 0)(2 − 1)
x(x2 − 1)
=
(3)(2)(1)
1 3
= (x − x).
6
By substituting xi for x in each Lagrange polynomial L3,j (x), for j = 0, 1, 2, 3, it can be verified
that
1 if i = j
L3,j (xi ) = .
0 if i 6= j
5.2. LAGRANGE INTERPOLATION 163
Substituting each xi , for i = 0, 1, 2, 3, into p3 (x), we can verify that we obtain p3 (xi ) = yi in each
case. 2
Next, we define
πn (x) = (x − x0 )(x − x1 ) · · · (x − xn ).
Then, each Lagrange polynomial can be rewritten as
πn (x)wj
Ln,j (x) = , x 6= xj ,
x − xj
Although O(n2 ) products are needed to compute the barycentric weights, they need only be com-
puted once, and then re-used for each x, which is not the case with the Lagrange form.
Exercise 5.2.3 Write a Matlab function w=baryweights(x) that accepts as input a
vector x of length n + 1, consisting of the distinct interpolation points x0 , x1 , . . . , xn , and
returns a vector w of length n + 1 consisting of the barycentric weights wj as defined in
(5.1).
be determined after computing the kth-degree interpolating polynomial pk (x) of a function f (x)
that pk (x) is not a sufficiently accurate approximation of f (x) on some domain. Therefore, an in-
terpolating polynomial of higher degree must be computed, which requires additional interpolation
points.
To address these issues, we consider the problem of computing the interpolating polynomial
recursively. More precisely, let k > 0, and let pk (x) be the polynomial of degree k that interpolates
the function f (x) at the points x0 , x1 , . . . , xk . Ideally, we would like to be able to obtain pk (x) from
polynomials of degree k − 1 that interpolate f (x) at points chosen from among x0 , x1 , . . . , xk . The
following result shows that this is possible.
Theorem 5.3.1 Let n be a positive integer, and let f (x) be a function defined on a
domain containing the n + 1 distinct points x0 , x1 , . . . , xn , and let pn (x) be the polynomial
of degree n that interpolates f (x) at the points x0 , x1 , . . . , xn . For each i = 0, 1, . . . , n, we
define pn−1,i (x) to be the polynomial of degree n − 1 that interpolates f (x) at the points
x0 , x1 , . . . , xi−1 , xi+1 , . . . , xn . If i and j are distinct nonnegative integers not exceeding n,
then
(x − xj )pn−1,j (x) − (x − xi )pn−1,i (x)
pn (x) = .
xi − xj
This theorem can be proved by substituting x = xi into the above form for pn (x), and using the
fact that the interpolating polynomial is unique.
Algorithm 5.3.2 Let x0 , x1 , . . . , xn be distinct numbers, and let f (x) be a function de-
fined on a domain containing these numbers. Given a number x∗ , the following algorithm
computes y ∗ = pn (x∗ ), where pn (x) is the nth interpolating polynomial of f (x) that in-
terpolates f (x) at the points x0 , x1 , . . . , xn .
for j = 0 to n do
Qj = f (xj )
end
for j = 1 to n do
for k = n to j do
Qk = [(x − xk )Qk−1 − (x − xk−j )Qk ]/(xk−j − xk )
end
end
y ∗ = Qn
At the jth iteration of the outer loop, the number Qk , for k = n, n − 1, . . . , j, represents the value
at x of the polynomial that interpolates f (x) at the points xk , xk−1 , . . . , xk−j .
The preceding theorem can be used to compute the polynomial pn (x) itself, rather than its value
at a given point. This yields an alternative method of constructing the interpolating polynomial,
166 CHAPTER 5. POLYNOMIAL INTERPOLATION
called Newton interpolation, that is more suitable for tasks such as inclusion of additional
interpolation points. The basic idea is to represent interpolating polynomials using the Newton
form, which uses linear factors involving the interpolation points, instead of monomials of the form
xj .
j−1
Y
N0 (x) = 1, Nj (x) = (x − xk ), j = 1, . . . , n.
k=0
The advantage of Newton interpolation is that the interpolating polynomial is easily updated as
interpolation points are added, since the basis functions {Nj (x)}, j = 0, . . . , n, do not change from
the addition of the new points.
Using Theorem 5.3.1, it can be shown that the coefficients cj of the Newton interpolating
polynomial
Xn
pn (x) = cj Nj (x)
j=0
are given by
cj = f [x0 , . . . , xj ]
where f [x0 , . . . , xj ] denotes the divided difference of x0 , . . . , xj . The divided difference is defined
as follows:
f [xi ] = yi ,
yi+1 − yi
f [xi , xi+1 ] = ,
xi+1 − xi
f [xi+1 , . . . , xi+k ] − f [xi , . . . , xi+k−1 ]
f [xi , xi+1 , . . . , xi+k ] = .
xi+k − xi
This definition implies that for each nonnegative integer j, the divided difference f [x0 , x1 , . . . , xj ]
only depends on the interpolation points x0 , x1 , . . . , xj and the value of f (x) at these points. It
follows that the addition of new interpolation points does not change the coefficients c0 , . . . , cn .
Specifically, we have
yn+1 − pn (xn+1 )
pn+1 (x) = pn (x) + Nn+1 (x).
Nn+1 (xn+1 )
This ease of updating makes Newton interpolation the most commonly used method of obtaining
the interpolating polynomial.
5.3. DIVIDED DIFFERENCES 167
The following result shows how the Newton interpolating polynomial bears a resemblance to a
Taylor polynomial.
Theorem 5.3.3 Let f be n times continuously differentiable on [a, b], and let
x0 , x1 , . . . , xn be distinct points in [a, b]. Then there exists a number ξ ∈ [a, b] such
that
f (n) (ξ)
f [x0 , x1 , . . . , xn ] = .
n!
Exercise 5.3.2 Prove Theorem 5.3.3 for the case of n = 2, using the definition of divided
differences and Taylor’s theorem.
Exercise 5.3.3 Let pn (x) be the interpolating polynomial for f (x) at points
x0 , x1 , . . . , xn ∈ [a, b], and assume that f is n times differentiable. Use Rolle’s Theo-
(n)
rem to prove that pn (ξ) = f (n) (ξ) for some point ξ ∈ [a, b].
Exercise 5.3.4 Use Exercise 5.3.3 to prove Theorem 5.3.3. Hint: Think of pn−1 (x) as
an interpolant of pn (x).
We now describe in detail how to compute the coefficients cj = f [x0 , x1 , . . . , xj ] of the Newton
interpolating polynomial pn (x), and how to evaluate pn (x) efficiently using these coefficients.
We construct this table by filling in the n + 1 entries in column 0, which are the trivial divided
differences f [xj ] = f (xj ), for j = 0, 1, . . . , n. Then, we use the recursive definition of the divided
differences to fill in the entries of subsequent columns. Once the construction of the table is
complete, we can obtain the coefficients of the Newton interpolating polynomial from the first
entry in each column, which is f [x0 , x1 , . . . , xj ], for j = 0, 1, . . . , n.
In a practical implementation of this algorithm, we do not need to store the entire table, because
we only need the first entry in each column. Because each column has one fewer entry than the
previous column, we can overwrite all of the other entries that we do not need. The following
algorithm implements this idea.
168 CHAPTER 5. POLYNOMIAL INTERPOLATION
for i = 0, 1, . . . , n do
di,0 = f (xi )
end
for j = 1, 2, . . . , n do
for i = n, n − 1, . . . , j do
di,j = (di,j−1 − di−1,j−1 )/(xi − xi−j )
end
end
for j = 0, 1, . . . , n do
cj = dj,j
end
Example 5.3.5 We will use Newton interpolation to construct the third-degree polynomial p3 (x)
that fits the data
i xi f (xi )
0 −1 3
1 0 −4
2 1 5
3 2 −6
In other words, we must have p3 (−1) = 3, p3 (0) = −4, p3 (1) = 5, and p3 (2) = −6.
First, we construct the divided-difference table from this data. The divided differences in the
table are computed as follows:
f [x0 ] = f (x0 ) = 3, f [x1 ] = f (x1 ) = −4, f [x2 ] = f (x2 ) = 5, f [x3 ] = f (x3 ) = −6,
5.3. DIVIDED DIFFERENCES 169
f [x1 ] − f [x0 ] −4 − 3
f [x0 , x1 ] = = = −7
x1 − x0 0 − (−1)
f [x2 ] − f [x1 ] 5 − (−4)
f [x1 , x2 ] = = =9
x2 − x1 1−0
f [x3 ] − f [x2 ] −6 − 5
f [x2 , x3 ] = = = −11
x3 − x2 2−1
f [x1 , x2 ] − f [x0 , x1 ]
f [x0 , x1 , x2 ] =
x2 − x0
9 − (−7)
=
1 − (−1)
= 8
f [x2 , x3 ] − f [x1 , x2 ]
f [x1 , x2 , x3 ] =
x3 − x1
−11 − 9
=
2−0
= −10
f [x1 , x2 , x3 ] − f [x0 , x1 , x2 ]
f [x0 , x1 , x2 , x3 ] =
x3 − x0
−10 − 8
=
2 − (−1)
= −6
x0 = −1 f [x0 ] = 3
f [x0 , x1 ] = −7
x1 = 0 f [x1 ] = −4 f [x0 , x1 , x2 ] = 8
f [x1 , x2 ] = 9 f [x0 , x1 , x2 , x3 ] = −6
x2 = 1 f [x2 ] = 5 f [x1 , x2 , x3 ] = −10
f [x2 , x3 ] = −11
x3 = 2 f [x3 ] = −6
It follows that the interpolating polynomial p3 (x) can be expressed in Newton form as follows:
3
X j−1
Y
p3 (x) = f [x0 , . . . , xj ] (x − xi )
j=0 i=0
= f [x0 ] + f [x0 , x1 ](x − x0 ) + f [x0 , x1 , x2 ](x − x0 )(x − x1 ) +
f [x0 , x1 , x2 , x3 ](x − x0 )(x − x1 )(x − x2 )
= 3 − 7(x + 1) + 8(x + 1)x − 6(x + 1)x(x − 1).
We see that Newton interpolation produces an interpolating polynomial that is in the Newton form,
with centers x0 = −1, x1 = 0, and x2 = 1. 2
170 CHAPTER 5. POLYNOMIAL INTERPOLATION
Exercise 5.3.5 Write a Matlab function c=divdiffs(x,y) that computes the divided
difference table from the given data stored in the input vectors x and y, and returns a
vector c consisting of the divided differences f [x0 , . . . , xj ], j = 0, 1, 2, . . . , n, where n + 1
is the length of both x and y.
Once the coefficients have been computed, we can use nested multiplication to evaluate the
resulting interpolating polynomial, which is represented using the Newton form
n
X
pn (x) = cj Nj (x) (5.2)
j=0
n
X j−1
Y
= f [x0 , x1 , . . . , xj ] (x − xi ) (5.3)
j=0 i=0
= f [x0 ] + f [x0 , x1 ](x − x0 ) + f [x0 , x1 , x2 ](x − x0 )(x − x1 ) + (5.4)
f [x0 , x1 , . . . , xn ](x − x0 )(x − x1 ) · · · (x − xn−1 ). (5.5)
bn = cn
for i = n − 1, n − 2, . . . , 0 do
bi = ci + (x − xi )y
end
y = b0
It can be seen that this algorithm closely resembles Horner’s Method, which is a special case of
nested multiplication that works with the power form of a polynomial, whereas nested multiplication
works with the more general Newton form.
Example 5.3.7 Consider the interpolating polynomial obtained in the previous example,
We will use nested multiplication to write this polynomial in the power form
p3 (x) = b3 x3 + b2 x2 + b1 x + b0 .
b3 = c3
b2 = c2 + (z − x2 )b3
b1 = c1 + (z − x1 )b2
b0 = c0 + (z − x0 )b1 ,
5.3. DIVIDED DIFFERENCES 171
It follows that b0 = p(z), which is why this algorithm is the preferred method for evaluating a
polynomial in Newton form at a given point z.
It should be noted that the algorithm can be derived by writing p(x) in the nested form
Initially, we have
c0 = 3, c1 = −7, c2 = 8, c3 = −6,
b3 = −6
b2 = 8 + (0 − 1)(−6)
= 14
b1 = −7 + (0 − 0)(14)
= −7
b0 = 3 + (0 − (−1))(−7)
= −4.
It follows that
b3 = −6
b2 = 14 + (0 − 0)(−6)
= 14
b1 = −7 + (0 − (−1))(14)
= 7
b0 = −4 + (0 − 0)(7)
= −4.
It follows that
b3 = −6
b2 = 14 + (0 − (−1))(−6)
= 8
b1 = 7 + (0 − 0)(8)
= 7
b0 = −4 + (0 − 0)(7)
= 1.
5.3. DIVIDED DIFFERENCES 173
It follows that
and the centers are now 0, 0 and 0. Since all of the centers are equal to zero, the polynomial is
now in the power form. 2
q(x) = b1 + b2 (x − x0 ) + b3 (x − x0 )(x − x1 ),
p(x) = b0 + (x − z)q(x),
it follows that once we have changed all of the centers of q(x) to be equal to z, then all of the centers
of p(x) will be equal to z as well. In summary, we can convert a polynomial of degree n from Newton
form to power form by applying nested multiplication n times, where the jth application is to a
polynomial of degree n − j + 1, for j = 1, 2, . . . , n.
Since the coefficients of the appropriate Newton form of each of these polynomials of successively
lower degree are computed by the nested multiplication algorithm, it follows that we can implement
this more efficient procedure simply by proceeding exactly as before, except that during the jth
application of nested multiplication, we do not compute the coefficients b0 , b1 , . . . , bj−2 , because
they will not change anyway, as can be seen from the previous computations. For example, in the
second application, we did not need to compute b0 , and in the third, we did not need to compute
b0 and b1 .
Exercise 5.3.7 Write a Matlab function p=powerform(x,c) that accepts as input vec-
tors x and c, both of length n + 1, consisting of the interpolation points xj and divided
differences f [x0 , x1 , . . . , xj ], respectively, j = 0, 1, . . . , n. The output is a (n + 1)-vector
consisting of the coefficients of the interpolating polynomial pn (x) in power form, ordered
from highest degree to lowest.
174 CHAPTER 5. POLYNOMIAL INTERPOLATION
Exercise 5.3.8 Write a Matlab function p=newtonfit(x,y) that accepts as input vec-
tors x and y of length n + 1 consisting of the x- and y-coordinates, respectively, of points
(x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ), where the x-values must all be distinct, and returns a (n+1)-
vector p consisting of the coefficients of the Newton interpolating polynomial pn (x), in
power form, with highest-degree coefficient in the first position. Use your divdiffs func-
tion from Exercise 5.3.5 and your powerform function from Exercise 5.3.7. Test your
function by comparing your output to that of the built-in function polyfit.
∆xk = xk+1 − xk ,
where {xk } is any sequence, then the divided differences f [x0 , x1 , . . . , xk ] are given by
1
f [x0 , x1 , . . . , xk ] = ∆k f (x0 ). (5.6)
k!hk
The interpolating polynomial can then be described by the Newton forward-difference for-
mula
n
X s
pn (x) = f [x0 ] + ∆k f (x0 ), (5.7)
k
k=1
Exercise 5.3.9 Use induction to prove (5.6). Then show that the Newton interpolating
polynomial (5.5) reduces to (5.7) in the case of equally spaced interpolation points.
i xi f (xi )
0 −1 3
1 0 −4
2 1 5
3 2 −6
In other words, we must have p3 (−1) = 3, p3 (0) = −4, p3 (1) = 5, and p3 (2) = −6. Note that the
interpolation points x0 = −1, x1 = 0, x2 = 1 and x3 = 2 are equally spaced, with spacing h = 1.
To apply the forward-difference formula, we define s = (x − x0 )/h = x + 1 and compute the
extended binomial coefficients
s s s(s − 1) x(x + 1) s s(s − 1)(s − 2) (x + 1)x(x − 1)
= s = x+1, = = , = = ,
1 2 2 2 3 6 6
and then the coefficients
f [x0 ] = f (x0 )
= 3,
∆f (x0 ) = f (x1 ) − f (x0 )
= −4 − 3
= −7,
2
∆ f (x0 ) = ∆(∆f (x0 ))
= ∆[f (x1 ) − f (x0 )]
= [f (x2 ) − f (x1 )] − [f (x1 ) − f (x0 )]
= f (x2 ) − 2f (x1 ) + f (x0 )
= 5 − 2(−4) + 3,
= 16
∆ f (x0 ) = ∆(∆2 f (x0 ))
3
It follows that
3
X s
p3 (x) = f [x0 ] + ∆k f (x0 )
k
k=1
s s 2 s
= 3+ ∆f (x0 ) + ∆ f (x0 ) + ∆3 f (x0 )
1 1 2
x(x + 1) (x + 1)x(x − 1)
= 3 + (x + 1)(−7) + 16 + (−36)
2 6
= 3 − 7(x + 1) + 8(x + 1)x − 6(x + 1)x(x − 1).
176 CHAPTER 5. POLYNOMIAL INTERPOLATION
Note that the forward-difference formula computes the same form of the interpolating polynomial
as the Newton divided-difference formula. 2
∇xk = xk − xk−1 ,
for any sequence {xk }. Then derive the Newton backward-difference formula
n
X
k −s
pn (x) = f [xn ] + (−1) ∇k f (xn ),
k
k=1
where s = (x − xn )/h, and the preceding definition of the extended binomial coefficient
applies.
Exercise 5.3.11 Look up the documentation for the Matlab function diff. Then write
functions yy=newtonforwdiff(x,y,xx) and yy=newtonbackdiff(x,y,xx) that use diff
to implement the Newton forward-difference and Newton backward-difference formulas,
respectively, and evaluate the interpolating polynomial pn (x), where n = length(x) − 1,
at the elements of xx. The resulting values must be returned in yy.
It is interesting to note that the error closely resembles the Taylor remainder Rn (x).
Exercise 5.4.1 Prove Theorem 5.4.1. Hint: work with the Newton interpolating polyno-
mial for the points x0 , x1 , . . . , xn , x.
5.4. ERROR ANALYSIS 177
Exercise 5.4.2 Determine a bound on the error |f (x) − p2 (x)| for x in [0, 1], where
f (x) = ex , and p2 (x) is the interpolating polynomial of f (x) at x0 = 0, x1 = 0.5, and
x2 = 1.
If the number of data points is large, then polynomial interpolation becomes problematic since
high-degree interpolation yields oscillatory polynomials, when the data may fit a smooth function.
Example 5.4.2 Suppose that we wish to approximate the function f (x) = 1/(1+x2 ) on the interval
[−5, 5] with a tenth-degree interpolating polynomial that agrees with f (x) at 11 equally-spaced points
x0 , x1 , . . . , x10 in [−5, 5], where xj = −5+j, for j = 0, 1, . . . , 10. Figure 5.1 shows that the resulting
polynomial is not a good approximation of f (x) on this interval, even though it agrees with f (x) at
the interpolation points. The following MATLAB session shows how the plot in the figure can be
created.
The example shown in Figure 5.1 is a well-known example of the difficulty of high-degree polynomial
interpolation using equally-spaced points, and it is known as Runge’s example [33]. 2
Figure 5.1: The function f (x) = 1/(1 + x2 ) (solid curve) cannot be interpolated accurately on
[−5, 5] using a tenth-degree polynomial (dashed curve) with equally-spaced interpolation points.
Is it possible to choose the interpolation points so that the error is minimized? To answer this
question, we introduce the Chebyshev polynomials
Using (5.8) and the sum and difference formulas for cosine,
it can be shown that the Chebyshev polynomials satisfy the three-term recurrence relation
It can easily be seen from this relation, and the first two Chebyshev polynomials, that Tk (x) is in
fact a polynomial for all integers k ≥ 0.
The Chebyshev polynomials have the following properties of interest:
5.4. ERROR ANALYSIS 179
Exercise 5.4.4 Use (5.11) and induction to show that the leading coefficient of Tk (x) is
2k−1 , for k ≥ 1.
Exercise 5.4.5 Use the roots of cosine to compute the roots of Tk (x). Show that they are
real, distinct, and lie within (−1, 1). These roots are known as the Chebyshev points.
Let f (x) be a function that is (n+1) times continuously differentiable on [a, b]. If we approximate
f (x) by a nth-degree polynomial pn (x) that interpolates f (x) at the n + 1 roots of the Chebyshev
polynomial Tn+1 (x), mapped from [−1, 1] to [a, b],
1 (2j + 1)π 1
ξj = (b − a) cos + (a + b),
2 2n + 2 2
then the error in this approximation is
n+1
f (n+1) (ξ)
b−a
f (x) − pn (x) = 2−n Tn+1 (t(x)),
(n + 1)! 2
where
2
t(x) = −1 + (x − a)
b−a
is the linear map from [a, b] to [−1, 1]. This is because
n n
b − a n+1 Y b − a n+1 −n
Y
(x − ξj ) = (t(x) − τj ) = 2 Tn+1 (t(x)),
2 2
j=0 j=0
where τj is the jth root of Tn+1 (t). From |Tn+1 (t)| ≤ 1, we obtain
(b − a)n+1
|f (x) − pn (x)| ≤ max |f (n+1) (ξ)|.
22n+1 (n + 1)! ξ∈[a,b]
It can be shown that using Chebyshev points leads to much less error in the function f (x) =
1/(1 + x2 ) from Runge’s example [28].
180 CHAPTER 5. POLYNOMIAL INTERPOLATION
Exercise 5.5.2 Suppose that osculatory interpolation is used to construct the polyno-
mial pn (x) that interpolates f (x) at only one x-value, x0 , and satisfies pn (x0 ) = f (x0 ),
(n)
p0n (x0 ) = f 0 (x0 ), p00n (x0 ) = f 00 (x0 ), and so on, up to pn (x0 ) = f (n) (x0 ). What polynomial
approximation of f (x) is obtained?
where, as before, Li (x) is the ith Lagrange polynomial for the interpolation points x0 , x1 , . . . , xn .
It can be verified directly that these polynomials satisfy, for i, j = 0, 1, . . . , n,
Exercise 5.5.3 Derive the formulas (5.12), (5.13) for Hi (x) and Ki (x), respectively,
using the specified constraints for these polynomials. Hint: use an approach similar to
that used to derive the formula for Lagrange polynomials.
To prove that this polynomial is the unique polynomial of degree 2n + 1, we assume that there
is another polynomial p̃2n+1 of degree 2n + 1 that satisfies the constraints. Because p2n+1 (xi ) =
p̃2n+1 (xi ) = f (xi ) for i = 0, 1, . . . , n, p2n+1 − p̃2n+1 has at least n + 1 zeros. It follows from Rolle’s
Theorem that p02n+1 − p̃02n+1 has n zeros that lie within the intervals (xi−1 , xi ) for i = 0, 1, . . . , n − 1.
Furthermore, because p02n+1 (xi ) = p̃02n+1 (xi ) = f 0 (xi ) for i = 0, 1, . . . , n, it follows that p02n+1 −
p̃02n+1 has n+1 additional zeros, for a total of at least 2n+1. However, p02n+1 − p̃02n+1 is a polynomial
of degree 2n, and the only way that a polynomial of degree 2n can have 2n + 1 zeros is if it is
identically zero. Therefore, p2n+1 − p̃2n+1 is a constant function, but since this function is known
to have at least n + 1 zeros, that constant must be zero, and the Hermite polynomial is unique.
Using a similar approach as for the Lagrange interpolating polynomial, the following result can
be proved.
Theorem 5.5.1 Let f be 2n + 2 times continuously differentiable on [a, b], and let p2n+1
denote the Hermite polynomial of f with interpolation points x0 , x1 , . . . , xn in [a, b]. Then
there exists a point ξ(x) ∈ [a, b] such that
f (2n+2) (ξ(x))
f (x) − p2n+1 (x) = (x − x0 )2 (x − x1 )2 · · · (x − xn )2 .
(2n + 2)!
i zi f (zi ) f 0 (zi )
0,1 0 0 1
2,3 1 0 1
In other words, we must have p3 (0) = 0, p03 (0) = 1, p3 (1) = 0, and p03 (1) = 1. To include the values
of f 0 (x) at the two distinct interpolation points, we repeat each point once, so that the number of
interpolation points, including repetitions, is equal to the number of constraints described by the
data.
First, we construct the divided-difference table from this data. The divided differences in the
table are computed as follows:
f [z1 ] − f [z0 ]
f [z0 , z1 ] =
z1 − z 0
= f 0 (z0 )
= 1
f [z2 ] − f [z1 ]
f [z1 , z2 ] =
z2 − z 1
0−0
=
1−0
= 0
f [z3 ] − f [z2 ]
f [z2 , z3 ] =
z3 − z 2
0
= f (z2 )
= 1
f [z1 , z2 ] − f [z0 , z1 ]
f [z0 , z1 , z2 ] =
z2 − z 0
0−1
=
1−0
= −1
f [z2 , z3 ] − f [z1 , z2 ]
f [z1 , z2 , z3 ] =
z3 − z 1
1−0
=
1−0
= 1
f [z1 , z2 , z3 ] − f [z0 , z1 , z2 ]
f [z0 , z1 , z2 , z3 ] =
z3 − z 0
1 − (−1)
=
1−0
= 2
Note that the values of the derivative are used whenever a divided difference of the form f [zi , zi+1 ]
5.6. PIECEWISE POLYNOMIAL INTERPOLATION 183
Exercise 5.5.4 Use the Newton form of the Hermite interpolating polynomial to prove
Theorem 5.5.1.
Definition 5.6.1 (Piecewise polynomial) Let [a, b] be an interval that is divided into
subintervals [xi , xi+1 ], where i = 0, . . . , n − 1, x0 = a and xn = b. A piecewise polyno-
mial is a function p(x) defined on [a, b] by
It is essential to note that by this definition, a piecewise polynomial defined on [a, b] is equal to
some polynomial on each subinterval [xi−1 , xi ] of [a, b], for i = 1, 2, . . . , n, but a different polynomial
may be used for each subinterval.
To study the accuracy of piecewise polynomials, we need to work with various function spaces,
including Sobolev spaces; these function spaces are defined in Section B.12.
Exercise 5.6.1 Given that sL (x) must satisfy sL (xi ) = f (xi ) for i = 0, 1, 2, . . . , n, ex-
plain how the formula (5.14) can be derived.
Exercise 5.6.2 Given the points (x0 , f (x0 )), (x1 , f (x1 )), . . . , (xn , f (xn )) plotted in the
xy-plane, explain how sL can easily be graphed. How can the graph be produced in a
single line of Matlab code, given vectors x and y containing the x- and y- coordinates
of these points, respectively?
If f ∈ C 2 [a, b], then by the error in Lagrange interpolation (Theorem 5.4.1), on each subinterval
[xi−1 , xi ], for i = 1, 2, . . . , n, we have
f 00 (ξ)
f (x) − sL (x) = (x − xi−1 )(x − xi ).
2
This leads to the following Theorem.
Theorem 5.6.2 Let f ∈ C 2 [a, b], and let sL be the piecewise linear spline defined by
(5.14). For i = 1, 2, . . . , n, let hi = xi − xi−1 , and define h = max1≤i≤n hi . Then
M 2
kf − sL k∞ ≤ h ,
8
where |f 00 (x)| ≤ M on [a, b].
In Section 5.4, it was observed in Runge’s example that even when f (x) is smooth, an inter-
polating polynomial of f (x) can be highly oscillatory, depending on the number and placement of
interpolation points. By contrast, one of the most welcome properties of the linear spline sL (x)
is that among all functions in H 1 (a, b) that interpolate f (x) at the knots x0 , x1 , . . . , xn , it is the
“flattest”. That is, for any function v ∈ H 1 (a, b) that interpolates f at the knots,
ks0L k2 ≤ kv 0 k2 .
We then note that on each subinterval [xi−1 , xi ], since sL is a linear function, s0L is a constant
function, which we denote by
f (xi ) − f (xi−1 )
sL (x) ≡ mi = , i = 1, 2, . . . , n.
xi − xi−1
We then have
Z b
0
hv − s0L , s0L i = [v 0 (x) − s0L (x)]s0L (x) dx
a
n Z xi
X
= [v 0 (x) − s0L (x)]s0L (x) dx
i=1 xi−1
n
X Z xi
= mi v 0 (x) − mi dx
i=1 xi−1
Xn
= mi [v(x) − mi x]|xxii−1
i=1
n
X f (xi ) − f (xi−1 )
= mi v(xi ) − v(xi−1 ) − (xi − xi−1 )
xi − xi−1
i=1
= 0,
because by assumption, v(x) interpolates f (x) at the knots. This leaves us with
Definition 5.6.3 (Cubic Spline) Let f (x) be function defined on an interval [a, b], and
let x0 , x1 , . . . , xn be n + 1 distinct points in [a, b], where a = x0 < x1 < · · · < xn = b. A
cubic spline, or cubic spline interpolant, is a piecewise polynomial s(x) that satisifes
the following conditions:
(a) s00 (a) = s00 (b) = 0, which is called free or natural boundary conditions,
and
(b) s0 (a) = f 0 (a) and s0 (b) = f 0 (b), which is called clamped boundary condi-
tions.
If s(x) satisfies free boundary conditions, we say that s(x) is a natural spline. The
points x0 , x1 , . . . , xn are called the nodes of s(x).
Clamped boundary conditions are often preferable because they use more information about f (x),
which yields a spline that better approximates f (x) on [a, b]. However, if information about f 0 (x)
is not available, then natural boundary conditions must be used instead.
If we define hi = xi − xi−1 , for i = 1, 2, . . . , n, and define an+1 = yn , then the requirement that s(x)
is continuous at the interior nodes implies that we must have si (xi ) = si+1 (xi ) for i = 1, 2, . . . , n−1.
5.6. PIECEWISE POLYNOMIAL INTERPOLATION 187
Furthermore, because s(x) must fit the given data, we must also have s(xn ) = sn (xn ) = yn . These
conditions lead to the constraints
si (xi ) = di h3i + ci h2i + bi hi + ai = ai+1 = si+1 (xi ), i = 1, 2, . . . , n. (5.17)
To ensure that s(x) has a continuous first derivative at the interior nodes, we require that
s0i (xi )
= s0i+1 (xi ) for i = 1, 2 . . . , n − 1, which imposes the constraints
s0i (xi ) = 3di h2i + 2ci hi + bi = bi+1 = s0i+1 (xi ), i = 1, 2, . . . , n − 1. (5.18)
Similarly, to enforce continuity of the second derivative at the interior nodes, we require that
s00i (xi ) = s00i+1 (xi ) for i = 1, 2, . . . , n − 1, which leads to the constraints
s00i (xi ) = 6di hi + 2ci = 2ci+1 = s00i+1 (xi ), i = 1, 2, . . . , n − 1. (5.19)
There are 4n coefficients to determine, since there are n cubic polynomials, with 4 coefficients
each. However, we have only prescribed 4n − 2 constraints, so we must specify 2 more in order to
determine a unique solution. If we use natural boundary conditions, then these constraints are
s001 (x0 ) = 2c1 = 0, (5.20)
s00n (xn ) = 3dn hn + cn = 0. (5.21)
On the other hand, if we use clamped boundary conditions, then our additional constraints are
s01 (x0 ) = b1 = z0 , (5.22)
s0n (xn ) = 3dn h2n + 2cn hn + bn = zn , (5.23)
where zi = f 0 (xi ) for i = 0, 1, . . . , n.
Having determined our constraints that must be satisfied by s(x), we can set up a system of 4n
linear equations based on these constraints, and then solve this system to determine the coefficients
ai , bi , ci , di for i = 1, 2 . . . , n.
However, it is not necessary to construct the matrix for such a system, because it is possible
to instead solve a smaller system of only O(n) equations obtained from the continuity conditions
(5.18) and the boundary conditions (5.20), (5.21) or (5.22), (5.23), depending on whether natural
or clamped boundary conditions, respectively, are imposed. This reduced system is accomplished
by using equations (5.16), (5.17) and (5.19) to eliminate the ai , bi and di , respectively.
Exercise 5.6.4 Show that under natural boundary conditions, the coefficients c2 , . . . , cn
of the cubic spline (5.15) satisfy the system of equations Ac = b, where
2(h1 + h2 ) h2 0 ··· 0
.. ..
h 2 2(h 2 + h3 ) h 3 . .
A=
.. .. .. ,
0 . . . 0
..
.. .. ..
. . . . h n−1
0 ··· 0 hn−1 2(hn−1 + hn )
3 3
− a2 ) − − a1 )
c2 h2 (a3 h1 (a2
..
c = ... , b= .
.
3 3
cn hn (an+1 − an ) − hn−1 (an − an−1 )
188 CHAPTER 5. POLYNOMIAL INTERPOLATION
Exercise 5.6.5 Show that under clamped boundary conditions, the coefficients
c1 , . . . , cn+1 of the cubic spline (5.15) satisfy the system of equations Ac = b, where
2h1 h1 0 ··· ··· 0
.. ..
h1 2(h1 + h2 )
h2 . .
. . ..
0 h2 2(h2 + h3 ) h3 . .
A= .. . . . .
,
. . . . . . . . . 0
..
..
.
. h 2(h
n−1 n−1+h ) h
n n
0 ··· ··· 0 hn 2hn
3
− a1 ) − 3z0
h1 (a2
c1 3 3
h2 (a3 − a2 ) − h1 (a2 − a1 )
c2
..
c= . , b= ,
.. .
3 3
(a − a ) − (a − a )
hn n+1 n hn−1 n n−1
cn+1
3
3zn − hn (an+1 − an )
and cn+1 = s00n (xn ).
Example 5.6.4 We will construct a cubic spline interpolant for the following data on the interval
[0, 2].
j xj yj
0 0 3
1 1/2 −4
2 1 5
3 3/2 −6
4 2 7
The spline, s(x), will consist of four pieces {sj (x)}4j=1 , each of which is a cubic polynomial of the
form
sj (x) = aj + bj (x − xj−1 ) + cj (x − xj−1 )2 + dj (x − xj−1 )3 , j = 1, 2, 3, 4.
We will impose natural boundary conditions on this spline, so it will satisfy the conditions s00 (0) =
s00 (2) = 0, in addition to the “essential” conditions imposed on a spline: it must fit the given data
and have continuous first and second derivatives on the interval [0, 2].
These conditions lead to the following system of equations that must be solved for the coefficients
c1 , c2 , c3 , c4 , and c5 , where cj = s00 (xj−1 )/2 for j = 1, 2, . . . , 5. We define h = (2 − 0)/4 = 1/2 to be
the spacing between the interpolation points.
c1 = 0
h y2 − 2y1 + y0
(c1 + 4c2 + c3 ) =
3 h
h y3 − 2y2 + y1
(c2 + 4c3 + c4 ) =
3 h
h y4 − 2y3 + y2
(c3 + 4c4 + c5 ) =
3 h
c5 = 0.
5.6. PIECEWISE POLYNOMIAL INTERPOLATION 189
Substituting h = 1/2 and the values of yj , and also taking into account the boundary conditions,
we obtain
1
(4c2 + c3 ) = 32
6
1
(c2 + 4c3 + c4 ) = −40
6
1
(c3 + 4c4 ) = 48
6
This system has the solutions
c2 = 516/7, c3 = −720/7, c4 = 684/7.
Using (5.16), (5.17), and (5.19), we obtain
a1 = 3, a2 = −4, a3 = 5, a4 = −6.
b1 = −184/7, b2 = 74/7, b3 = −4, b4 = −46/7,
and
d1 = 344/7, d2 = −824/7, d3 = 936/7, d4 = −456/7.
We conclude that the spline s(x) that fits the given data, has two continuous derivatives on [0, 2],
and satisfies natural boundary conditions is
344 3 184 2
x − 7 x +3 if x ∈ [0, 0.5]
7824
− 7 (x − 1/2)3 + 516 2 + 74 (x − 1/2) − 4 if x ∈ [0.5, 1]
7 (x − 1/2) 7
s(x) = 936 .
7 (x − 1)3 − 720
7 (x − 1)2 − 4(x − 1) + 5 if x ∈ [1, 1.5]
456
− 7 (x − 3/2)3 + 684 2 46
7 (x − 3/2) − 7 (x − 3/2) − 6 if x ∈ [1.5, 2]
The graph of the spline is shown in Figure 5.2. 2
The Matlab function spline can be used to construct cubic splines satisfying both natural
(also known as “not-a-knot”) and clamped boundary conditions. The following exercises require
reading the documentation for this function.
Exercise 5.6.6 Use spline to construct a cubic spline for the data from Example 5.6.4.
First, use the interface pp=spline(x,y), where x and y are vectors consisting of the x-
and y-coordinates, respectively, of the given data points, and pp is a structure that rep-
resents the cubic spline s(x). Examine the members of p and determine how to interpret
them. Where do you see the coefficients computed in Example 5.6.4?
Exercise 5.6.8 If the input argument y in the function call pp=spline(x,y) has two
components more than x, it is assumed that the first and last components are the slopes
z0 = s0 (x0 ) and zn = s0 (xn ) imposed by clamped boundary conditions. Use the given data
from Example 5.6.4, with various values of z0 and zn , and construct the clamped cubic
spline using this interface to spline. Compare the coefficients and graphs to that of the
natural cubic spline from Exercises 5.6.6 and 5.6.7.
190 CHAPTER 5. POLYNOMIAL INTERPOLATION
Figure 5.2: Cubic spline that passing through the points (0, 3), (1/2, −4), (1, 5), (2, −6), and (3, 7).
5.6. PIECEWISE POLYNOMIAL INTERPOLATION 191
Theorem 5.6.5 Let x0 , x1 , . . . , xn be n + 1 distinct points in the interval [a, b], where
a = x0 < x1 < · · · < xn = b, and let f (x) be a function defined on [a, b]. Then f
has a unique cubic spline interpolant s(x) that is defined on the nodes x0 , x1 , . . . , xn that
satisfies the natural boundary conditions s00 (a) = s00 (b) = 0.
Theorem 5.6.6 Let x0 , x1 , . . . , xn be n + 1 distinct points in the interval [a, b], where
a = x0 < x1 < · · · < xn = b, and let f (x) be a function defined on [a, b] that is
differentiable at a and b. Then f has a unique cubic spline interpolant s(x) that is defined
on the nodes x0 , x1 , . . . , xn that satisfies the clamped boundary conditions s0 (a) = f 0 (a)
and s0 (b) = f 0 (b).
Just as the linear spline is the “flattest” interpolant, in an average sense, the natural cubic spline
has the least “average curvature”. Specifically, if s2 (x) is the natural cubic spline for f ∈ C[a, b] on
[a, b] with knots a = x0 < x1 < · · · < xn = b, and v ∈ H 2 (a, b) is any interpolant of f with these
knots, then
ks002 k2 ≤ kv 00 k2 .
This can be proved in the same way as the corresponding result for the linear spline. It is this
property of the natural cubic spline, called the smoothest interpolation property, from which
splines were named.
The following result, proved in [34, p. 57-58], provides insight into the accuracy with which a
cubic spline interpolant s(x) approximates a function f (x).
Theorem 5.6.7 Let f be four times continuously differentiable on [a, b], and assume
that kf (4) k∞ = M . Let s(x) be the unique clamped cubic spline interpolant of f (x) on
the nodes x0 , x1 , . . . , xn , where a = x0 < x1 < · · · < xn < b. Then for x ∈ [a, b],
5M
kf (x) − s(x)k∞ ≤ max h4 ,
384 1≤i≤n i
where hi = xi − xi−1 .
A similar result applies in the case of natural boundary conditions [6].
s(x). Requiring that s(x) interpolates f (x) at the knots, and that s0 (x) interpolates f 0 (x) at the
knots, imposes 2n + 2 constraints on the coefficients. We can then use the remaining 2n − 2 degrees
of freedom to require that s(x) belong to C 1 [a, b]; that is, it is continuously differentiable on [a, b].
Note that unlike the cubic spline interpolant, the Hermite cubic spline does not have a continuous
second derivative.
The following result provides insight into the accuracy with which a Hermite cubic spline inter-
polant s(x) approximates a function f (x).
Theorem 5.6.8 Let f be four times continuously differentiable on [a, b], and assume that
kf (4) k∞ = M . Let s(x) be the unique Hermite cubic spline interpolant of f (x) on the
nodes x0 , x1 , . . . , xn , where a = x0 < x1 < · · · < xn < b. Then
M
kf (x) − s(x)k∞ ≤ max h4 ,
384 1≤i≤n i
where hi = xi − xi−1 .
This can be proved in the same way as the error bound for the linear spline, except that the error
formula for Hermite interpolation is used instead of the error formula for Lagrange interpolation.
Exercise 5.6.9 Prove Theorem 5.6.8.
An advantage of Hermite cubic splines over cubic spline interpolants is that they are local
approximations rather than global; that is, if the values of f (x) and f 0 (x) change at some knot
xi , only the polynomials defined on the pieces containing xi need to be changed. In cubic spline
interpolation, all pieces are coupled, so a change at one point changes the polynomials for all pieces.
To see this, we represent the Hermite cubic spline using the same form as in the cubic spline
interpolant,
si (xi−1 ) = f (xi−1 ), si (xi ) = f (xi ), s0i (xi−1 ) = f 0 (xi−1 ), s0i (xi ) = f 0 (xi )
Approximation of Functions
Previously we have considered the problem of polynomial interpolation, in which a function f (x)
is approximated by a polynomial pn (x) that agrees with f (x) at n + 1 distinct points, based on the
assumption that pn (x) will be, in some sense, a good approximation of f (x) at other points. As we
have seen, however, this assumption is not always valid, and in fact, such an approximation can be
quite poor, as demonstrated by Runge’s example.
Therefore, we consider an alternative approach to approximation of a function f (x) on an
interval [a, b] by a polynomial, in which the polynomial is not required to agree with f at any
specific points, but rather approximate f well in an “overall” sense, by not deviating much from
f at any point in [a, b]. This requires that we define an appropriate notion of “distance” between
functions that is, intuitively, consistent with our understanding of distance between numbers or
points in space.
To that end, we can use vector norms, as defined in Section B.11, where the vectors in question
consist of the values of functions at selected points. In this case, the problem can be reduced to a
least squares problem, as discussed in Chapter 6. This is discussed in Section 6.1.
Still, finding an approximation of f (x) that is accurate with respect to any discrete, finite
subset of the domain cannot guarantee that it accurately approximates f (x) on the entire domain.
Therefore, in Section 6.2 we generalize least squares approximations to a continuous setting by
working with with norms on function spaces, which are vector spaces in which the vectors are
functions. Such function spaces and norms are reviewed in Section B.12.
In the remainder of the chapter, we consider approximating f (x) by functions other than poly-
nomials. Section 6.3 presents an approach to approximating f (x) with a rational function, to
overcome the limitations of polynomial approximation, while Section 6.4 explores approximation
through trigonometric polynomials, or sines and cosines, to capture the frequency content of f (x).
193
194 CHAPTER 6. APPROXIMATION OF FUNCTIONS
f (xi ) = yi for i = 1, 2, . . . , m.
However, fitting the data exactly may not be the best approach to describing the data with a
function. We have seen that high-degree polynomial interpolation can yield oscillatory functions
that behave very differently than a smooth function from which the data is obtained. Also, it
may be pointless to try to fit data exactly, for if it is obtained by previous measurements or other
computations, it may be erroneous. Therefore, we consider revising our notion of what constitutes
a “best fit” of given data by a function.
Let f = f (x1 ) f (x2 ) · · · f (xn ) and let y = y1 y2 · · · yn . One alternative ap-
proach to data fitting is to solve the minimax problem, which is the problem of finding a function
f (x) of a given form for which
kf − yk∞ = max |f (xi ) − yi |
1≤i≤n
is minimized. However, we cannot apply standard minimization techniques to this function, be-
cause, like the absolute value function that it employs, it is not differentiable.
This defect is overcome by considering the problem of finding f (x) of a given form for which
m
X
kf − yk22 = [f (xi ) − yi ]2
i=1
is minimized. This is known as the least squares problem. We will first show how this problem is
solved for the case where f (x) is a linear function of the form f (x) = a1 x + a0 , and then generalize
this solution to other types of functions.
When f (x) is linear, the least squares problem is the problem of finding constants a0 and a1
such that the function
m
X
E(a0 , a1 ) = (a1 xi + a0 − yi )2
i=1
is minimized. In order to minimize this function of a0 and a1 , we must compute its partial derivatives
with respect to a0 and a1 . This yields
m m
∂E X ∂E X
= 2(a1 xi + a0 − yi ), = 2(a1 xi + a0 − yi )xi .
∂a0 ∂a1
i=1 i=1
At a minimum, both of these partial derivatives must be equal to zero. This yields the system of
linear equations
m m
!
X X
ma0 + xi a1 = yi ,
i=1 i=1
m m m
! !
X X X
xi a0 + x2i a1 = xi yi .
i=1 i=1 i=1
6.1. DISCRETE LEAST SQUARES APPROXIMATIONS 195
Example 6.1.1 We wish to find the linear function y = a1 x + a0 that best approximates the data
shown in Table 6.1, in the least-squares sense. Using the summations
Table 6.1: Data points (xi , yi ), for i = 1, 2, . . . , 10, to be fit by a linear function
i xi yi
1 2.0774 3.3123
2 2.3049 3.8982
3 3.0125 4.6500
4 4.7092 6.5576
5 5.5016 7.5173
6 5.8704 7.0415
7 6.2248 7.7497
8 8.4431 11.0451
9 8.7594 9.8179
10 9.3900 12.2477
m
X m
X m
X m
X
xi = 56.2933, x2i = 380.5426, yi = 73.8373, xi yi = 485.9487,
i=1 i=1 i=1 i=1
we obtain
380.5426 · 73.8373 − 56.2933 · 485.9487 742.5703
a0 = 2
= = 1.1667,
10 · 380.5426 − 56.2933 636.4906
10 · 485.9487 − 56.2933 · 73.8373 702.9438
a1 = 2
= = 1.1044.
10 · 380.5426 − 56.2933 636.4906
We conclude that the linear function that best fits this data in the least-squares sense is
y = 1.1044x + 1.1667.
Figure 6.1: Data points (xi , yi ) (circles) and least-squares line (solid line)
Exercise 6.1.2 Generalize the above derivation of the coefficients a0 and a1 of the least-
squares line to obtain formulas for the coefficients a, b and c of the quadratic function
y = ax2 + bx + c that best fits the data (xi , yi ), i = 1, 2, . . . , m, in the least-squares sense.
Then generalize your function leastsqline from Exercise 6.1.1 to obtain a new function
leastsqquad that computes these coefficients.
It is interesting to note that if we define the m × 2 matrix A, the 2-vector a, and the m-vector
y by
1 x1 y1
1 x2 y2
a0
A= . , a= , y = . ,
.
.. .. a1 ..
1 xm ym
then a is the solution to the system of equations
AT Aa = AT y.
These equations are the normal equations defined earlier, written in matrix-vector form. They arise
6.1. DISCRETE LEAST SQUARES APPROXIMATIONS 197
kAa − yk2
Our goal is to minimize the sum of squares of the deviations in pn (x) from each y-value,
2
m m n
aj xj − yi ,
X X X
E(a) = [pn (xi ) − yi ]2 =
i
i=1 i=1 j=0
Setting each of these partial derivatives equal to zero yields the system of equations
n m m
!
j+k
X X X
xi aj = xki yi , k = 0, 1, . . . , n.
j=0 i=1 i=1
These are the normal equations. They are a generalization of the normal equations previously
defined for the linear case, where n = 1. Solving this system yields the coefficients {aj }nj=0 of the
least-squares polynomial pn (x).
198 CHAPTER 6. APPROXIMATION OF FUNCTIONS
As in the linear case, the normal equations can be written in matrix-vector form
AT Aa = AT y,
where
x21 xn1
1 x1 ··· a0 y1
1 x2 x22 ··· xn2 a1 y2
A= , a= , y= . (6.1)
.. .. .. .. .. .. ..
. . . . . . .
1 xm x2m · · · xnm an yn
The matrix A is called a Vandermonde matrix for the points x0 , x1 , . . . , xm .
The normal equations equations can be used to compute the coefficients of any linear combi-
nation of functions {φj (x)}nj=0 that best fits data in the least-squares sense, provided that these
functions are linearly independent. In this general case, the entries of the matrix A are given by
aij = φi (xj ), for i = 1, 2, . . . , m and j = 0, 1, . . . , n.
Example 6.1.2 We wish to find the quadratic function y = a2 x2 + a1 x + a0 that best approximates
the data shown in Table 6.2, in the least-squares sense. By defining
Table 6.2: Data points (xi , yi ), for i = 1, 2, . . . , 10, to be fit by a quadratic function
i xi yi
1 2.0774 2.7212
2 2.3049 3.7798
3 3.0125 4.8774
4 4.7092 6.6596
5 5.5016 10.5966
6 5.8704 9.8786
7 6.2248 10.5232
8 8.4431 23.3574
9 8.7594 24.0510
10 9.3900 27.4827
x21
1 x1 y1
1 x2 x22 a0 y2
A= , a = a1 , y= ,
.. .. .. ..
. . .
a2
.
1 x10 x210 y10
and solving the normal equations
AT Aa = AT y,
we obtain the coefficients
c0 = 4.7681, c1 = −1.5193, c2 = 0.4251,
and conclude that the quadratic function that best fits this data in the least-squares sense is
y = 0.4251x2 − 1.5193x + 4.7681.
The data, and this function, are shown in Figure 6.2. 2
6.1. DISCRETE LEAST SQUARES APPROXIMATIONS 199
Figure 6.2: Data points (xi , yi ) (circles) and quadratic least-squares fit (solid curve)
Exercise 6.1.4 Test your function leastsqpoly from Exercise 6.1.3 to approximate the
function y = e−cx on the interval [0, 1] where c is a chosen positive constant. Experiment
with different values of c, as well as m and n, the number of data points and degree of
the approximating polynomial, respectively. What combination yields the smallest relative
residual kAa − yk2 /kyk2 ?
Least-squares fitting can also be used to fit data with functions that are not linear combinations
of functions such as polynomials. Suppose we believe that given data points can best be matched
to an exponential function of the form y = beax , where the constants a and b are unknown. Taking
the natural logarithm of both sides of this equation yields
ln y = ln b + ax.
If we define z = ln y and c = ln b, then the problem of fitting the original data points {(xi , yi )}m
i=1
200 CHAPTER 6. APPROXIMATION OF FUNCTIONS
with an exponential function is transformed into the problem of fitting the data points {(xi , zi )}m
i=1
with a linear function of the form c + ax, for unknown constants a and c.
Similarly, suppose the given data is believed to approximately conform to a function of the form
y = bxa , where the constants a and b are unknown. Taking the natural logarithm of both sides of
this equation yields
ln y = ln b + a ln x.
If we define z = ln y, c = ln b and w = ln x, then the problem of fitting the original data points
{(xi , yi )}m
i=1 with a constant times a power of x is transformed into the problem of fitting the data
points {(wi , zi )}m
i=1 with a linear function of the form c + aw, for unknown constants a and c.
Example 6.1.3 We wish to find the exponential function y = beax that best approximates the data
shown in Table 6.3, in the least-squares sense. By defining
1 x1 z1
1 x2 z2
c
A= .. .. .. , c= , z= ,
a ..
. . . .
1 x5 z5
where c = ln b and zi = ln yi for i = 1, 2, . . . , 5, and solving the normal equations
AT Ac = AT z,
we obtain the coefficients
a = 0.4040, b = ec = e−0.2652 = 0.7670,
and conclude that the exponential function that best fits this data in the least-squares sense is
y = 0.7670e0.4040x .
2
that minimizes
2
Z b Z b Xn
E(c0 , c1 , . . . , cn ) = kfn − f k22 = [fn (x) − f (x)]2 dx = cj φj (x) − f (x) dx,
a a j=0
where
Z b 1/2
2
kfn − f k2 = [fn (x) − f (x)] dx .
a
We refer to fn as the best approximation in span(φ0 , φ1 , . . . , φn ) to f in the 2-norm on (a, b).
This minimization can be performed for f ∈ C[a, b], the space of functions that are continuous
on [a, b], but it is not necessary for a function f (x) to be continuous for kf k2 to be defined. Rather,
we consider the space L2 (a, b), the space of real-valued functions such that |f (x)|2 is integrable over
(a, b). Both of these spaces, in addition to being normed spaces, are also inner product spaces, as
they are equipped with an inner product
Z b
hf, gi = f (x)g(x) dx.
a
and requiring that each partial derivative be equal to zero yields the normal equations
Xn Z b Z b
φk (x)φj (x) dx cj = φk (x)f (x) dx, k = 0, 1, . . . , n.
j=0 a a
We can then solve this system of equations to obtain the coefficients {cj }nj=0 . This system can
be solved as long as the functions {φj (x)}nj=0 are linearly independent. That is, the condition
n
X
cj φj (x) ≡ 0, x ∈ [a, b],
j=0
202 CHAPTER 6. APPROXIMATION OF FUNCTIONS
Exercise 6.2.1 Prove that the functions {φj (x)}nj=0 are linearly independent in C[a, b]
if, for j = 0, 1, . . . , n, φj (x) is a polynomial of degree j.
where the functions {φj (x)}nj=0 are real-valued functions that are linearly independent
in C[a, b]. Prove that A is symmetric positive definite. Why is the assumption of linear
independence essential? How does this guarantee that the solution of the normal equations
yields a minimum rather than a maximum or saddle point?
Figure 6.3: Graphs of f (x) = ex (red dashed curve) and 4th-degree continuous least-squares poly-
nomial approximation f4 (x) on [0, 5] (blue solid curve)
Exercise 6.2.3 Repeat Example 6.2.1 with f (x) = x7 . What happens to the coefficients
{cj }4j=0 if the right-hand side vector b is perturbed?
For the remainder of this section, we restrict ourselves to the case where the functions {φj (x)}nj=0
are polynomials. These polynomials form a basis of Pn , the vector space of polynomials of degree
at most n. Then, for f ∈ L2 (a, b), we refer to the polynomial fn that minimizes kf − pk2 over
all p ∈ Pn as the best 2-norm approximating polynomial, or least-squares approximating
polynomial, of degree n to f on (a, b).
It follows that the coefficients {cj }nj=0 of the least-squares approximation fn (x) are simply
hφk , f i
ck = , k = 0, 1, . . . , n.
kφk k22
If the constants {αk }nk=0 above satisfy αk = 1 for k = 0, 1, . . . , n, then we say that the orthogonal set
of functions {φj (x)}nj=0 is orthonormal. In that case, the solution to the continuous least-squares
problem is simply given by
ck = hφk , f i, k = 0, 1, . . . , n. (6.2)
Next, we will learn how sets of orthogonal polynomials can be constructed.
p1 = a 1
p1 · a2
p2 = a 2 − p1
p1 · p1
..
.
n−1
X pj · a n
pn = a n − pj .
pj · pj
j=0
and a set of orthonormal vectors {qj }nj=1 , in that they are orthogonal (qk · qj = 0 for k 6= j), and
unit vectors (qj · qj = 1).
We can use a similar process to compute a set of orthogonal polynomials {pj (x)}nj=0 . For
convenience, we will require that all polynomials in the set be monic; that is, their leading (highest-
degree) coefficient must be equal 1. We then define p0 (x) = 1. Then, because p1 (x) is supposed to
be of degree 1, it must have the form p1 (x) = x − α1 for some constant α1 . To ensure that p1 (x)
is orthogonal to p0 (x), we compute their inner product, and obtain
0 = hp0 , p1 i = h1, x − α1 i,
so we must have
h1, xi
α1 = .
h1, 1i
For j > 1, we start by setting pj (x) = xpj−1 (x), since pj should be of degree one greater
than that of pj−1 , and this satisfies the requirement that pj be monic. Then, we need to subtract
6.2. CONTINUOUS LEAST SQUARES APPROXIMATION 205
polynomials of lower degree to ensure that pj is orthogonal to pi , for i < j. To that end, we apply
Gram-Schmidt orthogonalization and obtain
j−1
X hpi , xpj−1 i
pj (x) = xpj−1 (x) − pi (x).
hpi , pi i
i=0
However, by the definition of the inner product, hpi , xpj−1 i = hxpi , pj−1 i. Furthermore, because
xpi is of degree i + 1, and pj−1 is orthogonal to all polynomials of degree less than j, it follows that
hpi , xpj−1 i = 0 whenever i < j − 1.
We have shown that sequences of orthogonal polynomials satisfy a three-term recurrence
relation
2
pj (x) = (x − αj )pj−1 (x) − βj−1 pj−2 (x), j > 1,
2
where the recursion coefficients αj and βj−1 are defined to be
hpj−1 , xpj−1 i
αj = , j > 1,
hpj−1 , pj−1 i
hp0 , xp0 i
p1 (x) = (x − α1 )p0 (x), α1 = .
hp0 , p0 i
β02 = hp0 , p0 i,
Exercise 6.2.5 Write a Matlab function P=orthpoly(a,b,w,n) that computes the co-
efficients of monic orthogonal polynomials on the interval (a, b), up to and including
degree n, and stores their coefficients in the rows of the (n + 1) × (n + 1) matrix P . The
vector w stores the coefficients of a polynomial w(x) that serves as the weight function.
Use Matlab’s polynomial functions to evaluate the required inner products. How can
you ensure that the weight function does not change sign on (a, b)?
L0 (x) = 1, (6.3)
L1 (x) = x, (6.4)
2j + 1 j
Lj+1 (x) = xLj (x) − Lj−1 (x), j = 1, 2, . . . (6.5)
j+1 j+1
These are known as the Legendre polynomials [23]. One of their most important applications is
in the construction of Gaussian quadrature rules (see Section 7.5). Specifically, the roots of Ln (x),
for n ≥ 1, are the nodes of a Gaussian quadrature rule for the interval (−1, 1). However, they
can also be used to easily compute continuous least-squares polynomial approximations, as the
following example shows.
Example 6.2.2 We will use Legendre polynomials to approximate f (x) = cos x on [−π/2, π/2] by
a quadratic polynomial. First, we note that the first three Legendre polynomials, which are the ones
of degree 0, 1 and 2, are
1
L0 (x) = 1, L1 (x) = x, L2 (x) = (3x2 − 1).
2
However, it is not practical to use these polynomials directly to approximate f (x), because they
are orthogonal with respect to the inner product defined on the interval (−1, 1), and we wish to
approximate f (x) on (−π/2, π/2).
To obtain orthogonal polynomials on (−π/2, π/2), we replace x by 2t/π, where t belongs to
[−π/2, π/2], in the Legendre polynomials, which yields
2t 1 12 2
L̃0 (t) = 1, L̃1 (t) = , L̃2 (t) = t −1 .
π 2 π2
Then, we can express our quadratic approximation f2 (x) of f (x) by the linear combination
where
hf, L̃j i
cj = , j = 0, 1, 2.
hL̃j , L̃j i
6.2. CONTINUOUS LEAST SQUARES APPROXIMATION 207
Z π/2
hf, L̃0 i = cos t dt
−π/2
= 2,
Z π/2
2t
hf, L̃1 i = cos t dt
−π/2 π
= 0,
Z π/2
1 12 2
hf, L̃2 i = t − 1 cos t dt
−π/2 2 π2
2 2
= (π − 12),
π2
Z π/2
hL̃0 , L̃0 i = 1 dt
−π/2
= π,
Z π/2 2
2t
hL̃1 , L̃1 i = dt
−π/2 π
8π
= ,
3
Z π/2 2
1 12 2
hL̃2 , L̃2 i = t −1 dt
−π/2 2 π2
π
= .
5
It follows that
2 2 5 2 10
c0 = , c1 = 0, c2 = 2
(π − 12) = 3 (π 2 − 12),
π π π π
and therefore
2 5 12 2
f2 (x) = + (π 2 − 12) x − 1 ≈ 0.98016 − 0.4177x2 .
π π3 π2
Exercise 6.2.6 Write a Matlab script that computes the coefficients of the Legendre
polynomials up to a given degree n, using the recurrence relation (6.5) and the function
conv for multiplying polynomials. Then, plot the graphs of these polynomials on the
interval (−1, 1). What properties can you observe in these graphs? Is there any symmetry
to them?
Exercise 6.2.7 Prove that the Legendre polynomial Lj (x) is an odd function if j is odd,
and an even function if j is even. Hint: use mathematical induction.
208 CHAPTER 6. APPROXIMATION OF FUNCTIONS
Figure 6.4: Graph of cos x (solid blue curve) and its continuous least-squares quadratic approxi-
mation (red dashed curve) on (−π/2, π/2)
Exercise 6.2.8 Let A be the Vandermonde matrix from (6.1), where the points
x1 , x2 , . . . , xm are equally spaced points in the interval (−1, 1). Construct this matrix
in Matlab for a small chosen value of n and a large value of m, and then compute the
QR factorization of A (See Chapter 6). How do the columns of Q relate to the Legendre
polynomials?
It is possible to compute sequences of orthogonal polynomials with respect to other inner products.
A generalization of the inner product that we have been using is defined by
Z b
hf, gi = f (x)g(x)w(x) dx,
a
where w(x) is a weight function. To be a weight function, it is required that w(x) ≥ 0 on (a, b),
and that w(x) 6= 0 on any subinterval of (a, b). So far, we have only considered the case of w(x) ≡ 1.
6.2. CONTINUOUS LEAST SQUARES APPROXIMATION 209
Exercise 6.2.9 Prove that the discussion of Section 6.2.2 also applies when using the
inner product
Z b
hf, gi = f (x)g(x)w(x) dx,
a
where w(x) is a weight function that satisfies w(x) ≥ 0 on (a, b). That is, polynomials
orthogonal with respect to this inner product also satisfy a three-term recurrence relation,
with analogous definitions of the recursion coefficients αj and βj .
Another weight function of interest is
1
w(x) = √ , −1 < x < 1.
1 − x2
A sequence of polynomials that is orthogonal with respect to this weight function, and the associated
inner product Z 1
1
hf, gi = f (x)g(x) √ dx
−1 1 − x2
is the sequence of Chebyshev polynomials, previously introduced in Section 5.4.2:
T0 (x) = 1,
T1 (x) = x,
Tj+1 (x) = 2xTj (x) − Tj−1 (x), j = 1, 2, . . .
In Section 6.4, we will investigate continuous and discrete least-squares approximation of functions
by linear combinations of trigonometric polynomials such as cos jθ or sin jθ, which will reveal how
these coefficients hf, Tj i can be computed very rapidly.
where w is a function handle for the weight function w(x). Use the built-in function
integral to evaluate the required inner products. Make the fifth argument w an optional
argument, using w(x) ≡ 1 as a default.
210 CHAPTER 6. APPROXIMATION OF FUNCTIONS
Exercise 6.2.11 Compute the best 2-norm approximating polynomial of degree 3 to the
functions f (x) = ex and g(x) = sin πx on (−1, 1), using both Legendre and Chebyshev
polynomials. Comment on the accuracy of these approximations.
hf − fn , pi = 0
for all p ∈ Pn .
Exercise 6.2.12 Use (6.6) to prove one part of Theorem 6.2.3: assume fn is the best
2-norm approximating polynomial to f ∈ L2 (a, b), and show that hf − fn , pi = 0 for any
p ∈ Pn .
Exercise 6.2.13 Use the Cauchy-Schwarz inequality to prove the converse of Exercise
6.2.12: that if f ∈ L2 (a, b) and hf − fn , pi = 0 for an arbitrary p ∈ Pn , then fn is the
best 2-norm approximating polynomial of degree n to f ; that is,
kf − fn k2 ≤ kf − pk2 .
and let the points ξ1 , ξ2 , . . . , ξk be the points in (a, b) at which ϕj (x) changes sign. This set of
points cannot be empty, because ϕj , being a polynomial of degree at least one, is orthogonal a
constant function, which means
Z b
ϕj (x)w(x) dx = 0.
a
Because w(x) is a weight function, it does not change sign. Therefore, in order for the integral to
be zero, ϕj (x) must change sign at least once in (a, b).
If we define
πk (x) = (x − ξ1 )(x − ξ2 ) · · · (x − ξk ),
then ϕj (x)πk (x) does not change sign on (a, b), because πk changes sign at exactly the same points
in (a, b) as ϕj . Because both polynomials are also nonzero on (a, b), we must have
Z b
hϕj , πk i = ϕj (x)πk (x)w(x) dx 6= 0.
a
If k < j, then we have a contradiction, because ϕj is orthogonal to any polynomial of lesser degree.
Therefore, k ≥ j. However, if k > j, we also have a contradiction, because a polynomial of degree
j cannot change sign more than j times on the entire real number line, let alone an interval. We
conclude that k = j, which implies that all of the roots of ϕj are real and distinct, and lie in (a, b).
Exercise 6.2.14 Use your function orthpoly from Exercise 6.2.5 to generate orthogonal
polynomials of a fixed degree n for various weight functions. How does the distribution of
the roots of pn (x) vary based on where the weight function has smaller or larger values?
Hint: consider the distribution of the roots of Chebyshev polynomials, and their weight
function w(x) = (1 − x2 )−1/2 .
pm (x) a0 + a1 x + a2 x2 + · · · + am xm
rm,n (x) = = ,
qn (x) b0 + b1 x + b2 x2 + · · · + bn xn
where pm (x) and qn (x) are polynomials of degree m and n, respectively. For convenience, we impose
b0 = 1, since otherwise the other coefficients can simply be scaled.
To construct pm (x) and qn (x), we generalize approximation of f (x) by a Taylor polynomial of
degree n. Consider the error
As in Taylor polynomial approximation, our goal is to choose the coefficients of pm and qn so that
That is, 0 is a root of multiplicity m + n + 1. It follows that xm+n+1 is included in the factorization
of the numerator of E(x).
For convenience, we express p and q as polynomials of degree m + n, by padding them with
coefficients thath are zero: am+1 = am+2 = · · · = am+n = 0 and bn+1 = bn+2 = · · · = bn+m = 0.
Taking a Maclaurin expansion of f (x),
∞
X f (i) (0)
f (x) = ci xi , ci = ,
i!
i=0
We can then ensure that 0 is a root of multiplicity m + n + 1 if the numerator has no terms of
degree m + n or less. That is, each coefficient of xi in the first summation must equal zero.
This entails solving the system of m + n + 1 equations
c0 = a0
c1 + b1 c0 = a1
c2 + b1 c1 + b2 c0 = a2 (6.7)
..
.
cm+n + b1 cm+n−1 + · · · + bm+n c0 = am+n .
Example 6.3.1 We consider the approximation of f (x) = e−x by a rational function of the form
a0 + a1 x + a2 x2
r2,3 (x) = .
1 + b1 x + b2 x2 + b3 x3
The Maclaurin series for f (x) has coefficients cj = (−1)j /j!. The system of equations (6.7) becomes
c0 = a0
c1 + b1 c0 = a1
c2 + b1 c1 + b2 c0 = a2
c3 + b1 c2 + b2 c1 + b3 c0 = 0
c4 + b1 c3 + b2 c2 + b3 c1 = 0
c5 + b1 c4 + b2 c3 + b3 c2 = 0.
−1 −c0
a0
−1 c0
a1
−c1
−1 c 1 c0
a2 −c2
A= , x= , b= .
c2 c1 c0
b1
−c3
c3 c2 c1 b2 −c4
c4 c3 c2 b3 −c5
If possible, we would like to work with a structure that facilitates Gaussian elimination. To that
end, we can reverse the rows and columns of A, x and b to obtain the system
−c5
c2 c3 c4 b3
c1 c2 c3 b2 −c4
c0 c1 c2 b1 −c3
A=
, x = a2 , b = −c2 .
c0 c1 −1
c0 −1 a1 −c1
−1 a0 −c0
It follows that Gaussian elimination can be carried out by eliminating m = 2 entries in each of the
first n = 3 columns. After that, the matrix will be reduced to upper triangular form so that back
substitution can be carried out. If pivoting is required, it can be carried out on only the first n rows,
because due to the block lower triangular structure of A, it follows that A is nonsingular if and only
if the upper left n × n block is.
After carrying out Gaussian elimination for this example, with Maclaurin series coefficients
cj = (−1)j /j!, we obtain the rational approximation
1 2 2
p2 (x) x − x+1
e−x ≈ r2,3 (x) = = 20 5 .
q3 (x) 1 3 3 2 3
x + x + x+1
60 20 5
Plotting the error in this approximation on the interval [0, 1], we see that the error is maximum at
x = 1, at roughly 4.5 × 10−5 . 2
214 CHAPTER 6. APPROXIMATION OF FUNCTIONS
Example 6.3.2 If we apply nested multiplication to p2 (x) and q3 (x) from Example 6.3.1, we obtain
2 1 3 3 1
p2 (x) = 1 + x − + x , q3 (x) = 1 + x +x + x .
5 20 5 20 60
It follows that evaluating r2,3 (x) requires 5 multiplications, 5 additions, and one division.
An alternative approach is to represent r2,3 (x) as a continued fraction [29, p. 285-322]. We
have
1 2 2
p2 (x) x − x+1
r2,3 (x) = = 20 5
q3 (x) 1 3 3 2 3
x + x + x+1
60 20 5
3
=
x3 + 9x2 + 36x + 60
x2 − 8x + 20
3
=
152x − 280
x + 17 + 2
x − 8x + 20
3
=
8
x + 17 + x2 −8x+20
19x−35
3
=
152
x + 17 + 3125/361
117
x− 19 + x−35/19
In this form, evaluation of r2,3 (x) requires three divisions, no multiplications, and five additions,
resulting in significantly more efficiency than using nested multiplication on p2 (x) and q3 (x). 2
It is important to note that the efficiency of this approach comes from the ability to make the
polynomial in each denominator monic–that is, having a leading coefficient of one–to remove the
need for a multiplication.
Exercise 6.3.2 Write a Matlab function y=contfrac(p,q,x) that takes as input poly-
nomials p(x) and q(x), represented as vectors of coefficients p and q, respectively, and
outputs y = p(x)/q(x) by evaluating p(x)/q(x) as a continued fraction. Hint: use the
Matlab function deconv to divide polynomials.
6.3. RATIONAL APPROXIMATION 215
Example 6.3.3 We consider the approximation of f (x) = e−x by a rational function of the form
a0 T0 (x) + a1 T1 (x) + a2 T2 (x)
r2,3 (x) = .
1 + b1 T1 (x) + b2 T2 (x) + b3 T3 (x)
The Chebyshev series (6.8) for f (x) has coefficients cj that can be obtained using the result of
Exercise 6.3.3. The system of equations implied by (6.10) becomes
1
c0 + (b1 c1 + b2 c2 + b3 c3 ) = a0
2
1
c1 + b1 c0 + (b1 c2 + b2 c1 + b2 c3 + b3 c2 + b3 c4 ) = a1
2
1
c2 + b2 c0 + (b1 c1 + b1 c3 + b2 c4 + b3 c1 + b3 c5 ) = a2
2
1
c3 + b3 c0 + (b1 c2 + b1 c4 + b2 c1 + b2 c5 + b3 c6 ) = 0
2
1
c4 + (b1 c3 + b1 c5 + b2 c2 + b2 c6 + b3 c1 + b3 c7 ) = 0
2
1
c5 + (b1 c4 + b1 c6 + b2 c3 + b2 c7 + b3 c2 + c3 c8 ) = 0.
2
This can be written as Ax = b, where
−1
−1
c0
−1 1 c0
A = + +
2
c0
c1 c2 c3
c0
c2 c3 c4
1
c1 c0 1
+ c3 c4 c5
,
2
c2 c1 c0
2
c4 c5 c6
c3 c2 c1 c5 c6 c7
c4 c3 c2 c6 c7 c8
−c0
a0
a1
−c1
a2 −c2
x= , b= .
b1
−c3
b2 −c4
b3 −c5
After carrying out Gaussian elimination for this example, we obtain the rational approximation
p2 (x) 0.0231x2 − 0.3722x + 0.9535
e−x ≈ r2,3 (x) = ≈ .
q3 (x) 0.0038x3 + 0.0696x2 + 0.5696x + 1
6.4. TRIGONOMETRIC INTERPOLATION 217
Plotting the error in this approximation on the interval (−1, 1), we see that the error is maximum
at x = −1, at roughly 1.1×10−5 , which is less than one-fourth of the error in the Padé approximant
on [0, 1]. In fact, on [0, 1], the error is maximum at x = 0 and is only 4.1 × 10−6 . 2
which can be established using trigonometric identities. The complex conjugation of f (x) in (6.13)
is necessary to ensure that the norm k · k defined by
p
kuk = (u, u) (6.14)
218 CHAPTER 6. APPROXIMATION OF FUNCTIONS
satisfies one of the essential properties of norms, that the norm of a function must be nonnegative.
Exercise 6.4.1 Prove that if m, n are integers, then
0 m 6= n,
2πmx 2πnx
cos , cos = L/2 m = n, n 6= 0, ,
L L
L m=n=0
2πmx 2πnx 0 m 6= n,
sin , sin = ,
L L L/2 m = n
2πmx 2πnx
cos , sin = 0,
L L
where the inner product (f, g) is as defined in (6.13).
Alternatively, we can use the relation eiθ = cos θ + i sin θ to express the solution in terms of
complex exponentials,
∞
1 X
u(x) = √ û(ω)e2πiωx/L , (6.15)
L ω=−∞
where Z L
1
û(ω) = √ e−2πiωx/L u(x) dx. (6.16)
L 0
Like the sines and cosines in (6.11), the functions e2πiωx/L are orthogonal with respect to the inner
product (6.13). Specifically, we have
L ω=η
e2πiωx/L , e2πiηx/L = . (6.17)
0 ω 6= η
√
This explains the presence of the scaling constant 1/ L in (6.15). It normalizes the functions
e2πiωx/L so that they form an orthonormal set, meaning that they are orthogonal to one another,
and have unit norm.
Exercise 6.4.2 Prove (6.17).
We say that f (x) is square-integrable on (0, L) if
Z L
|f (x)|2 dx < ∞. (6.18)
0
That is, the above integral must be finite; we also say that f ∈ L2 (0, L) . If such a function is also
piecewise continuous, the following identity, known as Parseval’s identity, is satisfied:
∞
X
|fˆ(ω)|2 = kf k2 , (6.19)
ω=−∞
where each f˜(ω) approximates the corresponding coefficient fˆ(ω) of the true Fourier series. Ideally,
this approximate series should satisfy
That is, fN (x) should be an interpolant of f (x), with the N points xj , j = 0, 1, . . . , N − 1, as the
interpolation points.
Because the functions e2πiωx/L are orthogonal with respect to the discrete inner product
N
X −1
(u, v)N = ∆x u(xj )v(xj ), (6.23)
j=0
it is straightforward to verify that fN (x) does in fact satisfy the conditions (6.21). Note that the
discrete inner product is an approximation of the continuous inner product.
From (6.21), we have
N/2
1 X
f (xj ) = √ e2πiηxj /L f˜(η). (6.24)
L η=−N/2+1
or
N −1 N/2 N −1
X 1 X X
∆x e−2πiωxj /L f (xj ) = √ f˜(η) ∆x e−2πiωxj /L e2πiηxj /L . (6.26)
j=0
L η=−N/2+1 j=0
220 CHAPTER 6. APPROXIMATION OF FUNCTIONS
Because
2πiωx/L 2πiηx/L
L ω=η
e ,e = , (6.27)
N 6 η
0 ω=
all terms in the outer sum on the right side of (6.26) vanish except for η = ω, and we obtain the
formula (6.22). It should be noted that the algebraic operations performed on (6.24) are equivalent
to taking the discrete inner product of both sides of (6.24) with e2πiωx/L .
Exercise 6.4.4 Prove (6.27). Hint: use formulas associated with geometric series.
The process of obtaining the approximate Fourier coefficients as in (6.22) is called the discrete
Fourier transform (DFT) of f (x). The discrete inverse Fourier transform is given by (6.20). As
at the beginning of this section, we can also work with the real form of the Fourier interpolant,
N/2−1
a˜0 X 2πjx 2πjx πN x
fN (x) = + ãj cos + b̃j sin + ãN/2 cos , (6.28)
2 L L L
j=1
where the coefficients ãj , b̃j are approximations of the coefficients aj , bj from (6.12).
Exercise 6.4.5 Express the coefficients ãj , b̃j of the real form of the Fourier interpolant
(6.28) in terms of the coefficients f˜(ω) from the complex exponential form (6.20).
Exercise 6.4.7 Use the result of Exercise 6.4.4 to prove the following discrete orthogo-
nality relations:
0 m 6= n,
2πmx 2πnx
cos , cos = L/2 m = n, n 6= 0, ,
L L N L m=n=0
2πmx 2πnx 0 m 6= n,
sin , sin = ,
L L N
L/2 m = n
2πmx 2πnx
cos , sin = 0,
L L N
where m and n are integers, and the discrete inner product (f, g)N is as defined in (6.23).
The function f (x), shown in Figure 6.5(a), is quite noisy. However, by taking the discrete Fourier
transform (Figure 6.5(b)), we can extract the original sine wave quite easily. The DFT shows two
distinct spikes, corresponding to frequencies of ω = ±10, that is, the frequencies of the original sine
wave. The first N/2 + 1 values of the Fourier transform correspond to frequencies of 0 ≤ ω ≤ ωmax ,
6.4. TRIGONOMETRIC INTERPOLATION 221
Figure 6.5: (a) Left plot: noisy signal (b) Right plot: discrete Fourier transform
where ωmax = N/2. The remaining N/2 − 1 values of the Fourier transform correspond to the
frequencies −ωmax < ω < 0.
The DFT only considers a finite range of frequencies. If there are frequencies beyond this
present in the Fourier series, an effect known as aliasing occurs. The effect of aliasing is shown in
Figure 6.6: it “folds” these frequencies back into the computed DFT. Specifically,
∞
X
f˜(ω) = fˆ(ω + `N ), −N/2 + 1 ≤ ω ≤ N/2. (6.30)
`=−∞
Aliasing can be avoided by filtering the function before the DFT is applied, to prevent high-
frequency components from “contaminating” the coefficients of the DFT.
Exercise 6.4.8 Use (6.18) and (6.20) to prove (6.30). Hint: Let x = xj for some j.
The discrete Fourier transform, as it was presented in the previous lecture, requires O(N 2 ) oper-
ations to compute. In fact the discrete Fourier transform can be computed much more efficiently
than that (O(N log2 N ) operations) by using the fast Fourier transform (FFT). The FFT arises by
noting that a DFT of length N can be written as the sum of two Fourier transforms each of length
N/2. One of these transforms is formed from the even-numbered points of the original N , and the
other from the odd-numbered points.
222 CHAPTER 6. APPROXIMATION OF FUNCTIONS
Figure 6.6: Aliasing effect on noisy signal: coefficients fˆ(ω), for ω outside (−63, 64), are added to
coefficients inside this interval.
We have
N −1
∆x X −2πijω/N
f˜(ω) = √ e f (xj )
L j=0
N/2−1 N/2−1
∆x X −2πiω(2j)/N ∆x X −2πiω(2j+1)/N
= √ e f (x2j ) + √ e f (x2j+1 )
L j=0 L j=0
N/2−1 N/2−1
∆x X −2πiωj/(N/2) ∆x X
= √ e f (x2j ) + √ W ω e−2πiωj/(N/2) f (x2j+1 ) (6.31)
L j=0 L j=0
where
W = e−2πi/N . (6.32)
It follows that
1 1
f˜(ω) = f˜e (ω) + W ω f˜o (ω), ω = −N/2 + 1, . . . , N/2, (6.33)
2 2
where f˜e (ω) is the DFT of f obtained from its values at the even-numbered points of the N -point
grid on which f is defined, and f˜o (ω) is the DFT of f obtained from its values at the the odd-
numbered points. Because the coefficients of a DFT of length N are N -periodic, in view of the
6.4. TRIGONOMETRIC INTERPOLATION 223
identity e2πi = 1, evaluation of f˜e and f˜o at ω between −N/2 + 1 and N/2 is valid, even though
they are transforms of length N/2 instead of N .
This reduction to half-size transforms can be performed recursively; i.e. a transform of length
N/2 can be written as the sum of two transforms of length N/4, etc. Because only O(N ) operations
are needed to construct a transform of length N from two transforms of length N/2, the entire
process requires only O(N log2 N ) operations.
Exercise 6.4.9 Write two functions to compute the DFT of a function f (x) defined
on [0, L], represented by a N -vector f that contains its values at xj = j∆x, j =
0, 1, 2, . . . , N − 1, where j = L/N . For the first function, use the formula (6.20), and
for the second, use recursion and the formula (6.33) for the FFT. Compare the efficiency
of your functions for different values of N . How does the running time increase as N
increases?
If f (x) is not L-periodic, then there is a jump discontinuity in the L-periodic extension of f (x)
beyond [0, L], and the Fourier series will again converge to the average of the values of f (x) on
either side of this discontinuity.
Such discontinuities pose severe difficulties for trigonometric interpolation, because the basis
functions eiωx grow more oscillatory as |ω| increases. In particular, the truncated Fourier series of a
function f (x) with a jump discontinuity at x = c exhibits what is known as Gibbs’ phenomenon,
first discussed in [37], in which oscillations appear on either side of x = c, even if f (x) itself is
smooth there.
Convergence of the Fourier series of f is more rapid when f is smooth. In particular, if f is
p-times differentiable and its pth derivative is at least piecewise continuous (that is, continuous
except possibly for jump discontinuities), then the coefficients of the complex exponential form of
the Fourier series satisfy
C
|fˆ(ω)| ≤ p+1
(6.35)
|ω| +1
for some constant C that is independent of ω [17].
Exercise 6.4.10 Generate a random vector of DFT coefficients that satisfy the decay
rate (6.35), for some value of p. Then, perform an inverse FFT to obtain the truncated
Fourier series (6.20), and plot the resulting function fN (x). How does the behavior of the
function change as p decreases?
The solution of many mathematical models requires performing the basic operations of calculus,
differentiation and integration. In this chapter, we will learn several techniques for approximating a
derivative of a function at a point, and a definite integral of a function over an interval. As we will
see, our previous discussion of polynomial interpolation will play an essential role, as polynomials
are the easiest functions on which to perform these operations.
f (x0 + h) − f (x0 )
f 0 (x0 ) = lim .
h→0 h
This definition suggests a method for approximating f 0 (x0 ). If we choose h to be a small positive
constant, then
f (x0 + h) − f (x0 )
f 0 (x0 ) ≈ .
h
This approximation is called the forward difference formula.
To estimate the accuracy of this approximation, we note that if f 00 (x) exists on [x0 , x0 + h],
then, by Taylor’s Theorem, f (x0 + h) = f (x0 ) + f 0 (x0 )h + f 00 (ξ)h2 /2, where ξ ∈ [x0 , x0 + h]. Solving
for f 0 (x0 ), we obtain
f (x0 + h) − f (x0 ) f 00 (ξ)
f 0 (x0 ) = − h,
h 2
so the error in the forward difference formula is O(h). We say that this formula is first-order
accurate.
225
226 CHAPTER 7. DIFFERENTIATION AND INTEGRATION
Our goal is to compute f 0 (0.25). Differentiating, using the Quotient Rule and the Chain Rule, we
obtain
√ √ h √ i
2 +x 2 +x x2 +1(sin x+1)
2 sin cosxx−x cos cosxx−x √
2
2x+1
2 x +1(cos x−x)
+ (cos x−x) 2
f 0 (x) = √ −
x−1
sin √x2 +1
√ √ h √ i
x( x−1)
cos √xx−1
2 +x
sin cosxx−x 2 +1
√ √1
2
− (x2 +1)3/2
√ 2 x x +1 .
sin2 √xx−1 2 +1
7.1. NUMERICAL DIFFERENTIATION 227
A similar approach can be used to obtain finite difference approximations of f 0 (x0 ) involving
any points of our choosing, and at an arbitrarily high order of accuracy, provided that sufficiently
many points are used.
Exercise 7.1.1 Use Taylor series expansions of f (x0 ± jh), for j = 1, 2, 3, to derive a
finite difference approximation of f 0 (x0 ) that is 6th-order accurate. What is the error
formula?
Exercise 7.1.2 Generalizing the process carried out by hand in Exercise 7.1.1, write a
Matlab function c=makediffrule(p) that takes as input a row vector of indices p and
returns in a vector c the coefficients of a finite-difference approximation of f 0 (x0 ) that
has the form
n
0 1X
f (x0 ) ≈ cj f (x0 + pj h),
h
j=1
where j and k are known nonnegative integers, x−j < x−j+1 < · · · < xk−1 < xk , and yi = f (xi ) for
i = −j, . . . , k. Then, a finite difference formula for f 0 (x0 ) can be obtained by analytically computing
the derivatives of the Lagrange polynomials {Ln,i (x)}ki=−j for these points, where n = j + k, and
the values of these derivatives at x0 are the proper weights for the function values y−j , . . . , yk . If
f (x) is n + 1 times continuously differentiable on [x−j , xk ], then we obtain an approximation of the
form
k k
0
X
0 f (n+1) (ξ) Y
f (x0 ) = yi Ln,i (x0 ) + (x0 − xi ), (7.2)
(n + 1)!
i=−j i=−j,i6=0
228 CHAPTER 7. DIFFERENTIATION AND INTEGRATION
where ξ ∈ [x−j , xk ].
Exercise 7.1.3 Prove (7.2) using the error formula for Lagrange interpolation. Hint:
Use the fact that the unknown point ξ in the error formula is an (unknown) function of
x.
Exercise 7.1.4 Modify your function makediffrule from Exercise 7.1.2 so that it uses
Lagrange interpolation rather than Taylor series expansion. Make it return a second out-
put err which is the constant C such that the error in (7.2) is of the form Chn f (n+1) (ξ),
where n = j + k.
Among the best-known finite difference formulas that can be derived using this approach is the
second-order-accurate three-point formula
−3f (x0 ) + 4f (x0 + h) − f (x0 + 2h) f 000 (ξ) 2
f 0 (x0 ) = + h , ξ ∈ [x0 , x0 + 2h], (7.3)
2h 3
which is useful when there is no information available about f (x) for x < x0 . If there is no
information available about f (x) for x > x0 , then we can replace h by −h in the above formula to
obtain a second-order-accurate three-point formula that uses the values of f (x) at x0 , x0 − h and
x0 − 2h.
Another formula is the five-point formula
f (x0 − 2h) − 8f (x0 − h) + 8f (x0 + h) − f (x0 + 2h) f (5) (ξ) 4
f 0 (x0 ) = + h , ξ ∈ [x0 − 2h, x0 + 2h],
12h 30
which is fourth-order accurate. The reason it is called a five-point formula, even though it uses
the value of f (x) at four points, is that it is derived from the Lagrange polynomials for the five
points x0 − 2h, x0 − h, x0 , x0 + h, and x0 + 2h. However, f (x0 ) is not used in the formula because
L04,0 (x0 ) = 0, where L4,0 (x) is the Lagrange polynomial that is equal to one at x0 and zero at the
other four points.
If we do not have any information about f (x) for x < x0 , then we can use the following five-point
formula that actually uses the values of f (x) at five points,
−25f (x0 ) + 48f (x0 + h) − 36f (x0 + 2h) + 16f (x0 + 3h) − 3f (x0 + 4h) f (5) (ξ) 4
f 0 (x0 ) = + h ,
12h 5
where ξ ∈ [x0 , x0 + 4h]. As before, we can replace h by −h to obtain a similar formula that
approximates f 0 (x0 ) using the values of f (x) at x0 , x0 − h, x0 − 2h, x0 − 3h, and x0 − 4h.
Exercise 7.1.5 Use (7.2) to derive a general error formula for the approximation of
f 0 (x0 ) in the case where xi = x0 + ih, for i = −j, . . . , k. Use the preceding examples to
check the correctness of your error formula.
Example 7.1.2 We will construct a formula for approximating f 0 (x) at a given point x0 by inter-
polating f (x) at the points x0 , x0 + h, and x0 + 2h using a second-degree polynomial p2 (x), and
then approximating f 0 (x0 ) by p02 (x0 ). Since p2 (x) should be a good approximation of f (x) near x0 ,
especially when h is small, its derivative should be a good approximation to f 0 (x) near this point.
Using Lagrange interpolation, we obtain
p2 (x) = f (x0 )L2,0 (x) + f (x0 + h)L2,1 (x) + f (x0 + 2h)L2,2 (x),
7.1. NUMERICAL DIFFERENTIATION 229
where {L2,j (x)}2j=0 are the Lagrange polynomials for the points x0 , x1 = x0 + h and x2 = x0 + 2h.
Recall that these polynomials satisfy
1 if j = k
L2,j (xk ) = δjk = .
0 otherwise
we obtain
(x − (x0 + h))(x − (x0 + 2h))
L2,0 (x) =
(x0 − (x0 + h))(x0 − (x0 + 2h))
x2 − (2x0 + 3h)x + (x0 + h)(x0 + 2h)
= ,
2h2
(x − x0 )(x − (x0 + 2h))
L2,1 (x) =
(x0 + h − x0 )(x0 + h − (x0 + 2h))
x2 − (2x0 + 2h)x + x0 (x0 + 2h)
= ,
−h2
(x − x0 )(x − (x0 + h))
L2,2 (x) =
(x0 + 2h − x0 )(x0 + 2h − (x0 + h))
x2 − (2x0 + h)x + x0 (x0 + h)
= .
2h2
It follows that
2x − (2x0 + 3h)
L02,0 (x) =
2h2
2x − (2x0 + 2h)
L02,1 (x) = −
h2
2x − (2x0 + h)
L02,2 (x) =
2h2
p02 (x0 ) = f (x0 )L02,0 (x0 ) + f (x0 + h)L02,0 (x0 ) + f (x0 + 2h)L02,0 (x0 )
−3 2 −1
≈ f (x0 ) + f (x0 + h) + f (x0 + 2h)
2h h 2h
3f (x0 ) + 4f (x0 + h) − f (x0 + 2h)
≈ .
2h
From (7.2), it can be shown (see Exercise 7.1.5) that the error in this approximation is O(h2 ), and
that this formula is exact when f (x) is a polynomial of degree 2 or less. The error formula is given
in (7.3). 2
230 CHAPTER 7. DIFFERENTIATION AND INTEGRATION
Exercise 7.1.7 Generalize your function makediffrule from Exercise 7.1.4 so that it
can compute the coefficients of a finite difference approximation to a derivative of a given
order, which is specified as an input argument.
7.1.4 Sensitivity
Based on the error formula for each of these finite difference approximations, one would expect
that it is possible to obtain an approximation that is accurate to within machine precision simply
by choosing h sufficiently small. In the following exercise, we can put this expectation to the test.
Exercise 7.1.8 Use the centered difference formula (7.1) to compute an approximation
of f 0 (x0 ) for f (x) = sin x, x0 = 1.2, and h = 10−d for d = 1, 2, . . . , 15. Compare the
error in each approximation with an upper bound for the error formula given in (7.1).
How does the actual error compare to theoretical expectations?
The reason for the discrepancy observed in Exercise 7.1.8 is that the error formula in (7.1), or any
other finite difference approximation, only accounts for discretization error, not roundoff error.
In a practical implementation of finite difference formulas, it is essential to note that roundoff
error in evaluating f (x) is bounded independently of the spacing h between points at which f (x)
is evaluated. It follows that the roundoff error in the approximation of f 0 (x) actually increases
as h decreases, because the errors incurred by evaluating f (x) are divided by h. Therefore, one
must choose h sufficiently small so that the finite difference formula can produce an accurate
approximation, and sufficiently large so that this approximation is not too contaminated by roundoff
error.
Example 7.1.3 We construct a differentiation matrix for functions defined on [0, 1], and satisfying
the boundary conditions f (0) = f (1) = 0. Let x1 , x2 , . . . , xn be n equally spaced points in (0, 1),
defined by xi = ih, where h = 1/(n + 1). If we use the forward difference approximation, we then
have
f (x2 ) − f (x1 )
f 0 (x1 ) ≈ ,
h
f (x3 ) − f (x2 )
f 0 (x2 ) ≈ ,
h
..
.
f (xn ) − f (xn−1 )
f 0 (xn−1 ≈ ,
h
0 − f (xn )
f 0 (xn ) ≈ .
h
Writing these equations in matrix-vector form, we obtain a relation of the form g ≈ Df , where
0 −1 1
f (x1 ) f (x1 )
f 0 (x2 ) f (x2 )
−1 1
1 .. ..
g= , f = , D = .
.. .. . .
. . h
−1 1
f 0 (xn )
f (xn )
−1
The entries of D can be determined from the coefficients of each value f (xj ) used to approximate
f 0 (xi ), for i = 1, 2, . . . , n. From the structure of this upper bidiagonal matrix, it follows that we
can approximate f 0 (x) at these grid points by a matrix-vector multiplication which costs only O(n)
floating-point operations.
Now, suppose that we instead impose periodic boundary conditions f (0) = f (1). In this case,
we again use n equally spaced points, but including the left boundary: xi = ih, i = 0, 1, . . . , n − 1,
where h = 1/n. Using forward differencing again, we have the same approximations as before,
except
f (1) − f (xn−1 ) f (0) − f (xn−1 ) f (x1 ) − f (xn−1 )
f 0 (xn−1 ) ≈ = = .
h h h
It follows that the differentiation matrix is
−1 1
−1 1
1 . .
D=
. . . . .
h
−1 1
1 −1
Note the “wrap-around” effect in which the superdiagonal appears to continue past the last column
into the first column. For this reason, D is an example of what is called a circulant matrix. 2
Exercise 7.1.9 What are the differentiation matrices corresponding to (7.4) for func-
tions defined on [0, 1], for (a) boundary conditions f (0) = f (1) = 0, and (b) periodic
boundary conditions f (0) = f (1)?
232 CHAPTER 7. DIFFERENTIATION AND INTEGRATION
In some cases, I[f ] can be computed by applying the Fundamental Theorem of Calculus and
computing
I[f ] = F (b) − F (a),
where F (x) is an antiderivative of f , meaning that F 0 (x) = f (x). Unfortunately, this is not practical
if an antiderivative of f is not available. In such cases, numerical techniques must be employed
instead. The basics of integrals are reviewed in Section A.4.
Exercise 7.2.1 Write a Matlab script that computes the Riemann sum Rn for
Z 1
1
x2 dx = ,
0 3
where the left endpoint of each subinterval is used to obtain the height of the corresponding
rectangle. How large must n, the number of subintervals, be to obtain an approximate
answer that is accurate to within 10−5 ?
Instead, we use a quadrature rule to approximate I[f ]. A quadrature rule is a sum of the
form
n
X
Qn [f ] = f (xi )wi , (7.5)
i=1
where the points xi , i = 1, . . . , n, are called the nodes of the quadrature rule, and the numbers wi ,
i = 1, . . . , n, are the weights. We say that a quadrature rule is open if the nodes do not include
the endpoints a and b, and closed if they do.
The objective in designing quadrature rules is to achieve sufficient accuracy in approximating
I[f ], for any Riemann integrable function f , while using as few nodes as possible in order to
maximize efficiency. In order to determine suitable nodes and weights, we consider the following
questions:
• Given a general Riemann integrable function f , can I[f ] be approximated by the integral of
a function g for which I[g] is easy to compute?
7.2. NUMERICAL INTEGRATION 233
Xn
= pn−1 (xi )wi
i=1
= Qn [pn−1 ]
where Z b
wi = Ln−1,i (x) dx, i = 1, . . . , n, (7.6)
a
are the weights of a quadrature rule with nodes x1 , . . . , xn .
Therefore, any n-point quadrature rule with weights chosen as in (7.6) computes I[f ] exactly
when f is a polynomial of degree less than n. For a more general function f , we can use this
quadrature rule to approximate I[f ] by I[pn−1 ], where pn−1 is the polynomial that interpolates
f at the points x1 , . . . , xn . Quadrature rules that use the weights defined above for given nodes
x1 , . . . , xn are called interpolatory quadrature rules. We say that an interpolatory quadrature
rule has degree of accuracy n if it integrates polynomials of degree n exactly, but is not exact
for polynomials of degree n + 1.
Exercise 7.2.2 Use Matlab’s polynomial functions to write a function
I=polydefint(p,a,b) that computes and returns the definite integral of a polyno-
mial with coefficients stored in the vector p over the interval [a, b].
Exercise 7.2.3 Use your function polydefint from Exercise 7.2.2 to write a function
w=interpweights(x,a,b) that returns a vector of weights w for an interpolatory quadra-
ture rule for the interval [a, b] with nodes stored in the vector x.
Exercise 7.2.4 Use your function interpweights from Exercise 7.2.3 to write a func-
tion I=interpquad(f,a,b,x) that approximates I[f ] over [a, b] using an interpolatory
quadrature rule with nodes stored in the vector x. The input argument f must be a func-
tion handle. Test your function by using it to evaluate the integrals of polynomials of
various degrees, comparing the results to the exact integrals returned by your function
polydefint from Exercise 7.2.2.
234 CHAPTER 7. DIFFERENTIATION AND INTEGRATION
7.2.3 Sensitivity
To determine the sensitivity of I[f ], we define the ∞-norm of a function f (x) by
kf k∞ = max |f (x)|
x∈[a,b]
and let fˆ be a perturbation of f that is also Riemann integrable. Then the absolute condition
number of the problem of computing I[f ] can be approximated by
from which it follows that the problem is fairly well-conditioned in most cases. Similarly, pertur-
bations of the endpoints a and b do not lead to large perturbations in I[f ], in most cases.
Exercise 7.2.5 What is the relative condition number of the problem of computing I[f ]?
If the weights wi , i = 1, . . . , n, are nonnegative, then the quadrature rule is stable, as its absolute
condition number can be bounded by (b − a), which is the same absolute condition number as the
underlying integration problem. However, if any of the weights are negative, then the condition
number can be arbitrarily large.
Exercise 7.2.6 Find the absolute condition number of the problem of computing Qn [f ]
for a general quadrature rule of the form (7.5).
It is of degree one, and it is based on the principle that the area under f (x) can be approxi-
mated by the area of a rectangle with width b − a and height f (m), where m = (a + b)/2 is
the midpoint of the interval [a, b].
It is of degree three, and it is derived by computing the integral of the quadratic polynomial
that interpolates f (x) at the points a, (a + b)/2, and b.
b 1
1
x4
Z Z
3 1
f (x) dx = x dx = = .
a 0 4 0 4
That is, the approximation of the integral by Simpson’s Rule is actually exact, which is expected
because Simpson’s Rule is of degree three. On the other hand, if we approximate the integral of
f (x) = x4 from 0 to 1, Simpson’s Rule yields 5/24, while the exact value is 1/5. Still, this is a
better approximation than those obtained using the Midpoint Rule (1/16) or the Trapezoidal Rule
(1/2). 2
Exercise 7.3.2 Use your code from Exercise 7.2.4 to write a function
I=quadnewtoncotes(f,a,b,n) to integrate f (x), implemented by the function handle f,
over [a, b] using an n-node Newton-Cotes rule.
236 CHAPTER 7. DIFFERENTIATION AND INTEGRATION
f 00 (η) b f 00 (η)
Z b
b−a
Z
f (x) dx − [f (a) + f (b)] = (x − a)(x − b) dx = − (b − a)3 , (7.7)
a 2 2 a 12
where a ≤ η ≤ b. Because the error depends on the second derivative, it follows that the Trapezoidal
Rule is exact for any linear function.
A similar approach can be used to obtain expressions for the error in the Midpoint Rule and
Simpson’s Rule, although the process is somewhat more complicated due to the fact that the
functions (x − m), for the Midpoint Rule, and (x − a)(x − m)(x − b), for Simpson’s Rule, where in
both cases m = (a + b)/2, change sign on [a, b], thus making the Weighted Mean Value Theorem
for Integrals impossible to apply in the same straightforward manner as it was for the Trapezoidal
Rule.
We instead use the following approach, illustrated for the Midpoint Rule and adapted from a
similar proof for Simpson’s Rule from [36]. We assume that f is twice continuously differentiable
on [a, b]. First, we make a change of variable
a+b b−a
x= + t, t ∈ [−1, 1],
2 2
to map the interval [−1, 1] to [a, b], and then define F (t) = f (x(t)). The error in the Midpoint Rule
is then given by
Z b Z 1
a+b b−a
f (x) dx − (b − a)f = F (τ ) dτ − 2F (0) .
a 2 2 −1
We now define Z t
G(t) = F (τ ) dτ − 2tF (0).
−t
It is easily seen that the error in the Midpoint Rule is 21 (b − a)G(1). We then define
Because H(0) = H(1) = 0, it follows from Rolle’s Theorem that there exists a point ξ1 ∈ (0, 1)
such that H 0 (ξ1 ) = 0. However, from
it follows from Rolle’s Theorem that there exists a point ξ2 ∈ (0, 1) such that H 00 (ξ2 ) = 0.
From
H 00 (t) = G00 (t) − 6tG(1) = F 0 (t) − F 0 (−t) − 6tG(1),
and the Mean Value Theorem, we obtain, for some ξ3 ∈ (−1, 1),
or 2
1 1 b−a
G(1) = F 00 (ξ3 ) = f 00 (x(ξ3 )).
3 3 2
Multiplying by (b − a)/2 yields the error in the Midpoint Rule.
The result of the analysis is that for the Midpoint Rule,
b
f 00 (η)
Z
a+b
f (x) dx − (b − a)f = (b − a)3 , (7.8)
a 2 24
Exercise 7.3.3 Adapt the approach in the preceding derivation of the error formula (7.8)
for the Midpoint Rule to obtain the error formula (7.9) for Simpson’s Rule.
In general, the degree of accuracy of Newton-Cotes rules can easily be determined by expanding
the integrand f (x) in a Taylor series around the midpoint of [a, b], m = (a + b)/2. This technique
can be used to show that n-point Newton-Cotes rules with an odd number of nodes have degree
n, which is surprising since, in general, interpolatory n-point quadrature rules have degree n − 1.
This extra degree of accuracy is due to the cancellation of the high-order error terms in the Taylor
expansion used to determine the error. Such cancellation does not occur with Newton-Cotes rules
that have an even number of nodes.
Exercise 7.3.4 Prove the statement from the preceding paragraph: a n-node Newton-
Cotes rule has degree of accuracy n if n is odd, and n − 1 if n is even.
be ill-conditioned. This can be seen by revisiting Runge’s Example from Section 5.4, and attempting
to approximate Z 5
1
dx (7.10)
−5 1 + x2
using a Newton-Cotes rule. As n increases, the approximate integral does not converge to the exact
result; in fact, it increases without bound.
Exercise 7.3.5 What is the smallest value of n for which a n-node Newton-Cotes rule
has a negative weight?
Exercise 7.3.6 Use your code from Exercise 7.3.2 to evaluate the integral from (7.10)
for increasing values of n and describe the behavior of the error as n increases.
Exercise 7.4.2 Apply your functions from Exercise 7.4.1 to approximate the integrals
Z 1√ Z 2
√
dx dx, x dx.
0 1
Use different values of n, the number of subintervals. How does the accuracy increase
as n increases? Explain any discrepancy between the observed behavior and theoretical
expectations.
Although it appears that the Composite Midpoint Rule is less accurate than the Composite Trape-
zoidal Rule, it should be noted that it uses about half as many function evaluations. In other words,
240 CHAPTER 7. DIFFERENTIATION AND INTEGRATION
the Basic Midpoint Rule is applied n/2 times, each on a subinterval of width 2h. Rewriting the
Composite Midpoint Rule in such a way that it uses n function evaluations, each on a subinterval
of width h, we obtain
n
f 00 (η)
Z b
X h
f (x) dx = h f xi−1 + + (b − a)h2 , (7.15)
a 2 24
i=1
which reveals that the Composite Midpoint Rule is generally more accurate.
Finally, for the Composite Simpson’s Rule, we have
n/2 (4)
X f (ηi ) f (4) (η)
Esimp = − h5 = − (b − a)h4 , (7.16)
90 180
i=1
because the Basic Simpson Rule is applied n/2 times, each on a subinterval of width 2h. We
conclude that the Simpson’s Rule is fourth-order accurate.
Exercise 7.4.3 Derive the error formula (7.16) for the Composite Simpson’s Rule
(7.13).
Example 7.4.1 We wish to approximate
Z 1
ex dx
0
using composite quadrature, to 3 decimal places. That is, the error must be less than 10−3 . This
requires choosing the number of subintervals, n, sufficiently large so that an upper bound on the
error is less than 10−3 .
For the Composite Trapezoidal Rule, the error is
f 00 (η) eη
Etrap = − (b − a)h2 = − ,
12 12n2
since f (x) = ex , a = 0 and b = 1, which yields h = (b − a)/n = 1/n. Since 0 ≤ η ≤ 1, and ex is
increasing, the factor eη is bounded above by e1 = e. It follows that |Etrap | < 10−3 if
e 1000e
< 10−3 ⇒ < n2 ⇒ n > 15.0507.
12n2 12
Therefore, the error will be sufficiently small provided that we choose n ≥ 16.
On the other hand, if we use the Composite Simpson’s Rule, the error is
f (4) (η) eη
Esimp = − (b − a)h4 = −
180 180n4
−3
for some η in [0, 1], which is less than 10 in absolute value if
1000e 1/4
n> ≈ 1.9713,
180
so n = 2 is sufficient. That is, we can approximate the integral to 3 decimal places by setting
h = (b − a)/n = (1 − 0)/2 = 1/2 and computing
Z 1
h 1/2 0
ex dx ≈ [ex0 + 4ex1 + ex2 ] = [e + 4e1/2 + e1 ] ≈ 1.71886,
0 3 3
whereas the exact value is approximately 1.71828. 2
7.5. GAUSSIAN QUADRATURE 241
For given n, our goal is to select weights and nodes so that the first 2n moments are computed
exactly; i.e.,
Xn
µk = wi xki , k = 0, 1, . . . , 2n − 1. (7.17)
i=1
Since we have 2n free parameters, it is reasonable to think that appropriate nodes and weights can
be found. Unfortunately, this system of equations is nonlinear, so it can be quite difficult to solve.
Exercise 7.5.1 Use direct construction to solve the equations (7.17) for the case of n = 2
on the interval (a, b) = (−1, 1) for the nodes x1 , x2 and weights w1 , w2 .
where
Notice that this method is exact for polynomials of degree 2n − 1 since the error functional E[G]
depends on the (2n)th derivative of G.
To prove this, we shall construct an orthonormal family of polynomials {qi (x)}ni=0 , as in Section
6.2, so that
Z b
0 r 6= s,
hqr , qs i = qr (x)qs (x) dx =
a 1 r = s.
Recall that this can be accomplished using the fact that such a family of polynomials satisfies a
three-term recurrence relation
βj qj (x) = (x − αj )qj−1 (x) − βj−1 qj−2 (x), q0 (x) = (b − a)−1/2 , q−1 (x) = 0,
where
Z b Z b
αj = hqj−1 , xqj−1 i = xqj−1 (x)2 dx, βj2 = hqj , xqj−1 i = xqj (x)qj−1 (x) dx, j ≥ 1,
a a
with β02 = b − a.
We choose the nodes {xi } to be the roots of the nth -degree polynomial in this family, which are
real, distinct and lie within (a, b), as proved in Section 6.2.6. Next, we construct the interpolant of
degree n − 1, denoted pn−1 (x), of g(x) through the nodes:
n
X
pn−1 (x) = g(xi )Ln−1,i (x),
i=1
where, for i = 1, . . . , n, Ln−1,i (x) is the ith Lagrange polynomial for the points x1 , . . . , xn . We shall
now look at the interpolation error function
Clearly, since g ∈ P2n−1 , e ∈ P2n−1 . Since e(x) has roots at each of the roots of qn (x), we can
factor e so that
e(x) = qn (x)r(x),
where r ∈ Pn−1 . It follows from the fact that qn (x) is orthogonal to any polynomial in Pn−1 that
the integral of g can then be written as
Z b Z b
I[g] = pn−1 (x) dx + qn (x)r(x) dx
a a
Z b
= pn−1 (x) dx
a
n
Z bX
= g(xi )Ln−1,i (x) dx
a i=1
n
X Z b
= g(xi ) Ln−1,i (x) dx
i=1 a
Xn
= g(xi )wi
i=1
7.5. GAUSSIAN QUADRATURE 243
where Z b
wi = Ln−1,i (x) dx, i = 1, 2, . . . , n.
a
For a more general function G(x), the error functional E[G] can be obtained from the expression
for Hermite interpolation error presented in Section 5.5.1, as we will now investigate.
Z b n−1
X
0< L2n−1,i (x) dx = wj L2n−1,i (xj ) = wi .
a j=0
Note that we have thus obtained an alternative formula for the weights.
This formula also arises from an alternative approach to constructing Gaussian quadrature
rules, from which a representation of the error can easily be obtained. We construct the Hermite
interpolating polynomial G2n−1 (x) of G(x), using the Gaussian quadrature nodes as interpolation
points, that satisfies the 2n conditions
We recall from Section 5.5.1 that this interpolant has the form
n
X n
X
G2n−1 (x) = G(xi )Hi (x) + G0 (xi )Ki (x),
i=1 i=1
Then, we have
Z b n
X Z b n
X Z b
0
G2n−1 (x) dx = G(xi ) Hi (x) dx + G (xi ) Ki (x) dx.
a i=1 a i=1 a
Hi (x) = Ln−1,i (x)2 [1 − 2L0n−1,i (xi )(x − xi )], Ki (x) = Ln−1,i (x)2 (x − xi ), i = 1, 2, . . . , n,
πn (x) = (x − x1 )(x − x2 ) · · · (x − xn ),
We then have
Z b Z b Z b
Hi (x) dx = Ln−1,i (x)2 dx − 2L0n−1,i (xi ) Ln−1,i (x)2 (x − xi ) dx
a a a
b 2L0n−1,i (xi ) b
Z Z
= Ln−1,i (x)2 dx − Ln−1,i (x)πn (x) dx
a πn0 (xi ) a
Z b
= Ln−1,i (x)2 dx,
a
as the second term vanishes because Ln−1,i (x) is of degree n − 1, and πn (x), a polynomial of degree
n, is orthogonal to all polynomials of lesser degree.
Similarly,
Z b Z b Z b
2 1
Ki (x) dx = Ln−1,i (x) (x − xi ) dx = 0 Ln−1,i (x)πn (x) dx = 0.
a a πn (xi ) a
We conclude that Z b n
X
G2n−1 (x) dx = G(xi )wi ,
a i=1
where, as before,
Z b Z b
2
wi = Ln−1,i (x) dx = Ln−1,i (x) dx, i = 1, 2, . . . , n.
a a
The equivalence of these formulas for the weights can be seen from the fact that the difference
Ln−1,i (x)2 − Ln−1,i (x) is a polynomial of degree 2n − 2 that is divisible by πn (x), because it
vanishes at all of the nodes. The quotient, a polynomial of degree n − 2, is orthogonal to πn (x).
Therefore, the integrals of Ln−1,i (x)2 and Ln−1,i (x) must be equal.
We now use the error in the Hermite interpolating polynomial to obtain
Z b n
X
E[G] = G(x) dx − G(xi )wi
a i=1
Z b
= [G(x) − G2n−1 (x)] dx
a
b
G(2n) (ξ(x))
Z
= πn (x)2 dx
a (2n)!
Z n
G(2n) (ξ) b Y
= (x − xi )2 dx,
(2n)! a i=1
where ξ ∈ (a, b). The last step is obtained using the Weighted Mean Value Theorem for Integrals,
which applies because πn (x)2 does not change sign.
In addition to this error formula, we can easily obtain qualitative bounds on the error. For
instance, if we know that the even derivatives of g are positive, then we know that the quadrature
rule yields a lower bound for I[g]. Similarly, if the even derivatives of g are negative, then the
quadrature rule gives an upper bound.
7.5. GAUSSIAN QUADRATURE 245
Finally, it can be shown that as n → ∞, the n-node Gaussian quadrature approximation of I[f ]
converges to I[f ]. The key to the proof is the fact that the weights are guaranteed to be positive,
and therefore the sum of the weights is always equal to b − a. Such a result does not hold for n-node
Newton-Cotes quadrature rules, because the sum of the absolute values of the weights cannot be
bounded, due to the presence of negative weights.
The particular Gaussian quadrature rule that we will use consists of 5 nodes x1 , x2 , x3 , x4 and x5 ,
and 5 weights w1 , w2 , w3 , w4 and w5 . To determine the proper nodes and weights, we use the fact
that the nodes and weights of a 5-point Gaussian rule for integrating over the interval [−1, 1] are
given by
i Nodes r5,i Weights c5,i
1 0.9061798459 0.2369268850
2 0.5384693101 0.4786286705
3 0.0000000000 0.5688888889
4 −0.5384693101 0.4786286705
5 −0.9061798459 0.2369268850
To obtain the corresponding nodes and weights for integrating over [0, 1], we can use the fact that
in general,
Z b Z 1
b−a a+b b−a
f (x) dx = f t+ dt,
a −1 2 2 2
as can be shown using the change of variable x = [(b − a)/2]t + (a + b)/2 that maps [a, b] into [−1, 1].
We then have
Z b Z 1
b−a a+b b−a
f (x) dx = f t+ dt
a −1 2 2 2
5
X b−a a+b b−a
≈ f r5,i + c5,i
2 2 2
i=1
5
X
≈ f (xi )wi ,
i=1
where
b−a a+b b−a
xi = r5,i + , wi = c5,i , i = 1, . . . , 5.
2 2 2
In this example, a = 0 and b = 1, so the nodes and weights for a 5-point Gaussian quadrature rule
for integrating over [0, 1] are given by
1 1 1
xi = r5,i + , wi = c5,i , i = 1, . . . , 5,
2 2 2
which yields
246 CHAPTER 7. DIFFERENTIATION AND INTEGRATION
i Nodes xi Weights wi
1 0.95308992295 0.11846344250
2 0.76923465505 0.23931433525
3 0.50000000000 0.28444444444
4 0.23076534495 0.23931433525
5 0.04691007705 0.11846344250
It follows that
Z 1 5
−x2 2
X
e dx ≈ e−xi wi
0 i=1
2 2
≈ 0.11846344250e−0.95308992295 + 0.23931433525e−0.76923465505 +
2 2
0.28444444444e−0.5 + 0.23931433525e−0.23076534495 +
2
0.11846344250e−0.04691007705
≈ 0.74682412673352.
Since the exact value is 0.74682413281243, the absolute error is −6.08 × 10−9 , which is remarkably
accurate considering that only fives nodes are used. 2
The high degree of accuracy of Gaussian quadrature rules make them the most commonly used
rules in practice. However, they are not without their drawbacks:
• They are not progressive, so the nodes must be recomputed whenever additional degrees of
accuracy are desired. An alternative is to use Gauss-Kronrod rules [22]. A (2n + 1)-point
Gauss-Kronrod rule uses the nodes of the n-point Gaussian rule. For this reason, practical
quadrature procedures use both the Gaussian rule and the corresponding Gauss-Kronrod rule
to estimate accuracy.
• Because the nodes are the roots of a polynomial, they must be computed using traditional
root-finding methods, which are not always accurate. Errors in the computed nodes lead
to lost degrees of accuracy in the approximate integral. In practice, however, this does not
normally cause significant difficulty.
Exercise 7.5.3 A 5-node Gaussian quadrature rule is exact for the integrand f (x) =
x8 , while a 5-node Newton-Cotes rule is not. How important is it that the Gaussian
quadrature
R1 8 nodes be computed with high accuracy? Investigate this by approximating
−1 x dx using a 5-node interpolatory quadrature rule with nodes
xi = θx̃i + (1 − θ)x̂i ,
where {x̃i }5i=1 and {x̂i }5i=1 are the nodes for a 5-node Gaussian and Newton-Cotes rule,
respectively, and θ ∈ [0, 1]. Use your function interpquad from Exercise 7.2.4 and let θ
vary from 0 to 1. How does the error behave as θ increases?
In Section 6.2 we learned how to construct sequences of orthogonal polynomials for the inner
product
Z b
hf, gi = f (x)g(x)w(x) dx,
a
where f and g are real-valued functions on (a, b) and w(x) is a weight function satisfying w(x) > 0 on
(a, b). These orthogonal polynomials can be used to construct Gauss quadrature rules for integrals
with weight functions, in a similar manner to how they were constructed earlier in this section for
the case w(x) ≡ 1.
Exercise 7.5.4 Let w(x) be a weight function satisfying w(x) > 0 in (a, b). Derive a
Gauss quadrature rule of the form
Z b n
X
f (x)w(x) dx = f (xi )wi + E[f ]
a i=1
A case of particular interest is the interval (−1, 1) with the weight function w(x) = (1 − x2 )−1/2 ,
as the orthogonal polynomials for the corresponding inner product are the Chebyshev polynomials,
introduced in Section 5.4.2. Unlike other Gauss quadrature rules, there are simple formulas for the
nodes and weights in this case.
Exercise 7.5.6 Use the result of Exercise 7.5.5 and direct construction to derive the
nodes and weights for an n-node Gaussian quadrature rule of the form
Z 1 n
X
2 −1/2
(1 − x ) f (x) dx = f (xi )wi ,
−1 i=1
for a given weight function w(x), exactly when f ∈ P2n . That is, prescribing a node reduces the
degree of accuracy by only one. Such a quadrature rule is called a Gauss-Radau quadrature
rule [1].
We begin by dividing f (x) by (x − a), which yields
for the weight function w∗ (x) = (x − a)w(x). It is clear that w∗ (x) > 0 on (a, b). Because this rule
is exact for g ∈ P2n−1 , we have
Z b Z b
∗
I[f ] = q2n−1 (x)w (x) dx + f (a) w(x) dx
a a
n
X Z b
= q2n−1 (x∗i )wi∗ + f (a) w(x) dx
i=1 a
n
f (x∗i ) − f (a) ∗ b
X Z
= wi + f (a) w(x) dx
x∗i − a a
i=1
n n
"Z #
X b X
= f (x∗i )wi + f (a) w(x) dx − wi ,
i=1 a i=1
7.6. EXTRAPOLATION TO THE LIMIT 249
Exercise 7.5.7 Following the discussion above, derive a Gauss-Radau quadrature rule in
which a node is prescribed at x = b. Prove that the weights w1 , w2 , . . . , wn+1 are positive.
Does this rule yield an upper bound or lower bound for the integrand f (x) = 1/(x − a)?
Exercise 7.5.8 Derive formulas for the nodes and weights for a Gauss-Lobatto
quadrature rule [1], in which nodes are prescribed at x = a and x = b. Specifically, the
rule must have n + 1 nodes x0 = a < x1 < x2 < · · · < xn−1 < xn = b. Prove that the
weights w0 , w1 , . . . , wn are positive. What is the degree of accuracy of this rule?
Our goal is to compute f 0 (0.25) as accurately as possible. Using a centered difference approximation,
f (x + h) − f (x − h)
f 0 (x) = + O(h2 ),
2h
with x = 0.25 and h = 0.01, we obtain the approximation
f (0.26) − f (0.24)
f 0 (0.25) ≈ = −9.06975297890147,
0.02
which has absolute error 3.0 × 10−3 , and if we use h = 0.005, we obtain the approximation
f (0.255) − f (0.245)
f 0 (0.25) ≈ = −9.06746429492149,
0.01
which has absolute error 7.7 × 10−4 . As expected, the error decreases by a factor of approximately
4 when we halve the step size h, because the error in the centered difference formula is of O(h2 ).
We can obtain a more accurate approximation by applying Richardson Extrapolation to these ap-
proximations. We define the function N1 (h) to be the centered difference approximation to f 0 (0.25)
obtained using the step size h. Then, with h = 0.01, we have
and the exact value is given by N1 (0) = −9.06669877124279. Because the error in the centered
difference approximation satisfies
where the constants K1 , K2 and K3 depend on the derivatives of f (x) at x = 0.25, it follows that
the new approximation
N1 (h/2) − N1 (h)
N2 (h) = N1 (h/2) + = −9.06670140026149,
22 − 1
has fourth-order accuracy. Specifically, if we denote the exact value by N2 (0), we have
Exercise 7.6.1 Use Taylor series expansion to prove (7.19); that is, the error in the
centered difference approximation can be expressed as a sum of terms involving only even
powers of h.
Exercise 7.6.2 Based on the preceding example, give a general formula for Richardson
extrapolation, applied to the approximation F (h) from (7.18), that uses F (h) and F (h/d),
for some integer d > 1, to obtain an approximation of F (0) that is of order r.
Using this expression for the error in the context of the Composite Trapezoidal Rule, applied
to the integral of a 2k-times differentiable function f (x) on a general interval [a, b], yields the
Euler-Maclaurin Expansion
n−1
Z b " #
h X
f (x) dx = f (a) + 2 f (xi ) + f (b) +
a 2
i=1
k n Z xi
2k X
X
2r (2r−1) (2r−1) h
cr h [f (b) − f (a)] − q2k (t)f (2k) (x) dx,
2 xi−1
r=1 i=1
2
where, for each i = 1, 2, . . . , n, t = −1 + h (x − xi−1 ), and the constants
q2r (1) B2r
cr = 2r
=− , r = 1, 2, . . . , k
2 (2r)!
are closely related to the Bernoulli numbers Br .
It can be seen from this expansion that the error Etrap (h) in the Composite Trapezoidal Rule,
like the error in the centered difference approximation of the derivative, has the form
Etrap (h) = K1 h2 + K2 h4 + K3 h6 + · · · + O(h2k ),
where the constants Ki are independent of h, provided that the integrand is at least 2k times
continuously differentiable. This knowledge of the error provides guidance on how Richardson Ex-
trapolation can be repeatedly applied to approximations obtained using the Composite Trapezoidal
Rule at different spacings in order to obtain higher-order accurate approximations.
It can also be seen from the Euler-Maclaurin Expansion that the Composite Trapezoidal Rule
is particularly accurate when the integrand is a periodic function, of period b − a, as this causes
the terms involving the derivatives of the integrand at a and b to vanish. Specifically, if f (x) is
periodic with period b − a, and is at least 2k times R b continuously differentiable, then2kthe error in the
Composite Trapezoidal Rule approximation to a f (x) dx, with spacing h, is O(h ), rather than
O(h2 ). It follows that if f (x) is infinitely differentiable, such as a finite linear combination of sines
or cosines, then the Composite Trapezoidal Rule has an exponential order of accuracy, meaning
that as h → 0, the error converges to zero more rapidly than any power of h.
Exercise 7.6.3 Use integration by parts to obtain explicit formulas for the polynomials
qr (t) for r = 2, 3, 4.
Exercise 7.6.4 Use the Composite Trapezoidal Rule to integrate f (x) = sin kπx, where
k is an integer, over [0, 1]. How does the error behave?
7.6. EXTRAPOLATION TO THE LIMIT 253
is approximated using the Composite Trapezoidal Rule with step sizes hk = (b − a)2−k , where k is
a nonnegative integer. Then, for each k, Richardson extrapolation is used k − 1 times to previously
computed approximations in order to improve the order of accuracy as much as possible.
More precisely, suppose that we compute approximations T1,1 and T2,1 to the integral, using
the Composite Trapezoidal Rule with one and two subintervals, respectively. That is,
b−a
T1,1 = [f (a) + f (b)]
2
b−a a+b
T2,1 = f (a) + 2f + f (b) .
4 2
Suppose that f has continuous derivatives of all orders on [a, b]. Then, the Composite Trapezoidal
Rule, for a general number of subintervals n, satisfies
Z b n−1 ∞
h X X
f (x) dx = f (a) + 2 f (xj ) + f (b) + Ki h2i ,
a 2
j=1 i=1
where h = (b − a)/n, xj = a + jh, and the constants {Ki }∞ i=1 depend only on the derivatives of f .
It follows that we can use Richardson Extrapolation to compute an approximation with a higher
order of accuracy. If we denote the exact value of the integral by I[f ] then we have
Neglecting the O(h4 ) terms, we have a system of equations that we can solve for K1 and I[f ]. The
value of I[f ], which we denote by T2,2 , is an improved approximation given by
T2,1 − T1,1
T2,2 = T2,1 + .
3
It follows from the representation of the error in the Composite Trapezoidal Rule that I[f ] =
T2,2 + O(h4 ).
Suppose that we compute another approximation T3,1 using the Composite Trapezoidal Rule
with 4 subintervals. Then, as before, we can use Richardson Extrapolation with T2,1 and T3,1
to obtain a new approximation T3,2 that is fourth-order accurate. Now, however, we have two
approximations, T2,2 and T3,2 , that satisfy
for some constant K̃2 . It follows that we can apply Richardson Extrapolation to these approxima-
tions to obtain a new approximation T3,3 that is sixth-order accurate. We can continue this process
to obtain as high an order of accuracy as we wish. We now describe the entire algorithm.
Algorithm 7.6.2 (Romberg Integration) Given a positive integer J, an interval [a, b] and
Rb
a function f (x), the following algorithm computes an approximation to I[f ] = a f (x) dx
that is accurate to order 2J.
h=b−a
for j = 1, 2, . h. . , J do i
h P2j−1 −1
Tj,1 = 2 f (a) + 2 i=1 f (a + ih) + f (b) (Composite Trapezoidal Rule)
for k = 2, 3, . . . , j do
T −Tj−1,k−1
Tj,k = Tj,k−1 + j,k−1 4k−1 −1
(Richardson Extrapolation)
end
h = h/2
end
It should be noted that in a practical implementation, Tj,1 can be computed more efficiently by
using Tj−1,1 , because Tj−1,1 already includes more than half of the function values used to compute
Tj,1 , and they are weighted correctly relative to one another. It follows that for j > 1, if we split
the summation in the algorithm into two summations containing odd- and even-numbered terms,
respectively, we obtain
j−2 2j−2
h
2X X−1
Tj,1 = f (a) + 2 f (a + (2i − 1)h) + 2 f (a + 2ih) + f (b)
2
i=1 i=1
2Xj−2 −1 j−2
2X
h h
= f (a) + 2 f (a + 2ih) + f (b) + 2 f (a + (2i − 1)h)
2 2
i=1 i=1
2j−2
1 X
= Tj−1,1 + h f (a + (2i − 1)h).
2
i=1
Example 7.6.3 We will use Romberg integration to obtain a sixth-order accurate approximation
to Z 1
2
e−x dx,
0
an integral that cannot be computed using the Fundamental Theorem of Calculus. We begin by
using the Trapezoidal Rule, or, equivalently, the Composite Trapezoidal Rule
Z b n−1
h X b−a
f (x) dx ≈ f (a) + f (xj ) + f (b) , h = , xj = a + jh,
a 2 n
j=1
where the constants K1 , K2 and K3 depend on the derivatives of f (x) on [a, b] and are independent
of h, we can conclude that T2,1 has fourth-order accuracy.
We can obtain a second approximation of fourth-order accuracy by using the Composite Trape-
zoidal Rule with n = 4 to obtain a third approximation of second-order accuracy. We set h =
(1 − 0)/4 = 1/4, and then compute
0.25
T3,1 = [f (0) + 2[f (0.25) + f (0.5) + f (0.75)] + f (1)] = 0.74298409780038,
2
which has an absolute error of 3.8 × 10−3 . Now, we can apply Richardson Extrapolation to T2,1 and
T3,1 to obtain
T3,1 − T2,1
T3,2 = T3,1 + = 0.74685537979099,
3
which has an absolute error of 3.1 × 10−5 . This significant decrease in error from T2,2 is to be
expected, since both T2,2 and T3,2 have fourth-order accuracy, and T3,2 is computed using half the
step size of T2,2 .
It follows from the error term in the Composite Trapezoidal Rule, and the formula for Richardson
Extrapolation, that
Z 1 Z 1 4
−x2 4 6 −x2 h
T2,2 = e dx + K̃2 h + O(h ), T2,2 = e dx + K̃2 + O(h6 ).
0 0 2
Therefore, we can use Richardson Extrapolation with these two approximations to obtain a new
approximation
T3,2 − T2,2
T3,3 = T3,2 + = 0.74683370984975,
24 − 1
which has an absolute error of 9.6 × 10−6 . Because T3,3 is a linear combination of T3,2 and T2,2 in
which the terms of order h4 cancel, we can conclude that T3,3 is of sixth-order accuracy. 2
256 CHAPTER 7. DIFFERENTIATION AND INTEGRATION
S is an empty stack
push(S, [a, b])
I=0
while S is not empty do
[a, b] = pop(S) (the interval [a, b] on top of S is removed from S)
I1 = [(b − a)/2][f (a) + f (b)] (Trapezoidal Rule)
m = (a + b)/2
I2 = [(b − a)/4][f (a) + 2f (m) + f (b)] (Composite Trapezoidal Rule with 2 subintervals)
if |I1 − I2 | < 3(b − a)T OL then
I = I1 + I2 (from error term in Trapezoidal Rule, |I[f ] − I2 | ≈ 13 |I1 − I2 |)
else
push(S, [a, m])
push(S, [m, b])
end
end
Throughout the execution of the loop in the above algorithm, the stack S contains all intervals
over which f still needs to be integrated to within the desired accuracy. Initially, the only such
interval is the original interval [a, b]. As long as intervals remain in the stack, the interval on top of
the stack is removed, and we attempt to integrate over it. If we obtain a sufficiently accurate result,
then we are finished with the interval. Otherwise, the interval is bisected into two subintervals,
both of which are pushed on the stack so that they can be processed later. Once the stack is
empty, we know that we have accurately integrated f over a collection of intervals whose union is
the original interval [a, b], so the algorithm can terminate.
Example 7.7.2 We will use adaptive quadrature to compute the integral
Z π/4
e3x sin 2x dx
0
to within (π/4)10−4 .
Let f (x) = e3x sin 2x denote the integrand. First, we use Simpson’s Rule, or, equivalently, the
Composite Simpson’s Rule with n = 2 subintervals, to obtain an approximation I1 to this integral.
258 CHAPTER 7. DIFFERENTIATION AND INTEGRATION
We have
π/4
I1 = [f (0) + 4f (π/8) + f (π/4)] = 2.58369640324748.
6
Then, we divide the interval [0, π/4] into two subintervals of equal width, [0, π/8] and [π/8, π/4],
and integrate over each one using Simpson’s Rule to obtain a second approximation I2 . This is
equivalent to using the Composite Simpson’s Rule on [0, π/4] with n = 4 subintervals. We obtain
π/8 π/8
I2 = [f (0) + 4f (π/16) + f (π/8)] + [f (π/8) + 4f (3π/16) + f (π/4)]
6 6
π/16
= [f (0) + 4f (π/16) + 2f (π/8) + 4f (3π/16) + f (π/4)]
3
= 2.58770145345862.
Now, we need to determine whether the approximation I2 is sufficiently accurate. Because the
error in the Composite Simpson’s Rule is O(h4 ), where h is the width of each subinterval used in
the rule, it follows that the actual error in I2 satisfies
1
|I2 − I[f ]| ≈ |I2 − I1 |,
15
where I[f ] is the exact value of the integral of f .
We find that the relation
1 π
|I2 − I[f ]| ≈ |I2 − I1 | < 10−4
15 4
is not satisfied, so we must divide the interval [0, π/4] into two subintervals of equal width, [0, π/8]
and [π/8, π/4], and use the Composite Simpson’s Rule with these smaller intervals in order to
achieve the desired accuracy.
First, we work with the interval [0, π/8]. Proceeding as before, we use the Composite Simpson’s
Rule with n = 2 and n = 4 subintervals to obtain approximations I1 and I2 to the integral of f (x)
over this interval. We have
π/8
I1 = [f (0) + 4f (π/16) + f (π/8)] = 0.33088926959519.
6
and
π/16 π/16
I2 = [f (0) + 4f (π/32) + f (π/16)] + [f (π/16) + 4f (3π/32) + f (π/8)]
6 6
π/32
= [f (0) + 4f (π/32) + 2f (π/16) + 4f (3π/32) + f (π/8)]
3
= 0.33054510467064.
Now, we need to achieve sufficient accuracy on the remaining subinterval, [π/4, π/8]. As before,
we compute the approximations I1 and I2 of the integral of f over this interval and obtain
π/8
I1 = [f (π/8) + 4f (3π/16) + f (π/4)] = 2.25681218386343.
6
and
π/16 π/16
I2 = [f (π/8) + 4f (5π/32) + f (3π/16)] + [f (3π/16) + 4f (7π/32) + f (π/4)]
6 6
π/32
= [f (π/8) + 4f (5π/32) + 2f (3π/16) + 4f (7π/32) + f (π/4)]
3
= 2.25801455892266.
π/16
I1 = [f (π/8) + 4f (5π/32) + f (3π/16)] = 0.72676545197054.
6
and
π/32 π/32
I2 = [f (π/8) + 4f (9π/64) + f (5π/32)] + [f (5π/32) + 4f (11π/64) + f (3π/16)]
6 6
π/64
= [f (π/8) + 4f (9π/64) + 2f (5π/32) + 4f (11π/64) + f (3π/16)]
3
= 0.72677918153379.
Now, we work with the interval [3π/16, π/4]. Proceeding as before, we use the Composite Simp-
son’s Rule with n = 2 and n = 4 subintervals to obtain approximations I1 and I2 to the integral of
f (x) over this interval. We have
π/16
I1 = [f (3π/16) + 4f (7π/32) + f (π/4)] = 1.53124910695212.
6
and
π/32 π/32
I2 = [f (3π/16) + 4f (13π/64) + f (7π/32)] + [f (7π/32) + 4f (15π/64) + f (π/4)]
6 6
π/64
= [f (3π/16) + 4f (13π/64) + 2f (7π/32) + 4f (15π/64) + f (π/4)]
3
= 1.53131941583939.
Since the exact value is 2.58862863250716, the absolute error is −1.507 × 10−5 , which is less in
magnitude than our desired error bound of (π/4)10−4 ≈ 7.854 × 10−5 . This is because on each
subinterval, we ensured that our approximation was accurate to within 10−4 times the width of
the subinterval, so that when we added these approximations, the total error in the sum would be
bounded by 10−4 times the width of the union of these subintervals, which is the original interval
[0, π/4].
The graph of the integrand over the interval of integration is shown in Figure 7.1. 2
Adaptive quadrature can be very effective, but it should be used cautiously, for the following
reasons:
• The integrand is only evaluated at a few points within each subinterval. Such sampling can
miss a portion of the integrand whose contribution to the integral can be misjudged.
• Regions in which the function is not smooth will still only make a small contribution to the
integral if the region itself is small, so this should be taken into account to avoid unnecessary
function evaluations.
7.7. ADAPTIVE QUADRATURE 261
Figure 7.1: Graph of f (x) = e3x sin 2x on [0, π/4], with quadrature nodes from Example 7.7.2
shown on the graph and on the x-axis.
• Adaptive quadrature can be very inefficient if the integrand has a discontinuity within a subin-
terval, since repeated subdivision will occur. This is unnecessary if the integrand is smooth
on either side of the discontinuity, so subintervals should be chosen so that discontinuities
occur between subintervals whenever possible.
2
Exercise 7.7.4 Let f (x) = e−1000(x−c) , where c isR 1 a parameter. Use your function
adapquadrecur from Exercise 7.7.3 to approximate 0 f (x) dx for the cases c = 1/8 and
c = 1/4. Explain the difference in performance between these two cases.
Then, to evaluate I[f ], one can use a Cartesian product rule, whose nodes and weights are
obtained by combining one-dimensional quadrature rules that are applied to each dimension.
For example, if functions of x are integrated along the line between x = a and x = b using
nodes xi and weights wi , for i = 1, . . . , n, and if functions of y are integrated along the line
between y = c and y = d using nodes yi and weights zi , for i = 1, . . . , m, then the resulting
Cartesian product rule
Xn Xm
Qn,m [f ] = f (xi , yj )wi zj
i=1 j=1
where g(x) is evaluated by using a one-dimensional quadrature rule to compute the inner
integral
Z y2 (x)
g(x) = f (x, y) dy.
y1 (x)
• For various simple regions such as triangles, there exist cubature rules that are not combi-
nations of one-dimensional quadrature rules. Cubature rules are more direct generalizations
of quadrature rules, in that they evaluate the integrand at selected nodes and use weights
determined by the geometry of the domain and the placement of the nodes.
It should be noted that all of these strategies apply to certain special cases. The first algorithm
capable of integrating over a general two-dimensional domain was developed by Lambers and Rice
[21]. This algorithm combines the second and third strategies described above, decomposing the
domain into subdomains that are either triangles or regions between two curves.
Example 7.8.1 We will use the Composite Trapezoidal Rule with m = n = 2 to evaluate the
double integral
Z 1/2 Z 1/2
ey−x dy dx.
0 0
where Z 1
g(x) = ey−x dy.
0
This yields
Z 1/2 Z 1/2 Z 1/2
y−x
e dy dx = g(x) dx
0 0 0
1
≈ [g(0) + 2g(1/4) + g(1/2)]
8" #
Z 1/2 Z 1/2 Z 1/2
1 y−0 y−1/4 y−1/2
≈ e dy + 2 e dy + e dy .
8 0 0 0
264 CHAPTER 7. DIFFERENTIATION AND INTEGRATION
Now, to evaluate each of these integrals, we use the Composite Trapezoidal Rule in the y-direction
with m = 2. If we let k denote the step size in the y-direction, we have k = (1/2 − 0)/2 = 1/4, and
therefore we have
Z 1/2 Z 1/2 "Z #
1/2 Z 1/2 Z 1/2
y−x 1 y−0 y−1/4 y−1/2
e dy dx ≈ e dy + 2 e dy + e dy
0 0 8 0 0 0
1 1 h 0−0 i
≈ e + 2e1/4−0 + e1/2−0 +
8 8
1h i
2 e0−1/4 + 2e1/4−1/4 + e1/2−1/4 +
8
1 h 0−1/2 i
1/4−1/2 1/2−1/2
e + 2e +e
8
1 h 0 i
≈ e + 2e1/4 + e1/2 +
64
1 h −1/4 i
e + 2e0 + e1/4 +
32
1 h −1/2 i
e + 2e−1/4 + e0
64
3 0 1 1 1 1
≈ e + e−1/4 + e−1/2 + e1/4 + e1/2
32 16 64 16 64
≈ 0.25791494889765.
The exact value, to 15 digits, is 0.255251930412762. The error is 2.66 × 10−3 , which is to be
expected due to the use of few subintervals, and the fact that the Composite Trapezoidal Rule is
only second-order-accurate. 2
Example 7.8.2 We will use the Composite Simpson’s Rule with n = 2 and m = 4 to evaluate the
double integral Z 1 Z 2x
x2 + y 3 dy dx.
0 x
In this case, the domain of integration described by the limits is not a rectangle, but a triangle
defined by the lines y = x, y = 2x, and x = 1. The Composite Simpson’s Rule with n = 2
subintervals is Z b
h a+b b−a
f (x) dx ≈ f (a) + 4f + f (b) , h = .
a 3 2 n
If a = 0 and b = 1, then h = (1 − 0)/2 = 1/2, and this simplifies to
Z 1/2
1
f (x) dx ≈ [f (0) + 4f (1/2) + f (1)].
0 6
We first use this rule to evaluate the “single” integral
Z 1
g(x) dx
0
where Z 2x
g(x) = x2 + y 3 dy.
x
7.8. MULTIPLE INTEGRALS 265
This yields
Z 1 Z 2x Z 1
2 3
x + y dy dx = g(x) dx
0 x 0
1
≈ [g(0) + 4g(1/2) + g(1)]
6" #
Z 0 Z 1 2 Z 2
1 1
≈ 02 + y 3 dy + 4 + y 3 dy + 12 + y 3 dy .
6 0 1/2 2 1
The first integral will be zero, since the limits of integration are equal. To evaluate the second and
third integrals, we use the Composite Simpson’s Rule in the y-direction with m = 4. If we let k
denote the step size in the y-direction, we have k = (2x−x)/4 = x/4, and therefore we have k = 1/8
for the second integral and k = 1/4 for the third. This yields
" Z #
1 Z 2x 1
1 2
Z Z 2
2 3 1 3 2 3
x + y dy dx ≈ 4 + y dy + 1 + y dy
0 x 6 1/2 2 1
( " 3 ! 3 ! 3 !
1 1 1 1 1 5 1 3
≈ 4 + +4 + +2 + +
6 24 4 2 4 8 4 4
3 ! # " 3 !
1 7 1 1 5
+ 13 1 + 13 + 4 1 +
4 + + + +
4 8 4 12 4
3 ! 3 ! #)
3 7
+ 1 + 23
2 1+ +4 1+
2 4
≈ 1.03125.
The exact value is 1. The error 3.125 × 10−2 is rather large, which is to be expected due to the poor
distribution of nodes through the triangular domain of integration. A better distribution is achieved
if we use n = 4 and m = 2, which yields the much more accurate approximation of 1.001953125.
2
Exercise 7.8.2 Generalize your function quadcomptrap2d from Exercise 7.8.1 so that
the arguments c and d can be either scalars or function handles. If they are function
handles, then your function approximates the integral
Z bZ d(x)
I[f ] = f (x, y) dy dx.
a c(x)
Hint: use the Matlab function isnumeric to determine whether c and d are numbers.
266 CHAPTER 7. DIFFERENTIATION AND INTEGRATION
Exercise 7.8.3 Generalize your function quadcomptrap2d from Exercise 7.8.2 so that
the arguments a, b, c and d can be either scalars or function handles. If a and b are
function handles, then your function approximates the integral
Z d Z b(y)
I[f ] = f (x, y) dx dy.
c a(y)
Exercise 7.8.4 Use the error formula for the Composite Trapezoidal Rule to obtain an
error formula for a Cartesian product rule such as the one implemented in Exercise
7.8.1. As in that exercise, assume that m subintervals are used in the x-direction and n
subintervals in the y-direction. Hint: first, apply the single-variable error formula to the
integral
Z b Z d
g(x) dx, g(x) = f (x, y) dy.
a c
Exercise 7.8.6 Modify your function quadcomptrap3d from Exercise 7.8.5 to obtain
a function quadcompsimp3d that uses the Composite Simpson’s Rule in each direction.
Then, use an approach similar to that used in Exercise 7.8.4 to obtain an error formula
for the Cartesian product rule used in quadcompsimp3d.
Exercise 7.8.7 Generalize your function quadcomptrap3d from Exercise 7.8.5 so that
any of the arguments a, b, c, d, s and t can be either scalars or function handles.
For example, if a and b are scalars, c and d are function handles that have two input
arguments, and s and t are function handles that have one input argument, then your
function approximates the integral
Z bZ t(x) Z d(x,z)
I[f ] = f (x, y, z) dy dz dx.
a s(x) c(x,z)
Hint: use the Matlab function isnumeric to determine whether any arguments are
numbers, and the function nargin to determine how many arguments a function handle
requires.
In more than three dimensions, generalizations of quadrature rules are not practical, since
7.8. MULTIPLE INTEGRALS 267
the number of function evaluations needed to attain sufficient accuracy grows very rapidly as the
number of dimensions increases. An alternative is the Monte Carlo method, which samples the
integrand at n randomly selected points and attempts to compute the mean value of the integrand
on the entire domain. The method converges rather slowly but its convergence rate depends only
on n, not the number of dimensions.
268 CHAPTER 7. DIFFERENTIATION AND INTEGRATION
Part IV
269
Chapter 8
f (x) = 0,
where f (x) : Rn → Rm can be any known function. A solution x of such a nonlinear equation is
called a root of the equation, as well as a zero of the function f .
In general, nonlinear equations cannot be solved in a finite sequence of steps. As linear equations
can be solved using direct methods such as Gaussian elimination, nonlinear equations usually require
iterative methods. In iterative methods, an approximate solution is refined with each iteration until
it is determined to be sufficiently accurate, at which time the iteration terminates. Since it is
desirable for iterative methods to converge to the solution as rapidly as possible, it is necessary to
be able to measure the speed with which an iterative method converges.
To that end, we assume that an iterative method generates a sequence of iterates x0 , x1 , x2 , . . .
that converges to the exact solution x∗ . Ideally, we would like the error in a given iterate xk+1 to
be much smaller than the error in the previous iterate xk . For example, if the error is raised to a
power greater than 1 from iteration to iteration, then, because the error is typically less than 1, it
will approach zero very rapidly. This leads to the following definition.
Definition 8.1.1 (Order and Rate of Convergence) Let {xk }∞ k=0 be a sequence in
Rn that converges to x∗ ∈ Rn and assume that xk 6= x∗ for each k. We say that the
order of convergence of {xk } to x∗ is order r, with asymptotic error constant C,
if
kxk+1 − x∗ k
lim = C,
k→∞ kxk − x∗ kr
If r = 1, and 0 < C < 1, we say that convergence is linear. If r = 1 and C = 0, or if 1 < r < 2 for
any positive C, then we say that convergence is superlinear. If r = 2, then the method converges
271
272 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS
quadratically, and if r = 3, we say it converges cubically, and so on. Note that the value of C need
only be bounded above in the case of linear convergence.
When convergence is linear, the asymptotic rate of convergence ρ indicates the number of correct
decimal digits obtained in a single iteration. In other words, b1/ρc + 1 iterations are required to
obtain an additional correct decimal digit, where bxc is the “floor” of x, which is the largest integer
that is less than or equal to x.
where c is between x and x + . With being small, the absolute condition number can be
approximated by |f 0 (x)|, the factor by which the perturbation in x () is amplified to obtain the
perturbation in f (x).
In solving a nonlinear equation in one dimension, we are trying to solve an inverse problem; that
is, instead of computing y = f (x) (the forward problem), we are computing x = f −1 (0), assuming
8.2. THE BISECTION METHOD 273
that f is invertible near the root. It follows from the differentiation rule
d −1 1
[f (x)] = 0 −1
dx f (f (x))
that the condition number for solving f (x) = 0 is approximately 1/|f 0 (x∗ )|, where x∗ is the solution.
This discussion can be generalized to higher dimensions, where the condition number is measured
using the norm of the Jacobian.
Using backward error analysis, we assume that the approximate solution x̂ = fˆ−1 (0), obtained
by evaluating an approximation of f −1 at the exact input y = 0, can also be viewed as evaluating
the exact function f −1 at a nearby input ŷ = . That is, the approximate solution x̂ = f −1 () is
the exact solution of a nearby problem.
From this viewpoint, it can be seen from a graph that if |f 0 | is large near x∗ , which means that
the condition number of the problem f (x) = 0 is small (that is, the problem is well-conditioned ),
then even if is relatively large, x̂ = f −1 () is close to x∗ . On the other hand, if |f 0 | is small near
x∗ , so that the problem is ill-conditioned, then even if is small, x̂ can be far away from x∗ . These
contrasting situations are illustrated in Figure 8.1.
Figure 8.1: Left plot: Well-conditioned problem of solving f (x) = 0. f 0 (x∗ ) = 24, and an ap-
proximate solution ŷ = f −1 () has small error relative to . Right plot: Ill-conditioned problem of
solving f (x) = 0. f 0 (x∗ ) = 0, and ŷ has large error relative to .
on the intervals [0, π/2] and [0, π], respectively. The graphs of these functions are shown in Figure
8.2. It can be seen that f (a) and f (b) have different signs, and since both functions are continuous
274 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS
Figure 8.2: Illustrations of the Intermediate Value Theorem. Left plot: f (x) = x − cos x has a
unique root on [0, π/2]. Right plot: g(x) = ex cos(x2 ) has multiple roots on [0, π].
on [a, b], the Intermediate Value Theorem guarantees the existence of a root in [a, b]. However,
both of these intervals are rather large, so we cannot obtain a useful approximation of a root from
this information alone.
At each root in these examples, f (x) changes sign, so f (x) > 0 for x on one side of the root
x , and f (x) < 0 on the other side. Therefore, if we can find two values a0 and b0 such that f (a0 )
∗
and f (b0 ) have different signs, but a0 and b0 are very close to one another, then we can accurately
approximate x∗ .
Consider the first example in Figure 8.2, that has a unique root. We have f (0) < 0 and
f (π/2) > 0. From the graph, we see that if we evaluate f at any other point x0 in (0, π/2), and
we do not “get lucky” and happen to choose the root, then either f (x0 ) > 0 or f (x0 ) < 0. If
f (x0 ) > 0, then f has a root on (0, x0 ), because f changes sign on this interval. On the other hand,
if f (x0 ) < 0, then f has a root on (x0 , π/2), because of a sign change. This is illustrated in Figure
8.3. The bottom line is, by evaluating f (x) at an intermediate point within (a, b), the size of the
interval in which we need to search for a root can be reduced.
The Method of Bisection attempts to reduce the size of the interval in which a solution is known
to exist. Suppose that we evaluate f (m), where m = (a + b)/2. If f (m) = 0, then we are done.
Otherwise, f must change sign on the interval [a, m] or [m, b], since f (a) and f (b) have different
signs and therefore f (m) must have a different sign from one of these values.
Let us try this approach on the function f (x) = x − cos x, on [a, b] = [0, π/2]. This example
can be set up in Matlab as follows:
>> a=0;
>> b=pi/2;
>> f=inline(’x-cos(x)’);
To help visualize the results of the computational process that we will carry out to find an approx-
imate solution, we will also graph f (x) on [a, b]:
>> dx=(b-a)/100;
>> x=a:dx:b;
>> plot(x,f(x))
8.2. THE BISECTION METHOD 275
Figure 8.3: Because f (π/4) > 0, f (x) has a root in (0, π/4).
We can see that f (a) and f (m) have different signs, so a root exists within (a, m). We can therefore
update our search space [a, b] accordingly:
>> b=m;
We then repeat the process, working with the midpoint of our new interval:
>> m=(a+b)/2
>> plot(m,f(m),’ro’)
Now, it does not matter at which endpoint of our interval f (x) has a positive value; we only need
the signs of f (a) and f (b) to be different. Therefore, we can simply check whether the product of
the values of f at the endpoints is negative:
>> f(a)*f(m)
ans =
0.531180450812563
>> f(m)*f(b)
ans =
-0.041586851697525
We see that the sign of f changes on (m, b), so we update a to reflect that this is our new interval
to search:
>> a=m;
>> m=(a+b)/2
>> plot(m,f(m),’ro’)
Exercise 8.2.1 Repeat this process a few more times: check whether f changes sign on
(a, m) or (m, b), update [a, b] accordingly, and then compute a new midpoint m. After
computing some more midpoints, and plotting each one as we have been, what behavior
can you observe? Are the midpoints converging, and if so, are they converging to a root
of f ? Check by evaluating f at each midpoint.
We can continue this process until the interval of interest [a, b] is sufficiently small, in which
case we must be close to a solution. By including these steps in a loop, we obtain the following
algorithm that implements the approach that we have been carrying out.
8.2. THE BISECTION METHOD 277
Figure 8.4: Progress of the Bisection method toward finding a root of f (x) = x − cos x on (0, π/2)
Algorithm 8.2.1 (Bisection) Let f be a continuous function on the interval [a, b] that
changes sign on (a, b). The following algorithm computes an approximation x∗ to a num-
ber x in (a, b) such that f (x) = 0.
for j = 1, 2, . . . do
m = (a + b)/2
if f (m) = 0 or b − a is sufficiently small then
x∗ = m
return x∗
end
if f (a)f (m) < 0 then
b=m
else
a=m
end
end
At the beginning, it is known that (a, b) contains a solution. During each iteration, this algo-
rithm updates the interval (a, b) by checking whether f changes sign in the first half (a, m), or in
the second half (m, b). Once the correct half is found, the interval (a, b) is set equal to that half.
Therefore, at the beginning of each iteration, it is known that the current interval (a, b) contains a
solution.
278 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS
Theorem 8.2.2 Let f be continuous on [a, b], and assume that f (a)f (b) < 0. For each
positive integer n, let pn be the nth iterate that is produced by the bisection algorithm.
Then the sequence {pn }∞ n=1 converges to a number p in (a, b) such that f (p) = 0, and
each iterate pn satisfies
b−a
|pn − p| ≤ n .
2
It should be noted that because the nth iterate can lie anywhere within the interval (a, b) that is
used during the nth iteration, it is possible that the error bound given by this theorem may be
quite conservative.
a b m = (a + b)/2 f (m)
1 2 1.5 −0.25
1.5 2 1.75 0.3125
1.5 1.75 1.625 0.015625
1.5 1.625 1.5625 −0.12109
1.5625 1.625 1.59375 −0.053711
1.59375 1.625 1.609375 −0.019287
1.609375 1.625 1.6171875 −0.0018921
1.6171875 1.625 1.62109325 0.0068512
1.6171875 1.62109325 1.619140625 0.0024757
1.6171875 1.619140625 1.6181640625 0.00029087
The correct solution, to ten decimal places, is 1.6180339887, which is the number known as the
golden ratio. 2
Exercise 8.2.5 The function f (x) = x2 − 2x − 2 has two real roots, one positive and one
negative. Find two disjoint intervals [a1 , b1 ] and [a2 , b2 ] that can be used with bisection to
find the negative and positive roots, respectively. Why is it not practical to use a much
larger interval that contains both roots, so that bisection can supposedly find one of them?
For this method, it is easier to determine the order of convergence if we use a different measure
of the error in each iterate xk . Since each iterate is contained within an interval [ak , bk ] where
bk − ak = 2−k (b − a), with [a, b] being the original interval, it follows that we can bound the error
xk − x∗ by ek = bk − ak . Using this measure, we can easily conclude that bisection converges
linearly, with asymptotic error constant 1/2.
A nonlinear equation of the form f (x) = 0 can be rewritten to obtain an equation of the form
x = g(x),
The formulation of the original problem f (x) = 0 into one of the form x = g(x) leads to a simple
solution method known as fixed-point iteration, or simple iteration, which we now describe.
280 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS
When rewriting this equation in the form x = g(x), it is essential to choose the function g
wisely. One guideline is to choose g(x) = x − φ(x)f (x), where the function φ(x) is, ideally, nonzero
except possibly at a solution of f (x) = 0. This can be satisfied by choosing φ(x) to be constant,
but this can fail, as the following example illustrates.
x + ln x = 0.
By the Intermediate Value Theorem, this equation has a solution in the interval [0.5, 0.6]. Further-
more, this solution is unique. To see this, let f (x) = x + ln x. Then f 0 (x) = 1 + 1/x > 0 on the
domain of f , which means that f is increasing on its entire domain. Therefore, it is not possible
for f (x) = 0 to have more than one solution.
We consider using Fixed-point Iteration to solve the equivalent equation
That is, we choose φ(x) ≡ 1. Let us try applying Fixed-point Iteration in Matlab:
>> g=inline(’-log(x)’)
g =
Inline function:
g(x) = -log(x)
>> x=0.55;
>> x=g(x)
x =
0.597837000755620
>> x=g(x)
x =
0.514437136173803
>> x=g(x)
x =
0.664681915480620
8.3. FIXED-POINT ITERATION 281
Exercise 8.3.1 Try this for a few more iterations. What happens?
Clearly, we need to use a different approach for converting our original equation f (x) = 0 to an
equivalent equation of the form x = g(x).
What went wrong? To help us answer this question, we examine the error ek = xk −x∗ . Suppose
that x = g(x) has a solution x∗ in [a, b], as it does in the preceding example, and that g is also
continuously differentiable on [a, b], as was the case in that example. We can use the Mean Value
Theorem to obtain
|ek+1 |
lim = lim |g 0 (ξk )| = |g 0 (x∗ )|.
k→∞ |ek | k→∞
Recall, though, that for linear convergence, the asymptotic error constant C = |g 0 (x∗ )| must satisfy
C < 1. Unfortunately, with g(x) = − ln x, we have |g 0 (x)| = | − 1/x| > 1 on the interval [0.5, 0.6],
so it is not surprising that the iteration diverged.
What if we could convert the original equation f (x) = 0 into an equation of the form x = g(x)
so that g 0 satisfied |g 0 (x)| < 1 on an interval [a, b] where a fixed point was known to exist? What
we can do is take advantage of the differentiation rule
d −1 1
[f (x)] = 0 −1
dx f (f (x)
and apply g −1 (x) = e−x to both sides of the equation x = g(x) to obtain
g −1 (x) = g −1 (g(x)) = x,
which simplifies to
x = e−x .
The function g(x) = e−x satisfies |g 0 (x)| < 1 on [0.5, 0.6], as g 0 (x) = −e−x , and e−x < 1 when the
argument x is positive. What happens if you try Fixed-point Iteration with this choice of g?
>> g=inline(’exp(-x)’)
g =
Inline function:
g(x) = exp(-x)
>> x=0.55;
>> x=g(x)
x =
0.576949810380487
>> x=g(x)
x =
282 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS
0.561608769952327
>> x=g(x)
x =
0.570290858658895
This is more promising.
Exercise 8.3.2 Continue this process to confirm that the iteration is in fact converging.
2
Having seen what can go wrong if we are not careful in applying Fixed-Point Iteration, we
should now address the questions of existence and uniqueness of a solution to the modified problem
g(x) = x. The following result, first proved in [9], answers the first of these questions.
Theorem 8.3.3 (Brouwer’s Fixed Point Theorem) Let g be continuous on [a, b]. If
g(x) ∈ [a, b] for each x ∈ [a, b], then g has a fixed point in [a, b].
Exercise 8.3.3 Use the Intermediate Value Theorem (see Theorem A.1.10) to prove
Theorem 8.3.3.
Given a continuous function g that is known to have a fixed point in an interval [a, b], we can try to
find this fixed point by repeatedly evaluating g at points in [a, b] until we find a point x for which
g(x) = x. This is the essence of the method of Fixed-point Iteration. However, just because g has
a fixed point does not mean that this iteration will necessarily converge. We will now investigate
this further.
Therefore, if g satisfies the conditions of the Brouwer Fixed-Point Theorem, and g is a contraction
on [a, b], and x0 ∈ [a, b] , then fixed-point iteration is convergent; that is, xk converges to x∗ .
Furthermore, the fixed point x∗ must be unique, for if there exist two distinct fixed points x∗
and y ∗ in [a, b], then, by the Lipschitz condition, we have
which is a contradiction. Therefore, we must have x∗ = y ∗ . We summarize our findings with the
statement of the following result, first established in [3].
8.3. FIXED-POINT ITERATION 283
In general, when fixed-point iteration converges, it does so at a rate that varies inversely with the
Lipschitz constant L.
If g satisfies the conditions of the Contraction Mapping Theorem with Lipschitz constant L,
then Fixed-point Iteration achieves at least linear convergence, with an asymptotic error constant
that is bounded above by L. This value can be used to estimate the number of iterations needed
to obtain an additional correct decimal digit, but it can also be used to estimate the total number
of iterations needed for a specified degree of accuracy.
From the Lipschitz condition, we have, for k ≥ 1,
|xk − x∗ | ≤ L|xk−1 − x∗ | ≤ Lk |x0 − x∗ |.
From
|x0 − x∗ | ≤ |x0 − x1 + x1 − x∗ | ≤ |x0 − x1 | + |x1 − x∗ | ≤ |x0 − x1 | + L|x0 − x∗ |
we obtain
Lk
|xk − x∗ | ≤
|x1 − x0 |. (8.1)
1−L
We can bound the number of iterations after performing a single iteration, as long as the Lipschitz
constant L is known.
Exercise 8.3.4 Use (8.1) to determine a lower bound on the number of iterations re-
quired to ensure that |xk − x∗ | ≤ for some error tolerance .
We can now develop a practical implementation of Fixed-Point Iteration.
Exercise 8.3.5 Write a Matlab function [x,niter]=fixedpt(g,x0,tol) that imple-
ments Algorithm 8.3.1 to solve the equation x = g(x) with initial guess x0 , except that
instead of simply using the absolute difference between iterates to test for convergence, the
error estimate (8.1) is compared to the specified tolerance tol, and a maximum number
of iterations is determined based on the result of Exercise 8.3.4. The output arguments x
and niter are the computed solution x∗ and number of iterations, respectively. Test your
function on the equation from Example 8.3.2.
We know that Fixed-point Iteration will converge to the unique fixed point in [a, b] if g satisfies
the conditions of the Contraction Mapping Theorem. However, if g is differentiable on [a, b], its
derivative can be used to obtain an alternative criterion for convergence that can be more practical
than computing the Lipschitz constant L. If we denote the error in xk by ek = xk − x∗ , we can see
from the Mean Value Theorem and the fact that g(x∗ ) = x∗ that
ek+1 = xk+1 − x∗ = g(xk ) − g(x∗ ) = g 0 (ξk )(xk − x∗ )
where ξk lies between xk and x∗ . However, from
g(xk ) − g(x∗ )
0
|g (ξk )| =
xk − x∗
284 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS
it follows that if |g 0 (x)| ≤ L on (a, b), where L < 1, then the Contraction Mapping Theorem applies.
This leads to the following result.
Using Taylor expansion of the error around x∗ , it can also be shown that if g 0 is continuous at x∗
and |g 0 (x∗ )| < 1, then Fixed-point Iteration is locally convergent; that is, it converges if x0 is chosen
sufficiently close to x∗ .
It can be seen from the preceding discussion why g 0 (x) must be bounded away from 1 on (a, b),
as opposed to the weaker condition |g 0 (x)| < 1 on (a, b). If g 0 (x) is allowed to approach 1 as x
approaches a point c ∈ (a, b), then it is possible that the error ek might not approach zero as k
increases, in which case Fixed-point Iteration would not converge.
Exercise 8.3.6 Find a function g and interval [a, b] such that g continuous on [a, b] and
differentiable on (a, b), but does not satisfy a Lipschitz condition on [a, b] for any Lipschitz
constant L.
The derivative can also be used to indicate why Fixed-point Iteration might not converge.
Example 8.3.6 The function g(x) = x2 + 16 3
has two fixed points, x∗1 = 1/4 and x∗2 = 3/4, as can
3
be determined by solving the quadratic equation x2 + 16 = x. If we consider the interval [0, 3/8],
then g satisfies the conditions of the Fixed-point Theorem, as g 0 (x) = 2x < 1 on this interval, and
therefore Fixed-point Iteration will converge to x∗1 for any x0 ∈ [0, 3/8].
On the other hand, g 0 (3/4) = 2(3/4) = 3/2 > 1. Therefore, it is not possible for g to satisfy the
conditions of the Fixed-point Theorem. Furthemore, if x0 is chosen so that 1/4 < x0 < 3/4, then
Fixed-point Iteration will converge to x∗1 = 1/4, whereas if x0 > 3/4, then Fixed-point Iteration
diverges. 2
The fixed point x∗2 = 3/4 in the preceding example is an unstable fixed point of g, meaning that
no choice of x0 yields a sequence of iterates that converges to x∗2 . The fixed point x∗1 = 1/4 is a
stable fixed point of g, meaning that any choice of x0 that is sufficiently close to x∗1 yields a sequence
of iterates that converges to x∗1 .
The preceding example shows that Fixed-point Iteration applied to an equation of the form
x = g(x) can fail to converge to a fixed point x∗ if |g 0 (x∗ )| > 1. We wish to determine whether
this condition indicates non-convergence in general. If |g 0 (x∗ )| > 1, and g 0 is continuous in a
neighborhood of x∗ , then there exists an interval |x − x∗ | ≤ δ such that |g 0 (x)| > 1 on the interval.
If xk lies within this interval, it follows from the Mean Value Theorem that
|xk+1 − x∗ | = |g(xk ) − g(x∗ )| = |g 0 (η)||xk − x∗ |,
where η lies between xk and x∗ . Because η is also in this interval, we have
|xk+1 − x∗ | > |xk − x∗ |.
8.3. FIXED-POINT ITERATION 285
In other words, the error in the iterates increases whenever they fall within a sufficiently small
interval containing the fixed point. Because of this increase, the iterates must eventually fall outside
of the interval. Therefore, it is not possible to find a k0 , for any given δ, such that |xk − x∗ | ≤ δ
for all k ≥ k0 . We have thus proven the following result.
Theorem 8.3.7 Let g have a fixed point at x∗ , and let g 0 be continuous in a neighborhood
of x∗ . If |g 0 (x∗ )| > 1, then Fixed-point Iteration does not converge to x∗ for any initial
guess x0 except in a finite number of iterations.
Now, suppose that in addition to the conditions of the Fixed-point Theorem, we assume that
g 0 (x∗ ) = 0, and that g is twice continuously differentiable on [a, b]. Then, using Taylor’s Theorem,
we obtain
1 1
ek+1 = g(xk ) − g(x∗ ) = g 0 (x∗ )(xk − x∗ ) + g 00 (ξk )(xk − x∗ )2 = g 00 (ξk )e2k ,
2 2
where ξk lies between xk and x∗ . It follows that for any initial iterate x0 ∈ [a, b], Fixed-point
Iteration converges at least quadratically, with asymptotic error constant |g 00 (x∗ )/2|. Later, this
will be exploited to obtain a quadratically convergent method for solving nonlinear equations of
the form f (x) = 0.
8.3.3 Relaxation
Now that we understand the convergence behavior of Fixed-point Iteration, we consider the appli-
cation of Fixed-point Iteration to the solution of an equation of the form f (x) = 0.
Example 8.3.8 We use Fixed-point Iteration to solve the equation f (x) = 0, where f (x) = x −
cos x − 2. It makes sense to work with the equation x = g(x), where g(x) = cos x + 2.
Where should we look for a solution to this equation? For example, consider the interval [0, π/4].
0 0
√ interval, g (x) = − sin x, which certainly satisfies the condition |g (x)| ≤ ρ < 1 where
On this
ρ = 2/2, but g does not map this interval into itself, as required by the Brouwer Fixed-point
Theorem.
On the other hand, if we consider the interval [1, 3], it can readily be confirmed that g(x) maps
this interval to itself, as 1 ≤ 2 + cos x ≤ 3 for all real x, so a fixed point exists in this interval.
First, let’s set up a figure with a graph of g(x):
>> x=1:0.01:3;
>> figure(1)
>> plot(x,g(x))
>> hold on
>> plot([ 1 3 ],[ 1 3 ],’k--’)
>> xlabel(’x’)
>> ylabel(’y’)
>> title(’g(x) = cos x + 2’)
Exercise 8.3.7 Go ahead and try Fixed-point Iteration on g(x), with initial guess x0 = 2.
What happens?
286 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS
The behavior is quite interesting, as the iterates seem to bounce back and forth. This is illustrated in
Figure 8.5. Continuing, we see that convergence is achieved, but it is quite slow. An examination
of the derivative explains why: g 0 (x) = − sin x, and we have |g 0 (π/2)| = | − sin π/2| = 1, so
the conditions of the Fixed-point Theorem are not satisfied–in fact, we could not be assured of
convergence at all, though it does occur in this case.
An examination of the iterates shown in Figure 8.5, along with an indication of the solution,
suggests how convergence can be accelerated. What if we used the average of x and g(x) at each
iteration? That is, we solve x = h(x), where
1 1
h(x) = [x + g(x)] = [x + cos x + 2].
2 2
You can confirm that if x = h(x), then f (x) = 0. However, we have
1
h0 (x) = [1 − sin x],
2
and how large can this be on the interval [1, 3]? In this case, the Fixed-point Theorem does apply.
Exercise 8.3.8 Try Fixed-point Iteration with h(x), and with initial guess x0 = 2. What
behavior can you observe?
The lesson to be learned here is that the most straightforward choice of g(x) is not always the
wisest–the key is to minimize the size of the derivative near the solution. 2
As previously discussed, a common choice for a function g(x) to use with Fixed-point Iteration
to solve the equation f (x) = 0 is a function of the form g(x) = x − φ(x)f (x), where φ(x) is nonzero.
8.4. NEWTON’S METHOD AND THE SECANT METHOD 287
Clearly, the simplest choice of φ(x) is a constant function φ(x) ≡ λ, but it is important to choose
λ so that Fixed-point Iteration with g(x) will converge.
Suppose that x∗ is a solution of the equation f (x) = 0, and that f is continuously differentiable
in a neighborhood of x∗ , with f 0 (x∗ ) = α > 0. Then, by continuity, there exists an interval
[x∗ − δ, x∗ + δ] containing x∗ on which m ≤ f 0 (x) ≤ M , for positive constants m and M . We can
then prove the following results.
Exercise 8.3.9 Let f 0 (x∗ ) = α > 0. Prove that there exist δ, λ > 0 such that on the
interval |x − x∗ | ≤ δ, there exists L < 1 such that |g 0 (x)| ≤ L, where g(x) = x − λf (x).
What is the value of L? Hint: choose λ so that upper and lower bounds on g 0 (x) are equal
to ±L.
Exercise 8.3.10 Under the assumptions of Exercise 8.3.9, prove that if Iδ = [x∗ −δ, x∗ +
δ], and λ > 0 is chosen so that |g 0 (x)| ≤ L < 1 on Iδ , then g maps Iδ into itself.
We conclude from the preceding two exercises that the Fixed-point Theorem applies, and Fixed-
point Iteration converges linearly to x∗ for any choice of x0 in [x∗ − δ, x∗ + δ], with asymptotic error
constant |1 − λα| ≤ L.
In summary, if f is continuously differentiable in a neighborhood of a root x∗ of f (x) = 0, and
f (x∗ ) is nonzero, then there exists a constant λ such that Fixed-point Iteration with g(x) = x−λf (x)
converges to x∗ for x0 chosen sufficiently close to x∗ . This approach to Fixed-point Iteration, with
a constant φ, is known as relaxation.
Convergence can be accelerated by allowing λ to vary from iteration to iteration. Intuitively,
an effective choice is to try to minimize |g 0 (x)| near x∗ by setting λ = 1/f 0 (xk ), for each k, so that
g 0 (xk ) = 1 − λf 0 (xk ) = 0. This results in linear convergence with an asymptotic error constant of 0,
which indicates faster than linear convergence. We will see that convergence is actually quadratic.
• Are there cases in which the problem easy to solve, and if so, how do we solve it in such cases?
• Is it possible to apply our method of solving the problem in these “easy” cases to more general
cases?
A recurring theme in this book is that these questions are useful for solving a variety of problems.
m(x − a) + b = 0,
288 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS
This function has a useful geometric interpretation, as its graph is the tangent line of f (x) at the
point (x0 , f (x0 )).
We will illustrate this in Matlab, for the example f (x) = x − cos x, and initial guess x0 = 1.
The following code plots f (x) and the tangent line at (x0 , f (x0 )).
>> f=inline(’x - cos(x)’);
>> a=0.5;
>> b=1.5;
>> h=0.01;
>> x=a:h:b;
>> % plot f(x) on [a,b]
>> plot(x,f(x))
>> hold on
>> % plot x-axis
>> plot([ a b ],[ 0 0 ],’k’)
>> x0=1;
>> % plot initial guess on graph of f(x)
>> plot(x0,f(x0),’ro’)
>> % f’(x) = 1 + sin(x)
>> % slope of tangent line: m = f’(x0)
>> m=1+sin(x0);
>> % plot tangent line using points x=a,b
>> plot([ a b ],[ f(x0) + m*([ a b ] - x0) ],’r’)
>> xlabel(’x’)
>> ylabel(’y’)
Exercise 8.4.1 Rearrange the formula for the tangent line approximation P1 (x) to obtain
a formula for its x-intercept x1 . Compute this value in Matlab and plot the point
(x1 , f (x1 )) as a red ’+’.
The plot that should be obtained from the preceding code and Exercise 8.4.1 is shown in Figure
8.6.
As can be seen in Figure 8.6, we can obtain an approximate solution to the equation f (x) = 0
by determining where the linear function P1 (x) is equal to zero. If the resulting value, x1 , is not a
8.4. NEWTON’S METHOD AND THE SECANT METHOD 289
Figure 8.6: Approximating a root of f (x) = x − cos x using the tangent line of f (x) at x0 = 1.
solution, then we can repeat this process, approximating f by a linear function near x1 and once
again determining where this approximation is equal to zero.
Exercise 8.4.2 Modify the above Matlab statements to effectively “zoom in” on the
graph of f (x) near x = x1 , the zero of the tangent line approximation P1 (x) above. Use
the tangent line at (x1 , f (x1 )) to compute a second approximation x2 of the root of f (x).
What do you observe?
The algorithm that results from repeating this process of approximating a root of f (x) using
tangent line approximations is known as Newton’s method, which we now describe in detail.
Example 8.4.3 We will use Newton’s Method to compute a root of f (x) = x − cos x. Since f 0 (x) =
1 + sin x, it follows that in Newton’s Method, we can obtain the next iterate xn+1 from the previous
iterate xn by
f (xn ) xn − cos xn xn sin xn + cos xn
xn+1 = xn − 0 = xn − = .
f (xn ) 1 + sin xn 1 + sin xn
We choose our starting iterate x0 = 1, and compute the next several iterates as follows:
Since the fourth and fifth iterates agree to 15 decimal places, we assume that 0.739085133215161
is a correct solution to f (x) = 0, to at least 15 decimal places. 2
Exercise 8.4.4 How many correct decimal places are obtained in each xk in the preceding
example? What does this suggest about the order of convergence of Newton’s method?
We can see from this example that Newton’s method converged to a root far more rapidly than
the Bisection method did when applied to the same function. However, unlike Bisection, Newton’s
method is not guaranteed to converge. Whereas Bisection is guaranteed to converge to a root in
[a, b] if f (a)f (b) < 0, we must be careful about our choice of the initial guess x0 for Newton’s
method, as the following example illustrates.
8.4. NEWTON’S METHOD AND THE SECANT METHOD 291
Example 8.4.4 Newton’s Method can be used to compute the reciprocal of a number a without
performing any divisions. The solution, 1/a, satisfies the equation f (x) = 0, where
1
f (x) = a − .
x
Since
1
f 0 (x) = ,
x2
it follows that in Newton’s Method, we can obtain the next iterate xn+1 from the previous iterate
xn by
a − 1/xn a 1/xn
xn+1 = xn − 2
= xn − + = 2xn − ax2n .
1/xn 1/xn 1/x2n
2
Note that no divisions are necessary to obtain xn+1 from xn . This iteration was actually used on
older IBM computers to implement division in hardware.
We use this iteration to compute the reciprocal of a = 12. Choosing our starting iterate to be
0.1, we compute the next several iterates as follows:
It is clear that this sequence of iterates is not going to converge to the correct solution. In general,
for this iteration to converge to the reciprocal of a, the initial iterate x0 must be chosen so that
0 < x0 < 2/a. This condition guarantees that the next iterate x1 will at least be positive. The
contrast between the two choices of x0 are illustrated in Figure 8.7 for a = 8. 2
We now analyze the convergence of Newton’s Method applied to the equation f (x) = 0, where we
assume that f is twice continuously differentiable near the exact solution x∗ . As before, we define
292 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS
Figure 8.7: Newton’s Method used to compute the reciprocal of 8 by solving the equation f (x) =
8 − 1/x = 0. When x0 = 0.1, the tangent line of f (x) at (x0 , f (x0 )) crosses the x-axis at x1 = 0.12,
which is close to the exact solution. When x0 = 1, the tangent line crosses the x-axis at x1 = −6,
which causes searching to continue on the wrong portion of the graph, so the sequence of iterates
does not converge to the correct solution.
ek+1 = xk+1 − x∗
f (xk )
= xk − 0 − x∗
f (xk )
f (xk )
= ek − 0
f (xk )
1 ∗ 0 ∗ 1 00 ∗ 2
= ek − 0 f (x ) − f (xk )(x − xk ) − f (ξk )(xk − x )
f (xk ) 2
00
f (ξk ) 2
= e
2f 0 (xk ) k
where ξk is between xk and x∗ .
Because, for each k, ξk lies between xk and x∗ , ξk converges to x∗ as well. By the continuity
of f 00 , we conclude that Newton’s method converges quadratically to x∗ , with asymptotic error
constant
|f 00 (x∗ )|
C= .
2|f 0 (x∗ )|
Example 8.4.5 Suppose that Newton’s Method is used to find the solution of f (x) = 0, where
8.4. NEWTON’S METHOD AND THE SECANT METHOD 293
√
f (x) = x2 − 2. We examine the error ek = xk − x∗ , where x∗ = 2 is the exact solution. The first
two iterations are illustrated in Figure 8.8. Continuing, we obtain
Figure 8.8: Newton’s Method applied to f (x) = x2 − 2. The bold curve is the graph of f . The
initial iterate x0 is chosen to be 1. The tangent line of f (x) at the point (x0 , f (x0 )) is used to
approximate f (x), and it crosses the x-axis at x1 = 1.5, which is much closer to the exact solution
than x0 . Then, the tangent line at (x1 , f (x1 )) is used to approximate f (x), and it crosses the x-axis
at x2 = 1.416̄, which is already very close to the exact solution.
k xk |ek |
0 1 0.41421356237310
1 1.5 0.08578643762690
2 1.41666666666667 0.00245310429357
3 1.41421568627457 0.00000212390141
4 1.41421356237469 0.00000000000159
|e4 |
≈ 0.35352,
|e3 |2
294 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS
so the actual behavior of the error is consistent with the behavior that is predicted by theory. 2
It is easy to see from the above analysis, however, that if f 0 (x∗ ) is very small, or zero, then
convergence can be very slow, or may not even occur.
ek+1 = xk+1 − 1
f (xk )
= xk − 0 −1
f (xk )
(xk − 1)2 exk
= xk − −1
[2(xk − 1) + (xk − 1)2 ]exk
e2k
= ek −
2ek + e2k
xk
= ek .
xk + 1
It follows that if we choose x0 > 0, then Newton’s method converges to x∗ = 1 linearly, with
asymptotic error constant C = 21 . 2
x∗ = x∗ , so Newton’s method converges to x∗ . Using the previous analysis, it can be shown that
this convergence is quadratic.
Exercise 8.4.6 Prove that Newton’s method converges quadratically if f 0 (x) > 0 and
f 00 (x) < 0 on an interval [a, b] that contains x∗ and x0 , where x0 < x∗ . What goes wrong
if x0 > x∗ ?
f (x1 ) − f (x0 )
(x2 − x1 ) + f (x1 ) = 0
x1 − x0
which has the solution
f (x1 )(x1 − x0 )
x2 = x1 − .
f (x1 ) − f (x0 )
This leads to the following algorithm.
Like Newton’s method, it is necessary to choose the starting iterate x0 to be reasonably close
to the solution x∗ . Convergence is not as rapid as that of Newton’s Method, since the secant-line
approximation of f is not as accurate as the tangent-line approximation employed by Newton’s
method.
Example 8.4.8 We will use the Secant Method to solve the equation f (x) = 0, where f (x) = x2 −2.
This method requires that we choose two initial iterates x0 and x1 , and then compute subsequent
iterates using the formula
f (xn )(xn − xn−1 )
xn+1 = xn − , n = 1, 2, 3, . . . .
f (xn ) − f (xn−1 )
We choose x0 = 1 and x1 = 1.5. Applying the above formula, we obtain
x2 = 1.4
x3 = 1.413793103448276
x4 = 1.414215686274510
x5 = 1.414213562057320.
As we√ can see, the iterates produced by the Secant Method are converging to the exact solution
x∗ = 2, but not as rapidly as those produced by Newton’s Method. 2
Exercise 8.4.8 How many correct decimal places are obtained in each xk in the preceding
example? What does this suggest about the order of convergence of the Secant method?
We now prove that the Secant Method converges if x0 is chosen sufficiently close to a solution
x∗ of f (x) = 0, if f is continuously differentiable near x∗ and f 0 (x∗ ) = α 6= 0. Without loss of
generality, we assume α > 0. Then, by the continuity of f 0 , there exists an interval Iδ = [x∗ −δ, x∗ +δ]
such that
3α 5α
≤ f 0 (x) ≤ , x ∈ Iδ .
4 4
It follows from the Mean Value Theorem that
xk − xk−1
xk+1 − x∗ = xk − x∗ − f (xk )
f (xk ) − f (xk−1 )
f (θk )(xk − x∗ )
0
= xk − x∗ −
f 0 (ϕk )
0
f (θk )
= 1− 0 (xk − x∗ ),
f (ϕk )
where θk lies between xk and x∗ , and ϕk lies between xk and xk−1 . Therefore, if xk−1 and xk are
in Iδ , then so are ϕk and θk , and xk+1 satisfies
∗ 5α/4 3α/4 2
|xk − x∗ | ≤ |xk − x∗ |.
|xk+1 − x | ≤ max 1 −
, 1 −
3α/4 5α/4 3
We conclude that if x0 , x1 ∈ Iδ , then all subsequent iterates lie in Iδ , and the Secant Method
converges at least linearly, with asymptotic rate constant 2/3.
8.4. NEWTON’S METHOD AND THE SECANT METHOD 297
The order of convergence of the Secant Method can be determined using a result, which we will
not prove here, stating that if {xk }∞k=0 is the sequence of iterates produced by the Secant Method
for solving f (x) = 0, and if this sequence converges to a solution x∗ , then for k sufficiently large,
|xk+1 − x∗ | ≈ S|xk − x∗ ||xk−1 − x∗ |
for some constant S.
We assume that {xk } converges to x∗ of order α. Then, dividing both sides of the above relation
by |xk − x∗ |α , we obtain
|xk+1 − x∗ |
≈ S|xk − x∗ |1−α |xk−1 − x∗ |.
|xk − x∗ |α
Because α is the order of convergence, the left side must converge to a positive constant C as
k → ∞. It follows that the right side must converge to a positive constant as well, as must its
reciprocal. In other words, there must exist positive constants C1 and C2
|xk − x∗ | |xk − x∗ |α−1
→ C1 , → C2 .
|xk−1 − x∗ |α |xk−1 − x∗ |
This can only be the case if there exists a nonzero constant β such that
β
|xk − x∗ | |xk − x∗ |α−1
= ,
|xk−1 − x∗ |α |xk−1 − x∗ |
which implies that
1 = (α − 1)β and α = β.
Eliminating β, we obtain the equation
α2 − α − 1 = 0,
which has the solutions
√ √
1+ 5 1− 5
α1 = ≈ 1.618, α2 = ≈ −0.618.
2 2
Since we must have α ≥ 1, the order of convergence is 1.618.
Exercise 8.4.9 What is the value of S in the preceding discussion? That is, compute
xk+1 − x∗
S= lim .
xk−1 ,xk →x∗ (xk − x∗ )(xk−1 − x∗ )
Hint: Take one limit at a time, and use Taylor expansion. Assume that x∗ is not a double
root.
Exercise 8.4.10 Use the value of S from the preceding exercise to obtain the asymptotic
error constant for the Secant method.
Exercise 8.4.11 Use both Newton’s method and the Secant method to compute a root
of the same polynomial. For both methods, count the number of floating-point operations
required in each iteration, and the number of iterations required to achieve convergence
with the same error tolerance. Which method requires fewer floating-point operations?
298 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS
(xk+1 − xk )2
x̂k = xk − ,
xk+2 − 2xk+1 + xk
that also converges to x∗ . This sequence has the following desirable property.
In other words, the sequence {x̂k } converges to x∗ more rapidly than {xk } does.
Exercise 8.5.2 Prove Theorem 8.5.1. Assume that the sequence {xk }∞
k=0 converges lin-
early with asymptotic error constant C, where 0 < C < 1.
If we define the forward difference operator ∆ by
∆xk = xk+1 − xk ,
then
∆2 xk = ∆(xk+1 − xk ) = (xk+2 − xk+1 ) − (xk+1 − xk ) = xk+2 − 2xk+1 + xk ,
and therefore x̂k can be rewritten as
(∆xk )2
x̂k = xk − , k = 0, 1, 2, . . .
∆2 xk
For this reason, the method of accelerating the convergence of {xk } by constructing {x̂k } is called
Aitken’s ∆2 Method [2].
A slight variation of this method, called Steffensen’s Method, can be used to accelerate the
convergence of Fixed-point Iteration, which, as previously discussed, is linearly convergent. The
basic idea is as follows:
8.5. CONVERGENCE ACCELERATION 299
The principle behind Steffensen’s Method is that x̂0 is thought to be a better approximation to the
fixed point x∗ than x2 , so it should be used as the next iterate for Fixed-point Iteration.
Example 8.5.2 We wish to find the unique fixed point of the function f (x) = cos x on the interval
[0, 1]. If we use Fixed-point Iteration with x0 = 0.5, then we obtain the following iterates from the
formula xk+1 = g(xk ) = cos(xk ). All iterates are rounded to five decimal places.
x1 = 0.87758
x2 = 0.63901
x3 = 0.80269
x4 = 0.69478
x5 = 0.76820
x6 = 0.71917.
These iterates show little sign of converging, as they are oscillating around the fixed point.
If, instead, we use Fixed-point Iteration with acceleration by Aitken’s ∆2 method, we obtain a
new sequence of iterates {x̂k }, where
(∆xk )2
x̂k = xk −
∆2 xk
(xk+1 − xk )2
= xk − ,
xk+2 − 2xk+1 + xk
x̂0 = 0.73139
x̂1 = 0.73609
x̂2 = 0.73765
x̂3 = 0.73847
x̂4 = 0.73880.
Clearly, these iterates are converging much more rapidly than Fixed-point Iteration, as they are not
oscillating around the fixed point, but convergence is still linear.
Finally, we try Steffensen’s Method. We begin with the first three iterates of Fixed-point Itera-
tion,
(0) (0) (0)
x0 = x0 = 0.5, x1 = x1 = 0.87758, x2 = x2 = 0.63901.
300 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS
We use this value to restart Fixed-point Iteration and compute two iterates, which are
(1) (1) (1) (1)
x1 = cos(x0 ) = 0.74425, x2 = cos(x1 ) = 0.73560.
(1) (1)
Repeating this process, we apply the formula from Aitken’s ∆2 Method to the iterates x0 , x1 and
(1)
x2 to obtain
(1) (1)
(2) (1) (x − x )2
x0 = x0 − (1) 1 (1)0 (1) = 0.739076.
x2 − 2x1 + x0
(2)
Restarting Fixed-point Iteration with x0 as the initial iterate, we obtain
(2) (2) (2) (2)
x1 = cos(x0 ) = 0.739091, x2 = cos(x1 ) = 0.739081.
(2)
The most recent iterate x2 is correct to five decimal places.
Using all three methods to compute the fixed point to ten decimal digits of accuracy, we find that
Fixed-point Iteration requires 57 iterations, so x57 must be computed. Aitken’s ∆2 Method requires
us to compute 25 iterates of the modified sequence {x̂k }, which in turn requires 27 iterates of the
(3)
sequence {xk }, where the first iterate x0 is given. Steffensen’s Method requires us to compute x2 ,
which means that only 11 iterates need to be computed, 8 of which require a function evaluation. 2
x = g(x).
8.6. SYSTEMS OF NONLINEAR EQUATIONS 301
x(k+1) = g(x(k) ), k = 0, 1, 2, . . . ,
x(k+1) = T x(k) + M −1 b
f1 (x1 , x2 , . . . , xn ) = 0,
f2 (x1 , x2 , . . . , xn ) = 0,
..
.
fn (x1 , x2 , . . . , xn ) = 0.
F(x) = 0,
F(x) = 0.
First, we transform this system of equations into an equivalent system of the form
x = G(x).
One approach to doing this is to solve the ith equation in the original system for xi . This is
analogous to the derivation of the Jacobi method for solving systems of linear equations. Next, we
choose an initial guess x(0) . Then, we compute subsequent iterates by
x(k+1) = G(x(k) ), k = 0, 1, 2, . . . .
302 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS
kJG (x)k ≤ ρ, x ∈ D,
where JG (x) is the Jacobian matrix of first partial derivatives of G evaluated at x, then G has a
unique fixed point x∗ in D, and fixed-point iteration is guaranteed to converge to x∗ for any initial
guess chosen in D. This can be seen by computing a multivariable Taylor expansion of the error
x(k+1) − x∗ around x∗ .
Exercise 8.6.2 Use a multivariable Taylor expansion to prove that if G satisfies the
assumptions in the preceding discussion (that it is continuous, maps D into itself, has
continuous first partial derivatives, and satisfies kJG (x)k ≤ ρ < 1 for x ∈ D and any
natural matrix norm k · k), then G has a unique fixed point x∗ ∈ D and fixed-point
iteration will converge to x∗ for any initial guess x(0) ∈ D.
Exercise 8.6.3 Under the assumptions of Exercise 8.6.2 to obtain a bound for the error
after k iterations, kx(k) − x∗ k, in terms of the initial difference kx(1) − x(0) k.
The constant ρ measures the rate of convergence of fixed-point iteration, as the error approxi-
mately decreases by a factor of ρ at each iteration. It is interesting to note that the convergence
of fixed-point iteration for functions of several variables can be accelerated by using an approach
similar to how the Jacobi method for linear systems is modified to obtain the Gauss-Seidel method.
(k+1) (k) (k+1)
That is, when computing xi by evaluating gi (x(k) ), we replace xj , for j < i, by xj , since it
has already been computed (assuming all components of x(k+1) are computed in order). Therefore,
as in Gauss-Seidel, we are using the most up-to-date information available when computing each
iterate.
x2 = x21 ,
x21 + x22 = 1.
The first equation describes a parabola, while the second describes the unit circle. By graphing both
equations, it can easily be seen that this system has two solutions, one of which lies in the first
quadrant (x1 , x2 > 0).
To solve this system using fixed-point iteration, we solve the second equation for x1 , and obtain
the equivalent system
q
x1 = 1 − x22 ,
x2 = x21 .
8.6. SYSTEMS OF NONLINEAR EQUATIONS 303
D = {(x1 , x2 ) | 0 ≤ x1 ≤ 1, 0 ≤ x2 ≤ 1},
maps D into itself. Because G is also continuous on D, it follows that G has a fixed point in D.
However, G has the Jacobian matrix
p
0 −x2 / 1 − x22
JG (x) = ,
2x1 0
which cannot satisfy kJG k < 1 on D. Therefore, we cannot guarantee that fixed-point iteration with
this choice of G will converge, and, in fact, it can be shown that it does not converge. Instead, the
iterates tend to approach the corners of D, at which they remain.
In an attempt to achieve convergence, we note that ∂g2 /∂x1 = 2x1 > 1 near the fixed point.
Therefore, we modify G as follows:
√
q
G(x1 , x2 ) = 2
x2 , 1 − x1 .
For this choice of G, JG still has partial derivatives that are greater than 1 in magnitude near the
fixed point. However, there is one crucial distinction: near the fixed point, ρ(JG ) < 1, whereas with
the original choice of G, ρ(JG ) > 1. Attempting fixed-point iteration with the new G, we see that
convergence is actually achieved, although it is slow. 2
It can be seen from this example that the conditions for the existence and uniqueness of a fixed
point are sufficient, but not necessary.
kJG (x)k ≤ ρ, x ∈ D,
provides an indication of the rate of convergence, in the sense that as the iterates converge to a
fixed point x∗ , if they converge, the error satisfies
kx(k+1) − x∗ k ≤ ρkx(k) − x∗ k.
Furthermore, as the iterates converge, a suitable value for ρ is given by ρ(JG (x∗ )), the spectral
radius of the Jacobian matrix at the fixed point.
Therefore, it makes sense to ask: what if this spectral radius is equal to zero? In that case, if the
first partial derivatives are continuous near x∗ , and the second partial derivatives are continuous and
304 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS
bounded at x∗ , then fixed-point iteration converges quadratically. That is, there exists a constant
M such that
kx(k+1) − x∗ k ≤ M kx(k) − x∗ k2 .
f (x(k) )
x(k+1) = x(k) − , k = 0, 1, 2, . . . ,
f 0 (x(k) )
where x(0) is an initial guess. We now wish to generalize this method to systems of nonlinear
equations.
Consider the fixed-point iteration function
Then, it can be shown by direct differentiation that the Jacobian matrix of this function is equal
to the zero matrix at x = x∗ , a solution to F(x) = 0. If we define
x1 f1 (x1 , x2 , . . . , xn ) g1 (x1 , x2 , . . . , xn )
x2 f2 (x1 , x2 , . . . , xn ) g2 (x1 , x2 , . . . , xn )
x = . , F(x) = , G(x) = ,
.. ..
..
. .
xn fn (x1 , x2 , . . . , xn ) gn (x1 , x2 , . . . , xn )
where fi and gi , i = 1, 2, . . . , n are the coordinate functions of F and G, respectively, then we have
n n n
∂ ∂ X X ∂ X ∂
gi (x) = xi − bij (x)fj (x) = δik −
bij (x) fj (x) − bij (x)fj (x),
∂xk ∂xk ∂xk ∂xk
j=1 j=1 j=1
Exercise 8.6.5 Prove that if G(x) is defined as in (8.4) and F(x∗ ) = 0, then JG (x∗ ) =
0.
We see that this choice of fixed-point iteration is a direct generalization of Newton’s method to
systems of equations, in which the division by f 0 (x(k) ) is replaced by multiplication by the inverse
of JF (x(k) ), the total derivative of F(x).
In summary, Newton’s method proceeds as follows: first, we choose an initial guess x(0) . Then,
for k = 0, 1, 2, . . . , we iterate as follows:
yk = −F(x(k) )
sk = [JF (x(k) )]−1 yk
x(k+1) = x(k) + sk
8.6. SYSTEMS OF NONLINEAR EQUATIONS 305
JF (x(k) )sk = yk ,
x2 − x21 = 0,
x21 + x22 − 1 = 0.
Fixed-point iteration converged rather slowly for this system, if it converged at all. Now, we apply
Newton’s method to this system. We have
x2 − x21
−2x1 1
F(x1 , x2 ) = , J (x ,
F 1 2 x ) = .
x21 + x22 − 1 2x1 2x2
Using the formula for the inverse of a 2 × 2 matrix, we obtain the iteration
" # " # " #" #
(k+1) (k) (k) (k) (k)
x1 x1 1 2x2 −1 x2 − (x1 )2
(k+1) = (k) + (k) (k) (k) (k) (k) (k) (k) .
x2 x2 4x1 x2 + 2x1 −2x1 −2x1 (x1 )2 + (x2 )2 − 1
Implementing this iteration in Matlab, we see that it converges quite rapidly, much more so than
(k)
fixed-point iteration. Note that in order for the iteration to not break down, we must have x1 6= 0
(k)
and x2 6= −1/2. 2
by the improved efficiency of each iteration. However, simply replacing the analytical Jacobian
matrix of F with a matrix consisting of finite difference approximations of the partial derivatives
does not do much to reduce the cost of each iteration, because the cost of solving the system of
linear equations is unchanged.
However, because the Jacobian matrix consists of the partial derivatives evaluated at an element
of a convergent sequence, intuitively Jacobian matrices from consecutive iterations are “near” one
another in some sense, which suggests that it should be possible to cheaply update an approximate
Jacobian matrix from iteration to iteration, in such a way that the inverse of the Jacobian matrix
can be updated efficiently as well.
This is the case when a matrix has the form
B = A + uvT ,
where u and v are given vectors. This modification of A to obtain B is called a rank-one update,
since uvT , an outer product, has rank one, since every vector in the range of uvT is a scalar multiple
of u. To obtain B −1 from A−1 , we note that if
Ax = u,
then
Bx = (A + uvT )x = (1 + vT x)u,
which yields
1
B −1 u = A−1 u.
1+ vT A−1 u
On the other hand, if x is such that vT A−1 x = 0, then
which yields
B −1 x = A−1 x.
This takes us to the following more general problem: given a matrix C, we wish to construct a
matrix D such that the following conditions are satisfied:
Exercise 8.6.7 Prove that the matrix D defined in (8.5) satisfies Dw = z and Dy = Cy
for gT y = 0.
8.6. SYSTEMS OF NONLINEAR EQUATIONS 307
Exercise 8.6.8 Prove the final form of the Sherman-Morrison formula given in (8.6).
We now return to the problem of approximating the Jacobian of F, and efficiently obtaining
its inverse, at each iterate x(k) . We begin with an exact Jacobian, A0 = JF (x(0) ), and use A0 to
compute the first iterate, x(1) , using Newton’s Method. Then, we recall that for the Secant Method,
we use the approximation
f (x1 ) − f (x0 )
f 0 (x1 ) ≈ .
x1 − x0
Generalizing this approach to a system of equations, we seek an approximation A1 to JF (x(1) that
has these properties:
• A1 (x(1) − x(0) ) = F(x(1) ) − F(x(0) )
Then, as A1 is an approximation to JF (x(1) ), we can obtain our next iterate x(2) as follows:
Repeating this process, we obtain the following algorithm, which is known as Broyden’s Method:
Choose x(0)
A0 = JF (x(0) )
s1 = −A−10 F(x )
(0)
x(1) = x(0) + s1
k=1
while not converged do
308 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS
yk = F(x(k) ) − F(x(k−1) )
wk = A−1k−1 yk
c = 1/sTk wk
A−1 −1 T −1
k = Ak−1 + c(sk − wk )sk Ak−1
sk+1 = −A−1 (k)
k F(x )
x(k+1) = x(k) + sk+1
k =k+1
end
In this chapter, we begin our exploration of the development of numerical methods for solving differ-
ential equations, which are equations that depend on derivatives of unknown quantities. Differential
equations arise in mathematical models of a wide variety of phenomena, such as propagation of
waves, dissipation of heat energy, population growth, or motion of fluids. Solutions of differen-
tial equations yield valuable insight about such phenomena, and therefore techniques for solving
differential equations are among the most essential methods of applied mathematics.
We now illustrate mathematical models based on differential equations. Newton’s Second Law
states
dv
F = ma = m ,
dt
where F , m, a and v represent force, mass, acceleration and velocity, respectively. We use this
law to develop a mathematical model for the velocity of a falling object that includes a differential
equation. The forces on the falling object include gravity and air resistance, or drag; to simplify
the discussion, we neglect any other forces.
The force due to gravity is equal to mg, where g is the acceleration due to gravity, and the
drag force is equal to −γv, where γ is the drag coefficient. We use downward orientation, so that
gravity is acting in the positive (downward) direction and drag is acting in the negative (upward)
direction. In summary, we have
F = mg − γv.
Combining with Newton’s Second Law yields the differential equation
dv
m = mg − γv
dt
for the velocity v of the falling object.
Another example of a mathematical model is a differential equation for the population p of a
species, which can have the form
dp
= rp − d,
dt
where the constant r is the rate of reproduction of the species. In general, r is called a rate constant
or growth rate. The constant d indicates the number of specimens that die per unit of time, perhaps
due to predation or other causes.
309
310 CHAPTER 9. INITIAL VALUE PROBLEMS
A differential equation such as this one does not have a unique solution, as it does not include
enough information. Typically, the differential equation is paired with an initial condition of the
form
y(t0 ) = y0 ,
where t0 represents an initial time and y0 is an initial value. The differential equation, together
with the intial condition, is called an initial value problem. As discussed in the next section,
under certain assumptions, it can be proven that an initial value problem has a unique solution.
This chapter explores the numerical solution of initial value problems. Chapter 10 investigates the
numerical solution of boundary value problems, which are differential equations defined on a spatial
domain, such as a bounded interval [a, b], paired with boundary conditions that ensure a unique
solution.
There are many types of differential equations, and a wide variety of solution techniques, even
for equations of the same type, let alone different types. We now introduce some terminology that
aids in classification of equations and, by extension, selection of solution techniques.
• An ordinary differential equation, or ODE, is an equation that depends on one or more deriva-
tives of functions of a single variable. Differential equations given in the preceding examples
are all ordinary dfiferential equations, and we will consider these equations exclusively in this
book.
• A partial differential equation, or PDE, is an equation that depends on one or more partial
derivatives of functions of several variables. In many cases, PDEs are solved by reducing to
multiple ODEs.
• The order of a differential equation is the order of the highest derivative of any unknown
function in the equation.
In this chapter, we limit ourselves to numerical methods for the solution of first-order ODEs. In
Section 9.6, we consider systems of first-order ODEs, which allows these numerical methods to be
applied to higher-order ODEs.
If, in addition, ∂f /∂y exists on D, we can also conclude that |∂f /∂y| ≤ L on D.
When solving a problem numerically, it is not sufficient to know that a unique solution exists.
As discussed in Section 1.4.4, if a small change in the problem data can cause a substantial change
in the solution, then the problem is ill-conditioned, and a numerical solution is therefore unreliable,
because it could be unduly influenced by roundoff error. The following definition characterizes
problems involving differential equations for which numerical solution is feasible.
This theorem can be proved using Fixed-Point Iteration, in which the Lipschitz condition on f is
used to prove that the iteration converges [7, p. 142-155].
tn = t0 + nh,
with h being a chosen time step. Computing approximate solution values in this manner is
called time-stepping or time-marching. Taking a Taylor expansion of the exact solution y(t)
at t = tn+1 around the center t = tn , we obtain
h2 00
y(tn+1 ) = y(tn ) + hy 0 (tn ) + y (ξ),
2
where tn ≤ ξ ≤ tn+1 .
yn+1 = yn + hf (tn , yn ),
To that end, we attempt to bound the error at time tn . We begin with a comparison of the difference
equation and the Taylor expansion of the exact solution,
h2 00
y(tn+1 ) = y(tn ) + hf (tn , y(tn )) + y (ξ),
2
yn+1 = yn + hf (tn , yn )
h2 00
en+1 = en + h[f (tn , y(tn ) − f (tn , yn )] + y (ξ).
2
9.2. ONE-STEP METHODS 313
Exercise 9.2.1 Repeat the convergence analysis for Euler’s method on (9.5) to obtain
the error bound
1 hM δ
|y(tn ) − ỹn | ≤ + [eL(tn −t0 ) − 1] + δeL(tn −t0 ) .
L 2 h
What happens to this error bound as h → 0? What is an optimal choice of h so that the
error bound is minimized?
We conclude our discussion of Euler’s method with an example of how the previous convergence
analyses can be used to select a suitable time step h.
314 CHAPTER 9. INITIAL VALUE PROBLEMS
We know that the exact solution is y(t) = e−t . Euler’s method applied to this problem yields the
difference equation
yn+1 = yn − hyn = (1 − h)yn , y0 = 1.
We wish to select h so that the error at time T = 10 is less than 0.001. To that end, we use the
error bound
hM L(tn −t0 )
|y(tn ) − yn | ≤ [e − 1],
2L
with M = 1, since y 00 (t) = e−t , which satisfies 0 < y 00 (t) < 1 on [0, 10], and L = 1, since
f (t, y) = −y satisfies |∂f /∂y| = | − 1| ≡ 1. Substituting tn = 10 and t0 = 0 yields
h 10
|y(10) − yn | ≤ [e − 1] ≈ 1101.27h.
2
Ensuring that the error at this time is less than 10−3 requires choosing h < 9.08 × 10−8 . However,
the bound on the error at t = 10 is quite crude. Applying Euler’s method with this time step yields
a solution whose error at t = 10 is 2 × 10−11 .
Now, suppose that we include roundoff error in our error analysis. The optimal time step is
r
2δ
h= ,
M
where δ is a bound on the roundoff error during any time step. We use δ = 2u, where u is the
unit roundoff, because each time step performs only two floating-point operations. Even if 1 − h
is computed once, in advance, its error still propagates to the multiplication with yn . In a typical
double-precision floating-point number system, u ≈ 1.1 × 10−16 . It follows that the optimal time
step is
r r
2δ 2(1.1 × 10−16 )
h= = ≈ 1.5 × 10−8 .
M 1
With this value of h, we find that the error at t = 10 is approximately 3.55 × 10−12 . This is even
more accurate than with the previous choice of time step, which makes sense, because the new value
of h is smaller. 2
where f is a function handle for f (t, y). The first output t is a column vector consisting of times
t0 , t1 , . . . , tn = T , where n is the number of time steps. The second output y is a n × m matrix,
where m is the length of y 0. The ith row of y consists of the values of y(ti ), for i = 1, 2, . . . , n.
This is the simplest usage of one of the ODE solvers; additional interfaces are described in the
documentation.
(b) Test your function on the IVP from Example 9.2.1 with h=0.1 and h=0.01, and
compute the error at the final time t using the known exact solution. What happens
to the error as h decreases? Is the behavior what you would expect based on theory?
with
f (t + α1 , y + β1 ),
at least up to and including terms of O(h), so that the local truncation error will be O(h2 ).
Applying the multivariable version of Taylor’s theorem to f (see Theorem A.6.6), we obtain
∂f ∂f
f (t + α1 , y + β1 ) = f (t, y) + α1 (t, y) + β1 (t, y) +
∂t ∂y
2 2
α1 ∂ f 2
∂ f β12 ∂ 2 f
(ξ, µ) + α 1 β1 (ξ, µ) + (ξ, µ),
2 ∂t2 ∂t∂y 2 ∂y 2
where ξ is between t and t + α1 and µ is between y and y + β1 . Meanwhile, computing the full
derivatives with respect to t in the Taylor expansion of the solution yields
h ∂f h ∂f
f (t, y) + (t, y) + (t, y)f (t, y) + O(h2 ).
2 ∂t 2 ∂y
This scheme is known as the midpoint method, or the explicit midpoint method. Note that
it evaluates f at the midpoint of the intervals [tn , tn+1 ] and [yn , yn+1 ], where the midpoint in y is
approximated using Euler’s method with time step h/2.
The midpoint method is the simplest example of a Runge-Kutta method, which is the name
given to any of a class of time-stepping schemes that are derived by matching multivaraiable Taylor
series expansions of f (t, y) with terms in a Taylor series expansion of y(t + h). Another often-used
Runge-Kutta method is the modified Euler method
h
yn+1 = yn + [f (tn , yn ) + f (tn+1 , yn + hf (tn , yn )], (9.6)
2
also known as the explicit trapezoidal method, as it resembles the Trapezoidal Rule from
numerical integration. This method is also second-order accurate.
Exercise 9.2.3 Derive the explicit trapezoidal method (9.6) by finding a method of the
form
yn+1 = yn + h[a1 f (tn + α1 , yn + β1 ) + a2 f (tn + α2 , yn + β2 )]
that is second-order accurate.
However, the best-known Runge-Kutta method is the fourth-order Runge-Kutta method,
9.2. ONE-STEP METHODS 317
which uses four evaluations of f during each time step. The method proceeds as follows:
k1 = hf (tn , yn ),
h 1
k2 = hf tn + , yn + k1 ,
2 2
h 1
k3 = hf tn + , yn + k2 ,
2 2
k4 = hf (tn+1 , yn + k3 ) ,
1
yn+1 = yn + (k1 + 2k2 + 2k3 + k4 ). (9.7)
6
In a sense, this method is similar to Simpson’s Rule from numerical integration, which is also
fourth-order accurate, as values of f at the midpoint in time are given four times as much weight
as values at the endpoints tn and tn+1 .
The values k1 , . . . , k4 are referred to as stages; more precisely, a stage of a Runge-Kutta method
is an evaluation of f (t, y), and the number of stages of a Runge-Kutta method is the number of
evaluations required per time step. We therefore say that (9.7) is a four-stage, fourth-order method,
while the explicit trapezoidal method (9.6) is a two-stage, second-order method. We will see in
Section 9.5 that the number of stages does not always correspond to the order of accuracy.
Example 9.2.2 We compare Euler’s method with the fourth-order Runge-Kutta scheme on the
initial value problem
y 0 = −2ty, 0 < t ≤ 1, y(0) = 1,
2
which has the exact solution y(t) = e−t . We use a time step of h = 0.1 for both methods. The
computed solutions, and the exact solution, are shown in Figure 9.1.
It can be seen that the fourth-order Runge-Kutta method is far more accurate than Euler’s
method, which is first-order accurate. In fact, the solution computed using the fourth-order Runge-
Kutta method is visually indistinguishable from the exact solution. At the final time T = 1, the
relative error in the solution computed using Euler’s method is 0.038, while the relative error in the
solution computing using the fourth-order Runge-Kutta method is 4.4 × 10−6 . 2
Exercise 9.2.4 Modify your function eulersmethod from Exercise 9.2.2 to obtain a
new function [T,Y]=rk4(f,tspan,y0,h) that implements the fourth-order Runge-Kutta
method.
by applying the Trapezoidal Rule to the integral. This yields a one-step method
h
yn+1 = yn + [f (tn , yn ) + f (tn+1 , yn+1 )], (9.8)
2
318 CHAPTER 9. INITIAL VALUE PROBLEMS
Figure 9.1: Solutions of y 0 = −2ty, y(0) = 1 on [0, 1], computed using Euler’s method and the
fourth-order Runge-Kutta method
Exercise 9.2.6 Suppose that fixed-point iteration is used to solve for yn+1 in backward
Euler’s method. What is the function g in the equation yn+1 = g(yn+1 )? Assuming that
g satisfies the condition for a fixed point to exist, how should h be chosen to help ensure
convergence of the fixed-point iteration?
Exercise 9.2.7 Repeat Exercise 9.2.6 for the trapezoidal method (9.8).
9.3. MULTISTEP METHODS 319
where s is the number of steps in the method (s = 1 for a one-step method), and h is the time step
size, as before.
By convention, α0 = 1, so that yn+1 can be conveniently expressed in terms of other values.
If β0 = 0, the multistep method is said to be explicit, because then yn+1 can be described using
an explicit formula, whereas if β0 6= 0, the method is implicit, because then an equation, generally
nonlinear, must be solved to compute yn+1 .
For a general implicit multistep method, for which β0 6= 0, Newton’s method can be applied to
the function
Xs s
X
F (y) = α0 y + αi yn+1−i − hβ0 f (tn+1 , y) − h βi fn+1−i .
i=1 i=1
The resulting iteration is
(k)
(k+1) (k) F (yn+1 )
yn+1 = yn+1 − (k)
F 0 (yn+1 )
(k) Ps (k) Ps
(k) α0 yn+1 + i=1 αi yn+1−i − hβ0 f (tn+1 , yn+1 ) − h i=1 βi fn+1−i
= yn+1 − (k)
,
α0 − hβ0 fy (tn+1 , yn+1 )
(0)
with yn+1 = yn . If one does not wish to compute fy , then the Secant Method can be used instead.
The general idea behind Adams methods is to approximate the above integral using polynomial
interpolation of f at the points tn+1−s , tn+2−s , . . . , tn if the method is explicit, and tn+1 as well if
the method is implicit. In all Adams methods, α0 = 1, α1 = −1, and αi = 0 for i = 2, . . . , s.
Explicit Adams methods are called Adams-Bashforth methods. To derive an Adams-
Bashforth method, we interpolate f at the points tn , tn−1 , . . . , tn−s+1 with a polynomial of degree
s − 1. We then integrate this polynomial exactly. It follows that the constants βi , i = 1, . . . , s,
are the integrals of the corresponding Lagrange polynomials from tn to tn+1 , divided by h, because
there is already a factor of h in the general multistep formula.
320 CHAPTER 9. INITIAL VALUE PROBLEMS
The constants βi , i = 1, 2, 3, are obtained by evaluating the integral from tn to tn+1 of a polynomial
p2 (t) that passes through f (tn , yn ), f (tn−1 , yn−1 ), and f (tn−2 , yn−2 ).
Because we can write
X 2
p2 (t) = f (tn−i , yn−i )Li (t),
i=0
where Li (t) is the ith Lagrange polynomial for the interpolation points tn , tn−1 and tn−2 , and because
our final method expresses yn+1 as a linear combination of yn and values of f , it follows that the
constants βi , i = 1, 2, 3, are the integrals of the Lagrange polynomials from tn to tn+1 , divided by h.
However, using a change of variable u = (tn+1 − s)/h, we can instead interpolate at the points
u = 1, 2, 3, thus simplifying the integration. If we define p̃2 (u) = p2 (s) = p2 (tn+1 − hu) and
L̃i (u) = Li (tn+1 − hu), then we obtain
Z tn+1 Z tn+1
f (s, y(s)) ds = p2 (s) ds
tn tn
Z 1
= h p̃2 (u) du
0
Z 1
= h f (tn , yn )L̃0 (u) + f (tn−1 , yn−1 )L̃1 (u) + f (tn−2 , yn−2 )L̃2 (u) du
0 Z 1 Z 1
= h f (tn , yn ) L̃0 (u) du + f (tn−1 , yn−1 ) L̃1 (u) du+
0 0
Z 1
f (tn−2 , yn−2 ) L̃2 (u) du
0
Z 1 Z 1
(u − 2)(u − 3) (u − 1)(u − 3)
= h f (tn , yn ) du + f (tn−1 , yn−1 ) du+
0 (1 − 2)(1 − 3) 0 (2 − 1)(2 − 3)
Z 1
(u − 1)(u − 2)
f (tn−2 , yn−2 ) du
0 (3 − 1)(3 − 2)
23 4 5
= h f (tn , yn ) − f (tn−1 , yn−1 ) + f (tn−2 , yn−2 ) .
12 3 12
We conclude that the three-step Adams-Bashforth method is
h
yn+1 = yn + [23f (tn , yn ) − 16f (tn−1 , yn−1 ) + 5f (tn−2 , yn−2 )]. (9.10)
12
This method is third-order accurate. 2
The same approach can be used to derive an implicit Adams method, which is known as an
Adams-Moulton method. The only difference is that because tn+1 is an interpolation point, after
the change of variable to u, the interpolation points 0, 1, 2, . . . , s are used. Because the resulting
interpolating polynomial is of degree one greater than in the explicit case, the error in an s-step
Adams-Moulton method is O(hs+1 ), as opposed to O(hs ) for an s-step Adams-Bashforth method.
9.3. MULTISTEP METHODS 321
• Predict: Use the Adams-Bashforth method to compute a first approximation to yn+1 , which
we denote by ỹn+1 .
• Correct: Use the Adams-Moulton method to compute yn+1 , but instead of solving an equation,
use f (tn+1 , ỹn+1 ) in place of f (tn+1 , yn+1 ) so that the Adams-Moulton method can be used
as if it was an explicit method.
• Evaluate: Evaluate f at the newly computed value of yn+1 , computing f (tn+1 , yn+1 ), to use
during the next time step.
Example 9.3.2 We illustrate the predictor-corrector approach with the two-step Adams-Bashforth
method
h
yn+1 = yn + [3f (tn , yn ) − f (tn−1 , yn−1 )]
2
and the two-step Adams-Moulton method
h
yn+1 = yn + [5f (tn+1 , yn+1 ) + 8f (tn , yn ) − f (tn−1 , yn−1 )]. (9.13)
12
First, we apply the Adams-Bashforth method, and compute
h
ỹn+1 = yn + [3f (tn , yn ) − f (tn−1 , yn−1 )].
2
Then, we compute f (tn+1 , ỹn+1 ) and apply the Adams-Moulton method, to compute
h
yn+1 = yn + [5f (tn+1 , ỹn+1 ) + 8f (tn , yn ) − f (tn−1 , yn−1 )].
12
This new value of yn+1 is used when evaluating f (tn+1 , yn+1 ) during the next time step. 2
322 CHAPTER 9. INITIAL VALUE PROBLEMS
One drawback of multistep methods is that because they rely on values of the solution from
previous time steps, they cannot be used during the first time steps, because not enough values
are available. Therefore, it is necessary to use a one-step method, with at least the same order
of accuracy, to compute enough starting values of the solution to be able to use the multistep
method. For example, to use the three-step Adams-Bashforth method, it is necessary to first use
a one-step method such as the fourth-order Runge-Kutta method to compute y1 and y2 , and then
the Adams-Bashforth method can be used to compute y3 using y2 , y1 and y0 .
Exercise 9.3.2 How many starting values are needed to use a s-step multistep method?
where
αi = L0s,i (tn+1 ),
with Ls,0 (t), Ls,1 (t), . . . , Ls,s (t) being the Lagrange polynomials for tn+1−s , . . . , tn , tn+1 .
Exercise 9.3.5 Show that a 1-step BDF is simply backward Euler’s method.
yn+1 = yn + hf (tn , yn )
9.4. CONVERGENCE ANALYSIS 323
to the initial value problem (9.1), (9.2), the error in the computed solution satisfies the error bound
M h L(tn −t0 )
|y(tn ) − yn | ≤ (e − 1),
2L
where L is the Lipschitz constant for f and M is an upper bound on |y 00 (t)|. This error bound
indicates that the numerical solution converges to the exact solution at h → 0; that is,
It would be desirable to be able to prove that a numerical method converges without having to
proceed through the same detailed error analysis that was carried out with Euler’s method, since
other methods are more complex and such analysis would require more assumptions to obtain a
similar bound on the error in yn .
To that end, we define two properties that a numerical method should have in order to be
convergent.
Definition 9.4.1
Informally, a stable method converges to the differential equation as h → 0, and the solution
computed using a stable method is not overly sensitive to perturbations in the initial data. While
the difference in solutions is allowed to grow over time, it is “controlled” growth, meaning that the
rate of growth is independent of the step size h.
9.4.1 Consistency
The definition of consistency in Definition 9.4.1 can be cumbersome to apply directly to a given
method. Therefore, we consider one-step and multistep methods separately to obtain simple ap-
proaches for determining whether a given method is consistent, and if so, its order of accuracy.
324 CHAPTER 9. INITIAL VALUE PROBLEMS
as h → 0.
We now analyze the convergence of a general one-step method of the form
for some continuous function Φ(t, y, h). We define the local truncation error of this one-step
method by
y(tn+1 ) − y(tn )
τn (h) = − Φ(tn , y(tn ), h).
h
That is, the local truncation error is the result of substituting the exact solution into the approxi-
mation of the ODE by the numerical method.
Exercise 9.4.1 Find the local truncation error of the modified Euler method (9.6).
As h → 0 and n → ∞, in such a way that t0 + nh = t ∈ [t0 , T ], we obtain
Recall that a consistent one-step method is one that converges to the ODE as h → 0.
so it is consistent. 2
Exercise 9.4.2 Verify that the fourth-order Runge-Kutta method (9.7) is consistent.
9.4. CONVERGENCE ANALYSIS 325
as h → 0.
To compute the local truncation error of Adams methods, integrate the error in the polynomial
interpolation used to derive the method from tn to tn+1 . For the explicit s-step method, this yields
1 tn+1 f (s) (ξ, y(ξ))
Z
τn (h) = (t − tn )(t − tn−1 ) · · · (t − tn−s+1 ) dt.
h tn s!
Using the substitution u = (tn+1 − t)/h, and the Weighted Mean Value Theorem for Integrals,
yields
Z 1
1 f (s) (ξ, y(ξ)) s+1
τn (h) = h (−1)s (u − 1)(u − 2) · · · (u − s) du.
h s! 0
Evaluating the integral yields the constant in the error term. We also use the fact that y 0 =
f (t, y) to replace f (s) (ξ, y(ξ)) with y (s+1) (ξ). Obtaining the local truncation error for an implicit,
Adams-Moulton method can be accomplished in the same way, except that tn+1 is also used as an
interpolation point.
For a general multistep method, we substitute the exact solution into the method, as in one-step
methods, and obtain
Ps Ps
j=0 αj y(tn+1−j ) − h βj f (tn+1−j , y(tn+1−j ))
τn (h) = Psj=0 ,
h j=0 βj
where the scaling by h sj=0 βj is designed to make this definition of local truncation error consistent
P
with that of one-step methods.
By replacing each evaluation of y(t) by a Taylor series expansion around tn+1 , we obtain
s ∞ ∞
" #
1 X X 1 (k) X d k
τn (h) = αj y (tn+1 )(−jh)k − hβj [f (tn+1 , y(tn+1 ))](−jh)k
h sj=0 βj dtk
P
k!
j=0 k=0 k=0
s
"∞ ∞
#
1 X X h k X h k
= (−1)k αj y (k) (tn+1 )j k + (−1)k βj y (k) (tn+1 )j k−1
h sj=0 βj
P
k! (k − 1)!
j=0 k=0 k=1
s ∞ s s
1 X X 1 X 1 X
= Ps y(tn+1 ) αj + (−h)k y (k) (tn+1 ) j k αj + j k−1 βj
h j=0 βj k! (k − 1)!
j=0 k=1 j=1 j=0
∞
" #
1 X
k (k)
= y(tn+1 )C0 + (−h) y (tn+1 )Ck
h sj=0 βj
P
k=1
326 CHAPTER 9. INITIAL VALUE PROBLEMS
where
s s s
X 1 X k 1 X
C0 = αj , Ck = j αj + j k−1 βj , k = 1, 2, . . . .
k! (k − 1)!
j=0 j=1 j=0
Exercise 9.4.3 Use the conditions (9.15) to verify the order of accuracy of the four-step
Adams-Bashforth method (9.11). What is the local truncation error?
Further analysis is required to obtain the local truncation error of a predictor-corrector method
that is obtained by combining two Adams methods. The result of this analysis is the following
theorem, which is proved in [19, p. 387-388].
and corrector
s
" #
X
yn+1 = yn + h β0 f (tn+1 , ỹn+1 ) + βi fn+1−i .
i=1
where Tn (h) and T̃n (h) are the local truncation errors of the predictor and corrector,
respectively, and ξn+1 is between 0 and hTn (h). Furthermore, there exist constant α and
β such that
|y(tn ) − yn | ≤ max |y(ti ) − yi | + βS(h) eα(tn −t0 ) ,
0≤i≤s−1
Exercise 9.4.4 Show that an s-step predictor-corrector method, in which the corrector
is repeatedly applied until yn+1 converges, has local truncation error O(hs+1 ).
9.4.2 Stability
We now specialize the definition of stability from Definition 9.4.1 to one-step and multistep methods,
so that their stability (or lack thereof) can readily be determined.
From Definition 9.4.1, a one-step method of the form (9.14) is stable if Φ(t, y, h) is Lipschitz
continuous in y. That is,
It follows that φ(t, y, h) satifies a Lipschitz condition on the domain [t0 , T ] × (−∞, ∞) × [0, h0 ] with
Lipschitz constant L̃ = L + 21 h0 L2 . Therefore it is stable. 2
Exercise 9.4.5 Prove that the modified Euler method (9.6) is stable.
328 CHAPTER 9. INITIAL VALUE PROBLEMS
y 0 = 0, y(t0 ) = y0 , y0 6= 0,
for which the exact solution is y(t) = y0 , then for the method to be stable, the computed solution
must remain bounded.
It follows that the computed solution satisfies the m-term recurrence relation
s
X
αi yn+1−i = 0,
i=0
α0 λs + α1 λs−1 + · · · + αs−1 λ + αs = 0.
When a root λi is distinct, pi = 0. Therefore, to ensure that the solution does not grow exponen-
tially, the method must satisfy the root condition:
• All roots must satisfy |λi | ≤ 1.
• If |λi | = 1 for any i, then it must be a simple root, meaning that its multiplicity is one.
It can be shown that a multistep method is zero-stable if and only if it satisfies the root condition.
Furthermore, λ = 1 is always a root, because in order to be consistent, a multistep method must
have the property that si=0 αi = 0. If this is the only root that has absolute value 1, then we say
P
that the method is strongly stable, whereas if there are multiple roots that are distinct from one
another, but have absolute value 1, then the method is said to be weakly stable.
Because all Adams methods have the property that α0 = 1, α1 = −1, and αi = 0 for i =
2, 3, . . . , s, it follows that the roots of the characteristic equation are all zero, except for one root
that is equal to 1. Therefore, all Adams methods are strongly stable. The same is not true for
BDFs; they are zero-unstable for s > 6 [36, p. 349].
Example 9.4.5 A multistep method that is neither an Adams method, nor a backward differenti-
ation formula, is an implicit 2-step method known as Simpson’s method:
h
yn+1 = yn−1 + [fn+1 + 4fn + fn−1 ].
3
Although it is only a 2-step method, it is fourth-order accurate, due to the high degree of accuracy
of Simpson’s Rule.
9.4. CONVERGENCE ANALYSIS 329
This method is obtained from the relation satisfied by the exact solution,
Z tn+1
y(tn+1 ) = y(tn−1 ) + f (t, y(t)) dt.
tn−1
Since the integral is over an interval of width 2h, it follows that the coefficients βi obtained by
polynomial interpolation of f must satisfy the condition
s
X
βi = 2,
i=0
Exercise 9.4.6 Determine whether the 2-step BDF from Exercise (9.3.6) is strongly
stable, weakly stable, or unstable.
9.4.3 Convergence
It can be shown that a consistent and stable one-step method of the form (9.14) is convergent.
Using the same approach and notation as in the convergence proof of Euler’s method, and the fact
that the method is stable, we obtain the following bound for the global error en = y(tn ) − yn :
!
eLΦ (T −t0 ) − 1
|en | ≤ max |τm (h)|,
LΦ 0≤m≤n−1
lim |en | = 0,
n→∞
h 00
Φ(t, y, h) = f (t, y), τn (h) = y (τ ), τ ∈ (t0 , T ).
2
Therefore, there exists a constant K such that
for some sufficiently small h0 . We say that Euler’s method is first-order accurate. More generally,
we say that a one-step method has order of accuracy p if, for any sufficiently smooth solution
y(t), there exists constants K and h0 such that
Exercise 9.4.7 Prove that the modified Euler method (9.6) is convergent and second-
order accurate.
As for multistep methods, a consistent multistep method is convergent if and only if it is stable.
Because Adams methods are always strongly stable, it follows that all Adams-Moulton predictor-
corrector methods are convergent.
The exact solution is y(t) = e−100t , which rapidly decays to zero as t increases. If we solve this
problem using Euler’s method, with step size h = 0.1, then we have
which yields the exponentially growing solution yn = (−9)n . On the other hand, if we choose
h = 10−3 , we obtain the computed solution yn = (0.9)n , which is much more accurate, and correctly
captures the qualitative behavior of the exact solution, in that it rapidly decays to zero. 2
The ODE in the preceding example is a special case of the test equation
The exact solution to this problem is y(t) = eλt . However, as λ increases in magnitude, the problem
becomes increasingly stiff. By applying a numerical method to this problem, we can determine how
small h must be, for a given value of λ, in order to obtain a qualitatively accurate solution.
When applying a one-step method to the test equation, the computed solution has the form
yn+1 = Q(hλ)yn ,
where Q(hλ) is a polynomial in hλ if the method is explicit, and a rational function if it is implicit.
This polynomial is meant to approximate ehλ , since the exact solution satisfies y(tn+1 ) = ehλ y(tn ).
However, to obtain a qualitatively correct solution, that decays to zero as t increases, we must
choose h so that |Q(hλ)| < 1.
The test equation can also be used to determine how to choose h for a multistep method. The
process is similar to the one used to determine whether a multistep method is stable, except that
we use f (t, y) = λy, rather than f (t, y) ≡ 0.
Given a general multistep method of the form
s
X s
X
αi yn+1−i = h βi fn+1−i ,
i=0 i=0
Let λ = −100. If we choose h = 0.1, so that λh = −10, then Q(µ, hλ) has a root approximately
equal to −18.884, so h is too large for this method. On the other hand, if we choose h = 0.005,
so that hλ = −1/2, then the largest root of Q(µ, hλ) is approximately −0.924, so h is sufficiently
small to produce a qualitatively correct solution.
Next, we consider the 2-step Adams-Moulton method
h
yn+1 = yn + [5fn+1 + 8fn − fn−1 ].
12
In this case, we have
5 2 2 1
Q(µ, hλ) = 1 − hλ µ + −1 − hλ µ + hλ.
12 3 12
Setting h = 0.05, so that hλ = −5, the largest root of Q(µ, hλ) turns out to be approximately
−0.906, so a larger step size can safely be chosen for this method. 2
In general, larger step sizes can be chosen for implicit methods than for explicit methods.
However, the savings achieved from having to take fewer time steps can be offset by the expense
of having to solve a nonlinear equation during every time step.
and since Re λ < 0, it follows that |Q(hλ)| < 1 regardless of the value of h. The only A-stable
multistep method is the implicit trapezoidal method
h
yn+1 = yn + [fn+1 + fn ],
2
because
hλ hλ
Q(µ, hλ) = 1 − µ + −1 − ,
2 2
which has the root
hλ
1+ 2
µ= hλ
.
1− 2
The numerator and denominator have imaginary parts of the same magnitude, but because Re λ <
0, the real part of the denominator has a larger magnitude than that of the numerator, so |µ| < 1,
regardless of h.
Implicit multistep methods, such as the implicit trapezoidal method, are often used for stiff
differential equations because of their larger regions of absolute stability. However, as the next
exercises illustrate, it is important to properly estimate the largest possible value of λ for a given
ODE in order to select an h such that hλ actually lies within the region of absolute stability.
Exercise 9.4.8 Form the stability polynomial for the 2-step Adams-Moulton method
h
yn+1 = yn + [5fn+1 + 8fn − fn−1 ]. (9.17)
12
Exercise 9.4.9 Suppose the 2-step Adams-Moulton method (9.17) is applied to the IVP
y 0 = −2y, y(0) = 1.
Exercise 9.4.10 Now, suppose the same Adams-Moulton method is applied to the IVP
How does the addition of the source term e−100t affect the choice of h?
Exercise 9.4.11 In general, for an ODE of the form y 0 = f (t, y), how should the value
of λ be determined for the purpose of choosing an h such that hλ lies within the region of
absolute stability?
This theorem shows that local error provides an indication of global error only for zero-stable
methods. A proof can be found in [15, Theorem 6.3.4].
The second theorem imposes a limit on the order of accuracy of zero-stable methods.
For example, because of this theorem, it can be concluded that a 6th-order accurate three-step
method cannot be zero stable, whereas a 4th-order accurate, zero-stable two-step method has the
highest order of accuracy that can be achieved. A proof can be found in [15, Section 4.2].
Finally, we state a result, proved in [11], concerning absolute stability that highlights the trade-
off between explicit and implicit methods.
In order to obtain A-stable methods with higher-order accuracy, it is necessary to relax the
condition of A-stability. Backward differentiation formulae (BDF), mentioned previously in our
initial discussion of multistep methods, are efficient implicit methods that are high-order accurate
and have a region of absolute stability that includes a large portion of the negative half-plane,
including the entire negative real axis.
Exercise 9.4.12 Find a BDF of order greater than 1 that has a region of absolute sta-
bility that includes the entire negative real axis.
• the chosen time step may be too large to resolve the solution with sufficient accuracy, especially
if it is highly oscillatory, or
• the chosen time step may be too small when the solution is particularly smooth, thus wasting
computational effort required for evaluations of f .
This is reminiscent of the problem of choosing appropriate subintervals when applying composite
quadrature rules to approximate definite integrals. In that case, adaptive quadrature rules were
designed to get around this problem. These methods used estimates of the error in order to
determine whether certain subintervals should be divided. In this section, we seek to develop an
analogous strategy for time-stepping to solve initial value problems.
9.5. ADAPTIVE METHODS 335
of orders p and p + 1, respectively. Recall that their local truncation errors are
1
τn+1 (h) = [y(tn+1 ) − y(tn )] − Φp (tn , y(tn ), h),
h
1
τ̃n+1 (h) = [y(tn+1 ) − y(tn )] − Φp+1 (tn , y(tn ), h).
h
We make the assumption that both methods are exact at time tn ; that is, yn = ỹn = y(tn ). It
then follows from (9.18) and (9.19) that
1 1
τn+1 (h) = [y(tn+1 ) − yn+1 ], τ̃n+1 (h) = [y(tn+1 ) − ỹn+1 ].
h h
Subtracting these equations yields
1
τn+1 (h) = τ̃n+1 (h) + [ỹn+1 − yn+1 ].
h
Because τn+1 (h) is O(hp ) while τ̃n+1 (h) is O(hp+1 ), we neglect τ̃n+1 (h) and obtain the simple error
estimate
1
τn+1 (h) = (ỹn+1 − yn+1 ).
h
The approach for multistep methods is similar. We use a pair of Adams methods, consisting of
an s-step Adams-Bashforth (explicit) method,
s
X s
X
αi yn+1−i = h βi fn+1−i ,
i=0 i=1
where f˜i = f (ti , ỹi ), so that both are O(hs )-accurate. We then have
s
X s
X
αi y(tn+1−i ) = h βi f (tn+1−i , y(tn+1−i ) + hτn+1 (h),
i=0 i=1
Xs Xs
α̃i y(tn+1−i ) = h β̃i f (tn+1−i , y(tn+1−i ) + hτ̃n+1 (h),
i=0 i=0
where τn+1 (h) and τ̃n+1 (h) are the local truncation errors of the explicit and implicit methods,
respectively.
As before, we assume that yn+1−s , . . . , yn are exact, which yields
1 1
τn+1 (h) = [y(tn+1 ) − yn+1 ], τ̃n+1 (h) = [y(tn+1 ) − ỹn+1 ],
h h
as in the case of one-step methods. It follows that
τn+1 (h) = Chs y (s+1) (ξn ), τ̃n+1 (h) = C̃hs y (s+1) (ξ˜n ),
where ξn , ξ˜n ∈ [tn+1−s , tn+1 ]. We assume that these unknown values are equal, which yields
C̃
τ̃n+1 (h) ≈ [ỹn+1 − yn+1 ]. (9.20)
h(C − C̃)
Exercise 9.5.1 Formulate an error estimate of the form (9.20) for the case s = 4;
that is, estimate the error in the 3-step Adams-Moulton method (9.12) using the 4-step
Adams-Bashforth method (9.11). Hint: use the result of Exercise 9.4.3.
In practice, though, the step size is kept bounded by chosen values hmin and hmax in order to avoid
missing sensitive regions of the solution by using excessively large time steps, as well as expending
too much computational effort on regions where y(t) is oscillatory by using step sizes that are too
small [10].
For one-step methods, if the error is deemed small enough so that yn+1 can be accepted, but
ỹn+1 is obtained using a higher-order method, then it makes sense to instead use ỹn+1 as input for
the next time step, since it is ostensibly more accurate, even though the error estimate applies to
yn+1 . Using ỹn+1 instead is called local extrapolation.
The Runge-Kutta-Fehlberg method [14] is an example of an adaptive time-stepping method.
It uses a four-stage, fourth-order Runge-Kutta method and a five-stage, fifth-order Runge-Kutta
method. These two methods share some evaluations of f (t, y), in order to reduce the number of
evaluations of f per time step to six, rather than the nine that would normally be required from a
pairing of fourth- and fifth-order methods. A pair of Runge-Kutta methods that can share stages
in this way is called an embedded pair.
The Bogacki-Shampine method [8], which is used in the Matlab function ode23, is an
embedded pair consisting of a four-stage, second-order Runge-Kutta method and a three-stage,
third-order Runge-Kutta method. As in the Runge-Kutta-Fehlberg metod, evaluations of f (t, y)
are shared, reducing the number of evaluations per time step from seven to four. However, unlike
the Runge-Kutta Felhberg method, its last stage, k4 = f (tn+1 , yn+1 ), is the same as the first stage
of the next time step, k1 = f (tn , yn ), if yn+1 is accepted, as local extrapolation is used. This reduces
the number of new evaluations of f per time step from four to three. A Runge-Kutta method that
shares stages across time-steps in this manner is called a FSAL (First Same as Last) method.
Exercise 9.5.2 Find the definitions of the two Runge-Kutta methods used in the Bogacki-
Shampine method (they can easily be found online). Use these definitions to write a
Matlab function [t,y]=rk23(f,tspan,y0,h) that implements the Bogacki-Shampine
method, using an initial step size specified in the input argument h. How does the perfor-
mance of your method compare to that of ode23?
The Matlab ODE solver ode45 uses the Dormand-Prince method [13], which consists of a
5-stage, 5th-order Runge-Kutta method and a 6-stage, 4th-order Runge-Kutta method. By sharing
stages, the number of evaluations of f (t, y) is reduced from eleven to seven. Like the Bogacki-
Shampine method, the Dormand-Prince method is FSAL, so in fact only six new evaluations per
time step are required.
Exercise 9.5.3 Find the definitions of the two Runge-Kutta methods used in the
Dormand-Prince method (they can easily be found online). Use these definitions to write
a Matlab function [t,y]=rk23(f,tspan,y0,h) that implements the Dormand-Prince
method, using an initial step size specified in the input argument h. How does the perfor-
mance of your method compare to that of ode45?
For multistep methods, we assume as before that an s-step predictor and (s − 1)-step corrector
are used. Recall that the error estimate τn+1 (h) for the corrector is given in (9.20). As with one-
step methods, we relate the error estimate τn+1 (qh) to the error tolerance ε and solve for q, which
yields
1/s
ε
q≈ .
τn+1 (h)
338 CHAPTER 9. INITIAL VALUE PROBLEMS
Then, the time step can be adjusted to qh, but as with one-step methods, q is constrained to
avoid drastic changes in the time step. Unlike one-step methods, a change in the time step is
computationally expensive, as it requires the computation of new starting values at equally spaced
times.
Exercise 9.5.4 Implement an adaptive multistep method based on the 4-step Adams
Bashforth method (9.11) and 3-step Adams-Moulton method (9.12). Use the fourth-order
Runge-Kutta method to obtain starting values.
y10 = f1 (t, y1 , y2 , . . . , ym ),
y20 = f2 (t, y1 , y2 , . . . , ym ),
..
.
0
ym = fm (t, y1 , y2 , . . . , ym ),
This initial-value problem has a unique solution y(t) on [t0 , T ] if f is continuous on the domain D =
[t0 , T ] × (−∞, ∞)m , and satisfies a Lipschitz condition on D in each of the variables y1 , y2 , . . . , ym .
9.6. HIGHER-ORDER EQUATIONS AND SYSTEMS OF DIFFERENTIAL EQUATIONS 339
k1 = hf (tn , yn )
k2 = hf (tn + h, yn + k1 )
1
yn+1 = yn + [k1 + k2 ].
2
2
Exercise 9.6.1 Try your rk4 function from Exercise 9.2.4 on the system (9.21), (9.22)
with initial conditions y1 (0) = 1, y2 (0) = −1. Write your time derivative function
yp=f(t,y) for this system so that the input argument y and the value yp returned by f
are both column vectors, and pass a column vector containing the initial values as the
input argument y0. Do you even need to modify rk4?
Multistep methods generalize in a similar way. A general s-step multistep method for a system of
first-order ODE y0 = f (t, y) has the form
s
X s
X
αi yn+1−i = h βi f (tn+1−i , yn+1−i ),
i=0 i=0
where the constants αi and βi , for i = 0, 1, . . . , s, are determined in the same way as in the case of
a single equation.
Example 9.6.2 The explicit 3-step Adams-Bashforth method applied to the system in the previous
example has the form
h
y1,n+1 = y1,n + [23f1,n − 16f1,n−1 + 5f1,n−2 ],
12
h
y2,n+1 = y2,n + [23f2,n − 16f2,n−1 + 5f2,n−2 ],
12
where
f1,i = −2y1,i + 3ti y2,i , f2,i = t2i y1,i − e−ti y2,i , i = 0, . . . , n.
The order of accuracy for a one-step or multistep method, when applied to a system of equations,
is the same as when it is applied to a single equation. For example, the modified Euler’s method is
second-order accurate for systems, and the 3-step Adams-Bashforth method is third-order accurate.
However, when using adaptive step size control for any of these methods, it is essential that the
step size h is selected so that all components of the solution are sufficiently accurate, or it is likely
that none of them will be.
9.6. HIGHER-ORDER EQUATIONS AND SYSTEMS OF DIFFERENTIAL EQUATIONS 341
Exercise 9.6.2 Modify your rk4 function from Exercise 9.2.4 so that it solves a single
ODE of the form
y (m) = f (t, y, y 0 , y 00 , . . . , y (m−1)
with initial conditions
(m)
y(t0 ) = y0 , y 0 (t0 ) = y00 , y 00 (t0 ) = y000 , ··· y (m−1) (t0 ) = y0 .
Assume that the input argument ym = f(t, y) treats the input argument y as a row vector
consisting of the values of y, y 0 , y 00 , . . . , y (m−1) at time t, and that f returns a scalar value
ym that represents the value of y (m) . Your function should also assume that the argument
y0 containing the initial values is a row vector. The value of m indicating the order of the
ODE can be automatically inferred from length(y0), rather than passed as a parameter.
Exercise 9.6.3 How would you use your function from the previous exercise to solve a
system of ODEs of order m?
Chapter 10
This problem is guaranteed to have a unique solution if the following conditions hold:
• f , fy , and fy0 are continuous on the domain
• fy > 0 on D
• fy0 is bounded on D.
In this chapter, we will introduce several methods for solving this kind of problem, most of which
can be generalized to partial differential equations (PDEs) on higher-dimensional domains. A
comprehensive treatment of numerical methods for two-point BVPs can be found in [20].
343
344 CHAPTER 10. TWO-POINT BOUNDARY VALUE PROBLEMS
approach requires selecting the proper slope, or “trajectory”, so that the solution will “hit the
target” of y(x) = β at x = b. This viewpoint indicates how the shooting method earned its name.
Note that since the ODE associated with the IVP is of second-order, it must normally be rewritten
as a system of first-order equations before it can be solved by standard numerical methods such as
Runge-Kutta or multistep methods.
selecting the slope t is relatively simple. Let y1 (x) be the solution of the IVP
where t is the correct slope, since any linear combination of solutions of the ODE also satisfies the
ODE, and the initial values are linearly combined in the same manner as the solutions themselves.
Exercise 10.1.1 Assume y2 (b) 6= 0. Find the value of t in (10.6) such that the boundary
conditions (10.2) are satisfied.
Exercise 10.1.2 Explain why the condition y2 (b) is guaranteed to be satisfied, due to the
previously stated assumptions about f (x, y, y 0 ) that guarantee the existence and uniqueness
of the solution.
y(b, t) = 0,
10.1. THE SHOOTING METHOD 345
where y(b, t) is the value of the solution, at x = b, of the IVP specified by the shooting method, with
initial sope t. This nonlinear equation can be solved using an iterative method such as the bisection
method, fixed-point iteration, Newton’s Method, or the Secant Method. The only difference is that
each evaluation of the function y(b, t), at a new value of t, is relatively expensive, since it requires
the solution of an IVP over the interval [a, b], for which y 0 (a) = t. The value of that solution at
x = b is taken to be the value of y(b, t).
If Newton’s Method is used, then an additional complication arises, because it requires the
derivative of y(b, t), with respect to t, during each iteration. This can be computed using the fact
that z(x, t) = ∂y(x, t)/∂t satisfies the ODE
which can be obtained by differentiating the original BVP and its boundary conditions with respect
to t. Therefore, each iteration of Newton’s Method requires two IVPs to be solved, but this extra
effort can be offset by the rapid convergence of Newton’s Method.
Suppose that Euler’s method,
for the IVP y0 = f (x, y), y(x0 ) = y0 , is to be used to solve any IVPs arising from the Shooting
Method in conjunction with Newton’s Method. Because each IVP, for y(x, t) and z(x, t), is of
second order, we must rewrite each one as a first-order system. We first define
y 1 = y, y2 = y0, z 1 = z, z2 = z0.
∂y 1
= y2,
∂x
∂y 2
= f (x, y 1 , y 2 ),
∂x
∂z 1
= z2,
∂x
∂z 2
= fy (x, y 1 , y 2 )z 1 + fy0 (x, y 1 , y 2 )z 2 ,
∂x
with initial conditions
Choose t(0)
Choose h such that b − a = hN , where N is the number of steps
for k = 0, 1, 2, . . . until convergence do
i = 0, y01 = α, y02 = t(k) , z01 = 0, z02 = 1
for i = 0, 1, 2, . . . , N − 1 do
xi = a + ih
346 CHAPTER 10. TWO-POINT BOUNDARY VALUE PROBLEMS
1
yi+1 = yi1 + hyi2
2
yi+1 = yi2 + hf (xi , yi1 , yi2 )
zi+1 = zi1 + hzi2
1
2
zi+1 = zi2 + h[fy (xi , yi1 , yi2 )zi1 + fy0 (xi , yi1 , yi2 )zi2 ]
end
t(k+1) = t(k) − (yN
1 − β)/z 1
N
end
Exercise 10.1.4 What would be a logical choice of initial guess for the slope t(0) , that
would not require any information about the function f (x, y, y 0 )?
Exercise 10.1.5 Implement the above algorithm to solve the BVP from Example 10.2.2.
Changing the implementation to use a different IVP solver, such as a Runge-Kutta method or
multistep method, in place of Euler’s method only changes the inner loop.
Exercise 10.1.6 Modify your code from Exercise 10.1.5 to use the fourth-order Runge-
Kutta method in place of Euler’s method. How does this affect the convergence of the
Newton iteration?
Exercise 10.1.7 Modify your code from Exercise 10.1.5 to use the Secant Method instead
of Newton’s Method. How can the function f (x, y, y 0 ) from the ODE be used to obtain a
logical second initial guess t(1) ? Hint: consider a solution that is a parabola. How is the
efficiency of the iteration affected by the change to the Secant Method?
for the values of the solution at each xi , in which the local truncation error is O(h2 ).
Then, the above system of equations is also linear, and can therefore be expressed in matrix-vector
form
Ay = r,
where A is a tridiagonal matrix, since the approximations of y 0 and y 00 at xi only use yi−1 , yi and
yi+1 , and r is a vector that includes the values of r(x) at the grid points, as well as additional terms
that account for the boundary conditions.
Specifically,
aii = 2 + h2 q(xi ), i = 1, 2, . . . , N,
h
ai,i+1 = −1 + p(xi ), i = 1, 2, . . . , N − 1,
2
h
ai+1,i = −1 − p(xi+1 ), i = 1, 2, . . . , N − 1,
2
2 h
r1 = −h r(x1 ) + 1 + p(x1 ) α,
2
ri = −h2 r(xi ), i = 2, 3, . . . , N − 1,
2 h
rN = −h r(xN ) + 1 − p(xN ) β.
2
This system of equations is guaranteed to have a unique solution if A is diagonally dominaint,
which is the case if q(x) ≥ 0 and h < 2/L, where L is an upper bound on |p(x)|.
The following script uses the function FDBVP (see Exercise 10.2.1) to solve this problem with N = 10
interior grid points, and then visualize the exact and approximate solutions, as well as the error.
348 CHAPTER 10. TWO-POINT BOUNDARY VALUE PROBLEMS
% coefficients
p=@(x)(2*ones(size(x)));
q=@(x)(-ones(size(x)));
r=@(x)(x.*exp(x)-x);
% boundary conditions
a=0;
b=2;
alpha=0;
beta=-4;
% number of interior grid points
N=10;
% solve using finite differences
[x,y]=FDBVP(p,q,r,a,b,alpha,beta,N);
% exact solution: y = x^3 e^x/6 - 5xe^x/3 + 2e^x - x - 2
yexact=x.^3.*exp(x)/6-5*x.*exp(x)/3+2*exp(x)-x-2;
% plot exact and approximate solutions for comparison
subplot(121)
plot(x,yexact,’b-’)
hold on
plot(x,y,’r--o’)
hold off
xlabel(’x’)
ylabel(’y’)
subplot(122)
plot(x,abs(yexact-y))
xlabel(’x’)
ylabel(’error’)
Exercise 10.2.2 After evaluating the coefficients p(x), q(x) and r(x) from (10.7) at
the grid points xi , i = 1, 2, . . . , N , how many floating-point operations are necessary to
solve the system Ay = r? If the boundary conditions are changed but the ODE remains
the same, how many additional floating-point operations are needed? Hint: review the
material in Chapter 2 on the solution of banded systems.
10.2. FINITE DIFFERENCE METHODS 349
Figure 10.1: Left plot: exact (solid curve) and approximate (dashed curve with circles) solutions of
the BVP (10.8) computed using finite differences. Right plot: error in the approximate solution.
If the ODE is nonlinear, then we must solve a system of nonlinear equations of the form
F(y) = 0,
where F(y) is a vector-valued function with coordinate functions fi (y), for i = 1, 2, . . . , N . These
coordinate functions are defined as follows:
2 y2 − α
F1 (y) = y2 − 2y1 + α − h f x1 , y1 , ,
2h
y3 − y1
F2 (y) = y3 − 2y2 + y1 − h2 f x2 , y2 , ,
2h
..
. (10.9)
2 yN − yN −2
FN −1 (y) = yN − 2yN −1 + yN −2 − h f xN −1 , yN −1 , ,
2h
2 β − yN −1
FN (y) = β − 2yN + yN −1 − h f xN , yN , .
2h
This system of equations can be solved approximately using an iterative method such as Fixed-point
Iteration, Newton’s Method, or the Secant Method.
For example, if Newton’s Method is used, then, by the Chain Rule, the entries of the Jacobian
350 CHAPTER 10. TWO-POINT BOUNDARY VALUE PROBLEMS
where, for convenience, we use y0 = α and yN +1 = β. Then, during each iteration of Newton’s
Method, the system of equations
from the previous iterate. An appropriate initial guess is the unique linear function that satisfies
the boundary conditions,
β−α
y(0) = α + (x − a),
b−a
where x is the vector with coordinates x1 , x2 , . . . , xN .
The following Matlab function can be used to evaluate F(y) for a general BVP of the form (10.1).
Its arguments are assumed to be vectors of x- and y-values, including boundary values, along with
a function handle for the right-hand side f (x, y, y 0 ) of the ODE (10.1) and the spacing h.
xlabel(’x’)
ylabel(’y’)
Using an absolute error tolerance of 10−8 , Newton’s Method converges in just three iterations, and
does so quadratically. The resulting plot is shown in Figure 10.2. 2
Figure 10.2: Exact (solid curve) and approximate (dashed curve with circles) solutions of the BVP
(10.11) from Example 10.2.2.
Method such as the Secant Method or Broyden’s Method, than for a general system of nonlinear
equations because the Jacobian matrix is tridiagaonal. This reduces the expense of the computation
of sk+1 from O(N 3 ) operations in the general case to only O(N ) for two-point boundary value
problems.
Exercise 10.2.6 Modify your function FDNLBVP from Exercise 10.2.5 to use Broyden’s
Method instead of Newton’s Method. How does this affect the efficiency, when applied to
the BVP from Example 10.2.2?
Exercise 10.2.7 Although Newton’s Method is much more efficient for such a problem
than for a general system of nonlinear equations, what is an advantage of using the Secant
Method over Newton’s Method or Broyden’s Method?
It can be shown that regardless of the choice of iterative method used to solve the system of
equations arising from discretization, the local truncation error of the finite difference method for
nonlinear problems is O(h2 ), as in the linear case. The order of accuracy can be increased by
applying Richardson extrapolation.
10.3 Collocation
While the finite-difference approach from the previous section is generally effective for two-point
boundary value problems, and is more flexible than the Shooting Method as it can be applied to
higher-dimensional BVPs, it does have its drawbacks.
• First, the accuracy of finite difference approximations relies on the existence of the higher-
order derivatives that appear in their error formulas. Unfortunately, the existence of these
higher-order derivatives is not assured.
• Third, it is best suited for problems in which the domain is relatively simple, such as a
rectangular domain.
We now consider an alternative approach that, in higher dimensions, is more readily applied to
problems on domains with complicated geometries.
First, we assume that the solution y(x) is approximated by a function yN (x) that is a linear
combination of chosen linearly independent functions φ1 (x), φ2 (x), . . . , φN (x), called basis functions
as they form a basis for an N -dimensional vector space. We then have
N
X
yN (x) = ci φi (x), (10.12)
i=1
where the constants c1 , c2 , . . . , cN are unknown. Substituting this form of the solution into (10.1),
(10.2) yields the equations
XN XN N
X
cj φ00j (x) = f x, cj φj (x), cj φ0j (x) , a < x < b, (10.13)
j=1 j=1 j=1
354 CHAPTER 10. TWO-POINT BOUNDARY VALUE PROBLEMS
N
X N
X
cj φj (a) = α, cj φi (b) = β. (10.14)
j=1 j=1
Already, the convenience of this assumption is apparent: instead of solving for a function y(x) on
the interval (a, b), we are instead having to solve for the N coefficients c1 , c2 , . . . , cN . However, it is
not realistic to think that there is any choice of these coefficients that satisfies (10.13) on the entire
interval (a, b), as well as the boundary conditions (10.14). Rather, we need to impose N conditions
on these N unknowns, in the hope that the resulting system of N equations will have a unique
solution that is also an accurate approximation of the exact solution y(x).
To that end, we require that (10.13) is satisfied at N −2 points in (a, b), denoted by x1 , x2 , . . . , xN −2 ,
and that the boundary conditions (10.14) are satisfied. The points a = x0 , x1 , x2 , . . . , xN −2 , xN −1 =
b are called collocation points. This approach of approximating y(x) by imposing (10.12) and solving
the system of N equations given by
XN N
X N
X
cj φ00j (xi ) = f xi , cj φj (xi ), cj φ0j (xi ) , i = 1, 2, . . . , N − 2 (10.15)
j=1 j=1 j=1
T
along with (10.14). This system can be written in the form Ac = b, where c = c1 · · · cN .
We can then solve this system using any of the methods from Chapter 2.
Exercise 10.3.1 Express the system of linear equations (10.17), (10.14) in the form
Ac = b, where c is defined as above. What are the entries aij and bi of the matrix A and
right-hand side vector b, respectively?
y4 (x) = c1 + c2 x + c3 x2 + c4 x3 .
That is, N = 4, since we are assuming that y(x) is a linear combination of the four functions
1, x, x2 and x3 . Substituting this form into the BVP yields
c1 = 0, c1 + c2 + c3 + c4 = 1.
Writing this system of equations in matrix-vector form, we obtain
1 0 0 0 c1 0
0 0 2 6x1 2
c2 = x1 .
(10.20)
0 0 2 6x2 c3 x22
1 1 1 1 c4 1
For the system to be specified completely, we need to choose the two collocation points x1 , x2 ∈ (0, 1).
As long as these points are chosen to be distinct, the matrix of the system will be nonsingular. For
this example, we choose x1 = 1/3 and x2 = 2/3. We then have the system
1 0 0 0 c1 0
0 0 2 2 c2 1/9
=
4/9 . (10.21)
0 0 2 4 c3
1 1 1 1 c4 1
c =
0
17/18
-1/9
1/6
The format rat statement was used to obtain exact values of the entries of c, since these entries
are guaranteed to be rational numbers in this case. It follows that our approximate solution yN (x)
is
17 1 1
y4 (x) = x − x2 + x3 .
18 9 6
The exact solution of the original BVP is easily obtained by integration:
1 4 11
y(x) = x + x.
12 12
From these formulas, though, it is not easy to gauge how accurate y4 (x) is. To visualize the error,
we plot both solutions:
356 CHAPTER 10. TWO-POINT BOUNDARY VALUE PROBLEMS
>> xp=0:0.01:1;
>> y4p=c(1)+c(2)*xp+c(3)*xp.^2+c(4)*xp.^3;
>> yp=xp.^4/12+11*xp/12;
>> plot(xp,yp)
>> hold on
>> plot(xp,y4p,’r--’)
>> xlabel(’x’)
>> ylabel(’y’)
>> legend(’exact’,’approximate’)
The result is shown in Figure 10.3. As we can see, this approximate solution is reasonably accurate.
Figure 10.3: Exact (blue curve) and approximate (dashed curve) solutions of (10.18), (10.19) from
Example 10.3.1.
To get a numerical indication of the accuracy, we can measure the error at the points in xp that
were used for plotting:
ans =
0.0023
10.3. COLLOCATION 357
Since the exact solution and approximation are both polynomials, we can also compute the L2
norm of the error:
>> py=[ 1/12 0 0 11/12 0 ];
>> py4=c(end:-1:1)’;
>> err4=py-[ 0 py4 ];
>> err42=conv(err4,err4);
>> Ierr42=polyint(err42);
>> norm22=polyval(Ierr42,1)-polyval(Ierr42,0);
>> norm2=sqrt(norm22)
norm2 =
0.0019
2
Exercise 10.3.2 Solve the BVP from Example 10.3.1 again, but with different collocation
points x1 , x2 ∈ (0, 1). What happens to the error?
Exercise 10.3.3 Use Matlab to compute the relative error in the ∞-norm and L2 -
norm from the preceding example.
Exercise 10.3.4 What would happen if N = 5 collocation points were used, along with
the functions φj (x) = xj−1 , j = 1, 2, . . . , 5?
Exercise 10.3.6 Use your function linearcollocation from Exercise 10.3.5 to solve
the BVP
y 00 = ex , 0 < x < 1, y(0) = 0, y(1) = 1.
What happens to the error in the approximate solution as the number of collocation points,
N , increases? Plot the error as a function of N , using logarithmic scales.
The choice of functions φj (x), j = 1, 2, . . . , N , can significantly affect the process of solving the
resulting system of equations. The choice used in Example 10.3.1, φj (x) = xj−1 , while convenient,
is not a good choice, especially when N is large. As illustrated in Section 6.2, these functions can
be nearly linearly dependent on the interval [a, b], which can lead to ill-conditioned systems.
Exercise 10.3.7 What happens to the condition number of the matrix used by your func-
tion linearcollocation from Exercise 10.3.5 as N increases?
358 CHAPTER 10. TWO-POINT BOUNDARY VALUE PROBLEMS
Exercise 10.3.8 Modify your function linearcollocation from Exercise 10.3.5 to use
Chebyshev polynomials instead of the monomial basis, and the Chebyshev points as the
collocation points instead of equally spaced points. What happens to the condition number
of the matrix as N increases?
Collocation can be used for either linear or nonlinear BVP. In the nonlinear case, choosing the
functions φj (x), j = 1, 2, . . . , N , and the collocation points xi , i = 0, 1, . . . , N − 1,, yields a system
of nonlinear equations for the unknowns c1 , c2 , . . . , cN . This system can then be solved using any
of the techniques from Section 8.6, just as when using finite differences.
Exercise 10.3.9 Describe the system of nonlinear equations F(c) = 0 that must be solved
at each iteration when using Newton’s method to solve a general nonlinear BVP of the
form (10.1), (10.2). What is the Jacobian of F, JF (c)?
What is the exact solution? Hint: Solve the simpler ODE y 0 = y 2 ; the form of its solution
suggests the form of the solution of y 00 = y 2 .
This equation can be used to model, for example, transverse vibration of a string due to an external
force f (x), or longitudinal displacement of a beam subject to a load f (x). In either case, the
boundary conditions prescribe that the endpoints of the object in question are fixed.
If we multiply both sides of (10.22) by a test function w(x), and then integrate over the domain
[0, 1], we obtain
Z 1 Z 1
00
−w(x)u (x) dx = w(x)f (x) dx.
0 0
Applying integration by parts, we obtain
Z 1 Z 1
00 0
1
w(x)u (x) dx = w(x)u (x) 0 −
w0 (x)u0 (x) dx.
0 0
Let C 2 [0, 1] be the space of all functions with two continuous derivatives on [0, 1], and let C02 [0, 1]
be the space of all functions in C 2 [0, 1] that are equal to zero at the endpoints x = 0 and x = 1. If
we require that our test function u(x) belongs to C02 [0, 1], then w(0) = w(1) = 0, and the boundary
term in the above application of integration by parts vanishes. We then have
Z 1 Z 1
0 0
w (x)u (x) dx = w(x)f (x) dx.
0 0
This is called the weak form of the boundary value problem (10.22), (10.23), known as the strong
form or classical form, because it only requires that the first derivative of u(x) exist, as opposed
to the original boundary value problem, that requires the existence of the second derivative. The
weak form also known as the variational form. It can be shown that both the weak form and strong
form have the same solution u ∈ C02 [0, 1].
To find an approximate solution of the weak form, we restrict ourselves to a N -dimensional
subspace VN of C02 [0, 1] by requiring that the approximate solution, denoted by uN (x), satisfies
N
X
uN (x) = cj φj (x), (10.24)
j=1
where the trial functions φ1 , φ2 , . . . , φn form a basis for VN . For now, we only assume that these
trial functions belong to C02 [0, 1], and are linearly independent. Substituting this form into the
weak form yields
XN Z 1 Z 1
0
w(x)φj (x) dx cj = w(x)f (x) dx.
j=1 0 0
Since our trial functions and test functions come from the same space, this version of the weighted
mean residual method is known as the Galerkin method. As in collocation, we need N equations
to uniquely determine the N unknowns c1 , c2 , . . . , cN . To that end, we use the basis functions
φ1 , φ2 , . . . , φN as test functions. This yields the system of equations
XN Z 1 Z 1
0 0
φi (x)φj (x) dx cj = φi (x)f (x) dx, i = 1, 2, . . . , N.
j=1 0 0
Z 1
fi = φi (x)f (x) dx, i = 1, 2, . . . , N.
0
By finding the coefficients u1 , u2 , . . . , uN that satisfy these equations, we ensure that the residual
R(x, uN , u0N , u00N ) = f (x) + u00N (x) satisfies
hw, Ri = 0, w ∈ VN ,
Figure 10.4: Piecewise linear basis functions φj (x), as defined in (10.25), for j = 1, 2, 3, 4, with
N =4
Then we define
0 0 ≤ x ≤ xj−1
1
h (x − x j−1 ) xj−1 < x ≤ xj
φj (x) = 1 , j = 1, 2, . . . , N. (10.25)
(x − x) xj < x ≤ xj+1
h j+1
0 xj+1 < x ≤ 1
These functions automatically satisfy the boundary conditions. Because they are only piecewise
linear, their derivatives are discontinuous. They are
0 0 ≤ x ≤ xj−1
1
xj−1 < x ≤ xj
φ0j (x) = h
1 , j = 1, 2, . . . , N.
− h xj < x ≤ xj+1
0 xj+1 < x ≤ 1
It follows from these definitions that φi (x) and φj (x) cannot simultaneously be nonzero at any
point in [0, 1] unless |i − j| ≤ 1. This yields a symmetric tridiagonal matrix A with entries
2 Z xi
1 2 xi+1
Z
1 2
aii = 1 dx + − 1 dx = , i = 1, 2, . . . , N,
h xi−1 h xi h
Z xi+1
1 1
ai,i+1 = − 2 1 dx = − , i = 1, 2, . . . , N − 1,
h xi h
ai+1,i = ai,i+1 , i = 1, 2, . . . , N − 1.
For the right-hand side vector f , known as the load vector, we have
1 xi 1 xi+1
Z Z
fi = (x − xi−1 )f (x) dx + (xi+1 − x)f (x) dx, i = 1, 2, . . . , N. (10.26)
h xi−1 h xi
When the Galerkin method is used with basis functions such as these, that are only nonzero within
a small portion of the spatial domain, the method is known as the finite element method. In this
context, the subintervals [xi−1 , xi ] are called elements, and each xi is called a node. As we have
seen, an advantage of this choice of trial function is that the resulting matrix A, known as the
stiffness matrix, is sparse.
It can be shown that the matrix A with entries defined from these approximate integrals is
not only symmetric and tridiagonal, but also positive definite. It follows that the system Ac = f
is stable with respect to roundoff error, and can be solved using methods such as the conjugate
gradient method that are appropriate for sparse symmetric positive definite systems.
Example 10.4.1 We illustrate the finite element method by solving (10.22), (10.23) with f (x) = x,
with N = 4. The following Matlab commands are used to help specify the problem.
>> % solve -u’’ = f on (0,1), u(0)=u(1)=0, f polynomial
>> % represent f(x) = x as a polynomial
>> fx=[ 1 0 ];
>> % set number of interior nodes
>> N=4;
>> h=1/(N+1);
>> % compute vector containing all nodes, including boundary nodes
>> x=h*(0:N+1)’;
362 CHAPTER 10. TWO-POINT BOUNDARY VALUE PROBLEMS
This vector of nodes will be convenient when constructing the load vector f and performing other
tasks.
We need to solve the system Ac = f , where
2 −1 0 0 c1
1 −1 2 −1 0
, c = c2 ,
A=
h 0 −1 2 −1 c3
0 0 −1 2 c4
with h = 1/5. The following Matlab commands set up the stiffness matrix for a general value of
N.
>> % construct stiffness matrix:
>> e=ones(N-1,1);
>> % use diag to place entries on subdiagonal and superdiagonal
>> A=1/h*(2*eye(N)-diag(e,1)-diag(e,-1));
The load vector f has elements
1 xi 1 xi+1
Z Z
fi = (x − xi−1 )x dx + (xi+1 − x)x dx, i = 1, 2, . . . , N.
h xi−1 h xi
Now that the system Ac = f is set up, we can easily solve it in Matlab using the command c=A\f.
For this simple BVP, we can obtain the exact solution analytically, to help gauge the accuracy
of our approximate solution. The following statements accomplish this for the BVP −u00 = f on
(0, 1), u(x0 ) = u(xN +1 ) = 1, when f is a polynomial.
Now, we can visualize the exact solution u(x) and approximate solution uN (x), which is a piecewise
linear function due to the use of piecewise linear trial functions.
In the preceding example, the integrals in (10.26) could be evaluated exactly. Generally, how-
ever, they must be approximated, using techniques such as those presented in Chapter 7.
Exercise 10.4.1 What is the value of fi if the Trapezoidal Rule is used on each of the
integrals in (10.26)? What if Simpson’s Rule is used?
Exercise 10.4.2 Write a Matlab function [x,u]=FEMBVP(f,N) that solves the BVP
(10.22), (10.23) with N interior nodes. The input argument f must be a function handle.
The outputs x and u must be column vectors consisting of the nodes and values of the
approximate solution at the nodes, respectively. Use the Trapezoidal rule to approximate
the integrals (10.26). Test your function with f (x) = ex . What happens to the error as
N increases?
364 CHAPTER 10. TWO-POINT BOUNDARY VALUE PROBLEMS
Figure 10.5: Exact (solid curve) and approximate (dashed curve) solutions of (10.22), (10.23) with
f (x) = x and N = 4
Exercise 10.4.3 Generalize your function FEMBVP from Exercise 10.4.2 so that it can be
used to solve the BVP −u00 + q(x)u = f (x) on (0, 1), with boundary conditions u(0) =
u(1) = 0, for a given function q(x) that must be passed as an input argument that is a
function handle. Hint: re-derive the weak form of the BVP to determine how the matrix
A must be modified. Use the Trapezoidal Rule to approximate any integrals involving q(x).
Exercise 10.4.4 Modify your function FEMBVP from Exercise 10.4.3 so that it can be
used to solve the BVP −u00 + q(x)u = f (x) on (0, 1), with boundary conditions u(0) = u0 ,
u(1) = u1 , where the scalars u0 and u1 must be passed as input arguments. Hint: Modify
(10.24) to include additional basis functions φ0 (x) and φN +1 (x), that are equal to 1 at
x = x0 and x = xN +1 , respectively, and equal to 0 at all other nodes. How must the load
vector f be modified to account for these nonhomogeneous boundary conditions?
Exercise 10.4.5 Modify your function FEMBVP from Exercise 10.4.4 so that it can be
used to solve the BVP −(p(x)u0 )0 + q(x)u = f (x) on (0, 1), with boundary conditions
u(0) = u0 , u(1) = u1 , where the coefficient p(x) must be passed as an input argument
that is a function handle. Hint: re-derive the weak form of the BVP to determine how
the matrix A must be modified. Use the Trapezoidal Rule to approximate any integrals
involving p(x).
10.5. FURTHER READING 365
It can be shown that when using the finite element method with piecewise linear trial functions,
the error in the approximate solution is O(h2 ). Higher-order accuracy can be achieved by using
higher-degree piecewise polynomials as basis functions, such as cubic B-splines. Such a choice
also helps to ensure that the approximate solution is differentiable, unlike the solution computed
using piecewise linear basis functions, which are continuous but not differentiable at the points xi ,
i = 1, 2, . . . , N . With cubic B-splines, the error in the computed solution is O(h4 ) as opposed to
O(h2 ) in the piecewise linear case, due to the two additional degrees of differentiability. However,
the drawback is that the matrix arising form the use of higher-degree basis functions is no longer
tridiagonal; the upper and lower bandwidth are each equal to the degree of the piecewise polynomial
that is used.
Appendices
367
Appendix A
Review of Calculus
Among the mathematical problems that can be solved using techniques from numerical analysis
are the basic problems of differential and integral calculus:
• computing the instantaneous rate of change of one quantity with respect to another, which
is a derivative, and
• computing the total change in a function over some portion of its domain, which is a definite
integral.
Calculus also plays an essential role in the development and analysis of techniques used in numerical
analysis, including those techniques that are applied to problems not arising directly from calculus.
Therefore, it is appropriate to review some basic concepts from calculus before we begin our study
of numerical analysis.
A.1.1 Limits
The basic problems of differential and integral calculus described in the previous paragraph can be
solved by computing a sequence of approximations to the desired quantity and then determining
what value, if any, the sequence of approximations approaches. This value is called a limit of the
sequence. As a sequence is a function, we begin by defining, precisely, the concept of the limit of a
function.
369
370 APPENDIX A. REVIEW OF CALCULUS
if for any open interval I1 containing L, there is some open interval I2 containing a such
that f (x) ∈ I1 whenever x ∈ I2 , and x 6= a. We say that L is the limit of f (x) as x
approaches a.
We write
lim f (x) = L
x→a−
if, for any open interval I1 containing L, there is an open interval I2 of the form (c, a),
where c < a, such that f (x) ∈ I1 whenever x ∈ I2 . We say that L is the limit of f (x) as
x approaches a from the left, or the left-hand limit of f (x) as x approaches a.
Similarly, we write
lim f (x) = L
x→a+
if, for any open interval I1 containing L, there is an open interval I2 of the form (a, c),
where c > a, such that f (x) ∈ I1 whenever x ∈ I2 . We say that L is the limit of f (x) as
x approaches a from the right, or the right-hand limit of f (x) as x approaches
a.
We can make the definition of a limit a little more concrete by imposing sizes on the intervals
I1 and I2 , as long as the interval I1 can still be of arbitrary size. It can be shown that the following
definition is equivalent to the previous one.
Definition A.1.2 We write
lim f (x) = L
x→a
if, for any > 0, there exists a number δ > 0 such that |f (x) − L| < whenever
0 < |x − a| < δ.
Similar definitions can be given for the left-hand and right-hand limits.
Note that in either definition, the point x = a is specifically excluded from consideration when
requiring that f (x) be close to L whenever x is close to a. This is because the concept of a limit
is only intended to describe the behavior of f (x) near x = a, as opposed to its behavior at x = a.
Later in this appendix we discuss the case where the two distinct behaviors coincide.
To illustrate limits, we consider
sin x
L = lim . (A.1)
x→0+ x
We will visualize this limit in Matlab. First, we construct a vector of x-values that are near zero,
but excluding zero. This can readily be accomplished using the colon operator:
>> dx=0.01;
>> x=dx:dx:1;
Then, the vector x contains the values xi = i∆x, i = 1, 2, . . . , 100, where ∆x = 0.01.
Exercise A.1.1 Use the vector x to plot sin x/x on the interval (0, 1]. What appears to
be the value of the limit L in (A.1)?
A.1. LIMITS AND CONTINUITY 371
The preceding exercise can be completed using a for loop, but it is much easier to use component-
wise operators. Since x is a vector, the expression sin(x)/x would cause an error. Instead, the ./
operator can be used to perform componentwise division of the vectors sin(x) and x. The . can
be used with several other arithmetic operators to perform componentwise operations on matrices
and vectors. For example, if A is a matrix, then A.^2 is a matrix in which each entry is the square
of the corresponding entry of A.
Exercise A.1.2 Use one statement to plot sin x/x on the interval (0, 1].
The notions of limit and continuity generalize to vector-valued functions and functions of several
variables in a straightforward way.
lim f (x) = L
x→x0
|f (x) − L| <
whenever x ∈ D and
0 < kx − x0 k < δ.
In this definition, we can use any appropriate vector norm k · k, as discussed in Section B.11.
f1 (x) f1 (x1 , x2 , . . . , xn )
f2 (x) f2 (x1 , x2 , . . . , xn )
F(x) = =
.. ..
. .
fn (x) fn (x1 , x2 , . . . , xn )
lim F(x) = L
x→x0
if and only if
lim fi (x) = Li , i = 1, 2, . . . , n.
x→x0
if for any open interval I containing L, there exists a number M such that fn ∈ I whenever
x > M.
since for any > 0, no matter how small, we can find a positive integer n0 such that |fn | < for
all n ≥ n0 . In fact, for any given , we can choose n0 = d1/e, where dxe, known as the ceiling
function, denotes the smallest integer that is greater than or equal to x. 2
A.1.4 Continuity
In many cases, the limit of a function f (x) as x approached a could be obtained by simply computing
f (a). Intuitively, this indicates that f has to have a graph that is one continuous curve, because
any “break” or “jump” in the graph at x = a is caused by f approaching one value as x approaches
a, only to actually assume a different value at a. This leads to the following precise definition of
what it means for a function to be continuous at a given point.
A.1. LIMITS AND CONTINUITY 373
The preceding definition describes continuity at a single point. In describing where a function
is continuous, the concept of continuity over an interval is useful, so we define this concept as well.
3. [a, b] if f is continuous on (a, b), continuous from the right at a, and continuous
from the left at b.
In numerical analysis, it is often necessary to construct a continuous function, such as a polyno-
mial, based on data obtained by measurements and problem-dependent constraints. In this course,
we will learn some of the most basic techniques for constructing such continuous functions by a
process called interpolation.
Suppose that a function f is continuous on some closed interval [a, b]. The graph of such a function
is a continuous curve connecting the points (a, f (a)) with (b, f (b)). If one were to draw such a
graph, their pen would not leave the paper in the process, and therefore it would be impossible
to “avoid” any y-value between f (a) and f (b). This leads to the following statement about such
continuous functions.
Theorem A.1.10 (Intermediate Value Theorem) Let f be continuous on [a, b].
Then, on (a, b), f assumes every value between f (a) and f (b); that is, for any value
y between f (a) and f (b), f (c) = y for some c in (a, b).
The Intermediate Value Theorem has a very important application in the problem of finding
solutions of a general equation of the form f (x) = 0, where x is the solution we wish to compute
and f is a given continuous function. Often, methods for solving such an equation try to identify
an interval [a, b] where f (a) > 0 and f (b) < 0, or vice versa. In either case, the Intermediate Value
374 APPENDIX A. REVIEW OF CALCULUS
Theorem states that f must assume every value between f (a) and f (b), and since 0 is one such
value, it follows that the equation f (x) = 0 must have a solution somewhere in the interval (a, b).
We can find an approximation to this solution using a procedure called bisection, which re-
peatedly applies the Intermediate Value Theorem to smaller and smaller intervals that contain the
solution. We will study bisection, and other methods for solving the equation f (x) = 0, in this
course.
A.2 Derivatives
The basic problem of differential calculus is computing the instantaneous rate of change of one
quantity y with respect to another quantity x. For example, y may represent the position of an
object and x may represent time, in which case the instantaneous rate of change of y with respect
to x is interpreted as the velocity of the object.
When the two quantities x and y are related by an equation of the form y = f (x), it is certainly
convenient to describe the rate of change of y with respect to x in terms of the function f . Because
the instantaneous rate of change is so commonplace, it is practical to assign a concise name and
notation to it, which we do now.
• the slope of the tangent line of f at the point (a, f (a)), and
This can be seen from the fact that all three numbers are defined in the same way. 2
Exercise A.2.1 Let f (x) = x2 − 3x + 2. Use the Matlab function polyder to compute
the coefficients of f 0 (x). Then use the polyval function to obtain the equation of the
tangent line of f (x) at x = 2. Finally, plot the graph of f (x) and this tangent line, on
the same graph, restricted to the interval [0, 4].
Many functions can be differentiated using differentiation rules such as those learned in a cal-
culus course. However, many functions cannot be differentiated using these rules. For example,
we may need to compute the instantaneous rate of change of a quantity y = f (x) with respect to
another quantity x, where our only knowledge of the function f that relates x and y is a set of pairs
of x-values and y-values that may be obtained using measurements. In this course we will learn how
to approximate the derivative of such a function using this limited information. The most common
methods involve constructing a continuous function, such as a polynomial, based on the given data,
A.3. EXTREME VALUES 375
using interpolation. The polynomial can then be differentiated using differentiation rules. Since the
polynomial is an approximation to the function f (x), its derivative is an approximation to f 0 (x).
It is important to keep in mind, however, that the converse of the above statement, “if f
is continuous, then f is differentiable”, is not true. It is actually very easy to find examples of
functions that are continuous at a point, but fail to be differentiable at that point. As an extreme
example, it is known that there is a function that is continuous everywhere, but is differentiable
nowhere.
Example A.2.3 The functions f (x) = |x| and g(x) = x1/3 are examples of functions that are
continuous for all x, but are not differentiable at x = 0. The graph of the absolute value function
|x| has a sharp corner at x = 0, since the one-sided limits
do not agree, but in general these limits must agree in order for f (x) to have a derivative at x = 0.
The cube root function g(x) = x1/3 is not differentiable at x = 0 because the tangent line to the
graph at the point (0, 0) is vertical, so it has no finite slope. We can also see that the derivative
does not exist at this point by noting that the function g 0 (x) = (1/3)x−2/3 has a vertical asymptote
at x = 0.
Exercise A.2.2 Plot both of the functions from this example on the interval [−1, 1].
Use the colon operator to create a vector of x-values and use the dot for performing
2
componentwise operations on a vector, where applicable. From the plot, identify the non-
differentiability in these continuous functions.
Before computing the maximum or minimum value of a function, it is natural to ask whether it
is possible to determine in advance whether a function even has a maximum or minimum, so that
effort is not wasted in trying to solve a problem that has no solution. The following result is very
helpful in answering this question.
Theorem A.3.2 (Extreme Value Theorem) If f is continuous on [a, b], then f has
an absolute maximum and an absolute minimum on [a, b].
Now that we can easily determine whether a function has a maximum or minimum on a closed
interval [a, b], we can develop an method for actually finding them. It turns out that it is easier
to find points at which f attains a maximum or minimum value in a “local” sense, rather than
a “global” sense. In other words, we can best find the absolute maximum or minimum of f
by finding points at which f achieves a maximum or minimum with respect to “nearby” points,
and then determine which of these points is the absolute maximum or minimum. The following
definition makes this notion precise.
The following result describes the relationship between critical numbers and local extrema.
This theorem suggests that the maximum or minimum value of a function f (x) can be found by
solving the equation f 0 (x) = 0. As mentioned previously, we will be learning techniques for solving
such equations in this course. These techniques play an essential role in the solution of problems
in which one must compute the maximum or minimum value of a function, subject to constraints
on its variables. Such problems are called optimization problems.
The following exercise highlights the significance of critical numbers. It relies on some of the
A.4. INTEGRALS 377
Matlab functions for working with polynomials that were introduced in Section 1.2.
Exercise A.3.1 Consider the polynomial f (x) = x3 − 4x2 + 5x − 2. Plot the graph of
this function on the interval [0, 3]. Use the colon operator to create a vector of x-values
and use the dot for componentwise operations on vectors wherever needed. Then, use the
polyder and roots functions to compute the critical numbers of f (x). How do they relate
to the absolute maximum and minimum values of f (x) on [0, 3], or any local maxima or
minima on this interval?
A.4 Integrals
There are many cases in which some quantity is defined to be the product of two other quantities.
For example, a rectangle of width w has uniform height h, and the area A of the rectangle is given
by the formula A = wh. Unfortunately, in many applications, we cannot necessarily assume that
certain quantities such as height are constant, and therefore formulas such as A = wh cannot be
used directly. However, they can be used indirectly to solve more general problems by employing
the notation known as integral calculus.
Suppose we wish to compute the area of a shape that is not a rectangle. To simplify the
discussion, we assume that the shape is bounded by the vertical lines x = a and x = b, the x-axis,
and the curve defined by some continuous function y = f (x), where f (x) ≥ 0 for a ≤ x ≤ b. Then,
we can approximate this shape by n rectangles that have width ∆x = (b − a)/n and height f (xi ),
where xi = a + i∆x, for i = 0, . . . , n. We obtain the approximation
n
X
A ≈ An = f (xi )∆x.
i=1
Intuitively, we can conclude that as n → ∞, the approximate area An will converge to the exact
area of the given region. This can be seen by observing that as n increases, the n rectangles defined
above comprise a more accurate approximation of the region.
More generally, suppose that for each n = 1, 2, . . ., we define the quantity Rn by choosing points
a = x0 < x1 < · · · < xn = b, and computing the sum
n
X
Rn = f (x∗i )∆xi , ∆xi = xi − xi−1 , xi−1 ≤ x∗i ≤ xi .
i=1
The sum that defines Rn is known as a Riemann sum. Note that the interval [a, b] need not
be divided into subintervals of equal width, and that f (x) can be evaluated at arbitrary points
belonging to each subinterval.
If f (x) ≥ 0 on [a, b], then Rn converges to the area under the curve y = f (x) as n → ∞,
provided that the widths of all of the subintervals [xi−1 , xi ], for i = 1, . . . , n, approach zero. This
behavior is ensured if we require that
This condition is necessary because if it does not hold, then, as n → ∞, the region formed by the
n rectangles will not converge to the region whose area we wish to compute. If f assumes negative
378 APPENDIX A. REVIEW OF CALCULUS
values on [a, b], then, under the same conditions on the widths of the subintervals, Rn converges
to the net area between the graph of f and the x-axis, where area below the x-axis is counted
negatively.
where the sequence of Riemann sums {Rn }∞ n=1 is defined so that δ(n) → 0 as n → ∞, as
in the previous discussion. The function f (x) is called the integrand, and the values a
and b are the lower and upper limits of integration, respectively. The process of computing
an integral is called integration.
In Chapter 7, we will study the problem of computing an approximation to the definite integral
of a given function f (x) over an interval [a, b]. We will learn a number of techniques for computing
such an approximation, and all of these techniques involve the computation of an appropriate
Riemann sum.
2
Exercise A.4.1 Let f (x) = e−x . Write a Matlab function that takes as input a
parameter n, and computes the Riemann sum Rn for f (x) using n rectangles. Use a for
loop to compute the Riemann sum. First use x∗i = xi−1 , then use x∗i = xi , and then use
x∗i = (xi−1 + xi )2. For each case, compute Riemann sums for several values of n. What
can you observe about the convergence of the Riemann sum in these three cases?
By applying Rolle’s Theorem to a function f , then to its derivative f 0 , its second derivative f 00 , and
so on, we obtain the following more general result, which will be useful in analyzing the accuracy
of methods for approximating functions by polynomials.
A more fundamental consequence of Rolle’s Theorem is the Mean Value Theorem itself, which
we now state.
f (b) − f (a)
= f 0 (c)
b−a
for some number c in (a, b).
is the slope of the secant line passing through the points (a, f (a)) and (b, f (b)). The Mean Value
Theorem therefore states that under the given assumptions, the slope of this secant line is equal to
the slope of the tangent line of f at the point (c, f (c)), where c ∈ (a, b). 2
The Mean Value Theorem has the following practical interpretation: the average rate of change of
y = f (x) with respect to x on an interval [a, b] is equal to the instantaneous rate of change y with
respect to x at some point in (a, b).
Suppose that f (x) is a continuous function on an interval [a, b]. Then, by the Fundamental Theorem
of Calculus, f (x) has an antiderivative F (x) defined on [a, b] such that F 0 (x) = f (x). If we apply
the Mean Value Theorem to F (x), we obtain the following relationship between the integral of f
over [a, b] and the value of f at a point in (a, b).
Theorem A.5.4 (Mean Value Theorem for Integrals) If f is continuous on [a, b],
then Z b
f (x) dx = f (c)(b − a)
a
for some c in (a, b).
In other words, f assumes its average value over [a, b], defined by
Z b
1
fave = f (x) dx,
b−a a
at some point in [a, b], just as the Mean Value Theorem states that the derivative of a function
assumes its average value over an interval at some point in the interval.
The Mean Value Theorem for Integrals is also a special case of the following more general result.
380 APPENDIX A. REVIEW OF CALCULUS
In the case where g(x) is a function that is easy to antidifferentiate and f (x) is not, this theorem
can be used to obtain an estimate of the integral of f (x)g(x) over an interval.
Example A.5.6 Let f (x) be continuous on the interval [a, b]. Then, for any x ∈ [a, b], by the
Weighted Mean Value Theorem for Integrals, we have
Z x Z x x
(s − a)2 (x − a)2
f (s)(s − a) ds = f (c) (s − a) ds = f (c) = f (c) ,
a a 2
a 2
where a < c < x. It is important to note that we can apply the Weighted Mean Value Theorem
because the function g(x) = (x − a) does not change sign on [a, b]. 2
1 f (n) (x0 )
= f (x0 ) + f 0 (x0 )(x − x0 ) + f 00 (x0 )(x − x0 )2 + · · · + (x − x0 )n
2 n!
and
x
f (n+1) (s) f (n+1) (ξ(x))
Z
Rn (x) = (x − s)n ds = (x − x0 )n+1 ,
x0 n! (n + 1)!
where ξ(x) is between x0 and x.
The polynomial Pn (x) is the nth Taylor polynomial of f with center x0 , and the expression Rn (x)
is called the Taylor remainder of Pn (x). When the center x0 is zero, the nth Taylor polynomial is
also known as the nth Maclaurin polynomial.
A.6. TAYLOR’S THEOREM 381
Exercise A.6.1 Plot the graph of f (x) = cos x on the interval [0, π], and then use the
hold command to include the graphs of P0 (x), P2 (x), and P4 (x), the Maclaurin polyno-
mials of degree 0, 2 and 4, respectively, on the same plot but with different colors and line
styles. To what extent do these Taylor polynomials agree with f (x)?
The final form of the remainder is obtained by applying the Mean Value Theorem for Integrals
to the integral form. As Pn (x) can be used to approximate f (x), the remainder Rn (x) is also
referred to as the truncation error of Pn (x). The accuracy of the approximation on an interval can
be analyzed by using techniques for finding the extreme values of functions to bound the (n + 1)-st
derivative on the interval.
Because approximation of functions by polynomials is employed in the development and analysis
of many techniques in numerical analysis, the usefulness of Taylor’s Theorem cannot be overstated.
In fact, it can be said that Taylor’s Theorem is the Fundamental Theorem of Numerical Analysis,
just as the theorem describing inverse relationship between derivatives and integrals is called the
Fundamental Theorem of Calculus.
The following examples illustrate how the nth-degree Taylor polynomial Pn (x) and the remain-
der Rn (x) can be computed for a given function f (x).
where
P1 (x) = f (x0 ) + f 0 (x0 )(x − x0 ).
This polynomial is a linear function that describes the tangent line to the graph of f at the point
(x0 , f (x0 )).
If we set n = 0 in the theorem, then we obtain
where
P0 (x) = f (x0 )
and
R0 (x) = f 0 (ξ(x))(x − x0 ),
where ξ(x) lies between x0 and x. If we use the integral form of the remainder,
x
f (n+1) (s)
Z
Rn (x) = (x − s)n ds,
x0 n!
then we have Z x
f (x) = f (x0 ) + f 0 (s) ds,
x0
which is equivalent to the Total Change Theorem and part of the Fundamental Theorem of Calculus.
Using the Mean Value Theorem for integrals, we can see how the first form of the remainder can
be obtained from the integral form. 2
382 APPENDIX A. REVIEW OF CALCULUS
where
x3 x3
P3 (x) = x − =x− ,
3! 6
and
1 4 1
R3 (x) = x sin ξ(x) = x4 sin ξ(x),
4! 24
where ξ(x) is between 0 and x. The polynomial P3 (x) is the 3rd Maclaurin polynomial of sin x, or
the 3rd Taylor polynomial with center x0 = 0.
If x ∈ [−1, 1], then
1 4 1 1
|Rn (x)| = x sin ξ(x) = |x4 || sin ξ(x)| ≤ ,
24 24 24
since | sin x| ≤ 1 for all x. This bound on |Rn (x)| serves as an upper bound for the error in the
approximation of sin x by P3 (x) for x ∈ [−1, 1]. 2
Exercise A.6.2 On the interval [−1, 1], plot f (x) = sin x and its Taylor polynomial
of degree 3 centered at x0 = 0, P3 (x). How do they compare? In a separate figure
1
window, plot the error R3 (x) = f (x) − P3 (x) on [−1, 1], and also plot the lines y = ± 24 ,
corresponding to the upper bound on |R3 (x)| from the preceding example. Confirm that
the error actually does satisfy this bound.
where
x2
P2 (x) = 1 + x + ,
2
and
x3 ξ(x)
R2 (x) =e ,
6
where ξ(x) is between 0 and x. The polynomial P2 (x) is the 2nd Maclaurin polynomial of ex , or
the 2nd Taylor polynomial with center x0 = 0.
If x > 0, then R2 (x) can become quite large, whereas its magnitude is much smaller if x < 0.
Therefore, one method of computing ex using a Maclaurin polynomial is to use the nth Maclaurin
polynomial Pn (x) of ex when x < 0, where n is chosen sufficiently large so that Rn (x) is small for
the given value of x. If x > 0, then we instead compute e−x using the nth Maclaurin polynomial
for e−x , which is given by
x2 x3 (−1)n xn
Pn (x) = 1 − x + − + ··· + ,
2 6 n!
and then obtaining an approximation to ex by taking the reciprocal of our computed value of e−x .
2
A.6. TAYLOR’S THEOREM 383
where
P1 (x) = x20 + 2x0 (x − x0 ) = 2x0 x − x20 ,
and
R1 (x) = (x − x0 )2 .
Note that the remainder does not include a “mystery point” ξ(x) since the 2nd derivative of x2 is
only a constant. The linear function P1 (x) describes the tangent line to the graph of f (x) at the
point (x0 , f (x0 )). If x0 = 1, then we have
P1 (x) = 2x − 1,
and
R1 (x) = (x − 1)2 .
We can see that near x = 1, P1 (x) is a reasonable approximation to x2 , since the error in this
approximation, given by R1 (x), would be small in this case. 2
Taylor’s theorem can be generalized to functions of several variables, using partial derivatives.
Here, we consider the case of two independent variables.
B.1 Matrices
Writing a system of equations can be quite tedious. Therefore, we instead represent a system of
linear equations using a matrix, which is an array of elements, or entries. We say that a matrix A
is m × n if it has m rows and n columns, and we denote the element in row i and column j by aij .
We also denote the matrix A by [aij ].
With this notation, a general system of m equations with n unknowns can be represented using
a matrix A that contains the coefficients of the equations, a vector x that contains the unknowns,
and a vector b that contains the quantities on the right-hand sides of the equations. Specifically,
a11 a12 · · · a1n x1 b1
a21 a22 · · · a2n x2 b2
A= . , x = . , b = . .
.
.. .. .. ..
am1 am2 · · · amn xn bm
3x1 + 2x2 = 4,
−x1 + 5x2 = −3
385
386 APPENDIX B. REVIEW OF LINEAR ALGEBRA
of vectors called vector spaces. A vector space over a field (such as the field of real or complex
numbers) is a set of vectors, together with two operations: addition of vectors, and multiplication
of a vector by a scalar from the field.
Specifically, if u and v are vectors belonging to a vector space V over a field F , then the sum
of u and v, denoted by u + v, is a vector in V , and the scalar product of u with a scalar α in F ,
denoted by αu, is also a vector in V . These operations have the following properties:
• Commutativity: For any vectors u and v in V ,
u+v =v+u
• Identity element for vector addition: There is a vector 0, known as the zero vector, such that
for any vector u in V ,
u+0=0+u=u
• Additive inverse: For any vector u in V , there is a unique vector −u in V such that
u + (−u) = −u + u = 0
• Distributivity over vector addition: For any vectors u and v in V , and scalar α in F ,
α(u + v) = αu + αv
• Distributivity over scalar multiplication: For any vector u in V , and scalars α and β in F ,
(α + β)u = αu + βu
• Associativity of scalar multiplication: For any vector u in V and any scalars α and β in F ,
α(βu) = (αβ)u
Example Let V be the vector space R3 . The vector addition operation on V consists of adding
corresponding components of vectors, as in
3 −2 1
0 + 4 = 4 .
1 5 6
The scalar multiplication operation consists of scaling each component of a vector by a scalar:
3
3
1 2
0 = 0 .
2 1
1 2
2
B.3. SUBSPACES 387
B.3 Subspaces
Before we can explain how matrices can be used to easily describe linear transformations, we must
introduce some important concepts related to vector spaces.
A subspace of a vector space V is a subset of V that is, itself, a vector space. In particular, a
subset S of V is also a subsapce if it is closed under the operations of vector addition and scalar
multiplication. That is, if u and v are vectors in S, then the vectors u + v and αu, where α is any
scalar, must also be in S. In particular, S cannot be a subspace unless it includes the zero vector.
Example The set S of all vectors in R3 of the form
x1
x = x2 ,
0
where x1 , x2 ∈ R, is a subspace of R3 , as the sum of any two vectors in S, or a scalar multiple of
any vector in S, must have a third component of zero, and therefore is also in S.
On the other hand, the set S̃ consisting of all vectors in R3 that have a third component of 1 is
not a subspace of R3 , as the sum of vectors in S̃ will not have a third component of 1, and therefore
will not be in S̃. That is, S̃ is not closed under addition. 2
We then define the span of {v1 , v2 , . . . , vk }, denoted by span {v1 , v2 , . . . , vk }, to be the set of all
linear combinations of v1 , v2 , . . ., vk . From the definition of a linear combination, it follows that
this set is a subspace of V .
Example Let
1 3 −1
v1 = 0 , v2 = 4 , v3 = 2 .
1 0 1
Then the vector
6
v = 10
2
388 APPENDIX B. REVIEW OF LINEAR ALGEBRA
v = v1 + 2v2 + v3 .
2
When a subspace is defined as the span of a set of vectors, it is helpful to know whether the set
includes any vectors that are, in some sense, redundant, for if this is the case, the description of
the subspace can be simplified. To that end, we say that a set of vectors {v1 , v2 , . . . , vk } is linearly
independent if the equation
c1 v1 + c2 v2 + · · · + ck vk = 0
holds if and only if c1 = c2 = · · · = ck = 0. Otherwise, we say that the set is linearly dependent.
If the set is linearly independent, then any vector v in the span of the set is a unique linear
combination of members of the set; that is, there is only one way to choose the coefficients of a
linear combination that is used to obtain v.
Example The subspace S of R3 defined by
x1
S = x2 x1 , x2 ∈ R
0
can be described as
1 0
S = span 0 , 1
0 0
or
1 1
S = span 1 , −1 ,
0 0
but
1 −1
6 span 1 , −1 ,
S=
0 0
as these vectors only span the subspace of vectors whose first two components are equal, and whose
third component is zero, which does not account for every vector in S. It should be noted that the
two vectors in the third set are linearly dependent, while the pairs of vectors in the previous two
sets are linearly independent. 2
Example The vectors
1 1
v1 = 0 , v2 = 1
1 0
are linearly independent. It follows that the only way in which the vector
3
v= 1
2
B.5. LINEAR TRANSFORMATIONS 389
v = 2v1 + v2 .
then, because v1 and v2 are linearly dependent, any linear combination of the form c1 v1 + c2 v2 ,
such that c1 + 2c2 = 3, will equal v. 2
Given a vector space V , if there exists a set of vectors {v1 , v2 , . . . , vk } such that V is the span
of {v1 , v2 , . . . , vk }, and {v1 , v2 , . . . , vk } is linearly independent, then we say that {v1 , v2 , . . . , vk }
is a basis of V . Any basis of V must have the same number of elements, k. We call this number
the dimension of V , which is denoted by dim(V ).
Example The standard basis of R3 is
1 0 0
e1 = 0 , e2 = 1 , e3 = 0 .
0 0 1
The set
1 1 0
v1 = 1 , v2 = −1 , v3 = 0
0 0 1
is also a basis for R3 , as it consists of three linearly independent vectors, and the dimension of R3
is three. 2
where x and y are vectors in V and α is a scalar from F . If V and W are the same vector space,
then we say that fA is a linear operator on V .
Suppose that the set of vectors {v1 , v2 , . . . , vn } is a basis for V , and the set {w1 , w2 , . . . , wm }
is a basis for W . Then, aij is the scalar by which wi is multiplied when applying the function fA
to the vector vj . That is,
m
X
fA (vj ) = a1j w1 + a2j w2 + · · · + amj wm = aij wi .
i=1
In other words, the jth column of A describes the image under fA of the vector vj , in terms of the
coefficients of fA (vj ) in the basis {w1 , w2 , . . . , wm }.
If V and W are spaces of real or complex vectors, then, by convention, the bases {vj }nj=1 and
{wi }m n m
i=1 are each chosen to be the standard basis for R and R , respectively. The jth vector in
the standard basis is a vector whose components are all zero, except for the jth component, which
is equal to one. These vectors are called the standard basis vectors of an n-dimensional space of
real or complex vectors, and are denoted by ej . From this point on, we will generally assume that
V is Rn , and that the field is R, for simplicity.
Example The standard basis for R3 consists of the vectors
1 0 0
e1 = 0 , e2 = 1 , e3 = 0 .
0 0 1
2
From this definition, we see that the jth column of A is equal to the matrix-vector product Aej .
Example Let
3 0 −1 10
A = 1 −4 2 , x = 11 .
5 1 −3 12
B.6. MATRIX-MATRIX MULTIPLICATION 391
Then
3 0 −1 18
Ax = 10 1 + 11 −4 + 12 2 = −10 .
5 1 −3 25
We see that Ax is a linear combination of the columns of A, with the coefficients of the linear
combination obtained from the components of x. 2
Example Let
3 1
1
A = 1 0 , x= .
−1
2 4
Then the matrix-vector product of A and x is
3(1) + 1(−1) 2
y = Ax = 1(1) + 0(−1) = 1 .
2(1) + 4(−1) −2
2
dim(null(A)) + rank(A) = n.
Ax = b,
where Ax is a matrix-vector product of the m × n coefficient matrix A and the vector of unknowns
x, and b is the vector of right-hand side values.
Of course, if m = n = 1, the system of equations Ax = b reduces to the scalar linear equation
ax = b, which has the solution x = a−1 b, provided that a 6= 0. As a−1 is the unique number
such that a−1 a = aa−1 = 1, it is desirable to generalize the concepts of multiplication and identity
element to square matrices, for which m = n.
392 APPENDIX B. REVIEW OF LINEAR ALGEBRA
The matrix-vector product can be used to define the composition of linear functions represented
by matrices. Let A be an m × n matrix, and let B be an n × p matrix. Then, if x is a vector of
length p, and y = Bx, then we have
Ay = A(Bx) = (AB)x = Cx,
where C is an m × p matrix with entries
n
X
Cij = aik bkj .
k=1
We define the matrix product of A and B to be the matrix C = AB with entries defined in
this manner. It should be noted that the product BA is not defined, unless m = p. Even if this
is the case, in general, AB 6= BA. That is, matrix multiplication is not commutative. However,
matrix multiplication is associative, meaning that if A is m × n, B is n × p, and C is p × k, then
A(BC) = (AB)C.
Example Consider the 2 × 2 matrices
1 −2 −5 6
A= , B= .
−3 4 7 −8
Then
1 −2 −5 6
AB =
−3 4 7 −8
1(−5) − 2(7) 1(6) − 2(−8)
=
−3(−5) + 4(7) −3(6) + 4(−8)
−19 22
= ,
43 −50
whereas
−5 6 1 −2
BA =
7 −8 −3 4
−5(1) + 6(−3) −5(−2) + 6(4)
=
7(1) − 8(−3) 7(−2) − 8(4)
−23 34
= .
31 −46
We see that AB 6= BA. 2
Example If
3 1
1 5
A = 1 0 , B = ,
4 −1
2 4
then the matrix-matrix product of A and B is
3(1) + 1(4) 3(5) + 1(−1) 7 14
C = AB = 1(1) + 0(4) 1(5) + 0(−1) = 1 5 .
2(1) + 4(4) 2(5) + 4(−1) 18 6
It does not make sense to compute BA, because the dimensions are incompatible. 2
B.7. OTHER FUNDAMENTAL MATRIX OPERATIONS 393
Example If
3 1
A = 1 0 ,
2 4
then
3 1 2
AT = .
1 0 4
2
Example Let A be the matrix from a previous example,
3 0 −1
A= 1 −4 2 .
5 1 −3
394 APPENDIX B. REVIEW OF LINEAR ALGEBRA
Then
3 1 5
AT = 0 −4 1 .
−1 2 −3
It follows that
3 + 3 0 + 1 −1 + 5 6 1 4
A + AT = 1 + 0 −4 − 4 2 + 1 = 1 −8 3 .
5 − 1 1 + 2 −3 − 3 4 3 −6
This matrix is symmetric. This can also be seen by the properties of the transpose, since
(A + AT )T = AT + (AT )T = AT + A = A + AT .
2
Example The matrix
3 1 5
A= 1 2 0
5 0 4
is symmetric, while
0 1 2
B = −1 0 −3
−2 3 0
is skew-symmetric, meaning that AT = −A. 2
where
x1 y1
x2 y2
x= , y= .
.. ..
. .
xn yn
Note that x and y must both be defined to be column vectors, and they must have the same length.
If xT y = 0, then we say that x and y are orthogonal.
Let x ∈ Rm and y ∈ Rn , where m and n are not necessarily equal. The term “inner product”
suggests the existence of another operation called the outer product, which is defined by
x 1 y1 x 1 y2 · · · x 1 yn
x 2 y1 x 2 y2 · · · x 2 yn
xyT = . .
..
.. .
xm y1 xm y2 · · · xm yn
B.7. OTHER FUNDAMENTAL MATRIX OPERATIONS 395
Note that whereas the inner product is a scalar, the outer product is an m × n matrix.
Example Let
1 4
x = 0 , y = −1 .
2 3
Then the inner (dot) product of x and y is
xT y = 1(4) + 0(−1) + 2(3) = 10,
while the outer product of x and y is
1(4) 1(−1) 1(3) 4 −1 3
xyT = 0(4) 0(−1) 0(3) = 0 0 0 .
2(4) 2(−1) 2(3) 8 −2 6
2
Example Let
1 −3 7
A= 2 5 −8 .
4 −6 −9
To change a11 from 1 to 10, we can perform the outer product update B = A + (10 − 1)e1 eT1 .
Similary, the outer product update C = B + 5e2 eT1 adds 5 to b21 , resulting in the matrix
10 −3 7
C= 7 5 −8 .
4 −6 −9
Note that
0 0 0 0 0 0
e2 eT1 = 1 1 1 0 1 0 = 1 0 0 .
0 0 0 0 0 0
2
It is useful to describe matrices as collections of row or column vectors. Specifically, a row partition
of an m × n matrix A is a description of A as a “stack” of row vectors rT1 , rT2 , . . ., rTm . That is,
T
r1
rT
2
A = . .
..
rTm
A row partitioning of A is
rT1
1 3
A= , r1 = , r2 = .
rT2 2 4
2
• Dot product: each entry cij is the dot product of the ith row of A and the jth column of B.
bT1
bT2
A= a1 a2 · · · an , ··· ,
B=
bTn
we can write
n
X
C = a1 bT1 + a2 bT2 + · · · + an bTn = ai bTi .
i=1
This can only be the case for any matrix A if Ijj = 1 for j = 1, 2, . . . , n, and Iij = 0 when i 6= j.
We call this matrix the identity matrix
1 0 ··· ··· 0
0 1 ..
0 .
I=
. .. .. ...
. .
..
. 0 1 0
0 ··· ··· 0 1
which generalizes the solution x = a−1 b of a single linear equation in one unknown.
398 APPENDIX B. REVIEW OF LINEAR ALGEBRA
However, just as we can use the inverse to describe the solution to a system of linear equations,
we can use systems of linear equations to characterize the inverse. Because A−1 satisfies AA−1 = I,
it follows from multiplication of both sides of this equation by the jth standard basis vector ej that
Abj = ej , j = 1, 2, . . . , n,
where bj = A−1 ej is the jth column of B = A−1 . That is, we can compute A−1 by solving n
systems of linear equations of the form Abj = ej , using a method such as Gaussian elimination
and back substitution. If Gaussian elimination fails due to the inability to obtain a nonzero pivot
element for each column, then A−1 does not exist, and we conclude that A is singular.
The inverse of a nonsingular matrix A has the following properties:
• A−1 is unique.
• A−1 is nonsingular, and (A−1 )−1 = A.
• If B is also a nonsingular n × n matrix, then (AB)−1 = B −1 A−1 .
• (A−1 )T = (AT )−1 . It is common practice to denote the transpose of A−1 by A−T .
Because the set of all n × n matrices has an identity element, matrix multiplication is associative,
and each nonsingular n × n matrix has a unique inverse with respect to matrix multiplication that
is also an n × n nonsingular matrix, this set forms a group, which is denoted by GL(n), the general
linear group.
B.10 Determinants
We previously learned that a 2 × 2 matrix A is invertible if and only if the quantity a11 a22 − a12 a21
is nonzero. This generalizes the fact that a 1 × 1 matrix a is invertible if and only if its single
entry, a11 = a, is nonzero. We now discuss the generalization of this determination of invertibility
to general square matrices.
The determinant of an n × n matrix A, denoted by det(A) or |A|, is defined as follows:
where Mij , called a minor of A, is the matrix obtained by removing row i and column j of A.
• Alternatively,
n
X
det(A) = aij (−1)i+j det(Mij ), 1 ≤ j ≤ n.
i=1
• det(AT ) = det(A)
The best-known application of the determinant is the fact that it indicates whether a matrix A is
nonsingular, or invertible. The following statements are all equivalent.
• det(A) 6= 0.
• A is nonsingular.
• A−1 exists.
It can be shown that for any p > 0, k · kp defines a vector norm. The following p-norms are of
particular interest:
• p = 1: The `1 -norm
kxk1 = |x1 | + |x2 | + · · · + |xn |
• p = ∞: The `∞ -norm
kxk∞ = max |xi |
1≤i≤n
2
It can be shown that the `2 -norm satisfies the Cauchy-Bunyakovsky-Schwarz inequality, also
known as simply the Cauchy-Schwarz inequality,
for any vectors x, y ∈ Rn . This inequality is useful for showing that the `2 -norm satisfies the
triangle inequality. It is a special case of the Hölder inequality
1 1
|xT y| ≤ kxkp kykq , + = 1.
p q
We will prove the Cauchy-Schwarz inequality for vectors in Rn ; the proof can be generalized to
a complex vector space. For x, y ∈ Rn and c ∈ R, with y 6= 0, we have
0 ≤ (x − cy)T (x − cy)
≤ xT x − xT (cy) − (cy)T x + (cy)T (cy)
≤ kxk22 − 2cxT y + c2 kyk22 .
402 APPENDIX B. REVIEW OF LINEAR ALGEBRA
We now try to find the value of c that minimizes this expression. Differentiating with respect to c
and equating to zero yields the equation
−2xT y + 2ckyk22 = 0,
lim kx(k) − xk = 0.
k→∞
That is, the distance between x(k) and x must approach zero. It can be shown that regardless of
the choice of norm, x(k) → x if and only if
(k)
xi → xi , i = 1, 2, . . . , n.
That is, each component of x(k) must converge to the corresponding component of x. This is due
to the fact that for any vector norm k · k, kxk = 0 if and only if x is the zero vector.
Because we have defined convergence with respect to an arbitrary norm, it is important to know
whether a sequence can converge to a limit with respect to one norm, while converging to a different
limit in another norm, or perhaps not converging at all. Fortunately, for p-norms, this is never the
case. We say that two vector norms k · kα and k · kβ are equivalent if there exists constants C1 and
C2 , that are independent of x, such that for any vector x ∈ Rn ,
It follows that if two norms are equivalent, then a sequence of vectors that converges to a limit
with respect to one norm will converge to the same limit in the other. It can be shown that all
`p -norms are equivalent. In particular, if x ∈ Rn , then
√
kxk2 ≤ kxk1 ≤ nkxk2 ,
√
kxk∞ ≤ kxk2 ≤ nkxk∞ ,
B.11. VECTOR AND MATRIX NORMS 403
Another property that is often, but not always, included in the definition of a matrix norm is the
submultiplicative property: if A is m × n and B is n × p, we require that
kABk ≤ kAkkBk.
kAxk
kAk = sup = max kAxk
x6=0 kxk kxk=1
is a matrix norm. It is called the natural, or induced, matrix norm. Furthermore, if the vector
norm is a `p -norm, then the induced matrix norm satisfies the submultiplicative property.
The following matrix norms are of particular interest:
404 APPENDIX B. REVIEW OF LINEAR ALGEBRA
• The `1 -norm:
m
X
kAk1 = max kAxk1 = max |aij |.
kxk1 =1 1≤j≤n
i=1
That is, the `1 -norm of a matrix is its maximum column sum of |A|. To see this, let x ∈ Rn
satisfy kxk1 = 1. Then
m
X
kAxk1 = |(Ax)i |
i=1
m X
n
X
=
aij xj
i=1 j=1
n m
!
X X
≤ |xj | |aij |
j=1 i=1
n m
!
X X
≤ |xj | max |aij |
1≤j≤n
j=1 i=1
m
!
X
≤ max |aij | .
1≤j≤n
i=1
It follows that the maximum column sum of |A| is equal to the maximum of kAxk1 taken
over all the set of all unit 1-norm vectors.
• The `∞ -norm:
n
X
kAk∞ = max kAxk∞ = max |aij |.
kxk∞ =1 1≤i≤m
j=1
That is, the `∞ -norm of a matrix is its maximum row sum. This formula can be obtained in
a similar manner as the one for the matrix 1-norm.
• The `2 -norm:
kAk2 = max kAxk2 .
kxk2 =1
kAxk22
g(x) =
kxk22
has a local maximum or minimum whenever x is a unit `2 -norm vector (that is, kxk2 = 1)
that satisfies
AT Ax = kAxk22 x,
B.12. FUNCTION SPACES AND NORMS 405
That is, the `2 -norm of a matrix is the square root of the largest eigenvalue of AT A, which is
guaranteed to be nonnegative, as can be shown using the vector 2-norm. We see that unlike
the vector `2 -norm, the matrix `2 -norm is much more difficult to compute than the matrix
`1 -norm or `∞ -norm.
It should be noted that the Frobenius norm is not induced by any vector `p -norm, but it
is equivalent to the vector `2 -norm in the sense that kAkF = kxk2 where x is obtained by
reshaping A into a vector.
Like vector norms, matrix norms are equivalent. For example, if A is an m × n matrix, we have
√
kAk2 ≤ kAkF ≤ nkAk2 ,
1 √
√ kAk∞ ≤ kAk2 ≤ mkAk∞ ,
n
1 √
√ kAk1 ≤ kAk2 ≤ nkAk1 .
m
Example Let
1 2 3
A = 0 1 0 .
−1 0 4
Then
kAk1 = max{|1| + |0| + | − 1|, |2| + |1| + |0|, |3| + |0| + |4|} = 7,
and
kAk∞ = max{|1| + |2| + |3|, |0| + |1| + |0|, | − 1| + |0| + |4|} = 6.
2
where the weight function w(x) is positive and integrable on (a, b). It is allowed to be singular
at the endpoints, as will be seen in certain examples. This norm is called the 2-norm or weighted
2-norm. 2
The 2-norm and ∞-norm are related as follows:
kf k2 ≤ W kf k∞, W = k1k2 .
However, unlike the ∞-norm and 2-norm defined for the vector space Rn , these norms are not
equivalent in the sense that a function that has a small 2-norm necessarily has a small ∞-norm. In
fact, given any > 0, no matter how small, and any M > 0, no matter how large, there exists a
function f ∈ C[a, b] such that
kf k2 < , kf k∞ > M.
We say that a function f is absolutely continouous on [a, b] if its derivative is finite almost
everywhere in [a, b] (meaning that it is not finite on at most a subset of [a, b] that has measure
zero), is integrable on [a, b], and satisfies
Z x
f 0 (s) dx = f (x) − f (a), a ≤ x ≤ b.
a
Any continuously differentiable function is absolutely continuous, but the converse is not necessarily
true.
Example B.12.1 For example, f (x) = |x| is absolutely continuous on any interval of the form
[−a, a], but it is not continuously differentiable on such an interval. 2
Next, we define the Sobolev spaces H k (a, b) as follows. The space H 1 (a, b) is the set of all
absolutely continuous functions on [a, b] whose derivatives belong to L2 (a, b). Then, for k > 1,
H k (a, b) is the subset of H k−1 (a, b) consisting of functions whose (k −1)st derivatives are absolutely
continuous, and whose kth derivatives belong to L2 (a, b). If we denote by C k [a, b] the set of all
functions defined on [a, b] that are k times continuously differentiable, then C k [a, b] is a proper
subset of H k (a, b). For example, any piecewise linear belongs to H 1 (a, b), but does not generally
belong to C 1 [a, b].
B.13. INNER PRODUCT SPACES 407
Example B.12.2 The function f (x) = x3/4 belongs to H 1 (0, 1) because f 0 (x) = 3 −1/4
4x is inte-
grable on [0, 1], and also square-integrable on [0, 1], since
9 1/2 1 9
Z 1 Z 1
0 2 9 −1/2
|f (x)| dx = x = x = .
0 0 16 8 0 8
Then, we say f and g are orthogonal with respect to this inner product if hf, gi = 0.
In general, an inner product on a vector space V over R, be it continuous or discrete, has the
following properties:
1. hf + g, hi = hf, hi + hg, hi for all f, g, h ∈ V
kvk = (v · v)1/2 .
Along similar lines, we define the 2-norm of a function f (x) defined on [a, b] by
Z b 1/2
1/2 2
kf k2 = (hf, f i) = [f (x)] dx .
a
It can be verified that this function does in fact satisfy the properties required of a norm.
One very important property that k · k2 has is that it satisfies the Cauchy-Schwarz inequality
The left side is a quadratic polynomial in c. In order for this polynomial to not have any negative
values, it must either have complex roots or a double real root. This is the case if the discrimant
satisfies
4hf, gi2 − 4kf k22 kgk22 ≤ 0,
from which the Cauchy-Schwarz inequality immediately follows. By setting c = 1 and applying this
inequality, we immediately obtain the triangle-inequality property of norms.
B.14 Eigenvalues
We have learned what it means for a sequence of vectors to converge to a limit. However, using
the definition alone, it may still be difficult to determine, conclusively, whether a given sequence of
vectors converges. For example, suppose a sequence of vectors is defined as follows: we choose the
initial vector x(0) arbitrarily, and then define the rest of the sequence by
x(k+1) = Ax(k) , k = 0, 1, 2, . . .
for some matrix A. Such a sequence actually arises in the study of the convergence of various
iterative methods for solving systems of linear equations.
An important question is whether a sequence of this form converges to the zero vector. This
will be the case if
lim kx(k) k = 0
k→∞
from which it follows that the sequence will converge to the zero vector if kAk < 1. However, this
is only a sufficient condition; it is not necessary.
To obtain a sufficient and necessary condition, it is necessary to achieve a better understanding
of the effect of matrix-vector multiplication on the magnitude of a vector. However, because
matrix-vector multiplication is a complicated operation, this understanding can be difficult to
acquire. Therefore, it is helpful to identify circumstances under which this operation can be simply
described.
To that end, we say that a nonzero vector x is an eigenvector of an n × n matrix A if there
exists a scalar λ such that
Ax = λx.
The scalar λ is called an eigenvalue of A corresponding to x. Note that although x is required to
be nonzero, it is possible that λ can be zero. It can also be complex, even if A is a real matrix.
B.14. EIGENVALUES 409
(A − λI)x = 0.
That is, if λ is an eigenvalue of A, then A − λI is a singular matrix, and therefore det(A − λI) = 0.
This equation is actually a polynomial in λ, which is called the characteristic polynomial of A. If
A is an n × n matrix, then the characteristic polynomial is of degree n, which means that A has n
eigenvalues, which may repeat.
The following properties of eigenvalues and eigenvectors are helpful to know:
• If there exists an invertible matrix P such that B = P AP −1 , then A and B have the same
eigenvalues. We say that A and B are similar, and the transformation P AP −1 is called a
similarity transformation.
• If A is a skew-symmetric matrix, meaning that AT = −A, then its eigenvalues are either equal
to zero, or are purely imaginary.
• tr(A), the sum of the diagonal entries of A, is also equal to the sum of the eigenvalues of A.
It follows that any method for computing the roots of a polynomial can be used to obtain the
eigenvalues of a matrix A. However, in practice, eigenvalues are normally computed using iterative
methods that employ orthogonal similarity transformations to reduce A to upper triangular form,
thus revealing the eigenvalues of A. In practice, such methods for computing eigenvalues are used
to compute roots of polynomials, rather than using polynomial root-finding methods to compute
eigenvalues, because they are much more robust with respect to roundoff error.
It can be shown that if each eigenvalue λ of a matrix A satisfies |λ| < 1, then, for any vector x,
lim Ak x = 0.
k→∞
Furthermore, the converse of this statement is also true: if there exists a vector x such that Ak x
does not approach 0 as k → ∞, then at least one eigenvalue λ of A must satisfy |λ| ≥ 1.
Therefore, it is through the eigenvalues of A that we can describe a necessary and sufficient
condition for a sequence of vectors of the form x(k) = Ak x(0) to converge to the zero vector.
Specifically, we need only check if the magnitude of the largest eigenvalue is less than 1. For
convenience, we define the spectral radius of A, denoted by ρ(A), to be max |λ|, where λ is an
eigenvalue of A. We can then conclude that the sequence x(k) = Ak x(0) converges to the zero
vector if and only if ρ(A) < 1.
410 APPENDIX B. REVIEW OF LINEAR ALGEBRA
Example Let
2 3 1
A = 0 4 5 .
0 0 1
Then
2−2 3 1 0 3 1
A − 2I = 0 4−2 5 = 0 2 5 .
0 0 1−2 0 0 −1
Because A − 2I has a column of all zeros, it is singular. Therefore, 2 is an eigenvalue of A. In fact,
T
Ae1 = 2e1 , so e1 = 1 0 0 is an eigenvector.
Because A is upper triangular, its eigenvalues are the diagonal elements, 2, 4 and 1. Because
the largest eigenvalue in magnitude is 4, the spectral radius of A is ρ(A) = 4. 2
The spectral radius is closely related to natural (induced) matrix norms. Let λ be the largest
eigenvalue of A, with x being a corresponding eigenvector. Then, for any natural matrix norm k · k,
we have
ρ(A)kxk = |λ|kxk = kλxk = kAxk ≤ kAkkxk.
kAk2 = ρ(A).
which can be seen to reduce to ρ(A) when AT = A, since, in general, ρ(Ak ) = ρ(A)k .
Because the condition ρ(A) < 1 is necessary and sufficient to ensure that limk→∞ Ak x = 0, it
is possible that such convergence may occur even if kAk ≥ 1 for some natural norm k · k. However,
if ρ(A) < 1, we can conclude that
lim kAk k = 0,
k→∞
1. ρ(A) < 1
4. limk→∞ Ak x = 0
These results are very useful for analyzing the convergence behavior of various iterative methods
for solving systems of linear equations.
B.15. DIFFERENTIATION OF MATRICES 411
d d
[A(t)−1 ] = −A(t)−1 [A(t)]A(t)−1 .
dt dt
It is also useful to know how to differentiate functions of vectors with respect to their compo-
nents. Let A be an n × n matrix and x ∈ Rn . Then, we have
∇ xT Ax = (A + AT )x, ∇(xT b) = b.
These formulas are useful in problems involving minimization of functions of x, such as the least-
squares problem, which entails approximately solving a system of m equations in n unknowns, where
typically m > n.
412 APPENDIX B. REVIEW OF LINEAR ALGEBRA
Bibliography
[1] Abramowitz, M. and Stegun, I. A., Eds.: Handbook of Mathematical Functions with Formulas,
Graphs, and Mathematical Tables, 9th printing. New York: Dover (1972).
[2] Aitken, A. C.: “On interpolation by iteration of proportional parts, without the use of differ-
ences”. Proc. Edinburgh Math. Soc. 3(2) (1932), p. 56-76.
[3] Banach, S.: “Sur les opérations dans les ensembles abstraits et leur application aux équations
intégrales.” Fund. Math. 3 (1922), p. 133-181.
[4] Bashforth, F. and Adams, J. C.: An Attempt to test the Theories of Capillary Action by
comparing the theoretical and measured forms of drops of fluid. With an explanation of the
method of integration employed in constructing the tables which give the theoretical forms of
such drops, Cambridge University Press (1883).
[5] Berrut, J.-P. and Trefethen, L. N.: “Barycentric Lagrange Interpolation”. SIAM Review 46(3)
(2004), p. 501-517.
[6] Birkhoff, G. and De Boor, C.: “Error bounds for spline interpolation”, Journal of Mathematics
and Mechanics 13 (1964), p. 827-836.
[7] Birkhoff, G. and Rota, G.: Ordinary differential equations, (Fourth edition), John Wiley &
Sons, New York, 1989.
[8] Bogacki, P. and Shampine, L. F.: “A 3(2) pair of Runge?Kutta formulas”, Applied Mathematics
Letters 2(4) (1989), p. 321-325.
[10] Burden, R. L. and Faires, J. D.: Numerical Analysis, 9th Edition. Brooks Cole (2004).
[11] Dahlquist, G.: “A special stability problem for linear multistep methods”, BIT 3 (1963), p.
27-43.
[13] Dormand, J. R. and Prince, P. J.: “A family of embedded Runge-Kutta formulae”, Journal of
Computational and Applied Mathematics 6 (1) (1980), p. 19-26.
413
414 BIBLIOGRAPHY
[14] Fehlberg, E.: “Klassische Runge-Kutta Formeln vierter und niedrigerer Ordnung mit
Schrittweiten-Kontrolle und ihre Anwendung auf Wrmeleitungsprobleme”, Computing 6
(1970), p. 61-71.
[16] Golub, G. H. and van Loan, C. F.: Matrix Computations, 4th Edition, Johns Hopkins Univer-
sity Press (2012).
[17] Gustafsson, B., Kreiss, H.-O. and Oliger, J. E.: Time-Dependent Problems and Difference
Methods, John WIley & Sons, New York (1995).
[18] Heath, M. T.: Scientific Computing: An Introductory Survey, 2nd Edition, McGraw-Hill
(2002).
[19] Issacson, E. and Keller, H. B.: Analysis of numerical methods, John Wiley & Sons, New York
(1966).
[20] Keller, H. B.: Numerical Methods for Two-Point Boundary Value Problems, Blaisdell,
Waltham, MA (1968).
[21] Lambers, J. V. and Rice, J. R., “Numerical Quadrature for General Two-Dimensional Do-
mains”, Computer Science Technical Reports, Paper 906 (1991).
[23] Le Gendre, M.: “Recherches sur l’attraction des sphéroı̈des homogènes”, Mémoires de
Mathématiques et de Physique, présentés à l’Académie Royale des Sciences, par divers sa-
vans, et lus dans ses Assemblées, Tome X (1785), p. 411-435.
[25] Moulton, F. R.: New methods in exterior ballistics, University of Chicago Press (1926).
[26] Neville, E. H.: “Iterative Interpolation”, J. Indian Math Soc. 20 (1934), p. 87-120.
[27] Padé, H.: “Sur la répresentation approchée d’une fonction par des fractions rationelles”, Thesis,
Ann. École Nor. (3), 9 (1892), p. 1-93.
[28] Quarteroni, A. and Saleri, F.: Scientific Computing with MATLAB, Texts in computational
science and engineering 2, Springer (2003), p. 66.
[29] Ralston, A. and Rabinowitz, P.: A first course in numerical analysis, (2nd edition), McGraw-
Hill, New York, 1978.
[30] Rice, J. R.: “A Metalgorithm for Adaptive Quadrature”, Journal of the ACM 22(1) (1975),
p. 61-82.
BIBLIOGRAPHY 415
[31] Richardson, L. F.: “The approximate arithmetical solution by finite differences of physical
problems including differential equations, with an application to the stresses in a masonry
dam”, Philosophical Transactions of the Royal Society A. 210(459-470) (1911), p. 307-357.
[32] Romberg, W.: “Vereinfachte numerische Integration”, Det Kongelige Norske Videnskabers
Selskab Forhandlinger, Trondheim 28(7) (1955), p. 30-36.
[33] Runge, C.: “Über empirische Funktionen und die Interpolation zwischen quidistanten Ordi-
naten”, Zeitschrift für Mathematik und Physik 46 (1901), p. 224-243.
[35] Shampine, L. F. and Reichelt, M. W.: “The Matlab ODE Suite”, SIAM J. Sci. Comput.
18(1) (1997), p. 1-22.
[36] Süli, E. and Mayers, D.: An Introduction to Numerical Analysis, Cambridge University Press
(2003).
[37] Wilbraham, H.: “On a certain periodic function”, The Cambridge and Dublin Mathematical
Journal 3 (1848), p. 198-201.
Index
416
INDEX 417
underflow, 36
underflow, gradual, 36
unit roundoff, 37
zero-stability, 325